homepage Welcome to WebmasterWorld Guest from 54.211.213.10
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Disallow all URLs with ' '
danwhitehouse




msg:4433846
 9:37 am on Mar 27, 2012 (gmt 0)

Hi all,
We've just changed URL manager, which has left a great number of redundant URLs live. A lot of them seem to have the same convention, an '_' in them.

I've thought about creating a script that detects urls with '_' in them and adding them to Robotx.txt on the fly with a cron running in the background.

I was wondering if anyone knew of a faster way to do this?

Thanks again

 

Andy Langton




msg:4434856
 3:34 pm on Mar 29, 2012 (gmt 0)

There shouldn't be a need to robots exclude redundant pages - if they have moved elsewhere, 301 them. If they don't exist any more, they will die out on there own.

Robots excluding them implies that they are present in search results - if they are, they should be good candidates for 301 redirecting.

If the desire is to remove the pages from search results, then a simple 404 (or 410) response is the quickest way to do that - robots exclusion won't speed this up.

All that said, if robots exclusion is the right choice in this situation, you could use a wildcard:

User-agent: googlebot
Disallow: *_*

Note that not all spiders support the wildcard - YMMV.

danwhitehouse




msg:4434864
 3:41 pm on Mar 29, 2012 (gmt 0)

Hi Andy,
That's a great help with the wildcard. It seems like a very easy solution.

The problem comes from there being 1000s of redundant URLs with '_' in them. We'd doing all our redirects through .htaccess . That's be a big file if those redirects were added.

Andy Langton




msg:4434869
 3:44 pm on Mar 29, 2012 (gmt 0)

I still be a bit unsure that robots is the best solution to this one.

Incidentally, if you've replaced the underscores with a different character, then you can pattern match with some crazy mod_rewrite code.

Final comment, I forgot that the golden rule of robots exclusion is prefix matching so the final wildcard is unnecessary and, worse still, my syntax was lazy. Here's a corrected version:

User-agent: googlebot
Disallow: /*_

danwhitehouse




msg:4434872
 3:48 pm on Mar 29, 2012 (gmt 0)

Thanks Andy. I'll let you know how we get on!

Andy Langton




msg:4434874
 3:49 pm on Mar 29, 2012 (gmt 0)

Remember you can test in Google Webmaster Tools to check this will do what you want it to. Always wise before making robots changes!

danwhitehouse




msg:4434891
 4:03 pm on Mar 29, 2012 (gmt 0)

Great plan, will do!

g1smd




msg:4434906
 4:28 pm on Mar 29, 2012 (gmt 0)

If they are truly "Gone" then I would use this:

RewriteRule _ - [G]


@andy I was about to post the correction for robots.txt. The pattern should be /*_ but you beat me to it. :)

danwhitehouse




msg:4435286
 3:12 pm on Mar 30, 2012 (gmt 0)

Ok the following solution is now live

Disallow: /*_

There's 6,638 not found as of 16.12 30th March '12.

Hopefully the number will be a lot less next week! Will report back!

danwhitehouse




msg:4436114
 8:23 am on Apr 2, 2012 (gmt 0)

Morning all,
3 days on and the new robots.txt has been downloaded a number of times, but the pages are still indexed. Total errors now stand at 6,368 with multiple URLs indexed containing the infamous underscore. Is it too soon for these pages to be not indexed?

lucy24




msg:4436123
 9:56 am on Apr 2, 2012 (gmt 0)

Did you ask in gwt to have the unwanted pages removed? Otherwise you'll be waiting until the cows come home. Blocking a page in robots.txt by itself doesn't de-index it if it's already in the index.

Pin to your wall: Google Never Forgets an URL.

danwhitehouse




msg:4436128
 10:22 am on Apr 2, 2012 (gmt 0)

Is there a quick way to remove thousands of URLs from the index?

phranque




msg:4436143
 11:34 am on Apr 2, 2012 (gmt 0)

i agree with g1smd and Andy Langton that your proper response is the provide a 301 or 410 as is appropriate for each requested url and not exclude those urls from being crawled.

danwhitehouse




msg:4436198
 1:56 pm on Apr 2, 2012 (gmt 0)

Yes, they are gone. So should I add this to .htaccess? What will it do?

RewriteRule _ -

Thanks!

lucy24




msg:4436270
 4:35 pm on Apr 2, 2012 (gmt 0)

As written? A 500 error, probably.

RewriteRule _ - [F]

or

RewriteRule _ - [G]

will slam the door in the face of anyone trying to set foot in anything containing a lowline. But it still won't make g### de-index the pages ;)

Andy Langton




msg:4436394
 8:40 pm on Apr 2, 2012 (gmt 0)

Google is going to revisit and re-evaluate those pages on its own schedule - primarily based on the links into the pages, and how important they are.

The "brute force" method is to request removal, but I wouldn't normally recommend it. Otherwise, you can submit a sitemap of the erroneous URLs which will speed things up somewhat.

When it comes down to it, though, if you struggle to get Google to revisit URLs, then they aren't very important in the general scheme of things!

Is there a particular problem with the URLs you're trying to remove?

danwhitehouse




msg:4436564
 7:38 am on Apr 3, 2012 (gmt 0)

Hi Andy,
There's no problem with the URLs, they were part of an obsolete URL manager. I'd like to see them removed or redirected as there's so many and surely G is looking at this as one of their ranking signals.

Andy Langton




msg:4436572
 7:56 am on Apr 3, 2012 (gmt 0)

IMO the quickest route that will get your where you want to be is to redirect the URLs (assuming there is a reasonably close new location), and then submit a sitemap of those URLs.

But low value URLs will always take a long time to be re-evaluated - other than URL removal the only way to speed that up would be to link to the URLs you want removed!

danwhitehouse




msg:4436574
 8:02 am on Apr 3, 2012 (gmt 0)

I spoke to one of the devs yesterday about redirecting with htaccess and he said that each call of htaccess would slow down server response time due to the extra checks that it has to do. Any other type of manual 301 would take time, like adding a redirect to each page to remove the load off the server side. Any ideas what would be best practice for the redirects? Thanks!

g1smd




msg:4436584
 8:58 am on Apr 3, 2012 (gmt 0)

A single RewriteRule can rewrite all requests with a "_ " in to a special PHP script that runs only for those requests.

RewriteRule _ /special.php [L]

This new PHP script then serves the required 404, 410, or 301 HTTP response code based on rules you program into that single PHP script.

It can get its data to do that either from an array you create, or from the database that runs the site.

This method slows the server down almost nothing. Only one extra rule is parsed per HTTP request.

Adding thousands of individual rules, one for each of your errant URLs, would slow the server down considerably. Additionally, this method will be almost impossible to maintain if there's more than a couple of hundred URLs involved.

danwhitehouse




msg:4436983
 8:05 am on Apr 4, 2012 (gmt 0)

Hi all,
It seems as though there are more issues at play here and in our efforts to be 404 free I've suggested that we also use the following rules. I was wondering what you thought about such redirects and whether big G would be effected?

Here's my notes, with URLs hidden..

I have added rules to the apache config to redirect these old urls to the new ZF version of the job detail page.
ie.
[SITE.co.uk...]
now redirects to
[SITE.co.uk...]

This brought to light a second issue, which is an issue of duplicate content. The url above is not the complete url, in fact. We can only configure apache to redirect to:
[SITE.co.uk...]
When the proper url should be:
[SITE.co.uk...]

In our current code, everything that comes after /job/1853622/ in the URL is ignored, so you could type:
[SITE.co.uk...]
and this would resolve correctly like the other urls. Which means Google will see it as duplicate content.

I have patched the job_detail controller code to check the request uri to see if it matches what the urlmanager assumes it should be and, if it isn't, then it redirects to the proper final url. This will sort out any duplicate content issues we might have for the job detail pages.

In summary:
[SITE.co.uk...]
redirects to
[SITE.co.uk...]
which then also redirects to
[SITE.co.uk...]

g1smd




msg:4436989
 9:00 am on Apr 4, 2012 (gmt 0)

Use example.com in this forum to stop URL auto-linking.

We need to see what you typed.

danwhitehouse




msg:4436996
 9:32 am on Apr 4, 2012 (gmt 0)

Here's the list of URLs above, from top to bottom:

http://www.example.com/job.php?job_id=1853622

http://www.example.com/job/1853622/

http://www.example.com/job/1853622/

http://www.example.com/job/1853622/

http://www.example.com/job/1853622/Hello-Google-I-am-a-duplicate-content-page

http://www.example.com/job.php?job_id=1853622

http://www.example.com/job/1853622/

http://www.example.com/job/1853622/IT-jobs-in-london

lucy24




msg:4437005
 10:00 am on Apr 4, 2012 (gmt 0)

Here's my notes, with URLs hidden.

They certainly are ;)

Yes, you can say example.co.uk along with .com or .org or .net. I learned that recently by experiment.

Andy Langton




msg:4437114
 3:56 pm on Apr 4, 2012 (gmt 0)

In our current code, everything that comes after /job/1853622/ in the URL is ignored


Unfortunately, this creates a potentially infinite "URL space" and is also the main reason migrating these URLs will prove difficult. The benefits of keywords in a URL need to be weighed against this sort of thing!

You can maintain such URLs only if you have a way of validating the string at the end, which adds a layer of complexity, and a chain in the redirects which is best avoided, if possible!

In terms of mod_rewrite and performance, if you're using a pattern-matched URL, e.g.

^job.php\?job_id=([0-9]+)$ /job/$1

You're not going to see any visible performance decrease IMO. The pattern will only match the old format of URLs, and this type of rule is in place on some very busy websites!

In any case, the redirects are a better plan than 404s/robots, since the content has genuinely moved. Avoid the chain if you can!

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved