Welcome to WebmasterWorld Guest from 54.167.0.111

Message Too Old, No Replies

Why Google Might "Ignore" a robots.txt Disallow Rule

Thousands of pages show up in the Google cache! aka "here we go again"

     

g1smd

4:35 pm on Aug 13, 2006 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Here's the skinny on this one:

This concerns a 750 000 URL forum that has had a large number of URLs disallowed in the robots.txt file for about 18 months. The forum has about 50 000 valid threads.

The disallowed URLs are those that for a guest or a bot only show an "Error. You are not logged in" message - URLs that would otherwise be used to reply to a thread, start a new thread, send a PM, show a "print-friendly" screen, edit the user profile, and so on. There is never a need for search engines to try to index these. Search engines only need to see the thread indexes, and the message threads themselves.

Google has not been indexing the content at the disallowed URLs, but has shown a large number of them as URL-only entries for a long time. They are most easily seen in a site:domain.com listing. This is pretty-much the normal operation. That part has been working OK. The disallowed URLs are listed in the User-agent: * part of the robots.txt file.

.

As you may have seen before, I have written several times about how a forum can have duplicate content for each thread, because each thread in a vBulletin or PHPbb (and most other packages too) forum has multiple URLs that can reach the same content.

For a vBulletin forum each thread could show up as each of these URLs:

/forum/showthread.php?t=54321
/forum/showthread.php?t=54321&p=22446688
/forum/showthread.php?t=54321&page=2
/forum/showthread.php?mode=hybrid&t=54321
/forum/showthread.php?p=22446688&mode=linear#post22446688
/forum/showthread.php?p=22446688&mode=threaded#post224466 88
/forum/showthread.php?t=34567&goto=nextnewest
/forum/showthread.php?t=87654&goto=nextoldest
/forum/showthread.php?goto=lastpost&t=54321
/forum/showpost.php?p=22446688
/forum/showpost.php?p=22446688&postcount=45
/forum/printthread.php?t=54321

and that is without introducing URLs that include the page parameter, for threads that are more than one page long, and the pp parameter for changing the default number of posts per page; either or both of which can be added to most of the URLs above too.

The robots.txt file had been set up long ago to exclude several of the URL patterns for thread duplicate content - but critically, not all combinations - and for the excluded URLs, Google had only shown URL-only entires if anything at all.

.

In a Vbulletin forum, the "next" and "previous" links cause massive duplicate content issues because they allow a thread like
/forum/showthread.php?t=54321 to be indexed as
/forum/showthread.php?t=34567&goto=nextnewest and as
/forum/showthread.php?t=87654&goto=nextoldest too.

Additionally if any of the three threads is bumped, the "next" and "previous" links that are indexed no longer point to the same thread, because they contain the thread number of the thread that they were ON (along with the goto parameter), not the real thread number of the thread that they actually pointed to.

This is a major programming error by the people that designed the forum software. The link should either contain the true thread number of the thread that it points to, or else clicking the "next" and "previous" links should go via a 301 redirect to a URL that includes the real true canonical thread number of the target thread.

Those duplicate content URLs have all been indexed before, but now the robots.txt file has been amended to disallow those. This is what was added to the robots.txt file just a few days ago:

User-Agent: Googlebot
Disallow: /*nextnewest
Disallow: /*nextoldest
Disallow: /*mode
Disallow: /*highlight

.

Here's the punchline:

The disallowed URLs in the User-Agent: * section of the robots.txt file are now being indexed and cached by Google. The cache time-stamps start showing up for dates and times that are just hours after the date and time that the robots.txt file was amended by adding the additional Googlebot-specific information.

I would have assumed that Google would not index the URLs that are in the User-agent: Googlebot section, and not index the URLs that are in the User-agent: * section.

What appears to happen is that as soon as you add a User-agent: Googlebot section, that Google starts indexing all URLs that are not mentioned in that section, even if they are mentioned in the User-agent: * section, and supposedly disallowed for all user agents.

That is, if you have a User-agent: Googlebot section, then you also need to repeat all URLs found in the User-agent: * section in the Googlebot-specific section.

That, to me, is not how it should work.

.

Can someone from Google clarify whether Google is supposed to follow both User-agent: * and User-agent: Googlebot if both are present; or whether it ignores User-agent: * if User-agent: Googlebot is present.

The latter is what appears to happen right now.

.

Side note: Looks like the other stuff at [webmasterworld.com...] is fixed, by the way.

[edited by: g1smd at 4:52 pm (utc) on Aug. 13, 2006]

vanessafox

9:41 pm on Aug 13, 2006 (gmt 0)

5+ Year Member



Hi Brett/moderators,

I hope it's OK to post these links as they describe how Googlebot handles robots.txt files.

This page provides links to information on robots.txt files as they pertain to Googlebot:
[google.com...]

And this page provides information on how Googlebot interprets the situation being discussed here:
[google.com...]

If you want to block access to all bots other than the Googlebot, you can use the following syntax:

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow:

Googlebot follows the line directed at it, rather than the line directed at everyone.

You can always use the robots.txt analysis tool in our webmaster tools to see how Googlebot will interpret a robots.txt file.

g1smd

9:46 pm on Aug 13, 2006 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Cool. Now we're talking.

Many thanks for such a rapid resolution to such an easy misunderstanding...

jdMorgan

10:03 pm on Aug 13, 2006 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Googlebot is "smart" in that is uses record-specificity priority, rather than record-order priority.

For most other robots, the safest way to exclude all but the specified robots would be:

User-agent: The_Allowed_Bot
Disallow:

User-agent: Another_Allowed_Bot
Disallow:

User-agent: *
Disallow: /


Here the allowed 'bots find their specific record first, and accept it, while all other robots continue to the catch-all record at the end.

This should work for Googlebot as well, and is a much safer bet for 'dumber' robots.

Jim

[edited by: jdMorgan at 10:05 pm (utc) on Aug. 13, 2006]

GoogleGuy

12:47 am on Aug 14, 2006 (gmt 0)

WebmasterWorld Senior Member googleguy is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Thanks for hopping in, Vanessa. Personally, I always take a new robots.txt file out for a test drive before I put it live. There's one in the Google Webmaster Tools, and Brett offered one of the earliest ones I saw on the web.

The rule of thumb I always use is "the most specific directive applies." So if you say "Everyone in the room, leave. g1smd, please stay; we need to chat" then everyone but g1msd would mosey.

Although this is how we've done things for a long time (and I think every other major engine works this way), I agree it's good to get the word out, g1smd. It's on the front page of WebmasterWorld, so I think the word is out. :) My takeaway would be to find a good robots.txt checker and test out a new file before making it live.

[edited by: GoogleGuy at 12:49 am (utc) on Aug. 14, 2006]

bufferzone

12:11 pm on Aug 14, 2006 (gmt 0)

10+ Year Member



I would definitely recommend that you take a close look at Google Webmasters tool. The Robots.txt tool will do validation as well as test for new entries. You also have the ability to make detailed tests for the many flavours of googlbot so that you can specify exactly what bot can do what.

Brett_Tabke

2:55 pm on Aug 14, 2006 (gmt 0)

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



also be aware that when you are using nonstandard robots.txt syntax - bad things can happen. Never use nonstandard syntax with wildcard agent names.


User-Agent: *
Allow: /
Disallow: /cgi-bin

That will disallow the entire site on some bots.

There is no "allow" syntax and we have documented Slurp misinterpreting it in the past as a disallow line which ultimately removed the entire site from Yahoo. That behavior has been changed, but it is a clear cut case of what can go wrong when using nonstandard (improper) syntax in robots.txt.

SEOEgghead

7:36 pm on Aug 15, 2006 (gmt 0)

5+ Year Member



I think Google is only following the specification here.

This debate arises from a somewhat ambiguous statement in the robots.txt standard.

"the record describes the default access policy for any robot that has not matched any of the other records."

In other words, if a specific rule is present the "*" rule is ignored. Order shouldn't matter.

So ... according to a close reading of the specification, the rules for a specific user agent _entirely override_ the "User-agent: *" rules. Therefore, any rule under "User-agent: *" that should also be applied to googlebot must be repeated under "User-agent: googlebot."

Am I wrong?

[edited by: SEOEgghead at 7:37 pm (utc) on Aug. 15, 2006]

mcavic

10:20 pm on Aug 15, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



So ... according to a close reading of the specification, the rules for a specific user agent _entirely override_ the "User-agent: *" rules. Therefore, any rule under "User-agent: *" that should also be applied to googlebot must be repeated under "User-agent: googlebot."

Correct.

SEOEgghead

10:48 pm on Aug 15, 2006 (gmt 0)

5+ Year Member



Out of curiousity, does anyone know if Yahoo and MSN also follow this very literal interpretation of the specification? I would hope so.

I agree that it's wise to put "*" last regardless. I think that cures the ambiguity regardless, right?

I think many programmers would tend to read it the _wrong_ way. I have in the past. It kind of reads like a "switch" statement in C -- where order counts. Arguably, C requires that "*," the default, be last. It's a so-so analogy, but you see my point. If they had used the token "default" instead of "*" nobody would mistake it for a glob/regex like many have ... </rant>

tedster

12:05 am on Aug 16, 2006 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Up above, GoogleGuy said "I believe most/all search engines interpret robots.txt this way--a more specific directive takes precedence over a weaker one."

I tend to trust his experience on this one.

jdMorgan

12:18 am on Aug 16, 2006 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



I would recommend going with the Standard, and coding to the lowest common denominator to avoid problems. While our esteemed Googlers will undoubtedly provide authoritative answers for how Google does things, and Google's implementations are usually correct, compliant, and robust, extending those attributes to 'most robots' is rather a big stretch.

Jim

mcavic

1:22 pm on Aug 16, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



does anyone know if Yahoo and MSN also follow this very literal interpretation of the specification

They would have to. It's the only interpretation that works logically, isn't it? If a bot were to follow its own section AND the * section, then there would be no way allow that bot more access than all the other bots (since Allow: is non-standard).

Brett_Tabke

2:10 pm on Aug 17, 2006 (gmt 0)

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



> It's the only interpretation that works logically, isn't it?

no. Some bots have worked in the past in increasing order of importance. What comes last in the robots.txt trumps what comes first. A sequential reading of robots.txt is how Infoseek and Scooter used to work.

[edited by: Brett_Tabke at 4:23 pm (utc) on Aug. 19, 2006]

g1smd

2:19 pm on Aug 17, 2006 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



New search results published to [gfe-eh.google.com ] today show that the modified robots.txt file is now being followed, and all the problems with indexed URLs are being cleared up. I don't yet see any change on most other DCs.

SEOEgghead

9:26 pm on Aug 17, 2006 (gmt 0)

5+ Year Member



Brett,

Who says those bots were logical, though? :) Anyway, my assessment is that you should place the "*" (default) rule last and repeat all rules in "*" that you want to be applied for specific bots in those repsective listings. This covers most of the bases.

Make sense?

This 31 message thread spans 3 pages: 31
 

Featured Threads

Hot Threads This Week

Hot Threads This Month