Forum Moderators: Robert Charlton & goodroi
This concerns a 750 000 URL forum that has had a large number of URLs disallowed in the robots.txt file for about 18 months. The forum has about 50 000 valid threads.
The disallowed URLs are those that for a guest or a bot only show an "Error. You are not logged in" message - URLs that would otherwise be used to reply to a thread, start a new thread, send a PM, show a "print-friendly" screen, edit the user profile, and so on. There is never a need for search engines to try to index these. Search engines only need to see the thread indexes, and the message threads themselves.
Google has not been indexing the content at the disallowed URLs, but has shown a large number of them as URL-only entries for a long time. They are most easily seen in a site:domain.com listing. This is pretty-much the normal operation. That part has been working OK. The disallowed URLs are listed in the User-agent: * part of the robots.txt file.
.
As you may have seen before, I have written several times about how a forum can have duplicate content for each thread, because each thread in a vBulletin or PHPbb (and most other packages too) forum has multiple URLs that can reach the same content.
For a vBulletin forum each thread could show up as each of these URLs:
/forum/showthread.php?t=54321
/forum/showthread.php?t=54321&p=22446688
/forum/showthread.php?t=54321&page=2
/forum/showthread.php?mode=hybrid&t=54321
/forum/showthread.php?p=22446688&mode=linear#post22446688
/forum/showthread.php?p=22446688&mode=threaded#post224466 88
/forum/showthread.php?t=34567&goto=nextnewest
/forum/showthread.php?t=87654&goto=nextoldest
/forum/showthread.php?goto=lastpost&t=54321
/forum/showpost.php?p=22446688
/forum/showpost.php?p=22446688&postcount=45
/forum/printthread.php?t=54321
and that is without introducing URLs that include the page parameter, for threads that are more than one page long, and the pp parameter for changing the default number of posts per page; either or both of which can be added to most of the URLs above too.
The robots.txt file had been set up long ago to exclude several of the URL patterns for thread duplicate content - but critically, not all combinations - and for the excluded URLs, Google had only shown URL-only entires if anything at all.
.
In a Vbulletin forum, the "next" and "previous" links cause massive duplicate content issues because they allow a thread like
/forum/showthread.php?t=54321 to be indexed as
/forum/showthread.php?t=34567&goto=nextnewest and as
/forum/showthread.php?t=87654&goto=nextoldest too.
Additionally if any of the three threads is bumped, the "next" and "previous" links that are indexed no longer point to the same thread, because they contain the thread number of the thread that they were ON (along with the goto parameter), not the real thread number of the thread that they actually pointed to.
This is a major programming error by the people that designed the forum software. The link should either contain the true thread number of the thread that it points to, or else clicking the "next" and "previous" links should go via a 301 redirect to a URL that includes the real true canonical thread number of the target thread.
Those duplicate content URLs have all been indexed before, but now the robots.txt file has been amended to disallow those. This is what was added to the robots.txt file just a few days ago:
User-Agent: Googlebot
Disallow: /*nextnewest
Disallow: /*nextoldest
Disallow: /*mode
Disallow: /*highlight
.
Here's the punchline:
The disallowed URLs in the User-Agent: * section of the robots.txt file are now being indexed and cached by Google. The cache time-stamps start showing up for dates and times that are just hours after the date and time that the robots.txt file was amended by adding the additional Googlebot-specific information.
I would have assumed that Google would not index the URLs that are in the User-agent: Googlebot section, and not index the URLs that are in the User-agent: * section.
What appears to happen is that as soon as you add a User-agent: Googlebot section, that Google starts indexing all URLs that are not mentioned in that section, even if they are mentioned in the User-agent: * section, and supposedly disallowed for all user agents.
That is, if you have a User-agent: Googlebot section, then you also need to repeat all URLs found in the User-agent: * section in the Googlebot-specific section.
That, to me, is not how it should work.
.
Can someone from Google clarify whether Google is supposed to follow both User-agent: * and User-agent: Googlebot if both are present; or whether it ignores User-agent: * if User-agent: Googlebot is present.
The latter is what appears to happen right now.
.
Side note: Looks like the other stuff at [webmasterworld.com...] is fixed, by the way.
[edited by: g1smd at 4:52 pm (utc) on Aug. 13, 2006]
I hope it's OK to post these links as they describe how Googlebot handles robots.txt files.
This page provides links to information on robots.txt files as they pertain to Googlebot:
[google.com...]
And this page provides information on how Googlebot interprets the situation being discussed here:
[google.com...]
If you want to block access to all bots other than the Googlebot, you can use the following syntax:
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow:
Googlebot follows the line directed at it, rather than the line directed at everyone.
You can always use the robots.txt analysis tool in our webmaster tools to see how Googlebot will interpret a robots.txt file.
For most other robots, the safest way to exclude all but the specified robots would be:
User-agent: The_Allowed_Bot
Disallow:
User-agent: Another_Allowed_Bot
Disallow:
User-agent: *
Disallow: /
This should work for Googlebot as well, and is a much safer bet for 'dumber' robots.
Jim
[edited by: jdMorgan at 10:05 pm (utc) on Aug. 13, 2006]
The rule of thumb I always use is "the most specific directive applies." So if you say "Everyone in the room, leave. g1smd, please stay; we need to chat" then everyone but g1msd would mosey.
Although this is how we've done things for a long time (and I think every other major engine works this way), I agree it's good to get the word out, g1smd. It's on the front page of WebmasterWorld, so I think the word is out. :) My takeaway would be to find a good robots.txt checker and test out a new file before making it live.
[edited by: GoogleGuy at 12:49 am (utc) on Aug. 14, 2006]
User-Agent: *
Allow: /
Disallow: /cgi-bin
That will disallow the entire site on some bots.
There is no "allow" syntax and we have documented Slurp misinterpreting it in the past as a disallow line which ultimately removed the entire site from Yahoo. That behavior has been changed, but it is a clear cut case of what can go wrong when using nonstandard (improper) syntax in robots.txt.
This debate arises from a somewhat ambiguous statement in the robots.txt standard.
"the record describes the default access policy for any robot that has not matched any of the other records."
In other words, if a specific rule is present the "*" rule is ignored. Order shouldn't matter.
So ... according to a close reading of the specification, the rules for a specific user agent _entirely override_ the "User-agent: *" rules. Therefore, any rule under "User-agent: *" that should also be applied to googlebot must be repeated under "User-agent: googlebot."
Am I wrong?
[edited by: SEOEgghead at 7:37 pm (utc) on Aug. 15, 2006]
I agree that it's wise to put "*" last regardless. I think that cures the ambiguity regardless, right?
I think many programmers would tend to read it the _wrong_ way. I have in the past. It kind of reads like a "switch" statement in C -- where order counts. Arguably, C requires that "*," the default, be last. It's a so-so analogy, but you see my point. If they had used the token "default" instead of "*" nobody would mistake it for a glob/regex like many have ... </rant>
Jim
does anyone know if Yahoo and MSN also follow this very literal interpretation of the specification
no. Some bots have worked in the past in increasing order of importance. What comes last in the robots.txt trumps what comes first. A sequential reading of robots.txt is how Infoseek and Scooter used to work.
[edited by: Brett_Tabke at 4:23 pm (utc) on Aug. 19, 2006]