Forum Moderators: Robert Charlton & goodroi
I checked out the stats with the Webmaster Tools, modified my sitemap.xml file and resubmitted, then waited. Finally, this morning Googlebot came back, and went right back to that calendar. Hundreds of hits were recorded this morning. About 6:00am this morning I again modified robots.txt, excluding the entire directory. Still, hours later, the bot persists in gobbling up those calendar pages.
The directory is banned in robots.txt. The directory is gone from my sitemap. There are still links to the wiki from my home page, but not to the calendar. I'm under the impression that a change in robots.txt would be respected immediately. Could I be wrong?
Here is the relevant portion of robots.txt, and I can't see anything there that would allow this behavior.
User-agent: Googlebot
Disallow: /wiki/
What gives? The last thing I want to have to think about today is how to recover from another ban given to the bot.
Normally, I'd give any 'bot 24 hours, just to be on the safe side.
A possible cause of your problem is that something else is wrong with your robots.txt file. Perhaps, it's invalid, or perhaps there is another record in it that Googlebot considers as an override of your Disallow.
While Googlebot is normally fairly sophisticated about handling multiple robots.txt records whose User-agent string might apply to it --it apparently looks for the 'best match'-- I wouldn't count on that behaviour. You should assume that any 'bot will accept (only) the first record in your robots.txt that matches or partially-matches its user-agent name or "*".
Other things to look out for are that each record must end with a blank line, including the last one.
I also recommend that you check the MIME-type returned with robots.txt fetches, and be sure that it's "text/plain". Be sure to use standard encoding and line-enders -- I recommend using Unix-standard line-enders; That is, line-feed only.
Have another look at the Standard for Robot Exclusion, and read it strictly: If it doesn't say that something will work, then assume that it won't work.
Jim
Who or what will decide which specific files may have different restrictions?
record must end with a blank line, including the last one.
Red-faced, they were until this mornings change. Like a good coder I thought I might be saving a few bytes. Blank lines are a scourge :(
MIME-type returned with robots.txt fetches, and be sure that it's "text/plain"
The Content-Type had been been verified as text/html; charset=utf-8.
and read it strictly
Point taken, and that's coming up. I'm smiling because I've done that several times, and of all the tasks I've ever had to do as a webmaster, reading the exclusions text is the worst. Period. But here I go, again. Ugg. I should just paste a copy into my r*.txt and be done with it.
I'm still taking hits on that calendar. Since mid-January there have been over 25,000 hits on that one page - for every day and session id imaginable to mankind. Session ID are part of the problem, and that's a configuration issue on the wiki. But I thought using robots.txt would be enough to prevent this sort of thing by excluding access altogether. Here is one of the final logs from yesterday.
66.249.66.76 - - [30/Jan/2007:23:59:10 -0800] "GET /tiki-calendar.php?viewmode=day&mon=7&day=-27&year=2004&PHPSESSID=390c412fe5716dacdae0b2598c02c248 HTTP/1.1" 200 25153 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Jim, I'm going to look at my htaccess too, there are a lot of overrides in there, maybe one is a problem.
I'm giving the entire network a 410 response on that page right now, until my newest version of robots is cached. At last they have moved on to another page...
66.249.66.76 - - [30/Jan/2007:23:59:10 -0800] "GET /tiki-calendar.php?viewmode=day&mo...
And your robots.txt Disallow:
Disallow: /wiki/
and hoping that the latter is not intended to Disallow the former...
If by chance you're rewriting requests for /tiki-calendar to /wiki/<something>, then you'll need to Disallow the URL (/tiki-calendar) in robots.txt, not the filepath (/wiki).
Jim