Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Robots.txt

how long before it is obeyed

         

grandpa

6:27 pm on Jan 31, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



GoogleBot ran afoul of a spam trap back in mid-January after it discovered a calendar on a wiki page on my site. Presumably it was going through each calendar day, rather quickly, and a ban was automatically imposed. I discovered the ban, and removed it, and modified my robots.txt file to prevent a recurrence by adding pages from that directory.

I checked out the stats with the Webmaster Tools, modified my sitemap.xml file and resubmitted, then waited. Finally, this morning Googlebot came back, and went right back to that calendar. Hundreds of hits were recorded this morning. About 6:00am this morning I again modified robots.txt, excluding the entire directory. Still, hours later, the bot persists in gobbling up those calendar pages.

The directory is banned in robots.txt. The directory is gone from my sitemap. There are still links to the wiki from my home page, but not to the calendar. I'm under the impression that a change in robots.txt would be respected immediately. Could I be wrong?

Here is the relevant portion of robots.txt, and I can't see anything there that would allow this behavior.

User-agent: Googlebot
Disallow: /wiki/

What gives? The last thing I want to have to think about today is how to recover from another ban given to the bot.

jdMorgan

6:56 pm on Jan 31, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You could go back through your raw logs and see when the 'bot that loaded the disallowed pages fetched robots.txt -- Match them by IP address.

Normally, I'd give any 'bot 24 hours, just to be on the safe side.

A possible cause of your problem is that something else is wrong with your robots.txt file. Perhaps, it's invalid, or perhaps there is another record in it that Googlebot considers as an override of your Disallow.

While Googlebot is normally fairly sophisticated about handling multiple robots.txt records whose User-agent string might apply to it --it apparently looks for the 'best match'-- I wouldn't count on that behaviour. You should assume that any 'bot will accept (only) the first record in your robots.txt that matches or partially-matches its user-agent name or "*".

Other things to look out for are that each record must end with a blank line, including the last one.

I also recommend that you check the MIME-type returned with robots.txt fetches, and be sure that it's "text/plain". Be sure to use standard encoding and line-enders -- I recommend using Unix-standard line-enders; That is, line-feed only.

Have another look at the Standard for Robot Exclusion, and read it strictly: If it doesn't say that something will work, then assume that it won't work.

Jim

trinorthlighting

7:00 pm on Jan 31, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If the pages that are indexed are in the supplemental index. It might take a while for google to recrawl the url and drop the urls out of the index.

grandpa

10:58 am on Feb 1, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



OK, it's not quite 24 hours from the final change to my file, and the Webmaster Tools are reporting that the directory is disallowed. However the following message from the tools is still a little bothersome - Disallow: /wiki/ Detected as a directory; specific files may have different restrictions
Emphasis is mine

Who or what will decide which specific files may have different restrictions?

  • another record in it that Googlebot considers as an override of your Disallow.
    I can't spot anything that may be considered an override.

  • error in robots.txt
    I spotted something in the tools Parsing Results, informing me that Crawl-Delay: 20 would be ignored by G-bot. The crawl delay is set up for msn, and the rule was ahead of googlebot rules. The Tools still indicate that my rules are working on selected directories - so not an error in any sense unless the Tools and the bots behave differently.

    record must end with a blank line, including the last one.
    Red-faced, they were until this mornings change. Like a good coder I thought I might be saving a few bytes. Blank lines are a scourge :(

    MIME-type returned with robots.txt fetches, and be sure that it's "text/plain"
    The Content-Type had been been verified as text/html; charset=utf-8.

    and read it strictly
    Point taken, and that's coming up. I'm smiling because I've done that several times, and of all the tasks I've ever had to do as a webmaster, reading the exclusions text is the worst. Period. But here I go, again. Ugg. I should just paste a copy into my r*.txt and be done with it.

    I'm still taking hits on that calendar. Since mid-January there have been over 25,000 hits on that one page - for every day and session id imaginable to mankind. Session ID are part of the problem, and that's a configuration issue on the wiki. But I thought using robots.txt would be enough to prevent this sort of thing by excluding access altogether. Here is one of the final logs from yesterday.

    66.249.66.76 - - [30/Jan/2007:23:59:10 -0800] "GET /tiki-calendar.php?viewmode=day&mon=7&day=-27&year=2004&PHPSESSID=390c412fe5716dacdae0b2598c02c248 HTTP/1.1" 200 25153 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

    Jim, I'm going to look at my htaccess too, there are a lot of overrides in there, maybe one is a problem.

  • grandpa

    12:56 pm on Feb 1, 2007 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    My current cached version still has groups ending without CR/LF, so the directory changes I made can still be considered suspect. Anyway, that's why I couldn't stop the bot when I made my changes. And I have the answer to my question of how long before it is obeyed. I removed the Crawl Delay until the whole mess is sorted out.. ie. starting from scratch. One thing is clear, once the bot found that calendar it passed the url to all his friends and they had a field day.

    I'm giving the entire network a 410 response on that page right now, until my newest version of robots is cached. At last they have moved on to another page...

    jdMorgan

    1:51 pm on Feb 1, 2007 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    I'm looking at your log entry:

    66.249.66.76 - - [30/Jan/2007:23:59:10 -0800] "GET /tiki-calendar.php?viewmode=day&mo...

    And your robots.txt Disallow:

    Disallow: /wiki/

    and hoping that the latter is not intended to Disallow the former...

    If by chance you're rewriting requests for /tiki-calendar to /wiki/<something>, then you'll need to Disallow the URL (/tiki-calendar) in robots.txt, not the filepath (/wiki).

    Jim

    grandpa

    2:29 pm on Feb 1, 2007 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    Actually, the rewrite is to wiki.domain.com/page.php and the directory path is /wiki. Are you suggesting that the directory path will not work as a Disallow with that rewrite?

    jdMorgan

    2:56 pm on Feb 1, 2007 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    You need to disallow URLs, not filepaths.

    Jim

    grandpa

    5:02 pm on Feb 1, 2007 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    ::sigh::