bad boy googlebot - (deprecated) Google News Archive forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

bad boy googlebot

or is it my fault?

incywincy

8:32 am on Jul 22, 2002 (gmt 0)

10+ Year Member

i thought that i had excluded googlebot from my cgi bin but it doesn't seem to adhere to my robots.txt!! have i made a mistake or doesn't googlebot respect the robots.txt?

fragment from robots.txt:
User-agent: *
Disallow: /gfx/
Disallow: /cgi-bin/
Disallow: /protected/
Disallow: /private/

fragment from access log:
64.68.82.26 - - [19/Jul/2002:12:44:02 +0100] "GET /cgi-bin/track.pl?os=www.example.com HTTP/1.0" 302 302 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.82.78 - - [19/Jul/2002:13:24:19 +0100] "GET /cgi-bin/track.pl?os=www.example.com HTTP/1.0" 302 294 "-" "Googlebot/2.1
(+http://www.googlebot.com/bot.html)"
64.68.82.6 - - [19/Jul/2002:13:54:35 +0100] "GET /cgi-bin/track.pl?os=www.example.com HTTP/1.0" 302 289 "-" "Googlebot/2.1
(+http://www.googlebot.com/bot.html)"
64.68.82.78 - - [19/Jul/2002:14:06:31 +0100] "GET /cgi-bin/track.pl?os=www.example.com HTTP/1.0" 302 297 "-" "Googlebot/2.1
(+http://www.googlebot.com/bot.html)"

[edited by: engine at 8:49 am (utc) on July 22, 2002]
[edit reason] Edited for generic website examples [/edit]

gsx

8:41 am on Jul 22, 2002 (gmt 0)

10+ Year Member

Google can choose to spider your whole site. It normally does respect robots.txt and it should not list the pages it spiders within these areas. But it can spider them. This can help filter out sites that are porn and other dodgy sites from it's adult filters, for example.

ciml

9:41 am on Jul 22, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I'd say the other way round, gsx. Google should not fetch /robots.txt protected URLs, but often lists them without fetching them.

incywincy, when did you last change your /robots.txt file to exclude /cgi-bin/? I can't see why Google would spider those URLs.

incywincy

10:25 am on Jul 22, 2002 (gmt 0)

10+ Year Member

hi ciml, i changed my robots.txt about a month or two ago based on an example posted here at wmw.

in the past i had noticed that the full cgi-path and target url (this is a click thru counter script) were included in google's index, that's why i put the cgi-bin in my robots.txt.

Grumpus

11:37 am on Jul 22, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Aren't you calling "track.pl" as a web counter or something along those lines from every page (or several pages) on your site? If so, googlebot's gonna hit the tracking code from any page you call it because, well, because you called it. It's not actually going into that directory, it's just executing the code from the snippet you placed in pages outside of the banned region.

G.

incywincy

11:42 am on Jul 22, 2002 (gmt 0)

10+ Year Member

hi grumpus, i'm not sure what you mean, i use track.pl to count click-thrus on my banner ads, certainly not on every page of my website. my question was why does googlebot do a get on a url that i have disallowed. sure it can pull the html page that contains the track.pl call but it shouldn't then do a get on the cgi-bin url should it?

Grumpus

11:49 am on Jul 22, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Seems to me that Googlebot will crawl the page. Then, it follows the links on the page. So, if it's not actually executing the script on the "load" of the page as I first guessed, but only when a link is "clicked", then all the googlebot is doing there is trying to follow all the links on the page. I'm of the opinion that the bot doesn't know "where" a link actually goes until it tries to follow it, so, in essence, you haven't told the bot to not follow the links on page A, so it tries the link to your track.pl. Once it gets there, it realizes that it's not supposed to index anything there, so it moves on.

G.

ciml

11:54 am on Jul 22, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

> about a month or two ago

Then I find this quite surprising. I suggest contacting Google, hopefully your problem won't be lost in the many obviously false reports I'm sure they get each day.

backus

12:01 pm on Jul 22, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Google has already once admitted to spidering pages even when it shouldn't, even though it doesn't post them.

ciml

12:03 pm on Jul 22, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

backus, it's rare and as far as I know they don't intend to fetch /robots.txt forbidden pages with the standard 'bot.

Key_Master

5:47 pm on Jul 22, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

incywincy,

It's not as rare as ciml has suggested. It happens more frequently than people choose to believe. If Google gets back to you on this, please post their explanation on why the spider chose to disobey your robots.txt.

mbauser2

8:24 pm on Jul 22, 2002 (gmt 0)

10+ Year Member

incywincy, if this is about the site in your profile, I have to say: Your robots.txt is a little weird.

It includes a lot of "User-agent" tokens that don't look legit to me, like "Mozilla/4.0 (compatible; MSIE 4.0; Windows NT)". The robots.txt specification says tokens should be one word, without version information. Also, some of the tokens you're using include characters not typically seen in robots.txt, like parentheses.

It's possible those nonstandard tokens are causing Googlebot to parse your robots.txt wrong, but that's just an educated guess. My advice would be to cut all the User-agent tokens you haven't verified with the entities you're trying to block, and see if that helps.