Googlebot appears to ignore robots.txt

Forum Moderators: open

Message Too Old, No Replies

Googlebot appears to ignore robots.txt

Key_Master

4:43 am on Jul 4, 2002 (gmt 0)

Google is crawling the cgi-bin on one of my sites even though it is (and has been) clearly prohibited from doing so in my (validated) robots.txt file. My security software kicked in automatically and prevented Googlebot from retrieving any information but this still has me greatly concerned. Others on this forum have reported similar invasions into their cgi-bin.

Privacy and security concerns aside, what effect does this have on a site's theme and overall PR?

<rant>GoogleGuy, what part of Disallow: /cgi-bin/ does your robot not understand? How would you like it if I sent my spider into your protected areas? :(</rant>

No respect!

Key_Master

6:35 am on Jul 4, 2002 (gmt 0)

Search on Google for [url=http://www.google.com/search?q=%2Fcgi-bin%2F]/cgi-bin/[/url] yields more examples.

www.altavista.com/cgi-bin/query
http://www.altavista.com/robots.txt

www.listbot.com/cgi-bin/subscriber
http://www.listbot.com/robots.txt

babelfish.altavista.com/cgi-bin/translate
http://babelfish.altavista.com/robots.txt

www.nsf.gov/cgi-bin/getpub?gpg
http://www.nsf.gov/robots.txt

Hehe! [url=http://www.google.com/search?q=%22google.%2Bcom/cgi-bin/%22]Googlebot[/url] doesn't even obey the rules at home.

labs.google.com/cgi-bin/keys
http://labs.google.com/robots.txt

Brett_Tabke

7:04 am on Jul 4, 2002 (gmt 0)

All those that I'm looking at in the examples KM, those are "linked" listings. Google is obeying the robots.txt on them. None of those that appear to be offending have descriptions. So Google is a-ok on those.

Like the labs one:
[google.com...]

That's all fine because the link is built from inbound links. Google's not under obligation to "not list" those.

It also doesn't say that Google can't crawl anything that is blocked by a robots.txt? As far as I know, a robots.txt does specifically have to block a crawler - just the listing/indexing. However, the se is still free to do what Google is doing and build a listing based solely on inbound links.

I don't care for that either KM.
<edit>fixed url</edit>

[edited by: Brett_Tabke at 7:39 am (utc) on July 4, 2002]

Key_Master

7:11 am on Jul 4, 2002 (gmt 0)

It also doesn't say that Google can't crawl anything that is blocked by a robots.txt. As far as I know, a robots.txt does specifically have to block a crawler - just the listing/indexing. However, the se is still free to do what Google is doing and build a listing based solely on inbound links.

What's the purpose of a robots.txt then? It can crawl and list sites blocked in robots.txt and still be compliant? No offense, but that sounds like a bunch of $#&*$^!

PsychoTekk

7:16 am on Jul 4, 2002 (gmt 0)

i also got pages indexed by google which are
1st: disallowed by robots.txt and
2nd: have the noindex tag on em :(

Brett_Tabke

7:16 am on Jul 4, 2002 (gmt 0)

Correct. I've always wondered about the question of timing:

The Robots Exclusion Protocol is a method that allows Web site administrators to indicate to visiting robots which parts of their site should not be visited by the robot.

When does that go into effect? What about Google that apparently has separate crawlers that "do the deed" vs those that grab an occasional copy of robots.txt.

I do agree with your strict interpretation of it. I've grown used to Googles behavior since they've done it on/off since the start.

[robotstxt.org...]

Key_Master

7:32 am on Jul 4, 2002 (gmt 0)

Why only the /cgi-bin/? Why doesn't it crawl other robots.txt prohibited directories and files?

If Google can't obey robots.txt (the strict way, not some made up Googly way) the consequences are going to be disastrous for Web security. And I though Microsoft was bad.

Hide those credit card lists! You have to presume Google will index it if it finds it. Don't rely on robots.txt for peace of mind.

PsychoTekk

7:44 am on Jul 4, 2002 (gmt 0)

Why only the /cgi-bin/? Why doesn't it crawl other robots.txt prohibited directories and files?

it also crawled disallowed folders other than cgi-bin which i know were
disallowed by robots.txt from the 1st day they were set up

chiyo

8:14 am on Jul 4, 2002 (gmt 0)

There is absolutely no way you should have credit card lists on a public server anyway. They should at least be on a non-public secure web. When you put up information on the public Web, you have to live with people or robots looking at it other than those you want. Robots.txt is one way of reducing this, but there is no law to say robots have to respect it. The only way to make be 99.9% sure that confidential info is not retrieved is to store it off the web or on secure protected servers.

highman

8:32 am on Jul 4, 2002 (gmt 0)

>Hide those credit card lists!

Oh dear.... no way! you should not have lists like that on a server!
If you want total security dont link to it

Woz

8:47 am on Jul 4, 2002 (gmt 0)

I don't think KM was being specific when saying "Credit Cards", rather using that as an example.

But the message still applies though to not keep sensitive information on an open server or, if you must, then secure it with passwords and the likes.

There was a case a while ago about credit card or banking details being indexed by the big G was there not? Can't remember the details but a worthwhile wake up call non the less.

Onya
Woz

ciml

1:48 pm on Jul 4, 2002 (gmt 0)

Surely the main point is "...to indicate to visiting robots which parts of their site should not be visited by the robot...". Other than rare occasions in the past, Googlebot's been very well behaved about robots.txt, it can list blocked URLs, but it shouldn't fetch them.

If I find someone with unencrypted credit card numbers on my servers then I throw them off. It's just not right, even if the numbers are outside of document root. If numbers are stored encrypted, then it is not right to have the decryption key on the server either.

Doofus

7:21 pm on Jul 4, 2002 (gmt 0)

A major Google bug beginning early this morning.

Despite a cgi-bin disallow that Google has respected for six months now, it crawled nearly 5,000 cgi links today before I caught and blocked them.

Each of these 5,000 generates a social network diagram that requires very substantial CPU activity.

The speed as which Google was grabbing these turned our machine into a boat anchor; the CPU load was way over the top.

Bad boy Googlebot, bad boy.

Key_Master

7:44 pm on Jul 4, 2002 (gmt 0)

Googlebot is still trying to crawl mine. I wrote a plugin that sends Google a detailed E-mail each and every time it accesses my cgi-bin.

Their inbox is going to be full tomorrow.

Grumpus

8:11 pm on Jul 4, 2002 (gmt 0)

Yeah - since yesterday sometimes, googlebot has been jinkin' around in several of my "Dissallowed" directories. Never had the problem before. Didn't have it when the crawl started last week either. Interesting, eh?

Doofus

8:25 pm on Jul 4, 2002 (gmt 0)

My "block" consists of a single-line "Server too busy" message that allows me to handle about 30 Googlebot cgi-bin requests per second without affecting our load. So far I'm only getting about 8 per second. It's still trying though.

I think the problem is that Google picked up all these cgi-bin links from my static files. It got thousands of these static files, as usual, in the normal crawl that ended a couple days ago. A lot of these cgi-bin links ended up as "uncrawled" link listings in their SERPs from the June crawl and update. I just noticed this in June; I don't know if this is new or not, but I would have preferred that Google not list them in the SERPs at all.

Now it's the July crawl and, it seems, Google is in Stage Two of the crawl, where they go back and get the uncrawled links from Stage One. The only problem is that they forgot to check the robots.txt before launching Stage Two. The robots.txt was the reason they didn't get crawled in Stage One!

I have noticed in the past that Google ends up with a handful of cgi-bin pages from my site due to external links (from weblogs and such) that appear now and then. It's not in the habit of checking robots.txt for these one-shot GETs. But this latest Googlebot behavior is a bug at a more fundamental level. If Google doesn't get it straightened out, the next step is to block them at our router.

For some reason, .htaccess in my www.domain.org root doesn't work on our cgi-bin, in the sense that a working "Deny" in the root for www.domain.org will still allow access to www.domain.org/cgi-bin/... It's probably a configuration issue. All of the various domains on our Class C use the same common cgi-bin directory.

Does anyone know of how to put something in Apache's httpd.conf to give me .htaccess control over cgi-bin?

Key_Master

10:30 pm on Jul 4, 2002 (gmt 0)

Google has responded. They are telling me that they're getting a 403 error when they access robots.txt using User-agent: Googlebot/2.1 (+http://www.googlebot.com/bot.html). Although I have tried, I can't seem to emulate this problem and I don't think it's an error on my part. Here's the server header I get.

GET /robots.txt HTTP/1.1
Host: mydomain.com
Connection: close
User-Agent: Googlebot/2.1 (+http://www.googlebot.com/bot.html)
HTTP/1.1 200 OK
Date: Thu, 04 Jul 2002 22:19:56 GMT
Server: Apache/1.3.20 (Unix) ApacheJServ/1.1.2 PHP/4.2.1 FrontPage/5.0.2.2510 Rewrit/1.1a
Connection: close
Transfer-Encoding: chunked
Content-Type: text/plain

Input?

GoogleGuy

11:25 pm on Jul 4, 2002 (gmt 0)

I'll check into it and see if something has changed on our end to make sure we're respecting robots.txt. If you think we're crawling your site despite a robots.txt, could you check if you give a 403 when you try to fetch /robots.txt on your own site? (Not you Key_Master, I'm talking about the 1-2 other people who have chimed in. KM, do you do any user-agent/IP delivery stuff? Just curious, because I'm seeing some weird stuff..)

GoogleGuy

11:55 pm on Jul 4, 2002 (gmt 0)

KM, our folks are seeing 403's when we try to fetch your robots.txt. Is there anything strange with your ISP, host? If you do any cloaking/IP/user-agent delivery, that may be part of the problem. Looks like a normal fetch gives 200, but for googlebot it does see a 403?

Key_Master

12:02 am on Jul 5, 2002 (gmt 0)

KM, do you do any user-agent/IP delivery stuff? Just curious, because I'm seeing some weird stuff..

LMAO! :)

Weird stuff maybe (in a nerdy type of way). IP delivery, somewhat. For example, I feed Googlebot a

<meta name="googlebot" content="noarchive">

tag and MSIE browsers a

<meta name="MSSmartTagsPreventParsing" content="TRUE">

tag. I also conduct studies on the elusive type of spiders that use common user-agent strings (e.g. Mozilla/x.x (compatible; MSIE x.x; Windows NT). Sorry if that was you who were banned. The lack of a referrer will trigger the deny plugin for that user-agent. Other than that, I don't cloak content (nor could I with my current setup). I'm just interested in preventing unauthorized access to my content. You could call it a hobby of mine. ;)

Key_Master

12:09 am on Jul 5, 2002 (gmt 0)

I am seeing no log entries for any Google IP showing a 403. Could it be possible the 403 you are seeing is on your end?

I also sent my spider in from a different domain to see if I would get a 403 error but it didn't.

GoogleGuy

12:42 am on Jul 5, 2002 (gmt 0)

Gotcha. If this was a simple case (straight Apache server, static robots.txt text file with vanilla set-up), that would help. Sounds like you've got a more complicated set-up though. Could you tell me more about the "deny plug-in" and how you match user-agents? Is there any way to turn that off temporarily to see if that's the problem?

Key_Master

1:01 am on Jul 5, 2002 (gmt 0)

I've disabled robots.txt logging. You won't see the

tag anymore. I should have mentioned, it's all done via SSI (no mod_rewrite or URL redirections). For .html pages the tag is inserted between the <head></head>. For .txt pages you would only see

without the <meta> tags or other HTML elements being shown.

The deny plugin can be seen in the plugin graphic available in the site listed in my profile. I don't wan't to elaborate more than that out of respect for the TOS here. You can, however, stickymail me or E-mail me directly for more details if you like.

rjohara

2:40 am on Jul 5, 2002 (gmt 0)

[big]GoogleGuy[/big] 7:55 pm on July 4, 2002 (utc -4)
KM, our folks are seeing 403's when we try to fetch your robots.txt.

You know, I wonder if anybody at, say, Microsoft is answering user queries on a public bulletin board on the 4th of July? (US Independence Day [archives.gov], for the international crowd. That's why your US referrals vanished today.)

This raises two other possibilities, though: (1) GoogleGuy is in fact the night watchman at the GooglePlex and just makes all this stuff up from some secretary's computer that got left on by mistake; or (2) the GooglePlex is a facade and the whole operation is run out of Indonesia.

RJO ;)

Key_Master

10:11 pm on Jul 5, 2002 (gmt 0)

GoogleGuy, I spent the morning sifting through yesterday's server log file and Googlebot made over a half dozen requests for robots.txt and each hit returned a 200. The only 403 errors logged were made by the technician while using user agent Wget. Like I had told the technician, we ban (like many other webmasters) Wget because of abuse. Is this your new user-agent and if so should we place it in our robots.txt?

Other than that, no other hits were logged with a 403 (which could only mean your requests that resulted in a 403 never reached our server).

Here's a sample from the log file showing a 403:

216.239.45.4 - - [04/Jul/2002:17:57:12 -0400] "GET /robots.txt HTTP/1.0" 403 - "-" "Wget/1.8.2"

I've had around a hundred people (literally) from around the world check on this robots.txt file and every one got a 200.

straight Apache server, static robots.txt text file with vanilla set-up

It is a static robots.txt file. Don't know what you mean by straight Apache server though. I don't have, use, or need mod_rewrite if that is what you are alluding to.

GoogleGuy

4:31 am on Jul 6, 2002 (gmt 0)

rjohara, the web doesn't rest so neither do we. :) This is kind of a weird case, because it almost looks like the behavior changed at one point. Let me make sure that I've got it straight though:
1. Has anything changed in the last few weeks? ISP, IP addresses, hosting companies, web server config, etc.?
2. You do return 403's for user-agent wget, but 200's for all other user-agents? (Googlebot always user agent Googlebot, but someone doing spot checks by hand might use wget, which might cause some confusion when we try to check it and see a 403).
3. Have you seen Googlebot hits in your cgi-bin lately (last day or so?), or has it stopped? We periodically refetch the robots.txt, so we should have a fresh copy by now and should be respecting it--assuming that we were able to grab it correctly. :)

nancyb

4:33 am on Jul 6, 2002 (gmt 0)

can't say I trust googlebot to follow robots.txt. He has crawled my cgi-bin, and a couple other directories that have always been denied by robots.txt. And then, he indexed a couple of pages from those banned directories as well.

I never mentioned it before, just fumed and stomped around a little, but thought I'd add my 2 cents today.

Key_Master

11:50 pm on Jul 8, 2002 (gmt 0)

GoogleGuy,

No changes were made (e.g. same IP, same configuration). Googlebot has been behaving itself lately though.

Sorry, if I came across a little ticked off (which I was), but I meant no disrespect towards yourself or Google.com :)

GoogleGuy

1:35 am on Jul 9, 2002 (gmt 0)

Glad that Googlebot is behaving now--let me know if it gets away again. At this point, my best guess is a transient error (either our part or a 403 somehow); looks like when the bots grabbed a refreshed robots.txt, things worked fine.

Just a quick note to other people who think that we've got problems with our robots.txt: We hadn't heard any complaints about robots.txt for several months, but we always want to make sure that we're handling robots.txt correctly. If you've seen some behavior that you think might be bad, you can mail to search-quality@google.com with a subject line like "robots.txt".

Here are a few things that you should check for first though:
1. If you try to fetch www.yourdomain.com/robots.txt, do you get a valid page, preferably with a 200 status code? You should not return a 403 (forbidden error).
2. If you see result pages returned that are supposed to be forbidden, are you sure we crawled the page? Google can return a link as a result even if we didn't crawl the page. One key thing to look for is that if the page doesn't have a snippet, then we probably didn't crawl the page.

Bonus points if you run a simple server (no cloaking/IP or user agent delivery, etc.), and if your configuration has been the same for several months. For better or for worse, >99% of robots.txt complaints turn out to be due to a webmaster mistake, so doing these simple checks will act as a prefilter and let people know that they should take your email seriously.

Hope this helps,
GoogleGuy

Brett_Tabke

10:49 am on Jul 10, 2002 (gmt 0)

> 99% of robots.txt complaints turn out to
> be due to a webmaster mistake,

Interesting. Thanks for the effort.

This 53 message thread spans 2 pages: 53