Googlebot does not obey robots.txt

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Googlebot does not obey robots.txt

moltar

4:37 am on May 5, 2005 (gmt 0)

I've never seen this happen before. Googlebot got caught in the spider trap. The path was specifically banned by robots.txt. There was nothing behind it but a trap.

IP: 66.249.65.238 
UA: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Did this happen to anyone else?

Brett_Tabke

2:45 pm on May 5, 2005 (gmt 0)

Robots.txt is not an access ban, it is a listings/index ban. Google will dl an entire site - no problem, but it will not list those pages specified in a robots.txt.

moltar

9:02 pm on May 5, 2005 (gmt 0)

It is an access ban.

From robots.txt FAQ [robotstxt.org]:

How do I prevent robots scanning my site?
The quick way to prevent robots visiting your site is put these two lines into the /robots.txt file on your server:
User-agent: * 
Disallow: / 

From Introduction [robotstxt.org]

...These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed. This standard addresses this need with an operational solution.

From Google FAQ [google.com]:

1. How should I request that Google not crawl part or all of my site?
The standard for robot exclusion given at [robotstxt.org...] provides for a file called robots.txt that you can put on your server to exclude Googlebot and other web crawlers. (Googlebot has a user-agent of "Googlebot".)...

walkman

9:15 pm on May 5, 2005 (gmt 0)

"It is an access ban. "
whatever it means technically, doesn't matter. Google will still read the files, but not list them.

Dayo_UK

9:19 pm on May 5, 2005 (gmt 0)

>>>>>>"It is an access ban. "
>>>>>>whatever it means technically, doesn't matter. Google will still read the files, but not list them.

Its also Mozilla Bot - who knows what the purpose of this bot is?

moltar

9:21 pm on May 5, 2005 (gmt 0)

It never got caught so far, on any of the sites that I own. This is the first time this has ever happened. I demand an explanation from Sergey and Larry! :)

Dayo_UK

9:22 pm on May 5, 2005 (gmt 0)

Not the only person it has happened to:-

http://www.google.com/search?hl=en&q=googlebot+mozilla [google.com]

Vadim

1:29 am on May 6, 2005 (gmt 0)

Since Google does not index the disallowed pages, it seems OK because the pages are in the public domain.

Anybody can see and study all the pages in the public directories of the site.

Simply we should be aware that the pages that are forbidden in the robots.txt may influence the rank of the pages that are allowed.

Vadim.

Powdork

5:42 am on May 6, 2005 (gmt 0)

They should not be accessed. That is what obeying robots.txt means. That's what their robots.txt faq says. My understanding was that robots.txt would keep the robot from crawling a specified page/directory, but that Google may still index the url as a result of finding a link to it. The way to keep a page from being indexed is to let it be crawled and put the noindex meta tag in the head of the document.

Powdork

5:46 am on May 6, 2005 (gmt 0)

It's weird that Googlebot is indexing itself. A search for 66.249.65.238 yields that has been crawled and indexed. No wonder they have so many pages.

Actually, it occurs to me that may be the result of the bot trap?

bull

8:53 am on May 6, 2005 (gmt 0)

Powdork is absolutely right, I cannot understand a statement like

whatever it means technically, doesn't matter. Google will still read the files, but not list them.

If robots.txt contains

User-agent: whatever
Disallow: /foo/

then accessing anything within /foo/ is a no-no for the bot, and it is irrelevant whether it would be listed or not in a serp. The purpose of disallowing a directory or even / is not that the bot will crawl it nevertheless.

Slightly OT:
It is also well known that Googlebot-Image does behave badly and ignores anything that is listed under User-agent: * in robots.txt. You need to copy all those lines listed under User-agent: * to a new section User-agent: Googlebot-Image .

Interestingly, also Yahoo-MMCrawler does not behave well.

g1smd

9:43 pm on May 7, 2005 (gmt 0)

>> Google will still read the files, but not list them. <<

Wrong again.

They will show the files as URL-only entries in the SERPs. They appear as URL-only, only because Google is asked to not index the content. Robots.txt says nothing about recording that the URL simply exists, so Google does record it.

You have to manually remove the entries by submitting the URL of your robots.txt file to the URL console (removal tool). Removal takes a few days.

rfgdxm1

4:14 pm on May 9, 2005 (gmt 0)

>They should not be accessed. That is what obeying robots.txt means. That's what their robots.txt faq says. My understanding was that robots.txt would keep the robot from crawling a specified page/directory, but that Google may still index the url as a result of finding a link to it. The way to keep a page from being indexed is to let it be crawled and put the noindex meta tag in the head of the document.

Right. The basic idea of robots.txt was to stop bots from wasting bandwidth a site pays for. If I have a site with 10,000 pages I don't care about whether it is listed in Google, I don't want to pay for the bandwidth of Googlebot accessing these pages over and over again.

KenB

11:48 pm on May 9, 2005 (gmt 0)

I too use robot traps and have never seen Googlebot get trapped. I've been thinking about the possiblity of this happening and here are my thoughts:

Googlebot could have fallen into the bot trap or index banned pages because of the problem googlebot has with 302 redirects. Maybe a 302 redirect caused googlebot to hit your robot trap without knowing it had been redirected there from a diferent site, thus did not realize it needed to request the robots.txt file for your site.

Possible? yes/no

moltar

12:10 am on May 10, 2005 (gmt 0)

Possible? yes/no

It is definetely some kind of problem, but we could only speculate on what it is.

As several people mentioned before, robots.txt is a ban for the robot, and they should obey it. Every robot should check robots.txt before accessing any documents on a given domain. It's simple as that.

Even

wget

in *nix systems obeys robots.txt by default. If you try to download a page that is disallowed, you will get an error.

Brett_Tabke

12:39 am on May 10, 2005 (gmt 0)

> They will show the files as URL-only entries in the SERPs.

semantics - that's still not indexing them for public consumption.

> Every robot should check robots.txt before accessing
> any documents on a given domain. It's simple as that.

I agree. Unfortunatly, that is not Googles interpretation of the worthless, toothless, all-but-useless, robots exclusion standard.

Jon_King

1:00 am on May 10, 2005 (gmt 0)

>>Actually, it occurs to me that may be the result of the bot trap?

It has happened to me. A space in a dir name sent it into a tizzy.

mickeymart

2:27 am on May 18, 2005 (gmt 0)

In my robots.txt I have disallow cart.php, however google grabbed it and indexed it and its been in the index for months. Full index and cache, not url only.

g1smd

8:45 pm on May 18, 2005 (gmt 0)

Are you sure that there isn't an error in the format of your robots.txt file?

To remove the pages from the index, submit the URL of your robots.txt file to the Google URL Removal service. It is one of the options in their URL console. Removal takes a couple of days.

Chris_D

9:17 am on May 19, 2005 (gmt 0)

If you read the fine print....

The Robots META tag allows HTML authors to indicate to visiting robots if a document may be indexed, or used to harvest more links.

[robotstxt.org...]
Try <META NAME="ROBOTS" CONTENT="NOINDEX">

kamran mohammed

11:13 am on May 19, 2005 (gmt 0)

g1smd said : They will show the files as URL-only entries in the SERPs. They appear as URL-only, only because Google is asked to not index the content. Robots.txt says nothing about recording that the URL simply exists, so Google does record it.
You have to manually remove the entries by submitting the URL of your robots.txt file to the URL console (removal tool). Removal takes a few days.

Hello everybody

so what g1smd says is correct as far as i have seen for my website....that after u add links to robots.txt it doesn't show up with snippet..rather only URL...

but if i want that only URL should also be removed so do i have to type in the remove URL tool in google something like this..
" [somename.com...] "

hope someone make it clear for me..

Thanxs..a lot ..to every1

Regards,

KaMran:-)