Welcome to WebmasterWorld Guest from 18.208.211.150

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Googlebot does not obey robots.txt

     
4:37 am on May 5, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 18, 2003
posts:1929
votes: 0


I've never seen this happen before. Googlebot got caught in the spider trap. The path was specifically banned by robots.txt. There was nothing behind it but a trap.

IP: 66.249.65.238 
UA: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Did this happen to anyone else?

2:45 pm on May 5, 2005 (gmt 0)

Administrator from US 

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 21, 1999
posts:38200
votes: 96


Robots.txt is not an access ban, it is a listings/index ban. Google will dl an entire site - no problem, but it will not list those pages specified in a robots.txt.
9:02 pm on May 5, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 18, 2003
posts:1929
votes: 0


It is an access ban.

From robots.txt FAQ [robotstxt.org]:

How do I prevent robots scanning my site?

The quick way to prevent robots visiting your site is put these two lines into the /robots.txt file on your server:

User-agent: * 
Disallow: /

From Introduction [robotstxt.org]

...These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed. This standard addresses this need with an operational solution.

From Google FAQ [google.com]:

1. How should I request that Google not crawl part or all of my site?

The standard for robot exclusion given at [robotstxt.org...] provides for a file called robots.txt that you can put on your server to exclude Googlebot and other web crawlers. (Googlebot has a user-agent of "Googlebot".)...

9:15 pm on May 5, 2005 (gmt 0)

Senior Member

joined:Dec 29, 2003
posts:5428
votes: 0


"It is an access ban. "
whatever it means technically, doesn't matter. Google will still read the files, but not list them.

Dayo_UK

9:19 pm on May 5, 2005 (gmt 0)

Inactive Member
Account Expired

 
 


>>>>>>"It is an access ban. "
>>>>>>whatever it means technically, doesn't matter. Google will still read the files, but not list them.

Its also Mozilla Bot - who knows what the purpose of this bot is?

9:21 pm on May 5, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 18, 2003
posts:1929
votes: 0


It never got caught so far, on any of the sites that I own. This is the first time this has ever happened. I demand an explanation from Sergey and Larry! :)

Dayo_UK

9:22 pm on May 5, 2005 (gmt 0)

Inactive Member
Account Expired

 
 


:)

Not the only person it has happened to:-

http://www.google.com/search?hl=en&q=googlebot+mozilla [google.com]

1:29 am on May 6, 2005 (gmt 0)

Full Member

10+ Year Member

joined:July 18, 2004
posts:238
votes: 0


Since Google does not index the disallowed pages, it seems OK because the pages are in the public domain.

Anybody can see and study all the pages in the public directories of the site.

Simply we should be aware that the pages that are forbidden in the robots.txt may influence the rank of the pages that are allowed.

Vadim.

5:42 am on May 6, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member powdork is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Sept 13, 2002
posts:3347
votes: 0


They should not be accessed. That is what obeying robots.txt means. That's what their robots.txt faq says. My understanding was that robots.txt would keep the robot from crawling a specified page/directory, but that Google may still index the url as a result of finding a link to it. The way to keep a page from being indexed is to let it be crawled and put the noindex meta tag in the head of the document.
5:46 am on May 6, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member powdork is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Sept 13, 2002
posts:3347
votes: 0


It's weird that Googlebot is indexing itself. A search for 66.249.65.238 yields that has been crawled and indexed. No wonder they have so many pages.

Actually, it occurs to me that may be the result of the bot trap?

8:53 am on May 6, 2005 (gmt 0)

Preferred Member

10+ Year Member

joined:June 3, 2002
posts:566
votes: 0


Powdork is absolutely right, I cannot understand a statement like
whatever it means technically, doesn't matter. Google will still read the files, but not list them.

If robots.txt contains

User-agent: whatever
Disallow: /foo/

then accessing anything within /foo/ is a no-no for the bot, and it is irrelevant whether it would be listed or not in a serp. The purpose of disallowing a directory or even / is not that the bot will crawl it nevertheless.

Slightly OT:
It is also well known that Googlebot-Image does behave badly and ignores anything that is listed under User-agent: * in robots.txt. You need to copy all those lines listed under User-agent: * to a new section User-agent: Googlebot-Image .

Interestingly, also Yahoo-MMCrawler does not behave well.

9:43 pm on May 7, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


>> Google will still read the files, but not list them. <<

Wrong again.

They will show the files as URL-only entries in the SERPs. They appear as URL-only, only because Google is asked to not index the content. Robots.txt says nothing about recording that the URL simply exists, so Google does record it.

You have to manually remove the entries by submitting the URL of your robots.txt file to the URL console (removal tool). Removal takes a few days.

4:14 pm on May 9, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member rfgdxm1 is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 12, 2002
posts:4479
votes: 0


>They should not be accessed. That is what obeying robots.txt means. That's what their robots.txt faq says. My understanding was that robots.txt would keep the robot from crawling a specified page/directory, but that Google may still index the url as a result of finding a link to it. The way to keep a page from being indexed is to let it be crawled and put the noindex meta tag in the head of the document.

Right. The basic idea of robots.txt was to stop bots from wasting bandwidth a site pays for. If I have a site with 10,000 pages I don't care about whether it is listed in Google, I don't want to pay for the bandwidth of Googlebot accessing these pages over and over again.

11:48 pm on May 9, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 24, 2003
posts:729
votes: 0


I too use robot traps and have never seen Googlebot get trapped. I've been thinking about the possiblity of this happening and here are my thoughts:

Googlebot could have fallen into the bot trap or index banned pages because of the problem googlebot has with 302 redirects. Maybe a 302 redirect caused googlebot to hit your robot trap without knowing it had been redirected there from a diferent site, thus did not realize it needed to request the robots.txt file for your site.

Possible? yes/no

12:10 am on May 10, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 18, 2003
posts:1929
votes: 0


Possible? yes/no

It is definetely some kind of problem, but we could only speculate on what it is.

As several people mentioned before, robots.txt is a ban for the robot, and they should obey it. Every robot should check robots.txt before accessing any documents on a given domain. It's simple as that.

Even

wget
in *nix systems obeys robots.txt by default. If you try to download a page that is disallowed, you will get an error.
12:39 am on May 10, 2005 (gmt 0)

Administrator from US 

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 21, 1999
posts:38200
votes: 96


> They will show the files as URL-only entries in the SERPs.

semantics - that's still not indexing them for public consumption.

> Every robot should check robots.txt before accessing
> any documents on a given domain. It's simple as that.

I agree. Unfortunatly, that is not Googles interpretation of the worthless, toothless, all-but-useless, robots exclusion standard.

1:00 am on May 10, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 16, 2002
posts:2139
votes: 1


>>Actually, it occurs to me that may be the result of the bot trap?

It has happened to me. A space in a dir name sent it into a tizzy.

2:27 am on May 18, 2005 (gmt 0)

New User

10+ Year Member

joined:May 8, 2005
posts:29
votes: 0


In my robots.txt I have disallow cart.php, however google grabbed it and indexed it and its been in the index for months. Full index and cache, not url only.
8:45 pm on May 18, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


Are you sure that there isn't an error in the format of your robots.txt file?

To remove the pages from the index, submit the URL of your robots.txt file to the Google URL Removal service. It is one of the options in their URL console. Removal takes a couple of days.

9:17 am on May 19, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Oct 25, 2001
posts:661
votes: 1


If you read the fine print....

The Robots META tag allows HTML authors to indicate to visiting robots if a document may be indexed, or used to harvest more links.

[robotstxt.org...]
Try <META NAME="ROBOTS" CONTENT="NOINDEX">
11:13 am on May 19, 2005 (gmt 0)

Junior Member

10+ Year Member

joined:June 29, 2004
posts:81
votes: 0


g1smd said : They will show the files as URL-only entries in the SERPs. They appear as URL-only, only because Google is asked to not index the content. Robots.txt says nothing about recording that the URL simply exists, so Google does record it.

You have to manually remove the entries by submitting the URL of your robots.txt file to the URL console (removal tool). Removal takes a few days.

Hello everybody

so what g1smd says is correct as far as i have seen for my website....that after u add links to robots.txt it doesn't show up with snippet..rather only URL...

but if i want that only URL should also be removed so do i have to type in the remove URL tool in google something like this..
" [somename.com...] "

hope someone make it clear for me..

Thanxs..a lot ..to every1

Regards,

KaMran:-)