Welcome to WebmasterWorld Guest from

Message Too Old, No Replies

Is Google Disregarding robots.txt?

Google seems to be indexing pages from site blocked using robots.txt

5:59 pm on Jul 7, 2005 (gmt 0)

Junior Member

10+ Year Member

joined:Feb 27, 2002
votes: 0

Can someone help me understand this one?

we have used robots.txt on one of our sites to prevent google from accessing any of the files as follows:

User-agent: Googlebot
Disallow: /

What I have noticed is that google is somehow getting some of the pages anyway. out of about 20,000 they have now about 3,670.

also interesting is that on the search results page for:

oursitename site:www.example.com

google shows: Results 1 - 9 of about 3,670

And, only 9 url links without title or description show up. No way to access any of the other supposed 3,670 results.

We have another site that has same pages and the reason we block google from the mirror site is to avoid penalty. Concerned about these pages getting in despite the robots.txt block, and possible penalty.

Any help on understanding this would be appreciated.

[edited by: ciml at 6:06 pm (utc) on July 7, 2005]
[edit reason] Examplified [/edit]

6:08 pm on July 7, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member ciml is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:June 22, 2001
votes: 2

I suggest looking at your logs. /robots.txt exclusion does not prevent Google from listing the URLs, it prevents Googlebot from fetching them.

The URL-only listings indicate that Google are doing the right thing, so the question is how they found the URLs.

My guess is that either Google visited the site before the /robots.txt was added, or there's some other way for Googlebot to see links to those URLs.

6:16 pm on July 7, 2005 (gmt 0)

Preferred Member

10+ Year Member

joined:Feb 17, 2004
votes: 0

latimer -

As ciml says. Also my understanding is that if other sites link to you Googlebot will index those links during spidering, but then not flesh them in with content when another bot returns to index the content. You can remove those listings (until the same thing happens again) using the Google robots.txt removal tool but use great caution with it!

8:39 pm on July 7, 2005 (gmt 0)

Junior Member

10+ Year Member

joined:Feb 27, 2002
votes: 0

Thanks for the replys. Very helpful.
9:29 pm on July 7, 2005 (gmt 0)

Junior Member

10+ Year Member

joined:Feb 10, 2005
votes: 0

I see them obeying but even though pages are not indexed, the homepage still is (without a title being displayed)...has been and I suspect will be in the future....
4:33 am on July 8, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 24, 2000
votes: 4

For "reseach" I created a site over a year ago that did every dirty trick in the book that was outside google's TOS to see if it would get blasted.
This site ranked within top in most searches.
I put the "site up for review" and google wacked it in three days.

I didn't follow the classic sandbox affect like other sites, it was more deliberate.

The domain name came up renewal two months ago, and I decided to block googlebot and move the content to a new domain.

Two weeks later the site started to get google traffic.

Go figure

[edited by: minnapple at 4:38 am (utc) on July 8, 2005]

4:37 am on July 8, 2005 (gmt 0)

Junior Member

10+ Year Member

joined:May 16, 2005
votes: 0

I think Google removal tool/program should work...