Is Google Disregarding robots.txt?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Is Google Disregarding robots.txt?

Google seems to be indexing pages from site blocked using robots.txt

latimer

5:59 pm on Jul 7, 2005 (gmt 0)

Can someone help me understand this one?

we have used robots.txt on one of our sites to prevent google from accessing any of the files as follows:

User-agent: Googlebot
Disallow: /

What I have noticed is that google is somehow getting some of the pages anyway. out of about 20,000 they have now about 3,670.

also interesting is that on the search results page for:

oursitename site:www.example.com

google shows: Results 1 - 9 of about 3,670

And, only 9 url links without title or description show up. No way to access any of the other supposed 3,670 results.

We have another site that has same pages and the reason we block google from the mirror site is to avoid penalty. Concerned about these pages getting in despite the robots.txt block, and possible penalty.

Any help on understanding this would be appreciated.

[edited by: ciml at 6:06 pm (utc) on July 7, 2005]
[edit reason] Examplified [/edit]

ciml

6:08 pm on Jul 7, 2005 (gmt 0)

I suggest looking at your logs. /robots.txt exclusion does not prevent Google from listing the URLs, it prevents Googlebot from fetching them.

The URL-only listings indicate that Google are doing the right thing, so the question is how they found the URLs.

My guess is that either Google visited the site before the /robots.txt was added, or there's some other way for Googlebot to see links to those URLs.

joeduck

6:16 pm on Jul 7, 2005 (gmt 0)

latimer -

As ciml says. Also my understanding is that if other sites link to you Googlebot will index those links during spidering, but then not flesh them in with content when another bot returns to index the content. You can remove those listings (until the same thing happens again) using the Google robots.txt removal tool but use great caution with it!

latimer

8:39 pm on Jul 7, 2005 (gmt 0)

Thanks for the replys. Very helpful.

AndAgain

9:29 pm on Jul 7, 2005 (gmt 0)

I see them obeying but even though pages are not indexed, the homepage still is (without a title being displayed)...has been and I suspect will be in the future....

minnapple

4:33 am on Jul 8, 2005 (gmt 0)

For "reseach" I created a site over a year ago that did every dirty trick in the book that was outside google's TOS to see if it would get blasted.
This site ranked within top in most searches.
I put the "site up for review" and google wacked it in three days.

I didn't follow the classic sandbox affect like other sites, it was more deliberate.

The domain name came up renewal two months ago, and I decided to block googlebot and move the content to a new domain.

Two weeks later the site started to get google traffic.

Go figure

[edited by: minnapple at 4:38 am (utc) on July 8, 2005]

Adversity Sure Fire

4:37 am on Jul 8, 2005 (gmt 0)

I think Google removal tool/program should work...