Forum Moderators: Robert Charlton & goodroi
User-agent: Googlebot
Disallow: /
What I have noticed is that google is somehow getting some of the pages anyway. out of about 20,000 they have now about 3,670.
also interesting is that on the search results page for:
oursitename site:www.example.com
google shows: Results 1 - 9 of about 3,670
And, only 9 url links without title or description show up. No way to access any of the other supposed 3,670 results.
We have another site that has same pages and the reason we block google from the mirror site is to avoid penalty. Concerned about these pages getting in despite the robots.txt block, and possible penalty.
Any help on understanding this would be appreciated.
[edited by: ciml at 6:06 pm (utc) on July 7, 2005]
[edit reason] Examplified [/edit]
The URL-only listings indicate that Google are doing the right thing, so the question is how they found the URLs.
My guess is that either Google visited the site before the /robots.txt was added, or there's some other way for Googlebot to see links to those URLs.
As ciml says. Also my understanding is that if other sites link to you Googlebot will index those links during spidering, but then not flesh them in with content when another bot returns to index the content. You can remove those listings (until the same thing happens again) using the Google robots.txt removal tool but use great caution with it!
I didn't follow the classic sandbox affect like other sites, it was more deliberate.
The domain name came up renewal two months ago, and I decided to block googlebot and move the content to a new domain.
Two weeks later the site started to get google traffic.
Go figure
[edited by: minnapple at 4:38 am (utc) on July 8, 2005]