Shaking off the Googlebot

Forum Moderators: open

Message Too Old, No Replies

Shaking off the Googlebot

Soometimes it's necessary!

yosmc

2:22 am on Apr 12, 2003 (gmt 0)

I'm aware that the Googlebot obeys robot.txt and does what it is supposed to do, however I find communication with the bot fairly difficult. Here's what I am stating in my robots.txt file:

#no robots
User-agent: *
Disallow: /images/
Disallow: /profile.php
Disallow: /posting.php
Disallow: /privmsg.php

Yet, Google has freshly indexed pages like...

[mysite.com...]
[mysite.com...]

Why?!

PS. Yep I have already tested with the robot checker available on this site, and it appears that the script is very happy with it.

jdMorgan

2:28 am on Apr 12, 2003 (gmt 0)

yosmc,

Your robots.txt as shown should have prevented those URLs from being spidered, as long as that's what was in it at the time the site was crawled.

You should report this to Google.

Jim

TheDave

3:12 am on Apr 12, 2003 (gmt 0)

Even though you have excluded those pages from being crawled, google can still find a link to them, and it's the link that it knows about. Are the pages in google's cache?

yosmc

3:25 am on Apr 12, 2003 (gmt 0)

Interesting point, Dave - no they're not, and they have the URL instead of the page title. But why would Google add a link to the index that it's not allowed to spider anyway? Is it a feature, or a bug?

jdMorgan

3:30 am on Apr 12, 2003 (gmt 0)

yosmc,

OK, you didn't say so in your first post. Have a look at this recent thread [webmasterworld.com], and the link I posted there.

Jim

yosmc

3:36 am on Apr 12, 2003 (gmt 0)

Thanks, very cool, that's exactly what I wanted to know. Sorry for the missing info, in my part of the world it's already fairly late (or early) - I simply forgot. :)

GoogleGuy

6:09 am on Apr 12, 2003 (gmt 0)

That's right. If we see references to a page that we can't crawl, we can still return a pointer to the link.