Forum Moderators: open

Message Too Old, No Replies

googlebot ignoring robot.txt?

         

hannamyluv

7:27 pm on Nov 29, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Okay, I really did try to look for anything about this but I couldn't find it. If it's out there, please just kindly direct me there.

Anyway, I have a little hobby site that I put up. I put up a robot.txt that disallows a certain directory. I have triple checked the robot.txt and I have put it through the validator. Everything checks out, but googlebot still crawls the pages in that directory. Nobody else crawls it, just googlebot.

Does googlebot crawl everything and just index whats allowed or is something wrong here?

Brett_Tabke

10:07 pm on Nov 29, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



There is one interpretation of the robots.txt standard (googles) that states that robots.txt just blocks indexing and crawling is ok.

Does google list the pages? If so, how long has the robots.txt been in place?

(no offense, but we've seen a few dozen of these claims and all but 1 [webmasterworld.com] have turned out to be webmaster error).

more reading [google.com]...

GoogleGuy

7:01 am on Nov 30, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hmm, didn't see a site in your profile. Did we actually fetch pages from that directory, or do we just show titles without any cached page link (we can do that for urls that we see referenced but didn't crawl)? Did you make the change recently? We try to refetch the robots.txt file pretty often, but if you just changed it then we might not have found it yet.

Jenstar

8:10 am on Nov 30, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Do you disallow everything or specific bots? If the latter, could it have been the Google mediapartners bot? If you are running AdSense, or if someone views those pages using Opera, the mediapartners bot (User-agent: Mediapartners-Google*) will arrive and spider those pages to display relevant ads via AdSense. They are completely different bots and would require separate entries into a robots.txt

DanA

11:31 am on Nov 30, 2003 (gmt 0)

10+ Year Member



It takes some time (in my case it was more than a month) for a robots.txt to be taken into account, but Google doesn't follow the rules for zip or pdf files.
I'm not sure it downloads the files or checks for links, but I think it shouldn't make a difference.

[edited by: DanA at 1:38 pm (utc) on Nov. 30, 2003]

ThomasB

1:09 pm on Nov 30, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Did we actually fetch pages from that directory, or do we just show titles without any cached page link (we can do that for urls that we see referenced but didn't crawl)?

How can we prevent these sites/directorys to be listed/spidered at all?

@hannamyluv
It's called robots.txt and not robot.txt. Maybe the problem is the name of the file?

hannamyluv

1:30 pm on Nov 30, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Oh, I am sure its something I am doing, I just wanted to make sure before I went beating my head against the wall to figure it out.

I'll check the file, perhaps it is singular so I will take a look at that. Thanks for the help.

*sigh, I feel like a newbie. I kinda wish I had a programmer/techie at home like I do at work.*