homepage Welcome to WebmasterWorld Guest from 54.225.57.156
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

    
robots.txt and googlebot
why isn't googlebot obeying?
kryton




msg:145792
 2:06 am on Mar 11, 2003 (gmt 0)

I added something like below into my robots.txt at the top

User-agent: *
Disallow: /adserver/
Disallow: /addbiography.cgi
Disallow: /addtrivia.cgi
Disallow: /addquotes.cgi
Disallow: /search.cgi

I added that about 4 hours ago. I am aware googlebot doesn't request robots.txt all the time, but the following happened in my logs

crawler9.googlebot.com - - [11/Mar/2003:00:55:19 +0000] "GET /robots.txt HTTP/1.0" 200 869 "-" "Mediapartners-Google/2.1 (+http://www.googlebot.com/bot.html)"
crawler9.googlebot.com - - [11/Mar/2003:00:55:19 +0000] "GET /robots.txt HTTP/1.0" 200 869 "-" "Mediapartners-Google/2.1
(+http://www.googlebot.com/bot.html)"
crawler9.googlebot.com - - [11/Mar/2003:00:56:48 +0000] "GET /addquotes.cgi?celeb=Robin%20Williams HTTP/1.0" 200 12988 "-" "Mediapartners-Google/2.1 (+http://www.googlebot.com/bot.html)"
crawler9.googlebot.com - - [11/Mar/2003:00:58:29 +0000] "GET /addquotes.cgi?celeb=Robin%20Williams HTTP/1.0" 200 12988 "-" "Mediapartners-Google/2.1 (+http://www.googlebot.com/bot.html)"

crawler9.googlebot.com read robots.txt but still followed /addquotes.cgi?celeb=Robin%20Williams why? Are there more than one instance of a googlebot running at once? So is it possible the instance which followed the addquotes.cgi page read the robots.txt earlier before I added the Disallow?

 

projectphp




msg:145793
 2:22 am on Mar 11, 2003 (gmt 0)

Google Cache's your Robots.txt file, so they only request it once per crawl.

Why not put a robots tag on those pages i.e. <meta name="robots" content="noindex,follow" />

This will stop Google Indexing the Page, but it will still follow the links.

kryton




msg:145794
 2:29 am on Mar 11, 2003 (gmt 0)

projectphp as you can see the googlebot requested robots.txt just before it crawled the cgi page. The only thing I can think of is that there is more than one instance of a googlebot running per IP, and the request made to the cgi page was from a googlebot instance which requested robots.txt before I updated it.

Good idea to change my cgi pages to like you suggested with the <meta name="robots" content="noindex,follow" />

Powdork




msg:145795
 4:46 am on Mar 11, 2003 (gmt 0)

What are the ip ranges of the visits. I've never seen that "Mediapartners" bit

kryton




msg:145796
 5:11 am on Mar 11, 2003 (gmt 0)

if you ping crawler9.googlebot.com it is 64.68.87.79

Powdork




msg:145797
 5:17 am on Mar 11, 2003 (gmt 0)

if you ping crawler9.googlebot.com it is 64.68.87.79

I know that, what I'm asking is if the visits with that user agent string are from ip's that are normally associated with googlebot or if someone is masking as Gbot? I'm guessing that 'Mediapartners' is part of a rdns lookup but my stat program doesn't have that functionality to check.

AthlonInside




msg:145798
 5:23 am on Mar 11, 2003 (gmt 0)

ME TOO! Now I have both meta robots and robots.txt to see if they obey this time.

kryton




msg:145799
 5:54 am on Mar 11, 2003 (gmt 0)

Powdork, I have no idea. I must admit I thought it was strange with that useragent? Is it possible someone could mask as a googlebot?

GoogleGuy




msg:145800
 6:14 am on Mar 11, 2003 (gmt 0)

kryton, I believe that this user agent is associated with our content ads program. Are you on a site that shows our content ads?

kryton




msg:145801
 3:12 pm on Mar 11, 2003 (gmt 0)

No GoogleGuy am I not.

acronym




msg:145802
 7:04 pm on Mar 13, 2003 (gmt 0)

I'm seeing this mediapartners bot, too.

See my post here:

I am not doing anything with "content ads" that I know of (unless my ad network is serving them up somehow)?

GoogleGuy, what's this about and why is this bot so aggressive?

Mike

Chris_1




msg:145803
 7:10 pm on Mar 13, 2003 (gmt 0)

Us too - we've got the Google MediaPartners bot and have been wondering what it is... We are not involved with the google ads in any way (at this point, only overture).

Chirs

GoogleGuy




msg:145804
 8:40 pm on Mar 13, 2003 (gmt 0)

kryton, do you sell any banner ads? Content ads can also show up on your site that way. I believe the Mediapartners user agent is the bot for content ads, but it would be helpful if anyone could mention specific urls. I guess that's out of bounds, but I'm pretty sure that these are for content ads.

Powdork




msg:145805
 9:12 pm on Mar 13, 2003 (gmt 0)

Wouldn't the ip ranges of the bots give all the needed info?

Chris_1




msg:145806
 11:27 pm on Mar 13, 2003 (gmt 0)

GG,

We use a little cgi script to rotate through some 234x60 banners - but it's only informational internal ads that link to different pages on our site - nothing external.

I've heard your stickymail is turned off - how should I send you our URL?

Thanks,

Chris

GoogleGuy




msg:145807
 2:25 am on Mar 14, 2003 (gmt 0)

Mmm. Do it as a spam report and mention your nick and this url. I'll get it. Thanks..

Powdork




msg:145808
 3:12 am on Mar 14, 2003 (gmt 0)

GoogleGuy,
Do people really send in their own URLs to the spam report when you ask them? I try to keep everything above board but that would still make me a tad nervous.;)

Key_Master




msg:145809
 3:53 am on Mar 14, 2003 (gmt 0)

I'm seeing the bot also although it is behaving itself. And though I don't serve Google content ads (yet), I'm certainly interested.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved