homepage Welcome to WebmasterWorld Guest from 54.161.192.130
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
allow major engines, block the fleas
and to add further control
youfoundjake

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3134879 posted 9:01 pm on Oct 25, 2006 (gmt 0)

I want to block my site from being spidered by everything but google, yahoo, msn, and ia_archiver

I want to block google, yahoo, msn and ia_archiver from spidering the forum. How does this look for syntax?

User-agent: *
Disallow: /forum/

User-agent: Slurp
User-agent: Googlebot
User-agent: msnbot
User-agent: Mediapartners-Google
User-agent: Adsbot-Google
User-agent: ia_archiver-web.archive.org
Disallow:

User-agent: *
Disallow: /

 

vanessafox

5+ Year Member



 
Msg#: 3134879 posted 1:11 am on Oct 26, 2006 (gmt 0)

I can't speak for all bots, but Googlebot follows the line aimed at it, if there is one. So, in this case, it would interpret the file as allowing it access to everything. I would recommend something like this:

User-agent: Slurp
User-agent: Googlebot
User-agent: msnbot
User-agent: Mediapartners-Google
User-agent: Adsbot-Google
User-agent: ia_archiver-web.archive.org
Disallow: /forum/

User-agent: *
Disallow: /

You can always verify how Googlebot will interpret a robots.txt file using the robots.txt analysis tool in Google webmaster tools. You can just add the site you're interested in to your account, paste the test file in into the tool, and check specific URLs to see if the test file would block or allow them.

youfoundjake

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3134879 posted 1:23 am on Oct 26, 2006 (gmt 0)

It's so simple. :)
Thanks vanessa, hope you have a good night.

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3134879 posted 5:18 am on Oct 26, 2006 (gmt 0)

Be prepared also for ancient-bots, quasi-bots, and broken-bots that can't handle the (valid according to the Standard) multiple-user-agent records. I suggest backing up your robots.txt with 'stronger stuff,' such as mod_rewrite user-agent checks, if possible.

There are plenty of badly-coded 'bots out there that are not really malicious, just incompetent...

Jim

youfoundjake

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3134879 posted 5:16 pm on Oct 26, 2006 (gmt 0)

thanks JD, looking into various "bot traps" to resolve that. Lots of threads to comb through....

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved