homepage Welcome to WebmasterWorld Guest from 54.167.75.155
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Exclude ALL bots but major search engines
Trying to make the perfect robots.txt
serpmaster




msg:3486078
 11:03 am on Oct 24, 2007 (gmt 0)

This is what I currently have:

User-agent: Googlebot
User-agent: Yahoo! Slurp
User-agent: MSNBot
Disallow:

User-agent: *
Disallow: /

It is my understanding that this robots.txt will refuse all (well-behaving) bots and just let Google, Yahoo! and Live Search to crawl my site. I am not interested in letting ANY other bots on my site, unless they are likely to provide a lot of traffic for me.

Are there any other search engines that I should include? Maybe ASK? What User-agent does it use? Is it worth it? Or should I just go for these top-3 ones?

[edited by: encyclo at 10:20 pm (utc) on Nov. 10, 2007]

 

thegreatpretender




msg:3486977
 7:33 am on Oct 25, 2007 (gmt 0)

Only major se's obey robots.txt, you should block them via .htaccess. I'm not expert on it ether but I'm sure somebody else here will be able to help you.

Woz




msg:3487058
 9:18 am on Oct 25, 2007 (gmt 0)

serpmaster, thegreatpretender is making the point that your efforts to block all but authorised bots via robots.txt will prove largely fruitless in that only the major bots obey robots.txt directives, and then only usually.

At the other end of the scale, given that any unscrupulous bots are attempting to spider your content unscrupulously, they are hardly likely to stop their unscrupulous activities just because you ask them to. That would be like leaving your house door open with a nice polite sign on the door asking all would-be robbers to please leave your property alone.

If you want to ensure you stop all unauthoised bots then you need to take more effective measure than only using robots.txt. If your server is apaches based, then htaccess is the way to go as thegreatpretender suggests. With other servers you need to use other methods.

Onya
Woz

Woz




msg:3487063
 9:25 am on Oct 25, 2007 (gmt 0)

[robotstxt.org...]

[robotstxt.org...]

Onya
Woz

jdMorgan




msg:3487164
 12:03 pm on Oct 25, 2007 (gmt 0)

Although I disallow some sections to all robots, this is effectively what I'd have if the site were completely open to them. The 'mix' of allowed robots depends on many factors, such as your primary market (e.g. U.S. or E.U.), whether your site is listed in the ODP, whether you want thumbnail images of your pages to appear on MSN and Ask, whether you have mobile-device pages on your site, and whether you want your site archived to support copyright claims, etc.

So there's no simple answer, and what's right for me is likely not right for you.

# Googlebots, msnbots, Yahoo, and Ask
User-agent: Googlebot
User-agent: msnbot/
User-agent: searchpreview
User-agent: slurp
User-agent: Teoma
User-agent: YahooSeeker/M1A1-R2D2
# DMOZ/ODP, Verizon, girafa page thumbnailer, Internet Archiver
User-agent: Robozilla
User-agent: Verizon Superpages Web Crawler
User-agent: girafa
User-agent: ia_archiver
Disallow:

# disallow all others
User-agent: *
Disallow: /


Jim

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved