|Exclude ALL bots but major search engines|
Trying to make the perfect robots.txt
This is what I currently have:
User-agent: Yahoo! Slurp
It is my understanding that this robots.txt will refuse all (well-behaving) bots and just let Google, Yahoo! and Live Search to crawl my site. I am not interested in letting ANY other bots on my site, unless they are likely to provide a lot of traffic for me.
Are there any other search engines that I should include? Maybe ASK? What User-agent does it use? Is it worth it? Or should I just go for these top-3 ones?
[edited by: encyclo at 10:20 pm (utc) on Nov. 10, 2007]
Only major se's obey robots.txt, you should block them via .htaccess. I'm not expert on it ether but I'm sure somebody else here will be able to help you.
serpmaster, thegreatpretender is making the point that your efforts to block all but authorised bots via robots.txt will prove largely fruitless in that only the major bots obey robots.txt directives, and then only usually.
At the other end of the scale, given that any unscrupulous bots are attempting to spider your content unscrupulously, they are hardly likely to stop their unscrupulous activities just because you ask them to. That would be like leaving your house door open with a nice polite sign on the door asking all would-be robbers to please leave your property alone.
If you want to ensure you stop all unauthoised bots then you need to take more effective measure than only using robots.txt. If your server is apaches based, then htaccess is the way to go as thegreatpretender suggests. With other servers you need to use other methods.
Although I disallow some sections to all robots, this is effectively what I'd have if the site were completely open to them. The 'mix' of allowed robots depends on many factors, such as your primary market (e.g. U.S. or E.U.), whether your site is listed in the ODP, whether you want thumbnail images of your pages to appear on MSN and Ask, whether you have mobile-device pages on your site, and whether you want your site archived to support copyright claims, etc.
So there's no simple answer, and what's right for me is likely not right for you.
# Googlebots, msnbots, Yahoo, and Ask
# DMOZ/ODP, Verizon, girafa page thumbnailer, Internet Archiver
User-agent: Verizon Superpages Web Crawler
# disallow all others