To allow some robots only

Forum Moderators: goodroi

Message Too Old, No Replies

To allow some robots only

GoldenHammer

12:46 am on Jan 26, 2006 (gmt 0)

I would like to set it only for some robots, and disallow all others. How can I indentify the robot name to be used in the robots.txt? what are the robot name for Google, MSN and Yahoo?

Sample:

User-agent: robot1
Disallow:

User-agent: robot2
Disallow:
.
.
.
User-agent: robotn
Disallow:

User-agent: *
Disallow: /

Pfui

2:11 am on Jan 26, 2006 (gmt 0)

1.) You don't need to enter anything to allow the robots you want to visit. The main aspect of robots.txt is to Disallow the ones you don't (assuming they honor robots.txt in the first place).

2.) That said, you can also use robots.txt to control aspects of the major SE bots' behaviors. You'll find out the specifics when you visit the majors' sites, use search engines, and read the The Web Robots Pages [robotstxt.org].

3.) Also, a lot of the major SEs have more than one bot. Here are just a few to give you an idea. Note that the list is not robots.txt-ready, rather it's to help you ID some of the majors you may see:

GOOGLE-related...

User-agent: Googlebot
User-agent: Mediapartners-Google*
User-agent: Googlebot-Image

MICROSOFT-related...

User-agent: msnbot
User-agent: SandCrawler - Compatibility Testing

YAHOO!-related...

User-agent: Slurp
User-agent: Yahoo-MMCrawler/3.x (mms dash mmcrawler dash support at yahoo dash inc dot com)

(The preceding are ALL case-sensitive.)

4.) Last but not least, regularly eyeball your own server logs and check IPs for more info about what/who is visiting you.

GoldenHammer

2:19 am on Jan 26, 2006 (gmt 0)

Thanks for the idea. Is there already a list of robot names available on the web for easy reference?

Pfui

2:47 am on Jan 26, 2006 (gmt 0)

There are new robots cropping up seemingly every single day, from the majors, from individuals, from all over the globe. Check the Web Robots Pages, plus use search engines to locate sites tracking/compiling recent bot accesses.

There are literally hundreds, if not thousands, of bots nowadays, not to mention bots spoofing browsers, so if you're really into a list, you'll probably have to roll your own. Good luck!

Dijkgraaf

3:39 am on Jan 26, 2006 (gmt 0)

The only such reference I know of is
The Web Robots Database [robotstxt.org], but I'm not sure it is being updated, and as previous people say, there are new ones popping up at a great rate.

GoldenHammer

1:33 am on Jan 28, 2006 (gmt 0)

Thanks for the information.

It seems to me that the robots (like Googbot, Msnbot, Slurp etc) just ignored the robot.txt. Should I have to use .htaccess to make it effective?

Pfui

4:54 am on Jan 28, 2006 (gmt 0)

The majors you mention are very good about heeding robots.txt. If they appear to be ignoring yours, chances are there's a problem either with your file (upload as ASCII/text) or its format.

Use Search Engine World's Robots.txt Validator [searchengineworld.com] to make sure your robots.txt is A-OK. And see also WW sister site's excellent Robots.txt Tutorial [searchengineworld.com].

GoldenHammer

12:51 pm on Jan 28, 2006 (gmt 0)

Thanks for the useful links.

BTW, got my robot.txt validated and no error found. Not sure why the Yahoo and MS robots ignored it.....

jdMorgan

3:14 pm on Jan 28, 2006 (gmt 0)

> BTW, got my robot.txt validated and no error found. Not sure why the Yahoo and MS robots ignored it...

Do be sure to call it "robots.txt", since you've mentioned "robot.txt" more than once in this thread. The file must be named "robots.txt".

Jim