Welcome to WebmasterWorld Guest from 18.104.22.168
Forum Moderators: goodroi
Some spiders read robots.txt, some read the html meta robots tags in the pages, and some read both. There are also "bad" spiders which don't read either, and some that read either or both, but ignore whatever they read. Thus, a lot of energy is spent banning these bad spiders from sites, since their usual purpose is to collect e-mail addresses for unsolicited commercial e-mail, or to steal site content.
For the good spiders, as long as your on-page html meta robots tags agree with what you have specified in robots.txt, you should be OK. That is, a page which you have Disallowed in robots.txt should contain a NOINDEX,NOFOLLOW meta robots tag, and one that is allowed by robots.txt should contain either INDEX,FOLLOW or INDEX,NOFOLLOW (according to whether any of the pages that page links to should be spidered).
Originally, the on-page meta robots tag was intended for people who wrote web pages, but did not have access to the administrative functions on the web server, i.e. robots.txt. The meta tag allowed them some control over spiders indexing their content.
If you do have access to robots.txt, the on-page meta robots tags are kind of redundant...
However, according to something I read somewhere (maybe here), Inktomi's default behaviour is INDEX,NOFOLLOW. As a result, I implement both robots.txt and the on-page meta robots tags to feed Inktomi the INDEX,FOLLOW directives.
Thanks for the reply. If I can just ask this. Will Slurp disallow the page if I don't put the noindex,nofollow on the html despite the robot.txt being present?
Also, in terms of my own experience, I don't usually have any robots tag on my html and I do find my sites will get picked up by free inktomi so it does seem they are followed and indexed w/o the tag on the page...
Will Slurp disallow the page if I don't put the noindex,nofollow on the html despite the robot.txt being present?
Yes, as long as robots.txt has a Disallow for the page.
The on-page meta robots stuff is redundant for robots.txt-Disallowed pages, and the only open question (to which you have contributed evidence) is for cases where the page is allowed by robots.txt, but the on-page meta robots tags is missing. In that case, someone, somewhere said that Slurp would index that page, but not follow the links.
I need to go find that information again, so I can cite it properly. I remember being surprised when I read it, since a robot should not require the on-page meta robots tags to spider a site completely. And I also remember thinking "I'd say this info is wrong, except for the authoritative source." That was some time ago, and things may have changed since then.
<edit>Here's the link to the source of the info I cited - Brett's Using a Robots Meta Tag article [searchengineworld.com].</edit>