Welcome to WebmasterWorld Guest from 22.214.171.124
Forum Moderators: goodroi
You can't use robots.txt to "tell" spiders anything. You can "request" that good spiders stay out of your site or certain directories, but it is up to them to obey. The bad ones don't, and the really insolent ones won't even fetch your robots.txt.
So, by definition, a "bad 'bot" won't fetch or obey your robots.txt file.
Let's look at this from an access control standpoint: Each method is dependent upon being allowed by the method above it.
So, an on-page robots meta-tag will be read and processed by only a robot that is allowed to fetch the page by robots.txt, is not denied access by your server configuration, and is not blocked by the firewall on your server.
Now there are some subtle details here, having to do with certain robots' behaviours. You should block 'bad' robots using the firewall or access controls like those used in httpd.conf or .htaccess. Then you should control good robots' fetching of your pages with robots.txt.
Now here's where it gets complex: All 'good' robots will not fetch a resource (page, file, image, etc.) if that resource is disallowed in robots.txt. However, some 'good' robots, even though they won't fetch your resource, may list it in their search results if they find a link pointing to it anywhere on the Web. Examples include Google, Yahoo slurp, and Ask Jeeves.
In order to keep a page from being listed in those engines' search results, you have to *allow* the page to be fetched by not Disallowing it in robots.txt, and then use the on-page robots meta-tag to prevent the page from appearing in their listings.
The next problem is that robots meta-tags can only be included in HTML pages. So you can't use the robots meta-tag to keep things like Word documents, PDF files, and images out of the search results if someone links to them. In Google's case, you can use the "filetype" exclusion that they've added as an extension to their robots.txt processing (See their Webmaster info page for more details). For non-HTML pages only, Google seems to treat the Disallow as a request not to list the resource in search results.
For the other robots, I'm not sure; The only thing you can do is to Disallow those resources in robots.txt and hope for the best. It's possible that these other engines may adopt Google's extended robots.txt syntax, or that they may adopt their own in the future.
I hope the above is clear; It's not at all a simple thing to "get it right."