robots.txt and meta tags

Forum Moderators: goodroi

Message Too Old, No Replies

robots.txt and meta tags

can I use both?

kdawg

2:09 pm on Dec 11, 2004 (gmt 0)

I want to know what would happen if I added a robots.txt page to my website an also have a meta tag on my homepage giving instructions. I want to use the robots.txt to tell a lot of bad spiders to stay away and the list is pretty long. But I want to use the meta tag to give the spiders I want to crawl my site other directions. Can this be done, or will it confuse the spiders?

jdMorgan

11:23 pm on Dec 11, 2004 (gmt 0)

Whoa...

You can't use robots.txt to "tell" spiders anything. You can "request" that good spiders stay out of your site or certain directories, but it is up to them to obey. The bad ones don't, and the really insolent ones won't even fetch your robots.txt.

So, by definition, a "bad 'bot" won't fetch or obey your robots.txt file.

Let's look at this from an access control standpoint: Each method is dependent upon being allowed by the method above it.

Server firewall

Server access control (e.g. httpd.conf and .htaccess files on Apache servers)

robots.txt file

on-page robots meta-tag

So, an on-page robots meta-tag will be read and processed by only a robot that is allowed to fetch the page by robots.txt, is not denied access by your server configuration, and is not blocked by the firewall on your server.

Now there are some subtle details here, having to do with certain robots' behaviours. You should block 'bad' robots using the firewall or access controls like those used in httpd.conf or .htaccess. Then you should control good robots' fetching of your pages with robots.txt.

Now here's where it gets complex: All 'good' robots will not fetch a resource (page, file, image, etc.) if that resource is disallowed in robots.txt. However, some 'good' robots, even though they won't fetch your resource, may list it in their search results if they find a link pointing to it anywhere on the Web. Examples include Google, Yahoo slurp, and Ask Jeeves.

In order to keep a page from being listed in those engines' search results, you have to *allow* the page to be fetched by not Disallowing it in robots.txt, and then use the on-page robots meta-tag to prevent the page from appearing in their listings.

The next problem is that robots meta-tags can only be included in HTML pages. So you can't use the robots meta-tag to keep things like Word documents, PDF files, and images out of the search results if someone links to them. In Google's case, you can use the "filetype" exclusion that they've added as an extension to their robots.txt processing (See their Webmaster info page for more details). For non-HTML pages only, Google seems to treat the Disallow as a request not to list the resource in search results.

For the other robots, I'm not sure; The only thing you can do is to Disallow those resources in robots.txt and hope for the best. It's possible that these other engines may adopt Google's extended robots.txt syntax, or that they may adopt their own in the future.

I hope the above is clear; It's not at all a simple thing to "get it right."

Jim