To Robots.txt or not

Forum Moderators: goodroi

Message Too Old, No Replies

To Robots.txt or not

Do we use both?

obsos

6:04 am on Jul 30, 2004 (gmt 0)

We have a large portal (just over 10,000 pages) and we use the Robots Metatag eg index,follow etc on all pages. This works well for us. We can choose which pages we want indexed at the time of creation.

However, we've recently had a company review our site with a view to search optimisation and they have advised that we use a robots.txt file.

Do we need to use both methods?

If we use a blank robots.txt file, does this override the metatag instructions on pages that we don't want indexed (noindex,follow)?

And what about directories like "global" or "images"?
If we have "index,follow" on a page that has images (and they all do) does excluding the "images" directory stop them from being indexed?

Cheers
Robyn

jdMorgan

6:43 am on Jul 30, 2004 (gmt 0)

Robyn,

> If we use a blank robots.txt file, does this override the metatag instructions on pages that we don't want indexed (noindex,follow)?

No, it won't override those tags, since the robot will be allowed to fetch the pages and read them.

Beyond that, it gets a bit complicated...

If you exclude a URL-path using a Disallow in robots.txt, then the search engines won't fetch pages in those paths, and therefore won't see the robots meta-tags on those pages. Since images don't include html meta-tags, this is a separate issue for them and for other non-html filetypes.

However, Google, Ask Jeeves, and quite recently, Yahoo will list a page in results even if they have been disallowed from fetching that page by robots.txt. They do this when they find a link to any page from any other page they have indexed. The unfetched page is listed by URL only, without title or description in Google and AJ; Yahoo uses the link text from the link it found as the page title in search results.

This leads to an interesting sort of paradox: In order to tell Google, AJ, and Yahoo not to list a page, you must allow it to be fetched in robots.txt, but include the "noindex" value in the on-page robots meta-tag.

The crux of the matter is one of semantics; We want to tell robots which pages not to list in search results, whereas the Standard for Robots Exclusion specifies that robots.txt tells search engine spiders which pages not to fetch. Furthermore, Google, AJ, and Yahoo have adopted the stance that they want to list pages they find links to, in order to reveal more of "the hidden Web" -- obscure pages for which no effort has been made to rank in search engines.

Therefore, I recommend a mixed robots.txt and on-page robots meta-tag approach for maximum control and flexibility. You may also find situations where it is necessary to use server-side URL rewrites to (technically) cloak some pages in order to keep them from being listed in the search engine results.

Images are typically not fetched by text-based search engine robots. Instead, they are handled by separate robots which feed the image search or shopping-service functions of search portals. You can handle them with separate robots.txt records targeted at those specific robots. So far, a robots.txt disallow seems to keep them from being listed.

pdf, xls, and several other non-html filetypes currently fall in a grey area, since they are "text" but not html. I haven't been able to get mine out of the search listings, but then again, I haven't spent much time trying, either.

I've tried to pick my words carefully, so I hope this is not too confusing.

Jim

obsos

10:48 pm on Aug 1, 2004 (gmt 0)

Thanks Jim,

I *think* I understand. What I'm hearing is to use both .. possibly a blank robots.txt file to allow everyone access to everything, then control it through the use of the metatags?

Cheers
Robyn

jdMorgan

11:20 pm on Aug 1, 2004 (gmt 0)

Robyn,

You'll have to decide based on your own site -- How many pages it has, how much traffic it gets from users and from search engine spiders. You can significantly decrease the amount of bandwidth consumed by spiders by using robots.txt. If you use on-page robots meta-tags only -- without using robots.txt -- then all pages will be fetched in order to read the robots meta-tags. On the other hand, the spiders listed above will list disallowed pages if they find a link to them and your only robots-control is in robots.txt.

Since you're asking, I use a layered approach: Different records for different groups of robots in robots.txt, plus on-page robots meta-tags for some of those pages that are allowed in robots.txt in order to let the robots fetch the noindex robots meta-tags so they won't list those pages at all. In addition, I use user-agent-based redirection to serve a simplified robots.txt to some second-tier robots that cannot handle multiple user-agent records in robots.txt (background [webmasterworld.com]).

One little text file, so many complications... :)

Note that since I posted last, Zyborg (Looksmart & WiseNut) has joined the list of robots that will list a disallowed page based on incoming links only.

Jim