Robots.txt: Overriding Disallow?

Forum Moderators: goodroi

Message Too Old, No Replies

Robots.txt: Overriding Disallow?

Is it possible to override a disallow on a directory on a url level?

thatoneguy

5:16 pm on Jul 26, 2009 (gmt 0)

I was wondering if anyone knows if it is possible to override specific URLs that are contained within a directory that is disallowed.

Example:


User-Agent: *
Disallow: /mostly-useless/
Allow: /mostly-useless/really-important-page.html

I know the last line is incorrect but it gets the point across. The problem is the majorty of that directory doesn't belong in the index but I also have 1-2 pages in there that do need to be indexed. I can't think of any method in robots.txt to instruct the single "Allow" URL to take precedence over the Disallow that is applied to the entire directory.

My plan on down the road is to actually move those pages into a better suited directory but with the way things are set up now, that isn't currently feasible.

--edit--
You know what, I actually just tested this in Google's webmaster tools and the above example works just fine :) silly me. Situation resolved!

GaryK

6:08 pm on Jul 26, 2009 (gmt 0)

It works in Google, but it might not work with all search engines at not all of them support the Allow directive.

jdMorgan

6:11 pm on Jul 26, 2009 (gmt 0)

The "Allow" solution works for Google -- and a very few other robots. "Allow" is a non-standard "extension" of the robots.txt protocol, and is not understood by all robots.

It would be best practice for robots encountering a line that they do not understand to simply ignore that line, but many second-tier robots will simply 'blow up' and either ignore your site or do the wrong thing. There is a stunning lack of robustness in robots.txt handling, even among the major players.

Therefore, you should add additional policy records to your robots.txt so that the "Allow" line is only fed to robots that will understand it. Similarly, some robots don't understand "Crawl-Delay" or "SiteMap" directives, or query-string or wildcard-URL strings. As a Webmaster, you should only feed to a 'bot that which you know it can digest.

Unfortunately, this means going to the "webmaster help" page of each and every 'bot that is important to you to research the current state of their robots.txt handling.

Jim