Forum Moderators: goodroi
Example:
User-Agent: *
Disallow: /mostly-useless/
Allow: /mostly-useless/really-important-page.html
I know the last line is incorrect but it gets the point across. The problem is the majorty of that directory doesn't belong in the index but I also have 1-2 pages in there that do need to be indexed. I can't think of any method in robots.txt to instruct the single "Allow" URL to take precedence over the Disallow that is applied to the entire directory.
My plan on down the road is to actually move those pages into a better suited directory but with the way things are set up now, that isn't currently feasible.
--edit--
You know what, I actually just tested this in Google's webmaster tools and the above example works just fine :) silly me. Situation resolved!
It would be best practice for robots encountering a line that they do not understand to simply ignore that line, but many second-tier robots will simply 'blow up' and either ignore your site or do the wrong thing. There is a stunning lack of robustness in robots.txt handling, even among the major players.
Therefore, you should add additional policy records to your robots.txt so that the "Allow" line is only fed to robots that will understand it. Similarly, some robots don't understand "Crawl-Delay" or "SiteMap" directives, or query-string or wildcard-URL strings. As a Webmaster, you should only feed to a 'bot that which you know it can digest.
Unfortunately, this means going to the "webmaster help" page of each and every 'bot that is important to you to research the current state of their robots.txt handling.
Jim