phranque - 9:58 am on Jun 4, 2010 (gmt 0)
i received a few hints about crawling and indexing issues from a small alpine bird that i thought should be shared.
if robots.txt is your primary line of defense to prevent indexing of content, consider what happens when the request for robots.txt returns a 404.
this means robots.txt is Not Found and therefore anything is fair game for the crawlers.
this status may occur when there is any number of server glitches or perhaps an incorrectly configured CDN.
therefore if you are using robots.txt to reduce server load or bandwidth usage but noindexing is also important, you should also use the robots meta tag, x-robots-tag HTTP header or perhaps authentication if appropriate to block indexing.
for example you could block indexing by using the .htaccess to add an x-robots-tag HTTP header for all URLs in a specific path or matching a URL pattern similar to your robots.txt disallows.
in any case, you should avoid making fast-paced changes in the robots.txt, such as dayparting to control crawler access times, since it can result in unintended consequences due to the cacheing issues.