cpollett - 1:02 am on Apr 2, 2012 (gmt 0)
What I meant was the queue responsible for a collection of hosts probably wants to keep all the robot stuff in memory, so its not too slow to do look ups as it decides whether or not to schedule a URL. Checking disk is slow. So if you have a policy that you flush your robot data every day and have to re-get it, you limit the amount of memory that needs to be used for robot stuff, which means you can use that memory for other things that can help speed up the crawl process. There is a trade-off going on -- it's obviously makes sense to cache the robot data for a little bit, but too long and you detract from better uses of that memory.