Page is a not externally linkable
lucy24 - 12:01 am on Apr 2, 2012 (gmt 0)
each with a different token suggests to me there queue takes at least several days from link extraction to actual download
Not just google. I recently goofed on a set of, ahem, relative links (unrelated post elsewhere). Fortunately I caught them before anyone but Yandex came by. There were eight potentially affected URLs; they crawled seven of them at once, and then came back a week later to look for the 8th* and pick up their final 404.
it's actually easier to make your crawler faster if you check robots.txt not so infrequently because it means you don't have to keep as big a cache of them in memory
I tried that both with and without the "in-" but couldn't wrap my brain around it either way :( All I know is that some robots make robots.txt the first stop on each individual visit, while some outsource it and catch up when they feel like it.
Wish there were a tab in WMT for "robots.txt has changed" so you could request an immediate update.
* Mental picture of robot thinking vaguely "Did I forget to do something...?"