On March 18, I rolled out a new version of my website. One of the features of my new version was a statistics page about search indexes on my site. This feature, which was rolled out about a week earlier, had a series of bar charts with links next to each bar. So if the bar said 10 million, there would be a link to a search which would list these results, 10 at a time with a next button. Unlike Google, this does not cut off after the first 1000 results. For instance, the top level domain chart had a .com anchor with a link to all the pages in a given index downloaded from a .com domain. Here is an
example of what this looks like:
[
yioop.com...]
Foolishly, when I pushed this page I hadn't thought about robots. I have since added noindex, nofollow metas and rel nofollow on all the links to be safe. Shortly after pushing this change I switched from a statistic page of an index of 20million to one of 100million. As I don't have a lot of hard drive space, I deleted the old crawl (March 18). A couple days later, I was sitting there thinking: Why are the hard drives on my poor mac minis spinning like there is no tomorrow? When I looked at my console messages I realized, spiders were going nuts essentially recrawling my whole index. The worst offenders seemed to be some spiders from 180.76.5 pretending to be Baidu spiders. In any case, it is now more than 10 days since I deleted the 20 million page index. I blocked the 180.76.5 IPs, but for the spiders from legitimate IPs, in particular, Googlebot, I am still getting requests for URLs into search results for my deleted index. If I had been thinking about this from an experimental point of view, and had designed this better, my accident could have been used to estimate the maximum time from when a url is harvested by Google, until it actually gets scheduled and crawled. My guess is Google had treated those links as relatively low budget, so they were lingering in the queue. As I said, if I had intended to do this, I could have made a better test, because I am not 100% certain that the garbage pages Google was getting for the non-existent links until a day or two ago didn't exacerbate the problem -- now it goes to a normal 404. As of yesterday, the robots.txt file should be blocking this nonexistent query (assuming * in Disallow's is recognized), GoogleBot has downloaded the robots.txt four times, but is still making queries. I guess this last could be used to get an estimate from when the robots.txt file is downloaded until it propagates through Google's queue.