|Estimating the Size Google's URL Crawl Queue|
| 6:01 am on Mar 30, 2012 (gmt 0)|
On March 18, I rolled out a new version of my website. One of the features of my new version was a statistics page about search indexes on my site. This feature, which was rolled out about a week earlier, had a series of bar charts with links next to each bar. So if the bar said 10 million, there would be a link to a search which would list these results, 10 at a time with a next button. Unlike Google, this does not cut off after the first 1000 results. For instance, the top level domain chart had a .com anchor with a link to all the pages in a given index downloaded from a .com domain. Here is an
example of what this looks like:
Foolishly, when I pushed this page I hadn't thought about robots. I have since added noindex, nofollow metas and rel nofollow on all the links to be safe. Shortly after pushing this change I switched from a statistic page of an index of 20million to one of 100million. As I don't have a lot of hard drive space, I deleted the old crawl (March 18). A couple days later, I was sitting there thinking: Why are the hard drives on my poor mac minis spinning like there is no tomorrow? When I looked at my console messages I realized, spiders were going nuts essentially recrawling my whole index. The worst offenders seemed to be some spiders from 180.76.5 pretending to be Baidu spiders. In any case, it is now more than 10 days since I deleted the 20 million page index. I blocked the 180.76.5 IPs, but for the spiders from legitimate IPs, in particular, Googlebot, I am still getting requests for URLs into search results for my deleted index. If I had been thinking about this from an experimental point of view, and had designed this better, my accident could have been used to estimate the maximum time from when a url is harvested by Google, until it actually gets scheduled and crawled. My guess is Google had treated those links as relatively low budget, so they were lingering in the queue. As I said, if I had intended to do this, I could have made a better test, because I am not 100% certain that the garbage pages Google was getting for the non-existent links until a day or two ago didn't exacerbate the problem -- now it goes to a normal 404. As of yesterday, the robots.txt file should be blocking this nonexistent query (assuming * in Disallow's is recognized), GoogleBot has downloaded the robots.txt four times, but is still making queries. I guess this last could be used to get an estimate from when the robots.txt file is downloaded until it propagates through Google's queue.
| 1:11 pm on Mar 30, 2012 (gmt 0)|
It looks the Googlebot is following my robots.txt. (No surprise). It is still requesting pages involving the nonexistent statistics page, but it occurred to me that since this gives a 302 to a 404, the Googlebot could be just checking if the location move back or something. In any case, my set-up wasn't originally supposed to be my measuring the Google queue size, I was just trying to point out that something like this, if done correctly, could be used to estimate queue size.
| 1:23 pm on Mar 30, 2012 (gmt 0)|
302 to 404 is a very bad thing to do. Return 404 at the originally requested URL.
It can take 24 hours for more or robots.txt directives to kick in once they pull the file from your site.
| 1:48 pm on Mar 30, 2012 (gmt 0)|
Yes. Glancing at some other posts I see this. A 301 to 404 would be Kosher?
| 1:54 pm on Mar 30, 2012 (gmt 0)|
Not really. The 301 says the content has moved to a different URL.
If the content has gone, return the correct 404 or 410 status at the originally requested URL.
| 2:23 pm on Mar 30, 2012 (gmt 0)|
Thanks. Fixed that to 404. Code probably couldn't predict if 410 or not.
| 8:40 pm on Mar 30, 2012 (gmt 0)|
|Code probably couldn't predict if 410 or not. |
No, you generally have to do them manually. If you've physically deleted an entire directory, requests for that directory could get a collective 410-- but only if you're willing to cheat a little by letting it return 410 for requests like
that never existed in the first place. Otherwise it's back to listing the names one by one.
| 8:16 pm on Apr 1, 2012 (gmt 0)|
So I left open in my robots.txt that google could still download statistics pages just not follow any links off of them. The refresh rate on robots.txt as mentioned is probably less than 24 hours (it's actually easier to make your crawler faster if you check robots.txt not so infrequently because it means you don't have to keep as big a cache of them in memory). I am still getting requests from google for the nonexistent stats page. The reason is that I have a token for anti-CSRF that I tack into the query string. So Google, presumably while there were still links to that page, was extracting a new link each time it was requesting one of the other pages in my site. So the fact that I am still seeing requests to that page, each with a different token suggests to me there queue takes at least several days from link extraction to actual download.
| 9:05 pm on Apr 1, 2012 (gmt 0)|
To be more precise as a portion of my CSRF token is a UNIX timestamp, I can tell the link must be from February. I see a few from Feb 2 and some from Feb 5.
| 12:01 am on Apr 2, 2012 (gmt 0)|
|each with a different token suggests to me there queue takes at least several days from link extraction to actual download |
Not just google. I recently goofed on a set of, ahem, relative links (unrelated post elsewhere). Fortunately I caught them before anyone but Yandex came by. There were eight potentially affected URLs; they crawled seven of them at once, and then came back a week later to look for the 8th* and pick up their final 404.
|it's actually easier to make your crawler faster if you check robots.txt not so infrequently because it means you don't have to keep as big a cache of them in memory |
I tried that both with and without the "in-" but couldn't wrap my brain around it either way :( All I know is that some robots make robots.txt the first stop on each individual visit, while some outsource it and catch up when they feel like it.
Wish there were a tab in WMT for "robots.txt has changed" so you could request an immediate update.
* Mental picture of robot thinking vaguely "Did I forget to do something...?"
| 1:02 am on Apr 2, 2012 (gmt 0)|
What I meant was the queue responsible for a collection of hosts probably wants to keep all the robot stuff in memory, so its not too slow to do look ups as it decides whether or not to schedule a URL. Checking disk is slow. So if you have a policy that you flush your robot data every day and have to re-get it, you limit the amount of memory that needs to be used for robot stuff, which means you can use that memory for other things that can help speed up the crawl process. There is a trade-off going on -- it's obviously makes sense to cache the robot data for a little bit, but too long and you detract from better uses of that memory.