aakk9999 - 1:46 pm on Mar 16, 2011 (gmt 0)
Crawl allowance - Just a bit
In order to even see the noindex meta tag, googlebot must crawl the page. It may crawl less frequently after it verifies the noindex a few times, but it must continue to crawl.
What would be the best way to recover crawling allowance? Is it robots.txt exclusion, or perhaps returning 301 redirect or perhaps returning 410 Gone (if appropriate)?
Lets say that there was a mistake where a large number of dynamic URLs have been exposed to Google (e.g. development script error). This is subsequently fixed, however, Google knows about these URLs and will be requesting them periodically. What would be the best way to "recover crawl budget"?
Where I am heading to is that on this forum we mostly talk about "crawling / indexing / ranking" but I see another step, which is "URLs TO DO" list somewhere.
So the accidental leakage of dynamic URLs may substantially increase this "URLs TODO" list for the site. Would G. ever drop URLs from "TODO" list once it knows about it, regardless whether it actually requests the page, crawls the page or not?
The way I see it is:
You may have URLs on that "TODO" list that
a) should not be even requested (because of robots.txt exclusion) and will not be crawled
b) Will be requested, but will not be crawled (e.g. the response is 404, 301, 302, 410, 5xx etc), but it seems these will remain on "TODO" list for later.
I am wondering the size of this "TODO" list also affects the depth and frequency of crawling important pages and if so, how to reduce this list? Or in the case of mistakenly flooding the site with URLs that are subsequently removed/redirected, how to tell Google: stop checking these, do not waste your time? Or it perhaps does not matter at all how big this "TODO" list is?
For instance, providing there are no links (internal or external) to a page that returns 410, will G. drop that URL for good from "TODO" list? Similarly, would 404 be dropped if no links to that page (even though it may take longer to be dropped). With regards to 301 - perhaps 301 will be requested less and less often the longer it keeps returning 301 response? Or even less frequent if no links pointing to URL that responds with 301? Any chances for this URL to be completely dropped from G. "TODO" list?
Would be interested in any thoughts on all this.