|Googlebot massively tries URLs not existing for 3+ years|
2014 brought an interesting new pattern in Googlebot's visits to one of my sites.
Need back-story first. I had shot myself in the foot, so to speak, twice over the last 3 years: in trying to create a more accessible and manageable URL structure, I have changed URLs of content twice. Once to eliminate capital letters from URLs and another time to change pagination so that each page of content is longer (which led to fewer pages). Not every single content page changed its URL but close to 90% did.
Googlebot has already come around (many times) since then and Google knows about the most recent URLs because I see them ranking on their own. Most recent URLs have not been changed since about 1.5 years ago.
Since this January 1st, I see almost every Googlebot hit starting with the oldest URLs first, which in most cases creates three Googlebot request to read one content page, diluting the crawl budget.
So, it goes like this:
- request Old Page With Capitalized URLs and Wrong Pagination -> 301
- requests more recent but still old page with lowercase urls and wrong pagination ->301
- request proper page -> 200
With quite a bit of work I can do something to eliminate the second request (both 301s are created by my server), but it behooves me to understand why they are coming back for pages that did not exist for 3+ years, have already had 301 (i.e permanent) redirect served 3+ years ago and clearly have new pages already replace the old in Google index (since new ones have already been ranking for a couple of years). In other words, short of Google index being wiped clean and Googlebot following old links again, I can't find a reasonable explanation. Even then, not all content pages have/had external links, and there are no internal links to the old URLs for Gbot to follow.
So, does Google ever have index "spring cleaning" events, when everything gets wiped clean and has to be recrawled/ reindexed afresh? What might the implications be?
Search engines never forget. The bigger your site is, the longer their memories are. Disclaimer: I made up the second sentence.
Your post seems to imply that there was a two-step redirect:
from name1 to name2
from name2 to name3
rather than a one-step
from name1 to name3
from name2 to name3
If that's the case you should absolutely consolidate your redirects. In fact you should turn back the clock and consolidate them from Day 1. Uhm. Not practical, I guess.
any possibility you could make some new pages with those old URLs?
might be a way to take advantage of what is happening anyway.
Thanks for your input, guys. Yes, I do realize they don't forget, but I thought this was the case for 404 and 302 codes, the ones that imply that perhaps the document is missing or relocated temporarily. Sounds like a waste of resources to keep coming back for documents that were permanently relocated. Also, it was very curious to me that they came in droves for 3 years old URLs, not 1.5 y.o. ones or combination thereof. It looks like some old database of URLs got restored from a backup made 3 years ago. Does anyone see an increase of interest specifically in old URLs on their sites (be that correct URLs or since redirected/removed)
I have them coming back for URLs older than that. All the time. Only thing I can think of is maybe old scrapers that suddenly show back up with a lot of old links to my older URL structures. Since I can't do anything about it, I've given up worrying about it; it is what it is.
Yeah, Google never forgets. I've seen it trying URLs which had not existed for 8 years!
When Dynamic Search Ads came out, our company was a BETA client. When I pulled a destination URL report I was alarmed to see Google had been trying to land people on these ancient URLs. The problem was that all the redirect rules for those URLs had been deleted because they were so old.
The only solution I know of is to 410 these URLs OR remove them via Google Webmaster Tools but then you have to block in robots.txt or configure your server to return 404s when Googlebot checks through your removal request.
Welcome to WebmasterWorld SEOWeasel
There's no doubt that a complete crawl will go on every-so-often with the idea being to eliminate pages that don't exist. It's also possible that the urls are linked from somewhere.
For example: Are the urls in the wayback machine?
Thanks again for your input guys. I guess returning 410 might have eventually stopped these crawls (or not), but I do need to have redirects instead: some of the old URLs were linked to and many of those old links are pretty good and still deliver referrals, and I shall assume some of that mysterious substance - link juice :)
As far as there being links to the oldest pages - I am certain there are a lot still out there. I am also not surprised that they come around from time to time to check on those. The main question is: why the massive renewal of interest now? I have seen Googlebot try old URLs alongside the new before, but this time around accesses for the new URLs are drowned in the sea or hits on the old.
Have other people noticed an upsurge in the same behavior? Seems like, if it's something a search engine does at intervals of a year or more, the mega-crawls would have to be randomly distributed throughout the year. And, just to make you uneasy, some of those random mega-crawls would randomly and coincidentally come right after the search engine has instituted some new, highly publicized algorithm change. Or right after you've made substantive site changes ;)
Check if Google crawls an URL that redirects to the removed page.
For example, if you return 410 for /removedpage but 301 for /removedpage?some=querystring (redirects to fix canonical issues including www/non-www), Google keeps checking the redirected urls and keeps requesting deleted pages.
I had old URLs (3 or more years old) in my crawl errors but I never marked them as fixed or even worried about them. Only since the "Hummingbird" update did I start taking notice and start working through the errors. I was able work through all the old urls and 301 redirect them to the new URLs that had the same / updated content and they haven't come back since and are not indexed.
|if you return 410 for /removedpage but 301 for /removedpage?some=querystring (redirects to fix canonical issues including www/non-www) |
This kind of thing can and should be fixed, though. If the simple path
returns a 410, then the same 410-- without 301 --will be returned for the same path with any appended queries. And your [G] directives will come before any redirects, so there should be no opportunity to redirect anything.
|The main question is: why the massive renewal of interest now? |
When enough people start seeing old 404s, it's usually a sign of an index refresh, possibly an update. I'm beginning to see reports on the forum suggesting that some sites, at least, are seeing refreshes and changes. Possibly... and this is conjecture... these sites are on a segment of the index that's being tested... or it could be the entire index, perhaps an index rolling out in chunks. Perhaps it's a Penguin update coming.
See my last post (of May 29, 2013) on this thread, as well as several posts along the way, about Google and 404s....
17 May 2013 - GWT Sudden Surge in Crawl Errors for Pages Removed 2 Years Ago?
|I've observed that in addition to periodically rechecking the lists of 404s it keeps, Google also often recrawls these lists when there's a refresh of the index, as might occur at a large update of the type we just had. |
This observation from a 2006 interview with the Google Sitemaps Team is helpful... [smart-it-consulting.com...]
My emphasis added...
|When Googlebot receives either (a 404 or 410) response when trying to crawl a page, that page doesn't get included in the refresh of the index. So, over time, as the Googlebot recrawls your site, pages that no longer exist should fall out of our index naturally. |
My sense of the above is that by recrawling the old lists at updates or refreshes, Google is able to generate "clean" reference points of sorts, with currently 404ed urls removed from the visible index. The above interview was in 2006, though, and the index has gotten much more complex, so it's hard to say whether the 404ed pages are removed from the index in one pass, or after many....
Its also getting on my nerves, they are spidering pages that dont even exist, never had and of cause old pages that has not been online for years.
|And your [G] directives will come before any redirects, so there should be no opportunity to redirect anything. |
That's the trick, I used to have [G] directives at the end.
|Its also getting on my nerves, they are spidering pages that dont even exist, never had and of cause old pages that has not been online for years. |
Google is checking if you're generating random pages/content based on keywords in URL.