homepage Welcome to WebmasterWorld Guest from 184.72.69.79
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Googlebot massively tries URLs not existing for 3+ years
1script




msg:4635319
 6:03 pm on Jan 5, 2014 (gmt 0)

2014 brought an interesting new pattern in Googlebot's visits to one of my sites.

Need back-story first. I had shot myself in the foot, so to speak, twice over the last 3 years: in trying to create a more accessible and manageable URL structure, I have changed URLs of content twice. Once to eliminate capital letters from URLs and another time to change pagination so that each page of content is longer (which led to fewer pages). Not every single content page changed its URL but close to 90% did.

Googlebot has already come around (many times) since then and Google knows about the most recent URLs because I see them ranking on their own. Most recent URLs have not been changed since about 1.5 years ago.

Since this January 1st, I see almost every Googlebot hit starting with the oldest URLs first, which in most cases creates three Googlebot request to read one content page, diluting the crawl budget.
So, it goes like this:
  1. request Old Page With Capitalized URLs and Wrong Pagination -> 301
  2. requests more recent but still old page with lowercase urls and wrong pagination ->301
  3. request proper page -> 200


With quite a bit of work I can do something to eliminate the second request (both 301s are created by my server), but it behooves me to understand why they are coming back for pages that did not exist for 3+ years, have already had 301 (i.e permanent) redirect served 3+ years ago and clearly have new pages already replace the old in Google index (since new ones have already been ranking for a couple of years). In other words, short of Google index being wiped clean and Googlebot following old links again, I can't find a reasonable explanation. Even then, not all content pages have/had external links, and there are no internal links to the old URLs for Gbot to follow.

So, does Google ever have index "spring cleaning" events, when everything gets wiped clean and has to be recrawled/ reindexed afresh? What might the implications be?

 

lucy24




msg:4635342
 9:29 pm on Jan 5, 2014 (gmt 0)

Search engines never forget. The bigger your site is, the longer their memories are. Disclaimer: I made up the second sentence.

Your post seems to imply that there was a two-step redirect:

from name1 to name2
and
from name2 to name3

rather than a one-step

from name1 to name3
and
from name2 to name3

If that's the case you should absolutely consolidate your redirects. In fact you should turn back the clock and consolidate them from Day 1. Uhm. Not practical, I guess.

creeking




msg:4635374
 11:24 pm on Jan 5, 2014 (gmt 0)

any possibility you could make some new pages with those old URLs?

might be a way to take advantage of what is happening anyway.

1script




msg:4635394
 1:06 am on Jan 6, 2014 (gmt 0)

Thanks for your input, guys. Yes, I do realize they don't forget, but I thought this was the case for 404 and 302 codes, the ones that imply that perhaps the document is missing or relocated temporarily. Sounds like a waste of resources to keep coming back for documents that were permanently relocated. Also, it was very curious to me that they came in droves for 3 years old URLs, not 1.5 y.o. ones or combination thereof. It looks like some old database of URLs got restored from a backup made 3 years ago. Does anyone see an increase of interest specifically in old URLs on their sites (be that correct URLs or since redirected/removed)

netmeg




msg:4635522
 2:30 pm on Jan 6, 2014 (gmt 0)

I have them coming back for URLs older than that. All the time. Only thing I can think of is maybe old scrapers that suddenly show back up with a lot of old links to my older URL structures. Since I can't do anything about it, I've given up worrying about it; it is what it is.

SEOWeasel




msg:4635768
 3:59 pm on Jan 7, 2014 (gmt 0)

Yeah, Google never forgets. I've seen it trying URLs which had not existed for 8 years!
When Dynamic Search Ads came out, our company was a BETA client. When I pulled a destination URL report I was alarmed to see Google had been trying to land people on these ancient URLs. The problem was that all the redirect rules for those URLs had been deleted because they were so old.

The only solution I know of is to 410 these URLs OR remove them via Google Webmaster Tools but then you have to block in robots.txt or configure your server to return 404s when Googlebot checks through your removal request.

engine




msg:4635806
 6:50 pm on Jan 7, 2014 (gmt 0)

Welcome to WebmasterWorld SEOWeasel

@1script
There's no doubt that a complete crawl will go on every-so-often with the idea being to eliminate pages that don't exist. It's also possible that the urls are linked from somewhere.
For example: Are the urls in the wayback machine?

1script




msg:4635854
 12:29 am on Jan 8, 2014 (gmt 0)

Thanks again for your input guys. I guess returning 410 might have eventually stopped these crawls (or not), but I do need to have redirects instead: some of the old URLs were linked to and many of those old links are pretty good and still deliver referrals, and I shall assume some of that mysterious substance - link juice :)

As far as there being links to the oldest pages - I am certain there are a lot still out there. I am also not surprised that they come around from time to time to check on those. The main question is: why the massive renewal of interest now? I have seen Googlebot try old URLs alongside the new before, but this time around accesses for the new URLs are drowned in the sea or hits on the old.

lucy24




msg:4635879
 7:53 am on Jan 8, 2014 (gmt 0)

Have other people noticed an upsurge in the same behavior? Seems like, if it's something a search engine does at intervals of a year or more, the mega-crawls would have to be randomly distributed throughout the year. And, just to make you uneasy, some of those random mega-crawls would randomly and coincidentally come right after the search engine has instituted some new, highly publicized algorithm change. Or right after you've made substantive site changes ;)

levo




msg:4635976
 5:50 pm on Jan 8, 2014 (gmt 0)

Check if Google crawls an URL that redirects to the removed page.

For example, if you return 410 for /removedpage but 301 for /removedpage?some=querystring (redirects to fix canonical issues including www/non-www), Google keeps checking the redirected urls and keeps requesting deleted pages.

my_name




msg:4635991
 6:44 pm on Jan 8, 2014 (gmt 0)

I had old URLs (3 or more years old) in my crawl errors but I never marked them as fixed or even worried about them. Only since the "Hummingbird" update did I start taking notice and start working through the errors. I was able work through all the old urls and 301 redirect them to the new URLs that had the same / updated content and they haven't come back since and are not indexed.

lucy24




msg:4636015
 7:57 pm on Jan 8, 2014 (gmt 0)

if you return 410 for /removedpage but 301 for /removedpage?some=querystring (redirects to fix canonical issues including www/non-www)

This kind of thing can and should be fixed, though. If the simple path

/removedpage

returns a 410, then the same 410-- without 301 --will be returned for the same path with any appended queries. And your [G] directives will come before any redirects, so there should be no opportunity to redirect anything.

Robert Charlton




msg:4636038
 10:01 pm on Jan 8, 2014 (gmt 0)

The main question is: why the massive renewal of interest now?

When enough people start seeing old 404s, it's usually a sign of an index refresh, possibly an update. I'm beginning to see reports on the forum suggesting that some sites, at least, are seeing refreshes and changes. Possibly... and this is conjecture... these sites are on a segment of the index that's being tested... or it could be the entire index, perhaps an index rolling out in chunks. Perhaps it's a Penguin update coming.

See my last post (of May 29, 2013) on this thread, as well as several posts along the way, about Google and 404s....

17 May 2013 - GWT Sudden Surge in Crawl Errors for Pages Removed 2 Years Ago?
http://www.webmasterworld.com/google/4575982.htm [webmasterworld.com]

I've observed that in addition to periodically rechecking the lists of 404s it keeps, Google also often recrawls these lists when there's a refresh of the index, as might occur at a large update of the type we just had.

This observation from a 2006 interview with the Google Sitemaps Team is helpful... [smart-it-consulting.com...]

My emphasis added...
When Googlebot receives either (a 404 or 410) response when trying to crawl a page, that page doesn't get included in the refresh of the index. So, over time, as the Googlebot recrawls your site, pages that no longer exist should fall out of our index naturally.

My sense of the above is that by recrawling the old lists at updates or refreshes, Google is able to generate "clean" reference points of sorts, with currently 404ed urls removed from the visible index. The above interview was in 2006, though, and the index has gotten much more complex, so it's hard to say whether the 404ed pages are removed from the index in one pass, or after many....

zeus




msg:4636046
 10:39 pm on Jan 8, 2014 (gmt 0)

Its also getting on my nerves, they are spidering pages that dont even exist, never had and of cause old pages that has not been online for years.

levo




msg:4636106
 6:35 am on Jan 9, 2014 (gmt 0)

And your [G] directives will come before any redirects, so there should be no opportunity to redirect anything.


That's the trick, I used to have [G] directives at the end.

Its also getting on my nerves, they are spidering pages that dont even exist, never had and of cause old pages that has not been online for years.


Google is checking if you're generating random pages/content based on keywords in URL.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved