Massive jumps in GSC legacy crawl errors - who sees this? - Google Search and SEO forum at WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Massive jumps in GSC legacy crawl errors - who sees this?

Simon_H

4:14 pm on Sep 7, 2016 (gmt 0)

We've seen some massive jumps in legacy crawl errors in Search Console, i.e. pages appearing that haven't existed on the site for months/years and where the 'linked from' list is also only pages that haven't existed for months/years, both on-site and off-site. Crawl errors were sitting at around 1,000, but on the 3rd August, they jumped to 10,000. On 1st September, they jumped to 50,000.

I've never seen Googlebot go this crazy before. It's as if it's performing an exhaustive and historical update of its link graphs.

I know others have seen this too, e.g. @Jhurwith and @BushieTop and others have reported it on multiple forums. Have you seen this on your site? Would be good to try to see what we have in common, or whether this is just random. Our site is UK, ecommerce and under both Penguin and Panda, but been clean for years. Interesting to see the profile of other sites with these crawl errors.

NickMNS

12:19 pm on Sep 20, 2016 (gmt 0)

@Robert what you are saying is mostly true and it makes sense to deal with links that no longer point to anything.

In so far as the legacy crawl errors, this has no impact. I am getting errors for pages that were deleted 2 to 3 years ago. All the internal links to those pages were deleted and there were never any external links pointing to them, yet they still appear in the error report. When I click the linked from tab, all the referenced pages are either also gone or have had the links removed.

Wilburforce

12:35 pm on Sep 20, 2016 (gmt 0)

it makes sense to deal with links that no longer point to anything

There shouldn't be any. If you don't already, you should run a link-checker on your site for internal link errors every time you change anything, as well as periodically (meaning frequently) for external link errors.

Wpow

6:11 pm on Sep 20, 2016 (gmt 0)

Could it be that Google all of a sudden started crawling archive.is? All these pages are 404 and all the "Linked From" sites are mostly 404 except for the one archive.is "Linked From" I found.

Here is some discussion about this site from a couple years ago:
[webmasterworld.com...]

martinibuster

4:35 am on Sep 21, 2016 (gmt 0)

I'm wondering if there are links to the legacy pages causing these errors. There's a recent post discussing an increase in backlink data which seems to indicate a larger amount of links getting crawled or else simply being revealed.

One of my sites experienced crawl errors due to obscure automated/scraper sites linking to non existent pages on my site. I thought, wow, that's a pretty deep crawl and deep reporting in GSC to be showing me that garbage.

Google Is Giving Me My Links Back (Doubled in a month)
[webmasterworld.com...]

Wpow

6:56 am on Sep 21, 2016 (gmt 0)

Wow martinibuster you are still in the biz. I remember you from the pre-redepemption period days. So you think that is more likely the crawl errors are not associated with how they are testing or rolling out the update but rather a result of normal indexing of sites that happen to have old link data like an archive.is or some who are scraping an archive.is type site?

Simon_H

10:25 am on Sep 21, 2016 (gmt 0)

I'm going to throw a theory out there. Might be nonsense, but interested to see how it plays out...

I wonder if Google has been doing a major application framework or system upgrade over the past few months/year which is now being released pre-Penguin. Not talking about the algo, but the underlying framework. For example, if Penguin is going realtime, then Google may need to store and process significantly more backlink data than they were capable of doing 2 years ago. This would mean Penguin 4 isn't just a code update, it's a system overhaul, including hardware, application framework, codebase = huge work. And there may be other reasons for such upgrades too.

It would explain the Penguin delays. If Google has been switching across to this new framework over the past month or so, it may also explain why there have been so many bug reports recently, e.g. GSC indexer not working, GSC search analytics not working, GA real time not working, Google+ 500 errors, all of which seem like integration issues.

It may also explain the sudden jump in GSC legacy crawl errors and backlink profile changes; the new system is starting its deeper crawl.

John Mueller's mentioned they've prepared an official Penguin communication, which seemed a little strange thing to say at the time. I wonder if it will be more than just 'Penguin is now live' and will include some of the background above, pointing out that what has been happening is actually far more than just a code update.

martinibuster

10:52 am on Sep 21, 2016 (gmt 0)

So you think that is more likely the crawl errors are not associated with how they are testing or rolling out the update...

I believe it is indeed associated with the update. I posted about this on Facebook last night, that Tedster pointed out years ago how increased crawling can be associated with updates. Simon's done a good job of summarizing the clues of what's going on.

Robert Charlton

11:58 am on Sep 21, 2016 (gmt 0)

Simon, not nonsense at all. Just some quick thoughts, but I agree that there are many undercurrents suggesting that changes in the search infrastructure are afoot.

I think I've been seeing machine learning-based search results in the serps themselves that appear to be relearning some old things, even as new query rewriting is also being added,, and while similar to old results, some of these seem to be temporarily a step back. Less smooth than the segue to Hummingbird, which was essentially invisible. Maybe Google will come to ask ask us, as they did with Humingbird, if we noticed a difference between the earlier and eventual results.

On searches involved multiple entities, Google seems to be struggling to decide which entity is the core query and which are the modifiers. I think I'm seeing the effects of RankBrain, as in many cases the revised sense of the query is way beyond synonyms and really very good, but at other times the results are oddly primitive, probably because there's not yet that much data, and Google needs to grab at whatever obvious clues it has... vocabulary matching and the like.

If what I think I'm seeing is real, it's not just an algo change...Google seems to be building the serps differently. And yes, this would be much more than just a code update.

martinibuster

2:48 pm on Sep 21, 2016 (gmt 0)

Illyes very recently declared the idea of crawl spikes as a precursor to an algo update a myth. Read article here. [seroundtable.com]

Mueller and Illyes speak within specific contexts. Thus when people ask "Muellyes" if something is part of the ranking algorithm they'll honestly answer no, without elaborating that the item may actually be part of the modification engine or some other context. The context of Illyes' statement is increased crawling as a precursor to an algorithm update. What Simon H is discussing is something else entirely. In my opinion, Illyes' recent statement does not conflict with Simon's observations or negate them.

Logical assumptions
I'm inclined to agree with Simon's points tying the anomalies together within the context of a framework upgrade (software/hardware) and to point out that Simon's points do not conflict with Illyes' recent statement. Simon's ideas are outside of the context of Illyes' statement. The hiccups aren't necessarily a symptom of a simple update, but may be a symptom of upgrading the entire framework.

Opinion
In my opinion, logically this change is going to have to be a radical change, something new, because (fact) Penguin itself is going realtime as a component of either the ranking engine or the modification engine part of the algorithm. In my opinion, Penguin itself is an immense link graph related application, one that used to take months to calculate. The scale of taking that real-time, to integrating that as part of the overall algorithm is in my opinion a remarkable achievement and I have no doubt that (opinion) like anything else of that scale of change there will be symptoms such as have been experienced and discussed this month on WebmasterWorld.

Roger

Barbados

12:58 pm on Sep 23, 2016 (gmt 0)

On a slightly tangential but related point, we've just had a customer come to us with a Google partial match penalty. Having gone through WMT it soon became clear that it was because Google had recrawled links built in 2009/10 and was showing them in the 'latest links' file.

We've had to rework his disavow file (many of them were previously disavowed in 2013) to disavow the root rather than just the link, which the previous firm had done. What surprised me though was that links from old abandoned .php directories were showing up in a new crawl.

Something smells fishy at the moment but I'm not sure what it is....

TechNoob

2:54 pm on Sep 23, 2016 (gmt 0)

Saw a strange increase in crawl errors on old pages here as well, happened in the beginning of September.

Used the link removal as a precaution and marked as fix. Very few soft 404's remain. Fairly clean since.

NickMNS

3:02 pm on Sep 23, 2016 (gmt 0)

@Simon_H I not sure I fully agree with your theory, but given the confirmation of some sort of Penguin event, it would seem that there is some sort of correlation between an increase in crawl of legacy pages and an algo refresh.

@TechNoob, the mark as fixed button does nothing except clear the reported pages from GSC.

Robert Charlton

3:47 pm on Sep 23, 2016 (gmt 0)

To put a bookmark into this discussion and note the announcement made today...

Penguin: Core, realtime and updated today
Sep 23, 2016
https://www.webmasterworld.com/google/4819521.htm [webmasterworld.com]

I'm curious whether the reported 404s will diminish and were there for "calibration" and for Google's own bookmarking purposes, or whether they will stay up at the high end.

TechNoob

3:56 pm on Sep 23, 2016 (gmt 0)

@NickMNS Right of course, but I have been waiting to see if any repeated offenses pop up

Simon_H

7:32 pm on Sep 24, 2016 (gmt 0)

Freaky that in @Robert_Charlton's example link, it was the beginning of May 2013 that a massive jump in crawl errors was reported and Penguin 2 hit on 22nd of the month. Then virtually identical for Sep 2016 and P4. No doubt P2 and P4 work in completely different ways, but Google is going to be hard-pushed to keep denying any relationship between legacy crawl error increases and updates involving links.

I wonder if the reason they keep emphatically denying this is because now that P4 is out, they won't be announcing updates any more. So the last thing they want is the ability for webmasters to have any clue when updates will be happening. If they gave any hint that jump in crawl errors = update, that would spoil their plans.

Wilburforce

12:54 am on Sep 25, 2016 (gmt 0)

Google is going to be hard-pushed to keep denying any relationship between legacy crawl error increases and updates involving links

It is quite a leap from the limited evidence available to each of us to the conclusion that Google is lying. Google might be lying, whatever the evidence, but

1. Correlation does not prove causality;
2. Not all webmasters have seen a spike in crawl errors (I haven't);
3. Webmasters who have not seen a spike wouldn't have anything to report;
4. Those who report a spike are a biased sample;
5. Those who report a spike (there are only 13 posters in this thread so far) are not a statistically significant sample.

I did see a spike in number of pages crawled on 1 September, and in bytes downloaded on 2 September, and my server log shows higher than normal activity on 2 September.

However, I have seen similar spikes that were obviously nothing to do with Penguin (and spikes in crawl errors at other times), so I wouldn't personally infer a relationship.

One question to consider is whether Google has any spare or redundant crawl capacity, which would be a necessary precursor to any general increase in crawl.Otherwise, any increase in crawl in one place would have to mean a decrease elsewhere, and it could not be generally true that Google crawls more before an update.

A second question is what gets crawled when. I think we can accept that Google periodically revisits historical pages and links, so we should ask whether crawl resources are widely diverted from active to historical content prior to (e.g.) a Penguin update. My expectation - whatever Google might say - is that they would not.

My belief is that a (probably fairly constant) portion of crawl trawls historical data on a rolling cycle, and that for some sites this will coincide with a widely-publicised algorithm change.

Simon_H

11:12 am on Sep 25, 2016 (gmt 0)

@Wilburforce Many of those points are already covered in this thread. Yes, Googlebot activity spikes at times for one reason or another, and it's quite normal to see legacy crawl errors from years back. But what people were seeing here is off-the-scale. Multiple independent reports on WebmasterWorld, SER and Moz of *huge* jumps in legacy crawl errors, not just people associating the usual fluctuations with Penguin. Also, Penguin has been pending release at any moment for the last year, but nothing like this has been reported until Aug/Sep 2016.

Interesting point about where Google gets the capacity to do these huge recrawls, but I think you may have answered your own question! Only a small number of sites have seen this and so even if things worked where there was a shared Googlebot resource pool, grabbing more resource for 0.5% of sites would not necessarily have any noticeable impact on the other 99.5%.

It's not as black and white as either 'Google is lying!' or 'Google is telling the truth!'. The increased activity could, for example, be due to some system upgrade, which in turn is a precursor to an algo update. Or, as Robert suggests, it could be a sign of testing and only once the final test passes will there be a green light for deploy, i.e. it's a loose relationship between jump in crawl errors and algo update. So, Google isn't lying per se, more that they're being a little cheeky with how they answer the question.

I disagree with your "statistically significant sample" approach. Almost everything in SEO is purely anecdotal and nothing can be proved. In the past, updates hit everyone at the same time, but not any more. Traffic changes tell you very little because they're impacted by things like SERP layout changes. Rank trackers tell you very little because the SERPs can no longer be simplified into a single set of ranks seen by all users. John M said earlier this week that updates might only affect small numbers of sites and in the past has said testing happens on ~1% of sites. There are a billion sites on the web, so what would be a "statistically significant sample"? Maybe a million sites? Or just a thousand? We're at a point where you will never get a "statistically significant sample" of anything. If you won't accept that anything happens until you get that level of proof, you're not going to find any satisfaction on these forums!

EditorialGuy

4:14 pm on Sep 25, 2016 (gmt 0)

For what it's worth, I see hardly any crawl errors for our main site (a couple for desktop, a handful for smartphone, about 99 for feature phone, the latter being old pages that were deleted months ago).

The site has been at its current domain for about 15 years.

Wilburforce

4:33 pm on Sep 25, 2016 (gmt 0)

If you won't accept that anything happens until you get that level of proof, you're not going to find any satisfaction on these forums!

It isn't a case of whether anything happens, or whether I find it satisfying. It is a case of whether A causes B (or, if we abandon causality, at least has a higher than chance association with B).

Here, we are considering whether A (an impending Penguin update) causes B (an increase in some Google activity that results in historical crawl errors). Google says it doesn't. All I am asking is what grounds we have, without robust analysis, to argue the point.

westcoast

5:37 pm on Sep 25, 2016 (gmt 0)

As posted in another thread, our 20-year-old site has now got 104,000 outstanding 404s.

The number jumped by 90,000 a few weeks ago.

This massive number of unactionable / ancient legacy links from 5 to 15 years ago is making this part of webmaster tools useless for us.

Wilburforce

6:50 pm on Sep 25, 2016 (gmt 0)

our 20-year-old site has now got 104,000 outstanding 404s

What do you think accounts for this?

Is there anything that sites reporting very large numbers of such errors have in common?

Simon_H

8:22 pm on Sep 25, 2016 (gmt 0)

@Wilburforce I could be wrong, but I don't think Google answered that question. They answered whether B (jump in crawl errors) means that A (algo update) is imminent, and said the answer is no. It's your point about causation/correlation. They've insinuated there's zero relationship between the two, but I don't believe they have specifically denied that certain algo updates aren't preceded by certain changes in Googlebot activity.

In terms of what grounds we have to assume any kind of link between the two, all we know is that over the past few weeks, many sites have reported an off-the-scale spike in legacy crawl errors of a scale that doesn't appear to have been seen in many years. And then, shortly afterwards, Google announced P4. Which is very similar to what happened in May 2013. That's all we know.

Robert Charlton

5:16 am on Sep 26, 2016 (gmt 0)

westcoast, welcome to WebmasterWorld.

Regarding your comment...

...The number jumped by 90,000 a few weeks ago.

This massive number of unactionable / ancient legacy links from 5 to 15 years ago is making this part of webmaster tools useless for us.

...you are not the only one who's noticed that this amount of data is too much for the average user.

This is one of the seven main points covered in a list of articles John Mueller has posted on his Google+ page. I mentioned these earlier in the thread, but perhaps didn't draw enough attention to them. Here's a link to John's list, followed by the fifth listing, which is about the point you raise. Even your numbers and Google's example are pretty close.
(The links to webmastercentral.blogspot now redirect to webmasters.googleblog, and where I post these links I've replaced them with the new destination urls.)

John Mueller - 404 Crawl Error Reading List
John Mueller > Public
https://plus.google.com/+JohnMueller/posts/RMjFPCSs5fm [plus.google.com]

5) We list crawl errors in Webmaster Tools by priority, which is based on several factors. If the first page of crawl errors is clearly irrelevant, you probably won't find important crawl errors on further pages.
https://webmasters.googleblog.com/2012/03/crawl-errors-next-generation.html

Crawl Errors: The Next Generation
March 12, 2012
[webmasters.googleblog.com...]

Less is more
We used to show you at most 100,000 errors of each type. Trying to consume all this information was like drinking from a firehose, and you had no way of knowing which of those errors were important (your homepage is down) or less important (someone's personal site made a typo in a link to your site). There was no realistic way to view all 100,000 errors - no way to sort, search, or mark your progress. In the new version of this feature, we've focused on trying to give you only the most important errors up front. For each category, we'll give you what we think are the 1000 most important and actionable errors. You can sort and filter these top 1000 errors, let us know when you think you've fixed them, and view details about them.

That should make dealing with the 90,000 additional links easier, unless you find lots of relevant errors in the top pages.

Wilburforce

5:47 pm on Sep 26, 2016 (gmt 0)

Hmmm...

I certainly don't have a spike in errors, and I'm pretty sure it has nothing to do with Penguin, but I have a few new errors that are:

1. Pages that have never existed, from
2. Internal links that have never existed.

A couple of them are links to mysite.com/mobile/examplepage.htm, but no directory called mobile has ever existed, and no link has ever been created to a page in it, while one is a link from a pdf that does not have and never has had a link in it.

Make of that what you will, but it looks to me like Google might have suffered some kind of data glitch recently, which could account for a lot of errors on a big site.

EditorialGuy

5:56 pm on Sep 26, 2016 (gmt 0)

Make of that what you will, but it looks to me like Google might have suffered some kind of data glitch recently, which could account for a lot of errors on a big site.

One that surprised me was the reporting of nearly 100 "feature phone" crawl errors (compared to virtually no errors for desktop and smartphone crawls) in this day and age.

Nutterum

11:45 am on Sep 27, 2016 (gmt 0)

Seen massive spike in crawl rate. Y-day had the biggest crawled pages since 2009. Also massive legacy soft 404 pages that are no longer (or should not be...) visible to Google. Yep, this is Penguin crawler snuffing the site out. This I can tell for sure.

This 56 message thread spans 2 pages: 56