homepage Welcome to WebmasterWorld Guest from 54.225.1.70
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 39 message thread spans 2 pages: 39 ( [1] 2 > >     
17 May 2013 - GWT Sudden Surge in Crawl Errors for Pages Removed 2 Years Ago?
Frost_Angel




msg:4575984
 4:09 am on May 21, 2013 (gmt 0)

Why would I have a sudden surge (over 6,000) in crawl errors for pages I removed 2 years ago that have been correctly 404'd?

Started on the 17th and grows daily. I use wordpress and I know these pages 404.

Does this have something to do with the latest update rumblings? If so - again - why? Where are they coming from?

Any advice is appreciated.

Thanks in advance.

 

Robert Charlton




msg:4575991
 5:42 am on May 21, 2013 (gmt 0)

Does this have something to do with the latest update rumblings?

Hmmm... thinking out loud, and guessing....

This most likely means that Google is running an old dataset, possibly among other datasets, perhaps for purposes of comparison. This seems to happen at times of big change.

It's hard to know how broadly sites are being affected, so I hesitate to whip out a crystal ball, but "2 years ago" sounds bigger than, say, 'six months ago' might.

How have you been doing in updates over the past several years? Mayday? Panda? Penguin? This might suggest how big a swath could be noticeably affected.

Frost_Angel




msg:4576203
 2:08 pm on May 21, 2013 (gmt 0)

I was hit by Panda in April 2011. I lost 2/3 of my traffic. I was also in the process of moving my 10 year old HTML site to wordpress at the exact same time - so it's hard to figure out exactly what happened. Timing was just horrible.

But after so much improvement and really revamping the entire site - I was hit again by Panda in late September 2012. Lost even more traffic. Now the site is on life support.

But my crawl errors are typically maybe 10-25 a week -- for it to jump to 6,780 over night and they are ALL from a script that was removed 2 years ago -- is downright weird. There is no trace of this script, the pages were dynamic and everything was properly 404'd years ago.

I'm just in a state of shock that these pages are showing up again and tons of them. That has to mean that Google cached them somewhere.

tedster




msg:4576219
 3:25 pm on May 21, 2013 (gmt 0)

Specifically, it means that Google has a list of every URL they ever crawled and occasionally they look again, even years later. They do that kind of "historical crawling" on various cycles and I do see them doing that in recent days.

They also may have a cache of the content they retrieved in past crawls (they do store a whole lot of data in their Caffeine infrastructure) but all they really need to store for the behavior reported here is that list.

lucy24




msg:4576291
 8:07 pm on May 21, 2013 (gmt 0)

It's not just you. Over the last couple of weeks the googlebot has revisited-- that is, requested-- every page I have ever 410'd. Explicit 410, not just 404. That means: pages that for months and months were requested by absolutely nobody except the bingbot (which doesn't seem to grasp the 410 concept*).


* Possibly because they know that if they keep bugging me, I will eventually knuckle under and feed them a few more handcrafted 301s :(

Robert Charlton




msg:4576307
 8:28 pm on May 21, 2013 (gmt 0)

Thanks, lucy. I'd been seeing mention of enough of this to assume it wasn't a single site.

Is it just 404s/410s?... or are there other time glitches that suggest that Google might be looking at older aspects of your sites?

Frost_Angel




msg:4576349
 10:24 pm on May 21, 2013 (gmt 0)

Do any of you think this could be a good thing? Because the crawl errors I am seeing are from a script I had running on my site that allowed people to do job searches. I had no idea the pages were being cached my google and put into the serps because the script was supposed to not allow this. (So said the developer! Anything for a sale.... ugh...)

Any way... it ends up - 80,000+ pages of aggregated (THIN) content from other job sites are created and thrown into the serps under my domain name - basically my domain looks like a huge web spammer - when really I was just naive - didn't understand, not technically brilliant...., whatever you want to call it.

I immediately --- IMMEDIATELY - when I figured out what was happening, removed the script, 404'd everything coming from that database folder and over a period of 6 months.... they quietly went away -- maybe one or two showed up here and there....
But now... I have 6,730 of them!

It's really freaking me out.

What I am wondering....
(Hoping....)

Is that *G is checking to see if I removed all these thin pages - or double checking.... because maybe this is what hurt me Panda-wise in the first place.
If *G sees I really have removed the junk -maybe they will stop hating me. LOL

Just seems weird to have such a long period of time and they start showing up. I know everyone is saying a major Penguin and Panda update that will rock the internet is "coming...." or has started... and I just wonder is this isn't a "pre-check" or "pre-update" before the update. Like priming a well - just making sure all is running clear before they crank it on.

Robert Charlton




msg:4576352
 10:27 pm on May 21, 2013 (gmt 0)

On the Google Update thread, nwelsh posted that he believes we're seeing old data, and that he thinks it's a total reindex in progress...

Google Updates and SERP Changes - May 2013
12:59 pm on May 21, 2013 (PST -8)
http://www.webmasterworld.com/google/4569639-12-30.htm#msg4576314

Sorry we don't have a way of linking directly to a post, but the above might get you there.

I was thinking that it might be a confirmation reindex, run on a portion of the web, to compare results with seed sets prior to a complete reindex, but I'm now thinking it could well be a full reindex.

Regarding the date, new member rossell123 asked earlier...
So nobody knows what the update was that took place on 16th May 2013?

May 17th in Webmaster Tools is close enough to confirm this date, and I've added May 17th to the title of this thread.

For now, discussion should probably continue both here or in the May update thread, depending on the types of observations you have.

Robert Charlton




msg:4576356
 10:40 pm on May 21, 2013 (gmt 0)

PS: Frost_Angel - As I posted in another thread where a member was similarly panicked, I would just chill out for a while.

404 errors are reported to alert you that requests for the urls are showing errors. If for some reason you were getting errors on pages that you assumed were in the index, then you should be concerned.

In this case, the errors for pages that have been removed are expected behavior. I'd watch with interest the history as it unfolds in Google, as it might give you some flashes of insight, but... considering current circumstances and that you are not alone... I certainly wouldn't lose any sleep over what you're seeing.

Here's the nwelsh post in full, as I think it might be helpful to calm you. Notice the numbers he's seeing... and that he doesn't seem to be worried by them....

It's an old data set, our site has over 500,000 pages and the URLs showing up in the index right now are pages we had removed from our site after first version of panda/penguin hit.

I believe this is the pattern:

Reset data to old
Reindex the web
Unleash the new algorithm on the new index

Could be 100% wrong, but believe this is what's happening.

nwelsh




msg:4576396
 4:26 am on May 22, 2013 (gmt 0)

Robert,

It also would fit in with what Matt Cutts said about having an algorithm update in few weeks.

We got a similar set of these Crawl Error messages in WMT and as the OP mentioned, they were from pages removed two years ago.

Frost_Angel




msg:4576398
 4:58 am on May 22, 2013 (gmt 0)

I'm glad I'm not the only one experiencing this. Makes me feel better because it's like walking on eggshells. I'm just a simple "mom" site that had done really well for 10 years when I got hit with Panda the first time. The people I hired to move my site from HTML to Wordpress made many mistakes - including having an "old" version of my site live, and a test version running - along with the new version.... so 3 copies of my site running. I don't know if it was that which brought on Panda or the script churning out thin content pages that were being gobbled up by G.

But I want to make sure I understand this right....
The theory any way....

Google has improved Panda and Penquin (in their minds) - and now to really test it... and to fix both the bad and good that it did - they are going to start with a clean slate so-to-speak and set the database BACK as much as they can to pre-Panda and pre-Penguin (as much as they can).

Then unleash their newer/cooler version and hope it catches more problem sites and hurts the innocent sites "less".

Am I way off?

nwelsh




msg:4576399
 5:11 am on May 22, 2013 (gmt 0)

I'm not sure what's going on, it's an assumption based on what's happening on our site. Maybe the data has been rolled back for only a few domains?

If you had a sudden spike in crawl errors, maybe you're part of that group.

As of today, the index still shows old URLs that don't exist for our site.
It could all be a site level issue too. It's hard to tell.

Frost_Angel




msg:4576400
 5:20 am on May 22, 2013 (gmt 0)

I appreciate the opinions though. It's helpful I think.
Thank you.

garyr_h




msg:4576401
 5:21 am on May 22, 2013 (gmt 0)

Are we all sure that it started May 16/17? I started noticing a huge change in traffic on May 14 (and it has continually changed since then).

tedster




msg:4576402
 5:29 am on May 22, 2013 (gmt 0)

The opening poster's report is about crawl errors, not a traffic change - right?

diberry




msg:4576554
 3:19 pm on May 22, 2013 (gmt 0)

Right, tedster. If there's an algo update happening, then chances are we will see a lot of different changes happening *alongside* this crawl errors May 17th thing. The other changes are getting discussed in the monthly updates thread. It sure feels like there's one big change going on and all these observations are parts of the whole. But we'll see.

Frost_Angel




msg:4576594
 4:31 pm on May 22, 2013 (gmt 0)

Yep - just have 3,000 more come in last night for the same damn pages I removed 2 years ago. And I have pages showing up from when the site was in HTML - which was two years ago too.

Robert Charlton




msg:4577550
 7:59 pm on May 24, 2013 (gmt 0)

Frost_Angel, lucy24, nwelsh, and others who were seeing reported crawl errors due to old pages appearing in the index...

...now that we've had the Panda 2.0 update, what's the situation with these crawl errors? Did they go away, or are they taking time to get purged from the system?

I should note that WMT is another aspect of this, and it might well take time to update.

lucy24




msg:4578785
 11:18 pm on May 28, 2013 (gmt 0)

Nothing new here.

By happenstance I recently removed a clutch of pages, so I'm getting a lot of 410s anyway. No unusual 404 activity lately: that seems to have come a month or two back with a flurry of garbage requests. That is: not misreading of links or bad parameters, but the kind you get when the cat walks across the keyboard-- or the search engine is checking for soft 404s.

By unrelated happenstance I also added a whole slew of (non-indexed but crawlable) pages last month. So this too may have triggered some robotic investigation.

Frost_Angel




msg:4578790
 11:22 pm on May 28, 2013 (gmt 0)

This is how mine is going right now.

I had 108 crawl errors on 5/19 - that was a little higher than normal....
But not out of the question.
Then it goes like this:

5/20: 6,730
5/21: 9,888
5/22: 8,067
5/23: 7,088
5/24: 7,126
5/25: 7,138
5/26: 6,201
5/27: 5,210

These are ALL pages that were 404'd YEARS ago. It's really bizarre. Not to mention that with all this going on - I also get an unnatural linking email on 5/16 right before all this and I do not buy links and really go after no links on my own either. The site is 14 years old... I have so many backlinks - I don't check my backlink profile, or even understand it. I don't buy links(couldn't afford it even if I wanted to!) -- so it's weird I get this warning out of the blue and then all the sudden all these old pages that have been gone forever.

I don't understand it. But I am also not a trained SEO person. I come to WW to just "try" and teach myself and learn.

If you know anyone offering a paid internship in SEO/backlink building - I would love a chance at it... LOL

Take care.

indyank




msg:4578841
 2:58 am on May 29, 2013 (gmt 0)

AFAIK. there is definitely a relationship between Panda and increase in crawl errors for negatively impacted sites. But I am not sure on the relationship between increase in crawl errors and other algorithmic changes.

Frost_Angel




msg:4578843
 3:02 am on May 29, 2013 (gmt 0)

It seems like if Google came back around and saw that all that web spam was indeed gone - then they would lighten up on the Panda hurt site.

tedster




msg:4578844
 3:09 am on May 29, 2013 (gmt 0)

As I understand it, Panda is not directly about webspam - that is Penguin's job. Panda is about "thin content", and especially the type produced by so-called content farms. It was orignally described by Matt Cutts and Amit Singhal the kind of site that "falls between the main ranking team and the spam team".

Frost_Angel




msg:4578854
 4:25 am on May 29, 2013 (gmt 0)

That's what I meant by webspam.... for my site - it was churning out thin content (web spamming the search engine with thin content pages) - that's what I meant.
Like I said... I had a script that was not supposed to create pages that were cached and used in the serps - but they were - over 80,000 of them and now they are gone - but after 2 years they are showing up as 404's again.

dougwilson




msg:4578855
 4:48 am on May 29, 2013 (gmt 0)

If we've removed a folder or file and entered that in robot.txt...

Disallow: /dir/
Disallow: /*.htm$

... wouldn't it mean links are followed from other sites?

Rather than G's index archive?

indyank




msg:4578881
 6:32 am on May 29, 2013 (gmt 0)

It seems like if Google came back around and saw that all that web spam was indeed gone


Nope,I don't thing it is about coming back to see those thin content pages are gone. It is more about crawling the entire set of URLs they have in their history DB for that site. This includes old URLs that you had removed years ago and URLs that never existed on the site.

Frost_Angel




msg:4579018
 1:48 pm on May 29, 2013 (gmt 0)

What would be Google's point of crawling old 404 urls? What benefit would it gain for them? It seems they would be verifying something or updating something? Otherwise - what's the point fo 404-ing in the first place if they keep coming back and rechecking or looking for them again.

tedster




msg:4579023
 2:02 pm on May 29, 2013 (gmt 0)

There's a Google explanation out there somewhere. Essentially, they say they monitor old 404 for years once they have that URL in "storage", because in their experience, many webmasters 404 a URL and then replace it later.

As long as you've got no remaining internal links pointing to it, and it's not ranking, then there is no real problem.

Frost_Angel




msg:4579034
 2:18 pm on May 29, 2013 (gmt 0)

OK - that makes sense. Makes me worry less. Must be something they run before updates.

dougwilson




msg:4579056
 3:26 pm on May 29, 2013 (gmt 0)

"many webmasters 404 a URL and then replace it later"

If google really says that's the reason, for Storing all data forever, they need to hire better writers.

This 39 message thread spans 2 pages: 39 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved