17 May 2013 - GWT Sudden Surge in Crawl Errors for Pages Removed 2 Years Ago? - Google Search and SEO forum at WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

17 May 2013 - GWT Sudden Surge in Crawl Errors for Pages Removed 2 Years Ago?

Frost_Angel

4:09 am on May 21, 2013 (gmt 0)

Why would I have a sudden surge (over 6,000) in crawl errors for pages I removed 2 years ago that have been correctly 404'd?

Started on the 17th and grows daily. I use wordpress and I know these pages 404.

Does this have something to do with the latest update rumblings? If so - again - why? Where are they coming from?

Any advice is appreciated.

Thanks in advance.

jimbeetle

5:33 pm on May 29, 2013 (gmt 0)

[productforums.google.com...]

John Mueller:

We generally treat 404 the same as 410, with a tiny difference in that 410 URLs usually don't need to be confirmed by recrawling, so they end up being removed from the index a tiny bit faster. In practice, the difference is not critical, but if you have the ability to use a 410 for content that's really removed, that's a good practice.

So, best practice, as it always has been, if the page is gone, throw the SEs a 410.

Via Rustybrick [seroundtable.com].

lucy24

7:38 pm on May 29, 2013 (gmt 0)

what's the point of 404-ing in the first place

A 404 isn't an intentional action. It simply means that someone asked for a page and didn't find it. You have to take intentional action to prevent a 404 when there is a request for a nonexistent page. Unless we're talking about badly coded php that always yields a 200 even if the page is physically empty.

Frost_Angel

8:00 pm on May 29, 2013 (gmt 0)

I hope this isn't off topic then....
How do I 410 that folder?
Do I do it in robot.txt?
I just want to do what I can to let google know these pages are gone - no need to recrawl.

TheOptimizationIdiot

8:06 pm on May 29, 2013 (gmt 0)

RewriteEngine on
RewriteRule ^the-directory-to-410/ - [G]

Frost_Angel

8:27 pm on May 29, 2013 (gmt 0)

This is my htaccess - I didn't write it - I hired someone to. I removed my site name because I know you can't have links...
So is this good for the 404 issue? Or is this hurting my site?

Thanks

-------------------

# deny from all

php_value display_errors off
php_value memory_limit 128M
# php_value post_max_size 500M
# php_value upload_max_filesize 500M
# php_value max_execution_time 500

RewriteCond %{HTTP_HOST} ^example.net$ [OR]
RewriteCond %{HTTP_HOST} ^www.example.net$
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]

RewriteCond %{HTTP_HOST} ^example.com$
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]

# RewriteRule ^(.*)/[0-9]+/$ http://www.example.com/$1 [R=301,L]

<Files 403.shtml>
order allow,deny
allow from all
</Files>

deny from 83.133.124.95
deny from 76.73.39.228
deny from 88.198.7.221
deny from 83.133.124.190

<FilesMatch "\.(js|css|png|gif|jpg)$">
<IfModule mod_headers.c>
#Header set Cache-Control "public, max-age=31536000"
#Header set Expires "A31536000"
</IfModule>
</FilesMatch>

# Hotlink Protection START #

RewriteEngine on
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?example.com [NC]
RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?google.com [NC]
RewriteRule \.(jpg|jpeg|png|gif)$ - [NC,F,L]

# Hotlink Protection END #

# BEGIN WordPress
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
RewriteRule ^index\.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
</IfModule>

# END WordPress

TheOptimizationIdiot

8:39 pm on May 29, 2013 (gmt 0)

# deny from all

<Files 403.shtml>
order allow,deny
allow from all
</Files>

deny from 83.133.124.95
deny from 76.73.39.228
deny from 88.198.7.221
deny from 83.133.124.190

<FilesMatch "\.(js|css|png|gif|jpg)$">
<IfModule mod_headers.c>
#Header set Cache-Control "public, max-age=31536000"
#Header set Expires "A31536000"
</IfModule>
</FilesMatch>

php_value display_errors off
php_value memory_limit 128M
# php_value post_max_size 500M
# php_value upload_max_filesize 500M
# php_value max_execution_time 500

RewriteEngine on
RewriteRule ^the-directory-to-410/ - [G]

RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$
RewriteRule .? http://www.example.com%{REQUEST_URI} [R=301,L]

# Hotlink Protection START #

RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?example.com [NC]
RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?google.com [NC]
RewriteRule \.(jpg|jpeg|png|gif)$ - [NC,F,L]

# Hotlink Protection END #

# BEGIN WordPress

RewriteRule ^index\.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]

# END WordPress

lucy24

9:29 pm on May 29, 2013 (gmt 0)

I didn't write it - I hired someone to.

You paid them too much.

Frost_Angel

9:40 pm on May 29, 2013 (gmt 0)

I always over pay - because people like me are taken advantage of - but have no choice :-(
Such is life. lol

Robert Charlton

9:54 pm on May 29, 2013 (gmt 0)

I just want to do what I can to let google know these pages are gone - no need to recrawl.

Google knows that these pages are gone, but it has its own reasons for recrawling.

By reporting the 404s, Google is just telling you that they requested the url for the page, and that your server didn't find anything and returned a "404 Not Found" response to Googlebot.

If you think that your server should have found something... ie, that you believe the pages are still around and that Google should not have gotten a 404 Not Found response when it requested the url, then Google's message is useful because it alerts you to a possible problem. Otherwise, 404s are the expected response and are perfectly normal.

As to why Google recrawls urls that you think are gone or non-existent, there are numerous reasons. One is that links to the urls may persist somewhere on the web. You can't do anything about some of the external links, but by recrawling periodically over time, Google will keep track of the responses and recrawl these old urls less often.

It might be, though, that a site will still have internal nav links to the urls of pages that have been removed. This is unlikely in your case because you hadn't gotten requests for these urls for a long while. It can be worth checking a site with Xenu or Screaming Frog, though, to make sure that these urls aren't in the site's code.

I've observed that in addition to periodically rechecking the lists of 404s it keeps, Google also often recrawls these lists when there's a refresh of the index, as might occur at a large update of the type we just had.

This observation from a 2006 interview with the Google Sitemaps Team is helpful... [smart-it-consulting.com...]

My emphasis added...

When Googlebot receives either (a 404 or 410) response when trying to crawl a page, that page doesn't get included in the refresh of the index. So, over time, as the Googlebot recrawls your site, pages that no longer exist should fall out of our index naturally.

My sense of the above is that by recrawling the old lists at updates or refreshes, Google is able to generate "clean" reference points of sorts, with currently 404ed urls removed from the visible index. The above interview was in 2006, though, and the index has gotten much more complex, so it's hard to say whether the 404ed pages are removed from the index in one pass, or after many.

There is a separate crawl list, and your observation suggests that the old urls are recrawled. I note from your report that the number of 404s peaked at just about the time of the update, and that the number is trending down gradually.

Re robots.txt, etc, in situations like this, I'll quote John Mueller's comments, cited above, for reference here...

For large-scale site changes like this, I'd recommend:
- don't use the robots.txt
- use a 301 redirect for content that moved
- use a 410 (or 404 if you need to) for URLs that were removed
- make sure that the crawl rate setting is set to "let Google decide" (automatic), so that you don't limit crawling
- use the URL removal tool only for urgent or highly-visibile issues.

This 39 message thread spans 2 pages: 39