Forum Moderators: phranque

Message Too Old, No Replies

410 Gone

I think it's time to replace all those 404s the bots are eating.

         

pendanticist

6:58 am on Jan 31, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



For a long, long time, I've #*$!ed about stupid bots and how come they can't understand that when a page has been deleted or changed that they outta just stop requesting it.

Well, we've decided to 'assist' the bots by actually telling them past 404 pages are now 410 Gone. It's the least I can do.

10.4.11 410 Gone
The requested resource is no longer available at the server and no forwarding address is known. This condition is expected to be considered permanent. Clients with link editing capabilities SHOULD delete references to the Request-URI after user approval. If the server does not know, or has no facility to determine, whether or not the condition is permanent, the status code 404 (Not Found) SHOULD be used instead. This response is cacheable unless indicated otherwise.

The 410 response is primarily intended to assist the task of web maintenance by notifying the recipient that the resource is intentionally unavailable and that the server owners desire that remote links to that resource be removed. Such an event is common for limited-time, promotional services and for resources belonging to individuals no longer working at the server's site. It is not necessary to mark all permanently unavailable resources as "gone" or to keep the mark for any length of time -- that is left to the discretion of the server owner.

Is there anything specific I need to do?

Would I need to create a Custom 410 page?

Thanks.

Marino

9:50 am on Jan 31, 2005 (gmt 0)

10+ Year Member



Hello,

Maybe this :

[...]
ErrorDocument 410 /mypathto/my410errorpage.htm
[...]
RewriteRule mypagethadoesnotexistanymore.html [mygentlesite.com...] [R=410,L]
[...]

You could use a RewriteBase directive, too, depending on your directories and where you intend to put this .htaccess file.
If the pages have some string in common, you should use a regexp to match them.

I don't know if you use PHP. For one of my sites, I found more convenient to replace the content of any no more existing pages by a script :

<?
header("HTTP/1.1 301 Moved Permanently");
header("Expires: Fri, 31 Dec 2004 01:00:00 GMT");
header("Location: [mygentlesite...]
?>

If you do not wish to have a replacement page, just redirect it towards the home page.

It worked with the search engines. When the page has not been visited for a month, I remove it.

Longhaired Genius

11:43 am on Jan 31, 2005 (gmt 0)

10+ Year Member



Here's an informative article [diveintomark.org] on Error Code 410 Gone.

Marino

1:37 pm on Jan 31, 2005 (gmt 0)

10+ Year Member



Thanks for the URL. Learning and learning...
So the correct code is :

RewriteCond %{REQUEST_FILENAME}!-f
RewriteRule regexp - [G,L]

...where regexp should match the set of pages gone. The "-" means no rewriting, and the G flag means "Gone".

However, for people that might have bookmarked a no longer existing page, a 410 would only be more upsetting thant a 404, no? I've been told by usability gurus that average users should never see an error message. They advise to redirect to a search form, or the home page.

As said in the page "HTTP Error 410: Gone" :

[...] When the average AOL user (or below average web surfer/Cerfer) tries to get a page which is either gone or was never there to begin with, I don’t think they’re going to care if it’s 404 or 410. The end result is the same... they’re scratching their heads wondering what happened and trying to find a link to fire off an email to webmaster@domain.tld [...]

But for bots, that will do. When a bot meets a 410, what should it do? Remove the entry in its bases, ok. And stop crawling the site? I don't know wether error messages are that good for sites ranking...

Anyone knows?

Longhaired Genius

2:29 pm on Jan 31, 2005 (gmt 0)

10+ Year Member



I don't think a 410 message would have any detrimental effect on the crawling of a site by robots, if the site is set up properly a robot would never get a 410 from a link that was actually on the site, only from remote links and it's own list of cached links. In which case it would most probably have plenty of good links to follow too.

You can set up a custom 410 page with whatever information you want to give humans who might see it and reference it from your .htaccess eg,

ErrorDocument 410 /errors/410.html

and it will automatically be presented in stead of the standard 410 message.

jdMorgan

4:11 pm on Jan 31, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Remember that 410-Gone is an HTTP/1.1 response, and should not be returned to an HTTP/1.0 user-agent. Also be aware that while some user-agents "claim" in their request headers to be HTTP/1.0-compliant, they are actually HTTP/1.1-compatible.

So what's all that mean?

It means that if you have your own IP address, and therefore can support HTTP/1.0 requests to your site, then you should check to see if the user-agent has sent a Host header in its request, and if so, treat it as an HTTP/1.1 user-agent. If not, then do not send a 410 response, because HTTP/1.0 user-agents won't understand it to be anything more than a general 400-series error code.


RewriteCond %{HTTP_HOST} .
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule .* - [F]

I don't really recommend the above approach -- automatically declaring a 410 when any resource is found to be missing. This is because 410 is a rather "final" declaration, as opposed to a non-committal 404 (More info here [webmasterworld.com]). Therefore, if you accidentally remove a page and a spider sees a 410, it could immediately remove that page from its index, leaving you to start over on ranking that page. This might only be important during the time it takes the search engine to find that your accidentally-410'ed page has been restored, but if it was an important-to-revenue page on a busy site, a 30-day re-indexing delay could be a real concern.

With that in mind, I suggest using an actual list of known-to-have-been-recently-removed pages, rather than the method shown above. Yes, it's more work. But lower risk.


RewriteCond %{HTTP_HOST} .
RewriteCond %{REQUEST_URI} ^/(gone_page1¦gone_page2¦gone_page[4-9]¦junk¦old.+¦byebye)\.html$ [OR]
RewriteCond %{REQUEST_URI} ^/(old_menu¦old_cat¦old_dir/products)\.php$
RewriteRule .* - [F]

In this way, files you know have been removed get a 410-Gone if requested by an HTTP/1.1-compatible user-agent, and files that are accidentally removed or renamed, or missing files that are requested by an HTTP/1.0 user-agent, will get a 404-Not Found response.

(Replace all broken pipe "¦" characters in code above with solid pipes before use)

Jim

balam

5:16 pm on Jan 31, 2005 (gmt 0)

10+ Year Member



Jim's come in with an excellent post, and here's a further note on my final statement (from the thread Jim referenced) about the poor implementation of 410 Gone:

Yahoo - HTTP/1.0
MSNbot - HTTP/1.0
Gigabot - HTTP/1.0
Jeeves - HTTP/1.0
"Old" Googlebot ("Googlebot/2.1 [...]") - HTTP/1.0
"New" Googlebot ("Mozilla/5.0 (compatible; Googlebot/2.1 [...]") - HTTP/1.1

(Again,) I think only Google - both versions of their bot - has understood a 410. Once again, poor implementation has meant a useful tool has fallen by the wayside. :(

pageoneresults

5:23 pm on Jan 31, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have a related question concerning the HTTP version. HTTP 1.1 has been out for quite some time now. Why would any bot today follow the HTTP 1.0 standard? I've recently been involved in a discussion concerning this and after some research discovered some disturbing differences between HTTP 1.0 and HTTP 1.1.

A few years ago I ran across a tool called WebBug. It allows you to check server responses based on HTTP 0.9, 1.0 and 1.1. When checking various servers both Apache and Windows, different response codes are being returned based on the HTTP version chosen. If a 301 is in place for root domain to sub-domain, HTTP 1.0 returns a 200 status. HTTP 1.1 returns a 301 status. Why is that?

jdMorgan

5:39 pm on Jan 31, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> If a 301 is in place for root domain to sub-domain, HTTP 1.0 returns a 200 status. HTTP 1.1 returns a 301 status. Why is that?

Because a true HTTP/1.0 request does not include a HOST header.

By definition, you cannot "check" an HTTP/1.0 request to see if it is addressed to the correct domain or subdomain -- this information is simply not available in the HTTP/1.0 request. All resolution of domain name to IP address takes place in the DNS phase of the client request. After the domain is resolved to an IP address in an HTTP/1.0 transaction, the requested host information is essentially lost.

Therefore, no domain-based redirects are possible for HTTP/1.0, and you cannot properly handle HTTP/1.0 requests on a shared name-based server -- This is one reason that having your own IP address used to be absolutely required.

Again, to see if an HTTP/1.1 user-agent is masquerading as an HTTP/1.0 agent (for maximum compaitibility, probably), just check the %{HTTP_HOST} header for non-blank.

Jim

pendanticist

12:08 am on Feb 1, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks all.

Some time back I went thru the 301 re-direct thing and it worked quite well...for re-directing. Didn't do squat for those orphan files running around in other engines databases. They just kept popping up over and over again, driving me nuts looking at all the inefficient crawls.

In this case, I'll be putting up 410s for each and every one of those files that are still being requested.