Forum Moderators: phranque
Well, we've decided to 'assist' the bots by actually telling them past 404 pages are now 410 Gone. It's the least I can do.
10.4.11 410 Gone
The requested resource is no longer available at the server and no forwarding address is known. This condition is expected to be considered permanent. Clients with link editing capabilities SHOULD delete references to the Request-URI after user approval. If the server does not know, or has no facility to determine, whether or not the condition is permanent, the status code 404 (Not Found) SHOULD be used instead. This response is cacheable unless indicated otherwise.The 410 response is primarily intended to assist the task of web maintenance by notifying the recipient that the resource is intentionally unavailable and that the server owners desire that remote links to that resource be removed. Such an event is common for limited-time, promotional services and for resources belonging to individuals no longer working at the server's site. It is not necessary to mark all permanently unavailable resources as "gone" or to keep the mark for any length of time -- that is left to the discretion of the server owner.
Is there anything specific I need to do?
Would I need to create a Custom 410 page?
Thanks.
Maybe this :
[...]
ErrorDocument 410 /mypathto/my410errorpage.htm
[...]
RewriteRule mypagethadoesnotexistanymore.html [mygentlesite.com...] [R=410,L]
[...]
You could use a RewriteBase directive, too, depending on your directories and where you intend to put this .htaccess file.
If the pages have some string in common, you should use a regexp to match them.
I don't know if you use PHP. For one of my sites, I found more convenient to replace the content of any no more existing pages by a script :
<?
header("HTTP/1.1 301 Moved Permanently");
header("Expires: Fri, 31 Dec 2004 01:00:00 GMT");
header("Location: [mygentlesite...]
?>
If you do not wish to have a replacement page, just redirect it towards the home page.
It worked with the search engines. When the page has not been visited for a month, I remove it.
RewriteCond %{REQUEST_FILENAME}!-f
RewriteRule regexp - [G,L]
...where regexp should match the set of pages gone. The "-" means no rewriting, and the G flag means "Gone".
However, for people that might have bookmarked a no longer existing page, a 410 would only be more upsetting thant a 404, no? I've been told by usability gurus that average users should never see an error message. They advise to redirect to a search form, or the home page.
As said in the page "HTTP Error 410: Gone" :
[...] When the average AOL user (or below average web surfer/Cerfer) tries to get a page which is either gone or was never there to begin with, I don’t think they’re going to care if it’s 404 or 410. The end result is the same... they’re scratching their heads wondering what happened and trying to find a link to fire off an email to webmaster@domain.tld [...]
But for bots, that will do. When a bot meets a 410, what should it do? Remove the entry in its bases, ok. And stop crawling the site? I don't know wether error messages are that good for sites ranking...
Anyone knows?
You can set up a custom 410 page with whatever information you want to give humans who might see it and reference it from your .htaccess eg,
ErrorDocument 410 /errors/410.html
and it will automatically be presented in stead of the standard 410 message.
So what's all that mean?
It means that if you have your own IP address, and therefore can support HTTP/1.0 requests to your site, then you should check to see if the user-agent has sent a Host header in its request, and if so, treat it as an HTTP/1.1 user-agent. If not, then do not send a 410 response, because HTTP/1.0 user-agents won't understand it to be anything more than a general 400-series error code.
RewriteCond %{HTTP_HOST} .
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule .* - [F]
With that in mind, I suggest using an actual list of known-to-have-been-recently-removed pages, rather than the method shown above. Yes, it's more work. But lower risk.
RewriteCond %{HTTP_HOST} .
RewriteCond %{REQUEST_URI} ^/(gone_page1¦gone_page2¦gone_page[4-9]¦junk¦old.+¦byebye)\.html$ [OR]
RewriteCond %{REQUEST_URI} ^/(old_menu¦old_cat¦old_dir/products)\.php$
RewriteRule .* - [F]
(Replace all broken pipe "¦" characters in code above with solid pipes before use)
Jim
Yahoo - HTTP/1.0
MSNbot - HTTP/1.0
Gigabot - HTTP/1.0
Jeeves - HTTP/1.0
"Old" Googlebot ("Googlebot/2.1 [...]") - HTTP/1.0
"New" Googlebot ("Mozilla/5.0 (compatible; Googlebot/2.1 [...]") - HTTP/1.1
(Again,) I think only Google - both versions of their bot - has understood a 410. Once again, poor implementation has meant a useful tool has fallen by the wayside. :(
A few years ago I ran across a tool called WebBug. It allows you to check server responses based on HTTP 0.9, 1.0 and 1.1. When checking various servers both Apache and Windows, different response codes are being returned based on the HTTP version chosen. If a 301 is in place for root domain to sub-domain, HTTP 1.0 returns a 200 status. HTTP 1.1 returns a 301 status. Why is that?
Because a true HTTP/1.0 request does not include a HOST header.
By definition, you cannot "check" an HTTP/1.0 request to see if it is addressed to the correct domain or subdomain -- this information is simply not available in the HTTP/1.0 request. All resolution of domain name to IP address takes place in the DNS phase of the client request. After the domain is resolved to an IP address in an HTTP/1.0 transaction, the requested host information is essentially lost.
Therefore, no domain-based redirects are possible for HTTP/1.0, and you cannot properly handle HTTP/1.0 requests on a shared name-based server -- This is one reason that having your own IP address used to be absolutely required.
Again, to see if an HTTP/1.1 user-agent is masquerading as an HTTP/1.0 agent (for maximum compaitibility, probably), just check the %{HTTP_HOST} header for non-blank.
Jim
Some time back I went thru the 301 re-direct thing and it worked quite well...for re-directing. Didn't do squat for those orphan files running around in other engines databases. They just kept popping up over and over again, driving me nuts looking at all the inefficient crawls.
In this case, I'll be putting up 410s for each and every one of those files that are still being requested.