Forum Moderators: phranque
1) Below is the content of my .htaccess file:
IndexIgnore *
#
ErrorDocument 404 /admin/errors/404.php
ErrorDocument 410 /admin/errors/404.php
#
RewriteEngine On
#
# All .html to .php
#
RewriteRule ^(.*)\.html$ $1.php [L]
#
# Old .php links lying in search engine indexs -- do a Gone 410
# If a user requests a .php link directly from browser -- do a Gone 410
#
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(.*)\.php\ HTTP/
RewriteRule .* - [G]
2. [If you were to try] this link -> http://www.example.com/what.html you get a correct 404 response.
3. [If you were to try] this link -> http://www.example.com/what.php you get a correct 410 response. But you also get a <i>Additionally, a 410 Gone error was encountered while trying to use an ErrorDocument to handle the request.</a>
Why is my ErrorDocument 410 /admin/errors/404.php
not working?
I have spent the past 2 days trying all sorts of things, i read all posts on this forum, but could not resolve :( Any help you can provide will be greatly appreciated.
Thanks
[edited by: jdMorgan at 2:56 pm (utc) on Nov. 28, 2006]
[edit reason] No URLs, please. See TOS. [/edit]
IndexIgnore *
#
ErrorDocument 404 /admin/errors/404.php
ErrorDocument 410 /admin/errors/404.php
#
RewriteEngine On
#
# All .html to .php
#
RewriteRule ^([^.]+)\.html$ $1.php [L]
#
# Old .php links lying in search engine indexes -- do a Gone 410
# If a user requests a .php link directly from browser -- do a Gone 410
#
RewriteCond %{REQUEST_URI} !^/admin/errors/[^.]+\.php$
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /[^.]+\.php\ HTTP/
RewriteRule \.php$ - [G]
If this does not help, check to be sure that your custom error pages do not reference other .php URLs via HTTP, since that would result in direct client requests for additional "Gone" .php URLs.
Jim
The changes you have suggested work. You have saved us so much time and headache -- I am glad i can move on with the next task at hand tomorrow when i go to work. Thank you.
A minor issue
=============
RewriteRule ^(.*)\.html$ $1.php [L] <-- our old code
RewriteRule ^([^.]+)\.html$ $1.php [L] <-- our tweak
Your tweak is working correctly for example.com/what.html, but it it resulting in a 404 for example.com/folder/what.html (folder is a name of a sub-directory).
What do think would be the reason?
Enlighten us please
====================
Would you be kind enough to answer my below query just so i understand this a bit better:
# -- My old code --
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(.*)\.php\ HTTP/
RewriteRule .* - [G]
# -----------------
1. A user types in example.com/what.php - This matches the RewriteCond above and would result in a [G].
2. Next, Apache would get ready to process the [G] and find that it needs to serve /admin/errors/404.php.
3. The big question:
a) At this point would Apache serve output from 404.php and be done OR
b) Apache would still keep looking for further directive matches in the .htaccess file?
If b) is true, Apache would not match the RewriteCond above because the internal url that it now has in hand to process is /admin/errors/404.php which will not match the THE_REQUEST.
Please shed some light.
Also
====
a) I just finished reading your wonderful post at [webmasterworld.com...] We have printed this and included in the documentation folder.
b) Is the approach we have taken here (doing a Gone for old .php links lying around in indexes and also for direct .php requests from users) a best practice or must we be rather doing a 301 instead. Our IT wants to hide the fact that we use .php as our scripting language and they argued that we silently transport a what.php to what.htm, this will expose the fact that we use php. (Dont even ask me why they would want to hide this info :) -- I have no clue!)
Please ignore the Minor issue that i described above -- Posted in a hurry without testing much. Below are the two small issues we are having after implementing your suggestions.
RewriteRule ^([^.]+)\.html$ $1.php [L]
RewriteCond %{REQUEST_URI}!^/admin/errors/[^.]+\.php$
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /[^.]+\.php\ HTTP/
RewriteRule \.php$ - [G]
1. example.com/what.ever.html <-- this is resulting in a 404 (i guess the first rewriterule needs tweaking).
2. example.com/what.ever.php <-- this is showing the actual .php page rather than resulting in a Gone
Both of these have to do with files that have a period as part of the filename. Can you suggest a workaround to accomodate this please?
Thanks again.
So, despite the fact that you'll lose some efficiency, with your dotted URLs, you'll need to use ".+" instead of "[^.]+". Here the meaning is "Match one or more of any characters." This will force the regular-expressions processing into backoff mode, since it will initially try to match everything into the ".+" pattern, fail to get a match, and then begin incrementally backing off one character at a time from the end of the requested URL-path in order to find a match. Compared to a specific pattern it's horribly inefficient, but your choice of URL format requires it.
The original problem was caused by the fact that despite the [L], [P], [F], and [G] flags taking immediate action and terminating mod_rewrite processing, they do so only for the current pass through the .htaccess file. While there is an efficiency gain from stopping this unnecessary processing, you need to be aware that any time a URL is changed internally, .htaccess processing will be re-invoked, so that any access controls which apply to the new URL can be checked and applied.
Therefore, mod_rewrite has the appearance of being recursive, and you must explicitly prevent rules from interacting unless you want them to. This is the reason for using THE_REQUEST in the thread you mentioned; Since it checks the original client request, it is completely unaffected by internal rewrites taking place in the context of the current HTTP request. Note however, that external redirects terminate the current HTTP transaction and cause the client to begin a new one, and since HTTP is a stateless protocol, the server will have no memory that the new request is related to the previous one. Therefore following an external redirect, THE_REQUEST will have the newly-requested URL.
Best practices for *any* old URL are to 301-redirect it to a new URL that directly replaces the original content when possible. If not, then the old URL should be redirected to a URL whose contents "make sense" to a requestor of the original URL. In many cases, a page that is "close enough to make sense" is available -- say an old product page URL being redirected to a new product page, or even to the product category page, if that specific product becomes unavailable. Here the user will see that the old product is no longer listed in the category, and can surmise that you no longer carry that product. And the user can then browse for similar replacement products.
If no sensible replacement is available, then redirect to a page that says so, and provide links to the site map and to the home page of the site. This is all about not confusing the visitor and being helpful; technical considerations should be mostly ignored.
I typically 301-redirect the very few URLs that I remove to the new location for a year, and then change the code to respond with 410-Gone in perpetuity. That "in perpetuity" part should indicate why you should never change a URL, except for legal (e.g. Trademark, Service Mark) reasons. There are many sites on the Web today whose URLs say "widgets.html" despite the fact that the Web server technology has changed from static HTML to SSI, to CGI scripting, then to Cold Fusion, and now to PHP with MySQL. There is no good reason to change a URL unless a lawyer calls you and says you must. And that includes not changing URLs just because the site technology changes. More here: [w3.org...]
I hope that covers your questions...
Jim
Thanks for sharing the knowledge.
No URL changes (almost) and Don't use inefficient reg exps in .htaccess -- Both points are very much valid and we will keep these in mind for future code changes.
1. We have around 5 files that have periods as part of the file name. We just went ahead and renamed these files without the period.
2. We added the below code to 301 redirect old what.ever.html to new what.html
RewriteRule ^about/contact.us.html$ /about/contact.html [R=301,L]
RewriteRule ^about/give.us.feedback.html$ /about/feedback.html [R=301,L]
RewriteRule ^about/other.providers.html$ /about/links.html [R=301,L]
RewriteRule ^about/terms.of.use.html$ /about/terms.html [R=301,L]
RewriteRule ^tell.a.friend.html$ /about/tellafriend.html [R=301,L]
This code is working fine, but i would still like to know if this is the correct approach or should we be adding a!^about/contact.us.html to the below rule to be safe+efficient?
RewriteRule ^([^.]+)\.html$ $1.php [L] (this rule is at the top of the file)
=====
With this change, both of our minor issues have been resolved. We now use your below code as-is and it works fine.
RewriteRule ^([^.]+)\.html$ $1.php [L]
RewriteCond %{REQUEST_URI}!^/admin/errors/[^.]+\.php$
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /[^.]+\.php\ HTTP/
RewriteRule \.php$ - [G]
No, it's not necessary to add that exclusion, since that URL will have already been redirected by the time the html-to-php rewrite rule runs if the rules are ordered correctly.
The proper order for what you've posted is as shown in your post: Specific-page redirects, then internal rewrites. If you also have a redirect rule for www/non-www domain canonicalization, put that after the page redirects and before the internal rewrites.
Jim