Forum Moderators: phranque

Message Too Old, No Replies

.htaccess URL Error Checking

How to get .htaccess to find filenames without extensions?

         

markis

4:32 am on Oct 18, 2008 (gmt 0)

10+ Year Member



Thanks for reading -

I'm in the process of trying to error-check inbound links and redirect "badly-linked" traffic to either a valid page or to my 404 page. (FYI, all pages on my site are .php) Within the current dir, it's working really well, with the exception of one IRRITATING glitch:

The .htaccess file successfully redirects all wonky urls to the 404 page, EXCEPT existing filenames WITHOUT an extension. For example --

/conTAct.htm ---> contact.php
/conWXYZ.php ---> 404_not_found.php
/conWXYZ ---> 404_not_found.php
/contact. ---> 404_not_found.php
**BUT**
/contact slips through everything and loads the webhost's generic 404 page. No matter what I do!

Any help to get this darn thing to work would be much appreciated. Here's the .htaccess file, and the little "helper" script at the beginning of the 404_not_found.php:


RewriteEngine On
RewriteRule ^(.+)/$ http://www.example.com/$1 [R=301,NC,L]
RewriteRule ^(.+)\.html$ http://www.example.com/$1.php [R=301,NC,L]
RewriteRule ^(.+)\.htm$ http://www.example.com/$1.php [R=301,NC,L]
RewriteRule ^(.+)\.shtml$ http://www.example.com/$1.php [R=301,NC,L]
RewriteRule ^(.+)\.asp$ http://www.example.com/$1.php [R=301,NC,L]

# Try to get /contact to turn into /contact.php ... doesn't work.
RewriteRule ^([a-zA-Z]+)$ http://www.example.com/$1.php [R=301,NC,L]

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule (.*) /404_not_found.php?$1 [L]


<?php

function file_iexists($path) {
$dirname = dirname($path);
$filename = basename($path);
$dir = dir($dirname);
while (($file = $dir->read()) !== false) {
if (strtolower($file) == strtolower($filename)) {
$dir->close();
return $file;
}
}
$dir->close();
return false;
}

$page = substr($_SERVER['QUERY_STRING'], 0, 999);
if ($page) {$page = file_iexists($page);}

if ($page) {
$page = "http://www.example.com/".$page;
header("Location: $page");
}

else {
header("HTTP/1.0 404 Not Found");
}

?>

[edited by: jdMorgan at 4:47 am (utc) on Oct. 18, 2008]
[edit reason] Please use example.com only. [/edit]

jdMorgan

5:01 am on Oct 18, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Could be you've got AcceptPathInfo enabled (Apache 2.x+ only) or the MultiViews option enabled. Or it might be mod_dir getting in there and interfering, depending on the LoadModule order.

I'd suggest the following:


Options -MultiViews
RewriteEngine on
#
RewriteRule ^([a-z]+)$ http://www.example.com/$1.php [NC,R=301,L]
RewriteRule ^(.+)/$ http://www.example.com/$1 [R=301,L]
RewriteRule ^(.+)\.s?html?$ http://www.example.com/$1.php [NC,R=301,L]
RewriteRule ^(.+)\.asp$ http://www.example.com/$1.php [NC,R=301,L]
#
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule (.*) /404_not_found.php?$1 [L]

You could also replace all of the redirects with just two:

RewriteRule ^([a-z]+)$ http://www.example.com/$1.php [NC,R=301,L]
RewriteRule ^(.+)(/¦\.s?html?¦\.asp)$ http://www.example.com/$1.php [NC,R=301,L]

Replace the broken pipe "¦" characters with solid pipes before use; Posting on this forum modifies the pipe characters.

Completely flush your browser cache before testing any new code.

Jim

markis

6:04 am on Oct 18, 2008 (gmt 0)

10+ Year Member



<font-size = "freakin' huge"> WOW! </font> Thank you, thank you ... -> infinity. Both did the trick neatly and easily, but I like the second one even better!

While I've got you on the phone, am I doing the R=301 thing right? I'm trying to impress on Google et al that the link is NOT the sloppy one that some webmaster linked to, but the new one. Anything else fishy / bad SEO / not-robust here?

For others' reference, here's the updated .htaccess file. I added A-Z0-9 to the first rule, because my site has other characters in the filenames as well. Works very well with preliminary tests:

Options -MultiViews
RewriteEngine on
#
RewriteRule ^([a-zA-Z0-9]+)$ http://www.example.com/$1.php [NC,R=301,L]
RewriteRule ^(.+)(/¦\.s?html?¦\.asp)$ http://www.example.com/$1.php [NC,R=301,L]
#
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule (.*) /404_not_found.php?$1 [L]

Like Jim says, change the pipe characters around to solid ones.

g1smd

9:46 am on Oct 18, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



In your script, your redirect HEADER returns a 302 redirect.

You need an extra line to be placed before that one to say HEADER "301 Moved Permanently".

Test using Live HTTP Headers for Firefox, and make sure you throw a good number of valid and non-valid URLs at it, both for www and non-www, with and without mixed case, with and without port number, parameters in a different order, parameters missing, extra parameters appended or within, with and without random trailing punctuation, and for all sorts of other expected and unexpected URLs.

Alternatively built a list as a text file and run that list through Xenu LinkSleuth. I can run a test of 5000 URLs in just a few minutes. It always finds a logic error in my thoughts or in the coding of what I thought I had done.

jdMorgan

2:10 pm on Oct 18, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> I added A-Z0-9

It is not necessary to add "A-Z" because the rule already has "a-z" in the pattern and an [NC] flag on it, making the pattern-match case insensitive. Adding the redundant uppercase sub-pattern only makes it run 33% slower, and doesn't change anything else.

Jim

markis

8:55 pm on Oct 18, 2008 (gmt 0)

10+ Year Member



OK, thanks to you both very much. I'll change that header to a 301, and I'll take out the redundant A-Z thing, and test, test, test. Good calls.

I'm also going to monkey with that script to get it to take an educated "guess" at which page the link was intending to hit. ie: ContacTR.oops would still return contact.php. If there's some doubt, it'll load the 404 page and give the user the option of picking the best one of, say, 5.

Thanks again, you two!

- m.

g1smd

9:00 pm on Oct 18, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



As long as the correction is done as a 301 redirect, then it should work fine.

jdMorgan

9:31 pm on Oct 18, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



FYI: The proper response code for "Server can't figure it out automatically, so please pick one" is "300-Multiple Choices", and not "404-Not Found".

Jim

markis

5:52 am on Oct 19, 2008 (gmt 0)

10+ Year Member



OK, thanks Jim. Actually, I mod'd the script using PHP's similar_text() at 80% significance to make it find the closest match to the file. Now the 404 page only actually loads if it can't find anything at all. The process might be interesting to someone looking to do very basic uri error-checking (haven't checked it completely yet, but it seems OK so far):

1. .htaccess strips any trailing '.' and '/'.
2. .htaccess adds a file extension to any file without one
3. .htaccess changes any wrong common file extensions to the right one
4. .htaccess calls the 404 page if the file isn't in the directory
5. 404 page looks for a non-case sensitive match
6. 404 page looks for a "fuzzy" match at 80%
7. If all of above fails, 404 page throws a 404 header and brings up an apology, a menu, and a search form.

* if any of the above successfully finds a page, the page is loaded with a 301 header.
* any queries will be lost. Not a biggie for my site, but you could easily mod this to accommodate.

Using this process, any 404's that have been thrown on my site in the last 3 months load fine, except 1 really goofy one which just goes to my 404 page.

Anyhow, thanks again Jim. You sure saved the day yesterday!

- m.

g1smd

8:07 am on Oct 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Please also check the HTTP headers using Live HTTP Headers for Firefox to make sure that any request (valid or non-valid) has only one response (or two for a redirect), and not a long "chain" of responses, correcting the URL one issue at a time.