Forum Moderators: coopster

Message Too Old, No Replies

Am I processing clean URLs correctly?

         

bernk

11:12 pm on Feb 15, 2012 (gmt 0)

10+ Year Member



I'm making a relatively small non-CMS based site and would like to use clean URLs in the form of:
mysite.com/home
mysite.com/therapy
mysite.com/therapy/acupuncture
etc.

I've used mod_rewrite to send all requests to /index.php which contains the following working code. Is there a better way I could do this? Is there a glaring problem with the way I've done it?

This is my first time using mod_rewrite and clean_urls so I'd like to be sure I haven't messed something up without knowing.


<?php

// Sanatize the request and put its parts into an array called $url_array
$url_request = strip_tags($_SERVER['REQUEST_URI']);
$url_array = explode("/", $url_request);

// Clean $url_array by removing empty elements
// The first element is always empty so let's shift it off
array_shift($url_array);

// If the last element is also empty then pop it off
if(end($url_array) == ""){
array_pop($url_array);
}

// if $url_array is empty, show home and exit this script
if (empty($url_array)){
include("home.html");
exit();
}

$filename = implode("-", $url_array) . ".html";

if ( file_exists($filename) ) {

// file exists, include it
include($filename);
exit();
} else {

// file does not exist, should return a 404
// header('HTTP/1.0 404 Not Found');
exit("<h1>404 Not Found</h1>\nThe page that you have requested could not be found. You may like to start over from <a href='/'>Home</a>");
}

?>


My .htaccess file looks like this:


Options -MultiViews

ErrorDocument 404 /404.html

RewriteEngine On
RewriteBase /

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d

RewriteRule . /index.php [L]


And just to be 100% sure, should all the hard-coded links look like
<a href="/therapy/acupunture">Accupunture</a>
throughout the site?

Thanks for any feedback!

g1smd

12:59 am on Feb 16, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



That's a very good start.

There's a way to make it much more efficient.

Remove the -f and -d checks. They require Apache to go look on the hard drive to see if things exist long before the point that content is actually going to be delivered. This is a very slow and inefficient process.

Rather than rewrite all requests that don't resolve to a physical file or folder you can instead rewrite all requests that match your extensionless URL format.

The " . " RegEx pattern becomes
^([^/.]+)$
for root or
^(([^/]+/)+[^/.]+)$
for folders, or similar. Yes, there's now two rules.

The code will run a lot faster.

Ahead of these two rewites, make sure you place your standard non-www to www redirecting code using yet another RewriteRule.


And your code is the first in a very long time that actually addresses the problem "how to signal a 404 error for requests which have no content to return?" This shows you are thinking about the "big picture" of how the whole site works not just focussing on "how to rewrite a request".

bernk

1:38 am on Feb 16, 2012 (gmt 0)

10+ Year Member



Thanks very much for your reply, g1smd. It's much appreciated.

So my .htaccess file should look more like this? …regexs still scare me but I'm working on it.

Options -MultiViews

ErrorDocument 404 /404.html

RewriteEngine On
RewriteBase /

RewriteCond %{HTTP_HOST} !^www\.
RewriteRule ^(.*)$ http://www.%{HTTP_HOST}/$1 [R=301]

RewriteRule ^([^/.]+)$ /index.php
RewriteRule ^(([^/]+/)+[^/.]+)$ /index.php [L]


I'm afraid I didn't quite get the last two rewrite rules right. Another nudge?

As for the 404 stuff, it's funny you mention it because although I did provide a message to the user it's not an actual 404. You'll notice that I have
header('HTTP/1.0 404 Not Found')
commented out. The reason for that is that it was giving me a warning that it couldn't do it. I'm not at my work computer so I can't reproduce the error right now for a better explanation.

Thanks again!

g1smd

10:17 pm on Feb 16, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You almost got the rules right. The patterns are right. Every rule also needs the [L] flag.

For your sanity, put a blank line after every RewriteRule and add a # comment before each block of code.

The warning you had about 404 errors was likely that it was "too late" to send the HTTP header. You should place your PHP code that does this BEFORE the point where the DOCTYPE and HTML page are begun to be sent.

RewriteBase / is the default and does not need to be specified.

bernk

12:52 pm on Feb 17, 2012 (gmt 0)

10+ Year Member



Alright, I've simplified the PHP a little down to:


<?php

// Report all PHP errors
error_reporting(E_ALL);

// Show what's running
// echo $_SERVER['PHP_SELF'] . "<br>";

// Sanatize the request and put its parts into an array called $url_array
$url_request = strip_tags($_SERVER['REQUEST_URI']);
$url_array = explode("/", $url_request);

// Clean $url_array by removing empty elements
// The first element is always empty so let's shift it off
array_shift($url_array);

// If the last element is also empty then pop it off
if(end($url_array) == ""){

array_pop($url_array);

}

// If $url_array is empty, show home and exit this script
if (empty($url_array)){

include("home.html");
exit();

}

// Create the file name to include
$filename = implode("-", $url_array) . ".html";

// Try to include the file, if fails then return 404
if ( !include($filename) ){

//header('HTTP/1.0 404 Not Found');
exit("404 <a href='/'>Home</a>");

}

?>


And the .htaccess looks like this now:


# Necessary for 1&1 - http://httpd.apache.org/docs/2.0/content-negotiation.html
Options -MultiViews

# Tell Apache to enable mod_rewrite
RewriteEngine On

# Redirect non-www requests to www
# RewriteCond %{HTTP_HOST} !^www\.
# RewriteRule ^(.*)$ http://www.%{HTTP_HOST}/$1 [R=301,L]

RewriteRule ^([^/.]+)$ /index.php [L]
RewriteRule ^(([^/]+/)+[^/./?]+)$ /index.php [L]

ErrorDocument 404 /404.html

# 1&1 supported:
#
# ErrorDocumentDefine your own error pages.
# AddTypeAssign a MIME-Type to a file ending.
# RewriteEngineActivate mod_rewrite module
# Allow/DenyHost or IP based access control
# FilesMatchFile based access control
# AuthType"Basic" password check
# RedirectRedirection to another page or site
# Options(de)activate index, symbolic links, etc.


Now I have a couple of problems, I guess the main one being my lack of skill with regular expressions. I tried adding /? to your rule, but not sure if I put it in the correct spot.

The 404 header is something that's totally eluding me too—and you're right, that is in fact the error. When I request /something/that/does/not/exist I get a "no such file or directory" warning followed by a "cannot modify header information - headers already sent" warning and a 200 status.

The weird thing is if I request /something/that/exists then something-that-exists.html is included, but when I request the same with a trailing slash I get a 404 status! I can't figure out why this is. First of all, my script should be ditching the empty content in front of the trailing slash, and second of all how come that loads the 404 while the other example does not!?

Any ideas? Anyone?

Thanks again for taking the time to help.

g1smd

1:06 pm on Feb 17, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



example.com/this-page is the canonical URL for a page.

example.com/this-folder/ is the canonical URL for a folder.

If you request example.com/this-folder you should be redirected to example.com/this-folder/

If you request example.com/this-page/ you should get a 404 error.

--

The server response to a URL request should be a HTTP header with technical details about the page or file that follows and then a blank line followed by the actual HTML page or image/css/js file content.

As soon as your script sends anything at all, including a space, carriage return, or any characters, the headers have finished and the page content has begun. If you now try to send more header data, you are "too late".

It's a bit like writing an email and then half way through the message writing "Oh! and CC: john.doe@example.com too". He would never see the mesaage because the delivery address details are not in the header.

[edited by: g1smd at 1:09 pm (utc) on Feb 17, 2012]

bernk

1:06 pm on Feb 17, 2012 (gmt 0)

10+ Year Member



Just realized what part of the problem was—the error reporting itself! Of course the errors themselves are output as html and precede the header call. That's what was breaking it.

The trailing slash is still guilty of strangeness. When I request /something/that/does/not/exist/ I get a 404 status and the 404.html page I specified in the .htaccess file, but if I request the same thing without the trailing slash I don't get the 404.html, just a blank page despite the fact that I'm still getting a 404 status.

g1smd

1:20 pm on Feb 17, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You got it. Yes, the error messsage is the problem. You should put all the error messages into an array and then set up a <div> on the content page to display those messages. You can turn the display of the <div> on and off by using a $debug = 1; variable at the top of your script along with suitable on-page logic.

Good script design sees the top half doing all the "logic" stuff and the bottom half sending the DOCTYPE and the results to the browser.

In your script, make sure the path to the included 404 file is correct otherwise the script will be looking in the wrong folder.