Forum Moderators: phranque

Message Too Old, No Replies

Removing .php extension with .htaccess

From an older thread in forum92

         

Patrick Taylor

10:02 am on Nov 6, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This relates to an older thread in forum92 -> [webmasterworld.com...]

As Jim suggested, I have spent time reading the mod_rewrite and regular expressions documentation but there are two things I still don't understand, in:

# REDIRECT to non.php extension
RewriteCond %{REQUEST_URI}!^/index\.php$
RewriteCond %{REQUEST_URI}!^/excluded/(.+)/(.+)\.php$
# Use THE_REQUEST to prevent infinite loops
RewriteCond %{THE_REQUEST} ^GET\ /[^.]+\.php\ HTTP
RewriteRule ^([^.]+)\.php$ /$1 [R=301,L]

(1) Why is regex pattern ([^.]+) more efficient than (.*) and how does it do the same, when (as far as I can understand) ([^.]+) means one or more of not single arbitrary characters?

(2) How exactly does RewriteCond %{THE_REQUEST} ^GET\ /[^.]+\.php\ HTTP prevent infinite loops?

When searching the web for "prevent infinite loops" there are lots of references to this code, but no actual reasons why they are prevented.

Patrick

vincevincevince

10:08 am on Nov 6, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



1) Why is regex pattern ([^.]+) more efficient than (.*) and how does it do the same, when (as far as I can understand) ([^.]+) means one or more of not single arbitrary characters?

^.+ is more efficient than .* or .+ because it starts at the beginning of the string and can only move forwards. .* or .+ can start anywhere and end anywhere, possibly requiring multiple reads of the string instead of just starting at character 0 and moving forwards one at a time.

(2) How exactly does RewriteCond %{THE_REQUEST} ^GET\ /[^.]+\.php\ HTTP prevent infinite loops?

If the GET request ends in .php (as in the Cond you reference), then it has not been rewritten yet. If it was rewritten then it wouldn't be ending in .php!

Patrick Taylor

10:16 am on Nov 6, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks for the answer. I now understand (2), but not (1), still.

It relates to ([^.]+). If the caret means not, then why doesn't the regex exclude everything instead of including everything as does (.*)? Sorry if I'm not explaining this very well.

vincevincevince

10:25 am on Nov 6, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



<correction> Not sure... I just realised you were talking about the [^.] inside the expression not that at the start.

Patrick Taylor

10:50 am on Nov 6, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



With the caret inside square brackets, [^.] will match any character that is not any character. I'm missing the logic somewhere.

jdMorgan

11:42 am on Nov 6, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The period, when inside [] is not a regex token, it is a literal period. Therefore, the "[^.]+" pattern matches "one or more characters up until the next literal period," and thus "knows exactly when to stop matching," unlike the greedy and promiscuous ".*" which will initially match the entire string, and then have to repeatedly "back off" one character at a time from the end in order to allow the following "starved" subpatterns to match.

A pattern like "^[^/]+/[^.]+\.jpg$" can be matched in a single left-to-right pass, while a pattern like "^.*/.*\.jpg$" may take dozens or even hundreds of trial passes in order to match. Want to kill your server? Use three or more ".*"s in a row matching long-tailed requested URLs, and use several dozens of rules like that.

Uninformed use of many rules containing multiple ".*" subpatterns is one of the contributors to slow sites and early/unnecessary server upgrades -- In many cases, the performance difference is terribly obvious, even to us 'slow' humans.

If you're still not a believer, then test this yourself as described above on a busy server. :)

Jim

Patrick Taylor

2:58 pm on Nov 6, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Jim, many thanks for the explanation, which is really very interesting. I am a definitely a believer even if I'm slow on the uptake.

If a dot inside square brackets is a literal period (a full stop), the logic (to a non-programmer) could be hard to grasp, because [^.]+ seems to mean "match one or more of something that does not contain a full stop", without actually defining the 'something.'

Patrick

jdMorgan

4:49 pm on Nov 6, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Just remember that regex works on a character basis, and things will be clearer. The two most useful (and analogous) descriptions of "[^.]+" are, "match one or more sequential characters not a full stop" or "match one or more characters until you find a full stop."

Also, you may need the 'other half' of this code -- a rule that maps a request for /requested-pagename back to /filename.php. On review, this may be what was missing in the original thread. So the whole solution (in addition to changing the links on your pages, which you're presumably doing in PHP) would be:


# Externally redirect direct client requests for .php files to non-.php URLs
RewriteCond %{THE_REQUEST} ^GET\ /([^/]+/)*[^.]+\.php(\?[^\ ]*)?\ HTTP/
RewriteRule ^(([^/]+/)*[^.]+)\.php$ http://www.example.com/$1 [R=301,L]
#
# Internally rewrite extensionless page URLs to PHP files
# if no extension or trailing slash on requested URL
RewriteCond %{REQUEST_URI} !(\.¦/$)
# and if filename exists when .php is appended
RewriteCond %{REQUEST_FILENAME}.php -f
RewriteRule (.+) /$1.php [L]

This code is generalized to handle any number of directory levels and/or appended query strings, and I removed the RewriteConds for excluded URLs, only the second of which might be needed for your site.

Jim

Patrick Taylor

6:00 pm on Nov 6, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Jim, thanks so much. I'm going to digest that piece by piece.

The site in question is in WordPress, and the permalink structure (and internal linking) gives URLs like:

www.mysite.com/my-page

To remove unwanted trailing slashes (which Wordpress seems to make available, even when the permalink structure is set not to) I have:

# REDIRECT to non trailing slash if not real directory
RewriteCond %{REQUEST_FILENAME}!-d
RewriteRule ^(.+)/$ /$1 [R=301,L]

Patrick

[edited by: Patrick_Taylor at 6:08 pm (utc) on Nov. 6, 2007]