Forum Moderators: phranque
As Jim suggested, I have spent time reading the mod_rewrite and regular expressions documentation but there are two things I still don't understand, in:
# REDIRECT to non.php extension
RewriteCond %{REQUEST_URI}!^/index\.php$
RewriteCond %{REQUEST_URI}!^/excluded/(.+)/(.+)\.php$
# Use THE_REQUEST to prevent infinite loops
RewriteCond %{THE_REQUEST} ^GET\ /[^.]+\.php\ HTTP
RewriteRule ^([^.]+)\.php$ /$1 [R=301,L]
(1) Why is regex pattern ([^.]+) more efficient than (.*) and how does it do the same, when (as far as I can understand) ([^.]+) means one or more of not single arbitrary characters?
(2) How exactly does RewriteCond %{THE_REQUEST} ^GET\ /[^.]+\.php\ HTTP prevent infinite loops?
When searching the web for "prevent infinite loops" there are lots of references to this code, but no actual reasons why they are prevented.
Patrick
1) Why is regex pattern ([^.]+) more efficient than (.*) and how does it do the same, when (as far as I can understand) ([^.]+) means one or more of not single arbitrary characters?
^.+ is more efficient than .* or .+ because it starts at the beginning of the string and can only move forwards. .* or .+ can start anywhere and end anywhere, possibly requiring multiple reads of the string instead of just starting at character 0 and moving forwards one at a time.
(2) How exactly does RewriteCond %{THE_REQUEST} ^GET\ /[^.]+\.php\ HTTP prevent infinite loops?
A pattern like "^[^/]+/[^.]+\.jpg$" can be matched in a single left-to-right pass, while a pattern like "^.*/.*\.jpg$" may take dozens or even hundreds of trial passes in order to match. Want to kill your server? Use three or more ".*"s in a row matching long-tailed requested URLs, and use several dozens of rules like that.
Uninformed use of many rules containing multiple ".*" subpatterns is one of the contributors to slow sites and early/unnecessary server upgrades -- In many cases, the performance difference is terribly obvious, even to us 'slow' humans.
If you're still not a believer, then test this yourself as described above on a busy server. :)
Jim
If a dot inside square brackets is a literal period (a full stop), the logic (to a non-programmer) could be hard to grasp, because [^.]+ seems to mean "match one or more of something that does not contain a full stop", without actually defining the 'something.'
Patrick
Also, you may need the 'other half' of this code -- a rule that maps a request for /requested-pagename back to /filename.php. On review, this may be what was missing in the original thread. So the whole solution (in addition to changing the links on your pages, which you're presumably doing in PHP) would be:
# Externally redirect direct client requests for .php files to non-.php URLs
RewriteCond %{THE_REQUEST} ^GET\ /([^/]+/)*[^.]+\.php(\?[^\ ]*)?\ HTTP/
RewriteRule ^(([^/]+/)*[^.]+)\.php$ http://www.example.com/$1 [R=301,L]
#
# Internally rewrite extensionless page URLs to PHP files
# if no extension or trailing slash on requested URL
RewriteCond %{REQUEST_URI} !(\.¦/$)
# and if filename exists when .php is appended
RewriteCond %{REQUEST_FILENAME}.php -f
RewriteRule (.+) /$1.php [L]
Jim
The site in question is in WordPress, and the permalink structure (and internal linking) gives URLs like:
www.mysite.com/my-page
To remove unwanted trailing slashes (which Wordpress seems to make available, even when the permalink structure is set not to) I have:
# REDIRECT to non trailing slash if not real directory
RewriteCond %{REQUEST_FILENAME}!-d
RewriteRule ^(.+)/$ /$1 [R=301,L]
Patrick
[edited by: Patrick_Taylor at 6:08 pm (utc) on Nov. 6, 2007]