Forum Moderators: phranque
ht*p://www.site.com/string-string-digits?
I'd like to remove the question mark so that the URLs only ever display as:
ht*p://www.site.com/string-string-digits
but not if there are actually parameters, eg:
ht*p://www.site.com/string-string-digits?item=something
In my .htaccess, should I use a RewriteRule or a 301 redirect? I don't do this sort of thing very often, so some guidance would be appreciated.
Patrick
GET /string-string-digits? HTTP/1.1
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /[^?]*\?\ HTTP/
RewriteRule (.*) http://www.example.com/$1? [R=301,L]
RewriteRule ^([a-z]+-[a-z]+-[0-9]+/?)$ http://www.example.com/$1? [NC,R=301,L]
Jim
RewriteRule ^([a-z0-9-]+)$ /$1.php [L]
The aim is to be able to link to URLs like:
http://www.example.com/mypage
http://www.example.com/mypage1
http://www.example.com/my-page
http://www.example.com/my-page1
http://www.example.com/my1-page
http://www.example.com/my-page-1
I'm wondering if my RewriteRule is actually correct. According to my understanding, ([a-z0-9-]+) will match any combination of lowercase letters, hyphens, and digits, but now I'm not sure because I'm getting strange results when using a form button to "get" (eg):
http://www.example.com/my-page-1
http://www.example.com/my-page on the other hand is fine.
[edited by: Patrick_Taylor at 12:08 pm (utc) on Jan. 15, 2007]
There are some FORM buttons that use GET to open a new page, but this adds a trailing question mark onto the URL of the new page. I want to use my .htaccess file to remove those question marks, but only when there are no parameters like?item=blue - so my .htaccess file is now:
#
Options +FollowSymLinks
RewriteEngine On
#
# Canonical url fix
RewriteCond %{HTTP_HOST} ^my-site\.com [NC]
RewriteRule ^(.*)$ ht tp://www.my-site.com/$1 [R=301,L]
#
# Rewrite 'any-string' to 'any-string.php'
RewriteRule ^([a-z0-9-]*)$ /$1.php [L]
#
RewriteCond %{THE_REQUEST} [?]
RewriteRule ^(.*)\?$ ht tp://www.my-site.com/$1 [R=301,L]
#
What happens now is that when the button is used to open /page-page-2 a 404 error results - the browser has tried to open a page called:
http://www.example.com/page-page-2.php <- literally that URL, with "example"!
If I try to open /second-page the browser opens:
[my-site.com...] - no question mark but .php added onto the URL (my .htaccess is supposed to prevent this).
Added: Jim, sorry but I don't follow this:
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /[^?]*\?\ HTTP/
[edited by: Patrick_Taylor at 7:05 pm (utc) on Jan. 15, 2007]
See my previous post about matching the entire browser request above. This pattern requires a "?" without any query parameters in order to match and invoke the rule. Refer to the regular-expressions tutorial cited in our charter, take that pattern apart token-by-token, and it should be clear.
A blank query string will not show up in the %{QUERY_STRING} variable, and a query string is never directly visible to the URL-path variable tested by RewriteRule. In any case, the "?" never appears in either variable. Therefore, you must use %{THE_REQUEST}, parse the entire request line, and look for a "?" followed by space-HTTP/ to detect a blank query string.
Options +FollowSymLinks
RewriteEngine On
#
# Remove question mark if blank query string
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /[^?]*\?\ HTTP/
RewriteRule (.*) http://www.example.com[b]/$1?[/b] [R=301,L]
#
# Canonical domain fix
RewriteCond %{HTTP_HOST} ^example\.com [NC]
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
#
# Rewrite 'matching-string' to 'matching-string.php'
RewriteRule [b]^([a-z0-9\-]+)$[/b] /$1.php [L]
When I type in:
ht*p://www.my-domain.com/first-page?
and go, becomes:
ht*p://www.my-domain.com/first-page (0K)
When I type in:
ht*p://www.my-domain.com/second-page?
and go, becomes:
ht*p://www.my-domain.com/second-page.php (not ok)
When I type in:
ht*p://www.my-domain.com/page-page-1?
and go, becomes:
ht*p://www.my-domain.com/page-page-1 (OK)
When I type in:
ht*p://www.my-domain.com/page-page-2?
and go, becomes:
ht*p://www.example.com/page-page-2.php (error 404)
The actual files are:
ht*p://www.my-domain.com/first-page.php
ht*p://www.my-domain.com/second-page.php
ht*p://www.my-domain.com/page-page-1.php
ht*p://www.my-domain.com/page-page-2.php
Aside from that, are you completely flushing your browser cache after changing/uploading your modified code? If not, you're likely seeing previously-cached responses and pages.
Otherwise, do I understand the problem to be that the "second-page-2" request is being externally redirected, thus exposing the "second-page-2.php rewritten URL in your browser address bar? In this case, you'll need to look for reasons that the URL is being externally redirected *after* the internal rewrite, which is the last rule in the file (as I posted it).
Possible reasons include conflicting rules in other .htaccess files or server config files (e.g. httpd.conf) or that your host has set UseCanonicalName On, or that you have MultiViews (content-negotiation) enabled, etc. -- Something other than this code is forcing an external redirect and "exposing" your .php extension after the internal rewrite.
The same appears to be true for the "page-page" problem, as there is nothing in the code I posted that would do this; Since all three patterns match the entire requested URL-path, they would duplicate the entire URL-path if they malfunctioned, and not just part of the path.
You might want to take a look at your PHP script(s), and make sure that it does not generate any redirects to the non-canonical domain, and carefully inspect the links it is generating on your pages for other problems as well.
Whatever the exact cause, it appears to be outside the scope of the posted code.
Jim
Added: it's this part: ^[A-Z]{3,9}\ - I've seen this many times. What does that actually do? I know it matches between 3 and 9 any caps, but why? Is it for HTTP?
[edited by: Patrick_Taylor at 9:10 pm (utc) on Jan. 15, 2007]
To catch this situation, you'll need to examine "%{THE_REQUEST}" which is the exact request sent by the client (e.g. browser), and for this case, would look something like this:GET /string-string-digits? HTTP/1.1RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /[^?]*\?\ HTTP/
"Match 3 to 9 uppercase letters, followed by a space, followed by a slash, followed by any number of any characters except for a question mark, followed by a question mark, followed by a space, followed by 'HTTP/' followed by anything else" (we don't care what the HTTP version number is). (The literal spaces and the question mark must be 'escaped' in the pattern with a '\', along with any other regex or mod_rewrite operators or quantifier tokens which may appear, such as ^$%.?*()+[]{}¦\ or a space.)
---
The "^(.*)$" and "(.*)" patterns are functionally equivalent, because the ".*" pattern is maximally-greedy -- it will always match as many characters as possible. Therefore any anchoring of this pattern is superfluous.
This greediness is also the reason why it should never be used more than once in any pattern; If used twice or more, the first instance will initially 'consume' as much of the input string as possible, and then the pattern-matching engine will (usually) have to retry many, many times, backing off one character at a time, to get a match. For this reason, a negative-match pattern such as the "[^?]*" I used above should be preferred, as it allows evaluation in a single left-to-right pass through the input string.
For example, the pattern "^([^/]+)/([^.]+)\.(.+)$" might be hundreds of times more efficient than "^(.*)/(.*).(.*)$" depending on the length of the requested URL-path that must be matched into the second two subpatterns.
Jim
Patrick
Added: it's rather like Latin. I was poor at Latin at school but could always see the beauty of it.
[edited by: Patrick_Taylor at 11:23 pm (utc) on Jan. 15, 2007]