Welcome to WebmasterWorld Guest from 3.227.233.6

Forum Moderators: Ocean10000 & phranque

Message Too Old, No Replies

Removing a question mark from a URL

I know this has been asked before but thread now closed

     
9:57 pm on Jan 11, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 30, 2003
posts:932
votes: 0


There are some form buttons that open URLs like:

ht*p://www.site.com/string-string-digits?

I'd like to remove the question mark so that the URLs only ever display as:

ht*p://www.site.com/string-string-digits

but not if there are actually parameters, eg:

ht*p://www.site.com/string-string-digits?item=something

In my .htaccess, should I use a RewriteRule or a 301 redirect? I don't do this sort of thing very often, so some guidance would be appreciated.

Patrick

11:09 pm on Jan 11, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


To catch this situation, you'll need to examine "%{THE_REQUEST}" which is the exact request sent by the client (e.g. browser), and for this case, would look something like this:
GET /string-string-digits? HTTP/1.1

 RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /[^?]*\?\ HTTP/
RewriteRule (.*) http://www.example.com/$1? [R=301,L]

If only the string-string-digits URLs are giving you trouble, then try to make the RewriteRule pattern more specific to them, something like:

RewriteRule ^([a-z]+-[a-z]+-[0-9]+/?)$ http://www.example.com/$1? [NC,R=301,L]

which will accept only <letters>-<letters>-<numbers> followed by an optional trailing slash. That's as specific as I can get with your description of the problematic URLs.

Jim

11:38 pm on Jan 11, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 30, 2003
posts:932
votes: 0


Thanks Jim. That's a great starter. I will study it carefully and have a go.
12:05 pm on Jan 15, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 30, 2003
posts:932
votes: 0


I haven't solved this yet. My .htaccess file contains:

RewriteRule ^([a-z0-9-]+)$ /$1.php [L]

The aim is to be able to link to URLs like:

http://www.example.com/mypage
http://www.example.com/mypage1
http://www.example.com/my-page
http://www.example.com/my-page1
http://www.example.com/my1-page
http://www.example.com/my-page-1

I'm wondering if my RewriteRule is actually correct. According to my understanding, ([a-z0-9-]+) will match any combination of lowercase letters, hyphens, and digits, but now I'm not sure because I'm getting strange results when using a form button to "get" (eg):

http://www.example.com/my-page-1

http://www.example.com/my-page on the other hand is fine.

[edited by: Patrick_Taylor at 12:08 pm (utc) on Jan. 15, 2007]

6:49 pm on Jan 15, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 30, 2003
posts:932
votes: 0


I'll try to explain this better.

There are some FORM buttons that use GET to open a new page, but this adds a trailing question mark onto the URL of the new page. I want to use my .htaccess file to remove those question marks, but only when there are no parameters like?item=blue - so my .htaccess file is now:

#
Options +FollowSymLinks
RewriteEngine On
#
# Canonical url fix
RewriteCond %{HTTP_HOST} ^my-site\.com [NC]
RewriteRule ^(.*)$ ht tp://www.my-site.com/$1 [R=301,L]
#
# Rewrite 'any-string' to 'any-string.php'
RewriteRule ^([a-z0-9-]*)$ /$1.php [L]
#
RewriteCond %{THE_REQUEST} [?]
RewriteRule ^(.*)\?$ ht tp://www.my-site.com/$1 [R=301,L]
#

What happens now is that when the button is used to open /page-page-2 a 404 error results - the browser has tried to open a page called:

http://www.example.com/page-page-2.php <- literally that URL, with "example"!

If I try to open /second-page the browser opens:

[my-site.com...] - no question mark but .php added onto the URL (my .htaccess is supposed to prevent this).

Added: Jim, sorry but I don't follow this:
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /[^?]*\?\ HTTP/

[edited by: Patrick_Taylor at 7:05 pm (utc) on Jan. 15, 2007]

7:15 pm on Jan 15, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


> Added: Jim, sorry but I don't follow this:
> RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /[^?]*\?\ HTTP/

See my previous post about matching the entire browser request above. This pattern requires a "?" without any query parameters in order to match and invoke the rule. Refer to the regular-expressions tutorial cited in our charter, take that pattern apart token-by-token, and it should be clear.

A blank query string will not show up in the %{QUERY_STRING} variable, and a query string is never directly visible to the URL-path variable tested by RewriteRule. In any case, the "?" never appears in either variable. Therefore, you must use %{THE_REQUEST}, parse the entire request line, and look for a "?" followed by space-HTTP/ to detect a blank query string.


Options +FollowSymLinks
RewriteEngine On
#
# Remove question mark if blank query string
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /[^?]*\?\ HTTP/
RewriteRule (.*) http://www.example.com[b]/$1?[/b] [R=301,L]
#
# Canonical domain fix
RewriteCond %{HTTP_HOST} ^example\.com [NC]
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
#
# Rewrite 'matching-string' to 'matching-string.php'
RewriteRule [b]^([a-z0-9\-]+)$[/b] /$1.php [L]

Jim
8:19 pm on Jan 15, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 30, 2003
posts:932
votes: 0


Thanks, but it's most peculiar. I have used your suggested .htaccess and things are much the same as I described above. I have also tried typing in some of the test URLs in the browser address bar to see what happens - eg:

When I type in:
ht*p://www.my-domain.com/first-page?
and go, becomes:
ht*p://www.my-domain.com/first-page (0K)

When I type in:
ht*p://www.my-domain.com/second-page?
and go, becomes:
ht*p://www.my-domain.com/second-page.php (not ok)

When I type in:
ht*p://www.my-domain.com/page-page-1?
and go, becomes:
ht*p://www.my-domain.com/page-page-1 (OK)

When I type in:
ht*p://www.my-domain.com/page-page-2?
and go, becomes:
ht*p://www.example.com/page-page-2.php (error 404)

The actual files are:
ht*p://www.my-domain.com/first-page.php
ht*p://www.my-domain.com/second-page.php
ht*p://www.my-domain.com/page-page-1.php
ht*p://www.my-domain.com/page-page-2.php

8:46 pm on Jan 15, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


It appears from your last example that you forgot to modify "example.com" --our only approved "example" domain here-- to your own domain name in one or more rules.

Aside from that, are you completely flushing your browser cache after changing/uploading your modified code? If not, you're likely seeing previously-cached responses and pages.

Otherwise, do I understand the problem to be that the "second-page-2" request is being externally redirected, thus exposing the "second-page-2.php rewritten URL in your browser address bar? In this case, you'll need to look for reasons that the URL is being externally redirected *after* the internal rewrite, which is the last rule in the file (as I posted it).

Possible reasons include conflicting rules in other .htaccess files or server config files (e.g. httpd.conf) or that your host has set UseCanonicalName On, or that you have MultiViews (content-negotiation) enabled, etc. -- Something other than this code is forcing an external redirect and "exposing" your .php extension after the internal rewrite.

The same appears to be true for the "page-page" problem, as there is nothing in the code I posted that would do this; Since all three patterns match the entire requested URL-path, they would duplicate the entire URL-path if they malfunctioned, and not just part of the path.

You might want to take a look at your PHP script(s), and make sure that it does not generate any redirects to the non-canonical domain, and carefully inspect the links it is generating on your pages for other problems as well.

Whatever the exact cause, it appears to be outside the scope of the posted code.

Jim

9:03 pm on Jan 15, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 30, 2003
posts:932
votes: 0


Jim, many thanks again. I didn't doubt your advice for one tiny instant. In fact I hadn't completely cleared the cache but was relying on the usual CTRL/F5. All is well now, except that I still want to understand workings of the solution. This always happens... I like to know what's happening but since I only have an .htaccess issue every few weeks or so, what laboriously gets learnt gets partly forgotten. It gets worse as one gets older.

Added: it's this part: ^[A-Z]{3,9}\ - I've seen this many times. What does that actually do? I know it matches between 3 and 9 any caps, but why? Is it for HTTP?

[edited by: Patrick_Taylor at 9:10 pm (utc) on Jan. 15, 2007]

9:16 pm on Jan 15, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 30, 2003
posts:932
votes: 0


Also, are these the same? ->

RewriteRule ^(.*)$ etc etc
RewriteRule (.*) etc etc

9:50 pm on Jan 15, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


As previously stated, the pattern matches an entire HTTP request line, as sent by the client (browser, bot, etc.), with the exception of the HTTP version number:

To catch this situation, you'll need to examine "%{THE_REQUEST}" which is the exact request sent by the client (e.g. browser), and for this case, would look something like this:
GET /string-string-digits? HTTP/1.1

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /[^?]*\?\ HTTP/

"Match 3 to 9 uppercase letters, followed by a space, followed by a slash, followed by any number of any characters except for a question mark, followed by a question mark, followed by a space, followed by 'HTTP/' followed by anything else" (we don't care what the HTTP version number is). (The literal spaces and the question mark must be 'escaped' in the pattern with a '\', along with any other regex or mod_rewrite operators or quantifier tokens which may appear, such as ^$%.?*()+[]{}¦\ or a space.)

---

The "^(.*)$" and "(.*)" patterns are functionally equivalent, because the ".*" pattern is maximally-greedy -- it will always match as many characters as possible. Therefore any anchoring of this pattern is superfluous.

This greediness is also the reason why it should never be used more than once in any pattern; If used twice or more, the first instance will initially 'consume' as much of the input string as possible, and then the pattern-matching engine will (usually) have to retry many, many times, backing off one character at a time, to get a match. For this reason, a negative-match pattern such as the "[^?]*" I used above should be preferred, as it allows evaluation in a single left-to-right pass through the input string.

For example, the pattern "^([^/]+)/([^.]+)\.(.+)$" might be hundreds of times more efficient than "^(.*)/(.*).(.*)$" depending on the length of the requested URL-path that must be matched into the second two subpatterns.

Jim

11:13 pm on Jan 15, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 30, 2003
posts:932
votes: 0


Jim, thanks ever so much for that. The penny finally dropped: {THE_REQUEST} is not an URL. Sorry it took me a while to see it. Your patient explanation is pure gold - great stuff.

Patrick

Added: it's rather like Latin. I was poor at Latin at school but could always see the beauty of it.

[edited by: Patrick_Taylor at 11:23 pm (utc) on Jan. 15, 2007]