Forum Moderators: phranque

Message Too Old, No Replies

Strip query string and question mark using mod rewrite

         

Willscrlt

11:31 am on Mar 14, 2007 (gmt 0)

10+ Year Member



I wanted to add this to the end of the existing topic, but the topic is closed to new posts. The information in the earlier topic is useful:
[webmasterworld.com...]

I think that Justin was the one who led me to the solution that eventually worked for my needs.

I have a Joomla content management system with a search engine friendly add-on. Before I added the SEF URLs, I embedded hyperlinks into some PDF files and other documentation that have been distributed far and wide. The problem is that I added a query string to the URL for tracking purposes (http://www.example.com/?ref=SOURCE). The search engine friendly URL plugin is confused by the query string, and clicking through on one of those embedded links receives the message "You are not authorised to view this resource." That is not something I want my clients to see, obviously.

I figured it should be a pretty easy task to just strip the query string from the URL using Apache mod_rewrite. It turns out, that it is pretty easy to implement (2 lines in my .htaccess file does the magic), but it took me nearly 4 hours of research and trial and error to work out the exact syntax that did what I needed it to do.

I'd also like to point out something that's probably really obvious for experienced mod_rewrite coders, but for people struggling enough with regular expressions, this tip might be very useful.

In mod_rewrite, the RewriteRule must match before the RewriteCond is even evaluated. In other words, even though the rewrite condition is listed first, it's not processed until the rewrite rule detects a match.

The easy fix (though probably not the most efficient, and if you want to improve your server's performance, a more specific match seems like a good idea) is to have the RewriteRule match any URL. Then the RewriteCond always will be checked. It's much more intuitive to humans that way, but the server probably does not appreciate the extra work.

Here is what I wanted to happen:
1) If the query string only contains my "ref" element, then erase the entire query string from the URL.
2) If the query string is empty, but a question mark is at the end of the request, strip the question mark from the address.
3) If there are other things in the query string, leave the entire query string alone. Maybe in the future I will have some other plugin that processes the query string before the search engine friendly URL add-on does its magic.

The first challenge is even detecting that pesky question mark. It does not appear. It is the delimiter between the REQUEST_URI and the QUERY_STRING. As such, it isn't in either of those variables. The only place you can find it is in the mod_rewrite special variable THE_REQUEST.

THE_REQUEST is the entire HTTP 1.1 (or HTTP 1.0 I suppose) request. For example: "GET http://www.example.com/?ref=SOURCE HTTP 1.1/html".

Now that we know where it is, the tricky part is getting rid of it. You do that in the RewriteRule section by placing a? after the full URL. Fortunately, that strips the entire query string instead of leaving a question mark behind.

My RewriteRule matches using the following regexp: ".?". That basically means that as long as there is 0 or 1 character in the request, there is a match. (I originally tried "^(.*)", but for some reason that permutation did not work, even though they seem nearly identical to me.)

Assuming that there is such a match (and there should always be one), my RewriteCond is checked. I evaluate the %{THE_REQUEST} variable for the following regexp: \?(ref=([_0-9a-z-])*)?\ HTTP [NC]

It basically says that anywhere directly preceding the HTTP part of the request, look for a question mark (the slash escapes the question mark) optionally followed by my special ref element. The ref can be empty or contain any number of letters, numbers, dashes, or underscores, but no other punctuation. The next slash followed by a space indicates the required single space between the REQUEST_URI (and the QUERY_STRING) and the rest of THE_REQUEST. Since ampersands ("&") and most other punctuation are excluded, and since the ref tag has to immediately follow the?, by default any other elements in the QUERY_STRING will cause this check to fail and the rule will be ignored. Also, if there is no question mark, then the check will fail, and the rule will be ignored. The [NC] at the end simply makes the entire check case insensitive. I could also have used [_0-9a-zA-Z-] to allow for upper and mixed case ref attributes, but the ref element itself would only be matched if it is lower cased, and I didn't want to take the chance of missing one.

Here is my complete code:

# If the RewriteRule applies, check if the query string is a single trailing question mark or only the 'ref' element
RewriteCond %{THE_REQUEST} \?(ref=([_0-9a-z-])*)?\ HTTP [NC]
# Any request matches, so if either condition above is true, then strip the entire query string and do a redirect
RewriteRule .? http://www.example.com%{REQUEST_URI}? [R=301,L]

Here is a more generic version which only strips off question marks from the query string at the end of a URI:

RewriteCond %{THE_REQUEST} \?\ HTTP [NC]
RewriteRule .? http://www.example.com%{REQUEST_URI}? [R=301,L]

I place this as the FIRST set of rules checked. In the original thread, it was mentioned that Justin's version caused serious headaches in some cases. I followed Justin's later advice and moved it first, and I didn't have any problems. The regular expression checks are even more specific now, so, hopefully, even more problems will be avoided for others.

Example links showing how it works:
http://www.example.com/content/category/7/34/78/? (Stripped)
http://www.example.com/content/category/7/34/78/?ref= (Stripped)
http://www.example.com/content/category/3/7/63/?ref=WbmstrWrld (Stripped)
http://www.example.com/content/category/3/7/63/?ref=WbmstrWrld?foo=bar (Ignored, invalid link)
http://www.example.com/content/category/7/?/34/78/? (Glitch! Truncates at the first?, not the last.)

I'd appreciate any suggestions on how to fix the "glitch" in the last example. What I hoped it would do is chop off the last question mark and leave the one in the middle. Looking at the matching regexp, I see WHY it does what it does, but not how to fix it.

Also, if anyone has ideas for optimizing the matching part of the RewriteRule for improved performance, I'd really appreciate those suggestions, too. Thanks in advance!

BTW, I'm not "new" at this, but I only seem to tweak my .htaccess file once or twice a year, and it's almost like I'm starting fresh every time. So frustrating. :-) Anyway, I hope this helps someone else and saves them a lot of time!

--Willscrlt

[edited by: jdMorgan at 2:40 pm (utc) on Mar. 14, 2007]
[edit reason] Example.com. Please see TOS & forum charter. [/edit]

jdMorgan

2:55 pm on Mar 14, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The "glitch" you encountered is to be expected, since a question mark isn't valid in a URL, but only in an attached query string -- See RFC2396 [faqs.org] for more info.

Apache therefore "believes" that the URL ends at "/content/category/7/" and that the following "?/34/78/?" is the query string. Unless you change that "?" to some other character, I doubt that there is a work-around, as Apache is simply complying with the HTTP/1.x specifications on URL-parsing.

In other words, by the time your mod_rewrite code gets control, Apache has already parsed "/content/category/7/" into the %{REQUEST_URI} variable and "/34/78/?" into %{QUERY_STRING}.

Even if you encode that questions mark as %253f, the behaviour is likely to be the same.

If the problem is important to you, I suppose you could rewrite in this special case, re-assigning the first part of the (now) "?/34/78/?" QUERY_STRING variable to the REQUEST_URI, but I suspect you'll need to use another character as a "token" to replace that initial question mark; It usually proves difficult to overcome URL restrictions imposed by the protocol and hard-coded into the server.

Jim

Willscrlt

3:43 am on Mar 15, 2007 (gmt 0)

10+ Year Member



Oh yeah. Duh! :-)

It makes perfect sense that it would behave that way, because question marks just should not appear more than once in a proper URL.

Not really to worry, though. The "gitchy" example was just an experiment to fully test the code, not an actual URL that I need it to parse.

I don't really know enough about the raw HTTP request (what THE_REQUEST returns). I'm guessing, based on what you just said, that there should only be one question mark within any proper GET request. So, I could simplify my regexp by searching for the first? found, and not have to search for the one adjacent to the HTTP portion. Right?

Or would searching for such a specific version of the question mark actually help speed up or otherwise improve the search? It would seem logical to me that a search for any question mark anywhere in THE_REQUEST should be faster than a question mark that meets additional criteria.

If so, then maybe my overly broad rule could be rewritten just to check if the QUERY_STRING exists or something. But would that tell if there is a dangling question mark at the end?

The problem that I see is that the rule only checks the main portion of the URL, and that's the part you can't check for a question mark, because the question mark exists outside of both the main portion and the query portion. It's the boundary. So I guess you do have to check EVERY URL to see if a question mark exists. In which case, I don't think you could optimize the rule any more. Not unless the rule can check THE_REQUEST directly.

These are the kinds of mental circles I wander around in when trying to deal with rules and regular expressions like this. Having learned the syntax and meaning of regexps has really helped, but the logic is still rather dizzying at times. :-)

--Willscrlt

jdMorgan

3:11 pm on Mar 15, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Short answer: Leave your code as it is; It accomplishes everything you set out to accomplish, and it's one apparent flaw is due to server behaviour outside of its control -- In other words, (as stated above), it is the server rejecting the URL as invalid that causes the request to fail before mod_rewrite is even invoked. So there's nothing you can (or even should) do in mod_rewrite to handle that malformed URL.

THE_REQUEST is the request header sent by the browser. Try Firefox with the "Live HTTP Headers" extension if you'd like to see some actual request headers -- Using this extension is an irreplaceable part of my site QA process now, and I use it to check every redirect and 410-Gone I install. Just for an example, here's what %{THE_REQUEST} looks like when my browser requests the WebmasterWorld Apache Library page:

GET /libraryv4.cgi?sortby=Date-Last-Post&sortdir=rvs&sortby=Date-Last-Post&sortdir=rvs&libshow=10&viewforum=92 HTTP/1.1

Jim