Forum Moderators: phranque

Message Too Old, No Replies

Mod-rewrite and querystrings

mod_rewrite, querystring

         

dougmcdonald

10:02 pm on Aug 2, 2011 (gmt 0)

10+ Year Member



Hi everyone,

I just stumbled upon, webmaster world as I'm redesigning my site and taking a quick foray into mod_rewrite at the same time.

I was hoping someone with a bit more experience in the area could point me in the right direction on two queries I have.

My first centres around rewritten URLs and querystrings. Put simply, if I rewrite, /index.php?page=news&title=somenews to /news/somenews/ how can I access the variable 'title' (or maybe I can't!) in my child php page? normally I'd just use $_REQUEST['varname']

Is this possible? or do I have to pass variables about via some other method than the querystring (since I guess I don't technically have one after the rewrite) If it's any help, I've been using this as my rewrite line for the piece in question:

RewriteRule ^articles/(.*)/(.*)/(.*)/(.*)$ index.php?page=viewarticle&y=$1&m=$2&d=$3&t=$4


My second question is I hope bit simpler. How (again with mod_rewrite) can I redirect request for my root directory e.g. www.mysite.com/ to ww.mysite.com/home/ I thought the following statment might do it, but it doesn't seem to:

RedirectMatch ^/$ http://mysite.com/home/


Any pointers on this would be very warmly received as this has been vexing me for a few days now!

Many thanks,

Doug

lucy24

11:05 pm on Aug 2, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The query-string part is actually straightforward. Conceptually straightforward, that is: putting together the right regular expression to capture exactly the parts you want is not so easy.

#1 By default, a rewrite simply ignores the query string. That is, they stash it in a safe place, do all the rewriting, and then quietly reappend the original query without change. The query is not part of the rewrite pattern.

#2 To delete the entire query, put a ? at the end of the target.

#3 To replace the existing query with something else, put ?newstuff at the end of the target. (#2 and #3 are really the same thing: you're just replacing the query with either something or nothing.)

In order to capture all or part of the query, you need

RewriteCond %{QUERY_STRING} blahblah

minus the leading ? which is implied. The blahblah will then contain one or more pieces in parentheses, which will be rendered as %1 %2 etc (not $1 $2 etc) in your target.

I deliberately left this part vague because the exact structure of the blahblah will depend on the exact structure of your query string, and which pieces you need to capture. It is easiest if the query always contains the same elements in the same order.

dougmcdonald

7:41 am on Aug 3, 2011 (gmt 0)

10+ Year Member



Hi Lucy, thanks very much for the reply.

On point #1 - Does that imply that if I have a php script which previously used the global $_REQUEST['paramname'] that the request object should still be accessible after the rewrite?
The reason I ask, is that my page structure includes an index.php, with an include to 'controller.php' which grabs $_REQUEST['page'] from the querystring and uses a simple switch statement to decide which page to include.
In my example it would include 'viewarticle.php' based on the value of 'page' in the querystring.

I can verify that the include 'controller.php' can access $_REQUEST['page'] as it successfully serves 'viewarticle.php' but 'viewarticle.php' doesn't seem to be able to access $_REQUEST[''] in general. Is this purely because of it's included nature do you think?

With the comment regarding 'RewriteCond %{QUERY_STRING} blahblah' does that relate to accessing the querystring in my mod_rewrite commands? If so, I might have asked the wrong question, as I'm more keen to access the querystring in my PHP than in my rewriting, but if one is a pre-cursor to the other, then I guess all I'm really after is making sure I can still access the variables in my server side script after a rewrite.

Many thanks,

Doug

g1smd

7:56 am on Aug 3, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Put simply, if I rewrite, /index.php?page=news&title=somenews to /news/somenews/

Mod_rewrite does the exact opposite. It takes the incoming URL request for
example.com/news/somenews/
and rewrites it to fetch content from the server filepath at
/index.php
and passes the parameters
page=news&title=somenews
to that resource.

Once you get that, you understand what mod_rewrite actually does. It does not "make" anything. It works on incoming requests. It therefore follows that the first step is changing the links on your pages to point to the SEF format URLs. URLs are defined in links. Mod_rewrite only starts working after that link is clicked.

By the way, you must change the .* patterns to the something else. The .* means "the entire input string". You can't follow "everything" with "everything" without the parser having to make tens of thousands of trial matches per request in order to see what you actually meant.

^([^/]+)/([^/]+)/([^/]+)/([^/.]+)$
may work a lot better.

Finally, if any of your rules use RewriteRule, do not use Redirect or RedirectMatch anywhere in the same site. Use RewriteRule for all of your rules.

dougmcdonald

7:37 am on Aug 4, 2011 (gmt 0)

10+ Year Member



Thanks for the replt g1smd, firstly, sorry I was thinking backwards when I wrote that, as you say it's the rewrite from the pretty format to the lengthy querystring.

Thanks again for the pointers on the * wildcard, I had assumed (obviously incorrectly) that the structure of the regex /blaa/blaaa/blaaaaa/ would need to be allowed to accept anything and that the position of the slashes would kind of delimit it. I realise now you've elaborated that that the first part would match /blaa/blaaa/blaaaaa/ as well as the /blaa/ I wanted it too, hence the looping you describe. Since the parts are designed to take dates in the format /2011/08/01/Sometext/ I will further refine these. My initial goal was just to get the links to point to the right files before I ran through any optimisation.

Following on from this point, do you think the usage of .* as part of a matching expression may have led to the seemingly invisible nature of the querystring to nested PHP scripts?

Finally, again thanks for the rewrite match heads up, I had read in a few places examples using both, but they seemed quite hazy on the whats and why's. so it's nice to hear someone clear things up for me.

g1smd

7:58 am on Aug 4, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



No problem.

If you're using dates, then the first part of the pattern might look like ^([12][0-9]{3})/([0][1-9]|1[012])/(0[1-9]|[12][0-9]|3[01])/([^/.]+)$

This allows requests beginning
/20AA/3F4/55/...
/1900/88/99/...
/FOO/BAR/QUUX/...
to return 404, or to be picked up by other, later, more general, rewrite rules on the site.

Query string is not affected by the RewriteRule pattern. Unless you explicitly replace or clear the query string it is automatically re-appended to the target. RewriteRule pattern looks only at the PATH part of the URL request.

dougmcdonald

7:22 pm on Aug 4, 2011 (gmt 0)

10+ Year Member



Hi again g1smd, thanks so much for the date format pattern. I got annoyed with not being able to grab the querystring, so I started playing about with my php and outputting various things.
Turns out, my previous pattern was doing some weirdness and plugging together some of the sections, resulting in my only getting 3 parameters when I was expecting 4. Because initially I was only looking at the last param, I though I was getting none....doh!

Once I had established it was related to the pattern match rather than the ability to see the params after re-write, I tweaked the pattern you provided to:

RewriteRule ^articles/([12][0-9]{3})/([0][1-9]|1[012])/(0[1-9]|[12][0-9]|3[01])/([^/.]+).$ index.php?page=viewarticle&y=$1&m=$2&d=$3&t=$4

Which does exactly what I want, thanks again for the pointers on this one (I'm aware there there aren't any major changes, but I do understand what it's doing, which is good!) :P

g1smd

7:29 pm on Aug 4, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



What is the period immediately before $ for?

It matches "any single character". But why?

Add the [L] flag to this and every RewriteRule.

dougmcdonald

8:22 am on Aug 10, 2011 (gmt 0)

10+ Year Member



The period was designed to match any character as you said, but more with the gola of allowing a trailing / on the URL.

What I really wanted was zero or 1 occurance of / but I just plopped an anything on the end, as I figured 'anything' would match / or a non '/' if that makes any sense? :S

lucy24

9:50 am on Aug 10, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Zero or 1 is expressed with ? If you want 0 or 1 of some specific character, use that character: /?

Is the overall idea to have no more than one directory after the part with the numbers? What if you get a bookmark or type-in that includes "index.html" after the final slash?

Since the preceding part is tightly constrained and you're capturing all the way to the end, this is one case where (.*) can be OK. It simply means "once you've got the date sorted out, capture the leftovers, if any".

g1smd

6:05 pm on Aug 10, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes, you can use /? for "optional trailing slash".

Use this pattern when you want to externally redirect several old URLs to a single new URL.

However, if you use this pattern with an internal rewrite you will have a Duplicate Content problem on your hands. You will have two URLs that both deliver content with the "200 OK" status.

The correct approach is to redirect "with slash" to "without slash", and then and only then rewrite the "without slash" requests to the internal filepath to actually fetch the content.

dougmcdonald

8:11 pm on Aug 14, 2011 (gmt 0)

10+ Year Member



Thanks for the breakdown guys, in answer to the questions:

lucy24, the URL in question is never expecting to have anything beyond the final slash, it's purely a rewrite to ensure that length querystring based URLs are more humanly readable, so the index.html shouldn't be a problem in this case.

g1smd, This is an internal re-write and I may well have a duplicate content issue, I'm assuming the downside of this is SEO wise?
Thanks again for pointing me at the correct to handle these, I have a simlar problem with my main links /home/ /about/ etc, where I want /home and /home/ to both be allowed and point to the same file. I guess in this example I should be directing /home/ to /home (for efficiency). I will begin to examine my rule for this pattern too to ensure I'm doing things by the book in this area!

Thanks again everyone for the massively helpful pointers in this area!

Cheers,

Doug

lucy24

9:37 pm on Aug 14, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I want /home and /home/ to both be allowed and point to the same file

Is the file itself /home.html (or home.php etc) or is it /home/index.html? The added slash in "naked" directory names tends to happen automatically unless you've done something to disable it. In extensionless filenames, the extension may be added either by your own explicit htaccess, or by MultiViews via mod_negotiation.

:: memo to self: add MultiViews to list of things my host does by default ::

g1smd

10:11 pm on Aug 14, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



That's an important point.

If the file on the drive is "home.html" then redirect URL requests "with slash" to "without slash" and rewrite only "without slash" URL requests to fetch the content.

If the file on the drive is "/home/index.html" let DirectoryIndex serve the index file for "with slash" URL requests and redirect "without slash" URL requests to "with slash", or let "DirectorySlash" take care of it.


:: memo to self: add MultiViews to list of things my host does by default ::
Do you mean turn it on? If so, don't turn it on. It does not play nice with Mod_Rewrite and is a source of endless Duplicate Content.

Mod_negtiation is another minefield all on it's own. Good luck with that. :)

lucy24

11:38 pm on Aug 14, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Breathe easy: I just re-checked using random extensionless filenames. I must have remembered wrong, because I get 404* errors. I assume the 404 means it translates "/foobar" into a request for "/foobar/index.html" and then can't find it. I do know that the host adds "index.html" before the request ever reaches my htaccess, because RedirectRules won't recognize anything with a final slash.

Anyway, tralala, I figured out a quick test to verify that mod_setenvif is processed before mod_rewrite which in turn is processed before mod_alias. Useful to know, since the only other information I have is which modules I've got, period. I think those three are the only ones with any .htaccess involvement.


* I can count very, very fast. :)

g1smd

11:52 pm on Aug 14, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I think there's several other modules that have settings that can be set by .htaccess rules.

mod_rewrite which in turn is processed before mod_alias
Correct operation should be mod_alias before mod_rewrite.

That would allow a request to be externally redirected by a mod_alias Redirect or RedirectMatch directive to a new URL and then for that new request to be internally rewritten using a mod_rewrite RewriteRule directive.

When mod_rewrite runs before mod_alias you end up with problems. An incoming URL request is rewritten to an internal path by the mod_rewrite RewriteRule and the internal pointer is updated. Next, when mod_rewrite runs, it takes that pointer and uses it as a part of the new URL and then exposes it out on to the web in the Location part of the HTTP header of the redirect.

This is why there's the oft-repeated rules
- "if you use RewriteRule for any of your directives, use it for ALL of your directives", and
- "list RewriteRule redirects before RewriteRule rewrites" and
- "list redirects from most specific to most general" and
- "list rewrites from most specific to most general".

lucy24

12:15 am on Aug 15, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



While pawing through Apache I found a reference to Rewrite Logs [httpd.apache.org], which would be amazingly useful and informative except that it looks as if you can't do it in .htaccess :(

Next, when mod_rewrite runs, it takes that pointer and uses it as a part of the new URL and then exposes it out on to the web in the Location part of the HTTP header of the redirect.

Did you mean to say mod_alias, or am I hopelessly confused? The two are not mutually exclusive.

g1smd

7:33 am on Aug 15, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



That'll teach me to post at 1 a.m.

Corrected:
When mod_rewrite runs before mod_alias you end up with problems. An incoming URL request is rewritten to an internal path by the mod_rewrite RewriteRule and the internal pointer is updated. Next, when mod_alias runs, it takes that pointer and uses it as a part of the new URL and then exposes it out on to the web in the Location part of the HTTP header of the redirect.

lucy24

3:16 pm on Aug 15, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yup. My quick-and-dirty test involved putting three redundant lines in .htaccess, all responding to a request for /foobar (directory I, ahem, don't actually have). Combination of paraphrase and literal text:

-- set variable keep_out (leading to "Deny from...")
-- RewriteRule with R=301 to /rewritten_url
-- Redirect 301 to /redirected_url

First try: 403 with status bar saying /foobar, meaning that mod_setenvif kicked in first. Deleting this line: 404 screen with status bar saying /rewritten_url. Deleting this line too: nothing left for browser to show but 404 screen with /redirected_url.

That's assuming I understand correctly that everything involving any one module is processed in a batch before htaccess moves along to the next module. I double-checked by putting the three lines in a different order (Redirect first, SetEnvIf last); no change.

g1smd

3:24 pm on Aug 15, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes, per request, apache runs one module at a time and that module will carry out everything in the .htaccess configuration file that applies to it. When that module is done processing the current request, control passes to the next Apache module to do its thing.

Apache modules are run in the reverse order to how they are listed in the LoadModule part of the httpd.conf file. It's a fatal error to run mod_alias after mod_rewrite, so I never take that chance and use only mod_rewrite directives, never mod_alias.