Forum Moderators: phranque
I'm using the following RewriteRule that works great but it does not seem to process the "?" symbol (it seems to completely ignore it):
RewriteEngine On
RewriteRule ^([a-zA-Z][a-zA-Z])-([a-zA-Z][a-zA-Z])/([^/]+)$ /do.php?cat=$1&cat=$2&cat=$3 [L]
For example, if I go to mysite.com/aa-bb/question?, it processes it as:
do.php?cat1=aa&cat2=bb&name=question
and not:
do.php?cat1=aa&cat2=bb&name=question?
Yet if I go directly to:
mysite.com/do.php?cat1=aa&cat2=bb&name=question?
(without making use of the .htaccess rule), it processes the "?" without problems, so I assume the issue is with the .htaccess RewriteRule.
Any ideas on how I can get it to process the "?" instead of ignoring it? (It also ignores anything AFTER the "?" mark, which I don't want either).
Just to be clear, by "process" I mean that if I go to mysite.com/aa-bb/question? and do a $_GET['name'], I would like it to "get" the "?" symbol as well (e.g. I'd like $_GET['name'] to return "question?" and not just "question".)
Maybe one of these resources will lead us on the right track (I couldn't make too much sense out of them, but they seem relevant):
[ask.metafilter.com...]
[webmasterworld.com...]
[forums.devshed.com...]
[evisibility.com...]
[webmasterworld.com...]
Thanks, I appreciate your help!
If you wish to apply your rule only when the query string is empty, then you'll need to use a RewriteCond to check QUERY_STRING. With several additional tweaks for efficiency, it would look like this:
RewriteEngine on
#
RewriteCond %{QUERY_STRING} =""
RewriteRule ^([a-z]{2})-([a-z]{2})/([^/]+)$ /do.php?cat=$1&cat=$2&name=$3 [NC,L]
RewriteEngine on
#
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /[a-z]{2}-[a-z]{2}/[^/\ ]+\ HTTP/
RewriteRule ^([a-z]{2})-([a-z]{2})/([^/]+)$ /do.php?cat=$1&cat=$2&name=$3 [NC,L]
Jim
I gave both rules a shot but unfortunately they did not work.
Just to clarify, my site does not use parameters (?) except behind the scenes (e.g. in the rewrite rule. So what I want it to do is if I go to the following URL:
mysite.com/aa-bb/Are you there? Yes.
I want the $_GET['name'] to pick up everything after the last slash, including the literal "?" symbol and everything after it: Are you there? Yes.
Right now, it ignores the "?" and anything after it.
Perhaps I need a Rewrite Rule that will convert the "?" in the URL to a "%3F"?
I say that because if I go to:
mysite.com/aa-bb/Are you there%3F Yes.
It works, but is not user-friendly.
Any ideas?
Thanks again.
Because someone who is up to no good could cause you big problems by appending query strings to your URLs and using that to "confuse" search engines as to the "correct" URLs for your pages, you're basically playing with fire by accepting a question mark for any purpose other than to demarcate a query string. You will likely even see Googlebot occasionally 'testing' your site (I mean even today, regardless of your rules or any future changes to them) to see what your server does with 'bogus' querystrings. If you accept them, Google will consider your site to have an 'infinite URL-space' and will arbitrarily limit the depth to which they are willing to crawl your site.
So, you can either replace that character with a non-reserved character and change it back 'inside' your script, or you can omit it, and inform the user that it's being omitted if that might make any difference to them or to their activities on your site. There's no "user-friendly" way around allowing reserved characters in URLs.
Note that in your example, spaces will also be converted to %20 because, although they're not reserved, they are "restricted." See RFC2396 - Uniform Resource Identifiers (URI): Generic Syntax [faqs.org] if you need more info about characters allowed in various parts of URLs. You *are not* free to allow any character you like, any place in the requested URL or query string.
And as I stated above, "you may wish to add an additional and similar rule to 301-redirect requests having a trailing "?" or a trailing "?" plus query string to the same URL but with the query string removed, unless you want those non-canonical URLs to return a 404-Not Found."
With your new explanation of your goals, this would go hand-in-hand with a change to the database to remove trailing question marks from the expected "Get" values.
Also, should you wish to continue, please be very specific; "It does not work" tells us almost nothing here...
Jim
So you advise against playing around with the "?" (and other special characters) in a URL for security and SEO reasons?
If so, how would I go about adding an "additional and similar rule to 301-redirect requests having a trailing "?" or a trailing "?" plus query string to the same URL but with the query string removed, unless you want those non-canonical URLs to return a 404-Not Found" ?
Also, for the sake of being thorough, when you say "you can replace that character with a non-reserved character and change it back 'inside' your script", how exactly would I do that considering the "?" is not getting passed to my script?
Thanks again, I appreciate your time.
As for replacing question marks, just a fairly-bad example:
URL was "Are you there?"
Modify your script to link to "Are you there~" where "~" is a non-reserved character replacing the "?".
Use mod_rewrite to rewrite the "Are you there~" URL to your script so you can generate the requested page.
In your script change "~" back to "?" before trying to look up this page in your database to generate this page's content.
I am *not* recommending using the "~" character specifically, just any non-reserved character that is not likely to appear in your URLs. There's really no good way to allow the passing of 'free text' except in query strings, but that defeats your whole purpose in trying to use "friendly" URLs.
You'd have a similar kind of problem if you tried to pass a URL like "Are/you/there" because the slash also has meaning to the server.
Jim
By the way, I got the following code from another forum which actually does exactly what I want it to do and works perfectly:
RewriteCond %{REQUEST_URI} [a-zA-Z][a-zA-Z]-[a-zA-Z][a-zA-Z]/[^/]+
RewriteCond %{THE_REQUEST} ^GET[\ ]/([a-zA-Z][a-zA-Z])-([a-zA-Z][a-zA-Z])/([^/\ ]+)[\ ]HTTP
RewriteRule .* /do.php?cat=%1&cat=%2&name=%3 [NE,L] By "works", I mean it passes on the "?" and everything after it and keeps the URL "clean" and intact.
Do you advise against using this solution?
[edited by: jdMorgan at 8:42 pm (utc) on Sep. 22, 2009]
[edit reason] [ code ] formatting [/edit]
Try this tweak:
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([a-zA-Z]{2})-([a-zA-Z]{2})/([^/\ ]+)\ HTTP/
RewriteRule ^[a-z]{2}-[a-z]{2}/[^/]+$ /do.php?cat=%1&cat=%2&name=%3 [NC,NE,L]
Also, why do you have two query string parameters both named "cat"? -- that looks dangerous, as you never know how future versions of PHP might handle that -- i.e. one name/value pair could be dropped or get overwritten by the other, and become unavailable to your script.
If that rule works for you, great -- But be aware of the 'bogus query string problems' I warned about above, as the rest of the Web is going to treat the question mark as a query string delimiter and treat anything after that question mark as a query string, regardless of how you treat it internal to your site.
Technically, what you are doing here 'breaks the rules' about URLs, and you are going to have to handle the problems that arise as a result; Be very sure that your script will return a 404-Not Found unless the values of cat1, cat2 and name are valid and can be found in your database.
Jim
As for the two "cat" parameters, that was simply my typo (it should be cat1 and cat2).
By the way, when I go to mysite.com/aa-bb/ and enter nothing after the slash (or forgo the last slash altogether), I get a 404 error. Based on your .htaccess script, do you know where this actually tries to take me behind the scenes? I would like to set up a "main" page for each category pair if nothing is entered after the slash (or if the slash is omitted).
Thanks again for your time and continued assistance.
The server will therefore use its "default URL-to-filename resolution," and attempt to take you to the physical index page in the physical /aa-bb/ subdirectory. Since neither exists, you get a 404.
If you like, you can make your rule accept a blank path after /aa-bb/ as long as your script can do something with such a request: Simply change the "+" character on the final subpattern (in both lines) to a "*" character, changing the quantifier from 'match one or more' to 'match zero or more'.
Jim
I found a very bizarre error that I can't get my head around and was hoping you could give me another hand.
On my main index page (mysite.com), I have a form with 3 fields for cat1, cat2, and name. That form simply posts those 3 variables to mysite.com/makeit.php, which itself simply does a simple header redirect to:
mysite.com/cat1-cat2/name
Which, as we already know from the .htaccess, gets processed as
mysite.com/do.php?cat1=%1&cat2=%2&name=%3
That was working fine until I tested it with non-standard characters, such as Chinese characters.
For example, if on the main index page form I input:
cat1="aa"
cat2="bb"
name=[Chinese characters aren't getting processed correctly by this forum, so the character in my example is the one Google displays here: [google.com...]
then instead of the Chinese characters getting preserved in Chinese, they change to "中国" [literally] (as read by makeit.php) and then as an empty string "" when makeit.php does the header redirect to mysite.com/cat1-cat2/name.
Here is where it gets even more bizarre, if I go to mysite.com/aa-bb/[proper chinese character] directly via the URL, it works fine and the "[proper chinese character]" gets read/processed as it is (e.g. not modified in any way). Furthermore, I have the same form from the main index page on each dynamic page found at mysite.com/cat1-cat2/name and when I submit the same exact variables ("aa","bb","[proper chinese character]") via that form (which posts to the exact same makeit.php), it also gets read/processed correctly.
So the problem ONLY happens when using NON-LATIN characters via the MAIN INDEX Form.
This is very bizarre and I hope I explained myself correctly. Even though the form within mysite.com/aa-bb/name posts to the same makeit.php page as the form on the main index page (by posting to "../makeit.php", it seems that because it is being "launched" from "within" mysite.com/aa-bb/name, it is INVOKING the .htaccess rule we created and thus being processed correctly (e.g. not modified in any way).
So I had two questions:
1) First of all, is my use of the makeit.php intermediary "post" file to keep the URL clean even when the variables are submitted via a form (as opposed to directly via the URL) the best way of going about this (it's the only way I could figure out)?
2) What is causing the current problem with non-standard characters being submitted via the main index page form and how can I fix it?
Thanks again, I really appreciate it.
The only good news is that your testing as described above indicates that the problem isn't in the rewriterule.
Jim
@Jim: I ran the Live HTTP Headers and here are the results step-by-step. Hopefully this will help us figure out where the problem is and how to fix it. Thanks again!
[START]
@@@@@@@@@@@@@@@@@@@@@@@@@@@
STEP 1: LOADING MAININDEX PAGE
###########################
#request# GET [mysite.com...]
GET /
#request# GET [mysite.com...]
###########################
***************************
[mysite.com...]
GET / HTTP/1.1
Host: mysite.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3 (.NET CLR 3.5.30729)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
HTTP/1.x 200 OK
Date: Wed, 23 Sep 2009 22:33:43 GMT
Server: Apache/2.0.63 (Unix) mod_ssl/2.0.63 OpenSSL/0.9.8e-fips-rhel5 mod_auth_passthrough/2.1 mod_bwlimited/1.4 FrontPage/5.0.2.2635 mod_perl/2.0.4 Perl/v5.8.8
X-Powered-By: PHP/5.2.10
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html
***************************
@@@@@@@@@@@@@@@@@@@@@@@@@@@
CURENT URL: [mysite.com...]
STEP 2: SUBMITTING FORM WITH CHINESE CHARACTER:
###########################
#request# POST [mysite.com...]
POST /makeit.php name=%26%2320013%3B%26%2322269%3B&cat1=aa&cat2=bb
#request# GET [mysite.com...]
#redirect# GET /aa-bb/&
#request# GET [mysite.com...]
###########################
***************************
[mysite.com...]
POST /makeit.php HTTP/1.1
Host: mysite.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3 (.NET CLR 3.5.30729)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Referer: [mysite.com...]
Content-Type: application/x-www-form-urlencoded
Content-Length: 47
name=%26%2320013%3B%26%2322269%3B&cat1=aa&cat2=bb
HTTP/1.x 301 Moved Permanently
Date: Wed, 23 Sep 2009 22:34:50 GMT
Server: Apache/2.0.63 (Unix) mod_ssl/2.0.63 OpenSSL/0.9.8e-fips-rhel5 mod_auth_passthrough/2.1 mod_bwlimited/1.4 FrontPage/5.0.2.2635 mod_perl/2.0.4 Perl/v5.8.8
X-Powered-By: PHP/5.2.10
Location: [mysite.com...]
Content-Length: 0
Connection: close
Content-Type: text/html
----------------------------------------------------------
[mysite.com...]
GET /aa-bb/& HTTP/1.1
Host: mysite.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3 (.NET CLR 3.5.30729)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Referer: [mysite.com...]
HTTP/1.x 200 OK
Date: Wed, 23 Sep 2009 22:34:50 GMT
Server: Apache/2.0.63 (Unix) mod_ssl/2.0.63 OpenSSL/0.9.8e-fips-rhel5 mod_auth_passthrough/2.1 mod_bwlimited/1.4 FrontPage/5.0.2.2635 mod_perl/2.0.4 Perl/v5.8.8
X-Powered-By: PHP/5.2.10
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html
***************************
@@@@@@@@@@@@@@@@@@@@@@@@@@@
CURRENT URL: [mysite.com...]
STEP 3: SUBMITTING FORM AGAIN WITH CHINESE CHARACTER BUT WITHIN NEW URL (ABOVE)
###########################
#request# POST [mysite.com...]
POST /makeit.php name=%E4%B8%AD%E5%9B%BD&cat1=aa&cat2=bb
#request# GET [mysite.com...]
#redirect# GET /aa-bb/%E4%B8%AD%E5%9B%BD
#request# GET [mysite.com...]
###########################
***************************
[mysite.com...]
POST /makeit.php HTTP/1.1
Host: mysite.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3 (.NET CLR 3.5.30729)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Referer: [mysite.com...]
Content-Type: application/x-www-form-urlencoded
Content-Length: 37
name=%E4%B8%AD%E5%9B%BD&cat1=aa&cat2=bb
HTTP/1.x 301 Moved Permanently
Date: Wed, 23 Sep 2009 22:35:56 GMT
Server: Apache/2.0.63 (Unix) mod_ssl/2.0.63 OpenSSL/0.9.8e-fips-rhel5 mod_auth_passthrough/2.1 mod_bwlimited/1.4 FrontPage/5.0.2.2635 mod_perl/2.0.4 Perl/v5.8.8
X-Powered-By: PHP/5.2.10
Location: [mysite.com...]
Content-Length: 0
Connection: close
Content-Type: text/html
----------------------------------------------------------
[mysite.com...]
GET /aa-bb/%E4%B8%AD%E5%9B%BD HTTP/1.1
Host: mysite.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3 (.NET CLR 3.5.30729)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Referer: [mysite.com...]
HTTP/1.x 200 OK
Date: Wed, 23 Sep 2009 22:35:57 GMT
Server: Apache/2.0.63 (Unix) mod_ssl/2.0.63 OpenSSL/0.9.8e-fips-rhel5 mod_auth_passthrough/2.1 mod_bwlimited/1.4 FrontPage/5.0.2.2635 mod_perl/2.0.4 Perl/v5.8.8
X-Powered-By: PHP/5.2.10
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html
***************************
CURRENT URL: [mysite.com...] CHINESE CHARACTER]
[/END]
In the first case, the character are HTML-entity-encoded and then hex-encoded, while in the second case, they are only hex-encoded.
So the question is, why does your "form-maker script" do HTML-entity-encoding in the first case, which breaks the URL, and not in the second case?
As this is rather beyond the scope of an Apache server forum, I'd suggest asking about it in the PHP forum (with only a very small code snippet, if required).
Oh, and do beware of copy-and-paste within the browser, which will copy/paste the HTML-entity, not the character itself. If you're doing that instead of typing, then all bets are off.
Jim
-------------
//makeit.php
$name = $_POST["name"];
$cat1 = $_POST["cat1"];
$cat2 = $_POST["cat2"];
Header( "HTTP/1.1 301 Moved Permanently" );
$url = 'http://mysite.com/'.$cat1.'-'.$cat2.'/'.$name;
Header('Location: '.$url);
exit;
-------------
Even though the form within mysite.com/aa-bb/name posts to the same EXACT makeit.php page as the form on the main index page (by posting to "../makeit.php" vs. "makeit.php", same exact file though), it seems that because it is being "launched" from "within" mysite.com/aa-bb/name (e.g. inside a /cat1-cat2/ "directory"), it is INVOKING the .htaccess rule we created and thus being processed correctly (e.g. not HTML-entity-encoded), whereas the form on the index page is NOT invoking the .htaccess rule since it is not inside the /cat1-cat2/ "directory" and thus is being HTML-entity-encoded.
So how can I make the form on the main index page either 1) obey the same .htaccess rules as the form within the "directory" (e.g. with its own .htaccess rule), or 2) be fooled into thinking it is also coming from within a /cat1-cat2/ "directory"?
FORM 1 - Main Index Form (which processes *WITH* HTML-entity-encoding)
-----------------
<form action="makeit.php" method="post">
<p>
Enter:<br/>
<input type="text" name="name" size="20" value="" />
<br />
<select name="cat1" id="cat1">
<option value="aa" >aa</option>
<option value="bb" >bb</option>
<option value="cc" >cc</option>
</select>
to
<select name="cat2" id="cat2">
<option value="aa" >aa</option>
<option value="bb" >bb</option>
<option value="cc" >cc</option>
</select>
<br/>
<input type="submit" value="Submit" />
</p>
</form> FORM 2 - Form Inside /cat1-cat2/ "Directory" (which processes *WITH NO* HTML-entity-encoding)
-----------------
<form action="../makeit.php" method="post">
<p>
Enter:<br/>
<input type="text" name="name" size="20" value="" />
<br />
<select name="cat1" id="cat1">
<option value="aa" >aa</option>
<option value="bb" >bb</option>
<option value="cc" >cc</option>
</select>
to
<select name="cat2" id="cat2">
<option value="aa" >aa</option>
<option value="bb" >bb</option>
<option value="cc" >cc</option>
</select>
<br/>
<input type="submit" value="Submit" />
</p>
</form> There are exactly 2 (and only 2) differences between each form.
Difference 1: FORM 1 posts to "makeit.php" while FORM 2 posts to "../makeit.php" since it is inside the /cat1-cat2/ "directory".
Difference 2: FORM 2 is "launched" from within a /cat1-cat2/ directory and FORM 1 is not.
It is Difference 2 that makes me think FORM 2 gets the special .htaccess rule treatment whereas FORM 1 does not.
What are your thoughts?
It's likely in the form, and specifically, with "name." Where does this "name" value come from?
Again, if it was copied-and-pasted from an HTML page, then it will be HTML-entity-encoded, and cause a problem. You could of course prove this by renaming the non-working form to something else (to back it up), and copying the working form into your root directory...
Jim
RewriteRule ^[a-z]{2}\-[a-z]{2}\/ /do.php [L]
You access the original string used by the browser as $_SERVER['REQUEST_URI'] and it includes the question marks (even multiple).
In PHP, you can then parse out the contents (using a variant of jdMorgan's regex):
<?php
preg_match("/^\/([a-z]{2})\-([a-z]{2})\/([^/]*)$/",$_SERVER['REQUEST_URI'],$match);
$cat1=$m[1];
$cat2=$m[2];
$name=$m[3]; If things are encoded, then decode them. html_entity_decode(), for example.
[edited by: vincevincevince at 12:20 am (utc) on Sep. 24, 2009]
Jim
So, in this code, we're looking at THE_REQUEST -- the raw client request line, and using [NE] to avoid double-encoding this single-encoded embedded question mark in order to bypass all of this automatic decoding/encoding.
We *could* look at the query string from inside PHP, and if it were non-blank, rebuild the literal question mark and move the query string back into the variable where we wanted it in the first place, although this process is not strictly reversible and could be tinkered with maliciously.
But this doesn't affect the problem directly at hand, because this current problem is being caused not by URL-encoding, but by an HTML-entity encoding that is occurring prior to the browser's URL-encoding.
Jim
If this is true, then I'm going to rescind what I said, and also recommend trying the method vincevincevince proposed above, because I can see no way to make Apache treat URL-paths that can have question marks and/or ampersands in them correctly, unless these characters are encoded and passed in an appended query string. So at this point, I'm out of ideas.
The only change I'd make is to remove the unnecessary escaping in that rule, leaving just:
RewriteRule ^[a-z]{2}-[a-z]{2}/ /do.php [L]
Jim
Vince, I gave what you said a shot but I'm getting an error with the preg_match:
Warning: preg_match(): Unknown modifier ']'
preg_match("/^\/([a-z]{2})\-([a-z]{2})\/([^/]*)$/",$_SERVER['REQUEST_URI'],$match)
What would I need to do to fix this error?
Thanks again.
@Jim: Your solution was working perfectly (precisely as I wanted it) except when using the form in the main index page, which I still can't get my head around. I'll give Vince's advice a shot and if that doesn't work as intended for whatever reason, we can go back to tackling the other issue. Thanks.
I can see this all ending in tears. You're fighting all of the HTTP specs, and a simple change to PHP version, or something else, could bring your whole site down at some unknown time in the future.
preg_match("/^\/([a-z]{2})\-([a-z]{2})\/([^\/]*)$/",$_SERVER['REQUEST_URI'],$match) Added a \ before the penultimate /.
I agree with g1smd... you should really be escaping the ? in the sequence BEFORE putting it into the URL. Note that the $_SERVER variables come from the server, so this could break on either PHP or Apache upgrade.
(I suppose you didn't test URLs with # in them yet?)