Forum Moderators: phranque

Message Too Old, No Replies

mod rewrite Question: Processing the "?" symbol

.htaccess question mark

         

NeedExpertHelp

10:17 am on Sep 22, 2009 (gmt 0)

10+ Year Member



Hello everyone, I'm new to WW and had a question about the "?" in .htaccess.

I'm using the following RewriteRule that works great but it does not seem to process the "?" symbol (it seems to completely ignore it):

RewriteEngine On
RewriteRule ^([a-zA-Z][a-zA-Z])-([a-zA-Z][a-zA-Z])/([^/]+)$ /do.php?cat=$1&cat=$2&cat=$3 [L]

For example, if I go to mysite.com/aa-bb/question?, it processes it as:

do.php?cat1=aa&cat2=bb&name=question

and not:

do.php?cat1=aa&cat2=bb&name=question?

Yet if I go directly to:

mysite.com/do.php?cat1=aa&cat2=bb&name=question?

(without making use of the .htaccess rule), it processes the "?" without problems, so I assume the issue is with the .htaccess RewriteRule.

Any ideas on how I can get it to process the "?" instead of ignoring it? (It also ignores anything AFTER the "?" mark, which I don't want either).

Just to be clear, by "process" I mean that if I go to mysite.com/aa-bb/question? and do a $_GET['name'], I would like it to "get" the "?" symbol as well (e.g. I'd like $_GET['name'] to return "question?" and not just "question".)

Maybe one of these resources will lead us on the right track (I couldn't make too much sense out of them, but they seem relevant):

[ask.metafilter.com...]

[webmasterworld.com...]

[forums.devshed.com...]

[evisibility.com...]

[webmasterworld.com...]

Thanks, I appreciate your help!

NeedExpertHelp

10:20 am on Sep 22, 2009 (gmt 0)

10+ Year Member



I did not find an "edit" button so allow me to correct the following:

The RewriteRule Should read:

RewriteEngine On
RewriteRule ^([a-zA-Z][a-zA-Z])-([a-zA-Z][a-zA-Z])/([^/]+)$ /do.php?cat=$1&cat=$2&name=$3 [L]

Thanks.

jdMorgan

1:34 pm on Sep 22, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



RewriteRule cannot "see" query strings appended to the requested URL-path, which is why your code appears to 'ignore' query strings.

If you wish to apply your rule only when the query string is empty, then you'll need to use a RewriteCond to check QUERY_STRING. With several additional tweaks for efficiency, it would look like this:


RewriteEngine on
#
RewriteCond %{QUERY_STRING} =""
RewriteRule ^([a-z]{2})-([a-z]{2})/([^/]+)$ /do.php?cat=$1&cat=$2&name=$3 [NC,L]

Note that this will still rewrite a request with a "?" at the end, but a blank query string. If that is a problem, then a less-efficient but more-thorough solution would be something like this:

RewriteEngine on
#
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /[a-z]{2}-[a-z]{2}/[^/\ ]+\ HTTP/
RewriteRule ^([a-z]{2})-([a-z]{2})/([^/]+)$ /do.php?cat=$1&cat=$2&name=$3 [NC,L]

and you may wish to add an additional and similar rule to 301-redirect requests having a trailing "?" or a trailing "?" plus query string to the same URL but with the query string removed, unless you want those non-canonical URLs to return a 404-Not Found.

Jim

NeedExpertHelp

3:02 pm on Sep 22, 2009 (gmt 0)

10+ Year Member



Hi jdMorgan, thanks for your prompt response, I appreciate it. I've seen some of your others posts re: .htaccess and you seem to be an .htaccess GOD. :)

I gave both rules a shot but unfortunately they did not work.

Just to clarify, my site does not use parameters (?) except behind the scenes (e.g. in the rewrite rule. So what I want it to do is if I go to the following URL:

mysite.com/aa-bb/Are you there? Yes.

I want the $_GET['name'] to pick up everything after the last slash, including the literal "?" symbol and everything after it: Are you there? Yes.

Right now, it ignores the "?" and anything after it.

Perhaps I need a Rewrite Rule that will convert the "?" in the URL to a "%3F"?

I say that because if I go to:

mysite.com/aa-bb/Are you there%3F Yes.

It works, but is not user-friendly.

Any ideas?

Thanks again.

jdMorgan

4:33 pm on Sep 22, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, you're fighting the HTTP protocol, in which "?" is a reserved character, and indicates the end of the URL-path and the beginning of the query string.

Because someone who is up to no good could cause you big problems by appending query strings to your URLs and using that to "confuse" search engines as to the "correct" URLs for your pages, you're basically playing with fire by accepting a question mark for any purpose other than to demarcate a query string. You will likely even see Googlebot occasionally 'testing' your site (I mean even today, regardless of your rules or any future changes to them) to see what your server does with 'bogus' querystrings. If you accept them, Google will consider your site to have an 'infinite URL-space' and will arbitrarily limit the depth to which they are willing to crawl your site.

So, you can either replace that character with a non-reserved character and change it back 'inside' your script, or you can omit it, and inform the user that it's being omitted if that might make any difference to them or to their activities on your site. There's no "user-friendly" way around allowing reserved characters in URLs.

Note that in your example, spaces will also be converted to %20 because, although they're not reserved, they are "restricted." See RFC2396 - Uniform Resource Identifiers (URI): Generic Syntax [faqs.org] if you need more info about characters allowed in various parts of URLs. You *are not* free to allow any character you like, any place in the requested URL or query string.

And as I stated above, "you may wish to add an additional and similar rule to 301-redirect requests having a trailing "?" or a trailing "?" plus query string to the same URL but with the query string removed, unless you want those non-canonical URLs to return a 404-Not Found."

With your new explanation of your goals, this would go hand-in-hand with a change to the database to remove trailing question marks from the expected "Get" values.

Also, should you wish to continue, please be very specific; "It does not work" tells us almost nothing here...

Jim

NeedExpertHelp

7:03 pm on Sep 22, 2009 (gmt 0)

10+ Year Member



Thanks for another great reply Jim.

So you advise against playing around with the "?" (and other special characters) in a URL for security and SEO reasons?

If so, how would I go about adding an "additional and similar rule to 301-redirect requests having a trailing "?" or a trailing "?" plus query string to the same URL but with the query string removed, unless you want those non-canonical URLs to return a 404-Not Found" ?

Also, for the sake of being thorough, when you say "you can replace that character with a non-reserved character and change it back 'inside' your script", how exactly would I do that considering the "?" is not getting passed to my script?

Thanks again, I appreciate your time.

jdMorgan

7:15 pm on Sep 22, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Search this forum for "redirect remove question mark" and similar.

As for replacing question marks, just a fairly-bad example:
URL was "Are you there?"
Modify your script to link to "Are you there~" where "~" is a non-reserved character replacing the "?".
Use mod_rewrite to rewrite the "Are you there~" URL to your script so you can generate the requested page.
In your script change "~" back to "?" before trying to look up this page in your database to generate this page's content.

I am *not* recommending using the "~" character specifically, just any non-reserved character that is not likely to appear in your URLs. There's really no good way to allow the passing of 'free text' except in query strings, but that defeats your whole purpose in trying to use "friendly" URLs.

You'd have a similar kind of problem if you tried to pass a URL like "Are/you/there" because the slash also has meaning to the server.

Jim

NeedExpertHelp

7:40 pm on Sep 22, 2009 (gmt 0)

10+ Year Member



Thanks Jim.

By the way, I got the following code from another forum which actually does exactly what I want it to do and works perfectly:

RewriteCond %{REQUEST_URI} [a-zA-Z][a-zA-Z]-[a-zA-Z][a-zA-Z]/[^/]+
RewriteCond %{THE_REQUEST} ^GET[\ ]/([a-zA-Z][a-zA-Z])-([a-zA-Z][a-zA-Z])/([^/\ ]+)[\ ]HTTP
RewriteRule .* /do.php?cat=%1&cat=%2&name=%3 [NE,L]

By "works", I mean it passes on the "?" and everything after it and keeps the URL "clean" and intact.

Do you advise against using this solution?

[edited by: jdMorgan at 8:42 pm (utc) on Sep. 22, 2009]
[edit reason] [ code ] formatting [/edit]

jdMorgan

8:39 pm on Sep 22, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, that rule doesn't comprehend the optimizations I posted above... It doesn't work for anything but GET requests either. The patterns aren't anchored, leading to ambiguity as to what will be matched in certain circumstances. There are some 'strange' regex subpatterns in there that indicate that the author doesn't understand regular expressions very well. The first RewriteCond is not needed, as that function can be done in the rule itself.

Try this tweak:


RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([a-zA-Z]{2})-([a-zA-Z]{2})/([^/\ ]+)\ HTTP/
RewriteRule ^[a-z]{2}-[a-z]{2}/[^/]+$ /do.php?cat=%1&cat=%2&name=%3 [NC,NE,L]

Note that [NC] on the rule makes checking for [A-Z] unnecessary, since the check is case-insensitive and checking for [a-z] will therefore suffice. The same trick can't really be used in the RewriteCond without introducing excessive permissiveness regarding the requested HTTP method and protocol, though - both will be uppercase-only unless the request is spoofed by a user-agent that sends invalid requests.

Also, why do you have two query string parameters both named "cat"? -- that looks dangerous, as you never know how future versions of PHP might handle that -- i.e. one name/value pair could be dropped or get overwritten by the other, and become unavailable to your script.

If that rule works for you, great -- But be aware of the 'bogus query string problems' I warned about above, as the rest of the Web is going to treat the question mark as a query string delimiter and treat anything after that question mark as a query string, regardless of how you treat it internal to your site.

Technically, what you are doing here 'breaks the rules' about URLs, and you are going to have to handle the problems that arise as a result; Be very sure that your script will return a 404-Not Found unless the values of cat1, cat2 and name are valid and can be found in your database.

Jim

NeedExpertHelp

9:03 pm on Sep 22, 2009 (gmt 0)

10+ Year Member



Jim, you are a GENIUS! It worked like a charm (plus great advice). I like any code that is shorter and more efficient, so you take the cake.

As for the two "cat" parameters, that was simply my typo (it should be cat1 and cat2).

By the way, when I go to mysite.com/aa-bb/ and enter nothing after the slash (or forgo the last slash altogether), I get a 404 error. Based on your .htaccess script, do you know where this actually tries to take me behind the scenes? I would like to set up a "main" page for each category pair if nothing is entered after the slash (or if the slash is omitted).

Thanks again for your time and continued assistance.

jdMorgan

1:37 am on Sep 23, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Behind the scenes it takes you nowhere, because the rule requires at least one character past the "/aa-bb/".

The server will therefore use its "default URL-to-filename resolution," and attempt to take you to the physical index page in the physical /aa-bb/ subdirectory. Since neither exists, you get a 404.

If you like, you can make your rule accept a blank path after /aa-bb/ as long as your script can do something with such a request: Simply change the "+" character on the final subpattern (in both lines) to a "*" character, changing the quantifier from 'match one or more' to 'match zero or more'.

Jim

NeedExpertHelp

6:59 pm on Sep 23, 2009 (gmt 0)

10+ Year Member



Hi Jim,

I changed the "+" to "*" and it worked brilliantly, thanks!

I've been testing it all day today and so far there are no kinks.

I'll let you know if I have any problems.

Thanks again for all your help, you're a master of your craft and I appreciate your time helping other lesser souls. :)

NeedExpertHelp

10:05 pm on Sep 23, 2009 (gmt 0)

10+ Year Member



Hi again Jim,

I found a very bizarre error that I can't get my head around and was hoping you could give me another hand.

On my main index page (mysite.com), I have a form with 3 fields for cat1, cat2, and name. That form simply posts those 3 variables to mysite.com/makeit.php, which itself simply does a simple header redirect to:

mysite.com/cat1-cat2/name

Which, as we already know from the .htaccess, gets processed as

mysite.com/do.php?cat1=%1&cat2=%2&name=%3

That was working fine until I tested it with non-standard characters, such as Chinese characters.

For example, if on the main index page form I input:

cat1="aa"
cat2="bb"
name=[Chinese characters aren't getting processed correctly by this forum, so the character in my example is the one Google displays here: [google.com...]

then instead of the Chinese characters getting preserved in Chinese, they change to "中国" [literally] (as read by makeit.php) and then as an empty string "" when makeit.php does the header redirect to mysite.com/cat1-cat2/name.

Here is where it gets even more bizarre, if I go to mysite.com/aa-bb/[proper chinese character] directly via the URL, it works fine and the "[proper chinese character]" gets read/processed as it is (e.g. not modified in any way). Furthermore, I have the same form from the main index page on each dynamic page found at mysite.com/cat1-cat2/name and when I submit the same exact variables ("aa","bb","[proper chinese character]") via that form (which posts to the exact same makeit.php), it also gets read/processed correctly.

So the problem ONLY happens when using NON-LATIN characters via the MAIN INDEX Form.

This is very bizarre and I hope I explained myself correctly. Even though the form within mysite.com/aa-bb/name posts to the same makeit.php page as the form on the main index page (by posting to "../makeit.php", it seems that because it is being "launched" from "within" mysite.com/aa-bb/name, it is INVOKING the .htaccess rule we created and thus being processed correctly (e.g. not modified in any way).

So I had two questions:

1) First of all, is my use of the makeit.php intermediary "post" file to keep the URL clean even when the variables are submitted via a form (as opposed to directly via the URL) the best way of going about this (it's the only way I could figure out)?

2) What is causing the current problem with non-standard characters being submitted via the main index page form and how can I fix it?

Thanks again, I really appreciate it.

g1smd

10:07 pm on Sep 23, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Test it with URL requests that you expect to be non-valid, and make sure you also get the correct operation for those.

Also try www and non-www, with and without extra parameters, and anything else that might 'break' it.

jdMorgan

10:22 pm on Sep 23, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'd recommend that you install the "Live HTTP Headers" add-on for Firefox/Mozilla browsers, and find out at which step the characters are being hex-encoded or HTML-entity-encoded (or both). You can't really make any progress until you narrow down the problem's location...

The only good news is that your testing as described above indicates that the problem isn't in the rewriterule.

Jim

NeedExpertHelp

10:53 pm on Sep 23, 2009 (gmt 0)

10+ Year Member



Hi g1smd, thanks for jumping in. I'm not sure what you mean, could you please be more specific?

@Jim: I ran the Live HTTP Headers and here are the results step-by-step. Hopefully this will help us figure out where the problem is and how to fix it. Thanks again!

[START]

@@@@@@@@@@@@@@@@@@@@@@@@@@@
STEP 1: LOADING MAININDEX PAGE

###########################
#request# GET [mysite.com...]
GET /
#request# GET [mysite.com...]
###########################

***************************
[mysite.com...]

GET / HTTP/1.1
Host: mysite.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3 (.NET CLR 3.5.30729)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive

HTTP/1.x 200 OK
Date: Wed, 23 Sep 2009 22:33:43 GMT
Server: Apache/2.0.63 (Unix) mod_ssl/2.0.63 OpenSSL/0.9.8e-fips-rhel5 mod_auth_passthrough/2.1 mod_bwlimited/1.4 FrontPage/5.0.2.2635 mod_perl/2.0.4 Perl/v5.8.8
X-Powered-By: PHP/5.2.10
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html
***************************

@@@@@@@@@@@@@@@@@@@@@@@@@@@
CURENT URL: [mysite.com...]
STEP 2: SUBMITTING FORM WITH CHINESE CHARACTER:

###########################
#request# POST [mysite.com...]
POST /makeit.php name=%26%2320013%3B%26%2322269%3B&cat1=aa&cat2=bb
#request# GET [mysite.com...]
#redirect# GET /aa-bb/&
#request# GET [mysite.com...]
###########################

***************************
[mysite.com...]

POST /makeit.php HTTP/1.1
Host: mysite.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3 (.NET CLR 3.5.30729)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Referer: [mysite.com...]
Content-Type: application/x-www-form-urlencoded
Content-Length: 47
name=%26%2320013%3B%26%2322269%3B&cat1=aa&cat2=bb
HTTP/1.x 301 Moved Permanently
Date: Wed, 23 Sep 2009 22:34:50 GMT
Server: Apache/2.0.63 (Unix) mod_ssl/2.0.63 OpenSSL/0.9.8e-fips-rhel5 mod_auth_passthrough/2.1 mod_bwlimited/1.4 FrontPage/5.0.2.2635 mod_perl/2.0.4 Perl/v5.8.8
X-Powered-By: PHP/5.2.10
Location: [mysite.com...]
Content-Length: 0
Connection: close
Content-Type: text/html
----------------------------------------------------------
[mysite.com...]

GET /aa-bb/& HTTP/1.1
Host: mysite.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3 (.NET CLR 3.5.30729)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Referer: [mysite.com...]

HTTP/1.x 200 OK
Date: Wed, 23 Sep 2009 22:34:50 GMT
Server: Apache/2.0.63 (Unix) mod_ssl/2.0.63 OpenSSL/0.9.8e-fips-rhel5 mod_auth_passthrough/2.1 mod_bwlimited/1.4 FrontPage/5.0.2.2635 mod_perl/2.0.4 Perl/v5.8.8
X-Powered-By: PHP/5.2.10
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html
***************************

@@@@@@@@@@@@@@@@@@@@@@@@@@@
CURRENT URL: [mysite.com...]

STEP 3: SUBMITTING FORM AGAIN WITH CHINESE CHARACTER BUT WITHIN NEW URL (ABOVE)

###########################
#request# POST [mysite.com...]
POST /makeit.php name=%E4%B8%AD%E5%9B%BD&cat1=aa&cat2=bb
#request# GET [mysite.com...]
#redirect# GET /aa-bb/%E4%B8%AD%E5%9B%BD
#request# GET [mysite.com...]
###########################

***************************
[mysite.com...]

POST /makeit.php HTTP/1.1
Host: mysite.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3 (.NET CLR 3.5.30729)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Referer: [mysite.com...]
Content-Type: application/x-www-form-urlencoded
Content-Length: 37
name=%E4%B8%AD%E5%9B%BD&cat1=aa&cat2=bb
HTTP/1.x 301 Moved Permanently
Date: Wed, 23 Sep 2009 22:35:56 GMT
Server: Apache/2.0.63 (Unix) mod_ssl/2.0.63 OpenSSL/0.9.8e-fips-rhel5 mod_auth_passthrough/2.1 mod_bwlimited/1.4 FrontPage/5.0.2.2635 mod_perl/2.0.4 Perl/v5.8.8
X-Powered-By: PHP/5.2.10
Location: [mysite.com...]
Content-Length: 0
Connection: close
Content-Type: text/html
----------------------------------------------------------
[mysite.com...]

GET /aa-bb/%E4%B8%AD%E5%9B%BD HTTP/1.1
Host: mysite.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3 (.NET CLR 3.5.30729)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Referer: [mysite.com...]

HTTP/1.x 200 OK
Date: Wed, 23 Sep 2009 22:35:57 GMT
Server: Apache/2.0.63 (Unix) mod_ssl/2.0.63 OpenSSL/0.9.8e-fips-rhel5 mod_auth_passthrough/2.1 mod_bwlimited/1.4 FrontPage/5.0.2.2635 mod_perl/2.0.4 Perl/v5.8.8
X-Powered-By: PHP/5.2.10
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html
***************************

CURRENT URL: [mysite.com...] CHINESE CHARACTER]

[/END]

jdMorgan

11:11 pm on Sep 23, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, that's far too much to read in detail, but it's obvious that makeit.php is posting differently in the two cases.

In the first case, the character are HTML-entity-encoded and then hex-encoded, while in the second case, they are only hex-encoded.

So the question is, why does your "form-maker script" do HTML-entity-encoding in the first case, which breaks the URL, and not in the second case?

As this is rather beyond the scope of an Apache server forum, I'd suggest asking about it in the PHP forum (with only a very small code snippet, if required).

Oh, and do beware of copy-and-paste within the browser, which will copy/paste the HTML-entity, not the character itself. If you're doing that instead of typing, then all bets are off.

Jim

NeedExpertHelp

11:22 pm on Sep 23, 2009 (gmt 0)

10+ Year Member



That's the thing, I don't think this has to do with PHP because all the makeit.php file does is redirect the cat1, cat2, and name variables from "post" and redirect/reorganize to [mysite.com...] Here is the 9-line makeit.php script:

-------------
//makeit.php
$name = $_POST["name"];
$cat1 = $_POST["cat1"];
$cat2 = $_POST["cat2"];

Header( "HTTP/1.1 301 Moved Permanently" );
$url = 'http://mysite.com/'.$cat1.'-'.$cat2.'/'.$name;
Header('Location: '.$url);
exit;
-------------

Even though the form within mysite.com/aa-bb/name posts to the same EXACT makeit.php page as the form on the main index page (by posting to "../makeit.php" vs. "makeit.php", same exact file though), it seems that because it is being "launched" from "within" mysite.com/aa-bb/name (e.g. inside a /cat1-cat2/ "directory"), it is INVOKING the .htaccess rule we created and thus being processed correctly (e.g. not HTML-entity-encoded), whereas the form on the index page is NOT invoking the .htaccess rule since it is not inside the /cat1-cat2/ "directory" and thus is being HTML-entity-encoded.

So how can I make the form on the main index page either 1) obey the same .htaccess rules as the form within the "directory" (e.g. with its own .htaccess rule), or 2) be fooled into thinking it is also coming from within a /cat1-cat2/ "directory"?

jdMorgan

11:36 pm on Sep 23, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It's not makeit.php that is doing this, it is whatever POSTs to makeit.php. So look at the forms that are doing the posting.

Jim

NeedExpertHelp

11:54 pm on Sep 23, 2009 (gmt 0)

10+ Year Member



Here are both forms:

FORM 1 - Main Index Form (which processes *WITH* HTML-entity-encoding)
-----------------

<form action="makeit.php" method="post">
<p>
Enter:<br/>
<input type="text" name="name" size="20" value="" />
<br />
<select name="cat1" id="cat1">
<option value="aa" >aa</option>
<option value="bb" >bb</option>
<option value="cc" >cc</option>
</select>
to
<select name="cat2" id="cat2">
<option value="aa" >aa</option>
<option value="bb" >bb</option>
<option value="cc" >cc</option>
</select>
<br/>
<input type="submit" value="Submit" />
</p>
</form>

-----------------

FORM 2 - Form Inside /cat1-cat2/ "Directory" (which processes *WITH NO* HTML-entity-encoding)
-----------------

<form action="../makeit.php" method="post">
<p>
Enter:<br/>
<input type="text" name="name" size="20" value="" />
<br />
<select name="cat1" id="cat1">
<option value="aa" >aa</option>
<option value="bb" >bb</option>
<option value="cc" >cc</option>
</select>
to
<select name="cat2" id="cat2">
<option value="aa" >aa</option>
<option value="bb" >bb</option>
<option value="cc" >cc</option>
</select>
<br/>
<input type="submit" value="Submit" />
</p>
</form>

-----------------

There are exactly 2 (and only 2) differences between each form.

Difference 1: FORM 1 posts to "makeit.php" while FORM 2 posts to "../makeit.php" since it is inside the /cat1-cat2/ "directory".

Difference 2: FORM 2 is "launched" from within a /cat1-cat2/ directory and FORM 1 is not.

It is Difference 2 that makes me think FORM 2 gets the special .htaccess rule treatment whereas FORM 1 does not.

What are your thoughts?

jdMorgan

12:17 am on Sep 24, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've already said there is no difference in .htaccess treatment... Or if it there is, it is because the POSTed data is HTML-entity-encoded. The problem occurs *before* .htaccess is processed.

It's likely in the form, and specifically, with "name." Where does this "name" value come from?

Again, if it was copied-and-pasted from an HTML page, then it will be HTML-entity-encoded, and cause a problem. You could of course prove this by renaming the non-working form to something else (to back it up), and copying the working form into your root directory...

Jim

vincevincevince

12:17 am on Sep 24, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I know you have gone a long way on the mod_rewrite path here but I am going to suggest cutting off the last parameter (ignoring it) and reading it from PHP. As much as mod_rewrite is amazing, it still likes to think of strings as URLs ... you are trying to break all the URL/HTTP rules and so it is an uphill battle. Once it is in PHP, PHP does not care that it 'should' be a URL... a string is a string.


RewriteRule ^[a-z]{2}\-[a-z]{2}\/ /do.php [L]

You access the original string used by the browser as $_SERVER['REQUEST_URI'] and it includes the question marks (even multiple).

In PHP, you can then parse out the contents (using a variant of jdMorgan's regex):

<?php
preg_match("/^\/([a-z]{2})\-([a-z]{2})\/([^/]*)$/",$_SERVER['REQUEST_URI'],$match);
$cat1=$m[1];
$cat2=$m[2];
$name=$m[3];

If things are encoded, then decode them. html_entity_decode(), for example.

[edited by: vincevincevince at 12:20 am (utc) on Sep. 24, 2009]

jdMorgan

12:19 am on Sep 24, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, you could read the vars in PHP, but the problem is that REQUEST_URI isn't the same as THE_REQUEST, because it is un-encoded, which caused problems seen in the first several posts of this thread. And unfortunately, as I understand it, THE_REQUEST is a 'private' mod_rewrite variable, and not visible to PHP.

Jim

vincevincevince

12:50 am on Sep 24, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Jim; your point is almost exactly mine... it is the URL processing and encoding which is causing problems here.

Chinese,for example, will be seen as: %E4%B8%AD%E5%9C%8B
That can be easily changed by: urldecode("%E4%B8%AD%E5%9C%8B")

jdMorgan

2:15 am on Sep 24, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, but see the bit about the question mark embedded in the URL-path -- That is, a question mark *not* intended as a query_string delimiter. If we let the content-handler get hold of the URL before reformatting it, any requested URL-path containing an encoded question mark will get un-encoded, the question mark will be removed, and anything following it will get moved into a query string that was never intended or expected to exist.

So, in this code, we're looking at THE_REQUEST -- the raw client request line, and using [NE] to avoid double-encoding this single-encoded embedded question mark in order to bypass all of this automatic decoding/encoding.

We *could* look at the query string from inside PHP, and if it were non-blank, rebuild the literal question mark and move the query string back into the variable where we wanted it in the first place, although this process is not strictly reversible and could be tinkered with maliciously.

But this doesn't affect the problem directly at hand, because this current problem is being caused not by URL-encoding, but by an HTML-entity encoding that is occurring prior to the browser's URL-encoding.

Jim

vincevincevince

3:09 am on Sep 24, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ah; PHP's REQUEST_URI does in fact show you the whole string, e.g.

/ab-cd/hello?are --> All goes to $_SERVER['REQUEST_URI']

From there, the PHP can be used to decode (once, twice, does not matter...):

e.g.
html_entity_decode(urldecode($name));

jdMorgan

3:50 am on Sep 24, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> You access the original string used by the browser as $_SERVER['REQUEST_URI'] and it includes the question marks (even multiple).

If this is true, then I'm going to rescind what I said, and also recommend trying the method vincevincevince proposed above, because I can see no way to make Apache treat URL-paths that can have question marks and/or ampersands in them correctly, unless these characters are encoded and passed in an appended query string. So at this point, I'm out of ideas.

The only change I'd make is to remove the unnecessary escaping in that rule, leaving just:


RewriteRule ^[a-z]{2}-[a-z]{2}/ /do.php [L]

Since we're going to pull the 'variables' out in PHP itself, the RewriteCond is no longer needed.

Jim

NeedExpertHelp

8:21 am on Sep 24, 2009 (gmt 0)

10+ Year Member



Interesting discussion guys, thanks.

Vince, I gave what you said a shot but I'm getting an error with the preg_match:

Warning: preg_match(): Unknown modifier ']'

preg_match("/^\/([a-z]{2})\-([a-z]{2})\/([^/]*)$/",$_SERVER['REQUEST_URI'],$match)

What would I need to do to fix this error?

Thanks again.

@Jim: Your solution was working perfectly (precisely as I wanted it) except when using the form in the main index page, which I still can't get my head around. I'll give Vince's advice a shot and if that doesn't work as intended for whatever reason, we can go back to tackling the other issue. Thanks.

g1smd

10:06 am on Sep 24, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You probably have to escape the question mark, otherwise it might be processed as a modifier for the string comparison.

I can see this all ending in tears. You're fighting all of the HTTP specs, and a simple change to PHP version, or something else, could bring your whole site down at some unknown time in the future.

vincevincevince

12:02 pm on Sep 24, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Try:
preg_match("/^\/([a-z]{2})\-([a-z]{2})\/([^\/]*)$/",$_SERVER['REQUEST_URI'],$match)

Added a \ before the penultimate /.

I agree with g1smd... you should really be escaping the ? in the sequence BEFORE putting it into the URL. Note that the $_SERVER variables come from the server, so this could break on either PHP or Apache upgrade.

(I suppose you didn't test URLs with # in them yet?)

This 33 message thread spans 2 pages: 33