Forum Moderators: phranque

Message Too Old, No Replies

Regex in RewriteRule

         

fmchris

7:37 am on Jul 11, 2009 (gmt 0)

10+ Year Member



Okay, I have the following rewrite rule at the moment:

RewriteRule ^([a-z0-9_-]+)/?([a-z0-9_-]+)?(\.png¦/)?$ generation/image.php?user=$1&template=$2 [NC,L]

It accepts URLs that look like the following:

user_string_here/template_string_here(.png)

And obviously, the .png is optional.

Now the problem is I want to pass URL encoded characters to this. When I change it to look like this:

RewriteRule ^([a-z0-9_-%]+)/?([a-z0-9_-]+)?(\.png¦/)?$ generation/image.php?user=$1&template=$2 [NC,L]

I get a server error. Any idea on how I can allow url encoded characters to pass through here?

jdMorgan

1:55 pm on Jul 11, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Have you tested URL-encoded characters with your original code? The reason I ask is that RewriteRule usually "sees" fully-decoded URL-paths, so I'm not sure that you even need to 'handle' them in any special way.

Be aware that hyphen/dash is a special character within alternate groups, specifying a "character range" as in "a-z" in your pattern. Unfortunately, handling is buggy, and so it's a good idea to always escape that character, as in "[a-z0-9_\-]". If indeed you need to include "%" in your group, then escaping the hyphen or putting the "%" before the hyphen will likely fix your server error.

I also suggest that you include the first slash within the optional group to speed up parsing and make the assignment of URL-path-parts to back-references more predictable:


RewriteRule ^([a-z0-9_\-%]+/)?([a-z0-9_\-]+)(\.pngĻ/)?$ generation/image.php?user=$1&template=$2 [NC,L]

However, you need to be aware of a lurking problem with this: As with your original, the "user" can be blank, and the path can end with ".png" or "/" or nothing. Further, "png" can be uppercase, lowercase, or mixed-case. This means that many variations of the URL will result in serving the exact same content, which may be a problem if you want this content to rank in search. It would be better to decide which is the single "correct" (canonical) URL format, and externally redirect the variant formats to the correct format before doing the internal rewrite.

For more info, search for "duplicate content" and "canonical URL" in our Google forum.

Jim

fmchris

5:10 pm on Jul 11, 2009 (gmt 0)

10+ Year Member



Hi, I purposely wrote the code to have three different possible extensions, so that's not a problem. It's not indexed content.

The slash is outside of the first group because I'm parsing the values with PHP, and if I leave the slash in I have to parse it out inside the PHP code anyway.

In the original code, if I visit:

generation/image.php?user=test-%23test&template=test

It returns test-#test and test as the values, respectively. However, if I visit:

/test-%23test/test

I get a 404.

fmchris

5:16 pm on Jul 11, 2009 (gmt 0)

10+ Year Member



On second thought, I'm wondering if for some reason the RegEx is unencoding the URL, so it would redirect to

/test-#test/test

Which is invalid, because the # is a special character in HTML.

jdMorgan

5:50 pm on Jul 11, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The regex doesn't un-encode the URL, but Apache does -- as I noted above.

If you need to leave it encoded, then try the following:

Use a RewriteCond examining %{THE_REQUEST} to get the original encoded client request into a variable (e.g. $1). Then back-reference that variable in the rewriterule substitution, and use the [NE] flag (if necessary) to prevent double-encoding.

To exclude the trailing slash on the path, use "^(([a-z0-9_\-%]+)/)?([a-z0-9_\-]+)(\.pngĻ/)?$" and then use $2 and $3 instead of $1 and $2 in the substitution. This puts the "optional-path-boundary" where it should be, but excludes the trailing slash.

Jim

fmchris

6:50 pm on Jul 11, 2009 (gmt 0)

10+ Year Member



Alright, I read the documentation on %{THE_REQUEST} and from what I see I should have something along these lines to parse the URI out of the HTTP request

RewriteCond %{THE_REQUEST} ^[A-Z]+\ /[^.]+\ HTTP/ 
RewriteRule ^(([a-z0-9_\-%]+)/)?([a-z0-9_\-]+)(\.pngĻ/)?$ generation/image.php?user=$2&template=$3 [NC,L]

Now, I don't understand how to use the backreference from RewriteCond in RewriteRule.

I did some research on the issue I'm having, and the "easy" fix is usually recommended as being double-encoding the URL. However, that won't work in this case since I'm not generating the URLs.

fmchris

7:14 pm on Jul 11, 2009 (gmt 0)

10+ Year Member



Ah... I found an interesting Apache 2.2 mod_rewrite flag that just saved me a lot of time...

[httpd.apache.org...]

Adding the B flag to my RewriteRule has apparently solved the problem.

jdMorgan

8:03 pm on Jul 11, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This is what I meant:

RewriteCond %{THE_REQUEST} ^[A-Z]+\ /(([a-z0-9_\-%]+)/)?([a-z0-9_\-]+)(\.pngĻ/)?\ HTTP/ [NC]
RewriteRule ^ generation/image.php?user=%2&template=%3 [L]

You may or may not need the [B] flag -- I'm not sure.

Jim