Regular Expression Almost Working - Apache Web Server forum at WebmasterWorld - WebmasterWorld

Forum Moderators: phranque

Message Too Old, No Replies

Regular Expression Almost Working

Not picking up my trailing character

Merganser

5:30 am on Feb 3, 2012 (gmt 0)

10+ Year Member

I use the following RewriteRule (I have some rewrite conditions above it):

RewriteRule ^page1/([^\ ]+)/?$ - [F]

My intent is for the rule to capture everything after the "page1/" in the $1 variable (which is then used in the rewrite conditions). But I am also trying to strip any trailing / out of the $1 variable (if it exists).

So, I think i am instructing to capture all characters (up to a space - which should not be present) that follow the page1/ but, if a slash is present as the last character, consider it not included within the $1 variable.

The problem is that if a trailing slash is present, it remains included within the $1 variable. I figure that if a trailing / exists it likely gets included in the $1 because it is not a space, and then the /?$ resolves to "zero slashes were present at the end". So, a match is made but with the / in the $1 variable.

I am at a bit of a loss because I want to strip the trailing slash from the variable but not any intermediate slashes which may be present.

Any help on a syntax solution would be greatly appreciated.

g1smd

8:07 am on Feb 3, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Find "page" followed by a "slash", then read anything that is "not a space" (so will include any slashes) and be "greedy" with that read (grab everything), and optionally follow with a "slash" (but never will because the "not a space" bit already grabbed it).

So, define in clear English what you want and perhaps the middle part will be "not a space or slash" or perhaps it will be begin with some other thing before the final part is "not a space or slash" followed by "slash"?

lucy24

8:35 am on Feb 3, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Edit: OK, so I type much much slower than g1 :P

You do not need to exclude spaces, because they will never be present in your URL anyway. Ahem. Right? If any of your page names do contain spaces, stop messing about with mod_rewrite and get those pages renamed before anything else.

Can there be more than one directory after "page1/" ? If not, then it is easy because you constrain the Rule to [^/]+ meaning "pick up everything you meet until you come to a slash". Or, if you know how many directories there are, package it as ([^/]+/[^/]+) et cetera for the appropriate number of directories.

Your Rule as written will never exclude a trailing slash, because you have done two things that reinforce each other:

#1 The capture does not say "capture everything except a slash"
and
#2 The final slash is optional, so if it has already been captured, the RegEx does not have to spit it out again.

If you don't know how many directories are involved, it becomes a little trickier. Look closely at this pattern. Don't bring out the scissors and paste pot until you're sure you understand what it means:

(([^/]+/)*[^/]+)/?

Now, different issue. You say "which is then used in the rewrite conditions". Good wording, because it means you know what order things get evaluated in. But if any of those Conditions involves matching the request against some fixed text, try to move that down into the Rule itself. That way, mod_rewrite doesn't have to stop and evaluate Conditions for every single request it ever gets.

g1smd

8:21 pm on Feb 3, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

And be careful with end anchoring. The above pattern without end anchoring would match and extract the folder names when an image or stylesheet is requested.

Merganser

3:12 am on Feb 4, 2012 (gmt 0)

10+ Year Member

There can be more than one directory and I do not know how many so your last suggestion would be applicable.

I think I understand 95% of it. 2 points I am not sure on:

1) After working through some paper exercises (writing candidate URLs on paper and evaluating them against the construct) it looks like the server would need to iterate through the URL b/c something like abc/def/ would fully match the first bit but then fail the second bit so a non-match overall. But if the server determined abc/ matched the first bit and def/ matched the second it would be a match overall. Does this process work by iteration, I was assuming it was a single left to right sweep?

2)If an image or stylesheet is requested, wouldn't it also match? With something like abc/def/ghi/image.gif the abc/def/ghi/ would match the first bit and image.gif would match the last (b/c the final / was optional)

3) What is the best way to inspect the captured $1 variable when testing? If I knew how to inspect the $1, I could more easily be self-sufficient even if by trial and error.

Thanks to you both.

g1smd

8:04 am on Feb 4, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

2) Using the $ end anchor will be useful. If you don't want to match URLs with extensions, make sure the last bit says either

[^/.]+

which says don't match a slash or period, or

[a-z0-9]+

which matches only characters.