Forum Moderators: phranque

Message Too Old, No Replies

Regex kicking my @$$ (again)

         

techtheatre

8:26 am on Jul 10, 2010 (gmt 0)

10+ Year Member


I was so proud of myself for getting as far as i did with regex this time, but then i went overboard and can't figure it out. I am trying to do a modRewrite for my .htaccess file that will take a url like this:
http://myserver.com/any/file/path/?/Save/Data42/

and turn it into this:
view.php?path=any/file/path/&action=Save&extra=Data42

I managed to get the first part (i think it is correct) and figured out how to get the part after the question mark...but I can't seem to be able to split up the part after the question mark. The one catch is that usually the path will end just prior to the question mark...only sometimes there will be one or two additional declarations (in other words, optional elements are: "?", "Save", "Data42")

This works partially:
RewriteRule ^([a-zA-Z0-9-/]+)[[\?]?(.*)]?$ /view.php?path=$1&$2 [L]

Can someone help me get the ending parts of this split up at the slashes? Thanks!

techtheatre

8:28 am on Jul 10, 2010 (gmt 0)

10+ Year Member



P.S. This is what i managed to get to...but it is VERY wrong:
^([a-zA-Z0-9-/]+)[[\?]?([a-z0-9])?([a-z0-9])?]?$

g1smd

11:12 am on Jul 10, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The question mark has a very specific meaning within a URL.

Having slashes after the question mark breaks the specifications for the characters that are allowed to appear in the query string attached to the URL.

Get rid of the question mark.

Another factor: go for all lower-case URLs otherwise you'll run into all sorts of Duplicate Content problems.

techtheatre

5:46 pm on Jul 10, 2010 (gmt 0)

10+ Year Member



I assume the question mark you are referencing me removing is the one that triggers the breakpoint? I have to use something because all my "path" strings will be of varying lengths. Maybe I could divide it with a backslash "\" instead of a question mark?
[myserver.com...]

Either way, that does not fix my regex problem...

Any thoughts on that? THANKS!

g1smd

6:13 pm on Jul 10, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Break it with a single hyphen, if single hyphens never appear anywhere else in the URL.

Otherwise choose a double hyphen or some other valid character; a backslash is not in the list of valid characters.

techtheatre

8:32 pm on Jul 10, 2010 (gmt 0)

10+ Year Member



Hyphens will be in use...it looks like an allowed character would be the asterisk "*" symbol? So that gives me the following...but my regex problem is still not solved for how to break up the 1-2 optional parts that may come after the asterisk.

So now my incoming URL is this:
[myserver.com...]

And i am still here with it identifying the path (first part) successfully, and the part after the breakpoint (asterisk) successfully...but I still need to split it up between the slashes:
^([a-zA-Z0-9-/]+)[[\*]?(.*)]?$

Need:
$1 = any/file/path/
$2 = Save
$3 = Data42

So, i know that this is now even more messed up than my original attempt...but this is where i have gotten (with NO success)...
^([a-zA-Z0-9-/]+)[(\*/)([^/].*)/([^/].*)]?$

techtheatre

1:40 am on Jul 11, 2010 (gmt 0)

10+ Year Member


I am making some progress...i now have the following:
^([a-zA-Z0-9-/]+)\*?([a-zA-Z0-9]*)\*?([a-zA-Z0-9]*)$

works with an asterisk as my dividers at the end:
any/file/path/*Save*Data42

Now i just have to figure out how to make those back into slashes

techtheatre

1:52 am on Jul 11, 2010 (gmt 0)

10+ Year Member



HOORAY! I figured it out! I would never have gotten close without this website: [gskinner.com...]

here is my result:
^([a-zA-Z0-9-/]+)(?:\*/)?([a-zA-Z0-9]*)(?:/)?([a-zA-Z0-9]*)(?:/)?$

works on this:
any/file/path/*/Save/Data42/

g1smd

9:48 am on Jul 11, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Some optimisation is useful, not least changing A-Za-z to just a-z and appending the [NC] flag. Parses 33% faster.

techtheatre

2:41 pm on Jul 11, 2010 (gmt 0)

10+ Year Member



okay. cool. thanks for the tip! anything else?

techtheatre

5:26 am on Aug 23, 2010 (gmt 0)

10+ Year Member



Okay...so after more than a month, I am re-visiting this. I have spent this time entering all my info into the CMS, but now when I try "turning on" the command for ModRewrite with my shiny new regex, everything dies and I get an error 500. Unfortunately i am not able to access the server logs (1and1 shared linux hosting)...so I am hoping someone out there can tell me what is wrong with this .htaccess file. Everything else works fine if I comment out the last line (the fancy new rewrite condition):


Options -MultiViews
Options +FollowSymLinks
Options +Indexes

RewriteEngine on
RewriteBase /

# forbid access to the configuration file
RewriteRule \.htaccess - [F]


# do not rewrite these directories (NC=no case sensitivity) and stop processing further (L=last)
RewriteRule ^css/$ - [NC,L]
RewriteRule ^images/$ - [NC,L]


# do not perform any rewrite on these filetypes and stop processing further
RewriteRule \.(txt|gif|jpe?g|png|css|ico|js|pdf)$ - [NC,L]


# use rules to make SEO friendly URLs
#Base path only:any/file-path/can/be/here/
#Base + Actions (2):any/file/path/*/save/data/(any extra junk can go here for SEO)
#Converts to this:view.php?path=any/file/path/&action=save&extra=data
#NOTE: Filepath can only be alphanumeric or dash (and slash, obviously). Action and extra are alphanumeric only.
RewriteRule ^([a-z0-9-/]+)(?:\*/)?([a-z0-9]*)(?:/)?([a-z0-9]*)(?:/)?(?:.)*$ /view.php?path=$1&action=$2&extra=$3 [NC,L]

jdMorgan

2:54 pm on Aug 23, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



What version of Apache does your host provide?

The regex you have used is PCRE -- PERL-Compatible Regular Expressions. These are only supported on Apache 2.0 and later.

Your Options can/should all be combined on one line, as in "Options -MultiViews +FollowSymLinks +Indexes" for efficiency. Further, your "skip rules" could be combined down to one or two rules instead of three.

Jim

techtheatre

12:06 am on Aug 24, 2010 (gmt 0)

10+ Year Member



Hi Jim,
Thanks! Good call. The Apache version is 1.3.34 so obviously that is the problem. Now the issue is that I have absolutely no idea how to change my current regex into one that will work... Can you help?!? It took me hours just to figure this much out, and apparently i was trying the wrong thing all along. :-(

jdMorgan

1:48 am on Aug 24, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The problem is the "(?:<whatever>)" subexpressions, which are "clustering but non-capturing" subexpressions. By "clustering," we mean that the entire parenthesized subexpression is quantified by the quantifier token (e.g. "?". "*", "+") that follows. However, unlike normal parenthesized/clustering sub-expressions, non-capturing subexpressions do not create back-references (for use here as "$1" through "$9").

You can get the same effect by allowing the subexpression to capture, but then ignoring the captured value (now $2 in the new version). It's a little slower, but at least it will work... Also, I believe that the clustering/non-capturing subpatterns at the end of your pattern can simply be omitted, leaving the pattern without an end-anchor. And since this is faster that fully-specifying the pattern, we kind of make up for lost time on the unnecessary capturing we've now had to do... :)

So, I'd expect this rule to do much the same thing on Apache 1.3.x as what you had working on Apache 2.x:

RewriteRule ^([a-z0-9\-/]+)(\*/)?([a-z0-9]*)/?([a-z0-9]*) /view.php?path=$1&action=$3&extra=$4 [NC,L]

Note that the second subexpression (now without the "?:") is now a capturing expression, but we since we now ignore $2, no matter...

Once you know regex pretty well, you get to learn all about its different "flavors." :)

Jim

techtheatre

3:37 am on Aug 26, 2010 (gmt 0)

10+ Year Member



Okay...thanks! I did not completely follow your explanation, but can try to read up on it further and also play around with bits and pieces to see what happens. I will give this modified version a try and post back to let people know how it works out.

Another question: My original regex you said was "PCRE"...what is the name of this correct version (so i know which tutorials to be reading next time)?

techtheatre

3:48 am on Aug 26, 2010 (gmt 0)

10+ Year Member



hrmm...so i am back to my error 500. I read through the revised regex and now understand what you were talking about with $2 being captured but unused...and the "throw away" part at the end not needing to be captured. It all looks to me like it should logically work great. Unfortunately it is back to errors when i use this new RewriteRule. The only thing I could think of is that the regex is trying to capture the /view.php page and redirect it somewhere (to itself)...so i tried ignoring this file in rewrite rules by adding the following (with no success) to a line ABOVE the new rewrite rule:
RewriteRule ^view.php - [NC,L]

Any other words of wisdom that I can try? THANKS!

g1smd

10:44 am on Aug 26, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The two versions are PCRE (PERL Compatible Regular Expressions) and POSIX.

What do your Server Error Logs say about the error?

jdMorgan

2:28 pm on Aug 26, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It's probably an "infinite loop" warning, since the rule will rewrite previously-rewritten requests for view.php to itself...

RewriteCond %{REQUEST_URI} !^/view\.php$
RewriteRule ^([a-z0-9\-/]+)(\*/)?([a-z0-9]*)/?([a-z0-9]*) /view.php?path=$1&action=$3&extra=$4 [NC,L]

And actually, although longer, the following method is 'neater' and easier to understand and maintain, with fewer potential side-effects:

RewriteRule ^([a-z0-9\-/]+)[*/]([a-z0-9]+)/([a-z0-9]+)/$ /view.php?path=$1&action=$2&extra=$3 [NC,L]
RewriteRule ^([a-z0-9\-/]+)[*/]([a-z0-9]+)/$ /view.php?path=$1&action=$2 [NC,L]
RewriteRule ^([a-z0-9\-/]+)[*/]$ /view.php?path=$1 [NC,L]
RewriteRule ^$ /view.php? [NC,L]

You may have to modify to suit, because I am not sure how to resolve the ambiguities of your original code and my previous modifications to your code.

Jim