Welcome to WebmasterWorld Guest from 100.24.122.228

Forum Moderators: Ocean10000 & phranque

Message Too Old, No Replies

Regex kicking my @$$ (again)

     
8:26 am on Jul 10, 2010 (gmt 0)

Junior Member

10+ Year Member

joined:May 12, 2007
posts: 91
votes: 0

I was so proud of myself for getting as far as i did with regex this time, but then i went overboard and can't figure it out. I am trying to do a modRewrite for my .htaccess file that will take a url like this:
http://myserver.com/any/file/path/?/Save/Data42/

and turn it into this:
view.php?path=any/file/path/&action=Save&extra=Data42

I managed to get the first part (i think it is correct) and figured out how to get the part after the question mark...but I can't seem to be able to split up the part after the question mark. The one catch is that usually the path will end just prior to the question mark...only sometimes there will be one or two additional declarations (in other words, optional elements are: "?", "Save", "Data42")

This works partially:
RewriteRule ^([a-zA-Z0-9-/]+)[[\?]?(.*)]?$ /view.php?path=$1&$2 [L]

Can someone help me get the ending parts of this split up at the slashes? Thanks!
8:28 am on July 10, 2010 (gmt 0)

Junior Member

10+ Year Member

joined:May 12, 2007
posts:91
votes: 0


P.S. This is what i managed to get to...but it is VERY wrong:
^([a-zA-Z0-9-/]+)[[\?]?([a-z0-9])?([a-z0-9])?]?$
11:12 am on July 10, 2010 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


The question mark has a very specific meaning within a URL.

Having slashes after the question mark breaks the specifications for the characters that are allowed to appear in the query string attached to the URL.

Get rid of the question mark.

Another factor: go for all lower-case URLs otherwise you'll run into all sorts of Duplicate Content problems.
5:46 pm on July 10, 2010 (gmt 0)

Junior Member

10+ Year Member

joined:May 12, 2007
posts:91
votes: 0


I assume the question mark you are referencing me removing is the one that triggers the breakpoint? I have to use something because all my "path" strings will be of varying lengths. Maybe I could divide it with a backslash "\" instead of a question mark?
[myserver.com...]

Either way, that does not fix my regex problem...

Any thoughts on that? THANKS!
6:13 pm on July 10, 2010 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


Break it with a single hyphen, if single hyphens never appear anywhere else in the URL.

Otherwise choose a double hyphen or some other valid character; a backslash is not in the list of valid characters.
8:32 pm on July 10, 2010 (gmt 0)

Junior Member

10+ Year Member

joined:May 12, 2007
posts: 91
votes: 0


Hyphens will be in use...it looks like an allowed character would be the asterisk "*" symbol? So that gives me the following...but my regex problem is still not solved for how to break up the 1-2 optional parts that may come after the asterisk.

So now my incoming URL is this:
[myserver.com...]

And i am still here with it identifying the path (first part) successfully, and the part after the breakpoint (asterisk) successfully...but I still need to split it up between the slashes:
^([a-zA-Z0-9-/]+)[[\*]?(.*)]?$

Need:
$1 = any/file/path/
$2 = Save
$3 = Data42

So, i know that this is now even more messed up than my original attempt...but this is where i have gotten (with NO success)...
^([a-zA-Z0-9-/]+)[(\*/)([^/].*)/([^/].*)]?$
1:40 am on July 11, 2010 (gmt 0)

Junior Member

10+ Year Member

joined:May 12, 2007
posts: 91
votes: 0

I am making some progress...i now have the following:
^([a-zA-Z0-9-/]+)\*?([a-zA-Z0-9]*)\*?([a-zA-Z0-9]*)$

works with an asterisk as my dividers at the end:
any/file/path/*Save*Data42

Now i just have to figure out how to make those back into slashes
1:52 am on July 11, 2010 (gmt 0)

Junior Member

10+ Year Member

joined:May 12, 2007
posts: 91
votes: 0


HOORAY! I figured it out! I would never have gotten close without this website: [gskinner.com...]

here is my result:
^([a-zA-Z0-9-/]+)(?:\*/)?([a-zA-Z0-9]*)(?:/)?([a-zA-Z0-9]*)(?:/)?$

works on this:
any/file/path/*/Save/Data42/
9:48 am on July 11, 2010 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


Some optimisation is useful, not least changing A-Za-z to just a-z and appending the [NC] flag. Parses 33% faster.
2:41 pm on July 11, 2010 (gmt 0)

Junior Member

10+ Year Member

joined:May 12, 2007
posts: 91
votes: 0


okay. cool. thanks for the tip! anything else?
5:26 am on Aug 23, 2010 (gmt 0)

Junior Member

10+ Year Member

joined:May 12, 2007
posts: 91
votes: 0


Okay...so after more than a month, I am re-visiting this. I have spent this time entering all my info into the CMS, but now when I try "turning on" the command for ModRewrite with my shiny new regex, everything dies and I get an error 500. Unfortunately i am not able to access the server logs (1and1 shared linux hosting)...so I am hoping someone out there can tell me what is wrong with this .htaccess file. Everything else works fine if I comment out the last line (the fancy new rewrite condition):


Options -MultiViews
Options +FollowSymLinks
Options +Indexes

RewriteEngine on
RewriteBase /

# forbid access to the configuration file
RewriteRule \.htaccess - [F]


# do not rewrite these directories (NC=no case sensitivity) and stop processing further (L=last)
RewriteRule ^css/$ - [NC,L]
RewriteRule ^images/$ - [NC,L]


# do not perform any rewrite on these filetypes and stop processing further
RewriteRule \.(txt|gif|jpe?g|png|css|ico|js|pdf)$ - [NC,L]


# use rules to make SEO friendly URLs
#Base path only:any/file-path/can/be/here/
#Base + Actions (2):any/file/path/*/save/data/(any extra junk can go here for SEO)
#Converts to this:view.php?path=any/file/path/&action=save&extra=data
#NOTE: Filepath can only be alphanumeric or dash (and slash, obviously). Action and extra are alphanumeric only.
RewriteRule ^([a-z0-9-/]+)(?:\*/)?([a-z0-9]*)(?:/)?([a-z0-9]*)(?:/)?(?:.)*$ /view.php?path=$1&action=$2&extra=$3 [NC,L]
2:54 pm on Aug 23, 2010 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


What version of Apache does your host provide?

The regex you have used is PCRE -- PERL-Compatible Regular Expressions. These are only supported on Apache 2.0 and later.

Your Options can/should all be combined on one line, as in "Options -MultiViews +FollowSymLinks +Indexes" for efficiency. Further, your "skip rules" could be combined down to one or two rules instead of three.

Jim
12:06 am on Aug 24, 2010 (gmt 0)

Junior Member

10+ Year Member

joined:May 12, 2007
posts: 91
votes: 0


Hi Jim,
Thanks! Good call. The Apache version is 1.3.34 so obviously that is the problem. Now the issue is that I have absolutely no idea how to change my current regex into one that will work... Can you help?!? It took me hours just to figure this much out, and apparently i was trying the wrong thing all along. :-(
1:48 am on Aug 24, 2010 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


The problem is the "(?:<whatever>)" subexpressions, which are "clustering but non-capturing" subexpressions. By "clustering," we mean that the entire parenthesized subexpression is quantified by the quantifier token (e.g. "?". "*", "+") that follows. However, unlike normal parenthesized/clustering sub-expressions, non-capturing subexpressions do not create back-references (for use here as "$1" through "$9").

You can get the same effect by allowing the subexpression to capture, but then ignoring the captured value (now $2 in the new version). It's a little slower, but at least it will work... Also, I believe that the clustering/non-capturing subpatterns at the end of your pattern can simply be omitted, leaving the pattern without an end-anchor. And since this is faster that fully-specifying the pattern, we kind of make up for lost time on the unnecessary capturing we've now had to do... :)

So, I'd expect this rule to do much the same thing on Apache 1.3.x as what you had working on Apache 2.x:

RewriteRule ^([a-z0-9\-/]+)(\*/)?([a-z0-9]*)/?([a-z0-9]*) /view.php?path=$1&action=$3&extra=$4 [NC,L]

Note that the second subexpression (now without the "?:") is now a capturing expression, but we since we now ignore $2, no matter...

Once you know regex pretty well, you get to learn all about its different "flavors." :)

Jim
3:37 am on Aug 26, 2010 (gmt 0)

Junior Member

10+ Year Member

joined:May 12, 2007
posts: 91
votes: 0


Okay...thanks! I did not completely follow your explanation, but can try to read up on it further and also play around with bits and pieces to see what happens. I will give this modified version a try and post back to let people know how it works out.

Another question: My original regex you said was "PCRE"...what is the name of this correct version (so i know which tutorials to be reading next time)?
3:48 am on Aug 26, 2010 (gmt 0)

Junior Member

10+ Year Member

joined:May 12, 2007
posts: 91
votes: 0


hrmm...so i am back to my error 500. I read through the revised regex and now understand what you were talking about with $2 being captured but unused...and the "throw away" part at the end not needing to be captured. It all looks to me like it should logically work great. Unfortunately it is back to errors when i use this new RewriteRule. The only thing I could think of is that the regex is trying to capture the /view.php page and redirect it somewhere (to itself)...so i tried ignoring this file in rewrite rules by adding the following (with no success) to a line ABOVE the new rewrite rule:
RewriteRule ^view.php - [NC,L]

Any other words of wisdom that I can try? THANKS!
10:44 am on Aug 26, 2010 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


The two versions are PCRE (PERL Compatible Regular Expressions) and POSIX.

What do your Server Error Logs say about the error?
2:28 pm on Aug 26, 2010 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


It's probably an "infinite loop" warning, since the rule will rewrite previously-rewritten requests for view.php to itself...

RewriteCond %{REQUEST_URI} !^/view\.php$
RewriteRule ^([a-z0-9\-/]+)(\*/)?([a-z0-9]*)/?([a-z0-9]*) /view.php?path=$1&action=$3&extra=$4 [NC,L]

And actually, although longer, the following method is 'neater' and easier to understand and maintain, with fewer potential side-effects:

RewriteRule ^([a-z0-9\-/]+)[*/]([a-z0-9]+)/([a-z0-9]+)/$ /view.php?path=$1&action=$2&extra=$3 [NC,L]
RewriteRule ^([a-z0-9\-/]+)[*/]([a-z0-9]+)/$ /view.php?path=$1&action=$2 [NC,L]
RewriteRule ^([a-z0-9\-/]+)[*/]$ /view.php?path=$1 [NC,L]
RewriteRule ^$ /view.php? [NC,L]

You may have to modify to suit, because I am not sure how to resolve the ambiguities of your original code and my previous modifications to your code.

Jim