Forum Moderators: phranque

Message Too Old, No Replies

How to parse a url in .htaccess

         

KrabyPatties

8:41 am on Nov 27, 2008 (gmt 0)

10+ Year Member



This maybe more of a regex question but I am trying to parse the string from the url and pass the parsed results as parameters to a php file. For example, I currently have:

RewriteEngine on
RewriteRule ^([a-z0-9])\.html$ /parse.php?data=$1 [NC,L]

This allows me to enter 'foobar.html' which calls 'parse.php?data=foobar'. This is ALMOST what I want. I need to do something like:

'joemama-was-here-today.html' and have it parse out the first set of characters up to the 1st '-', resulting in a similar call to 'parse.php?data=joemama'.

Is this possible? Thanks.

phranque

9:41 am on Nov 27, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



welcome to WebmasterWorld [webmasterworld.com], KrabyPatties!

maybe something like this?
RewriteRule ^([^-]*)[-a-z0-9]*\.html$ /parse.php?data=$1 [NC,L]

KrabyPatties

10:07 am on Nov 27, 2008 (gmt 0)

10+ Year Member



Ah, You are the man! It works great. I really appreciate your help.

g1smd

8:58 pm on Nov 27, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Be aware that this type of "part wildcard" URL in the rewrite means that your site will generate infinite Duplicate Content.

The correct way to do this, is for the remaining words to be checked against the database and all except the 'one' correct version should result in a 404 error being returned.

KrabyPatties

1:13 am on Nov 28, 2008 (gmt 0)

10+ Year Member


I see, good point. The first part of the string is going to be a 5 digit id to the record in the database and the remainder of the string is allowed to change, but should still point to the same record. I will take your advice, but I will just 301 to the correct version given that the id part of the string is valid.

So that leaves me needing another RewriteRule which hopefully you or Phranque could help me with. I would also like to make sure that the script is only called when there is a combination of 5 alphanumeric characters followed by a hyphen at the beginning of the string. For example:

00001-blue-skies-bring-tears.html would call /parse.php?id=00001&title=blue-skies-bring-tears

Thanks for all the help guys.

KrabyPatties

9:57 pm on Nov 28, 2008 (gmt 0)

10+ Year Member



I figured it out guys. It only took me 3 hours of trial and error. Here is my solution:

RewriteRule ^([a-z0-9][a-z0-9][a-z0-9][a-z0-9][a-z0-9])-([-a-z0-9]*)\.html$ /parse.php?param1=$1&param=$2 [NC,L]

g1smd

11:33 pm on Nov 28, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The first bit can be done with
([0-9]{5})-(...
instead.

It matches five digits and a hyphen.

It doesn't match if any letters are present before the first hyphen, or if there are more or less than five digits.

If you really do need five letters and numbers then this would do it:

([a-z0-9]{5})-(...
assuming the letters are always lower case.

Your script would need to generate the 301 response, using two HEADER commands (one to say "301" and the other for the new URL).

phranque

4:30 am on Nov 29, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



thanks to g1smd for pointing out the issues with my simplistic response.

assuming the letters are always lower case

if using the NC flag the case is ignored...

g1smd

10:07 am on Nov 29, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If using [NC] there is the possibility that your document can have up to 2^5 different valid URLs because of the multiple combinations of upper and lower case characters that make seemingly "valid" URLs.

Again, your backend script would need to verify the name against the data base and redirect to the correct URL when an incorrectly cased request is received.

That is, the real document ID is 6e4ad but NC would "make" 6e4Ad and 6e4aD and several other combinations also appear to be "valid".

You need to avoid that Duplicate Content scenario by making only one form able to return "200 OK", and all the others return 301 or 404. That's down to your script.

KrabyPatties

6:00 pm on Nov 29, 2008 (gmt 0)

10+ Year Member



Thanks for all of the knowledge guys. My script is now streamlined and bulletproof.