Forum Moderators: phranque

Message Too Old, No Replies

conflicting rewrites

Rewrite ignoring [L] flag; can't get rewriteCond right

         

run2

6:05 pm on Jul 17, 2009 (gmt 0)

10+ Year Member



Hi all,
I'm definitely new to the Rewrite Engine, and am having a bit of a nightmare.
I have an .htaccess file with two rewrite rules I'm using to create static URLs, both of which work fine individually, but no in combination.

Script is currently:

Options +FollowSymLinks
RewriteEngine on

RewriteRule ^(.*)/(.*)\.php$ product.php?manufacturer=$1&style=$2 [L]

RewriteRule ^(.*)/$ index.php?manufacturer=$1
RewriteRule ^(.*)$ index.php?manufacturer=$1

I've been reading round, and come to a dead end but understand that:
1) The [L] flag just stops this loop, and that the rewrite engine then goes round again (so catching the rewritten url and modifying it in the second set of rules)
2) I need a rewriteCond to follow the first rule. Which is where the problem lies - I've found lots of things with people saying I need one, but have absolutely no clue what to put in it.

The htaccess file resides in a folder called 'manufacturers', and basically I want anyone going looking for manufacturers/whatever (or manufacturers/whatever/) to go to index.php?manufacturer=whatever with users then progressing to specific product pages at product.php?manufacturer=whatever&product=something.
This is probably a no-brainer for many of you, but it's giving me massive headaches!

As an aside, I strongly suspect that one additional line of code can be used to push requests for pages (products) that don't exists to a custom 404 error page at /error_docs/404/php? Am I right? If so, how do?

Hope this makes sense, and hope someone can help before stressing about it ruins my weekend!
Cheers,
Nik

jdMorgan

7:23 pm on Jul 17, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The first rule does not need a RewriteCond, because its output path won't match its pattern due to the "subdirectory" path-part required to match its pattern.

But the second rule's ".*" pattern matches "anything, everything, or nothing -- with or without a trailing slash," and so must be prevented from being re-invoked (and looping) with previously-internally-rewritten requests for either index.php or products.php:


Options +FollowSymLinks
RewriteEngine on
#
RewriteRule ^(([^/]+/)+)([^.]+)\.php$ product.php?manufacturer=$1&style=$3 [L]
#
RewriteCond $1 !(productŠindex)\.php$
RewriteRule ^(.*)/?$ index.php?manufacturer=$1 [L]

Replace the broken pipe "Š" character with a solid pipe character before use; Posting on this forum modifies the pipe characters.

The regex patterns in the first rule above have been modified to improve performance.

Note that you really should not allow that trailing slash to be optional, because that creates two URLs for each page/product. That is duplicate-content, a much-discussed search ranking problem. You should pick either slashed or non-slashed URLs as your preferred (canonical) URL-form, and rewrite only those to your script. The other form should be detected and 301-redirected to the canonical form.

-----

Missing/bogus product page URLs:

As far as the server is concerned, any and all pages (products) in that subdirectory exist, because all are rewritten to one or the other script file, which physically exists. The scripts themselves are the only things that can "know" whether a page (product) exists, based on the presence or absence of a database entry for that page/product. Therefore, the script itself must check for the page's existence, and return a proper 404-Not Found response if no database entry can be found to generate an HTML page for that product.

Note that for pages/products that once existed, but which you no longer wish to support, you can/should return a 410-Gone response instead of a 404, assuming that you mark obsolete product records as obsolete, rather than just deleting them (In other words, whether you can support this function depends on how you set up and administer your database). A 410 response (and its error page) essentially tells visitors "We used to sell that, but no longer do," rather than just saying, "Sorry, we can't find that page, and we don't know why," which is all that a 404 means.

It is best practice to "remember" all old/obsolete URLs that you used to have pages for, and to handle those separately (410-Gone) from mis-typed or bogus URLs (404-Not Found).

Jim

[edited by: jdMorgan at 3:41 pm (utc) on July 19, 2009]

run2

11:00 pm on Jul 18, 2009 (gmt 0)

10+ Year Member



Hi Jim,
firstly, thanks loads for this, and apologies for not replying sooner - it did get through in time to save my weekend, but literally as I was on my way out the door, so no time to reply!

Anyway, if I'm gonna learn, I need to unpick this and understand it rather than just copying & pasting so...
First RewriteRule says "find anything that starts either with or without a / (the [^/] signifying the 'with'), is followed by a / and is then followed by something.php and redirect to ... When done, stop this rewrite run & start again."
The RewriteCond says "if the first bracketed pattern isn't product or index.php, then..."
Not sure about the significance of the ? in the second RewriteRule. Is it literally a question mark? Or is it saying "find anything that may or may not end in a / and redirect to..."

Thanks for the tip abut the 404/410 pages as well - I can check for non-existant pages/products in the php, so no big issues there, but may well alter the database to allow 'deleted' products to retain some value which can be used to pass through to a 410 - am becoming bit of a stickler for standards compliance, clean code etc, so the least I can do is make sure things like this aren't left unresolved - to not do so would just be lazy (and that's something which will set me off on a rant I can probably join in on elsewhere on the forums!)...

Thanks again,
Nik

jdMorgan

3:43 pm on Jul 19, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The meanings of regular-expressions tokens changes with their context, so I commend the regex tutorial cited in our Apache Forum Charter to you. Used to start an alternate-character group (as denoted by "[]") the carat "^" means "NOT." So while "[abc]" matches any single character "a", "b", or "c", "[^abc]" matches any character that is NOT one of those three.

"?" is a quantifier meaning, "match zero or one of the preceding character, alternate character group, or parenthesized sub-pattern."

"[^/]+/" is "Match one or more characters which are not a slash, followed by a slash."

"([^/]+/)+" is "Match one or more characters which are not a slash, followed by a slash, and do all that one or more times."

So "^(([^/]+/)+)([^.]+)\.php$" is "Match one or more characters which are not a slash, followed by a slash, one or more times, followed by one or more characters which are not a period, followed by a period, and ending with 'php'."

So that pattern parses the requested URL-path into the "directory part" in $1 and the "filename part" in $3. It will work properly as long as the 'filename.filetype' part of your URLs does not contain multiple periods. If they do, then the same technique used for slashes can be used to make the period-matching more robust, e.g. "^(([^/]+/)+)(([^.]+\.)*[^.]+)\.php$"

Jim

jdMorgan

4:18 pm on Jul 19, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I just realized that the above patterns will include the trailing slash in $1. So an even more rubust solution would be:

RewriteRule ^(([^/]+/)*[^/]+)/(([^.]+\.)*[^.]+)\.php$ product.php?manufacturer=$1&style=$3 [L]

Jim

run2

7:53 pm on Jul 19, 2009 (gmt 0)

10+ Year Member



Truly amazing - I stand in awe :)
More complicated than I thought, so will definitely go read that tutorial and see how I get on.
Thanks loads, Jim - it really is most appreciated.

jdMorgan

8:23 pm on Jul 19, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You could probably make several much simpler patterns, but they'd be less efficient to process. The idea of using these negative-match patterns is that the matching engine will know exactly when to stop matching the URL-path to a particular 'piece' of the sub-pattern, so multiple back-off-and-retry matching attempts can be eliminated or at least greatly reduced.

And as noted, part of the complexity comes from the flexibility of the pattern: It will accept one or more 'subdirectory levels' and more that one period in the filename itself. If you don't need that flexibility, you can simplify the pattern.

Jim