mod rewrite problems

Forum Moderators: phranque

Message Too Old, No Replies

mod rewrite problems

musicmaster

6:49 pm on Apr 6, 2009 (gmt 0)

I have a rewrite rule
RewriteRule ^([^~]*)~[^/]+\/(.*) $1$2

It is supposed to turn strings like
localhost/abcd/~gh/23/
into
localhost/abcd/23/
after which the line is further processed in my htaccess.

Problem is that it does not always work. The line above works, but when I put something behind the last slash (like localhost/abcd/~gh/23/abbn) it doesn't work.

So I tried some debugging to see what this rule produces. I tried the two lines below and both give the same result:
RewriteRule ^(.*) [localhost2...] [R=301,L]
RewriteRule ^(b.+)$ test.php?c=$1 [L]

Both give the same result:

The line
localhost/abcd/~gh/23/
becomes
localhost/abcd/23//~h/23/

So the debugging is a problem on its own. Can anybody help me by explaining what is going on?

Thanks

g1smd

6:57 pm on Apr 6, 2009 (gmt 0)

You need [L] on the end of the rule.

You need to end anchor the pattern and provide the trailing / in the pattern if the trailing slash should be present.

Do not escape the slash, that is, \/ should be / only.

The first backreference matches up to the tilde, so the slash will be included inside what is matched. You need to have the slash outside what is matched.

As coded, the pattern doesn't match what you wanted it to do. However, it does work exactly as coded.

Something like:

RewriteRule ^([^/]+/)~[^/]+/([^/]+/)  /$1$2 [L]

However, this opens up your site to many DuplicateContent problems, because it accepts any URL with anything at all on the end.

jdMorgan

7:26 pm on Apr 6, 2009 (gmt 0)

Just a note pertaining to a couple of side-issues here...

The first rule --the redirect to localhost2-- may be problematic is one of two ways: Since it does not check the currently-requested hostname, it will cause an 'infinite' loop if localhost and localhost2 both resolve to the same virtual host. If they resolve to different virtual hosts, then the first rule will redirect the request from localhost to localhost2, and the second rule will never be invoked.

Use a RewriteCond testing %{HTTP_HOST} ^localhost\.?(:[0-9]+)?$ to fix the problem in the first case, or move the second rule to localhost2/.htaccsss to fix the second case.

Another point is the statement "after which the line is further processed in my htaccess." Because of a known bug in Apache mod_rewrite, it is a good idea --and often actually required-- to do all processing for a given HTTP request with a single rule. What I mean by 'a given HTTP request' is that all internal rewrites need to be accomplished in a single-step rule. Under certain (but common) circumstances, trying to process a URL-path through multiple internal-rewrite rules can result in corruption of the req_rec variable, leading to repeated URL-path-parts being injected into the rewritten URL-path.

Redirects *can* be done in multiple steps since they involve multiple HTTP request/response transactions, but this is also bad because it confuses search engines, and requires multiple client-server exchanges, slowing the user experience.

Your best bet is to do all necessary rewrites and all redirects in one step. This can be simplified by ordering your rules with external redirects first, in order from most-specific pattern (fewest URLs affected) to least-specific pattern, followed by all internal rewrites, again ordered from most-specific to least-specific pattern.

Jim

musicmaster

6:25 pm on Apr 7, 2009 (gmt 0)

Thank you for your replies.

g1smd: the "\/" is the result from an experiment. It is not the problem.
: "end anchor". The point of the code is that I want to handle both lines ending with a slash and lines ending without a slash. (So I could only end-anchor with a "$" as far as I know - but I doubt whether adding that has an added value). The problem is that this line does not work for me for those lines that do not end with a slash. And I have no idea why.

jdMorgan: localhost2 is really different, so no problem there.
: the two lines (localhost2 and test.php) are two efforts to do the same thing. So I used only one of them at a time.
: good to know about the corruption of the req_rec variable. I suppose that was happened for me with the localhost2 and test.php test lines. I will watch out for that in other situations.

jdMorgan

6:46 pm on Apr 7, 2009 (gmt 0)

Are you using mod_userdir? Is that how you ended up with the tilde "~" characters in your URL-paths?

Also, be aware that mod_speling, mod_negotiation, mod_dir, mod_alias, and (on Apache 2.x) the AcceptPathInfo directive can interfere with the expected operation of mod_rewrite. If you are not knowingly using these, disable them.

g1smd's point about end-anchoring your pattern is sound. If you do not end-anchor the pattern, then "any* URL that matches up to that point but includes *anything* else in addition will also match the rule, get rewritten, and result in a 200-OK server response along with the same content as the expected URL. This creates 'infinite' duplicate-content, can cause ranking troubles in search, and invites "GoogleBombing" -- Search for the "duplicate content" topic in our Google News forum for more info.

From a slightly different angle, only one specific and unique URL should access any given resource (e.g. Web page) on your site. All non-canonical variations should result in a 301-Moved Permanently redirect to the canonical URL. So, pick either the non-trailing-slash (recommended) or the trailing-slash URL, match only that exact URL-path in the current RewriteRule, and add another RewriteRule to redirect the variant URL to that canonical URL. Again, more on URL canonicalization can be found in our Google News forum. This may seem academic or even pedantic at this point, but running a very tight ship can prevent potential disasters and "mystery problems" in the future.

Jim

musicmaster

9:47 am on Apr 8, 2009 (gmt 0)

The tilde is my own invention. I need flags for different display of the header of the page. So I am not using mod_usedir (at least as far as I know: I use a standard XAMPP installation). So if the space between two slashes starts with a tilde it is defined as a flag.

I don't understand your point about end-anchoring. As far as I can see it this is a greedy algorithm that takes as much as it can. So ending with "$" should give the same result as not end-anchoring a "(.*)".

As I plan it urls without slashes have a different meaning. If it would be about cars it would be something like
ford/ about the company
ford/models a list of models
ford/dealers a dealer list
ford/ka/ about the ford Ka model
ford/ka/engine about the Ka engine

You write that all non-canonical variations should result in a 301. I suppose you mean those that are redirected to my PHP page but that don't fit the required pattern. One of my problems is that some (not allowed) urls result in a 404 that ignores my custom 404 (ErrorDocument) as defined in .htaccess.

musicmaster

11:06 am on Apr 8, 2009 (gmt 0)

One more test - and I understand even less. Now I have reduced .htaccess to a single rewrite-line:
RewriteRule ^([^~]*)~[^/]+/([^/]*) $1$2 [R=301,L]

The page it should produce doesn't exist, but that isn't a problem: I just want to see how it changes the url in my address bar.

The result:
[localhostb...]
becomes
[localhostb...]
while the hoped for result was:
[localhostb...]

Oops, I solved this one: it should be "/$1$2" instead of "$1$2"

g1smd

11:54 am on Apr 8, 2009 (gmt 0)

*** =Now I have reduced .htaccess to a single rewrite-line. ***

Nope. You have changed it from being a rewrite to being a redirect.

If you really want a redirect, then the target URL must include the domain name.

As for the end anchoring of a pattern used in a rewrite, the idea is that only one URL request format should result in a rewrite. If you allow multiple URLs to activate the rewrite and be served the same content, then you have created a Duplicate Content problem. If different URLs could be requested, then redirect the alternatives to the canonical form, and rewrite only the canonical form.

Say I wanted my rewrite to work for the page named /abc here. If i end anchor the pattern only a request for /abc will be rewritten.

If I fail to end anchor it, then any request that merely begins with /abc will be rewritten. That is /abcdef and /abc1234 and /abc56q7 and /abcjwusjs and /abc8282 and /abc38119 and /abc-anything will all be rewritten and all of those URL requests will be served the same content, and all with the HTTP Status of '200 OK'.

That is, all of those URLs will 'exist' and they will cause you a lot of problems.

jdMorgan

1:13 pm on Apr 8, 2009 (gmt 0)

My understanding was that that rule was changed from an internal rewrite to a redirect for testing purposes only -- so that the action of the RewriteRule could be exposed and seen in the browser's address bar.

I'm not sure we've said this explicitly yet, but there does not appear to ba anything wrong with the code that would cause the very-strange rewriting behaviour. That's why I've mentioned looking for outside influences interfering with the operation of this rule.

The problem with not end-anchoring is simple. If you do not specify the entire valid "URL-space" to be rewritten, then any URL that matches the *prefix* defined by the RewriteRule pattern will get rewritten, and will result in a 200-OK response from your server. This means, for example, that a competitor could easily arrange for hundreds of links to your site in the form example.com/ford/dealers-convicted-of-fraud, example.com/ford-dealers-selling-used-cars-as-new, and example.com/ford-dealers-selling-lemons-as-new. All of these URLs would work and could be promoted to rank in search engines, thus associating your site and the dealers you are promoting with consumer fraud... And add to that the fact that all of these duplicate-content URLs would compete with your 'real' URLs for ranking in search results.

Let's forget all of that for now, though and get back to the central question, which is "Why is the action/result of your RewriteRule so odd?"

I suggest you try rewriting URLs that don't exist to files that don't exist, just as a test, in order to try to avoid the problems with other rules or directives interfering with your rule. Also, this might expose the action of the other rewriting modules I mentioned above, if you have not yet disabled them.

Also try redirecting your intended URL-set to an external URL. For example, detect your desired URL-set, move the variables into a parameter named "q=" and redirect to www.google.com/search?q="blah" as a valid search URL. Then see if you still find those odd URL-path-parts in the result.

Jim

musicmaster

3:03 pm on Apr 8, 2009 (gmt 0)

Thanks for your feedback.

I have given up on using mod_rewrite for anything but the most simple processing. I keep encountering situations where code works one second and stops working the next. I will now use a central processing model with one php file that divides the tasks.

.htaccess becomes something like:
RewriteRule ^(.*)/(.*)/(.*)/(.*) index.php?num=4&a=$1&b=$2&c=$3&d=$4 [L]
RewriteRule ^(.*)/(.*)/(.*) index.php?num=3&a=$1&b=$2&c=$3 [L]
RewriteRule ^(.*)/(.*) index.php?num=2&a=$1&b=$2 [L]

And the rest is handled in php.

jdMorgan

5:12 pm on Apr 8, 2009 (gmt 0)

That's fine, but I was hoping to get to the bottom of the problem, because it may in fact be a server mis-configuration that will cause you further grief in the future. Your initially-posted rule should have worked -- and I've dealt with far more complex rule-sets and have developed some measure of judgment in this matter.

If you would like your new rules to run several hundred times faster, swap out the ambiguous, promiscuous, and greedy ".*" subpatterns for more-specific ones:


RewriteRule ^([^/]+)/([^/]+)/([^/]+)/([^/]+)/$ index.php?num=4&a=$1&b=$2&c=$3&d=$4 [L]
RewriteRule ^([^/]+)/([^/]+)/([^/]+)/$ index.php?num=3&a=$1&b=$2&c=$3 [L]
RewriteRule ^([^/]+)/([^/]+)/$ index.php?num=2&a=$1&b=$2 [L]

The "[^/]+" subpatterns shown here match characters up to the first "/" that they find and then stop, unlike the ".*" subpattern, which will initially match the entire requested URL-path and then have to back off one character at a time until a match can be found for all subpatterns in the pattern. This mpre=specific subpattern allows the requested URL-path to be matched in a single left-to-right pass, instead of requiring hundreds of iterations (or thousands, depending on the requested-URL-path length) while trying to find a match from the tail-end of the URL-path while backing-off toward the head.

It also prevents accepting a requested URL-path with one or more additional "subdirectories" such as
/a/b/c/d/e/ where the result of using your ".*" rule would be num=4&a=a&b=b&c=c&d=d/e

You've also been advised several times to end-anchor those patterns, and warned of the search-ranking and exploitation vulnerability consequences of not doing so. I will repeat that advice only indirectly here, and you may un-anchor these patterns if you wish.

Jim

musicmaster

6:37 pm on Apr 8, 2009 (gmt 0)

I have mostly copied it from and have now this (the only differences are the end-slash and the * in the last group):

RewriteRule ^([^/]+)/([^/]+)/([^/]+)/([^/]*)$ index.php?num=4&a=$1&b=$2&c=$3&d=$4 [L]
RewriteRule ^([^/]+)/([^/]+)/([^/]*)$ index.php?num=3&a=$1&b=$2&c=$3 [L]
RewriteRule ^([^/]+)/([^/]*)$ index.php?num=2&a=$1&b=$2 [L]

I will check for the fraudulent car dealers in the code. I suppose I will need some 404 header command there.

The only fault I found in my code was that I used relative paths in my webpages. Mod_rewrite interpreted that at different times differently: so at one session a link might work and at another session the same link from the same page did not work. What made me decide to stop was when I encountered a similar problem with the "$1$2" that I used in my example. Initially it worked, then at some point I got paths starting with "C:\" in my modified links and I had to put a slash before the "$1$2". Later on , the slashes version gave an error and I had to go back to a slashless version. But unlike the relative paths I have no idea what triggered this problem.

jdMorgan

10:22 pm on Apr 8, 2009 (gmt 0)

Be aware that mod_rewrite never sees "a relative link" -- All it ever sees is fully-formed URLs as requested by the browser (or a search engine robot, etc.). It is the client (e.g. browser) that interprets page- or domain-relative links, by taking the current URL in its address bar as the base and adding your relative link path to the appropriate parts of it. When it sends a URL to your server, it is always a full "http://example.com/dir/page.whatever" URL.

To prevent problems with rewriting from 'virtual subdirectory URLs' to files located at different directory-levels, the use of only server-relative or canonical URLs is recommended; Otherwise, you must add more code to handle all the special cases, and although that works, it also results in duplicate-content problems.

Jim

[edited by: jdMorgan at 10:24 pm (utc) on April 8, 2009]