Forum Moderators: phranque

Message Too Old, No Replies

How to best handle thousands of RewriteRules?

         

jimih

10:50 am on Apr 18, 2012 (gmt 0)

10+ Year Member



Hi,

I have done some contributions for a web site with quite alot of traffic, where we were faced with a problem of being able to handle redirects from a lot of old url's to new ones. We are talking about 30.000 old url's, that were to be mapped to new url's. And there is no generic pattern that can be applied. So, the company responsibles for the production servers suggested that we simply generate a 30.000 line long list of RewriteRules that we include into the existing apache httpd configuration.

Now, what are the main things one should think of when doing something like this?

The lines look something like this:

RewriteRule /.*_12345\.xx$ http:www.some-site.com/123456 [R=301,NE,L]



How efficient would mod_rewrite be in dealing with 30.000 lines like this? Assuming there is no rule before these 30.000 lines that end the chain (using the L directive for example), whould it be very bad for performance? Is there an easy way to make sure that these 30.000 rules are only considered if the incoming url ends with ".xx", and therefore not having any negative inpact at all on all other requests?

Note that even though the site has a lot of traffic, there is a very efficient caching server infront of apache, so in reallity there are not *that* many requests that end up in apache. But still I am worried about the performance.

Regards
/J

incrediBILL

6:11 am on Apr 19, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



simply generate a 30.000 line long list of RewriteRules that we include into the existing apache httpd configuration.


Kiss performance g'bye and go find a new host that knows better.

You could implement a rewrite map which, if you used the DBM format, would be pretty quick

Otherwise, I'd load all those into a MySQL database and use a PHP script to do the redirects

Shared server?

If so, do what the host suggests and then run Xenu across your site at top speed.

Enjoy.

jimih

6:33 am on Apr 19, 2012 (gmt 0)

10+ Year Member



Thanks for your reply Bill (?),

But do you think that you could elaborate a little on why this is more or less a doomed approach? The majority of the incoming requests will not match any of these 30.000 rules, are you telling me there is no way to avoid a performance penalty for these requests even though a simple regex can match all those other requests?

Or put this way, if there is a rule on line 100 or so in the config that matches 99% of all requests, and that has the L flag to stop further processing, will the following 30.000 lines *still* be a performance problem for these 99% of the requests? If yes, can you explain why? If no, then I guess we are fine.

incrediBILL

7:31 am on Apr 19, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You assume it finds a match and stops, and you also assume it finds a match soon which is possible but unpredictable. Most likely with each access to your server it'll unsuccessfully linearly process many thousands of lines of rules unless ordered by most frequently accessed to speed it along. Don't forget, when your site would normally generate a 404 you'll have processed all 30K each time something isn't found. If someone started attacking your server looking for random paths used in vulnerable software, which happens all the time, many times daily on most sites, it could put a serious hit on the server.

My suggestions of using a rewritemap or a simple PHP script with a MySQL database to handle 30K rules has a much better prospect of being viable on a site with significant traffic.

So unless you can figure out how to break down your problem into a few rules with one catch-all and keep it under a thousand, I really wouldn't recommend doing it with 30K rules.

To satisfy your curiosity a simple test would be to get page load speeds from your current site and then put in 30K dummy rules generated by a script and test it again and see what happens when looking for some page that matches way down in the list.

lucy24

7:59 am on Apr 19, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Overlapping Bill:
RewriteRule /.*_12345\.xx$ http:www.some-site.com/123456 [R=301,NE,L]

How efficient would mod_rewrite be in dealing with 30.000 lines like this?

Oh. Ouch. Oh. Ouch. Ow, ow, ow. If you mean the "like this" part literally, tell your established users to send you their orders by carrier pigeon. They'll reach you faster.

It's this bit:

/.*_12345\.xx$


Every single time mod_rewrite reaches this Rule-- or any of its 30,000 siblings-- it will have to go through the following:

/.*_12345\.xx$

gobblegobblegobble to the very end

Oh. Oops! I'm supposed to look for a lowline.

Backtrack, look for _, backtrack further, look for _, backtrack...

OK, here's a lowline. Now to see if it's followed by "12345.xx".

The only good news is that the lowline you want is the last one-- potentially-- in the request. Otherwise there would be still more backtracking and re-evaluating.

Instead you proceed to the redirect, including "noescape" flag even though nothing in the URL would call for escaping anyway. You don't have query strings do you? Not much point in redirecting to an extensionless URL-- and then keeping a clunky long query at the end of it all.

Now, does the .xx represent some specific extension? At a minimum, it's got to be a page, like .html or .php. So any requests for images or stylesheets or scripts have to read the rule right along with everything else-- unless you can deflect them right at the beginning.

if there is a rule on line 100 or so in the config that matches 99% of all requests

Is that 99% of all page requests, or just 99% of all requests everywhere?

In any case, there is no flag that says "The remainder of this file is devoted exclusively to mod_rewrite, so the rest of youse can pack up and go home". All the other mods still have to plow through the whole 30,000 lines on the off chance that Line 14,592 contains a mod_alias directive or Line 28,824 has a rule for mod_setenvif.

Grab all requests of all kinds that will not require special handling and stick an [L] flag on them. The rest get rewritten to a php script that will do its stuff without bothering anyone else.