Forum Moderators: phranque

Message Too Old, No Replies

Mod Rewrite on large Ecommerce Website

Facing some unusual problems.

         

irldonalb

1:34 pm on Apr 28, 2008 (gmt 0)

10+ Year Member



Hi All,

I’m trying to get a company I work close with to implement mod rewrite. During our peak period, the website receives approximately 180,000 page impressions an hour or over a million impressions a day. (Just to give you an indication of the load involved).

However the IT department have hit back with some unusual obstacles:

- Apparently storing mod rewrite rules in a htaccess file for a site of our size is impractical
- The http conf file should only be used to store 10 re-rewrite urls. (I think we’ll need 30 dynamic rules at least)
- Server processing time will increase by 30%. Conveniently we’re running at 80% of the maximum capacity...

I’d like to get some independent opinions. It’s difficult for me to challenge the above as I’ve never attempted to rewrite the URLs of such a busy website.

Thanks
Donal

jdMorgan

5:15 pm on Apr 28, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Sounds like FUD to me...

The only absolute truth I read there is that mod_rewrite for a busy site should be done at the server config level in httpd.conf or conf.d, etc. This is because code in the config files is compiled at server start-up and then executed as native code on a per-HTTP-request basis, whereas code in .htaccess is interpreted on a per-HTTP-request basis. Therefore, the code execution in httpd.conf is much more efficient.

30 lines of RewriteRules in httpd.conf is roughly comparable to 30 lines of PHP code. Are your e-commerce applications more than 30 lines long? If so, your server should have already crashed! I call foul on this.

Write the code, and make them test it. Whoever is wrong buys the beer (You should prepare for a serious hangover).

Some tips:

  • Make the patterns in RewriteRules and RewriteConds as specific as possible. For example, avoid the construct ^(.*)something(.*)$ at all costs. If forced to use ^(.*)something(.*)something(.*)$, suicide is an attractive alternative.
  • Use RewriteConds if necessary to make sure that any 'file exists' checks are only executed when absolutely required.
  • Use RewriteConds if necessary so that any reverse-DNS checks are only executed when absolutely required.

    Multiple "match any number of any characters" regex patterns force the regular-expressions matching engine to retry many, many times to find a match. Instead, use negative-matching, for example, ^(.*)/(.*)$ might be re-coded as ^([^/]+)/(.+)$
    Or if the requested URL-path that you're trying to match contains multiple slashes, perhaps you'd want to use ^(([^/]+/)+)(.+)$ to pick up only the final URL-path-part in $3, with the rest in $1, more-closely matching the 'greedy' behavior of the first sub-pattern in ^(.*)/(.*)$

    These are just examples -- The best way to code a pattern is highly-dependent on the URL-path you're trying to match. But the general rule holds: Avoid multiple ambiguous subpatterns.

    Files exists checks invoke a call to the operating systems file manager, which may in turn (depending on filesystem caching) invoke a read of the physical disk. Therefore, file exists checks should be done only when absolutely required. If not clear, I'm talking about "RewriteCond %{REQUEST_FILENAME} -f" for example.

    Reverse-DNS lookups invoke a request by your server to a DNS server, so again, this should be avoided. An example would be "RewriteCond %{REMOTE_HOST} ^(www\.)?example\.com"

    In both cases, try to make the RewriteRule pattern as specific as possible, and add any RewriteConds which might further qualify the file-exists check or rDNS lookup to prevent unnecessary execution.

    Jim

  • TheMadScientist

    5:08 pm on May 8, 2008 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    If you are still looking for opinions:

    I would guess if you are not currently using mod_rewrite to block a number of spiders you could actually alleviate some of your server load by implementing a known 'bad user-agent' block list, which should more than make up for any additional lines of code.

    If you block them at the initial request with a <p>Forbidden</p> page (16 characters), you should save an incredible amount of bandwidth when compared to letting them run rampant and download any full page they feel like.

    Really, if your rulesets are well written, I would guess it's not an issue, and I would also guess since when you type in webmasterworld.com it resolves to www.webmasterworld.com, and the site is written in .pl, but the URLs say .htm there is some mod_rewrite used on what is very arguably the busiest and fastest message board online.