Forum Moderators: phranque
I've always been using (.*) because when ever I tried the longer codes, they never worked for me, until just now when I tried allowing extra characters in. For security, should I bother changing all my mod_rewrited sites from (.*) to ([0-9a-zA-Z\-\.\'\_]+)?
Remember that you can use the [NC] flag to avoid 26 of those character compares, using [a-z] with the [NC] flag is equivalent to using [A-Za-z].
There is another good reason to not use ".*" and that is that it is the greediest, most promiscuous pattern, and often causes the regular-expressions evaluation to iterate several times before finding a match.
For a simple example, take the pattern ^(.*)\.html$. A much more efficient pattern would be ^([^.]+)\.html$ because it allows a single-pass match evaluation from left to right.
Anyway, back to your original question, a compromise would be to use ".*" and ".+" type patterns on the URL part of a request, but the more restrictive ([0-9a-z.'_\-]+) and [NC] flag on query strings.
Jim
RewriteRule ^([a-z0-9._\-]+)/([^.]+)\.html$ cgi-bin/file.cgi?Operation=something&something=$1 [NC,L]
I prefer to do parameter validation in the scripts themselves, since PERL and PHP both have better regular-expressions string-handling than mod_rewrite. I also code scripts with the approach recommended by most security experts, which is to define exactly and restrictively what the scripts will accept, rather than trying to predict all possible exploits and reject those. The latter approach is a maintenance nightmare, and leaves the script open to exploitation until you discover or are informed of an exploit and can code a solution. The 'restrictive' approach is undoubtedly what inspired the source of information that lead you to post your question.
Also, since most scripting languages have full access to the original HTTP request headers, there is no guarantee that a script, once invoked, will not directly access the HTTP REQUEST_URI parameter and extract information directly, bypassing the careful pattern-based validation you've done in your mod_rewrite rules used to pass the request to the script.
One way around that would be to use a 'filter' rule at the very top of your rules, something like:
RewriteCond %{QUERY_STRING} [^&=a-z0-9._\-] [NC,OR]
RewriteCond %{REQUEST_URI} [^/#a-z0-9._\-] [NC,OR]
RewriteCond %{THE_REQUEST} \.(php¦pl)\ HTTP/ [NC]
RewriteRule .* - [F]
Note that I just typed this code, and I may have omitted some characters from the 'allowed character' lists that your site requires to function. I erred on the side of caution in composing these lists. I also assumed that all of your published URLs are static in appearance and refer to page names which do not contain ".pl" or ".php", and that all scripts are invoked by rewriting those static URLs; In other words, this approach works if your published URLs refer to .html page names or page names without any file extensions. If these are true, then there is no way a script can be invoked without being processed through the rewriterule filter.
Jim