Forum Moderators: phranque
<FilesMatch "\.(php多tml多tm地sp夸sp地spx夸spx圭fm如l)$">
Header set imagetoolbar "no"
Header set MSSmartTagsPreventParsing "TRUE"
</FilesMatch>
I get Internal Server Error 500 every time. Syntax is as per the book.
Server uses Apache 2.0.54.
.
This does not cause an error...
<FilesMatch "\.(php多tml多tm地sp夸sp地spx夸spx圭fm如l)$">
# Header set imagetoolbar "no"
# Header set MSSmartTagsPreventParsing "TRUE"
</FilesMatch>
...so it is the Header lines that appear to be the problem.
Apache Version: Apache/2.0.54 (Debian GNU/Linux) PHP/4.3.10-16
Loaded Modules: core mod_access mod_auth mod_log_config mod_logio mod_env mod_setenvif prefork http_core mod_mime mod_status mod_autoindex mod_negotiation mod_dir mod_alias mod_so mod_cgi mod_php4 mod_rewrite mod_userdir
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\/([^/]*/)*index\.(php多tml?).*\ HTTP/
RewriteRule ^(([^/]*/)*)index\.(php多tml?)$ http://www.example.com/test/$1 [R=301,L]
rather than this:
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]*/)*index\.(php多tml?).*\ HTTP/
RewriteRule ^(([^/]*/)*)index\.(php多tml?)$ http://www.example.com/test/$1 [R=301,L]
yet it still seemed to work.
The difference is the single space in the top line of the two.
However, there are a lot of other rules placed before this one, also dealing with index filenames. They are for requests with specific parameters, so maybe this set of rules never actually gets used by any of the URL requests that are in my test data.
[edited by: jdMorgan at 1:16 am (utc) on Sep. 21, 2008]
[edit reason] example.com [/edit]
I guess you might see a problem should you be reusing %1 later on somewhere else. In this case I wasn't reusing it in any way.
[edited by: g1smd at 12:21 am (utc) on Sep. 20, 2008]
At one point I saw in the reports that one URL which should have been redirected was not being redirected. This was also seen in the lists that Xenu LinkSleuth produces and which can be exported to a spreadsheet application for better analysis.
This investigation revealed that the non-redirecting URL was the only URL that would be touched by the final rule of them all -- the generic non-www to www redirect -- as it was the only URL that did not match any of the other previous rules.
Upon looking at the code for the non-www to www rule, there was a simple typo found in it. This was easily fixed, and then confirmed as being OK by running my test data again.
However, this brings it home why the general non-www to www rule should always be the last one.
In this case, 1099 URLs are processed by preceding rules and just one of 1100 URLs is processed by this final rule.
If, however, I had misplaced this rule by locating it first, it would have operated for 550 different URL requests, and would then be creating a Redirection Chain in 549 of those cases as there is another rule that also needs to do some extra work for those other 549 requests.
That would be a dangerous situation indeed. So specific rules go first, and the general stuff goes last, to catch anything that the preceding rules missed.
Is the stuff after the final closing bracket even needed at all?We're done matching by then, proven that it is an index filename.
Dunno -- That would be up to you. You've proven it's an "index" URL after "index\." but you haven't proven it's "php" or "html" until you identify one character beyond the end of those strings, which could be either a "?" starting a query string, or the space before "HTTP" if no query delimiter is present.
On the one hand, it's good to canonicalize "reasonable" URLs, but what if, for example, a competitor discovered that you will 301 anything that even looks like an index URL, instead of returning a 404, for say, "index.html-my-knickers". That might invite some unwelcome linking "games."
In the end it's all down to what the individual Webmaster feels is best for his/her site.
Jim
However, I'm not sure that unwelcome linking games are a problem if the unwanted incoming link simply hits a 301 redirect on the site.
It's obviously a huge problem when another URL returns valid content with a 200 OK HTTP status, but not so dangerous with a redirect in place.
RewriteCond %{QUERY_STRING} pge=0&?
RewriteCond %{QUERY_STRING} mu=([clxvi]+)&?
RewriteCond %{QUERY_STRING} Muid=([0-9]+)&?
RewriteRule ^(index\.php)?$ http://www.example.com/pages/%1/%2? [R=301,L]
Only the second variable makes it through the redirect, and it is found in the position where the first variable should be. That is:
http://example.com/index.php?mu=vii&Muid=33&pge=0
redirects to:
http://www.example.com/pages/33/
instead of to:
http://www.example.com/pages/vii/33
It results in a single 404 error in my test data, but there are several hundred different URLs that 301 redirect to it.
The site "looks" like it works. It is only by running Xenu LinkSleuth over a comprehensive set of test URLs that this stuff is coming up.
.
Additionally, I was fooled by the browser many times. It shows a URL like this:
http://example.com/pages.php?Muid=44&mu=vii&pge=0
like this:
http://example.com/pages.php?Muid=44u=vii&pge=0
on screen. Mouseover the URL in the report and it is seen correctly.
I now have to re-run my test data with all the & changed to & - forgot that I needed to check it both ways.
[edited by: jdMorgan at 1:17 am (utc) on Sep. 21, 2008]
[edit reason] example.com [/edit]
In order to 'pick up and collect' the pieces, you need to 'carry forward' previously-matched back-references from one RewriteCond to the next. Aside from using separate rules to handle different query-string-variable orders, here's one way to do it:
RewriteCond %{QUERY_STRING} pge=0&?
RewriteCond %{QUERY_STRING} &?mu=([clxvi]+)&?
RewriteCond %1>%{QUERY_STRING} ^([^>]+)>([^&]*&)*Muid=([0-9]+)&?
RewriteRule ^(index\.php)?$ http://www.example.com/pages/%1/%3? [R=301,L]
The ">" character used in the third RewriteCond is arbitrary, and has no special meaning. It is simply a relatively-rare and unique-for-this-specific-case character that is used to demarcate the previously-matched value from the one being currently evaluated.
Jim
[edited by: jdMorgan at 1:09 am (utc) on Sep. 21, 2008]
Some of the URLs have 4 variables, and some are optional, so there is no way that I am putting 16 rules in to fix that up. I'll go for the much shorter option!
I also tried to find a mega-post you did a year or two ago where someone was trying to fix multiple variables to always be in the same order - as I guessed that might get me started.
The best I came up with is this, as yet untested:
RewriteCond %{QUERY_STRING} &?pge=0&?
RewriteCond %{QUERY_STRING} &?mu=[clxvi]+&?
RewriteCond %{QUERY_STRING} &?Muid=([0-9]+)&?
RewriteRule ^(index\.php)?$ http://www.example.com/pages/%1/ [QSA] -- hoping that with no [L] here, the redirect isn't invoked *yet*.
RewriteCond %{QUERY_STRING} &?mu=([clxvi]+)&?
RewriteRule ^pages/(.*)/$ http://www.example.com/pages/%1/$1? [R=301,L]
I had already thought about adding something before the variable names to make the rules more efficient, but wanted to fix the other stuff first.
I need to look closely at your example, at least several times I think.
I assume that your code doesn't care about the parameter order, nor does it care if there are any other parameters present (they get dumped).
[edited by: jdMorgan at 1:14 am (utc) on Sep. 21, 2008]
[edit reason] example.com [/edit]
Correct on both counts. They all get dumped because of the trailing "?" on the substitution.
You can use the exact same method to re-order the parameters: "pick up the pieces" of the query string one RewriteCond at a time, put them into %1 (separated by tokens like my ">" character above) to "carry them along", and then re-assemble at the end. (I believe I came up with this after my so-called "mega-post" and I like this method better.)
In general, avoid 'stacked' rewrites that change the URL-path -- These trigger a long-time and still-present documented bug in Apache mod_rewrite, which results in multiple 'copies' of parts of the URL-path appearing in req_rec (the "current URL-path" variable used by mod_rewrite). If you must stack rewrites, then copy the URL-path from RewriteRule into a 'user variable' work on it there, and then put it back. A trivial example:
RewriteRule \.php$ - [E=MyURL:/index.php]
...
RewriteRule \.php$ - [E=MyURL:%{ENV:MyURL}?added-parm=foo]
...
RewriteCond %{ENV:MyURL} (.+)
RewriteRule \.php$ %1 [L]
Jim
I need to do more reading, and a lot more testing.
I am absolutely convinced that a large number of sites have installed code which *appears* to work but which has unseen flaws that are causing all sorts of problems "under the hood".
Without my long list of test URLs to run through Xenu LinkSleuth I would have signed this off as "done" about two days ago. Instead, I find several serious flaws that aren't all that apparent when simply browsing the site, and which I would have been extremely lucky to have picked out as candidates to sample with Live HTTP headers.
These redirects are picking up incoming links to URLs that only existed in an older version of the CMS, and sending the requests on to the new URL for the same resource.
Very true... and go look at some of the other on-line forums dealing with mod_rewrite and related subjects: There are many more examples of bad code than good code on the Web: Dot-star-this and dot-star-that... People are upgrading servers all over the world because their regex patterns are so sloppy and slow! Or because they do unnecessary file-exists checks left and right (WordPress, for example).
Long test-URL lists are a requirement -- be sure to include them in your bid. :)
Most folks skip requirements specification and testing -- or give them very short shrift. It's all too common. :(
Jim
I fixed up your other suggestions too. I already changed ^index\.php¦$ to become ^(index\.php)?$ a few days ago.
I am going to leave it until tomorrow, have a final look, and then run all the test URLs through Xenu LinkSleuth again. That process only takes about 5 minutes, but analysing the output takes quite a bit longer.
RewriteCond %{QUERY_STRING} pge=0&?
RewriteCond %{QUERY_STRING} mu=([clxvi]+)&?
RewriteCond %1>%{QUERY_STRING} ^([^>]+)>([^&]&)*Muid=([0-9]+)&?
RewriteRule ^(index\.php)?$ http://www.example.com/test/pages/%1/%3? [R=301,L]
#
RewriteCond %{QUERY_STRING} mu=([clxvi]+)&?
RewriteCond %1>%{QUERY_STRING} ^([^>]+)>([^&]&)*Muid=([0-9]+)&?
RewriteRule ^pages?\.php$ http://www.example.com/test/pages/%1/%3? [R=301,L]
http://www.example.com/test/?pge=0&Muid=33&mu=ii&l1i=
Other URLs, with index or page filename included, are either directly serving content or are being incorrectly handled by a different redirect.
[edited by: jdMorgan at 1:14 am (utc) on Sep. 21, 2008]
[edit reason] example.com, formatting, & disabled smiles [/edit]
The very last item on the site navigation list is now "Forbidden" when clicked.
The URL as displayed does not have to pass through any redirects to deliver content.
The URL is rewriten to some other, different, internal file path to fetch the content.
If I remove all the redirects, and leave just the rewrite, then that URL does work just fine.
This one is going to need some thought.
It might be something within the PHP coding, and that would be out of my control.
There are a few URLs that still have a one-parameter query string on the end. The parameter value is used by some JavaScript functions to do with styling the navigation bar. The navigation bar is going to be redesigned using CSS, and that will eventually make the query string completely obsolete (which is why that parameter isn't featured in any of the rewrites so far implemented).
For this one page with issues, if I remove the value from the query string, or remove the whole query string, then the page displays fine. The issue only occurs with one link/one URL on the whole site. My guess is that something in the JavaScript or PHP is responsible for the error message.
Is it worth finding and fixing, if the query string will be completely gone within weeks? I think not.
# Deny Hacker Query Strings Looking to Exploit Database Features:
RewriteCond %{QUERY_STRING} declare圭har存et圭ast圭onvert圬elete圬rop圯xec夷nsert妃eta存cript存elect宇runcate守pdate [NC]
RewriteRule . - [F]
Ooops.
Since there will be no more query string after next week, it isn't all that important, but it would be nice to add some extra qualifier to that rule so that it reliably detects attempted hacks, while still allowing "normal" operation of a site.