Forum Moderators: phranque

Message Too Old, No Replies

Server Error 500 with Filesmatch and Header

Followed the syntax by the book...

         

g1smd

10:11 am on Sep 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



What's wrong with this?

<FilesMatch "\.(php多tml多tm地sp夸sp地spx夸spx圭fm如l)$">
Header set imagetoolbar "no"
Header set MSSmartTagsPreventParsing "TRUE"
</FilesMatch>

I get Internal Server Error 500 every time. Syntax is as per the book.

Server uses Apache 2.0.54.

.

This does not cause an error...

<FilesMatch "\.(php多tml多tm地sp夸sp地spx夸spx圭fm如l)$">
# Header set imagetoolbar "no"
# Header set MSSmartTagsPreventParsing "TRUE"
</FilesMatch>

...so it is the Header lines that appear to be the problem.

g1smd

10:16 am on Sep 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Can I simply assume that the problem is that Mod_Headers isn't installed or enabled?

I don't have access to httpd.conf to look and see, or change it.

g1smd

11:18 am on Sep 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Good old phpinfo(); comes to the rescue:

Apache Version: Apache/2.0.54 (Debian GNU/Linux) PHP/4.3.10-16

Loaded Modules: core mod_access mod_auth mod_log_config mod_logio mod_env mod_setenvif prefork http_core mod_mime mod_status mod_autoindex mod_negotiation mod_dir mod_alias mod_so mod_cgi mod_php4 mod_rewrite mod_userdir

g1smd

11:20 am on Sep 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Now to wrap my head round the infinite redirect loop problem that I also have...

g1smd

1:05 pm on Sep 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



... an accidental ? where there shouldn't have been one.

Grrr.

jdMorgan

1:23 pm on Sep 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You're debugging code far too early in the morning for us yanks! :)

No mod_headers? Wow, good luck with managing caching issues on that server! :(

Jim

g1smd

1:57 pm on Sep 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Apologies for being on UTC+0000, many hours ahead of you. :-)

Yeah, I obviously need to look at the headers stuff in a lot more detail. Hadn't got that far, yet.

Debugging the CMS URL structure, and Duplicate Content issues, has been a big enough job...

g1smd

11:50 pm on Sep 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



After getting it all working, I noticed that the index filename rule looked like this:

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\/([^/]*/)*index\.(php多tml?).*\ HTTP/
RewriteRule ^(([^/]*/)*)index\.(php多tml?)$ http://www.example.com/test/$1 [R=301,L]

rather than this:

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]*/)*index\.(php多tml?).*\ HTTP/
RewriteRule ^(([^/]*/)*)index\.(php多tml?)$ http://www.example.com/test/$1 [R=301,L]

yet it still seemed to work.

The difference is the single space in the top line of the two.

However, there are a lot of other rules placed before this one, also dealing with index filenames. They are for requests with specific parameters, so maybe this set of rules never actually gets used by any of the URL requests that are in my test data.

[edited by: jdMorgan at 1:16 am (utc) on Sep. 21, 2008]
[edit reason] example.com [/edit]

jdMorgan

12:09 am on Sep 20, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



A properly-formatted HTTP request will always have that space, so the other rules must have been catching these requests. The request is always going to look like:

HEAD /index.html HTTP/1.0
or
GET /index.php?foo=bar&howdy=do HTTP/1.1
or similar, but always with that space.

Jim

g1smd

12:12 am on Sep 20, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



True, but the next part of the rule says to match anything that isn't a / so maybe it still works OK even with the typo?

I guess you might see a problem should you be reusing %1 later on somewhere else. In this case I wasn't reusing it in any way.

[edited by: g1smd at 12:21 am (utc) on Sep. 20, 2008]

jdMorgan

12:14 am on Sep 20, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



BTW,

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]*/)*index\.(php多tml?)[b](\?[^\ ]*)?[/b]\ HTTP/

will execute quite a bit faster if the site uses query strings on .php URLs, especially long ones... :)

Jim

g1smd

12:18 am on Sep 20, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Noted.

Is the stuff after the final closing bracket even needed at all?

We're done matching by then, proven that it is an index filename.

g1smd

12:59 am on Sep 20, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My test data has almost 1100 Duplicate Content Variation URLs in it.

At one point I saw in the reports that one URL which should have been redirected was not being redirected. This was also seen in the lists that Xenu LinkSleuth produces and which can be exported to a spreadsheet application for better analysis.

This investigation revealed that the non-redirecting URL was the only URL that would be touched by the final rule of them all -- the generic non-www to www redirect -- as it was the only URL that did not match any of the other previous rules.

Upon looking at the code for the non-www to www rule, there was a simple typo found in it. This was easily fixed, and then confirmed as being OK by running my test data again.

However, this brings it home why the general non-www to www rule should always be the last one.

In this case, 1099 URLs are processed by preceding rules and just one of 1100 URLs is processed by this final rule.

If, however, I had misplaced this rule by locating it first, it would have operated for 550 different URL requests, and would then be creating a Redirection Chain in 549 of those cases as there is another rule that also needs to do some extra work for those other 549 requests.

That would be a dangerous situation indeed. So specific rules go first, and the general stuff goes last, to catch anything that the preceding rules missed.

jdMorgan

2:17 am on Sep 20, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Is the stuff after the final closing bracket even needed at all?

We're done matching by then, proven that it is an index filename.

Dunno -- That would be up to you. You've proven it's an "index" URL after "index\." but you haven't proven it's "php" or "html" until you identify one character beyond the end of those strings, which could be either a "?" starting a query string, or the space before "HTTP" if no query delimiter is present.

On the one hand, it's good to canonicalize "reasonable" URLs, but what if, for example, a competitor discovered that you will 301 anything that even looks like an index URL, instead of returning a 404, for say, "index.html-my-knickers". That might invite some unwelcome linking "games."

In the end it's all down to what the individual Webmaster feels is best for his/her site.

Jim

g1smd

8:35 am on Sep 20, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Ah, I see, so only match if there is a ? or space after the file extension. That makes a lot of sense.

However, I'm not sure that unwelcome linking games are a problem if the unwanted incoming link simply hits a 301 redirect on the site.

It's obviously a huge problem when another URL returns valid content with a 200 OK HTTP status, but not so dangerous with a redirect in place.

g1smd

11:38 am on Sep 20, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This doesn't do what I expected it to:

RewriteCond %{QUERY_STRING} pge=0&?
RewriteCond %{QUERY_STRING} mu=([clxvi]+)&?
RewriteCond %{QUERY_STRING} Muid=([0-9]+)&?
RewriteRule ^(index\.php)?$ http://www.example.com/pages/%1/%2? [R=301,L]

Only the second variable makes it through the redirect, and it is found in the position where the first variable should be. That is:

http://example.com/index.php?mu=vii&Muid=33&pge=0

redirects to:

http://www.example.com/pages/33/

instead of to:

http://www.example.com/pages/vii/33

It results in a single 404 error in my test data, but there are several hundred different URLs that 301 redirect to it.

The site "looks" like it works. It is only by running Xenu LinkSleuth over a comprehensive set of test URLs that this stuff is coming up.

.

Additionally, I was fooled by the browser many times. It shows a URL like this:

http://example.com/pages.php?Muid=44&mu=vii&pge=0

like this:

http://example.com/pages.php?Muid=44u=vii&pge=0

on screen. Mouseover the URL in the report and it is seen correctly.

I now have to re-run my test data with all the & changed to &amp; - forgot that I needed to check it both ways.

[edited by: jdMorgan at 1:17 am (utc) on Sep. 21, 2008]
[edit reason] example.com [/edit]

jdMorgan

3:38 pm on Sep 20, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Back-references contain the values of the matched parenthesized sub-expressions in the last-matched RewriteCond only -- as documented in the mod_rewrite documentation.

In order to 'pick up and collect' the pieces, you need to 'carry forward' previously-matched back-references from one RewriteCond to the next. Aside from using separate rules to handle different query-string-variable orders, here's one way to do it:


RewriteCond %{QUERY_STRING} pge=0&?
RewriteCond %{QUERY_STRING} &?mu=([clxvi]+)&?
RewriteCond %1>%{QUERY_STRING} ^([^>]+)>([^&]*&)*Muid=([0-9]+)&?
RewriteRule ^(index\.php)?$ http://www.example.com/pages/%1/%3? [R=301,L]

Note the use of the &? "soft anchors" on all query string parts. This prevents problems such as future use of a new, longer query variable that partially matches one of the shorter, older ones. For example, compare the results with and without soft anchoring if a new query string variable called "NEWmu" were to be introduced.

The ">" character used in the third RewriteCond is arbitrary, and has no special meaning. It is simply a relatively-rare and unique-for-this-specific-case character that is used to demarcate the previously-matched value from the one being currently evaluated.

Jim

[edited by: jdMorgan at 1:09 am (utc) on Sep. 21, 2008]

g1smd

3:48 pm on Sep 20, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have spent the last couple of hours looking for stuff like this, and 99% of all the tutorials out there, never get that deep. Thanks!

Some of the URLs have 4 variables, and some are optional, so there is no way that I am putting 16 rules in to fix that up. I'll go for the much shorter option!

I also tried to find a mega-post you did a year or two ago where someone was trying to fix multiple variables to always be in the same order - as I guessed that might get me started.

The best I came up with is this, as yet untested:

RewriteCond %{QUERY_STRING} &?pge=0&?
RewriteCond %{QUERY_STRING} &?mu=[clxvi]+&?
RewriteCond %{QUERY_STRING} &?Muid=([0-9]+)&?
RewriteRule ^(index\.php)?$ http://www.example.com/pages/%1/ [QSA] -- hoping that with no [L] here, the redirect isn't invoked *yet*.
RewriteCond %{QUERY_STRING} &?mu=([clxvi]+)&?
RewriteRule ^pages/(.*)/$ http://www.example.com/pages/%1/$1? [R=301,L]

I had already thought about adding something before the variable names to make the rules more efficient, but wanted to fix the other stuff first.

I need to look closely at your example, at least several times I think.

I assume that your code doesn't care about the parameter order, nor does it care if there are any other parameters present (they get dumped).

[edited by: jdMorgan at 1:14 am (utc) on Sep. 21, 2008]
[edit reason] example.com [/edit]

jdMorgan

5:00 pm on Sep 20, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> I assume that your code doesn't care about the parameter order, nor does it care if there are any other parameters present (they get dumped).

Correct on both counts. They all get dumped because of the trailing "?" on the substitution.

You can use the exact same method to re-order the parameters: "pick up the pieces" of the query string one RewriteCond at a time, put them into %1 (separated by tokens like my ">" character above) to "carry them along", and then re-assemble at the end. (I believe I came up with this after my so-called "mega-post" and I like this method better.)

In general, avoid 'stacked' rewrites that change the URL-path -- These trigger a long-time and still-present documented bug in Apache mod_rewrite, which results in multiple 'copies' of parts of the URL-path appearing in req_rec (the "current URL-path" variable used by mod_rewrite). If you must stack rewrites, then copy the URL-path from RewriteRule into a 'user variable' work on it there, and then put it back. A trivial example:


RewriteRule \.php$ - [E=MyURL:/index.php]
...
RewriteRule \.php$ - [E=MyURL:%{ENV:MyURL}?added-parm=foo]
...
RewriteCond %{ENV:MyURL} (.+)
RewriteRule \.php$ %1 [L]

Note that the URL 'seen' by RewriteRule is never changed until the very last rule. All others use "-" as the substitution address, and manipulate only the "MyURL" variable. This prevents the Apache bug from happening.

Jim

g1smd

5:13 pm on Sep 20, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



That's even more esoteric than the first example. :-)

I need to do more reading, and a lot more testing.

I am absolutely convinced that a large number of sites have installed code which *appears* to work but which has unseen flaws that are causing all sorts of problems "under the hood".

Without my long list of test URLs to run through Xenu LinkSleuth I would have signed this off as "done" about two days ago. Instead, I find several serious flaws that aren't all that apparent when simply browsing the site, and which I would have been extremely lucky to have picked out as candidates to sample with Live HTTP headers.

These redirects are picking up incoming links to URLs that only existed in an older version of the CMS, and sending the requests on to the new URL for the same resource.

jdMorgan

5:36 pm on Sep 20, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> I am absolutely convinced that a large number of sites have installed code which *appears* to work but which has unseen flaws that are causing all sorts of problems "under the hood".

Very true... and go look at some of the other on-line forums dealing with mod_rewrite and related subjects: There are many more examples of bad code than good code on the Web: Dot-star-this and dot-star-that... People are upgrading servers all over the world because their regex patterns are so sloppy and slow! Or because they do unnecessary file-exists checks left and right (WordPress, for example).

Long test-URL lists are a requirement -- be sure to include them in your bid. :)

Most folks skip requirements specification and testing -- or give them very short shrift. It's all too common. :(

Jim

g1smd

6:21 pm on Sep 20, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I *have* looked elsewhere, and I wasn't impressed.

There is a heck of a lot of junk code out there.

At least we do get there in the end, here at WebmasterWorld...

g1smd

7:40 pm on Sep 20, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've gone through every line and optimised everything as far as I can; I had previously missed a few ^(.*)$ that are now changed to (.*) instead.

I fixed up your other suggestions too. I already changed ^index\.php¦$ to become ^(index\.php)?$ a few days ago.

I am going to leave it until tomorrow, have a final look, and then run all the test URLs through Xenu LinkSleuth again. That process only takes about 5 minutes, but analysing the output takes quite a bit longer.

g1smd

12:32 am on Sep 21, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



These rules are seemingly being ignored now. Either the OLD URL serves the content at the old URL, or the old URL issues a 404 error. There's no redirect to the new URL:

RewriteCond %{QUERY_STRING} pge=0&?
RewriteCond %{QUERY_STRING} mu=([clxvi]+)&?
RewriteCond %1>%{QUERY_STRING} ^([^>]+)>([^&]&)*Muid=([0-9]+)&?
RewriteRule ^(index\.php)?$ http://www.example.com/test/pages/%1/%3? [R=301,L]
#
RewriteCond %{QUERY_STRING} mu=([clxvi]+)&?
RewriteCond %1>%{QUERY_STRING} ^([^>]+)>([^&]&)*Muid=([0-9]+)&?
RewriteRule ^pages?\.php$ http://www.example.com/test/pages/%1/%3? [R=301,L]

It's URLs like this returning 404 (without index filename):

http://www.example.com/test/?pge=0&Muid=33&mu=ii&l1i=

Other URLs, with index or page filename included, are either directly serving content or are being incorrectly handled by a different redirect.

[edited by: jdMorgan at 1:14 am (utc) on Sep. 21, 2008]
[edit reason] example.com, formatting, & disabled smiles [/edit]

jdMorgan

1:08 am on Sep 21, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



My example above had a missing quantifier:

RewriteCond %1>%{QUERY_STRING} ^([^>]+)>([^&[b]]*&[/b])*Muid=([0-9]+)&?

BTW, please use the [ code ] tags instead of italics -- much easier to read. ANd disable smilies when posting code, too. Thanks.

Jim

g1smd

5:47 am on Sep 21, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Many Thanks. I stared at that code for ages without knowing what to do.

.

Does WebmasterWorld use smilies? I am aware of that issue on other forums.

In all my years here, I have never seen one here.

g1smd

7:29 am on Sep 21, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Just one set of URLs left to redirect, missed one example format with only two variables.

Everything else is working fine. The extra * fixed it.

g1smd

11:21 am on Sep 21, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



One final oddity.

The very last item on the site navigation list is now "Forbidden" when clicked.

The URL as displayed does not have to pass through any redirects to deliver content.

The URL is rewriten to some other, different, internal file path to fetch the content.

If I remove all the redirects, and leave just the rewrite, then that URL does work just fine.

This one is going to need some thought.

It might be something within the PHP coding, and that would be out of my control.

g1smd

12:32 pm on Sep 21, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It is most odd that just one link fails like this.

There are a few URLs that still have a one-parameter query string on the end. The parameter value is used by some JavaScript functions to do with styling the navigation bar. The navigation bar is going to be redesigned using CSS, and that will eventually make the query string completely obsolete (which is why that parameter isn't featured in any of the rewrites so far implemented).

For this one page with issues, if I remove the value from the query string, or remove the whole query string, then the page displays fine. The issue only occurs with one link/one URL on the whole site. My guess is that something in the JavaScript or PHP is responsible for the error message.

Is it worth finding and fixing, if the query string will be completely gone within weeks? I think not.

g1smd

7:35 pm on Sep 21, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




While out for a walk, far away from the web, it just occurred to me that this fails because that query string value includes the word "set", so it simply walks straight into this rule:

# Deny Hacker Query Strings Looking to Exploit Database Features:

RewriteCond %{QUERY_STRING} declare圭har存et圭ast圭onvert圬elete圬rop圯xec夷nsert妃eta存cript存elect宇runcate守pdate [NC]
RewriteRule . - [F]

Ooops.

Since there will be no more query string after next week, it isn't all that important, but it would be nice to add some extra qualifier to that rule so that it reliably detects attempted hacks, while still allowing "normal" operation of a site.

This 40 message thread spans 2 pages: 40