How to get .htaccess to recognize the quote (") character?

Forum Moderators: phranque

Message Too Old, No Replies

How to get .htaccess to recognize the quote (") character?

It seems to ignore it

MichaelBluejay

2:28 am on Oct 25, 2005 (gmt 0)

I have this line of code in my .htaccess file to strip off any trailing characters in a url:

RewriteRule (.*\.html).+$ [mydomain.com...] [R=301]

That fixes a lot of problem links to my site from blogs and forums where people make typos when entering my url. But there's a request from another site for a url on my site that ends in a quote mark and the quote mark doesn't get stripped:

http://mydomain.com/directory/page.html"

Maybe it's a bot requesting the page, I don't know, but I still want to strip the quote character. .htaccess just ignores it, and delivers a "Page not found" error. Is there any way to get .htaccess to strip any characters after the .html even if it's a quote mark?

jdMorgan

1:58 pm on Oct 25, 2005 (gmt 0)

I'd suggest changing your Rewrite Rule pattern to make it less ambiguous:


RewriteRule ^([^.]+\.html).+$ http://example.com/$1 [R=301,L]

If that doesn't work, then try using RewriteCond %{REQUEST_URI} and testing for \%22 to catch a standard double-quote character.


RewriteCond %{REQUEST_URI} ^/([^.]+\.html)\%22$
RewriteRule \.html http://example.com/%1 [R=301,L]

Jim

MichaelBluejay

11:28 pm on Oct 25, 2005 (gmt 0)

Okay, good news and bad news. The good news is that this had nothing to do with the quote character! My original code actually handles that nicely. The problem was that the url in question was actually http://example.com/subdir/page.html", and the subdir had its own .htaccess file, so for some reason it was ignoring the Rewrite command that was in the .htaccess that's at the root level.

The bad news is that just copying the Rewrite code to the subdir's .htaccess file doesn't fix the problem, because the url above gets redirected to http://http://example.com/page.html (without the subdir). This is true whether I use my original code or your single-line substitute code. (And by the way, I don't understand what's better about your version.)

I don't see what's wrong with the code because with (.*) in parentheses, it looks like it should catch all the characters, meaning the whole url after the domain name, including the subdirectory name and the slash. But it's not. What am I doing wrong?

jdMorgan

12:40 am on Oct 26, 2005 (gmt 0)

We've discussed this at length here, but basically, it takes n+5 passes for mod_rewrite to match your pattern
".*\.html.+$" where n is the length of the 'tail' string following "html".

On the first pass, the ".*" matches the entire requested URL-path. But this leaves nothing for the rest of the pattern to match, so mod_rewrite tries again, backing off one character from the end. But the rest of the "\.html" pattern still can't match. So it backs off again and again, until the "\.html.+" pattern can be satisfied, with ".*" matching everything ahead of that, thus requiring n+5 passes.

With the pattern "^[^.]+\.html.+$", the first subpattern matches all characters up to but not including the period preceding "html", then "\." matches the period, then "html" is matched, and then the 'tail' pattern of ".+" matches whatever follows "html" (which was presumed to be a double-quote character in the first post) -- all in one single left-to-right pass. So, it's much faster and doesn't waste CPU time.

Using ".*" is "easy" but often inefficient, and frequently leads to unexpected results. It should be used only when a more specific pattern cannot be used.

If your root .htaccess file seems to not apply to subdirectories, then add


[url=http://httpd.apache.org/docs/1.3/mod/mod_rewrite.html#RewriteOptions]RewriteOptions inherit[/url]

to those .htaccess files in subdirectories where you want the root .htaccess to apply as well. Some servers already have this configured and some don't.

Jim

MichaelBluejay

1:24 am on Oct 26, 2005 (gmt 0)

Okay, thanks for the tip about more efficient RewriteRules. I've filed that away for future reference.

I'm not sure that RewriteOptions Inherit is what I want, because then that will load the whole .htaccess file from the root, and it seems like THAT will waste CPU time. I'd rather just put the few commands I need into the subdir's .htaccess file.

Which brings me back to the problem mentioned in my last post: Neither your code nor my code works for stripping trailing characters when the file is in a subdirectory. The url gets rewritten to the root level, not the subdir level. Could you tell me how I'd fix this? I assume it's possible to do without RewriteOptions Inherit.

Thanks!

jdMorgan

1:48 am on Oct 26, 2005 (gmt 0)

That duplicated "http" indicates a fairly severe server misconfiguration... Not sure what would cause that, but a call to your host or your sysadmin would seem to be in order. There's likely either another internal rewrite taking place that's messing that up, a bad Alias, or a bad ServerName (in conjunction with UseCanonicalName on) configuration.

Jim

MichaelBluejay

3:13 am on Oct 26, 2005 (gmt 0)

There is no duplicated http. The problem is that http://example.com/subdir/page.htmlX gets rewritten to http://example.com/page.html (without the subdir).

jdMorgan

3:24 am on Oct 26, 2005 (gmt 0)

Sorry, I take you at your word:

because the url above gets redirected to http://http://example.com/page.html (without the subdir).

In order to make the rule work in your subdirectory, just add the subdir back in:


RewriteRule ^([^.]+\.html).+$ http://example.com[b]/subdir[/b]/$1 [R=301,L]

In a per-directory .htaccess context, the path to the current (sub)directory is removed, so it won't be present in the URL-path matched by the RewriteRule pattern, and therefore won't be copied into the new URL. Therefore, you must explicitly declare it as shown.

Jim