lucy24 - 2:51 am on Oct 22, 2013 (gmt 0)
The important question is: what do the requests look like at the moment they reach your site? I'm used to seeing the \x form in UA strings-- usually undesirable ones-- while URLs should be safely disencoded by the time they reach you.
:: detour to test a random string ::
Oh, interesting. Error logs use the \x notation, while access logs use percent-encoding. At least this week, on my current real-life server. But that's just logging; what actually reaches the site-- including my htaccess file-- is the raw character.
I can't guarantee that this will work on all servers and all Apache installations, but my test site was happy with this:
RewriteRule ^([\w/.-]*)[^\w/.-] http://www.example.com/$1 [R=301,L]
I simply grouped all the characters that can occur in an URL: "word" characters (alphanumerics plus lowline), slash, dot, hyphen. Add punctuation as necessary, looking only at the body of the URL; queries don't matter. If your server changes its mind and starts viewing the "bad" characters as either percent-encoded or \x-encoded, you are still good to go. In each of those forms, a non-permitted character (% or \) is still the very first thing in the added part.
Now, this will only work if the garbage comes after the legitimate part of the request. You're simply chopping it off. And if you're getting bad requests with accented letters like é, or non-Roman ones like ᐄ you'll have to change the details of the bracketed group. The \w notation is pretty all-encompassing.