| 4:30 pm on Oct 21, 2013 (gmt 0)|
Unless you have your own server with firewalls, there is no way to keep a request from reaching the server-- and from being logged.
:: detour to look up ::
Oh, the BOM. Don't know why I have to look it up every time.
If they are robots making garbage requests, a 404 is a perfectly legitimate response, though you may want to lock out repeat offenders on other criteria.
If, instead, these are human requests with accidentally appended garbage, it's pretty easy to redirect (not rewrite!) them to the garbage-free version. But we'll need a little more information to hammer out the right format.
| 11:56 pm on Oct 21, 2013 (gmt 0)|
Redirect is fine. What type of info do you need?
| 2:51 am on Oct 22, 2013 (gmt 0)|
The important question is: what do the requests look like at the moment they reach your site? I'm used to seeing the \x form in UA strings-- usually undesirable ones-- while URLs should be safely disencoded by the time they reach you.
:: detour to test a random string ::
Oh, interesting. Error logs use the \x notation, while access logs use percent-encoding. At least this week, on my current real-life server. But that's just logging; what actually reaches the site-- including my htaccess file-- is the raw character.
I can't guarantee that this will work on all servers and all Apache installations, but my test site was happy with this:
RewriteRule ^([\w/.-]*)[^\w/.-] http://www.example.com/$1 [R=301,L]
I simply grouped all the characters that can occur in an URL: "word" characters (alphanumerics plus lowline), slash, dot, hyphen. Add punctuation as necessary, looking only at the body of the URL; queries don't matter. If your server changes its mind and starts viewing the "bad" characters as either percent-encoded or \x-encoded, you are still good to go. In each of those forms, a non-permitted character (% or \) is still the very first thing in the added part.
Now, this will only work if the garbage comes after the legitimate part of the request. You're simply chopping it off. And if you're getting bad requests with accented letters like é, or non-Roman ones like ᐄ you'll have to change the details of the bracketed group. The \w notation is pretty all-encompassing.
| 2:39 am on Oct 23, 2013 (gmt 0)|
Thanks for trying, lucy. I put the RewriteRule in my htaccess and it didn't make a difference. Server didn't object to it but the crap is still filling the error log.
| 3:42 am on Oct 23, 2013 (gmt 0)|
Phooey. What happens when you manually request an URL with non-ASCII characters in it? Take one of your ordinary URLs and shove something like a curly quote onto the end-- or into the middle, doesn't matter. Do you get redirected? If no, there's some unrelated problem.
| 1:14 am on Oct 24, 2013 (gmt 0)|
Traced the problem back to and entry where the author used “ ” in an href which XHTML didn't like. Also validated the page. This will likely stop future error log entries but the ones that are being requested due to search engines will continue to be a pain.
| 4:41 am on Oct 24, 2013 (gmt 0)|
Is this UGC? If so, the writer probably didn't even realize what he was doing. (Should have, but didn't.) Betcha he had one of those "smart quotes" options turned on. I have to type my curly quotes and apostrophes manually because the text editor isn't clever enough to leave them unaltered when-and-only-when they're inside html tags.
But if a request is reaching your site with extraneous bad characters, it should still redirect :( Or return a [G] or whatever approach you choose to take.