Welcome to WebmasterWorld Guest from 107.22.7.35

Forum Moderators: Ocean10000 & incrediBILL & phranque

Message Too Old, No Replies

URL request has unicode characters

Any way to rewrite or redirect using htaccess?

     
2:09 pm on Oct 21, 2013 (gmt 0)

10+ Year Member



I have recently started to see a lot of file requests for URL's that have Unicode appended to the end. These requests are filling my error log at an alarming rate and I don't know how to prevent it. Is there a way to rewrite these requests using htaccess?

Sample request from error log:

/httpdocs/blogs/thisblog/\xe2\x80\x9dhttp:,


The server is giving them 404 File does not exist status and that's what's filling the logs.
4:30 pm on Oct 21, 2013 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Unless you have your own server with firewalls, there is no way to keep a request from reaching the server-- and from being logged.

:: detour to look up ::

Oh, the BOM. Don't know why I have to look it up every time.

If they are robots making garbage requests, a 404 is a perfectly legitimate response, though you may want to lock out repeat offenders on other criteria.

If, instead, these are human requests with accidentally appended garbage, it's pretty easy to redirect (not rewrite!) them to the garbage-free version. But we'll need a little more information to hammer out the right format.
11:56 pm on Oct 21, 2013 (gmt 0)

10+ Year Member



Redirect is fine. What type of info do you need?
2:51 am on Oct 22, 2013 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



The important question is: what do the requests look like at the moment they reach your site? I'm used to seeing the \x form in UA strings-- usually undesirable ones-- while URLs should be safely disencoded by the time they reach you.

:: detour to test a random string ::

Oh, interesting. Error logs use the \x notation, while access logs use percent-encoding. At least this week, on my current real-life server. But that's just logging; what actually reaches the site-- including my htaccess file-- is the raw character.

I can't guarantee that this will work on all servers and all Apache installations, but my test site was happy with this:

RewriteRule ^([\w/.-]*)[^\w/.-] http://www.example.com/$1 [R=301,L]

I simply grouped all the characters that can occur in an URL: "word" characters (alphanumerics plus lowline), slash, dot, hyphen. Add punctuation as necessary, looking only at the body of the URL; queries don't matter. If your server changes its mind and starts viewing the "bad" characters as either percent-encoded or \x-encoded, you are still good to go. In each of those forms, a non-permitted character (% or \) is still the very first thing in the added part.

Now, this will only work if the garbage comes after the legitimate part of the request. You're simply chopping it off. And if you're getting bad requests with accented letters like , or non-Roman ones like ᐄ you'll have to change the details of the bracketed group. The \w notation is pretty all-encompassing.
2:39 am on Oct 23, 2013 (gmt 0)

10+ Year Member



Thanks for trying, lucy. I put the RewriteRule in my htaccess and it didn't make a difference. Server didn't object to it but the crap is still filling the error log.
3:42 am on Oct 23, 2013 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Phooey. What happens when you manually request an URL with non-ASCII characters in it? Take one of your ordinary URLs and shove something like a curly quote onto the end-- or into the middle, doesn't matter. Do you get redirected? If no, there's some unrelated problem.
1:14 am on Oct 24, 2013 (gmt 0)

10+ Year Member



Traced the problem back to and entry where the author used in an href which XHTML didn't like. Also validated the page. This will likely stop future error log entries but the ones that are being requested due to search engines will continue to be a pain.
4:41 am on Oct 24, 2013 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



the author used

Is this UGC? If so, the writer probably didn't even realize what he was doing. (Should have, but didn't.) Betcha he had one of those "smart quotes" options turned on. I have to type my curly quotes and apostrophes manually because the text editor isn't clever enough to leave them unaltered when-and-only-when they're inside html tags.

But if a request is reaching your site with extraneous bad characters, it should still redirect :( Or return a [G] or whatever approach you choose to take.
 

Featured Threads

Hot Threads This Week

Hot Threads This Month