Welcome to WebmasterWorld Guest from 3.209.80.87

Forum Moderators: Ocean10000 & phranque

Message Too Old, No Replies

Problem with strange characters

Problem with strange characters

     
11:34 am on Jun 27, 2011 (gmt 0)

New User

5+ Year Member

joined:June 20, 2011
posts: 1
votes: 0


We've found that we're receiving requests including weird or non-English characters and we want to redirect these requests to our 404 custom page.

The requests are like these:

[mysite.com...]
[mysite.com...]
[mysite.com...]

even when trying myself with tildes I want it to redirect to the 404 page

[mysite.com...]

I've tried with many things but I can't get it to work. These characters of the request are immediately urlencoded to %xx

For instance, this rule won't work:
RewriteRule !^([_-a-zA-Z0-9\/]*)$ /not_found.php [L]
if gives an internal error (500)
This one
RewriteRule ^([^_-a-zA-Z0-9\/]*)$ /not_found.php [L]
won't work as expected either

Any suggestions appreciated
3:38 pm on June 27, 2011 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15901
votes: 877


Rule #1: Use example.com in your examples. All other addresses get auto-converted into active links. You don't want people to go to your page, you want them to see what you typed.
http://www.example.com/some%C3%ADtext
http://www.example.com/some%C3%A9text
http://www.example.com/some%F3text

even when trying myself with tildes I want it to redirect to the 404 page

http://www.example.com/s%EF%BF%BDme%EF%BF%BDtext

Ouch. I hope that last one wasn't meant to be your actual address, since it contains two "I can't deal with this" UTF-8 characters. You meet them most often when certain Latin-1 characters are reinterpreted as UTF-8 but they're in a forbidden range. Hence the need for example.com so the characters you typed can remain visible as í, é and ... ó? How on earth did you achieve that? (The first two came through in utf-8 encoding, the last in hexadecimal form instead of the expected %C3%B3.)

I'm missing something. If they're requesting pages that don't exist-- for whatever reason-- they should be landing on the 404 page anyway. Did you mean that you want to direct "vanilla" 404s and "special-character" 404s to different locations?

For instance, this rule won't work:
RewriteRule !^([_-a-zA-Z0-9\/]*)$ /not_found.php [L]
if gives an internal error (500)

Woo hoo, you have remembered Rule #2*, which is to explain how it doesn't work. It may simply dislike the \/. Apache doesn't seem to like it when you escape characters that don't need to be escaped. (Some RegEx dialects don't care.) You are technically allowed to use ! in the Rule, but it has to be done with extreme caution. Save it for a preceding RewriteCond, where you can get away with more.

The second version, with your anchors,
RewriteRule ^([^_-a-zA-Z0-9\/]*)$ /not_found.php [L]

would only pick up requests consisting entirely of non-ASCII characters. You are not likely to get many of those.

Do you have any pages that actually contain the % character? If not, and they've been encoded before they reach your .htaccess, it may be simplest just to say % (without anchors) meaning that something somewhere in the request has had to be encoded. Or %25 if other characters like / are also being encoded.

Error documents are generally static. What does the php do?


* I made this up. But so far nobody has objected loudly.
3:46 pm on June 27, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


I assume the 500 error is caused because after the request is rewritten to "not-found.html", the pointer still matches the pattern and hence the rule rewrites to "not-found.html" again and again in an infinite loop. Add a RewriteCond exclusion to stop this happening.


One note on "error 404 pages". The 404 error must be served at the originally requested URL to tell the browser that THIS URL does not exist. A 404 error is NEVER a redirect. Redirects have 30x status codes. You do NOT want a redirect here.