|change encoding of URIs: from ISO8859 to UTF-8|
How do I change Apache's URL encoding from ISO8859 to UTF-8?
I need to have some files whose names contain accented characters.
For example, ˛.txt
In other servers, I transfer the file with FTP, and everything works.
I type in the browser the URL:
The browser automatically encodes the URL to UTF-8
Apache receives the GET request
GET /%C3%B2.txt HTTP/1.1
and decodes it, correctly returning the file "˛.txt"
On some servers (running WHM + Cpanel, btw) this fails, and apache uncorrectly searches for a funnily-named file. In the erro log, I see
File does not exist: /home/example/public_html/\xc3\xb2.txt
I have seen that requesting the same file manually entering an URL like this works:
(i.e. apache recevies a "GET /%f2.txt" request and serves the "˛.txt" file)
I appears that on this server the URLs are decoded using ISO8859-1 instead of using UTF-8, as it happens on other servers.
Where do I change this setting?
If I get it correctly, AddDefaultCharset (which btw I have tried) does not do this, as it changes the default charset for the response, not the request!
I suppose it would be a complete waste of time to say: Don't
I'm all for putting the content of any and all files into UTF-8. But filenames? That's not only your server. It's everything that the request meets in transit from user's browser, across the internet and into the server. And further complications if the file is linked from some third site.
AddCharset (mod_mime) and AddDefaultCharset (core) apply to the file's contents, not its name. Both should be viewed mainly as information you're giving the end user's browser.
are you willing to trust every proxy server that sees your file to treat that filename correctly?
how lucky do you feel?
what she said.
asciify your filenames and your life will be much simpler.
you should read the man page for convmv:
this utility converts filenames from one encoding to another.
the section on "Filesystem issues" is enlightening.
While I generally agree on the "Don't" approach (to make one's life simple) the fact is that I am migrating to my server several customers' sites that were already written in that way, (thing like IMG SRC="Úcole.jpg").
I.e., someteimes one cannot take the simpler road.
That said, I suggest a small experiment...
go to the main russian wikipedia page ( ru.wikipedia.org )
You'll get redirected to a page with a Cyrillic URL, like all pages of ru.wikipedia.
If you copy the URL in a notepad, you'll get the form that was actually requested to the server:
That is, UTF-8 characters have been percent-encoded, as specified in RFC 3986
This URL can safely travel through proxies, through Google translate, etc. etc.
So, what is good for wikipedia can be good for my sites :-)
If only I was able to configure apache...
The odd thing is, other apache installations behave as expected (use UTF-8 for escape-encoding), I just cannot find the way to set the url decoding on cpanel servers.
Oh, supporting files. I assume you're not editing the individual pages, so there's nothing you can do to stop the end user's browser from requesting "Úcole.jpg" in the first place. But can we assume you've got encoding information in place for the page content, so the browser is requesting Ú rather than ├ę?
There are two points of contact: What the browser originally sees, and what reaches your server. In between is a vast area that's out of your control. What you may need to do is set up a simple RewriteRule that looks for any URL containing \\x (escape the literal backslash) and rewrites to a php script that does the conversion. Note that \x is essentially the same as %. The form has a technical name which escapes me haha at the moment.
The browser is not requesting ˛, it's requesting
GET /%C3%B2.txt HTTP/1.1
(that's what I actually see from Live HTTP headers, and that's what I see in the server logs.)
So, the "in between area" is pretty safe, the request travels in a properly encoded form.
The problem is just server-side.
In this installation of Apache, the percent-encoded URLs are decoded using ISO8859-1, while in other installations I have the URLs are decoded using UTF-8.
In this installation, apache would require a request like "http://www.example.com/%f2.txt" to serve the file ˛.txt (i.e., it decodes the percent-encoded URL with ISO8859 instead of UTF-8)
The problem is, the rest of the world is pretty consistent in encoding the URL in UTF-8, i.e. the request the server gets is "http://www.example.com/%C3%B2.txt"
|The browser is not requesting ˛ |
What does the <img src... say? ˛ or %C3%B2?
Ugh, that would be horrible. f2 is a flat-out forbidden character in UTF-8, so that would be an even worse error than misinterpreting UTF8 as 8859-1.
:: google, google ::
Are we on Tomcat by any chance?
|By default, Tomcat uses ISO-8859-1 character encoding when decoding URLs received from a browser. This can cause problems |
... Yes, indeed. It says here [confluence.atlassian.com] (really https, but forums refused to cooperate)
<Connector port="8090" URIEncoding="UTF-8"/>
which of course is so much Hungarian to me, but it's one of the few pages I found that even understands the question; most turn out to be about file encoding within an html document, not the URL. Curiously, Tomcat seems to be a recurring theme among those that do know what's going on. Also java, which is not much help.
Here is another useful if infuriating [blog.lunatech.com] version from a few years back:
|The standards do not define any way by which a URI might specify the encoding it uses, so it has to be deduced from the surrounding information. For HTTP URLs it can be the HTML page encoding, or HTTP headers. This is often confusing and a source of many errors. In fact, the latest version of the URI standard defines that new URI schemes use UTF-8 |
Thanks. Just what we needed to hear.
And then there's this [freecode.com], which strikes me as a throwing-in-the-towel approach, but may actually work. (Full disclosure: I found the link on That Other Forum.)
> What does the <img src... say? ˛ or %C3%B2?
The img src says ˛
But even typing the address directly, the problem is the same.
> Are we on Tomcat by any chance?
No, no Tomcat, just plain apache
Yes, I saw the lunatech page, which is very useful (for example) in pointing out there are different reserved chars in different part of an URL.
Also, this page:
Gives the most details I have found so far.
What makes me FURIOUS is that some apaches work in one way, other in the other way.
UNLESS IT'S A PROBLEM OF NEWER/OLDER APACHE.....
i think the problem is whether or not the filename is normalized and if so in what form.
yes, the file is normalized.
from ssh, doing vi "˛.txt" works
I think the problem is that they can't or won't asciify the filenames.
Just how many filenames are involved? There's always Option B, which is to quietly rewrite all requests to the form they "should" have. Any incoming request that isn't valid utf-8-- either leading \x as in \xc3\xb2, or one-byte characters in the 80-FF range --gets retro-converted to utf-8. This can't happen in the config file, of course, but it should only be a few lines of php.
And then there's Option C, which is to say ### it and rename everything until such a time as the server catches up. I really don't see the need for image files to have non-ascii names. Pages maybe, but that doesn't seem to be the situation.
Option B is do-able, but:
1) there are several sites, with their (sometimes complex) rewrites, and there would be a lot of tweaking to do to check if the new one interferes
2) rewriting and UTF-8 are typically not very happy fo convive (try googling, there are scores of pages of people having problems)
> I really don't see the need for image files to have non-ascii names.
Indeed, images are secondary. There are other cases were losing UTF-8 characters would be a larger problem.
For example, there is a whole section of "downloadable documents", where many files have russian names.
Ascifying everything would mean most files would stop having meaningful names for the user!
|1) there are several sites, with their (sometimes complex) rewrites, and there would be a lot of tweaking to do to check if the new one interferes |
Make a generic rewriting script that replaces all \x with %, interprets the result as UTF-8 and re-encodes as 8859-1.
:: pause to reread OP ::
|I have seen that requesting the same file manually entering an URL like this works: |
(i.e. apache receives a "GET /%f2.txt" request and serves the "˛.txt" file)
Are you here talking about literally typing in the 8859-1 form %F2? Not ˛? That's fine. The only peril is if requests arriving at the server have more than one possible encoding. You would then have to do some preliminary stuff to identify the encoding of characters in the Latin-1 range. That's still not un-doable; it's just a few more lines.
Do all these sites live on the same server? If so, an alternative worth looking into is a RewriteMap.