|Character encoding: UTF-8 to ISO-*|
I have a text field that accepts a word and sends it as a GET parameter to a translation server (http://www.linguee.com). Everything works fine till I usee ASCII characters but if I add, for example, a "ü" things get messed up because of encoding.
Thanks a lot!
I don't think you are doing anything wrong. %C3%BC is the correct UTF-8 encoding for ü, and you said that your file is UTF-8. It would be more worrying if you didn't get this result.
The fact that the translation service demands ISO-Latin-1 encoding is their problem, not yours. How do they handle input outside the Latin-1 range, like Cyrillic or Hebrew?
Oh, wait. Are you doing the encoding at your end (charCodeAt or whatever it's called) or are you sending out the unencoded word? I kinda think you want a parseInt in there somewhere. At least that what I've got in my own script-changing routines. But those are for non-Roman scripts. Are you working strictly in the Latin-1 range?
:: wandering off to see what's involved in disencoding simple Latin-1 ::
Hey, thanks a lot for the reply!
I must say I am not very familiar with character encoding. All I know is that they have this form on their homepage and if I type "menü" they call this URL:
If I write a form from my UTF-8 page I call
If I escape the URL and I alert it before to open the linguee page I see correctly:
but when the links opens I get:
So it seems there is some weird translation going on which takes %FC to %C3%BC. Is there a way to avoid that?
Uh-oh, that didn't quite work. Mouse-hovering on your examples shows four identical sets of "menĂ¼", so something is getting re-encoded in transit :)
The remainder of this thread will come through as garbage if your browser is not set to UTF-8. If it is set to UTF-8, the first three posts will say "menï¿½" in place of the intended "menĂ¼". Ajurnaqtuq, c'est la vie, et cetera.
The first three links come through as %FC in the, er, menu bar, while the page title has "menĂ¼". The fourth link says "menĂ¼" in the menu bar while the page title says "menĂ£Å“". This is definitely not UTF-8 being reinterpreted as 8859-1. (I know this without looking it up because the œ character is not in Latin-1.)
Further investigation of page source, plus detour to IANA [iana.org], tells me that all four are encoded as ISO-8859-15 (described here [iana.org]) (!) alias Latin-9 (double !).
The key difference (I'm quoting) is:
BC CAPITAL LIGATURE OE U0152
BD SMALL LIGATURE OE U0153
Does that BC sound vaguely familiar? It should.
But wait! If you put the word "menĂ¼" into UTF-8 and reinterpret as Latin-9, you do not get "menĂ£Å“". You get menĂƒÅ’; with capital letters. In fact you can't get to lower-case "menĂ£Å“" from UTF-8 at all, so the page must be running a script to put everything into lower case, even if it is lower case garbage.
In my previous post I thought I was just being snippy when I said it's the translation site's problem. But if they are going to go around encoding themselves in 8859-15, it really is their problem. (Here I detoured to another browser to make sure it wasn't just reading an existing 8859-15 cookie.) If the site used 8859-1 you could probably deal with it, but 8859-15 is simply not going to work.
So we'd better backtrack to the original problem and find a different solution. Does your own site have features that require you to send non-ASCII text over the Internet to a translation source? You'll need to find a site that either uses UTF-8 encoding, or accepts information about the source file's encoding so it can do the conversion at its end.
Thank you so much for the patience. And you know what's funny? If I mousehover on my links (FF, v9) I see four different links! God, it seems a complicated problem... And I don't get the last question: my website is in UTF-8 encoding. Or you mean the target website?
And are you saying that even if I change the encoding of the required URL on the fly it will still be reconverted on their website?
Now that you mention it: when I was hovering on the links before, I was in ISO-Latin-1. Now that I'm in UTF-8 I see... well, at least three different forms. Nos. 1 and 3 end in %FC, No. 4 is the word with its proper u-umlaut, and and No. 2 is the unicode "I can't deal with this" character-- the same one that now displays in the text fields of the first three posts in this thread :)
-- whoops! --
<div>Copy and paste this into text box: menü</div>
<input type="hidden" name="lang" value="english-german">
<input type="text" name="query">
<input type="submit" value="Submit">
If you do it that way, what happens to the encoding? The root problem is that OP's site uses UTF-8-- which is perfectly appropriate-- while the destination site uses, of all things, 8859-15, aka Latin-9.
|All characters are encoded before sent (spaces are converted to "+" symbols, and special characters are converted to ASCII HEX values) |
but this doesn't really give the necessary information.
More ominous is the list here:
Note that it includes the full series of %8\h and %9\h -- and those shouldn't even exist. They're Windows-Latin-1 encodings that aren't recognized by unicode, though they may de facto work on sites headed 8859-1 (not -15).
well, apparently there is this hack:
But seems to work only ISO to utf-8 and not in the other direction.