Welcome to WebmasterWorld Guest from 54.145.136.73

Forum Moderators: open

Character encoding: UTF-8 to ISO-*

   
7:55 am on Feb 16, 2012 (gmt 0)

5+ Year Member



Hello everybody!

I have a text field that accepts a word and sends it as a GET parameter to a translation server (http://www.linguee.com). Everything works fine till I usee ASCII characters but if I add, for example, a "" things get messed up because of encoding.

I tried on Linguee and should be encoded as %FC. I tried to use all the JavaScript encoding functions but I always get a %C3%BC instead. What am I doing wrong? Can it be that the problem is originated by the fact that my page uses utf-8 while Linguee uses a iso encoding? How can I solve this issue?

Thanks a lot!
9:30 am on Feb 16, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



I don't think you are doing anything wrong. %C3%BC is the correct UTF-8 encoding for , and you said that your file is UTF-8. It would be more worrying if you didn't get this result.

The fact that the translation service demands ISO-Latin-1 encoding is their problem, not yours. How do they handle input outside the Latin-1 range, like Cyrillic or Hebrew?

Edit:
Oh, wait. Are you doing the encoding at your end (charCodeAt or whatever it's called) or are you sending out the unencoded word? I kinda think you want a parseInt in there somewhere. At least that what I've got in my own script-changing routines. But those are for non-Roman scripts. Are you working strictly in the Latin-1 range?

:: wandering off to see what's involved in disencoding simple Latin-1 ::
12:47 pm on Feb 16, 2012 (gmt 0)

5+ Year Member



Hey, thanks a lot for the reply!

I must say I am not very familiar with character encoding. All I know is that they have this form on their homepage and if I type "men" they call this URL:
[linguee.com...]

If I write a form from my UTF-8 page I call
[linguee.com...]

If I escape the URL and I alert it before to open the linguee page I see correctly:
[linguee.com...]
but when the links opens I get:
[linguee.com...]

So it seems there is some weird translation going on which takes %FC to %C3%BC. Is there a way to avoid that?
8:15 pm on Feb 16, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Uh-oh, that didn't quite work. Mouse-hovering on your examples shows four identical sets of "menü", so something is getting re-encoded in transit :)

Advance warning:

The remainder of this thread will come through as garbage if your browser is not set to UTF-8. If it is set to UTF-8, the first three posts will say "men�" in place of the intended "menü". Ajurnaqtuq, c'est la vie, et cetera.

Now then...

The first three links come through as %FC in the, er, menu bar, while the page title has "menü". The fourth link says "menü" in the menu bar while the page title says "menãœ". This is definitely not UTF-8 being reinterpreted as 8859-1. (I know this without looking it up because the œ character is not in Latin-1.)

Further investigation of page source, plus detour to IANA [iana.org], tells me that all four are encoded as ISO-8859-15 (described here [iana.org]) (!) alias Latin-9 (double !).

The key difference (I'm quoting) is:

BC CAPITAL LIGATURE OE U0152
BD SMALL LIGATURE OE U0153

Does that BC sound vaguely familiar? It should.

But wait! If you put the word "menü" into UTF-8 and reinterpret as Latin-9, you do not get "menãœ". You get menÃŒ; with capital letters. In fact you can't get to lower-case "menãœ" from UTF-8 at all, so the page must be running a script to put everything into lower case, even if it is lower case garbage.

In my previous post I thought I was just being snippy when I said it's the translation site's problem. But if they are going to go around encoding themselves in 8859-15, it really is their problem. (Here I detoured to another browser to make sure it wasn't just reading an existing 8859-15 cookie.) If the site used 8859-1 you could probably deal with it, but 8859-15 is simply not going to work.

So we'd better backtrack to the original problem and find a different solution. Does your own site have features that require you to send non-ASCII text over the Internet to a translation source? You'll need to find a site that either uses UTF-8 encoding, or accepts information about the source file's encoding so it can do the conversion at its end.
8:45 am on Feb 17, 2012 (gmt 0)

5+ Year Member



OMG :)

Thank you so much for the patience. And you know what's funny? If I mousehover on my links (FF, v9) I see four different links! God, it seems a complicated problem... And I don't get the last question: my website is in UTF-8 encoding. Or you mean the target website?

And are you saying that even if I change the encoding of the required URL on the fly it will still be reconverted on their website?
10:39 am on Feb 17, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Hm. Can you change a text string's encoding in JavaScript? You'd have to tell it, in effect, "I'm giving you this UTF-8-encoded text. Now you have to re-encode it as 8859-15, convert that into percent-encoding, ship it off to this other site-- and then reverse the process with whatever they send back." Or does the percent-encoding happen en route after you have shipped off your word?

It's all done with two mouse clicks in my text editor-- one to pick the encoding, one to say whether I want to convert or reinterpret-- but I haven't a clue how to do it in javascript. It's not a straight conversion, like switching from decimal to hexadecimal. The letters are simply in different places.

Now that you mention it: when I was hovering on the links before, I was in ISO-Latin-1. Now that I'm in UTF-8 I see... well, at least three different forms. Nos. 1 and 3 end in %FC, No. 4 is the word with its proper u-umlaut, and and No. 2 is the unicode "I can't deal with this" character-- the same one that now displays in the text fields of the first three posts in this thread :)

I think we are dealing with three different encodings. I think it would be better if I cleaned the rat cage-- which I should have done several hours ago-- and maybe in the meantime someone who speaks fluent Javascript will stop by and shed light.
10:39 am on Feb 17, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



-- whoops! --
3:37 pm on Feb 17, 2012 (gmt 0)

WebmasterWorld Senior Member fotiman is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



Since you already have a text box for the value, why don't you just use a regular form submit instead of trying to submit it via JavaScript? For example, this would work:


<div>Copy and paste this into text box: men</div>
<form action="http://www.linguee.com/english-german/search">
<div>
<input type="hidden" name="lang" value="english-german">
<input type="text" name="query">
<input type="submit" value="Submit">
</div>
</form>
10:12 pm on Feb 17, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



If you do it that way, what happens to the encoding? The root problem is that OP's site uses UTF-8-- which is perfectly appropriate-- while the destination site uses, of all things, 8859-15, aka Latin-9.

w3c says
All characters are encoded before sent (spaces are converted to "+" symbols, and special characters are converted to ASCII HEX values)

but this doesn't really give the necessary information.

More ominous is the list here:

[w3schools.com...]

Note that it includes the full series of %8\h and %9\h -- and those shouldn't even exist. They're Windows-Latin-1 encodings that aren't recognized by unicode, though they may de facto work on sites headed 8859-1 (not -15).
7:49 am on Feb 20, 2012 (gmt 0)

5+ Year Member



Hello people!

well, apparently there is this hack:
[stackoverflow.com...]
But seems to work only ISO to utf-8 and not in the other direction.

Fotiman, I use javascript because I want to translate to many languages, and since this doesn't seem to be a GET parameter I have to build the final URL with JS.
 

Featured Threads

Hot Threads This Week

Hot Threads This Month