homepage Welcome to WebmasterWorld Guest from 54.83.133.189
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe and Support WebmasterWorld
Home / Forums Index / Code, Content, and Presentation / JavaScript and AJAX
Forum Library, Charter, Moderator: open

JavaScript and AJAX Forum

    
Character encoding: UTF-8 to ISO-*
fm86




msg:4418274
 7:55 am on Feb 16, 2012 (gmt 0)

Hello everybody!

I have a text field that accepts a word and sends it as a GET parameter to a translation server (http://www.linguee.com). Everything works fine till I usee ASCII characters but if I add, for example, a "" things get messed up because of encoding.

I tried on Linguee and should be encoded as %FC. I tried to use all the JavaScript encoding functions but I always get a %C3%BC instead. What am I doing wrong? Can it be that the problem is originated by the fact that my page uses utf-8 while Linguee uses a iso encoding? How can I solve this issue?

Thanks a lot!

 

lucy24




msg:4418288
 9:30 am on Feb 16, 2012 (gmt 0)

I don't think you are doing anything wrong. %C3%BC is the correct UTF-8 encoding for , and you said that your file is UTF-8. It would be more worrying if you didn't get this result.

The fact that the translation service demands ISO-Latin-1 encoding is their problem, not yours. How do they handle input outside the Latin-1 range, like Cyrillic or Hebrew?

Edit:
Oh, wait. Are you doing the encoding at your end (charCodeAt or whatever it's called) or are you sending out the unencoded word? I kinda think you want a parseInt in there somewhere. At least that what I've got in my own script-changing routines. But those are for non-Roman scripts. Are you working strictly in the Latin-1 range?

:: wandering off to see what's involved in disencoding simple Latin-1 ::

fm86




msg:4418319
 12:47 pm on Feb 16, 2012 (gmt 0)

Hey, thanks a lot for the reply!

I must say I am not very familiar with character encoding. All I know is that they have this form on their homepage and if I type "men" they call this URL:
[linguee.com...]

If I write a form from my UTF-8 page I call
[linguee.com...]

If I escape the URL and I alert it before to open the linguee page I see correctly:
[linguee.com...]
but when the links opens I get:
[linguee.com...]

So it seems there is some weird translation going on which takes %FC to %C3%BC. Is there a way to avoid that?

lucy24




msg:4418471
 8:15 pm on Feb 16, 2012 (gmt 0)

Uh-oh, that didn't quite work. Mouse-hovering on your examples shows four identical sets of "menü", so something is getting re-encoded in transit :)

Advance warning:

The remainder of this thread will come through as garbage if your browser is not set to UTF-8. If it is set to UTF-8, the first three posts will say "men�" in place of the intended "menü". Ajurnaqtuq, c'est la vie, et cetera.

Now then...

The first three links come through as %FC in the, er, menu bar, while the page title has "menü". The fourth link says "menü" in the menu bar while the page title says "menãœ". This is definitely not UTF-8 being reinterpreted as 8859-1. (I know this without looking it up because the œ character is not in Latin-1.)

Further investigation of page source, plus detour to IANA [iana.org], tells me that all four are encoded as ISO-8859-15 (described here [iana.org]) (!) alias Latin-9 (double !).

The key difference (I'm quoting) is:

BC CAPITAL LIGATURE OE U0152
BD SMALL LIGATURE OE U0153

Does that BC sound vaguely familiar? It should.

But wait! If you put the word "menü" into UTF-8 and reinterpret as Latin-9, you do not get "menãœ". You get menÃŒ; with capital letters. In fact you can't get to lower-case "menãœ" from UTF-8 at all, so the page must be running a script to put everything into lower case, even if it is lower case garbage.

In my previous post I thought I was just being snippy when I said it's the translation site's problem. But if they are going to go around encoding themselves in 8859-15, it really is their problem. (Here I detoured to another browser to make sure it wasn't just reading an existing 8859-15 cookie.) If the site used 8859-1 you could probably deal with it, but 8859-15 is simply not going to work.

So we'd better backtrack to the original problem and find a different solution. Does your own site have features that require you to send non-ASCII text over the Internet to a translation source? You'll need to find a site that either uses UTF-8 encoding, or accepts information about the source file's encoding so it can do the conversion at its end.

fm86




msg:4418678
 8:45 am on Feb 17, 2012 (gmt 0)

OMG :)

Thank you so much for the patience. And you know what's funny? If I mousehover on my links (FF, v9) I see four different links! God, it seems a complicated problem... And I don't get the last question: my website is in UTF-8 encoding. Or you mean the target website?

And are you saying that even if I change the encoding of the required URL on the fly it will still be reconverted on their website?

lucy24




msg:4418707
 10:39 am on Feb 17, 2012 (gmt 0)

Hm. Can you change a text string's encoding in JavaScript? You'd have to tell it, in effect, "I'm giving you this UTF-8-encoded text. Now you have to re-encode it as 8859-15, convert that into percent-encoding, ship it off to this other site-- and then reverse the process with whatever they send back." Or does the percent-encoding happen en route after you have shipped off your word?

It's all done with two mouse clicks in my text editor-- one to pick the encoding, one to say whether I want to convert or reinterpret-- but I haven't a clue how to do it in javascript. It's not a straight conversion, like switching from decimal to hexadecimal. The letters are simply in different places.

Now that you mention it: when I was hovering on the links before, I was in ISO-Latin-1. Now that I'm in UTF-8 I see... well, at least three different forms. Nos. 1 and 3 end in %FC, No. 4 is the word with its proper u-umlaut, and and No. 2 is the unicode "I can't deal with this" character-- the same one that now displays in the text fields of the first three posts in this thread :)

I think we are dealing with three different encodings. I think it would be better if I cleaned the rat cage-- which I should have done several hours ago-- and maybe in the meantime someone who speaks fluent Javascript will stop by and shed light.

lucy24




msg:4418708
 10:39 am on Feb 17, 2012 (gmt 0)

-- whoops! --

Fotiman




msg:4418819
 3:37 pm on Feb 17, 2012 (gmt 0)

Since you already have a text box for the value, why don't you just use a regular form submit instead of trying to submit it via JavaScript? For example, this would work:


<div>Copy and paste this into text box: men</div>
<form action="http://www.linguee.com/english-german/search">
<div>
<input type="hidden" name="lang" value="english-german">
<input type="text" name="query">
<input type="submit" value="Submit">
</div>
</form>

lucy24




msg:4418988
 10:12 pm on Feb 17, 2012 (gmt 0)

If you do it that way, what happens to the encoding? The root problem is that OP's site uses UTF-8-- which is perfectly appropriate-- while the destination site uses, of all things, 8859-15, aka Latin-9.

w3c says
All characters are encoded before sent (spaces are converted to "+" symbols, and special characters are converted to ASCII HEX values)

but this doesn't really give the necessary information.

More ominous is the list here:

[w3schools.com...]

Note that it includes the full series of %8\h and %9\h -- and those shouldn't even exist. They're Windows-Latin-1 encodings that aren't recognized by unicode, though they may de facto work on sites headed 8859-1 (not -15).

fm86




msg:4419478
 7:49 am on Feb 20, 2012 (gmt 0)

Hello people!

well, apparently there is this hack:
[stackoverflow.com...]
But seems to work only ISO to utf-8 and not in the other direction.

Fotiman, I use javascript because I want to translate to many languages, and since this doesn't seem to be a GET parameter I have to build the final URL with JS.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / JavaScript and AJAX
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved