Forum Moderators: open

Message Too Old, No Replies

Ascii above 128 auto-converted to unicode in browsers?

In both moz and IE ascii char 150 is turned into unicode

         

phaze

10:13 am on Jan 24, 2004 (gmt 0)

10+ Year Member



If I write a script that outputs ascii code 150 to a browser, Mozilla (and IE) see it as Unicode 2013 in javascript. Anyone know why that is? Are all chars above 128 converted to unicode? If so, what's the conversion logic.

It's important to me because I'm writing a regex editing tool in Javascript that creates long regular expressions that are ultimatelly used in a perl script. So when I see a unicode char in javascript that is actually ascii 150, things start getting funky. Took me a while to chase this one down.

thanks for any input.

Mark.

asquithea

12:19 pm on Jan 24, 2004 (gmt 0)

10+ Year Member



The ASCII standard character set only goes from 0-127, so (in my humble opinion) your question is a wee bit meaningless in a non-platform specific context. I presume that the browser is compensating by performing a conversion based on your platform's ASCII extensions.

nafmo

12:20 pm on Jan 24, 2004 (gmt 0)

10+ Year Member



There is no such thing as ASCII above 128, ASCII is a 7-bit code, and is only using the values 0-127. To use a code above 127, you need to define the proper encoding for your document, and when your Javascript reads those characters, it will see the Unicode equivalent.

Characters 128-159 of Unicode is specificly disallowed from use in HTML by the HTML standard, which means that NCRs with these numbers are illegal. Most browsers will probably interpret these as if they were written using windows-1252 encoding, but that's just because there are tons of old pages that exploited bugs in pre-Unicode aware browsers, and not something you should rely on, even a little bit, in a new document.

bird

1:26 pm on Jan 24, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Characters 128-159 of Unicode is specificly disallowed from use in HTML by the HTML standard

The reason being that no characters are defined for this range, which is a consequence for the compatibility with Latin-1 (and the same for all ISO-8859-X character sets).
The position 150 from Windows-1252 appears to translate to Unicode 8211, so that can't be it either.

What exactly is the purpose of your script outputting a byte value of 150? What character set is the page encoded in where you run this script? It's not unlikely that you fell victim to a conceptual error, and might need to be doing something entirely different...

phaze

6:47 pm on Jan 24, 2004 (gmt 0)

10+ Year Member



Nafmo, you ROCK!

Thank you very very very much. I'm not sure if I could have solved this myself at all.

It's Windows 1252 and here's the spec giving the unicode equivalents. As you can see 0x96 (decimal 150) is unicode 0x2013.

[microsoft.com...]

I'm writing regex parsers for other peoples pages. So I dont control the charset at all. Many of these pages do not specify charsets. So the browsers are assuming the extended ascii is Windows-1252 and converting it to unicode. This is the case for both Mozilla on Linux and IE on Windows 98. (wierd that moz defaults to a MS codepage, but anyway).

So to solve this I'm simply going to put a lookup table in my javascript so when my js sees the already converted unicode char, it converts it back to the ascii value, and adds that to the regex which will be used in a perl script on the raw data to do the actual parsing.

Thank you very very much.

bird

8:19 pm on Jan 24, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It's Windows 1252 and here's the spec giving the unicode equivalents. As you can see 0x96 (decimal 150) is unicode 0x2013.

Ah, but that's not how you presented it at first. The hexadecimal value 0x2013 is of course equivalent to decimal 8211. So you see that the unassuming characters "0x" make a very big difference in understanding your problem!

So the browsers are assuming the extended ascii is Windows-1252 and converting it to unicode. This is the case for both Mozilla on Linux and IE on Windows 98. (wierd that moz defaults to a MS codepage, but anyway).

Sounds like the JavaScript engines do that conversion based on the charset of the underlying page. But anything is good enough as long as it produces the result you need, eh?

when my js sees the already converted unicode char, it converts it back to the ascii value,

Please do yourself a favour and stop calling byte values beyond 128 "ASCII". They're not. I'm not quite sure what charset they belong to in your case (most likely the same as the processed page), but you're only confusing yourself and others when you call them "ASCII".

choster

9:34 pm on Jan 24, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Another recent discussion about Unicode and character sets is at [webmasterworld.com...] .