Forum Moderators: open
It's important to me because I'm writing a regex editing tool in Javascript that creates long regular expressions that are ultimatelly used in a perl script. So when I see a unicode char in javascript that is actually ascii 150, things start getting funky. Took me a while to chase this one down.
thanks for any input.
Mark.
Characters 128-159 of Unicode is specificly disallowed from use in HTML by the HTML standard, which means that NCRs with these numbers are illegal. Most browsers will probably interpret these as if they were written using windows-1252 encoding, but that's just because there are tons of old pages that exploited bugs in pre-Unicode aware browsers, and not something you should rely on, even a little bit, in a new document.
The reason being that no characters are defined for this range, which is a consequence for the compatibility with Latin-1 (and the same for all ISO-8859-X character sets).
The position 150 from Windows-1252 appears to translate to Unicode 8211, so that can't be it either.
What exactly is the purpose of your script outputting a byte value of 150? What character set is the page encoded in where you run this script? It's not unlikely that you fell victim to a conceptual error, and might need to be doing something entirely different...
Thank you very very very much. I'm not sure if I could have solved this myself at all.
It's Windows 1252 and here's the spec giving the unicode equivalents. As you can see 0x96 (decimal 150) is unicode 0x2013.
[microsoft.com...]
I'm writing regex parsers for other peoples pages. So I dont control the charset at all. Many of these pages do not specify charsets. So the browsers are assuming the extended ascii is Windows-1252 and converting it to unicode. This is the case for both Mozilla on Linux and IE on Windows 98. (wierd that moz defaults to a MS codepage, but anyway).
So to solve this I'm simply going to put a lookup table in my javascript so when my js sees the already converted unicode char, it converts it back to the ascii value, and adds that to the regex which will be used in a perl script on the raw data to do the actual parsing.
Thank you very very much.
Ah, but that's not how you presented it at first. The hexadecimal value 0x2013 is of course equivalent to decimal 8211. So you see that the unassuming characters "0x" make a very big difference in understanding your problem!
So the browsers are assuming the extended ascii is Windows-1252 and converting it to unicode. This is the case for both Mozilla on Linux and IE on Windows 98. (wierd that moz defaults to a MS codepage, but anyway).
Sounds like the JavaScript engines do that conversion based on the charset of the underlying page. But anything is good enough as long as it produces the result you need, eh?
when my js sees the already converted unicode char, it converts it back to the ascii value,
Please do yourself a favour and stop calling byte values beyond 128 "ASCII". They're not. I'm not quite sure what charset they belong to in your case (most likely the same as the processed page), but you're only confusing yourself and others when you call them "ASCII".