Forum Moderators: open
[unicode.org...]
Edit: I have to correct myself. It's not Unicode, it's an ISO standard. You'll find the ISO standards here:
[iso.org...]
(the ISO standards are incorporated into the Unicode, that was the reason for my confusion)
As long as you're not involved in data processign or any type of non-basic english data, you won't even understand the consequences and neccessities of encoding.
At the core of it all is that the way data is stored, bits and bytes, does not correspond well to the type of data we want to store, words and letters.
Encoding is a way of turning the one, words and letters, into the other, bits and bytes. By knowing the encoding used, you can reverse the process. withiout encoding, you wouldn't know how to reverese the process and you would end up with gibberish.
The confusion comes from the fact that for many decades we've gotten used to a default encoding, ASCII, which was understood by all computers and assumed as default.
Unfortunately, ASCII is not able to encode all we want and need to encode, and that is why we have the ISO standards, and various encoding schemes, including unicode.
I hope this gives you a brief idea of why encoding schemata are necessary, and perhaps a bit of an idea of why it's is a complex and confusing subject for so many people.
Regards,
SN
The validator (like any web browser) needs to know what character set you used to write your page.
The ISO character set (aka Latin 1) is probably the most universally used for Western web sites (I believe us-ascii is actually just a subset of the ISO charset).
am I better off (safer) doing something like "charset=iso-8859-1"?Yes, I would use ISO-8859-1, also known as the "Latin-1" set, for a typical English language website.
Strictly speaking, US-ASCII is a set of 128 characters and control codes originally adopted for teletype machines, and includes only unaccented letters, numbers, basic English punctuation, and a handful of common characters. IBM developed an "extended ASCII" set, and then Windows developed its own set, "Windows-1252," but both of these are proprietary whereas Latin-1 is a global standard.
There's no need to use US-ASCII in a web browser, which has much more sophisticated display capabilities than a teletype machine. You won't save any bandwidth or make the page display any faster or anything like that by using the more limited set. And your website may eventually include foreign names or words with accented letters or characters that hadn't been invented yet when ASCII was developed (e.g. São Paulo, €75.50). It's important to spell them correctly. Don't wish any Spanish-speaking customers a Nuevo Ano.
yes, the official iso-site has quite stupid policies - making a standard and then not making it available for free is ridiculous. Otoh someone seems to disagree with me on this.
For the same reason, if you do a search on "iso-8859-1" you will find lots of unofficial pages that do a very good job explaining these standards and even provide illustrations - i just thought it was better to link to the official site in stead of a private one.
/claus
The ISO-8859-x encodings are *not* the same as Unicode. At best they serve a similar purpose as the physical encoding schemes of Unicode. There's a variety of ISO encodings for different language and character families. Incidentally, the numerical values (but not necessarily the binary representations!) of ISO-8859-1 (Latin-1) and the first 256 positions of Unicode are the same. This could be practical, if it wasn't for the missing Euro sign in Latin-1, which makes ISO-8859-15 (Latin-9) a more useful replacement nowadays.
Here's the full set:
ISO 8859-1 west European languages (Latin-1)
ISO 8859-2 central and east European languages (Latin-2)
ISO 8859-3 southeast European and miscellaneous languages (Latin-3)
ISO 8859-4 Scandinavian/Baltic languages (Latin-4)
ISO 8859-5 Latin/Cyrillic
ISO 8859-6 Latin/Arabic
ISO 8859-7 Latin/Greek
ISO 8859-8 Latin/Hebrew
ISO 8859-9 Latin-1 modification for Turkish (Latin-5)
ISO 8859-10 Lappish/Nordic/Eskimo languages (Latin-6)
ISO 8859-11 Latin/Thai
ISO 8859-13 Baltic Rim languages (Latin-7)
ISO 8859-14 Celtic (Latin-8)
ISO 8859-15 west European languages (Latin-9)
ISO 8859-16 some east European languages (Latin-10)
All of those include 256 character positions, so that the physical representation of each takes one byte. Of those, the first 128 are identical, to remain downwards compatible with US-ASCII (US-ASCII only defines those 128 positions). As a consequence, you can write english language text with all of them, and it will be stored as the same byte sequence. But the HTML standards require that a character set be declared even if the file only contains english text, because there are other valid (and increasingly common) encoding schemes available that don't follow the same principle.
The differences between the ISO character sets are in the upper 128 character positions. If your text uses any "funny characters" like umlauts, or even characters from a completely different writing system, then you need to tell the browser what each of those byte values actually means. Reading arabic text with a cyrillic font isn't quite as amusing as it might seem at first... ;)
An alternative is to use one of the physical encodings of Unicode. In this case it is important to remember that Unicode itself doesn't define how text is stored physically, it just assigns a running number to each character it knows about. If you want to store it on disk, then you still have to decide about a specific encoding.
Most often this will be UTF-8, which shares not only the numerical values but also the binary representation of its respective range with Latin-1. For positions beyond 256, it uses sequences of two or more bytes, so that it can represent all legal Unicode values (= all languages). Unfortunately, this means that not every character in your text will take the same amount of space on disk, which will confuse many editors (but not the web browser). Other Unicode representations use at least two or four bytes for all characters.
Sorry for the confusion, but this happens to be nonsense. Only the US-ASCII range is binary compatible between UTF-8 and any ISO-8859 encoding. Otherwise there would be no "special bytes" left to specify that a multibyte character follows. So the compatibility between Latin-1 and any Unicode encoding is really limited to the numerical values.