what is this charset and iso business

Forum Moderators: open

Message Too Old, No Replies

what is this charset and iso business

domoftheuk

9:49 am on Jan 22, 2004 (gmt 0)

I rellay would like to know!

charset=iso-8859-1"

please help

claus

10:03 am on Jan 22, 2004 (gmt 0)

It's the Unicode standard for characters. Try the "Start here" page at unicode-dot-org:

[unicode.org...]

Edit: I have to correct myself. It's not Unicode, it's an ISO standard. You'll find the ISO standards here:

[iso.org...]

(the ISO standards are incorporated into the Unicode, that was the reason for my confusion)

domoftheuk

11:17 am on Jan 22, 2004 (gmt 0)

is all this necessary

TheDoctor

12:38 pm on Jan 22, 2004 (gmt 0)

It tells the browser which character set to use. Otherwise, the browser uses its default.

It's probably necessary if you'd like people whose native language is not your own to be able to read your web pages.

killroy

12:45 pm on Jan 22, 2004 (gmt 0)

It seems very messy and confusing and unintuitive at first, but if you go into the technical details you realise jsut HOW neccessary it is, and you become surprised that anything works at all.

As long as you're not involved in data processign or any type of non-basic english data, you won't even understand the consequences and neccessities of encoding.

At the core of it all is that the way data is stored, bits and bytes, does not correspond well to the type of data we want to store, words and letters.

Encoding is a way of turning the one, words and letters, into the other, bits and bytes. By knowing the encoding used, you can reverse the process. withiout encoding, you wouldn't know how to reverese the process and you would end up with gibberish.

The confusion comes from the fact that for many decades we've gotten used to a default encoding, ASCII, which was understood by all computers and assumed as default.

Unfortunately, ASCII is not able to encode all we want and need to encode, and that is why we have the ISO standards, and various encoding schemes, including unicode.

I hope this gives you a brief idea of why encoding schemata are necessary, and perhaps a bit of an idea of why it's is a complex and confusing subject for so many people.

Regards,
SN

grahamstewart

2:21 pm on Jan 22, 2004 (gmt 0)

Definitely required if you validate your HTML with the w3c validator [validator.w3.org].

The validator (like any web browser) needs to know what character set you used to write your page.

Hagstrom

2:32 pm on Jan 22, 2004 (gmt 0)

I suppose that goes for robots too. Inktomi thinks my name is Hagstrøm ;)

I tried the link above and clicked on "8859-1". The result was "Add to shopping basket" - 61 Swiss francs for 8859-1.

wickydoodah

3:02 pm on Jan 22, 2004 (gmt 0)

This topic also has me puzzled... I use CSE HTML Validator Pro to check pages on my sites, and it keeps suggesting that I use "charset=us-ascii" instead of "charset=windows-1252", which FrontPage uses as its default. Running the pages through the W3C Validator doesn't seem to care either way. Since my sites are regional ones in the USA (no international visitors), does it matter if I keep using either of these character sets? Or in the end, am I better off (safer) doing something like "charset=iso-8859-1"?

grahamstewart

3:11 pm on Jan 22, 2004 (gmt 0)

us-ascii is a better choice than the windows charset, because not every browser will be on windows and in theory it may not recognise a windows character set (unlikely tho given the number of pages written with Frontpage).

The ISO character set (aka Latin 1) is probably the most universally used for Western web sites (I believe us-ascii is actually just a subset of the ISO charset).

choster

3:34 pm on Jan 22, 2004 (gmt 0)

am I better off (safer) doing something like "charset=iso-8859-1"?

Yes, I would use ISO-8859-1, also known as the "Latin-1" set, for a typical English language website.

Strictly speaking, US-ASCII is a set of 128 characters and control codes originally adopted for teletype machines, and includes only unaccented letters, numbers, basic English punctuation, and a handful of common characters. IBM developed an "extended ASCII" set, and then Windows developed its own set, "Windows-1252," but both of these are proprietary whereas Latin-1 is a global standard.

There's no need to use US-ASCII in a web browser, which has much more sophisticated display capabilities than a teletype machine. You won't save any bandwidth or make the page display any faster or anything like that by using the more limited set. And your website may eventually include foreign names or words with accented letters or characters that hadn't been invented yet when ASCII was developed (e.g. S�o Paulo, �75.50). It's important to spell them correctly. Don't wish any Spanish-speaking customers a Nuevo Ano.

claus

8:45 pm on Jan 22, 2004 (gmt 0)

>> shopping basket

yes, the official iso-site has quite stupid policies - making a standard and then not making it available for free is ridiculous. Otoh someone seems to disagree with me on this.

For the same reason, if you do a search on "iso-8859-1" you will find lots of unofficial pages that do a very good job explaining these standards and even provide illustrations - i just thought it was better to link to the official site in stead of a private one.

/claus

bird

10:04 pm on Jan 22, 2004 (gmt 0)

It's the Unicode standard for characters.

The ISO-8859-x encodings are *not* the same as Unicode. At best they serve a similar purpose as the physical encoding schemes of Unicode. There's a variety of ISO encodings for different language and character families. Incidentally, the numerical values (but not necessarily the binary representations!) of ISO-8859-1 (Latin-1) and the first 256 positions of Unicode are the same. This could be practical, if it wasn't for the missing Euro sign in Latin-1, which makes ISO-8859-15 (Latin-9) a more useful replacement nowadays.

Here's the full set:

ISO 8859-1 west European languages (Latin-1)
ISO 8859-2 central and east European languages (Latin-2)
ISO 8859-3 southeast European and miscellaneous languages (Latin-3)
ISO 8859-4 Scandinavian/Baltic languages (Latin-4)
ISO 8859-5 Latin/Cyrillic
ISO 8859-6 Latin/Arabic
ISO 8859-7 Latin/Greek
ISO 8859-8 Latin/Hebrew
ISO 8859-9 Latin-1 modification for Turkish (Latin-5)
ISO 8859-10 Lappish/Nordic/Eskimo languages (Latin-6)
ISO 8859-11 Latin/Thai
ISO 8859-13 Baltic Rim languages (Latin-7)
ISO 8859-14 Celtic (Latin-8)
ISO 8859-15 west European languages (Latin-9)
ISO 8859-16 some east European languages (Latin-10)

All of those include 256 character positions, so that the physical representation of each takes one byte. Of those, the first 128 are identical, to remain downwards compatible with US-ASCII (US-ASCII only defines those 128 positions). As a consequence, you can write english language text with all of them, and it will be stored as the same byte sequence. But the HTML standards require that a character set be declared even if the file only contains english text, because there are other valid (and increasingly common) encoding schemes available that don't follow the same principle.

The differences between the ISO character sets are in the upper 128 character positions. If your text uses any "funny characters" like umlauts, or even characters from a completely different writing system, then you need to tell the browser what each of those byte values actually means. Reading arabic text with a cyrillic font isn't quite as amusing as it might seem at first... ;)

An alternative is to use one of the physical encodings of Unicode. In this case it is important to remember that Unicode itself doesn't define how text is stored physically, it just assigns a running number to each character it knows about. If you want to store it on disk, then you still have to decide about a specific encoding.

Most often this will be UTF-8, which shares not only the numerical values but also the binary representation of its respective range with Latin-1. For positions beyond 256, it uses sequences of two or more bytes, so that it can represent all legal Unicode values (= all languages). Unfortunately, this means that not every character in your text will take the same amount of space on disk, which will confuse many editors (but not the web browser). Other Unicode representations use at least two or four bytes for all characters.

choster

10:23 pm on Jan 22, 2004 (gmt 0)

Here's a site which lists Unicode entitites block by block: www.theorem.ca/~mvcorks/code/charsets/auto.html . It's interesting to see how support differs in Mozilla and IE, even with the same fonts installed.

bird

12:16 am on Jan 23, 2004 (gmt 0)

Most often this will be UTF-8, which shares not only the numerical values but also the binary representation of its respective range with Latin-1.

Sorry for the confusion, but this happens to be nonsense. Only the US-ASCII range is binary compatible between UTF-8 and any ISO-8859 encoding. Otherwise there would be no "special bytes" left to specify that a multibyte character follows. So the compatibility between Latin-1 and any Unicode encoding is really limited to the numerical values.

g1smd

12:05 am on Feb 7, 2004 (gmt 0)

Having just looked at a topic which took me to sites in Japan, China, Thailand, Vietnam, India, Iran and others, I certainly appreciated the pages that did include the Character set information. They all rendered fine in Mozilla. Those without a definition just ended up as a mess of western characters and punctuation marks instead.