Confused about Encoding

Forum Moderators: open

Message Too Old, No Replies

Confused about Encoding

kfegarty

6:17 pm on Jun 30, 2005 (gmt 0)

I am coding our xml file and I am confused when I look at others - some I see the first line as encoding="iso-8859-1"?

others
encoding="windows-1252"?

encoding="UTF-8"?

What should I use? What is the difference?

choster

6:19 pm on Jul 5, 2005 (gmt 0)

Welcome to WebmasterWorld!

Computers don't see text as shapes, they see numeric codes. A character set is a table which translates between the codes and shapes. The same code used to generate the letter "a" in one character set might be used to display a hiragana "shu" (or a Braille sequence, or a bullet, or an ancient Mycenean Linear B footstool ideogram) on another. This is why, for instance, "smart quotes" on older Mac programs rendered as funny accent marks in Windows programs, and vice versa-- they used different character sets.

Unicode (http://www.unicode.org/ ) as its name implies is a project to create a universal set containing all the letters, numbers, punctuation marks, and so on for all the major alphabets of the world. UTF-8 is a standard set of Unicode characters essentially combining dozens of existing character sets into one. According to the specification, all XML processors must support UTF-8 (and UTF-16) at a minimum.

ISO-8859-1, also known as Latin-1, is a character set containing characters used by Western European languages (inc. English, French, Spanish, German). It predates UTF-8 but nowadays can be considered a subset of it.

Windows-1252 is a proprietary character set for Western European languages created by Microsoft and widely used in Windows applications.

kfegarty

6:53 pm on Jul 5, 2005 (gmt 0)

Thanks for the clarification. Is one better to use than the other? Or does it really matter to the RSS readers and agregators?

choster

8:43 pm on Jul 5, 2005 (gmt 0)

UTF-8 is the "lowest common denominator" for XML 1.0 processors-- and probably what you have to begin with. I'd stick with it unless you have a compelling reason to do so otherwise.

That said, you should label your data according to the form it actually takes. If your content is encoded as Windows-1252 (for instance, because it was generated in Microsoft Word), you can't just change the label so to speak and call it UTF-8; the characters will display improperly or not at all.