Forum Moderators: open
© = ©
™ = ™
® = ®
Do a search on google for HTML ASCII, there are plenty of sites with tables.
#153 deprecated and &174; is not the right one
I am not sure why the w3c would choose to depricate characters in the ascii table. Could you please give a link showing where it talks about this, thanks.
ISO-8859-1 reserves values 127-159 for control characters. These values are undefined and always have been for
- ASCII (which only goes to 127; there is no such thing as ASCII character 153, 169 or 174. ASCII is a seven-bit system so that programs can use the other bit to check for parity and so forth).
- ISO-8859-1 (which reserves these values for non-displaying control characters).
- Unicode (which overlaps with ISO-8858-1 for 0-255)
These values are defined for Windows-1252 and will cause you problems in almost any other character set and should never be used in HTML. You should use the Unicode value, that is ™ in decimal notation.
Tom
if I understand correctly I should be using
charset=iso-8859-1
Maybe, maybe not. You should use the correct character set. Support for ISO-8859-1 is probably the second most widespread (after ASCII), but if you want to use true typographic markup (em dashes and such), you would probably want to use UTF-8.
and always use Unicode entities like as Choster posted above.
More or less. As I said in the original post, if you are going to serve up pages as iso-8859-*, you should always use Unicode entities for code points that have numbers 128-159 decimal in the Windows-1252 character set. However, if your underlying text from your word processor or whatever is Unicode text and you serve pages up as UTF-8, these "same" characters should render just fine ("same" meaning they look pretty much like one another, though if you look at them with a hex editor, they will not be the same).
Basically it depends on how many such characters you are using. Examples are curly quotes (single and double), the florin sign, en and em dashes.
* Far and away the best discussion of the problem in general and particularly as it relates to HTML presentation is offered on Jukka Korpela's page On the use of some MS Windows characters in HTML [cs.tut.fi]. He has a complete chart of problem values.
* See also the Windows-1252 code point table near the bottom of the Wikipedia article on ISO-8859-1 [en.wikipedia.org] with problem characters highlighted.
* And Chris Wendt's comments [lists.w3.org] from way back in 1998.
* The quick converter from codeside is an easy way to convert Windows-1252 to numeric entities [code.cside.com].
Tom
If I set my character type to charset=iso-8859-1, and want to produce the copyright symbol these websites with charts suggest the following,
http://slackerhtml.tripod.com/html/ascii.html suggests using ©
http://www.ascii.cl/htmlcodes.htm states this at the top of it's page,
Standard ASCII set, HTML Entity names, ISO 10646, ISO 8879, ISO 8859-1 Latin alphabet No. 1
Browser support: All browsers
and says to use © or ©
[w3schools.com...] says to just use ©
There are many other websites that all say use either © or ©
So whats the deal, is everybody just doing it wrong?
decimal numeric entities
named entities
hexadecimal numeric entities
Modern browsers support all of them pretty well. However, modern browsers also support UTF-8 just fine, so the whole thing is becoming less and less important. Looking way ahead in internet time, in ten years you could probably get rid of most entities except < > & etc which must be retained because of their special meaning in markup.
Tom
I have just one more question.
you said:
>As I said in the original post, if you are going to serve up pages as iso-8859-*, you should always use Unicode entities for code points that have numbers 128-159 decimal in the Windows-1252 character set.
Can you tell me what is meant by Window-1252 set?
Is this how Windows views the characters?
I have a Mac, BTW. Both unicode and Ascii work fine on my computer, IE 5.1.
Thanks for your help
Lorel
The brief answer: it's just a character set like any other. Certain hex numbers have certain values, and those are not necessarily the same for a given character encoding. There is no such thing as "plain text" as all text has to be encoded into numbers and those numbers depend on the character set you are using.
I don't know about Macs, but there is no reason that you couldn't have Windows-1252 installed on your Mac. Perhaps, for example, if you use older Microsoft products, they may depend on that character set and may have installed it (newer MS products use Unicode).
For Western European languages there are four main encodings you might find being used on a Windows machine:
- ASCII (no accents or fancy punctuation, so English only - think a standard US typewriter)
- ISO-8859-* family (accents for most Euro languages and some other characters, but not large character sets - only 255 characters each - so each region needs its own encoding. Western Europe uses 8859-1, aka Latin-1)
- Unicode, which uses a variety of encodings (UTF-8, UTF-16LE, UTF-16BE) but is a standard that covers all European characters and several other languages and a lot of fancy punctuation (64K characters).
- Windows-1252 which takes unused code points in ISO-8859-1 and assigns values for commonly used characters, such as em dashes and curly quotes.
These determine the actually underlying *number* used to represent a character. Unicode values overlap perfectly with iso-8859-1 for the first 255 characters and iso-8859-1 overlaps with ascii for the first 128 characters. However, values 128-159 in Windows-1252 do not overlap with any other character set that I know of and those values are reserved for control characters in iso-8859-1. That means the display value for those numbers is undefined in any character set other than Windows-1252. So it is up to the OS or user agent to decide what to do with a code like ™ The display value is only defined if the page is being served up (using an xml tag or a meta charset tag) as Windows-1252.
If, when you put a ™ in a page, it renders as you expect on your Mac, that means you have the Windows-1252 character set on your machine [edit: see correction below]. However, you can't count on others having that character set or, if they do, you can't count on them interpreting the code as you wish. Indeed, if the page is iso-8859-1 or utf-8, those numbers should NOT be displayed as anything other than a box or a question mark. One could consider it a browser bug if the browser actually displayed anything in that case.
[correction: older Macs have a charset known as MacRoman which differs from both ISO-8859-1 and Windows-1252 in the 80-9F (128-158) range, but I don't know what the overlap is. Some code points may render the same in Windows-1252 and MacRoman - I don't know]
Whew! Huge post and I'm not even sure I answered your question...
Tom