Forum Moderators: open

Message Too Old, No Replies

Greek characters needed for web page

unicode, character sets, decimal, hexadecimal - my brain is about to explode!

         

Trisha

2:55 am on Nov 6, 2002 (gmt 0)

10+ Year Member



I've done searches on Google and read all sorts of stuff about unicode, character sets, etc., but I still don't really understand it.

What I need to do is have Greek letters on a site. The site will be in xhtml 1.1 and I'm using UTF-8. I would like for the site to validate and to follow the wcag accessibility guidelines as much as possible.

Some pages I found talked about entity, decimal and hexidecimal notation, and I didn't understand that at all.

For a small alpha, one site said to use:


U+03B1

another site said:


α or α or α

yet another explained:
"of the form &#xxxx;, where xxxx is the position of the character in the Unicode character set"

which made sense to me, except after I found the chart with the characters I wanted on unicode.org, I couldn't figure out how to use the chart.

My site also uses a MySQL database and PHP. When testing some of the above it would appear right in IE5.5 the first time I looked at it, but after editing it again through the PHP/CMS web interface thingy I made, the characters looked totally different.

I'm also getting quite a few white boxes where characters should be. Oh, and the font used for the site is verdana.

Is there anyone who can help me sort this out or point me to a site that explains it all clearly?

Thanks!

phollings

3:49 pm on Nov 6, 2002 (gmt 0)

10+ Year Member



Trisha --

Off the top of my head I can't fully answer your question, but let me try a simple answer.

Depending on a computer's language settings, operating system, browser, etc., specific character sets are installed. Regardless of the character set, each specific character is represented internally to the computer as a unique number. These numbers, in turn, can expressed in various numbering systems, e.g., decimal (base 10), hexadecimal (base 16), etc. The convention is to precede the number representing a character with an ampersand and follow it with a semicolon. This lets the computer know that it is dealing with a character entity and not the literal number. Also, for some of the characters, a short text string can be used, e.g., "&" for ampersand (note:these are case sensitive).

In practice, all I think you need to do is find the characters you need in a table and use the corresponding code in your HTML. Then test the result on several browsers to verify that the character set you used is likely to be available to your site visitors. Some good tables are available at: [visibone.com...] . Perhaps someone else can comment on the choice of character sets?

HTH,

Peter Hollings

tedster

4:19 pm on Nov 6, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The Verdana font doesn't enter into the picture - but you might be able to get good results writing your characters in the "Symbol" font, which is pretty widely distributed.

I believe that Symbol is standard on Windows, Macintosh and Unix machines today. Unfortunately they use different numbering systems for the characters, making things challenging if you try to use the unicode or HTML code for the characters. But using font-family:Symbol or <font face="Symbol"> may give better results.

Extended character set information [hclrss.demon.co.uk]

Test page for Greek characters in Symbol [dibonsmith.com]
(View Source to see which English letter corresponds to which Greek letter)

Trisha

12:01 am on Nov 7, 2002 (gmt 0)

10+ Year Member



I just spent a few more hours today reading whatever I could find about this subject. Here's a summary of what I found (please post any corrections or additions):

Unicode (ISO 10646) - a basic introduction:
In the past most computers used fonts that contained a maximum of 256 characters. The first 128 characters were a-z, A-Z, etc. The second 128 would depend upon where you lived, they could be accented letters, punctuation marks or characters from the Arabic, Greek. etc. alphabets. Unicode replaces this system with one that assigns a unique number to each character in each of the major languages of the world and potentially allows for over a million unique characters.

There are three ways of specifying Unicode characters in (x)html (other than the characters that appear on a typical keyboard)

1 - Character Entity References; written like:


&character_entity;
an example:
&alpha;

252 of these characters are available.

2 - Numeric Character References; written like:


&#decimal_reference;
an example:
&#945;

all unicode characters are available with this method
this is also sometimes referred to as a decimal character reference

3 - Hexadecimal Character References; written like:


&#hexadecimal_reference;
an example:
&#x3B1;

all unicode characters are available with this method

Methods 1 and 2 can cause problems in Netscape 4.x, unless
they are present in the document’s character encoding. Netscape 4.x only recognizes a few hexadecimal characters.

Most articles I read seemed to suggest that it is preferable to convert hexadecimal to decimal and use method 2.

unicode.org has charts showing the hexadecimal codes. One of the articles I read explained how to convert hexadecimal to decimal, don't remember which one now, without looking it back up. Other web sites list entity, decimal and hexadecimal. Here are a couple (if they get edited out I can send them to people personally):
[htmlhelp.com...]

[cs.tut.fi...]

concerning use of the symbol font:
There are accessibility problems with it and if used as a font tag, font has been deprecated (in XHTML 1.1). This page gives more information about why is not a good idea to do it this way:
[ppeph.gla.ac.uk...]

additional references for the above information:
[alanwood.net...]
[cs.tut.fi...]

tedster - I mentioned verdana because some fonts apparently don't include all characters, from what I understand. Verdana should though, I wanted people to know I wasn't using some hardly-ever-heard-of font that wouldn't be able to display some characters.

I still have 2 more problems yet to solve: 1)MySQL/PHP problem and 2) many characters not appearing properly on my computer, probable an OS issue, I'll go ask these in the appropriate forums.

tedster

12:15 am on Nov 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The support problems that currently exist are the reason I suggested the Symbol font. Sometimes I get pragmatic and toss standards and theory to the wind.

However, from a strict viewpoint, you (and the web page you cite) are totally correct. Search engines would index the English characters, for instance, and that could easily generate confusion.

I'd love to know what you finally settle on using - and why. Looks like every choice involves a hearty trade-off of some kind.

Trisha

2:15 am on Nov 7, 2002 (gmt 0)

10+ Year Member



I hadn't even thought of the search engine issues involved!

I found another problem since my last post, which I hadn't been aware of before either:


... It seems that IE 4.0 behaves very erroneously with any links containing the # character (i.e. links to locations in the same document or links to specific locations in other documents) if UTF-8 encoding is specified ...

I don't use those links very much, except to make "skip navigation/go to main content" type links for accessibility. And accessibility is of course one of the reasons the symbol font shouldn't be used. It does pretty much make it impossible to find a solution that will work in all cases.

My current plan is to use the decimal numeric character reference, but to make it meet WCAG, I suppose I will have to use


<abbr><span lang="gr">decimal numeric character references for greek letters here</span></abbr>

I need abbr so that screen readers would read the letters
rather than try to pronounce it, although apparently there is very little support for abbr anyway. (For example: alpha, beta, etc., since in this case I'm using it for a science term for an enzyme, which is always refered to with 4 greek letters.) And lang="gr" so they would be recognized as Greek letters (I'm guessing that "gr" is Greek). I need to ask about this at an accessibility mailing list/message board, currently there is a discussion taking place at the w3c-wai-ig mailing list about abbr vs. acronym, so it's good timing I guess. Of course, I'll just confuse the issue by bringing non-English characters into it! :)

(all this and people will still hire the 15 y.o. down the street with a bad WYSIWYG editor who "knows computers", but not HTML - instead of me!)

g1smd

3:17 am on Nov 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




The ISO 8859-1 character set is for Western Europe, but did you know that ISO 8859-7 is the Greek character set? You can use these two if the complexities of UTF-8 overwhelm.

In any case, when you specify a character set for a document, you are not restricted to using only those characters, as you can encode other characters and use those with the document.

As for <span lang="gr">, this says that the Language is Greek (The codes come from ISO 639), but is does not say what Character Set is actually being used. That is a separate declaration that has to be made in the document header.

Trisha

3:59 am on Nov 7, 2002 (gmt 0)

10+ Year Member




The ISO 8859-1 character set is for Western Europe, but did you know that ISO 8859-7 is the Greek character set? You can use these two if the complexities of UTF-8 overwhelm.

Yes, I would like to use Unicode for a number of reasons though.


In any case, when you specify a character set for a document, you are not restricted to using only those characters, as you can encode other characters and use those with the document.

Can you explain this further, maybe there is something else about this I'm not yet understanding?


As for <span lang="gr">, this says that the Language is Greek (The codes come from ISO 639), but is does not say what Character Set is actually being used. That is a separate declaration that has to be made in the document header.

yes, I know this doesn't refer to the character set, but in Guideline 4 of the WCAG is says to "clarify natural language usage" , "use markup that facilitates pronunciation or interpretation of abbreviated or foreign text", that is why it is in there.

Eric_Jarvis

10:32 am on Nov 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



for Greek characters in an otherwise English page definitely use Unicode...I'd recommend following the advice from Alan Flavell's pages, he developed them to help mathematicians and physicists build websites using mathematical symbols (and hence a lot of Greek characters)...start from those and you won't go far wrong

isn't ISO 8859-7 for modern Greek, which may differ slightly?