Forum Moderators: open

Message Too Old, No Replies

Armenian characters in a web page - How?

         

kapow

6:58 pm on Mar 17, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



We are creating a site that will have some pages in Armenian - using Armenian characters. How do you get a web page to display non-western characters? I have been reading about Unicode - but I'm not sure how to make them work :( Is it as simple as this or is something else required?:

1.) use this meta tag:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

2.) Get the client to add their text with a cms.

DrDoc

7:02 pm on Mar 17, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, if it is going to be all Armenian, you may want to specify the character set as whatever the most precise is. Otherwise, yes, utf-8 should work.

penders

10:43 am on Mar 18, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



1.) use this meta tag:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Just to note, the document will need to be saved as UTF-8 encoding as well (without a BOM - signature) - so your editor will need to support this. And presumably if you are using a CMS to store content in a DB then the DB will need to support/save as UTF-8 ?

kapow

11:54 am on Mar 18, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



if it is going to be all Armenian
If possible I would like some English characters on the same page (if thats not possible, those pages will be all Armenian).
the document will need to be saved as UTF-8 encoding as well (without a BOM - signature)
I use DreamWeaver. A typical test page I just created has this in the css file:
@charset "utf-8";
And on 'Save As' DW presents a check-box with:
Include Unicode Signature (BOM) (I left it unticked).
Do you think thats enough?

if you are using a CMS to store content in a DB then the DB will need to support/save as UTF-8
Our CMS edits html pages directly and does not use a DB.

jelle76

12:34 pm on Mar 18, 2008 (gmt 0)

10+ Year Member



You can easily include the English characters mixed with Armenian (Or even Chinese if you'd like) if you use UTF8. The UTF-8 set basically extends the standard ASCII character set with other characters (Although it might change the internal representation, but that is not for you to worry about).

I do not now DreamWeavers' handling of files. The easiest way to see whether your CMS does it right is to try.. Create a test file and open it in a browser. If the Armenian is displayed correctly, you are on the right track. In your view>Character Encoding verify that the option UTF8 is displayed: That tells you that the page was rendered using UTF8 definitions.

J

penders

1:48 pm on Mar 18, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do you think thats enough?

Sounds OK.

kapow

7:09 pm on Mar 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Just found I need to specify a font that works for Armenian too. This seems to work:
font-family:'Sylfaen','serif';

Can anyone shed more light on this? e.g. is that the most compatible font for Armenian?

mikewm

7:03 pm on Mar 20, 2008 (gmt 0)

10+ Year Member



Penders: "Just to note, the document will need to be saved as UTF-8 encoding as well (without a BOM - signature) - so your editor will need to support this. And presumably if you are using a CMS to store content in a DB then the DB will need to support/save as UTF-8 ? "

Why does it need to be saved in UTF-8? The source file I'm using is in ANSI and it displays in HTML with the headers in UTF-8 because I read some values from DB that are in UTF-8. In fact, if I encode the source file to UTF-8 and I use session_start or something that uses the headers, it sends some wierd characters before and fails. So all my source files are encoded in ANSI even though it displays UTF-8 charcaters in portuguese and so on.

kapow

7:31 pm on Mar 21, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



OK - I have now discovered my clients Word-docs are in ANSI and so can't be pasted into a UTF-8 web page (cms).

Can anyone confirm which windows text editor can convert from ANSI to UTF-8?

There is a bewildering list here:
[alanwood.net...]
I can see that some say they handle this or that format - but I need to get my client to CONVERT from one format to another (ansi to utf8).

Reading other theads I see recommendations for worldpad and textpad. Can anyone confirm which editor will convert ansi Armenian to utf-8 Armenian?

mikewm

7:55 pm on Mar 21, 2008 (gmt 0)

10+ Year Member



Use Notepad++. Under menu Format.
But I had same problem with portuguese characters. I now have iso-8859-1 and source file encoded in UTF-8. But it will work if you have meta in UTF-8 and then source file in ANSI.

kapow

1:31 pm on Mar 23, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Just to be clear:
- The source files are MS-Word documents in ANSI.
- I want the resulting web pages to be UTF-8 (with no ANSI content on them),
- So I need the client to convert from ANSI to UTF-8 BEFORE pasting into our cms.
- The text files WILL NOT have any html i.e. he is not editing html in the Word or Notepad files. The html will be applied after pasting the converted UTF-8 text into the cms.

So would Notepad++ allow him to past ANSI text into Notepad++ and then (still in Notepad++) convert to UTF-8?

Or is another text editor better for this?

g1smd

10:26 pm on Mar 23, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The ODP has a category in Armenian, you could maybe take a look at that for some hints.

penders

12:29 am on Mar 25, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Penders: "Just to note, the document will need to be saved as UTF-8 encoding as well (without a BOM - signature) - so your editor will need to support this. And presumably if you are using a CMS to store content in a DB then the DB will need to support/save as UTF-8 ? "

mikewm: Why does it need to be saved in UTF-8? The source file I'm using is in ANSI and it displays in HTML with the headers in UTF-8 because I read some values from DB that are in UTF-8. In fact, if I encode the source file to UTF-8 and I use session_start or something that uses the headers, it sends some wierd characters before and fails. So all my source files are encoded in ANSI even though it displays UTF-8 charcaters in portuguese and so on.

Why does it need to be saved in UTF-8?

If you are typing any unicode characters directly into your HTML/PHP document then you will need to save the file as UTF-8. I guess you don't have any 'special' chars in the document itself? You are lucky in this repect as the (single byte) characters you are using are a subset of UTF-8.

The document is ANSI but you are telling the browser to display it as UTF-8. Try typing the copyright symbol (ALt+0169 on Windows), save it as ANSI, tell the browser it's UTF-8 and it won't display correctly. Save it as ANSI, display it as ANSI - OK. Save it as UTF-8, display it as UTF-8 - OK. The copyright symbol does not share the same position in Unicode as it does in ANSI.

I read some values from DB that are in UTF-8.

They are in UTF-8 and you are displaying them in UTF-8 - OK. The rest of the 'ANSI' document shares the same codes as UTF-8 (possibly by chance, since I guess you are not using any out of the ordinary characters, just the regular a-z, A-Z, 0-9 and basic punctuation. Start using curly quotes etc. and it will be a problem.) - but otherwise OK. You may also be using numeric character references to display any special chars (rather than typing the chars directly in the document), which is again OK.

A test: These two 'ANSI' characters "π" are in fact the single UTF-8 character for the 'Greek Small Letter Pi' (U+03C0). Change the character encoding in your browser to UTF-8 and you will see the UTF-8 character as intended. The other characters remain the same, yet they are ANSI (but share the same codes as UTF-8). This is how your webpage is coping.

if I encode the source file to UTF-8 and I use session_start or something that uses the headers, it sends some wierd characters before and fails.

Do you get something like:
Warning: session_start(): Cannot send session cache limiter - headers already sent...
?

This sounds as if you are including the BOM (Byte Order Mark) when you save the file as UTF-8? This must be omitted. The BOM appears in the first 3 bytes of the file (although invisible to you in your text editor when viewed as UTF-8). And importantly before your "<?php ...". Unfortunately, as far as I'm aware, PHP does not understand the BOM. It will treat this as output (some weird characters) before the headers are sent and will consequently fail.

In Notepad++ this is Format > Encode in UTF-8 without BOM. In Notepad2 this is the other way round; you explicitly have to request 'with Signature' in order to get a BOM, simply picking UTF-8 does not include it.

See this recent thread for more info on the BOM (and removing it): [webmasterworld.com...]

So, you seem to be OK in your particular situation providing you keep to the basic set of characters in your HTML/PHP document or use numeric char refs. Certainly something to be aware of. What is best practise in this case? Have a mixture of character encodings or go 100% UTF-8?

kapow: A typical test page I just created has this in the css file:
@charset "utf-8";

Do external CSS files need to be UTF-8 encoded? Do they contain any content? Do they contain any non-ANSI chars?

So would Notepad++ allow him to past ANSI text into Notepad++ and then (still in Notepad++) convert to UTF-8?

Yes, I believe so (as mikewm mentions above), but be sure to pick the "Format > Convert to UTF-8 without BOM", not simply the "Encode to" option as this could lose data!

penders

2:04 pm on Mar 25, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



What is best practise in this case? Have a mixture of character encodings or go 100% UTF-8?

To quote encyclo, from this thread: ANSI, Unicode, UTF-8, and the path of most resistance! [webmasterworld.com]

>> 3.) Should I save all the files as utf-8?
The answer to question three is yes, keep it consistent - go utf-8 for everything by default - even if you think the page only contains ASCII characters you will be saving yourself a lot of hassle in the long run. Note: don't use anything outside the usual ASCII range for PHP function names and such.

mikewm

9:56 pm on Mar 25, 2008 (gmt 0)

10+ Year Member



I did what you said. Converted [not encode in...] to utf-8 without BOM all my files, set the HTML charset to <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> and my table's collation in the database is set to utf8_bin. I now, when reading from database don't have the correct characters displayed. If it is HTML it displays well and if it reads from AJAX that sends and receives data encoded in UTF-8, it displays well too. So the only thing that is messing is UTF-8 reading from DB. Must change collation from utf8_bin to another one?

penders

10:24 am on Mar 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



...and my table's collation in the database is set to utf8_bin.

I really don't know much about the database specifics here I'm afraid. But isn't the collation just the rules that govern how characters are compared, not how they are actually stored/encoded?
[dev.mysql.com...]

utf8_bin - Binary collation (case-sensitive comparison ?)

Looking around, I would guess that the correct character set encoding to use is simply 'utf8'. And perhaps set after having connected to the DB like:

SET NAMES utf8

Just a thought... if data has already been written to the DB with a latin charset, it may need to be converted... read as latin, written back as utf8? (not sure)

This is rather speculative, however, so may be the Database forum [webmasterworld.com] can offer more sound advice? In fact I notice that the latest thread "MySQL converting character sets question [webmasterworld.com]" perhaps deals with a related topic (although no replies as yet).

I would be interested to know the outcome of this.