Forum Moderators: open
I work strictly in notepad so to save Unicode characters I have to change the encoding when I use Unicode characters. One of my questions in this regards is that Notepad gives me several options including "Unicode" and "UTF-8". What is the difference between them? I know that UTF-16 supports more character sets then UTF-8 though at the sacrifice of larger file size and greater bandwidth in some regards.
When I save normal files (such as my news page) as Unicode or UTF-8 the text on the page renders in one of the Asian languages. However I have declared all of my pages as being initially written in English as so...
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
I am forced to save with the "ANSI" encoding on say, my news page. My language menu is included by all the pages as a PHP includes and that file is saved as UTF-8.
I'm sure this is wrong in some sense as even though you could get away with (as I am now) serving a page as ANSI and declaring it UTF (as English would be one of the many language charsets in UTF) but the conflict is that the includes file is UTF and the main files are ANSI.
Right now I have no browser errors that I know of, just the W3C validator coughing at the point the language.php file begins (and just that one error). I'm sure this has something to do with the possible conflict of meshing ANSI and UTF-8 saved files.
My XML declaration declares the page as UTF-8, most pages are saved as ANSI, and my language includes file is saved with a UTF-8 encoding.
Some people will suggest removing the translation services but the path of least resistance does not help me in solving the problems which I am intentionally trying to take on. What are my best options to sort out this mesh?
John
The first step you will have to take is to give up Notepad. Notepad has a bug (OK, not strictly a bug, but a very undesireable feature). When saving as Unicode (UTF-8), it adds a byte-order mark, which is a zero-width space used to indicate the file contents and whether UTF-16 is "little-endian" or "big-endian" (waaay off-topic for this post!). A BOM is not required in UTF-8 and should never be used on the web. Notepad will break your UTF-8 pages, sometimes to the extent that you will need a hex editor to fix them.
If you use a web-friendly text editor then just select UTF-8 and save away. You may need to convert older files, and it is best to be consistent and save as UTF-8 everywhere. Don't forget that one big advantage of UTF-8 is that it is ASCII-compatible, so the transition is easy.
The specified language is not in direct relation to the charset - English content can be in UTF-8, ISO-8859-1 or US-ASCII, for example. When dealing with a multi-language page, you define as usual the default language, and you place the
lang or xml:lang attributes as required on the appropriate surrounding elements for each language.
comprehensive post on this topic
Comprehensive, no, just an introduction:
[webmasterworld.com...]
(also in the forum library [webmasterworld.com])
So...
1.) What is happening with Textpad?
2.) Does crossing ANSI and UTF-8 files (even if the UTF-8 files are correctly coded) create errors?
3.) Should I save all the files as UTF-8?
4.) In short how much of a bandwidth difference does a specific charset save over using UTF-8?
(Not that bandwidth is a concern, more of a curiosity.)
and a little off-topic...
5.) Will working in Textpad with my text color coded make me less hard core? ;)
Thanks folks!
John
1.) What is happening with Textpad?
2.) Does crossing ANSI and UTF-8 files (even if the UTF-8 files are correctly coded) create errors?
3.) Should I save all the files as UTF-8?
Some information on ANSI code pages from Wikipedia (one of the few places with decent information about character encoding, surprisingly):
[en.wikipedia.org...]
The ANSI in question is an encoding for Windows operating systems - a collection of single-byte and multi-byte code pages which are not synonymous with UTF-8. Textpad is most certainly getting confused. Can it open the files saved as UTF-8 by Notepad?
As you can see, mixing two different encodings is very problematic. Notepad may have messed up the UTF-8 pages to the extent that Textpad won't recognize them. Stick to just Textpad, and if needed you may need to abandon the Notepad-saved files and return to the original source to get the characters you need.
The answer to question three is yes, keep it consistent - go UTF-8 for everything by default - even if you think the page only contains ASCII characters you will be saving yourself a lot of hassle in the long run. Note: don't use anything outside the usual ASCII range for PHP function names and such.
4.) In short how much of a bandwidth difference does a specific charset save over using UTF-8?
(Not that bandwidth is a concern, more of a curiosity.)
It will save a few bytes at most, and legacy encodings have the advantage of backwards-compatibility - so if you have a large IE3 audience you need to worry, otherwise UTF-8 has a clear advantage.
An extremely important tool to help with the conversions is
iconv, which is available on many (most?) Linux/Unix systems and can convert files easily between differect encodings. There are versions floating around the web when can be used on Windows too. There is a PHP function too which uses the library: If you convert your existing site files with
iconv then you can just change the specified character encoding at the top of the page and you're more or less set (do a convert from CP-1252 or ISO-8859-1 to UTF-8). 5.) Will working in Textpad with my text color coded make me less hard core? ;)
Yes. ;)
Actually, not really, Notepad is just hard work for the sake of it - and it is a bad tool for building websites. Anyway, Notepad is simply not really hardcore: you should start using a proper operating system [ubuntu.com] (which uses UTF-8 as the default encoding for the entire OS, BTW) instead, then you can use vi over SSH to live-edit config files without taking backups. :)
The hardest part of UTF-8 is the conversion process - once you are up and running and using UTF-8 exclusively then things are easier. The tools do exist now for using UTF-8 on any platform, even though modern Linux distributions have the edge over Windows, which still defaults to its own windows-1252 encoding.
Textpad set to use UTF-8 ... Asian characters were converted to question marks ... What is happening with Textpad?
Changes are made under
Configure... ¦ Preferences... ¦ Document Classes ¦ Default ¦ Font
The final comment that needs making is that Windows1252 is US-Windows - other languages (eg en-gb) have other encodings (tsk, tetchy).
WARNING: "includes-toolbar-language.php" contains characters that do not exist in code page 1252 (ANSI - Latin I).
They will be converted to the system default character, if you click OK.
Is this a setting in Windows or Textpad?
I've gone to this option and set it as the default for all files...
Configure Menu --> Preferences --> Document Classes --> Default --> Default Encoding --> UTF-8
I've pasted from Firefox at Babel Fish (Chinese Simplified) directly in to Textpad even after saving it as UTF-8 from a new file with no luck.
I've found the UTF-8 BOM option if you right click and enter preferences on the document's editing area though I won't turn that on.
I came across this in Textpad's help file which seems to be what we're looking for but I need it dumbed down for me if possible please IoI...
How to Work with Unicode Files
Overview:
TextPad automatically detects 16-bit Unicode and UTF-8 encoded characters, when opening files. Unicode characters may be in "little endian" (Intel) or "big endian" (RISC) order, and the order is preserved when a file is saved.Internally, these files are converted to single or double byte characters (DBCS), using the locale corresponding to the font script selected for the document class. For example, if the screen font for the Text document class is MS Mincho, with the script set to Japanese, Unicode characters in *.TXT files will be converted to the corresponding DBCS characters in code page 932.
WARNING: This means that it is only possible to edit, without data loss, files containing characters from the implied code page. Other characters will be converted into a system default character (normally "?"), if you confirm that is what you want to do.
Conversion:
Conversion between various file formats and encodings can be made using the Save As command on the File menu. The options for encoding are ANSI, DOS, Unicode, Unicode (big endian) and UTF-8.The Find in Files and Compare Files commands automatically convert files to the internal format, so they can operate independently of character encoding and end of line characters. For example, a file containing UTF-8 characters can be compared to another containing Unicode characters. The code page used depends on the font specified for the "Search Results" and "Command Results" document classes.
I can convert to Textpad just fine if I can get it to work correctly. I don't know how I'm going to get the original source ... if I copied the characters from Firefox or Notepad and pasting them in to Textpad does not work, well... there is some sort of byte or BOM related issue at hand. I'm sure my ignorance of the subject is shining right now IoI.
The only person who has ever visited my site with IE3 has been me and I'm not worried about that market. ;)
So there is a way to set Windows to use UTF-8 as the OS's default encoding or are you just trying to convert me? Actually I'm putting my first self-built system together from extra parts to build a Linux system so another off-topic question if I may, will I be able to run KHTML/Webcore on Ubuntu?
I'll give those encoding related links more reading through-out this week while I work on the conversion to UTF-8.
Thanks for your help!
John
It is actually easy (honest). You are receiving this error-msg:
contains characters that do not exist in code page 1252 (ANSI - Latin I)...because the Latin-I script (font) does not contain glyphs to represent some characters in the file (so it uses "?" instead).
Internally, these files are converted to single or double byte characters (DBCS)
For example, if the screen font for the Text document class is MS Mincho, with the script set to Japanese
(Remembering that what I am presenting is all theory) let's say that you want to work with Japanese in UTF-8. You will need to:
Control Panel ¦ Regional & Language Options ¦ Advancedmake sure that the Code-Pages that you want available are selected.
So there is a way to set Windows to use UTF-8 as the OS's default encoding
Windows is "Unicode" (it's own unique variety) and that has no connection with UTF8.
The problem is actually that TextPad is not (windows-)Unicode, and therefore makes use of the old-fashioned code-pages. Damn shame.
Notepad was set to Western script and so was Textpad! So I began changing the font (started with good ol' Arial) and was seeing certain language scripts become available (or disappear) depending on my font choice.
The question now is which fonts/scripts will work?
John
In Textpad I went to...
Configure...
Preferences...
Document Classes...
Default...
Font...
I then set the font to MS Mincho with Japanese script. I pasted Japanese text (?) and it worked!
I pasted Chinese text and one of the symbols was converted to a question mark but the other worked.
I pasted some Greek text and it also worked.
~Now~ my question is, how do I edit in a font I like that I find readable? Do I have to manually go through the entire list of fonts and check each list of supported scripts? Do I have to BUY some sort of Unicode supporting font from the internet?
How is it that Notepad can use Western script with the same font and display Asian characters but Textpad can not? Is this linked to the BOM? It's all slowly starting to make sense...
John
I've had a look, but cannot now find the old VB help articles.
What I do find interesting is that PHP also keeps reminding me of old-VB.
I suspect that NotePad (and the browsers) use modern display Windows-APIs, and not the old ANSI displays, and therefore do not rely on Code-pages. TextPad does rely on the (ancient, still supported) Code pages, and that is where the difficulty is coming from.
More research required.
[alanwood.net...]
I've been checking Alan's site out, I'll post what I find later.
John
We are both going to have to wait for TextPad 5 - any year now.
Robin_reala:
In CSS you specify a list of fallback fonts...
I find myself at another crossroads right now with Unicode just like I find myself now dealing with IE6 not working at my site due to my personal advancement and the wait for IE7 to alleviate a portion of that issue. Unicode support is only a minor aspect on my radar but important nonetheless. I'm just glad that I'm not trying to work for clients right now in dealing with all of this! I read somewhere there is an editor with a selling point of Unicode support...
Let me clarify something first...support (when I say it) is or is not. If a program "supports" 99.9% of a standard, it does not support that standard; not in my book. Especially when I'm trying to learn everything (not hard to track me) all I really need is to spend a week trying to figure something out, have done it correctly in eighty ways and then find out the programs are the ones to blame. So I wanted to clarify that. ;)
But yes I read somewhere (and closed the page before it clicked in my head) that someone mentioned there was an editor with Unicode support as one of it's main selling points. Do you happen to know what editor that is?
John