Forum Moderators: open

Message Too Old, No Replies

Saving an .html as proper UTF-8

Issues with BOM

         

apprentice

2:37 pm on Sep 24, 2006 (gmt 0)

10+ Year Member



Having incorporated some Chinese characters in several pages of a site it was time to save the html as proper UTF-8 rather the windows standard ANSI or whatever that is. Having not been suspicious, I tried doing that in Notepad, which proved a bad mistake as it automatically adds the byte order mark (BOM) which is a no no for the web at the moment for several reasons as outlined in older threads of this forum.

Whilst opening the UTF-8 - Notepad-saved pages, Textpad gives the following error: Warning: "blahblah.html" contains characters that do not exist in code page 1252 ANSI - Latin I. They will be converted to the system default character if you click OK. Since I got no option, I click OK. It seems to me that this automatically drops the UTF-8 encoding as Chinese characters appear as? again. My guess is that in order to fix the damaged pages (thankfully only a few of them) I will have to remove the scrabbled Chinese characters and re-save as UTF-8 from within Textpad. The save option has a file format dropdown with the options: 'no change', 'PC', 'MAC' and 'UNIX'. I am not quite clear of what this actually means for the file so could someone tell me which is the right choice? My development PC is running on XP and the hosting is on Linux server. Should I chose 'PC' or UNIX? I also presume that saving under Textpad (at least using the default settings) doesn't add the BOM (which should have been discarded whilst opening the initial UTF-8 saved file from Notepad!?). Am I on the right track or did I miss something?

Regards.

encyclo

12:51 am on Sep 25, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have never used Textpad, so I can't confirm what the behaviour would be when saving files as UTF-8 or how it would handle the BOM. A earlier thread in this forum mentions Textpad and the precise settings you need:

  • ANSI, Unicode, UTF-8, and the path of most resistance [webmasterworld.com]

    If the BOM remains a problem, just break out your favorite hex editor and remove it (backup the file first!).

    The PC, MAC and UNIX settings are probably to do with line endings - in Windows, the line ends with

    \r\n
    and UNIX with just
    \n
    .

    BTW I never really got into using UTF-8 until I switched to Ubuntu Linux, in which all text files are in UTF-8 by default. It seems that better tools and editors exist in Linux compared with Windows.

  • apprentice

    6:59 pm on Sep 25, 2006 (gmt 0)

    10+ Year Member



    O.k. That took me a while to sort out. I think I ought to mention what I went through in case it saves someone else's time trying to fiddle with the BOM.

    As I mentioned at the beginning I made the mistake of using Notepad. Never again for HTML/UTF. It's still brilliant for saving in ANSI I suppose.

    Having break all the pages on that site with Notepad, I tried TextPad. My issue with that was that I never managed to find a way to paste raw Traditional Chinese characters. I tried what was already mentioned in an earlier thread by changing to as many different fonts that my patience could possibly allowed.

    Then I tried UltraEdit which has always been my favourite text editor. I managed to remove the FF FE (ÿþ) character from the already broken HTML, but since having no experience in HEX editing I felt uncomfortable saving the files as such. So I thought I should go through all pages (not many thankfully), load them in FF, view source, select all and paste that for a clean start. Immediately after pasting that into UltraEdit, as expected the BOM wasn't there - but after saving it, even with the no BOM option the build-in HEX editor suggested that the BOM was back there! It was really frustrating.

    I tried using Unired, again saving without BOM - back to UltraEdit's HEX editor and BOM was there. After that it was obvious. I tried the freeware XVI32 and for the same file reported by UltraEdit as carrying BOM, XVI32 showed otherwise! Final test. Pasted the parsed HTML source of a page into Notepad and saved as UTF-8. Did the same using UltraEdit/No BOM UTF-8. XVI32 reported EF BB BF at the beginning of the Notepad-saved file (as expected) and a healthy UltraEdit-saved file starting with 3C 21 (<!). Plus, with UltraEdit I can now paste Traditional Chinese characters in raw and save without it adding the BOM.

    It is just so frustrating that the UltraEdit HEX editor, for some reason wrongfully sees the FF FE at the begining of every UTF-8 file that I created even though I chose to save without the BOM. Does UltraEdit make this false assumption about the non-existent BOM, because of the charset=utf-8 that is within the HTML?

    Cheers.

    bill

    6:11 am on Sep 26, 2006 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    with UltraEdit I can now paste Traditional Chinese characters in raw and save without it adding the BOM.

    I haven't used UltraEdit on production sites yet, but I'm looking into it. Could you clarify this statement a bit? Did you have to manually edit out the BOM first to be able to use UltraEdit with Chinese on UTF-8 pages? I'm not clear about your process as I'm not familiar with the softwares you're mentioning.

    Is UltraEdit good for working with UTF-8 BOM-less files right off the bat, or is some additional or interim processing with another software required?

    apprentice

    7:54 am on Sep 26, 2006 (gmt 0)

    10+ Year Member



    There are several points here I think. First, this approach is still not very practical for a large site as it involves a manual process which is fine for fixing a few pages of a small site. UltraEdit, at least version 12.10b that I just tried ,can indeed save a BOM-less UTF-8 HTML from scratch. On the downside, you can't trust its build-in HEX editor, as no matter what I tried, it always reported the presence of BOM at the beginning of the HTML I was trying to save. Using another HEX editor like XVI32, shows that BOM wasn't actually there. The process I followed and it worked for me is:

    1) Start UltraEdit -> New document (creates a *.* blank document).

    2) Went to the site (I use FF), view source, select all, copy and paste that into UltraEdit's blank document.

    3) Save as - dialog options: 'example.html', 'all files', 'DOS', 'UTF-8 No BOM'.

    4) After having saved the file according to the above, you can now paste raw Chinese characters of a site. If you tried that prior saving to UTF-8, the characters would be pasted as?. Not sure if it's got an option to default to UTF-8 when you hit 'new document'.

    5) That will give you a UTF-8 BOM-less file from scratch and to verify that you can use XVI32 or any other HEX editor.

    Didn't change much of the defaults within the UltraEdit Preferences - I just made sure that the file-type defaults to 'DOS' (for convenience purposes). As I said that worked for me so it would be good to try the evaluation version of UltraEdit first just to make sure it does the trick for you as well. Although not practical, it proved quite handy for me that I had broken those few pages with Notepad, without backing them up first. Hope this helps and apologies if I have been ambiguous in my explanation.

    Regards.

    bill

    9:26 am on Sep 26, 2006 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    That makes your process a lot clearer. Thanks for taking the time to explain that.

    I am wondering whether there may be an issue with the FF source viewer that is adding the BOM to the files? Did you try FTPing the files and opening them directly with UltraEdit? I'm wondering whether you'd get the same results.

    apprentice

    9:53 pm on Sep 26, 2006 (gmt 0)

    10+ Year Member



    Just tried that now. I FTP-ed the file across and BOM is still reported by UltraEdit, whilst XVI32 still claims BOM isn't there. So it wasn't FF - that's for sure. This is probably an issue with UltraEdit - it always reports the FF FE of each HTML I tried to open. Maybe it purposely sees it like that for some reason I cannot understand. Doesn't matter though as learned not to trust its build-in HEX editor in anything that regards BOM.

    Regards.

    apprentice

    10:02 pm on Sep 26, 2006 (gmt 0)

    10+ Year Member



    There is another thing I would like to know. I will be converting all the *.html files of the site to BOM-less UTF-8. I will only be using Chinese characters in few of the pages, but since the site is small I thought I should do it just for consistency purposes. My question is whether I should also be converting other text-based files to BOM-less UTF-8. For example *.xml *.rss *.css *.txt - including things such as the SiteMap file. My guess is that I should leave these to their current ANSI encoding. I would be a bit worried having the robots file to UTF-8 for some strange reason - it's already being ignored by some robots, so it's anyone's guess what would happen if I convert it;)

    Cheers.

    encyclo

    1:31 am on Sep 27, 2006 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    It all depends on the contents of the files in question. For example, the robots.txt file is unlikely to contain characters outside of the usual US-ASCII charset, so in fact there is no alteration of the file when "converted" to UTF-8 as the code points for US-ASCII characters are the same in US-ASCII, ISO-8859-1, windows-1252 (ANSI) and UTF-8.

    Overall, it is easiest from a development point of view to use one consistent charset throughout the site. If you need to convert a large number of files, you can use the

    iconv
    utility present in most Linux distributions (ie. on most *nix web servers).

    bill

    7:47 am on Oct 18, 2006 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    I just wanted to clarify here for future reference.

    A Byte Order Mark (BOM) in a UTF-8 file looks like this in a HEX editor:

    [b]EF BB BF[/b]

    Correct? To remove the BOM you would simply remove that string, right?

    Above, in #:3096378 you mentioned:

    I managed to remove the
    [b]FF FE (ÿþ)[/b]
    character from the already broken HTML, but since having no experience in HEX editing I felt uncomfortable saving the files as such.

    Isn't

    [b]FF FE[/b]
    the UTF-16 Little Endian BOM? I'm seeing this in some UltraEdit text files via the hex editor. Were you suggesting removing that? If
    [b]FF FE[/b]
    shows up at the beginning of a UTF-8 file is that an issue? (I'm seeing this on some files that I had supposedly saved as UTF-8 without BOM.)

    I guess what I'm looking for is a more definitive answer as to the hex data that needs to be removed to remove BOM.

    Is the UltraEdit hex editor all that bad? Would it be advisable to get another?

    apprentice

    6:25 pm on Oct 19, 2006 (gmt 0)

    10+ Year Member



    Hi,

    When I saved my HTML from within Notepad as UTF-8, UltraEdit reported it as EF BB BF. I parsed that using FF and copied the code from the View Source (I didn't removed BOM using the HEX, as I wasn't to comfortable in doing that). Then saved it as BOM-less UTF-8 using UltraEdit. Still UltraEdit always reported FF EE at the beginning of all HTML files I tried so far (on 2 different PCs, running different versions of UltraEdit), I can only assume that this is a problem with the UE's build in HEX editor. Other HEX editors I tried on the same files, including XVI32 (ver 2.51) and HexEdit (ver 1.03), didn't report the FF EE on my properly-saved UTF-8 HTML suggesting that UE's HEX is mistaken. UltraEdit, in my opinion, is probably one of the best editors out there, but I won't be trusting its HEX editor in regard with BOM. Still, if you use FF's View Source, copy and paste the code into UltraEdit and save as UTF-8, no-BOM - you should be o.k. I think this is what is confusing here; UltraEdit is indeed capable of creating BOM-less files, but for some reason its HEX editor is always mistaken by suggesting it is actually there!

    So my suggestion, for a start, would be to use a different HEX editor - like XV32 or HexEdit which are both free. That way you can be assured that at least your HTML doesn't contain the BOM. From what I know in my limited HTML experience, the first thing that should be at the beginning of HTML files, is the DOCTYPE. Anything else could lead to numerous issues, potentially preventing an Internet Browser from reading the page properly, or maybe causing other problems with SEs. In order to be sure that doesn't happen and to be certain that the BOM isn't there, I would use a HEX editor (not UE's!) to check that the first thing you see on the file is 3C 21 44 etc etc which translates to <!D (beginning of DOCTYPE tag). That gives me peace of mind in my 4.01S that I use for my HTML - but I am not quite sure what the case would be for a properly served XHTML Strict - where an XML tag often precedes the DOCTYPE.

    Sorry if I couldn't be of more help - as I too only starting to get the grasps of it!

    Regards.

    encyclo

    12:56 am on Oct 20, 2006 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Isn't
    FF FE
    the UTF-16 Little Endian BOM?

    Yes, this is a UTF-16 BOM - however that doesn't stop some editors adding it when supposedly saving as UTF-8 (or more often "Unicode" without specifying which version - like in Notepad).

    The difference between UTF-16 and UTF-8 is evident when dealing with US-ASCII characters, which are encoded as single-byte ASCII-compatible in UTF-8 but not UTF-16.

    If you encounter a UTF-16 LE BOM then you need to verify how US-ASCII characters are encoded. If they are single-byte ASCII-compatible characters, then your document is UTF-8 with an incorrect BOM - if you use hex editor to remove the BOM the document should function as UTF-8 when served as such.

    If characters from the US-ASCII range are double-byte encoded (again a hex editor is your friend), then you need to use

    iconv
    , which is downloadable or available in most Linux distributions (sorry I don't know what you can use on Windows) to convert the file before editing the BOM.

    Sometimes you can get ISO-8859-1 (or other legacy encoded) documents preceeded by a UTF-16 BOM, usually due to encoding mishaps. In this case, you can remove the BOM with a hex editor, then use

    iconv
    to re-encode as UTF-8 - assuming you can figure out what legacy encoding was used.

    You should never attempt to use UTF-16 on the web as the presence of a BOM is mandatory, and user agent support is limited.

    In every case, it is vital (obviously) to keep backups of the original file. :)

    bill

    1:43 am on Oct 20, 2006 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    If you encounter a UTF-16 LE BOM then you need to verify how US-ASCII characters are encoded.

    I was having this issue with Chinese language files. In particular these were not HTML files so there were no ASCII characters to verify against. I added some English words to the mix and UltraEdit's hex editor is still showing the UTF-16 Little Endian BOM
    [b]FF FE[/b]
    .
    Further testing I got the same results with XHTML & HTML files.

    So I grabbed a copy of XVI32, and none of the files shows either a UTF-8 or UTF-16 BOM. Just to be on the safe side I tried out HexEdit as well. Same results. Of the two I liked the XVI32 interface better, but they both return the same results for all files tested.

    Conclusion: The UltraEdit hex editor is screwy? I'm going to have to check their forums to see if this issue has been raised. It's a shame as I like the rest of the editor functions.

    If you encounter a UTF-16 LE BOM then you need to verify how US-ASCII characters are encoded. If they are single-byte ASCII-compatible characters, then your document is UTF-8 with an incorrect BOM - if you use hex editor to remove the BOM the document should function as UTF-8 when served as such.

    How would you verify the ASCII character encoding? Is this done through a hex editor?

    bill

    12:18 am on Oct 21, 2006 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    I heard back from the folks at UltraEdit. According to them the HEX editor is working as designed.

    As encyclo pointed out, Windows does not natively support UTF-8, unlike Linux. So using UltraEdit in Windows, for example, when a UTF-8 format file is loaded it is internally converted to Unicode format for editing and is converted back to UTF-8 format when written to disk. Because of this we are seeing the above issue with the BOM. Therefore the HEX display is accurately representing the state of the file at the time it is being edited.

    apprentice

    11:45 am on Oct 21, 2006 (gmt 0)

    10+ Year Member



    Thanks for reporting back on that Bill. A different philosophy adopted by the UE then. They could have highlighted that in their help file in first place, as the majority out there still uses Windows and therefore likely to be confused by that as we did.

    Up until recently, I used HTML-Kit for developing pages for my site but it is a bit of a problem now that I moved to UTF-8 encoding - which is not supported by HTML-Kit. Does anyone know of any plugin that offers such functionality?

    Regards.