|retro codepage characters|
how to get proper display of CP437 characters
hi everyone! it has been a long time since i was last active in these forums... a lot of stuff has changed over the years and i now find myself seeking assistance for a problem that i don't know how to solve...
the problem is that i have some pages generated on a CP437 system using CP437 characters... specifically the line drawing and box characters used back in the good (bad?) old days of plain jane DOS and dial-up BBSes...
what i want to do is to have these characters properly displayed no matter what character the browser is using... i've been able to get most of the way toward this goal but... yeah, there's always a but(t) in the way ;)
i have set my CSS to use an @font-face so as to download this font and display these characters properly... for the most part, this works... however, when i use FF's web developer tools and change the encoding back to UTF-8 from ISO-8859-1 or Windows-1252, the characters are "gibberish"... this makes no sense because they are displayed properly when initially viewed with this particular setting at UTF-8 but i get lost trying to track this all as i try to work through it...
the pages are identified with [meta http-equiv="Content-Type" content="text/html; charset=IBM437"] (< and > changed to [ and ]) and the (apache) server (under my control) is not sending anything similar in the headers which might override this... i have verified this with a non-rendering browser...
so, i have a font file in place referenced with @font-face... what i'm thinking is that i need to tweak this font file to include these specific characters in a second cell within the font table... this second spot would be in the UTF-8 positions that the characters occupy... in this way, the characters should be available for all character sets, IBM437, Windows-1252, ISO-8859-1 and UTF-8...
is this a valid theory?
is there something else that i'm missing?
is there a better way for me to handle this without specifically translating these characters when the pages are generated?
|when i use FF's web developer tools and change the encoding back to UTF-8 from ISO-8859-1 or Windows-1252, the characters are "gibberish"... this makes no sense because they are displayed properly when initially viewed with this particular setting at UTF-8 |
Does your html show the "real" characters, or numerical entities? Entities are immune from encoding issues-- but they are otherwise unreadable. Judgement call.
Why are you changing encodings at all? Make sure every page has a charset declaration, or set it globally for the site.
No matter how obscure the character, it's in UTF-8 somewhere. Here it sounds as if you're describing the "box drawing" range, which was incorporated specifically because one longago OS used it.
:: shuffling papers ::
hex 2500 - 257F (decimal 9472 - 9599)
Is the sole purpose of your embedded font to display these characters? If so, let users stick with a local font if they've got one. For me that's
:: further shuffling of papers ::
Hm. Buncha CJK fonts, Apple Symbols, Apple Gothic, Arial Unicode, the usual third-party fonts. Someone will know if there's something in the standard Windows package that covers the same range.
:: detour to refresh memory ::
In @font-face rules, as in other CSS font declaration, list possible sources in order of preference. So first "local", then "url".
thanks, lucy... yes, the pages are generated with the actual characters... this is a DOS app that has no clue about anything else other than CP437... in fact, it doesn't even know about that because it simply uses what DOS has available...
as for changing encodings, i assume you are asking why i do this in the browser... i do this because i'm trying to test what others may see... i know that if it works properly for the three main ones (Windows-1252, ISO-8859-1 and UTF-8, then it should work properly for the rest... but that's pure theory...
yes, i am describing exactly those line drawing and box characters :)
every page does have a character set declaration... i've always put one in since i taught myself HTML coding back in the mid-90s ;)
yes, the sole purpose of my @font-face font is for this pages...
i've tried listing those that should work first but something isn't right and i don't know what... "Courier New" should work but doesn't... FreeMono should work but doesn't... Nouveau_IBM should work but doesn't... i've tried numerous others, too, and finally found "Perfect DOS VGA 437" which works for the most part but not when the browser is in UTF-8 mode... the problem i've found is that if there is a local font but it doesn't have the characters needed, "gibberish" (aka the wrong character) is displayed from that local font instead of the others being used for the proper character...
in the end, i suspect that i will have to test my theory from my first post and list only that font in the css for the elements used... i'm just hoping that others will have more insight into this... i've been working on it for several weeks now...
thanks for your response! you did give me the UTF-8 positions of those characters which will make it easier for me to work on my font file for my test :)
When you say "doesn't work" do you mean that the font isn't getting embedded, or that it won't display the characters? Sometimes a font has multiple names and you have to dredge up the precisely correct form for embedding. Another issue that has bitten some people is font substitution. This is A Good Thing and is normally what you want-- but it can lead to mistaken ideas about what characters are available in a given font.
Short of using entities, there's no viable way of making non-ASCII characters display as intended, independent of encoding. But there's no earthly reason to change the encoding once the page is in place. So this is just a one-time change.
Unless, urk, your target audience includes people with antiquated browsers that can't read the charset declaration. And even then, you can fall back on an explanatory note. (I used to do this in ebooks back when there was a realistic possibility that some readers might be stuck in MSIE 5. That was, as you can imagine, some time ago.) "If parts of this page display as garbage, you may need to change your browser's 'character set' to UTF-8."
How is all this happening, mechanically? That is, how do you get from
Point A: ancient DOS document with 'box drawing'
Point B: HTML file containing those same characters
Is there a database involved, or is everything getting routed via a helpful text editor that changes encodings on request?
:: memo to self: find out whether Text Wrangler can change encodings in bulk (SubEthaEdit is excellent w/r/t encodings, but only works on individual open documents) ::
This is a fun question :)
"doesn't work" means that the proper characters are not being displayed...
the path is as follows:
1. new files arrive on system via FTN (Fidonet Technology Network) transfers.
2. the files are processed and their descriptions extracted and inserted into the BBS' proprietary files database format. there is no SQL involved... everything is 16bit DOS using CP437...
3. a tool, written in Borland Pascal 7, extracts the file names, sizes, dates and descriptions and creates the complete htm(l) files directly from the data. filenames are limited to 8.3 so index.htm is what is created in each file area directory for the files it contains...
i wrote the tool which creates the whole htm(l) page... header, body, etc... it is very simple and there is no conversion from anything to anything else... in the header, the charset (IBM437) is specified as previously noted... the entities used on the page carry ids and/or classes so that CSS can be used... the CSS file is maintained manually and covers pretty much the entire site... the several hundred files areas are one small portion of the site... i cannot control what the creators of the files use in their descriptions... there are some that like to use ASCII drawings for logos and frames around the descriptions... some do more and the description is actually an ASCII drawing which may use ""high ASCII"" characters like the line drawing and box characters... i simply want to display them properly to everyone that views my files area pages...
it has been a fun mystery to deal with... it has also been aggravating at times when it looks good on machineA and completely different on machineB... sometimes both machines are running the same browser and other times not... sometimes the same OS and other times not... testing testing testing... there are times when i remember why i left the IT industry and stopped coding for others with unrealistic demands ;)
Did you see this:
Google "HOWTO Display 437 art on non MS-DOS displays"
.. include the quotes for the 'exact word or phrase' result.
This seems much safer than relying on the end user's browser to have an antiquated encoding that hasn't been used in eons. Especially if there's a chance your end user will try to cut-and-paste.
@Jonsey: no, i have not seen that... i will take a look at it... thanks!
@lucy24: this is why i'm looking at the theory of copying the characters to their UTF-8 positions in the font table so that they work for both ""normal"" and UTF-8... this is the whole idea behind using a downloadable font... isn't it? so that it works for all systems?
You may have misunderstood what character encoding is all about. The only way to achieve an "encoding-neutral" font is by replacing the first 128 characters-- the ASCII range-- with glyphs that look like your target script. When I was in school I had to do this for Greek and Sanskrit; you can still find a lot of Inuktitut online in assorted legacy fonts. I kinda think I've even got a Tagalog font that does the same thing. (Do not ask me how or why I acquired a Tagalog legacy font. I have absolutely no idea.)
You do not want to do this. The unicode standard was developed precisely so you wouldn't have to.
Offer an embedded font if you have reason to believe your users won't be able to read the text as intended. But don't rely on the font. Ideally your CSS would give a long list of font names, running through all the ones likely to be present on major operating systems, with your custom font at the very end. That way users only have to resort to it if they don't have any other font that can display the characters.
|so that they work for both ""normal"" and UTF-8 |
There is no such thing as "normal". Once you go beyond ASCII, decisions have to be made. That's why there are dozens upon dozens of legacy encodings. Open your box drawings in a text editor that lets you change encodings on the fly, and watch as the boxes magically jump from é to Â to Œ ̏ (I just made that up)-- and even into different scripts if you're really feeling playful.
What you need to do is to add one more step to the program that currently processes the files. Let it globally replace every character in the 80-FF range with the equivalent character in the 2500-whatever-it-was range. Some parts may be tricky because one-byte encodings tended to use the 80-9F sector, which is off-limits to unicode. But it's possible to work around this.
i really didn't want to have to change the program that creates the files because it also creates other files for other uses... but if i/we are to be forced to use UTF-8 then i guess that's the best option :|
"encoding-neutral" is apparently what the font i'm currently using is doing because the character table is exactly the original 256 characters of (what eventually became) CP437... as far as i can tell, it works perfectly... except when the browser is in UTF-8 mode :(
That's because the browser isn't really displaying the characters. It's displaying codepoints 80 through FF in one specific encoding. See what happens if you change the encoding manually (probably on the "View" menu) to any random other one: Windows-Latin-1, Mac Roman, that long list of Central European forms.
I went and checked. Both SubEthaEdit and TextWrangler can do 437 ("Latin-US"). So I guess it isn't that obscure.
You don't have to "change" the program. Just add an optional loop for these files.
Punch line: Thanks to my previous post, which threw in some random non-ASCII characters, the browser has now decided that the encoding of the present page is
:: drumroll ::
DOS 437, apparently, since one of my random letters now shows up as part of a box. (Edit: Not really. It appears to be something Cyrillic, but I can't place it. Two cyrillic letters, a box, and a decimal entity. It looks as if the page was originally in Latin-1.)
Further punchline: Nobody today even needs box drawings, do they? You achieve the same result by setting borders on table cells.
Waldo Kitty, you can use synchronet's content handler asc_handler.js to load up a file with highascii and display it with the codepage you are looking for.
with some modification, you can even have it spit it out as a new file.
view the source here to a file ran through this content handler [eob-bbs.com...]
@mro1337: :LOL: funny seeing you here! thanks for the pointer... this is actually for a different system than the synchronet bbs i help to admin ;) that one is on *nix but the system in question is running on OS2 and is pure DOS 16bit stuff... i'll take a look there as well, though... i might be able to extract some ideas on how i will handle this situation... the tool i wrote, mentioned above, is being requested for use by others in the bbs world so i'm trying to figure this out and come up with a solution that will work for others as well as the systems i use it on now :)
@lucy24: that's funny that your browser decided to to that :) as for borders and such, they only work on HTML pages... i'm looking at these not only working on HTML but also (still) "plain text" ""ASCII"" pages... for many of us, there is ASCII and "high-ASCII" even though many keep trying to use different terms... some times it really aches in the bones to be such an old system/network admin and i'm not all that old! :lol:
Waldokitty you can create a utility that generates html output the way synchronet's ascii handler does. or you can just simply use synchronet on a 32bit system and use jsexec and the handler script to out put the html you need. look at the source to the .html link i posted and you can see how deuce rigged it to output the correct chars.
btw, i found this thread because i use a google search string for bbses and have it show sites in the past few months.