|Byte encoding woes (character mojibake)|
My file looks fine in the editor, but in the browser it turns to gibberish.
I'm trying to use "special" characters like curly quotes (“”) em dashes (—), etc, but in the browser, they come up looking like garbage.
I'm using CFEclipse, and under Window/Preferences/General/Workspace, then in the box Text file encoding, I have set it to "Other: UTF-8."
On the page, I have the content-type meta tag set to "text/html; charset=UTF-8"
According to the server test I ran, it is outputting the file in UTF-8.
Yet still the browser outputs curly quotes like this: "â€œGetting Things Doneâ€&?"
Is there another place this could be getting screwed up along the line?
Adding the content-type meta tag doesn't work if the web server you are using sends its own content-type meta header. You should use a HTTP server header checker (available as a firefox add-on and on some tool websites) to see which non-visible information is attached to each page sent to the browser from your server.
Search in your favorite search engine for encoding curly quotes in HTML.
The first page that came up for me [dwheeler.com ] describes how to correctly encode them.
What scripting language are you using? Many of the have an html encode method that will correctly convert them so that they will be correct. e.g. Left Double Quotation Mark “
Like I said, I checked the server headings, and they're being sent out as UTF-8.
Also, I don't want to encode characters in HTML. The goal is to use UTF-8, a universal character set, so that I don't have to encode characters. I'm lucky enough to use English, so there aren't too many occasions where I have to use accented characters, but if I were Icelandic, my code would quickly become unreadable if I had to encode every single non-standard letter form.
I seem to have solved the problem, though it feels a bit dirty. The text file is encoded as Cp1252 in Eclipse, and when sent over the server, the special characters hold true. I'm not sure why a single UTF-8 encoding gets choked up while a dual encoding appears fine, but it's working for me now.
If anyone can provide further insight into this bizarre behind-the-scenes world, I would surely appreciate it.
For all the issues with incorrect encodings and utf-8 make sure first the environment works and you can then check if the editor in use is not the problem.
I don't know what server language you are using but do this:
Create a simple form with a text area box
Insert/Copy or type in the foreign string.
Upon form submission save the result in the database
Check with a db tool see if the string in the db is stored exactly as you typed it.
Then retrieve the text from the db and send it to the client end (browser) see if it shows as you expect.
If it's ok, then the problem is likely with the editor or its settings. Otherwise backtrack, see if it's the db encodings or some other filtering function on the server end that changes things before sending the text to the client.
Thanks, enigma1, that's the kind of answer I was looking for. Here is what I found:
I'm using Microsoft SQL Server 2005, and if you set the datatype to nvarchar, it should accept UTF-8. If I input a foreign character directly into the database, things work okay. In the database view, I just see hollow boxes, but when I output it to the browser, I see the correct characters.
However, when I'm submitting via the form, I only get question marks in both views.
For security reasons, when a user submits data from forms, it goes through a ColdFusion validation tag (cfqueryparam with the sqltype set to cf_sql_varchar). Could this be what is messing up the characters (nvarchar vs. varchar), or is it likely something else on the server? Like I said, the server headers check out as UTF-8.
It would seem that either the tool you have to browse the database or the database configuration isn't setup for UTF-8.
First, in the database, what you type in is what you should see. So check few things.
- If the database and all tables/columns for chars are setup for utf8_general_ci. And I am not sure about the ms sql but there should be an exact equivalent.
- When you connect to the database make sure you setup the charset to utf-8 also.
Second, when you edit directly into the db with your tool you should see all the characters exactly as you typed them. Without encodings. If you copy something from you db browsing tool into notepad you should see the exact characters as you originally typed them and the operations to the clipboard should not alter anything. If you see the hollow squares in notepad, then that means the db isn't setup right.
That's the whole idea behind utf-8 that supports the chars of all primary languages without having special encodings.