Forum Moderators: coopster

Message Too Old, No Replies

Character Encoding Issues

Wierd Characters Showing Up

         

cabbagehead

10:07 pm on Nov 9, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have built a CMS for a client. They client is copy pasting characters from Microsoft Word into the CMS and saving. When the CMS content is later displayed, several of the characters are shown as wier junk characters like: β€” or ’.

I fixed this at one level by converting the MySQL database to UTF8. This fixed it insofar as my computer is now diaplaying those characters properly but apparently some other computers are not. Some other computers are still showing wierd characters.

Does anyone know what the problem thus might still be? Are some client machines missing the fontmaps to express those characters? Or, are they not understanding to display that text as UTF8 perhaps?

cabbagehead

2:25 am on Nov 10, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, I found my own solution here. Looks like I just needed to add a meta tag specifying UTF8 in the HTML. Everything seems good now. :)

eelixduppy

6:31 am on Nov 10, 2006 (gmt 0)



Glad you solved it, cabbagehead! ;)

cabbagehead

11:01 am on Nov 10, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Spoke too soon. I thought I resolved this by adding that UTF8 meta tag. Somehow now however, I'm seen a ton of square/ question mark characters that I was not seeing earlier today. It seems I've traded one problem for another.

Any thoughts? :(

alfaguru

11:21 am on Nov 10, 2006 (gmt 0)

10+ Year Member



I'm no expert on character sets, but when they copy/paste from MS Word I suspect IE (is it IE?) is failing to translate correctly to UTF8.

I'd do some experimentation on the side to see if there's a reliable way to detect and deal with the problem characters at input time.

johnjoyce

8:10 am on Nov 12, 2006 (gmt 0)

10+ Year Member



not only do you need to specify UTF-8 in the xhtml/html
you also need to be sure you are saving the file itself in this encoding!
MS Word is not good for a text editor. It will often insert invisible characters that Word reads for document formatting.
Also do not use .rtf files. They will have this problem as well.
Make sure you are using a "text editor" and not a "word processor".
Though UTF-8 is a subset of UTF-16 (and in practically, not really different, since UTF-8 covers all text encodings) you really do have to make sure your document is saved as UTF-8 in the text-editor's settings/preferences.

Well-formed html/xhtml is also critical. This means your file must be correct. Use the W3C's validators to make sure your file is valid. The validator will check the encoding declared and tell you if it is actually correct. Just declaring UTF-8 in the meta tags doesn't make it so, the file itself has to be written and saved as UTF-8.

Now, one problem is that PHP doesn't use UTF-8 itself. So it is best to have it take the text from a file outside of the .php file and store that text in a variable.
It's more work in the beginning, especially for simple things, but infinitely more reusable and maintainable, because you are not mixing PHP and HTML/XHTML code in the same file. The receiving user-agent (browser) never knows the difference. The html/xhtml sources can be .txt files encoded as UTF-8 whose contents are used by PHP to create that dynamic web page.

If you don't do these things, results will be unpredictable, and will depend on the "user-agent" that reads and renders the page. (usually a web browser)

cabbagehead

11:31 am on Nov 12, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, I tried having both the HTML and the DB set to UTF-8 and still the problem kept happening, so I finally gave up and just wrote a function to find/replace the most common characters. Sucks to have to do that but I didn't see any other reasonable solution and the client insists on using MS Word and was getting frustrated with the issue ... so I just opted for the easy solution. :)

Thanks for the input all the same!

encyclo

8:19 pm on Nov 12, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



There are a couple of things you can do. Firstly, the browser should be able to handle the content in a textarea as being the same charset as the one defined on the page itself. However, you can help the browser handle the text correctly by setting the charset for the form as well:

<form action="script.php" method="post" [b]accept-charset="UTF-8"[/b]>
...
</form>

In IE in particular this will ensure that windows-1252-specific encodings such as curly quotes etc. from Word are transmitted (and therefore added to your database) encoded as UTF-8.

If you are running a Linux or other Unix-based server, you can use

iconv
to convert your existing windows-1252 files and data to UTF-8. This, added to HTTP headers and meta elements specifying UTF-8 on every page (put the meta element before the
title
element!), should allow your application to run smoothly in UTF-8.

As you are using PHP, you should at least use

iconv
instead of a home-grown solution to converting the data. See:

[gnu.org...]
[php.net...]

cabbagehead

9:28 pm on Nov 29, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ok - I have taken all discussed steps:

1. UTF8 meta tag in the head of each page
2. accept-charset="UTF-8" in each form
3. converted the dB to UTF8

...now it is supporting special characters and glyphs etc from other languages ... but I am still getting those empty boxes in place of common MS Word formatted characters such as apostrophes, quotes, "..." chatacter and the extended dash "-".

Any thoughts on those? This stuff is driving me batty! I thought I had it for sure this time. :(

cabbagehead

9:31 pm on Nov 29, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ok - this is a test to see how WebMasterWorld handles some of these characters I'm talking about:

“Can you still hear me?” — this is a test

cabbagehead

9:32 pm on Nov 29, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Aaaaagh! Looks like WebmasterWorld handles them fine! And I don't even see a UTF8 declaration at the top of the freaking page!

What the hell!

<<hairs being ripped from head one by one>>

cabbagehead

9:36 pm on Nov 29, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Note the characters appear correct in the DB ... it seems to be the problem is with the representation of those characters on the HTML page.

henry0

9:41 pm on Nov 29, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'll like to know what is your based on your CMS
is it "pure" Text area with some formating of your own and an image loader
or is it an existing text editor grabed from the free GPL?
I have seen many of those editors acting wierdly when any MS doc is pasted in.

cabbagehead

9:51 pm on Nov 29, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It is a custom CMS I've built using PHP. Its just a standard <textarea> input field the text is being entered into.

At this point however my problem does not appear to be input - the data looks fine in the dB. the problem is that these characters are not being represented properly on the output page. I'm just getting stupid squares/boxes where apostrophes and quotation marks should be.

henry0

9:56 pm on Nov 29, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have seen that and was getting mad about it
is by any chance any MAC involved in this?
I had a partner sending me file from a Mac that produced what you are experiencing when opened in my Ultra Edit (text editor).

<EDIT>
same stuff from a MAC sending me DW files
</edit>

cabbagehead

10:04 pm on Nov 29, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ok - I just found this out:

If I *REMOVE* the UTF8 declaration (meta tag on display page) then those MS Word characters show properly...but none of the glyphs or special characters work in that case. Conversely if I add back the UTF meta tag then just the opposite is true.

Sigh.

encyclo

10:57 pm on Nov 29, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



What is the browser telling you when you add and remove your UTF-8 declaration? In Firefox (this is the easiest), press Ctrl + I to open the page info dialog. The Encoding is listed there.

If the MS-Word curly quotes work and Firefox says windows-1252 or ISO-8859-1, then those quotes are not UTF-8 encoded.

How did you convert the database? Did you specify ISO-8859-1 to UTF-8 or windows-1252 to UTF-8?

cabbagehead

9:17 pm on Nov 29, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member




System: The following 2 messages were spliced on to this thread from: http://www.webmasterworld.com/php/3172650.htm [webmasterworld.com] by jatar_k - 4:42 pm on Nov. 29, 2006 (pst -8)


Ok - this is going to be the death of me. It is DRIVING ME CRAZY!

I've been trying to get a CMS system I built to support and properly show those many special characters and specially formatted dashes "..." characters, special apostrophes etc that come from MS Word. I thought I had it but its back with a vengence.

I've done the following to date:

1. UTF8 meta tag in the head of each page
2. accept-charset="UTF-8" in each form
3. converted the dB to UTF8

.... now, I am seeing the proper values in the dB but funny litle squares where those MS Word formatted characters should be.

Can someone please help?!?! Does anyone know what the heck to do?!?!?!?!?!?!?!?!?!?!

cabbagehead

10:04 pm on Nov 29, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ok - here's what I've got now .... sigh ...

If I *REMOVE* the UTF8 declaration (meta tag on display page) then those MS Word characters show properly...but none of the glyphs or special characters work in that case. Conversely if I add back the UTF meta tag then just the opposite is true.

So it would appear I must choose between support of glyphs and support of MS formatted apostrophes, quotes, etc.

Is this correct? Am I missing somehting? Sigh.