Forum Moderators: open

Message Too Old, No Replies

UTF8 problems moving from W2k to W2003 Server

utf-8 problems

         

rmassart

3:24 pm on Oct 5, 2005 (gmt 0)

10+ Year Member



I am running a multi-lingual asp site off a mysql database. All my data is stored in utf-8 format in order to accomodate for a wide range of characters. In fact the whole Db is utf8. This is all working fine on the W2000 server running IIS 5. Text can be saved to the db in any language and is displayed correctly when retieving and sending to a browser (with charset utf8). However, when I copy my asp files to the W2003 server, making no changes, and then try to display the text from the same database (still on the W2000 server), it simply returns a load of "?" for non latin characters.

I have tried playing around with the Server.Codepage settings (which seems to have little effect) and I have saved all my asp files in utf-8 format. None of this seems to help.

Also, I have gone through the trouble of retrieving my data with phpmyadmin to ensure it really is stored as utf-8 and this seems to work fine.

I guess there are difference in the way IIS6 handles utf8 data compared to IIS5, but I have not managed to figure out what they are.

Any help in this matter is very much appreciated.

Thanks,
Robin

rmassart

3:28 pm on Oct 5, 2005 (gmt 0)

10+ Year Member



Some investigation has brought up the following strange behaviour (well I think so anyway).

On IIS5:

With Session.Codepage = 1252, submitting "A pound sign: should appear here" in a form field results in:

"A pound sign: £ should appear here"

in the mysql database.

Given that I don't think my sql client (SqlYog) can display unicode, this would appear to be correct, according to the following article, which states that in unicode would be displayed as A in a Latin charset.

[czyborra.com...]

If I now change the codepage to 65001 (UTF8), then only following is stored:

"A pound sign:"

ie: all characters after the (and including it) are cut off.

Repeating this whole exercise on IIS6 gives this behaviour:

For session.codepage = 1252, the following is stored in the DB:

"A pound sign: £ should appear here"

This I believe is (and using the site mentioned above) is an extra conversion from a utf-8 string interpreted as latin, into utf-8, ie the mysql db thinks the string to be stored is: "A pound sign: £ should appear here"

And for session codepage = 65001 I get:

"A pound sign: £ should appear here"

This *appears* to work, but if I try putting in a cyrillic character (eg Д - not sure if this will display), with codepage=65001 then the db stored:

"xx" for the cyrillic character (where x = a byte the font of my sql client can't display), which is surely too many bytes for utf-8.

The locale on the server is set to UK, which might be why IIS6 understand the character. But my impression is that the server runs in unicode format and should therefore understand all characters. My gut feeling is that (with codepage=65001) i'm having trouble making ADO (in IIS5 and IIS6) understand that the characters are already in utf8 format when they are received from the browser and that they don't need converting again. Does this make any sense? If so how do I tell ADO that the charset is utf8? And can an SQL statement be encoded in the utf8 charset in the first place?

Sorry for the long post, but any help is much appreciated.

Thanks,
Robin