Forum Moderators: coopster

Message Too Old, No Replies

multi language corrupt data

         

omoutop

10:28 am on Aug 10, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hello all and thanks for any tips/advice you will provide.

Recently I was called to upgrade an very old site.

Php 4.3, MySql 5.0.41, shared hosting, no access to any conf/ini fle, no control panel (plesk or anything similar).

All the pages (live pages, admin page) were in iso-8859-1 mode (default english).
At some point they upgrade parts of their site with 4 more languages. Yet the page encoding remained in english.
So, they typed german texts in a form in an english encoded page, and they submit it in a php script who stored them in an unknown charset in Mysql.
When they retrieve them, it appeared correct in their pages.

Now they wish to use those stored data to populate various minor websites. Some of those hosted on same server, while others in different server, with different configuration.

These new websites will be mostly static ones, except from a couple of pages that will be filled dynamic, with xml data from the original site.

So far, my attempts to get properly encoded data failed.
If i set my xml to send english data, it meshes my page encoding.
If i set the xml to send utf-8 data, i get scrabbled data.

I hope i made my problem cleared.

I have no knowledge on language manipulation in php.
I must not mess with the mysql data directly, since they appear on parent website properly.
Whatever method suggested,must alter the data after they are retrieved from the database and before printing them on screen in te new websites.

So, is there any way to get proper data from this mess?

lostdreamer

11:04 am on Aug 10, 2011 (gmt 0)

10+ Year Member



Have you tried the utf8_encode() / utf8_decode() functions ?
[php.net...]
[php.net...]

This way you should be able to get the data out of MySQL, and convert it to the correct character set.

Otherwise you might have more success with mysql_set_charset();
[php.net...]
With it, you can tell PHP that the MySQL DB has another charset then your website (and it will convert the data for you)


Regards,
LostDreamer

omoutop

11:16 am on Aug 10, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have tried utf8_encode/decode and it didn't help.
As for the mysql_set_charset, this is for PHP 5+.
I must work in php 4.

penders

5:59 pm on Aug 10, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Is it possible that HTML entities are stored in the database? And this is how they got round the issue of displaying special characters on an ISO-8859-1 page? If you are saving this out to an XML file then the entities could be a problem, unless they are decoded - or is that the problem?!

lucy24

9:25 pm on Aug 10, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



8859-1 is not only English. It includes all characters normally used in the major western European languages. In the case of German, that means the standard umlauts ÄÖÜäöü and the eszet (don't quote me on the spelling) ß, guillemets «» and so on. So you don't have to change encodings just to accommodate German (or French, or Spanish, or any Scandinavian language).

If they used one of the obscure encodings that's a superset of 8859-1-- or if your program says Latin-1 when it really means ASCII-- then you may be in trouble. And if you're looking to the future, UTF-8 will cover all possibilities. But if you are using genuine 8859-1 it should not be necessary to change anything at this point.

helenp

9:39 pm on Aug 10, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I use this in my connection file, dont know for wich php version it is, however it converts correctly for me (spanish and swedish characters)
mysql_query ("SET NAMES 'utf8'");

omoutop

6:02 am on Aug 11, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@penders:
no, i don't see any htmlentities stored in database

@lucy24:
runing mysql_client_encoding($db) i see that the default database encodng is Latin-1. The metas in original pages are iso-8859-1. So far i can see english, french, german, italian, spanish and swedish. I can't see proper greek, russian, turkish.

@helenp
mysql_query ("SET NAMES 'utf8'")
mysql_query("SET CHARACTER SET utf8");
I tried both of these, but no luck with greek, russian, turkish. Other languages behave correctly.

I tried iconv() and even setlocale(LC_ALL, 'el_GR.UTF-8') (for greek) but nothing came out correctly (as expected).

If i run the script that creates the xml file, i see alla languages correctly. If i pull that xml data into the new website, i see scrabbled data.

My xml is in utf-8 encoding. I will try various combinations of xml encoding and data encoding and see if i will be lucky.

Is there any other approach i can try? Something different perhaps than xml?

penders

7:26 am on Aug 11, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I must not mess with the mysql data directly, since they appear on parent website properly.


So far i can see english, french, german, italian, spanish and swedish. I can't see proper greek, russian, turkish.


And the greek, russian and turkish appear OK on the parent site? I doubt that you'd be able to display these OK with an ISO-8859-1 charset?!

But regardless of the above, if...

If i run the script that creates the xml file, i see alla languages correctly. If i pull that xml data into the new website, i see scrabbled data.


If the text appears correctly in the (UTF-8 encoded) XML file then you've already extracted the language data from the MySQL database correctly - is that correct?

The problem seems to be reading the XML into your page? How are you doing this? Multi-byte string functions [php.net]? Are your pages UTF-8 encoded?

omoutop

7:43 am on Aug 11, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



penders,

If i run the script that creates the xml file, i see alla languages correctly. If i pull that xml data into the new website, i see scrabbled data


i "cheat" on this.. my page set charset to greek (windows-1253 only, iso-859-7 is not working) and the xml file is in utf-8. This is the only way to see greek characters so far - and its confusing me on how this is possible.

The new script (that reads the xml data) is in utf-8 encoding (php headers plus meta encoding)

mysql_client_encoding($db) states that database is in Latin1
mb_detect_encoding($str) states that the data is in utf-8 and ascii - another confusion.

Both of these i run on a clean page - no headers, no metas, just a connection to db and echoing 10 rows of data from the table

Parent site uses only english data (in iso-8859-1 header + meta encoding). Children sites are on different server and all in utf-8 encoding

lucy24

7:46 am on Aug 11, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



i see that the default database encodng is Latin-1. The metas in original pages are iso-8859-1.

Same thing. Unless they're being evil and interpreting "Latin-1" as, say, "Windows-Latin-1" instead of "ISO-Latin-1". But that doesn't seem to be the problem.

So far i can see english, french, german, italian, spanish and swedish. I can't see proper greek, russian, turkish.

Right. Latin-1 is the Roman alphabet with a few selected diacritics. No cyrillic, no Greek, no... well, Turkish is written in Roman but it's got a few wonky characters. Dotless i, I think, and

:: detour to Omniglot ::

ş (s with cedilla) and ğ (g with breve, apparently standing in for what unicode calls a caron and the rest of the known universe calls a hacek). Not in Latin-1. Which means the Forums will make mincemeat of them ;)

mb_detect_encoding($str) states that the data is in utf-8 and ascii

That sounds as if mb_detect_et cetera is due for a vacation. ASCII is a subset of UTF-8 and of Latin-1; the codepoints are the same either way. Is it maybe trying to say that some pieces of data are safely in the ASCII range, while the non-ASCII ones are UTF-8 encoded?

omoutop

7:54 am on Aug 11, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



That sounds as if mb_detect_et cetera is due for a vacation. ASCII is a subset of UTF-8 and of Latin-1; the codepoints are the same either way. Is it maybe trying to say that some pieces of data are safely in the ASCII range, while the non-ASCII ones are UTF-8 encoded?


Yes, there are some words that are not translated in any language. For example, the word Wi-Fi stays the same on every translation in every language. That word mb_detect sees as ASCII

penders

8:22 am on Aug 11, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



i "cheat" on this.. my page set charset to greek (windows-1253 only, iso-859-7 is not working) and the xml file is in utf-8. This is the only way to see greek characters so far - and its confusing me on how this is possible.


Hhhmmm, yes, this is confusing! If your data is UTF-8 encoded then your page that you are outputting to should be the same encoding? Otherwise you're going to have to convert it (I would have thought)? And convert to different encodings for different languages?

omoutop

8:26 am on Aug 11, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



yes i think something like this
for now i consetrate on 1 lang hopping that the solution to the others will be the same

lucy24

9:36 am on Aug 11, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



!
Can you post a few samples of the texts that are coming through as garbage, alongside the non-garbage versions? The Forums may auto-convert the non-Latin-1 characters to numerical entities, but that is OK. Just to confirm what type of encoding mess is happening.

omoutop

10:10 am on Aug 11, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Notes for quick reference:
- All data are for greek language.
- mysql_client_encoding($db) states that database is in Latin1
- mb_detect_encoding($str) states that the data is in utf-8 and ascii (ascii are the english words inside greek text)
- Php 4.3, MySQL client version: 5.1.49, Server version: 4.0.27, phpMyAdmin - 2.11.11.3
- Parent site uses and stores data under meta encoding ISO-8857-1
- Child site must export data in xml format and present them in UTF-8


A) Sample data using mysql_query("SET NAMES 'utf8'");

Ðéóßíá
ÐáéäéêÞ Ðéóßíá
ÊÞðïò
Êëéìáôéóìüò
Ôçëåüñáóç


B) Sample data using <meta http-equiv="Content-Type" content="text/html; charset=windows-1253" /> (this works for me, i see greek data printed on screen)

&#928;&#953;&#963;&#943;&#957;&#945;
&#928;&#945;&#953;&#948;&#953;&#954;&#942; &#928;&#953;&#963;&#943;&#957;&#945;
&#922;&#942;&#960;&#959;&#962;
&#922;&#955;&#953;&#956;&#945;&#964;&#953;&#963;&#956;&#972;&#962;
&#932;&#951;&#955;&#949;&#972;&#961;&#945;&#963;&#951;



C) Sample data using <meta http-equiv="Content-Type" content="text/html; charset=iso-8857-1" />

Ðéóßíá
ÐáéäéêÞ Ðéóßíá
ÊÞðïò
Êëéìáôéóìüò
Ôçëåüñáóç


D) Using: mb_convert_encoding($str, "ISO-8859-7", "auto") or mb_convert_encoding($str, "UTF-8", "auto") gives me empty string

E) Sample data using iconv("ISO-8859-1", "UTF-8", $str)

Ã&#144;éóßíá
Ã&#144;áéäéêÞ Ã&#144;éóßíá
ÊÞðïò
Êëéìáôéóìüò
Ôçëåüñáóç


F) Using iconv("UTF-8", "iso-8859-1", $str) gives me empty string


G) Sample data using utf8_encode($str) & utf8_decode($str)

Ã&#144;éóßíá
Ã&#144;áéäéêÞ Ã&#144;éóßíá
ÊÞðïò
Êëéìáôéóìüò
Ôçëåüñáóç

penders

10:29 am on Aug 11, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A) and C) display OK for me providing I change the encoding of the page to greek (ISO-8859-7 or Windows-1253)

B) should always display OK because it's HTML entities and not dependent on an encoding.


Sample data using <meta http-equiv="Content-Type" content="text/html; charset=iso-8857-1" />


And the page is saved as ISO-8857-1 ?

omoutop

10:34 am on Aug 11, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



And the page is saved as ISO-8857-1 ?


The testing page is a simple php page created with dreamweaver. Wherever i state a meta, i have that meta in page. Otherwise no meta is included

lucy24

9:17 pm on Aug 11, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A and C come through fine for me in the Forums, as does B if I paste into a text editor and do HTML Preview. (That is, the entities agree with the letters.) The browser has intelligently gone to Greek encoding (8859-7, not Windows Greek) on its own volition.

E and G are the weird ones. What you have there is the Dreaded Question Mark, meaning that parts of your text use codepoints that are permitted in Latin-1-- or some other one-byte encoding-- but not in UTF-8. Except that the overall pattern-- one recurring letter alternating with a second variable letter-- is what you'd get in the opposite situation: UTF-8 from the two-byte range being reinterpreted as Latin-1. If you did that with your sample text you would get something entirely different... which I can't paste in because now that we are in Greek encoding it all goes to entities. Anyway, the recurring character is &#206; (capital I with circumflex) or CE, the first byte of each Greek letter in UTF-8.

So what you've got is UTF-8 being reinterpreted as something, but I can't for the life of me figure out what. It isn't 8859-7 or Windows-Greek, because in both of those the recurring letter would be capital Xi. What we're looking for is an encoding where codepoint CE is capital Gamma. Or a multiple series of reinterpretations that I can't reproduce on the text editor.