homepage Welcome to WebmasterWorld Guest from 54.161.175.231
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

    
Accents being broken - turns to å
brokaddr




msg:4495332
 8:59 pm on Sep 14, 2012 (gmt 0)

This seemed to happen out of nowhere. Previously, special characters would save to my database/be passed by PHP perfectly.

Now, all of the accents and special characters are malforming.

- The database table stores as latin1_swedish_ci
- The file that sanitizes inputs/whitelists these accents is saved as UTF-8 encoding

I'm not sure where else to look?

 

lucy24




msg:4495343
 9:37 pm on Sep 14, 2012 (gmt 0)

You've already diagnosed the problem. At this point, what are you looking for? You need to include a line in both places that explicitly sets the encoding. Or rather in all three places, since the final-destination html also has to have the same encoding. Don't just cross your fingers and hope that they'll come out the same.

Also check to make sure it behaves the same locally and online. Sometimes your ftp-or-equivalent turns out to be the culprit. (Voice Of Experience.)

swa66




msg:4495422
 7:14 am on Sep 15, 2012 (gmt 0)

I would convert everything to UTF-8, this is really the simplest solution, but you need to do the conversion carefully.

e.e. strlen() in your php looks for bytes, not characters, and hence it can be off quite a bit, your database is the same, if you have char() fields: they take triple space in UTF-8 (worst case scenario is what char aims for in storage); varchar doesn't do that.

Your problem is going to get worse as you might be contaminating your database with having a mix of iso-latin and utf-8 in it, something that you only can clean up if you know which fields have been contaminated.

brokaddr




msg:4495490
 3:14 pm on Sep 15, 2012 (gmt 0)

You need to include a line in both places that explicitly sets the encoding.

Forgive me for being daft, but how would I do that, exactly?

Also check to make sure it behaves the same locally and online. Sometimes your ftp-or-equivalent turns out to be the culprit. (Voice Of Experience.)

I save the file as UTF-8, in my client (WinSCP), the "Server Environment" is set to UTF-8 encoding for filenames = Auto
Anything else I'd need to check?

I would convert everything to UTF-8, this is really the simplest solution, but you need to do the conversion carefully.

Change the database tables from latin1_swedish_ci to UTF-8, correct? Would utf8_unicode_ci be the best option? I primarily use A-Z alphabet, but deal with a lot of foreign accents and characters from time to time.

Your problem is going to get worse as you might be contaminating your database with having a mix of iso-latin and utf-8 in it, something that you only can clean up if you know which fields have been contaminated.

Does the entire database need to be changed, or just the fields that are acting up? I don't have a lot of experience with this, I didn't even realize there was a difference in database storage until this happened. I've actually had it set to latin1_swedish_ci since 2005 or so, never had any issues until recently.

swa66




msg:4495563
 8:27 pm on Sep 15, 2012 (gmt 0)

If you managed to get your data in a mixed encoding in your database:
It's old but was in my bookmarks as a pointer to commands to get out of such a mess: [oreillynet.com...]

lucy24




msg:4495588
 9:47 pm on Sep 15, 2012 (gmt 0)

Forgive me for being daft, but how would I do that, exactly?

I was hoping you would know ;) Almost anything that has the capacity to make databases includes a command to specify character encoding. Usually you can also convert between one and another. And, of course, html itself includes-- or should include, heh heh-- a "charset" declaration. I assume the contents of your database will end up online in some form, so don't overlook that step.

The good news is that if all you're using is "vanilla" Swedish letters like å you can stay with Latin-1 encoding. That's ISO-Latin-1; I definitely don't recommend anything more specialized. And if you go to UTF-8 you're only into the two-byte range, so your database won't get significantly bigger.

swa66




msg:4495594
 10:14 pm on Sep 15, 2012 (gmt 0)

Maybe we need to elaborate a bit further:

As far as I understand (might not be perfect - so I very much welcome corrections)

MYSQL 5

I'm not using older ones anymore, but I doubt they function the same.

It uses a character set -> this is how the data is encoded, so it determines what the binary representation for an A is, but does it for each and every character it supports.

It also uses a collation -> this is how data is compared and sorted. E.g. it determines if it is to be sorted case sensitive or not, if an "" (ü should the board eat it) and an "ue" should be considered the same (e.g. useful in German) etc.

You can set the character set and the collation for the mysql instance, for a database, for a table or even for a column. A character set in mysql has a default collation.
e.g.
- for the character set latin1, the default collation is latin1_swedish_ci
- for the character set utf8, the default collation is utf8_general_ci

You can see it with the SQL commands
show character set;
and
show collation;


mysql also uses a character set and collation to communicate with the clients.

To set the communication between a client and the database in utf-8 you use:
SET NAMES 'utf8';
for e.g. interactive use or
charset 'utf8';
to make it persistent through reconnections, or if you use the mysqli interface from php:
$mysqli->set_charset("utf8");
(object mode, assuming $myslqi being the object) or e.g.
mysql_set_charset('utf8', $conn);
for the obsolete mysql interface.

When creating a database you can specify the character set and/or the collation.

CREATE DATABASE db_name
[[DEFAULT] CHARACTER SET charset_name]
[[DEFAULT] COLLATE collation_name]


If both are ommitted the compile time default is used (typically latin1 and latin1_swedish_ci . If only one is suported then the other is derived from the supplied one (e.g. set the collation to utf8_general_ci and it will also force a character set of utf-8.

To change the character set and/or collation, ALTER TABLE can be used.

You can see the character set and collations used with command like:


show variables like "collation%";
show variables like "character%";


If you look at these: do not worry about character_set_server and collation_server still being latin1 and not utf8: they are the server defaults if you specify nothing, the rest should be utf-8.

HTML
In order to make sure the browsers understand that the data they get sent and that they send to you , they too need to be told it is in utf-8:

You can set a HTTP header (IE ignores this)
You can add a HTML head <meta> tag

e.g.:
In apache's httpd.conf (or I guess in a .htaccess)

AddDefaultCharset UTF-8
AddType text/html;charset=utf-8 html


The obsolete HTML4 etc require it to be:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

HTML5 wants us to use:
<meta charset="utf-8" />


[you want it in both the HTTP header (apache and or header() from php) and in the meta tag.]

PHP
To output html:
Do not forget to set the content type header. E.g. I use
header('Content-Type: application/xhtml+xml;charset=UTF-8');
for my polyglot html5 (take care with the serious consequences of this choice).

To process strings, you need the mb_ variants.

E.g.

- to validate input: you need to make sure the sting is valid UTF-8:

if ( !mb_check_encoding($string,'UTF-8') ) {
//bad input
}
The reason here is that a single character in UTF-8 can be represented by a number of bytes, so a string could contain "unfinished" sequences leading to a number of potential issues

- strlen() on UTF-8 data returns the size in bytes, not the number of characters, so you need :
mb_strlen($string, 'UTF-8')


There's a truckload of mb_ string functions (see references below) - use them whenever you deal with UTF-8 data.



References:
[php.net...]
[mysql.he.net...]
[php.net...]

brokaddr




msg:4498461
 10:20 pm on Sep 21, 2012 (gmt 0)

This is a lot more complicated than I'd expected!

I changed the page charset from iso to utf-8, the scrambled characters are now view properly. However, in the database they are still scrambled.

For example:
Database: Schöllingstraße
Viewed in php/html/browser: Schllingstrae

Older entries that are written to the database properly, are now reversed.
For example:
Database: Strae
Viewed in php/html/browser: Stra&#65533;e (it appears as the diamond-questionmark, not the ascii code)

Both tables have the exact same encoding, of latin1_swedish_ci - so I'm thinking this isn't a database problem, after all.

swa66




msg:4498474
 11:04 pm on Sep 21, 2012 (gmt 0)

As I feared, the data in your database is messed up.

You have data that's encoded in UTF-8 and data that is encoded in iso-latin-1 in your database.
And you've told your database it's all latin-1. (the swedish collation isn't all that important)

What you tell the database the data is, is mostly important for the (automatic) translations going on, for the length calculation etc.

There is no easy solution to correct this. Data contamination is one of the hardest things to solve.

The best you can do: tell your database everything is UTF-8 (this doesn't change what's stored). And tell your application and database to communicate in UTF-8 as well.

Next you'll have to search your database for invalid UTF-8.
php's md_check_encoding() [php.net...] might be helpful as would the HEX sql function, and manual reviews.

brokaddr




msg:4498482
 11:51 pm on Sep 21, 2012 (gmt 0)

If it's database contamination, why are the older entries readable?

As I mentioned, from only about a month ago:
In the database: Strae
.. it is only malformed once viewed in the site. But, if I go to phpmyadmin, it looks fine.

lucy24




msg:4498492
 12:47 am on Sep 22, 2012 (gmt 0)

If it's database contamination, why are the older entries readable?

Because you changed horses in midstream. You can either tell the database to convert from ISO-8859 to UTF-8, or tell it to keep UTF-8 throughout. But unfortunately you can't tell it to treat one part of the database one way and another part the other way. There is no alternative but to go into your database and fix it manually. Is there a mechanism for splitting one database into two, and merging two databases into one? Any way to dump them into a text editor, do some stuff and dump back?

I know what you're going through, anyway. I've had the equivalent thing happen in e-books when I didn't notice in time that I'd opened the raw file in Mac-Roman and then started adding non-ASCII characters in what I thought was ISO-Latin-1. The only comfort is that it was a text editor, so "all" I had to do was globally replace all the garbage pairs, like your å, with the proper letter.

swa66




msg:4498622
 4:23 pm on Sep 22, 2012 (gmt 0)

If you have a date or other way to split the database into two parts: the UTF-8 data and the iso-latin-1 data: then you could split the tables, and convert one of them automatically and rejoin both. BUT you need to find the criterion to do that split. Might be easy if you know when it changed and you have e.g. automatic timestamps for a last update, or excellent logs - but it might be nearly impossible just as well.

To get the entire database into a text editor all you need is mysqldump (commandline).

You can edit that text (it's actually a collection of sql commands ready to feed to mysql to restore the database).

brokaddr




msg:4499991
 2:14 am on Sep 26, 2012 (gmt 0)

Until I was testing the other day, I've never changed the encoding of either html output or database storage - the 'corruption' occurred at random.

Changing to UTF seems futile until I can narrow down the initial cause of corruption. Could an automatic mysql upgrade cause it, perhaps?

swa66




msg:4500035
 6:25 am on Sep 26, 2012 (gmt 0)

AFAIK, there's a problem with IE (it's always the same culprit isn't it?) in that when it tries to encode things in iso-latin-1, it sometimes keeps characters in UTF-8. Trying to force it all to stay in iso-latin-1 is futile once your input braks up like that.
I
ve never seen that wacko behaviour if it's all moved to UTF-8 and make sure not to do conversions anymore (it's safe to convert between unicode encodings (UTF-8 to and from UTF-16 is ok). [you use UTF-16 for some databases (mysql does UTF-8 - but e.g. MSFT's SQL server uses UTF-16 internally]

IMHO: the sooner you move it all to UTF-8 the sooner your can stop messing up your data.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved