How to solve character set problem of PHP files

Forum Moderators: coopster

Message Too Old, No Replies

How to solve character set problem of PHP files

dbarasuk

4:08 pm on Feb 19, 2015 (gmt 0)

Hi everyone,
I'm experiencing serious problems due to the fact that I have edited my PHP code in a code editor i was not used to that is CODELOBSTER while all my files have been previously edited using NotePad++ (i was just trying codelobster)

Now when i create sql queries and try to see what the query contains before it reaches the database, everything is OK . the problem happens with french words containing accents, example 'Février 2015'. When i echo in the browser and the field contains 'Février 2015', the browser shows no problem. However when this values reaches MySQL, this value becomes FÚvrier 2015. All subsequent operations on that value saved in the database fail. I've tried to change the charset in notepad++ from ANSI (the default in Notepad++) to UTF8 with no success. I've also tried to change the whole database to LATIN1 but my problem was not solved.

Can anyone guide me what to do? I'm in trouble.

Thanks to everyone that will help me.

dbarasuk

4:12 pm on Feb 19, 2015 (gmt 0)

NOTE: What I want is that the database should not change html values I'm sending. For instance if I send 'Février 2015' (with an accent) I want to find in the database table that same value and not 'FÚvrier 2015' like in my case.

PLZ save me from this trouble

brotherhood of LAN

4:19 pm on Feb 19, 2015 (gmt 0)

This sounds like a character collation issue with your database. Make sure your fields use utf8_unicode_ci if you plan on having non-Latin characters in there.

When retrieving from the DB, I typically go for

SELECT CAST(field AS BINARY) AS field

Which'll avoid any character mangling or assumptions about the stored data.

Also make sure you have a UTF-8 content-encoding if you're looking at it through your browser.

dbarasuk

4:47 pm on Feb 19, 2015 (gmt 0)

What's the difference between UTF8 WITH BOM and WITHOUT BOM?
The problem is that i cannot rewrite all my queries. It's a big project and I can't find them all

brotherhood of LAN

4:52 pm on Feb 19, 2015 (gmt 0)

I don't think it'll matter here, but you can Google to find some already existing explanations.

The first thing you should do is ensure the data you expect to be in the database is what you expect it to be (i.e. properly encoded).

lucy24

7:32 pm on Feb 19, 2015 (gmt 0)

What's the difference between UTF8 WITH BOM and WITHOUT BOM?

For most purposes, nothing.

But was it really a change from é (eacute) to Ú (Uacute)? That's not Latin-1 vs. UTF-8; it's two different one-byte encodings. (But not Mac vs. Windows. I checked.)

What I want is that the database should not change html values I'm sending.

What do you mean by "html values"? Are non-ASCII characters stored in the database as entities (é or possibly é or, uh, whatever the decimal form is)?

dbarasuk

9:50 am on Feb 20, 2015 (gmt 0)

Hi lucy24,
Yes it was a change from é (eacute) to Ú (Uacute).
What I mean buy html values is in fact form textbox values.
In the form textbox, value I had was "é" and was changed to Ú (Uacute) once in database. But I wanted it to remain "é" as it was sent from my form.

I'm still in trouble!

thanks for your guidance, I am waiting.

lucy24

9:04 pm on Feb 20, 2015 (gmt 0)

What you need to understand is that there is no such thing as é. The computer stores a number and then when it comes time to generate and display the text, it converts that number back into a visible character. That's why file encoding matters. The computer's stored number-- E9 let's say (hexadecimal for 233)-- may be interpreted as é or it may be interpreted as È or it may be Ú ... or it may come through as something in Cyrillic or Korean.

It looks as if what you have is a mixup between ISO-Latin-1 (eacute encoded as E9) and DOSLatin1 or DOSLatin2 (both with Uacute encoded as E9). And you now have a problem, because the database has no way of knowing that some records are in one encoding while others are in a different one.

Your second problem will be finding someone who understands both character sets and databases to walk you through the solution. (You will have figured out that I only know character sets.) I'm not sure there is any alternative to pulling out all the "wrong" information and re-entering it with the correct encoding. Your computer is just a dumb machine; it has no way of knowing that you "meant" é rather than Ú. (If it were a different mixup, like Latin-1 vs. UTF-8-- one-byte vs. multi-byte-- there might be ways to test.)

Right now you are primarily in Latin-1. Don't try changing to UTF-8 until you make sure that all your existing data uses the same encoding. Otherwise you'll never get it disentangled.

not2easy

11:11 pm on Feb 20, 2015 (gmt 0)

Going back to this part:

All subsequent operations on that value saved in the database fail. I've tried to change the charset in notepad++ from ANSI (the default in Notepad++) to UTF8 with no success. I've also tried to change the whole database to LATIN1 but my problem was not solved.

- please don't confuse this with being "the solution" - but years ago I had the Notepad++ character issue also. I do not know if the current versions of Windows have the same changes available for you. IF available, to find it just open up good old Windows notepad (not ++) and as you save a text file there is a tiny dropdown menu to choose encoding. ANSI is the Windows version of Latin-1 and it is the default encoding, but it offers to let you change the default there to ascii. After I made that change, I was able to change character encoding for notepad++ also. Don't choose "with BOM" because it is a "Byte Order Mark" that Windows used way back when and it deposits a weird character visible on the viewed page if your page isn't 100% in the same encoding. If available, select "Unicode UTF-8". Unfortunately I never found a way in Notepad++ (other than one page at a time) to convert the old ANSI to UTF-8.

This does not answer the issue of your sql character encoding. Sorry, I am not much help there.