Forum Moderators: coopster

Message Too Old, No Replies

Moving Character Set to UTF8

         

Knucklehead00

7:05 pm on Jan 7, 2010 (gmt 0)

10+ Year Member



I am working to migrate the data from one database to another and I am running into a plethora of special characters.

The database that I am migrating from is setup in Latin I character set (otherwise known as windows-1252 I believe). I am attempting to migrate these special characters over to UTF-8 character set as set in the database and HTTP Content Type.

To make sure that I am migrating from the proper character set, here is some example text of what I am dealing with:

Dresden, Germany-based KSW Microtec AG, a supplier of RFID components and inlays for secure cards and documents, reports that it has received the MasterCard certification for its facilities and production process.

The company, which offers RFID components and Thinlams for high security applications in the fields of eBanking and eGovernment, also notes that it now “further expands its production environment compliant with the EAL 5+ security standard at its headquarters in Dresden.”

The company says it aims to receive the common criteria certification “within the next months.”

The EAL5+ certification, KSW Microtec says, will enable it to offer a new generation Thinlam with microcontrollers for electronic personal identity cards and ePassport inlays at the highest security level manufactured in Germany.

This is what it looks like on the previous site:

Dresden, Germany-based KSW Microtec AG, a supplier of RFID components and inlays for secure cards and documents, reports that it has received the MasterCard certification for its facilities and production process.

The company, which offers RFID components and Thinlams for high security applications in the fields of eBanking and eGovernment, also notes that it now further expands its production environment compliant with the EAL 5+ security standard at its headquarters in Dresden.

The company says it aims to receive the common criteria certification within the next months.

The EAL5+ certification, KSW Microtec says, will enable it to offer a new generation Thinlam with microcontrollers for electronic personal identity cards and ePassport inlays at the highest security level manufactured in Germany.

Smart quotes for the win!

In many other instances, there are problems with apostrophes, quotes, and like shown above, smart quotes.

Does anyone have any idea how to work through this problem? I have spent about 10-12 hours on this and tried things like iconv, mb_convert_encoding, and both utf8 decode/encode and none of those provided the output required.

Thanks in advance!

TheMadScientist

9:01 pm on Jan 10, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hmmmm...

Looks like you're not getting too much help, and I might not be much either, since it's not something I've done before, but why not find out what chars are not working and then just loop through the data and string replace it into a duplicate version of the table with the HTML entity...

EG

/* If you need to use the limit below, add another col to the table called Updated or something and set the value to 1 from 0, so you can work your way through the DB */


$s="SELECT id,col_with_bad_chars from the duplicate table WHERE Update=0 LIMIT (if there are too many rows to select at once)";
$q=mysql_query($s);

while($r=mysql_fetch_array($q)) {
$New_Chars=str_replace('Windows Smart Quote Char','"',$r['col_with_bad_chars']);
$u="UPDATE duplicate table SET col_with_bad_chars='".$New_Chars."' WHERE id=".$r['id']."
}

rocknbil

3:17 am on Jan 11, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I think you might have to set the character collation for your database/tables to match. Obviously, as your samples clearly demonstrate, one can't be rendered in the other.

alter database [dbname] character set latin1, collate latin1;

alter table [tablename] character set latin1, collate latin1;

The real problem is how they got there in the first place . . . these should be encoded prior to entry in the DB, but sometimes we're just stuck cleaning up the mess . . .

coopster

12:46 pm on Jan 22, 2010 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



>>but sometimes we're just stuck cleaning up the mess . . .
:)
Reminds of an article I read on O'Reilly a few years back regarding utf8 conversion [oreillynet.com].

penders

12:54 pm on Jan 23, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



...why not find out what chars are not working and then just loop through the data and string replace it into a duplicate version of the table with the HTML entity...

But a big advantage of using UTF-8 is you don't need to use HTML entities. Data is immediately searchable, no conversion reqd...