homepage Welcome to WebmasterWorld Guest from 54.227.12.4
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

    
Handling Char Encodings
XMLMania




msg:1291799
 3:16 pm on Oct 15, 2004 (gmt 0)

Hi,

I need some help converting a variety of char encodings to ISO-8859-1 for storage in a mysql database.

Currently i'm using a LOOOONG array in a str_replace, however this is becoming inpractical, excessively LONG and VERY tedious.

Any help or pointers are grateful!

 

ergophobe




msg:1291800
 5:10 pm on Oct 15, 2004 (gmt 0)

You could just read the data one row or one line (depending on whether or not its in a DB or file), us the mb_convert_encoding()] [php.net] function and write it back out to an SQL file line by line. Then just upload the SQL file to your DB. I just did that to convert a small DB from ISO-8859-1 to UTF-8 a couple of days ago. On my small DB (only 5000 records and 9MB) it took only a couple of seconds.

You may or may not find this thread on Unicode Support [webmasterworld.com] useful (depending on whether you have the conversion issues solved already or not).

Tom

[edited by: ergophobe at 6:59 pm (utc) on Oct. 15, 2004]

XMLMania




msg:1291801
 5:45 pm on Oct 15, 2004 (gmt 0)

I've looked into the mb_XXX functions, and it seems this is most likely my only option.

Unfortunately the database is 800Mb with 600000+ rows :(

ergophobe




msg:1291802
 7:07 pm on Oct 15, 2004 (gmt 0)

So your DB is not even 12x mine and mine was processed in literally under three seconds on a 750MHz Athlon. I bet that on a fast machine, your data could be processed in 20-30 seconds and, I assume, this is a one-time conversion, so who cares if it takes 10 minutes to run?

There are also standalone programs that will convert encodings. You could do an SQL dump, convert the text file with a standalone, and upload it to your DB.

Honestly, though, what's going to take the time is just going through the steps. The 30 seconds it takes to do the character conversion is nothing.

XMLMania




msg:1291803
 7:31 pm on Oct 15, 2004 (gmt 0)

Unfortunately its on a news search engine, which is constantly parsing XML, returning results, being syndicated, filtering news items etc, -- all requiring queries, and worst of all... all on a mediocre Celeron 2GHz half gig of ram.

But I WILL have to do what you specified at some point :-(.

And dumping the index is definitely NOT an option, re-indexing on this sissy of a server takes AGES.

Also, 800/9!= 12 lol

XMLMania




msg:1291804
 7:47 pm on Oct 15, 2004 (gmt 0)

I've tested the following:

<?php
$str = file_get_contents($_GET['url']);

echo "<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>\n";
?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN">
<html xml:lang="en">
<head>
<title></title>
</head>

<body>
<?php
echo htmlentities(mb_convert_encoding($str, 'ISO-8859-1', mb_detect_encoding($str)));
?>
</body>
</html>

Is mb_convert_encoding($str, 'ISO-8859-1', mb_detect_encoding($str)) similar to what you used?

ergophobe




msg:1291805
 11:10 pm on Oct 15, 2004 (gmt 0)


Also, 800/9!= 12 lol

What's an order of magnitude and a 25% error between friends? Somehow I was not looking at your post and was remembering different numbers when I wrote that.


Is mb_convert_encoding($str, 'ISO-8859-1', mb_detect_encoding($str)) similar to what you used?

No, for all the reasons outlined in the my long post in the thread I referenced.

You're trying to figure out the encoding of form output that is being sent to you as ISO-8859-1 (because that's what your webpage sends as) but, unfortunately, being posted into the form as something else.

You have to make sure that ISO-8859-1 is not a viable encoding to detect, otherwise you will not get any conversion. So you need to have some idea what encodings might possibly come in, and make a list of encodings that you will test for, none of which is iso-8859-1 and tell PHP about that list using mb_detect_order() otherwise it won't work at all.

Look at my messages in the Unicode thread I mentioned previously for more detail.

[edited by: ergophobe at 4:26 pm (utc) on Oct. 16, 2004]

XMLMania




msg:1291806
 9:58 am on Oct 16, 2004 (gmt 0)

Thats great, thanks for the help :-)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved