homepage Welcome to WebmasterWorld Guest from 54.242.126.126
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

    
Inputting UTF-8 Encoded Data to MySQL
XMLMania

10+ Year Member



 
Msg#: 6565 posted 8:10 pm on Jan 19, 2005 (gmt 0)

I've been having major headaches over inserting UTF-8 encoded data into MySQL 4.1.x

The problem is many characters are being replaced with "?".

Languages I need to support are far eastern languages such as Japanese and Cryllic symbols.

Data is recieved in the form of XML. I'm using PHP5/MySQL 4.1 and libxml to parse XML.

I can ECHO the parsed XML and it appears FINE, but when I insert it into the database and retrieve the data some chracters are mangled.

I know that MySQL 4.1 now has full UTF-8/Unicode support, libxml too. Also note, i'm traversing the XML via DOM, this FORCES libxml to convert the parsed file to UTF-8 :-)

Heres some code:

[code]<?php
ob_start('ob_gzhandler');
header('Content-Type: text/html; charset=UTF-8');
mb_internal_encoding('UTF-8');

include 'System/runCheck.php';

runCheck(date('H'), date('i'));

function processData($str) {
// Snipped this, all it does is replace entitie codes with actual characters.
}

mysql_connect('XXX', 'XXX', 'XXX');
mysql_select_db('XXX');

$timeStart = time();
$feeds = array(ARRAY OF XML FEEDS);

foreach ($feeds as $feedURL) {
$checkTime = round(time() - $timeStart);

if ($checkTime < 150) {
if ($feed = new DOMDocument) {
$feed->strictErrorChecking = false;
$feed->preserveWhiteSpace = false;
$feed->load($feedURL);

switch (strtolower($feed->documentElement->tagName)) {
case 'rss';
$feedType = 'rss';
$feedVersion = $feed->documentElement->getAttribute('version');
$feedItemsElement = 'item';

break;

case 'feed';
$feedType = 'atom';
$feedVersion = $feed->documentElement->getAttribute('version');
$feedItemsElement = 'entry';

break;

case 'rdf:rdf':
$feedType = 'rdf';
$feedVersion = '1.0';
$feedItemsElement = 'item';
}

if (in_array($feedType, array('rss', 'atom', 'rdf'))) {
$feedTitle = $feed->getElementsByTagname('title')->item(0);

if (in_array(strtolower($feedTitle->parentNode->nodeName), array('channel', 'feed'))) $feedTitle = $feedTitle->textContent;

else $feedTitle = '';

echo "<h1>$feedTitle</h1>";

$items = $feed->getElementsByTagname($feedItemsElement);

foreach ($items as $item) {
unset($title, $body, $link);

foreach ($item->childNodes as $node) {
if ($feedType == 'atom') {
switch (strtolower($node->nodeName)) {
case 'title':
$title = processData($node->textContent);

break;

case 'content':
$body = processData($node->textContent);

break;

case 'summary':
if (!$body) $body = processData($node->textContent);

break;

case 'link':
$link = trim($node->getAttribute('href'));

break;
}
}

else {
switch (strtolower($node->nodeName)) {
case 'title':
$title = processData($node->textContent);

break;

case 'description':
$body = processData($node->textContent);

break;

case 'content:encoded':
if (!$body) $body = processData($node->textContent);

break;

case 'link':
$link = trim($node->textContent);

break;
}
}
}

$checksum = sha1($title . $body . $link);
$adCheckURLS = array('spam site url', 'spam site url', 'spam site url', 'spam site url', 'spam site url', 'viewRssAd.php');

unset($ad);

foreach ($adCheckURLS as $adURL) {
$check = strpos(strtolower($link), $adURL);

if ($check) {
$ad = true;

break;
}
}

if (!$ad && substr(strtolower($title), 0, 4)!= 'adv:' && strtolower($title)!= 'spam site') mysql_query("INSERT INTO items (title, feedtitle, body, feedurl, itemurl, lang, checksum, time) VALUES ('" . addslashes($title) . "', '" . addslashes($feedTitle) . "', '" . addslashes($body) . "', '" . addslashes($feedURL) . "', '" . addslashes($link) . "', 'xx', '$checksum', '" . time() . "')");

#echo "<h2 style=\"margin-bottom: 0\">$title</h2><p style=\"color: blue; margin: 0\">$body</p><p style=\"color: green; margin: 0\">$link</p>";
}
}
}
}
}
?>[code]

 

Timotheos

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 6565 posted 11:15 pm on Jan 19, 2005 (gmt 0)

Hi XMLmania,

Is mySQL the default character set for UTF-8?

Do you have something like phpMyAdmin to see if it indeed being inserted with question marks?

Tim

ergophobe

WebmasterWorld Administrator ergophobe us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 6565 posted 5:02 pm on Jan 21, 2005 (gmt 0)

Are you certain that whatever you are using to view the data you retrieve from the DB (mysql shell client? browser with PHPMyAdmin?) is properly interpreting your data as utf-8?

In other words, do you know whether the data is being corrupted on its way into the DB, on its way out, or upon presentation to the screen?

bcolflesh

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 6565 posted 5:06 pm on Jan 21, 2005 (gmt 0)

By default, all the table collations in 4.1.x are:

latin1_swedish_ci

did you change yours to:

utf8_general_ci

?

XMLMania

10+ Year Member



 
Msg#: 6565 posted 10:08 pm on Jan 21, 2005 (gmt 0)

Sorry for the late reply.

Database is fully UTF-8 enabled. Connections, results and data is UTF-8 encoded. (UTF-8 is default).

I'm pretty certain data gets corrupted between coming out of the libxml based parser and being inserted into the database.

Whats wierd is when I copy some Japanese, Korea, Russian etc and create a form in PHP to submit the data via post it all works flawlessly.

Timotheos

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 6565 posted 10:25 pm on Jan 21, 2005 (gmt 0)

Hi XMLMania,

Are you aware of the multibyte string functions [php.net]? I noticed some things in your code like strtolower that (I believe) don't work on multibyte character sets like UTF-8. Could this be the problem?

Tim

XMLMania

10+ Year Member



 
Msg#: 6565 posted 11:09 am on Jan 22, 2005 (gmt 0)

Strtolower() is used on tag names so it doesn't get confused between variations in XML.

The source XML is version 1.0 so tag names won't contain multibyte characters or symbols.

I've asked the people on the MySQL list which told me to set the default encoding to UTF-8.

Would someone possibly test my script?

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved