Forum Moderators: coopster

Message Too Old, No Replies

Spanish characters, simplexml and character encoding

what a nightmare

         

Mike521

9:41 pm on Dec 16, 2008 (gmt 0)

10+ Year Member



I've been wrestling with a spanish character problem all day. The short version of the problem is that I need to send spanish characters in XML data through HTTP POST without screwing them up, but it seems like simplexml doesn't understand them.

We have a process that does something like this:

1. take data from user
2. put data into xml
3. post xml to receiving script
4. use simplexml to read incoming xml
5. make a new xml string for receiving script 2
5. post the reformatted xml to receiving script 2

complications include html special characters, url encoding and charsets.

I've gotten to step 4 -- I receive an iso-8859-1 encoded xml string on POST. When I do simplexml_load_string, the spanish characters turn into useless garbage.

for example:


$xml = "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>";
$xml .= "<data>spanish cháractérs</data>";
$incomingXML = simplexml_load_string( $xml );
echo $incomingXML->data;

output will be: spanish cháractérs

I'm using iso-8859-1 because apparently utf-8 doesn't understand spanish characters. every time I try a utf8_encode, or an htmlentities with utf-8 as the encoding, the script breaks completely.

has anyone had to tackle a similar problem?

eelixduppy

10:01 pm on Dec 16, 2008 (gmt 0)



In the page you are echoing this data to, did you set the character encoding?

Mike521

3:18 pm on Dec 22, 2008 (gmt 0)

10+ Year Member



yes I did, I found out today that the http header was set to iso-8859-1 and it takes precedence over the meta tag. That was part of the problem and is now fixed

Mike521

6:07 pm on Dec 22, 2008 (gmt 0)

10+ Year Member



OK I narrowed it down to a couple of problems remaining:

1. urlencode - it *seems* to turn spanish characters to gibberish
2. simplexml_load_string - it *seems* to break when I send spanish characters

for example, for # 1, I have the following test page:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<title>Untitled Document</title>
</head>
<body>
<p>just checking: èá</p>
<?
$string = "& spanish cháractér's";
$encodedString = urlencode( $string );
$deEncodedString = urldecode( $encodedString );
echo "<p>" . $string . "</p>\r\n";
echo "<p>" . $encodedString . "</p>\r\n";
echo "<p>" . $deEncodedString . "</p>\r\n";
?>
</body></html>

The output for the above on my server is:


just checking: èá
& spanish cháractér's
%26+spanish+ch%C3%A1ract%C3%A9r%27s
& spanish cháractér's

Notice how line 3 appears to be just gibberish (check it online here - if you deencode the string, it's garbage): [meyerweb.com...]

But the php urldecode turns it right back to normal. So I don't know what the situation is there