Welcome to WebmasterWorld Guest from 54.197.130.93

Forum Moderators: open

Message Too Old, No Replies

Invalid character in XML file

Help needed.

     
8:14 pm on Dec 10, 2007 (gmt 0)

Preferred Member

10+ Year Member

joined:Sept 7, 2003
posts: 383
votes: 0


I have been supplied XML data that will not validate because it contains an invalid character.

I need to write a script to replace this character, but I am unable to see what the character is.

I've tried several editors and also tried encoding the XML file with all diiferent character sets, but still can't get it to display.

Textpad shows it as a thick pipe.

Frontpage will only load the document up to the character.

Stylusstudio shows it as a square box.

Anybody know a way to find out what this character is, or a way of removing invalid characters from an XML document?

8:20 pm on Dec 10, 2007 (gmt 0)

Senior Member from CA 

WebmasterWorld Senior Member encyclo is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Aug 31, 2003
posts:9074
votes: 6


Try opening the file in a hex editor, it should help identify the character. You can look up the hex code to see what it is. :)
8:34 pm on Dec 10, 2007 (gmt 0)

Senior Member from CA 

WebmasterWorld Senior Member httpwebwitch is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Aug 29, 2003
posts:4061
votes: 0


yeah - check it in a hex editor. you might have a "beep" in your XML.

maybe something like PHP's htmlencode() would expose the nasty little thing, though it will also change all your <'s into &lt;'s

9:10 pm on Dec 10, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 21, 2005
posts: 1526
votes: 0


I use a Mac program called "BBEdit." It has a feature called "Zap Gremlins" that is designed for exactly this kind of thing.
9:32 pm on Dec 10, 2007 (gmt 0)

Preferred Member

10+ Year Member

joined:Sept 7, 2003
posts: 383
votes: 0


Thanks for the input Guys.

Problem is I can't do this locally. As the XML is dynaically pulled, I need to write a PHP script to replace these nasties before it's transformed using XSLT.

I've downloaded a hex editor and this is the result.

00 as shortint: 0
00 3C as word: 15360
00 3C as integer: 15360
00 3C 2F 70 as longint: 1882143744
00 3C 2F 70 as 32 bit IEEE single: 2.16929649071649E29
00 3C 2F 70 6F 73 74 61 as 64 bit IEEE double: 2.87521632282426E161

How do I get an ASCII code from this?

10:57 pm on Dec 10, 2007 (gmt 0)

Preferred Member

10+ Year Member

joined:Sept 7, 2003
posts: 383
votes: 0


$response = iconv("UTF-8","UTF-8//IGNORE",$response);

Found this Guys. Works a treat.

Thanks for the help.

12:28 am on Dec 11, 2007 (gmt 0)

Preferred Member

10+ Year Member

joined:Sept 7, 2003
posts: 383
votes: 0


FALSE ALARM

It echoed the XML fine, but when I write it to a file it still has the same problem.

12:59 am on Dec 11, 2007 (gmt 0)

Preferred Member

10+ Year Member

joined:Sept 7, 2003
posts: 383
votes: 0


$response=str_replace("\0","",$response);

This fixed it.

Funny thing I was searching on Google for this problem and this thread has been indexed already. Less than 2hrs.

I think Brett must have his own Googlebot parked over there. ;-)

4:08 am on Dec 11, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 21, 2005
posts: 1526
votes: 0


$response=str_replace("\0","",$response);

This fixed it.

Funny thing I was searching on Google for this problem and this thread has been indexed already. Less than 2hrs.

I think Brett must have his own Googlebot parked over there. ;-)

The principal mission of this site is SEO, so don't be surprised at how well it does.

That will work for this one case, but you might want to have a ore general RegEx for "XML Clean" code.

The Spec [w3.org] says that we can use 0x0009 (tab), 0x000A (linefeed), 0x000D (carriage return), 0x0020 - 0xD7FF, 0xE000 - 0xFFFD or 0x100000 - 0x10FFFF.

0x0000 ain't in there. The same goes for a whole lot of other characters.

I have to get some shut-eye, so I can't whip up the required RegEx, but it should be fairly straightforward using these rules.

2:12 am on Jan 15, 2008 (gmt 0)

New User

5+ Year Member

joined:Jan 15, 2008
posts: 1
votes: 0


I had the same prob! I fixed it with this little clearing.php script:


<?php
$input="input.xml";
$file=fopen($input,"r") or exit("Unable to open file!");
$writefile=fopen("prep_".$input,"w") or exit("Unable to open file!");
while (!feof($file))
{
$character = fgetc($file);
// Check if ascii character value is below 10:
if (ord($character) > 10) {fwrite($writefile,$character);}
else {echo "Illegal Char Found: " . $character . " <br>";}
}
fclose($file);
fclose($writefile);
?>

It reads every single character of the xml and writes it into new xml, if chracter value is over 10

I had some illegal chars value 3 in there, so this works fine!

Later

2:38 am on Jan 15, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 21, 2005
posts: 1526
votes: 0


Thanks, and welcome to WebmasterWorld!

It's great that your inaugural post to WebmasterWorld is a solution, not a question.

10:15 am on Jan 15, 2008 (gmt 0)

Preferred Member

10+ Year Member

joined:Sept 7, 2003
posts: 383
votes: 0


Welcome dude212 and thank you, super solution.
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members