Forum Moderators: coopster

Message Too Old, No Replies

preg replace problems processing XML data

         

gnetcon

11:46 pm on Jul 30, 2009 (gmt 0)

10+ Year Member



Hello, all!

I apologize if this has been asked and answered. I couldn't find a good solution anywhere. Thought I'd ask the experts. : )

I'm parsing an XML file, and I have run across a problem in the output. The SimpleXML construct is having "issues" because some the data is in the form of "&", "<", ">", etc.

What I'm looking for is a regex pattern to replace all of the HTML characters INSIDE the XML tags, and not within the tags (or XML tags). I tried htmlentities() on it, but that screws up all of the XML.

So, I need:

<tag name="tag">Replace the &lt; and &amp; signs but not the ...</tag>

Any guidance? TIA!

eelixduppy

1:50 am on Jul 31, 2009 (gmt 0)



How are you getting the XML file? Are you making it yourself? If you can change the way the XML is created you can avoid this problem altogether.

gnetcon

4:25 am on Jul 31, 2009 (gmt 0)

10+ Year Member



It's provided by a 3rd party program that cannot be changed. I just load the contents of the file into SimpleXML.

gnetcon

2:32 pm on Jul 31, 2009 (gmt 0)

10+ Year Member



An update:

If this helps, this is the exact problem I'm having:


<proto name="packet">This is the <&> text that needs escaped.</proto>

So, anything inside the <proto name="packet"></proto> needs to be escaped. Everything else is fine.

gnetcon

3:55 pm on Jul 31, 2009 (gmt 0)

10+ Year Member



Update #2:

Well, the regex was really super simple. The following gathers what I need:


preg_match_all("/<proto name=\"packet\">(.*?)<\/proto>/is", $xml, $matches);

From this I get my array of matches. Now my problem is updating the data. I tried this:


foreach ($matches[1] as $match) {
$replaced = str_replace(array("&", "<", ">"), array("&amp;", "&lt;", "&gt;"), $match);
$xml = str_replace($match, $replaced, $xml);
}

I just get a blank page.

What would be the fastest way to update these items? This file is currently 3 MB's, but it could MUCH larger.

TIA!

idfer

7:54 pm on Jul 31, 2009 (gmt 0)

10+ Year Member



Considering the size of your text, you may be better off rolling your own search and replace with strpos() and substr(). Here's an example:

$tagStart = '<proto name="packet">';
$tagEnd = '</proto>';
$tagStartLen = strlen($tagStart);
$tagEndLen = strlen($tagEnd);

$corrected = '';
$ptr = 0;
while(($posStart = strpos($xml, $tagStart, $ptr)) !== false) {
// Append text up to and including tagStart, advance ptr.
$corrected .= substr($xml, $ptr, $posStart-$ptr+$tagStartLen);
$ptr = $posStart + $tagStartLen;

// Find tagEnd.
$posEnd = strpos($xml, $tagEnd, $ptr);
if($posEnd === false) // No tagEnd!
$posEnd = strlen($xml) - 1;

// Append corrected tag content, advance ptr to beginning of tagEnd.
$corrected .= htmlentities(substr($xml, $ptr, $posEnd-$ptr));
$ptr = $posEnd;
}

// Append rest of text.
$corrected .= substr($xml, $ptr);

Hope this helps.

gnetcon

10:19 pm on Sep 5, 2009 (gmt 0)

10+ Year Member



idfer:

That code worked well. I ended up working with the developer to make sure the code is now escaped, but this should help anyone else who might have the same issue.