Forum Moderators: coopster
Today I tried to parse a HTML flow in order to inject some contents at some places. First I've tried to use the DOMDocument native object. The problem with it is that it gets lost with CDATA and script comments like:
<script type='text/javascript'>/*<![CDATA[*/
var urls = new Array();
urls[0] = 'blablah';
// urls[...] = ...
var url = urls[Math.floor(Math.random()*urls.length)];
document.write('<div class="banner" style="background:transparent url('+url+') top left no-repeat;"></div>');
/*]]>*/</script> I don't know if it is when it is parsing or when it is printing back the DOM tree but I get some different results and my page is simply broken. This by doing nothing with DOM after its loading, just:
$document = new DOMDocument();
@$document->loadHTMLFile('page.html');
echo $document->saveHTML();
I'm using libxml 2.7.3 for your information. I don't know if there's a known issue about that.
So, do you guys know a safe way to parse HTML documents, that might not be completely strict, then modify them slightly before rendering them?
I had a look on htmlpurifier but it seems complicated at the first glance. I don't know if that could help me going on with the DOM API...
Maybe there are some cool HTML swallow parsers in PHP?
Any idea could be great! :)
Thanks!