Forum Moderators: coopster
I want a way to parse it, to get all the goodies inside.
I want to find the <title>, I want to loop through all the <p>'s and pay special attention to the <blockquote>s, I want to define a callback function to do something special with the <h1>s and <h2>s and so on.
The thing is, this document might not be valid XML... I actually don't know. I've been handed a huge pile of old documents - we're talking thousands of *.HTM files, and I need to get them into a CMS with as little pain as possible.
Can anyone point me in the right direction?
solutions that use only basic PHP5 stuff (ie no PEAR extensions) preferred; I don't want to have to bug my ISP to reinstall a new php config
Anyone else have any ideas?
So,
$NewHTMLstring=preg_replace('#<br[\s/]?>#','¦*br*¦',$HTMLstring);
$NewHTMLstring=strip_tags($NewHTMLstring,'<title><p><blockquote><html><body><h1><h2>')
preg_match('#<title>([^<]+)<[\w/]+>#i',$NewHTMLstring,$title);
preg_match_all('#<blockquote>([^<]+)<[\w/]+>#i',$NewHTMLstring,$blockquote);
preg_match_all('#<p>([^<]+)<[\w/]+>#i',$NewHTMLstring,$p);
And so on...
If you want to leave any links or other tags I would change those the same as the <br>s before the strip_tags(), then they can easily be converted back after any matches you need are made...
You could do the same thing with any other tags you would like to leave in place, but not 'close' an opening tag with. EG <b><strong><span> could easily be replaced with ¦*tag*¦ ¦*/tag*¦ and they would not be stripped by strip_tags or matched as 'the close of a tag' and could then be reversed easily later. You could probably use BBCode for the 'change and reverse' tags too.
I might actually convert to two different custom 'tag styles' to have a less likely chance of 'closing' a match too soon by replacing with the custom string I used above and switching known 'opening tags that close preceding tags' to BBCode style and change the expression to reflect the new style, because IMO you're more likely to run into a non-html < in the middle of a text string than a [, but it would depend to some extent on what you're working with.
I guess the short version is replace actual HTML with 'custom delimiters' to make matching easy and leave yourself a way to convert back to HTML easily.
* I hope all this rambling makes at least a bit of sense to someone. LOL.
The problem with strip_tags however is that if a tag is not closed, for example, strip tags will remove more than it should. There are some other situations where it would remove more, too. The HTML still has to be formatted correctly for this to work, but not a bad idea.