Forum Moderators: coopster

Message Too Old, No Replies

PHP to parse HTML?

         

httpwebwitch

8:57 pm on Dec 2, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Say, I load a big HTML document in as a string.

I want a way to parse it, to get all the goodies inside.
I want to find the <title>, I want to loop through all the <p>'s and pay special attention to the <blockquote>s, I want to define a callback function to do something special with the <h1>s and <h2>s and so on.

The thing is, this document might not be valid XML... I actually don't know. I've been handed a huge pile of old documents - we're talking thousands of *.HTM files, and I need to get them into a CMS with as little pain as possible.

Can anyone point me in the right direction?

solutions that use only basic PHP5 stuff (ie no PEAR extensions) preferred; I don't want to have to bug my ISP to reinstall a new php config

eelixduppy

1:28 am on Dec 3, 2009 (gmt 0)



I cannot think of anything real simple that would work every time. If the HTML were valid XML this would be real simple. Would it be worth coding for it as if it were correct XML and then seeing how much gets ported over. I'm afraid if you cannot exactly figure out what you expect in the file (ie with valid XML) then you aren't going to be 100% sure that you get everything.

Anyone else have any ideas?

TheMadScientist

2:02 am on Dec 3, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'd think about replacing all the <br> tags with a custom string [ br ] or ¦*br*¦ to get them out of the way, then using strip_tags() to remove everything that's not what you want to match, then matching the sections you want from the opening tag of the info you want to the next opening or closing tag, which is where the tag would be 'closed' even if it's not valid. Here's an idea of what I'm thinking:

So,
$NewHTMLstring=preg_replace('#<br[\s/]?>#','¦*br*¦',$HTMLstring);
$NewHTMLstring=strip_tags($NewHTMLstring,'<title><p><blockquote><html><body><h1><h2>')

preg_match('#<title>([^<]+)<[\w/]+>#i',$NewHTMLstring,$title);
preg_match_all('#<blockquote>([^<]+)<[\w/]+>#i',$NewHTMLstring,$blockquote);
preg_match_all('#<p>([^<]+)<[\w/]+>#i',$NewHTMLstring,$p);

And so on...

If you want to leave any links or other tags I would change those the same as the <br>s before the strip_tags(), then they can easily be converted back after any matches you need are made...

You could do the same thing with any other tags you would like to leave in place, but not 'close' an opening tag with. EG <b><strong><span> could easily be replaced with ¦*tag*¦ ¦*/tag*¦ and they would not be stripped by strip_tags or matched as 'the close of a tag' and could then be reversed easily later. You could probably use BBCode for the 'change and reverse' tags too.

I might actually convert to two different custom 'tag styles' to have a less likely chance of 'closing' a match too soon by replacing with the custom string I used above and switching known 'opening tags that close preceding tags' to BBCode style and change the expression to reflect the new style, because IMO you're more likely to run into a non-html < in the middle of a text string than a [, but it would depend to some extent on what you're working with.

I guess the short version is replace actual HTML with 'custom delimiters' to make matching easy and leave yourself a way to convert back to HTML easily.

* I hope all this rambling makes at least a bit of sense to someone. LOL.

eelixduppy

5:01 am on Dec 3, 2009 (gmt 0)



>> strip_tags

The problem with strip_tags however is that if a tag is not closed, for example, strip tags will remove more than it should. There are some other situations where it would remove more, too. The HTML still has to be formatted correctly for this to work, but not a bad idea.

vincevincevince

6:26 am on Dec 3, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



First:
[php.net...]
Second:
[php.net...]
Third:
Loop through the results and output what you need

TheMadScientist

6:26 am on Dec 3, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Interesting... I actually haven't had the need to use strip_tags, so that's good information to have.

httpwebwitch

5:33 pm on Dec 3, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



vvv, that looks very promising! I'll try it.

coopster

3:30 pm on Dec 5, 2009 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



You may also consider using the tidy extension.