Robust Parsing of HTML

I am trying to extract different elements on a page and I was hoping for a little advice so that I can make the scripts as robust as possible.

I have created a simple page that has keyword and description metas along with <title> a <h1> and two <p> sections. I initially get the page into a string and I am then parsing out the different sections and echoing them so I know they work:

preg_match("/<title>(.*)<\/title>/i", $html, $tag_contents);
$title = $tag_contents[1];
echo $title;

Now I am having trouble when the code starts to get little more complicated (<h1 class=”xyz”>text</h1>

If there are two <p> sets of text it only finds and prints the first one.

On a slightly different note is there a recommended order to parsing out a html page. Here is my proposed order:

1. get page into string
2. remove everything in <script> tags
3. extract meta tags with regex not meta tag string function (apparently slower)
4. extract all text between desired tags (<title> <h1> <p> etc)
5 remove any other html from all of 4 (eg <b> tags <br> etc)
6 echo the desired variables to screen to check that work.

Is preg_match the way forward to do what I propose.

I have read this regex information [etext.lib.virginia.edu] page but I am still struggling.

Cheers and thanks in advance.

Robust Parsing of HTML

ukgimp

Brett_Tabke

ukgimp

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week