Forum Moderators: coopster

Message Too Old, No Replies

Robust Parsing of HTML

         

ukgimp

11:54 am on Jan 27, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I am trying to extract different elements on a page and I was hoping for a little advice so that I can make the scripts as robust as possible.

I have created a simple page that has keyword and description metas along with <title> a <h1> and two <p> sections. I initially get the page into a string and I am then parsing out the different sections and echoing them so I know they work:

preg_match("/<title>(.*)<\/title>/i", $html, $tag_contents);
$title = $tag_contents[1];
echo $title;

Now I am having trouble when the code starts to get little more complicated (<h1 class=”xyz”>text</h1>

If there are two <p> sets of text it only finds and prints the first one.

On a slightly different note is there a recommended order to parsing out a html page. Here is my proposed order:

1. get page into string
2. remove everything in <script> tags
3. extract meta tags with regex not meta tag string function (apparently slower)
4. extract all text between desired tags (<title> <h1> <p> etc)
5 remove any other html from all of 4 (eg <b> tags <br> etc)
6 echo the desired variables to screen to check that work.

Is preg_match the way forward to do what I propose.

I have read this regex information [etext.lib.virginia.edu] page but I am still struggling.

Cheers and thanks in advance.

Brett_Tabke

5:17 pm on Jan 27, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



As you pointed out, the problem with the above method is when you get into unexpected duplicate tags.

There are several good html strip modules available. I would go that route. Doing a good full featured accurate html striper is very time consuming and difficult. They also take lots of maintenance.

ukgimp

9:16 am on Jan 28, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks Brett

>>Doing a good full featured accurate html striper is very time consuming and difficult

Is that where using PERL might come in?

As suggested I will look at some existing modules out there, I just fancied being able to get at different parts of the text and classify them and print/store then.

Oh well back to work

Cheers