Forum Moderators: coopster
I have created a simple page that has keyword and description metas along with <title> a <h1> and two <p> sections. I initially get the page into a string and I am then parsing out the different sections and echoing them so I know they work:
preg_match("/<title>(.*)<\/title>/i", $html, $tag_contents);
$title = $tag_contents[1];
echo $title;
Now I am having trouble when the code starts to get little more complicated (<h1 class=”xyz”>text</h1>
If there are two <p> sets of text it only finds and prints the first one.
On a slightly different note is there a recommended order to parsing out a html page. Here is my proposed order:
1. get page into string
2. remove everything in <script> tags
3. extract meta tags with regex not meta tag string function (apparently slower)
4. extract all text between desired tags (<title> <h1> <p> etc)
5 remove any other html from all of 4 (eg <b> tags <br> etc)
6 echo the desired variables to screen to check that work.
Is preg_match the way forward to do what I propose.
I have read this regex information [etext.lib.virginia.edu] page but I am still struggling.
Cheers and thanks in advance.
There are several good html strip modules available. I would go that route. Doing a good full featured accurate html striper is very time consuming and difficult. They also take lots of maintenance.
>>Doing a good full featured accurate html striper is very time consuming and difficult
Is that where using PERL might come in?
As suggested I will look at some existing modules out there, I just fancied being able to get at different parts of the text and classify them and print/store then.
Oh well back to work
Cheers