snagging and parsing html with regexp in php

Forum Moderators: coopster

Message Too Old, No Replies

snagging and parsing html with regexp in php

broniusm

5:00 pm on Nov 4, 2003 (gmt 0)

I am working on a script to grab some remote html, parse the TDs of a table into array elements from which I could then use the data on my own. I am having both a regexp and a php regexp syntax issue, in that I don't know which is wrong when and how:


// pick out individual events & store them in an array
$matches = preg_match_all("<td[^>]*>(.*)</td>", $alltext, $arrdata);// raw output the TDs
print "<textarea>";
for ($i=0; $i<count($arrdata[0]); $i++) {
print "\nMatch $i:\n".$arrdata[0][$i];
}
print "</textarea>";

At this stage, I just want to grab all TDs and display them one by one. Eventually, I will parse the semi-structured data within the TD to make better sense of it.

Can someone please help?

coopster

5:26 pm on Nov 4, 2003 (gmt 0)

The expression should be enclosed in the delimiters, a forward slash (/), for example:


$matches = preg_match_all("/<td.*>(.*)<\/td>/Ui", $alltext, $arrdata);

broniusm

7:33 pm on Nov 4, 2003 (gmt 0)

thanks coopster-- right on!
now I seem to have a problem with either whitespace or linebreaks.. the raw html looks like this, so you can see it's a bit spacey:


 <tr>
<td colspan=2 class=searchres>
      
      
      <b><a href="[..url..]">[..event title..]</a></b><br>
      Monday, 11/3/2003 at 4:00pm<br>
      Meetings & Conventions<br>
      [Event Category]
</td>
 </tr>

I think I need to strip all that for the regexp to pick up what I'm requesting.

coopster

10:09 pm on Nov 4, 2003 (gmt 0)

I'm not sure if I understand your request, but if it isn't working because of newlines, add the

s (PCRE_DOTALL)

modifier. If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded:


$matches = preg_match_all("/<td.*>(.*)<\/td>/Uis", $alltext, $arrdata);

broniusm

4:44 am on Nov 5, 2003 (gmt 0)

coopster- ya did it again! :)

Thanks very much for your expertise. It's a bit frustrating to know you're on the right track but not to be 100% about the syntax-- you helped me through the muck.