Parse/Extract data from page

Forum Moderators: coopster

Message Too Old, No Replies

Parse/Extract data from page

page is the results of a remote search query collected via cURL

erikcw

1:09 am on Jul 5, 2004 (gmt 0)

Hi All,

I am using cURL to access a remote search query. I am having no problems getting the data, but parsing it into something more usefull is causing me some trouble.

I want to strip out all of the "header data" (everything from <html>.......<table>[GOOD DATA]<more junk>)

I am guessing that Regular Expressions are probably the best way to go, but I have the slightest idea of how to begin. Can anyone share aome info to get me on my way?

Thanks!

bsterz

3:10 am on Jul 5, 2004 (gmt 0)

If you have some perl skilz I would suggest:

HTML::Parser
[search.cpan.org...]

And:
HTML::TableExtract

I use em for tons of screen scraping stuff.

Bill

erikcw

5:09 am on Jul 5, 2004 (gmt 0)

Unfortunatly, Perl is not an option. I need to do it in PHP. Any ideas?

coopster

2:01 pm on Jul 5, 2004 (gmt 0)

I have seen some classes for parsing html tables with PHP as well, can't remember where now, maybe somebody else can offer more...

Anyway, here is a regular expression to get you started:

preg_match("/<table>(.*)<\/table>/Uis", $string, $matches); 
print $matches[1];

Breaking it down a bit, we are looking for a subpattern [php.net]. We do this by using parentheses. Note that our subpattern is searching for everything between the <table> tags. The second table tag has to have it's slash escaped since it is the pattern delimiter. The "Uis" characters are modifiers.
There are quite a few good tutorials and links all over WebmasterWorld regarding regular expressions. The manual comes in quite handy as well.

Regular Expression Functions (Perl-Compatible) [php.net]

erikcw

10:11 pm on Jul 5, 2004 (gmt 0)

Worked like a charm! Wow, I thought that the code was going to be so much more complex! I ended up using a combination of the REGEX match above and preg_replace to pull out the last of the junk-data.

Thanks!

coopster

10:24 pm on Jul 5, 2004 (gmt 0)

You're welcome. Now, if you are simply using echo or print to send that to the browser to see what you are getting, you may have <tr> and <td> tags in there still you know! Use your browser's View Source option to have a look at what is indeed there. Unless, of course, that is what you are referring to when you said you used preg_replace() to get the rest of the junk data out...

erikcw

10:41 pm on Jul 5, 2004 (gmt 0)

In order to correct having the extra "table data", I used strip_tags() to pull out all of the html. I was then left with just the data. I then ran it through preg_replace() to remove numbers that were next to each item (it was a numbered list).

Thanks!

coopster

10:59 pm on Jul 5, 2004 (gmt 0)

Hey, good for you, strip_tags() was a nice touch. Cheers.