Forum Moderators: coopster

Message Too Old, No Replies

Parse/Extract data from page

page is the results of a remote search query collected via cURL

         

erikcw

1:09 am on Jul 5, 2004 (gmt 0)

10+ Year Member



Hi All,

I am using cURL to access a remote search query. I am having no problems getting the data, but parsing it into something more usefull is causing me some trouble.

I want to strip out all of the "header data" (everything from <html>.......<table>[GOOD DATA]<more junk>)

I am guessing that Regular Expressions are probably the best way to go, but I have the slightest idea of how to begin. Can anyone share aome info to get me on my way?

Thanks!

bsterz

3:10 am on Jul 5, 2004 (gmt 0)

10+ Year Member



If you have some perl skilz I would suggest:

HTML::Parser
[search.cpan.org...]

And:
HTML::TableExtract

I use em for tons of screen scraping stuff.

Bill

erikcw

5:09 am on Jul 5, 2004 (gmt 0)

10+ Year Member



Unfortunatly, Perl is not an option. I need to do it in PHP. Any ideas?

coopster

2:01 pm on Jul 5, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



I have seen some classes for parsing html tables with PHP as well, can't remember where now, maybe somebody else can offer more...

Anyway, here is a regular expression to get you started:

preg_match("/<table>(.*)<\/table>/Uis", $string, $matches); 
print $matches[1];

Breaking it down a bit, we are looking for a subpattern [php.net]. We do this by using parentheses. Note that our subpattern is searching for everything between the <table> tags. The second table tag has to have it's slash escaped since it is the pattern delimiter. The "Uis" characters are modifiers.
There are quite a few good tutorials and links all over WebmasterWorld regarding regular expressions. The manual comes in quite handy as well.

Regular Expression Functions (Perl-Compatible) [php.net]

erikcw

10:11 pm on Jul 5, 2004 (gmt 0)

10+ Year Member



Worked like a charm! Wow, I thought that the code was going to be so much more complex! I ended up using a combination of the REGEX match above and preg_replace to pull out the last of the junk-data.

Thanks!

coopster

10:24 pm on Jul 5, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



You're welcome. Now, if you are simply using echo or print to send that to the browser to see what you are getting, you may have <tr> and <td> tags in there still you know! Use your browser's View Source option to have a look at what is indeed there. Unless, of course, that is what you are referring to when you said you used preg_replace() to get the rest of the junk data out...

erikcw

10:41 pm on Jul 5, 2004 (gmt 0)

10+ Year Member



In order to correct having the extra "table data", I used strip_tags() to pull out all of the html. I was then left with just the data. I then ran it through preg_replace() to remove numbers that were next to each item (it was a numbered list).

Thanks!

coopster

10:59 pm on Jul 5, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Hey, good for you, strip_tags() was a nice touch. Cheers.