Welcome to WebmasterWorld Guest from 54.166.107.51

Forum Moderators: coopster & jatar k

Message Too Old, No Replies

how to scrape data from remote webpage

scrap, extract, php, parse

     
12:09 am on Jul 17, 2009 (gmt 0)

New User

5+ Year Member

joined:June 18, 2009
posts: 36
votes: 0


Hi All,
php junkyard mechanic here with a snag:
I found (and lost) this website that had examples of how to pull data from html pages in a directory.
I'm trying to mod the code to grab data from multiple entries on a page (like a listing). Here is the code I currently have:
// picks only valid filenames
if (strpos($file, '.htm',1)¦¦strpos($file, '.html',1) ) {
// echos to the client a nice unordered list of images
$pathfile = $path."/".$file;
//echo($pathfile);
$data = file_get_contents($pathfile);
$getfname = '/id="boxfname">(.+?)<\/li>/';
$getlname = '/id="boxlname">(.+?)<\/li>/';
$getco = '/id="boxcompname">(.+?)<\/li>/';
$getti = '/id="boxtitlename">(.+?)<\/li>/';
$getli = '/<a href="(.+?)<\/a>/';

preg_match($getfname,$data,$match1);
preg_match($getlname,$data,$match2);
preg_match($getco,$data,$match3);
preg_match($getti,$data,$match4);
preg_match($getli,$data,$match5);
// var_dump($match);
-------------------------
the rows of data are in a table, formatted like this:

<td colspan="4">
<div id="box1" class="dragableBox1">
<ul style="list-style-type: none;display:inline;height:30px">
<li class="searchresulttextnew" align="left" valign="top" width="15%" id="boxfname1">Nadia</li>
<li class="searchresulttextnew" align="left" valign="top" width="15%" id="boxlname1">Narmeen</li>
<li class="searchresulttextnew" align="left" valign="top" width="20%" id="boxcompname1">Ayla & Company</li>
<li class="searchresulttextnew" align="left" valign="top" width="20%" id="boxtitlename1">Buyer</li>
</ul>
</div>
</td>
-------------------
how can I set the above code up in a loop, so that I can extract the entries into an array, to reconstruct later down the line.

Thanks,

1:28 pm on July 17, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:June 10, 2008
posts: 1130
votes: 0


check out the curl functions.
7:29 pm on July 17, 2009 (gmt 0)

New User

5+ Year Member

joined:June 18, 2009
posts: 36
votes: 0


This is based on this snippet, which is one of several:
<div id="box3" class="dragableBox1">
<ul style="list-style-type: none;display:inline;height:30px">
<li class="searchresulttextnew" align="left" valign="top" width="15%" id="boxfname3">Kathy</li>
<li class="searchresulttextnew" align="left" valign="top" width="15%" id="boxlname3">Savaze</li>
<li class="searchresulttextnew" align="left" valign="top" width="20%" id="boxcompname3">Peter Kate Shoes</li>
<li class="searchresulttextnew" align="left" valign="top" width="20%" id="boxtitlename3">Buyer</li>
</ul>
</div>

if I could just loop this, I'd be set:

<tr>
<?
// picks only valid filenames
if (strpos($file, '.htm',1)¦¦strpos($file, '.html',1) ) {
// echos to the client a nice unordered list
$data = file_get_contents($pathfile);
$getfname = '/id="boxfname">(.+?)<\/li>/';
$getlname = '/id="boxlname">(.+?)<\/li>/';
$getco = '/id="boxcompname">(.+?)<\/li>/';
$getti = '/id="boxtitlename">(.+?)<\/li>/';
$getli = '/<a href="(.+?)<\/a>/';

preg_match($getfname,$data,$match1);
preg_match($getlname,$data,$match2);
preg_match($getco,$data,$match3);
preg_match($getti,$data,$match4);
preg_match($getli,$data,$match5);
// var_dump($match);
?><td><? echo $match1[1]; ?></td>
<td><? echo $match2[1]; ?></td>
<td><? echo $match3[1]; ?></td>
<td><? echo $match4[1]; ?></td>
<td><a target="_blank" href="http://<? echo $match5[1]; ?>"><? echo $match5[1]; ?></a></td>
</tr>

---------
How do I get to the next row in the table?

7:38 pm on July 17, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:June 10, 2008
posts: 1130
votes: 0


You can echo html with php. Check this out

//echo the start of the table
echo("<table>");
for($i = 0; $i < 10; $i++){
$temp = $i + 1;
echo("<tr><td>row $i column $i</td><td> row $i column $temp</tr>\n");
}//for
//finish the table
echo("</table>");
run that and then look at the html source.
For your purposes just echo("<tr>"); wherever you need a new row.

2:53 am on July 19, 2009 (gmt 0)

Junior Member

10+ Year Member

joined:Apr 22, 2005
posts: 185
votes: 0


preg_match only catch first ocurrence,
preg_match_all is better.

look, something like this is very much appropiate:

$data = file_get_contents($pathfile);

$re = "/<ul[^\>]>\s*<li[^\>]>(.*?)<\/li>\s*<li[^\>]>(.*?)<\/li>\s*<li[^\>]>(.*?)<\/li>\s*<li[^\>]>(.*?)<\/li>\s*<\/ul>/s";

if (preg_match_all($re, $data, $get, PREG_SET_ORDER))
{
foreach ($get as $aux)
{
echo <<<EOM
<tr>
<td><?php echo $aux[1]; ?></td>
....... etc ............
</tr>
EOM
}
}

4:28 pm on July 21, 2009 (gmt 0)

New User

5+ Year Member

joined:June 18, 2009
posts:36
votes: 0


got it, thanks NomikOS.
I started working on exactly the same thing already. Thanks for the tip.
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members