homepage Welcome to WebmasterWorld Guest from 54.205.160.82
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Accredited PayPal World Seller

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

    
how to scrape data from remote webpage
scrap, extract, php, parse
sasori




msg:3954041
 12:09 am on Jul 17, 2009 (gmt 0)

Hi All,
php junkyard mechanic here with a snag:
I found (and lost) this website that had examples of how to pull data from html pages in a directory.
I'm trying to mod the code to grab data from multiple entries on a page (like a listing). Here is the code I currently have:
// picks only valid filenames
if (strpos($file, '.htm',1)¦¦strpos($file, '.html',1) ) {
// echos to the client a nice unordered list of images
$pathfile = $path."/".$file;
//echo($pathfile);
$data = file_get_contents($pathfile);
$getfname = '/id="boxfname">(.+?)<\/li>/';
$getlname = '/id="boxlname">(.+?)<\/li>/';
$getco = '/id="boxcompname">(.+?)<\/li>/';
$getti = '/id="boxtitlename">(.+?)<\/li>/';
$getli = '/<a href="(.+?)<\/a>/';

preg_match($getfname,$data,$match1);
preg_match($getlname,$data,$match2);
preg_match($getco,$data,$match3);
preg_match($getti,$data,$match4);
preg_match($getli,$data,$match5);
// var_dump($match);
-------------------------
the rows of data are in a table, formatted like this:

<td colspan="4">
<div id="box1" class="dragableBox1">
<ul style="list-style-type: none;display:inline;height:30px">
<li class="searchresulttextnew" align="left" valign="top" width="15%" id="boxfname1">Nadia</li>
<li class="searchresulttextnew" align="left" valign="top" width="15%" id="boxlname1">Narmeen</li>
<li class="searchresulttextnew" align="left" valign="top" width="20%" id="boxcompname1">Ayla & Company</li>
<li class="searchresulttextnew" align="left" valign="top" width="20%" id="boxtitlename1">Buyer</li>
</ul>
</div>
</td>
-------------------
how can I set the above code up in a loop, so that I can extract the entries into an array, to reconstruct later down the line.

Thanks,

 

andrewsmd




msg:3954346
 1:28 pm on Jul 17, 2009 (gmt 0)

check out the curl functions.

sasori




msg:3954575
 7:29 pm on Jul 17, 2009 (gmt 0)

This is based on this snippet, which is one of several:
<div id="box3" class="dragableBox1">
<ul style="list-style-type: none;display:inline;height:30px">
<li class="searchresulttextnew" align="left" valign="top" width="15%" id="boxfname3">Kathy</li>
<li class="searchresulttextnew" align="left" valign="top" width="15%" id="boxlname3">Savaze</li>
<li class="searchresulttextnew" align="left" valign="top" width="20%" id="boxcompname3">Peter Kate Shoes</li>
<li class="searchresulttextnew" align="left" valign="top" width="20%" id="boxtitlename3">Buyer</li>
</ul>
</div>

if I could just loop this, I'd be set:

<tr>
<?
// picks only valid filenames
if (strpos($file, '.htm',1)¦¦strpos($file, '.html',1) ) {
// echos to the client a nice unordered list
$data = file_get_contents($pathfile);
$getfname = '/id="boxfname">(.+?)<\/li>/';
$getlname = '/id="boxlname">(.+?)<\/li>/';
$getco = '/id="boxcompname">(.+?)<\/li>/';
$getti = '/id="boxtitlename">(.+?)<\/li>/';
$getli = '/<a href="(.+?)<\/a>/';

preg_match($getfname,$data,$match1);
preg_match($getlname,$data,$match2);
preg_match($getco,$data,$match3);
preg_match($getti,$data,$match4);
preg_match($getli,$data,$match5);
// var_dump($match);
?><td><? echo $match1[1]; ?></td>
<td><? echo $match2[1]; ?></td>
<td><? echo $match3[1]; ?></td>
<td><? echo $match4[1]; ?></td>
<td><a target="_blank" href="http://<? echo $match5[1]; ?>"><? echo $match5[1]; ?></a></td>
</tr>

---------
How do I get to the next row in the table?

andrewsmd




msg:3954579
 7:38 pm on Jul 17, 2009 (gmt 0)

You can echo html with php. Check this out

//echo the start of the table
echo("<table>");
for($i = 0; $i < 10; $i++){
$temp = $i + 1;
echo("<tr><td>row $i column $i</td><td> row $i column $temp</tr>\n");
}//for
//finish the table
echo("</table>");
run that and then look at the html source.
For your purposes just echo("<tr>"); wherever you need a new row.

NomikOS




msg:3955195
 2:53 am on Jul 19, 2009 (gmt 0)

preg_match only catch first ocurrence,
preg_match_all is better.

look, something like this is very much appropiate:

$data = file_get_contents($pathfile);

$re = "/<ul[^\>]>\s*<li[^\>]>(.*?)<\/li>\s*<li[^\>]>(.*?)<\/li>\s*<li[^\>]>(.*?)<\/li>\s*<li[^\>]>(.*?)<\/li>\s*<\/ul>/s";

if (preg_match_all($re, $data, $get, PREG_SET_ORDER))
{
foreach ($get as $aux)
{
echo <<<EOM
<tr>
<td><?php echo $aux[1]; ?></td>
....... etc ............
</tr>
EOM
}
}

sasori




msg:3956660
 4:28 pm on Jul 21, 2009 (gmt 0)

got it, thanks NomikOS.
I started working on exactly the same thing already. Thanks for the tip.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved