Forum Moderators: coopster

Message Too Old, No Replies

screen scraper for PHP?

         

partha

6:25 am on Jan 22, 2005 (gmt 0)

10+ Year Member



is there an easy to use php script that configured to get certain info off of a web page and store them in some variables, without having to do a lot of coding by hand?

mincklerstraat

10:03 am on Jan 22, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member




<?php
//Don't use this illegally - for PHP 4.3 or more
//scraper.php
$url = 'http://www.webmasterworld.com/forum88';
$findstuff = array();
$findstuff['start'][] = 'Moderated by\: \<a href\="/vewprofile\.cgi\?action=view&amp;member\=';
$findstuff['end'][] = '"';
$findstuff['start'][] = '<img src="http\://showcase\.netins\.net/web/phdss/WebmasterWorldgfx/thread\.png" alt="thread icon" align="left"><font size="2" face="verdana" color="\#000000"><b><a href="/forum88/[0-9]*\.htm" target="_top">';
$findstuff['end'][] = '</a>';
$scrapegreedy = 0;
$user_agent = 'browser';
$preg = 1;
$debug = 0;
$foundbits = go_scrape($url, $findstuff, $scrapegreedy, $user_agent, $preg, $debug);
echo '<pre>';
print_r($foundbits);
echo '</pre>';
//
//
//
function go_scrape($url, $findstuff, $scrapegreedy, $user_agent='', $preg=0, $debug=0){
if(!empty($user_agent)){
if($user_agent = 'browser') ini_set('user_agent', $_SERVER['HTTP_USER_AGENT']);
else ini_set('user_agent', $user_agent);
} else ini_set('user_agent', 'scraper.php - www.webmasterworld.com/forum88/6614.htm');
$contents = file_get_contents($url);
if(!$contents) return 'no contents';
$foundbits = array();
foreach($findstuff['start'] as $k => $v){
if(empty($preg)){
$v = preg_quote($v, '#');
$findstuff['end'][$k] = preg_quote($findstuff['end'][$k]);
}
$pregstring = '#'.$v.'(.*';
if(empty($scrapegreedy)) $pregstring .= '?)';
else $pregstring .= ')';
$pregstring .= $findstuff['end'][$k].'#';
if($debug) echo(htmlspecialchars($pregstring)).'<br />';
$check = preg_match($pregstring, $contents, $matches);
if($check) $foundbits[] = $matches[1];
else $foundbits[] = '* none found *';
}
if($debug) return array($foundbits, $contents);
else return $foundbits;
}

Put your url in $url, and stuff surrounding what you want to get (the beginning and end HTML) in $findstuff['start'] and $findstuff['end'] for each bit of stuff you want to get, like above. You can set user agent if you want, setting it to 'browser' will send the user agent currently being used. Set $preg if you want your strings to be used as preg_match strings (properly escaped, with your special preg stuff inside, but no delimiters) - if they're just 'normal' strings, set this to 0 or leave empty.

If you set $debug to 1, it'll output each regular expression used so you can check up, and return an array including the contents of the page fetched.

The example above outputs the first moderator name found, and the first thread name found in this php forum.

mincklerstraat

11:12 am on Jan 22, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



mistake:
replace:

if($user_agent = 'browser') ini_set('user_agent', $_SERVER['HTTP_USER_AGENT']);

with:

if($user_agent == 'browser') ini_set('user_agent', $_SERVER['HTTP_USER_AGENT']);