Forum Moderators: coopster
I copied this routine from the manual. It works great but I need to only download the first 30k of each webpage being crawled?
[pre]
$pagehandle=@fopen($pagebeingcrawled,"rb");
$contents = "";
do {
$data = @fread($pagehandle, 8192);
if (strlen($data) == 0) {
break;
}
$contents .= $data;
} while(true);
@fclose($pagehandle);
[/pre]
Anyone lend a hand?
Much thanks!
Nick
It's such a specific number that I thought it might have somthing to do with this (which is given as the reason for doing it this way):
When reading from network streams or pipes, such as those returned when reading remote files or from popen() and proc_open(), reading will stop after a packet is available. This means that you should collect the data together in chunks as shown in the example below.
Make any sense?
Nick
$target = 30 * 1024;
$chunk = 1024;
$readsofar = 0;
$pagehandle=@fopen($pagebeingcrawled,"rb");
$contents = "";
do {
$data = @fread($pagehandle, $chunk);
if (strlen($data) == 0) {
break;
}
$contents .= $data;
$readsofar += $chunk;
} while($readsofar<$target);
@fclose($pagehandle);
You're telling the script to read 8192 bytes before "doing something" with it.
jatar_k's solution should work too, depending on which version of PHP you're running.
[ Editor's Note: It *can* stop before because the behavior of fread() has been corrected with PHP 4.3.2 to stop reading on (A) Whole packet, (B) Maxlen bytes, or (C) EOF, whichever comes first. This change will be reflected in the next build of the manual. ]
To do what he mentioned, though, not even save anything in the variable until after the body tag...
>>But what I meant was not to even save anything until the opening body tag...
That would have to occur inside the fread do loop:
if (!$contents) $data = stristr($data, '<body>');
if ($data) $contents .= $data;
//$contents .= $data;
Bear with me as I try to explain myself ;-)
I am working on a function to grab links from a URL and the REGEX DrDoc posted looked to be what I needed. The problem I am having (as best as I can figure) is that if the link does not contain 'http://' the result will included a quote (") preceding the link. If the link contains 'http://' then the link is fine. Is there any mod to the REGEX that will take care of this?
function grablinks($document){
preg_match_all("/href=[\"']?(.*)[\"']?[ >]/Uis" , $document, $links); // DrDocs REGEX
while(list($key,$val) = each($links[1])) {
if(!empty($val))
$match[] = $val;
}
// return the links
return $match;
}
Thnx.