Forum Moderators: coopster

Message Too Old, No Replies

Help with fread() - Limit size of file being read?

         

Nick_W

7:56 pm on Nov 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi all,

I copied this routine from the manual. It works great but I need to only download the first 30k of each webpage being crawled?

[pre]
$pagehandle=@fopen($pagebeingcrawled,"rb");
$contents = "";
do {
$data = @fread($pagehandle, 8192);
if (strlen($data) == 0) {
break;
}
$contents .= $data;
} while(true);
@fclose($pagehandle);
[/pre]

Anyone lend a hand?

Much thanks!

Nick

jatar_k

8:00 pm on Nov 26, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



why not change the fread line?

$data = @fread($pagehandle, 30000);

Nick_W

8:04 pm on Nov 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, let me tell ya!

It's such a specific number that I thought it might have somthing to do with this (which is given as the reason for doing it this way):


When reading from network streams or pipes, such as those returned when reading remote files or from popen() and proc_open(), reading will stop after a packet is available. This means that you should collect the data together in chunks as shown in the example below.

Make any sense?

Nick

DrDoc

8:08 pm on Nov 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



How about something like this:

$target = 30 * 1024;
$chunk = 1024;
$readsofar = 0;

$pagehandle=@fopen($pagebeingcrawled,"rb");
$contents = "";
do {
$data = @fread($pagehandle, $chunk);
if (strlen($data) == 0) {
break;
}
$contents .= $data;
$readsofar += $chunk;
} while($readsofar<$target);
@fclose($pagehandle);

Nick_W

8:11 pm on Nov 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That certainly looks like the ticket. Though would jatars way do the same thing simpler?

What is the significance of '8192'?

Cheers guys, much obliged ;)

Nick

DrDoc

8:27 pm on Nov 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



8192 = 8192 bytes

You're telling the script to read 8192 bytes before "doing something" with it.

jatar_k's solution should work too, depending on which version of PHP you're running.

[ Editor's Note: It *can* stop before because the behavior of fread() has been corrected with PHP 4.3.2 to stop reading on (A) Whole packet, (B) Maxlen bytes, or (C) EOF, whichever comes first. This change will be reflected in the next build of the manual. ]

DrDoc

8:29 pm on Nov 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Here's how jatar-k's solution would work:

$pagehandle = @fopen($pagebeingcrawled,"rb");
$contents = @fread($pagehandle, 30*1024);
@fclose($pagehandle);

...which is a lot leaner.

Nick_W

8:30 pm on Nov 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yeah, less than 4.3 :(

I know what 8192 means but why not 9000? - I just thought the actual figure might be significant?

Nick

coopster

8:35 pm on Nov 26, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



I think the 8192 is just a random number picked by the author saying they didn't want to read any more than 8KB from the target file (8192/1024 bytes in a kb) = 8.

Nick_W

8:37 pm on Nov 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Right! - That would make sense I think ;)

Cheers all..

Nick

coopster

8:38 pm on Nov 26, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Therefore, jatar_k's example would work, except to get the figure you want, Nick:
30*1024=30720 :)
<edit>as DrDoc pointed out earlier<edit>

Nick_W

8:45 pm on Nov 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Now all I have to do is decide how much of a page is worth reading to get the recently added links from bloglike sites ---- How long is a piece of string? ;)

Nick

DrDoc

8:53 pm on Nov 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



...and where to start reading...

Maybe you shouldn't start counting until you have passed the opening body tag? Otherwise you may get a load of headers, style sheets, javascripts, and meta junk to parse through :)

coopster

9:07 pm on Nov 26, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member




preg_match_all("/href=\"(.*)\"/Uis" , stristr($string, '<body>'), $matches);

matches[1] would have the links
? ;)

DrDoc

9:12 pm on Nov 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



preg_match_all("/href=[\"']?(.*)[\"']?[ >]/Uis" , stristr($string, '<body>'), $matches);

But what I meant was not to even save anything until the opening body tag...

Nick_W

9:46 pm on Nov 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



So what's the differnece between your 2 codes there guys? - And am I right in thinking they find anchor tags after the body tag?

Nick

jatar_k

10:02 pm on Nov 26, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



30720

yeah, alright I was being lazy

bunch o' comedians ;)

coopster

10:12 pm on Nov 26, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Yes, Nick, that's right -- the REGEX finds links after the initial body tag. DrDoc corrected my initial REGEX to accommodate single quotations [w3.org] or even lack of quotations surrounding href attributes.

To do what he mentioned, though, not even save anything in the variable until after the body tag...

>>But what I meant was not to even save anything until the opening body tag...

That would have to occur inside the fread do loop:


if (!$contents) $data = stristr($data, '<body>');
if ($data) $contents .= $data;
//$contents .= $data;

Nick_W

10:18 pm on Nov 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Right! - Back to work on this in morning, great help and discussion guys, appreciated ;)

I'm afraid I'm a poor Geek, I just don't keep late hours well hehe!

Nick

Jabocra

6:22 am on Nov 27, 2003 (gmt 0)

10+ Year Member



>>Yes, Nick, that's right -- the REGEX finds links after the initial body tag. DrDoc corrected my initial REGEX to accommodate single quotations or even lack of quotations surrounding href attributes.

Bear with me as I try to explain myself ;-)

I am working on a function to grab links from a URL and the REGEX DrDoc posted looked to be what I needed. The problem I am having (as best as I can figure) is that if the link does not contain 'http://' the result will included a quote (") preceding the link. If the link contains 'http://' then the link is fine. Is there any mod to the REGEX that will take care of this?

function grablinks($document){

preg_match_all("/href=[\"']?(.*)[\"']?[ >]/Uis" , $document, $links); // DrDocs REGEX

while(list($key,$val) = each($links[1])) {
if(!empty($val))
$match[] = $val;
}

// return the links
return $match;

}

Thnx.