Forum Moderators: coopster

Message Too Old, No Replies

fopen() and fget() to scrape a remote page and bandwidth concerns.

Can I limit what's downloaded? (I don't want to download images)

         

Christopher C

4:28 pm on Mar 22, 2004 (gmt 0)

10+ Year Member



Hello,

I am using fopen and fget() to take the content from a remote website one line at a time. I'm then using reg exp to parse the data for use on my own site.

The page that I'm scraping is a whopping 395k in size including images. I have permission to scrape this page, but that I'm sure will vanish if I have to download the full 395k each time as it will really chew up his bandwidth.

My question is, can I just scrape the html code itself (perhaps this is what fget is already doing?)? All I need is the actual html itself, so downloading the images is just wasting bandwidth (to the tune of about 365k).

thanks,
chris

gethan

5:00 pm on Mar 22, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Chris, don't worry. fget() - also take a look at file() do not grab the images. If you wanted the images you would have to code to specifically ask for them as well (eg. follow the src='')

Hopefully the pure html is well below 395k :)

Added: You might want to consider caching the page (saving a local copy) if the page is updated for example once a day...

[edited by: gethan at 5:03 pm (utc) on Mar. 22, 2004]

coopster

5:01 pm on Mar 22, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



That's right, the fgets() file function is merely reading the file, not rendering it, nor downloading any images. Have you also considered the file [php.net] function or perhaps the file_get_contents [php.net] function? They will save you from having to use a loop just to read the entire file. Just another option...

Christopher C

6:02 pm on Mar 22, 2004 (gmt 0)

10+ Year Member



Ah this is good news guys, thanks :) Yes indeed the html is much smaller (about 365k so). I am also stripping the content directly into a database but it's a product database so I have to keep updated to make sure our sites are in sync.

I've tried the file_get_contents() and it also works, the script runs quite a bit slower than doing a line by line fgets(). Am I right that it is less intrusive as it just makes one call to the server instead of multiple calls? Once the script is in a cron job it's not going to matter for load time.

Thanks,
Chris