Forum Moderators: coopster
I am using fopen and fget() to take the content from a remote website one line at a time. I'm then using reg exp to parse the data for use on my own site.
The page that I'm scraping is a whopping 395k in size including images. I have permission to scrape this page, but that I'm sure will vanish if I have to download the full 395k each time as it will really chew up his bandwidth.
My question is, can I just scrape the html code itself (perhaps this is what fget is already doing?)? All I need is the actual html itself, so downloading the images is just wasting bandwidth (to the tune of about 365k).
thanks,
chris
Hopefully the pure html is well below 395k :)
Added: You might want to consider caching the page (saving a local copy) if the page is updated for example once a day...
[edited by: gethan at 5:03 pm (utc) on Mar. 22, 2004]
I've tried the file_get_contents() and it also works, the script runs quite a bit slower than doing a line by line fgets(). Am I right that it is less intrusive as it just makes one call to the server instead of multiple calls? Once the script is in a cron job it's not going to matter for load time.
Thanks,
Chris