PHP crawler optimization

Forum Moderators: coopster

Message Too Old, No Replies

PHP crawler optimization

darkphoenix

9:54 am on Jan 13, 2009 (gmt 0)

Hi All

I'm having some difficulties with my php crawler, and was hoping some of you might have some optimization solutions :)

- I have a php crawler which crawls about 30 webpages(each with about 100-200 subpages) in a single threaded environment. The crawler is started by a cronjob each night.

- The crawler retrieves the webpages using:
$page = implode('', file ($url));

And the content found on each page are then saved in a MySQL database as it is found.

The problem is that some nights this operation times out - and only half of the pages are crawled, and I have to restart the crawler the next morning in order to have all sites crawled.

So I was hoping some of you could help me figure out some way to optimize my crawler to it doesn't time out all the time.

Best Regards

janharders

10:00 am on Jan 13, 2009 (gmt 0)

don't use file-functions on urls. they're not meant to be treated alike and imho it's a bad decision by the php-devs to do it that way (as was register_globals etc ... they'll finally figure it out). instead, use some "real" http-library, like curl for example, where you can define timeouts and easily write code to decide what to do if an error occurs (and what to do for which type of error).

It'll save you a lot of trouble and you'll have much more control.

darkphoenix

10:12 am on Jan 13, 2009 (gmt 0)

Ok thanks alot - do you have a small example of a cURL operation that does just that?

darkphoenix

10:26 am on Jan 13, 2009 (gmt 0)

Damn that worked like a charm with cURL (Just made a quick try).

One of my crawlers that took 1min 29sec before now only takes 31 sec.

Thanks a lot mate.

janharders

10:31 am on Jan 13, 2009 (gmt 0)

you're welcome!
you should see better results, especially when handling bigger files. and you could just save the last time you visited a single url and then ask the server to return data only if it has changed since then -- could save your script a lot of time