Forum Moderators: coopster

Message Too Old, No Replies

PHP Site Scraper

Want to develop a php site scraper

         

dermotirl

11:25 am on Jun 21, 2006 (gmt 0)

10+ Year Member



Hi I want to build a site scraper in PHP that will scrape news content from a number of our websites and partners website's. There is a large number of sites so it is very time consuming to do it manually.

I have looked into it myself but am not sure where to start. Some people say to do it using fopen and other say to use CURL. Does anyone know which is best to use or know of any tutorials that may be of help.

Thanks

eelixduppy

12:11 pm on Jun 21, 2006 (gmt 0)



Welcome to WebmasterWorld!

>>>our websites and partners website's
Do these sites have databases holding this information. If so, you can then just take the data from there, especially in your own sites, and that will reduce some of the work that has to be done.

dermotirl

1:34 pm on Jun 21, 2006 (gmt 0)

10+ Year Member



Yes the site's have databases, but I cannot get access to the databases beause the partners will not give out that information, which I cannot really blame them for.

There is just over 100 site that I want to scrape for there news and insert it into my own database and then display it on the site.

Philosopher

1:38 pm on Jun 21, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If they are partner sites, how about getting them to set you up your own username/password with read-only permissions?

trillianjedi

1:39 pm on Jun 21, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I cannot get access to the databases beause the partners will not give out that information, which I cannot really blame them for.

Might be easier for you to build them a script that they install on their server that only your server can call - they then dump DB data through that script to you.

Very easy to restrict access to it by IP address. I would think that's a lot better for them than having you hammering Apache to try and crawl it?

TJ

dermotirl

2:00 pm on Jun 21, 2006 (gmt 0)

10+ Year Member



Thanks,I'll try that but in the event they dont/wont is there any other alternatives, I still think scraping them might is a good option.

I'm looking into using curl and it seems to be better than fopen as I can connect to a url without getting the following error (URL file-access is disabled in the server configuration).