Forum Moderators: coopster
I am trying to write a crawler which will enable me to scrape dynamic webpage.As I am a PHP developer , I am trying to see if it can be done with PHP
Can anybody give me some pointers to proceed?Also I would like to know if there is any other technology/technique with which it can be done.
Thanks in advance.
The typical tool for this job is cURL [php.net].
Thanks for your reply.
I have tried cURL succssfully for crawling static webpages however for the content which javascript shows could not be fetched using cURL at that time.
Is there any way to deal with AJAX content ? for example on page load some JS function is called and that function fetches data using AJAX and adds it to innerHTML of a DIV.
Google can't make that guarantee for all sites on the web.
EDIT: You should also look into SSJS [en.wikipedia.org]; I'm not sure whether it could be hooked up to PHP.
EDIT2: You don't happen to run your site on a PC do you? Darn. [addons.mozilla.org]
EDIT3: Eureka! I was searching "Gecko" but needed "spidermonkey." And it looks like Pooh Bear might be involved too: [devzone.zend.com...]
[edited by: Jesdisciple at 6:29 pm (utc) on June 19, 2009]
Yes, I think you could do that. However, one caveat is that the extension sometimes "loses information" when converting PHP objects to JavaScript. So IMO, you'd need to be aware of those issues and build in appropriate safeguards when working with user-supplied JavaScript.