Welcome to WebmasterWorld Guest from 174.129.127.214

Forum Moderators: coopster & jatar k

Message Too Old, No Replies

Crawling dynamic(AJAX) website

Deep web crawling

   
10:11 am on Jun 17, 2009 (gmt 0)

5+ Year Member



Hello,

I am trying to write a crawler which will enable me to scrape dynamic webpage.As I am a PHP developer , I am trying to see if it can be done with PHP

Can anybody give me some pointers to proceed?Also I would like to know if there is any other technology/technique with which it can be done.

Thanks in advance.

10:58 am on Jun 17, 2009 (gmt 0)

5+ Year Member



Please announce when your scritp will be ended.
12:35 pm on Jun 17, 2009 (gmt 0)

WebmasterWorld Administrator coopster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Welcome to WebmasterWorld, itsdone.

The typical tool for this job is cURL [php.net].

12:50 pm on Jun 17, 2009 (gmt 0)

5+ Year Member



Hi coopster ,

Thanks for your reply.

I have tried cURL succssfully for crawling static webpages however for the content which javascript shows could not be fetched using cURL at that time.

Is there any way to deal with AJAX content ? for example on page load some JS function is called and that function fetches data using AJAX and adds it to innerHTML of a DIV.

2:48 pm on Jun 19, 2009 (gmt 0)

10+ Year Member



I don't even think Google can do that my friend. I had a client with AJAX functionality of the sort for a hotel booking site and the site indexed was only 10 pages...

Good luck! Hope you have a lot of time on your hands and a lot of patience and a PHD in something!

3:12 pm on Jun 19, 2009 (gmt 0)

5+ Year Member



I once heard about a command line based 'firefox' application (lightweight) which can be run via PHP but I don't have any more details I'm afraid

There's no way to my knowledge of rendering javascript in PHP - you'd need to go through some software.

3:56 pm on Jun 19, 2009 (gmt 0)

WebmasterWorld Administrator coopster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



... or write your own software that parses the JavaScript and locates the ajax fetch routines and execute their processes.
5:33 pm on Jun 19, 2009 (gmt 0)

5+ Year Member



Alternatively, if you provide your dynamically-retrievable URLs in a uniform (i.e. preg-compatible) format in an inline script tag at the top of each page, and use a uniform URL syntax for accessing the listed URLs, your PHP wouldn't have to know a bit of actual JavaScript.

Google can't make that guarantee for all sites on the web.

EDIT: You should also look into SSJS [en.wikipedia.org]; I'm not sure whether it could be hooked up to PHP.

EDIT2: You don't happen to run your site on a PC do you? Darn. [addons.mozilla.org]

EDIT3: Eureka! I was searching "Gecko" but needed "spidermonkey." And it looks like Pooh Bear might be involved too: [devzone.zend.com...]

[edited by: Jesdisciple at 6:29 pm (utc) on June 19, 2009]

10:28 am on Jun 22, 2009 (gmt 0)

5+ Year Member



thanks for your replies.I will check those tools.
6:39 pm on Jun 22, 2009 (gmt 0)

5+ Year Member



Zend says you can't manipulate the DOM with that extension; however, you could probably register PHP's native DOM implementation with the JS. That is, if this bug doesn't bite you:
Yes, I think you could do that. However, one caveat is that the extension sometimes "loses information" when converting PHP objects to JavaScript. So IMO, you'd need to be aware of those issues and build in appropriate safeguards when working with user-supplied JavaScript.