homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

Crawling dynamic(AJAX) website
Deep web crawling

 10:11 am on Jun 17, 2009 (gmt 0)


I am trying to write a crawler which will enable me to scrape dynamic webpage.As I am a PHP developer , I am trying to see if it can be done with PHP

Can anybody give me some pointers to proceed?Also I would like to know if there is any other technology/technique with which it can be done.

Thanks in advance.



 10:58 am on Jun 17, 2009 (gmt 0)

Please announce when your scritp will be ended.


 12:35 pm on Jun 17, 2009 (gmt 0)

Welcome to WebmasterWorld, itsdone.

The typical tool for this job is cURL [php.net].


 12:50 pm on Jun 17, 2009 (gmt 0)

Hi coopster ,

Thanks for your reply.

I have tried cURL succssfully for crawling static webpages however for the content which javascript shows could not be fetched using cURL at that time.

Is there any way to deal with AJAX content ? for example on page load some JS function is called and that function fetches data using AJAX and adds it to innerHTML of a DIV.


 2:48 pm on Jun 19, 2009 (gmt 0)

I don't even think Google can do that my friend. I had a client with AJAX functionality of the sort for a hotel booking site and the site indexed was only 10 pages...

Good luck! Hope you have a lot of time on your hands and a lot of patience and a PHD in something!


 3:12 pm on Jun 19, 2009 (gmt 0)

I once heard about a command line based 'firefox' application (lightweight) which can be run via PHP but I don't have any more details I'm afraid

There's no way to my knowledge of rendering javascript in PHP - you'd need to go through some software.


 3:56 pm on Jun 19, 2009 (gmt 0)

... or write your own software that parses the JavaScript and locates the ajax fetch routines and execute their processes.


 5:33 pm on Jun 19, 2009 (gmt 0)

Alternatively, if you provide your dynamically-retrievable URLs in a uniform (i.e. preg-compatible) format in an inline script tag at the top of each page, and use a uniform URL syntax for accessing the listed URLs, your PHP wouldn't have to know a bit of actual JavaScript.

Google can't make that guarantee for all sites on the web.

EDIT: You should also look into SSJS [en.wikipedia.org]; I'm not sure whether it could be hooked up to PHP.

EDIT2: You don't happen to run your site on a PC do you? Darn. [addons.mozilla.org]

EDIT3: Eureka! I was searching "Gecko" but needed "spidermonkey." And it looks like Pooh Bear might be involved too: [devzone.zend.com...]

[edited by: Jesdisciple at 6:29 pm (utc) on June 19, 2009]


 10:28 am on Jun 22, 2009 (gmt 0)

thanks for your replies.I will check those tools.


 6:39 pm on Jun 22, 2009 (gmt 0)

Zend says you can't manipulate the DOM with that extension; however, you could probably register PHP's native DOM implementation with the JS. That is, if this bug doesn't bite you:
Yes, I think you could do that. However, one caveat is that the extension sometimes "loses information" when converting PHP objects to JavaScript. So IMO, you'd need to be aware of those issues and build in appropriate safeguards when working with user-supplied JavaScript.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved