homepage Welcome to WebmasterWorld Guest from 54.211.95.201
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

    
Crawling dynamic(AJAX) website
Deep web crawling
itsdone




msg:3935010
 10:11 am on Jun 17, 2009 (gmt 0)

Hello,

I am trying to write a crawler which will enable me to scrape dynamic webpage.As I am a PHP developer , I am trying to see if it can be done with PHP

Can anybody give me some pointers to proceed?Also I would like to know if there is any other technology/technique with which it can be done.

Thanks in advance.

 

extomas




msg:3935038
 10:58 am on Jun 17, 2009 (gmt 0)

Please announce when your scritp will be ended.

coopster




msg:3935096
 12:35 pm on Jun 17, 2009 (gmt 0)

Welcome to WebmasterWorld, itsdone.

The typical tool for this job is cURL [php.net].

itsdone




msg:3935105
 12:50 pm on Jun 17, 2009 (gmt 0)

Hi coopster ,

Thanks for your reply.

I have tried cURL succssfully for crawling static webpages however for the content which javascript shows could not be fetched using cURL at that time.

Is there any way to deal with AJAX content ? for example on page load some JS function is called and that function fetches data using AJAX and adds it to innerHTML of a DIV.

Pico_Train




msg:3936793
 2:48 pm on Jun 19, 2009 (gmt 0)

I don't even think Google can do that my friend. I had a client with AJAX functionality of the sort for a hotel booking site and the site indexed was only 10 pages...

Good luck! Hope you have a lot of time on your hands and a lot of patience and a PHD in something!

nick279




msg:3936813
 3:12 pm on Jun 19, 2009 (gmt 0)

I once heard about a command line based 'firefox' application (lightweight) which can be run via PHP but I don't have any more details I'm afraid

There's no way to my knowledge of rendering javascript in PHP - you'd need to go through some software.

coopster




msg:3936848
 3:56 pm on Jun 19, 2009 (gmt 0)

... or write your own software that parses the JavaScript and locates the ajax fetch routines and execute their processes.

Jesdisciple




msg:3936924
 5:33 pm on Jun 19, 2009 (gmt 0)

Alternatively, if you provide your dynamically-retrievable URLs in a uniform (i.e. preg-compatible) format in an inline script tag at the top of each page, and use a uniform URL syntax for accessing the listed URLs, your PHP wouldn't have to know a bit of actual JavaScript.

Google can't make that guarantee for all sites on the web.

EDIT: You should also look into SSJS [en.wikipedia.org]; I'm not sure whether it could be hooked up to PHP.

EDIT2: You don't happen to run your site on a PC do you? Darn. [addons.mozilla.org]

EDIT3: Eureka! I was searching "Gecko" but needed "spidermonkey." And it looks like Pooh Bear might be involved too: [devzone.zend.com...]

[edited by: Jesdisciple at 6:29 pm (utc) on June 19, 2009]

itsdone




msg:3938100
 10:28 am on Jun 22, 2009 (gmt 0)

thanks for your replies.I will check those tools.

Jesdisciple




msg:3938365
 6:39 pm on Jun 22, 2009 (gmt 0)

Zend says you can't manipulate the DOM with that extension; however, you could probably register PHP's native DOM implementation with the JS. That is, if this bug doesn't bite you:
Yes, I think you could do that. However, one caveat is that the extension sometimes "loses information" when converting PHP objects to JavaScript. So IMO, you'd need to be aware of those issues and build in appropriate safeguards when working with user-supplied JavaScript.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved