Welcome to WebmasterWorld Guest from 23.22.140.143

Forum Moderators: coopster & jatar k

Message Too Old, No Replies

Crawling dynamic(AJAX) website

Deep web crawling

     
10:11 am on Jun 17, 2009 (gmt 0)

New User

5+ Year Member

joined:June 17, 2009
posts: 9
votes: 0


Hello,

I am trying to write a crawler which will enable me to scrape dynamic webpage.As I am a PHP developer , I am trying to see if it can be done with PHP

Can anybody give me some pointers to proceed?Also I would like to know if there is any other technology/technique with which it can be done.

Thanks in advance.

10:58 am on June 17, 2009 (gmt 0)

New User

5+ Year Member

joined:Apr 21, 2009
posts:8
votes: 0


Please announce when your scritp will be ended.
12:35 pm on June 17, 2009 (gmt 0)

Administrator

WebmasterWorld Administrator coopster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:July 31, 2003
posts:12533
votes: 0


Welcome to WebmasterWorld, itsdone.

The typical tool for this job is cURL [php.net].

12:50 pm on June 17, 2009 (gmt 0)

New User

5+ Year Member

joined:June 17, 2009
posts: 9
votes: 0


Hi coopster ,

Thanks for your reply.

I have tried cURL succssfully for crawling static webpages however for the content which javascript shows could not be fetched using cURL at that time.

Is there any way to deal with AJAX content ? for example on page load some JS function is called and that function fetches data using AJAX and adds it to innerHTML of a DIV.

2:48 pm on June 19, 2009 (gmt 0)

Preferred Member

10+ Year Member

joined:Jan 4, 2005
posts:621
votes: 0


I don't even think Google can do that my friend. I had a client with AJAX functionality of the sort for a hotel booking site and the site indexed was only 10 pages...

Good luck! Hope you have a lot of time on your hands and a lot of patience and a PHD in something!

3:12 pm on June 19, 2009 (gmt 0)

Junior Member

5+ Year Member

joined:July 21, 2008
posts:103
votes: 0


I once heard about a command line based 'firefox' application (lightweight) which can be run via PHP but I don't have any more details I'm afraid

There's no way to my knowledge of rendering javascript in PHP - you'd need to go through some software.

3:56 pm on June 19, 2009 (gmt 0)

Administrator

WebmasterWorld Administrator coopster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:July 31, 2003
posts:12533
votes: 0


... or write your own software that parses the JavaScript and locates the ajax fetch routines and execute their processes.
5:33 pm on June 19, 2009 (gmt 0)

New User

5+ Year Member

joined:June 2, 2009
posts: 36
votes: 0


Alternatively, if you provide your dynamically-retrievable URLs in a uniform (i.e. preg-compatible) format in an inline script tag at the top of each page, and use a uniform URL syntax for accessing the listed URLs, your PHP wouldn't have to know a bit of actual JavaScript.

Google can't make that guarantee for all sites on the web.

EDIT: You should also look into SSJS [en.wikipedia.org]; I'm not sure whether it could be hooked up to PHP.

EDIT2: You don't happen to run your site on a PC do you? Darn. [addons.mozilla.org]

EDIT3: Eureka! I was searching "Gecko" but needed "spidermonkey." And it looks like Pooh Bear might be involved too: [devzone.zend.com...]

[edited by: Jesdisciple at 6:29 pm (utc) on June 19, 2009]

10:28 am on June 22, 2009 (gmt 0)

New User

5+ Year Member

joined:June 17, 2009
posts:9
votes: 0


thanks for your replies.I will check those tools.
6:39 pm on June 22, 2009 (gmt 0)

New User

5+ Year Member

joined:June 2, 2009
posts: 36
votes: 0


Zend says you can't manipulate the DOM with that extension; however, you could probably register PHP's native DOM implementation with the JS. That is, if this bug doesn't bite you:
Yes, I think you could do that. However, one caveat is that the extension sometimes "loses information" when converting PHP objects to JavaScript. So IMO, you'd need to be aware of those issues and build in appropriate safeguards when working with user-supplied JavaScript.
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members