Grabbing other websites data

Forum Moderators: phranque

Message Too Old, No Replies

Grabbing other websites data

michaelbs

4:19 pm on Aug 5, 2003 (gmt 0)

Hi,

Don't know if anyone can point me in the right direction.
I am looking to set up a comparison shopping site but I am stuck and wondered if anyone knows how to grab data from other websites. ie large widget companies.

Bit of a long shot but I do know this is possible.

Cheers,

Mike

Iguana

4:38 pm on Aug 5, 2003 (gmt 0)

You can use perl

use LWP::Simple;
$doc = get 'http://www.website.com/widget.html';

If you are using ASP you can use the MS XML parser to get a page.

Once you get the page you need to parse it to get the product/price data out - that's a bit difficult

too much information

4:47 pm on Aug 5, 2003 (gmt 0)

Hi Michaelbs,

This worked well for me recently. (in ASP)

<%
Response.Buffer = true

Dim objXMLHTTP, xml, text

Set xml = Server.CreateObject ("Microsoft.XMLHTTP")
'Or if this dosn't work then try :
'Set xml = Server.CreateObject("MSXML2.ServerXMLHTTP")

xml.Open "GET", "http://www.domain_to_read.com/default.asp", false

xml.Send

text = xml.ResponseText
Response.write(text)

Set xml = Nothing
%>

Of course you will need to decide how to take the HTML apart.

michaelbs

10:28 am on Aug 6, 2003 (gmt 0)

Cheers for the info fellas.

Is there no way to just grab an element of a page rather than the whole page?

Cheers,
Mike

WibbleWobble

10:37 am on Aug 6, 2003 (gmt 0)

Resource Index [php.resourceindex.com] has a couple of tutorials to get you around the basics of content retrieval with PHP.

Iguana

10:43 am on Aug 6, 2003 (gmt 0)

No, just the whole page and then write your own parser to grab the text you want. To grab an element you would really need to allow the page to render into a browser (asynchronous) and then read the browser obect model - you don't want to go there.

For Amazon there is always the possibility of using their web services to get pricing info - that requires server-side scripting on your part. But if you develop the text grab routines for other merchant sites, you might as well use them for Amazon.

asmith_2048

2:27 pm on Aug 6, 2003 (gmt 0)

michaelbs,

You're talking a very serious web crawler - performs a recursive descent through the target site, looking for variables and text named "price" (and any permutations thereof/abbreviations/codes/languages other than English/etc.) and identifying the items they match. By the time you've coded against all the possible display combinations you'll have built up quite a body of source and spent many a sleepless night refining and testing your baby.

Please note that what you want is considerably more sophisticated behaviour than your typical mail crawler or search engine bot (the former only has to look for email addresses, the latter typically just grabs pages for later, offline indexing).

A couple of (M$) tools you could use that haven't been mentioned yet is the web browser control that comes with VB (I'm not sure what the .NET equivalent is) - very easy to code against, or the more difficult, but also more satisfying, WinInet C API. Both of these have the added advantage of appearing on server logs as some version of Internet Explorer, making them harder to identify and automatically block.

... which brings me to your last problem. Poorly coded/deliberately impolite/downright hostile crawlers are notorious bandwidth hogs and as such are the bane of web administrators world-wide (us, and I suspect most of the people posting on this site, included). A software agent requests pages far faster than any human user and can seriously degrade a site's performance for the people (read: "potential customers") trying to use it.

A slow site is an unprofitable site, which directly threatens the livelihood of the people behind it. You could find yourself deeply unpopular in a very short space of time ... if you're operating a site of your own, for instance, and your crawler operates from the same IP addresses, you might find yourself swamped by "counter crawls", or simply blocked.

I don't know whether to say, "good luck!" or, "you have been warned!"

buckworks

2:41 pm on Aug 6, 2003 (gmt 0)

Do you have copyright permission to use content from the other sites?