Forum Moderators: phranque
This worked well for me recently. (in ASP)
<%
Response.Buffer = true
Dim objXMLHTTP, xml, text
Set xml = Server.CreateObject ("Microsoft.XMLHTTP")
'Or if this dosn't work then try :
'Set xml = Server.CreateObject("MSXML2.ServerXMLHTTP")
xml.Open "GET", "http://www.domain_to_read.com/default.asp", false
xml.Send
text = xml.ResponseText
Response.write(text)
Set xml = Nothing
%>
Of course you will need to decide how to take the HTML apart.
For Amazon there is always the possibility of using their web services to get pricing info - that requires server-side scripting on your part. But if you develop the text grab routines for other merchant sites, you might as well use them for Amazon.
You're talking a very serious web crawler - performs a recursive descent through the target site, looking for variables and text named "price" (and any permutations thereof/abbreviations/codes/languages other than English/etc.) and identifying the items they match. By the time you've coded against all the possible display combinations you'll have built up quite a body of source and spent many a sleepless night refining and testing your baby.
Please note that what you want is considerably more sophisticated behaviour than your typical mail crawler or search engine bot (the former only has to look for email addresses, the latter typically just grabs pages for later, offline indexing).
A couple of (M$) tools you could use that haven't been mentioned yet is the web browser control that comes with VB (I'm not sure what the .NET equivalent is) - very easy to code against, or the more difficult, but also more satisfying, WinInet C API. Both of these have the added advantage of appearing on server logs as some version of Internet Explorer, making them harder to identify and automatically block.
... which brings me to your last problem. Poorly coded/deliberately impolite/downright hostile crawlers are notorious bandwidth hogs and as such are the bane of web administrators world-wide (us, and I suspect most of the people posting on this site, included). A software agent requests pages far faster than any human user and can seriously degrade a site's performance for the people (read: "potential customers") trying to use it.
A slow site is an unprofitable site, which directly threatens the livelihood of the people behind it. You could find yourself deeply unpopular in a very short space of time ... if you're operating a site of your own, for instance, and your crawler operates from the same IP addresses, you might find yourself swamped by "counter crawls", or simply blocked.
I don't know whether to say, "good luck!" or, "you have been warned!"