Forum Moderators: open

Message Too Old, No Replies

Scraping Webpages

Two ways of doing it - however I need help

         

chris_f

1:38 pm on Feb 6, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi Guys,

I have a syndicate of clients. They have around 30 smaller sites and one major site which is the daddy of them all and contains the general information. Each small site has a different person in charge. All the sites are static. They approached me and asked for a site search so that people can search across all the sites. I can do the search functionality no problem and I can code it to strip HTML from the pages and so forth ... my problem is getting their pages into a database.

To my knowledge there are two ways of doing this:
1.Use a component. This has some problems. They want the site search to be CHEAP. Does anyone now where I can get a cheap or FREE component that can fetch pages. I needs to be able to run on WinXp Pro with IIS5.
2.Component-less code. Is this possible? If so, does anyone know where I can see some code on this?

btw, this is ASP and not ASP.net

Chris.

Woz

1:59 pm on Feb 6, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You can do it on IIS5 with the XML Object thus

Dim objXMLHTTP, strScrape, strURL
strURL = "httt://www.TheSiteYouWantToScrape.com"
Set objXMLHTTP = Server.CreateObject("Microsoft.XMLHTTP")
objXMLHTTP.Open "GET", strURL, False
objXMLHTTP.Send
strScrape = objXMLHTTP.responseText
Set objXMLHTTP = Nothing

Then parse strScrape to extract what you want.

Onya
Woz

olias

2:02 pm on Feb 6, 2003 (gmt 0)

10+ Year Member



You can fetch webpages using MSXML which is available if you are using ASP on IIS5. The code would be something like this..

Set xml = Server.CreateObject("Microsoft.XMLHTTP")
xml.Open "GET", "http://www.example.com/index.htm", False
xml.Send ""
xmloutput = xml.responseText
Set xml = Nothing

<added>Damn, I was way too slow!</added>

chris_f

2:04 pm on Feb 6, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks Woz and Olias,

I'll give it a go. Do I need to download the XML object or do you think I will alrady have it?

Chris

[edited by: chris_f at 2:18 pm (utc) on Feb. 6, 2003]

olias

2:05 pm on Feb 6, 2003 (gmt 0)

10+ Year Member



Should already have it.

chris_f

2:10 pm on Feb 6, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Cool. I'll give it a go tonight.

Chris

Woz

2:14 pm on Feb 6, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>Do I need to download the XML object?

As olias says, if you are running IIS 5 then you should have it. I am using this on Win2kPro IIS5 and Winserver2k IIS5 with no problems.

>Damn, I was way too slow!
Wasn't aware there was a race on ... ;)

Onya
Woz

BlobFisk

2:44 pm on Feb 6, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've been looking at this and this thread came as a great and well-timed help!

The problem is I keep getting the time out error:


error '80072ee2'
/alex/devzone/scrapeTest.asp, line 12

I can get it to work on sites on the same server... but it times out for anything externally. Could that be to do with a firewall setting or something?

graywolf

2:47 am on Feb 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It works locally for me but on the outside I get this error


TIME: 2/6/2003 9:43:00 PM
asp code:
Number:-2146697211
Source:
Category:
Path: /scrape.asp
Path: D:\inetpub\store\scrape.asp
File:/scrape.asp
line:7
Column:-1
Description:
number:-2146697211

chris_f

9:25 am on Feb 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This code seems to work locally but for nothing outside your network. I get

Error Type: (0x800C0005)
/cf/site_search.asp, line 11

Browser Type:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705)

Page:
GET /cf/site_search.asp

Time:
07 February 2003, 09:23:59

More information:
Microsoft Support

Xoc

9:50 am on Feb 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



On an IIS Server, you want to use the ServerXMLHTTP [msdn.microsoft.com] object, not the XMLHTTP Object. Beside offering better security, there is a timeout that you can set.

Note, that if you have a proxy server, ServerXMLHTTP does not use the proxy settings from IE. Instead you must run proxycfg [msdn.microsoft.com] to tell ServerXMLHTTP about the proxy server

chris_f

10:24 am on Feb 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks Xoc,

I'll try that.

Chris

chris_f

10:34 am on Feb 8, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've got the code working as long as your not behind a proxy server

The code:

<html>
<head>
<title>Site Search</title>
</head>
<body>
<%
Dim objXMLHTTP, strScrape, strURL
strURL = "http://www.chrisfelstead.co.uk/default.asp"
Set objXMLHTTP = Server.CreateObject("MSXML2.ServerXMLHTTP")
objXMLHTTP.Open "GET", strURL, False
objXMLHTTP.Send
strScrape = objXMLHTTP.responseText
Set objXMLHTTP = Nothing
%>
Here is the page:
<br><br>
<% Response.Write(strScrape) %>
</body>
</html>

Chris

tomasz

5:16 pm on Feb 8, 2003 (gmt 0)

10+ Year Member



chris_f,

I noticed from your error page you are runnig .Net framework.

For .Net you can use bellow example

Dim wc As New System.Net.WebClient()
Dim html As String = Encoding.ASCII.GetString(wc.DownloadData("http://microsoft.com"))

chris_f

10:07 am on Feb 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



tomasz,

I am not using the .net framework. This is classic ASP.

Chris