Welcome to WebmasterWorld Guest from 54.147.20.131

Forum Moderators: ocean10000

Message Too Old, No Replies

Web scraping with .NET

     

garann

3:53 am on Oct 27, 2006 (gmt 0)

10+ Year Member



Hello,

Does anyone know of a simple way to copy an entire website from one server to another using ASP.NET? I need something that can start at the root of the site and grab everything under it.

I have FTP information for the site I want to copy from, but either I'm misunderstanding the documentation or the built-in classes only let me copy one file at a time.

I could also use a web scraping utility, but I'm hesitant to ask my client to pay for it. If there's a free one anyone knows of, that would be a hugely appreciated alternative.

(And before anyone asks, I'm not doing anything shady. My client has an old site on a hosting service whose WYSIWYG she likes very much. I want to set her up so she can continue using that as a "development server", and import the content to her new site.)

Thanks!
g.

garann

6:41 am on Oct 27, 2006 (gmt 0)

10+ Year Member



Man, I knew I was going to end up doing this the hard way.. ;)

Here's the code for anyone who's interested. It could be improved through the use of regular expressions, but it was kind of quick and dirty.


ArrayList scraped;
protected void Page_Load(object sender, EventArgs e)
{
scraped = new ArrayList();
}
protected void btnSubmit_ServerClick(object sender, EventArgs e)
{
scrapePage("index.html");
results.Visible = true;
}
private void scrapePage(String filename)
{
// keep track of what we've done
scraped.Add(filename);

WebClient objWebClient = new WebClient();

UTF8Encoding objUTF8 = new UTF8Encoding();
String page = "";
Byte[] bytes;
try
{
bytes = objWebClient.DownloadData("http://example.com/" + filename);
page = objUTF8.GetString(bytes);
}
catch (Exception ex)
{
return;
}

if (filename.IndexOf(".htm") > 0)
{
// make everything lowercase so we can find what we need
String pageCopy = page.ToLower();

String[] keys = { "href=", "src=", "src = " };
String[] links = pageCopy.Split(keys, StringSplitOptions.RemoveEmptyEntries);
// get rid of html up to the first link
links[0] = "";

for (int i = 1; i < links.Length; i++)
{
int firstSpace = links[i].IndexOf(' ');
if (firstSpace > 0)
{
// shorten these to include only what's before the first space
links[i] = links[i].Substring(0, firstSpace).Replace("\"", "");
}
else
{
// if there are no spaces in the section, this section is junk - erase it
links[i] = "";
}
}

// call this function for each non-blank link we've produced
foreach (String link in links)
{
if (link!= "")
{
if (!scraped.Contains(link)) scrapePage(link);
}
}
}

try
{
File.WriteAllBytes(MapPath("/") + filename, bytes);
}
catch
{
return;
}
}

Easy_Coder

7:11 pm on Oct 27, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you have the FTP info why not just open up an FTP client and just copy the stuff into another FTP window for the other server?

oxbaker

9:16 pm on Oct 30, 2006 (gmt 0)

5+ Year Member



how about xcopy?

garann

1:01 am on Nov 14, 2006 (gmt 0)

10+ Year Member



Easy Coder, this is something that needs to be done automatically by the client.

oxbaker, that's a good idea.. Is it free?

Jimmy Turnip

12:40 pm on Nov 15, 2006 (gmt 0)

10+ Year Member



Have you looked at creating a batch file to ftp it? You could then get .NET to execute it, or schedule it.

For example create these two files and put them in the same directory:

New text file, save as data.ftp:
================================

OPEN ftp.example.com
USER myusername mypassword
LCD C:Inetpub\wwwroot\
CD mywebsitefolder
BINARY
PUT myfile.html
QUIT

New text file, save as transfer.bat
===================================

ftp -n -s:data.ftp

Now run the batch file to ftp the stuff that has been PUT in the data.ftp file.

 

Featured Threads

Hot Threads This Week

Hot Threads This Month