homepage Welcome to WebmasterWorld Guest from 54.166.66.204
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Microsoft / Microsoft IIS Web Server and ASP.NET
Forum Library, Charter, Moderators: ocean10000

Microsoft IIS Web Server and ASP.NET Forum

    
Web scraping with .NET
garann




msg:3136393
 3:53 am on Oct 27, 2006 (gmt 0)

Hello,

Does anyone know of a simple way to copy an entire website from one server to another using ASP.NET? I need something that can start at the root of the site and grab everything under it.

I have FTP information for the site I want to copy from, but either I'm misunderstanding the documentation or the built-in classes only let me copy one file at a time.

I could also use a web scraping utility, but I'm hesitant to ask my client to pay for it. If there's a free one anyone knows of, that would be a hugely appreciated alternative.

(And before anyone asks, I'm not doing anything shady. My client has an old site on a hosting service whose WYSIWYG she likes very much. I want to set her up so she can continue using that as a "development server", and import the content to her new site.)

Thanks!
g.

 

garann




msg:3136512
 6:41 am on Oct 27, 2006 (gmt 0)

Man, I knew I was going to end up doing this the hard way.. ;)

Here's the code for anyone who's interested. It could be improved through the use of regular expressions, but it was kind of quick and dirty.


ArrayList scraped;
protected void Page_Load(object sender, EventArgs e)
{
scraped = new ArrayList();
}
protected void btnSubmit_ServerClick(object sender, EventArgs e)
{
scrapePage("index.html");
results.Visible = true;
}
private void scrapePage(String filename)
{
// keep track of what we've done
scraped.Add(filename);

WebClient objWebClient = new WebClient();

UTF8Encoding objUTF8 = new UTF8Encoding();
String page = "";
Byte[] bytes;
try
{
bytes = objWebClient.DownloadData("http://example.com/" + filename);
page = objUTF8.GetString(bytes);
}
catch (Exception ex)
{
return;
}

if (filename.IndexOf(".htm") > 0)
{
// make everything lowercase so we can find what we need
String pageCopy = page.ToLower();

String[] keys = { "href=", "src=", "src = " };
String[] links = pageCopy.Split(keys, StringSplitOptions.RemoveEmptyEntries);
// get rid of html up to the first link
links[0] = "";

for (int i = 1; i < links.Length; i++)
{
int firstSpace = links[i].IndexOf(' ');
if (firstSpace > 0)
{
// shorten these to include only what's before the first space
links[i] = links[i].Substring(0, firstSpace).Replace("\"", "");
}
else
{
// if there are no spaces in the section, this section is junk - erase it
links[i] = "";
}
}

// call this function for each non-blank link we've produced
foreach (String link in links)
{
if (link!= "")
{
if (!scraped.Contains(link)) scrapePage(link);
}
}
}

try
{
File.WriteAllBytes(MapPath("/") + filename, bytes);
}
catch
{
return;
}
}


Easy_Coder




msg:3137390
 7:11 pm on Oct 27, 2006 (gmt 0)

If you have the FTP info why not just open up an FTP client and just copy the stuff into another FTP window for the other server?

oxbaker




msg:3140239
 9:16 pm on Oct 30, 2006 (gmt 0)

how about xcopy?

garann




msg:3155757
 1:01 am on Nov 14, 2006 (gmt 0)

Easy Coder, this is something that needs to be done automatically by the client.

oxbaker, that's a good idea.. Is it free?

Jimmy Turnip




msg:3157156
 12:40 pm on Nov 15, 2006 (gmt 0)

Have you looked at creating a batch file to ftp it? You could then get .NET to execute it, or schedule it.

For example create these two files and put them in the same directory:

New text file, save as data.ftp:
================================

OPEN ftp.example.com
USER myusername mypassword
LCD C:Inetpub\wwwroot\
CD mywebsitefolder
BINARY
PUT myfile.html
QUIT

New text file, save as transfer.bat
===================================

ftp -n -s:data.ftp

Now run the batch file to ftp the stuff that has been PUT in the data.ftp file.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Microsoft / Microsoft IIS Web Server and ASP.NET
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved