Welcome to WebmasterWorld Guest from 54.242.206.44

Forum Moderators: ocean10000

Message Too Old, No Replies

Web scraping with .NET

     
3:53 am on Oct 27, 2006 (gmt 0)

Preferred Member

10+ Year Member

joined:Sept 4, 2002
posts:508
votes: 0


Hello,

Does anyone know of a simple way to copy an entire website from one server to another using ASP.NET? I need something that can start at the root of the site and grab everything under it.

I have FTP information for the site I want to copy from, but either I'm misunderstanding the documentation or the built-in classes only let me copy one file at a time.

I could also use a web scraping utility, but I'm hesitant to ask my client to pay for it. If there's a free one anyone knows of, that would be a hugely appreciated alternative.

(And before anyone asks, I'm not doing anything shady. My client has an old site on a hosting service whose WYSIWYG she likes very much. I want to set her up so she can continue using that as a "development server", and import the content to her new site.)

Thanks!
g.

6:41 am on Oct 27, 2006 (gmt 0)

Preferred Member

10+ Year Member

joined:Sept 4, 2002
posts:508
votes: 0


Man, I knew I was going to end up doing this the hard way.. ;)

Here's the code for anyone who's interested. It could be improved through the use of regular expressions, but it was kind of quick and dirty.


ArrayList scraped;
protected void Page_Load(object sender, EventArgs e)
{
scraped = new ArrayList();
}
protected void btnSubmit_ServerClick(object sender, EventArgs e)
{
scrapePage("index.html");
results.Visible = true;
}
private void scrapePage(String filename)
{
// keep track of what we've done
scraped.Add(filename);

WebClient objWebClient = new WebClient();

UTF8Encoding objUTF8 = new UTF8Encoding();
String page = "";
Byte[] bytes;
try
{
bytes = objWebClient.DownloadData("http://example.com/" + filename);
page = objUTF8.GetString(bytes);
}
catch (Exception ex)
{
return;
}

if (filename.IndexOf(".htm") > 0)
{
// make everything lowercase so we can find what we need
String pageCopy = page.ToLower();

String[] keys = { "href=", "src=", "src = " };
String[] links = pageCopy.Split(keys, StringSplitOptions.RemoveEmptyEntries);
// get rid of html up to the first link
links[0] = "";

for (int i = 1; i < links.Length; i++)
{
int firstSpace = links[i].IndexOf(' ');
if (firstSpace > 0)
{
// shorten these to include only what's before the first space
links[i] = links[i].Substring(0, firstSpace).Replace("\"", "");
}
else
{
// if there are no spaces in the section, this section is junk - erase it
links[i] = "";
}
}

// call this function for each non-blank link we've produced
foreach (String link in links)
{
if (link!= "")
{
if (!scraped.Contains(link)) scrapePage(link);
}
}
}

try
{
File.WriteAllBytes(MapPath("/") + filename, bytes);
}
catch
{
return;
}
}

7:11 pm on Oct 27, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 2, 2003
posts:1184
votes: 0


If you have the FTP info why not just open up an FTP client and just copy the stuff into another FTP window for the other server?
9:16 pm on Oct 30, 2006 (gmt 0)

Junior Member

5+ Year Member

joined:May 11, 2006
posts:177
votes: 0


how about xcopy?
1:01 am on Nov 14, 2006 (gmt 0)

Preferred Member

10+ Year Member

joined:Sept 4, 2002
posts:508
votes: 0


Easy Coder, this is something that needs to be done automatically by the client.

oxbaker, that's a good idea.. Is it free?

12:40 pm on Nov 15, 2006 (gmt 0)

Junior Member

10+ Year Member

joined:Dec 18, 2003
posts:146
votes: 0


Have you looked at creating a batch file to ftp it? You could then get .NET to execute it, or schedule it.

For example create these two files and put them in the same directory:

New text file, save as data.ftp:
================================

OPEN ftp.example.com
USER myusername mypassword
LCD C:Inetpub\wwwroot\
CD mywebsitefolder
BINARY
PUT myfile.html
QUIT

New text file, save as transfer.bat
===================================

ftp -n -s:data.ftp

Now run the batch file to ftp the stuff that has been PUT in the data.ftp file.

 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members