Forum Moderators: open

Message Too Old, No Replies

C# spider using System.Net to page scrape in ASP.Net

aspx version

         

korkus2000

12:40 pm on Oct 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



C# Spider
ASP.Net Version

I have been spending a lot of time learning C# and .net. I wanted to share some of the things I have learned. I always liked the idea of open source widgets, and wanted to add a widget here. If anyone is interested in adding to the spider or fixing code, post it here.

C# is Microsoft's new language for the .Net initiative. It is really close to Java, but does have some differences. It is a C based language. If you know JavaScript then the leap to C# is not that hard.

So let’s look at some code. I wrote this using code behind from visual studio, but changed it over to on page scripting. I thought most out there don't have visual studio, so the code behind module wouldn't make much sense. If you see any unnecessary code left from the conversion let me know.

<%@ Page language="c#" %>
<%@ Import Namespace="System.Net" %>
<%@ Import Namespace="System.Text" %>
<script runat="server">
private void getURLInfo_Click(object sender, System.EventArgs e)
{
WebClient objWebClient = new WebClient();
string strURL = URLinputBox.Text;
UTF8Encoding objUTF8 = new UTF8Encoding();
try
{
string scrappedText = objUTF8.GetString(objWebClient.DownloadData(strURL));

string[] splitHTMLTags = scrappedText.Split('<','>');
int titlePosition = 0;
string metaDescContent =" ";

for(int i=0;i<splitHTMLTags.GetUpperBound(0);i++)
{
string stringSwitch;
if(splitHTMLTags[i].Length > 22)
{
stringSwitch = splitHTMLTags[i].Substring(0,23);
}
else
{
stringSwitch = splitHTMLTags[i];
}
switch(stringSwitch.ToLower())
{
case "title":
titlePosition = i + 1;
break;
case "meta name=\"description\"":
string[] descriptionArray = splitHTMLTags[i].Split('"');
metaDescContent = descriptionArray[3];
break;
default:
break;
}
}
if(titlePosition == 0)
{
HTMLtitle.Text = @"No title available";
}
else
{
HTMLtitle.Text = splitHTMLTags[titlePosition];
}
if(metaDescContent == " ")
{
HTMLdesc.Value = @"No description available";
}
else
{
HTMLdesc.Value = metaDescContent;
}
}
catch(Exception err)
{
Response.Write(err.Message);
}
}
</script>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" >
<HTML>
<HEAD>
<title>Spider</title>
<meta content="JavaScript" name="vs_defaultClientScript">
</HEAD>
<body>
<form id="Form1" method="post" runat="server" action=WebForm1.aspx>
<asp:textbox id="URLinputBox" Runat="server"></asp:textbox><asp:button id="getURLInfo" Runat="server" Text="Get Info" OnClick="getURLInfo_Click"></asp:button><br>
<br>
<asp:textbox id="HTMLtitle" Runat="server" Width="420"></asp:textbox><br>
<textarea id="HTMLdesc" rows="4" wrap="soft" cols="50" Runat="server"></textarea><br>
</form>
</body>
</HTML>



So lets look at the @ directives [msdn.microsoft.com].

<%@ Page language="c#" %>
<%@ Import Namespace="System.Net" %>

First we use the Page directive to declare the page language. Then we use the Import directive to import the System.Net namespace [msdn.microsoft.com] from the .Net architecture. System.Net has the scrapper class WebClient that we will be using to grab the HTML from the page (You can see more about the WebClient class at MSDN [msdn.microsoft.com]).

Next we use a server side script tag to hold our code:

<script runat="server">

Notice the runat="server" attribute to make the IIS process this as server-side code. Next we set up a click method to fire when our button is clicked.

private void getURLInfo_Click(object sender, System.EventArgs e)
{

We use the keyword private to make the method only accessible from our page. Private is a member access modifier, which means we make it private for us to use. For more info on access modifiers you can look at MSDN [msdn.microsoft.com].

We use the keyword void to declare the method to have no returning data. For more information on void you can look at MSDN [msdn.microsoft.com].

Then we declare the method name as getURLInfo_Click. This method takes two arguments that are generated for us. The object that raised the event:

object sender

and the event information:

System.EventArgs e

So now for the meat of the code:

WebClient objWebClient = new WebClient();
string strURL = URLinputBox.Text;

UTF8Encoding objUTF8 = new UTF8Encoding();
try
{
string scrappedText = objUTF8.GetString(objWebClient.DownloadData(strURL));

We instantiate a new object objWebClient of the type WebClient. Then I set up a variable strURL which is of the string data type. In C# you have to declare the data type of the variable. I am assigning it the value of our asp:TextBox URLinputBox.

<asp:textbox id="URLinputBox" Runat="server"></asp:textbox>

Then we instantiate a new instance of the UTF8Encoding class [msdn.microsoft.com] named objUTF8. This will be used to create a string returned from the WebClient object objWebClient.

UTF8Encoding objUTF8 = new UTF8Encoding();

Next I use a try catch block for error handling. I will explain that later.

Then we basically scrape the inputted url from the URLinputBox.

string scrappedText = objUTF8.GetString(objWebClient.DownloadData(strURL));

I create a string variable, scrappedText, to hold the scrapped string. I assign it method outputs of the objWebClient.DownloadData method [msdn.microsoft.com] that took our strURL variable. This method takes one aurgument, a fully qualified URL with the http://.

Then we run the GetString method [msdn.microsoft.com] of the objUTF8 object. This converts the DownloadData method output to UTF8 encoding. Now we have scraped the URL and have a string we can work with. At this point we could do a Response.Write and recreate the page on our page, but that would be copyright infringement, and we really don't want to go there.

string[] splitHTMLTags = scrappedText.Split('<','>');

Now we need to split the long string into something we can use. Since I am only after the title and the meta description, we need to make those values accessible. You can do this in a variety of ways. I chose this solution because it seemed the easiest way to get what I wanted.

We declare a string array splitHTMLTags. We use an array(declared with the "[]" characters. More info at MSDN [msdn.microsoft.com].) because the Split method [msdn.microsoft.com] of the string object returns a string array. We are passing the Split method characters with which we want to start a new array. This method will throw away our delimiters. We use the single quotes because they need to be characters. If we needed a string we use double quotes.

BTW C# wants the end line character ";" after each line of code that does not have curly brackets or a control structure like try, for, or if statements.

int titlePosition = 0;
string metaDescContent =" ";

We also declare some variables to help us find the values we want when searching through the string array. The meta description was a pain since I needed the value of an attribute and not enclosed text.

Now we need to run a for loop to look through the string array for our values:

//declare the for loop and set the max value of i to the max value of the
splitHTMLTags array.
//If we don't we will get an out of bounds exception.
//The GetUpperBound method [msdn.microsoft.com] will return the max index of our array.

for(int i=0;i<splitHTMLTags.GetUpperBound(0);i++)
{

//We declare a local variable to use in our switch statement.

string stringSwitch;

//Find out if the text length is long enough to be that nasty meta description tag

if(splitHTMLTags[i].Length > 22)
{

//If it is long enough assign the first 24 characters to the variable.
//We only want those characters since we can only trust that the string
//will be equavalent for those first characters.

stringSwitch = splitHTMLTags[i].Substring(0,23);
}
else
{

//assign the small string to the variable.

stringSwitch = splitHTMLTags[i];
}

//Start a switch statement to look for our values
//using the ToLower string method to make the string lowercase.

switch(stringSwitch.ToLower())
{

//If our text is the title tag

case "title":

//Raise the array number by 1 because we want the next array entry
//after the beginning of the title tag.

titlePosition = i + 1;

//Leave the switch statement.

break;

//Now we need to look for or description. Since our string has double quotes
//we need to escape them with the / character before them.

case "meta name=\"description\"":

//Now we need to split the string using the double quotes as a delimiter.

string[] descriptionArray = splitHTMLTags[i].Split('"');

//If the meta description uses quotes,
//if it doesn't we don't have any code to deal with that,
//the content attribute value will be in the 4 array position.
//Since array positions are zero based we use 3 in the brackets.

metaDescContent = descriptionArray[3];
break;

//In our default we just have a break since we don't want to do anything with the other values.

default:
break;
}
}

//Now I check to see if there was a title or description.
//If there isn't then I want to display some text to alert the user.
//I use the @ string literal to tell the compiler that
//I want to use the string exactly like it is.
//We don't have to use it here. It is used to unescape "\"
//like if you need a file path.

if(titlePosition == 0)
{
HTMLtitle.Text = "No title available";
}
else
{
HTMLtitle.Text = splitHTMLTags[titlePosition];
}
if(metaDescContent == " ")
{
HTMLdesc.Value = "No description available";
}
else
{
HTMLdesc.Value = metaDescContent;
}
}

Now we can use if statements to write the values back out to our page elements.

catch(Exception err)
{
Response.Write(err.Message);
}
}

Then we use the catch from our try to stop nasty exceptions. We basically write the error message to the page if one is raised. The Try...Catch...Finally statement [msdn.microsoft.com] Is a great way to catch exceptions. I did not use the finally section because I didn't need code to fire at the end everytime it is run.

Then here’s our html:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" >
<HTML>
<HEAD>
<title>Spider</title>
<meta content="JavaScript" name="vs_defaultClientScript">
</HEAD>
<body>
<form id="Form1" method="post" runat="server" action=WebForm1.aspx>
<asp:textbox id="URLinputBox" Runat="server"></asp:textbox><asp:button id="getURLInfo" Runat="server" Text="Get Info" OnClick="getURLInfo_Click"></asp:button><br>
<br>
<asp:textbox id="HTMLtitle" Runat="server" Width="420"></asp:textbox><br>
<textarea id="HTMLdesc" rows="4" wrap="soft" cols="50" Runat="server"></textarea><br>
</form>
</body>
</HTML>

So there is my simple spider. Needs some work, but the basics are there. Anyone have a different way they would do it, a VB.Net solution, or some added functionality?

ziggystardust

6:24 pm on Oct 9, 2003 (gmt 0)

10+ Year Member



Nice! :)

You could add some multithreading functionality to it... if you want to be able to scrape several sites simultaneously that is.

Good luck with your future projects.
//ZS

Xoc

6:59 pm on Oct 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Note that the BestBBS software that runs the bulletin board removes indenting from code examples. The example above originally was all nicely indented.

Nice job Korkus2000!