Forum Moderators: open
I have been spending a lot of time learning C# and .net. I wanted to share some of the things I have learned. I always liked the idea of open source widgets, and wanted to add a widget here. If anyone is interested in adding to the spider or fixing code, post it here.
C# is Microsoft's new language for the .Net initiative. It is really close to Java, but does have some differences. It is a C based language. If you know JavaScript then the leap to C# is not that hard.
So let’s look at some code. I wrote this using code behind from visual studio, but changed it over to on page scripting. I thought most out there don't have visual studio, so the code behind module wouldn't make much sense. If you see any unnecessary code left from the conversion let me know.
<%@ Page language="c#" %>
<%@ Import Namespace="System.Net" %>
<%@ Import Namespace="System.Text" %>
<script runat="server">
private void getURLInfo_Click(object sender, System.EventArgs e)
{
WebClient objWebClient = new WebClient();
string strURL = URLinputBox.Text;
UTF8Encoding objUTF8 = new UTF8Encoding();
try
{
string scrappedText = objUTF8.GetString(objWebClient.DownloadData(strURL));string[] splitHTMLTags = scrappedText.Split('<','>');
int titlePosition = 0;
string metaDescContent =" ";for(int i=0;i<splitHTMLTags.GetUpperBound(0);i++)
{
string stringSwitch;
if(splitHTMLTags[i].Length > 22)
{
stringSwitch = splitHTMLTags[i].Substring(0,23);
}
else
{
stringSwitch = splitHTMLTags[i];
}
switch(stringSwitch.ToLower())
{
case "title":
titlePosition = i + 1;
break;
case "meta name=\"description\"":
string[] descriptionArray = splitHTMLTags[i].Split('"');
metaDescContent = descriptionArray[3];
break;
default:
break;
}
}
if(titlePosition == 0)
{
HTMLtitle.Text = @"No title available";
}
else
{
HTMLtitle.Text = splitHTMLTags[titlePosition];
}
if(metaDescContent == " ")
{
HTMLdesc.Value = @"No description available";
}
else
{
HTMLdesc.Value = metaDescContent;
}
}
catch(Exception err)
{
Response.Write(err.Message);
}
}
</script>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" >
<HTML>
<HEAD>
<title>Spider</title>
<meta content="JavaScript" name="vs_defaultClientScript">
</HEAD>
<body>
<form id="Form1" method="post" runat="server" action=WebForm1.aspx>
<asp:textbox id="URLinputBox" Runat="server"></asp:textbox><asp:button id="getURLInfo" Runat="server" Text="Get Info" OnClick="getURLInfo_Click"></asp:button><br>
<br>
<asp:textbox id="HTMLtitle" Runat="server" Width="420"></asp:textbox><br>
<textarea id="HTMLdesc" rows="4" wrap="soft" cols="50" Runat="server"></textarea><br>
</form>
</body>
</HTML>
Next we use a server side script tag to hold our code:
<script runat="server">
Notice the runat="server" attribute to make the IIS process this as server-side code. Next we set up a click method to fire when our button is clicked.
private void getURLInfo_Click(object sender, System.EventArgs e)
{
We use the keyword private to make the method only accessible from our page. Private is a member access modifier, which means we make it private for us to use. For more info on access modifiers you can look at MSDN [msdn.microsoft.com].
We use the keyword void to declare the method to have no returning data. For more information on void you can look at MSDN [msdn.microsoft.com].
Then we declare the method name as getURLInfo_Click. This method takes two arguments that are generated for us. The object that raised the event:
object sender
and the event information:
System.EventArgs e
So now for the meat of the code:
WebClient objWebClient = new WebClient();
string strURL = URLinputBox.Text;
UTF8Encoding objUTF8 = new UTF8Encoding();
try
{
string scrappedText = objUTF8.GetString(objWebClient.DownloadData(strURL));
We instantiate a new object objWebClient of the type WebClient. Then I set up a variable strURL which is of the string data type. In C# you have to declare the data type of the variable. I am assigning it the value of our asp:TextBox URLinputBox.
<asp:textbox id="URLinputBox" Runat="server"></asp:textbox>
Then we instantiate a new instance of the UTF8Encoding class [msdn.microsoft.com] named objUTF8. This will be used to create a string returned from the WebClient object objWebClient.
UTF8Encoding objUTF8 = new UTF8Encoding();
Next I use a try catch block for error handling. I will explain that later.
Then we basically scrape the inputted url from the URLinputBox.
string scrappedText = objUTF8.GetString(objWebClient.DownloadData(strURL));
I create a string variable, scrappedText, to hold the scrapped string. I assign it method outputs of the objWebClient.DownloadData method [msdn.microsoft.com] that took our strURL variable. This method takes one aurgument, a fully qualified URL with the http://.
Then we run the GetString method [msdn.microsoft.com] of the objUTF8 object. This converts the DownloadData method output to UTF8 encoding. Now we have scraped the URL and have a string we can work with. At this point we could do a Response.Write and recreate the page on our page, but that would be copyright infringement, and we really don't want to go there.
string[] splitHTMLTags = scrappedText.Split('<','>');
Now we need to split the long string into something we can use. Since I am only after the title and the meta description, we need to make those values accessible. You can do this in a variety of ways. I chose this solution because it seemed the easiest way to get what I wanted.
We declare a string array splitHTMLTags. We use an array(declared with the "[]" characters. More info at MSDN [msdn.microsoft.com].) because the Split method [msdn.microsoft.com] of the string object returns a string array. We are passing the Split method characters with which we want to start a new array. This method will throw away our delimiters. We use the single quotes because they need to be characters. If we needed a string we use double quotes.
BTW C# wants the end line character ";" after each line of code that does not have curly brackets or a control structure like try, for, or if statements.
int titlePosition = 0;
string metaDescContent =" ";
We also declare some variables to help us find the values we want when searching through the string array. The meta description was a pain since I needed the value of an attribute and not enclosed text.
Now we need to run a for loop to look through the string array for our values:
//declare the for loop and set the max value of i to the max value of the
splitHTMLTags array.
//If we don't we will get an out of bounds exception.
//The GetUpperBound method [msdn.microsoft.com] will return the max index of our array.
for(int i=0;i<splitHTMLTags.GetUpperBound(0);i++)
{
//We declare a local variable to use in our switch statement.
string stringSwitch;
//Find out if the text length is long enough to be that nasty meta description tag
if(splitHTMLTags[i].Length > 22)
{
//If it is long enough assign the first 24 characters to the variable.
//We only want those characters since we can only trust that the string
//will be equavalent for those first characters.
stringSwitch = splitHTMLTags[i].Substring(0,23);
}
else
{
//assign the small string to the variable.
stringSwitch = splitHTMLTags[i];
}
//Start a switch statement to look for our values
//using the ToLower string method to make the string lowercase.
switch(stringSwitch.ToLower())
{
//If our text is the title tag
case "title":
//Raise the array number by 1 because we want the next array entry
//after the beginning of the title tag.
titlePosition = i + 1;
//Leave the switch statement.
break;
//Now we need to look for or description. Since our string has double quotes
//we need to escape them with the / character before them.
case "meta name=\"description\"":
//Now we need to split the string using the double quotes as a delimiter.
string[] descriptionArray = splitHTMLTags[i].Split('"');
//If the meta description uses quotes,
//if it doesn't we don't have any code to deal with that,
//the content attribute value will be in the 4 array position.
//Since array positions are zero based we use 3 in the brackets.
metaDescContent = descriptionArray[3];
break;
//In our default we just have a break since we don't want to do anything with the other values.
default:
break;
}
}
//Now I check to see if there was a title or description.
//If there isn't then I want to display some text to alert the user.
//I use the @ string literal to tell the compiler that
//I want to use the string exactly like it is.
//We don't have to use it here. It is used to unescape "\"
//like if you need a file path.
if(titlePosition == 0)
{
HTMLtitle.Text = "No title available";
}
else
{
HTMLtitle.Text = splitHTMLTags[titlePosition];
}
if(metaDescContent == " ")
{
HTMLdesc.Value = "No description available";
}
else
{
HTMLdesc.Value = metaDescContent;
}
}
Now we can use if statements to write the values back out to our page elements.
catch(Exception err)
{
Response.Write(err.Message);
}
}
Then we use the catch from our try to stop nasty exceptions. We basically write the error message to the page if one is raised. The Try...Catch...Finally statement [msdn.microsoft.com] Is a great way to catch exceptions. I did not use the finally section because I didn't need code to fire at the end everytime it is run.
Then here’s our html:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" >
<HTML>
<HEAD>
<title>Spider</title>
<meta content="JavaScript" name="vs_defaultClientScript">
</HEAD>
<body>
<form id="Form1" method="post" runat="server" action=WebForm1.aspx>
<asp:textbox id="URLinputBox" Runat="server"></asp:textbox><asp:button id="getURLInfo" Runat="server" Text="Get Info" OnClick="getURLInfo_Click"></asp:button><br>
<br>
<asp:textbox id="HTMLtitle" Runat="server" Width="420"></asp:textbox><br>
<textarea id="HTMLdesc" rows="4" wrap="soft" cols="50" Runat="server"></textarea><br>
</form>
</body>
</HTML>
So there is my simple spider. Needs some work, but the basics are there. Anyone have a different way they would do it, a VB.Net solution, or some added functionality?