Forum Moderators: open

Message Too Old, No Replies

Spider Design

walking the tightrope

         

BerndH

3:47 am on Mar 30, 2006 (gmt 0)



I run a Canada and US directory <snip>, and respider all the urls about every 2 or 3 months to weed out hijacked domains, dead sites, etc.

Haven't found any good sites yet on spider design, so had to program mine from scratch, and I know it still needs major improvement. (I have lworld spider in the useragent string). Some isps have blocked my ip, and others have growled.

Has anyone come across any sites that are good working guides for spider design? I'd like to be able to meet the goal of periodic url checks, without getting people up in arms.

I run 10 threads at a time, which - unless I'm wrong - is low enough not to hammer a server if a group of 10 urls happen to reside on the same server. I'm using vb6.

I'd welcome constructive feedback. Always trying to improve, and practice good net etiquette.

[edited by: volatilegx at 9:00 pm (utc) on Mar. 31, 2006]
[edit reason] no URL dropping please [/edit]

volatilegx

9:01 pm on Mar 31, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Welcome to WebmasterWorld :)

I believe you picked the right forum to post in. There are a bunch of people here who have strong opinions on these matters.

Pfui

11:01 pm on Mar 31, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hi, BerndH.

I don't know about spider design, sorry --- just what I don't like about what spider designers do -- or don't do:)

So if your spider heeds robots.txt, three cheers! Erm... Does it?

nathan_enns

12:06 pm on Apr 3, 2006 (gmt 0)

10+ Year Member



Hi BerndH,

I have written a web crawler for my search engine company in PHP and then re-wrote it in Python. I do not know of any web pages to refer you to but I would be happy to help if I can.

A few thoughts:

1. What type of connection are you running the crawler from?

2. Make sure it follows the robots.txt correctly. Writing the robots.txt parser in PHP was a pain but Python has a built in parser. I am not sure about VB but if you have to write the parser manually test it until you are confident that it works correctly.

3. You might consider having the user-agent string contain a link back to a page describing the crawler a little more.

4. I suggest making it pause 2 seconds before downloading another page from the same domain.

5. The number of threads you run should be to improve the overall performance of the crawler. You should not change the number of threads because the crawler is getting banned from sites. Figure out why they don’t like your crawler and then try to improve it... which may involve talking with angry people.

Feel free to PM me or reply if you have any questions.

Best of luck!

Ocean10000

7:27 pm on Apr 3, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




I'd like to be able to meet the goal of periodic url checks, without getting people up in arms.

I run 10 threads at a time, which - unless I'm wrong - is low enough not to hammer a server if a group of 10 urls happen to reside on the same server. I'm using vb6.

From the sound of his post, it is not a spider but a link checker/validator. I have done this type of work in the past for different sites. I am talking for sites under a million urls give or take a few, so take what you will from that.

1. What I often did when checking urls is do a first pass though the list to get Unique Hostnames, and download the robots.txt files for each of those hostnames and store it in a db with the date downloaded, I only downloaded them once a week.

2. Process though the Robots files and mark which urls you should exclude from the list of urls to check.

3. The Files that are marked as forbidden I usually mark them so that they are hidden on the website, or less prominent then the rest that allow your link checker.

4. Proceed to check the remaining urls if they still exist and do what you will with the results.

5. Spot check the forbidden urls for spam, etc in your browser, and mark them as you see fit in your Directory.