Forum Moderators: open
Haven't found any good sites yet on spider design, so had to program mine from scratch, and I know it still needs major improvement. (I have lworld spider in the useragent string). Some isps have blocked my ip, and others have growled.
Has anyone come across any sites that are good working guides for spider design? I'd like to be able to meet the goal of periodic url checks, without getting people up in arms.
I run 10 threads at a time, which - unless I'm wrong - is low enough not to hammer a server if a group of 10 urls happen to reside on the same server. I'm using vb6.
I'd welcome constructive feedback. Always trying to improve, and practice good net etiquette.
[edited by: volatilegx at 9:00 pm (utc) on Mar. 31, 2006]
[edit reason] no URL dropping please [/edit]
I have written a web crawler for my search engine company in PHP and then re-wrote it in Python. I do not know of any web pages to refer you to but I would be happy to help if I can.
A few thoughts:
1. What type of connection are you running the crawler from?
2. Make sure it follows the robots.txt correctly. Writing the robots.txt parser in PHP was a pain but Python has a built in parser. I am not sure about VB but if you have to write the parser manually test it until you are confident that it works correctly.
3. You might consider having the user-agent string contain a link back to a page describing the crawler a little more.
4. I suggest making it pause 2 seconds before downloading another page from the same domain.
5. The number of threads you run should be to improve the overall performance of the crawler. You should not change the number of threads because the crawler is getting banned from sites. Figure out why they don’t like your crawler and then try to improve it... which may involve talking with angry people.
Feel free to PM me or reply if you have any questions.
Best of luck!
I'd like to be able to meet the goal of periodic url checks, without getting people up in arms.I run 10 threads at a time, which - unless I'm wrong - is low enough not to hammer a server if a group of 10 urls happen to reside on the same server. I'm using vb6.
From the sound of his post, it is not a spider but a link checker/validator. I have done this type of work in the past for different sites. I am talking for sites under a million urls give or take a few, so take what you will from that.
1. What I often did when checking urls is do a first pass though the list to get Unique Hostnames, and download the robots.txt files for each of those hostnames and store it in a db with the date downloaded, I only downloaded them once a week.
2. Process though the Robots files and mark which urls you should exclude from the list of urls to check.
3. The Files that are marked as forbidden I usually mark them so that they are hidden on the website, or less prominent then the rest that allow your link checker.
4. Proceed to check the remaining urls if they still exist and do what you will with the results.
5. Spot check the forbidden urls for spam, etc in your browser, and mark them as you see fit in your Directory.