Forum Moderators: coopster
Here are a few requirements:
I have written a few crawlers and content extractors using your usual ruby libraries.
Any ideas on how to write such a crawler that doesn't abuse any websites. Usual polite behavior like checking robots exclusion rules will be followed, of course.
cURL is often the tool of choice for this job. Surround the fetching methods with the logic you have defined here and you are off and running. I recommend crawling your own domains first or a test domain if you don't have your own so you can confirm your crawler is following your logic correctly.
I have some experience in developing a crawler and scrapper. But, crawling a few hundred sites is much different story. Do not want to write a custom scrapper for each website.
It is often easiest to use a database as you mentioned earlier. Set up a database that will store the domain you will crawl as well as the urls you crawl. Create a control that selects appropriate domain/url and you can use your non-specific methods over and over again for each and every domain/url instance.