Developing a crawler

Forum Moderators: coopster

Message Too Old, No Replies

Developing a crawler

crawler, spider

wmhelp

4:17 pm on Jul 2, 2009 (gmt 0)

I need to develop a crawler that crawls several (a few hundred) sites (in a specific business category) and extract content and urls of products and services. Other types of pages may be irrelevant. Most of the sites are tiny or small (a few hundred pages at the most).

Here are a few requirements:

Content and urls extracted need to be inserted into a database (or updated if already exists).
No fetching is necessary, if the pages are not modified.
Check for any errors and recover.

I have written a few crawlers and content extractors using your usual ruby libraries.

Any ideas on how to write such a crawler that doesn't abuse any websites. Usual polite behavior like checking robots exclusion rules will be followed, of course.

coopster

5:42 pm on Jul 6, 2009 (gmt 0)

cURL [php.net] is often the tool of choice for this job. Surround the fetching methods with the logic you have defined here and you are off and running. I recommend crawling your own domains first or a test domain if you don't have your own so you can confirm your crawler is following your logic correctly.

wmhelp

10:20 pm on Jul 6, 2009 (gmt 0)

@coopster:

cURL is often the tool of choice for this job. Surround the fetching methods with the logic you have defined here and you are off and running. I recommend crawling your own domains first or a test domain if you don't have your own so you can confirm your crawler is following your logic correctly.

I have some experience in developing a crawler and scrapper. But, crawling a few hundred sites is much different story. Do not want to write a custom scrapper for each website.

coopster

12:37 pm on Jul 7, 2009 (gmt 0)

It really isn't a different story, you don't have to write a separate crawler for each site. Nothing should be hard-coded into the classes or functions you create. Merely pass the domain name and/or url to the crawler when you create an instance in your control logic.

It is often easiest to use a database as you mentioned earlier. Set up a database that will store the domain you will crawl as well as the urls you crawl. Create a control that selects appropriate domain/url and you can use your non-specific methods over and over again for each and every domain/url instance.