Crawler4j is an open source Java crawler which provides a simple interface for crawling the Web. You can setup a multi-threaded web crawler in 5 minutes!
So you can download it, and use it to crawl web documents. This does nothing in itself. The data you retrieve still needs to be processed. This tool does not do that for you.
I block many terms found in UAs including: spider, crawler, scrape, download, etc. But there are some beneficial actors that may also include some of these terms, so you need to allow the ones you like.
Bewenched
9:56 pm on Nov 24, 2012 (gmt 0)
Yea, didn't like them snooping on our ecommerce site. So many competitors use this type of stuff to grab our pricing then beat us by a penny it's not even funny.
incrediBILL
10:35 pm on Nov 24, 2012 (gmt 0)
Just block "code.google.com" found in any user agent and you'll solve this problem once and for all.
I actually block anything with "http" or "www" in the user agent, post processing beyond the initial whitelist of course, which stops just about everything that actually advertises who they are.