Forum Moderators: open
We are considering starting a specialist search engine to compliment a network of sites that we run. We know who we want to crawl (hand edited lists, categorised and spam checked) but don't know the best way to make sure the bot isn't banned.
I know that adding an URL with details of the bot and reason for collecting data can help if human intervention happens. It's the automated blocking of bots that are outside of specific IP ranges that I'm trying to avoid.
I'm not looking for a cheaters charter, just a list of sensible things to implement/do.
Any ideas?
Further information: The search engine will cover a large vertical for just the UK, there are 20,000 sites that have been hand checked so far (8,000 O.K.), with another 40,000 to go. Most of the sites are small and we only intend to hold up to 1,000 pages of any site (as most sites are much smaller than that). Data is fairly static for most sites but we would want to establish how frequently sites change in order to give an automated re-visit delay for each.
You might also consider giving selective thought to naming your bot, avoiding anything 'scary'.
Personally, I ban all bots that have nothing to offer in return for crawling.