Forum Moderators: open

Message Too Old, No Replies

How do you make sure a new, genuine, spider is not blocked?

         

inbound

12:04 am on Jan 11, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I know how much of a problem rogue spiders are, as a result more and more people are doing things to restrict bot access.

We are considering starting a specialist search engine to compliment a network of sites that we run. We know who we want to crawl (hand edited lists, categorised and spam checked) but don't know the best way to make sure the bot isn't banned.

I know that adding an URL with details of the bot and reason for collecting data can help if human intervention happens. It's the automated blocking of bots that are outside of specific IP ranges that I'm trying to avoid.

I'm not looking for a cheaters charter, just a list of sensible things to implement/do.

Any ideas?

Further information: The search engine will cover a large vertical for just the UK, there are 20,000 sites that have been hand checked so far (8,000 O.K.), with another 40,000 to go. Most of the sites are small and we only intend to hold up to 1,000 pages of any site (as most sites are much smaller than that). Data is fairly static for most sites but we would want to establish how frequently sites change in order to give an automated re-visit delay for each.

keyplyr

9:41 am on Jan 12, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




In addition to "adding an URL with details of the bot and reason for collecting data", I think most webmasters probably just want unknown bots to request and obey robots.tx, use a defined UA, a verifiable IP, and request files at a reasonable rate.

You might also consider giving selective thought to naming your bot, avoiding anything 'scary'.

Staffa

9:53 am on Jan 12, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



In addition to the above it might be wise to restrict your first visit to a site to robots.txt and index page. That way the webmaster can check out who you are and why you are crawling and then decide whether or not to allow access in the future.

Personally, I ban all bots that have nothing to offer in return for crawling.