I've been working on a monitoring script that reports on the top bots crawling my site at any given moment. During my testing I've noted a few bots that are legit, but from other countries.
My question, should I block these bots? Yandex for instance is a Russian bot for a Russian search engine/portal. My site is US based, and my traffic is 99% US based. So I'm not getting much benefit from Russian traffic. Does this tell me I should just block them and save my processor time/bandwidth being eaten up by them?
Currently they fall in 3rd in regards to how much they crawl my site only behind Google and Yahoo. We are talking tens of thousands of pages a day they crawl.
What advice would you give me? At first I thought why not just leave it, it is more exposure for my site. But then I started thinking maybe it was pointless?
I guess the same question goes for all those prototype search bots who are trying to make a name for themselves. They typically don't crawl many pages, but they are always on the site.
I allow Yandex to crawl my server's sites here (UK). I see the occasional hit from the SE but not enough to worry about. I originally allowed it for a client with a Russian-facing site, now no longer up.
Taking only two sites' stats for the past six weeks, top crawl on both sites is msn, slurp and yandex followed by "unknown" on one site and vagabondo on another. Goog comes in very low on the hits (48 of 600 total and 125 of 1500 total).
I recently ran an excercise tying in yandex crawl IPs with their bot's User-Agent. Very messy. Far fewer actual IPs than MS but fragmented across more Class C's.
I'm working through several other bots at the moment, especially those used by meta engines, ensuring that when everyone drops google there is still something left to send traffic to my customers. :)