Forum Moderators: open
Kitenga [google.com] in Google shows lots of usage from the Mäori bible, but does not indicate other references.
Perhaps this person has Mäori roots...
The names for some of these bots are selected specifically for that reason.
I denied the Pac-Bell sub-range, even though the bot only read robots.txt. They provide no URL or any reasonable possibility of finding a home page.
Don
http//www.kitenga.com/about.html
Kitenga is developing next-generation search, extraction and e-commerce technologies in conjunction with partners to take the World Wide Web to the next level, where searching is no longer about keywords and wading through mountains of results, but is about making sense out of information.
We get an awful lot of automated abuse these days, and it makes some of us jumpy. I strongly suggest you add a URL to your user-agent string that resolves to a Web page on your domain that explains your project and contains robots exclusion information. This is a recommeded "best practice" for legitimate robot users.
Exhibit A - Top three spider UAs on my site today:
Googlebot/2.1 (+http://www.googlebot.com/bot.html)
msnbot/1.0 (+http://search.msn.com/msnbot.htm)
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
This will prevent a lot of mistrust, and reduce the number of inquiries you receive via e-mail.
Thanks,
Jim
I'd venture to say that a lot of Webmasters prefer the Web page approach. When faced with what might be an e-mail address harvester, I'd rather not have to send them an e-mail to find out, if you appreciate my logic there. :) Also, if your project is wildly successful, you may "suffer from success" using this approach; You might need several people spending all day answering the "Who are you guys?" e-mails from Webmasters. The e-mail approach scales poorly.
Also, as you probably know, there is often a lot of confusion about robots.txt, and the User-agent name needed to control fetches. In some cases, the name we see in our server access logs is what the 'bot will recognize in robots.txt. In other cases, it's not. So, a direct Web page URL gives you the opportunity to communicate the required robots.txt User-agent name to the Webmaster community with one click.
From your site:
The Kitenga User-Agent is Kitenga-crawler-bot-alpha/0.9. A simple rule like this:User-Agent: Kitenga-crawler-bot-alpha/0.9
Disallow: *in your robots.txt file will stop crawling of your site completely.
Anyway, please understand that it's a jungle out here, and Webmasters have to deal with an ever-increasing number of malicious 'bots and "mis-implemented" crawlers (my polite term). Helping us to make a quick determination of the robot's intent will help you, too.
Thanks for posting, and best of luck with your project!
Jim