Forum Moderators: open

Message Too Old, No Replies

Kitenga

         

wilderness

5:33 am on Jan 15, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



couldn't find anything on google, with the expcetion of log lines.

69.225.14.60 - - [14/Jan/2005:13:17:53 -0800] "GET /robots.txt HTTP/1.1" 200
3219 "-" "Kitenga-crawler-bot-alpha/0.9"

pendanticist

10:54 pm on Jan 15, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Only things I found are the term used is Mäori for "discovery" and of course, the user is in the US.

Kitenga [google.com] in Google shows lots of usage from the Mäori bible, but does not indicate other references.

Perhaps this person has Mäori roots...

wilderness

11:13 pm on Jan 15, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Glenn,
I went through 6-8 pages at google with no success.
Even tried adding a few terms with no success.

The names for some of these bots are selected specifically for that reason.

I denied the Pac-Bell sub-range, even though the bot only read robots.txt. They provide no URL or any reasonable possibility of finding a home page.

Don

pendanticist

11:23 pm on Jan 15, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Congrats! Don. Looks like you found another new one...

http//www.kitenga.com/about.html

Kitenga is developing next-generation search, extraction and e-commerce technologies in conjunction with partners to take the World Wide Web to the next level, where searching is no longer about keywords and wading through mountains of results, but is about making sense out of information.

wilderness

11:43 pm on Jan 15, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for the heads-up Glenn.

This page on their site more interesting (at least from a denial perspective.)

http//www.kitenga.com/consulting.html

Staffa

12:21 pm on Jan 16, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Just tried and could not access any page on their site, they immediately redirected each visit to M$ dot com.

Reason enough to ban I would think

pendanticist

2:41 pm on Jan 16, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Try adding the ":" in the appropriate place. ;)

Staffa

6:47 pm on Jan 16, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thank you pendanticist
For sure I overlooked that bitty ;o)

kitenga

11:42 pm on Jan 26, 2005 (gmt 0)

10+ Year Member



Gentlemen, do please let me know if Kitenga Alpha bot misbehaved on your site. There is no "denial" question here, we are building specialized search engine, definitely not scraping emails or other trashy behavior. If you want a source-based exclusion, I can do that for you as well.

jdMorgan

12:30 am on Jan 27, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi Kitenga, and welcome to WebmasterWorld!

We get an awful lot of automated abuse these days, and it makes some of us jumpy. I strongly suggest you add a URL to your user-agent string that resolves to a Web page on your domain that explains your project and contains robots exclusion information. This is a recommeded "best practice" for legitimate robot users.

Exhibit A - Top three spider UAs on my site today:

Googlebot/2.1 (+http://www.googlebot.com/bot.html) 
msnbot/1.0 (+http://search.msn.com/msnbot.htm)
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

This will prevent a lot of mistrust, and reduce the number of inquiries you receive via e-mail.

Thanks,
Jim

kitenga

12:53 am on Jan 27, 2005 (gmt 0)

10+ Year Member



Thanks much for the advice. We do populate the HTTP_FROM field with an email address, but I will add the url as suggested.

jdMorgan

4:55 am on Jan 27, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The problem with the HTTP_FROM header is that it's not displayed in the standard server Common Log Format or NCSA extended/combined log format. So this effort is largely wasted -- unless the Webmaster sets up a custom log format or implements a custom logging script. Few do.

I'd venture to say that a lot of Webmasters prefer the Web page approach. When faced with what might be an e-mail address harvester, I'd rather not have to send them an e-mail to find out, if you appreciate my logic there. :) Also, if your project is wildly successful, you may "suffer from success" using this approach; You might need several people spending all day answering the "Who are you guys?" e-mails from Webmasters. The e-mail approach scales poorly.

Also, as you probably know, there is often a lot of confusion about robots.txt, and the User-agent name needed to control fetches. In some cases, the name we see in our server access logs is what the 'bot will recognize in robots.txt. In other cases, it's not. So, a direct Web page URL gives you the opportunity to communicate the required robots.txt User-agent name to the Webmaster community with one click.

From your site:

The Kitenga User-Agent is Kitenga-crawler-bot-alpha/0.9. A simple rule like this:

User-Agent: Kitenga-crawler-bot-alpha/0.9
Disallow: *

in your robots.txt file will stop crawling of your site completely.


Here, I'd say, "Don't be too specific, and don't be so pessimistic." That example's OK, but it's got two problems. First, I'd never include version info in my robots.txt unless it was absolutely required. So, this example might leave me wondering, "What if I just use 'User-agent: Kitenga-crawler', will that work? (The Standard For Robot Exclusion says it should, and I'd assume so, but others with less experience won't have a clue.) And secondly, why provide only an example of how to block your robot completely? -- Most of us just want to keep 'good' robots out of our cgi-bin, shopping carts, script-generated infinite URL-spaces, and "admin" areas. I'd recommend several more examples, and move this info to a sub-page if it gets too big.

Anyway, please understand that it's a jungle out here, and Webmasters have to deal with an ever-increasing number of malicious 'bots and "mis-implemented" crawlers (my polite term). Helping us to make a quick determination of the robot's intent will help you, too.

Thanks for posting, and best of luck with your project!

Jim

larryhatch

5:57 am on Jan 27, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I was about to note that there is a Katenga Province in Nigeria, but I had mis-spelled that.
Its Katanga province in Nigeria, with 3 letters 'a'.