Forum Moderators: open

Message Too Old, No Replies

Kyluka

         

fusion5

1:37 am on May 30, 2006 (gmt 0)

10+ Year Member



Anybody know who these guys are? Good/Bad?
Mozilla/5.0 (compatible; Kyluka crawl; [kyluka.com...] crawl@kyluka.com)

Thx,
-W

Pfui

8:21 pm on May 30, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I block Kyluka and its kin Sokitomi by agent (& will probably block their primary IP, too) because although they ask for robots.txt --

1.) Their namesake sites play hide-the-ball. And seeing as how I don't know who the creator is, what they're doing, and/or what they'll do with MY data, they can't have it.

2.) The crawlers hit via HEAD requests. That's atypical and I've long had problems with HEAD-hitters (...and never get any from AOL or any legit host).

3.) They're aggressive and undeterred by 403s. The same ".sfo1.dsl.speakeasy.net" user out of San Francisco -- probably both crawlers' SF-based owner per the WHOIS registration -- hit with Kyluka up to eight times a day from two different accounts, one of which also hit the same way with Sokitomi.

Three strikes, they're OUT.

.
P.S.
Separated at birth? Oh, yeah...

Sokitomi: Only visited the first part of May, then gone; also did GETs:

dsl092-019-252.sfo1.dsl.speakeasy.net - - [17/May/2006:01:35:43 -0700] "HEAD /robots.txt HTTP/1.1" 403
"Mozilla/5.0 (compatible; Sokitomi crawl; [sokitomi.com...] crawl@sokitomi.com)"
dsl092-019-252.sfo1.dsl.speakeasy.net - - [17/May/2006:01:35:44 -0700] "HEAD / HTTP/1.1" 403
"Mozilla/5.0 (compatible; Sokitomi crawl; [sokitomi.com...] crawl@sokitomi.com)"
dsl092-019-252.sfo1.dsl.speakeasy.net - - [18/May/2006:12:47:46 -0700] "HEAD /robots.txt HTTP/1.1" 403
"Mozilla/5.0 (compatible; Sokitomi crawl; [sokitomi.com...] crawl@sokitomi.com)"
dsl092-019-252.sfo1.dsl.speakeasy.net - - [18/May/2006:12:47:47 -0700] "HEAD / HTTP/1.1" 403
"Mozilla/5.0 (compatible; Sokitomi crawl; [sokitomi.com...] crawl@sokitomi.com)"

Kyluka: Coincidentally, showed up the last part of May; no GETs:

dsl092-019-252.sfo1.dsl.speakeasy.net - - [27/May/2006:01:22:49 -0700] "HEAD /robots.txt HTTP/1.1" 403
"Mozilla/5.0 (compatible; Kyluka crawl; [kyluka.com...] crawl@kyluka.com)"
dsl092-019-252.sfo1.dsl.speakeasy.net - - [27/May/2006:01:22:50 -0700] "HEAD / HTTP/1.1" 403
"Mozilla/5.0 (compatible; Kyluka crawl; [kyluka.com...] crawl@kyluka.com)"

thetrasher

11:52 am on May 31, 2006 (gmt 0)

10+ Year Member



www.kyluka.com
=> birch.kyluka.com
=> 66.92.19.252
=> dsl092-019-252.sfo1.dsl.speakeasy.net

The crawlers hit via HEAD requests. That's atypical and I've long had problems with HEAD-hitters (...and never get any from AOL or any legit host)
66.249.66.71 - - [25/Mar/2006:13:00:07 +0100] "GET /robots.txt HTTP/1.1" 200 122 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.66.71 - - [25/Mar/2006:13:00:07 +0100] "HEAD / HTTP/1.1" 200 0 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

fusion5

9:25 pm on Jun 1, 2006 (gmt 0)

10+ Year Member



Alright, cool.
Thanx for the replies!
-W

Lord Majestic

5:49 pm on Jun 3, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



2.) The crawlers hit via HEAD requests. That's atypical and I've long had problems with HEAD-hitters (...and never get any from AOL or any legit host).

HEADs are part of HTTP standard as much as GETs and using HEADs helps save bandwidth to everyone.

Pfui

6:32 pm on Jun 3, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Well sure, HEAD requests are one of the standard methods [httpd.apache.org] but they're not standard on my sites. Never have been. Neither are CONNECT, DELETE, OPTIONS, PUT and TRACE.

For years, the vast majority of HEAD reqs I've observed are, of course, from robots, never 'regular' browsers/visitors. Seeing as how I take a hard line against unauthorized bots, and the major SEs I allow use GET, continuing to block HEADs works for me. YMMV

Lord Majestic

10:06 pm on Jun 3, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, all I am saying is that HEAD request not only saves you bandwidth but also does not obtain your content: using HEAD request for robots.txt is very smart move actually: about 80% of servers do not have it and many of them would return custom 404 which would waste bandwidth, so any bot that uses HEAD request for robots.txt is a webmaster friendly bot that IMO really should be encouraged, but nevermind.

The funny thing is that if you reject HEAD request for robots.txt then its pretty legitimate to assume that you have no robots.txt at all because standard requires robots.txt to be publicly accessible, which it ain't if you reject standard request method of HEAD.

Pfui

12:46 am on Jun 4, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



We could each cite standards and practices till the bots came home, but that would be silly. I've repeatedly explained how my 403 methodology is based on my sites' needs and requirements. Perhaps one of my site's stats will aid in understanding --

May, 2006
Requested: robots.txt
Methods
GET: 4093
HEAD: 23 (.56%)

All of the HEAD requests came from just three ISPs and all were anonymous bots worthy of 403, e.g., this thread's topic, Kyluka, and its twin, Sokitomi -- both distinctly Webmaster-UNfriendly bots as I detailed in Msg. #2.

Bandwidth cost to 403 unwanted HEAD reqs: $0

Freedom from bandwidth-abusing bots: Priceless

###

Lord Majestic

1:16 pm on Jun 4, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Bandwidth cost of HEAD is the same as 403: both send headers only, given network overheads its going to be almost exactly the same if not the same actually. Consequence however is that by denying HEAD requests to robots.txt you are not gaining anything, but potentially losing something: requests for urls that you disallowed in your robots.txt may actually come through and it will be perfectly legitimate since it was you who denied access to robots.txt.

So, in this situation it seems logical to allow all HEAD requests, or at least those for robots.txt. But of course it will mean that you won't have pleasure to have denied access to some guys that you consider bad, and they may well be bad, which makes this move self-justifying, however bad-ness of people should not be judged by HEAD requests to robots.txt: if anything webmasters who don't like bots wasting their bandwidth should be interested in making sure that robots.txt is always accessible regardless of source.

In my experience using HEAD request to check presence of robots.txt is much faster and more efficient bandwidth wise not just for bot, but also for webmaster, surely if some bots writers care to use HEADs, then they should be encouraged to do so rather than whacked on a spot?

jdMorgan

10:16 pm on Jun 5, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



HEAD requests for robots.txt should always be allowed along with GET requests.

Since the Last-modified and e-tag headers can also be checked using HEAD, some user-agents use HEAD as a quick (and low-bandwidth) test to see if a page has been updated. It may be more reliable than counting on the server to return a 304-Not Modified header, in fact. AOL, in particular, uses a lot of HEAD requests when checking to see if the caching proxies at the borders of their network need to be updated.

There's nothing mysterious about HEAD requests. The difference between HEAD and GET requests is that HEAD requests are used to ask the server to send only the HTTP response headers for a particular resource. So the response to a HEAD request is smaller, because it doesn't include the content-body of the 'page' itself. In fact, if you add your HTML page to a HEAD response, that's what a GET response looks like.

Jim