Forum Moderators: open
Thx,
-W
1.) Their namesake sites play hide-the-ball. And seeing as how I don't know who the creator is, what they're doing, and/or what they'll do with MY data, they can't have it.
2.) The crawlers hit via HEAD requests. That's atypical and I've long had problems with HEAD-hitters (...and never get any from AOL or any legit host).
3.) They're aggressive and undeterred by 403s. The same ".sfo1.dsl.speakeasy.net" user out of San Francisco -- probably both crawlers' SF-based owner per the WHOIS registration -- hit with Kyluka up to eight times a day from two different accounts, one of which also hit the same way with Sokitomi.
Three strikes, they're OUT.
.
P.S.
Separated at birth? Oh, yeah...
Sokitomi: Only visited the first part of May, then gone; also did GETs:
dsl092-019-252.sfo1.dsl.speakeasy.net - - [17/May/2006:01:35:43 -0700] "HEAD /robots.txt HTTP/1.1" 403
"Mozilla/5.0 (compatible; Sokitomi crawl; [sokitomi.com...] crawl@sokitomi.com)"
dsl092-019-252.sfo1.dsl.speakeasy.net - - [17/May/2006:01:35:44 -0700] "HEAD / HTTP/1.1" 403
"Mozilla/5.0 (compatible; Sokitomi crawl; [sokitomi.com...] crawl@sokitomi.com)"
dsl092-019-252.sfo1.dsl.speakeasy.net - - [18/May/2006:12:47:46 -0700] "HEAD /robots.txt HTTP/1.1" 403
"Mozilla/5.0 (compatible; Sokitomi crawl; [sokitomi.com...] crawl@sokitomi.com)"
dsl092-019-252.sfo1.dsl.speakeasy.net - - [18/May/2006:12:47:47 -0700] "HEAD / HTTP/1.1" 403
"Mozilla/5.0 (compatible; Sokitomi crawl; [sokitomi.com...] crawl@sokitomi.com)"
Kyluka: Coincidentally, showed up the last part of May; no GETs:
dsl092-019-252.sfo1.dsl.speakeasy.net - - [27/May/2006:01:22:49 -0700] "HEAD /robots.txt HTTP/1.1" 403
"Mozilla/5.0 (compatible; Kyluka crawl; [kyluka.com...] crawl@kyluka.com)"
dsl092-019-252.sfo1.dsl.speakeasy.net - - [27/May/2006:01:22:50 -0700] "HEAD / HTTP/1.1" 403
"Mozilla/5.0 (compatible; Kyluka crawl; [kyluka.com...] crawl@kyluka.com)"
The crawlers hit via HEAD requests. That's atypical and I've long had problems with HEAD-hitters (...and never get any from AOL or any legit host)66.249.66.71 - - [25/Mar/2006:13:00:07 +0100] "GET /robots.txt HTTP/1.1" 200 122 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
For years, the vast majority of HEAD reqs I've observed are, of course, from robots, never 'regular' browsers/visitors. Seeing as how I take a hard line against unauthorized bots, and the major SEs I allow use GET, continuing to block HEADs works for me. YMMV
The funny thing is that if you reject HEAD request for robots.txt then its pretty legitimate to assume that you have no robots.txt at all because standard requires robots.txt to be publicly accessible, which it ain't if you reject standard request method of HEAD.
May, 2006
Requested: robots.txt
Methods
GET: 4093
HEAD: 23 (.56%)
All of the HEAD requests came from just three ISPs and all were anonymous bots worthy of 403, e.g., this thread's topic, Kyluka, and its twin, Sokitomi -- both distinctly Webmaster-UNfriendly bots as I detailed in Msg. #2.
Bandwidth cost to 403 unwanted HEAD reqs: $0
Freedom from bandwidth-abusing bots: Priceless
###
So, in this situation it seems logical to allow all HEAD requests, or at least those for robots.txt. But of course it will mean that you won't have pleasure to have denied access to some guys that you consider bad, and they may well be bad, which makes this move self-justifying, however bad-ness of people should not be judged by HEAD requests to robots.txt: if anything webmasters who don't like bots wasting their bandwidth should be interested in making sure that robots.txt is always accessible regardless of source.
In my experience using HEAD request to check presence of robots.txt is much faster and more efficient bandwidth wise not just for bot, but also for webmaster, surely if some bots writers care to use HEADs, then they should be encouraged to do so rather than whacked on a spot?
Since the Last-modified and e-tag headers can also be checked using HEAD, some user-agents use HEAD as a quick (and low-bandwidth) test to see if a page has been updated. It may be more reliable than counting on the server to return a 304-Not Modified header, in fact. AOL, in particular, uses a lot of HEAD requests when checking to see if the caching proxies at the borders of their network need to be updated.
There's nothing mysterious about HEAD requests. The difference between HEAD and GET requests is that HEAD requests are used to ask the server to send only the HTTP response headers for a particular resource. So the response to a HEAD request is smaller, because it doesn't include the content-body of the 'page' itself. In fact, if you add your HTML page to a HEAD response, that's what a GET response looks like.
Jim