I had a "D'oh!" moment.
This is a robot:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
This is a robot:
Acoon v4.1.0 (www.acoon.de)
This is a robot:
Mozilla/6.0 (compatible)
This is a robot:
Googlebot-Image/1.0
(at 19 characters, probably the shortest named robot I know-- apart from YahooCacheSystem, which I no longer give a ### about)
This is a robot:
vlc/1.1.6
This is a robot:
2
(I am not making this up. Admittedly, my logs have been known to get the hiccups.)
This is either a robot or someone who deserves to be treated as one:
-
Now, moving in the other direction, and stipulating for the sake of discussion that I've correctly identified the humans:
robot:
Opera/9.00 (Windows NT 5.1; U; en)
human:
Nokia5233/UC Browser7.9.0.102/50/355/UCWEB
(at 42 characters, the shortest human UA I've seen)
robot:
Mozilla/5.0 (compatible; IntelCSbot/0.2.1beta)
humans:
Mozilla/4.0 (PSP (PlayStation Portable); 2.00)
KWC-Buckle/ UP.Browser/7.2.7.2.541 (GUI) MMP/2.0
robots:
Yeti/1.0 (NHN Corp.; http://help.naver.com/robots/)
Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)
and so on. Beyond a certain point it's all humans-- or robots spoofing humans-- with obvious aberrations like
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.51 (KHTML, like Gecko; Google Web Preview) Chrome/12.0.742 Safari/534.51
and (probably the longest self-identified robot we normally see)
Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)
There's no upper limit to the length of a UA string. But below a certain length, it's a robot.
To check things out I went prowling through the last couple of days' logs... and instantly netted webcollage/1.156 at 16 characters. I never knew they existed; they sneaked under my radar by requesting images with the correct referer, as if human. (They did not sneak under everyone else's radar. There are lots of WebmasterWorld threads talking about it. Apparently hotlinkers without the hotlink.)
All of this leads to the obvious thought:
RewriteCond %{HTTP_USER_AGENT} ^.{,some-integer-here}$
RewriteRule (/|\.html)$ - [F]
By constraining it to html, I don't have to bother with exceptions for robots.txt and so on. Robots that prey on image files are few and far between; they can be dealt with separately.
Question.
What's a safe number to use?
Set it too low and it's not worth the trouble. Set it too high and you have to pile on the exceptions for authorized robots-- and risk locking out humans with weird mobile devices.
Two tiers?
.{,15}
no argument, you're out.
.{,40}
unless your name is {second Condition listing exceptions}.
?