homepage Welcome to WebmasterWorld Guest from 54.226.93.128
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
how many bytes make a human?
lucy24




msg:4414589
 2:42 am on Feb 6, 2012 (gmt 0)

I had a "D'oh!" moment.

This is a robot:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)

This is a robot:
Acoon v4.1.0 (www.acoon.de)

This is a robot:
Mozilla/6.0 (compatible)

This is a robot:
Googlebot-Image/1.0
(at 19 characters, probably the shortest named robot I know-- apart from YahooCacheSystem, which I no longer give a ### about)

This is a robot:
vlc/1.1.6

This is a robot:
2
(I am not making this up. Admittedly, my logs have been known to get the hiccups.)

This is either a robot or someone who deserves to be treated as one:
-

Now, moving in the other direction, and stipulating for the sake of discussion that I've correctly identified the humans:

robot:
Opera/9.00 (Windows NT 5.1; U; en)

human:
Nokia5233/UC Browser7.9.0.102/50/355/UCWEB
(at 42 characters, the shortest human UA I've seen)

robot:
Mozilla/5.0 (compatible; IntelCSbot/0.2.1beta)

humans:
Mozilla/4.0 (PSP (PlayStation Portable); 2.00)
KWC-Buckle/ UP.Browser/7.2.7.2.541 (GUI) MMP/2.0

robots:
Yeti/1.0 (NHN Corp.; http://help.naver.com/robots/)
Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)

and so on. Beyond a certain point it's all humans-- or robots spoofing humans-- with obvious aberrations like

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.51 (KHTML, like Gecko; Google Web Preview) Chrome/12.0.742 Safari/534.51

and (probably the longest self-identified robot we normally see)

Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)

There's no upper limit to the length of a UA string. But below a certain length, it's a robot.

To check things out I went prowling through the last couple of days' logs... and instantly netted webcollage/1.156 at 16 characters. I never knew they existed; they sneaked under my radar by requesting images with the correct referer, as if human. (They did not sneak under everyone else's radar. There are lots of WebmasterWorld threads talking about it. Apparently hotlinkers without the hotlink.)

All of this leads to the obvious thought:

RewriteCond %{HTTP_USER_AGENT} ^.{,some-integer-here}$
RewriteRule (/|\.html)$ - [F]

By constraining it to html, I don't have to bother with exceptions for robots.txt and so on. Robots that prey on image files are few and far between; they can be dealt with separately.

Question.

What's a safe number to use?

Set it too low and it's not worth the trouble. Set it too high and you have to pile on the exceptions for authorized robots-- and risk locking out humans with weird mobile devices.

Two tiers?

.{,15}

no argument, you're out.

.{,40}

unless your name is {second Condition listing exceptions}.

?

 

Samizdata




msg:4414599
 4:08 am on Feb 6, 2012 (gmt 0)

below a certain length, it's a robot

As I read this thread my user-agent is "lucy24" (without the quotes).

The ability to specify any user-agent is built in to my (mainstream) browser.

I am not a robot, though you are entitled to consider me undesirable.

In all honesty, you wouldn't be the first.

...

lucy24




msg:4414619
 7:33 am on Feb 6, 2012 (gmt 0)

The ability to specify any user-agent is built in to my (mainstream) browser.

I think they all have it, though some are easier to find than others. (I detoured here to look.) Safari is definitely the most straightforward. Others you have to dig around in extensions.

But if you're pretending to be a robot, you can't complain if you're treated like one ;)

adrian20




msg:4414655
 11:33 am on Feb 6, 2012 (gmt 0)

With the sole purpose of commenting on my own experience.

Long ago I use the number for the fixed-length size of 208 for all visits, I remember I started with 180, but I found that some users Agent, originally created are really long.

I must confess that I tried to create a second limit for 19, (as you've mentioned), to raise an oscillation between 19 and 208. But I remember I could not create this second rule for 19 characters, in every attempt I was with error 500.

On the other hand, I'm not sure if there is a number, how length is safe to use. For these cases I try to be practical and realistic;

I can not satisfy everyone, and some of us humans are trying to make machines act more like humans.

One and two, for the days we live on the Internet, the security of my content became a matter of national security.

I have yet pending this second rule, one for length 208, and the second, 19 for length-short.

Samizdata




msg:4414687
 2:18 pm on Feb 6, 2012 (gmt 0)

if you're pretending to be a robot, you can't complain if you're treated like one

I wasn't pretending to be a robot at all.

And I have no complaints about my treatment on this site (which allowed me to post).

Bot control is an art rather than a science.

...

Pfui




msg:4414704
 3:24 pm on Feb 6, 2012 (gmt 0)

Bot control is also a heckuva lot simpler if you whitelist. Start with the requirement that all UAs start with the word Mozilla and tweak to taste.

lucy24




msg:4414775
 5:44 pm on Feb 6, 2012 (gmt 0)

I don't much care for Opera myself, but surely that's overkill? :)

Pfui




msg:4414820
 7:39 pm on Feb 6, 2012 (gmt 0)

Of course, "tweak to taste" means you allow the ones you want, block the ones you don't. (And a huge, HUGE number of Operas are actually bad bots, btw. Including Mozilla.*Opera versions.)

Why mess with 'safe numbers' theories when an easily implemented solution already exists? Why not simply whitelist?

dstiles




msg:4414864
 9:25 pm on Feb 6, 2012 (gmt 0)

I think the official length of a UA is 127 bytes but a LOT of browsers, especially MSIE, exceed that. Often they include sub-strings of the original UA (eg "mozilla...6.0..." is inserted into the new UA of, eg, MSIE 7). This, as far as I can tell, is due in part to faulty MS updates (or faulty machines) and in part to GTB install/updates.

I find it far more reliable to use prescence/absence of various headers in conjunction with known bad and suspect UAs and other things. I also check for certain characters in the UA and block on those - this catches things like "vlc/1.1.6" and "2".

Opera 9 is obsolete, although I still allow it at the moment, depending on other things. My Ubuntu copy is version 11.61. Most versions below 10, I would have thought, could be discarded UNLESS it's coming from a mobile: some of those are weird. Same with Firefox up to a point: 3.6 is still "official" in some linux installations but we're now up to 10 (6 numbers in under a year!) - for what seem to be some very stupid reasons. MSIE 6 and below can be rejected unless you think some of your punters really are dumb enough to be using MSIE 6 a couple of years after MS discontinued support for it.

Webcollage is sort of legit. Some Linux machines use it as default browser "wallpaper" or something like that. I've been blocking it for years and it's one reason why my favicons are not in the web roots.

Samizdata: come at me with anything other than a real browser (or reasonable approximation thereof) and you die! :)

I do check for lengths of querystrings. Anything over a reasonable length (on sites that accept QS) and it's IP-killed: it's almost always a SQL injection attempt or similar.

Samizdata




msg:4414899
 12:04 am on Feb 7, 2012 (gmt 0)

come at me with anything other than a real browser (or reasonable approximation thereof) and you die!

Humans die, bots do not.

All of us here use various methods to control bots, and many involve the user-agent.

I do it myself, as I'm sure you all understand.

But the bottom line is that a user-agent is not proof of anything.

Whatever you decide my user-agent means, I remain human.

Or a reasonable approximation thereof.

...

lucy24




msg:4414964
 4:48 am on Feb 7, 2012 (gmt 0)

MSIE 6 and below can be rejected unless you think some of your punters really are dumb enough to be using MSIE 6 a couple of years after MS discontinued support for it.

Heh. There exist people on this planet who still use MSIE for Mac-- and that means 5. I originally thought it was because they had very very old computers that just couldn't use anything else. Never mind that even ten years ago, there were better choices.

But it turns out you can't tell from the UA string. MSIE 5 doesn't know from Intel-- it just knows 68K vs. PPC-- so it says PPC. (I tested on myself.) Like those elderly www sites whose html divides the universe into MSIE and Netscape.

I think the official length of a UA is 127 bytes but a LOT of browsers, especially MSIE, exceed that.

There doesn't seem to be any limit to the number of .NET CLR statements they will throw in. And the Google-Mobile UA I quoted in the first post is almost 500 characters. Doubly improbable because mobiles tend to have shorter UAs. Less room, haha.

Wonky spacing might be another good one. Either missing or too many. I recently blocked the plainclothes MSNbot-- the one that claims to be MSIE 7. Different thread. One of its distinguishing traits is that every .NET CLR is preceded by a double space.

One of my favorite recent UAs is
Mozilla/4.0 (compatible; MSIE 4.01; Digital AlphaServer 1000A 4/233; Windows NT; Powered By 64-Bit Alpha Processor)

I have no idea what that would be in real life. But it sounds a lot like my pocket calculator, which is 30 years old and still going strong.

And then there was the query that emerged from the logs as
Rat.+++On.++Bed.++Cheese

I tried to disentangle those plusses, but gave up. Not sure about the full stops, either.

I get weird queries, so I can't mess with them. The ones that disencode to multi-byte characters in particular are definitely legit.

dstiles




msg:4415224
 11:01 pm on Feb 7, 2012 (gmt 0)

Samizdata - I agree the UA is not proof of the client type or use but if my system can't recognise it as a valid browser it's rejected - is that a better word than die? :)

Lucy - turns out I'm worng on 127 - it was 256 BUT...

From an MSDN blog Feb 2010:

"For security reasons, ASP.NET (1.1 and 2.0) limited the maximum length of User-Agent strings to 256 characters. We have released fixes that ASP.NET recognize User-Agent strings that contain as many as 512 characters.
ASP.NET 1.1

FIX: You cannot browse an ASP.NET 1.1 Web site if the User-Agent string that is in the browser contains more than 256 characters"

From a 2009 MSDN blog:

"In IE7 and below, if the UA string grows to over 260 characters, the navigator.userAgent property is incorrectly computed."

Which could well be the problem with all of the multiple mozilla inclusions etc.

Also from the same blog:

"Many websites will return only error pages upon receiving a UA header over a fixed length (often 256 characters)"

...and...

"Notably, the RFC does not define a maximum length for the header value"

So my original 127 was incorrect and what I thought was an RFC limitation turns out to be a browser limitation.

I would certainly kill the rat. The + signs, by the way, are probably spaces as seen in querystrings etc.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved