|Simple user agent checks|
Fast checks for user agents
I am looking at good, simple heuristics to check if a user agent is "believable". For example, I want a user agent like "RAV1.23" to be rejected, but a normal one like "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)" to be accepted. I'm thinking about these rules
1. If the user agent doesn't contain "/", reject it.
2. If the user agent doesn't start with a letter, reject it (this catches some bizarre ones)
3. If the user agent doesn't contain at least one space, reject it (This seems like a bad idea, eg "NokiaE66/UCWEB18.104.22.168/28/800" looks legit but has no space)
4. If user agent is 10 characters or less, reject it (This allows "Mozilla/4.0" but nothing shorter)
Will these rules reject any legitimate user agents? What I'm basically looking for is rules that will detect unknown, suspicious user agents while not rejected any legitimate ones.
Your rules might be different than others... There are few hard and fast rules (other than banning known server farms by IP... and even then some might disagree).
You do for your site what is best for your site.
That said, any deny is a potential deny to a real live human being (potential! is! key! word!). Allow all, or allow only what you want (called white listing) where you only allow a specific set and reject the rest.
Whatever makes it easier to sleep at night is what you will do. Chasing bad actors and compiling a huge list is a lot of work... much easier to say who can come in the door than to say let everybody in Except this one, that one, that other one, that one other there, whoops, he's a friend of that one, and by golly there's another one on the same block at that first one... Black listing will makes you nuts.
Looking at your 1, 2, 3, 4 above, that's close to white listing. Investigate that concept. See if it fits with your intended audience...
Another factor in white-listing's favor is properly implemented it tends to have much less overhead.
I whitelist and then have a very small blacklist for those which manage to pass through the whitelist ruleset.
|3. If the user agent doesn't contain at least one space, reject it |
I've seen more of the opposite: Robotic UAs that contain multi-spaces like " " * in the middle. Can't remember a real human in that form.
* In the Forums, as in HTML, you keep the extra space from being eaten by judicious use of nonbreaking spaces ;)
Thanks for the feedback everyone :) lucy24 are you talking about multiple spaces? Yes that does sound suspicious..
The reason I'm avoiding whitelisting is mobile user agents - there seems to be a lot of them. We may end up doing it though.. there's a lot of PHP libraries available that identify mobile UAs and we could use those.
I definitely want to avoid explicit blacklisting, because they multiply every day.
|The reason I'm avoiding whitelisting is mobile user agents |
That's no excuse not to whitelist as there are way more bad bots to blacklist.
There are some pretty simple ways of detecting mobile agents using a combination of things in the user agent and header that IDs most of them and some simple PHP scripts that already do it.
Worse case, just use browsecap.ini to look them up.