homepage Welcome to WebmasterWorld Guest from 54.237.184.242
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Simple user agent checks
Fast checks for user agents
btherl




msg:4476573
 11:41 pm on Jul 17, 2012 (gmt 0)

Hi,

I am looking at good, simple heuristics to check if a user agent is "believable". For example, I want a user agent like "RAV1.23" to be rejected, but a normal one like "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)" to be accepted. I'm thinking about these rules

1. If the user agent doesn't contain "/", reject it.
2. If the user agent doesn't start with a letter, reject it (this catches some bizarre ones)
3. If the user agent doesn't contain at least one space, reject it (This seems like a bad idea, eg "NokiaE66/UCWEB8.5.0.163/28/800" looks legit but has no space)
4. If user agent is 10 characters or less, reject it (This allows "Mozilla/4.0" but nothing shorter)

Will these rules reject any legitimate user agents? What I'm basically looking for is rules that will detect unknown, suspicious user agents while not rejected any legitimate ones.

Thanks.

 

tangor




msg:4476673
 9:49 am on Jul 18, 2012 (gmt 0)

Your rules might be different than others... There are few hard and fast rules (other than banning known server farms by IP... and even then some might disagree).

You do for your site what is best for your site.

That said, any deny is a potential deny to a real live human being (potential! is! key! word!). Allow all, or allow only what you want (called white listing) where you only allow a specific set and reject the rest.

Whatever makes it easier to sleep at night is what you will do. Chasing bad actors and compiling a huge list is a lot of work... much easier to say who can come in the door than to say let everybody in Except this one, that one, that other one, that one other there, whoops, he's a friend of that one, and by golly there's another one on the same block at that first one... Black listing will makes you nuts.

Looking at your 1, 2, 3, 4 above, that's close to white listing. Investigate that concept. See if it fits with your intended audience...

motorhaven




msg:4476849
 6:09 pm on Jul 18, 2012 (gmt 0)

Another factor in white-listing's favor is properly implemented it tends to have much less overhead.

I whitelist and then have a very small blacklist for those which manage to pass through the whitelist ruleset.

lucy24




msg:4476925
 11:05 pm on Jul 18, 2012 (gmt 0)

3. If the user agent doesn't contain at least one space, reject it

I've seen more of the opposite: Robotic UAs that contain multi-spaces like "   " * in the middle. Can't remember a real human in that form.


* In the Forums, as in HTML, you keep the extra space from being eaten by judicious   use   of   nonbreaking   spaces ;)

btherl




msg:4476947
 12:57 am on Jul 19, 2012 (gmt 0)

Thanks for the feedback everyone :) lucy24 are you talking about multiple spaces? Yes that does sound suspicious..

The reason I'm avoiding whitelisting is mobile user agents - there seems to be a lot of them. We may end up doing it though.. there's a lot of PHP libraries available that identify mobile UAs and we could use those.

I definitely want to avoid explicit blacklisting, because they multiply every day.

incrediBILL




msg:4477587
 6:14 pm on Jul 20, 2012 (gmt 0)

The reason I'm avoiding whitelisting is mobile user agents


That's no excuse not to whitelist as there are way more bad bots to blacklist.

There are some pretty simple ways of detecting mobile agents using a combination of things in the user agent and header that IDs most of them and some simple PHP scripts that already do it.

Worse case, just use browsecap.ini to look them up.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved