homepage Welcome to WebmasterWorld Guest from 23.21.9.44
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Detecting human users via user agent
Not looking for 100% accuracy, close enough will do
bhonda




msg:4538615
 9:13 am on Jan 23, 2013 (gmt 0)

I'm sure this will have been asked before, but I couldn't find it.

What I'm trying to do is give a rough estimate as to how many visitors have hit each page in one of our sites. Basically, all I have to go off is the user agent. I'm logging all accesses, including all bots, but I want to filter out anything that is obviously a bot.

These figures only need to be rough, so I'm not worried about the odd access here and there being missed or included.

Is there a really quick way that you guys know of to do this?

Could it be as simple as checking for the string 'bot' - do you think that would grab 80% of the bots? I'd be fine with that!

 

incrediBILL




msg:4538799
 12:51 am on Jan 24, 2013 (gmt 0)

Checking user agents is wholly insufficient as most bad bots use the same user agents humans do. If you have the IP address, you could filter out anything that comes from data centers like RackSpace, ServerBeach, ThePlanet, etc. as no humans exist there.

The best tell I've found, ever, is javascript that checks for human activity like mouse or keyboard activity and pass that information along to the server. Second best is to examine the HTTP headers because most bad bots with fake user agents don't bother to fake the headers properly.

Using just user agents, the best you can do is look for things in the log file like it didn't load graphics, css, javascript, things bots don't need. Check the speed and volume of access as humans can't read 30 pages a minute, or 3 pages in a second, or hundreds or thousands of pages a day. Also, real humans don't ask for robots.txt and rarely request legal.html, policy.html, etc.

Good luck.

lucy24




msg:4538810
 1:25 am on Jan 24, 2013 (gmt 0)

Until recently, you could get close to 99% just by counting requests for favicon.ico. And then the mobiles came out. Some ask for one of the eight variations on "apple-touch-icon",* some don't.

:: rant ::

humans can't read 30 pages a minute, or 3 pages in a second

And conversely: If there are requests for associated files like images, humans will get them all in a lump, as fast as the server will deliver them. Robots tend to space their requests evenly, regardless of filetype.

If it's a human on a satellite connection, all bets are off.


* If you have two of the eight, it will ask for one of the other six.

Samizdata




msg:4538815
 1:43 am on Jan 24, 2013 (gmt 0)

These figures only need to be rough

You could filter out those with "bot" in the string as suggested.

Then subtract an estimated percentage that seems likely (think "shrinkage").

how many visitors have hit each page

The percentage for the home page should be significantly higher as many bots go no further.

Whatever you do will be a guesstimate unless you spend a lot of time on it.

...

jlnaman




msg:4558107
 11:26 pm on Mar 24, 2013 (gmt 0)

If you track IP address by User Agent, you may detect non-humans by changes in the U Agent. I have noticed how some scumbots rapidly mutate their User Agent Strings. For example, IP addr 198.27.74.10 attacks 3 times in a row, about 8 seconds apart, every 12 hours. No user agent strings are ever the same. Here are 10 recent examples (over a day and a half):
Mozilla/5.0 (Windows NT 6.2; rv:8.0) Gecko/20050108
Mozilla/5.0 (Linux i686; rv:11.0) Gecko/20000505 Firefox/11.0
Mozilla/5.0 (68K) AppleWebKit/587.0 (KHTML, live Gecko)
Mozilla/5.0 (Linux x86_64; rv:5.0) Gecko/20090507 Firefox/5.0
Mozilla/5.0 (Linux x86_64; rv:9.0) Gecko/20000927 Firefox/9.0
Mozilla/5.0 (Linux x86_64; rv:12.0) Gecko/20000621 Firefox/12.0
Mozilla/5.0 (compatible; MSIE 4.0; 68K; Win64;
Mozilla/5.0 (compatible; MSIE 10.0; Windows NT
Mozilla/5.0 (Windows NT 6.2) AppleWebKit/587.0 (KHTML,
Mozilla/5.0 (68K; rv:11.0) Gecko/20020906 Firefox/11.0
#=========
Others show more diversity:
Same IP address, all within 90 seconds:
66.249.73.112 = DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1;+http://www.google.com/bot.html)
66.249.73.112 = SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0
66.249.73.112 = Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Same IP address, six minutes apart:
50.30.34.47 = SEOstats 2.1.0 https://github.com/eyecatchup/SEOstats
50.30.34.47 = wscheck.com/1.0.0 (+http://wscheck.com/)
50.30.34.47 = bot.wsowner.com/1.0.0 (+http://wsowner.com/)
#=========
All data from my Apache Logs, in the last 24 days. 158 unique User agents.

lucy24




msg:4558134
 2:27 am on Mar 25, 2013 (gmt 0)

I remember one Polish robot that appended the current time to its UA string, and spaced its visits just far enough apart that this meant the UA would never be the same two times in a row.

Oh, and its clock was slow ;)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved