Welcome to WebmasterWorld Guest from

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

Detecting human users via user agent

Not looking for 100% accuracy, close enough will do

9:13 am on Jan 23, 2013 (gmt 0)

5+ Year Member

I'm sure this will have been asked before, but I couldn't find it.

What I'm trying to do is give a rough estimate as to how many visitors have hit each page in one of our sites. Basically, all I have to go off is the user agent. I'm logging all accesses, including all bots, but I want to filter out anything that is obviously a bot.

These figures only need to be rough, so I'm not worried about the odd access here and there being missed or included.

Is there a really quick way that you guys know of to do this?

Could it be as simple as checking for the string 'bot' - do you think that would grab 80% of the bots? I'd be fine with that!
12:51 am on Jan 24, 2013 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

Checking user agents is wholly insufficient as most bad bots use the same user agents humans do. If you have the IP address, you could filter out anything that comes from data centers like RackSpace, ServerBeach, ThePlanet, etc. as no humans exist there.

The best tell I've found, ever, is javascript that checks for human activity like mouse or keyboard activity and pass that information along to the server. Second best is to examine the HTTP headers because most bad bots with fake user agents don't bother to fake the headers properly.

Using just user agents, the best you can do is look for things in the log file like it didn't load graphics, css, javascript, things bots don't need. Check the speed and volume of access as humans can't read 30 pages a minute, or 3 pages in a second, or hundreds or thousands of pages a day. Also, real humans don't ask for robots.txt and rarely request legal.html, policy.html, etc.

Good luck.
1:25 am on Jan 24, 2013 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month

Until recently, you could get close to 99% just by counting requests for favicon.ico. And then the mobiles came out. Some ask for one of the eight variations on "apple-touch-icon",* some don't.

:: rant ::

humans can't read 30 pages a minute, or 3 pages in a second

And conversely: If there are requests for associated files like images, humans will get them all in a lump, as fast as the server will deliver them. Robots tend to space their requests evenly, regardless of filetype.

If it's a human on a satellite connection, all bets are off.

* If you have two of the eight, it will ask for one of the other six.
1:43 am on Jan 24, 2013 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member

These figures only need to be rough

You could filter out those with "bot" in the string as suggested.

Then subtract an estimated percentage that seems likely (think "shrinkage").

how many visitors have hit each page

The percentage for the home page should be significantly higher as many bots go no further.

Whatever you do will be a guesstimate unless you spend a lot of time on it.

11:26 pm on Mar 24, 2013 (gmt 0)

If you track IP address by User Agent, you may detect non-humans by changes in the U Agent. I have noticed how some scumbots rapidly mutate their User Agent Strings. For example, IP addr attacks 3 times in a row, about 8 seconds apart, every 12 hours. No user agent strings are ever the same. Here are 10 recent examples (over a day and a half):
Mozilla/5.0 (Windows NT 6.2; rv:8.0) Gecko/20050108
Mozilla/5.0 (Linux i686; rv:11.0) Gecko/20000505 Firefox/11.0
Mozilla/5.0 (68K) AppleWebKit/587.0 (KHTML, live Gecko)
Mozilla/5.0 (Linux x86_64; rv:5.0) Gecko/20090507 Firefox/5.0
Mozilla/5.0 (Linux x86_64; rv:9.0) Gecko/20000927 Firefox/9.0
Mozilla/5.0 (Linux x86_64; rv:12.0) Gecko/20000621 Firefox/12.0
Mozilla/5.0 (compatible; MSIE 4.0; 68K; Win64;
Mozilla/5.0 (compatible; MSIE 10.0; Windows NT
Mozilla/5.0 (Windows NT 6.2) AppleWebKit/587.0 (KHTML,
Mozilla/5.0 (68K; rv:11.0) Gecko/20020906 Firefox/11.0
Others show more diversity:
Same IP address, all within 90 seconds: = DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1;+http://www.google.com/bot.html) = SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/ (GUI) MMP/2.0 = Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Same IP address, six minutes apart: = SEOstats 2.1.0 [github.com...] = wscheck.com/1.0.0 (+http://wscheck.com/) = bot.wsowner.com/1.0.0 (+http://wsowner.com/)
All data from my Apache Logs, in the last 24 days. 158 unique User agents.
2:27 am on Mar 25, 2013 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month

I remember one Polish robot that appended the current time to its UA string, and spaced its visits just far enough apart that this meant the UA would never be the same two times in a row.

Oh, and its clock was slow ;)