|Detecting human users via user agent|
Not looking for 100% accuracy, close enough will do
I'm sure this will have been asked before, but I couldn't find it.
What I'm trying to do is give a rough estimate as to how many visitors have hit each page in one of our sites. Basically, all I have to go off is the user agent. I'm logging all accesses, including all bots, but I want to filter out anything that is obviously a bot.
These figures only need to be rough, so I'm not worried about the odd access here and there being missed or included.
Is there a really quick way that you guys know of to do this?
Could it be as simple as checking for the string 'bot' - do you think that would grab 80% of the bots? I'd be fine with that!
Checking user agents is wholly insufficient as most bad bots use the same user agents humans do. If you have the IP address, you could filter out anything that comes from data centers like RackSpace, ServerBeach, ThePlanet, etc. as no humans exist there.
Until recently, you could get close to 99% just by counting requests for favicon.ico. And then the mobiles came out. Some ask for one of the eight variations on "apple-touch-icon",* some don't.
:: rant ::
|humans can't read 30 pages a minute, or 3 pages in a second |
And conversely: If there are requests for associated files like images, humans will get them all in a lump, as fast as the server will deliver them. Robots tend to space their requests evenly, regardless of filetype.
If it's a human on a satellite connection, all bets are off.
* If you have two of the eight, it will ask for one of the other six.
|These figures only need to be rough |
You could filter out those with "bot" in the string as suggested.
Then subtract an estimated percentage that seems likely (think "shrinkage").
|how many visitors have hit each page |
The percentage for the home page should be significantly higher as many bots go no further.
Whatever you do will be a guesstimate unless you spend a lot of time on it.
If you track IP address by User Agent, you may detect non-humans by changes in the U Agent. I have noticed how some scumbots rapidly mutate their User Agent Strings. For example, IP addr 126.96.36.199 attacks 3 times in a row, about 8 seconds apart, every 12 hours. No user agent strings are ever the same. Here are 10 recent examples (over a day and a half):
Mozilla/5.0 (Windows NT 6.2; rv:8.0) Gecko/20050108
Mozilla/5.0 (Linux i686; rv:11.0) Gecko/20000505 Firefox/11.0
Mozilla/5.0 (68K) AppleWebKit/587.0 (KHTML, live Gecko)
Mozilla/5.0 (Linux x86_64; rv:5.0) Gecko/20090507 Firefox/5.0
Mozilla/5.0 (Linux x86_64; rv:9.0) Gecko/20000927 Firefox/9.0
Mozilla/5.0 (Linux x86_64; rv:12.0) Gecko/20000621 Firefox/12.0
Mozilla/5.0 (compatible; MSIE 4.0; 68K; Win64;
Mozilla/5.0 (compatible; MSIE 10.0; Windows NT
Mozilla/5.0 (Windows NT 6.2) AppleWebKit/587.0 (KHTML,
Mozilla/5.0 (68K; rv:11.0) Gecko/20020906 Firefox/11.0
Others show more diversity:
Same IP address, all within 90 seconds:
188.8.131.52 = DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1;+http://www.google.com/bot.html)
184.108.40.206 = SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/220.127.116.11.c.1.101 (GUI) MMP/2.0
18.104.22.168 = Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Same IP address, six minutes apart:
22.214.171.124 = SEOstats 2.1.0 https://github.com/eyecatchup/SEOstats
126.96.36.199 = wscheck.com/1.0.0 (+http://wscheck.com/)
188.8.131.52 = bot.wsowner.com/1.0.0 (+http://wsowner.com/)
All data from my Apache Logs, in the last 24 days. 158 unique User agents.
I remember one Polish robot that appended the current time to its UA string, and spaced its visits just far enough apart that this meant the UA would never be the same two times in a row.
Oh, and its clock was slow ;)