|Is page request a Bot or Human?|
Generic detection and filtering.
Iím trying to find the simplest way to detect if a page request is from real human visitors or robots/crawlers ect.
It doesnít need to be 100% effective 85-90% is fine.
Here is what I have let me know if you think itís ineffective or know of a better way.
Also if you know how to convert my Else If statments to a Select Case please tell me I tried and faild.
Iím using ASP/VBScript.
If instr(1,Request.ServerVariables("HTTP_USER_AGENT"), "Opera") > 0 then
Else If instr(1,Request.ServerVariables("HTTP_USER_AGENT"), "Mozilla") > 0 then
Else If instr(1,Request.ServerVariables("HTTP_USER_AGENT"), "MSIE") > 0 then
response.write "IS HUMAN"
response.write "IS NOT HUMAN"
If you use that method then you'll be mis-classifying quite a few user agents. Most browsers (including MSIE) have "Mozilla" in the UA, but so do many spiders, including: Ask Jeeves/Teoma; grub-client; ZyBorg; and Slurp.
A more effective approach might be to look for the platform (ie. "Linux", "Mac", "Win", etc) to indicate a 'Human'.
I think for Windows there is a public "BROWSCAP.INI" file that people are using to filter traffic on their sites - try searching for reference to it here or on Google.
So are you saying that spiders donít identify a platform?
Or are there identified platforms something other than the standard "Linux", "Mac", "Win", etc...?
A user agent (UA) is just a string of text - how you interpret it is up to you. Most web browsers however will identify one of those platforms in the UA. A quick inspection of our logs shows that you might want to add the following to that list: "FreeBSD", "WebTV" and maybe "Lynx" as it doesn't appear to list a platform.
Others might want to double-check this but I think you could get 95% or even 99% accuracy using such a system.
O.k. dcrombie I did a quick search of my server logs and was not able to find any quality indexing bots such as
(Google, AltaVista, Inktomi ectÖ) that identified a platform so I would agree that your approach is the better one.
I also got "BROWSCAP.INI" working looks like an easy solution for detection but it a VERY large file so I have concerns about its speed.
Iíll keep working on it and hopefully end up with a quality function.
Just a little background info:
The site being built is part of an industry that is very dishonest and thereís a lot of copycat behavior.
So this function will load a ďRealĒ set of MetaTags for search engine bots and other non-bot visitors will see a generic set of tags.
The only critical design goal is 100% of all Major indexing spiders must be detected and feed ďRealĒ MetaTags.
I would consider the major spiders to be:
All the Web
Wow are you still seeing these in yours logs, I have not seen these for well "nearly 2 years"
My search of the log files shows (Googlebot,Slurp,Fast-Webcrawler,ia_archiver,Infoseek)
as repeat visitors.
My list of major spiders was off the top off my head and definitely needs to be refined.
Iíve never cared much about search page rankings or indexing bots and Iím finding
it to be a very complicated and controversial subject to learn.
as always input is most welcome
Robots will typically not stay on a page very long.
A human needs to read the page and digest its contents before moving to a new page.
When robots visit my website they stay less than 5 seconds per page.
Humans average 5 seconds - 90 seconds per page.
Robots are well "robotic"; they can "visit" the same page several times in one second.
Also a human will not click on invisible links.
I typically pepper my page with invisible links.
Many of them are counters and other tools.