homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / HTML
Forum Library, Charter, Moderators: incrediBILL

HTML Forum

Is page request a Bot or Human?
Generic detection and filtering.

 1:04 am on Dec 24, 2003 (gmt 0)

Iím trying to find the simplest way to detect if a page request is from real human visitors or robots/crawlers ect.
It doesnít need to be 100% effective 85-90% is fine.

Here is what I have let me know if you think itís ineffective or know of a better way.
Also if you know how to convert my Else If statments to a Select Case please tell me I tried and faild.
Iím using ASP/VBScript.

If instr(1,Request.ServerVariables("HTTP_USER_AGENT"), "Opera") > 0 then
Else If instr(1,Request.ServerVariables("HTTP_USER_AGENT"), "Mozilla") > 0 then
Else If instr(1,Request.ServerVariables("HTTP_USER_AGENT"), "MSIE") > 0 then
response.write "IS HUMAN"
response.write "IS NOT HUMAN"
End If
End If




 11:26 am on Dec 24, 2003 (gmt 0)

If you use that method then you'll be mis-classifying quite a few user agents. Most browsers (including MSIE) have "Mozilla" in the UA, but so do many spiders, including: Ask Jeeves/Teoma; grub-client; ZyBorg; and Slurp.

A more effective approach might be to look for the platform (ie. "Linux", "Mac", "Win", etc) to indicate a 'Human'.

I think for Windows there is a public "BROWSCAP.INI" file that people are using to filter traffic on their sites - try searching for reference to it here or on Google.


 2:07 pm on Dec 24, 2003 (gmt 0)

So are you saying that spiders donít identify a platform?
Or are there identified platforms something other than the standard "Linux", "Mac", "Win", etc...?



 2:29 pm on Dec 24, 2003 (gmt 0)

A user agent (UA) is just a string of text - how you interpret it is up to you. Most web browsers however will identify one of those platforms in the UA. A quick inspection of our logs shows that you might want to add the following to that list: "FreeBSD", "WebTV" and maybe "Lynx" as it doesn't appear to list a platform.

Others might want to double-check this but I think you could get 95% or even 99% accuracy using such a system.


 4:26 pm on Dec 24, 2003 (gmt 0)

O.k. dcrombie I did a quick search of my server logs and was not able to find any quality indexing bots such as
(Google, AltaVista, Inktomi ectÖ) that identified a platform so I would agree that your approach is the better one.

I also got "BROWSCAP.INI" working looks like an easy solution for detection but it a VERY large file so I have concerns about its speed.
Iíll keep working on it and hopefully end up with a quality function.

Just a little background info:

The site being built is part of an industry that is very dishonest and thereís a lot of copycat behavior.
So this function will load a ďRealĒ set of MetaTags for search engine bots and other non-bot visitors will see a generic set of tags.

The only critical design goal is 100% of all Major indexing spiders must be detected and feed ďRealĒ MetaTags.

I would consider the major spiders to be:

Northern Light
All the Web
Direct Hit


 5:39 pm on Dec 24, 2003 (gmt 0)

Wow are you still seeing these in yours logs, I have not seen these for well "nearly 2 years"

Northern Light



 6:42 pm on Dec 24, 2003 (gmt 0)

My search of the log files shows (Googlebot,Slurp,Fast-Webcrawler,ia_archiver,Infoseek)
as repeat visitors.

My list of major spiders was off the top off my head and definitely needs to be refined.

Iíve never cared much about search page rankings or indexing bots and Iím finding
it to be a very complicated and controversial subject to learn.

as always input is most welcome


 7:19 pm on Dec 24, 2003 (gmt 0)

Robots will typically not stay on a page very long.

A human needs to read the page and digest its contents before moving to a new page.

When robots visit my website they stay less than 5 seconds per page.

Humans average 5 seconds - 90 seconds per page.

Robots are well "robotic"; they can "visit" the same page several times in one second.

Also a human will not click on invisible links.

I typically pepper my page with invisible links.
Many of them are counters and other tools.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Code, Content, and Presentation / HTML
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved