homepage Welcome to WebmasterWorld Guest from 54.166.113.249
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / HTML
Forum Library, Charter, Moderators: incrediBILL

HTML Forum

    
Is page request a Bot or Human?
Generic detection and filtering.
mossimo

10+ Year Member



 
Msg#: 6861 posted 1:04 am on Dec 24, 2003 (gmt 0)

Iím trying to find the simplest way to detect if a page request is from real human visitors or robots/crawlers ect.
It doesnít need to be 100% effective 85-90% is fine.

Here is what I have let me know if you think itís ineffective or know of a better way.
Also if you know how to convert my Else If statments to a Select Case please tell me I tried and faild.
Iím using ASP/VBScript.

If instr(1,Request.ServerVariables("HTTP_USER_AGENT"), "Opera") > 0 then
Else If instr(1,Request.ServerVariables("HTTP_USER_AGENT"), "Mozilla") > 0 then
Else If instr(1,Request.ServerVariables("HTTP_USER_AGENT"), "MSIE") > 0 then
response.write "IS HUMAN"
Else
response.write "IS NOT HUMAN"
End If
End If

Cheers
mossimo

 

dcrombie

10+ Year Member



 
Msg#: 6861 posted 11:26 am on Dec 24, 2003 (gmt 0)

If you use that method then you'll be mis-classifying quite a few user agents. Most browsers (including MSIE) have "Mozilla" in the UA, but so do many spiders, including: Ask Jeeves/Teoma; grub-client; ZyBorg; and Slurp.

A more effective approach might be to look for the platform (ie. "Linux", "Mac", "Win", etc) to indicate a 'Human'.

I think for Windows there is a public "BROWSCAP.INI" file that people are using to filter traffic on their sites - try searching for reference to it here or on Google.

mossimo

10+ Year Member



 
Msg#: 6861 posted 2:07 pm on Dec 24, 2003 (gmt 0)

So are you saying that spiders donít identify a platform?
Or are there identified platforms something other than the standard "Linux", "Mac", "Win", etc...?

Thanks

dcrombie

10+ Year Member



 
Msg#: 6861 posted 2:29 pm on Dec 24, 2003 (gmt 0)

A user agent (UA) is just a string of text - how you interpret it is up to you. Most web browsers however will identify one of those platforms in the UA. A quick inspection of our logs shows that you might want to add the following to that list: "FreeBSD", "WebTV" and maybe "Lynx" as it doesn't appear to list a platform.

Others might want to double-check this but I think you could get 95% or even 99% accuracy using such a system.

mossimo

10+ Year Member



 
Msg#: 6861 posted 4:26 pm on Dec 24, 2003 (gmt 0)

O.k. dcrombie I did a quick search of my server logs and was not able to find any quality indexing bots such as
(Google, AltaVista, Inktomi ectÖ) that identified a platform so I would agree that your approach is the better one.

I also got "BROWSCAP.INI" working looks like an easy solution for detection but it a VERY large file so I have concerns about its speed.
Iíll keep working on it and hopefully end up with a quality function.

Just a little background info:

The site being built is part of an industry that is very dishonest and thereís a lot of copycat behavior.
So this function will load a ďRealĒ set of MetaTags for search engine bots and other non-bot visitors will see a generic set of tags.

The only critical design goal is 100% of all Major indexing spiders must be detected and feed ďRealĒ MetaTags.

I would consider the major spiders to be:

Lycos
Altavista
Webcrawler
Northern Light
Excite
All the Web
Direct Hit
Google
Hotbot
Go
Slurp

ncw164x

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 6861 posted 5:39 pm on Dec 24, 2003 (gmt 0)

Wow are you still seeing these in yours logs, I have not seen these for well "nearly 2 years"

Lycos
Webcrawler
Northern Light
Excite
Hotbot
Go

ncw164x

mossimo

10+ Year Member



 
Msg#: 6861 posted 6:42 pm on Dec 24, 2003 (gmt 0)

My search of the log files shows (Googlebot,Slurp,Fast-Webcrawler,ia_archiver,Infoseek)
as repeat visitors.

My list of major spiders was off the top off my head and definitely needs to be refined.

Iíve never cared much about search page rankings or indexing bots and Iím finding
it to be a very complicated and controversial subject to learn.

as always input is most welcome
mossimo

tomparis

10+ Year Member



 
Msg#: 6861 posted 7:19 pm on Dec 24, 2003 (gmt 0)

Robots will typically not stay on a page very long.

A human needs to read the page and digest its contents before moving to a new page.

When robots visit my website they stay less than 5 seconds per page.

Humans average 5 seconds - 90 seconds per page.

Robots are well "robotic"; they can "visit" the same page several times in one second.

Also a human will not click on invisible links.

I typically pepper my page with invisible links.
Many of them are counters and other tools.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / HTML
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved