Forum Moderators: open

Message Too Old, No Replies

Excluding all UA's with spider, bot, and crawl in them

Is this too broad a brush to paint with?

         

cfx211

8:37 pm on Jun 28, 2006 (gmt 0)

10+ Year Member



In order to keep bots out of our visitors table we have a job that goes and deletes any visitor whose user agent field matches up to a known crawler.

I was thinking altering the SQL that does this to delete any visitor where the following applies:

lower(user_agent) is like '%bot%'
lower(user_agent) is like '%spider%'
lower(user_agent) is like '%crawler%'

I would then add an in statement with all the other critters out there who don't use those three words in their UA.

Before I go and do this, can anyone think of a situation where this would delete someone who was a real human? I don't know of any browsers with those words in the UA, but I could be missing something.

GaryK

6:21 pm on Jun 29, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



What OS and web server are you using please.

cfx211

6:45 pm on Jun 29, 2006 (gmt 0)

10+ Year Member



We do not use web logs, our application writes visit/visitor information directly into an Oracle database.

In this case when a session is initiated we capture that cookie information including the IP and user agent and write it to a database table.

I am not sure why you want to know OS/server info. My question relates to data cleansing after it is in the Oracle table. We delete records that are known to be bots using a procedure. What I want to do is alter that procedure to delete any record where "bot", "spider", or "crawl" is found in the user agent string. Before I do that I just want to make sure that there are not any common user agents used by humans for normal browsing that contain those three words. Can you think of any?

incrediBILL

6:51 pm on Jun 29, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Actually, if the UA doesn't start with "Mozilla/" or "Opera/" it's typical not a human.

That's how I block bots, everything that doesn't match that gets tossed, so much for cell phones and people using Lynx, but my site doesn't work for them anyway.

Then you need to subfilter the user agent as there's about 100 items I knock out, and anything with "http://", "crawler", "spider", "download" and "robot" in it is pretty safe to zap.

Just "bot" alone will nail things is probably shouldn't, like BOTtom, BOTher, BOTtle, you get the point, you need to do Googlebot, Spambot, etc. one at a time or perhaps check anything that matches the for "bot" to make sure the word ends in bot and isn't in the middle or something.

cfx211

8:16 pm on Jun 29, 2006 (gmt 0)

10+ Year Member



Just what I needed to know. Thanks.