homepage Welcome to WebmasterWorld Guest from 184.73.87.85
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 39 message thread spans 2 pages: 39 ( [1] 2 > >     
Stale bad bot lists
Need a list of live versus stale bad bots
knonymouse




msg:4432837
 2:07 am on Mar 24, 2012 (gmt 0)

A Google search about bad bots turns up several examples of long lists of bots to block.

Excerpt:

...
RewriteCond %{HTTP_USER_AGENT} ^Morfeus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
...

Elsewhere, comments have been made about the number of script kiddies represented there that have moved on to other things.

It would be helpful if anyone knowledgeable was publishing a current hotlist of bots to ban, indicating their nature between pest or pure evil. Or indicate a link to the best source to turn to for list that gets updated.

It would be interesting to see a second list of apparently lifeless bots to perhaps purge as possible dead wood from .htaccess.

(Of course, the latter list becomes a resource for future bot namers.)

 

incrediBILL




msg:4432843
 2:20 am on Mar 24, 2012 (gmt 0)

All those ancient big long bot lists do is slow your server down processing them.

If you haven't seen it in 12 months, it's probably dead.

wilderness




msg:4432856
 2:47 am on Mar 24, 2012 (gmt 0)

All those ancient big long bot lists do is slow your server down processing them.


In addition, those long lists promote the copying and pasting of badly formatted syntax to instructional-website after instructional-website prompting newcomers whom fail to grasp the concept of badly formatted syntax, into their active use.

"crawler" and "spider" will stop a vast majority that chose to ID themselves. There are about a dozen other names (synonyms)(which have been discussed here over and over) that most everybody agrees on.

After that, each webmaster must determine what is beneficial or detrimental to their own website (s).

I deny "Linux" in a UA, most others do not.
Malformed browsers with extra spaces or missing spaces will stop others.

Building such a list and keeping same list current is a viscous circle from a group standpoint.

There are simply too many variations and too many custom requirements to offer a "one-size-fits-all".

Many others prefer white-listing over black-listing entirely.

incrediBILL




msg:4432861
 3:04 am on Mar 24, 2012 (gmt 0)

Many others prefer white-listing over black-listing entirely.


It's the only way to fly.

motorhaven




msg:4433087
 9:39 pm on Mar 24, 2012 (gmt 0)

Does anyone have a decent whitelist or know of one on this site? I spend way too much time chasing down yet another variety of bad bot, scraper, etc. If the wrong people seeing it is an issue, can someone stickymail me?

motorhaven




msg:4433088
 9:41 pm on Mar 24, 2012 (gmt 0)

Wilderness,

Denying "Linux" in a UA will stop Android users from browsing your site. Maybe that's not significant on your site, but they represent about 10% of our users.

wilderness




msg:4433090
 10:06 pm on Mar 24, 2012 (gmt 0)

Many thanks motorhaven.

I review the 403's daily.
from 7PM est (00.00 GMT on my host) through 5PM est, there was a single Linux request that was in fact Android.

Today the site was quite more active than recent days and even recent Saturday's due to the rejuvenation of a trivia thing I do in the widget forums.

wilderness




msg:4433091
 10:12 pm on Mar 24, 2012 (gmt 0)

Does anyone have a decent whitelist or know of one on this site?


motorhaven,
AFAIK there's never been one published.
I've some links saved where Bill or another supplied some short lines in example.
I've also some black list lines that function as whitelist.

Some of these modules were provided to myself and others under a "condition" that we'd agree to NEVER submit them in an open forum.

wilderness




msg:4433093
 10:29 pm on Mar 24, 2012 (gmt 0)

2006 excerpt [webmasterworld.com]


2006 Bill (nine days after above) [webmasterworld.com]

lucy24




msg:4433128
 2:45 am on Mar 25, 2012 (gmt 0)

Many others prefer white-listing over black-listing entirely.

Every time I go to take a closer look at whitelisting instructions, it turns out to involve a robot identifying itself upfront, for example by dutifully visiting some cranked-up version of robots.txt, or by not spoofing a human UA. So the easiest ones to trap are the stupid and/or honest robots.

Now, what would be nice-- and might even be sort-of possible-- is a current list of IP ranges showing where the servers live.


:: pause to contemplate mental picture of a stale robot ::

motorhaven




msg:4433132
 3:01 am on Mar 25, 2012 (gmt 0)

Looks like its simply not going to be easy. I'll go back to my dynamic robots.txt/honey pot project and work on that some more. It also has measures in place to detect what looks like a normal browser but isn't. I just wish I could do it more proactively with a decent whitelist rather than so reactively. I hate leeches!

wilderness




msg:4433135
 3:23 am on Mar 25, 2012 (gmt 0)

Every time I go to take a closer look at whitelisting instructions, it turns out to involve a robot identifying itself upfront,


lucy,
If that's the comprehension, than your misunderstanding it.

The theory is to DENY ALL, and than make exceptions for the visitors you choose.

Bill does lots of tracking by other methods and moving visitors around to track even further.
He's explained it multiple times.

incrediBILL




msg:4433143
 6:00 am on Mar 25, 2012 (gmt 0)

The theory is to DENY ALL, and than make exceptions for the visitors you choose.


Exactly.

Deny All.

Then punch a hole in the 'firewall' for browsers and allowed bots only. Cell phones are the only thing that's kind of tricky, and using browsecap.ini can help with that.

Everything else bounces off the website like it was a grade A circus trampoline.

lucy24




msg:4433153
 8:47 am on Mar 25, 2012 (gmt 0)

Well, yes, that was my point. How do you distinguish between a browser and a robot? You can't, unless the robot introduces itself as R. Daneel Olivaw or spoofs some wildly improbable browser. Don't know about your server, but mine doesn't come with x-ray vision that would let it say "This UA string may look like a perfectly respectable* MSIE 9 with the ordinary human trimmings, but it's really an evil robot from Kazakhstan".

And I'm not going to lock out those Canadians using MSIE 5 for Mac just because I think they're bonkers ;) They're my audience. Not, ahem, people who are bonkers. Canadians with elderly browsers and/or slow connections.


* For a given definition of "respectable".

keyplyr




msg:4433156
 8:59 am on Mar 25, 2012 (gmt 0)

How do you distinguish between a browser and a robot?

The behavior - but that of course is after the fact.

incrediBILL




msg:4433159
 9:05 am on Mar 25, 2012 (gmt 0)

How do you distinguish between a browser and a robot? You can't


You can to a degree.

Web security is built in layers and you peel away bots one layer at a time, line peeling an onion.

First the robots.txt layer for the good guys that honor robots.txt, but it's whitelisted as well so you get the maximum bang for the buck by stopping as much here as possible so they don't keep burning more server resources.

Second, the .htaccess or script layer to forcibly filter out all the bad guys that ignore robots.txt. You obviously know the bots that announce themselves and the user agents that claim to be browsers you let into the next level, best you can do in layer 2.

Third, filter by IP policy, firewall any browsers from commercial locations, like hosting farm IP ranges or in reverse, block robots coming from residential IP ranges.

Fourth, filter by behavior, header content, loads CSS, js, images, etc.

Fifth, filter by volume, how fast and how many pages, does it skew human behavior

So on and so forth...

Then there are other policies and rules in place that keep stripping out bot behavior vs human behavior until it's as good as it gets and in the end bots still sneak in, but it's much less than it was before.

keyplyr




msg:4433161
 9:17 am on Mar 25, 2012 (gmt 0)

I say make everyone do a CAPTCHA - LOL

lucy24




msg:4433163
 10:27 am on Mar 25, 2012 (gmt 0)

Fourth, filter by behavior, header content, loads CSS, js, images, etc.

Well, of course. But that's all after the fact. My log-wrangling has a multi-layered pattern of: human, probably human, maybe human, doubtful and (very very rare) definitely robot. Does it pick up the favicon (darn those mobile devices for destroying a perfect test!), the css (the plainclothes bing/msie robots always walk away with errorstyles.css), come in from a search engine with a plausible query? If it's using g### translate do all the images go to a second, human IP?

A good day is one that doesn't call for reopening the logs to verify someone's humanity. A really good day is one with no unfamiliar robots because all visitors are already on the Ignore list-- one way or the other.

The question was, how can your htaccess (acting as bouncer) identify a new robot that it hasn't met before? Standard human behaviors like requesting the full set of images can't happen until after the critter itself has been admitted. Standard robotic behaviors like going home to a server farm in Moldavia don't become obvious until after you've identified a new robot and looked up where it lives. ("And the horse you rode in on.")

wilderness




msg:4433186
 1:14 pm on Mar 25, 2012 (gmt 0)

The question was, how can your htaccess (acting as bouncer) identify a new robot that it hasn't met before?


Under the scheme of white-listing your not required to ID every newcomer as you have all the doors closed.

(similar to a new colo customer coming from a colo range that you previously denied. Why waste your time attempting to add them to your most wanted list?)

incrediBILL




msg:4433229
 5:08 pm on Mar 25, 2012 (gmt 0)

I say make everyone do a CAPTCHA - LOL


I fail to see the humor here as I practically do this already.

Detecting mouse moves, key presses, etc. are all forms of a CAPTCHA so you can tell there's a human at the browser selecting menu options and typing in data.

Showing a box with squiggly lines and text is old school and intrusive.

motorhaven




msg:4433236
 6:12 pm on Mar 25, 2012 (gmt 0)

Here's a fundamental problem at least in my case:

- Most of the "rogue" bots these days I get are using legitimate UAs.

- Most come in, grab a few pages using a single IP, and don't come back under that IP.

Using an after the fact "bot catcher" doesn't work in these cases. By the time I and/or my filters/programs have caught an IP and locked it out, the bot has moved on to a new IP, switched to another legit UA, etc. I end up adding all this processing overhead with diminishing results... and still see 30%+ of my bandwidth going out to them.

Someone in a previous thread had talked about an .htaccess method they came up with which looks at headers closely to match up the UA to the what the headers should contain verses what they actually contain. The problem with this approach is, alas, it wasn't described in detail and so it leaves it up to me to spend an enormous amount of time logging headers from all visitors, spending time looking at headers from real visitors and comparing it to fake visitors, etc.

Even then, its far from perfect, because of things like corporate and military visitors who have proxies grabbing info either before or after the user gets the page (or it gets the page for them). These filters don't present "normal" UAs nor is their header content identical to a real browser, and if I block them out I lose an enormous number of visitors at military/corporate desktops.

wilderness




msg:4433241
 7:04 pm on Mar 25, 2012 (gmt 0)

motorhaven,
I've never comprehended the use of testing headers myself!

Believe I've a mere two lines over more than decade that checked headers and those were copied from a thread with a custom solution.

I find (despite the ravings of others) the FF Header plug-in to be a royal-PITA.

Believe dstiles uses header extensively, perhaps he'll chirp in later?

It's my understanding (or lack of) that if the header FAILS to contain "qzip" that it's likely a rogue.

These line from the AVG vulnerability a couple of years back (damned if I could figure out how to use the thing in other instances to my benefit):

RewriteCond %{HTTP:Accept-Encoding} !gzip,\ deflate

motorhaven




msg:4433244
 7:38 pm on Mar 25, 2012 (gmt 0)

Wilderness,

The gzip/deflate header information is extremely useful, and one I didn't think off. Before any blocking, I've added it to my Apache log format and I'm watching it now. So far it appears that indeed legit browsers (and the legit crawlers I've seen in the past few minutes) send this.

I'll run this for a few days to see what exceptions, if any, there are to this rule for legit users/search crawlers.

Thank you!

incrediBILL




msg:4433246
 7:42 pm on Mar 25, 2012 (gmt 0)

I've never comprehended the use of testing headers myself!


I don't test them, I block on them.

Most poorly written bots don't even put certain header fields in the request, or they make stupid headers, or add something dumb, or lack simple things like what language they're using.

I don't spend a lot of time looking at headers, I have scripts that analyze them for me.

Using an after the fact "bot catcher" doesn't work in these cases. By the time I and/or my filters/programs have caught an IP and locked it out, the bot has moved on to a new IP, switched to another legit UA, etc. I end up adding all this processing overhead with diminishing results... and still see 30%+ of my bandwidth going out to them.


Get used to it, that's going to be bot blocking in the future.

What we're doing now, in this very forum, is on the verge of being obsolete when it comes to actually catching and blocking bad bots. Being the lone wolf bot hunter isn't going to work much anymore. It's going to take a collective network of websites to collect, detect and block these random IPs doing the bidding of the bot herder.

I only know this because I'm working on the problem :)

I've recently encountered masses of unstoppable IPs hosted on residential computers doing things they shouldn't be doing. Monitoring their activity across multiple computers is the only way to identify which of them is involved in the attacks vs. being actual humans making stray hits on websites.

Not only scrapers are the problem, there are corporate crawlers out there that also crave access to these networks of IPs to do data mining on your sites undetected. I won't site specifics examples because there's no way to prove it's them 100% until they do something that verifies they have been in the honeypot, but I'm positive it's happening.

IPv6 will just make the situation worse, much worse.

wilderness




msg:4433248
 7:46 pm on Mar 25, 2012 (gmt 0)

Thank you!


Widget people thank me all the time for something I've no clue as to why I deserved it.

here's one of the old AVG threads [webmasterworld.com] there were multiple threads when AVG began notifying webmasters whom monitored their logs of all their AVG customers.

wilderness




msg:4433250
 7:51 pm on Mar 25, 2012 (gmt 0)

I don't test them, I block on them.


Bill,
In order to block them, you at least have some comprehension of what the headers should be, unlike myself.

wilderness




msg:4433252
 7:55 pm on Mar 25, 2012 (gmt 0)

Get used to it, that's going to be bot blocking in the future.


After more than a decade of accumulating IP's (server farms and others) and UA's, there's not many that slip through the cracks, although an occasional rascal appears.

If only the ranges could be somehow converted to IPv6

Key_Master




msg:4433258
 8:08 pm on Mar 25, 2012 (gmt 0)

motorhaven, with white listing, you can allow proxies or any other entity safe passage into your site. FWIW, a military proxy isn't likely to be banned in my server configuration.

The real problem you are grappling with, is a lack of usable information. You are reliant on server access logs which still log the same visitor details as they did in the '90s. Details similar to:

REMOTE_ADDR{'101.80.225.183'}
SERVER_DATE{'Thu Mar 15 07:50:00 2012'}
HTTP_REFERER{'http://www.example.com/example.htm'}
HTTP_USER_AGENT{'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1068.0 Safari/536.3'}


You don't use MSIE 5.5 anymore, so why are you still dependent on server logs? At a bare minimum, you should be capturing all HTTP headers sent via a browser to the server. This will give you a header resolution of:

REMOTE_ADDR{'101.80.225.183'}
SERVER_DATE{'Thu Mar 15 07:50:00 2012'}
HTTP_ACCEPT_CHARSET{'ISO-8859-1,utf-8;q=0.7,*;q=0.3'}
HTTP_ACCEPT_ENCODING{'gzip,deflate,sdch'}
HTTP_ACCEPT_LANGUAGE{'zh-CN,zh;q=0.8,en-US;q=0.6,en;q=0.4'}
HTTP_CONNECTION{'keep-alive'}
HTTP_HOST{'www.example.com'}
HTTP_REFERER{'http://www.example.com/example.htm'}
HTTP_USER_AGENT{'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1068.0 Safari/536.3'}


But if you want to really get serious about browser sniffing, you need to dig a lot deeper than that (by the way, don't let the JavaScript execution or valid Chrome headers fool you- this is a bot):

REMOTE_ADDR{'101.80.225.183'}
REMOTE_HOST{'101.80.225.183'}
SERVER_DATE{'Thu Mar 15 07:50:00 2012'}
JS_date{'Thu Mar 15 2012 22:50:01 GMT+0800 (%u4E2D%u56FD%u6807%u51C6%u65F6%u95F4)'}
JS_appCodeName{'Mozilla'}
JS_appName{'Netscape'}
JS_appVersion{'5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1068.0 Safari/536.3'}
JS_platform{'Win32'}
JS_product{'Gecko'}
JS_productSub{'20030107'}
JS_userAgent{'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1068.0 Safari/536.3'}
JS_vendor{'Google Inc.'}
JS_onLine{'true'}
JS_language{'en-US'}
JS_cookieEnabled{'true'}
JS_javaEnabled{'true'}
JS_plugins{'Remoting Viewer, Native Client, Chrome PDF Viewer, Shockwave Flash, Microsoft® DRM, Microsoft® DRM, Windows Media Player Plug-in Dynamic Link Library, Google Update, PowerEnter Plug-in for SPDB, Kingsoft@Firefox ActiveX Comm'}
JS_webkitHidden{'true'}
JS_webkitVisibilityState{'hidden'}
JS_domain{'www.example.com'}
JS_referer{'http%3A//www.google.com/url%3Fsa%3Dt%26rct%3Dj%26q%3D%26esrc%3Ds%26source%3Dweb%26cd%3D6%26ved%3D0CE0QFjAF%26url%3D
http%253A%252F%252Fwww.example.com
%252Fexample.htm%26ei%3DdQFiT53pMI2UiAef9InjBQ%26usg
%3DAFQjCNGyC2cBy9CH9KQWztPxA0fu9xh4Tg'}
JS_historyLength{'1'}
JS_topLocation{'http%3A//www.example.com/example.htm'}
JS_colorDepth{'32'}
JS_pixelDepth{'32'}
JS_availHeight{'770'}
JS_availWidth{'1280'}
JS_height{'800'}
JS_width{'1280'}
JS_innerHeight{'709'}
JS_innerWidth{'1280'}
JS_locationbar{'true'}
JS_menubar{'true'}
JS_personalbar{'true'}
JS_scrollbars{'true'}
JS_statusbar{'true'}
JS_toolbar{'true'}
HTTP_ACCEPT{'*/*'}
HTTP_ACCEPT_CHARSET{'ISO-8859-1,utf-8;q=0.7,*;q=0.3'}
HTTP_ACCEPT_ENCODING{'gzip,deflate,sdch'}
HTTP_ACCEPT_LANGUAGE{'zh-CN,zh;q=0.8,en-US;q=0.6,en;q=0.4'}
HTTP_CONNECTION{'keep-alive'}
HTTP_HOST{'www.example.com'}
HTTP_REFERER{'http://www.example.com/example.htm'}
HTTP_USER_AGENT{'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1068.0 Safari/536.3'}


As far as understanding HTTP headers, you can learn an awful lot about them on you own, using your own set of browsers. Each brand of bowser is going to send the same headers for a file as any other visitor using the same browser. To do this, you don't need a FireFox plugin. You just have to have the proper tools installed on your site. To demonstrate this, I've given any webmasterworld referrer, admin access to a headers plugin on my site. If you use the tool with different brands of browsers you can see that each browser uses its own unique set of headers. You'll also notice that the headers sent by the browser are reliably consistent. If you began to white list for browsers that only send valid headers, your .htaccess file would shrink considerably.

But the bot masters are getting smarter. White listing based on valid headers will knock out the majority of the riffraff from your site, but not all of them. As incrediBILL pointed out, the future is bringing some tough challenges. Better start educating yourself now before you get overwhelmed by the bad guys.

[edited by: incrediBILL at 8:40 pm (utc) on Mar 25, 2012]
[edit reason] broken down JS_referer because of length [/edit]

motorhaven




msg:4433259
 8:24 pm on Mar 25, 2012 (gmt 0)

Thanks for the tips so far.

Already found several valid exceptions to gzip/deflate, including Googlebot (from valid Google 66.249.x.x.), Yahoo, Bing, etc. This is my first phase of attack, whitelisting these.

I have pretty good set of PHP scripts I've written which do a lot of analyzing beyond just log files... it just seems to get overwhelming sometimes but I'll add much more to it based on the tips given here so far.

wilderness




msg:4433264
 8:41 pm on Mar 25, 2012 (gmt 0)

motorhaven,
I seem to recall some old references in which the language header was used to block some pests as well.

This 39 message thread spans 2 pages: 39 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved