|Bad Bot Blocking|
how to recognize bat bots and block them
| 4:53 pm on Dec 10, 2010 (gmt 0)|
This question has four parts:
- Our existing flood control, which may make concerns with bad bots unnecessary.
- What harm can bad bots do?
- How do you recognize bat bots?
- How do you block bad bots?
I am wondering about the issue of "bad bots". On our site several years ago we had several occurrences of attacks which overwhelmed our db server, which made our website non-responsive.
We instituted "flood control" which does not block a request from the site but instead serves a simple static error page when access frequency exceeds a threshold.
Our immediate issue was solved because we have a very good capacity to serve such pages, and as long as our db server was protected from expensive accesses, all is well. We have not had such a shutdown since we put the flood control in place.
What harm can bad bots do?
In general the answer to this question would be that bad bots could gather information in a way we do not want to permit, or consume bandwidth. We believe we are presently immune from such harm because (1) our site is designed so that only public information is available to a non-logged in user. Thus no private information can be obtained by a bot; and (2) we have the flood control which prevents excessive harmful traffic of the type we have previously experienced.
My question is thus, given the above two protections, is there anything else that a bad bot could do?
How do you recognize bat bots?
Because the recognition question is widely discussed, rather than answer it yet again here, I would prefer a link to a current, highly regarded reference. Of course this question is moot if the above "what harm" question does not show any hazards.
How do you block bat bots?
Again, if not moot, I prefer a reference to a current, highly regarded method rather than yet another rehash.
| 9:08 pm on Dec 10, 2010 (gmt 0)|
To begin. . .
1) Recent trends are to "deny all" via white-listing, than make exceptions for those visitors you desire.
a) whether based upon User-Agent, request, IP range, header, or any combination of the aforementioned four.
b) I'm not aware of (with only a few exceptions) of any "complete" examples in providing the coding for white-listing. Were these examples made public, than it would be an easy task for all the harvesters, and bad bots to circumvent the coding.
2) Before white-listing, the previous trends were in "black-listing" (denying after-the-fact based upon either your record of activity or record provided by another).
|How do you recognize bat bots? |
1) You review and analyze your raw visitor logs.
a) This requires and awareness of your website (s) layout, pages and how regular visitors interact with the pages of your website.
b) the capability to examine IP ranges and User-Agents.
c) I'm not aware of any website which provides examples and explanations of criteria to interpret the traffic of website, at least with a method of determining what is good or bad. Every website (or webmaster) has different goals for their site (s), and as a result, each must determine individually, what is beneficial or detrimental to their own site (s).
|How do you block bat bots? |
There are too many examples across the entire internet (not just here at Webmaster World), however not are current (recent activity or updated).
In addition and considering the www-as-a-whole, some websites are providing solutions which are inefficient, even inaccurate.
Here's a very old (note 2001 date) and long thread at Webmaster World; Close to Perfect [webmasterworld.com]. In addition many of the lines were provided inaccurately by people seeking solutions, as a result, GREAT CARE should be taken (no blind copying and pasting) before any of the methods or lines are implemented.
|prefer a reference to a current, highly regarded method |
No such example exists!
Suggest you review the Forum Library, near top of page, as well as the two leading threads at page top.
| 9:39 pm on Dec 10, 2010 (gmt 0)|
jasimon9 - bad bots can scrape your content for use elsewhere - eg on thin "affilliates" and virus-trapping sites - and can also affect your rating in search engines.
If you are convinced that your system is proof against the spate of virus injection and other attacks that are prevalent daily then you can ignore those, although I would advise trapping them on attack vectors and 403'ing them.
I would also advise Wilderness' whitelist approach. Since a large number of bots do not even look at robots.txt that is of little use against the baddies. If you have a robust URL rewrite system available that can incorporate whitelisting then use that.
In addition I would advise blocking all the server farms you can find - and it's amazing how many new ones crop up each day, most of them compromised by viruses. Many server farms get compromised by viruses that allow them to be used within botnets (same applies to broadband, of course, but less easy to block). Apart from that some server farms (eg Amazon) permit some really nasty bots to run unchecked. Blocking them can lower your bandwidth requirements considerably.
My advice would be to read this forum beginning approx 2 years back: most current problems have been addressed here since then.
| 1:29 am on Dec 11, 2010 (gmt 0)|
Agreed. And when it comes to spotting and blocking bots, this forum is the "current, highly regarded reference." Ahem:)
| 11:50 am on Dec 12, 2010 (gmt 0)|
|How do you recognize bat bots? |
Throw UP a forum on a separate box(a throw away domain), Then get that forum noticed and let the BOTs run Wild on it. Make sure to create genetic pages on the site a.k.a guestbook, contact-us e.t.c.
Analize the IP Data to collect server-farm IP ranges and compromissed IPs from regular ISPs.
| 4:42 pm on Dec 13, 2010 (gmt 0)|
I did not mention that we also block at the firewall level "the worst offenders." Don't want to get to specific here. However, this would not block the typical spider behaving badly.
dstiles: we deal with injection and other vulnerabilities at another level (not via blocking of bots or traffic attack). We are required to be PCI Compliant so in addition to the use of coding best practices, we have monitoring from third parties.
others: good ideas and suggestions