Forum Moderators: DixonJones
Following great input from this forum I constructed some 'bot traps' to catch the nasty ones, that do not obey the robots.txt and are just snooping around on my site.
I have accumalted a LARGE number of IP addresses (500+), which are banned from accessing my web pages. It occured to me, though, that some spiders may not fall into my trap soon enough - I would have liked to stop them even BEFORE they come to me.
That can be done by having a public 'black list' of nasty bots IP addresses, similar to the black list of email spamming addresses. If web masters will learn to trust that list, it can be used to proactively bar those bots.
Does anyone know if such a black list exists?
Is there a place where I can publish my list, so that other people will be able to use it - should they choose to?
Cheers,
MC
But there has been a lot of work here in collecting bad user agent names:
[webmasterworld.com...]
[webmasterworld.com...]
Your work naturally complements that, I think.
A static list fails to comprehend that dial-up IPs are dynamically reassigned, that open proxies do have some legitimate uses, and that corporate servers with configuration problems will often be fixed if you send the admin an e-mail and explain the problem.
Basically IP-based lists need to be purged periodically to avoid banning new users of 'previously-bad' IPs. Therefore, coordinating and maintaining such a list is a big job. Who would handle 'appeals' in case there was a mistake? Who would handle the 'slander and libel' lawsuits? If you publish the list, you give the bad guys a clear indication that they've been detected, and they move on to some other IP range, leaving some innocent soul to inherit the condemned IP address.
Another problem is that users of such a list might disagree on implementation details: While some would want ultra-precise lists of individual IP addresses, others might say, "Look, half the addresses in that class B IP address range are bad, let's just block the whole class B and save filespace and processing time."
So, it ends up being a problem with too many variables.
Jim
First - UA blocking is only a partial answer. It is so easy to spoof the UA that you must be really careless, ignorant or shameless to announce your prsence in such way (if you have BAD intentions!).
I have implemented the UA blocking. It works but only that far. I am not sure how to measure it, but I think that UA spoofing is now rampant. I get 2-3 'bad apples' every day, and virtually all of the have a legit UA. So now they fall into the trap and are banend - by IP address.
So far, I received only ONE complaint from a user saying that he is barred from access and doesn't understand why. Now, a bit of traffic report. Percentage of BLOCKED hits / month (error 403) during 2004:
Jan - 5.6%
Feb - 6.1%
Mar - 3.5%
Apr - 4.3%
May - 3.3%
I consider it quite significant...
Jim, that was an excellent post! I would want to discuss your comments in detail, but let's just start with a few:
Publishing the IP addresses would make them worthless for the nasty bot.
Perhaps, but...
* It is not dead easy to jump from one IP range to the other
* If the impact of the published list would be so great - they will be discovered pretty soon anyway.
But let's see if we can circumvent it. How about creating a 'Nasty Bot Fighting Society', the members of which will contribute to the banned IP list and be able to get updates, but it will NOT be in the public domain? The membership may be regulated via recommendations etc.
I also have an idea about meintenance. The list should be 'self maintained'. How about this:
- Every time a NEW IP bot is misbehaving, the script sends an email (with a fixed format) to the List Maintainer.
- An IP is considered 'bad' if more than X number of members have sent such an email.
- An appeal (via the web site) is ALWAYS granted, but the IP address is NOT removed (only 'suspended')
- As soon as the IP address is used again for nasty bots - it is banned again (no need for X votes).
It doesn't seem to me beyond the capabilities of a decent PHP+MySQL coder (which I am, sadly, not).
What do you think?
A little OT... If you view the
robots.txt file for WebmasterWorld, you'll probably end up with a very good and up to date starting point for named bots. I do believe Brett frequently updates that file.