Forum Moderators: phranque
I've built Birdmans PHP version of the original PERL script found right here at WebmasterWorld, with a couple of changes (direct IP banning in the firewall rather than .htaccess is the main one) which I'll post in the PHP/PERL forum later on.
Now that I have the ban mechanism, what I need is a decent set of traps. I have built two, so far. One looks like this:-
<a href="/come_here_little_spider.html"></a>
The simplest type of no-anchor-text bait. You wouldn't expect many spiders to be that stupid would you. You'd be surprised, I've clocked a few bots that have come into this one already, after 24 hours of linking it.
The other is a simple 1x pixel invisible .gif:-
<a href="/come_here_little_spider.html"><img src="gif.gif"></a>
I'll be sprinkling these around various pages. Are there any other ways to disguise links so as not to be seen by humans on screen? Anything obvious that I've missed?
Should I give people a chance to read a "don't go any further message" or just ban mercilessly?
TJ
I also wonder if non-English speakers try to download my whole site for translating rather than for the nefarious reasons I fear. I really don't like people trying to rip the whole site.
I hesitate to post my best bad-bot-baiting techniques because their operators may read here.
Jim
So maybe we can instead look at these aspects of allowing good bots in.
The engines that you do want to index you are easy - most have well-known user-agents. What about the WAP gateways and pre-fetchers? Where do you obtain a list?
Perhaps a case of running the bot-trap for a few weeks without actually doing the IP bans, just to see what you catch?
What are peoples methods for collating user-agent data?
TJ
I do not want to miss the PHP version
I recently had three sites sending me a bandwidth alarm
Due to G going crazy after a PHP calendar of events
(eating the monthly BW in a matter of one week)
It’s my server but I did set a quota and would like respecting them.
as a quickfix I now rely on robot.txt
Thanks
naughty_bot.php is a file that will never get called by anything other than a bot. It's a very simple script that simply logs the IP, datetime and user-agent string into a MySQL based DB table.
naughty_bot.php
<?php// Bad bot detection by TrillianJedi
// Released to open-source GPL 2006
// See threads at www.webmasterworld.com for info
// and background
//
// Build 1.0
// BETA and completely experimental - USE AT YOUR OWN RISK
//
// This is a file that nothing should be looking at, so
// if we're in here, we're a naughty person. In that case
// we log their details in the DB which the PERL script will
// pickup later from CRON and institute an IP ban$server = 'localhost';
$username = 'your_db_username';
$password = 'your_db_password';
$dbname = 'your_db_name';$connection = mysql_connect($server , $username , $password) or die ("Cannot make the connection");
$db = mysql_selectdb($dbname , $connection) or die ("Cannot connect to the database");$sql = "INSERT INTO naughty_bots VALUES (INET_ATON('".$_SERVER['REMOTE_ADDR']."'), '".$_SERVER['HTTP_USER_AGENT']."', NOW())";
mysql_query($sql);
}echo "Gotcha...\n";
echo "IP : ".$_SERVER['REMOTE_ADDR']."\n";
echo "UA : ".$_SERVER['HTTP_USER_AGENT']."\n";?>
ipblock.pl is a PERL script called by CRON every couple of minutes that examines the database table for entries and, if there are any, blocks them at an IPtables level. If you use an APF or other firewall, substitute as necessary.
ipblock.pl
#!/usr/bin/perluse DBI;
my $dsn = 'DBI:mysql:your_DB_name:localhost';
my $db_user_name = 'your_db_username';
my $db_password = 'your_db_password';
my $dbconn = DBI->connect($dsn, $db_user_name, $db_password);$dbconn->disconnect();
my $query = $dbconn->prepare(qq{
SELECT INET_NTOA(IP) FROM naughty_bots
});
$query->execute();while (my ($ip) = $query->fetchrow_array())
{
print "iptables -I INPUT -s $ip/24 -j DROP\n";
$dbconn->do("DELETE FROM naughty_bots WHERE INET_NTOA(IP)='$ip'");
exec("iptables -I INPUT -s $ip/24 -j DROP")
}$query->finish();
The idea behind this is to (a) log everything to a DB (my version of the PHP script is slightly more advanced in logging, but that's easy when you use a DB so you can do what's best for you) and (b) block at IPtables level rather than using .htaccess, which just makes more sense to me from a server resources point of view, especially if .htaccess starts to get filled up quite quickly.
This is a work in progress, it's changing daily. I'll probably move the code part of this thread over to the PHP forum at some point. Parts of these scripts will require table locks.
The database table structure is:-
¦ IP ¦ int(11) unsigned
¦ UserAgent ¦ varchar(64)
¦ DateTime ¦ datetime
You want to make IP a unique column.
You then probably want to start cloaking your robots.txt file as per the way that BT does that here, a PERL script which I converted to PHP:-
<?phpheader("Content-Type: text/plain; charset=UTF-8");
$agent = $_SERVER['HTTP_USER_AGENT'];
if ($_GET('view') == "producecode") {
include("robots.txt");
exit;
}# Simple agent check to keep the snoopy happy and to keep bad bots out and good bots in.
if (preg_match('/slurp/', $agent)
¦¦ preg_match('/msnbot/', $agent)
¦¦ preg_match('/Jeeves/', $agent)
¦¦ preg_match('/googlebot/', $agent)
¦¦ preg_match('/Mediapartners-Google/', $agent)
) {include ('robots.txt');
}else {
echo "User-agent: *\n";
echo "Disallow: /\n";}
?>
You do not want to start calling the PERL script from CRON until you notice in your DB that the bots you want to crawl are no longer appearing in the naughty_bots table.
TJ
Potential hole that I see is that bad bot blocking is happening every few minutes (or at the frequency that “info checking and Iptable updating” script runs. Is it possible then that “worse case” scenario ca be: checking script runs – bad bot enteres – roams free for X amount of time – checking script runs – bad bot blocked. During those X minutes bad bot is free to do it’s thing. I guess technically you can get the checking script to run more often then one minute but that might take a toll on server resources.
I personally prefer things to run based on events, rather then based on some time table (i.e. bad-bot –showed-up-block-it-right-away vs. bad-bot-showed-up-it’s-not-blocked- until-script-runs-at-some-predetermined-time). I see your point regarding .htaccess getting big in the hurry thought.
I think that it might be possible to use .htaccess for short term and immediate bad bot block and still use IPTables for longer term block . In nutshell the scripts might work something like this (very high level overview):
-bad bot trips the trap
-gets immediately blocked by dynamically generated .httacess
-some script runs at predetermined interval and gets info from .htaccess and/or some other log files and blocks caught bad bots via Iptable. .htaccess get dynamically rewritten to it’s default (empty or otherwise) values.
With this you get bad bots banned immediately, .htaccess doesn’t get to big (etc), “main blocking” is still done on table level.
Note : I haven’t done this; this occurred to me when reading this thread, so take it with grain of salt…
-bad bot trips the trap
-gets immediately blocked by dynamically generated .httacess
-some script runs at predetermined interval and gets info from .htaccess and/or some other log files and blocks caught bad bots via Iptable. .htaccess get dynamically rewritten to it’s default (empty or otherwise) values.
That's an excellent suggestion, and one I'll have to consider implementing. It should be quite easy actually, as the .htaccess could be regenerated with a database dump....
That's why I always advocate using a real-time script with a banned IP database or .htaccess because then you're strictly blocking the web server access, nothing more, and people with shared accounts can't update the server firewall in the first place.
If you must use the firewall itself, consider adding "--dport 80" or whatever it is and restrict the block to just Apache.
If you want to get tricky, you could do a combination of real-time script and .htaccess and append blocks in .htaccess after several page requests and then prune the .htaccess file every 24 hours and start over. Then you would have the best of all worlds catching it in real-time, hardening the server after it looks persistent, not overflowing .htaccess with excessive blocks other than for the day, and it works with shared hosting accounts.
If you want to get tricky, you could do a combination of real-time script and .htaccess and append blocks in .htaccess after several page requests and then prune the .htaccess file every 24 hours and start over.
Bill, How is this different from what I suggested? (my question is without adversarial (spl?) tone – just curiosity as I can’t seem to see difference between what we both said)
Oh yes, that little prefetch issue...RewriteEngine On
SetEnvIf X-moz prefetch HAS_X-moz
RewriteCond %{ENV:HAS_X-moz} prefetch
RewriteRule .* [F,L]
incrediBILL,
For the .htaccess almost illiterate people such as myself, what does that do?
I have a way of handling the prefetchers that is working for me, but am open to new or different ways.
Also, you can add:
User-agent: Fasterfox
Disallow: /
to your robots.txt to keep Firefox from prefetching on your site.
[edited by: Jordo_needs_a_drink at 9:59 pm (utc) on Dec. 14, 2006]
This page has more information:
[webaccelerator.google.com...]
The robots.txt entry is definitely simpler, but won't stop Google's Web Accelerator.
I use the robots.txt for FF (since it only blocks the prefetches themselves, not the users), but have added something else to my script to keep google prefetchers from being blocked by hitting the bot trap.
I don't really want to completely block prefetchers from my sites entirely, although I understand a lot of webmasters do, because of bandwidth, etc.
If the .htaccess method you posted only blocks the prefetches, but the users themselves can still access the site, then I'd like to try that.
Edit - I just read up on what (f,l) is and read the link you posted again, and the lightbulb just went off. I didn't realize that xmoz is set only on the prefetches themselves, for some stupid reason I was thinking the user always had it set...
[edited by: Jordo_needs_a_drink at 12:20 am (utc) on Dec. 15, 2006]