Spider Traps and Honey Pots - Webmaster General forum at WebmasterWorld - WebmasterWorld

Forum Moderators: phranque

Message Too Old, No Replies

Spider Traps and Honey Pots

Design Considerations

trillianjedi

3:04 pm on Nov 24, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

So, I finally got around to building an automated system for blocking badly behaved bots. Took me a while as I've never really had bandwidth or server resource issues. Now the issue is primarily about scraping (and similar).

I've built Birdmans PHP version of the original PERL script found right here at WebmasterWorld, with a couple of changes (direct IP banning in the firewall rather than .htaccess is the main one) which I'll post in the PHP/PERL forum later on.

Now that I have the ban mechanism, what I need is a decent set of traps. I have built two, so far. One looks like this:-

<a href="/come_here_little_spider.html"></a>

The simplest type of no-anchor-text bait. You wouldn't expect many spiders to be that stupid would you. You'd be surprised, I've clocked a few bots that have come into this one already, after 24 hours of linking it.

The other is a simple 1x pixel invisible .gif:-

<a href="/come_here_little_spider.html"><img src="gif.gif"></a>

I'll be sprinkling these around various pages. Are there any other ways to disguise links so as not to be seen by humans on screen? Anything obvious that I've missed?

Should I give people a chance to read a "don't go any further message" or just ban mercilessly?

TJ

4string

4:17 pm on Nov 24, 2006 (gmt 0)

10+ Year Member

I've been trying to do this, too. One problem I get is users using web accelerators tripping things. There are some popular Firefox accelerator extensions.

I also wonder if non-English speakers try to download my whole site for translating rather than for the nefarious reasons I fear. I really don't like people trying to rip the whole site.

jdMorgan

4:39 pm on Nov 24, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

It's a very good idea to incorporate exclusions for web accelerators, WAP translators, and major search engines -- The first two because they tend to pre-fetch blindly, and the latter just in case you make a mistake while editing robots.txt.

I hesitate to post my best bad-bot-baiting techniques because their operators may read here.

Jim

trillianjedi

4:53 pm on Nov 24, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Thanks guys - and yes Jim you're right, it's not in the interest of anyone with good quality answers to post them ;)

So maybe we can instead look at these aspects of allowing good bots in.

The engines that you do want to index you are easy - most have well-known user-agents. What about the WAP gateways and pre-fetchers? Where do you obtain a list?

Perhaps a case of running the bot-trap for a few weeks without actually doing the IP bans, just to see what you catch?

What are peoples methods for collating user-agent data?

TJ

jdMorgan

5:40 pm on Nov 24, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Yes, track the UAs for awhile, logging them to a file (easy mod to the script), then import into a spreadsheet, sort, de-duplicate, and look for and remove revision numbers and other unnecessary substrings.

Jim

henry0

6:20 pm on Nov 24, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Hi TJ,
How will we know that you have posted in PHP (aside refreshing every other hour!)
Since auto-notification is still in �fixing state�.

I do not want to miss the PHP version
I recently had three sites sending me a bandwidth alarm
Due to G going crazy after a PHP calendar of events
(eating the monthly BW in a matter of one week)
It�s my server but I did set a quota and would like respecting them.

as a quickfix I now rely on robot.txt
Thanks

balam

6:23 pm on Nov 24, 2006 (gmt 0)

10+ Year Member

> good quality answer

You must look beyond the UA, Grasshopper. If you do, you will know what I speak of.

It's fun being cryptic - I don't do it near enough!

trillianjedi

7:55 pm on Nov 24, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Henry, as it's pretty straightforward I'll post it in here for the moment. However, it's a two-parter, and it does rely on PERL for the second part (the blocking) but you could probably convert this to PHP using shell_exec (having put your mind to the security implications of that first).

naughty_bot.php is a file that will never get called by anything other than a bot. It's a very simple script that simply logs the IP, datetime and user-agent string into a MySQL based DB table.

naughty_bot.php

<?php
// Bad bot detection by TrillianJedi
// Released to open-source GPL 2006
// See threads at www.webmasterworld.com for info
// and background
//
// Build 1.0
// BETA and completely experimental - USE AT YOUR OWN RISK
//
// This is a file that nothing should be looking at, so
// if we're in here, we're a naughty person. In that case
// we log their details in the DB which the PERL script will
// pickup later from CRON and institute an IP ban
$server = 'localhost';
$username = 'your_db_username';
$password = 'your_db_password';
$dbname = 'your_db_name';
$connection = mysql_connect($server , $username , $password) or die ("Cannot make the connection");
$db = mysql_selectdb($dbname , $connection) or die ("Cannot connect to the database");
$sql = "INSERT INTO naughty_bots VALUES (INET_ATON('".$_SERVER['REMOTE_ADDR']."'), '".$_SERVER['HTTP_USER_AGENT']."', NOW())";
mysql_query($sql);
}
echo "Gotcha...\n";
echo "IP : ".$_SERVER['REMOTE_ADDR']."\n";
echo "UA : ".$_SERVER['HTTP_USER_AGENT']."\n";
?>

ipblock.pl is a PERL script called by CRON every couple of minutes that examines the database table for entries and, if there are any, blocks them at an IPtables level. If you use an APF or other firewall, substitute as necessary.

ipblock.pl

#!/usr/bin/perl
use DBI;
my $dsn = 'DBI:mysql:your_DB_name:localhost';
my $db_user_name = 'your_db_username';
my $db_password = 'your_db_password';
my $dbconn = DBI->connect($dsn, $db_user_name, $db_password);
$dbconn->disconnect();
my $query = $dbconn->prepare(qq{
SELECT INET_NTOA(IP) FROM naughty_bots
});
$query->execute();
while (my ($ip) = $query->fetchrow_array())
{
print "iptables -I INPUT -s $ip/24 -j DROP\n";
$dbconn->do("DELETE FROM naughty_bots WHERE INET_NTOA(IP)='$ip'");
exec("iptables -I INPUT -s $ip/24 -j DROP")
}
$query->finish();

The idea behind this is to (a) log everything to a DB (my version of the PHP script is slightly more advanced in logging, but that's easy when you use a DB so you can do what's best for you) and (b) block at IPtables level rather than using .htaccess, which just makes more sense to me from a server resources point of view, especially if .htaccess starts to get filled up quite quickly.

This is a work in progress, it's changing daily. I'll probably move the code part of this thread over to the PHP forum at some point. Parts of these scripts will require table locks.

The database table structure is:-

� IP � int(11) unsigned
� UserAgent � varchar(64)
� DateTime � datetime

You want to make IP a unique column.

You then probably want to start cloaking your robots.txt file as per the way that BT does that here, a PERL script which I converted to PHP:-

<?php
header("Content-Type: text/plain; charset=UTF-8");
$agent = $_SERVER['HTTP_USER_AGENT'];
if ($_GET('view') == "producecode") {
include("robots.txt");
exit;
}
# Simple agent check to keep the snoopy happy and to keep bad bots out and good bots in.
if (preg_match('/slurp/', $agent)
�� preg_match('/msnbot/', $agent)
�� preg_match('/Jeeves/', $agent)
�� preg_match('/googlebot/', $agent)
�� preg_match('/Mediapartners-Google/', $agent)
) {
include ('robots.txt');
}
else {
echo "User-agent: *\n";
echo "Disallow: /\n";
}
?>

You do not want to start calling the PERL script from CRON until you notice in your DB that the bots you want to crawl are no longer appearing in the naughty_bots table.

TJ

henry0

8:30 pm on Nov 24, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Hi TJ,
Thanks a lot :)
This is a great work that calls for in depth toughts!

Cannot wait to see it in PHP and read comments about it
Henry

trillianjedi

9:06 pm on Nov 24, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Henry - there is nothing new here other than my use of a DB and firewall rather than .htaccess.

All of these things are well documented here - both by Brett and jdMorgan and key_master in the PHP/PERL forums.

TJ

Tastatura

10:53 pm on Nov 24, 2006 (gmt 0)

10+ Year Member

Comment below makes sense only if I correctly understood what your scripts are doing in regards to bad bots (I think good bot handling is straight forward):
-Bad bot trips the trap
-Bad bots various info gets logged
-Another script runs every few minutes (or any other frequency) and updates Iptable to block bad bot

Potential hole that I see is that bad bot blocking is happening every few minutes (or at the frequency that �info checking and Iptable updating� script runs. Is it possible then that �worse case� scenario ca be: checking script runs � bad bot enteres � roams free for X amount of time � checking script runs � bad bot blocked. During those X minutes bad bot is free to do it�s thing. I guess technically you can get the checking script to run more often then one minute but that might take a toll on server resources.

I personally prefer things to run based on events, rather then based on some time table (i.e. bad-bot �showed-up-block-it-right-away vs. bad-bot-showed-up-it�s-not-blocked- until-script-runs-at-some-predetermined-time). I see your point regarding .htaccess getting big in the hurry thought.

I think that it might be possible to use .htaccess for short term and immediate bad bot block and still use IPTables for longer term block . In nutshell the scripts might work something like this (very high level overview):
-bad bot trips the trap
-gets immediately blocked by dynamically generated .httacess
-some script runs at predetermined interval and gets info from .htaccess and/or some other log files and blocks caught bad bots via Iptable. .htaccess get dynamically rewritten to it�s default (empty or otherwise) values.

With this you get bad bots banned immediately, .htaccess doesn�t get to big (etc), �main blocking� is still done on table level.

Note : I haven�t done this; this occurred to me when reading this thread, so take it with grain of salt�

trillianjedi

11:09 pm on Nov 24, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

The CRON is currently running once every minute, but yes, the ideal would be to have this work in realtime. The idea behind part PERL part PHP is security related (Apache has no rights to call PERL, so I can let it exec stuff).

-bad bot trips the trap
-gets immediately blocked by dynamically generated .httacess
-some script runs at predetermined interval and gets info from .htaccess and/or some other log files and blocks caught bad bots via Iptable. .htaccess get dynamically rewritten to it�s default (empty or otherwise) values.

That's an excellent suggestion, and one I'll have to consider implementing. It should be quite easy actually, as the .htaccess could be regenerated with a database dump....

pixeltierra

6:59 am on Nov 25, 2006 (gmt 0)

10+ Year Member

Are there any other ways to disguise links so as not to be seen by humans on screen? Anything obvious that I've missed?

You could use variations on CSS display:none visibility:hidden, and height:1px, etc...

incrediBILL

2:58 am on Dec 14, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

The only problem with firewall blocking vs .htaccess is when you block an IP that's on a shared server with maybe 100s of domains, your server is no longer accessible to that server. Therefore you aren't able to send/receive email to potential clients also hosting there, and you're blocking RSS feeds, etc..

That's why I always advocate using a real-time script with a banned IP database or .htaccess because then you're strictly blocking the web server access, nothing more, and people with shared accounts can't update the server firewall in the first place.

If you must use the firewall itself, consider adding "--dport 80" or whatever it is and restrict the block to just Apache.

If you want to get tricky, you could do a combination of real-time script and .htaccess and append blocks in .htaccess after several page requests and then prune the .htaccess file every 24 hours and start over. Then you would have the best of all worlds catching it in real-time, hardening the server after it looks persistent, not overflowing .htaccess with excessive blocks other than for the day, and it works with shared hosting accounts.

incrediBILL

3:10 am on Dec 14, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Oh yes, that little prefetch issue...

RewriteEngine On
SetEnvIf X-moz prefetch HAS_X-moz
RewriteCond %{ENV:HAS_X-moz} prefetch
RewriteRule .* [F,L]

Tastatura

3:15 am on Dec 14, 2006 (gmt 0)

10+ Year Member

If you want to get tricky, you could do a combination of real-time script and .htaccess and append blocks in .htaccess after several page requests and then prune the .htaccess file every 24 hours and start over.

Bill, How is this different from what I suggested? (my question is without adversarial (spl?) tone � just curiosity as I can�t seem to see difference between what we both said)

incrediBILL

3:39 am on Dec 14, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I don't think there is any difference except I was in a hurry and skipped your post earlier ;)

Jordo needs a drink

9:49 pm on Dec 14, 2006 (gmt 0)

10+ Year Member

Oh yes, that little prefetch issue...
RewriteEngine On
SetEnvIf X-moz prefetch HAS_X-moz
RewriteCond %{ENV:HAS_X-moz} prefetch
RewriteRule .* [F,L]

incrediBILL,
For the .htaccess almost illiterate people such as myself, what does that do?

I have a way of handling the prefetchers that is working for me, but am open to new or different ways.

Also, you can add:
User-agent: Fasterfox
Disallow: /

to your robots.txt to keep Firefox from prefetching on your site.

[edited by: Jordo_needs_a_drink at 9:59 pm (utc) on Dec. 14, 2006]

Jordo needs a drink

9:54 pm on Dec 14, 2006 (gmt 0)

10+ Year Member

Also,
I've been thinking about adding a javascript link to the access denied pages, that when clicked, will automatically unban the IP. Or some other type of captcha.

The reason for this is to unban the dynamic IP's that you end up blocking.

Any thoughts on this?

incrediBILL

11:04 pm on Dec 14, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

My code should block when it sees the prefetch header used by Firefox or Google Web Accelerator.

This page has more information:
[webaccelerator.google.com...]

The robots.txt entry is definitely simpler, but won't stop Google's Web Accelerator.

Jordo needs a drink

12:12 am on Dec 15, 2006 (gmt 0)

10+ Year Member

I guess what I'm asking is are you completely blocking users from your website that are prefetching or are you only blocking their prefetches with the .htaccess method.

I use the robots.txt for FF (since it only blocks the prefetches themselves, not the users), but have added something else to my script to keep google prefetchers from being blocked by hitting the bot trap.

I don't really want to completely block prefetchers from my sites entirely, although I understand a lot of webmasters do, because of bandwidth, etc.

If the .htaccess method you posted only blocks the prefetches, but the users themselves can still access the site, then I'd like to try that.

Edit - I just read up on what (f,l) is and read the link you posted again, and the lightbulb just went off. I didn't realize that xmoz is set only on the prefetches themselves, for some stupid reason I was thinking the user always had it set...

[edited by: Jordo_needs_a_drink at 12:20 am (utc) on Dec. 15, 2006]