homepage Welcome to WebmasterWorld Guest from 54.211.97.242
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / WebmasterWorld / Webmaster General
Forum Library, Charter, Moderators: phranque

Webmaster General Forum

    
Spider Traps and Honey Pots
Design Considerations
trillianjedi

WebmasterWorld Senior Member trillianjedi us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3167082 posted 3:04 pm on Nov 24, 2006 (gmt 0)

So, I finally got around to building an automated system for blocking badly behaved bots. Took me a while as I've never really had bandwidth or server resource issues. Now the issue is primarily about scraping (and similar).

I've built Birdmans PHP version of the original PERL script found right here at WebmasterWorld, with a couple of changes (direct IP banning in the firewall rather than .htaccess is the main one) which I'll post in the PHP/PERL forum later on.

Now that I have the ban mechanism, what I need is a decent set of traps. I have built two, so far. One looks like this:-

<a href="/come_here_little_spider.html"></a>

The simplest type of no-anchor-text bait. You wouldn't expect many spiders to be that stupid would you. You'd be surprised, I've clocked a few bots that have come into this one already, after 24 hours of linking it.

The other is a simple 1x pixel invisible .gif:-

<a href="/come_here_little_spider.html"><img src="gif.gif"></a>

I'll be sprinkling these around various pages. Are there any other ways to disguise links so as not to be seen by humans on screen? Anything obvious that I've missed?

Should I give people a chance to read a "don't go any further message" or just ban mercilessly?

TJ

 

4string

10+ Year Member



 
Msg#: 3167082 posted 4:17 pm on Nov 24, 2006 (gmt 0)

I've been trying to do this, too. One problem I get is users using web accelerators tripping things. There are some popular Firefox accelerator extensions.

I also wonder if non-English speakers try to download my whole site for translating rather than for the nefarious reasons I fear. I really don't like people trying to rip the whole site.

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3167082 posted 4:39 pm on Nov 24, 2006 (gmt 0)

It's a very good idea to incorporate exclusions for web accelerators, WAP translators, and major search engines -- The first two because they tend to pre-fetch blindly, and the latter just in case you make a mistake while editing robots.txt.

I hesitate to post my best bad-bot-baiting techniques because their operators may read here.

Jim

trillianjedi

WebmasterWorld Senior Member trillianjedi us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3167082 posted 4:53 pm on Nov 24, 2006 (gmt 0)

Thanks guys - and yes Jim you're right, it's not in the interest of anyone with good quality answers to post them ;)

So maybe we can instead look at these aspects of allowing good bots in.

The engines that you do want to index you are easy - most have well-known user-agents. What about the WAP gateways and pre-fetchers? Where do you obtain a list?

Perhaps a case of running the bot-trap for a few weeks without actually doing the IP bans, just to see what you catch?

What are peoples methods for collating user-agent data?

TJ

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3167082 posted 5:40 pm on Nov 24, 2006 (gmt 0)

Yes, track the UAs for awhile, logging them to a file (easy mod to the script), then import into a spreadsheet, sort, de-duplicate, and look for and remove revision numbers and other unnecessary substrings.

Jim

henry0

WebmasterWorld Senior Member henry0 us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3167082 posted 6:20 pm on Nov 24, 2006 (gmt 0)

Hi TJ,
How will we know that you have posted in PHP (aside refreshing every other hour!)
Since auto-notification is still in “fixing state”.

I do not want to miss the PHP version
I recently had three sites sending me a bandwidth alarm
Due to G going crazy after a PHP calendar of events
(eating the monthly BW in a matter of one week)
It’s my server but I did set a quota and would like respecting them.

as a quickfix I now rely on robot.txt
Thanks

balam

10+ Year Member



 
Msg#: 3167082 posted 6:23 pm on Nov 24, 2006 (gmt 0)

> good quality answer

You must look beyond the UA, Grasshopper. If you do, you will know what I speak of.

It's fun being cryptic - I don't do it near enough!

trillianjedi

WebmasterWorld Senior Member trillianjedi us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3167082 posted 7:55 pm on Nov 24, 2006 (gmt 0)

Henry, as it's pretty straightforward I'll post it in here for the moment. However, it's a two-parter, and it does rely on PERL for the second part (the blocking) but you could probably convert this to PHP using shell_exec (having put your mind to the security implications of that first).

naughty_bot.php is a file that will never get called by anything other than a bot. It's a very simple script that simply logs the IP, datetime and user-agent string into a MySQL based DB table.

naughty_bot.php


<?php

// Bad bot detection by TrillianJedi
// Released to open-source GPL 2006
// See threads at www.webmasterworld.com for info
// and background
//
// Build 1.0
// BETA and completely experimental - USE AT YOUR OWN RISK
//
// This is a file that nothing should be looking at, so
// if we're in here, we're a naughty person. In that case
// we log their details in the DB which the PERL script will
// pickup later from CRON and institute an IP ban

$server = 'localhost';
$username = 'your_db_username';
$password = 'your_db_password';
$dbname = 'your_db_name';

$connection = mysql_connect($server , $username , $password) or die ("Cannot make the connection");
$db = mysql_selectdb($dbname , $connection) or die ("Cannot connect to the database");

$sql = "INSERT INTO naughty_bots VALUES (INET_ATON('".$_SERVER['REMOTE_ADDR']."'), '".$_SERVER['HTTP_USER_AGENT']."', NOW())";
mysql_query($sql);
}

echo "Gotcha...\n";
echo "IP : ".$_SERVER['REMOTE_ADDR']."\n";
echo "UA : ".$_SERVER['HTTP_USER_AGENT']."\n";

?>

ipblock.pl is a PERL script called by CRON every couple of minutes that examines the database table for entries and, if there are any, blocks them at an IPtables level. If you use an APF or other firewall, substitute as necessary.

ipblock.pl


#!/usr/bin/perl

use DBI;

my $dsn = 'DBI:mysql:your_DB_name:localhost';
my $db_user_name = 'your_db_username';
my $db_password = 'your_db_password';
my $dbconn = DBI->connect($dsn, $db_user_name, $db_password);

$dbconn->disconnect();

my $query = $dbconn->prepare(qq{
SELECT INET_NTOA(IP) FROM naughty_bots
});
$query->execute();

while (my ($ip) = $query->fetchrow_array())
{
print "iptables -I INPUT -s $ip/24 -j DROP\n";
$dbconn->do("DELETE FROM naughty_bots WHERE INET_NTOA(IP)='$ip'");
exec("iptables -I INPUT -s $ip/24 -j DROP")
}

$query->finish();

The idea behind this is to (a) log everything to a DB (my version of the PHP script is slightly more advanced in logging, but that's easy when you use a DB so you can do what's best for you) and (b) block at IPtables level rather than using .htaccess, which just makes more sense to me from a server resources point of view, especially if .htaccess starts to get filled up quite quickly.

This is a work in progress, it's changing daily. I'll probably move the code part of this thread over to the PHP forum at some point. Parts of these scripts will require table locks.

The database table structure is:-


¦ IP ¦ int(11) unsigned
¦ UserAgent ¦ varchar(64)
¦ DateTime ¦ datetime

You want to make IP a unique column.

You then probably want to start cloaking your robots.txt file as per the way that BT does that here, a PERL script which I converted to PHP:-


<?php

header("Content-Type: text/plain; charset=UTF-8");

$agent = $_SERVER['HTTP_USER_AGENT'];

if ($_GET('view') == "producecode") {
include("robots.txt");
exit;
}

# Simple agent check to keep the snoopy happy and to keep bad bots out and good bots in.

if (preg_match('/slurp/', $agent)
¦¦ preg_match('/msnbot/', $agent)
¦¦ preg_match('/Jeeves/', $agent)
¦¦ preg_match('/googlebot/', $agent)
¦¦ preg_match('/Mediapartners-Google/', $agent)
) {

include ('robots.txt');
}

else {
echo "User-agent: *\n";
echo "Disallow: /\n";

}

?>

You do not want to start calling the PERL script from CRON until you notice in your DB that the bots you want to crawl are no longer appearing in the naughty_bots table.

TJ

henry0

WebmasterWorld Senior Member henry0 us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3167082 posted 8:30 pm on Nov 24, 2006 (gmt 0)

Hi TJ,
Thanks a lot :)
This is a great work that calls for in depth toughts!

Cannot wait to see it in PHP and read comments about it
Henry

trillianjedi

WebmasterWorld Senior Member trillianjedi us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3167082 posted 9:06 pm on Nov 24, 2006 (gmt 0)

Henry - there is nothing new here other than my use of a DB and firewall rather than .htaccess.

All of these things are well documented here - both by Brett and jdMorgan and key_master in the PHP/PERL forums.

TJ

Tastatura

5+ Year Member



 
Msg#: 3167082 posted 10:53 pm on Nov 24, 2006 (gmt 0)

Comment below makes sense only if I correctly understood what your scripts are doing in regards to bad bots (I think good bot handling is straight forward):
-Bad bot trips the trap
-Bad bots various info gets logged
-Another script runs every few minutes (or any other frequency) and updates Iptable to block bad bot

Potential hole that I see is that bad bot blocking is happening every few minutes (or at the frequency that “info checking and Iptable updating” script runs. Is it possible then that “worse case” scenario ca be: checking script runs – bad bot enteres – roams free for X amount of time – checking script runs – bad bot blocked. During those X minutes bad bot is free to do it’s thing. I guess technically you can get the checking script to run more often then one minute but that might take a toll on server resources.

I personally prefer things to run based on events, rather then based on some time table (i.e. bad-bot –showed-up-block-it-right-away vs. bad-bot-showed-up-it’s-not-blocked- until-script-runs-at-some-predetermined-time). I see your point regarding .htaccess getting big in the hurry thought.

I think that it might be possible to use .htaccess for short term and immediate bad bot block and still use IPTables for longer term block . In nutshell the scripts might work something like this (very high level overview):
-bad bot trips the trap
-gets immediately blocked by dynamically generated .httacess
-some script runs at predetermined interval and gets info from .htaccess and/or some other log files and blocks caught bad bots via Iptable. .htaccess get dynamically rewritten to it’s default (empty or otherwise) values.

With this you get bad bots banned immediately, .htaccess doesn’t get to big (etc), “main blocking” is still done on table level.

Note : I haven’t done this; this occurred to me when reading this thread, so take it with grain of salt…

trillianjedi

WebmasterWorld Senior Member trillianjedi us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3167082 posted 11:09 pm on Nov 24, 2006 (gmt 0)

The CRON is currently running once every minute, but yes, the ideal would be to have this work in realtime. The idea behind part PERL part PHP is security related (Apache has no rights to call PERL, so I can let it exec stuff).

-bad bot trips the trap
-gets immediately blocked by dynamically generated .httacess
-some script runs at predetermined interval and gets info from .htaccess and/or some other log files and blocks caught bad bots via Iptable. .htaccess get dynamically rewritten to it’s default (empty or otherwise) values.

That's an excellent suggestion, and one I'll have to consider implementing. It should be quite easy actually, as the .htaccess could be regenerated with a database dump....

pixeltierra

5+ Year Member



 
Msg#: 3167082 posted 6:59 am on Nov 25, 2006 (gmt 0)

Are there any other ways to disguise links so as not to be seen by humans on screen? Anything obvious that I've missed?

You could use variations on CSS display:none visibility:hidden, and height:1px, etc...

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3167082 posted 2:58 am on Dec 14, 2006 (gmt 0)

The only problem with firewall blocking vs .htaccess is when you block an IP that's on a shared server with maybe 100s of domains, your server is no longer accessible to that server. Therefore you aren't able to send/receive email to potential clients also hosting there, and you're blocking RSS feeds, etc..

That's why I always advocate using a real-time script with a banned IP database or .htaccess because then you're strictly blocking the web server access, nothing more, and people with shared accounts can't update the server firewall in the first place.

If you must use the firewall itself, consider adding "--dport 80" or whatever it is and restrict the block to just Apache.

If you want to get tricky, you could do a combination of real-time script and .htaccess and append blocks in .htaccess after several page requests and then prune the .htaccess file every 24 hours and start over. Then you would have the best of all worlds catching it in real-time, hardening the server after it looks persistent, not overflowing .htaccess with excessive blocks other than for the day, and it works with shared hosting accounts.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3167082 posted 3:10 am on Dec 14, 2006 (gmt 0)

Oh yes, that little prefetch issue...

RewriteEngine On
SetEnvIf X-moz prefetch HAS_X-moz
RewriteCond %{ENV:HAS_X-moz} prefetch
RewriteRule .* [F,L]

Tastatura

5+ Year Member



 
Msg#: 3167082 posted 3:15 am on Dec 14, 2006 (gmt 0)


If you want to get tricky, you could do a combination of real-time script and .htaccess and append blocks in .htaccess after several page requests and then prune the .htaccess file every 24 hours and start over.

Bill, How is this different from what I suggested? (my question is without adversarial (spl?) tone – just curiosity as I can’t seem to see difference between what we both said)

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3167082 posted 3:39 am on Dec 14, 2006 (gmt 0)

I don't think there is any difference except I was in a hurry and skipped your post earlier ;)

Jordo needs a drink

5+ Year Member



 
Msg#: 3167082 posted 9:49 pm on Dec 14, 2006 (gmt 0)

Oh yes, that little prefetch issue...

RewriteEngine On
SetEnvIf X-moz prefetch HAS_X-moz
RewriteCond %{ENV:HAS_X-moz} prefetch
RewriteRule .* [F,L]

incrediBILL,
For the .htaccess almost illiterate people such as myself, what does that do?

I have a way of handling the prefetchers that is working for me, but am open to new or different ways.

Also, you can add:
User-agent: Fasterfox
Disallow: /

to your robots.txt to keep Firefox from prefetching on your site.

[edited by: Jordo_needs_a_drink at 9:59 pm (utc) on Dec. 14, 2006]

Jordo needs a drink

5+ Year Member



 
Msg#: 3167082 posted 9:54 pm on Dec 14, 2006 (gmt 0)

Also,
I've been thinking about adding a javascript link to the access denied pages, that when clicked, will automatically unban the IP. Or some other type of captcha.

The reason for this is to unban the dynamic IP's that you end up blocking.

Any thoughts on this?

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3167082 posted 11:04 pm on Dec 14, 2006 (gmt 0)

My code should block when it sees the prefetch header used by Firefox or Google Web Accelerator.

This page has more information:
[webaccelerator.google.com...]

The robots.txt entry is definitely simpler, but won't stop Google's Web Accelerator.

Jordo needs a drink

5+ Year Member



 
Msg#: 3167082 posted 12:12 am on Dec 15, 2006 (gmt 0)

I guess what I'm asking is are you completely blocking users from your website that are prefetching or are you only blocking their prefetches with the .htaccess method.

I use the robots.txt for FF (since it only blocks the prefetches themselves, not the users), but have added something else to my script to keep google prefetchers from being blocked by hitting the bot trap.

I don't really want to completely block prefetchers from my sites entirely, although I understand a lot of webmasters do, because of bandwidth, etc.

If the .htaccess method you posted only blocks the prefetches, but the users themselves can still access the site, then I'd like to try that.

Edit - I just read up on what (f,l) is and read the link you posted again, and the lightbulb just went off. I didn't realize that xmoz is set only on the prefetches themselves, for some stupid reason I was thinking the user always had it set...

[edited by: Jordo_needs_a_drink at 12:20 am (utc) on Dec. 15, 2006]

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / WebmasterWorld / Webmaster General
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved