Forum Moderators: coopster
I am posting a revised version of this PHP spider trap [webmasterworld.com], because of a flaw that was recognized by our local Apache Web Server [webmasterworld.com] guru. Thanks, jdMorgan [webmasterworld.com]!
Basically, it needed filelocking to prevent the htaccess file from being opened by another bot while it is already being written to. This could happen on a busy server.
Also, before I move on, I'd like to extend the credit for this script to Key_Master [webmasterworld.com]. KeyMaster posted the original bad bot script [webmasterworld.com], written in Perl. There is also a modified version [webmasterworld.com] as well.
How it Works
When the file getout.php is accessed, it opens your .htaccess file and appends the visitor's(bad bot) ip address to the list of banned ips.
Before you do anything, you'll need to disallow the file(getout.php) in your robots.txt file. Any decent bot should be reading and obeying this file so. Do not use the spider trap for a few days after adding the robots.txt disallow. You have to give the good bots enough time to read the ammended robots file. If you start using the trap right away, you stand a chance of banning good spiders!
Example robot.txt disallow:
User-agent: *
Disallow: /getout.php
Next, create a new folder in your root folder. Name it /trap/. You can name it anything really, but that's what I have in the script so you'll need to alter the script if you name it differently.
Chmod your .htaccess file to 644 and chmod getout.php to 755. You should put getout.php in the root folder. Or, simply change the robots.txt file to reflect the location of the file if you put it elsewhere.
Add these lines to your .htaccess file at the very top.
SetEnvIf Request_URI "^(/403.*\.htm¦/robots\.txt)$" allowsome
<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
</Files>
Ok, now you are ready to add some invisible links to your pages to catch the misbehaving bots. Don't forget to wait a few days for the good bots to catch the updated robots.txt file.
You can use a 1x1 transparent .gif for your links like so:
<a href="/getout.php" onclick="return false">
<img src="/clear.gif" /></a>
There are other ways as well. Using CSS absolute positioning, display property or visibility property. Jd_Morgan also suggests adding links within <!--comment tags-->.
getout.php
Any PHP peeps out there fell free to suggest ways to streamline this code :)
<?php$lock_dir = $_SERVER["DOCUMENT_ROOT"] . "/trap/lock";
$filename = $_SERVER["DOCUMENT_ROOT"] . "/.htaccess";
$bad_bot_ip = str_replace(".", "\.", $_SERVER["REMOTE_ADDR"]);
$content = "SetEnvIf Remote_Addr ^" . $bad_bot_ip . "$ getout\r\n";
function make_lock_dir(){
global $lock_dir;
$key = @mkdir($lock_dir, 0777);
$i = 0;
while ($key === FALSE && $i++ < 20) {
clearstatcache();
usleep(rand(5,85));
$key = @mkdir($lock_dir, 0777);
return $key;
}
}
function write_ban(){
global $filename, $bad_bot_ip, $content, $lock_dir;
$handle = fopen($filename, 'r');
$content .= fread($handle,filesize($filename));
fclose($handle);
$handle = fopen($filename, 'w+');
fwrite($handle, $content,strlen($content));
fclose($handle);
rmdir($lock_dir);
print "Goodbye!";
}
function stale_check(){
global $lock_dir;
if (fileatime($lock_dir) < time()-120){
rmdir($lock_dir);
if (make_lock_dir()!== False) write_ban();
} else {
exit;
}
}
if (make_lock_dir()!== False) {
write_ban();
} else {
stale_check();
}
?>
Enjoy! Thanks to Key_Master and jdMorgan!
[edited by: jatar_k at 4:39 pm (utc) on June 29, 2004]
[edit reason] Birdman requested edit [/edit]
Simply create a file named htaccess.txt and then change this line in the script:
$filename = $_SERVER["DOCUMENT_ROOT"] . "/.htaccess";
to
$filename = $_SERVER["DOCUMENT_ROOT"] . "/htaccess.txt";
Don't forget to chmod the test file to 644. To test it, just browse to yourdomain.com/getout.php! You should see the text, "Goodbye!".
SetEnvIf Request_URI "^(/403.*\.htm¦/robots\.txt)$" allowsome
<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
</Files>
I can't go that far as to say it's certain. I have used the trap on a few very well ranking sites for many months now with no problems. I know quite a few others use similar scripts and have been using them for years.
I suppose if you are really worried, you could disallow the /trap/ folder and then put your hidden links on pages within that folder. Then, the good bots shouldn't even see them. Of course, you will still have to have at least one link into the /trap/ folder, but behaving bots should not follow it.
Another topic I should mention is WAP users. The various translators used to make html pages available to WAP have a behaviour you need to be aware of, and that is that they pre-fetch most if not all links on every page the user accesses. And since they are not robots, they don't read robots.txt. You may need to place an exception in .htaccess to prevent WAP proxies from banning themselves if your site sees much action from WAP.
Also, remember that you don't have to link directly to your bot script's URL in your pages. You can link to any URL you like, and then use mod_rewrite to internally rewrite those requests to your script. You can then use nice, tasty names like "e-mail", "login", and "members" and such, even though the site has no real pages like that. Remember to disallow these pseudo-pages in robots.txt as well. I suggest waiting several days after updating robots.txt before you put a new poison URL into service; some robots do not update their copy of your robots.txt frequently, so you need to give them a chance to pick up the changes.
Jim
At least until you're comfortable that you're not taking out important search spiders.
Therefore, I suggest "getout.php" should just say
<?php
$ipaddress=$_SERVER['REMOTE_ADDR'];
$useragent=$_SERVER['USER_AGENT'];
mail("adminguy@example.com", "bad bot on $ipaddress", "Bot $useragent is going where it shouldn't. Consider banning.");
?>
you could also always add something like "http://www.example.com/banthisip.php?address=$ipaddress" to the body of the text and use the above script kindly provided.
Maybe even an rwhois URL in the email body as well so you can do a reverse lookup and make sure it isn't google or something being tricky..
As Jim said, he and I have been using this for a year and a half. I have a bit busier site than Jim... and I can tell you that I rarely have a good bot hit it. If I do, it is because I added a new alias and did not leave an updated Robot.txt file up long enough, or (in the case of the msnbot) I had the syntax wrong (MSN will let you limit the frequency of it's bot- when I added that it, it cause a problem. MSN techs were very helpful in getting this fixed for me, BTW!)
RE the WAP proxies, add this after you grab the ip:
if ($visitor_ip =~ /^216\.239\.3([3¦7¦9]\.5)$¦^216\.239\.35\.4$/) {
print "Content-type: text/html\n\n";
print "<html>\n";
print "<head>\n";
print "<title>Forward On</title>\n<META NAME=\"robots\" CONTENT=\"NOINDEX,NOFOLLOW\">\n";
print "</head>\n";
print "<body>\n";
print "<p><b>We had an error.<BR>Please return to continue!</b></p>\n";
print "</body>\n";
print "</html>\n";
exit;
}
else {
and the WAPS will not get banned. That is the perl version, I am not sure how you will update that for PHP...
And Jim's addition of flock is great. A LOT of credit goes to keymaster for the original script, too- that is what got Jim and I intrested in this project initially.
Another point I notice on my busier site- I get 5-10 bans a day. Not sure how much it slows things down, but I whip out all bans over two weeks old. I use the same script for about 5 sites, all of which feed a single ban list, so I added this:
# Set Date
$date = scalar localtime ( time );
# Write banned IP to .htaccess file
open(HTACCESS,">".$rootdir."/bad_ip.txt") ¦¦ die $!;
flock(HTACCESS,2);
seek(HTACCESS,0,0);
print HTACCESS "\^".$visitor_ip."\$\n\# $date (NAME OF SITE)\n";
foreach $deny_ip (@htaccess) {
print HTACCESS $deny_ip;
}
SO i NOT ONLY LOG THE BAN ip, BUT THE DATE AND TIME (Whoops, sorry!), but the site that the offender hit. A lot of times, you will see these guys crawling sites, just adding a 1 to the IP and doing it again... so my system-wide ban stops them right off! (In addition to having this on all my sites, the first IP of my block will send ANY hits to ban!)
So I just wanted to poibt out there is a LOT you can do with this to protect your sites, and it works really well!
Dave
Accidentally banning good bots isn't the only concern when you're adding deny from's automatically.
Another problem that can develop is you can end up banning ISP owned dynamic IPs used by people running email grabbers, scapers, click agents etc., from their own PC. In which case it's only a bad bot until they log off, after which it's a legitimate surfer.
Yes, that is a concern... which is one of the reasons I datestamp all the bans. I spent some time whois-ing all the IP's, and found that there was not enough possible "real" users in the mix for me to care about, so I came up with my two-week rule. That worked for me. YMMV, of course! However, I think you will find- when you analyze your logs- that this catches many more than you thought mught be there (probably just JR wanna-be hackers, really, running automated scripts), but it does help keep them from effecting your site as a whole.
The beauty of this script is it is the building block to go and do what is right for you on your site. You can easily mod it to add a pass for WAPS (as I have done), use it site or server wide, and add all sorts of other things that are specific to your site. You can hide it under any number of "juicy" names, and change those names monthly! You can also mod it so it does not automatically ban, but sends any user IP to a "You Have Been Banned" page, giving them instructions on how to contact you for review... so you can do that, too!
It is not a fix wholey by itself, either- it should be used with other methods of protection.
Cheers!
dave
1. Why not use <meta name="robots" content="noindex,nofollow"> in addition to a robots.txt entry? The benefit being there's no need to wait a few days / weeks (and hope!) for the good bots to have read robots.txt.
2. I think a better method for providing the 'invisible' link would be to use:
<a href="/getout.php" onclick="return false" style="display:none">Email Addresses For Harvesting!</a>
3. I rather like the idea of making life difficult for spammers and [blibbleblobble.co.uk...] seems like a good solution. Any opinion on pros / cons of using this in conjunction with Birdman's solution?
It is not. The script on the link you posted should not be used, at least nowhere like it is on that page. What it does is to feed the spambots made-up adresses like, say:
Abigail.Altemus (at) nytimes.com Now, consider a person by the name of Abigail Altemus getting a job at NY Times. Or, in general, the attitute towards you from sites in that list (yahoo.com' ,'microsoft.com' ,'msn.com' ,'ntl.com' ,'msdn.org' ,'fbi.gov' ,'ftc.gov' ,'nytimes.com' ,'yahoo.fr' ,'yahoo.de' ,'aol.com' ), for generating excessive spam to their domains.
Do you see why this is wrong? This script increases spam, by providing valid email addresses, even if made up. It does not do what the writer intended, rather, it does exactly what spam-scripts do, only it's a limited version.
Do you see why this is wrong?
I do and I did when I read it - just forgot to add that comment when I posted. The principle still applies, just replace the potentially valid addresses with 'bilbo.unlikelysurname@notmuchchance34.com', etc.
I've emailed the site owner to suggest he changes the script accordingly.
I know that the fact that nobody else in this thread has the same problem means that it must be ME who's doing something wrong. But I have no clue what it is!
One of the few downsides of PHP (although some consider it a feature) is that it runs under user/group nobody.
Nobody generally needs a file to be set to 777 in order to do anything with it.
So either you're going to need .htaccess to be 777 (I agree, not a great idea) or you'll have to change the config to allow PHP more freedom. If you don't have access to the config then your're out of luck.
Unless PHP is allowed to access files outside of the web directory on your server, in which case you can put the .htaccess outside of the pub dir and chmod it anything you want w/o worrying too much.
You might want to consider using perl instead. It generally runs under the domain account user and group and can therefore do a lot more.
In fact if the server recognizes that the script created the file in question during the current incarnation it should be able to do read/writes even at default (644) permissions.
The nobody/nobody user/group that PHP runs under has always been a mystery to me. You'd think that security is something you'd leave up to the programmer instead of forcing it on them.
Of course the fact that it's harder for a beginning programmer to write a bad script in PHP than it is in any other language is a big part of why PHP has become so popular, so what do I know? :-)
<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
</Files>
<Files *>
order deny,allow
deny from 111.111.111.111 (some IP I made up for example)
allow from all
</Files>
Or should you just put your
deny from 111.111.111.111
inbetween the first example, so it looks like this...
<Files *>
order deny,allow
deny from env=getout
deny from 111.111.111.111
allow from env=allowsome
</Files>
:)