homepage Welcome to WebmasterWorld Guest from 54.167.173.250
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

This 34 message thread spans 2 pages: 34 ( [1] 2 > >     
Updated PHP Bad Bot Script
AKA: Spider Trap
Birdman




msg:1297822
 12:54 pm on Jun 29, 2004 (gmt 0)

Hello everyone,

I am posting a revised version of this PHP spider trap [webmasterworld.com], because of a flaw that was recognized by our local Apache Web Server [webmasterworld.com] guru. Thanks, jdMorgan [webmasterworld.com]!

Basically, it needed filelocking to prevent the htaccess file from being opened by another bot while it is already being written to. This could happen on a busy server.

Also, before I move on, I'd like to extend the credit for this script to Key_Master [webmasterworld.com]. KeyMaster posted the original bad bot script [webmasterworld.com], written in Perl. There is also a modified version [webmasterworld.com] as well.

How it Works
When the file getout.php is accessed, it opens your .htaccess file and appends the visitor's(bad bot) ip address to the list of banned ips.

Before you do anything, you'll need to disallow the file(getout.php) in your robots.txt file. Any decent bot should be reading and obeying this file so. Do not use the spider trap for a few days after adding the robots.txt disallow. You have to give the good bots enough time to read the ammended robots file. If you start using the trap right away, you stand a chance of banning good spiders!

Example robot.txt disallow:
User-agent: *
Disallow: /getout.php

Next, create a new folder in your root folder. Name it /trap/. You can name it anything really, but that's what I have in the script so you'll need to alter the script if you name it differently.

Chmod your .htaccess file to 644 and chmod getout.php to 755. You should put getout.php in the root folder. Or, simply change the robots.txt file to reflect the location of the file if you put it elsewhere.

Add these lines to your .htaccess file at the very top.
SetEnvIf Request_URI "^(/403.*\.htm¦/robots\.txt)$" allowsome
<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
</Files>

Ok, now you are ready to add some invisible links to your pages to catch the misbehaving bots. Don't forget to wait a few days for the good bots to catch the updated robots.txt file.

You can use a 1x1 transparent .gif for your links like so:
<a href="/getout.php" onclick="return false">
<img src="/clear.gif" /></a>

There are other ways as well. Using CSS absolute positioning, display property or visibility property. Jd_Morgan also suggests adding links within <!--comment tags-->.

getout.php
Any PHP peeps out there fell free to suggest ways to streamline this code :)
<?php

$lock_dir = $_SERVER["DOCUMENT_ROOT"] . "/trap/lock";

$filename = $_SERVER["DOCUMENT_ROOT"] . "/.htaccess";

$bad_bot_ip = str_replace(".", "\.", $_SERVER["REMOTE_ADDR"]);

$content = "SetEnvIf Remote_Addr ^" . $bad_bot_ip . "$ getout\r\n";

function make_lock_dir(){

global $lock_dir;

$key = @mkdir($lock_dir, 0777);

$i = 0;

while ($key === FALSE && $i++ < 20) {

clearstatcache();

usleep(rand(5,85));

$key = @mkdir($lock_dir, 0777);

return $key;

}

}

function write_ban(){

global $filename, $bad_bot_ip, $content, $lock_dir;

$handle = fopen($filename, 'r');

$content .= fread($handle,filesize($filename));

fclose($handle);

$handle = fopen($filename, 'w+');

fwrite($handle, $content,strlen($content));

fclose($handle);

rmdir($lock_dir);

print "Goodbye!";

}

function stale_check(){

global $lock_dir;

if (fileatime($lock_dir) < time()-120){

rmdir($lock_dir);

if (make_lock_dir()!== False) write_ban();

} else {

exit;

}

}

if (make_lock_dir()!== False) {

write_ban();

} else {

stale_check();

}

?>

Enjoy! Thanks to Key_Master and jdMorgan!

[edited by: jatar_k at 4:39 pm (utc) on June 29, 2004]
[edit reason] Birdman requested edit [/edit]

 

Birdman




msg:1297823
 1:07 pm on Jun 29, 2004 (gmt 0)

One more thing. It's not a bad idea to test this script on a fake .htaccess file before putting it into use. I tested it thoroughly, but it's easy enough to test it on your server too.

Simply create a file named htaccess.txt and then change this line in the script:

$filename = $_SERVER["DOCUMENT_ROOT"] . "/.htaccess";

to

$filename = $_SERVER["DOCUMENT_ROOT"] . "/htaccess.txt";

Don't forget to chmod the test file to 644. To test it, just browse to yourdomain.com/getout.php! You should see the text, "Goodbye!".

Warboss Alex




msg:1297824
 2:17 pm on Jun 29, 2004 (gmt 0)

"_Morgan also suggests adding links within <!--comment tags-->."

I thought bots didn't read comments ..

Birdman




msg:1297825
 2:43 pm on Jun 29, 2004 (gmt 0)

It really depends on how the bot is configured. some will look only for mailto: and others may look for anything the resembles a link. (href or http etc..)

ukgimp




msg:1297826
 2:58 pm on Jun 29, 2004 (gmt 0)

Nearly got this working. Great idea

I do strugle when i put this in the htaccess

<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
</Files>

it causes a 500 error.

The line $content is written ok.

Could I just use

$content ="RewriteCond %{$bad_bot_ip} [NC,OR]"

or similar

ukgimp




msg:1297827
 3:29 pm on Jun 29, 2004 (gmt 0)

I get the following error

[Tue Jun 29 16:24:47 2004] [alert] [client 127.0.0.1] c:/phpdev/www/.htaccess: order not allowed here

Birdman




msg:1297828
 3:30 pm on Jun 29, 2004 (gmt 0)

I appologize :( I left a line out in my first post. This is what you need to have in your htaccess to get started: (as always, replace the broken pipe ¦ to an unbroken one)

SetEnvIf Request_URI "^(/403.*\.htm¦/robots\.txt)$" allowsome
<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
</Files>

ukgimp




msg:1297829
 3:41 pm on Jun 29, 2004 (gmt 0)

Stil get the warning about the order?

Suggestions?

i do have other rewrite rules in there, but sime url rewrites to make things static.

Birdman




msg:1297830
 3:49 pm on Jun 29, 2004 (gmt 0)

Hmmm, it could have something to do with the httpd.config settings.

You could try it this way(without the extra line from my last post)

<Files ~ "^.*$">
order allow,deny
allow from all
deny from env=getout
</Files>

And yes, I think you could do it with rewriteCond directives too.

kapow




msg:1297831
 5:11 pm on Jun 29, 2004 (gmt 0)

Thanks so much Birdman :)
Can't wait to try it.

If I put invisible links on my index page to a page that is disallowed - I still have invisible links on my index page. Is it certain this will not invite a google penalty for suspected seo tricks?

Birdman




msg:1297832
 5:19 pm on Jun 29, 2004 (gmt 0)

Is it certain...?

I can't go that far as to say it's certain. I have used the trap on a few very well ranking sites for many months now with no problems. I know quite a few others use similar scripts and have been using them for years.

I suppose if you are really worried, you could disallow the /trap/ folder and then put your hidden links on pages within that folder. Then, the good bots shouldn't even see them. Of course, you will still have to have at least one link into the /trap/ folder, but behaving bots should not follow it.

jdMorgan




msg:1297833
 6:41 pm on Jun 29, 2004 (gmt 0)

Your site can get banned for hidden links or other SEO tricks intended to fool the search engines or users. The major search engines are aware that invisible gifs and such are used for hit counters and other legitimate uses. Also, as stated, the bad-bot files you are linking to must be disallowed in robots.txt to prevent legitimate spiders from following those links. Therefore, it is obvious to anyone inspecting your site that there is no intent to fool search engines or users.

Another topic I should mention is WAP users. The various translators used to make html pages available to WAP have a behaviour you need to be aware of, and that is that they pre-fetch most if not all links on every page the user accesses. And since they are not robots, they don't read robots.txt. You may need to place an exception in .htaccess to prevent WAP proxies from banning themselves if your site sees much action from WAP.

Also, remember that you don't have to link directly to your bot script's URL in your pages. You can link to any URL you like, and then use mod_rewrite to internally rewrite those requests to your script. You can then use nice, tasty names like "e-mail", "login", and "members" and such, even though the site has no real pages like that. Remember to disallow these pseudo-pages in robots.txt as well. I suggest waiting several days after updating robots.txt before you put a new poison URL into service; some robots do not update their copy of your robots.txt frequently, so you need to give them a chance to pick up the changes.

Jim

blaze




msg:1297834
 7:11 pm on Jun 29, 2004 (gmt 0)

Actually, instead of banning IPs a wiser idea is just to send an email or something that a bot is going where it shouldn't.

At least until you're comfortable that you're not taking out important search spiders.

Therefore, I suggest "getout.php" should just say

<?php
$ipaddress=$_SERVER['REMOTE_ADDR'];
$useragent=$_SERVER['USER_AGENT'];
mail("adminguy@example.com", "bad bot on $ipaddress", "Bot $useragent is going where it shouldn't. Consider banning.");
?>

you could also always add something like "http://www.example.com/banthisip.php?address=$ipaddress" to the body of the text and use the above script kindly provided.

Maybe even an rwhois URL in the email body as well so you can do a reverse lookup and make sure it isn't google or something being tricky..

jdMorgan




msg:1297835
 8:41 pm on Jun 29, 2004 (gmt 0)

In over a year-and-a-half of running the PERL version of this script, I've never had a good bot get trapped. And as with the WAP proxies mentioned above, you can always put a user-agent-based exclusion in your .htaccess just in case it might happen.

Jim

blaze




msg:1297836
 8:52 pm on Jun 29, 2004 (gmt 0)

Well, I'm a mere mortal and often make mistakes .. so I find it's nice to test and be comfortable before going whole hog.

But yes, the advantage to doing it immediately is that you get the robots while they're moving and not after the cows are out of the barn, so to speak.

carfac




msg:1297837
 3:47 am on Jun 30, 2004 (gmt 0)

Hi:

As Jim said, he and I have been using this for a year and a half. I have a bit busier site than Jim... and I can tell you that I rarely have a good bot hit it. If I do, it is because I added a new alias and did not leave an updated Robot.txt file up long enough, or (in the case of the msnbot) I had the syntax wrong (MSN will let you limit the frequency of it's bot- when I added that it, it cause a problem. MSN techs were very helpful in getting this fixed for me, BTW!)

RE the WAP proxies, add this after you grab the ip:

if ($visitor_ip =~ /^216\.239\.3([3¦7¦9]\.5)$¦^216\.239\.35\.4$/) {
print "Content-type: text/html\n\n";
print "<html>\n";
print "<head>\n";
print "<title>Forward On</title>\n<META NAME=\"robots\" CONTENT=\"NOINDEX,NOFOLLOW\">\n";
print "</head>\n";
print "<body>\n";
print "<p><b>We had an error.<BR>Please return to continue!</b></p>\n";
print "</body>\n";
print "</html>\n";
exit;
}
else {

and the WAPS will not get banned. That is the perl version, I am not sure how you will update that for PHP...

And Jim's addition of flock is great. A LOT of credit goes to keymaster for the original script, too- that is what got Jim and I intrested in this project initially.

Another point I notice on my busier site- I get 5-10 bans a day. Not sure how much it slows things down, but I whip out all bans over two weeks old. I use the same script for about 5 sites, all of which feed a single ban list, so I added this:

# Set Date
$date = scalar localtime ( time );

# Write banned IP to .htaccess file
open(HTACCESS,">".$rootdir."/bad_ip.txt") ¦¦ die $!;
flock(HTACCESS,2);
seek(HTACCESS,0,0);
print HTACCESS "\^".$visitor_ip."\$\n\# $date (NAME OF SITE)\n";
foreach $deny_ip (@htaccess) {
print HTACCESS $deny_ip;
}

SO i NOT ONLY LOG THE BAN ip, BUT THE DATE AND TIME (Whoops, sorry!), but the site that the offender hit. A lot of times, you will see these guys crawling sites, just adding a 1 to the IP and doing it again... so my system-wide ban stops them right off! (In addition to having this on all my sites, the first IP of my block will send ANY hits to ban!)

So I just wanted to poibt out there is a LOT you can do with this to protect your sites, and it works really well!

Dave

IanKelley




msg:1297838
 10:40 am on Jun 30, 2004 (gmt 0)

Great method but something to think about before implementing it...

Accidentally banning good bots isn't the only concern when you're adding deny from's automatically.

Another problem that can develop is you can end up banning ISP owned dynamic IPs used by people running email grabbers, scapers, click agents etc., from their own PC. In which case it's only a bad bot until they log off, after which it's a legitimate surfer.

carfac




msg:1297839
 2:42 pm on Jun 30, 2004 (gmt 0)

Ian:

Yes, that is a concern... which is one of the reasons I datestamp all the bans. I spent some time whois-ing all the IP's, and found that there was not enough possible "real" users in the mix for me to care about, so I came up with my two-week rule. That worked for me. YMMV, of course! However, I think you will find- when you analyze your logs- that this catches many more than you thought mught be there (probably just JR wanna-be hackers, really, running automated scripts), but it does help keep them from effecting your site as a whole.

The beauty of this script is it is the building block to go and do what is right for you on your site. You can easily mod it to add a pass for WAPS (as I have done), use it site or server wide, and add all sorts of other things that are specific to your site. You can hide it under any number of "juicy" names, and change those names monthly! You can also mod it so it does not automatically ban, but sends any user IP to a "You Have Been Banned" page, giving them instructions on how to contact you for review... so you can do that, too!

It is not a fix wholey by itself, either- it should be used with other methods of protection.

Cheers!

dave

IanKelley




msg:1297840
 6:43 am on Jul 1, 2004 (gmt 0)

I agree that a time limit on bans is the way to go.

Unless you're Yahoo it's very unlikely you're going to ban an IP in an ISPs pool and then get a legit visitor using the same IP in less than a week or two.

David_1cog




msg:1297841
 6:17 pm on Jul 10, 2004 (gmt 0)

I'm about to implement this excellent technique and have some questions / suggestions before I do:

1. Why not use <meta name="robots" content="noindex,nofollow"> in addition to a robots.txt entry? The benefit being there's no need to wait a few days / weeks (and hope!) for the good bots to have read robots.txt.

2. I think a better method for providing the 'invisible' link would be to use:

<a href="/getout.php" onclick="return false" style="display:none">Email Addresses For Harvesting!</a>

(or better still, assign a CSS class to the link and use an external CSS file for the display:none). Benefit - it's easier than creating / uploading a clear GIF, plus it may be more attractive to email harvesters.

3. I rather like the idea of making life difficult for spammers and [blibbleblobble.co.uk...] seems like a good solution. Any opinion on pros / cons of using this in conjunction with Birdman's solution?

claus




msg:1297842
 6:37 pm on Jul 10, 2004 (gmt 0)

>> I rather like the idea of making life difficult for spammers and <some link> seems like a good solution

It is not. The script on the link you posted should not be used, at least nowhere like it is on that page. What it does is to feed the spambots made-up adresses like, say:

Abigail.Altemus (at) nytimes.com

Now, consider a person by the name of Abigail Altemus getting a job at NY Times. Or, in general, the attitute towards you from sites in that list (yahoo.com' ,'microsoft.com' ,'msn.com' ,'ntl.com' ,'msdn.org' ,'fbi.gov' ,'ftc.gov' ,'nytimes.com' ,'yahoo.fr' ,'yahoo.de' ,'aol.com' ), for generating excessive spam to their domains.

Do you see why this is wrong? This script increases spam, by providing valid email addresses, even if made up. It does not do what the writer intended, rather, it does exactly what spam-scripts do, only it's a limited version.

David_1cog




msg:1297843
 7:01 pm on Jul 10, 2004 (gmt 0)

Do you see why this is wrong?

I do and I did when I read it - just forgot to add that comment when I posted. The principle still applies, just replace the potentially valid addresses with 'bilbo.unlikelysurname@notmuchchance34.com', etc.

I've emailed the site owner to suggest he changes the script accordingly.

yosmc




msg:1297844
 5:26 am on Aug 2, 2004 (gmt 0)

This only works for me if I set the trap directory and htaccess world-writable (777) which obviously isn't a good idea. Hmmm...

yosmc




msg:1297845
 1:05 pm on Aug 2, 2004 (gmt 0)

UPDATE: I'm afraid I really don't get it. After all, it's the bot that executes getout.php, and since the bot isn't the owner of that file, it won't be able to write to htaccess when that file isn't world-writable.

I know that the fact that nobody else in this thread has the same problem means that it must be ME who's doing something wrong. But I have no clue what it is!

IanKelley




msg:1297846
 9:50 pm on Aug 2, 2004 (gmt 0)

All PHP scripts have the same file access permissions regardless of who is running them.

One of the few downsides of PHP (although some consider it a feature) is that it runs under user/group nobody.

Nobody generally needs a file to be set to 777 in order to do anything with it.

So either you're going to need .htaccess to be 777 (I agree, not a great idea) or you'll have to change the config to allow PHP more freedom. If you don't have access to the config then your're out of luck.

Unless PHP is allowed to access files outside of the web directory on your server, in which case you can put the .htaccess outside of the pub dir and chmod it anything you want w/o worrying too much.

You might want to consider using perl instead. It generally runs under the domain account user and group and can therefore do a lot more.

In fact if the server recognizes that the script created the file in question during the current incarnation it should be able to do read/writes even at default (644) permissions.

yosmc




msg:1297847
 10:21 pm on Aug 2, 2004 (gmt 0)

Ian, thanks for the reply. So what exactly do you recommend? How can I configure php to have getout.php write to a htaccess file that is set to permissions 644?

IanKelley




msg:1297848
 3:05 am on Aug 3, 2004 (gmt 0)

One way to do it would be to compile php_suexec into apache. PHP would then run under the user account and group as perl does.

yosmc




msg:1297849
 10:16 am on Aug 3, 2004 (gmt 0)

I wonder how everyone else in this and the other threads (earlier version of this script, etc.) did it. Can't be that everyone recompiled apache and didn't even mention it?

IanKelley




msg:1297850
 2:39 am on Aug 4, 2004 (gmt 0)

I can tell you from doing back end work on 100's of servers that a standard PHP installation cannot edit a file unless it has at least 666 permissions.

The nobody/nobody user/group that PHP runs under has always been a mystery to me. You'd think that security is something you'd leave up to the programmer instead of forcing it on them.

Of course the fact that it's harder for a beginning programmer to write a bad script in PHP than it is in any other language is a big part of why PHP has become so popular, so what do I know? :-)

kwasher




msg:1297851
 5:33 am on Sep 5, 2004 (gmt 0)

Probably a dumb question. I already have a File * in my htaccess. Can you have TWO of these in your htaccess like this?

<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
</Files>

<Files *>
order deny,allow
deny from 111.111.111.111 (some IP I made up for example)
allow from all
</Files>

Or should you just put your

deny from 111.111.111.111

inbetween the first example, so it looks like this...

<Files *>
order deny,allow
deny from env=getout
deny from 111.111.111.111
allow from env=allowsome
</Files>

:)

This 34 message thread spans 2 pages: 34 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved