homepage Welcome to WebmasterWorld Guest from 54.235.16.159
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Setting up a bad bot trap
I've got the files in place
youfoundjake




msg:3202978
 4:50 pm on Dec 29, 2006 (gmt 0)

I'm first starting by setting up my robots.txt

User-agent: Slurp
User-agent: Googlebot
User-agent: msnbot
User-agent: Mediapartners-Google
User-agent: Adsbot-Google
User-agent: ia_archiver-web.archive.org
Disallow: /forum/

User-agent: *
Disallow: /
User-agent: *
Disallow: /herring/

This has not worked, because google and yahoo keep poking their heads in, (notification by email that someone visited)

So I'm thinking of switching it to this:

User-agent: Slurp
User-agent: Googlebot
User-agent: msnbot
User-agent: Mediapartners-Google
User-agent: Adsbot-Google
User-agent: ia_archiver-web.archive.org
Disallow: /forum/
Disallow: /herring/

User-agent: *
Disallow: /
Disallow: /herring/
Disallow: /forum/

Does the second entry look correct?

 

jdMorgan




msg:3203083
 6:10 pm on Dec 29, 2006 (gmt 0)

The first example is invalid: Every record in robots.txt must be followed by a blank line. This includes the last record as well, but in simple terms, every "User-agent:" line should have one blank line separating it from the last "Disallow:" in the previous record.

The second code example is better, but will block all robots from all files, by virtue of the "User-agent: * -- Disallow: /" record. Therefore the Disallows which follow that line are redundant.

Many if not most robots don't support the multiple-user-agents-per-record format. I suggest using that format only for the "big three" or "big four" robots, and using explicit records or the wild-card record for all minor robots.

The robots.txt [webmasterworld.com] forum may be more appropriate for getting good answers to further questions.

Jim

youfoundjake




msg:3203125
 6:57 pm on Dec 29, 2006 (gmt 0)

jdMorgan, thanks for the response. I did think about posting in the robots.tx forum, but this thread will be progressing into an apache mod_rewrite, which I know you are a genius at. (but if you want to move it ok :)

I want to block all bots except for the
User-agent: Slurp
User-agent: Googlebot
User-agent: msnbot
User-agent: Mediapartners-Google
User-agent: Adsbot-Google
User-agent: ia_archiver-web.archive.org

And those bots i want to block from forum and herring since herring will be the bot trap and fourm indexing is a nightmare for indexing leading to supp results

I'm ultimately going to try and implement this solution at kloth.net/internet/bottrap.php
so I'm just taking it one step at a time.

[edited by: jdMorgan at 8:25 pm (utc) on Dec. 29, 2006]
[edit reason] De-linked [/edit]

youfoundjake




msg:3203239
 9:17 pm on Dec 29, 2006 (gmt 0)

Sorry about the link Jim, and It appears that I have mistyped. I meant to say that I'm implementing a solution that I found located at that site.

I have questions about paths. When implementing it, on the index page, below the text that I've input for navigation back to the main site, if for some reason a user landed on the badrobot page, im getting:

Warning: fopen() [function.fopen]: open_basedir restriction in effect. File(/blacklist.dat) is not within the allowed path(s): ('.:/proc/uptime:/tmp:/home:/usr/local/lib/php:/nfs/home:/usr/home:/usr/local/bin/') in /home/domain/public_html/herring/index.php on line 16

Warning: fopen(/blacklist.dat) [function.fopen]: failed to open stream: Operation not permitted in /home/domian/public_html/herring/index.php on line 16
Error opening file ...

My site has the basic layout. www.example.com with a hidden link to www.example.com/herring/index.php which has the bot trap.

I was also referred to this page [webmasterworld.com...]

Is the error message because I don't have sufficient privilages to that path?

jdMorgan




msg:3203356
 11:24 pm on Dec 29, 2006 (gmt 0)

It looks like you need to move blacklist.dat into your Web-accessible space. Since only the script and the server will access it, you can deny HTTP access to it by using mod_access or mod_rewrite in your .htaccess file(s).

Jim

youfoundjake




msg:3203363
 11:29 pm on Dec 29, 2006 (gmt 0)

K, I resolved the path issue. I have the results of whatever visits the index.php saved in blacklist.dat

67.121.255.255 - - [2006-12-29 (Fri) 14:10:50] "GET /herring/index.php HTTP/1.1" Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)

There a way to also get this pushed into the .htaccess file so that it will prevent them from accessing the site again.

It involves $fp = fopen($filename,'a+');
fwrite($fp,"$REMOTE_ADDR - - [$datum] \"$REQUEST_METHOD $REQUEST_URI $SERVER_PROTOCOL\" $HTTP_REFERER $HTTP_USER_AGENT\n");
fclose($fp);

And is there any harm in being able to pull up the .dat in a browser window?

jdMorgan




msg:3203378
 11:39 pm on Dec 29, 2006 (gmt 0)

I'm not sure which script you're using, but both of the bad-bot scripts published here on WebmasterWorld prepend records to .htaccess to block further access. You can get the PERL or PHP code from those two scripts.

Jim

youfoundjake




msg:3203415
 12:09 am on Dec 30, 2006 (gmt 0)

Cool jd, I'll look around some more, but it seems like its just the two of us. :) come here often?

youfoundjake




msg:3203973
 9:41 pm on Dec 30, 2006 (gmt 0)

Update:
Using information from here: [webmasterworld.com...] as well as kloth.net/internet/bottrap.php

I have everything in place, starting with an hidden link on my home page

<a href="/herring/index.php"><img src="images/pixel.gif" border="0" alt=" " width="1" height="1"></a>

And for whatever its worth, here is the current .htaccess file which is chmod to 644


SetEnvIf Request_URI "^(/403.*\.htm¦/robots\.txt)$" allowsome
<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
</Files>

Options All Indexes
IndexOptions FancyIndexing

Options +FollowSymlinks
RewriteEngine on
rewritecond %{http_host} ^domain.com [nc]
rewriterule ^(.*)$ [domain.com...] [r=301,nc]

RewriteCond %{THE_REQUEST} ^.*\/index\.htm?
RewriteRule ^(.*)index\.html?$ [domain.com...] [R=301,L]

<Files 403.shtml>
order allow,deny
allow from all
</Files>
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
.........
RewriteCond %{HTTP_USER_AGENT} ^Xaldon

RewriteRule ^.* - [F]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule!^http://[^/.]\.domain.com.* - [F]

This controls access to the site when a bad bot hits index.php.
Index.php is set up to pass the variable to this portion:

SetEnvIf Request_URI "^(/403.*\.htm¦/robots\.txt)$" allowsome
<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
</Files>

Here is the text of index.php which has 2 functions, one to add the ip address to .htaccess, and the other to send me an email notification and add the useragent to a blacklist, later checking it, although it may be a little redundant since there is no real need to check the blacklist if the ip address has been denied.
(forgive the sloppy paste together job, any optimization would be greatly appreciated, since I don't know all the syntax, I just found that it works for me so I'm not going to break it.)

This is chmod to 755

<?php
$lock_dir = $_SERVER["DOCUMENT_ROOT"] . "/lock";

$filename = $_SERVER["DOCUMENT_ROOT"] . "/.htaccess";

$bad_bot_ip = str_replace(".", "\.", $_SERVER["REMOTE_ADDR"]);

$content = "SetEnvIf Remote_Addr ^" . $bad_bot_ip . "$ getout\r\n";

function make_lock_dir(){

global $lock_dir;

$key = @mkdir($lock_dir, 0777);

$i = 0;

while ($key === FALSE && $i++ < 20) {

clearstatcache();

usleep(rand(5,85));

$key = @mkdir($lock_dir, 0777);

return $key;

}

}

function write_ban(){

global $filename, $bad_bot_ip, $content, $lock_dir;

$handle = fopen($filename, 'r');

$content .= fread($handle,filesize($filename));

fclose($handle);

$handle = fopen($filename, 'w+');

fwrite($handle, $content,strlen($content));

fclose($handle);

rmdir($lock_dir);

print "Goodbye!";

}

function stale_check(){

global $lock_dir;

if (fileatime($lock_dir) < time()-120){

rmdir($lock_dir);

if (make_lock_dir()!== False) write_ban();

} else {

exit;

}

}

if (make_lock_dir()!== False) {

write_ban();

} else {

stale_check();

}

?>

<?php
if(phpversion() >= "4.2.0") {
extract($_SERVER);
}
?>
<html>
<head><title>Bad Robots </title></head>
<body>
<p>There is nothing here to see. So what are you doing here?</p>
<p><a href="http://www.domain.com">Go home.</a></p>
<?php
$badbot = 0;
/* scan the blacklist.dat file for addresses of SPAM robots
to prevent filling it up with duplicates */
$filename = "blacklist.dat";
$fp = fopen($filename, "r") or die ("Error opening file ... <br>\n");
while ($line = fgets($fp,255)) {
$u = explode(" ",$line);
if (ereg($u[0],$REMOTE_ADDR)) {$badbot++;}
}
fclose($fp);
if ($badbot == 0) { /* we just see a new bad bot not yet listed! */
/* send a mail to hostmaster */
$tmestamp = time();
$datum = date("Y-m-d (D) H:i:s",$tmestamp);
$from = "badbot-watch@domain.com";
$to = "webmaster@domain.com";
$subject = "Domainname: bad robot";
$msg = "A bad robot hit $REQUEST_URI $datum \n";
$msg .= "address is $REMOTE_ADDR, agent is $HTTP_USER_AGENT\n";
mail($to, $subject, $msg, "From: $from");
/* append bad bot address data to blacklist log file: */
$fp = fopen($filename,'a+');
fwrite($fp,"$REMOTE_ADDR - - [$datum] \"$REQUEST_METHOD $REQUEST_URI $SERVER_PROTOCOL\" $HTTP_REFERER $HTTP_USER_AGENT\n");
fclose($fp);
}
?>
</body>
</html>
<?php include($_SERVER['DOCUMENT_ROOT'] . "/herring/blacklist.php");?>

And then of course there is blacklist.php


<?php
if(phpversion() >= "4.2.0") {
extract($_SERVER);
}
$badbot = 0;
/* look for the IP address in the blacklist file */
$filename = "blacklist.dat";
$fp = fopen($filename, "r") or die ("Error opening file ... <br>\n");
while ($line = fgets($fp,255)) {
$u = explode(" ",$line);
if (ereg($u[0],$REMOTE_ADDR)) {$badbot++;}
}
fclose($fp);
if ($badbot > 0) { /* this is a bad bot, reject it */
sleep(12);
print ("<html><head>\n");
print ("<title>Site unavailable, sorry</title>\n");
print ("</head><body>\n");
print ("<center><h1>Welcome ...</h1></center>\n");
print ("<p><center>Unfortunately, due to abuse, this site is temporarily not available ...</center></p>\n");
print ("<p><center>If you feel this in error, send a mail to the hostmaster at this site,<br>
if you are an anti-social ill-behaving SPAM-bot, then just go away.</center></p>\n");
print ("</body></html>\n");
exit;
}
?>

Finally, I have my robots.txt set up like this:

User-agent: Slurp
User-agent: Googlebot
User-agent: msnbot
User-agent: Mediapartners-Google
User-agent: Adsbot-Google
User-agent: ia_archiver-web.archive.org
Disallow: /forum/
Disallow: /herring/

User-agent: *
Disallow: /

Hopefully this helps someone out there.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved