homepage Welcome to WebmasterWorld Guest from 54.163.91.250
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

    
PHP Spider Trap
An alternative to the Perl version
Birdman




msg:1312929
 1:48 pm on Mar 7, 2004 (gmt 0)

Hello,

I recently set up my own spider trap after reading about it here. I finally got sick of site-suckers driving up my bandwidth to the point I had to upgrade my hosting package twice.

So anyway, I don't use Perl much and decided to make a PHP trap. It's working nicely and just wanted to post it up here in case anyone wants to use it.

*Notes:

  1. Add the robots.txt snippet days before luring bots to the trap. This gives the good bots time to read the disallow and obey.
  2. chmod .htaccess to 666 and getout.php to 755(please correct me here if I'm wrong)
  3. Replace the broken pipe(¦) with a solid one in .htacces snippet.
  4. Edit getout.php with the real path to your .htaccess file and also change the email to your own so you will recieve the "spider alert".

Robots.txt

User-agent: *
Disallow: /getout.php

.htaccess(keep this code at the top of the file)

SetEnvIf Request_URI "^(/403.*\.htm¦/robots\.txt)$" allowsome
<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
</Files>

getout.php

<?php
$filename = "/var/www/html/.htaccess";
$content = "SetEnvIf Remote_Addr ^".str_replace(".","\.",$_SERVER["REMOTE_ADDR"])."$ getout\r\n";
$handle = fopen($filename, 'r');
$content .= fread($handle,filesize($filename));
fclose($handle);
$handle = fopen($filename, 'w+');
fwrite($handle, $content,strlen($content));
fclose($handle);
mail("me@mysite.com",
"Spider Alert!",
"The following ip just got banned because it accessed the spider trap.\r\n\r\n".$_SERVER["REMOTE_ADDR"]."\r\n".$_SERVER["HTTP_USER_AGENT"]."\r\n".$_SERVER["HTTP_REFERER"]
,"FROM: trap@mysite.com");
print "Goodbye!";
?>

 

isitreal




msg:1312930
 4:22 pm on Mar 7, 2004 (gmt 0)

Birdman: I was following that thread too, also .htaccess spider blocking, like you I'm not comfortable with PERL, this looks like a really good solution.

Thanks for posting it, can you keep us updated if you find any problems with it?

====
footnote:
I started testing this, it works exactly as claimed, this thing is really nicely thought out, easy to implement. Re the permissions, they can be set to 404 for the getout.php file and 606 for the .htaccess file.

The group permissions only apply to other users of the server, and the execute permissions apply to viewing folder content, I think anyway. So all you need is read permissions on the getout.php file and read/write permissions on the .htaccess file, someone correct me if I'm wrong about that.

If this script does in fact add blocked ip addresses to the list, on a large site that might lead to some problems, since a spider could be using a dynamically assigned IP address, which would mean that some other user in the future might conceivably find themselves blocked, but I can't tell for sure if that's the case from testing it on just my IP address. Whatever the case, I'm definitely going to test this thing out and see if it starts catching spiders, it's very well thought out, elegant solution, much better than the .htaccess spider blocking lists I was playing with last year, those can be so easily fooled by just using a standard navigigator useragent string to id the spider.

isitreal




msg:1312931
 8:06 pm on Mar 7, 2004 (gmt 0)

It seems a shame to leave the spider with nothing at all to reward it for its hard work, this will give it as many randomly generated email addresses as you want, although I'd keep the number under 10,000 to make sure that it doesn't time out the spider on waiting for the page to load.

I made a mistake on the permissions, the getout.php needs 604 permissions or else you can't upload changes to it, if you are never going to change it you can have 404 permissions on it.

<?php
$filename = "/var/www/html/.htaccess";
$content = "SetEnvIf Remote_Addr ^".str_replace(".","\.",$_SERVER["REMOTE_ADDR"])."$ getout\r\n";
$handle = fopen($filename, 'r');
$content .= fread($handle,filesize($filename));
fclose($handle);
$handle = fopen($filename, 'w+');
fwrite($handle, $content,strlen($content));
fclose($handle);
mail("me@mysite.com",
"Spider Alert!",
"The following ip just got banned because it accessed the spider trap.\r\n\r\n".$_SERVER["REMOTE_ADDR"]."\r\n".$_SERVER["HTTP_USER_AGENT"]."\r\n".$_SERVER["HTTP_REFERER"]
,"FROM: trap@mysite.com");

// start free emails for spider
$page = '';
for ( $i = 0; $i < 5000; $i++ )
{
$page .= new_email();
}

function new_email()
{
$email = '';
$letters_array = array('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r',
's', 't', 'u', 'v', 'w', 'x', 'y', 'z');
for ( $i = 0; $i < 17; $i++ )
{
$email .= ( $i!== 10 )? $letters_array[ mt_rand( 0, 25) ] : '@';
}
$email .= '.com';
$email = '<a href="mailto:' . $email . '">' . $email . "</a>\n";
return $email;

}

$page .= "Goodbye!";
echo $page;
?>

vkaryl




msg:1312932
 11:05 pm on Mar 18, 2004 (gmt 0)

isitreal, you just cracked me up completely when I read this: a php randomizer for bogus emails! Fantastic!

I'm gonna try this thing this weekend. I'm SERIOUSLY drowned in junk.... "junque" I can handle, but THIS! Sheesh....

gruntre




msg:1312933
 5:07 am on Apr 20, 2004 (gmt 0)

Do you need to have .htaccess modeles on the appache web server loaded for this to work?

My web host says that these modules aren't active on appache by default and not about to become active on my server either!

I found this out because I wanted to modify the cache control headers with .htaccess. Is this related?

I don't understand PHP but Isitreal you cracked me up with your cunning reward for the spiders plan - good one mate!

isitreal




msg:1312934
 5:26 am on Apr 20, 2004 (gmt 0)

If they aren't giving you .htaccess support then maybe it's time to find a new hoster, try giving them the option of supporting your needs or losing your business, it's no loss either way.

neweb




msg:1312935
 7:53 am on Apr 21, 2004 (gmt 0)

This is great!

And much better than those HUGE lists I've been trying to keep up with!

I have one question though .... is there anyway I can "hide" or "disguise" my email address in php? I can do it easily in javascript .. but not sure if it will work in php?

Thanks for your efforts on this!

Darla

isitreal




msg:1312936
 3:00 pm on Apr 21, 2004 (gmt 0)

I used to do those huge lists too, but then I started to take a closer look at my logfiles and realized that the spiders are using standard user agent strings, or none, or some mozilla version. This is a default setting on many new spiders, and easy to set on most others. Spending time looking through logfiles strikes me as one of the less productive ways a webmaster can spend their time, and always forces you to react to a past event instead of dealing with a current one.

This thread continues here [webmasterworld.com].

re: php: No, there is no way you can hide your email address with php, that's a server side thing. You can make a php email form of course, but then the user has to type in their email address, and not use their email client. I suspect many users don't even know their email address so I tend to shy away from this option, although it is fool proof, since your email address is not in the html.

There are javascript methods that work fine, just search for email javascript protection on webmasterworld. The best solution is to keep the spider out of your site from the beginning.

putting in an image of your email address strikes me as one of the worse solutions out there, since that forces the user to type in your email address after opening their email client.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved