Forum Moderators: open

Message Too Old, No Replies

Ask Jeeves

Ask Jeeves disregarding robots.txt

         

Scooter24

11:20 am on Sep 9, 2002 (gmt 0)

10+ Year Member Top Contributors Of The Month



Just noticed that Ask Jeeves, the spider of ask.com, disregarded my robots.txt and crawled through some directories which were off limits. Anybody noticed the same?

jdMorgan

9:24 pm on Sep 9, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, Confirmed.

I had the same thing happen to one of my sites this morning. If those pages show up in their index, I'll have to take stronger measures.

Thanks for posting - I wondered if this had happened to others.

Jim

martin

8:24 am on Sep 10, 2002 (gmt 0)

10+ Year Member



Did you change your robots.txt recently, they may be using a cached copy?

Scooter24

10:08 am on Sep 10, 2002 (gmt 0)

10+ Year Member Top Contributors Of The Month



I didn't change robots.txt recently, that means not in the past three weeks. Although three weeks ago I added a directory and on the same day disallowed it in robots.txt.
Interestingly Ask Jeeves accessed exactly this (new) disallowed directory and not the other (older) disallowed ones.

But I'd say that a well-mannered spider should read robots.txt every time. In any case I just banned Ask Jeeves from my site (brings very little traffic anyway).

martin

6:49 am on Sep 11, 2002 (gmt 0)

10+ Year Member



Some spiders get robots.txt every day but they obviosly have opted not to.

carfac

10:59 pm on Sep 12, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi:

I just caught askjeeves where he did not belong, too. He got himself banned automatically, but I am going to send a note to ask.com. I expect a lot more than this from a SE like ask...

dave

Scooter24

9:42 am on Sep 13, 2002 (gmt 0)

10+ Year Member Top Contributors Of The Month



"He got himself banned automatically"

Did you use a script which added a 'deny from IP-address' line to .htaccess?

If yes, how did you do it (I'm trying to do the same)?

martin

10:31 pm on Sep 13, 2002 (gmt 0)

10+ Year Member



RewriteCond %{REMOTE_ADDR} 127.0.0.1
RewriteRule .* /you-are-banned.txt

They will be shown the text file for whatever request they make.

I don't think you can automate this without a scripting language/CGI. It's up to you how to implement it.

Scooter24

2:38 am on Sep 14, 2002 (gmt 0)

10+ Year Member Top Contributors Of The Month



But then the script needs to have write rights on .htaccess. What if your PHP is registered as others? You would have to set the permissions of .htaccess to 666, giving everybody read/write rights.

carfac

4:17 am on Sep 14, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Scooter:

You want an automated script, that you ban in robots.txt, but link to invisibly.... and if any bad guys run the script, BAM- they are banned?

Yep, I got that.

Here is the code for the script:

#!/usr/local/bin/perl
# Name this script trap.pl, upload it in ASCII mode to your cgi-bin and set the file permissions to CHMOD 755.

# This is the only variable that needs to be modified. Replace it with the absolute path to your root directory.
$rootdir = "/path/to/root/dir";

# Grab the IP of the bad bot
$visitor_ip = $ENV{'REMOTE_ADDR'};

#WAP's read it all, we do not want to ban them; send a polite page!

if ($visitor_ip =~ /^216\.239\.3([3񕺿]\.5)$216\.239\.35\.4$/) {
print "Content-type: text/html\n\n";
print "<html>\n";
print "<head>\n";
print "<title>Forward On</title>\n";
print "</head>\n";
print "<body>\n";
print "<p><b>Please <A HREF=\"http://www.yourdomain.com/\">Click Here</A> to continue!</b></p>\n";
print "</body>\n";
print "</html>\n";
exit;
}
else {

$visitor_ip =~ s/\./\\\./gi;

# Set Date
$date = scalar localtime ( time );

# Open .htaccess file
open(HTACCESS,"".$rootdir."/\.htaccess") Ζ die $!;
@htaccess = <HTACCESS>;
close(HTACCESS);

# Write banned IP to .htaccess file
open(HTACCESS,">".$rootdir."/\.htaccess") Ζ die $!;
print HTACCESS "SetEnvIf Remote_Addr \^".$visitor_ip."\$ ban\n# $date\n";
foreach $deny_ip (@htaccess) {
print HTACCESS $deny_ip;
}
close(HTACCESS);

# Close
print "Content-type: text/html\n\n";
print "<html>\n";
print "<head>\n";
print "<title>Access Denied!</title>\n";
print "</head>\n";
print "<body>\n";
print "<p><b>Access Denied!</b></p>\n";
print "<A HREF=\"http://www.imdb.com/harvest_me/\"> </A>\n";
print "</body>\n";
print "</html>\n";
exit;

################END OF SCRIPT

I found this script on this site... and I have modded this for my own use- I added the bit so it would NOT ban WAPs- if you find anyone else getting banned that should not be, add there IP to that section (and let me know!) I also added a time stamp- found that helpfull for cleaning it out every week. (I would recommend emptying it weekly or so... .htaccess slows down the server) I also spotted IMDB.com's spider trap... it's fun, so I Linked the bad spider to that. More fun!

OK, there HAS to be an .htaccess file in your root directory, and this cgi file HAS to go into the root directory, and you HAVE to be able to execute CGI in your root.

Put this in your .htaccess:

<Files ~ "^.*$">
order allow,deny
allow from all
deny from env=ban
</Files>

#################END htaccess

Save the script as xxxxx.cgi Add the name of the script to the Dissallow section of robots.txt. (You might want to then wait a week before uploading the script, some spiders do not always read the robots.txt file every access.)

Then just put an invisible link or two to the script, and away you go. I have a couple other tricks for suckering them bad spiders, sticky me for those!

dave

Scooter24

8:43 am on Sep 14, 2002 (gmt 0)

10+ Year Member Top Contributors Of The Month



This PHP one will also do, but it requires write rights on .htaccess. It also sends you an email when smebod gets banned:
By the way, CGI doesn't work on my server (well, it's available, but I've never managed to get it to work).

<?php

$fd = fopen(".htaccess","r");
$file = "";
$line = fgets($fd, 4096);
while ((substr ($line, 0, 10) != "allow from") and (!feof ($fd))) {
$file = $file . $line;
$line = fgets($fd, 4096);
}
$file = $file . $line;
$line = fgets($fd, 4096);
while ((substr ($line, 0, 9) == "deny from") and (!feof ($fd))) {
$file = $file . $line;
$line = fgets($fd, 4096);
}
$file = $file . "deny from " . $_SERVER ["REMOTE_ADDR"];
$file = $file . $line;
while (!feof ($fd)) {
$line = fgets($fd, 4096);
$file = $file . $line;
}
fclose ($fd);

$fd = fopen(".htaccess","w");
fwrite ($fd, $file);
fclose ($fd);

$message = "PHP-file: " . $_SERVER["PHP_SELF"] . "\n";
$message = $message . "IP-Adress: " . $_SERVER ["REMOTE_ADDR"] . "\n";
$message = $message . "HTTP_REFERER: " . $_SERVER ["HTTP_REFERER"] . "\n";
$message = $message . "User agent: " . $_SERVER ["HTTP_USER_AGENT"] . "\n";
$postdt = date ("j.n.Y - H:i");
$message = $message . $postdt . "\n";

mail("your$email.address", "Web site attack", $message);

?>

martin

10:42 pm on Sep 14, 2002 (gmt 0)

10+ Year Member



If you're so crazy about world writable use PHP itself to do the checks. It will be *slower*.

Meanwhile, I got Inktomi direguarding a robots.txt ban from 8 aug. They only do it from certain IPs though, it looks like they have distinct cached copies of it.

Scooter24

6:03 am on Sep 15, 2002 (gmt 0)

10+ Year Member Top Contributors Of The Month



It's not that I'm not crazy about world writeable PHP, it's just that CGI scripts don't work on my server (CGI is installed though). PHP at least works. Something in my .htaccess file disables CGI. I've spent weeks trying to find out why CGI doesn't work, writing posts in all possible forums and it still doesn't work.

martin

12:30 pm on Sep 16, 2002 (gmt 0)

10+ Year Member



I found that you can use an external map with mod_rewrite. Check Apache docs.

Scooter24

2:12 pm on Sep 16, 2002 (gmt 0)

10+ Year Member Top Contributors Of The Month



What is an external map? Sorry, I'm not familiar with Apache.

martin

5:00 pm on Sep 17, 2002 (gmt 0)

10+ Year Member



See the Rewrite* docs [httpd.apache.org] and the
URL Rewrite Guide [httpd.apache.org]