Forum Moderators: phranque
In this case they were prevented from viewing my robots file:
xx.xxx.xx.xx - - [24/Apr/2003:11:29:19 -0400] "GET /robots.txt HTTP/1.1" 403 1445 "-" "sitecheck.internetseer.com (For more info see: http *://sitecheck.internetseer.com)"
The next day they came back and requested more files; in the instance of requesting my form mail script they were permitted and a few minutes later they requested more pages and were forbidden access:
xx.xxx.xx.xx - - [25/Apr/2003:18:07:09 -0400] "GET /cgi-bin/myformmailscript.pl HTTP/1.1" 200 1020 "-" "sitecheck.internetseer.com (For more info see: http: *//sitecheck.internetseer.com)"
xx.xxx.xx.xx - - [25/Apr/2003:18:11:36 -0400] "HEAD /file.htm HTTP/1.1" 403 0 "-" "sitecheck.internetseer.com (For more info see: http: *//sitecheck.internetseer.com)"
I don't want to prevent anyone from accessing my robots file, and at the same time I don't understand why they were permitted to access my form mail.
I'm using the bad-bots trapping script that has been posted on this forum:
SetEnvIf Request_URI "^(/403.*\.html¦/robots\.txt¦/uaterms\.html)$" allowsome
and the line in my htaccess banning them is:
RewriteEngine On
[list of bad bots]
RewriteCond %{HTTP_USER_AGENT} sitecheck\.internetseer\.com [NC,OR]
[more bad bots]
RewriteRule!^403.html$ - [F,L]
I should mention that this has only occurred with respect to internetseer.com - the other search engines (google, inktomi, Ask Jeeves, etc.) have all been able to get the robots file and spider my site.
I hope I've provided adequate information. Anyone have any clues as to what I may have done wrong here?
Thanks ahead of time for any assistance!
You have several questions here...
In order to NOT block internetseer and other quasi-bad bots from robots.txt, you'll need to change your rewrite rule from
RewriteRule!^403.html$ - [F,L] RewriteRule!^(403\.html¦robots\.txt)$ - [F] Note that [L] with {F] is redundant, and that the above assumes that you have a custom 403 error page named 403.html.
Now, as to why internetseer can still fetch your form mail script, that may depend on your set-up. If the script was provided by your hosting service, it may not actually reside in your directory structure, and therefore your .htaccess will not apply to it. It is fairly typical on a shared server for the hosting company to redirect formmail requests to a subdirectory under their control, and then to place a symbolic link into your web root directory or your cgi-bin directory that makes it look like formmail is in your directory tree when it really isn't. In this case, your .htaccess will be bypassed, and it is up to the host to protect it. I suppose you could download the script to your computer and then upload it to another subdirectory on your site and protect it there, but check with your host first.
The above presupposes that you actually have a formmail script. If you're just using it as spider bait, read on.
If bad bots are not eating your spider bait, you may have to put a more tasty variety of bait out. Also, make sure that the allow,deny setup in your .htaccess file which tests the spider trap script's output looks something like this:
# Block bad-bots using lines written by bad_bot.pl script above
SetEnvIf Request_URI "^(/403\.html¦/robots\.txt¦/uaterms\.html)$" allowsome
<Files *>
order deny,allow
deny from env=banned
allow from env=allowsome
</Files>
# Thu Apr 17 09:07:32 2003
SetEnvIf Remote_Addr ^212\.198\.0\.96$ banned
It is also critical that you do not block any user-agent from reading robots.txt, ever. Otherwise, some of them that are merely nuisances won't be able to read robots.txt, and so they can't be blamed for subsequently swallowing your spider bait. The whole plan has to work together consistently. The only way we can tell a good bot from a bad bot is that the bad ones won't obey robots.txt. And they have to be able to fetch it to obey it.
The other thing to note is that the spider trap script as originally posted will return a 200 status code for the first request from a bad bot grabbing the bait. After that, the spider is banned by IP address, and will get 403 responses. So one 200-OK response is to be expected unless you modify the script itself to return a 403 header.
HTH,
Jim
The form mail script wasn't provided by my host. It's one I've been using for quite a while and which I've renamed so it doesn't sound like a form mail script. Which tells me that to find it, they actually had to follow the link from the form on my html page itself.
However, I have changed hosts recently. I'm still finding out about my new set up. One of the things that's different is that I have to put my cgi scripts in a cgi-bin folder which is at the same directory level as the html files, i.e.
/home/user/public/cgi-bin/
/home/user/public/html/
In the past my cgi-bin has been under the html folder. Don't know if that makes any difference - can't think why it would.
Yes my bad bot script is set up in the way you have described and is returning results such as this:
SetEnvIf Remote_Addr ^148\.xxx\.xx\.xx$ getout
But there hasn't been a new one added since I changed hosts. So now that you've got me thinking along this line, I'm wondering what may be different in the new set up. I think I'll go try to ban myself.
I'll correct the rewrite rule as you have pointed out and then see what happens next. Internetseer has been flooding my sites lately so I don't think I'll have to wait long for an answer.
Jim, thank you again.
this forum is read by friend and foe alike
Regarding, the Slurp and Msnbot getting trapped.. i have clearly disallowed the url(s) in my robots.txt and everybody is allowed to fetch it, still these two sometimes getting trapped. I have not seen googlebot asking for that script for a single time. Therefore.. can it be deducted, that there is no problem in my robots.txt?
If yes, then what could be the reason behind it.. i could not understand.
As
Slurp occasionally fetches directory paths that are not linked, and perhaps this is why you're trapping it. For example, if you have a page "/events/march.html", Slurp will decide --on its own, and without finding any link to it-- to try to fetch the directory "/events/". I have had no such trouble with Slurp, though, so do re-check your robots.txt and raw log files.
The MSNbots seem to be a total mess these days; They apparently cannot or will not differentiate between their own various 'bots in robots.txt. For example "User-agent: MSNbot/" and "User-agent: MSNbot-media" are treated as if they are the same thing, and Disallowing one will Disallow the other, even though those two substrings don't match both robots' names! They currently ignore on-page <meta name="robots" content="noindex,nofollow"> tags as well. Something appears to be very wrong with MSNbot right now, and I sure hope they're working to fix it.
So currently, the only way to keep the MSNbots 100% safe is to rewrite any bad-bot script request from any_msnbot_user_agent+(valid remote_host OR valid remote_IP_address) to a "safe" page. I suggest a more-or-less blank page with a single link to your home page, and a meta noindex,follow tag on it. You can also use this technique to keep Slurp out of trouble if required.
In simplified form, without the checking of the remote_host or remote_addr variables, I mean:
# Rewrite bad-bot script requests from msnbot or slurp to a safe page
RewriteCond %{HTTP_USER_AGENT} (msnbot¦slurp) [NC]
RewriteRule ^bad_bot_script\.pl$ /safe-page.html [L]