htaccess causing both 200 and 403 codes for same spider - Apache Web Server forum at WebmasterWorld - WebmasterWorld

Forum Moderators: phranque

Message Too Old, No Replies

htaccess causing both 200 and 403 codes for same spider

obviously I've done something wrong!

Busynut

6:46 pm on Apr 30, 2003 (gmt 0)

10+ Year Member

In my htaccess file I have sitecheck.internetseer.com banned. I'm seeing some mixed results and am sure I have done something wrong.

In this case they were prevented from viewing my robots file:

xx.xxx.xx.xx - - [24/Apr/2003:11:29:19 -0400] "GET /robots.txt HTTP/1.1" 403 1445 "-" "sitecheck.internetseer.com (For more info see: http *://sitecheck.internetseer.com)"

The next day they came back and requested more files; in the instance of requesting my form mail script they were permitted and a few minutes later they requested more pages and were forbidden access:

xx.xxx.xx.xx - - [25/Apr/2003:18:07:09 -0400] "GET /cgi-bin/myformmailscript.pl HTTP/1.1" 200 1020 "-" "sitecheck.internetseer.com (For more info see: http: *//sitecheck.internetseer.com)"

xx.xxx.xx.xx - - [25/Apr/2003:18:11:36 -0400] "HEAD /file.htm HTTP/1.1" 403 0 "-" "sitecheck.internetseer.com (For more info see: http: *//sitecheck.internetseer.com)"

I don't want to prevent anyone from accessing my robots file, and at the same time I don't understand why they were permitted to access my form mail.

I'm using the bad-bots trapping script that has been posted on this forum:
SetEnvIf Request_URI "^(/403.*\.html¦/robots\.txt¦/uaterms\.html)$" allowsome

and the line in my htaccess banning them is:
RewriteEngine On
[list of bad bots]
RewriteCond %{HTTP_USER_AGENT} sitecheck\.internetseer\.com [NC,OR]
[more bad bots]
RewriteRule!^403.html$ - [F,L]

I should mention that this has only occurred with respect to internetseer.com - the other search engines (google, inktomi, Ask Jeeves, etc.) have all been able to get the robots file and spider my site.

I hope I've provided adequate information. Anyone have any clues as to what I may have done wrong here?

Thanks ahead of time for any assistance!

jdMorgan

7:44 pm on Apr 30, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Busy, BusyNut,

You have several questions here...

In order to NOT block internetseer and other quasi-bad bots from robots.txt, you'll need to change your rewrite rule from

RewriteRule!^403.html$ - [F,L]

to

RewriteRule!^(403\.html¦robots\.txt)$ - [F]

Note that [L] with {F] is redundant, and that the above assumes that you have a custom 403 error page named 403.html.

Now, as to why internetseer can still fetch your form mail script, that may depend on your set-up. If the script was provided by your hosting service, it may not actually reside in your directory structure, and therefore your .htaccess will not apply to it. It is fairly typical on a shared server for the hosting company to redirect formmail requests to a subdirectory under their control, and then to place a symbolic link into your web root directory or your cgi-bin directory that makes it look like formmail is in your directory tree when it really isn't. In this case, your .htaccess will be bypassed, and it is up to the host to protect it. I suppose you could download the script to your computer and then upload it to another subdirectory on your site and protect it there, but check with your host first.

The above presupposes that you actually have a formmail script. If you're just using it as spider bait, read on.

If bad bots are not eating your spider bait, you may have to put a more tasty variety of bait out. Also, make sure that the allow,deny setup in your .htaccess file which tests the spider trap script's output looks something like this:


# Block bad-bots using lines written by bad_bot.pl script above
SetEnvIf Request_URI "^(/403\.html¦/robots\.txt¦/uaterms\.html)$" allowsome
<Files *>
order deny,allow
deny from env=banned
allow from env=allowsome
</Files>

and that the script is writing lines that look like this:

# Thu Apr 17 09:07:32 2003 
SetEnvIf Remote_Addr ^212\.198\.0\.96$ banned

It is critical that the environment variables which are tested are those which are set by the script and the "allowsome" exclusionary clause shown. You can name those variables anything you want, but the usage must be consistent between the script output and your .htaccess code.

It is also critical that you do not block any user-agent from reading robots.txt, ever. Otherwise, some of them that are merely nuisances won't be able to read robots.txt, and so they can't be blamed for subsequently swallowing your spider bait. The whole plan has to work together consistently. The only way we can tell a good bot from a bad bot is that the bad ones won't obey robots.txt. And they have to be able to fetch it to obey it.

The other thing to note is that the spider trap script as originally posted will return a 200 status code for the first request from a bad bot grabbing the bait. After that, the spider is banned by IP address, and will get 403 responses. So one 200-OK response is to be expected unless you modify the script itself to return a 403 header.

HTH,
Jim

Busynut

2:30 am on May 1, 2003 (gmt 0)

10+ Year Member

Wow. thank you for your response! I'm saving this page so I can study it better when I'm not so tired.

The form mail script wasn't provided by my host. It's one I've been using for quite a while and which I've renamed so it doesn't sound like a form mail script. Which tells me that to find it, they actually had to follow the link from the form on my html page itself.

However, I have changed hosts recently. I'm still finding out about my new set up. One of the things that's different is that I have to put my cgi scripts in a cgi-bin folder which is at the same directory level as the html files, i.e.
/home/user/public/cgi-bin/
/home/user/public/html/

In the past my cgi-bin has been under the html folder. Don't know if that makes any difference - can't think why it would.

Yes my bad bot script is set up in the way you have described and is returning results such as this:

SetEnvIf Remote_Addr ^148\.xxx\.xx\.xx$ getout

But there hasn't been a new one added since I changed hosts. So now that you've got me thinking along this line, I'm wondering what may be different in the new set up. I think I'll go try to ban myself.

I'll correct the rewrite rule as you have pointed out and then see what happens next. Internetseer has been flooding my sites lately so I don't think I'll have to wait long for an answer.

Jim, thank you again.

marodhum

8:28 pm on Mar 18, 2008 (gmt 0)

10+ Year Member

Hi Jim, sorry for reviving such an old thread, but my bot trap is mainly attracting y!slurp and msnbot :(

if bad bots are not eating your spider bait, you may have to put a more tasty variety of bait

Can you please elaborate the above point and thanks in advance.
As

jdMorgan

2:52 am on Mar 19, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I can answer specific questions about why slurp and msnbot might be getting trapped, but I cannot generally "elaborate" on spider-baiting due to the fact that this forum is read by friend and foe alike. However, you should adapt your bait to the specific abuses that you see in your own logs.

Jim

marodhum

1:51 pm on Mar 19, 2008 (gmt 0)

10+ Year Member

this forum is read by friend and foe alike

I fully understand your reasoning. I guess, i have to learn by trial and error.

Regarding, the Slurp and Msnbot getting trapped.. i have clearly disallowed the url(s) in my robots.txt and everybody is allowed to fetch it, still these two sometimes getting trapped. I have not seen googlebot asking for that script for a single time. Therefore.. can it be deducted, that there is no problem in my robots.txt?
If yes, then what could be the reason behind it.. i could not understand.

As

marodhum

1:55 pm on Mar 19, 2008 (gmt 0)

10+ Year Member

Btwn.. is there any script like the bad bot script, available in WebmasterWorld to block crawlers, who fetch pages too fast. I tried here and also google-ing it, but without any result. If you can point me to a link or something, I will be obliged.

As

jdMorgan

2:57 pm on Mar 19, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Perhaps you are thinking of the Runaway robot script [searchengineworld.com] (PHP) by xlcus and AlexK.

Slurp occasionally fetches directory paths that are not linked, and perhaps this is why you're trapping it. For example, if you have a page "/events/march.html", Slurp will decide --on its own, and without finding any link to it-- to try to fetch the directory "/events/". I have had no such trouble with Slurp, though, so do re-check your robots.txt and raw log files.

The MSNbots seem to be a total mess these days; They apparently cannot or will not differentiate between their own various 'bots in robots.txt. For example "User-agent: MSNbot/" and "User-agent: MSNbot-media" are treated as if they are the same thing, and Disallowing one will Disallow the other, even though those two substrings don't match both robots' names! They currently ignore on-page <meta name="robots" content="noindex,nofollow"> tags as well. Something appears to be very wrong with MSNbot right now, and I sure hope they're working to fix it.

So currently, the only way to keep the MSNbots 100% safe is to rewrite any bad-bot script request from any_msnbot_user_agent+(valid remote_host OR valid remote_IP_address) to a "safe" page. I suggest a more-or-less blank page with a single link to your home page, and a meta noindex,follow tag on it. You can also use this technique to keep Slurp out of trouble if required.

In simplified form, without the checking of the remote_host or remote_addr variables, I mean:


# Rewrite bad-bot script requests from msnbot or slurp to a safe page
RewriteCond %{HTTP_USER_AGENT} (msnbot¦slurp) [NC]
RewriteRule ^bad_bot_script\.pl$ /safe-page.html [L]

Jim

marodhum

3:48 pm on Mar 19, 2008 (gmt 0)

10+ Year Member

Thank you JD for your suggestion and the link.
My knowledge regarding script-ing is a big zero, so you can understand, how much trouble that link is creating for me. ;)

need to unwind my head from an awful spin after reading that. So goodbye for now..

As