Forum Moderators: phranque

Message Too Old, No Replies

newbie in a world of pain trying to ban bots with .htaccess

         

sross

5:28 am on Feb 9, 2005 (gmt 0)

10+ Year Member



Hi guys, I run a good sized vb3 community and lately bots are tearing me to shreds, my host has shut me down 3 times as they push the cpu to 80% usage (this is on a beefy semi-dedicated box that usually runs 0.20) and I am getting worried. I don't know much on how to stop them other than people saying to use robots.txt and add entries to .htaccess file. I tried robots.txt block all without luck. I had some luck banning googlebot ip's in my htaccess. What I really want to do is ban by bot name. I have read many sites and many posts about this here and still can't find an answer for someone that is clueless. For example, I need someone to say copy and paste THIS into your .htaccess file and you are set. I tried putting this in to see what would happen:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^msnbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^googlebot [OR]
RewriteCond %{HTTP_USER_AGENT} ^ask\ jeeves [OR]
RewriteCond %{HTTP_USER_AGENT} ^askjeeves [OR]
RewriteCond %{HTTP_USER_AGENT} ^slurp@inktomi [OR]
RewriteCond %{HTTP_USER_AGENT} ^wisenutbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^alexa [OR]
RewriteRule ^.* - [F,L

my site then went haywire an hour later and crashed with 100's of these errors:

public_html/.htaccess: RewriteRule: bad flag delimiters

Can anyone help me? This looks like a great resource so I am subscribing. Thanks!

jdMorgan

6:14 am on Feb 9, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



sross,

Welcome to WebmasterWorld!

Your RewriteRule is missing the closing "]".
The last RewriteCond *must not* have an [OR] flag on it. The [OR] flag means "logical or"', and you cannot 'or' a RewriteRule with a RewriteCond.

Slurp now belongs to Yahoo, which bought Inktomi.

You'd be far better off using robots.txt to ask these robots to leave your site (or parts of it) alone. All the 'bots on your list will respect robots.txt. Robots that do not respect robots.txt can then be blocked using mod_rewrite.

Jim

sross

6:24 am on Feb 9, 2005 (gmt 0)

10+ Year Member



Hi Thanks Jim!

I did have the ] in the actual htaccess i must have missed it when I copied and pasted. I set my robots.txt to block everything about 10 days ago but it has not kicked in yet. Are you saying it should look like this? That the below is now a correct banning setup?

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^msnbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^googlebot [OR]
RewriteCond %{HTTP_USER_AGENT} ^ask\ jeeves [OR]
RewriteCond %{HTTP_USER_AGENT} ^askjeeves [OR]
RewriteCond %{HTTP_USER_AGENT} ^slurp@inktomi [OR]
RewriteCond %{HTTP_USER_AGENT} ^wisenutbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^alexa
RewriteRule ^.* - [F,L]

dcrombie

10:02 am on Feb 9, 2005 (gmt 0)



You're getting confused between robots.txt and .htaccess

Legitimate web spiders will request robots.txt and if you want to block them you do so by listing the Name of their robot along with the rules you want it to follow.

In .htaccess you can restrict access according to the User Agent - usually different from the Name you would use in robots.txt

Sample user agents:

Mozilla/2.0 (compatible; Ask Jeeves/Teoma) 
Googlebot/2.1 (+http://www.google.com/bot.html)
Googlebot-Image/1.0
msnbot/1.0 (+http://search.msn.com/msnbot.htm)

Your rules above will ONLY block msnbot as it's the only one that matches a regular expression ("^msnbot")

;)