Forum Moderators: goodroi

Message Too Old, No Replies

Baiduspider - not checking robots.txt & how to block?

         

TreeShare

3:55 pm on Apr 30, 2005 (gmt 0)

10+ Year Member



Hello,

For the last several days running (that I've noticed), my Apache access logs are full of thousands of visits from Baiduspider. I realize it's a legitimate Chinese search engine spider, but I want to block it - I have no use for Chinese traffic.

In my logs I see NO access of robots.txt by Baiduspider (IP 61.135.145.205).

If I'm right that it doesn't use robots.txt, can anyone suggest a good way to block it entirely?

Thanks!

Me:
Win2k Pro SP4, Apache 2.0.54, MySQL 4,1,11, PHP 4.3.11

TreeShare

3:30 am on May 1, 2005 (gmt 0)

10+ Year Member



Update -

Having surfed around a lot, I'm trying the following in my httpd.conf to block Baiduspider:

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider+ [NC,OR]
RewriteRule ^.*$ - [F]

I'm new to mod_rewrite, and I don't know if this will work ...

TreeShare

12:13 pm on May 2, 2005 (gmt 0)

10+ Year Member



Update -

Ok, this is insane. The above blocking method is not working. Example of recent visits from Baiduspider:

May 1st, 1292 requests, stayed 2 hours 14 minutes
All on IP 61.135.145.205

Please, can anyone help me block this?
Thanks!

Span

1:01 pm on May 2, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi, if that is the only rewrite condition there, try taking out the OR flag. Also, you don't need the +.

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider [NC]
RewriteRule .* - [F]

TreeShare

12:48 pm on May 4, 2005 (gmt 0)

10+ Year Member



Thanks for your suggestion, Span. Before I saw it, I found another method which I've just implemented ...

A new .htaccess file at the site root:

SetEnvIfNoCase User-Agent "^Baiduspider" bad_bot

<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>

... so I'll see if this works first.

TreeShare

8:07 am on May 5, 2005 (gmt 0)

10+ Year Member



Update -

Previous method didn't work. But now I'm finally getting somewhere with this:

SetEnvIfNoCase User-Agent "^Baidu" bad_bot
<Directory />
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Directory>

Directly in my Apache httpd.conf

Now every request Baiduspider makes is getting a 403 error. I just wish I could make it go away altogether, coz it's still hitting me thousands of times a day :(