Forum Moderators: phranque

Message Too Old, No Replies

blocking Baidu by User Agent - what am I doing wrong?

         

Dan99

3:24 am on Nov 7, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



OK, I'm trying to block Baidu by User Agent. They just don't listen to robots.txt.

I put this in my .htaccess file

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^Baiduspider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Baidu [NC]
RewriteRule .* - [F]


But then I see this in my log.

180.76.6.16 - - [06/Nov/2014:21:07:11 -0500] "GET *******.pdf HTTP/1.1" 200 5884642 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"


Duh. So why did Baidu get that file? Why wasn't it blocked? I'm confused. Now, I can block them with

deny from 180.76 123.125.71 220.181.108 66.249.84


but I'd like to do it in a more Baidu-specific way.

not2easy

4:14 am on Nov 7, 2014 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Probably because the UA does not start with Baiduspider or Baidu, it starts with "Mozilla". Try it without the ^ anchor.

Dan99

5:08 am on Nov 7, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



Thanks. That did it. I got 'em.

I had to go back and refresh my memory about regex vocabulary. Indeed, the ^ matches strings exactly equal to what follows it. Without the ^ it matches strings containing it.

lucy24

6:48 am on Nov 7, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



^ is an opening anchor. It means the string to be matched has to start with the specified text.
$ is a closing anchor. It behaves exactly the same, mutatis mutandis.

I think it's quite common for people to see anchors in published RewriteRules and simply misunderstand what they're for. Anchors could easily be mistaken for some essential part of mod_rewrite syntax, along the lines of "I've finished naming the rule and now I'm starting on the next phrase".

Dan99

2:00 pm on Nov 7, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



Right. That's a better statement than I made.

I saw several recommendations for how to use RewriteRules for User Agents, and many recommended that anchor. I frankly don't see why. I want to kill off any request that has "Baidu [NC]" anywhere in the UA string, not just ones that start with it.

lucy24

9:58 pm on Nov 7, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



many recommended that anchor

Scratch all those many off your "advice worth following" list, then. It means they copied their rules from some other source, which itself didn't understand what the anchor does.

Try to avoid [NC] unless it's really needed, since it creates a little more work for the server. What you're really saying is
[Bb][Aa][Ii][Dd][Uu]
The [NC] flag saves time for you, not for the server.

The string "baidu" with any kind of casing is not likely to occur in anything other than Baidu spiders. So [NC] will not do any harm; it just might not be necessary. I don't suppose it's all that common in fake UAs. Not like, say, "GoogleBot" (note casing) which I've personally seen.

Dan99

10:12 pm on Nov 7, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



Hmmm. Interesting. So "Baidu [NC]" creates more work for the server than "[Bb][Aa][Ii][Dd][Uu]"? Should that be obvious?