Forum Moderators: phranque

Message Too Old, No Replies

yowzaa! Baidu is steamrolling my logs

         

Dan99

2:53 am on Nov 15, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



OK, I've had a site up for many years, and had occasional pokes and prods from Baidu. They became more frequent. Several months ago I asked them to please go away in my robots.txt file. They got that file, and didn't go away. No big surprise. Oh, BTW, it's really Baidu. They're appearing on the well known Baidu IPs, mostly 180.76 and 220.181 with the user agent "Baiduspider/2.0; +http://www.baidu.com/search/spider.html". I don't exclude other search engines in my robots.txt file, but I ask them to show up infrequently, and they seem to obey that.

So I blocked Baidu. By UA, and not by IP. Every single Baidu hit gets 403'd now.

But the hits are getting manic. I'm now getting one every minute or two, and although they're hardly using my bandwidth, they're kinda filling up my logs.

What are my options? I'm assuming after while they'll just get bored and go away. But that hasn't happened yet.

Dan99

4:17 am on Dec 1, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



FWIW, here we are, several weeks later, and the Baidu hits are starting to subside after I denied service to it. They were coming in one every few minutes originally, and gradually got less frequent. As of today, they're coming in once every hour or two. Baidu is getting bored with 403s, which is all I'm giving it. These are almost all coming from 180.76 URLs, which are real Baidu ISPs according to DNS lookups, and they go by "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)".

Don't let anyone tell you that Baidu obeys robots.txt in which I tried to ban it long ago. Not only was Baidu not obeying my robots.txt file directive, it's really not clear to me that Baidu ever even looked at that file.

Fie, fie on Baidu.

lucy24

5:31 am on Dec 1, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Don't let anyone tell you that Baidu obeys robots.txt

At one time there was a difference in behavior between Japanese Baidu and Chinese Baidu. Don't know if that's still the case.

What are my options?

Is it your own server? If so, you could do a firewall. Won't stop the requests, but at least they won't make it as far as server logs. Then again-- if your main issue is log bloat, and if it is your own server, you could always exclude Baidu from logging. (I wouldn't personally do this, just pointing out it's an option.)

Search engines never get bored and go away.

:: detour to check random selection of logs ::

Weird. They seem to fixate on a handful of specific files, and keep asking for them month after month. I don't think it's because they are linked from someplace that permits Baidu crawling, though I guess you can't absolutely exclude the possibility.

Dan99

7:14 pm on Dec 1, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



Yep. Why didn't I think of using a firewall? I just have to look into how to use ipfw on my Mac server. I think it's straightforward.

How does one exclude Baidu from being logged? I didn't know you could do such exclusion. I'm assuming that excluding Baidu from Apache logging is worse than stopping it with a firewall because in the former case it is still taking up Apache horsepower, and it's useful to have an honest log of what is taking up that horsepower.

As to Japanese Baidu, these IPs resolve out to "Beijing Baidu Netcom Science and Technology Co." So if it's a Japanese Baidu, they're outsourcing their server to Beijing.

lucy24

10:04 pm on Dec 1, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



By the usual yawn-provoking coincidence, I was reading about conditional logs [httpd.apache.org] just a few days ago (link to 2.2, but 2.0 and 2.4 are essentially the same-- scroll down near the end of the "Access Logs" section). It's done in conjunction with mod_setenvif, and looks like this:

BrowserMatch Baidu dontlog
CustomLog logs/access_log common env=!dontlog


(Replace "common" with "combined" or whatever you normally use, and replace "dontlog" with any name of your choice.)

it's useful to have an honest log of what is taking up that horsepower

Exactly. That's why I wouldn't want a conditional log myself. Save the exclusions for the next step, where you customize your log-wrangling software to pick out the parts that matter.

I really haven't thought much about Baidu in a couple of years. They used to crawl from a few different IPs, one of which was in Japan. I just block by IP, not by UA, so the Japanese version could always get in. I might do some bot-watching in January and then I'll see what they're up to.

Dan99

4:23 pm on Dec 28, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



Search engines never get bored and go away.

Actually, Baidu did. Quite suddenly. Just GONE. As of two days ago, roughly a month into their attack. Now, what has been ramping up in its place, and remains, are noxious pokes from 202.46.x.x with what appears to be a forged UA -- Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2 (huh? Firefox 6?), which are also from China. Says to me that Baidu looks over at ShenZhen Sunrise and says "I'm getting bored. You take over for a while." About one in ten requests are now from them. Those too are easy to block, but end up littering my logs. Maybe they'll get bored and go away as well.

I never bothered with the firewall, but as long as there is some possibility of boredom and quitting, I'll hold out for a while more.

Dan99

3:48 pm on Jan 11, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



Just a quick followup. 202.46.x.x, which resolves to ShenZhen Sunrise Technologies in China, is GONE from my logs. Two weeks ago, they were hitting me a few dozen times an hour (and getting 404'ed every time). A week ago that had dropped off to a few times an hour. Two days ago, they DISAPPEARED. As I said, I think these guys were connected with Baidu, and so the whole Baidu saga is over for me, it would seem. Keeping my fingers crossed.

Again (and I have seen this referred to elsewhere), that when Baidu gets tired of hitting on you and getting 404'ed repeatedly, they turn it over to their ShenZhen servers. It would seem that those servers get tired as well.

lucy24

9:00 pm on Jan 11, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Baidu looks over at ShenZhen Sunrise and says "I'm getting bored. You take over for a while."

Hee. 202.46.what-exactly? I've got 202.46. broken into assorted smaller bits, some as small as /23 and /24 but I think they're all primary, not an upstream server farm. My notes for 202.46.32.0/19 say "202.46.53-59 many robots" -- is that where your unwanted visitors came from?

Dan99

9:23 pm on Jan 11, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



I found it necessary to block 202.46.32.0/19. They all resolve to the same ShezZhen Technologies. So it looks like my blocking needs correspond exactly to what you have in your notes.

Dan99

9:39 pm on Jan 11, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



I should say that I was seeing them from 202.46.49.x to 202.46.63.x. So I guess I was seeing some more than you were.

lucy24

12:23 am on Jan 12, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Notes don't say how long ago I blocked the range. And once someone is blocked, it's out of sight out of mind unless I'm doing an At Home With The Robots exercise. I've blocked all of China /16 and up a priori, but to keep the htaccess manageable, smaller ranges only get blocked if they do something to offend.

EastTexas

2:59 am on Jan 18, 2015 (gmt 0)

10+ Year Member



# None USA is Blocked

deny from 202.0.0.0/202.72.95.255
deny from 202.72.112.0/202.255.255.255

<IfModule mod_setenvif.c>
# SetEnvIfNoCase User-Agent ^$ keep_out

SetEnvIfNoCase User-Agent (360spider|baidu|bla|bla|bla) keep_out

<limit GET POST PUT>
Order Allow,Deny
Allow from all
Deny from env=keep_out
</limit>
</IfModule>

lucy24

7:30 am on Jan 18, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



deny from 202.0.0.0/202.72.95.255
deny from 202.72.112.0/202.255.255.255

I am 95% certain that a quotation from The Princess Bride is warranted here ... but for the sake of the other 5%, what is the part after the / intended to signify?

EastTexas

8:05 pm on Jan 18, 2015 (gmt 0)

10+ Year Member



# block a partial domain via network/netmask values
[a nonauthoritative source giving bad examples ]

[edited by: phranque at 1:07 am (utc) on Jan 19, 2015]
[edit reason] snipped url [/edit]

Dan99

8:25 pm on Jan 18, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



Two points. One is that blocking Baidu was never an issue. Pretty trivial to do that. As I said in my OP, I did that. The issue was keeping those denials out of my access logs. The second is that in your example above, 202.72.95.255 and 202.255.255.255 sure don't look like any sensible net masks. What am I missing?

wilderness

8:35 pm on Jan 18, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



# block a partial domain via network/netmask values


FWIW, you need to STOP providing links to this website in this forum or any other forum.

You seem to believe that they offer some type of valid authority, when in fact there are many error examples provided in their syntax. SOME of which will even render 500 errors and take a website down when added blindly.

EastTexas

8:49 pm on Jan 18, 2015 (gmt 0)

10+ Year Member



Thanks for the info 8)

lucy24

9:17 pm on Jan 18, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



What am I missing?

"You keep using that netmask. I do not think it means what you think it means."

[cyberflunk.com...]
[freesoft.org...]

et cetera. I'm sure there's an explanation somewhere at apache.org as well, but I couldn't find it; they just seem to assume you already know.

EastTexas

9:59 pm on Jan 18, 2015 (gmt 0)

10+ Year Member



using ixquick: deny netmask(htaccess)
resulted in a lot of good info.

Thanks again 8)