homepage Welcome to WebmasterWorld Guest from 54.163.84.199
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 34 message thread spans 2 pages: 34 ( [1] 2 > >     
Baiduspider - does it obey robots.txt ?
1script




msg:4348359
 6:56 pm on Aug 5, 2011 (gmt 0)

I'm getting Baiduspider+ (and less frequently Baiduspider) pummeling my sites and I would like to disallow it. I've added this to my robots.txt two days ago:

User-agent: Baiduspider
Disallow: /

They've read it 5 times already, still haven't stopped. The user agent string leads to an error page:

"Baiduspider+(+http://www.baidu.com/search/spider.htm)"

And of course it's in Chinese (Mandarin?) so I can't tell what it says.

So, do you guys know if they just don't obey robots.txt or they just take their sweet time to adjust to a robots.txt change? I'm tempted to just firewall them out but that would mean that they won't read any robots.txt changes and will keep pounding.

So, anyone knows how best to stop Baiduspider ?

P.S. I've added this today, will see what happens:

User-agent: Baiduspider+
Disallow: /

 

g1smd




msg:4348425
 9:58 pm on Aug 5, 2011 (gmt 0)

I always allow several days for searchengines to process what they find in the robots.txt file. They are not all instantaneous in updating their crawl rules.

Pfui




msg:4348434
 10:33 pm on Aug 5, 2011 (gmt 0)

1.) Are you sure the UA is "Baiduspider"? If not, it'd be no great shakes if you added variations until it/they stopped crawling:

User-agent: Baiduspider
User-agent: baiduspider
User-agent: Baiduspider+
Disallow: /

Btw, WIkipedia says: "The user-agent string of Baidu search engine is baiduspider." [en.wikipedia.org...]

2.) The fastest fix would be to simply block the IP address(es) in .htaccess but for access to robots.txt. That way, the bot can still read/heed the file they need, and you're spared unnecessary hits until they do.

3.) Baidu's spiders hit my sites scores of times/day, always hailing from 119.63.196. --

119.63.196.53
119.63.196.120
119.63.196.84
119.63.196.73
119.63.196.83
119.63.196.91
etc. etc. etc.

They're fully Disallowed via "User-agent: *" in robots.txt (dynamically generated) and follow it. But it'd be nicer if they stopped coming by at all.

tangor




msg:4348462
 12:33 am on Aug 6, 2011 (gmt 0)

I never mind serving a robots.txt to bots who read and obey! And Baidu, for me, has behaved very nicely.

lucy24




msg:4348468
 1:45 am on Aug 6, 2011 (gmt 0)

There seem to be two of them. One is Japanese (119.et cetera) and behaves itself; the other is Chinese (123.et cetera) and doesn't. I dealt with them ages ago by IP alone.

1script




msg:4348479
 3:24 am on Aug 6, 2011 (gmt 0)

Thanks for your suggestion, guys.
Mine is in the 119. IP range. Anyways, I'm going to wait and see what happens in a few days and then stop them at the firewall if they won't, But from what everyone's saying they seem to be a well behaved robot, so I hope it won't come to this.
Cheers!

dstiles




msg:4348720
 10:00 pm on Aug 6, 2011 (gmt 0)

I allow the Japanese one but block the Chinese one by IP (ie not in robots.txt).

The JP one seems well behaved.

PS: doubtless the JP and the CN spiders share their gleanings so it probably doesn't matter about blocking the CN one.

I have the IPs below for baiduspider. There may be some I haven't noticed, especially the Chinese ones.

61.135.169.32 - 61.135.169.32cn:China
61.135.190.1 - 61.135.190.254cn:China
119.63.192.128 - 119.63.192.254jp:Japan
119.63.193.0 - 119.63.193.255jp:Japan
119.63.196.1 - 119.63.196.127jp:Japan
119.63.198.0 - 119.63.198.255jp:Japan
119.63.199.103 - 119.63.199.103jp:Japan
123.125.66.0 - 123.125.66.255cn:China
123.125.71.0 - 123.125.71.255cn:China
220.181.7.0 - 220.181.7.255cn:China
220.181.108.0 - 220.181.108.255cn:China

dstiles




msg:4349994
 4:57 pm on Aug 10, 2011 (gmt 0)

I've just seen a new baidu bot UA on a valid Japanese baidu bot IP 119.63.196.78 ...

Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)

Anyone else seeing this?

Staffa




msg:4350057
 6:38 pm on Aug 10, 2011 (gmt 0)

This UA has been around for a while, last seen on 06 Aug from 123.125.71.104

Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)

dstiles




msg:4350124
 9:28 pm on Aug 10, 2011 (gmt 0)

Thanks for the confirmation.

123.125.71.104 is one of the Chinese bot sources - I block that but it's useful to know that both JP and CN are using the same UA.

1script




msg:4350187
 2:39 am on Aug 11, 2011 (gmt 0)

123.125.71.104 is one of the Chinese bot sources - I block that but it's useful to know that both JP and CN are using the same UA.
But are they feeding their results to the same search engine?
I mean, I have ZERO referrals from baidu.com and my understanding was that Baidu is a Chinese search engine. Clearly, my lack of both Chinese and Japanese is showing: is there a page on that site someplace that actually explains relationship between CH and JP counterparts? Are they both feeding the index of the same search engine? If that's the case, I would block BOTH - what difference does it make which country the bot is coming from? In the end, we're all interested in the traffic and that's not coming from either countries as far as I can tell.

By the way, a week into this, after adding all permutation of baiduspider's name into the robots.txt Japanese baiduspider hasn't stopped and hasn't even slowed down. So much for following robots.txt ...

Next step

iptables -A INPUT -s 119.63.192.0/21 -j DROP
service iptables save


I'm going to give them a couple more days and then drop their access if they don't heed the robots.txt.

tangor




msg:4350218
 4:43 am on Aug 11, 2011 (gmt 0)

@1script: Is Baidu requesting pages OTHER than robots.txt? If so, then boot 'em, but WORLD+DOG should be able to get robots.txt... else robots.txt means nothing. I don't mind them checking back to see if I've changed my mind...

keyplyr




msg:4350258
 9:25 am on Aug 11, 2011 (gmt 0)

As another point of view, although I have never seen measurable traffic from baidu, or any Chinese source for that matter, I do allow them to freely crawl. They have always followed robots.txt directives and my sites do show up in their SERP. More importantly, I'd rather show a branded presence there than have that market share be assumed by copycats and posers. The Chinese are well known for knock-offs.

1script




msg:4350396
 3:15 pm on Aug 11, 2011 (gmt 0)

@tangor: Yes, they come for URLs other than robots.txt , for roughly 10,000 of them every day. It's not an outrageous amount, but just in principal: why don't they obey the robots.txt? Maybe I have my robots.txt wrong for them? Problem is: their spider description page (from the UA string) leads to an error page, so there's no good info on what I should put in my robots.txt

Here is what I have:

User-agent: Baiduspider
Disallow: /
User-agent: Baiduspider+
Disallow: /


Do they call themselves by any other name?

The robots.txt is regularly picked up by both Chinese (once a day) and Japanese Baidu bots (3-4 times a day).

Funny thing: Chinese bot reads only robots.txt and leaves. Japanese reads robots.txt and 10,000 other pages - an exact opposite to what everyone in this thread was saying so far - it appears to be the Japanese bot that misbehaves.

And yes, I am reluctant to just ax them with a firewall because I would like to be able to change my mind at some point in the future. If they can no longer get to robots.txt, they wouldn't know I changed my mind ...

wilderness




msg:4350442
 4:57 pm on Aug 11, 2011 (gmt 0)

Next step

iptables -A INPUT -s 119.63.192.0/21 -j DROP
service iptables save


It's not really necessary to accumulate and apply a long list of IP's for this bot.

Rather, simply deny access to "any UA that contains 'spider'", and they deny themselves having used an long-standing-abusive term when they named themselves ("crawler" is another example).

1script




msg:4350476
 5:52 pm on Aug 11, 2011 (gmt 0)

@wilderness: you may be onto something! It never occurred to me before but I cannot remember a single bot that I would care for that has the word "spider" in it. It's so 90s :) !

I would only be concerned about them using Apache resources if I deny by UA. I don't know if it's a valid concern but what if they don't treat a 403 HTTP response properly and keep pounding? I think from the resources stand point, axing them at the firewall makes use of less resources (and creates less traffic, too - no response needs to be sent, however small).

wilderness




msg:4350495
 6:23 pm on Aug 11, 2011 (gmt 0)

I don't know if it's a valid concern but what if they don't treat a 403 HTTP response properly and keep pounding?


If you use a custom 403, you may reduce the bandwidth to nearly zero KB's (JdMorgan has provided such an explanation in the past) (my own current Custom 403 results in 315kb), however even absent a custom 403 (i. e. browser/server default) the bandwidth is less than 800kb's.

If the bot returned 100 or a thousand times to eat 403's, none of that bandwidth would present any noticeable effect on your server costs.

The one downside is that 403's continue to clutter your raw logs and/or stats software.

You could easily black-list hundreds or thousands of UA's, and if configured properly would NOT have any ill-effect on server-CPU or bandwidth load.

FWIW, there are a couple dozen of these abusive UA terms. I've added some in the past, however failed to bookmark the threads.

wilderness




msg:4350501
 6:42 pm on Aug 11, 2011 (gmt 0)

Relatively recent thread [webmasterworld.com]

"Extractor" is one of the so-called standard deny terms.

Others; crawler, spider, download, harvest, email, larbin, Nutch, link, PHP, Reaper, Wget, fetch, curl, libww, and any variations of wording with similar definitions.

Pfui




msg:4350539
 8:04 pm on Aug 11, 2011 (gmt 0)

Do they call themselves by any other name?

As mentioned above (ahem:) --

Btw, WIkipedia says: "The user-agent string of Baidu search engine is baiduspider." [en.wikipedia.org...]

I must be missing something here -- no mod_rewrite? -- because a solution's always seemed as easy as:

RewriteCond %{HTTP_USER_AGENT} baidu [NC]
RewriteCond %{REQUEST_URI} !^robots\.txt$
RewriteRule .* - [F]

(10,000 pages crawled daily by a company that provides no discernible benefit? No way.)

g1smd




msg:4350555
 8:41 pm on Aug 11, 2011 (gmt 0)

my own current Custom 403 results in 315kb.

Err. You should try 850 bytes, including headers. :)

wilderness




msg:4350570
 9:21 pm on Aug 11, 2011 (gmt 0)

You should try 850 bytes


Perhaps I'm a dunce?
Is this "k" or kb"?

HTTP/1.1" 403 361

dstiles




msg:4350583
 10:08 pm on Aug 11, 2011 (gmt 0)

I allow baidu JP because one of my customers has Japanese customers. As I said above, it is quite possible that JP and CN share results: if so, why permit both?

And also as noted above, I have no problem with baidu JP disobeying robots.txt

g1smd




msg:4350594
 10:37 pm on Aug 11, 2011 (gmt 0)

HTTP/1.1 403 361

That's bytes. And very minimal it is too. :)

tangor




msg:4350641
 3:15 am on Aug 12, 2011 (gmt 0)

Let's not get sidetracked on size of 403 (though that is important!, Mine is 299b, you'll have to optimize your html to get it shorter), nor worry about filtering 403 OUT of reports (very easy to do), the real query as regards OP is Baidu, which does honor robots.txt... and in my opinion any "Baidu" which does not honor robots.txt needs to be nuked by IP, selectively, since Baidu does, in the main, honor robots.txt (which suggests that bad actors are riding their coattails). If, on the other hand, if nothing from Asia is desired then .htaccess "baidu" is reasonable. (results in 403, see above)

If one is not already filtering for "spider, et al." then now might be the time to get started. Again, 403...

I'm North America and Baidu honors with no worries, been a charm... that includes the China version, too. (rDNS on Baidu, both countries) Those others CLAIMING TO BE Baidu which are NOT in compliance are nuked as bad actors.

My robots.txt starts with the bots allowed then ends with:

# Disallow all others
User-agent: *
Disallow: /

Works a charm, and even a significant number of Russian bots comply, too. Work smart, not hard!

dstiles




msg:4358110
 9:00 pm on Sep 2, 2011 (gmt 0)

Today I had several dozen hits on several IPs in the range 180.76.5/24 with the Baidu spider UA...

Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)

IP range 180.76/16 belongs to Baidu China.

Haven't seen this range before.

1script




msg:4358123
 9:23 pm on Sep 2, 2011 (gmt 0)

By coincidence I just run this literally 5 minutes before reading your message :)

iptables -A INPUT -s 180.76.0.0/16 -j DROP
service iptables save

dstiles




msg:4358137
 9:51 pm on Sep 2, 2011 (gmt 0)

Always depends on your attitude to baidu china, of course. :)

Pfui




msg:4360413
 8:48 am on Sep 9, 2011 (gmt 0)

From Baidu China, two seconds apart:

baiduspider-ad-61-135-186-29.crawl.baidu.com
baiduspider-ad-61-135-186-18.crawl.baidu.com

robots.txt? NO

Note no-space-after-semi-colon before "baidu Transcoder" and space-after-last-open-paren:

Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.2.8;baidu Transcoder) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729)

Pfui




msg:4360430
 9:51 am on Sep 9, 2011 (gmt 0)

...Then 60 minutes later, Baidu Japan:

119.63.196.111
Baiduspider+(+http://www.baidu.com/search/spider.htm)
robots.txt? Yes

Then 40 minutes later, Baidu China again, with four bare IPs seconds apart and another malformed UA:

180.149.133.15
180.149.133.39
180.149.133.14
180.149.133.16

robots.txt? NO

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; baidu Transcoder;)

Enough.

keyplyr




msg:4367563
 1:26 am on Sep 27, 2011 (gmt 0)

There seems to be quite a lot of Chinese calling themselves Baiduspider. I've just added a white-list filter allowing only those ranges that are verified as Baidu owned in WHOIS *and* having a rDNS as baiduspider-*.crawl.baidu.com.

I don't much care whether they're Japanese or Chinese, just that they're valid.

This 34 message thread spans 2 pages: 34 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved