Welcome to WebmasterWorld Guest from 54.226.246.160

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

Baidu Behaving Badly: Goes undercover w/ cloaked UA; omits robots.txt

"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"

     

Pfui

10:36 pm on May 23, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Baiduspider's been around forever and the company gets a lot of mentions in this site's "The Search Engine World / Asia and Pacific Region" forum [webmasterworld.com]. The IPs I see are similar permutations of this morning's hits --

119.63.193.226
Baiduspider+(+http://www.baidu.com/search/spider.htm)
119.63.193.225
Baiduspider+(+http://www.baidu.com/search/spider.htm)
119.63.193.224
Baiduspider+(+http://www.baidu.com/search/spider.htm)

-- a.k.a.:

Baidu, Inc., Japan
119.63.192.0 - 119.63.199.255
119.63.192.0/21
(CIDR courtesy of the terrifically nifty "IP to CIDR online converter [ip2cidr.com]")

Baiduspider is also well-behaved, asking for robots.txt and heeding it. My only complaint is that they hit my most popular site waaay too many times/day. That's it, complaint-wise.

Until today. Almost three hours on-the-nose after the above 'okay' hits:

119.63.193.226
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)
119.63.193.225
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)
119.63.193.224
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)

Exact same IPs as the prior hits. Exact same order of IPs. Exact same files requested. But now, the UA's cloaked and robots.txt was not requested.

(Aside: Isn't that also a highly unlikely UA because the OS is unpatched? The only hits I see are suspect at best.)

I'm not sure if the above is 100% brand new (mis)behavior because the only file I let Baidu's 'visible' UA and/or that CIDR see is robots.txt so any other hits get 403'd. (That UA will probably get the same treatment now, too.)

Anyway, be on the lookout, gang.

incrediBILL

10:58 pm on May 23, 2009 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



When you see a bot IPs suddenly start showing browser UAs, and always the same UA, the logical conclusion to draw is that they are starting to take screen shots.

So far this has been the case every time I've seen this pattern happen but it's almost always a Firefox UA, rarely MSIE

Pfui

2:36 am on May 24, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Hmm. The Baidu mothership(s) shouldn't have any of the site's graphics cached, and none were snagged this time -- the cloaked UA just hit html files.

.
Addenda to OP

1.) A new UA? Only for Baidu.jp bots? From the aforementioned Baidu CIDR today:

Baiduspider+(+http://www.baidu.jp/spider/)

2.) If Baidu.com, a Chinese company, crawls from Beijing-based IPs, here's another IP. Perhaps the presence of baidu.com -- not baidu.jp -- in the URL indicates country of crawl. Or something:

123.125.64.15
Baiduspider+(+http://www.baidu.com/search/spider.htm)

Finis. (Really:)

incrediBILL

3:15 am on May 24, 2009 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



If it's HTML only, perhaps they're checking for cloakers and other SE abuse.

Regardless, robots.txt should still be honored.

Pfui

6:39 pm on Jun 21, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Something new from baidu.com comes...

Note incorrect spelling of "Mozilla" and non-baidu URL in UA. FYI: McAfee has an "Adware, spyware, or viruses" report about lssw365.net [siteadvisor.com], dated 06-16-09.

In one day (partial listing; chronological):

06/18 05:51:07

baiduspider-123-125-66-32.crawl.baidu.com
Mosilla+(+http://www.lssw365.net/)
robots.txt? YES

06/18 08:34:58

baiduspider-123-125-66-17.crawl.baidu.com
Mosilla+(+http://www.lssw365.net/)
robots.txt? YES

06/18 11:13:28

119.63.193.56
Baiduspider+(+http://www.baidu.jp/spider/)
robots.txt? YES

06/18 14:13:36

123.125.64.15
Baiduspider+(+http://www.baidu.com/search/spider.htm)
robots.txt? YES

GaryK

8:02 pm on Jun 21, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Same here:

06/18/2009 12:54:52 UTC

Mosilla+(+http://www.lssw365.net/)
123.125.66.40
baiduspider-123-125-66-40.crawl.baidu.com
robots.txt? YES

06/17/2009 17:24:15 UTC

Baiduspider+(+http://www.baidu.com/search/spider.htm)
123.125.66.40
baiduspider-123-125-66-40.crawl.baidu.com
robots.txt? YES

BTW, the McAfee page says they tested links from the lssw365.net domain and found that when [w]e visited this site, we found that most of its links are to sites which are safe or have only minor safety/annoyance issues.

Leosghost

9:30 pm on Jun 21, 2009 (gmt 0)

WebmasterWorld Senior Member leosghost is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



lssw365.net Is the official site for the PRC "Green Dam" internet censorship system .."Bill" has a thread running here [webmasterworld.com] at WebmasterWorld ..I suggest the two "Bills" exchange sm's :)

dstiles

9:47 pm on Jun 21, 2009 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



I'm getting the occasional MSIE 7 UA mentioned above. I'm blocking it on a specific header combination after a bad scan but it may turn out I can block it on all combinations.

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)

 

Featured Threads

Hot Threads This Week

Hot Threads This Month