homepage Welcome to WebmasterWorld Guest from 54.161.214.221
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Baidu Behaving Badly: Goes undercover w/ cloaked UA; omits robots.txt
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
Pfui




msg:3919202
 10:36 pm on May 23, 2009 (gmt 0)

Baiduspider's been around forever and the company gets a lot of mentions in this site's "The Search Engine World / Asia and Pacific Region" forum [webmasterworld.com]. The IPs I see are similar permutations of this morning's hits --

119.63.193.226
Baiduspider+(+http://www.baidu.com/search/spider.htm)
119.63.193.225
Baiduspider+(+http://www.baidu.com/search/spider.htm)
119.63.193.224
Baiduspider+(+http://www.baidu.com/search/spider.htm)

-- a.k.a.:

Baidu, Inc., Japan
119.63.192.0 - 119.63.199.255
119.63.192.0/21
(CIDR courtesy of the terrifically nifty "IP to CIDR online converter [ip2cidr.com]")

Baiduspider is also well-behaved, asking for robots.txt and heeding it. My only complaint is that they hit my most popular site waaay too many times/day. That's it, complaint-wise.

Until today. Almost three hours on-the-nose after the above 'okay' hits:

119.63.193.226
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)
119.63.193.225
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)
119.63.193.224
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)

Exact same IPs as the prior hits. Exact same order of IPs. Exact same files requested. But now, the UA's cloaked and robots.txt was not requested.

(Aside: Isn't that also a highly unlikely UA because the OS is unpatched? The only hits I see are suspect at best.)

I'm not sure if the above is 100% brand new (mis)behavior because the only file I let Baidu's 'visible' UA and/or that CIDR see is robots.txt so any other hits get 403'd. (That UA will probably get the same treatment now, too.)

Anyway, be on the lookout, gang.

 

incrediBILL




msg:3919205
 10:58 pm on May 23, 2009 (gmt 0)

When you see a bot IPs suddenly start showing browser UAs, and always the same UA, the logical conclusion to draw is that they are starting to take screen shots.

So far this has been the case every time I've seen this pattern happen but it's almost always a Firefox UA, rarely MSIE

Pfui




msg:3919248
 2:36 am on May 24, 2009 (gmt 0)

Hmm. The Baidu mothership(s) shouldn't have any of the site's graphics cached, and none were snagged this time -- the cloaked UA just hit html files.

.
Addenda to OP

1.) A new UA? Only for Baidu.jp bots? From the aforementioned Baidu CIDR today:

Baiduspider+(+http://www.baidu.jp/spider/)

2.) If Baidu.com, a Chinese company, crawls from Beijing-based IPs, here's another IP. Perhaps the presence of baidu.com -- not baidu.jp -- in the URL indicates country of crawl. Or something:

123.125.64.15
Baiduspider+(+http://www.baidu.com/search/spider.htm)

Finis. (Really:)

incrediBILL




msg:3919255
 3:15 am on May 24, 2009 (gmt 0)

If it's HTML only, perhaps they're checking for cloakers and other SE abuse.

Regardless, robots.txt should still be honored.

Pfui




msg:3937820
 6:39 pm on Jun 21, 2009 (gmt 0)

Something new from baidu.com comes...

Note incorrect spelling of "Mozilla" and non-baidu URL in UA. FYI: McAfee has an "Adware, spyware, or viruses" report about lssw365.net [siteadvisor.com], dated 06-16-09.

In one day (partial listing; chronological):

06/18 05:51:07

baiduspider-123-125-66-32.crawl.baidu.com
Mosilla+(+http://www.lssw365.net/)
robots.txt? YES

06/18 08:34:58

baiduspider-123-125-66-17.crawl.baidu.com
Mosilla+(+http://www.lssw365.net/)
robots.txt? YES

06/18 11:13:28

119.63.193.56
Baiduspider+(+http://www.baidu.jp/spider/)
robots.txt? YES

06/18 14:13:36

123.125.64.15
Baiduspider+(+http://www.baidu.com/search/spider.htm)
robots.txt? YES

GaryK




msg:3937857
 8:02 pm on Jun 21, 2009 (gmt 0)

Same here:

06/18/2009 12:54:52 UTC

Mosilla+(+http://www.lssw365.net/)
123.125.66.40
baiduspider-123-125-66-40.crawl.baidu.com
robots.txt? YES

06/17/2009 17:24:15 UTC

Baiduspider+(+http://www.baidu.com/search/spider.htm)
123.125.66.40
baiduspider-123-125-66-40.crawl.baidu.com
robots.txt? YES

BTW, the McAfee page says they tested links from the lssw365.net domain and found that when [w]e visited this site, we found that most of its links are to sites which are safe or have only minor safety/annoyance issues.

Leosghost




msg:3937891
 9:30 pm on Jun 21, 2009 (gmt 0)

lssw365.net Is the official site for the PRC "Green Dam" internet censorship system .."Bill" has a thread running here [webmasterworld.com] at WebmasterWorld ..I suggest the two "Bills" exchange sm's :)

dstiles




msg:3937896
 9:47 pm on Jun 21, 2009 (gmt 0)

I'm getting the occasional MSIE 7 UA mentioned above. I'm blocking it on a specific header combination after a bad scan but it may turn out I can block it on all combinations.

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved