Welcome to WebmasterWorld Guest from 107.20.75.63

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

What is Baidu BaiDoing?

     
12:22 pm on Mar 30, 2012 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14624
votes: 88


Found some consecutive requests from Baidu crawler IPs with odd user agents for the requests for that range.

Checking for mobile pages?

123.125.71.98,China,"Mozilla/5.0 (Linux;u;Android 2.3.7;zh-cn;) AppleWebKit/533.1 (KHTML,like Gecko) Version/4.0 Mobile Safari/533.1 (compatible; +http://www.baidu.com/search/spider.html)",/index.html

No clue, FF asking for robots.txt?

180.76.5.93,China,"Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2",/robots.txt
180.76.5.89,China,"Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2",/robots.txt
180.76.5.87,China,"Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2",/robots.txt

At first I was thinking screen shots, but robots.txt?
5:57 pm on Mar 30, 2012 (gmt 0)

New User

joined:Jan 22, 2012
posts: 30
votes: 1


Is 180.76.5. really Baidu? I blocked that range after it was pinging my site like crazy.
I know 180.76.5.100 shows up on
www.ipfraudreporter.com
7:02 pm on Mar 30, 2012 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14624
votes: 88


Reverse DNS sure says it's Baidu Spider:
180.76.5.93 -> baiduspider-180-76-5-93.crawl.baidu.com.

Whois says it's Baidu:

inetnum: 180.76.0.0 - 180.76.255.255
netname: Baidu
descr: Beijing Baidu Netcom Science and Technology Co., Ltd.
descr: Baidu Plaza, No.10, Shangdi 10th street,Haidian District Beijing,100080
country: CN

Hard to argue with all that!
7:17 pm on Mar 30, 2012 (gmt 0)

New User

joined:Jan 22, 2012
posts: 30
votes: 1


Yeah, I guess your right. I was thinking that Baidu might also run an ISP in China. then some random person was running a crawler from the ISP.

One other possibility as to why you are seeing a browser look at the robots.txt is that there is actually a real person coming from the IP address of the spider trying to figure out why their spider is misbehaving? I do that sometimes.
7:19 pm on Mar 30, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:5811
votes: 64



I eventually blocked all country ranges for China.
7:31 pm on Mar 30, 2012 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14624
votes: 88


One other possibility as to why you are seeing a browser look at the robots.txt is that there is actually a real person coming from the IP address of the spider trying to figure out why their spider is misbehaving? I do that sometimes.


Don't see how a real human can be on 3 IPs at basically the same time.

2012-03-29,13:56:57
2012-03-29,13:57:25
2012-03-29,13:57:31

I'm sure it's something automated.
8:05 pm on Mar 30, 2012 (gmt 0)

New User

joined:Jan 22, 2012
posts: 30
votes: 1


Render farm seemed like a good idea. I don't know why they would hit robots.txt unless the crawl itself was entirely being done with scripted instances of Firefox. Did the ff versions check out?
11:14 pm on Mar 31, 2012 (gmt 0)

Junior Member

10+ Year Member

joined:June 25, 2005
posts:179
votes: 1


Reverse DNS sure says it's Baidu Spider:
180.76.5.93 -> baiduspider-180-76-5-93.crawl.baidu.com.

That's step 1.
Step 2 is forward DNS->IP lookup
4:02 am on Apr 1, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month

joined:Apr 9, 2011
posts:12714
votes: 244


?
Baidu at 180.76 asks for robots.txt all the time. I just assumed it's a feeble attempt to convince me it's a real, law-abiding robot.

:: detour to raw logs ::

Hm, wonder who told them I've got a file called /fun/panda.html ? They've been 403'd from the whole site since long before the file was created, and all links leading to the page are nofollow.*


* Nope, guess again. It's because almost the entire text of the page is actual search strings-- meaning that it would be utterly useless if it came up in any search. Even worse than indexing a search-results page.