Welcome to WebmasterWorld Guest from 54.160.131.144

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

What is Baidu BaiDoing?

     

incrediBILL

12:22 pm on Mar 30, 2012 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Found some consecutive requests from Baidu crawler IPs with odd user agents for the requests for that range.

Checking for mobile pages?

123.125.71.98,China,"Mozilla/5.0 (Linux;u;Android 2.3.7;zh-cn;) AppleWebKit/533.1 (KHTML,like Gecko) Version/4.0 Mobile Safari/533.1 (compatible; +http://www.baidu.com/search/spider.html)",/index.html

No clue, FF asking for robots.txt?

180.76.5.93,China,"Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2",/robots.txt
180.76.5.89,China,"Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2",/robots.txt
180.76.5.87,China,"Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2",/robots.txt

At first I was thinking screen shots, but robots.txt?

cpollett

5:57 pm on Mar 30, 2012 (gmt 0)



Is 180.76.5. really Baidu? I blocked that range after it was pinging my site like crazy.
I know 180.76.5.100 shows up on
www.ipfraudreporter.com

incrediBILL

7:02 pm on Mar 30, 2012 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Reverse DNS sure says it's Baidu Spider:
180.76.5.93 -> baiduspider-180-76-5-93.crawl.baidu.com.

Whois says it's Baidu:

inetnum: 180.76.0.0 - 180.76.255.255
netname: Baidu
descr: Beijing Baidu Netcom Science and Technology Co., Ltd.
descr: Baidu Plaza, No.10, Shangdi 10th street,Haidian District Beijing,100080
country: CN

Hard to argue with all that!

cpollett

7:17 pm on Mar 30, 2012 (gmt 0)



Yeah, I guess your right. I was thinking that Baidu might also run an ISP in China. then some random person was running a crawler from the ISP.

One other possibility as to why you are seeing a browser look at the robots.txt is that there is actually a real person coming from the IP address of the spider trying to figure out why their spider is misbehaving? I do that sometimes.

keyplyr

7:19 pm on Mar 30, 2012 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month




I eventually blocked all country ranges for China.

incrediBILL

7:31 pm on Mar 30, 2012 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



One other possibility as to why you are seeing a browser look at the robots.txt is that there is actually a real person coming from the IP address of the spider trying to figure out why their spider is misbehaving? I do that sometimes.


Don't see how a real human can be on 3 IPs at basically the same time.

2012-03-29,13:56:57
2012-03-29,13:57:25
2012-03-29,13:57:31

I'm sure it's something automated.

cpollett

8:05 pm on Mar 30, 2012 (gmt 0)



Render farm seemed like a good idea. I don't know why they would hit robots.txt unless the crawl itself was entirely being done with scripted instances of Firefox. Did the ff versions check out?

thetrasher

11:14 pm on Mar 31, 2012 (gmt 0)

10+ Year Member



Reverse DNS sure says it's Baidu Spider:
180.76.5.93 -> baiduspider-180-76-5-93.crawl.baidu.com.

That's step 1.
Step 2 is forward DNS->IP lookup

lucy24

4:02 am on Apr 1, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



?
Baidu at 180.76 asks for robots.txt all the time. I just assumed it's a feeble attempt to convince me it's a real, law-abiding robot.

:: detour to raw logs ::

Hm, wonder who told them I've got a file called /fun/panda.html ? They've been 403'd from the whole site since long before the file was created, and all links leading to the page are nofollow.*


* Nope, guess again. It's because almost the entire text of the page is actual search strings-- meaning that it would be utterly useless if it came up in any search. Even worse than indexing a search-results page.
 

Featured Threads

Hot Threads This Week

Hot Threads This Month