| What is Baidu BaiDoing?
|
incrediBILL

msg:4435208 | 12:22 pm on Mar 30, 2012 (gmt 0) | Found some consecutive requests from Baidu crawler IPs with odd user agents for the requests for that range. Checking for mobile pages? 123.125.71.98,China,"Mozilla/5.0 (Linux;u;Android 2.3.7;zh-cn;) AppleWebKit/533.1 (KHTML,like Gecko) Version/4.0 Mobile Safari/533.1 (compatible; +http://www.baidu.com/search/spider.html)",/index.html No clue, FF asking for robots.txt? 180.76.5.93,China,"Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2",/robots.txt 180.76.5.89,China,"Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2",/robots.txt 180.76.5.87,China,"Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2",/robots.txt At first I was thinking screen shots, but robots.txt?
|
cpollett

msg:4435350 | 5:57 pm on Mar 30, 2012 (gmt 0) | Is 180.76.5. really Baidu? I blocked that range after it was pinging my site like crazy. I know 180.76.5.100 shows up on www.ipfraudreporter.com
|
incrediBILL

msg:4435367 | 7:02 pm on Mar 30, 2012 (gmt 0) | Reverse DNS sure says it's Baidu Spider: 180.76.5.93 -> baiduspider-180-76-5-93.crawl.baidu.com. Whois says it's Baidu: inetnum: 180.76.0.0 - 180.76.255.255 netname: Baidu descr: Beijing Baidu Netcom Science and Technology Co., Ltd. descr: Baidu Plaza, No.10, Shangdi 10th street,Haidian District Beijing,100080 country: CN Hard to argue with all that!
|
cpollett

msg:4435377 | 7:17 pm on Mar 30, 2012 (gmt 0) | Yeah, I guess your right. I was thinking that Baidu might also run an ISP in China. then some random person was running a crawler from the ISP. One other possibility as to why you are seeing a browser look at the robots.txt is that there is actually a real person coming from the IP address of the spider trying to figure out why their spider is misbehaving? I do that sometimes.
|
keyplyr

msg:4435378 | 7:19 pm on Mar 30, 2012 (gmt 0) | I eventually blocked all country ranges for China.
|
incrediBILL

msg:4435385 | 7:31 pm on Mar 30, 2012 (gmt 0) | | One other possibility as to why you are seeing a browser look at the robots.txt is that there is actually a real person coming from the IP address of the spider trying to figure out why their spider is misbehaving? I do that sometimes. |
| Don't see how a real human can be on 3 IPs at basically the same time. 2012-03-29,13:56:57 2012-03-29,13:57:25 2012-03-29,13:57:31 I'm sure it's something automated.
|
cpollett

msg:4435397 | 8:05 pm on Mar 30, 2012 (gmt 0) | Render farm seemed like a good idea. I don't know why they would hit robots.txt unless the crawl itself was entirely being done with scripted instances of Firefox. Did the ff versions check out?
|
thetrasher

msg:4435733 | 11:14 pm on Mar 31, 2012 (gmt 0) | Reverse DNS sure says it's Baidu Spider: 180.76.5.93 -> baiduspider-180-76-5-93.crawl.baidu.com. |
| That's step 1. Step 2 is forward DNS->IP lookup
|
lucy24

msg:4435772 | 4:02 am on Apr 1, 2012 (gmt 0) | ? Baidu at 180.76 asks for robots.txt all the time. I just assumed it's a feeble attempt to convince me it's a real, law-abiding robot. :: detour to raw logs :: Hm, wonder who told them I've got a file called /fun/panda.html ? They've been 403'd from the whole site since long before the file was created, and all links leading to the page are nofollow.* * Nope, guess again. It's because almost the entire text of the page is actual search strings-- meaning that it would be utterly useless if it came up in any search. Even worse than indexing a search-results page.
|
|
|