homepage Welcome to WebmasterWorld Guest from 54.224.202.109
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
What is Baidu BaiDoing?
incrediBILL




msg:4435208
 12:22 pm on Mar 30, 2012 (gmt 0)

Found some consecutive requests from Baidu crawler IPs with odd user agents for the requests for that range.

Checking for mobile pages?

123.125.71.98,China,"Mozilla/5.0 (Linux;u;Android 2.3.7;zh-cn;) AppleWebKit/533.1 (KHTML,like Gecko) Version/4.0 Mobile Safari/533.1 (compatible; +http://www.baidu.com/search/spider.html)",/index.html

No clue, FF asking for robots.txt?

180.76.5.93,China,"Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2",/robots.txt
180.76.5.89,China,"Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2",/robots.txt
180.76.5.87,China,"Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2",/robots.txt

At first I was thinking screen shots, but robots.txt?

 

cpollett




msg:4435350
 5:57 pm on Mar 30, 2012 (gmt 0)

Is 180.76.5. really Baidu? I blocked that range after it was pinging my site like crazy.
I know 180.76.5.100 shows up on
www.ipfraudreporter.com

incrediBILL




msg:4435367
 7:02 pm on Mar 30, 2012 (gmt 0)

Reverse DNS sure says it's Baidu Spider:
180.76.5.93 -> baiduspider-180-76-5-93.crawl.baidu.com.

Whois says it's Baidu:

inetnum: 180.76.0.0 - 180.76.255.255
netname: Baidu
descr: Beijing Baidu Netcom Science and Technology Co., Ltd.
descr: Baidu Plaza, No.10, Shangdi 10th street,Haidian District Beijing,100080
country: CN

Hard to argue with all that!

cpollett




msg:4435377
 7:17 pm on Mar 30, 2012 (gmt 0)

Yeah, I guess your right. I was thinking that Baidu might also run an ISP in China. then some random person was running a crawler from the ISP.

One other possibility as to why you are seeing a browser look at the robots.txt is that there is actually a real person coming from the IP address of the spider trying to figure out why their spider is misbehaving? I do that sometimes.

keyplyr




msg:4435378
 7:19 pm on Mar 30, 2012 (gmt 0)


I eventually blocked all country ranges for China.

incrediBILL




msg:4435385
 7:31 pm on Mar 30, 2012 (gmt 0)

One other possibility as to why you are seeing a browser look at the robots.txt is that there is actually a real person coming from the IP address of the spider trying to figure out why their spider is misbehaving? I do that sometimes.


Don't see how a real human can be on 3 IPs at basically the same time.

2012-03-29,13:56:57
2012-03-29,13:57:25
2012-03-29,13:57:31

I'm sure it's something automated.

cpollett




msg:4435397
 8:05 pm on Mar 30, 2012 (gmt 0)

Render farm seemed like a good idea. I don't know why they would hit robots.txt unless the crawl itself was entirely being done with scripted instances of Firefox. Did the ff versions check out?

thetrasher




msg:4435733
 11:14 pm on Mar 31, 2012 (gmt 0)

Reverse DNS sure says it's Baidu Spider:
180.76.5.93 -> baiduspider-180-76-5-93.crawl.baidu.com.

That's step 1.
Step 2 is forward DNS->IP lookup

lucy24




msg:4435772
 4:02 am on Apr 1, 2012 (gmt 0)

?
Baidu at 180.76 asks for robots.txt all the time. I just assumed it's a feeble attempt to convince me it's a real, law-abiding robot.

:: detour to raw logs ::

Hm, wonder who told them I've got a file called /fun/panda.html ? They've been 403'd from the whole site since long before the file was created, and all links leading to the page are nofollow.*


* Nope, guess again. It's because almost the entire text of the page is actual search strings-- meaning that it would be utterly useless if it came up in any search. Even worse than indexing a search-results page.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved