homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

this time it's Sogou
61.135 returns

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month

Msg#: 4629112 posted 11:32 pm on Dec 9, 2013 (gmt 0)

October 2010 [webmasterworld.com] (Soso)
August 2011 [webmasterworld.com] (Yodao)
July 2012 [webmasterworld.com] (Baidu)
September 2012 [webmasterworld.com] (thread next door in Apache)

The current incarnation looks like this (spacing as shown):* - - [23/Sep/2013:09:50:10 -0700] "GET /robots.txt HTTP/1.1" 200 1014 "-" "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)" - - [23/Sep/2013:09:50:11 -0700] "GET /ebooks/paston/paston6b.html HTTP/1.1" 403 2963 "-" "New-Sogou-Spider/1.0 (compatible; MSIE 5.5; Windows 98)"

What is it with Chinese robots anyway? They always seem to put on UA strings that would get them blocked even from a previously unknown IP.

Personal hunch: the idea is to lull servers into complacency by first asking for robots.txt. It isn't very determined though; it goes away after one or two 403s. (If anyone has been asleep for the last five years, the uber-range is If only Ukrainian robots lived in such nice fat /10 blocks!)

Cursory log search tells me they also show up at** with the same behavior pattern except that they don't change UAs after getting robots.txt. I don't know if either one is legit; free lookup is uninformative on both.

* The referenced page is in Chinese except for the recurring phrases "sogou spider" and "robots.txt". Rumor has it they're compliant, but who gives a ###.
** Not to be confused with, which sometimes claims to be Baidu.


Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved