Forum Moderators: open

Message Too Old, No Replies

Sogou web spider

new crawl range

         

keyplyr

11:58 pm on Jul 24, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




UA: Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)
Protocol: HTTP/1.1
Robots.txt: Yes
Host: sogou.com
218.30.96.0 - 218.30.127.255 new
218.30.96.0/19

Other ranges this UA has used:
61.135.0.0 - 61.135.255.255
61.135.0.0/16

106.37.0.0 - 106.39.255.255
106.37.0.0/16
106.38.0.0/15

106.120.0.0 - 106.121.255.255
106.120.0.0/15

123.112.0.0 - 123.127.255.255
123.112.0.0/12

220.178.0.0 - 220.181.255.255
220.180.0.0/16

All Parent: Chinanet or Unicom

TorontoBoy

12:58 am on Jul 25, 2017 (gmt 0)

5+ Year Member Top Contributors Of The Month



Sougou 搜狗 (translation: Search Dog), is one of China's larger search engines. I find they index non-Chinese sites very poorly. I have allowed them to index my content for over a year, but when I use their search engine to find my content I can only find a fraction of it. This may be due to the influence of the Great Firewall of China and censorship, I don't know.

Sougou often uses ranges in Chinanet Beijing and changes frequently. They are difficult to pin down.

lucy24

2:36 am on Jul 25, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It's no use, TorontoBoy; this site has chosen to remain strictly Latin-1.

Other ranges this UA has used:
An extremely common pattern on one site is
106.38.blahblah robots.txt, followed immediately by
36.110.blahblah some page

otoh when they use 123.126.blahblah, the robots.txt and page request come from the same IP.

I suppose it's no use wondering why they obey robots.txt on your site while utterly ignoring it on mine. I've even given them a block of their own in case they're one of the rare robots that are too primitive to pick their name out of a list.

Is there a difference between the 220.181 Sogou and all the others? It seems to be the only one where robots.txt is almost never followed by an immediate page request.

keyplyr

2:40 am on Jul 25, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I suppose it's no use wondering why they obey robots.txt on your site while utterly ignoring it on mine.
Don't know who you're speaking to, but if it's me, I never said they *obey* robots,txt, I only documented the agent *requests* robots.txt.

TorontoBoy

3:14 am on Jul 25, 2017 (gmt 0)

5+ Year Member Top Contributors Of The Month



Here's my pattern from July 22 (24 hrs), chrono order, for UA Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07), sort by UA then date/time :
218.30.103.5 GET /robots.txt HTTP/1.1
106.38.241.159 GET /robots.txt HTTP/1.1
106.38.241.155 GET /robots.txt HTTP/1.1
220.181.125.146GET /robots.txt HTTP/1.1
220.181.125.146 major scrape (48 GETs)
218.30.103.145 GET /robots.txt HTTP/1.1
106.38.241.155 GET /robots.txt HTTP/1.1
106.38.241.155 GET /robots.txt HTTP/1.1
220.181.125.146 GET /robots.txt HTTP/1.1
220.181.125.146 Major scrape (45 GETs)
106.38.241.159 GET /robots.txt HTTP/1.1
123.126.113.154 GET /robots.txt HTTP/1.1
106.38.241.155 GET /robots.txt HTTP/1.1
106.38.241.159 GET /robots.txt HTTP/1.1
220.181.125.146 Major Scrape (43 GETs)
123.126.68.114 GET /robots.txt HTTP/1.1
220.181.125.146 GET /robots.txt HTTP/1.1
220.181.125.146 GET /robots.txt HTTP/1.1
123.126.68.114 Major Scrape (16 GETs)

They used 220.181.125.146 three times and 123.126.68.114 once to scrape me. Both IPs read my robots.txt, 220.181.125.146 does not adhere to my robots.txt.

218.30.103.5 (2 times) and 106.38.241.159 97 times) both read my robots.txt but never ask for anything else.

It is like some dysfunctional octopus where the multiple hands are not too coordinated. The bot is a bit ditzy, which I find endearing though puzzling. I have Chinese content, I want my content indexed by Chinese search engines, I allow Sogou, Baidu, Yisou, Tencent 360 to ravage me.

keyplyr

3:23 am on Jul 25, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



So basically the same ranges I posted.

Different server nodes will have different IP addresses. These are all Chinanet servers. This is not abnormal.

I allow... to ravage me.
Be careful what you wish for.

TorontoBoy

12:02 pm on Jul 31, 2017 (gmt 0)

5+ Year Member Top Contributors Of The Month



Sogou prepares for future US IPO [scmp.com].

keyplyr

6:58 pm on Jul 31, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



How much are you gonna buy?

TorontoBoy

8:41 pm on Jul 31, 2017 (gmt 0)

5+ Year Member Top Contributors Of The Month



I know Chinese and have used Sogou as a search engine. Compared to Baidu it is very inferior, worse than comparing Baidu to Google search. I shall buy 50 cents, or wumao.