Forum Moderators: open

Message Too Old, No Replies

Sohu search robot misbehaving

Surprising considering their reach

         

jdMorgan

11:02 pm on Jul 29, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Saw this in my logs recently, but not for the first time:

220.181.26.73 - - [29/Jul/2005:16:52:42 -0400] "GET / HTTP/1.1" 403 670 "-" "sohu-search"
220.181.26.73 - - [29/Jul/2005:16:52:42 -0400] "GET //robots.txt HTTP/1.1" 200 16078 "-" "sohu-search"

Here sohu (I presume it's a valid sohu IP) attempts to fetch my index page without asking, and then after being rebuffed, comes back with a malformed request for robots.txt! Two bugs in one session!

I would expect better from one of the top search engines in their market.

Jim

volatilegx

2:57 am on Jul 30, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



220.181.26.73 - - [29/Jul/2005:16:52:42 -0400] "GET //robots.txt HTTP/1.1" 200 16078 "-" "sohu-search"

At least they got the robots.txt, even if the GET was malformed.

jdMorgan

3:21 am on Jul 30, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yeah, I take care of malformed-but-fixable requests. But sohu's still blocked until they fetch robots.txt *before* requesting pages. After that, I'll decide if I still want to Disallow them in robots.txt.

This behaviour is like being forced off a property by the security personnel, and then going back to read the prominent "No Trespassing" sign on the gate.

Really, my main point is that even 'big' search companies often run 'mis-implemented' robots.

Jim

GaryK

3:55 am on Jul 30, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've always assumed when log entries show the same exact time down to the second that it's possible robots.txt may have been read before requesting pages. This odd example notwithstanding is that an incorrect assumption on my part? Or are log entries always sequential even if the time is the same?

jdMorgan

4:06 am on Jul 30, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, that'd be a pretty fast machine to fetch *and* parse my robots.txt *and* then return in the same second... Besides which, they are Disallowed, and have been Disallowed forever, because I use the "Disallow by default" construct in my robots.txt file on this site -- Known (and useful) 'bots have specific policy records, while unknown or unwelcome 'bots are disallowed at the end with

User-agent: *
Disallow: /

You are correct about the logging order ambiguity: The log entry seems to be created when the server finishes processing a request, which can often lead to apparently-reversed entries such as images, CSS, or external JS files seeming to be fetched *before* the page that includes them.

Jim

wilderness

4:49 am on Jul 30, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



But sohu's still blocked until they fetch robots.txt

Much easier Jim, to deny the entire 220. :)

I recall you having the Oceanic ranges denied, as I provided those ranges to you.
Have the left the remaining far-east ranges at APNIC open?

Don

jdMorgan

5:34 am on Jul 30, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



OT, but that was a different site that seemed "more interesting" to 202.. and other AP IPs. AP IPs don't generate much traffic to the site I'm reporting here, and I just run the 'bots scripts and block known-bad UAs and unknown/unwelcome spiders on it, which covers most exploits this site sees from AP IPs (and keeps my access-control file smaller).

The AP IP list you provided is still in use on the other site though... I'm not sure why they love it, but apparently that site's IP address used to belong to a site that they were very interested in...

Jim

volatilegx

3:10 am on Jul 31, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Just a note: the whois info for this IP doesn't match other IPs I have listed for sohu.

jdMorgan

3:48 pm on Jul 31, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Interesting... Do you see the same behaviour (fetch order or malformed robots.txt request)?

Do your IPs show as specifically registered to sohu, or to an ISP?

I suppose this could be someone spoofing their UA.

Jim

volatilegx

5:18 pm on Jul 31, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The other IPs (61.135.130.*** and 61.135.131.***) are assigned to:

netname: CNCGROUP-BJ
descr: CNCGROUP Beijing province network
descr: China Network Communications Group Corporation
descr: No.156,Fu-Xing-Men-Nei Street,
descr: Beijing 100031
country: CN

220.181.26.73 is assigned to:

netname: CHINANET-IDC-BJ
country: CN
descr: CHINANET Beijing province network
descr: China Telecom
descr: No.31,jingrong street
descr: Beijing 100032

I haven't seen malformed requests for robots.txt that I recall.