Chinese bot shape-shifts on the fly.

3:22 pm on Aug 18, 2011 (gmt 0)

Yikespider is more like it...

I certainly don't like the looks of this one from a.k.a. ChinaNet Beijing. Note the different UAs literally from one second to the next, one version for HEAD and one for GET. And note the unbalanced " in the GET: - - [1n/Aug/2011:0n:20:31 -0700] "HEAD /dir/file.html HTTP/1.0" 302 0 "-" "jikespider ("Mozilla/5.0)" - - [1n/Aug/2011:0n:20:32 -0700] "GET /dir/file.html HTTP/1.0" 302 215 "-" "jikespider "Mozilla/5.0"

robots.txt? NO

Found yet another version elsewhere:

JikeSpider Mozilla/5.0 (compatible; JikeSpider; +http://shoulu.jike.com/spider.html)

Here's some circa June 24, 2011 news about this creepy-crawler [cnngo.com...] --

"Earlier this week, China's state-run company People’s Search announced the re-launch of its search engine Jike.com.

"Formerly named Goso.cn, the search engine was first launched by People’s Search, a joint venture between People’s Daily and People.com, in May 2010. ..."

Considering I have firewall killfile rules against umpteen Chinese CIDRs and .hta blocks against a gazillion others, I'm amazed at the relentless, troublesome, and now state-sanctioned traffic from that part of the world. And I reckon it's only going to get worse...
10:46 pm on Aug 22, 2011 (gmt 0)

Oh, gosospider. I remember them. - - [...] "HEAD / HTTP/1.0" 403 271 "-" "jikespider (\"Mozilla/5.0)" - - [...] "GET / HTTP/1.0" 403 2266 "-" "jikespider \"Mozilla/5.0"

Right down to that extra quotation mark (escaped).

The unnerving thing is, I can't for the life of me figure out why they landed a 403. Not that I'm complaining, mind, but I've pored over my htaccess and can only conclude that the server is psychic.

(pfui, we don't have the same host do we? Mine suddenly went haywire on IP addresses too.)

While mopping up, I found these guys, who must be their cousins. Yawn. - - [...] "GET /robots.txt HTTP/1.0" 200 769 "-" "Mozilla/5.0 ()"
1:57 am on Aug 23, 2011 (gmt 0)

(Lucy: Doubt it, unless yours is in downtown Seattle?)

All: I found this after I wrote the OP. It came in at the same time from the same IP (and was also seen by Lucy):
Mozilla/5.0 ()

FWIW, that UA did ask for robots.txt, w/o success: When a UA is that absurd, the visitor gets a one-way ticket to

(Lucy: That may also be where your 403 came from. Servers can kick uneven parens automatically.)

For those of you keeping score at home, jikespider used THREE distinct, and distinctly screwy UAs in mere seconds:

jikespider ("Mozilla/5.0)
jikespider "Mozilla/5.0
Mozilla/5.0 ()

(Who programmed this thing, a spam harvester?;)
7:29 pm on Aug 23, 2011 (gmt 0)

My little friend, back again --

-- still not ID'ing itself in the first hit, still using the same trio o' mangled UAs. Also still HEAD'ing yet another page it's supposedly never seen:

GET /robots.txt
Mozilla/5.0 ()

HEAD /dir/filename.html
jikespider ("Mozilla/5.0)

GET /dir/filename.html
jikespider "Mozilla/5.0