Page is a not externally linkable
- Search Engines
-- Search Engine Spider and User Agent Identification
---- Naughty Yahoo User Agents


Pfui - 9:04 am on Jun 12, 2006 (gmt 0)


I show that it's been Slurping robots.txt as China for a while, and for months typically in 'pairs' with assorted .html files -- all of which were both generically and specifically Disallowed it in robots.txt (regular Slurp has robots.txt-specified access).

But at first, you're right, Jim, it didn't ask for robots.txt by itself, but always within seconds of regular Slurp asking for same. That was when I first saw Slurp China, back in November, 2005:

lj9118.inktomisearch.com - - [17/Nov/2005:02:23:03 -0800] "GET /robots.txt HTTP/1.0" 302 213 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]
lj9083.inktomisearch.com - - [17/Nov/2005:02:23:05 -0800] "GET /file.html HTTP/1.0" 302 213 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]

Within a few weeks, it started asking for just robots.txt, using its own ID:

lj9119.inktomisearch.com - - [01/Dec/2005:08:41:09 -0800] "GET /robots.txt HTTP/1.0" 200 3990 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]

And thereafter, robots.txt plus a single (and robots.txt-Disallowed) file, akin to your excerpt:

lj9119.inktomisearch.com - - [14/Feb/2006:23:42:30 -0800] "GET /robots.txt HTTP/1.0" 200 6401 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]
lj9062.inktomisearch.com - - [14/Feb/2006:23:42:37 -0800] "GET /dir/file.html HTTP/1.0" 200 29109 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]

I tried just letting it have robots.txt and Forbidding (not just Disallowing) all else but it didn't miss a beat. So for months now, I've 403'd it re everything.

Yet still it comes:

lj910179.inktomisearch.com - - [11/Jun/2006:23:37:14 -0700] "GET /robots.txt HTTP/1.0" 403 803 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]
lj910053.inktomisearch.com - - [11/Jun/2006:23:37:21 -0700] "GET /dir/file.html HTTP/1.0" 403 803 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]

And as mentioned, from Day One, its info page has been inaccessible to the Chinese font- and language-challenged.

If Slurp China hadn't been a Yahoo spawn, it would've been a goner six months ago. But because of its heritage, I tried to work with it, and/or around it. However, as a direct result of its behavior, nowadays I have far less tolerance for all of Yahoo's countless UAs/IPs/Hosts and their seemingly Yahoo-beneficial screw-ups.

.
P.S.

A few more oddities for your compilation, GaryK:

dcf1.labs.corp.yahoo.com
NO UA

demo03.labs.corp.yahoo.com
NO UA

search1.labs.corp.yahoo.com
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225

And one more curious lineage [webmasterworld.com] --


Thread source:: http://www.webmasterworld.com/search_engine_spiders/3276.htm
Brought to you by WebmasterWorld: http://www.webmasterworld.com