lucy24 - 7:10 am on Dec 16, 2012 (gmt 0)
Anyone else come across something calling itself the TosCrawler? It's one of those under-the-radar robots. Turns out it first showed its face in September-- once-- and has been visiting sporadically since October.
IP (exact): 184.108.40.206 and ..253
UA: currently TosCrawler/Nutch-1.6 (http://www.toshiba.co.jp/rdc/about/crawl_info.htm; 'Rdc-crawler at ml dot toshiba dot co dot jp')
but has used 1.4 and 1.5.1 in past months, not always in sequential order. (The "at"s, "dot"s and single quotes are in the original, not an artifact of log wrangling.)
Behavior: unobjectionable. Generally only one page at a time; about once a month it will scoop up 8-10 in the course of a day, generally top-level directory pages. Reads robots.txt, usually holding it for about 12 hours though I once found it going over 24 hours on the same robots.txt. Can't say anything about compliance, because my roboted-out text files are mostly in deeper directories.
Far as I can make out, the IP belongs to an ISP named Plala. (Is this one of those Toyota prestige-naming things?) This strikes me as odd. I mean, Toshiba, wouldn't they have their own name on their IP?
The linked page in the UA leads eventually to ../crawl_info_en.html which says among other things
The main goal of developing the crawler is to collect web pages for R&D related to natural language processing. Using the collected web pages, we extract new or unknown words, and we analyze statistical information such as word frequency. Utilizing this information, we develop highly accurate statistical machine translation systems, text-to-speech systems and so on.
:: pause here to sit on hands ::
but there are no details about UA or IP. Wonder if they answer email?