Anyone else come across something calling itself the TosCrawler? It's one of those under-the-radar robots. Turns out it first showed its face in September-- once-- and has been visiting sporadically since October.
IP (exact): 22.214.171.124 and ..253
UA: currently TosCrawler/Nutch-1.6 (http://www.toshiba.co.jp/rdc/about/crawl_info.htm; 'Rdc-crawler at ml dot toshiba dot co dot jp')
but has used 1.4 and 1.5.1 in past months, not always in sequential order. (The "at"s, "dot"s and single quotes are in the original, not an artifact of log wrangling.)
Behavior: unobjectionable. Generally only one page at a time; about once a month it will scoop up 8-10 in the course of a day, generally top-level directory pages. Reads robots.txt, usually holding it for about 12 hours though I once found it going over 24 hours on the same robots.txt. Can't say anything about compliance, because my roboted-out text files are mostly in deeper directories.
Far as I can make out, the IP belongs to an ISP named Plala. (Is this one of those Toyota prestige-naming things?) This strikes me as odd. I mean, Toshiba, wouldn't they have their own name on their IP?
The linked page in the UA leads eventually to ../crawl_info_en.html which says among other things
|The main goal of developing the crawler is to collect web pages for R&D related to natural language processing. Using the collected web pages, we extract new or unknown words, and we analyze statistical information such as word frequency. Utilizing this information, we develop highly accurate statistical machine translation systems, text-to-speech systems and so on. |
:: pause here to sit on hands ::
but there are no details about UA or IP. Wonder if they answer email?
All you need to know is Nutch and any leading or trailing name is possible, as well as any action.
Nutch should be in your UA denails?
|we extract new or unknown words, and we analyze statistical information such as word frequency. Utilizing this information, we develop highly accurate statistical machine translation systems, text-to-speech systems and so on. |
This reeks of SEO.
I have 126.96.36.199/24 listed for TOS (Toshiba) as a bot with the attribute Kill!
no nutch is good nutch
I'm not sure you can put my site and SEO into the same sentence.
My only mental association with Toshiba is TV, so I'm thinking TVs that produce their own closed captioning and/or subtitles on the fly. Matter of fact about half of the inner pages they've picked up to date say something about translation-- but in my case this is not statistically meaningful ;)
They seem to be especially interested in one subgroup of paintings. I'll have to see if I consistently use some word that has an alternate meaning.
All you need to know is Nutch
I second that. And yes, they came by starting in september and got blocked.
Good Nutch is a 403d Nutch.
Apparently it comes from several IPs: [projecthoneypot.org...] . Showed up on my sites mid October or so.
wonder if it's somehow related to start.toshiba.com
just found a referral from them..
Looks like their search is powered by goole,but look at how the "related searches" come up in the results... interesting.
I get these tobi's refers fairly regular (I guess they like me)coming from Cellco users.
Also here's a 2009 UA:
"Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; PeoplePC 1.0; Toshiba; (R1 1.5))"
I first searched my logs for "Toshiba" but that was a red herring. It shows up as part of the referer-query string in g### mobile searches: tablet-android-toshiba, ms-android-toshiba and so on.
Postscript, in case it makes a difference to anyone:
I fired off an e-mail and, thanks to time difference which I'd, ahem, forgotten about, received an almost immediate reply. Toshiba says it is their robot, in spite of the generic IP. (Toshiba's website is also splat in the middle of a random bunch of others sharing the same address. Were they standing behind the door when IP ranges were handed out?)
:: idly wondering whether A Very Big Registrar would even care that someone* wrote "State" in the line that asks for "State or Province" ::
* Irritating but wholly unrelated referer spam. Couldn't Fairpoint have taken a whole /8 to themselves somewhere, so I wouldn't have to keep swatting them by /17s and /18s?
I've blocked all "nutch" for years. IMO if they can't get their own bot and name it accordingly, why should I recognize them.
:: bump ::
Now here's an interesting coincidence. After months of nibbling at a page here, a page there, Toshiba has taken to gulping up to 40 pages at once. Deep enough that I can be pretty sure they are honoring robots.txt. (I have two directories that are fully accessible to humans, but off-limits to robots.) Pages only, no other stuff.
We'll call it a coincidence because my e-mail IP is unrelated to my www IP, my signature never includes the domain name, and the log entry I quoted does not include a domain name. Quick detour to g### confirms that I appear to have the only site in the world with the exact pagename I randomly quoted. Oops.