TosCrawler - Crawler, Spider, and User Agent ID forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

TosCrawler

lucy24

7:10 am on Dec 16, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Anyone else come across something calling itself the TosCrawler? It's one of those under-the-radar robots. Turns out it first showed its face in September-- once-- and has been visiting sporadically since October.

IP (exact): 60.36.84.94 and ..253
UA: currently TosCrawler/Nutch-1.6 (http://www.toshiba.co.jp/rdc/about/crawl_info.htm; 'Rdc-crawler at ml dot toshiba dot co dot jp')
but has used 1.4 and 1.5.1 in past months, not always in sequential order. (The "at"s, "dot"s and single quotes are in the original, not an artifact of log wrangling.)

Behavior: unobjectionable. Generally only one page at a time; about once a month it will scoop up 8-10 in the course of a day, generally top-level directory pages. Reads robots.txt, usually holding it for about 12 hours though I once found it going over 24 hours on the same robots.txt. Can't say anything about compliance, because my roboted-out text files are mostly in deeper directories.

Far as I can make out, the IP belongs to an ISP named Plala. (Is this one of those Toyota prestige-naming things?) This strikes me as odd. I mean, Toshiba, wouldn't they have their own name on their IP?

The linked page in the UA leads eventually to ../crawl_info_en.html which says among other things

The main goal of developing the crawler is to collect web pages for R&D related to natural language processing. Using the collected web pages, we extract new or unknown words, and we analyze statistical information such as word frequency. Utilizing this information, we develop highly accurate statistical machine translation systems, text-to-speech systems and so on.

:: pause here to sit on hands ::

but there are no details about UA or IP. Wonder if they answer email?

wilderness

6:57 pm on Dec 16, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

All you need to know is Nutch and any leading or trailing name is possible, as well as any action.

Nutch should be in your UA denails?

wilderness

8:02 pm on Dec 16, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

we extract new or unknown words, and we analyze statistical information such as word frequency. Utilizing this information, we develop highly accurate statistical machine translation systems, text-to-speech systems and so on.

This reeks of SEO.

dstiles

10:18 pm on Dec 16, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I have 60.36.84.0/24 listed for TOS (Toshiba) as a bot with the attribute Kill!

incrediBILL

10:29 pm on Dec 16, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

no nutch is good nutch

lucy24

12:15 am on Dec 17, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

This reeks of SEO.

I'm not sure you can put my site and SEO into the same sentence.

::snrk::

My only mental association with Toshiba is TV, so I'm thinking TVs that produce their own closed captioning and/or subtitles on the fly. Matter of fact about half of the inner pages they've picked up to date say something about translation-- but in my case this is not statistically meaningful ;)

They seem to be especially interested in one subgroup of paintings. I'll have to see if I consistently use some word that has an alternate meaning.

Bewenched

12:20 am on Dec 17, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

All you need to know is Nutch

I second that. And yes, they came by starting in september and got blocked.

blend27

12:29 am on Dec 17, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

no nutch is good nutch

Good Nutch is a 403d Nutch.

Apparently it comes from several IPs: [projecthoneypot.org...] . Showed up on my sites mid October or so.

Bewenched

1:29 am on Dec 17, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

wonder if it's somehow related to start.toshiba.com

just found a referral from them..
Looks like their search is powered by goole,but look at how the "related searches" come up in the results... interesting.

wilderness

3:19 am on Dec 17, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I get these tobi's refers fairly regular (I guess they like me)coming from Cellco users.

http ://start.toshiba.com/search/index.php?context=homepage

Also here's a 2009 UA:
"Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; PeoplePC 1.0; Toshiba; (R1 1.5))"

lucy24

5:56 am on Dec 17, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I first searched my logs for "Toshiba" but that was a red herring. It shows up as part of the referer-query string in g### mobile searches: tablet-android-toshiba, ms-android-toshiba and so on.

lucy24

10:01 am on Dec 17, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Postscript, in case it makes a difference to anyone:

I fired off an e-mail and, thanks to time difference which I'd, ahem, forgotten about, received an almost immediate reply. Toshiba says it is their robot, in spite of the generic IP. (Toshiba's website is also splat in the middle of a random bunch of others sharing the same address. Were they standing behind the door when IP ranges were handed out?)

:: idly wondering whether A Very Big Registrar would even care that someone* wrote "State" in the line that asks for "State or Province" ::

* Irritating but wholly unrelated referer spam. Couldn't Fairpoint have taken a whole /8 to themselves somewhere, so I wouldn't have to keep swatting them by /17s and /18s?

keyplyr

8:38 pm on Dec 17, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I've blocked all "nutch" for years. IMO if they can't get their own bot and name it accordingly, why should I recognize them.

lucy24

6:55 pm on Dec 25, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

:: bump ::

Now here's an interesting coincidence. After months of nibbling at a page here, a page there, Toshiba has taken to gulping up to 40 pages at once. Deep enough that I can be pretty sure they are honoring robots.txt. (I have two directories that are fully accessible to humans, but off-limits to robots.) Pages only, no other stuff.

We'll call it a coincidence because my e-mail IP is unrelated to my www IP, my signature never includes the domain name, and the log entry I quoted does not include a domain name. Quick detour to g### confirms that I appear to have the only site in the world with the exact pagename I randomly quoted. Oops.

Hmm.