homepage Welcome to WebmasterWorld Guest from 54.197.211.197
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
TosCrawler
lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4528638 posted 7:10 am on Dec 16, 2012 (gmt 0)

Anyone else come across something calling itself the TosCrawler? It's one of those under-the-radar robots. Turns out it first showed its face in September-- once-- and has been visiting sporadically since October.

IP (exact): 60.36.84.94 and ..253
UA: currently TosCrawler/Nutch-1.6 (http://www.toshiba.co.jp/rdc/about/crawl_info.htm; 'Rdc-crawler at ml dot toshiba dot co dot jp')
but has used 1.4 and 1.5.1 in past months, not always in sequential order. (The "at"s, "dot"s and single quotes are in the original, not an artifact of log wrangling.)

Behavior: unobjectionable. Generally only one page at a time; about once a month it will scoop up 8-10 in the course of a day, generally top-level directory pages. Reads robots.txt, usually holding it for about 12 hours though I once found it going over 24 hours on the same robots.txt. Can't say anything about compliance, because my roboted-out text files are mostly in deeper directories.

Far as I can make out, the IP belongs to an ISP named Plala. (Is this one of those Toyota prestige-naming things?) This strikes me as odd. I mean, Toshiba, wouldn't they have their own name on their IP?

The linked page in the UA leads eventually to ../crawl_info_en.html which says among other things
The main goal of developing the crawler is to collect web pages for R&D related to natural language processing. Using the collected web pages, we extract new or unknown words, and we analyze statistical information such as word frequency. Utilizing this information, we develop highly accurate statistical machine translation systems, text-to-speech systems and so on.


:: pause here to sit on hands ::

but there are no details about UA or IP. Wonder if they answer email?

 

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4528638 posted 6:57 pm on Dec 16, 2012 (gmt 0)

All you need to know is Nutch and any leading or trailing name is possible, as well as any action.

Nutch should be in your UA denails?

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4528638 posted 8:02 pm on Dec 16, 2012 (gmt 0)

we extract new or unknown words, and we analyze statistical information such as word frequency. Utilizing this information, we develop highly accurate statistical machine translation systems, text-to-speech systems and so on.


This reeks of SEO.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4528638 posted 10:18 pm on Dec 16, 2012 (gmt 0)

I have 60.36.84.0/24 listed for TOS (Toshiba) as a bot with the attribute Kill!

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4528638 posted 10:29 pm on Dec 16, 2012 (gmt 0)

no nutch is good nutch

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4528638 posted 12:15 am on Dec 17, 2012 (gmt 0)

This reeks of SEO.

I'm not sure you can put my site and SEO into the same sentence.

::snrk::

My only mental association with Toshiba is TV, so I'm thinking TVs that produce their own closed captioning and/or subtitles on the fly. Matter of fact about half of the inner pages they've picked up to date say something about translation-- but in my case this is not statistically meaningful ;)

They seem to be especially interested in one subgroup of paintings. I'll have to see if I consistently use some word that has an alternate meaning.

Bewenched

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4528638 posted 12:20 am on Dec 17, 2012 (gmt 0)


All you need to know is Nutch

I second that. And yes, they came by starting in september and got blocked.

blend27

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4528638 posted 12:29 am on Dec 17, 2012 (gmt 0)

no nutch is good nutch
Good Nutch is a 403d Nutch.


Apparently it comes from several IPs: [projecthoneypot.org...] . Showed up on my sites mid October or so.

Bewenched

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4528638 posted 1:29 am on Dec 17, 2012 (gmt 0)

wonder if it's somehow related to start.toshiba.com

just found a referral from them..
Looks like their search is powered by goole,but look at how the "related searches" come up in the results... interesting.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4528638 posted 3:19 am on Dec 17, 2012 (gmt 0)

I get these tobi's refers fairly regular (I guess they like me)coming from Cellco users.

http ://start.toshiba.com/search/index.php?context=homepage

Also here's a 2009 UA:
"Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; PeoplePC 1.0; Toshiba; (R1 1.5))"

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4528638 posted 5:56 am on Dec 17, 2012 (gmt 0)

I first searched my logs for "Toshiba" but that was a red herring. It shows up as part of the referer-query string in g### mobile searches: tablet-android-toshiba, ms-android-toshiba and so on.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4528638 posted 10:01 am on Dec 17, 2012 (gmt 0)

Postscript, in case it makes a difference to anyone:

I fired off an e-mail and, thanks to time difference which I'd, ahem, forgotten about, received an almost immediate reply. Toshiba says it is their robot, in spite of the generic IP. (Toshiba's website is also splat in the middle of a random bunch of others sharing the same address. Were they standing behind the door when IP ranges were handed out?)

:: idly wondering whether A Very Big Registrar would even care that someone* wrote "State" in the line that asks for "State or Province" ::


* Irritating but wholly unrelated referer spam. Couldn't Fairpoint have taken a whole /8 to themselves somewhere, so I wouldn't have to keep swatting them by /17s and /18s?

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4528638 posted 8:38 pm on Dec 17, 2012 (gmt 0)

I've blocked all "nutch" for years. IMO if they can't get their own bot and name it accordingly, why should I recognize them.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4528638 posted 6:55 pm on Dec 25, 2012 (gmt 0)

:: bump ::

Now here's an interesting coincidence. After months of nibbling at a page here, a page there, Toshiba has taken to gulping up to 40 pages at once. Deep enough that I can be pretty sure they are honoring robots.txt. (I have two directories that are fully accessible to humans, but off-limits to robots.) Pages only, no other stuff.

We'll call it a coincidence because my e-mail IP is unrelated to my www IP, my signature never includes the domain name, and the log entry I quoted does not include a domain name. Quick detour to g### confirms that I appear to have the only site in the world with the exact pagename I randomly quoted. Oops.

Hmm.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved