Welcome to WebmasterWorld Guest from 54.164.198.240

Forum Moderators: Ocean10000 & keyplyr

Message Too Old, No Replies

TosCrawler

toshiba

     
5:24 pm on Mar 26, 2015 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 891


UA: TosCrawler/Nutch-1.8 (http://www.toshiba.co.jp/rdc/about/crawl_info.htm; 'Rdc-crawler at ml dot toshiba dot co dot jp')
Robots.txt: yes
Host: 67.228.191.72 - 67.228.191.79 67.228.191.72/29

First: I don't understand why any reputable company would use Nutch, or even if they do, why they would include this attribute in the UA string. IMO it looks like "well we weren't as smart as 20 million script kiddies to write our own bot so we searched the web and found this free generic bot."

Second: the info page didn't resolve for me in English so I have no idea what this thing is up to. When I chose the link for English version, I was forwarded to their (Japenese language) index page - duh!

Third: Their sub-net 67.228.191.72/29 is part of SoftLayer (server farm) 67.228/16 which I (and many other webmasters) block, so end of story.
5:27 am on Mar 27, 2015 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2040
votes: 1


I agree they look amateur. But at least they're using a newer Nutch...

Back in June and October, 2012, both times from --

i60-36-84-72.s43.a014.ap.plala.or.jp
(60.36.84.72)

-- they used the similar, and similarly painful-looking:

TosCrawler/Nutch-1.4 (http://www.toshiba.co.jp/rdc/about/crawl_info.htm; 'Rdc-crawler at ml dot toshiba dot co dot jp')


robots.txt? YES

Project Honey Pot for 60.36.84.72 shows that UA plus these variations active in 2012:

Tobot/Nutch-1.4 ('tobot at eel dot rdc dot toshiba dot co dot jp')
TosCrawler/Nutch-1.4 ('Rdc-crawler at ml dot toshiba dot co dot jp')


No clue if they really are Toshiba-related, or just look that way, or why Toshiba would crawl at all.
7:53 am on Mar 27, 2015 (gmt 0)

Junior Member

Top Contributors Of The Month

joined:Jan 26, 2014
posts:197
votes: 0


I google-translated the crawl_info page: it's their research people. Researching natural language stuff apparently.
8:13 am on Mar 27, 2015 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 891


Well when I google-translated that page the translator said it was already in English.
8:20 am on Mar 27, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15305
votes: 703


Oh, good heavens, are they back? I haven't seen them since
::shuffling papers::
February 2013, at the end of several months of avid but well-behaved crawling. Their last UA contained the element
http://www.toshiba.co.jp/rdc/about/crawl_info_en.htm

... leading to a page that's still there. Note the _en
(Their IP at the time was an absolutely consistent 60.36.84.49 which I believe really is Toshiba, or at least was back then.)

I tend to approve of projects that have anything whatsoever to do with language, so they're welcome to return.
9:01 am on Mar 27, 2015 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 891



I block "nutch" at server, reasons stated above. I do allow the real Nutch, just not the bazillion clones.
11:55 am on Mar 27, 2015 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 891


I tend to approve of projects that have anything whatsoever to do with language, so they're welcome to return.

You allow Softlayer?
7:57 pm on Mar 27, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15305
votes: 703


Softlayer is a project? I thought they were a server farm.

:: detour to look up ::

I've only got two of their ranges blocked-- and that's from a list going back so long, some of them don't even have the "robot" color code. They're not on the Shoot To Kill list (an elite group that includes Hetzner, OVH and a short list of others whose names escape me at the moment).
9:00 pm on Mar 27, 2015 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3209
votes: 17


I have all detected softlayer ranges blocked, US and EU...

5.10.64.0 - 5.10.127.255
5.153.0.0 - 5.153.63.255
37.58.64.0 - 37.58.127.255
50.22.0.0 - 50.23.255.255
50.97.0.0 - 50.97.255.255
66.228.112.0 - 66.228.127.255
67.228.0.0 - 67.228.255.255
74.86.0.0 - 74.86.255.255
75.126.0.0 - 75.126.255.255
108.168.128.0 - 108.168.255.255
119.81.0.0 - 119.81.255.255
158.85.0.0 - 158.85.255.255
159.8.0.0 - 159.8.255.255
159.122.0.0 - 159.122.255.255
159.253.128.0 - 159.253.159.255
169.38.0.0 - 169.38.255.255
169.45.0.0 - 169.48.255.255
169.50.0.0 - 169.51.255.255
169.53.0.0 - 169.63.255.255
173.192.0.0 - 173.193.255.255
174.36.0.0 - 174.37.255.255
192.155.192.0 - 192.155.255.255
192.255.0.0 - 192.255.63.255
198.23.64.0 - 198.23.127.255
208.43.0.0 - 208.43.255.255
208.101.0.0 - 208.101.63.255
9:12 pm on Mar 27, 2015 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 891


ditto
12:27 am on Mar 28, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15305
votes: 703


What was the "language" connection? I really hope this wasn't simply an obvious joke that went whoosh over my head.
12:55 am on Mar 28, 2015 (gmt 0)

Junior Member

Top Contributors Of The Month

joined:Jan 26, 2014
posts:197
votes: 0


Fragment from the TosCrawler crawl_info page, as translated by google:

Publish a Web page collection policy of the Research and Development Center. The Research and Development Center, research and such as natural language processing technology, in order to carry out product development that applies this, you have to collect the Web page. To everyone of administrator of the Web page, Please understand our collection purposes and policies, to ask for your cooperation, and to publish the collection policy. In regard to read, so please contact if you have any questions or requests, please.

(Google translate leaves room for improvement, but is hugely better than nothing. Sometimes if it guesses the wrong language as it did in this case, you have to override it.)
1:00 am on Mar 28, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15305
votes: 703


No, no, I meant what does Softlayer have to do with language?

You don't need Google Translate. I gave their English-language URL above.

But I actually stopped by to say:
169.38.0.0 - 169.38.255.255
169.45.0.0 - 169.48.255.255
169.50.0.0 - 169.51.255.255
169.53.0.0 - 169.63.255.255

This seemed so odd that I looked it up. All the other pieces of 32-63 are Credit Suisse. So if you don't have any content that's attractive to Swiss bankers, you are probably safe blocking the whole /11 in one fell swoop.

Turns out I'd got more of Softlayer identified than I thought. My bad, forgot to switch off case matching. softlayer != Softlayer != SoftLayer. Never knew they had tentacles in Korea and Singapore, though.

192.255.0.0/18 (Softlayer) + 192.255.64.0/18 (Micfo) = 192.255.0.0/17
Wait, it gets better.
192.255.0.0/17 + 192.255.128.0/17 (Hostwinds) = 192.255
1:33 am on Mar 28, 2015 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2040
votes: 1


I wonder if they're stepping up their crawling? I've not seen them in ages and then suddenly a few minutes ago:

67.228.191.74-static.reverse.softlayer.com
(67.228.191.74)

TosCrawler/Nutch-1.8 (http://www.toshiba.co.jp/rdc/about/crawl_info.htm; 'Rdc-crawler at ml dot toshiba dot co dot jp')


robots.txt? YES

PHP for the IP [projecthoneypot.org...] shows they've also run Yet Another Nutch variation:

tsip-agent/Nutch-1.8


(P.S. I've blocked every SoftLayer forEVER but for robots.txt and will happily continue to do so.)
1:58 am on Mar 28, 2015 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 891


What always bugged me about these Nutch clones is they don't follow the "disallow: nutch" directive in robots.txt. Half of don't follow it even if I include their specific tagging, example: "TosCrawler/Nutch" So again I say what is the #$*& point of including this attribute in the UA string if it has no application for the webmaster who is receiving this it?