Forum Moderators: open

Message Too Old, No Replies

Nutch Sightings from 100+ IPs

         

incrediBILL

7:04 pm on Jun 14, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm sure Nutch has been discussed before but I thought it might interest this group to see the recent scope of Nutch utilization that I culled from my spider archive.

Sadly, many of the users don't further indentify the purpose of their usage by modifying the user agent details. Personally, I wish the author would alter Nutch so it won't even run unless you change the user agent.

You'll note a few actual business names in the list that you might find interesting, so take a look.


124.32.246.36 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

124.32.246.45 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

128.208.6.200 NutchCVS/0.7.1 (Nutch running at UW; [crawlers.cs.washington.edu...] sycrawl@cs.washington.edu)

128.208.6.226 NutchCVS/0.8-dev (Nutch running at UW; [nutch.org...] sycrawl@cs.washington.edu)

128.208.6.227 NutchCVS/0.8-dev (Nutch running at UW; [nutch.org...] sycrawl@cs.washington.edu)

128.208.6.77 NutchCVS/0.8-dev (Nutch running at UW; [nutch.org...] sycrawl@cs.washington.edu)

129.242.19.138 NutchCVS/0.06-dev (Nutch; [nutch.org...] nutch-agent@lists.sourceforge.net)

129.34.20.19 NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

129.78.64.106 NutchCVS/0.7.2 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

131.112.16.140 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

131.112.16.220 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

131.211.84.21 NutchCVS/0.7.2 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

136.165.45.122 NutchCVS/0.06-dev (Nutch; [nutch.org...] nutch-agent@lists.sourceforge.net)

137.43.154.203 NutchCVS/0.06-dev (Nutch; [nutch.org...] nutch-agent@lists.sourceforge.net)

147.202.90.2 NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

164.67.195.24 NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

164.67.195.245 NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

164.67.195.26 NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

164.67.195.27 NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

164.67.195.68 NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

164.67.195.85 NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

166.214.93.76 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

193.203.240.117 NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

193.203.240.118 NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

193.203.240.119 NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

193.203.240.120 NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

193.203.240.121 NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

193.203.240.122 NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

193.252.148.51 NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

203.113.130.205 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

203.131.194.84 NutchCVS/0.7 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

203.147.0.44 NutchCVS/0.7.2 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

203.244.218.1 NutchCVS/0.06-dev (Nutch; [nutch.org...] nutch-agent@lists.sourceforge.net)

209.131.61.1 NutchCVS/0.7 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

210.174.3.130 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

210.196.73.193 NutchCVS/0.06-dev (Nutch; [nutch.org...] nutch-agent@lists.sourceforge.net)

210.245.31.15 NutchCVS/0.7.2 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

210.245.31.18 NutchCVS/0.7.2 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

212.12.114.238 NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

212.127.226.60 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

212.137.33.140 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

212.156.230.210 NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

212.58.116.72 NutchCVS/0.7 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

213.132.175.101 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

213.186.36.107 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

213.251.133.12 Misterbot-Nutch/0.7.1 (Misterbot-Nutch; [misterbot.fr;...] nutch at misterbot.fr)

216.93.185.12 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

220.218.159.50 NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

221.114.253.210 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

221.116.237.114 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

221.221.237.35 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

24.222.153.250 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

24.224.226.18 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

58.186.61.164 NutchCVS/0.7.2 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

58.187.12.236 NutchCVS/0.7.2 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

58.87.139.90 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

59.160.240.115 NutchCVS/0.7.2 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

60.248.9.114 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

61.135.151.175 NutchCVS/0.06-dev (Nutch; [nutch.org...] nutch-agent@lists.sourceforge.net)

62.129.132.47 NutchCVS/0.7.2 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

62.168.188.151 NutchCVS/0.7 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

62.40.36.87 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

63.133.162.98 NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

64.105.36.210 NutchCVS/0.06-dev (Nutch; [nutch.org...] nutch-agent@lists.sourceforge.net)

64.151.112.44 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

64.241.242.18 NutchCVS/0.05 (Nutch; [nutch.org...] nutch-agent@lists.sourceforge.net)

64.242.88.10 NutchCVS/0.05 (Nutch; [nutch.org...] nutch-agent@lists.sourceforge.net)

64.242.88.60 NutchCVS/0.05 (Nutch; [nutch.org...] nutch-agent@lists.sourceforge.net)

64.34.172.78 BurstFind Crawler 1.0/0.7.1 (Nutch; [lucene.apache.org...] crawler@burstfind.com)

64.34.180.167 Nokia6620/2.0 (4.22.1) SymbianOS/7.0s Series60/2.1 Profile/MIDP-2.0 Configuration/CLDC-1.0/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

64.38.10.26 NutchCVS/0.7.2 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

64.71.164.103 NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

64.71.164.107 NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

64.71.164.108 NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

64.71.164.125 Krugle/Krugle,Nutch/0.8+ (Krugle web crawler; [krugle.com...] webcrawler@krugle.com)

65.220.67.9 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

65.9.20.49 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

65.91.114.3 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

66.108.32.4 NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

66.15.68.234 NutchCVS/0.7.2 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

66.162.5.43 NutchCVS/0.7 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

66.207.120.226 NutchCVS/0.7.2 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

66.243.31.34 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

67.111.28.139 NutchCVS/0.7.2 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

67.52.101.242 NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

68.205.124.164 NutchCVS/0.7.2 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

68.205.127.94 NutchCVS/0.7.2 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

69.248.26.83 Comrite/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

69.55.233.28 Argus/1.1 (Nutch; [simpy.com...] feedback at simpy dot com)

70.197.81.79 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

70.30.97.106 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

70.56.66.216 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

70.96.99.254 NutchCVS/0.7 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

71.241.153.125 NutchCVS/0.7 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

71.35.163.79 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

72.0.207.162 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

72.2.25.67 NutchCVS/0.06-dev (Nutch; [nutch.org...] nutch-agent@lists.sourceforge.net)

72.5.173.12 sdcresearchlabs-testbot/0.8-dev (www.shopping.com/bot.html; [lucene.apache.org...] researchbot@shopping.com)

72.51.37.148 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

81.203.142.109 NutchCVS/0.7.2 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

83.246.79.28 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

84.191.111.92 NutchCVS/0.7.1 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)

Pfui

12:39 am on Jun 15, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yeah. Nutch is quite the plague. Last April, I mused over its rapid proliferation [webmasterworld.com] and ranted a bit about Rude Nutch Users.

I stopped getting overly irked after deciding to 403 every Nutch user but the I know they're still ringing the bell. It would be SO NICE if bot-coders programmed their spawn to GO AWAY after, say, three 403s.

incrediBILL

12:54 am on Jun 15, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've been taking a lot of abuse from my musings about nutch too, which is why I just posted the list of sighted crawlers and left it at that.

GaryK

1:11 am on Jun 15, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I read your rant from yesterday Bill. I was pleased to see it and I agree completely. But please don't encourage these people to change the user agent. It's so easy to block Nutch right now with just one line of code. I'd hate to have to start playing the IP Address game with yet another user agent.

incrediBILL

5:15 pm on Jun 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Another one showed up:

203.199.83.162
pro3.rediffmailpro.com
"NutchCVS/0.7.2 (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)"

bobothecat

7:31 pm on Jun 18, 2006 (gmt 0)



Guess the easiest way to deal with this at the moment would be a simple:

RewriteCond %{HTTP_USER_AGENT} Nutch [NC,OR]

in the .htaccess file - has worked for me for years now :)

It's so easy to block Nutch right now with just one line of code. I'd hate to have to start playing the IP Address game with yet another user agent.

Though if we had a more private forum... we'd have less of a chance for 'them' to see:

[webmasterworld.com...]

Pfui

8:05 pm on Jun 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



SetEnvIfNoCase User-Agent "Nutch" no_way 

Works for me:)

incrediBILL

9:56 pm on Jun 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I never said they got onto my site - nothing gets onto my site.

I just see what's trying to get on, which is amusing to say the least.

One of these days I might even spot something useful and let it thru my automatic firewall but to date nothing has motivated me in that direction.

Just reporting whats out there as this the "Search Engine Spider Identification" forum ;)

bobothecat

10:07 pm on Jun 18, 2006 (gmt 0)



Just reporting whats out there as this the "Search Engine Spider Identification" forum

...like trying to drum-up business for your Blog? :(

Iguana

10:34 pm on Jun 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hey, I'm a daily reader of incredibill's blog. How is he benefitting from my 'business'? I don't see adverts or links to benefit from the measly PR. I just thought it was a great read + a bit of info. Have I been duped again?

bobothecat

10:45 pm on Jun 18, 2006 (gmt 0)



Have I been duped again?

I've read his blog as well... when it's secondary news (originally reported on WebmasterWorld)... your guess is as good as mine.

Pfui

10:52 pm on Jun 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Bill doesn't need defending, but bobo, methinks you've got the wrong impression.

I, too, am a reader (when Blogspot cooperates), and now-twice Anon poster because I've run into related info. Actually, it appears a bunch of us tend to hang out in both places.

There's a different feeling over there -- aside from Bill's cuss words (if I had a dollar...) -- although that might be nausea from its current Mid-Century Mod puke-green decor;)

Now if only we could find a place to exchange data privately, away from the eyes of those we're trying to stop.

bobothecat

10:57 pm on Jun 18, 2006 (gmt 0)



Bill doesn't need defending, but bobo, methinks you've got the wrong impression.

I'll be happy to agree... but can't seem to think there isn't a bit of sampling going on.

But I do agree about a private forum. :)

incrediBILL

11:04 pm on Jun 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



...like trying to drum-up business for your Blog?

I don't care if people read that or not really, stopping the bots is all that's important.

I will admit I've double posted more than a couple of times when I thought the data should reach a wider audience like 100+ recent instances of new bots using the same core crawler.

Problem with the moderated forum, although signal to noise ratio is better, speed to publish isn't which can be a tad frustration when new things show up.

But I do agree about a private forum.

No arguments here either as I've noticed the wrong people read when I post no matter where it is and I see them "upgrade" their wares fixing things I've used to identify them.

Have I been duped again?

Not unless you fell into a copying machine ;)

[edited by: incrediBILL at 11:09 pm (utc) on June 18, 2006]

bobothecat

11:07 pm on Jun 18, 2006 (gmt 0)



Bill,

No hard feelings... guess this is what it sounds like when doves cry :)

Pfui

11:38 pm on Jun 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Speaking of speed...

Stay tuned for a still-pending, four-bot post with details about the following, NONE of which ask for robots.txt:

- Dawang Version
- g3.pl (leaseweb)
- yoono
- SPIP (jujuscript)

Too bad mod Dan has a life when most of us, m'self included, might benefit from more of one... Weekend? Whazzat?

Anyway, and bringing this back around to the topic of this thread (heh):

No new Nutch sightings.

incrediBILL

11:57 pm on Jun 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Here goes the thread off on a tangent ;)

g3.pl is a very busy spider that I've been watching for a while, operates out of many IPs, yoono only hit a few times from 3 ip's, but the other 2 are new to me.

Hoever, nothing new from nutch has showed since earlier today.

GaryK

3:51 am on Jun 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm open to hosting a private forum for members only. There's just two problems. I don't want to offend Dan. And I don't want to get accused of trying to drive traffic to my site by anyone. I'm not directing my comment about that to anyone here. It's a valid concern so I wanted to address it up-front.