Welcome to WebmasterWorld Guest from 54.226.133.245

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

Nutch

     
11:15 am on Feb 13, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


209.25.87.38 - - [13/Feb/2003:02:07:49 -0800] "GET /robots.txt HTTP/1.0" 200 2058 "-" "Nutch"

Don't have time tis AM to explore.
Google returned a few resources although they weren't working for some reason.

Only read robots and another page.

11:26 am on Feb 13, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 24, 2002
posts:87
votes: 0


:-) Guess you beat me to it. Read my topic for more info. Looks like a new search-engine, even has a retreiveble cache.
11:21 am on Feb 13, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 24, 2002
posts:87
votes: 0


I found this in my logs:

209.25.87.38 - - [13/Feb/2003:01:50:59 +0100] "GET /robots.txt HTTP/1.0" 200 144 "-" "Nutch"
209.25.87.38 - - [13/Feb/2003:01:51:07 +0100] "GET / HTTP/1.0" 200 10151 "-" "Nutch"

A Whois turned up this rather useless information:

Cable & Wireless CWIX2 (NET-209-25-0-0-1)
209.25.0.0 - 209.25.127.255
Internet Business Services CWI-IBS66 (NET-209-25-86-0-1)
209.25.86.0 - 209.25.87.255
NT Sales IBS-NTS-03 (NET-209-25-86-0-2)
209.25.86.0 - 209.25.87.255
Malachiarts Networks NTS-MALI-01 (NET-209-25-87-32-1)
209.25.87.32 - 209.25.87.63

BUT WAIT! Check what happens if you point your browsers to 209.25.87.38!

New Search-engine and Who owns it?

11:51 am on Feb 13, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Jan 5, 2003
posts:47
votes: 0


Found this

[elib.cs.berkeley.edu...]

Nutch seems to be an attempt to create an open-source Search Engine.

10:05 pm on Feb 13, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


I cannot recall seeing a subnet this far seperated from the backbone provider? Three levels.
IMO it doesn't say much for either the security or the integrity of any of the four levels of the net. :(

Cable & Wireless CWIX2 (NET-209-25-0-0-1)
209.25.0.0 - 209.25.127.255
Internet Business Services CWI-IBS66 (NET-209-25-86-0-1)
209.25.86.0 - 209.25.87.255
NT Sales IBS-NTS-03 (NET-209-25-86-0-2)
209.25.86.0 - 209.25.87.255
Malachiarts Networks NTS-MALI-01 (NET-209-25-87-32-1)
209.25.87.32 - 209.25.87.63

Then it turns out that Malachiarts Networks is a colocation server leasing rack space.
[malachiarts.com...]

I think if it was a venture by U of Cal, Berkley that the attempts wouldn't have been so shody and perhaps provided some sort of pre-release.

I denied 209.25.87. this AM and may expand upward.

As most of you realize my traffic goals are relative to my market. I'm not going to build a search engine on the subject content of my websites for anybody unless I'm compensated. I have over 1000 distinctive and categorized subject url's which took many hours to accumulate. To allow somebody to absorb that in a few seconds is senseless.
U of Cal has a BIG research budget let them tread in somebody else's direction. NOT MINE!

The thread (Domanova) immediately following this thread is a good example. The spider name (if I recall correctly) was Jack.
I was told either here or Alt.webmaster that it was a reputable and established bot. On that premsie, after I denied their bot I removed my denies. FORTUANTELY their never returned. If they had? Where would my data and sweat of brow be? They've gone belly up! Their no longer responsible? Would they mine my data amd permission to somebody else for pennies? Somebody who holds a lesser integrity?
I think all these questions MUST be asked before allowing a new bot to travel our/my sites?

9:16 pm on Feb 20, 2003 (gmt 0)

Junior Member

10+ Year Member

joined:Aug 18, 2002
posts:131
votes: 0


I'm denying too, only because they fail to provide a URL in their user-agent string. I don't have much hope for them if they can't even do that right.
9:43 pm on Feb 20, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 25, 2002
posts:872
votes: 0


I spotted Nutch too...

If you have access the FROM: header you might find it points you towards sourceforge, or to be more specific;

[sourceforge.net...]

Which ties with Equiano's finding.

- Tony

6:14 pm on Mar 22, 2003 (gmt 0)

New User

10+ Year Member

joined:Mar 9, 2003
posts:16
votes: 0


I just got hit this morning,

66-113-76-112.rev.ibsinc.com - - [22/Mar/2003:xx:xx:xx -0x00] "GET /robots.txt HTTP/1.0" 200 543 "-" "Nutch"

It hit robots.txt and the root. nothing else. I will try disallow in robots.txt , not sure it obeys.

10:04 pm on Mar 25, 2003 (gmt 0)

Full Member

10+ Year Member

joined:Oct 22, 2002
posts:217
votes: 0


Got hit this morning & still no URL in the UA string. Emailed the creators this afternoon and requested that they add the URL.

I read over their project description & other notes (http:*//www.nutch.org/docs/index.html) and it looks interesting. Certainly wish them the best.

4:42 am on Mar 28, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


Had a visit from nutch just now. It requested my (validated) robots.txt, and then promptly violated it.

66.113.76.112 - - [27/Mar/2003:20:52:19 -0500] "GET /robots.txt HTTP/1.0" 200 2980 "-" "Nutch"
66.113.76.112 - - [27/Mar/2003:20:52:25 -0500] "GET /lgl_docs.html HTTP/1.0" 200 24831 "-" "Nutch"

Plonk!

Jim

4:49 am on Mar 28, 2003 (gmt 0)

New User

10+ Year Member

joined:Mar 9, 2003
posts:16
votes: 0


I got my answer. It doesn't repect robots.txt.
It hit robots.txt and 6 seconds latter went straight to root.
I know it looks harmless now. but when or if it starts traversing image directories with a disallow the next time it will be too late.
Going for the ban.
3:35 am on Mar 30, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 27, 2002
posts:1685
votes: 0


This Nutch is about a strange acting thing....

66.113.76.112 - - [29/Mar/2003:13:47:59 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:13:48:07 -0800] "GET /Multicultural.html HTTP/1.0" 200 6698 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:14:04:38 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:14:04:43 -0800] "GET /Citation_Guides.html HTTP/1.0" 200 3101 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:15:14:46 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:15:15:01 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:15:15:02 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"

66.113.76.112 - - [29/Mar/2003:15:15:07 -0800] "GET /Mega_Sites.html HTTP/1.0" 200 13642 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:15:32:31 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:15:32:37 -0800] "GET /Philosophy.html HTTP/1.0" 200 14159 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:15:42:14 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:15:42:20 -0800] "GET /History_A-I.html HTTP/1.0" 200 13207 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:13:18 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:13:38 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:13:53 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:14:08 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:14:23 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:14:39 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:14:54 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:15:03 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"

66.113.76.112 - - [29/Mar/2003:16:15:09 -0800] "GET /Bar_Review.html HTTP/1.0" 200 2845 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:27:45 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:27:51 -0800] "GET /Aboriginal_Governance.html HTTP/1.0" 200 9565 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:46:45 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:46:50 -0800] "GET /Vocabulary.html HTTP/1.0" 200 4181 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:17:16:29 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:17:16:29 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:17:16:32 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:17:16:32 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:17:16:38 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"

66.113.76.112 - - [29/Mar/2003:17:16:43 -0800] "GET /About_TheWall-Fact-Sheet.htm HTTP/1.0" 200 24391 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:17:54:24 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:17:54:30 -0800] "GET /Humanities.html HTTP/1.0" 200 11713 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:17:59:10 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:17:59:19 -0800] "GET /Hmmmmm.html HTTP/1.0" 200 6112 "-" "http://www.nutch.org/"

<Oh, it now provides that url you requested.>

Pendanticist.

4:36 am on Mar 30, 2003 (gmt 0)

New User

10+ Year Member

joined:Mar 9, 2003
posts:16
votes: 0


Went to [nutch.org...]

I couldn't find anything on respecting robots.txt or what to use for exclusion.
Did I miss something?

Interesting they do get the robots.txt. My experience has been they don't respect.

11:15 am on Mar 30, 2003 (gmt 0)

New User

10+ Year Member

joined:Mar 30, 2003
posts:27
votes: 0


Nutch has been visiting my pages a few times last week, yesterday twice it took my spam trapper and blocked itself.
blatant violation of the robots.txt

two ips yesterday both blocked:
66.113.76.112
66.113.76.111

oh well.

7:55 am on Apr 8, 2003 (gmt 0)

New User

10+ Year Member

joined:Apr 7, 2003
posts:3
votes: 0


Hi,

I'm one of the developers working on the Nutch project, and we were given a very friendly heads-up about the discussion here.

As a result of the feedback from this forum, we've made some changes to our robot. There were some bugfixes- we had intended to follow robots.txt directives, and as you noticed, we'd pulled the file but not always obeyed. We have also moved our contact email from the HTTP "From" header into the User-agent header, and added a URL.

We have not made the index built from the misbehaving crawl public. We're in the process of recrawling with a patched version of our robot, and we're hoping the new version will not generate any further ill-will. We are truly sorry we violated netiquette, and we're working to right our ship.

We hope to win back those of you who've banned us, once we demonstrate that we've gotten our stuff together. In the meantime, we welcome bugreports and (hopefully constuctive) criticism at nutch-agent@lists.sourceforge.net (which goes to our whole team).

-Tom

7:59 am on Apr 8, 2003 (gmt 0)

New User

10+ Year Member

joined:Apr 7, 2003
posts:3
votes: 0


PS:

For those of you who've banned us, and won't see our bot URL in your logs, you can find it at:

[nutch.org...]

2:35 pm on Apr 8, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


Parliaments,

There is still a discrepancy between the described interpretation of robots.txt and the actual interpretation. The document at [nutch.org...] states that disallowing the user-agent "Nutch" in robots.txt will block both Nutch and NutchOrg. The example robots.txt supports this by showing that


User-agent: NutchOrg
Disallow:

should be included with a disallow of "Nutch" if we desire to allow access for the development version, but not the "commercial" version.

However, until last night, I had not yet seen the new Nutch robots information (at the URL included in the new ua string), and had included only


User-agent: Nutch
Disallow: /

Last night I logged this transaction:


66.113.76.112 - - [08/Apr/2003:00:11:31 -0400] "GET /robots.txt HTTP/1.1" 200 3002 "-" "NutchOrg/0.03-dev (Nutch; http://www.nutch.org/docs/bot.html; nutch-agent@lists.sourceforge.net)"
66.113.76.112 - - [08/Apr/2003:00:11:42 -0400] "GET / HTTP/1.1" 403 775 "-" "NutchOrg/0.03-dev (Nutch; http://www.nutch.org/docs/bot.html; nutch-agent@lists.sourceforge.net)"

So obviously, NutchOrg did not interpret my Disallow as applying to both Nutch and NutchOrg, and therefore tried to access my site.

As a workaround, I had to install the following modified block in my .htaccess file to allow NutchOrg:


RewriteCond %{HTTP_USER_AGENT} ^Nutch
RewriteCond %{HTTP_USER_AGENT} !^NutchOrg
RewriteRule !^(403.*\.html¦robots\.txt)$ - [F]

The robots information you have published should be brought into correspondence with the actual behaviour of the robot as quickly as possible. As you have seen here, we have only the behaviour of the robot - and most often nothing else - by which to judge user-agents. Because of past abuse, some of us are a bit picky about who we "donate" bandwidth to. Often times, permissiveness is rewarded by having our entire site downloaded, and a subsequent flood of new unsolicited e-mail for Viagra, etc. Other times, the webmaster will notice that someone has copied his entire site, and simply made a few changes to re-brand it. This often means spending a lot of money to hire an attorney, and sometimes involves international law. Not to mention the small webmaster whose entire site is thus downloaded, causing him to go over his bandwidth quota for the month, invoking penalities - or forcing him to take his site off-line to avoid penalties.

So, any misbehaviour of a robot, misidentification of the using organization, or inaccuracy in its information page is viewed with great suspicion - It's self-defense.

The proper functioning and description of your robots.txt handling is critical.

One more thing: Don't make the same mistake as Grub.org. Specify in your contract of sale or terms of use (whichever is applicable) that using Nutch without a correct and accurate robot identification and using organization identification/contact string in the user-agent field is prohibited. Make it iron-clad. Make it so that abusers immediately and unconditionally lose all rights to use Nutch if they break this rule, and they pay for all legal fees necessary to stop their use. Take it further: Specify that they must identify themselves and provide a correct and functional e-mail contact and robot information URL. Do not allow other organizations to use the sourceforge contact info - make them use their own. Understand that one transgression might easily result in another wave of postings here and on other sites, and soon "Nutch" will be dead, because it will be disallowed from a large number of sites over time. You've built it, now you've got to protect and defend it from abusers. It's sad, but necessary; the Web ain't what it used to be.

The two items above are critical for the survival of Nutch. Otherwise, it will join the list of useful-but-uncontrolled user-agents like Indy Library and Grub, which most savvy webmasters block (or allow only on a case-by-case basis).

Hope this helps,
Jim

4:13 pm on Apr 8, 2003 (gmt 0)

New User

10+ Year Member

joined:Apr 7, 2003
posts:3
votes: 0


Jim,

Thank you for the very thoughtful message. I will be forwarding your advice regarding the licensing terms to our developer list.

Sorry this is a short and rushed note, but I will be dashing off to look into the violation of robots.txt you've reported. I'll respectfully submit that we don't want to update our bot page- we want to fix our bot so it behaves according to that description. I'm hoping the remainder of this thread can be about how the bot is now behaving itself.

I'd also like to mention that I am glad that you're giving us the benefit of the doubt here- by still allowing our agent via your .htaccess rule. We do appreciate that!

-Tom

4:32 pm on Apr 8, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


Parliaments,

As long as the 'bot behaves as described, or is described as it actually behaves, you should be OK.

With the development UA being longer than the release UA, you're risking some loss of 'net space - You might want to consider calling the development 'bot something like "NutchOrg" and the release version "NutchBot" - anything that will eliminate the requirement to have an "allow" for NutchOrg if a disallow for Nutch is present in robots.txt.

This would simplify webmasters' jobs, allow you to correctly support partial UA matching when analyzing robots.txt, and comply with the robots.txt standard for partial UA matching as well, all without requiring a second entry in robots.txt.

I suspect this partial-matching problem led someone on your team to implement an "exact-match-required" modification, and that is the reason that NutchOrg thought it could access my site when "Nutch" was disallowed and that the 'bot no longer behaves as described. This could be avoided with a naming change as recommended above.

Hope this helps - and thanks for your responsiveness.
Jim

8:10 pm on Apr 11, 2003 (gmt 0)

New User

10+ Year Member

joined:Mar 30, 2003
posts:27
votes: 0


Well, I turned nutch back on again to see what they do next. havent come around yet though.
3:20 am on Apr 25, 2003 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 1, 2002
posts:774
votes: 0


Hi:

Nutch has been around the last couple days for me. Must report nothing bad so far. Rate of page views is never more than 1 every five seconds, but averages MUCH less than that- one every couple minutes is about average.

Has read robots, and so far, no violations!

dave

12:51 am on Apr 26, 2003 (gmt 0)

New User

10+ Year Member

joined:Mar 30, 2003
posts:27
votes: 0


they showed up the other day looked ok. would like to see what they are doing with the results of this.
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members