Nutch

Forum Moderators: open

Message Too Old, No Replies

Nutch

wilderness

11:15 am on Feb 13, 2003 (gmt 0)

209.25.87.38 - - [13/Feb/2003:02:07:49 -0800] "GET /robots.txt HTTP/1.0" 200 2058 "-" "Nutch"

Don't have time tis AM to explore.
Google returned a few resources although they weren't working for some reason.

Only read robots and another page.

Orange_XL

11:26 am on Feb 13, 2003 (gmt 0)

:-) Guess you beat me to it. Read my topic for more info. Looks like a new search-engine, even has a retreiveble cache.

Orange_XL

11:21 am on Feb 13, 2003 (gmt 0)

I found this in my logs:

209.25.87.38 - - [13/Feb/2003:01:50:59 +0100] "GET /robots.txt HTTP/1.0" 200 144 "-" "Nutch"
209.25.87.38 - - [13/Feb/2003:01:51:07 +0100] "GET / HTTP/1.0" 200 10151 "-" "Nutch"

A Whois turned up this rather useless information:

Cable & Wireless CWIX2 (NET-209-25-0-0-1)
209.25.0.0 - 209.25.127.255
Internet Business Services CWI-IBS66 (NET-209-25-86-0-1)
209.25.86.0 - 209.25.87.255
NT Sales IBS-NTS-03 (NET-209-25-86-0-2)
209.25.86.0 - 209.25.87.255
Malachiarts Networks NTS-MALI-01 (NET-209-25-87-32-1)
209.25.87.32 - 209.25.87.63

BUT WAIT! Check what happens if you point your browsers to 209.25.87.38!

New Search-engine and Who owns it?

Equiano

11:51 am on Feb 13, 2003 (gmt 0)

Found this

[elib.cs.berkeley.edu...]

Nutch seems to be an attempt to create an open-source Search Engine.

wilderness

10:05 pm on Feb 13, 2003 (gmt 0)

I cannot recall seeing a subnet this far seperated from the backbone provider? Three levels.
IMO it doesn't say much for either the security or the integrity of any of the four levels of the net. :(

Then it turns out that Malachiarts Networks is a colocation server leasing rack space.
[malachiarts.com...]

I think if it was a venture by U of Cal, Berkley that the attempts wouldn't have been so shody and perhaps provided some sort of pre-release.

I denied 209.25.87. this AM and may expand upward.

As most of you realize my traffic goals are relative to my market. I'm not going to build a search engine on the subject content of my websites for anybody unless I'm compensated. I have over 1000 distinctive and categorized subject url's which took many hours to accumulate. To allow somebody to absorb that in a few seconds is senseless.
U of Cal has a BIG research budget let them tread in somebody else's direction. NOT MINE!

The thread (Domanova) immediately following this thread is a good example. The spider name (if I recall correctly) was Jack.
I was told either here or Alt.webmaster that it was a reputable and established bot. On that premsie, after I denied their bot I removed my denies. FORTUANTELY their never returned. If they had? Where would my data and sweat of brow be? They've gone belly up! Their no longer responsible? Would they mine my data amd permission to somebody else for pennies? Somebody who holds a lesser integrity?
I think all these questions MUST be asked before allowing a new bot to travel our/my sites?

Finder

9:16 pm on Feb 20, 2003 (gmt 0)

I'm denying too, only because they fail to provide a URL in their user-agent string. I don't have much hope for them if they can't even do that right.

Dreamquick

9:43 pm on Feb 20, 2003 (gmt 0)

I spotted Nutch too...

If you have access the FROM: header you might find it points you towards sourceforge, or to be more specific;

[sourceforge.net...]

Which ties with Equiano's finding.

- Tony

photoace

6:14 pm on Mar 22, 2003 (gmt 0)

I just got hit this morning,

66-113-76-112.rev.ibsinc.com - - [22/Mar/2003:xx:xx:xx -0x00] "GET /robots.txt HTTP/1.0" 200 543 "-" "Nutch"

It hit robots.txt and the root. nothing else. I will try disallow in robots.txt , not sure it obeys.

brakthepoet

10:04 pm on Mar 25, 2003 (gmt 0)

Got hit this morning & still no URL in the UA string. Emailed the creators this afternoon and requested that they add the URL.

I read over their project description & other notes (http:*//www.nutch.org/docs/index.html) and it looks interesting. Certainly wish them the best.

jdMorgan

4:42 am on Mar 28, 2003 (gmt 0)

Had a visit from nutch just now. It requested my (validated) robots.txt, and then promptly violated it.

66.113.76.112 - - [27/Mar/2003:20:52:19 -0500] "GET /robots.txt HTTP/1.0" 200 2980 "-" "Nutch"
66.113.76.112 - - [27/Mar/2003:20:52:25 -0500] "GET /lgl_docs.html HTTP/1.0" 200 24831 "-" "Nutch"

Plonk!

Jim

photoace

4:49 am on Mar 28, 2003 (gmt 0)

I got my answer. It doesn't repect robots.txt.
It hit robots.txt and 6 seconds latter went straight to root.
I know it looks harmless now. but when or if it starts traversing image directories with a disallow the next time it will be too late.
Going for the ban.

pendanticist

3:35 am on Mar 30, 2003 (gmt 0)

This Nutch is about a strange acting thing....

66.113.76.112 - - [29/Mar/2003:13:47:59 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:13:48:07 -0800] "GET /Multicultural.html HTTP/1.0" 200 6698 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:14:04:38 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:14:04:43 -0800] "GET /Citation_Guides.html HTTP/1.0" 200 3101 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:15:14:46 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:15:15:01 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:15:15:02 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:15:15:07 -0800] "GET /Mega_Sites.html HTTP/1.0" 200 13642 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:15:32:31 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:15:32:37 -0800] "GET /Philosophy.html HTTP/1.0" 200 14159 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:15:42:14 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:15:42:20 -0800] "GET /History_A-I.html HTTP/1.0" 200 13207 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:13:18 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:13:38 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:13:53 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:14:08 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:14:23 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:14:39 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:14:54 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:15:03 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:15:09 -0800] "GET /Bar_Review.html HTTP/1.0" 200 2845 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:27:45 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:27:51 -0800] "GET /Aboriginal_Governance.html HTTP/1.0" 200 9565 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:46:45 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:46:50 -0800] "GET /Vocabulary.html HTTP/1.0" 200 4181 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:17:16:29 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:17:16:29 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:17:16:32 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:17:16:32 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:17:16:38 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:17:16:43 -0800] "GET /About_TheWall-Fact-Sheet.htm HTTP/1.0" 200 24391 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:17:54:24 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:17:54:30 -0800] "GET /Humanities.html HTTP/1.0" 200 11713 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:17:59:10 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:17:59:19 -0800] "GET /Hmmmmm.html HTTP/1.0" 200 6112 "-" "http://www.nutch.org/"

<Oh, it now provides that url you requested.>

Pendanticist.

photoace

4:36 am on Mar 30, 2003 (gmt 0)

Went to [nutch.org...]

I couldn't find anything on respecting robots.txt or what to use for exclusion.
Did I miss something?

Interesting they do get the robots.txt. My experience has been they don't respect.

Andrue

11:15 am on Mar 30, 2003 (gmt 0)

Nutch has been visiting my pages a few times last week, yesterday twice it took my spam trapper and blocked itself.
blatant violation of the robots.txt

two ips yesterday both blocked:
66.113.76.112
66.113.76.111

oh well.

parliaments

7:55 am on Apr 8, 2003 (gmt 0)

Hi,

I'm one of the developers working on the Nutch project, and we were given a very friendly heads-up about the discussion here.

As a result of the feedback from this forum, we've made some changes to our robot. There were some bugfixes- we had intended to follow robots.txt directives, and as you noticed, we'd pulled the file but not always obeyed. We have also moved our contact email from the HTTP "From" header into the User-agent header, and added a URL.

We have not made the index built from the misbehaving crawl public. We're in the process of recrawling with a patched version of our robot, and we're hoping the new version will not generate any further ill-will. We are truly sorry we violated netiquette, and we're working to right our ship.

We hope to win back those of you who've banned us, once we demonstrate that we've gotten our stuff together. In the meantime, we welcome bugreports and (hopefully constuctive) criticism at nutch-agent@lists.sourceforge.net (which goes to our whole team).

-Tom

parliaments

7:59 am on Apr 8, 2003 (gmt 0)

PS:

For those of you who've banned us, and won't see our bot URL in your logs, you can find it at:

[nutch.org...]

jdMorgan

2:35 pm on Apr 8, 2003 (gmt 0)

Parliaments,

There is still a discrepancy between the described interpretation of robots.txt and the actual interpretation. The document at [nutch.org...] states that disallowing the user-agent "Nutch" in robots.txt will block both Nutch and NutchOrg. The example robots.txt supports this by showing that


User-agent: NutchOrg
Disallow:

should be included with a disallow of "Nutch" if we desire to allow access for the development version, but not the "commercial" version.

However, until last night, I had not yet seen the new Nutch robots information (at the URL included in the new ua string), and had included only


User-agent: Nutch
Disallow: /

Last night I logged this transaction:


66.113.76.112 - - [08/Apr/2003:00:11:31 -0400] "GET /robots.txt HTTP/1.1" 200 3002 "-" "NutchOrg/0.03-dev (Nutch; http://www.nutch.org/docs/bot.html; nutch-agent@lists.sourceforge.net)"
66.113.76.112 - - [08/Apr/2003:00:11:42 -0400] "GET / HTTP/1.1" 403 775 "-" "NutchOrg/0.03-dev (Nutch; http://www.nutch.org/docs/bot.html; nutch-agent@lists.sourceforge.net)"

So obviously, NutchOrg did not interpret my Disallow as applying to both Nutch and NutchOrg, and therefore tried to access my site.

As a workaround, I had to install the following modified block in my .htaccess file to allow NutchOrg:


RewriteCond %{HTTP_USER_AGENT}  ^Nutch
RewriteCond %{HTTP_USER_AGENT} !^NutchOrg
RewriteRule !^(403.*\.html�robots\.txt)$ - [F]

The robots information you have published should be brought into correspondence with the actual behaviour of the robot as quickly as possible. As you have seen here, we have only the behaviour of the robot - and most often nothing else - by which to judge user-agents. Because of past abuse, some of us are a bit picky about who we "donate" bandwidth to. Often times, permissiveness is rewarded by having our entire site downloaded, and a subsequent flood of new unsolicited e-mail for Viagra, etc. Other times, the webmaster will notice that someone has copied his entire site, and simply made a few changes to re-brand it. This often means spending a lot of money to hire an attorney, and sometimes involves international law. Not to mention the small webmaster whose entire site is thus downloaded, causing him to go over his bandwidth quota for the month, invoking penalities - or forcing him to take his site off-line to avoid penalties.

So, any misbehaviour of a robot, misidentification of the using organization, or inaccuracy in its information page is viewed with great suspicion - It's self-defense.

The proper functioning and description of your robots.txt handling is critical.

One more thing: Don't make the same mistake as Grub.org. Specify in your contract of sale or terms of use (whichever is applicable) that using Nutch without a correct and accurate robot identification and using organization identification/contact string in the user-agent field is prohibited. Make it iron-clad. Make it so that abusers immediately and unconditionally lose all rights to use Nutch if they break this rule, and they pay for all legal fees necessary to stop their use. Take it further: Specify that they must identify themselves and provide a correct and functional e-mail contact and robot information URL. Do not allow other organizations to use the sourceforge contact info - make them use their own. Understand that one transgression might easily result in another wave of postings here and on other sites, and soon "Nutch" will be dead, because it will be disallowed from a large number of sites over time. You've built it, now you've got to protect and defend it from abusers. It's sad, but necessary; the Web ain't what it used to be.

The two items above are critical for the survival of Nutch. Otherwise, it will join the list of useful-but-uncontrolled user-agents like Indy Library and Grub, which most savvy webmasters block (or allow only on a case-by-case basis).

Hope this helps,
Jim

parliaments

4:13 pm on Apr 8, 2003 (gmt 0)

Jim,

Thank you for the very thoughtful message. I will be forwarding your advice regarding the licensing terms to our developer list.

Sorry this is a short and rushed note, but I will be dashing off to look into the violation of robots.txt you've reported. I'll respectfully submit that we don't want to update our bot page- we want to fix our bot so it behaves according to that description. I'm hoping the remainder of this thread can be about how the bot is now behaving itself.

I'd also like to mention that I am glad that you're giving us the benefit of the doubt here- by still allowing our agent via your .htaccess rule. We do appreciate that!

-Tom

jdMorgan

4:32 pm on Apr 8, 2003 (gmt 0)

Parliaments,

As long as the 'bot behaves as described, or is described as it actually behaves, you should be OK.

With the development UA being longer than the release UA, you're risking some loss of 'net space - You might want to consider calling the development 'bot something like "NutchOrg" and the release version "NutchBot" - anything that will eliminate the requirement to have an "allow" for NutchOrg if a disallow for Nutch is present in robots.txt.

This would simplify webmasters' jobs, allow you to correctly support partial UA matching when analyzing robots.txt, and comply with the robots.txt standard for partial UA matching as well, all without requiring a second entry in robots.txt.

I suspect this partial-matching problem led someone on your team to implement an "exact-match-required" modification, and that is the reason that NutchOrg thought it could access my site when "Nutch" was disallowed and that the 'bot no longer behaves as described. This could be avoided with a naming change as recommended above.

Hope this helps - and thanks for your responsiveness.
Jim

Andrue

8:10 pm on Apr 11, 2003 (gmt 0)

Well, I turned nutch back on again to see what they do next. havent come around yet though.

carfac

3:20 am on Apr 25, 2003 (gmt 0)

Hi:

Nutch has been around the last couple days for me. Must report nothing bad so far. Rate of page views is never more than 1 every five seconds, but averages MUCH less than that- one every couple minutes is about average.

Has read robots, and so far, no violations!

dave

Andrue

12:51 am on Apr 26, 2003 (gmt 0)

they showed up the other day looked ok. would like to see what they are doing with the results of this.

Nutch

wilderness

Orange_XL

Orange_XL

Equiano

wilderness

Finder

Dreamquick

photoace

brakthepoet

jdMorgan

photoace

pendanticist

photoace

Andrue

parliaments

parliaments

jdMorgan

parliaments

jdMorgan

Andrue

carfac

Andrue

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week