Forum Moderators: open
209.25.87.38 - - [13/Feb/2003:01:50:59 +0100] "GET /robots.txt HTTP/1.0" 200 144 "-" "Nutch"
209.25.87.38 - - [13/Feb/2003:01:51:07 +0100] "GET / HTTP/1.0" 200 10151 "-" "Nutch"
A Whois turned up this rather useless information:
Cable & Wireless CWIX2 (NET-209-25-0-0-1)
209.25.0.0 - 209.25.127.255
Internet Business Services CWI-IBS66 (NET-209-25-86-0-1)
209.25.86.0 - 209.25.87.255
NT Sales IBS-NTS-03 (NET-209-25-86-0-2)
209.25.86.0 - 209.25.87.255
Malachiarts Networks NTS-MALI-01 (NET-209-25-87-32-1)
209.25.87.32 - 209.25.87.63
BUT WAIT! Check what happens if you point your browsers to 209.25.87.38!
New Search-engine and Who owns it?
[elib.cs.berkeley.edu...]
Nutch seems to be an attempt to create an open-source Search Engine.
Cable & Wireless CWIX2 (NET-209-25-0-0-1)
209.25.0.0 - 209.25.127.255
Internet Business Services CWI-IBS66 (NET-209-25-86-0-1)
209.25.86.0 - 209.25.87.255
NT Sales IBS-NTS-03 (NET-209-25-86-0-2)
209.25.86.0 - 209.25.87.255
Malachiarts Networks NTS-MALI-01 (NET-209-25-87-32-1)
209.25.87.32 - 209.25.87.63
Then it turns out that Malachiarts Networks is a colocation server leasing rack space.
[malachiarts.com...]
I think if it was a venture by U of Cal, Berkley that the attempts wouldn't have been so shody and perhaps provided some sort of pre-release.
I denied 209.25.87. this AM and may expand upward.
As most of you realize my traffic goals are relative to my market. I'm not going to build a search engine on the subject content of my websites for anybody unless I'm compensated. I have over 1000 distinctive and categorized subject url's which took many hours to accumulate. To allow somebody to absorb that in a few seconds is senseless.
U of Cal has a BIG research budget let them tread in somebody else's direction. NOT MINE!
The thread (Domanova) immediately following this thread is a good example. The spider name (if I recall correctly) was Jack.
I was told either here or Alt.webmaster that it was a reputable and established bot. On that premsie, after I denied their bot I removed my denies. FORTUANTELY their never returned. If they had? Where would my data and sweat of brow be? They've gone belly up! Their no longer responsible? Would they mine my data amd permission to somebody else for pennies? Somebody who holds a lesser integrity?
I think all these questions MUST be asked before allowing a new bot to travel our/my sites?
If you have access the FROM: header you might find it points you towards sourceforge, or to be more specific;
[sourceforge.net...]
Which ties with Equiano's finding.
- Tony
66.113.76.112 - - [27/Mar/2003:20:52:19 -0500] "GET /robots.txt HTTP/1.0" 200 2980 "-" "Nutch"
66.113.76.112 - - [27/Mar/2003:20:52:25 -0500] "GET /lgl_docs.html HTTP/1.0" 200 24831 "-" "Nutch"
Plonk!
Jim
66.113.76.112 - - [29/Mar/2003:13:47:59 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:13:48:07 -0800] "GET /Multicultural.html HTTP/1.0" 200 6698 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:14:04:38 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:14:04:43 -0800] "GET /Citation_Guides.html HTTP/1.0" 200 3101 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:15:14:46 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:15:15:01 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:15:15:02 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:15:15:07 -0800] "GET /Mega_Sites.html HTTP/1.0" 200 13642 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:15:32:31 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:15:32:37 -0800] "GET /Philosophy.html HTTP/1.0" 200 14159 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:15:42:14 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:15:42:20 -0800] "GET /History_A-I.html HTTP/1.0" 200 13207 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:13:18 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:13:38 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:13:53 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:14:08 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:14:23 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:14:39 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:14:54 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:15:03 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:15:09 -0800] "GET /Bar_Review.html HTTP/1.0" 200 2845 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:27:45 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:27:51 -0800] "GET /Aboriginal_Governance.html HTTP/1.0" 200 9565 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:46:45 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:16:46:50 -0800] "GET /Vocabulary.html HTTP/1.0" 200 4181 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:17:16:29 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:17:16:29 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:17:16:32 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:17:16:32 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:17:16:38 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:17:16:43 -0800] "GET /About_TheWall-Fact-Sheet.htm HTTP/1.0" 200 24391 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:17:54:24 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:17:54:30 -0800] "GET /Humanities.html HTTP/1.0" 200 11713 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:17:59:10 -0800] "GET /robots.txt HTTP/1.0" 200 188 "-" "http://www.nutch.org/"
66.113.76.112 - - [29/Mar/2003:17:59:19 -0800] "GET /Hmmmmm.html HTTP/1.0" 200 6112 "-" "http://www.nutch.org/"
<Oh, it now provides that url you requested.>
Pendanticist.
I couldn't find anything on respecting robots.txt or what to use for exclusion.
Did I miss something?
Interesting they do get the robots.txt. My experience has been they don't respect.
I'm one of the developers working on the Nutch project, and we were given a very friendly heads-up about the discussion here.
As a result of the feedback from this forum, we've made some changes to our robot. There were some bugfixes- we had intended to follow robots.txt directives, and as you noticed, we'd pulled the file but not always obeyed. We have also moved our contact email from the HTTP "From" header into the User-agent header, and added a URL.
We have not made the index built from the misbehaving crawl public. We're in the process of recrawling with a patched version of our robot, and we're hoping the new version will not generate any further ill-will. We are truly sorry we violated netiquette, and we're working to right our ship.
We hope to win back those of you who've banned us, once we demonstrate that we've gotten our stuff together. In the meantime, we welcome bugreports and (hopefully constuctive) criticism at nutch-agent@lists.sourceforge.net (which goes to our whole team).
-Tom
For those of you who've banned us, and won't see our bot URL in your logs, you can find it at:
[nutch.org...]
There is still a discrepancy between the described interpretation of robots.txt and the actual interpretation. The document at [nutch.org...] states that disallowing the user-agent "Nutch" in robots.txt will block both Nutch and NutchOrg. The example robots.txt supports this by showing that
User-agent: NutchOrg
Disallow:
However, until last night, I had not yet seen the new Nutch robots information (at the URL included in the new ua string), and had included only
User-agent: Nutch
Disallow: /
Last night I logged this transaction:
66.113.76.112 - - [08/Apr/2003:00:11:31 -0400] "GET /robots.txt HTTP/1.1" 200 3002 "-" "NutchOrg/0.03-dev (Nutch; http://www.nutch.org/docs/bot.html; nutch-agent@lists.sourceforge.net)"
66.113.76.112 - - [08/Apr/2003:00:11:42 -0400] "GET / HTTP/1.1" 403 775 "-" "NutchOrg/0.03-dev (Nutch; http://www.nutch.org/docs/bot.html; nutch-agent@lists.sourceforge.net)"
So obviously, NutchOrg did not interpret my Disallow as applying to both Nutch and NutchOrg, and therefore tried to access my site.
As a workaround, I had to install the following modified block in my .htaccess file to allow NutchOrg:
RewriteCond %{HTTP_USER_AGENT} ^Nutch
RewriteCond %{HTTP_USER_AGENT} !^NutchOrg
RewriteRule !^(403.*\.html¦robots\.txt)$ - [F]
So, any misbehaviour of a robot, misidentification of the using organization, or inaccuracy in its information page is viewed with great suspicion - It's self-defense.
The proper functioning and description of your robots.txt handling is critical.
One more thing: Don't make the same mistake as Grub.org. Specify in your contract of sale or terms of use (whichever is applicable) that using Nutch without a correct and accurate robot identification and using organization identification/contact string in the user-agent field is prohibited. Make it iron-clad. Make it so that abusers immediately and unconditionally lose all rights to use Nutch if they break this rule, and they pay for all legal fees necessary to stop their use. Take it further: Specify that they must identify themselves and provide a correct and functional e-mail contact and robot information URL. Do not allow other organizations to use the sourceforge contact info - make them use their own. Understand that one transgression might easily result in another wave of postings here and on other sites, and soon "Nutch" will be dead, because it will be disallowed from a large number of sites over time. You've built it, now you've got to protect and defend it from abusers. It's sad, but necessary; the Web ain't what it used to be.
The two items above are critical for the survival of Nutch. Otherwise, it will join the list of useful-but-uncontrolled user-agents like Indy Library and Grub, which most savvy webmasters block (or allow only on a case-by-case basis).
Hope this helps,
Jim
Thank you for the very thoughtful message. I will be forwarding your advice regarding the licensing terms to our developer list.
Sorry this is a short and rushed note, but I will be dashing off to look into the violation of robots.txt you've reported. I'll respectfully submit that we don't want to update our bot page- we want to fix our bot so it behaves according to that description. I'm hoping the remainder of this thread can be about how the bot is now behaving itself.
I'd also like to mention that I am glad that you're giving us the benefit of the doubt here- by still allowing our agent via your .htaccess rule. We do appreciate that!
-Tom
As long as the 'bot behaves as described, or is described as it actually behaves, you should be OK.
With the development UA being longer than the release UA, you're risking some loss of 'net space - You might want to consider calling the development 'bot something like "NutchOrg" and the release version "NutchBot" - anything that will eliminate the requirement to have an "allow" for NutchOrg if a disallow for Nutch is present in robots.txt.
This would simplify webmasters' jobs, allow you to correctly support partial UA matching when analyzing robots.txt, and comply with the robots.txt standard for partial UA matching as well, all without requiring a second entry in robots.txt.
I suspect this partial-matching problem led someone on your team to implement an "exact-match-required" modification, and that is the reason that NutchOrg thought it could access my site when "Nutch" was disallowed and that the 'bot no longer behaves as described. This could be avoided with a naming change as recommended above.
Hope this helps - and thanks for your responsiveness.
Jim