Forum Moderators: open

Message Too Old, No Replies

"NutchCVS" (again) but from penguin26.parc.xerox.com

No robots.txt

         

Pfui

7:13 pm on Jul 25, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Everyone uses Nutch, but the bot-runner surprised me. Came out of nowhere, hitting hot and heavy x2 days to existing files, and some surprisingly long-gone ones (dirs/filenames removed from log excerpt, below). Wonder where they found out what to hit? Hmm.

penguin26.parc.xerox.com - - [25/Jul/2006:03:52:01 -0700]
"NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)"
penguin26.parc.xerox.com - - [25/Jul/2006:03:54:18 -0700]
"NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)"
penguin26.parc.xerox.com - - [25/Jul/2006:08:58:00 -0700]
"NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)"
penguin26.parc.xerox.com - - [25/Jul/2006:09:01:18 -0700]
"NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)"
penguin26.parc.xerox.com - - [25/Jul/2006:09:01:33 -0700]
"NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)"
penguin26.parc.xerox.com - - [25/Jul/2006:09:02:03 -0700]
"NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)"
penguin26.parc.xerox.com - - [25/Jul/2006:09:02:23 -0700]
"NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)"
penguin26.parc.xerox.com - - [25/Jul/2006:09:02:26 -0700]
"NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)"
penguin26.parc.xerox.com - - [25/Jul/2006:09:03:19 -0700]
"NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)"
penguin26.parc.xerox.com - - [25/Jul/2006:09:04:25 -0700]
"NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)"
penguin26.parc.xerox.com - - [25/Jul/2006:09:37:39 -0700]
"NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)"
penguin26.parc.xerox.com - - [25/Jul/2006:09:38:56 -0700]
"NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)"
penguin26.parc.xerox.com - - [25/Jul/2006:09:40:59 -0700]
"NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)"
penguin26.parc.xerox.com - - [25/Jul/2006:09:48:09 -0700]
"NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)"
penguin26.parc.xerox.com - - [25/Jul/2006:11:28:31 -0700]
"NutchCVS/0.8-dev (Nutch; [lucene.apache.org...] nutch-agent@lucene.apache.org)"

GaryK

11:11 pm on Jul 25, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Xerox's Palo Alto Research Center is using Nutch on a Linux box? And apparently not doing a very good job of it. Is this a case of how the mighty have fallen? ;)

Any user agent that starts with NutchCVS/ get banned immediately. I love it when the user doesn't bother to remove it from the default user agent.

[edited by: GaryK at 11:13 pm (utc) on July 25, 2006]

thetrasher

9:02 am on Jul 26, 2006 (gmt 0)

10+ Year Member



Wonder where they found out what to hit?
Search engines?!

From nutch-0.7.2.tar.gz\nutch-0.7.2\src\engines\Google.src:

# Google plugin

<search
name="Google"
description="Google Search"
method="GET"
action="http://www.google.com/search"
update="http://www.google.com/mozilla/google.src"
updateCheckDays=1
>

<input name="q" user>
<input name="sourceid" value="mozilla-search">
<inputnext name="start" factor="10">
<inputprev name="start" factor="10">

<interpret
resultListStart="<body"
resultListEnd="</body>"

resultItemStart="<p class=g>"
resultItemEnd="<br>"
>
</search>

www.google.com/search?q=cache:oqN_1H25-s4J:cvs.sourceforge.net/viewcvs.py/nutch/nutch/engines/+nutch+engines+%22google.src%22&hl=de&lr=&strip=1
This subdirectory contains Altavista.src, FAST.src, Google.src and Inktomi.src.

[google.com ]

[edited by: jatar_k at 6:06 pm (utc) on July 26, 2006]