Forum Moderators: open

Message Too Old, No Replies

MaxPointCrawler

         

keyplyr

8:21 am on Oct 11, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




UA: MaxPointCrawler/Nutch-1.10 (maxpoint.crawler at maxpointinteractive dot com)
Protocol: HTTP/1.0
Robots.txt: Yes
Host: corenap.com / zayo.com
208.123.64.0 - 208.123.95.255
208.123.64.0/19

Archived thread: [webmasterworld.com...]

If you sell ad space and/or publish Adsense, Bing ads, etc you may wish to allow this marketing data agent. I allow several Nutch UAs including this one. So far it has behaved and hopefully will bring pots of gold my way.

lucy24

9:26 pm on Oct 11, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



UA: MaxPointCrawler/Nutch-1.10 (maxpoint.crawler at maxpointinteractive dot com)
<snip>
Robots.txt: Yes

I've never blocked the "Nutch" UA element, even though it's an obvious robot marker, because for some reason robots that call themselves "Nutch" are exceptionally likely to read and obey robots.txt. Maybe it's coincidence; maybe it's built into the robot's default programming. (Someone out there probably knows.)

keyplyr

8:18 am on Oct 12, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I allow several today, but in the past I did have "nutch" blocked via htaccess without mercy.

10-15 years ago when there was only one "nutch" I allowed it. It obeyed robots.txt and was coming from a source I though either beneficial or benign.

Then it became free distributed software, which in theory is a good thing for the WWW, but unaccountable and therefore a possible threat to my interests IMO. There were over a dozen hitting my sites, each adding their moniker to the UA and unless I used their complete UA in robots, they didn't obey it; none obeyed just "nutch."

There seems to be only a few left now, so I block via htaccess but poke holes to allow the friendlies through... insuring against another uprising. Ya never know.

lucy24

9:02 am on Oct 12, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



none obeyed just "nutch"

I don't generally get that specific. (Except things like "We both know you don't index that format, so willya just stay the ### out?" Or, in rare cases, "I've got nothing against you personally, but you live in a bad neighborhood so let's save everyone some trouble by just telling you upfront not to crawl.") It's enough for me if they read robots.txt and then refrain from requesting any of the boilerplate pages which, by their nature, have to be linked from all pages including the front page, but are none of any robot's concern. The one thing better than a 403'd request is no request at all.

:: exhausted because earlier today I re-checked the past year's logs to see which IPs can have the bad_russia designation or exact-IP lockout lifted ... at least until the next time someone picks up a bug ::

keyplyr

10:47 am on Oct 12, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Probably should start a new thread... but here's another nutch I recently decided to let through:

198.148.15.20 - - [11/Oct/2015:02:26:22 -0700] "GET /robots.txt HTTP/1.0" 200 1416 "-" "Nutch/2.2.1 (page scorer; http://integralads.com/site-indexing-policy/)

Again, useful for those of us who publish ads.