Forum Moderators: open

Message Too Old, No Replies

SpiderMan/1.0

         

Pfui

7:36 pm on May 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



tss37-2-82-239-104-nnn.fbx.proxad.net
SpiderMan/1.0

Asked for robots.txt, then promptly ignored it. Twice. So I'd mark this --

robots.txt? NO

No clue if it's new, old, reincarnated, etc.

blend27

12:44 pm on May 30, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Mary Jane is gonna be upset, I mean the dude is out there for a quick scrape... whats up with that?

GaryK

5:29 am on Jun 14, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'll add some UAs with Spiderman in them to the mix in case it helps. I've been collecting these since Dec. 2004:

AESOP_com_SpiderMan
Neo Lee/Nutch-0.9 (Nutch spiderman; [lucene.apache.org...] MyEmail)
Peter Wang/Nutch-1.0-dev (Nutch spiderman; [peterpuwang.googlepages.com...] ; MyEmail)
search.ch V1.4.2 (spiderman@search.ch; [search.ch)...]
SpiderMan
SpiderMan Mozilla/4.0( compatible; MSIE 6.0; Windows NT 5.1; SV1; Maxthon; Alexa Toolbar)

dstiles

4:51 pm on Jun 14, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A note on nutch - I got a hit from some princton bot with nutch split into two with an underscore. Worth watching out for. Nutchers may be getting cleverer - could be up to gerbil level of intelligence soon!

Regex N(¦_)u(¦_)t(¦_)c(¦_)h works for me but I'm keeping an eye out for hyphens as well. :)

Pfui

4:07 am on Jun 16, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The roll-your-own Nutch bots are everywhere, and too rarely do the runners program them to heed robots.txt. Oh, and Princeton's been around a LOT lately:

mmx.cs.princeton.edu
nu_tch-princeton/Nu_tch-1.0-dev (princeton crawler for cass project; [cs.princeton.edu...] zhewang a_t cs ddot princeton dot edu)

Also from: aegis.cs.princeton.edu

FWIW, last year that I started redirecting:

RewriteCond %{REMOTE_HOST} \.cs\.

Works like a charm:)

Oops. Thread's not about Nutch and University cs depts. Back to SpiderMan! Well, I've only see "SpiderMan/1.0" coming from the same proxad.net host in the OP. Could be this runner's a film fan script-kiddie.