Forum Moderators: open

Message Too Old, No Replies

nu tch

         

wilderness

2:00 am on Jul 30, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Don't recall seeing this UA previously, at least the modified nutch term.

128.112.139.zzz - - [30/Jul/2009:02:22:57 +0100] "GET /robots.txt HTTP/1.0" 200 4858 "-" "nu_tch-princeton/Nu_tch-1.0-dev (princeton crawler for cass project; [cs.princeton.edu...] zhewang a_t cs ddot princeton dot edu)"

GaryK

3:35 am on Jul 30, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hmm, another one I forgot to report at the time.

First seen on April 14, 2009
Last seen July 26, 2009
Total visits 19

ROBOTS.TXT? Yes

I don't save IP Addresses though. Sorry.

keyplyr

8:08 am on Jul 30, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Those Princeton lads think their pretty clever. However they're still blocked as are all research and CS projects.

dstiles

9:04 pm on Jul 30, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I reported the princeton nutch last month, with a regex to catch nutch UAs broken in this way:

[webmasterworld.com...]

wilderness

10:25 pm on Jul 30, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Many thanks dstiles.

I did recall a recent mention, however was unable to locate the reference in a search.

jdMorgan

11:29 pm on Jul 30, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> Regex N(¦_)u(¦_)t(¦_)c(¦_)h

That's kind of doing it the hard way. How about just
N_?u_?t_?c_?h

or, taking the concern for possible hyphens into account
N[-_]?u[-_]?t[-_]?c[-_]?h

Jim

dstiles

8:16 pm on Jul 31, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Not that familiar with regex - I use it only occasionally under vbScript which in any case is rather limited. I have found a few instances where things don't work as expected (according to my limited knowledge and using documented examples) and ? was one of them.

Your first solution isn't adequate for future expansion. Why is your second solution easier? It has more characters. Is it faster?

jdMorgan

5:57 pm on Aug 3, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It's the same number of characters if you take out the additional provision for hyphens. It's also "more correct" in that it uses alternate alternate groups as intended and doesn't rely on the potentially-dangerous "blank or underscore" parenthesized subpattern. I was actually surprised to see that posted as working code; My gut reaction is that I'd expect the regex parser to reject it. But obviously if it works for you, then it must be acceptable, at least with some regex libraries.

Jim

dstiles

7:41 pm on Aug 3, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks. As I said, not too familiar with regex, just enough to get by. I've tried your solution and it works in IIS vbScript. :)