Forum Moderators: open

Message Too Old, No Replies

Princetonbot

         

lucy24

6:24 pm on Apr 7, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I found this in logs while looking for something else. It was active a little over a year ago, from 29 January through 18 March 2016, disappearing as quickly as it had appeared. During that time it only picked up images--a variety of them, on two different sites, but it had one particular favorite directory that it visited especially often.

IP: 128.112.155.170-173 (128.112 is Princeton, and what do you bet 128.112.155 is the Computer Science department? The 170-173 is odd, since it's not a /22 block, but too many to be coincidental.)
Requests: assorted image files
Referer: as if human (that is, whatever page the image belongs to--but they never got the page itself)
UA:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/600.1.4 (KHTML, like Gecko) Safari/600.1.4 (compatible; Princetonbot/1.0; +http://http://tigress-web.princeton.edu/~fy/bot.html)
(Can you guess what I was searching for that led to this accidental find?)

Further pawing through logs reveals that for a couple weeks earlier in January 2016 they used our old friend Chrome 34 for similar requests:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36

Although I haven't seen them in over a year, the page in the UA is still there, telling me--no surprises here--that
Princetonbot is an image crawler from Princeton University. It collects Internet images for an on-going research project to further our understanding in big data and deep learning.

We obtained the list of image URLs from popular image search engines, so we are not going to crawl web pages and we don’t download images with prohibited access from search engines. Also, we randomize our image access to a particular website to avoid peak traffic to the remote image server.

There's also an “Opt-Out” paragraph, but they don’t seem to have heard of robots.txt. Apparently they think that if Googlebot-Image is allowed to crawl it, then so are they.

:: insert nasty crack about The Princeton Personality here ::

keyplyr

10:17 pm on Apr 7, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Been around a very long time.

Extremely interesting story... The first time I saw this bot (roughly 10 years ago) there was a lot in the news about scams from Nigerian Princes, etc. With this somewhere in the back of my mind, when I first saw this UA I read it as Prince Tono and immediately blocked it :)

There's a similar UA: princetononline... blah blah. I assume Prince Tono has nothing to do with this one either.