Forum Moderators: open
Robots.txt: NoThat's funny. I met them recently too, and--as might be expected for Nutch--they did ask for robots.txt.
...as might be expected for Nutch--they did ask for robots.txt.I made an effort to look through 24 hours of data and did not see a request for robots.txt either from the UA or the IP. It would be odd for any agent (except maybe a SE) to cache robots.txt for more than 24 hours. I don't.
...it implies they already know about the site. If so, the knowledge must be recentUnless it is following a link, the bot just tries to execute whatever it's programmed to do. There is no "knowledge" needed. Aside from restrictions & depending on what language is used, it can open a directory without previous history with that server/account. It doesn't need to "know" the directory exists. That's human logic :)
Does SNAPSHOT mean anything?I think that UA is documented somewhere. Like it's name implies, it takes a snapshot of your page, often used in those site-info type services.