Forum Moderators: open

Message Too Old, No Replies

TSPcassiopeeMesosCrawler

         

keyplyr

8:09 pm on Jun 11, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




UA: TSPcassiopeeMesosCrawler/Nutch-1.13-SNAPSHOT
Protocol: HTTP/1.0
Robots.txt: No
Host: Institut Telecom SudParis (ISP)
157.159.0.0 - 157.159.255.255
157.159.0.0/16

lucy24

10:10 pm on Jun 11, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Robots.txt: No
That's funny. I met them recently too, and--as might be expected for Nutch--they did ask for robots.txt.

Site A: / only
Site B: /directory/ only
(This kind of thing is always interesting, as it implies they already know about the site. If so, the knowledge must be recent, as they proceeded directly to https rather than getting redirected from http.)

:: detour to archived logs ::

Does SNAPSHOT mean anything? I found an unrelated robot from a while back, calling itself
Dispatch/0.11.1-SNAPSHOT
(“Unrelated” = different IP, different behavior.)

keyplyr

10:15 pm on Jun 11, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



...as might be expected for Nutch--they did ask for robots.txt.
I made an effort to look through 24 hours of data and did not see a request for robots.txt either from the UA or the IP. It would be odd for any agent (except maybe a SE) to cache robots.txt for more than 24 hours. I don't.

However, I agree, Nutch requests robots.txt by default. Maybe I missed it :)

keyplyr

11:20 pm on Jun 11, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



...it implies they already know about the site. If so, the knowledge must be recent
Unless it is following a link, the bot just tries to execute whatever it's programmed to do. There is no "knowledge" needed. Aside from restrictions & depending on what language is used, it can open a directory without previous history with that server/account. It doesn't need to "know" the directory exists. That's human logic :)

It's a simple command to open a directory... the name of that directory is irrelevent; likewise get files in that directory.

Now, with the advancement of AI implementation, I can foresee future bots "learning" as they go, but outside of Google & MS (and a couple other actors) use of AI in some utilities, I haven't seen any evidence of web crawlers/spiders using AI (yet.)

Does SNAPSHOT mean anything?
I think that UA is documented somewhere. Like it's name implies, it takes a snapshot of your page, often used in those site-info type services.

RE: Dispatch/0.11.1-SNAPSHOT
It's the same thing. Like a lot of extensions, SNAPSHOT can be included with many other applications. Note the apostrophe.