Forum Moderators: open

Message Too Old, No Replies

The Knowledge AI

         

keyplyr

8:56 pm on Apr 18, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



UA: The Knowledge AI
Protocol: HTTP/1.1
Robots.txt: Yes
Host: Hurricane Electric he.net
66.160.128.0 - 66.160.207.255
66.160.192.0/20, 66.160.128.0/18

Travis

10:27 am on Apr 20, 2018 (gmt 0)

5+ Year Member Top Contributors Of The Month



"AI" the magical word that you see everywhere now.

keyplyr

7:50 pm on Apr 20, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



All requests have been for robots.txt where this agent in not disallowed, but the IP range is blocked (I allow all IPs access to robots.txt.)

This agent is possibly scanning for another actor.

Also coming from...
Host: Hurricane Electric he.net
64.62.128.0 - 64.62.255.255
64.62.128.0/17

lucy24

1:11 am on Apr 21, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This agent is possibly scanning for another actor.
In general, I would say it doesn't do a fat lot of good to ask for robots.txt under one name and pages under a different one. Admittedly I know of one (only one) entity that does exactly this--but they've got a fixed IP down to the last a.b.c.d so there's not much shadiness involved.

All requests have been for robots.txt where this agent in not disallowed
On my main site, robots.txt has been followed by requests for the basic directories, blocked on header grounds which I haven't looked into more closely. As an encouraging sign, the roboted-out area (pages linked from front page and from 403 page) has not been requested.

On my personal site, it's been robots.txt and-that's-all.

Last time I ran logs, I added a Disallow just for ### and giggles, since I really do prefer more than a bare name. I'll see in a few days if they come back.

keyplyr

1:16 am on Apr 21, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



n general, I would say it doesn't do a fat lot of good to ask for robots.txt under one name and pages under a different one.
More than once I've seen some bot UA request robots.txt to get their shopping list, then come back using a browser UA for those disallowed files... just say'n.

lucy24

6:58 pm on May 16, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Short version, the long version [webmasterworld.com] having been relocated:

This robot will stay away if you give it a robots.txt Disallow using its full exact name
User-Agent: The Knowledge AI
Disallow: /

lucy24

5:56 pm on Jun 24, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I broke down and allowed them after finally finding a robots.txt directive they could understand. (“Never attribute to malice” et cetera.) Since then they’ve been crawling ... and crawling ... and crawling. Or technically I guess spidering, since it’s clear that each visit involves following-up links they found on earlier visits. Their initial shopping list may have been from {reputable site I can identify} but they’ve been moving outward.

Now, either they can’t do HTTPS or they are very slow in following redirects, because they’ve been visiting my personal site for about a week and a half--it took them 5 or 6 days to find a link, since most are in a disallowed directory--and to date it’s been nothing but HTTP requests. That means they get robots.txt as-is, but everything else is redirected. We Shall See.


---

[edited by: keyplyr at 7:01 pm (utc) on Jun 24, 2018]
[edit reason] splice clean-up [/edit]

lucy24

8:36 pm on Jul 19, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



:: bump ::

Out of curiosity, has anyone met it on an HTTPS site? In the course of more than a month, they’ve eaten a steady diet of redirects at the HTTP version of my personal site, but have yet to make a single request by HTTPS. Is it possible they don’t know how?

keyplyr

3:16 am on Jul 20, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Out of curiosity, has anyone met it on an HTTPS
Of course... I posted didn't I ;)

I think I've been pretty clear in these forums that I would never subject users to security malevolence by publishing unsecure documents of any kind.

As for requesting HTTP and learning of the secure site through a 301 redirect... some bots may not be built to negotiate protocol the same sophisticated way browsers do. I'd have to see their source code, but I would assume if the bot was achieving what it was intended to do, there would be little incentive to rewrite it just because the web went secure.

lucy24

5:30 am on Jul 20, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



if the bot was achieving what it was intended to do
Whatever that is. I tend to doubt it was “intended to” see how many 301s it could rack up on a single site. On your secure sites, did it start out by making HTTP requests, or have you only ever met it on HTTPS from the beginning?

Since it mostly does from-scratch spidering, rather than follow an old and perhaps outdated shopping list, I've had no occasion to see how it responds to redirects in general. But most robots doing full-scale spidering tend to follow up on redirects within a day or so (search engines are much faster), or on their next visit, whenever it may be.

:: detour to check something ::

Come to think of it, I don't know where it's getting its shopping list for this particular site. Aside from the root, it requests both directory pages (this is my personal site), one interior page in /dir1/ and three in /dir2/. On a handful of early visits it got a 403, which would have provided it with the names of the top-level directories, but not of their contents.

:: further delving into raw logs ::

Oh, that's funny. I spot-checked a few specific aa.bb.cc.dd IPs that it has used, and the very first one (in the 64.62 group) was formerly used frequently by BUbiNG. But it’s not systematic, just a funny coincidence.