Forum Moderators: open

Message Too Old, No Replies

robot clusters

         

lucy24

11:18 pm on Aug 11, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Someone out there has decided to spend their summer making up a new robot script, and it's been vexing me since mid-July.

IP: entirely random, but not typically The Usual Suspects or well-known server farms. Any one IP might do just one request, often two, rarely more. Most seem to be from ARIN ranges.
UA: each cluster uses a random selection of exactly five humanoid UAs, always a different set.
Headers: ample and various, rarely some that will trigger a lockout.
Requests: around 20-25 within a short time period, typically 20 seconds or so. Each cluster involves a random combination of {some specific page, different each time} AND the / root AND--usually but not always--robots.txt. There is absolutely nothing distinctive about the pages selected; they could easily put all my URLs in a hat and pick one.
Referer: randomly google OR root OR blank OR ... robots.txt. (This vexed me so much that I have added /robots.txt to the bad_ref environmental variable.) Redirected requests due to missing directory slash always give the original wrong request as referer, and always stay with the same IP-and-UA combo. Everything else is random.

Mutter, mutter, grumble.

phranque

11:50 pm on Aug 11, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



sorry about that!
=8)

that actually sounds like pretty optimal scraper behavior...

lucy24

12:49 am on Aug 12, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



that actually sounds like pretty optimal scraper behavior...
How much scraping can you do, if you don't get any supporting files?

One thing I double-checked for before posting was requests for the same page with supporting files in the same general time frame--which would make me suspicious of the apparent human.

phranque

1:09 am on Aug 12, 2022 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



if you don't get any supporting files

i was wondering but you hadn't mentioned that.

dstiles

8:08 am on Aug 12, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



> UA: each cluster uses a random selection of exactly five humanoid UAs, always a different set.

Are these current browsers, old or ancient?

> Headers: ample and various, rarely some that will trigger a lockout.

Do thos headers include SecFetch?

lucy24

4:42 pm on Aug 12, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Are these current browsers, old or ancient?

Brand-new, version numbers 99 and up. Standard browsers such as Chrome, Firefox, Edg, Safari (the latter obviously not such high version numbers!). Not the FF/50s and Chrome/60s that are so popular with unimaginative robots, let alone MSIE 6--which I have actually seen in the past year, making me wonder why they even bother.

Do those headers include SecFetch?
Often but not always. I pulled the headers from one cluster and fine-tooth-combed for shared features, but absolutely nothing was common to all--except the basics like Accept: that have to be present or they would be blocked up front, or conversely ones like Connection: that are present in all requests without exception. I've often looked for some blockable aspect of the Sec-Fetch / Sec-Ch family, but they're used by too many humans, especially Android.

I only became aware of these clusters because a noticeable proportion were getting in, which made the time clustering jump out. A significant proportion of the 403s are because the cluster involves a deep interior page claiming to be linked from the root. (Or, even more ridiculously, from robots.txt.)

phranque, I can just about count on my fingers the number of malign robots--as opposed to search engines and the like--that request anything but pages. Rarely they get page-plus-any-scripts, though they rarely act on scripts. (Are they doing this to learn how to send in piwik/matomo requests, perhaps for purposes of referer spam? Doesn't do them any good, if so.) But that's robots-in-general, not the clusters that have been plaguing me of late.

I'm not concerned with server load, since thirty requests in the span of 20 seconds is still less work than one typical page with all resources; it doesn't even create a blip in logs.* But it's aggravating.

* One subsection of my personal site involves pages with so many supporting files, I can actually see how many humans have visited just by eyeballing the size of the day's log file.

dstiles

8:38 am on Aug 13, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Not sure how one goes about trapping anything like that. :( I have, as you noted, added robots.txt to the referer trap but that appears to be a minimal deterrent.

Is there any order to the requests? eg robots.txt, root, page or are they random-ish?

The thought occurs: a distributed bot? Though no idea what they are looking for. And most bots like to advertise themselves. And unlikely to co-operate in using more than a single IP per set.

lucy24

4:51 pm on Aug 13, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Is there any order to the requests? eg robots.txt, root, page or are they random-ish?
They appear to be random, though I can't swear to it because my server's access logs are sometimes a little hiccupy: for example, a human visit might show a slew of images at 12:01:42 and then after those in logs, the page request at 12:01:41. So when a flurry of requests come in during the same second, there's no way to be absolutely certain the requests came in the order shown in logs. The order in access logs does seem to match the order in header logs. (And they're just far enough apart that logged headers never get tangled up, as can happen with over-speedy robots.)

a distributed bot?
I was thinking infected individual computers, but you'd expect more of those to come from places like Vietnam or the Philippines, not ARIN ranges. Looking at the most recent cluster, I did spot one from Bangladesh and one from Iraq, and a few RIPE, but really it varies all over the map: some lesser-known server or colo ranges, some to all appearances human IP.

Here, for example, is the second-to-last cluster. I've collapsed the UAs into that day's five. They are always different, but to date it is always exactly five. You can see that the only consistent handling of referers is when there is an index redirect involved. In this particular batch, quite a few were blocked due to the Rtt header, which I have yet to see from a human or legitimate robot.

Total number of requests, 25, sent directly to https. This is pretty typical; to date the range is from a low of 14 (did they get interrupted?) to a high of 32.
162.43.242.abc - - [09/Aug/2022:21:02:23 -0700] "GET /ebooks/shaw HTTP/2.0" 301 481 "-" "{Firefox 101}" 
206.204.33.abc - - [09/Aug/2022:21:02:23 -0700] "GET /ebooks/shaw HTTP/2.0" 403 3354 "https://www.google.com/" "{Chromium 101a}"
213.188.85.abc - - [09/Aug/2022:21:02:23 -0700] "GET / HTTP/2.0" 403 3354 "https://example.com/robots.txt" "{Chromium 101b}"
152.39.227.abc - - [09/Aug/2022:21:02:23 -0700] "GET /robots.txt HTTP/2.0" 200 197 "https://www.google.com/" "{Firefox 99}"
162.43.242.abc - - [09/Aug/2022:21:02:23 -0700] "GET /ebooks/shaw/ HTTP/2.0" 200 46167 "https://example.com/ebooks/shaw" "{Firefox 101}"
152.39.227.abc - - [09/Aug/2022:21:02:23 -0700] "GET /ebooks/shaw HTTP/2.0" 403 3354 "https://example.com/robots.txt" "{Chromium 101b}"
149.71.176.abc - - [09/Aug/2022:21:02:23 -0700] "GET /ebooks/shaw HTTP/2.0" 403 3354 "https://www.google.com/" "{Chromium 101a}"
193.176.22.abc - - [09/Aug/2022:21:02:23 -0700] "GET / HTTP/2.0" 403 3354 "https://example.com/robots.txt" "{Chromium 101b}"
31.204.13.abc - - [09/Aug/2022:21:02:23 -0700] "GET /robots.txt HTTP/2.0" 200 197 "https://www.google.com/" "{Firefox 99}"
31.204.13.abc - - [09/Aug/2022:21:02:23 -0700] "GET /ebooks/shaw HTTP/2.0" 403 3354 "https://example.com/robots.txt" "{Chromium 101b}"
76.189.21.abc - - [09/Aug/2022:21:02:24 -0700] "GET /ebooks/shaw HTTP/2.0" 403 3354 "https://www.google.com/" "{Chromium 101a}"
206.204.4.abc - - [09/Aug/2022:21:02:23 -0700] "GET / HTTP/2.0" 200 7849 "-" "{Safari 14}"
73.0.139.abc - - [09/Aug/2022:21:02:24 -0700] "GET / HTTP/2.0" 403 3354 "https://example.com/robots.txt" "{Chromium 101b}"
71.72.184.abc - - [09/Aug/2022:21:02:24 -0700] "GET /robots.txt HTTP/2.0" 200 197 "https://www.google.com/" "{Firefox 99}"
144.142.209.abc - - [09/Aug/2022:21:02:24 -0700] "GET / HTTP/2.0" 403 3354 "https://example.com/robots.txt" "{Chromium 101b}"
71.72.184.abc - - [09/Aug/2022:21:02:24 -0700] "GET /ebooks/shaw HTTP/2.0" 403 3354 "https://example.com/robots.txt" "{Chromium 101b}"
141.242.156.abc - - [09/Aug/2022:21:02:24 -0700] "GET /ebooks/shaw HTTP/2.0" 403 3354 "https://www.google.com/" "{Chromium 101a}"
208.207.171.abc - - [09/Aug/2022:21:02:25 -0700] "GET /robots.txt HTTP/2.0" 200 197 "https://www.google.com/" "{Firefox 99}"
208.207.171.abc - - [09/Aug/2022:21:02:25 -0700] "GET /ebooks/shaw HTTP/2.0" 403 3354 "https://example.com/robots.txt" "{Firefox 99}"
206.204.4.abc - - [09/Aug/2022:21:02:29 -0700] "GET / HTTP/2.0" 403 3354 "https://example.com/robots.txt" "{Firefox 99}"
64.79.240.abc - - [09/Aug/2022:21:02:29 -0700] "GET / HTTP/2.0" 200 7849 "-" "{Safari 14}"
64.79.240.abc - - [09/Aug/2022:21:02:33 -0700] "GET / HTTP/2.0" 200 7849 "-" "{Safari 14}"
64.79.240.abc - - [09/Aug/2022:21:02:38 -0700] "GET / HTTP/2.0" 200 7849 "-" "{Safari 14}"
64.79.240.abc - - [09/Aug/2022:21:02:42 -0700] "GET /ebooks/shaw HTTP/2.0" 301 481 "-" "{Safari 14}"
64.79.240.abc - - [09/Aug/2022:21:02:42 -0700] "GET /ebooks/shaw/ HTTP/2.0" 200 46167 "https://example.com/ebooks/shaw" "{Safari 14}"

dstiles

9:28 am on Aug 14, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



> Vietnam or the Philippines, not ARIN ranges

Judging by my mail server (in the UK) a lot of bad hits come from arin. Also india. Not so many on the web server but some and often from servers.

> Rtt header

That's a new one on me. I'll look further into that one.

I don't have the relevant logs to hand but do the major bots accompany their robots.txt probes with a referer? If not, block the hit for robots.txt + referer. Not sure if that would do any good but it may confuse the bot. Beyond that, sorry, no idea; but I'll keep an eye on the logs for similar.

lucy24

4:35 pm on Aug 14, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



do the major bots accompany their robots.txt probes with a referer?
Never that I can think of. In fact, one purpose of my robots.php rewrite is to screen requests. (This is a recent addition. Originally I did the php rewrite so I could #1 log headers and #2 include a single shared robots file for all sites.) If the robots.txt request includes one or more of:
$_SERVER['HTTP_REFERER'] || $_ENV['noagent'] || $_ENV['bad_agent'] || $_ENV['bad_range'] || $_ENV['lying_bot']

(all of which should be self-explanatory, except that "lying_bot" simply means a humanoid user-agent such as Chrome or Firefox)
then they get a minimalist robots.txt that flatly Disallows everyone. That's why, in the sample posted above, the response sizes for robots.txt are so small.

Not that it matters a whole lot with robots that don't bother looking at robots.txt until after one or more page requests.

Tangentially: There is, I believe, one isolated place on my main site--it's part of the “At Home with the Robots” subdirectory--that explicitly links to robots.txt, so human requests would legitimately come in with a referer. But those would also have human user-agents, so they'd get the same file either way.

dstiles

10:58 am on Aug 16, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Don't know how this fits in, if at all, but, as I noted above, I put in a block if robots.txt appears in the referer.

This morning I had 7 hits on robots.txt, all within 20 seconds, from a "real" browser with SecFetch. Every IP was different but all from US. First three hits were Chrome on NT 6.1 (ie XP) UA, the others were MSIE-10 on NT 6.2 (both UAs blocked on my server anyway for being stupidly old OS)

lucy24

5:24 pm on Aug 16, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



as I noted above, I put in a block if robots.txt appears in the referer
It would never have occurred to be to list robots.txt as bad_ref ... until this cluster started using it.

:: detour to archived logs, going back to late 2013, for some number-crunching ::

Requests giving robots.txt as referer
56% or a bit over half are requests for robots.txt

14% are for favicon.ico (on my sites, favicon is never explicitly named in the html, let alone in a .txt file, and most visitors are allowed to have it, if only because it's less work for the server than a 403)

18% are requests for / root, with the earliest in September 2017--and the second-earliest a full year later.

NO requests for any page other than root with robots.txt as referer

Noteworthy: 18% of all requests giving robots.txt as referer come from the present calendar year. Going purely by age of site, it should be well under 10%. So this looks like an up-and-coming behavior.

As long as I was there, I checked conversely for
robots.txt requests with a referer (of any kind). These are about three times as frequent as the reverse (robots.txt as referer), with the obvious auto-referer overlap

18.5% or a bit under 1/5 give robots.txt as the referer

only about 2% give / root as referer, and this was first seen in 2019.

The rest are a miscellaneous batch ranging from google (yeah, right) to apparent referer spam, including a fair number from DuckDuckBot--whether real or fake I didn't bother to check--along with misspellings of my own site name such as "http:// www.example.com" [sic space] where it's really https://example.com.

All of this is a bit of a digression from the original theme of robot clusters, but robot behavioral psychology is an endlessly fascinating subject.

lucy24

8:50 pm on Aug 16, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Follow-up: After several clear days in a row, I got another one yesterday. Of 32 requests, 8 were for robots.txt (all served with the minimalist Disallow-everyone version) and a further 20 were blocked. The 4 that got in were referer-less requests for the root, which are pretty impossible to block when there are no header/IP/UA offenses.

This cluster, like the one before that, added a new and slightly worrying feature: it was followed within seconds by a to-all-appearances-human request for the interior page in question.

Oh, and Today I Learned ... that “ka” is the language code for Georgian, hence Accept-Language: ka-GE in about half the requests.

:: wandering off to investigate Kartvelian languages, because if you can’t at least learn something, what's the use ::

dstiles

1:42 pm on Sep 2, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I got another of these this morning. 24 hits, 12 to robots.txt with robots referer, 4 others blocked 403 and 6 returning 200. Of the 200s, 4 were to index, one of which also requested the two CSS files. The others were to pages GDPR and REPAIRS; also two 403s from the PRODMAP page, all linked from INDEX.

A possibility of blocking occurs to me, using this set of hits as an example.

The first hit was to robots.txt with robots.txt as the referer. Use that hit to trigger a delay ON THAT DOMAIN (and possibly only for ARIN / US IPs) of, say, 45 to 60 seconds. Alternatively, trigger a minimal captcha, if the site is a busy one.

lucy24

5:06 pm on Sep 2, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Oh, so that's where they went. I haven't seen any more since I last posted, so they must have moved along to your site :)

Did all of yours come from the same hostname (widely different IP ranges, but all belonging to the same company?). If so, a short-lived response really might work.

tangor

4:09 am on Sep 3, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Feel left out, haven't found any of the above in the last 18 months logs. Maybe not widespread yet?

dstiles

8:10 am on Sep 3, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



> Did all of yours come from the same hostname

Not sure what you mean. The target was a single web site. The source IPs were all different providers - charter, comcast, oculus etc.

A simple php delay will not work, obviously, and I can't think of a simple way of checking status across several sessions. I'm going to save a base time to a file then check for expiry. If I save a new base time every time there is a robots.txt referer I can probably keep the timeout to around 15 seconds, judging from this one pattern.

Just noticed: the time between loading the index file and loading each css file is around 3 or 4 seconds each, so something is deliberately loading css after some "thought" - ie it's not a browser action despite its claim to be safari, chrome etc.

lucy24

4:20 pm on Sep 3, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Oh, oops, I misunderstood what you meant by “ON THAT DOMAIN”. I thought you meant all the requests were coming FROM an identifiable origin. Handling would depend very much on the nature of the site. On mine, f'rinstance, it is very unlikely any human would notice if the entire site simply shut down for 60 seconds.

the time between loading the index file and loading each css file is around 3 or 4 seconds each, so something is deliberately loading css after some "thought"

Yah, I've seen similar behavior in requests that get flagged as “maybe human, maybe not”, though not in the specific context of robot-clustering; those have never asked for supporting files. If it takes several seconds for a page request to be followed by a css or js request, and they never get around to images at all ... I would say I smell a rat, except that rats are actually quite nice-smelling. (“Corn chips” is a common comparison.)

dstiles

9:46 am on Sep 4, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm not sure how to go about timing css and image files.

Anyway, I've built the trap, now to wait for the mice. :)