Forum Moderators: open
that actually sounds like pretty optimal scraper behavior...How much scraping can you do, if you don't get any supporting files?
Are these current browsers, old or ancient?
Do those headers include SecFetch?Often but not always. I pulled the headers from one cluster and fine-tooth-combed for shared features, but absolutely nothing was common to all--except the basics like Accept: that have to be present or they would be blocked up front, or conversely ones like Connection: that are present in all requests without exception. I've often looked for some blockable aspect of the Sec-Fetch / Sec-Ch family, but they're used by too many humans, especially Android.
Is there any order to the requests? eg robots.txt, root, page or are they random-ish?They appear to be random, though I can't swear to it because my server's access logs are sometimes a little hiccupy: for example, a human visit might show a slew of images at 12:01:42 and then after those in logs, the page request at 12:01:41. So when a flurry of requests come in during the same second, there's no way to be absolutely certain the requests came in the order shown in logs. The order in access logs does seem to match the order in header logs. (And they're just far enough apart that logged headers never get tangled up, as can happen with over-speedy robots.)
a distributed bot?I was thinking infected individual computers, but you'd expect more of those to come from places like Vietnam or the Philippines, not ARIN ranges. Looking at the most recent cluster, I did spot one from Bangladesh and one from Iraq, and a few RIPE, but really it varies all over the map: some lesser-known server or colo ranges, some to all appearances human IP.
162.43.242.abc - - [09/Aug/2022:21:02:23 -0700] "GET /ebooks/shaw HTTP/2.0" 301 481 "-" "{Firefox 101}"
206.204.33.abc - - [09/Aug/2022:21:02:23 -0700] "GET /ebooks/shaw HTTP/2.0" 403 3354 "https://www.google.com/" "{Chromium 101a}"
213.188.85.abc - - [09/Aug/2022:21:02:23 -0700] "GET / HTTP/2.0" 403 3354 "https://example.com/robots.txt" "{Chromium 101b}"
152.39.227.abc - - [09/Aug/2022:21:02:23 -0700] "GET /robots.txt HTTP/2.0" 200 197 "https://www.google.com/" "{Firefox 99}"
162.43.242.abc - - [09/Aug/2022:21:02:23 -0700] "GET /ebooks/shaw/ HTTP/2.0" 200 46167 "https://example.com/ebooks/shaw" "{Firefox 101}"
152.39.227.abc - - [09/Aug/2022:21:02:23 -0700] "GET /ebooks/shaw HTTP/2.0" 403 3354 "https://example.com/robots.txt" "{Chromium 101b}"
149.71.176.abc - - [09/Aug/2022:21:02:23 -0700] "GET /ebooks/shaw HTTP/2.0" 403 3354 "https://www.google.com/" "{Chromium 101a}"
193.176.22.abc - - [09/Aug/2022:21:02:23 -0700] "GET / HTTP/2.0" 403 3354 "https://example.com/robots.txt" "{Chromium 101b}"
31.204.13.abc - - [09/Aug/2022:21:02:23 -0700] "GET /robots.txt HTTP/2.0" 200 197 "https://www.google.com/" "{Firefox 99}"
31.204.13.abc - - [09/Aug/2022:21:02:23 -0700] "GET /ebooks/shaw HTTP/2.0" 403 3354 "https://example.com/robots.txt" "{Chromium 101b}"
76.189.21.abc - - [09/Aug/2022:21:02:24 -0700] "GET /ebooks/shaw HTTP/2.0" 403 3354 "https://www.google.com/" "{Chromium 101a}"
206.204.4.abc - - [09/Aug/2022:21:02:23 -0700] "GET / HTTP/2.0" 200 7849 "-" "{Safari 14}"
73.0.139.abc - - [09/Aug/2022:21:02:24 -0700] "GET / HTTP/2.0" 403 3354 "https://example.com/robots.txt" "{Chromium 101b}"
71.72.184.abc - - [09/Aug/2022:21:02:24 -0700] "GET /robots.txt HTTP/2.0" 200 197 "https://www.google.com/" "{Firefox 99}"
144.142.209.abc - - [09/Aug/2022:21:02:24 -0700] "GET / HTTP/2.0" 403 3354 "https://example.com/robots.txt" "{Chromium 101b}"
71.72.184.abc - - [09/Aug/2022:21:02:24 -0700] "GET /ebooks/shaw HTTP/2.0" 403 3354 "https://example.com/robots.txt" "{Chromium 101b}"
141.242.156.abc - - [09/Aug/2022:21:02:24 -0700] "GET /ebooks/shaw HTTP/2.0" 403 3354 "https://www.google.com/" "{Chromium 101a}"
208.207.171.abc - - [09/Aug/2022:21:02:25 -0700] "GET /robots.txt HTTP/2.0" 200 197 "https://www.google.com/" "{Firefox 99}"
208.207.171.abc - - [09/Aug/2022:21:02:25 -0700] "GET /ebooks/shaw HTTP/2.0" 403 3354 "https://example.com/robots.txt" "{Firefox 99}"
206.204.4.abc - - [09/Aug/2022:21:02:29 -0700] "GET / HTTP/2.0" 403 3354 "https://example.com/robots.txt" "{Firefox 99}"
64.79.240.abc - - [09/Aug/2022:21:02:29 -0700] "GET / HTTP/2.0" 200 7849 "-" "{Safari 14}"
64.79.240.abc - - [09/Aug/2022:21:02:33 -0700] "GET / HTTP/2.0" 200 7849 "-" "{Safari 14}"
64.79.240.abc - - [09/Aug/2022:21:02:38 -0700] "GET / HTTP/2.0" 200 7849 "-" "{Safari 14}"
64.79.240.abc - - [09/Aug/2022:21:02:42 -0700] "GET /ebooks/shaw HTTP/2.0" 301 481 "-" "{Safari 14}"
64.79.240.abc - - [09/Aug/2022:21:02:42 -0700] "GET /ebooks/shaw/ HTTP/2.0" 200 46167 "https://example.com/ebooks/shaw" "{Safari 14}"
do the major bots accompany their robots.txt probes with a referer?Never that I can think of. In fact, one purpose of my robots.php rewrite is to screen requests. (This is a recent addition. Originally I did the php rewrite so I could #1 log headers and #2 include a single shared robots file for all sites.) If the robots.txt request includes one or more of:
$_SERVER['HTTP_REFERER'] || $_ENV['noagent'] || $_ENV['bad_agent'] || $_ENV['bad_range'] || $_ENV['lying_bot'] as I noted above, I put in a block if robots.txt appears in the refererIt would never have occurred to be to list robots.txt as bad_ref ... until this cluster started using it.
the time between loading the index file and loading each css file is around 3 or 4 seconds each, so something is deliberately loading css after some "thought"