Forum Moderators: open

Message Too Old, No Replies

Spike of Chrome robots.txt requests

Unwitting scanning?

         

Pfui

4:47 pm on Jul 1, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Time was, 99.9% of robots.txt hits came from known entities, a la Google, Facebook, Twitter, and miscellaneous independent bot runners using, oh, Screaming Frog SEO Spider and the like.

But in the last 30 days, wow. Day in and day out, hits galore only to robots.txt from apparently individual accounts. And all one-offs, as if someone, or some mobile program, is actually seeking the file, and reading it, and heeding it, and leaving. But how can that be? Suddenly everybody's a snoop, or worse, a maybe-scraper, AND a respectful one? Nah.

So what do you think is going on? Does anyone know if a newish spoof seeks robots.txt behind the scenes? How handy for someone to silently deploy an army of unwitting file scanners...

Okay, on to the geeky deets. The primary commonalities are "Linux; Android" (old versions) plus too-old Chromes (like 50-plus versions too old), whether from India or Estonia. Example UAs from just the past few days; I'm including a lot, not even all, in case you see clues or patterns --

Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.6973.1483 Mobile Safari/537.36

Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.7715.1557 Mobile Safari/537.36

Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.7534.1156 Mobile Safari/537.36

Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.7765.1774 Mobile Safari/537.36

Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.1871.1614 Mobile Safari/537.36

Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.3228.1273 Mobile Safari/537.36

Mozilla/5.0 (Linux; Android 8.0; Pixel 2 Build/OPD3.170816.012) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.4891.1444 Mobile Safari/537.36

Mozilla/5.0 (Linux; Android 8.0; Pixel 2 Build/OPD3.170816.012) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.4890.1101 Mobile Safari/537.36

Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.4145.1757 Mobile Safari/537.36

Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.4440.1373 Mobile Safari/537.36

Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.4140.1260 Mobile Safari/537.36

Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.7916.1732 Mobile Safari/537.36

Thoughts? Thanks!

not2easy

1:30 pm on Jul 2, 2023 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



If these are sorted into a different order:

Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.6973.1483 Mobile Safari/537.36

Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.7765.1774 Mobile Safari/537.36

Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.4145.1757 Mobile Safari/537.36

Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.4145.1757 Mobile Safari/537.36

Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.4145.1757 Mobile Safari/537.36

Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.7715.1557 Mobile Safari/537.36

Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.7534.1156 Mobile Safari/537.36

Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.1871.1614 Mobile Safari/537.36

Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.3228.1273 Mobile Safari/537.36

Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.4440.1373 Mobile Safari/537.36

Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.4140.1260 Mobile Safari/537.36

Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.7916.1732 Mobile Safari/537.36

Mozilla/5.0 (Linux; Android 8.0; Pixel 2 Build/OPD3.170816.012) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.4891.1444 Mobile Safari/537.36

Mozilla/5.0 (Linux; Android 8.0; Pixel 2 Build/OPD3.170816.012) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.4890.1101 Mobile Safari/537.36

They seem to follow patterns based on the Android version EXCEPT for the Chrome vintage anomaly that doesn't fit anything. So, blame AI?

lucy24

2:54 pm on Jul 2, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



as if someone, or some mobile program, is actually seeking the file, and reading it, and heeding it, and leaving
Can you expand on the “and heeding it”? After robots.txt, what requests do or don’t they make?

I could readily envision a humanoid malign robot heading straight for robots.txt to learn the name of disallowed directories that they might otherwise not know about. But that doesn’t seem to be your case.

Does the site’s robots.txt say anything about humanoid user-agents? I’ve taken to setting an environmental variable called lying_bot if the UA contains some basic elements like “Chrome” or “Firefox” (turned off for a handful of legit robots including bing and G### whose full UA string includes one of these). It isn’t used for access control, but is used to decide whether they see the “real” robots.txt or a minimalist version that simply Disallows everyone.

Pfui

4:39 pm on Jul 2, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



not2easy: Shouldn't AI be smart enough to mimic more recent OS and/or Chrome versions? :)

lucy: I've got a similar system going on. All requests for robots.txt are routed to a cgi that serves up assorted versions of a robots.txt-named file depending on who or what is asking. If the requestor is whitelisted, they get their custom version. If they or their UA aren't okay, they get:

# The use of robots or other automated means to access this site
# without the express permission of this site is strictly prohibited. ... [etc.]

User-agent: *
Disallow: /
Disallow: /robots.txt

(The default version also includes a handful of instructions for old bots with atypical requirements.)

So this new crop of 'visitors' gets basically nothing but a No. But neither do they ask for anything else. And then they're gone. Since when are so many so well-behaved? Uh, never.

not2easy

4:57 pm on Jul 2, 2023 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Shouldn't AI be smart enough to mimic more recent OS and/or Chrome versions?
I'm sure it could be, but AI is only as good as its instructions are. ;)

lucy24

12:54 am on Jul 3, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Disallow: /
Disallow: /robots.txt
(a) Isn't that redundant?
(b) Shouldn't everyone be allowed to see robots.txt (and why Disallow it when, by definition, they are already there)?

Pfui

2:30 am on Jul 3, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



- The specific robots.txt file Disallow is for anyone/thing tempted to include the file or any contents in compilations/serps/whatevers. Years ago it was a thing to include the line, a kind of a belt-and-suspenders step akin to Nothing to See Here + Forget You Were Here. I don't see any reason to strip it out now, do you?

- Everyone is allowed to see their curated (heh) robots.txt file. Well, almost everyone. Some chronically bad sources -- e.g., AWS -- are flatly denied access to everything because on the rare occasions they ask for it, they ignore it anyway.