Forum Moderators: open

Message Too Old, No Replies

Four-Minute Bytes

Dive with me into the AWS weeds for a moment...

         

Pfui

6:27 pm on Sep 5, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



In recent months, I've noticed a strange dance performed by none other than AWS (always ap-southeast-1) and Bytespider. Initially the pattern involved a robots.txt-only hit from AWS running Bytespider, followed by a robots.txt-only hit from a non-AWS address (let's call those Independents) running a non-bot UA. This went on for ages until recently when the pattern changed to two AWS+Byte hits together, no Indies.

Here's the interesting bit: the timing.

Every. Single. Time. the second robots.txt-only hit follows the first by exactly four minutes. In ALL cases, and I've tracked many scores of these, both the original AWS+Indie pairs, and thereafter the AWS+Byte pairs. Four minutes.

Well, I think it's interesting:)

Those of you who save your logs, take a look back at ".ap-southeast-1.compute.amazonaws.com" hits running Bytespider and asking for robots.txt. Then count forward exactly four minutes. See? Anybody?

EXAMPLES: AWS+Byte & AWS+Byte

ec2-47-128-54-170.ap-southeast-1.compute.amazonaws.com
Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)
/robots.txt GET: 21:05:17
ec2-47-128-52-93.ap-southeast-1.compute.amazonaws.com
Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)
/robots.txt GET: 21:01:18

ec2-47-128-27-73.ap-southeast-1.compute.amazonaws.com
Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)
/robots.txt GET: 17:44:20
ec2-47-128-48-133.ap-southeast-1.compute.amazonaws.com
Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)
/robots.txt GET: 17:40:23

EXAMPLES: AWS+Byte & Indie

86-46-71-xxx-dynamic.agg2.ety.prp-wtd.eircom.net
Mozilla/5.0 (Linux; Android 8.0; Pixel 2 Build/OPD3.170816.012) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.4324.1180 Mobile Safari/537.36
/robots.txt GET: 18:08:42
ec2-47-128-54-63.ap-southeast-1.compute.amazonaws.com
Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)
/robots.txt GET: 18:04:27

203.94.xx.x
Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.1244.1462 Mobile Safari/537.36
/robots.txt GET: 12:28:28
ec2-47-128-34-174.ap-southeast-1.compute.amazonaws.com
Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)
/robots.txt GET: 12:24:26

(Personally, I find the latter examples more concerning because I wonder if the non-AWS twins were intentionally involved? Or were they randomly cherry-picked to pair up? Jes' musing:)

lucy24

6:54 pm on Sep 5, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Ooh, I love delving into robotic behavior. But for those of us who keep our logs in aa.bb.cc.dd format, how do you translate all those hosts? (I used to have a series of conversion patterns, for use when I goofed in htaccess and everything was thrown into lookup mode, but haven't needed it in a while.)

Meanwhile I looked up /robots.txt requests from Bytespider-any-IP, followed within a reasonable time period by any robots.txt. Nothing, darn it. Keep us posted! I’m entranced.

Pfui

12:58 am on Sep 6, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



- I don't translate the Host names into their IPs every time only because it's always been easier for me to regard hits by names, not numbers. (I also use an ancient Perl script that reads my access_log and presents the data in a great grid so I can see X number of lines by visitor+files hit, not just an endless jumble of single hits. That's how the four-minute thing jumped out.) If I need to boil down a Host name for killfiling its IP or range/CIDR, I throw it into a whois like domaintools, or my current data-rich favorite: [myip.ms...]

- Thanks for having a look-see -- I figured you might want to:)