Forum Moderators: open

Message Too Old, No Replies

Seeing some strange hits today

Bytespider for one

         

SumGuy

12:15 am on Jun 8, 2023 (gmt 0)

5+ Year Member Top Contributors Of The Month



Seeing some strange hits today.

First, several dozen hits from this bot:

Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)

operating from 156.59.198.135 (Zenlayer, Singapore). It grabbed several dozen pdf files, it knew the path to get them. Did not grab any html files, did not request robots.txt. I had been blocking some Zenlayer IP's, but now the entire AS21859 is blocked.

Also today, about 53 hits from these User-Agents:

Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.9279.1304 Mobile Safari/537.36

Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.3165.1420 Mobile Safari/537.36

Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.3566.1039 Mobile Safari/537.36

Mozilla/5.0 (Linux; Android 8.0; Pixel 2 Build/OPD3.170816.012) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.1095.1898 Mobile Safari/537.36

About a dozen hits from each one, but the chrome version changed to various versions from 39 to 59, otherwise the UA's were identical.

I am going to scan my logs for those UA's, results in a few days.

These hits were also getting PDF files almost exclusively, sometimes an HTML file (but not the accessory files that a human browser would be requesting to render the page). No referrer.

The countries these hits came from were Australia, Canada, Ireland, New Zealand, UK and US (about 8 to 10 for each). I'm blocking a lot of the third world (in the router) so it makes these hits from western countries stand out. The IP's belonged to what looks like residential/commercial ISP's (ACCESS-SK, ATT, Bell Canada, BT (UK), BT (Ireland), Comcast, Eir Broadband, Foxtel, Mercury NZ, Microplex, One New Zealand Group, Rogers, SASKTEL, Sky UK Limited, SPACEX-STARLINK, Spark New Zealand Trading, Telstra, Vodafone Australia, Vodafone Ireland).

I have never seen anything like this. I think this is related to the Bytedance hits. Bytedance, aka TikTok.

I think Bytedance is using the TikTok app on people's phones to access files that perhaps they (China) is having problems reaching. For the past few years I've been blocking every Chinese /16 net-block that I see hitting my router / webserver / mailserver.

The only alternate explanation is that there is a new bot or malware operating on people's cell phones. Or at least made to look like the hits are coming from cell phones, from residential / commercial IP's.

lucy24

3:49 am on Jun 8, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Interesting. All the Bytespiders I find in logs include
https://zhanzhang.toutiao.com/
in the UA string, even where nothing else--headers, IP--is the same. That makes me suspect at least some of them are spoofed, for all the good it does them. In particular, there is the blizzard of hits from a putative Bytespider with UA beginning in 'Mozilla (botrunner's cat stepped on the keyboard at an inopportune moment, leading to yet another reason to block).

SumGuy

12:12 pm on Jun 9, 2023 (gmt 0)

5+ Year Member Top Contributors Of The Month



The string fragment Build/OPD3.170816.012 is diagnostic for the Bytespider.

It shows up in the User-Agent used by Byte Dance's Bytespider seen in hits from 2018, 2019, 2020 and 2022 from these IP's:

220.243.136.68
220.243.135.124
220.243.136.235
111.225.148.134
111.225.148.138
111.225.148.149
52.80.105.123

The above (except for the AWS IP) have host-names of bytespider-(IP).crawl.bytedance.com.

An example user-agent is:

Mozilla/5.0 (Linux; Android 8.0; Pixel 2 Build/OPD3.170816.012) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.8963.1615 Mobile Safari/537.36

When testing for bytespider hits, the chrome version is variable and ranges from 39.x to 59.x

I have yet to test, but I theorize these are also diagnostic for Bytespider:

Build/LRX21T
Build/MRA58N

Because of the recent emergence of web hits from from commercial / residential IP space in several western countries that include Build/OPD3.170816.012, I believe this constitues strong evidence that Byte Dance is using their Tik Tok app (which is only available for cell phones and not PC's ?) as a web proxy for their byte spider search engine.

This would probably be done if Byte Dance had a strong desire to shield the source of their bot's file requests or to get around IP blocks that web servers may have in place to limit access.

lucy24

5:52 pm on Jun 11, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Well, I spoke too soon. In the last few days' logs I find a clutch of the Bytespiders you cited in the first post--the ones with MobileSafari and bytedance in the UA string. They come from a variety of IPs that can generally but not always be summed up as The Usual Suspects, generally but not always with some header deficits. I only noticed today because one of them had no header deficits and came from an IP that wasn't flagged as bad_range, so it got through. (In my access controls, bad_range is an environmental variable that may be unset to accommodate some wide-ranging crawlers.)

I hope they don't make a habit of it, or onto the bad_agent list it goes.

martinibuster

7:46 pm on Jun 11, 2023 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I believe this constitues strong evidence that Byte Dance is using their Tik Tok app... as a web proxy for their byte spider search engine.


Bytedance operates other apps besides Tik Tok. It's highly unlikely to be from Tik Tok given the extreme scrutiny that Tik Tok receives. It is far likelier that one of its other apps with hundreds of millions of users is doing this, IF that's what is happening.

lucy24

8:14 pm on Jun 11, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Adding insult to injury: One of the recent hits I noticed was to my contact page, which (a) is roboted-out and (b) cannot possibly have been referenced in any kind of human communication.

martinibuster

1:15 am on Jun 12, 2023 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Ah, okay. I may be right, Lucy24 posted the hint.

toutiao is a Bytedance app. It's a news aggregator. So it's probably not Tik Tok, it may be toutiao, if that's what is getting referenced.

[en.wikipedia.org...]

SumGuy

1:58 pm on Jun 12, 2023 (gmt 0)

5+ Year Member Top Contributors Of The Month



I made the observation that recent hits where the user-agent contained

OPD3.170816.012
Build/LRX21T
Build/MRA58N

were suspicious - no referrer, and in my case were after pdf files AND robots.txt (now that I look at port 80 hits) and favicon.ico was not also retreived. Hits from a few years ago from confirmed Bytespider IP's also contained the same strings. Prior to 2018, such as 2015 - 2017, I do see a few valid hits using those strings (my search limited to 2015 to present).

This seems to have been a short-lived phenomena, as after seeing these hits during the first week of June I'm not seeing them in the last few days. I'm also now blocking the Zenlay IP range containing 156.59.198.135 and I know that IP (operating the ByteSpider) has been trying (and failing) to reach my server since the block.

I would be interested to know if anyone else has seen the above Build strings in web hits and if they look "robotic" or not.

I have zero knowledge of ByteDance beyond their connection with TikTok. Do they operate a web search portal interface?

martinibuster

5:25 pm on Jun 12, 2023 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



No, they don't operate a search portal. Just news, social media and video editing tools.
It's weird that they're looking for PDFs.

The fact that it's using Zenlayer, which provides servers to VPNs, makes it suspicious, because hackers and spammers both use VPNs.

Maybe it's not Bytedance but rather a hacker that's spoofing that user agent in the hopes that firewalls will let it through?

I would consider those visits as hacking probes.

lucy24

6:33 pm on Jun 12, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Looking at the ones in my logs--both the ordinary toutiao ones and the recent bytedance--I'm inclined to think a great many of them are spoofed. The wide range of IPs and the inconsistent headers both point that way.

Similarly I see a lot of fake baiduspider. I suppose it must be on the same principle as the fake googlebot that was so common a few years ago (but has now largely disappeared as botrunners realize how counterproductive it is). This implies that there exists a legitimate Bytespider that is common enough to be worth spoofing, though maybe it only operates in some regions.

SumGuy

2:04 pm on Jun 13, 2023 (gmt 0)

5+ Year Member Top Contributors Of The Month



More sightings:

[wordpress.org...]

See also:

[udger.com...]

Pfui

3:08 am on Jun 30, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Belatedly... I've had hits from the Bytespider UA for about two months, only courtesy of AWS: ".ap-southeast-1.compute.amazonaws.com" and ".ap-southeast-2.compute.amazonaws.com". Files sought range from html and plain text to cgi; no pdfs, no graphics.

The malformed UA has a single leading apostrophe and no space after its intra-string closing apostrophe, ditto before its URL:

'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Mobile Safari/537.36'Bytespider;https://zhanzhang.toutiao.com/

Out of approx. 20 hits total, that requested robots.txt three times. Always ignored.

lucy24

5:55 pm on Jun 30, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



a single leading apostrophe and no space after its intra-string closing apostrophe
Oh, I hadn't noticed the second apostrophe before--the leading one was obvious--nor yet the missing space after the semicolon. This makes me wonder if the UA string is actually constructed in the form

'some-human-UA-here'Bytespider;real-bytespider-url
(i.e. apostrophe + string + apostrophe + Bytespider + semicolon + URL)

via some standard Build Your Robot script. The “real” Bytespider--in my logs, less than half as common as the fakers--has a space after the semicolon.

SumGuy

12:13 am on Jul 6, 2023 (gmt 0)

5+ Year Member Top Contributors Of The Month



I'm looking for LRX21T or MRA58N in the user-agent and if I see it, I'm giving them a 410 html error page.

I see that once in a while, it was more common a few weeks ago. They're only going for pdf files on my site.

Frank_Rizzo

7:37 pm on Oct 18, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I started getting crawled by this a few months ago. First I 403'd UA that contained Bytedance or Bytspider. They still crawled so I started blocking complete IP ranges. It went quiet for a month but now they are back using what looks like an unlimited number of genuine home user / mobile phone IPs. IP blocking those would be impossible as there are hundreds if not thousands in a few days.

I noticed too the UA were mostly of phones especially older ones so now I am back to 403'ing UAs that contain these:

LRX21T
OPD3
iPhone OS 11_0
MRA58N

I am always up for a whack-a-mole fight and usually prevail but these are getting boring now. I just wish they would p*** off.

mrgood

7:00 pm on Oct 20, 2023 (gmt 0)



They actualy ddosed sites I am admin for by overfilling http logs. This bot sucks articles and images from different IP at the same time without any delays. Most IP are from 47.128.0.0/16 - that is Amazon Singapore cloud. Several different IPs are 156.59.198.135 156.59.198.136.

I suppose they suck images for their generative AI.

Some info from their site:

[...]
Copyright Infringement

We do not allow any content that infringes copyright. The use of copyrighted content of others without proper authorization or legally valid reason may lead to a violation of TikTok's policies.

At the same time, not all unauthorized uses of copyrighted content constitute an infringement. In many countries, exceptions to copyright infringement allow the use of copyrighted works under certain circumstances without authorization. These include the fair use doctrine in the United States and permitted acts of fair dealing in the European Union (and other equivalent exceptions under applicable local laws in other countries).
[...]

But themselves steal content... Of course they choose second paragraph for themselves.

User Agents:

47.128.0.0/16
"Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)"

156.59.198.135 156.59.198.136
"Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)"

e.g. the same for both.

Bewenched

9:24 pm on Dec 1, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Just got hammered with this one from the following IP addresses and never once asked for the robots.txt. Insta ban now.
110.249.201.207
110.249.201.208
110.249.201.232
110.249.201.233
110.249.201.236
110.249.201.237
110.249.201.238
110.249.201.241

not2easy

10:36 pm on Dec 1, 2023 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



That is China's Unicom:
110.240.0.0/12
110.240.0.0 - 110.255.255.255

dstiles

10:00 am on Dec 2, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



All the chrome versions mentioned above are obsolete versions. I block all chrome versions below 100 or so.

I also block bytewhatever.

It's worth running geomind's geoip and blocking on country such as RU, CN etc.

Bewenched

5:26 pm on Dec 2, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I finally had to block Bytespider, they hit our server with over 20,000 sessions in a matter of 20 minutes.
I'd hoped to have some traffic from tictoc or whatever, but in the last 3 months not enough to justify the massive load they put on our server.
I blocked it by the user agent.

tangor

11:01 pm on Dec 3, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I wonder why some more than others are hammered. I had a total of 37 hits from bytedance in November and all honored robots.txt.

SumGuy

12:31 am on Dec 8, 2023 (gmt 0)

5+ Year Member Top Contributors Of The Month



You will note that the IP's mentioned by @Bewenched all resolve to bytespider-ip-address.crawl.bytedance.com.

Which turns on this lightbulb in my head: Is it possible, within the confines of a hosted environment, to block access to your website based on a regular-expression match that you can make against a host-name IP lookup?

lucy24

1:16 am on Dec 8, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Is it possible, within the confines of a hosted environment,
I would think so--but don't do it unless you are already doing a hostname lookup on every request, as it's a bit of a strain on the server.

:: brief run to apache docs [httpd.apache.org] ::

Both Require host and Require forward-dns claim to work only with complete hostnames, but I think that's all we need. Otherwise you're looking at Require expr which can be messy--especially in htacess, where it would have to be recompiled on every request.

Did you at some point explain why blocking by user-agent won't do?

Bewenched

6:05 pm on Dec 8, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes we blocked them by user agent and are keeping an eye out for them coming without a user agent or disguised as a valid one.
They darn near crashed our server with 20k sessions.

blend27

3:17 pm on Dec 11, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@SumGuy

-- host-name lookup on every request

I would not do it on every request either. The way it runs on my sites is the lookup is made after you look IP in IP-Tables(Blocked and WHITELISTED bots) to see if it belongs to a hosting range. If not do a host-name lookup but ONLY ones for regular users and then store it in users session.

If host-name contains one of the following

CDN77,CONTABO,amazonaws,your-server,ovh.net,linode,secureserver.net,googleusercontent

I nuke the request and write IP to a MAP File that is being picked up by .HTACCESS to be blocked for subsequent requests. MAP File processed before IP Tables lookup.

3 steps:

1. .htaccess map file look up.
2. ip-tables look up
3. rdns look up (if it was not already in users session which is created by programming lang of you choice)

...here is an old one in PHP to get you started: [webmasterworld.com...]

But then again, I would not worry about server resources unless you dishing out hundreds of thousands unique visitor page views daily.



[edited by: not2easy at 4:16 pm (utc) on Dec 11, 2023]
[edit reason] removed . to fix 404 [/edit]