Forum Moderators: open

what do they want?

         

lucy24

9:23 pm on Jan 29, 2026 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Granted, not so much an ID question but--my favorite subject--a robot psychology question:

For many months now, I've noticed a particular robotic behavior: a cluster of requests for some random page (html only), in the range of 5-20 in rapid succession, all from different IP and UA. I tend to doubt any kind of DDoS exploit, as there would be more of them, closer together, likely resulting in a 429* code. Best guess: infected human machines, as most come from broadband IP ranges all over the world--with a slightly higher proportion of countries that I don't ordinarily see much of--with the occasional colo/server thrown into the mix. Since there is no unifying feature, no distinctive headers, all I can do is temporarily block the IP (generally /24 for 3 months), with the happy result that at least half of any given cluster gets a 403.

Question: What the ### do they want? Why isn't it enough to request a page just once?

* The 429 response only started showing up in logs a couple of years ago, probably as a byproduct of one of the host's periodic server changes. I've never asked, but I think they are true 429, “too many requests”, as would happen if you bombarded a lot of different sites living on the same server.

lucy24

6:40 pm on Feb 17, 2026 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Regarding the request header priority flag
I checked the Priority header last time I ran logs, and there turned out to be nothing diagnostic. Darn.

Is there an X-forwarded-for header that's in use?
It's pretty rare. I’ve got a rule that begins
SetEnvIf X-Forwarded-For ^\D
meaning that if the value isn't a number, out they go. But
:: quick run to logged headers ::
the header, with any value, is only sent in .0028 (¼ of 1%) of all requests.

Incidentally, while looking it up, I discovered there is also a header, even rarer, X-Forwarded-Proto, with value http or https. It seems to be used only by robots, including a tiny handful that I currently don't block (all of these with value https). The value http could be a block criterion, but 100% of them are already blocked on other grounds, so why bother.

blend27

10:44 pm on Feb 17, 2026 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



-- not with Win-10 + chrome --
That is my point :)

Also look into "sentry-trace" header.

Interesting part, or maybe I am bugging now:

Safari UA: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/26.0.1 Safari/605.1.15
Brave UA: Mozilla/5.0 (iPad; CPU OS 26_0_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Brave/1 Mobile/15E148 Safari/604.1

on my 9th Gen IPadOS Version:26.0.1

SO OS is not 10, it is 26 whats up with that?

lucy24

3:44 am on Feb 18, 2026 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The OS and browser version reported in the UA string are not necessarily what the (human) machine actually uses. F’rinstance:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:147.0) Gecko/20100101 Firefox/147.0
CAN mean
Mac OS 13.7.8
Firefox 147.0.4
but it CAN ALSO mean
Mac OS 26.2

where the first is my desktop, the second my laptop, checked in real time. Someone using Chrome might like to confirm that actual version numbers do not spring nimbly from 142.0.0 to 143.0.0 to 144.0.0 and so on, as UA strings would have us believe.

Also look into "sentry-trace" header.
Checking my logged headers, I find
Sentry-Trace: {long string of hexadecimals, sometimes with -0 at the end}
...
botheader: Sentry-Trace
which tells me that at some time in the past I flagged this header as up to no good ... and then promptly forgot I’d done so ;) Neither mozilla nor rfc seems to have anything to say--in fact nobody does except the manufacturer--which does seem to suggest it isn't used for any legitimate purpose.

blend27

5:34 pm on Feb 18, 2026 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



-- Sentry-Trace: {long string of hexadecimals, sometimes with -0 at the end} --

...that is usually populated by sentury.io framework code that tracks errors in you "APP".

So once i figured that out I stopped feeding sweet 403s responseseseses.. to the requests and instead started feeding 200s with some random TEXT in HTML template to encourage them to revival more Hosting IPs/proxies they used for it.

Sometimes me be evil.....

lucy24

6:16 pm on Feb 18, 2026 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Mwa ha ha.

Incidentally, just now I checked the UA string sent by Safari. It still says “Intel Mac OS X 10_15_7” as Firefox does, while the Safari version is either 18.6 or 26.2. (Did they jump up the version number to match the OS? Who knows.) I don't currently have Chrome myself, but a quick look at yesterday's logs reveals that Chrome is generally that same 10_15_7.

Next I took a quick look at the ones that don’t give this OS number. Most are already blocked; most of the rest would be blocked by a newly added image-specific rule. Technically, not blocked but rewritten to our friend the single-pixel gif, which is less work for the server, and just as unhelpful for scrapers.

not2easy

6:34 pm on Feb 18, 2026 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Safari 26.3 is the most recent version for silicon chipped Macs, at least on Tahoe 26.3. The update to 26.3 was just last week. iOS is the same 26.3 for iPhones.

lucy24

4:13 am on Feb 19, 2026 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



By not-at-all surprising coincidence, my laptop has just this instant updated itself to 26.3. I didn't check Safari, but I'm sure it tagged along.

<tangent>
Remember when you had to pay for system updates, unless they had a fit of generosity and decided to give one out for free? I feel old.
</tangent>

Taran

8:00 pm on Feb 19, 2026 (gmt 0)

10+ Year Member Top Contributors Of The Month



They're likely just refreshing to check for content updates or simple state changes without triggering heavy JS-based detection. I've blocked entire /24 ranges before for this exact pattern. It’s annoying but temporary blocks usually keep the noise down for a few months.

Kendo

1:02 am on Feb 20, 2026 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Late last year my stats jumped by 5,000%. That's when we all found a lot of unwanted traffic coming from networks like Ten Cent. For a long time I blocked networks at the firewall to no avail, even after blocking most of Asia. I had to unlock a couple of clients in India that caught in my net. Stats were still high for a long time, but since the last couple of weeks my stats have gone back to what they were originally.

May be all the AI wanna-bes have got their data and now need to learn how to process it.

lucy24

5:28 am on Feb 20, 2026 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Last few days of logs suggest they've instead adopted a new pattern: single request for
/directory/subdir
redirected to
/directory/subdir/
giving the originally requested “/directory/subdir” as referer. Human browsers don't do this, as it's a flat contradiction of how referers are supposed to be sent when a request is redirected. And mod_rewrite flatly refuses to block them, no matter how I tweak the rule.

But I'm tired of blocking--latest logs featured several hundred, between this and other patterns--and will just ignore and wait them out. (I also noticed a great increase in requests for single images, but those are easy to deal with.)

blend27

10:48 pm on Feb 26, 2026 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



-- May be all the AI wanna-bes have got their data and now need to learn how to process it.--

I am being scraped right and right as we speak.

Goog decided that they should not sent any converting traffic any more about 6 month ago.

Most requests are either get blocked by software Firewall or pass thru.

Here is the kicker: The ones are not "DETECTED" present them selves with a "PROPER set of headers based on a standard header set ", AI BOT got a list of URIs somewhere(no I do not have a sitemap file, never had). The URIs that are being requested have NO HTTP RRR-referrer and are not linked from any pages from this site....

So here-hear: there is a 17 sites(some IIS and some Apache, I do not look at Raw log files cause all data is picked up by scripts that produce HTML) set that run of the same set of "Banned Hosting Ranges" that are talking to each other LIVE. The DB is pulled in by-weekly from IPInfo, parsed/update into Ranges that are linked into another dataset of ASN that are marked as Hosting Ranges manually and by getting "usageType" as "Data Center/Web Hosting/Transit" from ABUSEDB API call. There are other ways to figure hosting range or not but...

PROXIes, Mobile and residential I am left.with.

I joined this forum on Dec 27, 2004 - 27 is not a proper header stretcher value. but is is an ol good pile of FUN.

SumGuy

12:02 am on Mar 15, 2026 (gmt 0)

5+ Year Member Top Contributors Of The Month



I have a few hundred PDF files on my site, full reprints of scientific research papers, and that's what these bots that use residential proxies go after. A new trick that I've seen them start doing in the past few days is - they hit my default page, download every file to render it like an actual human browser would, and then hit the URL for the PDF they're really after.

I have maybe a hundred rules that look just at request headers (not the IP) and they're getting hung up mostly on those rules, but 2 rules in particular is all I need for these recent hits:

Accept language is "zh,zh;q=0.8" or "zh,zh;q=0.9"

That's it. Not that the accept-language includes those strings, but that they ARE those strings.

And 90% of these are coming from US residential or commercial IP addresses (because I'm already IP-blocking the third world).

Kendo

2:46 am on Mar 15, 2026 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have a few hundred PDF files on my site,

Have you considered DRM or copy protecting those pages/files?

lucy24

5:58 pm on Mar 15, 2026 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Accept language is "zh,zh;q=0.8" or "zh,zh;q=0.9"

That's it. Not that the accept-language includes those strings, but that they ARE those strings.
You’re more tolerant than I am; I unconditionally block ^zh unless it specifies zh-(tw|TW). Sorry, China. Mumble mumble signal-to-noise ratio mumble mumble.

And, heh-heh, I do hope your rule is actually expressed as
^zh,zh;q=0\.[89]$
rather than laying out two separate patterns.

In the last week or two, log size has gone through the ceiling, and there is NOTHING to distinguish humans from fully humanoid robots (requesting all supporting files). That is, nothing I can securely identify even when processing logs, let alone something the server can use at the outset.

And I still don't understand why mod_rewrite refuses to understand $1 in a RewriteCond when the docs specifically say they do recognize it. Grumble.

SumGuy

1:53 pm on Mar 17, 2026 (gmt 0)

5+ Year Member Top Contributors Of The Month



> I do hope your rule is actually expressed as (stuff) rather than laying out two separate patterns.

My web-server software is "Abyss" made by Aprelium, running on win-NT 4 server. I do have 2 separate rules for the 2 different zh strings (which differ only on the q= part).

With this software there's some things I just can't figure out when it comes to specifiying header match strings. For example, I'd like to have a rule for when some field is actually blank, say the referer, but I have to put something in the operand of the rule, and it won't take just a pair of quotes. So instead I test the referer to see if it contains http, because if it doesn't, that means it's blank. I also can't figure out how to specify a CIDR for a requesting-source IP address.

But otherwise regarding the current scourge, I've never seen a new bot phenomena (like this zh,zh;q=0.8/9 thing) last for more than a day, but this one is persistent, meaning it's not adapting as quickly as I've seen for this sort of thing.

And I do have other "zh" rules. I've seen the request-language being simply "zh", or even just "en" and I block those as well.

I've got a good idea when my rule has caught a bot vs real human browser. My "I think you're a bot" page loads (or is supposed to load) a single jpg image. When indeed that image is requested by the browser, which is rare, I'm pretty sure that was a human, but lots of times it's not requested.

And when a bot is blocked (ie it gets the robot page) it's common to see a rash of successive requests within a few seconds or a minute from different IP's for the same file, usually using the exact same UA.

blend27

2:49 pm on Mar 17, 2026 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



-- That is, nothing I can securely identify even when processing logs --

...stick an image at the end of a page with loading="lazy" attribute. If it is requested within the same second as the page it self >>> 99% a Bot.

lucy24

5:34 pm on Mar 17, 2026 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If it is requested within the same second as the page it self >>> 99% a Bot.
But if it is requested three seconds later, it may still be a bot, because there’s often a delay before requesting supporting files. Unfortunately this sometimes happens with humans too--especially when, as would not be the case here, the page is significantly longer than the viewport. (I kinda think some browsers, especially mobiles, behave this way by default.)

The most recent development is that these almost-certainly-bots are not only requesting but acting on piwik/matomo files. It's pretty blatant when piwik shows nine consecutive visits to the same page, all from different IPs. I simply don't have that kind of site.

<tangent>
Looking it up, I find the AI explanation accompanied by “This video explains...” If someone needs a video to explain how to enter a line of code, it is possible they are in the wrong line of work.
</tangent>

blend27

5:54 pm on Mar 17, 2026 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



also, look at the order of headers labels are coming in/sent:

"user-agent" is no way sent by a normal browser behind "Accept Language" header.

[edited by: blend27 at 6:03 pm (utc) on Mar 17, 2026]

blend27

6:02 pm on Mar 17, 2026 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



it may still be a bot >> maybe that is the keyword... write an image tag using JavaScript, i bet that is requested first if/cause Bot executes JavaScript. Add =?timestamp QueryString that says what time IMG was written/reffed, then programmatically compare that query-string value to the timestamp when the page was first requested.

Fire up Dev Tools in FF or Chrome, see what happens. Remember loading="lazy" attribute.

blend27

6:13 pm on Mar 17, 2026 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



BTW, AbuseDB Free API(register) gives you whether IP is a hosting range or has issues, just saying, simple stuff scripting. Look Up if not in session, then do yo headers. IPInfo gives you Country and ASN of IP, Also gives you DB file that can be compared IP/CIDR(code for it, takes my script less that 8 minutes - if anybody wants it - it would be 2 USD)/ASN if you keep it locally.

^^ That is about 1.4 GB(including Log files that can be truncated). Once you split, "Normalize" from SQL perspective, it is like 228 Megs.

Happy hunting.

Oh and F the war.

SumGuy

11:26 pm on Mar 17, 2026 (gmt 0)

5+ Year Member Top Contributors Of The Month



> stick an image at the end of a page with loading="lazy" attribute. If it is requested within the same
> second as the page it self >>> 99% a Bot.

No. If these bots want a specific html or pdf or what-ever file, they will just ask for it and nothing else. They're not scraping by following links within the file.

When these bots ask for pdf files on my site, I perform a URL re-write so the file they get is (paraphrasing) "I-think-your-a-bot.html" and force-feed them that file. That file contains code to render "you-are-bot.jpg" on that page. A real browser will request the jpg file, the bot won't.

Abused IPDB doesn't track these residential VPN's or proxies BTW.

lucy24

3:44 am on Mar 18, 2026 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



"user-agent" is no way sent by a normal browser behind "Accept Language" header

Says who?
:: quick run to test site to request robots.txt in my everyday browser ::
Content-Length: 0
Connection: close
Host: www.example.com
Priority: u=0, i
Sec-Fetch-User: ?1
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-Dest: document
Upgrade-Insecure-Requests: 1
Dnt: 1 *
Accept-Encoding: gzip, deflate, br, zstd
Accept-Language: {PII suppressed}
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:148.0) Gecko/20100101 Firefox/148.0
In that order.

They're not scraping by following links within the file.
Unfortunately, that seems to be exactly what the last few days’ robots are doing. Page, few seconds’ delay, then all supporting files including favicon and analytics, followed by analytics file request (which lives on a different site, so a further few seconds’ delay is never diagnostic). These may or may not be the same robots I formerly flagged as botnets, where supporting files are always requested by someone in a Usual Suspects range. Maybe they finally noticed they weren't getting those other files.

If a human comes in via a search engine, their browser in collusion with the search engine may preload the html before the human actually clicks on the link, which will again create a multi-second delay between page and supporting files.

... which is why so much of this comes out sounding like “Why don’t you / Yes but”.


* I swear I just saw somewhere in the fine print that FF no longer supports the dnt header, but I guess that won't stop them from continuing to send it until the cows come home.

SumGuy

12:36 pm on Mar 18, 2026 (gmt 0)

5+ Year Member Top Contributors Of The Month



> > They're not scraping by following links within the file.
> Unfortunately, that seems to be exactly what the last few days’ robots are doing.

For not more than maybe the past week, the new thing I've been seeing is a bot that hits my default or landing page, requests all the files that a normal human browser would request (about 30 files which includes a couple js files and .gif's and jpg's - but maybe not favicon.ico I'd have to check that) and then immediately requests a pdf with no referer using an interior link that was not visible on the landing page. Maybe that's what the bots are doing on your site also. These are the bots using zh,zh;q=0.8/9 for language.

In my logs, those hits now look different - they are requesting my landing page but getting my you're-a-bot page, they then are asking for the PDF which is their real target but instead are getting the you're-a-bot page a second time. This all happens in the same second or within a couple seconds of each other.

I really do believe the ask-for-landing-page-first strategy was a tactic they thought would get around blocking strategies that they were encountering.

When I test their IP's on spur they are 100% identified as belonging to that list of about a dozen different residental VPN / Proxy networks.

blend27

3:17 pm on Mar 18, 2026 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@Lucy24

-- few seconds’ delay, then all supporting files including favicon --

JavaScript function: [developer.mozilla.org...] . Are they Humans or are they Dancers? No Visa/MS/Amex image from the bottom of the page without s'crolling down the river...

-- These may or may not be--

I got Tencent bots requesting image linked files that are Dated to past October from perfectly fine/clean residential IPs. US Carter, Comcast, TMo, Rodgers... It is fun. Proxy stuff.

--Says who? / Headers --

Every normal browser has a specific order of headers sent to the server. Bots written in different programming languages have different way of sending data when requesting page/image/pdf. So testing on PII FF is fine, but look at the order Your IPad Safari sends(I know you got one ;)), Then Chrome 'based', Edge, etc. Most bots/scrapers just get something from Git and run with it....

We are getting there..

SumGuy

2:38 pm on Mar 20, 2026 (gmt 0)

5+ Year Member Top Contributors Of The Month



If the ordering of the header fields sent by a bot vs an authetic browser are or could be different, causing no operational handling problems for the typical web server, then that is indeed a huge signal.

Problem is, for me, my software has no ability to tell me the order of the fields let alone make decisions based on it.

lucy24

4:11 pm on Mar 20, 2026 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



my software has no ability to tell me the order of the fields let alone make decisions based on it.
Likewise. That is, I could look at logged headers and maybe get some further information--beyond the presence/absence/content of individual headers, which I do use for access control--but that only helps in after-the-fact identification. I guess it would be theoretically possible to detour every page request via a php-or-equivalent script that analyzes the headers. But holy cow that would be a lot of extra work, not just for me but for the server.

blend27

12:20 pm on Mar 21, 2026 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



--But holy cow that would be a lot of extra work, not just for me but for the server.--

getHTTPRequestData() is a native function in ColdFusion(based on Java), which is what I markup my sites for the past 2 decades++.

In PHP is getallheaders() I believe is the same since Version 4( << since all, including holy, cows came home) >> [php.net...]

DOTNET: Request.Headers

Every Browser lists request and response headers in Dev Tools, Every API Authentication thingy ever listed on the web relies on request headers parameters...

Part of CGI, as always, when holy request comes in.

blend27

12:31 pm on Mar 21, 2026 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@lucy24, notice how diff these headers with what you posted above...
Host: www.webmasterworld.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:147.0) Gecko/20100101 Firefox/147.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.9
Accept-Encoding: gzip, deflate, br, zstd
Referer: https://www.webmasterworld.com/post-v6.cgi
Connection: keep-alive
Cookie: lastvisitinfo=mana-mama-strip-stiff-curly.01; splorks=googlygook
Upgrade-Insecure-Requests: 1
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: same-origin
Sec-Fetch-User: ?1
Priority: u=0, i
Pragma: no-cache
Cache-Control: no-cache


So what you looking for in this instance lets say do a search, lets say on Goog for ColdFusion function: "getHTTPRequestData() get unordered original list of headers in Coldfusion sent by a client". Each Browser you trying to investigate will have the different order and structure[array] of headers/values >> juicy and built in and live before any HTML is served to a Client requesting it, like nobody ever seen before! Thank you for paying attention to this matter!

lucy24

6:10 pm on Mar 21, 2026 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



In PHP is getallheaders()
Uhm, this is where we came in. My header-logging script includes this very function. But it's just for creating a log file; it doesn't lead to any other action.

blend27

12:27 pm on Mar 23, 2026 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



-- But it's just for creating a log file; it doesn't lead to any other action.--

Bingo! It already does have the object in memory then, and interacts with file system on you hosting server.

...from there is just a simple transfer of logic as you would do in.htaccess and such into a simple PHP logic.

Try it: create a separate .php template and try to evaluate some rules against getallheaders() information provided.

* I don't do PHP, but the logic is all over the Web/WebmasterWorld

Briefly Translated by Goog -- some of my googlygook rules:
<?php
$headers = getallheaders();
// check if 'Tth-Endproxy' is set
if (isset($headers['Tth-Endproxy'])) {
// do you logging and block request
header('HTTP/1.1 403 Forbidden');
exit;
}

// Check if 'User-Agent' contains 'Windows NT'
if (isset($headers['User-Agent']) && stripos($headers['User-Agent'], 'Windows NT') !== false) {

// Check if the 'sec-ch-ua-platform' header exists
if (isset($headers['sec-ch-ua-platform'])) {

// Also Check if headers contain 'Linux'
if (stripos($headers['sec-ch-ua-platform'], 'Linux') !== false) {
// do you logging and block request
header('HTTP/1.1 403 Forbidden');
exit;
}
}
}
?>

++ In Order of things: Do a search for this if you need help on Goog/AI: how to inspect a structure, in PHP, to determine the order of elements and their position in that structure. Do not sort headers array, just inspect the order, according which browser is sending headers.

Makes Fence...
This 88 message thread spans 3 pages: 88