Forum Moderators: open

Message Too Old, No Replies

Can you spot the difference?

         

lucy24

9:08 pm on May 22, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I found a vexatious cluster of obvious robots in yesterday’s logs.
IP: various, changing every 10-15 requests, later returning for a final flurry of 30 or so.
UA: assorted plausible humanoid.

Before reading on, set your stopwatch.

Headers:
Accept-Encoding: br,gzip
Referrer: https://www.google.com/
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Did you spot it?

I have now added a line to the "botheader" environmental variable.

not2easy

10:14 pm on May 22, 2025 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I'm probably not catching it, but did every request include the same referer?

lucy24

10:47 pm on May 22, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Go have a cup of coffee and look again :)

HInt: The part in [ code ] markup is direct copy-paste from logged headers, nothing typed-in by me.

SumGuy

2:09 am on May 23, 2025 (gmt 0)

5+ Year Member Top Contributors Of The Month



The header field names are present in your logs?

lucy24

4:43 am on May 23, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes, that's part of the copy-and-paste.

Mwa ha.

not2easy

1:10 pm on May 23, 2025 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I don't see the request or UA field, is this header from the 403 page?

BTW @ SumGuy - there is a tutorial here: [webmasterworld.com...]

SumGuy

1:52 pm on May 23, 2025 (gmt 0)

5+ Year Member Top Contributors Of The Month



My web server generates one log file per day. At the very top of each log file, the first line, is the field names. All other lines are individual file requests from external clients. So I never see "Accept-Encoding: br,gzip" on a log-line. I just see "br,gzip". Field items are separated by spaces, enclosed with double quotes. Everything Lucy24 posted in the box, these items:

br,gzip
http s:// www.google.com/
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
en-US,en;q=0.5

I see exactly those items, those exact strings, in my log files. None of them are items that I could use to trigger a bot-detection because they are also found in legit web hits. Maybe the encoding string - br,gzip - might have a space in it, such as - br, gzip - or it might also include deflate. But everything else looks normal.

Now maybe if there's something missing, like a blank user-agent, that's something that indicates bot.

lucy24

4:28 pm on May 23, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Again: I posted the relevant parts. My header logging function (courtesy the late lamented iBill) includes
 foreach (getallheaders() as $name => $value)
{
fwrite($fh, "$name: $value\n");
}
Frankly I’m surprised that _I_ homed in on it as fast as I did. Some days, I might have stared blankly at it for hours.

:: twiddling thumbs happily, because sooner or later you’re going to kick yourself ::

SumGuy

2:24 pm on May 24, 2025 (gmt 0)

5+ Year Member Top Contributors Of The Month



Referer is supposed to be spelled without a double-r? So "Referrer" is pecuiliar?

not2easy

2:54 pm on May 24, 2025 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Referer is what shows in most Apache logs, it is not the correct spelling, but it would mess up some tasks/scripts to change it. I've never seen an Apache server log that spells it as "Referrer".

lucy24

4:06 pm on May 24, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Everyone. Please. These are not Apache server logs. They are the request headers sent in by the robot: spelled, capitalized and punctuated exactly as the request has them. Another I’ve occasionally seen is “Useragent”, like that.

SumGuy

12:35 am on May 25, 2025 (gmt 0)

5+ Year Member Top Contributors Of The Month



On my (non-linux, non-apache) server, when it comes to Request Header items that show up in my logs, there is a default set, which is this:

Request Header User-Agent
Request Header Referer
Request Header Cookie
Request Header Accept
Request Header Accept-Language
Request Header Accept-Charset

I'm just sort of stumbling on this now, as this is a section of the config that I've never really dived into before.

I believe now that if a given client sends an item in the Request Header, such as "Referrer", that it would not be logged because it doesn't show up in the above list. In the logs, the place where the "Referer" would show up would be blank. I have just added another field to the Request Header logging, this new field being "Referrer". I expect it to be blank the vast majority of the time. If indeed something shows up there, that will be certainly be of interest to me and I will see if it can be leveraged for bot-detection.

A while ago I had a similar idea with another possible header field - the "forwarded for" or "X-forwarded for" field. If such does or can exist, it might also be useful. Again, I have to add this (and spell it correctly) to my log configuration for it to show up, if indeed something out there actually sends it.

lucy24

4:15 pm on May 25, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



if a given client sends an item in the Request Header, such as "Referrer", that it would not be logged because it doesn't show up in the above list
Exactly. Bear in mind that in some respects computers are vastly dumber than humans, and thus don’t recognize the concept of misspelling.

Access log entry from last night (IP lightly obfuscated):
45.38.206.abc - - [24/May/2025:23:49:22 -0700] "GET /ebooks/beauty/beauty1.html HTTP/1.1" 403 3466 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36 OPR/117.0.0." 
Complete logged headers for the same request (exemplified):
2025-05-24:23:49:22
URL: /ebooks/beauty/beauty1.html
Status: 403
HTTPS: on
IP: 45.38.206.abc
----
Content-Length: 0
Connection: close
Host: example.com
Accept-Encoding: br,gzip
Referrer: https://www.google.com/
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36 OPR/117.0.0.
----
botheader: Referrer
lying_bot: Chrome/132.0.0.0
----
My logheaders code is attached to all pages including error documents, so I can always tell why they’re blocked. (If you wondered, “lying_bot” doesn’t necessarily mean a lying robot, though in this case it certainly does. I use it when determining which version of robots.txt to serve.)

Last night I spent some time poring over a part of Apache docs that I don’t usually deal with, since access logs are done at the server level and I’m on shared hosting. Turns out we are dealing with what they call “NCSA extended/combined log format” (search me), i.e.
"%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\""
(literal quotation marks are escaped because the whole pattern is in quotes), which translates as:

remote hostname OR IP address, depending on whether lookups are enabled (off by default unless I made a mistake)
remote logname
remote user, if authenticated (on my site, these two are always - -)
“Time the request was received, in the format [18/Sep/2011:19:18:28 -0400]. The last number indicates the timezone offset from GMT”
first line of request, in quotes
final status of request, after all processing
“Size of response in bytes, excluding HTTP headers. In CLF format, i.e. a '-' rather than a 0 when no bytes are sent.”
contents of the header with exact name "Referer" ("-" if not sent), in quotes
contents of the header with exact name “User-Agent” ("-" if not sent), in quotes

An interesting quirk is that you can tell from access logs if a given header was empty--logged as ""--vs. not sent at all--logged as "-"--but there doesn't seem to be any way to make the distinction in access-control rules using the various Apache mods that govern access.

the "forwarded for" or "X-forwarded for" field
I've got one that says
SetEnvIf X-Forwarded-For ^(unknown|\W) botheader=XForward
:: detour here to disentangle cat toy from robotic vacuum ::

making another automatic 403. The header isn’t malign in and of itself; normally its value is an IP address. In fact I should change my rule to \D (non-digit), because looking it up I find all kinds of bogus values, though the ones that don’t fit my existing pattern all managed to get themselves blocked anyway.