Welcome to WebmasterWorld Guest from 54.196.73.22

Forum Moderators: Ocean10000 & phranque

Getting started: header-based access controls

     
8:24 pm on Jun 28, 2018 (gmt 0)

Preferred Member from CA 

Top Contributors Of The Month

joined:Feb 7, 2017
posts: 444
votes: 37


How do I get started on using header-based access controls?

Currently I am on an Apache server, shared host, using an htaccess, robots.txt, error file and this works pretty well. I run multiple web sites, all in respective directories, with the .htaccess in public_html, where my SetEndIfs cascade down to all subdirectories (or inheritance up). I regularly read my raw access log, find the bad guys and ban using UAs (SetEndIf) or IPs (deny from) using htaccess. I do have an error file, which I do review, but it reveals very little.

How do I set up header-based access controls? Can someone point me to a link or two to get me started?

Thanks All!
10:22 pm on June 28, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15031
votes: 665


The tricky part is that you don’t want to give robots too much information, so this post will be incomplete by its nature. But we start with the mechanics. mod_setenvif is inherited normally--where “normal” means “not like mod_rewrite”--making it useful for rules intended to apply to many sites, as in a "userspace" or "primary/addon" shared-hosting setup.

You start with the basic syntax (do not copy-and-paste :) )
SetEnvIf header-field-name header-field-value one-or-more-variables-to-set
It is very rarely appropriate to use SetEnvIfNoCase, because anything wrongly cased is likely to be wrong, like the “GoogleBot” I used to meet a few years ago. The other mod_setenvif directive is
BrowserMatch blahblah
which is simply shorthand for
SetEnvIf User-Agent blahblah
Again, think twice about BrowserMatchNoCase. You want the real thing, not a wrongly cased faker.

A couple of simple examples:
BrowserMatch ^$ no_agent
SetEnvIf Accept ^$ noaccept

Deny from env=no_agent
Deny from env=noaccept
Those are straightforward: Slam the door in their face if they don’t send an Accept header, or don't send a User-Agent header. Note that there is no way to distinguish between a header that is absent--Apache logs say "-"--and one that is empty--Apache logs say "" alone--but fortunately it doesn't matter.

I am on Apache 2.2, so directives are in the form “Deny from”. If you are on Apache 2.4, this part will involve assorted "Require" rules instead, but they still take the same arguments:
Require env noaccept 
(assuming a <RequireNone> envelope).

But what about law-abiding robots who don't send the required header? Then you start poking holes:
BrowserMatch Googlebot !noaccept

SetEnvIf Remote_Addr ^31\.13\.(6[4-9]|[7-9][0-9]|1[01]\d|12[0-7]) !noagent
meaning “turn off the environmental variable I just set”. (I am assuming here that you have a rule elsewhere, probably using mod_rewrite, that unconditionally denies anyone who claims to be Googlebot but doesn’t come from a Google crawl range. The only other search engine UA that’s routinely faked is, for some reason, Baidu.) The second line, involving IP addresses, is specifically for Facebook, which has recently picked up a nasty habit of not sending a UA. There are actually five possible IP ranges; there's a thread somewhere hereabouts that lists them.

That was the easy part. The hard part is knowing what headers to deny, and what holes to poke. For this you will need to log headers and study them closely for a while. I think it was incrediBill who originally came up with the header-logging code. Make it part of one of your standard includes so it executes on all pages including any custom error pages.
<?php
function get_server($var)
{
return isset($_SERVER[$var]) ? $_SERVER[$var] : false;
}

if (!function_exists('getallheaders'))
{
function getallheaders()
{
$headers = '';
foreach ($_SERVER as $name => $value)
{
if (substr($name, 0, 5) == 'HTTP_')
{ $headers[str_replace(' ', '-', ucwords(strtolower(str_replace('_', ' ', substr($name, 5)))))] = $value; }
}
return $headers;
}
}

$ip = get_server('REMOTE_ADDR');
$fh = fopen($_SERVER['DOCUMENT_ROOT'] . "/boilerplate/headers-". date('Ymd') . ".log","a");
fwrite($fh, date('Y-m-d:') . date("H:i:s\n"));
$thispage = $_SERVER['REQUEST_URI'];
fwrite($fh, "URL: $thispage\n");
fwrite($fh, "IP: $ip\n");

foreach (getallheaders() as $name => $value)
{
fwrite($fh, "$name: $value\n");
}

fwrite($fh, "----\n\n");
fclose($fh);
?>
I have a /boilerplate/ directory that I use for, well, boilerplate, so that's where I told it to keep my logged headers. It will make a new file each day, and they will remain there until you delete them (unlike log files, which your host probably wipes after a set time period). Watch out! Log files typically “roll over” at some dead hour of the night, so your logged headers will sometimes be under a different date than the access logs themselves.

You are logging request headers. That means you cannot tell by looking at the headers what response your server sent out in reply. This is why you need to log headers on your 403 page. It tells you when requests for non-page files were denied, and you can then figure out if you need to poke additional holes for particular filetypes.

Once you start logging headers, look at the differences between humans and robots. Human requests always contain certain headers, and they often have a different range of values than robotic headers. So you can start setting environmental variables like "noagent" or "badlanguage" or sometimes even "botheader"--my catchall for headers that nobody but a robot ever sends. Watch out for mobiles, which sometimes send rather minimalist, bot-like headers.

In addition to User-Agent and Referer, I recommend looking especially closely at Accept and its relatives: Accept-Language, Accept-Charset, Accept-Encoding. Also look at From and Via, which tend to lead to individual judgement calls. Most headers in X- are just white noise, but there are a handful of exceptions.
2:50 pm on June 29, 2018 (gmt 0)

Preferred Member from CA 

Top Contributors Of The Month

joined:Feb 7, 2017
posts: 444
votes: 37


Thanks @lucy24. I will need time to digest this!
2:25 pm on July 3, 2018 (gmt 0)

Preferred Member from CA 

Top Contributors Of The Month

joined:Feb 7, 2017
posts: 444
votes: 37


ok, I got it logging. The logheaders.php needs to be deep within my theme and not where I want it in a subdir. The header file is logging.

I named the code "logheaders.php" and put it into a subdir of public_html called "loghead". On my site I did an include

<body>
<?php include '/loghead/logheaders.php';?>

This did not work. Does this file need to be in the same dir where it is included?

and changed the php code to:
$fh = fopen($_SERVER['DOCUMENT_ROOT'] . "/loghead/headers-". date('Ymd') . ".log","a");

The php code does work and I see logging. Not that I understand much yet.

Are there permissions I need to set? The dir is set to 770, with user/group having rwe and world having none. Is this correct? logheaders.php is set to 755. Is this correct?

Shared server: Apache Version 2.2.34, PHP Version 5.3.29
3:58 pm on July 3, 2018 (gmt 0)

Preferred Member from CA 

Top Contributors Of The Month

joined:Feb 7, 2017
posts: 444
votes: 37


I added logging to my 403.php file
6:17 pm on July 3, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15031
votes: 665


Does this file need to be in the same dir where it is included?
I do mine with SSI since I'd already coded for those, so it just meant adding a line to the existing footer using “include virtual”.

Oh, wait, I do have it as a php include in one place, because I rewrite robots.txt to robots.php. (For two reasons: so I can use a single shared Disallow list for all sites, and so I can log headers on robots.txt requests.)
include ($_SERVER['DOCUMENT_ROOT'] . "/includes/logheaders.php");
You probably just forgot the DOCUMENT_ROOT bit.

Are there permissions I need to set?
If things are getting written, then you can safely assume the permissions are what they need to be.

:: uneasily wondering what I said in earlier versions of this post that triggered a 403 (yikes!) from the site when I tried to Preview ::
7:59 pm on July 3, 2018 (gmt 0)

Preferred Member from CA 

Top Contributors Of The Month

joined:Feb 7, 2017
posts: 444
votes: 37


<?php include($_SERVER['DOCUMENT_ROOT'] . '/loghead/logheaders.php') ?>

Thanks! That seemed to be the trick!

Now for the task of figuring out how to use this new header info!
6:18 pm on July 5, 2018 (gmt 0)

Preferred Member from CA 

Top Contributors Of The Month

joined:Feb 7, 2017
posts: 444
votes: 37


My logging is working well. And now for the questions:
Raw access log:
188.72.127.* [03/Jul/2018:19:22:58 GET /example/chinese-tourist-to-new-zealand-visa-requirements/ HTTP/1.0 200 43198 http://example.com/chinese-tourist-to-new-zealand-visa-requirements/ Mozilla/5.0 (Windows NT 6.1; Win64; rv:38.0) Gecko/20100101 Firefox/38.0

Headers
2018-07-03:23:22:58
URL:/example/chinese-tourist-to-new-zealand-visa-requirements/
IP:188.72.127.*
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Host:example.com
Referer:http://example.com/wp/2018/06/28/chinese-tourist-to-new-zealand-visa-requirements/
User-Agent:Mozilla/5.0

188.72.127.0 - 188.72.127.127
netname: NLNetwork Nederlands Net NL
There is exactly a 4 hr time difference betwee the raw access log and the headers log. This I have seen before.
Should I ban all http/1.0s?
Why is the raw access log UA (Mozilla/5.0 (Windows NT 6.1; Win64; rv:38.0) Gecko/20100101 Firefox/38.0) different from the headers UA (Mozilla/5.0). Should they not be the same, as they are from the same source?

What else can the headers tell me that is not available from the raw access log, related to IDing a bot?
9:55 pm on July 5, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15031
votes: 665


There is exactly a 4 hr time difference between the raw access log and the headers log. This I have seen before.
Heh. I suppose they're using different clocks--that is, the access logs may be set to use something other than server time. Presumably you know which of the two is correct.

Should I ban all http/1.0s?
There's a quite recent thread [webmasterworld.com] on this very subject. As with so many things, it's a judgement call.

Why is the raw access log UA (Mozilla/5.0 (Windows NT 6.1; Win64; rv:38.0) Gecko/20100101 Firefox/38.0) different from the headers UA (Mozilla/5.0). Should they not be the same, as they are from the same source?
One would think so. Is it possible something sneaked into your code that causes it only to return the first part of the UA? The first “word” (set of non-space characters), or everything before the parentheses, are the two possibilities that jump out at me.
10:32 pm on July 5, 2018 (gmt 0)

Preferred Member from CA 

Top Contributors Of The Month

joined:Feb 7, 2017
posts: 444
votes: 37


I actually did not study the code, and did not modify the php you recommended. The header UAs seem suspiciously short, but no, IncrediBill's code does not actually code it.

fwrite($fh, date('Y-m-d:') . date("H:i:s\n"));
fwrite($fh, "URL: $thispage\n");
fwrite($fh, "IP: $ip\n");

foreach (getallheaders() as $name => $value)
{
fwrite($fh, "$name: $value\n");
}

He writes the date/time, then URL, then IP, then looping anything else in the header. There is no code specifically for the UA.

That the header UA and the raw access log UA differ is interesting! I need to find a live human visit and look at both the header and the raw access log.
10:52 pm on July 5, 2018 (gmt 0)

Preferred Member from CA 

Top Contributors Of The Month

joined:Feb 7, 2017
posts: 444
votes: 37


Here's a live human. In my raw access log I've been tracking humans for a number of years, so I can positively ID them. They have a tell-tale "smell".

Raw Access Log:
64.229.227.* [03/Jul/2018:12:11:52 GET /example/rocker-ra-200-acoustic-guitar/ HTTP/1.1 200 59712 https://www.google.ca/ Mozilla/5.0 (Linux; Android 7.0; SM-G390W Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Mobile Safari/537.36

Header Log:
2018-07-03:16:11:52
URL:/example/rocker-ra-200-acoustic-guitar/
IP:64.229.227.*
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding:gzip,
Accept-Language:en-US,en;q=0.9
Connection:keep-alive
Host:example.com
Referer:https://www.google.ca/
Upgrade-Insecure-Requests:1
User-Agent:Mozilla/5.0

Raw access log UA: Mozilla/5.0 (Linux; Android 7.0; SM-G390W Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Mobile Safari/537.36
Header Log UA: Mozilla/5.0

Maybe they are just different, but why? I'll need to research this. Anyone know?

It looks like humans have a language: Accept-Language:en-US,en;q=0.9
11:49 pm on July 5, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15031
votes: 665


Accept-Encoding:gzip,
Really? gzip followed by a comma? I've never seen that--but I have seen thousands of gzip, comma, more-stuff. Is something eating the parts after the space?

Yes, humans have a language--usually several--and, unless it's a mobile, the Accept-Language header is rarely something minimalist like "en" or "en-us" alone.

Pull up some standard search engine spiders like bingbot or Googlebot and show me what their headers say. This is intriguing.
11:57 pm on July 5, 2018 (gmt 0)

Preferred Member from CA 

Top Contributors Of The Month

joined:Feb 7, 2017
posts: 444
votes: 37


Accept-Encoding:gzip,

I just confirmed that yes, some have a comma at the end, some have no comma and some have a comma followed by other info
Accept-Encoding:gzip,
Accept-Encoding:gzip
Accept-Encoding:gzip,deflate,br


Googlebot
2018-07-03:16:08:52
URL:/example/toronto-chinese-neighbourhoods/
IP:66.249.69.153
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Encoding:gzip,deflate,br
Connection:keep-alive
From:googlebot(at)googlebot.com
Host:example.com
If-Modified-Since:Tue,
User-Agent:Mozilla/5.0

Yisou
2018-07-03:15:30:58
URL:/example/tag/cambridge-university-press/
IP:106.11.155.141
Accept:*/*
Accept-Encoding:gzip
Accept-Language:zh-cn,en-us,zh-tw,en-gb,en;
Connection:Keep-Alive
Host:example.com
User-Agent:YisouSpider

Yandex
2018-07-03:15:41:18
URL:/example/tag/put-on/
IP:5.255.250.153
Accept:*/*
Accept-Encoding:gzip,deflate
Connection:Keep-Alive
From:support@search.yandex.ru
Host:example.com
User-Agent:Mozilla/5.0

Bingbot
2018-07-03:16:15:05
URL:/example/tag/partition/
IP:157.55.39.255
Accept:*/*
Accept-Encoding:gzip,
Cache-Control:no-cache
Connection:Keep-Alive
From:bingbot(at)microsoft.com
Host:example.com
Pragma:no-cache
User-Agent:Mozilla/5.0

Alexa
2018-07-03:16:32:10
URL:/example/2017/03/20/persimmons-china-and-smog/
IP:54.165.90.203
Connection:close
From:crawler@alexa.com
Host:example.com
User-Agent:ia_archiver

Sogou
2018-07-03:17:14:22
URL:/example/beijing-university-student-dorms-past-vs-present/
IP:218.30.103.83
Accept:*/*
Accept-Encoding:gzip,deflate
Accept-Language:zh-cn
Connection:close
Host:example.com
User-Agent:Sogou

This one returned a cookie. yummy!
2018-07-03:20:44:51
URL:/example/san-yang-pai-cy-760-range-hood-onoff-switch/
IP:67.164.105.166
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding:gzip,
Accept-Language:en-US,en;q=0.9
Connection:keep-alive
Cookie:_ga=GA1.2.1786510478.1530593713;
Dnt:1
Host:example.com
Referer:https://www.google.com/
Upgrade-Insecure-Requests:1
User-Agent:Mozilla/5.0
12:44 am on July 6, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15031
votes: 665


It really, really looks as if something is eating everything from the first space onward. And what about the missing space after the header name? Is that an artifact of posting or was it missing all along? It's clearly present in the code
 fwrite($fh, "$name: $value\n");
I'm personally baffled. Try posting a parallel question in the php subforum and see if an explanation jumps up and hits someone in the face.

Googlebot:
If-Modified-Since:Tue,
User-Agent:Mozilla/5.0
obviously s/b
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
(or whatever it is for the vanilla googlebot--I just grabbed the first header), and similarly for Yandex and so on.

And then
If-Modified-Since: Wed, 14 Jun 2017 01:48:16 GMT
(again, mutatis mutandis).

SImilarly for Bing:
Accept-Encoding:gzip,
User-Agent:Mozilla/5.0
s/b
User-Agent: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Accept-Encoding: gzip, deflate
(Psst! Bing! I think the Cache-Control header was meant to obviate the old-fashioned Pragma header.)

YisouSpider? Really? I've never set eyes on them. They must be triggered by Chinese links.
1:44 am on July 6, 2018 (gmt 0)

Preferred Member from CA 

Top Contributors Of The Month

joined:Feb 7, 2017
posts: 444
votes: 37


Punt to the php sub-forum: [webmasterworld.com...]

What is s/b?

I write some content in bilingual English and Chinese, and encourage the Chinese search engines to index me: Yisou, Baidu, Sogou, 360 Haosou
1:57 am on July 6, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15031
votes: 665


What is s/b?
Sorry. s/b “should be”.
6:45 pm on July 17, 2018 (gmt 0)

Preferred Member from CA 

Top Contributors Of The Month

joined:Feb 7, 2017
posts: 444
votes: 37


I am slowly working through request headers. Headers really opens up a new alternative area to check. I have some questions. Are these assumptions correct?

  • All human browsers have a request header
  • All human browsers will provide an Accept-Language


SetEnvIf Accept ^$ bad_header
deny from env=bad_header
Is this the code to deny anyone who does not provide a header?

I would like to have a conditional to deny, if a PUT is combined with no header, or deny if a PUT is combined with a language such as Russian. I think I can use SetEnvIf to check for a header, or to check for Accept-Language, but how do you combine it with a conditional such as a PUT?

RewriteCond %{REQUEST_METHOD} ^(PUT) does this conditional but I do not see a conditional for Accept-Language? And this is the Mod_Rewrite, not SetEnvIf.

Can I use SetEnvIf as a conditonal for REQUEST_METHOD? I prefer SetEnvIf because it is inherited for all my sites and Mod_Rewrite is not.
7:43 pm on July 17, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15031
votes: 665


All human browsers have a request header
What do you mean by “a request header”? There are request headers and there are response headers. Everything that comes in with the request is a request header.

Is this the code to deny anyone who does not provide a header?
Oh. When you say “header” do you mean “Accept: header”? Personally I'd make it more precise; my own version says noaccept for a missing Accept: header, and then there's a long list of other missing and/or anomalous headers. And then I have to disable the “noaccept” environmental variable for law-abiding robots that happen not to send it.

:: shuffling papers ::

My current exemption list for missing Accept: header has 32 names, up to and including Googlebot.

There is variability in human headers, but yeah, Accept: is pretty basic.

Did you ever figure out the eat-everything-after-the-spaces issue?
8:03 pm on July 17, 2018 (gmt 0)

Preferred Member from CA 

Top Contributors Of The Month

joined:Feb 7, 2017
posts: 444
votes: 37


I have not figured out the seemingly missing spaces issue in logheaders.php. The code looks simple enough, but I will need time to go through every line and see if I can find an error. It is on my chores to-do list. The code post in the php sub-forum did not get any response.

The header I am talking about is the one that gets logged in headers-yyyymmdd.log. This is what is logged coming from the requester, so I though this was called a request header. What is the name of this type of header? Here's an example:
2018-07-16:10:38:45
URL:/example/2017/02/09/parking-ticket-city-of-toronto-canada/
IP:23.101.169.*
Accept:*/*
Accept-Encoding:gzip,
Accept-Language:en-US
Connection:Keep-Alive
Host:example.com
Referer:http://www.bing.com/search?q=city+of+toronto+plate+d.o.t&form=MSNH14&sc=8-4&sp=-1&qs=n&sk=
User-Agent:Mozilla/5.0
Is each line of the above considered a header? When I say header I am talking about all the above lines in total as a single object and I call that a header. This may be incorrect.

SetEnvIf Accept ^$ bad_header
deny from env=bad_header
If I block anything without a header, does this have negative consequences? All the human requests I have checked all have headers. Many bad bots that spam me do not have headers.
9:24 pm on July 17, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15031
votes: 665


Yes, each individual line is a header. There's the Referer header, the User-Agent header, the Accept-Language header, et cetera et cetera. Hence the “foreach” business in the LogHeaders code. The requested URL itself is not technically a header; that's why there's a separate bit of code for it in the LogHeaders function. In fact I added this because the original didn't have it, so I had to keep cross-checking with access logs to see what file was requested by suchandsuch IP at suchandsuch time.

Many bad bots that spam me do not have headers.
If there's a request, there must be a header. At the absolute minimum there would have to be an IP: header, because otherwise there is no place to send the requested material to. Conversely, on shared hosting there has to be a Host: header, or the request will not reach your site in the first place. That’s why it is called a “Request”: the User-Agent, whether human or robotic, is asking your server to send them suchandsuch content. Some 400 responses--exactly 400 I mean, not 400-class in general--may be because an essential header is missing.

Are you logging headers on your custom 403 and 404 pages? Even if robots don't bother to look at the page, your server still prepares it and sends it out. The only time I don't see request headers is when there is a 418 response (my host's response with mod_security) because those are handled at the server level and never reach my userspace at all. (Whee! It's the shared-hosting analogue to a firewall, except that they do still get listed in access logs.)
12:14 am on July 18, 2018 (gmt 0)

Preferred Member from CA 

Top Contributors Of The Month

joined:Feb 7, 2017
posts: 444
votes: 37


2018-07-18:00:05:51
URL: /example/tag/security/page/2/
IP: 157.55.39.*
Accept: */*
Accept-Encoding: gzip, deflate
Cache-Control: no-cache
Connection: Keep-Alive
From: bingbot(at)microsoft.com
Host: example.com
Pragma: no-cache
User-Agent: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
----
After thoroughly going through the code I found absolutely nothing incorrect. I then began inserting random "***" into the code to try to problem solve. This proved unhelpful but did shift my thinking.

It turns out that it was my misuse of OpenOffice Calc and how I imported the file. Do not use a " " as a delimiter to separate columns. Noob error.
12:40 am on July 18, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15031
votes: 665


You mean all this time you've been quoting OpenOffice output rather than the actual headers-whatever.log file, which can be opened in any text editor?

Oh.
8:19 pm on July 19, 2018 (gmt 0)

Preferred Member from CA 

Top Contributors Of The Month

joined:Feb 7, 2017
posts: 444
votes: 37


Ok, I am finally getting used to this request header info, It provides a lot of additional info. I'll keep monitoring the difference between a human and a bot. Humans are a lot cleaner in their request header logging. They read one page at a time and therefore generates a single log entry. Bots are very messy and greedy for multiple pages.

One issue I have come up with is that I have included the logheaders.php code into the head portion of my site. This executes and logs for each page fetch. But if the request is for a specific file, such as a .jpg, there is no page to render, the code does not run and therefore there is no logging. Is there something I can put into my .htaccess for this to log?

I am logging 403s but not as yet logging 404s. I need do the code for 404s as well. I have yet not added any code to my htaccess from the log file analysis.
8:44 pm on July 19, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15031
votes: 665


I need do the code for 404s as well.
Just put it in your custom 404 page. It's one line.

I don't log requests for anything but pages or robots.txt; it's not worth the bother. But since I do log headers on the custom 403 page, that means when an image request is blocked, I get a chance to see its headers. This, in turn, means that I know when I need to set a few extra toggles based on the filetype of the request.

Besides, I don't know how ;)

On robots.txt requests I do it by rewriting to “robots.php”, which can then include anything it wants to. I originally did this so multiple sites could share a single comprehensive User-Agent list; the header logging was just a bonus.
2:00 am on July 22, 2018 (gmt 0)

Preferred Member from CA 

Top Contributors Of The Month

joined:Feb 7, 2017
posts: 444
votes: 37


I think I will start with checking request headers for the presence of Accept, Host and UA. I already check for UA. This, along with IP, should be the bare minimum headers for any request.

I have yet to see any header that did not come with an Accept or Host, but I'm just starting out. I have seen requests with no UA in my raw access log.
2:55 am on July 22, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15031
votes: 665


Missing UA is definitely the most common transgression. If you are on shared hosting, you will never see a request with missing Host: header, simply because the request won't reach your site in the first place. The other header that is never, ever missing on my sites is Connection. In my case its value is always, without exception, "close"; for other people it may have some other constant value.

Note that
BrowserMatch !.
or
BrowserMatch ^$
will apply both to empty UA headers and to missing ones. But an empty UA header--or, for that matter, any empty header--is extremely rare. They may even be processing glitches rather than intentional omissions.

:: detour to logged headers ::

Hm, now that's interesting. The great majority of requests with empty (as opposed to missing) User-Agent header are for robots.txt. Most of the rest are for ads.txt--a file I happen not to have.
5:07 pm on July 25, 2018 (gmt 0)

Preferred Member from CA 

Top Contributors Of The Month

joined:Feb 7, 2017
posts: 444
votes: 37


from @lucy24 [webmasterworld.com...]
Request Headers: did not trigger request headers

The second request--the one that got as far as a 403--should have triggered header logging. (You've got it on your 403 page, right?)

Yes, and thanks, you are correct. My 403.php did not give me an error or warning of any sort, but also did not log. I have logheaders.php in a subdirectory and 403.php did not find it. I have duplicated the php in public_html and now it logs.

Ditto for 404.php, which I also did not notice was also not logging. Both are now logging.

I have been noticing that some spamming bots are not sending the Request Accept header, which I have used in my rules, as you have explained.
1:11 am on July 27, 2018 (gmt 0)

Preferred Member from CA 

Top Contributors Of The Month

joined:Feb 7, 2017
posts: 444
votes: 37


@lucy24 I know you say that your Connection header always says:
Connection : close

I am trying to figure out the logic. It seems the Connection is consistent with the search engine. I have not verified what humans do on a regular basis, by browser type. How this helps with IDing bots I do not know.
Here is someone I verified as "Human":

47.148.106.12 [25/Jul/2018:20:09:37 GET /example/2015/11/05/san-yang-pai-cy-760-range-hood-onoff-switch-replacement/ HTTP/1.1 200 59303 [google.com...] (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36

2018-07-26:00:09:38
URL: /example/2015/11/05/san-yang-pai-cy-760-range-hood-onoff-switch/
IP: 47.148.106.12
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.9
Connection: keep-alive
Host: example.com
Referer: [google.com...]
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36

Baiduspider always seems to send Connection ; close
180.76.15.151 [25/Jul/2018:20:03:41 GET /example/2017/07/14/content-security-policy/ HTTP/1.1 200 51889 - Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)

2018-07-26:00:03:41
URL: /example/2017/07/14/content-security-policy/
IP: 180.76.15.151
Accept: */*
Accept-Encoding: gzip
Accept-Language: en-US
Connection: close
Host: example.com
User-Agent: Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)

Googlebot always sends me Connection: keep-alive
2018-07-26:00:06:59
URL: /example/2009/05/29/jack-stand-points-nissan-altima/
IP: 66.249.69.150
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Encoding: gzip,deflate,br
Connection: keep-alive
From: googlebot(at)googlebot.com
Host: example.com
If-Modified-Since: Sun, 22 Jul 2018 22:58:13 GMT
User-Agent: Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
X-Https: 1

Yisou always sends me Connection: Keep-Alive
2018-07-26:00:10:27
URL: /example/2016/06/14/vnpt-vn-content-scraper-research/
IP: 106.11.156.170
Accept: */*
Accept-Encoding: gzip
Accept-Language: zh-cn,en-us,zh-tw,en-gb,en;
Connection: Keep-Alive
Host: example.com
User-Agent: YisouSpider

Bingbot always seems to send me Connection: Keep-Alive:
2018-07-26:00:13:59
URL: /example/2009/02/16/chinese-overseas/
IP: 157.55.39.252
Accept: */*
Accept-Encoding: gzip, deflate
Cache-Control: no-cache
Connection: Keep-Alive
From: bingbot(at)microsoft.com
Host: example.com
Pragma: no-cache
User-Agent: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

Seznam always sends me Connection: keep-alive
2018-07-26:00:19:10
URL: /example/tag/milk/
IP: 77.75.79.109
Accept: */*
Accept-Encoding: gzip, deflate
Accept-Language: cs
Connection: keep-alive
Host: example.com
If-Modified-Since: Tue, 17 Jul 2018 04:50:51 GMT
User-Agent: Mozilla/5.0 (compatible; SeznamBot/3.2; +http://napoveda.seznam.cz/en/seznambot-intro/)

Yandex always sends me Connection: Keep-Alive
2018-07-26:00:26:53
URL: /example/tag/%E7%8C%AA%E6%89%92/
IP: 5.255.250.153
Accept: */*
Accept-Encoding: gzip,deflate
Connection: Keep-Alive
From: support@search.yandex.ru
Host: example.com
User-Agent: Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)

Sogou always sends me Connection: close
2018-07-26:00:29:18
URL: /example/2009/03/13/increase-efficiency-drupal-feed-aggregator/www.google.com/reader
IP: 111.202.100.87
Accept: */*
Accept-Encoding: gzip,deflate
Accept-Language: zh-cn
Connection: close
Host: example.com
User-Agent: Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)

HTTP/1.1 defines the "close" connection option for the sender to signal that the connection will be closed after completion of the response. [w3.org...]

This Mozilla doc [developer.mozilla.org...] says the http/1.0 default is Connection : close and the http/1.1 default is a persistent connection keep-alive. Neither of these are followed in my header logs.
1:51 am on July 27, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15031
votes: 665


I read somewhere--possibly even hereabouts in the middle of something unrelated--that hosts may do things to the Connection: header. It certainly seems that way in my case, since it is always present and always has the same value. For comparison purposes, the Host: header is always present--it has to be--but its value can vary, as in example.com vs. WWW.EXAMPLE.COM. (I have never seen a legitimate request giving my hostname in ALL CAPITALS. I suspect they are very, very elderly robots.)

:: looking vaguely around for not2easy, whose superpowers include Finding Stuff ::

If you're seeing different values of the header, and also some requests where it's absent entirely, then you must be seeing the actual header as it is sent--or not sent--by the requester.
1:59 am on July 27, 2018 (gmt 0)

Preferred Member from CA 

Top Contributors Of The Month

joined:Feb 7, 2017
posts: 444
votes: 37


The Connection header from my server seems to always be present. While consistent for each individual search engine it can be either Connection : close or Connection: Keep-Alive.
This 50 message thread spans 2 pages: 50