Forum Moderators: phranque
SetEnvIf header-field-name header-field-value one-or-more-variables-to-set
It is very rarely appropriate to use SetEnvIfNoCase, because anything wrongly cased is likely to be wrong, like the “GoogleBot” I used to meet a few years ago. The other mod_setenvif directive is BrowserMatch blahblah
which is simply shorthand for SetEnvIf User-Agent blahblah
Again, think twice about BrowserMatchNoCase. You want the real thing, not a wrongly cased faker. BrowserMatch ^$ no_agent
SetEnvIf Accept ^$ noaccept
Deny from env=no_agent
Deny from env=noaccept
Those are straightforward: Slam the door in their face if they don’t send an Accept header, or don't send a User-Agent header. Note that there is no way to distinguish between a header that is absent--Apache logs say "-"--and one that is empty--Apache logs say "" alone--but fortunately it doesn't matter. Require env noaccept
(assuming a <RequireNone> envelope). BrowserMatch Googlebot !noaccept
SetEnvIf Remote_Addr ^31\.13\.(6[4-9]|[7-9][0-9]|1[01]\d|12[0-7]) !noagent
meaning “turn off the environmental variable I just set”. (I am assuming here that you have a rule elsewhere, probably using mod_rewrite, that unconditionally denies anyone who claims to be Googlebot but doesn’t come from a Google crawl range. The only other search engine UA that’s routinely faked is, for some reason, Baidu.) The second line, involving IP addresses, is specifically for Facebook, which has recently picked up a nasty habit of not sending a UA. There are actually five possible IP ranges; there's a thread somewhere hereabouts that lists them. <?php
function get_server($var)
{
return isset($_SERVER[$var]) ? $_SERVER[$var] : false;
}
if (!function_exists('getallheaders'))
{
function getallheaders()
{
$headers = '';
foreach ($_SERVER as $name => $value)
{
if (substr($name, 0, 5) == 'HTTP_')
{ $headers[str_replace(' ', '-', ucwords(strtolower(str_replace('_', ' ', substr($name, 5)))))] = $value; }
}
return $headers;
}
}
$ip = get_server('REMOTE_ADDR');
$fh = fopen($_SERVER['DOCUMENT_ROOT'] . "/boilerplate/headers-". date('Ymd') . ".log","a");
fwrite($fh, date('Y-m-d:') . date("H:i:s\n"));
$thispage = $_SERVER['REQUEST_URI'];
fwrite($fh, "URL: $thispage\n");
fwrite($fh, "IP: $ip\n");
foreach (getallheaders() as $name => $value)
{
fwrite($fh, "$name: $value\n");
}
fwrite($fh, "----\n\n");
fclose($fh);
?>
I have a /boilerplate/ directory that I use for, well, boilerplate, so that's where I told it to keep my logged headers. It will make a new file each day, and they will remain there until you delete them (unlike log files, which your host probably wipes after a set time period). Watch out! Log files typically “roll over” at some dead hour of the night, so your logged headers will sometimes be under a different date than the access logs themselves.
<body>
<?php include '/loghead/logheaders.php';?> $fh = fopen($_SERVER['DOCUMENT_ROOT'] . "/loghead/headers-". date('Ymd') . ".log","a"); Does this file need to be in the same dir where it is included?I do mine with SSI since I'd already coded for those, so it just meant adding a line to the existing footer using “include virtual”.
include ($_SERVER['DOCUMENT_ROOT'] . "/includes/logheaders.php");
You probably just forgot the DOCUMENT_ROOT bit. Are there permissions I need to set?If things are getting written, then you can safely assume the permissions are what they need to be.
188.72.127.* [03/Jul/2018:19:22:58 GET /example/chinese-tourist-to-new-zealand-visa-requirements/ HTTP/1.0 200 43198 http://example.com/chinese-tourist-to-new-zealand-visa-requirements/ Mozilla/5.0 (Windows NT 6.1; Win64; rv:38.0) Gecko/20100101 Firefox/38.0 2018-07-03:23:22:58
URL:/example/chinese-tourist-to-new-zealand-visa-requirements/
IP:188.72.127.*
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Host:example.com
Referer:http://example.com/wp/2018/06/28/chinese-tourist-to-new-zealand-visa-requirements/
User-Agent:Mozilla/5.0 There is exactly a 4 hr time difference between the raw access log and the headers log. This I have seen before.Heh. I suppose they're using different clocks--that is, the access logs may be set to use something other than server time. Presumably you know which of the two is correct.
Should I ban all http/1.0s?There's a quite recent thread [webmasterworld.com] on this very subject. As with so many things, it's a judgement call.
Why is the raw access log UA (Mozilla/5.0 (Windows NT 6.1; Win64; rv:38.0) Gecko/20100101 Firefox/38.0) different from the headers UA (Mozilla/5.0). Should they not be the same, as they are from the same source?One would think so. Is it possible something sneaked into your code that causes it only to return the first part of the UA? The first “word” (set of non-space characters), or everything before the parentheses, are the two possibilities that jump out at me.
fwrite($fh, date('Y-m-d:') . date("H:i:s\n"));
fwrite($fh, "URL: $thispage\n");
fwrite($fh, "IP: $ip\n");
foreach (getallheaders() as $name => $value)
{
fwrite($fh, "$name: $value\n");
} 64.229.227.* [03/Jul/2018:12:11:52 GET /example/rocker-ra-200-acoustic-guitar/ HTTP/1.1 200 59712 https://www.google.ca/ Mozilla/5.0 (Linux; Android 7.0; SM-G390W Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Mobile Safari/537.36 2018-07-03:16:11:52
URL:/example/rocker-ra-200-acoustic-guitar/
IP:64.229.227.*
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding:gzip,
Accept-Language:en-US,en;q=0.9
Connection:keep-alive
Host:example.com
Referer:https://www.google.ca/
Upgrade-Insecure-Requests:1
User-Agent:Mozilla/5.0
Accept-Encoding:gzip,Really? gzip followed by a comma? I've never seen that--but I have seen thousands of gzip, comma, more-stuff. Is something eating the parts after the space?
Accept-Encoding:gzip, Accept-Encoding:gzip,
Accept-Encoding:gzip
Accept-Encoding:gzip,deflate,br 2018-07-03:16:08:52
URL:/example/toronto-chinese-neighbourhoods/
IP:66.249.69.153
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Encoding:gzip,deflate,br
Connection:keep-alive
From:googlebot(at)googlebot.com
Host:example.com
If-Modified-Since:Tue,
User-Agent:Mozilla/5.0
2018-07-03:15:30:58
URL:/example/tag/cambridge-university-press/
IP:106.11.155.141
Accept:*/*
Accept-Encoding:gzip
Accept-Language:zh-cn,en-us,zh-tw,en-gb,en;
Connection:Keep-Alive
Host:example.com
User-Agent:YisouSpider
2018-07-03:15:41:18
URL:/example/tag/put-on/
IP:5.255.250.153
Accept:*/*
Accept-Encoding:gzip,deflate
Connection:Keep-Alive
From:support@search.yandex.ru
Host:example.com
User-Agent:Mozilla/5.0 2018-07-03:16:15:05
URL:/example/tag/partition/
IP:157.55.39.255
Accept:*/*
Accept-Encoding:gzip,
Cache-Control:no-cache
Connection:Keep-Alive
From:bingbot(at)microsoft.com
Host:example.com
Pragma:no-cache
User-Agent:Mozilla/5.0
2018-07-03:16:32:10
URL:/example/2017/03/20/persimmons-china-and-smog/
IP:54.165.90.203
Connection:close
From:crawler@alexa.com
Host:example.com
User-Agent:ia_archiver 2018-07-03:17:14:22
URL:/example/beijing-university-student-dorms-past-vs-present/
IP:218.30.103.83
Accept:*/*
Accept-Encoding:gzip,deflate
Accept-Language:zh-cn
Connection:close
Host:example.com
User-Agent:Sogou 2018-07-03:20:44:51
URL:/example/san-yang-pai-cy-760-range-hood-onoff-switch/
IP:67.164.105.166
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding:gzip,
Accept-Language:en-US,en;q=0.9
Connection:keep-alive
Cookie:_ga=GA1.2.1786510478.1530593713;
Dnt:1
Host:example.com
Referer:https://www.google.com/
Upgrade-Insecure-Requests:1
User-Agent:Mozilla/5.0
fwrite($fh, "$name: $value\n");
I'm personally baffled. Try posting a parallel question in the php subforum and see if an explanation jumps up and hits someone in the face. If-Modified-Since:Tue,
User-Agent:Mozilla/5.0obviously s/b Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)(or whatever it is for the vanilla googlebot--I just grabbed the first header), and similarly for Yandex and so on. If-Modified-Since: Wed, 14 Jun 2017 01:48:16 GMT
(again, mutatis mutandis). Accept-Encoding:gzip,
User-Agent:Mozilla/5.0
s/b User-Agent: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Accept-Encoding: gzip, deflate
(Psst! Bing! I think the Cache-Control header was meant to obviate the old-fashioned Pragma header.) SetEnvIf Accept ^$ bad_header
deny from env=bad_header Is this the code to deny anyone who does not provide a header? All human browsers have a request headerWhat do you mean by “a request header”? There are request headers and there are response headers. Everything that comes in with the request is a request header.
Is this the code to deny anyone who does not provide a header?Oh. When you say “header” do you mean “Accept: header”? Personally I'd make it more precise; my own version says noaccept for a missing Accept: header, and then there's a long list of other missing and/or anomalous headers. And then I have to disable the “noaccept” environmental variable for law-abiding robots that happen not to send it.
2018-07-16:10:38:45
URL:/example/2017/02/09/parking-ticket-city-of-toronto-canada/
IP:23.101.169.*
Accept:*/*
Accept-Encoding:gzip,
Accept-Language:en-US
Connection:Keep-Alive
Host:example.com
Referer:http://www.bing.com/search?q=city+of+toronto+plate+d.o.t&form=MSNH14&sc=8-4&sp=-1&qs=n&sk=
User-Agent:Mozilla/5.0 Is each line of the above considered a header? When I say header I am talking about all the above lines in total as a single object and I call that a header. This may be incorrect. SetEnvIf Accept ^$ bad_header
deny from env=bad_header If I block anything without a header, does this have negative consequences? All the human requests I have checked all have headers. Many bad bots that spam me do not have headers.
Many bad bots that spam me do not have headers.If there's a request, there must be a header. At the absolute minimum there would have to be an IP: header, because otherwise there is no place to send the requested material to. Conversely, on shared hosting there has to be a Host: header, or the request will not reach your site in the first place. That’s why it is called a “Request”: the User-Agent, whether human or robotic, is asking your server to send them suchandsuch content. Some 400 responses--exactly 400 I mean, not 400-class in general--may be because an essential header is missing.
2018-07-18:00:05:51
URL: /example/tag/security/page/2/
IP: 157.55.39.*
Accept: */*
Accept-Encoding: gzip, deflate
Cache-Control: no-cache
Connection: Keep-Alive
From: bingbot(at)microsoft.com
Host: example.com
Pragma: no-cache
User-Agent: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
---- After thoroughly going through the code I found absolutely nothing incorrect. I then began inserting random "***" into the code to try to problem solve. This proved unhelpful but did shift my thinking. I need do the code for 404s as well.Just put it in your custom 404 page. It's one line.
Request Headers: did not trigger request headers
The second request--the one that got as far as a 403--should have triggered header logging. (You've got it on your 403 page, right?)
Connection : close
47.148.106.12 [25/Jul/2018:20:09:37 GET /example/2015/11/05/san-yang-pai-cy-760-range-hood-onoff-switch-replacement/ HTTP/1.1 200 59303 [google.com...] (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36
2018-07-26:00:09:38
URL: /example/2015/11/05/san-yang-pai-cy-760-range-hood-onoff-switch/
IP: 47.148.106.12
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.9
Connection: keep-alive
Host: example.com
Referer: [google.com...]
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36
180.76.15.151 [25/Jul/2018:20:03:41 GET /example/2017/07/14/content-security-policy/ HTTP/1.1 200 51889 - Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
2018-07-26:00:03:41
URL: /example/2017/07/14/content-security-policy/
IP: 180.76.15.151
Accept: */*
Accept-Encoding: gzip
Accept-Language: en-US
Connection: close
Host: example.com
User-Agent: Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
2018-07-26:00:06:59
URL: /example/2009/05/29/jack-stand-points-nissan-altima/
IP: 66.249.69.150
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Encoding: gzip,deflate,br
Connection: keep-alive
From: googlebot(at)googlebot.com
Host: example.com
If-Modified-Since: Sun, 22 Jul 2018 22:58:13 GMT
User-Agent: Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
X-Https: 1
2018-07-26:00:10:27
URL: /example/2016/06/14/vnpt-vn-content-scraper-research/
IP: 106.11.156.170
Accept: */*
Accept-Encoding: gzip
Accept-Language: zh-cn,en-us,zh-tw,en-gb,en;
Connection: Keep-Alive
Host: example.com
User-Agent: YisouSpider
2018-07-26:00:13:59
URL: /example/2009/02/16/chinese-overseas/
IP: 157.55.39.252
Accept: */*
Accept-Encoding: gzip, deflate
Cache-Control: no-cache
Connection: Keep-Alive
From: bingbot(at)microsoft.com
Host: example.com
Pragma: no-cache
User-Agent: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
2018-07-26:00:19:10
URL: /example/tag/milk/
IP: 77.75.79.109
Accept: */*
Accept-Encoding: gzip, deflate
Accept-Language: cs
Connection: keep-alive
Host: example.com
If-Modified-Since: Tue, 17 Jul 2018 04:50:51 GMT
User-Agent: Mozilla/5.0 (compatible; SeznamBot/3.2; +http://napoveda.seznam.cz/en/seznambot-intro/)
2018-07-26:00:26:53
URL: /example/tag/%E7%8C%AA%E6%89%92/
IP: 5.255.250.153
Accept: */*
Accept-Encoding: gzip,deflate
Connection: Keep-Alive
From: support@search.yandex.ru
Host: example.com
User-Agent: Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
2018-07-26:00:29:18
URL: /example/2009/03/13/increase-efficiency-drupal-feed-aggregator/www.google.com/reader
IP: 111.202.100.87
Accept: */*
Accept-Encoding: gzip,deflate
Accept-Language: zh-cn
Connection: close
Host: example.com
User-Agent: Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)