Getting started: header-based access controls

Forum Moderators: phranque

Message Too Old, No Replies

Getting started: header-based access controls

TorontoBoy

8:24 pm on Jun 28, 2018 (gmt 0)

How do I get started on using header-based access controls?

Currently I am on an Apache server, shared host, using an htaccess, robots.txt, error file and this works pretty well. I run multiple web sites, all in respective directories, with the .htaccess in public_html, where my SetEndIfs cascade down to all subdirectories (or inheritance up). I regularly read my raw access log, find the bad guys and ban using UAs (SetEndIf) or IPs (deny from) using htaccess. I do have an error file, which I do review, but it reveals very little.

How do I set up header-based access controls? Can someone point me to a link or two to get me started?

Thanks All!

JamesSC

12:32 am on Oct 21, 2018 (gmt 0)

Okay, so I have now

- commented out my previous .htacces directive and replaced it with a custom forbidden.php page instead containing both my prior message and the includes code from my child theme header.php file

- tested it by trying my denied-from-all offlimits.txt, which successfully both displayed my message to me and logged my visit (in real time) in my headers.log

But in looking at my access and error logs I noticed that a 403'd bingbot visit* that occurred between a prior, header-logged 200-successful Googlebot visit and my also header-logged 403-forbidden test was itself not header-logged as well, nor have several others been.

Now I understand that only pages containing my includes code, my WordPress pages with headers and my custom error document forbidden.php, should be logging visits in my header.log, which means either bingbot was not served the custom forbidden.php page (although all of these have been standard 403 errors, not 500 server errors), or something else happened.

Any clues?

Is there any way to simply place the include necessary to write to the header log directly in the .htaccess file itself in order to trigger it at the earliest possible stage?

Thanks,

* I selectively block several bingbot IPs simply to cripple its volume of hits on my server

lucy24

12:54 am on Oct 21, 2018 (gmt 0)

If two requests come in at the same time, the logged headers may be garbled together. This is not a significant problem for me, as I've got a small low-traffic site, but it may lead to confusion on busier sites when two or more requests are writing to the same headers.log file at the same time. It's possible the code would need to be tweaked.

:: shuffling papers ::

Here's an unusually clean example. It is no coincidence that most of these tangles involve requests for things like wp files that are the hallmark of a malign robot, firing off requests as fast as it can:

2017-11-25:17:25:31
2017-11-25:17:25:31
URL: /xmlrpc.php
IP: 85.93.88.nnn
URL: /blog/xmlrpc.php
IP: 85.93.88.nnn
Cookie: wordpress_test_cookie=WP+Cookie+check
Cookie: wordpress_test_cookie=WP+Cookie+check
Accept-Language: en-US,en;q=0.8
Accept-Language: en-US,en;q=0.8
Content-Type: application/x-www-form-urlencoded
Content-Type: application/x-www-form-urlencoded
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; fr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; fr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8
Cache-Control: max-age=0
Cache-Control: max-age=0
Content-Length: 217
Content-Length: 217
Connection: close
Connection: close
Host: example.com
Host: example.com

And here's a rare one involving two different legitimate robots:

2017-12-21:05:12:41
2017-12-21:05:12:41
URL: /
IP: 157.55.39.189
URL: /robots.txt
IP: 68.180.230.166
User-Agent: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Host: www.example.com
From: bingbot(at)microsoft.com
Accept-Encoding: gzip, deflate
Connection: close
Accept: */*
Accept: */*
Pragma: no-cache
Host: www.example.com
Connection: close
User-Agent: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
Cache-Control: no-cache

Whew!

keyplyr

12:56 am on Oct 21, 2018 (gmt 0)

I selectively block several bingbot IPs simply to cripple its volume of hits on my server

You may want to reconsider that.

Use robots.txt crawl-rate and crawl-delay directives instead. Blocking Bingbot just ends up reducing visitors.

[fix typo]

[edited by: keyplyr at 12:59 am (utc) on Oct 21, 2018]

lucy24

12:57 am on Oct 21, 2018 (gmt 0)

Oh, oops, there was a question.

Is there any way to simply place the include necessary to write to the header log directly in the .htaccess file itself

I suppose you could do it by rewriting all page requests, both external and internal (like requests for the custom 403 page), to a php file that first invokes the logheaders function and then includes the entire requested page. That's how I log headers on requests for robots.txt. But I tend to doubt this is the best way to achieve the intended result. You can't put php directly in htaccess, if that's what you meant.

Consider using crawl-rate and crawl-delay directives instead.

Does bingbot honor crawl-delay? That Other Search Engine, as we all know, explicitly doesn't.

keyplyr

1:11 am on Oct 21, 2018 (gmt 0)

We don't even need to use robots.txt

Google has the setting here:
GSC > Crawl > Limit Google's maximum crawl rate

Bing has it here:
BWT > Crawl Control > Crawl Rate

My point is that blocking bingbot is like shooting yourself in the foot.

JamesSC

1:53 am on Oct 21, 2018 (gmt 0)

Does bingbot honor crawl-delay? That Other Search Engine, as we all know, explicitly doesn't.

No, which is why I do so for it. I get more hits from DuckDuckGo, which crawls 1/100th as much as the one IP I still allow Bing.

The most apparent thing I've noticed so far logging headers is how few visitors actually get logged, successful 200s and, now, 403s (at least me). By far the most regular entry in my header logs is the big G, whether or not someone else was hitting my server far more times near the time their visit was recorded. Perhaps this follows logically from G spidering page after actual page, while others may somehow be hitting my site in ways that do not trigger the header.php includes.

I should also point out that Bing was exempted from the denys based on no accept headers at the time its 403 (denied by me because of the particular IP) was not logged near the time my test 403 was logged, that is, it wasn't denied for not having the proper header credentials..

In case this means anything as well, I should also reiterate that my header log is in real time sync with my access log, not displaced by four hours or whatever.

keyplyr

2:19 am on Oct 21, 2018 (gmt 0)

I get more hits from DuckDuckGo, which crawls 1/100th as much as the one IP I still allow Bing.

I think you misunderstand the way indexing works. You cannot block all Bing IPs except one and expect to benefit from the Bing Search Index sending you good traffic. All you're doing is blocking your site from being indexed, which translates to blocking visitors. No wonder you aren't getting as much traffic as DDG.

So what if Bingbot crawls a lot. That's what bots do, and it's to your benefit. I showed you where and how to control the crawl rate. But hey, not everyone likes visitors to their site. We have several members that block almost everything.

JamesSC

3:42 am on Oct 21, 2018 (gmt 0)

keyplyr, I would be the last to suggest that bingbot does not obey your every crawl rate whim.

My own experience over more than several years, however, has been different. My own experience of letting Bing crawl at will and of attempting to adjust their crawl rate both through robots.txt and within their Webmaster Tools console has been that 1) Bing respects neither and 2) sends me virtually no traffic at all compared to Google and others while at the same time 3) flooding my shared server with such a consequent useless load that my host ends up killing my legitimate scripts prematurely to keep the total impact within my quota.

And, as I mentioned, for every hundred or so spiderings from the Bing IP I still allow (I rotate them randomly) I may get one or two hits, while for every seven or eight spiderings from unregulated DuckDuckGo I get the same number, both monstrously dwarfed, of course, by Google.

So I kill Bing with fire. Mostly.

Did you have any thoughts about my access header logging?

keyplyr

3:47 am on Oct 21, 2018 (gmt 0)

bingbot does not obey your every crawl rate whim.

Sorry, I don't understand your meaning.

Each SE indexes differently. Their algos use different metrics. It's difficult to appease both Google & Bing to get good results from both, but it can be done. Just takes work.

[fix typo]

[edited by: keyplyr at 3:50 am (utc) on Oct 21, 2018]

JamesSC

3:49 am on Oct 21, 2018 (gmt 0)

Lucy24, I may have been making a foolish mistake from not realizing that my new header log differs materially from my host's access and server logs. When I access the latter, a snapshot copy is produced of the original at the time of opening, which I peruse, while the original keeps on logging. It may very well be - I'll have to study this further before piping up again - that my act of blindly opening the only copy of my header log instead of copying it elsewhere and reading that immediately disrupts any attempt to write to it, thus leaving these otherwise inexplicable lacunae in the logging.

keyplyr

3:54 am on Oct 21, 2018 (gmt 0)

Did you have any thoughts about my access header logging?

It may be sufficient to be used alone for smaller sites, however I feel blocking by header should be limited to only a part of the comprehensive security approach of medium to larger sites.

More Blocking Methods [webmasterworld.com]

lucy24

4:10 am on Oct 21, 2018 (gmt 0)

It may very well be - I'll have to study this further before piping up again - that my act of blindly opening the only copy of my header log instead of copying it elsewhere and reading that immediately disrupts any attempt to write to it, thus leaving these otherwise inexplicable lacunae in the logging.

Oh, cripes. I don't think it would ever occur to me to open an in-progress log file. Well, in the case of access and error logs, I'm pretty sure I couldn't if I wanted to; in all cases I download them to my HD and take it from there. Come to think of it, Fetch interprets a double-click as Download, so I could never open a file by accident. (Plenty of accidental downloads, though, and then I have to remember where the computer puts them by default.)

But in general, the simple act of opening a document shouldn't affect what's being done to it by other processes. At least not with this millennium's computers. I often have the same file open in two different text editors, for example. Each one will yap at me if I've made changes in the other, but that's as far as it goes.

If I remember, I'll experiment with this on my test site, which logs headers just like everyone else, but no loss if things melt down.

JamesSC

4:54 am on Oct 21, 2018 (gmt 0)

Don't bother, Lucy24. Copying the header log file off-server and reading it there doesn't seem to change anything. In fact, since the header log day time has rolled over to a new day, my test 403's are no longer logging, whether from trying to access my test offlimits.txt or a standard target like xmlrpc.php - although the new log with the new day time continues to faithfully log Google in real time, 10/21 header log in sync with 10/20 access.log.

Something else is causing these log entries to not be made.

JamesSC

2:00 pm on Oct 21, 2018 (gmt 0)

Well, logging has resumed, including my own test 403s at least. Best I can determine at this point is that the current active header log somehow doesn't like to be groped in any fashion, directly or just copied off-server, so next+day historical review seems the best course for now. Still not sure why only my own test 403s from my own cache-cleared browser seem to be getting logged and not others.

TorontoBoy

3:11 pm on Oct 21, 2018 (gmt 0)

The most apparent thing I've noticed so far logging headers is how few visitors actually get logged, successful 200s and, now, 403s (at least me). By far the most regular entry in my header logs is the big G,.. . Perhaps this follows logically from G spidering page after actual page, while others may somehow be hitting my site in ways that do not trigger the header.php includes.

It shows you just how few humans there are on your site, and just how many bots are scraping you. If a bot only access an image, you won't get a header logging. Any WP page should use header.php, which renders the header of your theme. There should be no WP page that does not use this.

On the plus side if you kill off enough rogue bots, and this does not include Mr. G, or Mr B, your proportion of humans will go up. I'd rather concentrate on writing good content to attract more humans.

You would need to try really hard to mess up your log just by reading it. A read will download a copy to you local computer. Unless you edit and save back to your log there will be no data collision. I always download my logs to local and read them. If you delete a log the script will simply recreate it and continue on its merry way.

JamesSC

4:31 pm on Oct 21, 2018 (gmt 0)

That's reassuring.

TorontoBoy, my nascent custom 403 page is listed in my .htaccess file

ErrorDocument 403 /forbidden.php

and contains simply

<h1>message</h1>

and the identical includes code from up above from my header.php theme/child theme file.

Would you have any guesses as to why I can trigger header logging me using my own browser to throw a 403 all day long, but other transgressors hitting the same resource (e.g. xmlrpc.php) so far don't get header logged themselves?

Incidentally, my theme/child themes already existing custom 404 page already gets header logged without any additional help from me.

TorontoBoy

5:22 pm on Oct 21, 2018 (gmt 0)

I'm unsure, but it seems to work for me. I also don't have a /blog/

210.112.224.0 - 210.112.255.255
netname: ELIMNET
descr: ELIMNET, INC.
country: KR

210.112.232.* [10/Oct/2018:17:15:51 POST /blog/xmlrpc.php HTTP/1.1 403 638-Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; fr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8

2018-10-10:17:15:51
URL: /blog/xmlrpc.php
IP: 210.112.232.*
Content-Length: 217
Content-Type: application/x-www-form-urlencoded
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.8
Cache-Control: max-age=0
Connection: keep-alive
Cookie: wordpress_test_cookie=WP+Cookie+check
Host: example.com
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; fr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8

[edited by: not2easy at 8:35 pm (utc) on Oct 21, 2018]
[edit reason] anonymized IP with * [/edit]

lucy24

5:52 pm on Oct 21, 2018 (gmt 0)

It shows you just how few humans there are on your site, and just how many bots are scraping you. If a bot only access an image, you won't get a header logging.

But the vast majority of robots request only pages, and every one of those--including the custom 403 page sent out on every 403, regardless of requested filetype--includes logheaders code. Currently, on my site the only requests whose headers don't get logged are the ones intercepted by mod_security on the server level. (I know they exist, because they're listed in access logs and error logs, but they never reach my userspace.)

It occurred to me that overlapping logs might come through differently if the logheaders function began and ended with an output buffer (ob_blahblah) instead of writing line-by-line to headers.log. So that's another thing for the test site. Might not make any difference, though.

TorontoBoy

8:05 pm on Oct 21, 2018 (gmt 0)

But the vast majority of robots request only pages

Not for me. For Wordpress, there are a lot of rss feeds, all of which do not run the header.php. I could fix that. The rss feed people are quite well behaved, after I threw out the troublemakers that were accessing other resources.

Anyone, bot or human, that requests a WP image or download file directly does not render a full page and therefore is not header logged. All the search engines do this, so this is significant. For me that is a lot of bot activity.

That said I am still happy with what I can get from request headers. I'm glad I made the effort and always recommend that others do the same.

keyplyr

8:11 pm on Oct 21, 2018 (gmt 0)

there are a lot of rss feeds... rss feed people are quite well behaved

Might want to take a closer look. RSS feeds are one of the ways your content can be hijacked and displayed on a remote site without leaving a trace.

This 80 message thread spans 3 pages: 80