Forum Moderators: phranque

Message Too Old, No Replies

Correct sequence of htaccess contents

cannot stop curl from accessing pages

         

dstiles

10:19 am on Sep 11, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Couple of decades of internet experience but fairly new to apache (on linux).

The scenario is: squirrel webmail running on apache on linux server. That's the only web site on that server. Squirrel has a limited clientele, all known, local and "controllable".

I can find no real information about the sequence of declaring various options in htaccess in order to block undesirables. It seems to mostly work as I have it arranged but there are still a few incursions from curl and, despite geo-blocking, the occasional access from unwanted countries. I can accept that some of the latter may be due to late responses from the dns resolver (Unbound) but it would be nice not to have to block IP ranges for their use of curl.

My htaccess file is below. I have reduced multiple lines to avoid clutter in here. I would appreciate guidance on whether or not I have declared things in a reasonable sequence and why curl keeps getting through.

Typical log entry for curl...
194.59.251.217 - - [11/Sep/2018:01:38:37 +0100] "GET / HTTP/1.1" 302 4900 "-" "curl/7.43.0"
194.59.251.217 - - [11/Sep/2018:01:38:37 +0100] "GET /src/login.php HTTP/1.1" 200 4240 "-" "curl/7.43.0"

htaccess...

#========
# geo-block
GeoIPEnable On
SetEnvIf GEOIP_COUNTRY_CODE CN BlockCountry
SetEnvIf GEOIP_COUNTRY_CODE US BlockCountry
SetEnvIf GEOIP_COUNTRY_CODE VN BlockCountry
Deny from env=BlockCountry

#======== # periodic server alive test...
SetEnvIfNoCase User-Agent "urlwatch" dontlog

#========
order allow,deny

#========
# several IP denials...
deny from 185.35.62.0/23
deny from 180.96.0.0/11

#========
# reject hotlinking
RewriteEngine on
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^https://mail\.example\.net/.*$ [NC]
RewriteRule \.(gif|jpg|css)$ - [F]

#========
# only allow get/post/http/1.1
RewriteEngine On
RewriteCond %{THE_REQUEST} !^(POST|GET)\ /.*\ HTTP/1\.1$
#RewriteCond %{REQUEST_METHOD} !^(HEAD|OPTIONS|PUT)
RewriteRule .* - [F]

# only allow mozilla, urlwatch and only https
RewriteCond %{HTTPS} off [OR]
RewriteCond %{HTTP_USER_AGENT} ^urlwatch\ [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0 [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*(agent|analy[sz]|anonymous|archive|bandit|bot|brand|cherrypicker|clshttp|collector|compatible;[a-z]|craftbot|crawl|curl|deepnet|discover|download|explorer|extract|file|grab|greasemonkey|harvest|indy\slibrary|java|larbin|le[ae]ch|legs|link|lynx|mail|miner|netcraft|ninja|n[-_\s]?u[-_\s]?t[-_\s]?c[-_\s]?h|open|perl|php|proxy|python|ripper|script|search|seo|shodan|sitemap|snoop|sph?ider|stripper|sucker|survey|sweep|torrent|webpictures|webspider|worm).*$ [NC]
RewriteCond %{HTTP_USER_AGENT} ^ht[tm][lpr] [NC]
RewriteRule . - [F,L]
#========
allow from all
#========

#========
# security policies
Header set Strict-Transport-Security "max-age=15552001; includeSubDomains; preload"
Header set X-Frame-Options "SAMEORIGIN"
Header set X-Xss-Protection "1; mode=block"
Header set X-Content-Type-Options "nosniff"
Header set X-Permitted-Cross-Domain-Policies "none"
Header set Referrer-Policy "strict-origin-when-cross-origin"
Header set Content-Security-Policy "default-src 'none'; style-src 'self'; child-src 'self'; frame-ancestors 'self'; base-uri 'self'; script-src 'self'; form-action 'self'; img-src 'self'; object-src 'none'; block-all-mixed-content;"
Header set Expect-CT "enforce,max-age=30"
#========


[edited by: not2easy at 6:18 pm (utc) on Sep 11, 2018]
[edit reason] exemplified mail server [/edit]

not2easy

2:37 pm on Sep 11, 2018 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Sorry, I'm not the answer person for all of the list and a little short on time right now but the curl question looks like a matter of syntax. As far as the UA blocks, I have not seen them using the syntax above - the ^.*( is simply (to begin and end with ) rather than the ).*$

If you run a search using RewriteCond %{HTTP_USER_AGENT} you'll find many examples of it to compare.

HTH

wilderness

3:32 pm on Sep 11, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



dstiles,
lucy provided the order of Rules so many times previously that she had a working bookmark. Sure it still exists.

^.*, is some for of 'Begins with' and I'd suggest removing. In the UA examples you've provided what your looking for is 'contains' (absent any leading characters).

TorontoBoy

6:28 pm on Sep 11, 2018 (gmt 0)

5+ Year Member Top Contributors Of The Month



#======== # periodic server alive test...
SetEnvIfNoCase User-Agent "urlwatch" dontlog

I'm not the expert, but you set the environment variable dontlog for the UA "urlwatch", but it seems you don't do anything with it?

# only allow mozilla, urlwatch and only https
RewriteCond %{HTTPS} off [OR]
RewriteCond %{HTTP_USER_AGENT} ^urlwatch\ [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0 [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*(stuff..more stuff).*$ [NC]
RewriteCond %{HTTP_USER_AGENT} ^ht[tm][lpr] [NC]
RewriteRule . - [F,L]


Don't you need an [OR] on the ^.*(stuff..more stuff).*$ line?

Pseudo coding:
-if not https or urlwatch or Mozilla 5, or (bunch of stuff) or ht{options) then forbid. This is not exactly what is written in the comment.

RewriteCond %{HTTP_REFERER} !^https://mail\.example\.net/.*$ [NC]
You might want to make this anonymous

lucy24

8:55 pm on Sep 11, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Short version: Each module is an island. It will see only its own directives--yes, even if they are mixed line-by-line with other mods. (This is why I have a test site.) So for overall categories of rules-- Allow/Deny, SetEnvIf, RewriteRule and so on-- put them in the order that is most convenient for you to read, since the server doesn't care. I like to put core directives-- ErrorDocument, Header Set and so on-- at the beginning, simply because there are fewer of them, so they're easy to find. The same goes for <Files(Match)> envelopes, since they will typically only contain one or two lines.

Another thing to note is that htaccess is basically concerned with http(s) requests. In general, it has no effect on requests made in other forms, like mail or ftp. That's what makes it possible for you to fetch, edit and upload the htaccess file itself, even though it can't be requested and viewed in the usual way.

only allow mozilla
At this point, the vast majority of robots have “Mozilla” at the beginning of their UA string, so a check for this element is no longer as useful as it was 10 years ago.

lucy provided the order of Rules so many times previously that she had a working bookmark
Hiya, Don, long time no see. I don't know about bookmarks, but I know I've got a few slabs of boilerplate saved. Here's a bit about ordering of RewriteRules, extracted from the middle of a longer document about all kinds of htaccess cleanup:
At the beginning is the single line
RewriteEngine on

A RewriteBase is almost never needed; get rid of any lines that mention it. Instead, make sure every target begins with either protocol-plus-domain or a slash / for the root.

Sort RewriteRules twice.

First group them by severity. Access-control rules (flag [F]) go first. Then any 410s (flag [G]). Not all sites will have these. Then external redirects (flag [R=301,L] unless there is a specific reason to say something different). Then simple rewrite (flag [L] alone). Finally, there may be a few rules without [L] flag, such as cookies or environmental variables.

Function overrides flag. If your redirects are so complicated that they've been exiled to a separate .php file, the RewriteRule will have only an [L] flag. But group it with the external redirects. If certain users are forcibly redirected to an "I don't like your face" page, the RewriteRule will have an R flag. But group it with the access-control [F] rules.

Then, within each functional group, list rules from most specific to most general. In most htaccess files, the second-to-last external redirect will take care of "index.html" requests. The very last one will fix the domain name, such as with/without www.

Leave a blank line after each RewriteRule, and put a
# comment

before each ruleset (Rule plus any preceding Conditions). A group of closely related rulesets can share an explanation.

Now then. What was the question again?

:: scurrying off to edit boilerplate because domain-name-canonicalization now often includes https as well ::

TorontoBoy

9:22 pm on Sep 11, 2018 (gmt 0)

5+ Year Member Top Contributors Of The Month



SetEnvIf User-Agent "curl\/" keep_out

This worked for me to ban all curls, until I found out that Drupal.org uses this in their bot, which I need
RewriteCond %{HTTPS} off [OR]
RewriteCond %{HTTP_USER_AGENT} ^urlwatch\ [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0 [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*(stuff..stuff).*$ [NC]
RewriteCond %{HTTP_USER_AGENT} ^ht[tm][lpr] [NC]
RewriteRule . - [F,L]

IF Rule 1 OR Rule 2 OR Rule 3 OR (Rule 4 implicit AND Rule 5) then forbid. The missed [OR] in the 4th rule uses Apache's implicit AND. Your curl is in Rule 4, but combined with Rule 5, does not meet the condition. Therefore your curl is allowed through to annoy you.
'ornext|OR' (or next condition)
Use this to combine rule conditions with a local OR instead of the implicit AND. Typical example:
[httpd.apache.org...]

lucy24

11:05 pm on Sep 11, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Further observation: the [OR] operator is most useful when you're assessing different aspects of the request--for example when a domain-name-canonicalization redirect looks separately at both {HTTPS} and {HTTP_HOST}.

When you're looking at the same aspect, like IP or User-Agent, it is often easier if you put each set in a single pipe-delimited line. Obviously there are limits; you don't want to put seventy-five different User-Agents on the same line. But it does make it easier when there are just a few items involved, like
RewriteCond %{HTTP_USER_AGENT} ^(urlwatch\.|Mozilla/5\.0)
Mixing [AND] (implicit) and [OR] in the same ruleset can lead to grief. Not because the server gives a hoot, but because you then need to pay extra-close attention to what goes in what order so you don't say “(A and B) or C” when you meant to say “A and (B or C)”

SetEnvIf User-Agent "curl/" keep_out
This worked for me to ban all curls, until I found out that Drupal.org uses this in their bot, which I need
Option B, in that case, would be to switch off the variable (note incidentally that you don't need to escape / slashes in mod_setenvif):
BrowserMatch Drupal !keep_out
or
SetEnvIf Remote_Addr ^1\.2\.3\.4 !keep_out
(I don't know exactly what's involved in legitimate Drupal requests, but you get the idea.) Remember also that
BrowserMatch
means exactly the same thing as
SetEnvIf User-Agent
at a savings of seven bytes. That's what the shorthand is for.

dstiles

6:36 pm on Sep 12, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Many thanks to all for your input! Got a problem with my desktop's sound at present but I'll work through the htacess file as soon as possible. But generally...

not2easy, wilderness, I thought the wildcards were necessary but your comments show they are not.

TorontoBoy - dontlog works ok but I'll look closer at the correct syntax. You are correct about the [OR] - missed that entirely. :(

Lucy - so many points to look through! Thank you. Hopefully I can get to it in a day or two.

I may be back! :)

dstiles

9:54 pm on Sep 15, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Finally got to playing with this again. I've modified htaccess in alignment with the advice you've all given so many thanks again.

TorontoBoy:
> SetEnvIfNoCase User-Agent "urlwatch" dontlog
I have a note in my htaccess file that customlog is not allowed in htaccess (originally setup around April last). I have just reinstated it but it throws a server error.
CustomLog /var/log/apache2/access.log combined env=!dontolg

Otherwise, I'll now sit back and see if I get any more curl UAs - or any other nasties.

Samizdata

10:38 pm on Sep 15, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'll now sit back and see if I get any more curl UAs

You could change the user-agent in your browser to test.

...

lucy24

1:14 am on Sep 16, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have a note in my htaccess file that customlog is not allowed in htaccess
Yup, everything to do with logging can only be said in the config file--either lying loose for the whole server, or in a vhost envelope. (Apache docs always list these two categories separately, even though things that can be done in config, but can't be done in vhost, can pretty well be counted on the fingers of one hand.) The same applies to LogLevel directives, which apply specifically to the Error Log*, and to RewriteLog, which is What It Says On The Box.

The server will obviously not care if you set an environmental variable called "dontlog" in htaccess--you just can't do anything with it. (Except, of course, sneakily use it for some entirely different, non-logging-related purpose.)


* No loss, it turns out. No matter how high you crank up the logging level, the Error Log will never tell you more than “request denied by server configuration”.

dstiles

1:23 pm on Sep 16, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Happily the dontlog does just that in htaccess - does not log that UA. Although urlwatch does actually work on it.

samizdata - I could just run curl from here. :) No time at present, though.

TorontoBoy

2:36 pm on Sep 16, 2018 (gmt 0)

5+ Year Member Top Contributors Of The Month



Tell us if your curls get blocked, but monitor them, because some service you want might use them.

lucy24

5:37 pm on Sep 16, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do you also control the server? Or is the specific environmental variable “dontlog” declared by your host, so you can set it in htaccess and it will be recognized elsewhere? (The analogy that comes to mind is the RewriteMap, which can be invoked in htaccess although it can only be defined in config.)

dstiles

9:43 am on Sep 17, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



TorontoBoy
> some service you want might use them.

No, and if so I can easily modify it. As I said, it's just for squirrelmail. So far, no more curls.

Lucy
Yes, I control the server. In apache2.conf I have the CustomLog line as noted above but it's commented out. Still, no periodic urlwatch entries, which is what I want. Do undeclared targets get lost anyway?

lucy24

4:04 pm on Sep 17, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do undeclared targets get lost anyway?
Do you mean, does the server recognize the “dontlog” environmental variable even if you haven't told it what it means? I shouldn't think so. Do you have some independent way of knowing that the requests are in fact coming in, just not getting logged? (You don't have a firewall, do you? If requests are blocked before they even reach the server, then logging preferences wouldn't apply.)

dstiles

1:29 pm on Sep 18, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



urlwatch runs three times an hour on this and other servers (linux and ASP). I get emails if any of the targets fails - for example, I've had three for this specific server in the past week, all accountable. I originally set up the dontlog because there were more entries for these in the log than anything else (squirrel only gets used occasionally).

Ok, just run a more extensive search. Customlog/dontlog is defined in sites-available and conf-available which, of course, are (in this case) sym-linked in the relevant -enabled folders. Oops! Sorry, I did say I wasn't too good at apache. :( Thanks for the help/advice, Lucy.