Welcome to WebmasterWorld Guest from 3.93.75.242

Forum Moderators: Ocean10000 & phranque

Message Too Old, No Replies

Correct sequence of htaccess contents

cannot stop curl from accessing pages

     
10:19 am on Sep 11, 2018 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3286
votes: 19


Couple of decades of internet experience but fairly new to apache (on linux).

The scenario is: squirrel webmail running on apache on linux server. That's the only web site on that server. Squirrel has a limited clientele, all known, local and "controllable".

I can find no real information about the sequence of declaring various options in htaccess in order to block undesirables. It seems to mostly work as I have it arranged but there are still a few incursions from curl and, despite geo-blocking, the occasional access from unwanted countries. I can accept that some of the latter may be due to late responses from the dns resolver (Unbound) but it would be nice not to have to block IP ranges for their use of curl.

My htaccess file is below. I have reduced multiple lines to avoid clutter in here. I would appreciate guidance on whether or not I have declared things in a reasonable sequence and why curl keeps getting through.

Typical log entry for curl...
194.59.251.217 - - [11/Sep/2018:01:38:37 +0100] "GET / HTTP/1.1" 302 4900 "-" "curl/7.43.0"
194.59.251.217 - - [11/Sep/2018:01:38:37 +0100] "GET /src/login.php HTTP/1.1" 200 4240 "-" "curl/7.43.0"

htaccess...

#========
# geo-block
GeoIPEnable On
SetEnvIf GEOIP_COUNTRY_CODE CN BlockCountry
SetEnvIf GEOIP_COUNTRY_CODE US BlockCountry
SetEnvIf GEOIP_COUNTRY_CODE VN BlockCountry
Deny from env=BlockCountry

#======== # periodic server alive test...
SetEnvIfNoCase User-Agent "urlwatch" dontlog

#========
order allow,deny

#========
# several IP denials...
deny from 185.35.62.0/23
deny from 180.96.0.0/11

#========
# reject hotlinking
RewriteEngine on
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^https://mail\.example\.net/.*$ [NC]
RewriteRule \.(gif|jpg|css)$ - [F]

#========
# only allow get/post/http/1.1
RewriteEngine On
RewriteCond %{THE_REQUEST} !^(POST|GET)\ /.*\ HTTP/1\.1$
#RewriteCond %{REQUEST_METHOD} !^(HEAD|OPTIONS|PUT)
RewriteRule .* - [F]

# only allow mozilla, urlwatch and only https
RewriteCond %{HTTPS} off [OR]
RewriteCond %{HTTP_USER_AGENT} ^urlwatch\ [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0 [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*(agent|analy[sz]|anonymous|archive|bandit|bot|brand|cherrypicker|clshttp|collector|compatible;[a-z]|craftbot|crawl|curl|deepnet|discover|download|explorer|extract|file|grab|greasemonkey|harvest|indy\slibrary|java|larbin|le[ae]ch|legs|link|lynx|mail|miner|netcraft|ninja|n[-_\s]?u[-_\s]?t[-_\s]?c[-_\s]?h|open|perl|php|proxy|python|ripper|script|search|seo|shodan|sitemap|snoop|sph?ider|stripper|sucker|survey|sweep|torrent|webpictures|webspider|worm).*$ [NC]
RewriteCond %{HTTP_USER_AGENT} ^ht[tm][lpr] [NC]
RewriteRule . - [F,L]
#========
allow from all
#========

#========
# security policies
Header set Strict-Transport-Security "max-age=15552001; includeSubDomains; preload"
Header set X-Frame-Options "SAMEORIGIN"
Header set X-Xss-Protection "1; mode=block"
Header set X-Content-Type-Options "nosniff"
Header set X-Permitted-Cross-Domain-Policies "none"
Header set Referrer-Policy "strict-origin-when-cross-origin"
Header set Content-Security-Policy "default-src 'none'; style-src 'self'; child-src 'self'; frame-ancestors 'self'; base-uri 'self'; script-src 'self'; form-action 'self'; img-src 'self'; object-src 'none'; block-all-mixed-content;"
Header set Expect-CT "enforce,max-age=30"
#========


[edited by: not2easy at 6:18 pm (utc) on Sep 11, 2018]
[edit reason] exemplified mail server [/edit]

2:37 pm on Sept 11, 2018 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:4559
votes: 363


Sorry, I'm not the answer person for all of the list and a little short on time right now but the curl question looks like a matter of syntax. As far as the UA blocks, I have not seen them using the syntax above - the ^.*( is simply (to begin and end with ) rather than the ).*$

If you run a search using RewriteCond %{HTTP_USER_AGENT} you'll find many examples of it to compare.

HTH
3:32 pm on Sept 11, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5507
votes: 5


dstiles,
lucy provided the order of Rules so many times previously that she had a working bookmark. Sure it still exists.

^.*, is some for of 'Begins with' and I'd suggest removing. In the UA examples you've provided what your looking for is 'contains' (absent any leading characters).
6:28 pm on Sept 11, 2018 (gmt 0)

Preferred Member from CA 

Top Contributors Of The Month

joined:Feb 7, 2017
posts:579
votes: 60


#======== # periodic server alive test...
SetEnvIfNoCase User-Agent "urlwatch" dontlog

I'm not the expert, but you set the environment variable dontlog for the UA "urlwatch", but it seems you don't do anything with it?

# only allow mozilla, urlwatch and only https
RewriteCond %{HTTPS} off [OR]
RewriteCond %{HTTP_USER_AGENT} ^urlwatch\ [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0 [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*(stuff..more stuff).*$ [NC]
RewriteCond %{HTTP_USER_AGENT} ^ht[tm][lpr] [NC]
RewriteRule . - [F,L]


Don't you need an [OR] on the ^.*(stuff..more stuff).*$ line?

Pseudo coding:
-if not https or urlwatch or Mozilla 5, or (bunch of stuff) or ht{options) then forbid. This is not exactly what is written in the comment.

RewriteCond %{HTTP_REFERER} !^https://mail\.example\.net/.*$ [NC]
You might want to make this anonymous
8:55 pm on Sept 11, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15936
votes: 889


Short version: Each module is an island. It will see only its own directives--yes, even if they are mixed line-by-line with other mods. (This is why I have a test site.) So for overall categories of rules-- Allow/Deny, SetEnvIf, RewriteRule and so on-- put them in the order that is most convenient for you to read, since the server doesn't care. I like to put core directives-- ErrorDocument, Header Set and so on-- at the beginning, simply because there are fewer of them, so they're easy to find. The same goes for <Files(Match)> envelopes, since they will typically only contain one or two lines.

Another thing to note is that htaccess is basically concerned with http(s) requests. In general, it has no effect on requests made in other forms, like mail or ftp. That's what makes it possible for you to fetch, edit and upload the htaccess file itself, even though it can't be requested and viewed in the usual way.

only allow mozilla
At this point, the vast majority of robots have “Mozilla” at the beginning of their UA string, so a check for this element is no longer as useful as it was 10 years ago.

lucy provided the order of Rules so many times previously that she had a working bookmark
Hiya, Don, long time no see. I don't know about bookmarks, but I know I've got a few slabs of boilerplate saved. Here's a bit about ordering of RewriteRules, extracted from the middle of a longer document about all kinds of htaccess cleanup:
At the beginning is the single line
RewriteEngine on

A RewriteBase is almost never needed; get rid of any lines that mention it. Instead, make sure every target begins with either protocol-plus-domain or a slash / for the root.

Sort RewriteRules twice.

First group them by severity. Access-control rules (flag [F]) go first. Then any 410s (flag [G]). Not all sites will have these. Then external redirects (flag [R=301,L] unless there is a specific reason to say something different). Then simple rewrite (flag [L] alone). Finally, there may be a few rules without [L] flag, such as cookies or environmental variables.

Function overrides flag. If your redirects are so complicated that they've been exiled to a separate .php file, the RewriteRule will have only an [L] flag. But group it with the external redirects. If certain users are forcibly redirected to an "I don't like your face" page, the RewriteRule will have an R flag. But group it with the access-control [F] rules.

Then, within each functional group, list rules from most specific to most general. In most htaccess files, the second-to-last external redirect will take care of "index.html" requests. The very last one will fix the domain name, such as with/without www.

Leave a blank line after each RewriteRule, and put a
# comment

before each ruleset (Rule plus any preceding Conditions). A group of closely related rulesets can share an explanation.

Now then. What was the question again?

:: scurrying off to edit boilerplate because domain-name-canonicalization now often includes https as well ::
9:22 pm on Sept 11, 2018 (gmt 0)

Preferred Member from CA 

Top Contributors Of The Month

joined:Feb 7, 2017
posts:579
votes: 60


SetEnvIf User-Agent "curl\/" keep_out

This worked for me to ban all curls, until I found out that Drupal.org uses this in their bot, which I need
RewriteCond %{HTTPS} off [OR]
RewriteCond %{HTTP_USER_AGENT} ^urlwatch\ [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0 [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*(stuff..stuff).*$ [NC]
RewriteCond %{HTTP_USER_AGENT} ^ht[tm][lpr] [NC]
RewriteRule . - [F,L]

IF Rule 1 OR Rule 2 OR Rule 3 OR (Rule 4 implicit AND Rule 5) then forbid. The missed [OR] in the 4th rule uses Apache's implicit AND. Your curl is in Rule 4, but combined with Rule 5, does not meet the condition. Therefore your curl is allowed through to annoy you.
'ornext|OR' (or next condition)
Use this to combine rule conditions with a local OR instead of the implicit AND. Typical example:
[httpd.apache.org...]
11:05 pm on Sept 11, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15936
votes: 889


Further observation: the [OR] operator is most useful when you're assessing different aspects of the request--for example when a domain-name-canonicalization redirect looks separately at both {HTTPS} and {HTTP_HOST}.

When you're looking at the same aspect, like IP or User-Agent, it is often easier if you put each set in a single pipe-delimited line. Obviously there are limits; you don't want to put seventy-five different User-Agents on the same line. But it does make it easier when there are just a few items involved, like
RewriteCond %{HTTP_USER_AGENT} ^(urlwatch\.|Mozilla/5\.0)
Mixing [AND] (implicit) and [OR] in the same ruleset can lead to grief. Not because the server gives a hoot, but because you then need to pay extra-close attention to what goes in what order so you don't say “(A and B) or C” when you meant to say “A and (B or C)”

SetEnvIf User-Agent "curl/" keep_out
This worked for me to ban all curls, until I found out that Drupal.org uses this in their bot, which I need
Option B, in that case, would be to switch off the variable (note incidentally that you don't need to escape / slashes in mod_setenvif):
BrowserMatch Drupal !keep_out
or
SetEnvIf Remote_Addr ^1\.2\.3\.4 !keep_out
(I don't know exactly what's involved in legitimate Drupal requests, but you get the idea.) Remember also that
BrowserMatch
means exactly the same thing as
SetEnvIf User-Agent
at a savings of seven bytes. That's what the shorthand is for.
6:36 pm on Sept 12, 2018 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3286
votes: 19


Many thanks to all for your input! Got a problem with my desktop's sound at present but I'll work through the htacess file as soon as possible. But generally...

not2easy, wilderness, I thought the wildcards were necessary but your comments show they are not.

TorontoBoy - dontlog works ok but I'll look closer at the correct syntax. You are correct about the [OR] - missed that entirely. :(

Lucy - so many points to look through! Thank you. Hopefully I can get to it in a day or two.

I may be back! :)
9:54 pm on Sept 15, 2018 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3286
votes: 19


Finally got to playing with this again. I've modified htaccess in alignment with the advice you've all given so many thanks again.

TorontoBoy:
> SetEnvIfNoCase User-Agent "urlwatch" dontlog
I have a note in my htaccess file that customlog is not allowed in htaccess (originally setup around April last). I have just reinstated it but it throws a server error.
CustomLog /var/log/apache2/access.log combined env=!dontolg

Otherwise, I'll now sit back and see if I get any more curl UAs - or any other nasties.
10:38 pm on Sept 15, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Aug 29, 2006
posts:1378
votes: 18


I'll now sit back and see if I get any more curl UAs

You could change the user-agent in your browser to test.

...
1:14 am on Sept 16, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15936
votes: 889


I have a note in my htaccess file that customlog is not allowed in htaccess
Yup, everything to do with logging can only be said in the config file--either lying loose for the whole server, or in a vhost envelope. (Apache docs always list these two categories separately, even though things that can be done in config, but can't be done in vhost, can pretty well be counted on the fingers of one hand.) The same applies to LogLevel directives, which apply specifically to the Error Log*, and to RewriteLog, which is What It Says On The Box.

The server will obviously not care if you set an environmental variable called "dontlog" in htaccess--you just can't do anything with it. (Except, of course, sneakily use it for some entirely different, non-logging-related purpose.)


* No loss, it turns out. No matter how high you crank up the logging level, the Error Log will never tell you more than “request denied by server configuration”.
1:23 pm on Sept 16, 2018 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3286
votes: 19


Happily the dontlog does just that in htaccess - does not log that UA. Although urlwatch does actually work on it.

samizdata - I could just run curl from here. :) No time at present, though.
2:36 pm on Sept 16, 2018 (gmt 0)

Preferred Member from CA 

Top Contributors Of The Month

joined:Feb 7, 2017
posts:579
votes: 60


Tell us if your curls get blocked, but monitor them, because some service you want might use them.
5:37 pm on Sept 16, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15936
votes: 889


Do you also control the server? Or is the specific environmental variable “dontlog” declared by your host, so you can set it in htaccess and it will be recognized elsewhere? (The analogy that comes to mind is the RewriteMap, which can be invoked in htaccess although it can only be defined in config.)
9:43 am on Sept 17, 2018 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3286
votes: 19


TorontoBoy
> some service you want might use them.

No, and if so I can easily modify it. As I said, it's just for squirrelmail. So far, no more curls.

Lucy
Yes, I control the server. In apache2.conf I have the CustomLog line as noted above but it's commented out. Still, no periodic urlwatch entries, which is what I want. Do undeclared targets get lost anyway?
4:04 pm on Sept 17, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15936
votes: 889


Do undeclared targets get lost anyway?
Do you mean, does the server recognize the “dontlog” environmental variable even if you haven't told it what it means? I shouldn't think so. Do you have some independent way of knowing that the requests are in fact coming in, just not getting logged? (You don't have a firewall, do you? If requests are blocked before they even reach the server, then logging preferences wouldn't apply.)
1:29 pm on Sept 18, 2018 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3286
votes: 19


urlwatch runs three times an hour on this and other servers (linux and ASP). I get emails if any of the targets fails - for example, I've had three for this specific server in the past week, all accountable. I originally set up the dontlog because there were more entries for these in the log than anything else (squirrel only gets used occasionally).

Ok, just run a more extensive search. Customlog/dontlog is defined in sites-available and conf-available which, of course, are (in this case) sym-linked in the relevant -enabled folders. Oops! Sorry, I did say I wasn't too good at apache. :( Thanks for the help/advice, Lucy.