Forum Moderators: phranque

Message Too Old, No Replies

How to block specific webcrawlers in .htaccess

         

stephen22

9:49 pm on Oct 23, 2024 (gmt 0)

Top Contributors Of The Month



I'm not much of an Apache developer... in fact, not at all. ;)

I run a single site on Apache.

Recently, I am being swamped by facebook and meta crawlers/spiders. I literally had 250 of them crawling the site today.

They have the user-agent "facebookexternalhit" and "meta-externalhit".

What specific Apache code could I put in my .htaccess to completely block these crawlers? Thanks!

lucy24

11:05 pm on Oct 23, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Some basics: What, if anything, do you currently use to block unwanted visitors?

Can we assume the server is Apache 2.4 or later, using directives Require syntax? (If it currently says Allow and Deny, we’ll need a little more information.)

The object is to build on whatever you’re currently using. If you have no idea what’s in your current .htaccess file--or, indeed, if you don’t have one at all--then oh boy, welcome to lots of fun ;)

tangor

12:20 am on Oct 24, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



--or, indeed, if you don’t have one at all--then oh boy, welcome to lots of fun ;)


...And you have come to the right place to find out how all this mysterious stuff works!

stephen22

6:13 am on Oct 24, 2024 (gmt 0)

Top Contributors Of The Month



Thanks for the replies. I DO have an .htaccess file, and I do understand somewhat how to use it, for the most part. But anything complicated I got from asking questions like this or searching on the web, not from any real understanding of Apache.

The Apache version is 2.4.59.

For example, I have this kind of stuff:

RewriteEngine On

# send the naked URL to the forums index
RewriteCond %{HTTP_HOST} ^forums\.example\.com$ [OR]
RewriteCond %{HTTP_HOST} ^www\.forums\.example\.com$
RewriteRule ^/?$ "https\:\/\/forums\.example.com\/forum" [R=301,L]

RewriteRule ^\.well-known\/acme-challenge\/ - [L]
RewriteCond %{HTTPS} !=on
RewriteRule ^ https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]


(I really don't remember what that second part does, but it's been in there forever.)

I have been blocking certain individual IPs or ranges using this:


order allow,deny

deny from xxx.xxx.xxx.xxx
deny from xxx.xxx.xxx

allow from all


...and lots of Redirects. That's about it. ;)

The Contractor

11:27 am on Oct 24, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You can redirect them to a specific page (full url to file) like a simple blank.html file.

RewriteEngine On

RewriteCond %{HTTP_USER_AGENT} "facebookexternalhit" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "meta-externalhit" [NC]

RewriteRule ^.*$ https://www.yoursite.com/blank.html [L,R]

lucy24

4:51 pm on Oct 24, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Whew. Smallest point first:
"https\:\/\/forums\.example.com\/forum"
The quotation marks aren’t necessary, and you don’t need to escape anything in the target of a RewriteRule. Obviously it does no harm, since the rule has been working for a while, but it’s needless clutter.

I disagree with Contractor’s approach (and here, too, the quotation marks aren’t needed), but let’s deal with some other stuff first.

When you say “lots of redirects”, do you mean rules in the form Redirect blahblah or RedirectMatch blahblah? If so, you've got mixed mod_alias and mod_redirect, which can create issues, so we’ll talk about those by and by.

The combination of Apache 2.4 with “Allow, Deny” means your server is using mod_access_compat. This module was created especially to bridge the gap between 2.2-and-earlier syntax, and 2.4-and-later syntax. It will eventually go away, so let’s start by updating your access controls.

Before anything else make a copy of your htaccess file. If you haven't already done so, save the copy on your personal hard drive with a name that doesn't have the leading dot (so you can easily find it). Open the htaccess (the “real” one) in the text editor of your choice.

Now add this:
<RequireAll>
Require all granted
<RequireNone>
Require env unwanted
</RequireNone>
</RequireAll>
In your text editor, globally change all occurrences of “Deny from” to “Require ip” and move them into the RequireNone envelope. Delete the “Allow from” and “Order” lines; they’re no longer needed.

Upload the edited htaccess and confirm that your site continues to work. (This is why you made the backup.)

Now the fun part! You can use mod_rewrite for access control, but there are advantages to using mod_auththingummy (the Require business) in conjunction with mod_setenvif. Add this line outside the Require envelope (before or after doesn’t matter, but before makes more intuitive sense):
BrowserMatch facebookexternalhit unwanted

The word “unwanted” here is the name of an environmental variable. You can call it anything you like. By default its value is sent to 1, but if you want to be fancy you can give each one a different value:
BrowserMatch facebookexternalhit unwanted=facebook
BrowserMatch ^meta-externalagent unwanted=meta
and so on. This is the simplest way to deny by user-agent.

:: quick detour to raw logs ::

I’ve never seen a meta-externalhit, but I do find some meta-externalagent requesting images. Further detour to my own htaccess tells me I blocked them early last month, and then promptly forgot they existed. Did you notice the ^ in front of meta? mod_setenvif, like mod_rewrite, uses Regular Expressions. The “meta-externalagent” happens to come at the very beginning of the UA string, so the ^ saves your server a picosecond or so: if it isn’t the first thing you see, stop checking and move on to the next item.

stephen22

5:43 am on Oct 25, 2024 (gmt 0)

Top Contributors Of The Month



Thank you for your reply!

Yes, it's "meta-externalagent" - sorry for my typo.

At this very moment, I have about 250 of facebookexternalhit and meta-externalagent spiders crawling my site. (They completely ignore robots.txt BTW, hence this thread.) Or maybe it's not 250 spiders per se; but my forums software has a page that shows "who's online", and there are 250 threads being simultaneously crawled by these bots.

So let me see if I've got this correct. The end result (for me) would look like this?


BrowserMatch facebookexternalhit unwanted=facebook
BrowserMatch ^meta-externalagent unwanted=meta

<RequireAll>
Require all granted
<RequireNone>
Require env unwanted
Require ip xxx.xxx.xxx.xxx
Require ip xxx.xxx.xxx
Require ip xxx.xxx
</RequireNone>
</RequireAll>

I'm assuming the Require ip directive works for ranges, i.e. xxx.xxx.xxx or xxx.xxx ?

EDIT:
And question:
I have two .htaccess files on this site.
The site structure is pretty simple. There is the root site (which is actually a sub-domain) forums.example.com.
This contains nothing but a few things like favicons and some folders of images.
Then there is the folder /forum which contains the actual forum software.
As you can see from the code I posted previously, using a URL to the root site rewrites so it goes to the forum index (forums.example.com/forum). There is no "root site" from the user perspective. This may be hinky but it's the way I had to set it up as it's a very old site and forums (been around 24+ years) and has been moved multiple times and the structure can't really be modified.
Then there is a separate .htaccess file inside the /forum directory, which contains different code.

Should I have both of these? Or everything in one .htaccess file in the root directory? I think the one in the /forum directory contains some things necessary for the operation of the forum software, so maybe they could not be combined in any way. So I would assume the stuff we are talking about here would go in the root directory .htaccess?

EDIT 2 (and UPDATE).
So before going to bed here, as an experiment I have just this at the top of my root .htaccess file, and did not for the moment rewrite the allow/deny stuff (of which there is a lot):
(cleaned up of escaping as you suggested)

RewriteEngine On

# send the naked URL to the forums index
RewriteCond %{HTTP_HOST} ^forums.example.com$ [OR]
RewriteCond %{HTTP_HOST} ^www.forums.example.com$
RewriteRule ^/?$ https://forums.example.com/forum [R=301,L]

RewriteRule ^.well-known/acme-challenge/ - [L]
RewriteCond %{HTTPS} !=on
RewriteRule ^ https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]

# =====================================================================
BrowserMatch facebookexternalhit unwanted=facebook
BrowserMatch ^meta-externalagent unwanted=meta

<RequireAll>
Require all granted
<RequireNone>
Require env unwanted
</RequireNone>
</RequireAll>
# =====================================================================


As of about 20 minutes later, every single facebook and meta crawler has disappeared! I guess it's working. ;)

lucy24

5:06 pm on Oct 25, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm assuming the Require ip directive works for ranges, i.e. xxx.xxx.xxx or xxx.xxx ?
Yes, it’s exactly the same as the old Allow/Deny rule, including partials like 45.72.0.0/17 (meaning 45.72.0.0 through 45.72.128.255).

every single facebook and meta crawler has disappeared!
Tralala! Next time you check your access logs, you should see a lovely line of requests, each with a 403 response.

Should I have both of these? Or everything in one .htaccess file in the root directory?
In your case, it should be possible to combine everything into a single htaccess. Sometimes it's necessary to have more than one if it's a hosting setup with a “primary”/“addon” structure, where most sites’ directories are inside the “primary” site directory. (Happily, mine has a “userspace” setup, where all sites are parallel. That lets me have a single htaccess covering access controls for all sites, and then site-specific htaccess for any individual sites.) That means the physical directories on the server, which may or may not align with what the user sees in the URL.

The “one or many” question mainly becomes relevant when you’ve got RewriteRules, because mod_rewrite doesn’t inherit as straightforwardly as other mods. As a general rule, don’t use mod_rewrite more than once along the same physical filepath. That is, if you’ve got directory A containing directories B and C, have all your RewriteRules either in A only, or in B and C only.

The other exception is if you’ve got rules that are specific to one physical directory: for example, you’ve got a global
Options -Indexes
for the whole site, but then for some directories you do want to allow auto-indexing. Since you can’t use a <Directory> section in htaccess, you’d need to make a supplemental htaccess with just one rule in it. Same goes if you want part of the site to use a different ErrorDocument. And so on. But these are specific situations that can be dealt with as they arise.

The rule with
.well-known/acme-challenge/
looks like something the host added when the site went https. It’s used by Let’s Encrypt; leave it as-is. The same applies to most things with leading dot; if something shows up and you know you didn’t put it there, check with the host.

stephen22

8:53 pm on Oct 25, 2024 (gmt 0)

Top Contributors Of The Month



Thank you all!

One other problem I'd love to fix. I am being crawled it seems by some unknown and unpublished bots that come from the same cloud company in Singapore, but use many different ranges of IPs. They may even be spoofing IPs for all I know about how this works. I've been trying to block IP ranges, but they just come back later with other ones. But the IPs all resolve to something like "ecs-159-138-111-201.compute.hwclouds-dns.com", where the first part is the ip address. How could I block EVERYTHING that comes from "hwclouds-dns.com"?

Also, you mentioned Redirects? As this is an ancient site, it has accumulated several hundred redirects in the form of:


Redirect permanent /releasenotes/ko https://forums.example.com/forum/forumdisplay.php?f=229


Should I be doing anything differently here?

lucy24

10:16 pm on Oct 25, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



They may even be spoofing IPs for all I know about how this works.
Probably not. It isn't like faking a caller ID, or putting a bogus return address on snail mail. If a request comes in with a fake IP, they won't receive the requested file, because it will be sent to the fake address. So IP spoofing is not likely unless someone's trying a DDOS attack, aimed at either your site or the real owner of the faked IP.

How could I block EVERYTHING that comes from "hwclouds-dns.com"?
You should be able to do it with mod_authz_host, but this is not something I'm awfully familiar with; I stick with the numerical IP. The syntax is
Require host badname.com
Note that this requires your server to do a reverse-DNS lookup on every request, so you have to judge whether this extra work is worth it. Most offending colos or server farms have a finite number of IP ranges, which you could block in a few lines.

Now, about Redirects: There are two ways to issue a redirect:
mod_rewrite (which executes fairly early), directives in the form RewriteRule blahblah
mod_alias (which executes fairly late), directives in the form Redirect(Match) blahblah
You cannot change the execution order of the various modules; you can only reorder directives within a given module.

If you use both mod_alias and mod_rewrite, you risk having things happen in unintended order. in particular, canonicalization (with/without www, or http/https) can only be done in mod_rewrite, which executes before mod_alias, so you may end up with chained redirects. So it's advisable to change everything to Rewrite.

Make a fresh backup of your currently working htaccess. Are you OK with Regular Expressions? If yes, you can update everything with a couple of global changes.

Edit: Whoops! I've had to delete a whole slab of my post because it may not be up-to-date. I'll come back later and figure out what I meant to say.

Teatime for me.

lucy24

2:32 am on Oct 26, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hours later: False alarm; I just couldn’t make sense of--stop me if you’ve heard this one--my own code.

Now then! If you really do have vast numbers of Redirect statements, you’ll want to run a few global replaces. In everything that follows, I’ve assumed \1 and \2 for captured groups. If your text editor uses $1 and $2, change as appropriate, but don’t change any other \ backslashes you may come across. This applies only to the global replaces you’re running here and now; your server uses $1 (or, in special cases, %1) so those won’t change.

First change . to \. in the pattern, because mod_rewrite always uses Regular Expressions, while mod_alias (Redirect by that name) only does so in RedirectMatch. No, the doubled \\ is not an error.
^(Redirect \w+ \S+?[^\\])\.
TO
\1\\.
and repeat until it rinses clean. (Because there might be more than one.) Similarly, get rid of all quotation marks as needless clutter:
^(Redirect.+)"
TO
\1
and repeat as needed.

Now then:
^Redirect(?:Match)? (?:301|permanent) (\^)?/(.+)
TO
RewriteRule \1\2 [R=301,L]
The meat of these rules is the part expressed as (.+) which includes both pattern and target. (This is the part that caused me to run off in a panic, having forgotten mod_alias syntax.) If you have any temporary redirects (302 instead of 301) we’ll deal with them later.

If you have any mod_alias rules that return a 410 or 403, add these:
# 410 if needed
^Redirect(?:Match)? 410 (\^)?/(.+)
TO
RewriteRule \1\2 - [G]

# 403 if needed
^Redirect(?:Match)? 403 (\^)?/(.+)
TO
RewriteRule \1\2 - [F]

Stop here, and search your whole htaccess and confirm that the word “Redirect” no longer occurs anywhere.

If you’re good to go, proceed to the fun part:

begin boilerplate from a document I put together years ago

Sort RewriteRules twice.

First group them by severity. Access-control rules (flag [F]) go first. Then any 410s (flag [G]). Not all sites will have these. Then external redirects (flag [R=301,L] unless there is a specific reason to say something different). Then simple rewrite (flag [L] alone). Finally, there may be a few rules without [L] flag, such as cookies or environmental variables.

Function overrides flag. If certain users are forcibly redirected to an "I don't like your face" page, the RewriteRule will have an R flag. But group it with the access-control [F] rules.

Then, within each functional group, list rules from most specific to most general. In most htaccess files, the second-to-last external redirect will take care of "index.html" requests. The very last one will fix the domain name and protocol: with/without www, and http vs. https.

Leave a blank line after each RewriteRule, and put a
# comment

before each ruleset (Rule plus any preceding Conditions). A group of closely related rulesets can share an explanation.

end boilerplate

That will do for now.

stephen22

11:18 pm on Oct 26, 2024 (gmt 0)

Top Contributors Of The Month



Thank you, but that is way over my head. :) I'm not ok with RegExp, sadly.

I would need to know what the end result is supposed to look like. Let's say this is a piece of what I have now:


Redirect permanent /forum/oasys/ann https://forums.example.com/forum/forumdisplay.php?forumid=155
Redirect permanent /forum/oasys/gen https://forums.example.com/forum/forumdisplay.php?forumid=140
Redirect permanent /forum/oasys/seq https://forums.example.com/forum/forumdisplay.php?forumid=156
Redirect permanent /forum/oasys/kar https://forums.example.com/forum/forumdisplay.php?forumid=142
Redirect permanent /forum/oasys/tut https://forums.example.com/forum/forumdisplay.php?forumid=154


What would that turn into with your recommendations? Thanks.

lucy24

4:04 am on Oct 27, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



With such a long list of redirects, you may feel safer pasting this whole section of your htaccess into a separate text file for working purposes, so you can merrily do global replaces without worrying about changing something unintended. And then paste the whole thing back in.

If it all looks like your specimen, we don’t have to worry about literal periods . (in the pattern only) and/or quotation marks (anywhere).

Are you really redirecting “friendly” URLs (no query string) to “unfriendly” ones? Well, I guess if it’s been set up that way for years, don’t mess with it.

Since you don't do Regular Expressions--I know the feeling: I was deathly afraid of RegEx for the first few years--you'll need to do it in two steps. First change all
Redirect permanent /
(include the leading slash / because mod_rewrite doesn't use it in this environment) to
RewriteRule 
(with trailing space). And then add
 [R=301,L]
(with leading space) to the end of each line. The R=301 part is the equivalent of “Redirect permanent”, and L is a necessary flag for most RewriteRules.

When done, search for any remaining occurrences of “Redirect” and we’ll deal with those as needed.

Once everything is done, look at the part of my earlier post that begins “Sort RewriteRules twice”. It should make sense now.

lucy24

5:47 am on Oct 27, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



So, for example,
Redirect permanent /forum/oasys/tut https://forums.example.com/forum/forumdisplay.php?forumid=154
becomes
RewriteRule forum/oasys/tut https://forums.example.com/forum/forumdisplay.php?forumid=154 [R=301,L]

not2easy

11:50 am on Oct 27, 2024 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



How could I block EVERYTHING that comes from "hwclouds-dns.com"?

Not to get in the way of lucy24's excellent explanations, but if you are blocking via IP CIDR, the one you mentioned (HUAWEI CLOUDS) is at 159.138.0.0/16

lucy24

4:44 pm on Oct 27, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Addendum: After re-reading your description of the site, it is probably a good idea to start all your RewriteRules with
RewriteRule ^
replacing the former / with ^. This saves your server a few nanoseconds, because the element /forum/ happens to come at the very beginning of the URL. Consider it your very first step into the world of Regular Expressions.

The range
159.138.0.0/16
can also be expressed as
159.138
and-that's-all. (The same goes for anything ending in /8 or /24.) This is purely a matter of personal preference.

Martin Potter

7:52 pm on Oct 27, 2024 (gmt 0)

5+ Year Member Top Contributors Of The Month



Many thanks to lucy24 for the above explanations. I have now finished changing all of the deny from's to the better Require ip formats.

I think now that I need a good O'Reilly book to get me started on regex.

stephen22

2:41 am on Oct 28, 2024 (gmt 0)

Top Contributors Of The Month




Not to get in the way of lucy24's excellent explanations, but if you are blocking via IP CIDR, the one you mentioned (HUAWEI CLOUDS) is at 159.138.0.0/16


It also appears to be at these ranges (all of these were crawling my site and showed "hwclouds-dns.com"):
94.74
101.44
110.238
111.119
114.119
119.8
119.13
124.243
159.38
166.108
190.92

I was trying to block all of these and thought it would be easier to just block hwclouds-dns.com - because there are probably more ranges.

lucy24

6:01 am on Oct 28, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



thought it would be easier to just block
It might be easier for you in the short term, but not necessarily easier on the server, because of the extra DNS-lookup step. Are you sure about the huawei? It's noteworthy that the IPs seem to come from all around the globe, not just Asia. But I think we talked about this somewhere upthread, arriving at
Require host blahblah

Now, if you wanted to be unkind, and if you have no legitimate human traffic from China, you could say
SetEnvIf Accept-Language ^zh badlang
SetEnvIf Accept-Language ^zh-(tw|TW) !badlang
coupled with
Require env badlang
Translated from Apache to English, that means “Don’t admit Chinese-speaking visitors unless they are from Taiwan.” I don't know why robots even bother to send an Accept-Language header, but I just checked and found tens of thousands of them over the past year on a not-very-big* site.

* For present purposes, “not very big” means that daily access logs tend to run less than 1MB. This is often a more useful metric than filesize or pagecount.

stephen22

3:22 pm on Oct 28, 2024 (gmt 0)

Top Contributors Of The Month




Are you sure about the huawei? It's noteworthy that the IPs seem to come from all around the globe, not just Asia.

Unless I am doing something wrong (quite possible). But when I identify a suspicious IP, I use a free tool to perform a reverse lookup, and they all said something like:


IP Address Geolocation
159.138.110.12 or ecs-159-138-110-12.compute.hwclouds-dns.com is an IPv4 address owned by Huawei International Pte. LTD and located in Singapore, Singapore


All of these example IPs from widely disparate ranges give the same thing. So how are you determining that they come from around the globe? (Just want to make sure I am not misreading something here and banning ranges inappropriately...) I have nothing against China per se so the language block seems overkill....

94.74.87.185
101.44.162.220
114.119.185.59
111.119.197.210
124.243.139.68
159.138.110.12

stephen22

4:54 pm on Oct 28, 2024 (gmt 0)

Top Contributors Of The Month



(more...)
For example, I just removed my block on hwclouds-dns.com. Within minutes, all of these showed up, all reverse DNS to hwclouds-dns.com:

(range 124.243)
124.243.145.58
124.243.187.217
124.243.135.55
124.243.136.131
124.243.145.104

range (190.92)
190.92.209.4
190.92.200.171
190.92.206.158
190.92.204.35

110.238.108.0
114.119.174.108
101.44.164.88

I do not have this many legitimate viewers in china/singapore. ;)

lucy24

5:57 pm on Oct 28, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I do not have this many legitimate viewers
Besides, you’ll know when you look at the day’s access logs, because it’s rare for robots to request supporting files. Just pages and-that’s-all.

I’ve got a set of IP tables that say what country they’re assigned to--I used to update it every few years, but got tired--and each top-level range (the /8 sectors) is controlled by one of the regional registries. 110-126 is APNIC (Asia and Pacific), but 190 is LACNIC (Latin America and Caribbean), 159 is ARIN (North America) and so on.

stephen22

11:10 pm on Nov 2, 2024 (gmt 0)

Top Contributors Of The Month




So, for example,

Redirect permanent /forum/oasys/tut https://forums.example.com/forum/forumdisplay.php?forumid=154

becomes

RewriteRule forum/oasys/tut https://forums.example.com/forum/forumdisplay.php?forumid=154 [R=301,L]


I finally got around to experimenting with this. What I did was take this one Redirect (that is tested and working):


Redirect permanent /forum/m3 https://forums.example.com/forum/forumdisplay.php?forumid=191


... commented it out, made sure it was no longer working, and changed it to this, up at the top of my htaccess file:


RewriteEngine On

# send the naked URL to the forums index
RewriteCond %{HTTP_HOST} ^forums.example.com$ [OR]
RewriteCond %{HTTP_HOST} ^www.forums.example.com$
RewriteRule ^/?$ https://forums.example.com/forum [R=301,L]

RewriteRule ^.well-known/acme-challenge/ - [L]
RewriteCond %{HTTPS} !=on
RewriteRule ^ https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]

# experiment
RewriteRule forum/m3 https://forums.example.com/forum/forumdisplay.php?forumid=191 [R=301,L]


It is not working. Any idea on what I'm doing wrong?

lucy24

4:13 pm on Nov 3, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Is this in your “outer” htaccess or the “inner” one? If "outer"--which it looks like it must be--does the inner also have RewriteRules?

Before anything else: In your example you have four RewriteRules (put a blank line after the .well-known one), currently in order 1, 2, 3, 4. The order should be

2, 4, 1, 3

where
#2 is a take-no-further-action rule
#4 is a specific redirect
#1 is, in effect, an index redirect
#3 is a canonicalization redirect

In fact the RewriteConds aren't needed for #1 (which will become #3), since the point is to send ALL root requests to /forum, assuming I understood your explanation of the wonky site layout. But if /forum is a real, physical directory, it should be /forum/ with final slash, or else mod_dir will give you a double redirect.

#3 is also not optimal, but we'll deal with that later.

Do this rearranging, and verify that #4 (which is now #2) still doesn't work.

stephen22

4:58 pm on Nov 4, 2024 (gmt 0)

Top Contributors Of The Month



Hi, first of all, thank you for your ongoing replies, I really appreciate it!

Assuming I understood you correctly, it now looks like this:

RewriteEngine On

RewriteRule ^.well-known/acme-challenge/ - [L]

# experiment
RewriteRule forum/m3 https://forums.example.com/forum/forumdisplay.php?forumid=191 [R=301,L]

# send the naked URL to the forums index
#RewriteCond %{HTTP_HOST} ^forums.example.com$ [OR]
#RewriteCond %{HTTP_HOST} ^www.forums.example.com$
RewriteRule ^/?$ https://forums.example.com/forum/ [R=301,L]

RewriteCond %{HTTPS} !=on
RewriteRule ^ https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]

(I commented out the RewriteCond lines that you said are not needed, and added the slash; just to evaluate. I think that part still works.)

It's still not working.

To answer your questions:
This *is* the outer .htaccess file, in the root of forums.example.com/ .

There are a few more RewriteRule lines at the bottom of this file, that redirect certain links used in the forum back to the main site, i.e. forums.example.com -> example.com.

# all links to the vp folder:
RewriteRule ^vp(.*)$ https://example.com/vp/$1 [L,R=302]
# all links to the oasys vgui folder:
RewriteRule ^oasys/gui(.*)$ https://example.com/oasys/gui/$1 [L,R=302]
# all links to the m3 vgui folder:
RewriteRule ^m3/gui(.*)$ https://example.com/m3/gui/$1 [L,R=302]
# all links to the youtube shortcuts:
RewriteRule ^youtube(.*)$ https://example.com/youtube/$1 [L,R=302]


And yes, in the inner .htaccess file, there are some Rewrite Rule lines (along with the notes I have for them - this stuff is really ancient):

# the following 2 lines sends everything through https
# even though this is already in the root .htaccess file, it seems it needs to be here
# or locating directly to the forum doesn't have https, i.e. example.com/forum
RewriteCond %{HTTPS} off
RewriteRule (.*) https://%{HTTP_HOST}%{REQUEST_URI} [R,L]

# experiment to keep out MSN Bots
RewriteCond %{HTTP_REFERER} ^msnbot/2\.0b [NC]
RewriteRule .* - [F,L]

RewriteCond %{HTTP_REFERER} ^msnbot-media/1\.1 [NC]
RewriteRule .* - [F,L]

lucy24

6:40 pm on Nov 5, 2024 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Start by moving all RewriteRules to the “outer” htaccess. This is much easier than figuring out what inheritance rules your server uses, and coding accordingly. In Apache 2.2 and earlier, any RewriteRules in an inner htaccess would simply eliminate RewriteRules in an outer htaccess unless you included a line about inheritance. In Apache 2.4 there's a wider range of options, some of which are inherited server-wide unless you explicitly say something different. Much simpler to keep them all in a single location.

Now, picking one at random:
RewriteRule ^vp(.*)$ https://example.com/vp/$1 [L,R=302]
Doesn't this potentially create an endless loop?
example.com/vp would redirect to example.com/vp/
example.com/vp/ would redirect to example.com/vp//
example.com/vp/blahblah would redirect to example.com/vp//blahblah

I note that these are temporary redirects. Why?

In any case, this is where having example.com and forums.example.com in the same physical location can create confusion, so let’s work on getting all rules into optimal order. Currently, if you request, say,
example.com/blahblah
(here I mean literally “blahblah”, i.e. any URL that doesn’t actually exist on the site), where do you end up? Try it in your browser and see what the address bar says.

Another trivia point: once you have [F]--or any other 400- or 500-class response--you don't need the [L]. It does no harm, but isn't needed.

And to think that this thread started out just asking innocently how the ### to block facebookexternalwhatever.