Forum Moderators: phranque

Message Too Old, No Replies

Making .htaccess generic

Making site-specific rules generic

         

Patrick Taylor

9:18 pm on Aug 24, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



My .htaccess file appears to be working ok but it is for websites for users other than me (although I used it on my websites as well). So I have tried to make things as simple as possible for non-tech users. The basic .htaccess file does not require editing and works 'out of the box' but it has limitations that can be overcome with an EXTENDED version requiring one or two edits. Ideally though it would not need editing to remove those limitations.

(1) Concerns removing index.php from the home page URL and only the home page URL. The website might be in root or a sub directory (or sub directories) so the only way I have been able to achieve this is ACTION (i) and (ii) below, which requires a user to manually the file to enter site-specific details.

(2) Concerns removing the .php file suffix from all root URLs but not those in sub directories. ACTION (iii) below requires a manual edit, again to enter site-specific details.

EXTENDED version
<IfModule mod_rewrite.c>
RewriteEngine on
##
# Forbid direct viewing txt files in pages folder
RewriteCond %{THE_REQUEST} ^[A-Z]{3,}\ (.*)/pages/(.*)\.txt [NC]
RewriteRule ^ "-" [F]
##
# SITE SPECIFIC ACTION
# Remove index php root only
# ACTION (i) exclude named subfolders
# eg (folder1) or (folder1|folder2) etc
RewriteCond %{REQUEST_URI} !(admin|diagnostics|visits) [NC]
# ACTION (ii) set site folder from site root
# eg / for root or /cms/ for subfolder
RewriteRule ^index\.php$ /cms/ [R=301,L]
##
# Remove php extensions
# ACTION (iii) exclude named subfolders
# eg (folder1) or (folder1|folder2) etc
RewriteCond %{REQUEST_URI} !(admin|diagnostics|visits) [NC]
RewriteCond %{THE_REQUEST} ^[A-Z]{3,}\ (.*)\.php [NC]
RewriteRule ^ %1 [R=301,L]
# END SITE SPECIFIC
##
# Rewrite non php URLs to php on server
# php URLs still usable
# Is not a directory
RewriteCond %{REQUEST_FILENAME} !-d
# Is an actual php file
RewriteCond %{REQUEST_FILENAME}\.php -f
# Internally rewrite to actual php file
Rewrite-rule ^(.*)$ $1.php
##
# If not found then relative path to error 404 file
# Is not an actual file or directory
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule .* inc/404.php [L]
</IfModule>


I would like to know whether those features can be achieved in a 'generic' .htaccess that does not require editing for site-specific features. I would prefer not to have to introduce RewriteBase.

lucy24

10:55 pm on Aug 24, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I would prefer not to have to introduce RewriteBase.
No need to, because (a) the default RewriteBase is already / (root of whatever site you're currently on) and, more importantly, (b) every RewriteRule target should begin with either https://example.com/ (for external redirects) or / (for internal rewrites), making the RewriteBase question entirely moot.

The part I'm not getting is why you want to make a generic, one-size-fits-all* .htaccess. It seems like the opposite of what most sites strive for, which is a set of rules specific to one site, one server, one configuration. Is this for hosting that is intended to support WP or other CMS that will be used by people who are afraid to touch their own htaccess?

* My fingers wanted to type “one-site-fits-all”. Nice try, fingers.

phranque

1:03 am on Aug 25, 2020 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



every RewriteRule target should begin with either https://example.com/ (for external redirects)

that means this:
RewriteRule ^index\.php$ /cms/ [R=301,L]

should instead be something like this:
RewriteRule ^index\.php$ https://www.example.com/cms/ [R=301,L]


in addition to the above consideration, it would be more efficient if you included the second conditional in the RewriteRule in this ruleset to prevent firing that ruleset on every pass:
RewriteCond %{THE_REQUEST} ^[A-Z]{3,}\ (.*)\.php [NC]
RewriteRule ^ %1 [R=301,L]

something like this:
RewriteRule ^(.+)\.php$ https://www.example.com/$1 [R=301,L]


next, rather than use this method to "show" a 404 error page:
# If not found then relative path to error 404 file
# Is not an actual file or directory
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule .* inc/404.php [L]

you should instead use the [G] flag on that RewriteRule to generate a 410 status code for the response and then you can also specify a custom 410 error document if necessary. (e.g., /inc/410.php)
something like this:
# If not found then 410 Gone response
# Is not an actual file or directory
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . - [G]


finally, you should have a hostname canonicalization redirect, typically after the last (more specific) external redirection ruleset and before the first (most specific) internal rewrite ruleset.
in your case it would be the last ruleset.
something like this:
RewriteCond %{HTTPS} !on [OR]
RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$ [NC]
RewriteRule (.*) https://www.example.com/$1 [R=301,L]

phranque

1:07 am on Aug 25, 2020 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



after rereading your original problem statement, i can see that my suggestions would require edits for the following:
- obviously "example" (the domain name) is different for each site.
- www vs non-www (or other subdomain names) may be different for each site.
- http vs https may be different for each site.

tangor

2:43 am on Aug 25, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



may be different for each site


Which kind of suggests that a generic .htaccess would still have to be modified FOR EACH SITE ...

YMMV.

(More bluntly: .htaccess is usually unique to each site.)

lucy24

4:07 am on Aug 25, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



RewriteCond %{THE_REQUEST} ^[A-Z]{3,}\ (.*)\.php [NC]
RewriteRule ^ %1 [R=301,L]

something like this:

RewriteRule ^(.+)\.php$ https://www.example.com/$1 [R=301,L]
Here I'd go with option C. Since the vast majority of requests will not end in .php, why make the server go to the work of a capture that ends up being thrown away--especially one that can’t help being inefficient. Instead something like
RewriteCond %{THE_REQUEST} /([^.]+)\.php
RewriteRule \.php$ https://example.com/%1 [R=301,L,NS]
where the [NS] flag immediately excludes things like auto-indexes (if any directory allows them) or, conversely, directory indexes in /index.php, and then the RewriteCond takes care of anything left over. That's assuming there don't happen to be literal . periods in a normal URL. (Perfectly legal, but a lot of rules are much easier if you don't use them.)

But it seems like this kind of rule would only be needed if the site does, in fact, get legitimate requests in .php, most likely if there used to be URLs in .php which have now gone extensionless. Otherwise, a lot of sites get away with a flat-out [F] on explicit requests for .php, since they tend to be malign robots looking for files they're not supposed to see, or that don't exist at all.

RewriteCond %{REQUEST_FILENAME} !-d
This would seem to be superfluous, unless the site actually has directories named /directory.php/. (Legitimate robots will ask for directories without final slash, so if you have a /directory.php/ there will be requests for /directory.php without slash.)

phranque

4:30 am on Aug 25, 2020 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



^(.+)\.php$

i'm interested in your explanation about how this is an especially inefficient capture.

RewriteCond %{THE_REQUEST} /([^.]+)\.php

this will fail to match requests for any directory names containing dots in url paths or any php filenames containing 2 or more dots.

Patrick Taylor

10:07 am on Aug 25, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



To answer lucy24's question, the context is a small CMS that can easily be installed by someone at least capable of:

(1) Obtaining a domain name.

(2) Setting up a web hosting package.

(3) Uploading folders and files to their web space.

Or at least they know someone who will do that for them, then all they have to do is login and make web pages.

When the CMS is installed it has a BASIC .htaccess file more or less as follows:

<IfModule mod_rewrite.c>
RewriteEngine on

# 1 ESSENTIAL ACCESS CONTROL

# Forbid direct viewing txt files in pages folder
RewriteCond %{THE_REQUEST} ^[A-Z]{3,}\ (.*)/pages/(.*)\.txt [NC]
RewriteRule ^ "-" [F]

# 2 ESSENTIAL GENERIC INTERNAL REWRITES

# Rewrite non php URLs to php on server
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_FILENAME}\.php -f
RewriteRule ^(.*)$ $1.php

# If not found then relative path to error 404 file
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule .* inc/404.php [L]

</IfModule>


If the system is installed on Microsoft-IIS it still works but (i) all the public URLs will have a .php file extension and (ii) the text files that contain the content will be directly accessible in a browser (which might only matter for password-protected pages if they have some). So the BASIC .htaccess file attempts to address those two issues.

The idea of an EXTENDED .htaccess file is to build on the BASIC but it would appear from your replies that it requires manual edits. That is fine for a tech-savvy user (in the same way that people edit wp-config.php when they install WordPress) but there is obviously a risk factor when users are invited to edit something like .htaccess.

At one time my CMS install and setup routine created an .htaccess with the site-specific rules but I decided to simplify it and remove that routine. Maybe I should add it back in. It's a question of whether it is worth it to deal with issues most users wouldn't notice or bother about (such as whether index.php is accessible or not).

From your replies (thanks) an EXTENDED user-editable .htaccess file might be this:

<IfModule mod_rewrite.c>
RewriteEngine on

# 1 ESSENTIAL ACCESS CONTROL

# Forbid direct viewing txt files in pages folder
RewriteCond %{THE_REQUEST} ^[A-Z]{3,}\ (.*)/pages/(.*)\.txt [NC]
RewriteRule ^ "-" [F]

# 2 OPTIONAL SITE SPECIFIC EXTERNAL REDIRECTS USER EDITS

# Remove index php root only
RewriteCond %{REQUEST_URI} !(admin|diagnostics|visits) [NC]
RewriteRule ^index\.php$ https://files.domain.com/cms/ [R=301,L]

# Remove php extensions root only
RewriteCond %{REQUEST_URI} !(admin|diagnostics|visits) [NC]
RewriteCond %{THE_REQUEST} ^[A-Z]{3,}\ (.*)\.php [NC]
RewriteRule ^(.+)\.php$ https://files.domain.com/cms/$1 [R=301,L]

# Hostname canonicalization redirect 1 if not sub domain
# RewriteCond %{HTTP_HOST} ^www\. [NC]
# RewriteRule (.*) https://files.domain.com/cms/$1 [R=301,L]

# Hostname canonicalization redirect 2
RewriteCond %{HTTPS} off
RewriteRule (.*) https://files.domain.com/cms/$1 [R=301,L]

# 3 ESSENTIAL GENERIC INTERNAL REWRITES

# Rewrite non php URLs to php on server
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_FILENAME}\.php -f
RewriteRule ^(.*)$ $1.php

# If not found then relative path to error 404 file
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule .* inc/404.php [L]

</IfModule>


This works on a test site in a sub-directory of a sub-domain.

As far as error 404 is concerned the path to the error file has to be relative or else it goes to the website root (which is a different file if the CMS is installed in a subfolder). If I use the normal ErrorDocument 404 /inc/404.php the path is absolute to the root whereas the rules above go to the right file. So I have not understood exactly what phranque is suggesting.

phranque

10:56 am on Aug 25, 2020 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I have not understood exactly what phranque is suggesting.

you don't need/want to rewrite the request to the 404 error file.
you do need to provide a 4XX status code on your response and the fact of that status being returned will provide your custom 404 (or 410 in this case) content with the error response.

with your current implementation the response will be a 200 OK status code.

Patrick Taylor

11:46 am on Aug 25, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hmmm, well, using the .htaccess file above, when I go to a non-existent URL on 'files.domain.com/cms/' the status code is 404 Not Found. That is when I use Developer tools in Chrome. The address bar shows files.domain.com/cms/non-existent-page and the content of inc/404.php is displayed, including a link back to the sub-domain plus the sub-directory.

The directory /inc/ is in the website 'root' but in a sub-directory it is not actually the root. The PHP file inc/404.php contains header('HTTP/1.1 404 Not Found');

It has to be a relative path to that file and the only way I found to do that is those rules above.

lucy24

4:57 pm on Aug 25, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



how this is an especially inefficient capture
I wouldn’t say “especially” inefficient. Just run-of-the-mill vanilla inefficient, like any rule that has non-final .+ or .* because then the server has to backtrack “oh, whoops, I was supposed to leave room for .php at the end”.

this will fail to match requests for any directory names containing dots in url paths or any php filenames containing 2 or more dots
Yes, that’s right. That’s why I specified that this handy time-saver can only be used if your file or directory names (whether real or virtual) do not contain literal dots. This, in turn, obviously depends on the site: apache dot org, to take the obvious example, has directory names with dots because that’s where they keep information about version numbers (2.0, 2.2, 2.4); any site involving IP lookups will have URLs ending in 1.2.3.4 ... and so on.

when I go to a non-existent URL on 'files.domain.com/cms/' the status code is 404 Not Found. That is when I use Developer tools in Chrome. The address bar shows files.domain.com/cms/non-existent-page and the content of inc/404.php is displayed, including a link back to the sub-domain plus the sub-directory
That would seem to be exactly the intended result. What’s the problem?

Error documents, including the 404 page, cannot contain relative links (ones that don’t begin in / slash) because the visitor--whether human or robot--doesn’t “know” that they’re in the error document; they think they’re on the originally requested page. All links have to begin in / for site root.

Patrick Taylor

5:26 pm on Aug 25, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That would seem to be exactly the intended result. What’s the problem?

There isn't a problem! I was trying to grasp what phranque was suggesting.

The point being that I needed a way for the 404 page to resolve to a subfolder (which it does) instead of the root folder which might have a different one altogether (and anyway if the user is in the sub-domain the 404 page needs to be in the sub-domain - i.e. relative.)

The bottom line seems to be that it isn't possible for the .htaccess file to be generic as long as it contains 301 redirects. The only way is with user input of some kind.

lucy24

7:34 pm on Aug 25, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



What, exactly, does “resolve” mean? The ordinary directive would be in the form
ErrorDocument 404 /missing.html
where the leading / means the root of the current (sub)domain, whatever it may be. If a user requests
sub.example.com/imaginarypage.html
they will be served the same content they would get if they explicitly requested
sub.example.com/missing.html
regardless of where that page physically lives.

If you want all the subdomains to serve the same physical 404 page it gets trickier, because the one thing you cannot do is include a hostname in the ErrorDocument directive. (That is, it is physically possible--it won't crash the server--but it changes the desired 404 response into a 302.)

Patrick Taylor

12:21 am on Aug 26, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



What I mean is that the 404 Page Not Found when the CMS is in a sub-directory (there might be something completely different in the root) has to be in the sub-directory and not go to the 404 Page Not Found of the root. So it has to be relative to the sub-directory.

ErrorDocument 404 /missing.html is absolute to the root so that rule can't be used in a sub-directory. It's just a matter of me explaining it properly. I am talking about a content management system installed in a sub-directory on a stand-alone basis with something else entirely in the root. I need them to act independently. My last rule above seems to do it nicely.

It would not matter if it was all one 'website' (which technically it is) but it matters in this instance. I hope that explains it.

lucy24

2:25 am on Aug 26, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Oh, oops. You’re talking about a physical subdirectory; I was thinking of subdomains. My bad.

Happily, there's still an out, because .htaccess is based on physical directory structure, not on URL filepaths. Or, at least, there's an out if you don't mind having two separate htaccess files: the main one at the site root, and then a supplementary one in the directory containing the CMS. It might even consist of the single line

ErrorDocument 404 /directory/missing.html

... but then again, if the whole CMS lives in a subdirectory, the parts of the htaccess specific to the CMS might also be located there, and then we don't even need to think about what happens in the rest of the site.

It sounds as if you're talking about a custom CMS rather than something standard like WordPress. But I'm going to go find the moderator who--very much unlike me--speaks fluent WordPress, because she can probably shed more light.

not2easy

3:11 am on Aug 26, 2020 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I believe that the description of this CMS is entirely different from the way WP works. It is also unlike typical CMS used per directory, in that it appears to be not based on typical structures, as though different subdirectories have different individual filesystems. If the CMS itself is central, I do not know how you can avoid different individual .htaccess files, though it sounds like it is handled via the CMS. If you do not need individual .htaccess files, great, but I do not know (or see) how it could manage the canonical 301 for https:/www/non-www to avoid duplicate content. If there is any .htaccess file used per subdirectory, it would need to have that canonical rewrite locally with RewriteBase / . This is not to say it cannot work, but it is probably using separate local folders (or rewrites) to deal with things such as 404 error pages. I would do thorough testing of https:/www/non-www URLs to check that canonical URLs all resolve as expected.

One kind of setup I've worked with is similar in that I needed to build a separate 404.html for it which is stored in that directory's /html/pages/folder. That setup does use a separate htaccess file in addition to the one in the root directory. Since that means another canonical 301 for https:/etc it does also require RewriteBase /directoryname/

That setup happens to also have an installation of WP in a directory other than root, but it has its /index.php file in the root directory so it appears to be installed in the root directory. The homepage URL shows https://example.com/ using WP installed at https://example.com/folder/ and they all cooperate.

Patrick Taylor

10:00 am on Aug 26, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The CMS is not like WordPress. It is supplied in a folder named cms which contains all the files. A user can simply upload the folder and they will have the cms fully functional in a folder named cms. It has to operate independently of the root or any other sub-directory. Alternatively they can upload all the files inside the cms folder (not the folder itself) to the root and they have the cms fully functional in the root. They can install it in both if they want, with both functioning independently,

As far as Page Not Found is concerned, Page Not Found must be specific to each installation. So a 404 in the sub-directory has to show in the address bar a non-existent URL for the sub-directory, not the root.

I appreciate the replies, thank you. It seems that a 'generic' .htaccess is not possible when there are external redirects and since I don't want users messing with .htaccess I am going to make it so the system auto-generates the correct rules for wherever the system is installed. It did that at one time but I removed the routine to simplify things and reduce the chance of errors on servers other users might install it on. It is just a matter of re-instating the routine.

tangor

12:55 am on Aug 28, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A product, a method, a query, now it makes sense. :)

Sounds interesting and I wish you all the best of success going forward!

Patrick Taylor

12:17 pm on Aug 28, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks tangor. This has been "going forward" since 2008. At one stage I had the system auto-generating the .htaccess file based on the site location, protocol and all the rest of it, but then I removed that part (to simplify the brainwork) so in that sense "went backwards". I have now gone forwards again and the file can be auto-generated with redirects containing the actual URLs.