Forum Moderators: phranque

Message Too Old, No Replies

More fun with .config files

         

csdude55

5:13 am on Nov 19, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



For those new to this, I'm moving my .htaccess files to /etc/apache2/conf.d/userdata/. I've found that I can shave over 500ms on each page's load time, so it's worth a little effort on my end :-)

@lucy24 helped me a lot when she pointed out that [L] isn't really applicable in the .conf files, I had to change them to [PT]. So that's where I am now.

With some of my domains, I set a variable based on the first parameter in the URL. Eg:

www.example.com/foo/bar/

has an internal rewrite (shout out to @phranque for helping me figure out what that's called) to:

www.example.com/bar/?default=foo

Here's the code I'm using in the .conf:

# I'm including these here just in case they're related
## Force https://www
RewriteCond %{HTTP_HOST} !^(?:www|ww2|images)\. [NC]
RewriteCond %{REQUEST_URI} !^/cgi-bin/chat.cgi
RewriteRule ^ https://www.%{HTTP_HOST}%{REQUEST_URI} [R=301]

## Force trailing /
RewriteCond %{HTTP_HOST} !^images\. [NC]
RewriteCond %{REQUEST_URI} !^/includes

# empty implies they're on the homepage
RewriteCond %{REQUEST_URI} !^$

RewriteCond %{REQUEST_URI} !(?:\..{2,4}|/)$
RewriteRule ^(.+)$ /$1/ [QSA,R=301,PT]


##### this is the main section, though
#
# only apply if the domain is example.com
RewriteCond %{HTTP_HOST} (?:www\.)?example\.(?:co|net) [NC]

# and the default parameter doesn't already exist
RewriteCond %{QUERY_STRING} !default=

# and it's not an URL that shouldn't need it
RewriteCond %{REQUEST_URI} !^/(?:ads|images|includes|cgi-bin)

# allow for the ~example in case I need it when moving to a new server
RewriteRule ^(/~example)?/(\w+)/(.*)$ $1/$3?default=$2 [QSA]


If I use [PT] here and then have another rewrite later in the .conf, the second rewrite doesn't run and I get a 404. If I leave it off then it rewrites as expected.

The issue, though, is when there's NOT a second rewrite... then I get a 404 error! But there's nothing in my error log, either.

I'm lost on this one. Do I need to add a RewriteRule at the end of the .conf as a safety net or something?

csdude55

8:45 am on Nov 19, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Update... at the very end of the .conf file, I added this just before the ErrorDocuments:

## Safety net for "default"
RewriteCond %{HTTP_HOST} (?:www\.)?example\.(?:co|net) [NC]
RewriteCond %{QUERY_STRING} (?:^|&)default=
RewriteRule ^ - [PT]

That seemed to work, although it feels like a workaround rather than a solution...

And for future reference, based on some earlier conversations with Lucy I also changed:

# original
RewriteCond %{QUERY_STRING} !default=

# corrected
RewriteCond %{QUERY_STRING} !(?:^|&)default=

w3dk

12:54 pm on Nov 19, 2020 (gmt 0)

10+ Year Member Top Contributors Of The Month



.... pointed out that [L] isn't really applicable in the .conf files, I had to change them to [PT]. So that's where I am now.


I think some context is missing here? As a general rule that doesn't make sense. The "L" flag IS applicable in a server context. Although "PT" (passthrough) implies "L", so [PT] is the same as [PT,L]. "PT" is required in a server context when you need the result of the rewrite to be treated as a URI and passed back through the rewrite engine - to be processed by other rewrites and Alias etc. This is the default behaviour in a directory (ie. .htaccess) context so it's not required in .htaccess. However, you shouldn't need to be using "PT" on all your rewrites (this is what causes problems for many in a directory context and you end up having to use additional directives to prevent rewrite loops etc).


RewriteRule ^(.+)$ /$1/ [QSA,R=301,PT]


It doesn't really make sense to use "PT" with an external redirect... just use "L" instead. (Although it won't make any difference in this instance.)

The QSA flag is superfluous here.

However, this will result in a double slash at the start of the URL-path (when used in a server context). You need to either remove the slash from the capturing subpattern, or remove the slash from the start of the substitution string.



# allow for the ~example in case I need it when moving to a new server
RewriteRule ^(/~example)?/(\w+)/(.*)$ $1/$3?default=$2 [QSA]


www.example.com/foo/bar/



Unless "bar/" is a valid file (seems unlikely) then this requires further rewriting I assume? (Which you imply by having to remove the PT flag.)

If you don't include the "L" or "PT" or "END" flags then processing naturally continues to the next rule... the output of the previous is passed to the next. However, when you use "PT" then (as mentioned above) the substitution string is treated as a URI and the rewriting process starts over, which can change the URL-path you are expecting to match.





Update... at the very end of the .conf file, I added this just before the ErrorDocuments:


That code block doesn't really do anything except perhaps cause the rewrite engine to start over. Which perhaps suggests your directives are in the wrong order?


Not that it really matters, but it would be more logical (readable) to define your ErrorDocuments first.

w3dk

1:23 pm on Nov 19, 2020 (gmt 0)

10+ Year Member Top Contributors Of The Month



Just a couple of other things...


# I'm including these here just in case they're related
## Force https://www
RewriteCond %{HTTP_HOST} !^(?:www|ww2|images)\. [NC]
RewriteCond %{REQUEST_URI} !^/cgi-bin/chat.cgi
RewriteRule ^ https://www.%{HTTP_HOST}%{REQUEST_URI} [R=301]


You should include the "L" flag on the external redirect. Otherwise processing is going to continue through your directives, which is generally a bad thing (or at best unnecessary) when triggering an external redirect.

Note that your comment states "Force https :// www", but the rule only strictly forces www. (Although the HTTP to HTTPS redirect should be in the vHost, preferably using a mod_alias Redirect)


# empty implies they're on the homepage
RewriteCond %{REQUEST_URI} !^$


This is not true (regardless of whether you are in a server or directory context). This condition will always be successful since the REQUEST_URI server variable contains the root-relative URL-path starting with a slash. It always starts with a slash; it is never empty.

lucy24

7:40 pm on Nov 19, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



On the [PT] issue: The docs don't say so clearly, but going by the results you reported in the previous thread in which the issue came up, this seems to be only applicable to internal rewrites, not to external redirects. So a rule that in htaccess would end in [L] alone would instead say [PT], but a rule that previously ended in [R=301,L] should stay that way.

This part...
## Force trailing /
RewriteCond %{HTTP_HOST} !^images\. [NC]
RewriteCond %{REQUEST_URI} !^/includes
# empty implies they're on the homepage
RewriteCond %{REQUEST_URI} !^$
RewriteCond %{REQUEST_URI} !(?:\..{2,4}|/)$
RewriteRule ^(.+)$ /$1/ [QSA,R=301,PT]
Is the idea here that all your URLs end in / (as if they were directories, but they aren't real physical directories) rather than extensionless? This is much easier if your URLs happen not to contain literal . periods, because then you can probably reduce the whole thing to a conditionless
RewriteRule ^/([^.]+[^./])$ https://example.com/$1/ [R=301,L]
replacing “example.com” with whatever you're using to support multiple hostnames on a single file. (This, incidentally, makes canonicalization a pain and may lead to chained redirects, but that's another issue.)

csdude55

7:27 am on Nov 20, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I think some context is missing here? As a general rule that doesn't make sense. The "L" flag IS applicable in a server context. Although "PT" (passthrough) implies "L", so [PT] is the same as [PT,L]. "PT" is required in a server context when you need the result of the rewrite to be treated as a URI and passed back through the rewrite engine - to be processed by other rewrites and Alias etc. This is the default behaviour in a directory (ie. .htaccess) context so it's not required in .htaccess. However, you shouldn't need to be using "PT" on all your rewrites (this is what causes problems for many in a directory context and you end up having to use additional directives to prevent rewrite loops etc).

Hmm. OK, @w3dk, , mea culpa. I guess that in my case, the previous issue I had was "solved" by simply replacing all of the [L] with [PT], so my mind said "tr// all those little suckers!" LOL

So let me see if I've got this straight:

1. In .htaccess, if I use [L] then the server gets to that line, performs the rewrite, then immediately kicks out and starts the .htaccess back over.

2. In .conf, if I use [L] then the server gets to that line, performs the rewrite, then stops.

3. In .conf, if I use [PT] then the server gets to that line, performs the rewrite, then immediately kicks out and starts the .conf back over.

Is that right?


# allow for the ~example in case I need it when moving to a new server
RewriteRule ^(/~example)?/(\w+)/(.*)$ $1/$3?default=$2 [QSA]

www.example.com/foo/bar/


Unless "bar/" is a valid file (seems unlikely) then this requires further rewriting I assume? (Which you imply by having to remove the PT flag.)

If you don't include the "L" or "PT" or "END" flags then processing naturally continues to the next rule... the output of the previous is passed to the next. However, when you use "PT" then (as mentioned above) the substitution string is treated as a URI and the rewriting process starts over, which can change the URL-path you are expecting to match.

Well...

I originally used [PT] on this rewriterule, but what happened was that pages with no future rewrites worked while pages that did have future rewrites didn't. I removed [PT] and it reversed; pages with no future rewrites threw an error while those with rewrites worked!

As an explanation, at example.com I prompt the user to select a category. Then, any subsequent page that the user views includes the category as the first parameter in the URL; eg,

example.com/category/page/

But I also have several domains that correspond to each category, so if they choose one of those then the category is already determined (meaning, it's no longer separate in the URL); eg,

category.com/page/

So there will never be a literal file at example.com/category/, it will always be a parameter.

## Safety net for "default"
RewriteCond %{HTTP_HOST} (?:www\.)?example\.(?:co|net) [NC]
RewriteCond %{QUERY_STRING} (?:^|&)default=
RewriteRule ^ - [PT]

That code block doesn't really do anything except perhaps cause the rewrite engine to start over. Which perhaps suggests your directives are in the wrong order?

I can't honestly say that I understand why it "works", either. I was originally thinking that it would get to this line and then "stop", but now I understand that my logic was wrong. But still, it works...?

I currently have this rule near the top (for testing I removed a lot of stuff, so first I force the www and trailing /, then this), just like I have it in the .htaccess. After that I rewrite to have subjects in the URL for message board posts, then the "safety net", then the ErrorDocument lines and the AddHandler to force PHP 5.6.

For the sake of testing I removed it from the top, replaced the safety net with it, and added [PT]. The result was that pages that had rewriterules earlier in the .conf returned 404 errors, while pages without rewrites worked just fine. On the sites other than example.com, everything worked just fine.

Then, just for the heck of it, I replaced [PT] with [L]. Same result, though.

So I don't think it's an issue of order? But I have no explanation on why it actually works. Which makes me worry that this is not a good solution... if not slower then it may not continue to work long term.

csdude55

7:39 am on Nov 20, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Is the idea here that all your URLs end in / (as if they were directories, but they aren't real physical directories) rather than extensionless? This is much easier if your URLs happen not to contain literal . periods, because then you can probably reduce the whole thing to a conditionless

RewriteRule ^/([^.]+[^./])$ https://example.com/$1/ [R=301,L]


replacing “example.com” with whatever you're using to support multiple hostnames on a single file. (This, incidentally, makes canonicalization a pain and may lead to chained redirects, but that's another issue.)

Awesome! That's exactly the plan, and it's much cleaner :-) The main condition in mine was:

RewriteCond %{REQUEST_URI} !(?:\..{2,4}|/)$

but I was testing that it not end with a . followed by 2-4 characters OR a /; that eliminated my JS, CSS, PHP, CGI, and image formats. And since I had that condition already, I plugged in exceptions for images.example.com and example.com/includes (which are all PHP scripts). But you're right, I don't have any URLs with a literal period, so it was probably overkill.

Is there any reason to explicitly define the domain? In 2 or 3 tests, this seems to be working OK:

RewriteRule ^/([^.]+[^./])$ /$1/ [R=301,L]


Oh, and thanks for the other catches, @w3dk! I've corrected those :-) And you're right about it not forcing [;...] that comment was a leftover from my .htaccess, but now I force https:// serverwide.

lucy24

5:56 pm on Nov 20, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes, a redirect without host and protocol specified will simply reuse whatever was in the original request. This does potentially result in a chained redirect if some part was non-canonical. But I'd consider it a low-concern problem, since we're now dealing with visitors who knowingly and intentionally requested a nonexistent URL in the first place, so they deserve everything they get.

More: In the case of real, physical directories, any request for /directory without slash will be handled by mod_dir issuing a 301 redirect to /directory/ with slash. Most search engines will periodically request /directory as part of their (to put it tactfully) entrapment routines, in the same way that they periodically request the wrong hostname and/or protocol. A handful of crawlers--Applebot especially comes to mind--make a real habit of requesting /directory without slash--possibly thinking that everyone ought to be extensionless. Of course, they've no way of knowing if the URL in question is or is not a real directory. In general, there's no reason to humor them; let them take their double redirect and lump it.

csdude55

11:25 pm on Nov 20, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Over the years I've noticed a handful of requests for virtual directories without the slash, when all of my links include them. So this rule is really just a safety to make sure everyone ends up where they're supposed to, and maybe to ensure that Google doesn't penalize me for having duplicate content (which is the same logic for forcing the www). They might not even do that anymore, for all I know... but it doesn't hurt to "fix" it, either way.

Do you have any other thoughts on the safety net rule I had to write? I have it running on the live site now (for about 16 hours) and so far no complaints, but I'm very much worried about it!