Forum Moderators: phranque

Message Too Old, No Replies

How to implement a "transfer/gateway" page

Apache, PHP

         

eric76

9:28 pm on Jan 8, 2015 (gmt 0)

10+ Year Member



Hi,

I'm a rather beginner at webbdevelopment, so I would be really glad if someone could help me look in the right direction on how to solve the following problem.

Baiscally, I have a web site that is being replaced. But rather than deleting the old stuff and putting in new stuff, I need to keep the old material visible at the original URL:s for public archive purposes. But in order to warn the visitor that this material is "outdated", and that they probably want to transfer to the new site, I have a "transfer" page, or "gateway", or what to call it - basically a webpage with the above mentioned information, and the choice of clicking one of two URLs, the first one redirecting to the new site, the other one proceeding to the resource that was asked for.

My question is, how to solve this in a neat way on the server. My current solution is uggly, in that I replaced all of the resources in the old site with a copy of the one and the same PHP-script, that implements the functionality of the transfer page. The script uses a session variable to remember whether the visitor wants to go to the archive. If so, then any further requests are automatically redirected to a copy of the original folder, where all the original files reside. This is an ugly solution for several reasons I think.

In my mind, a nicer solution would be to configure the web server itself, so that any request to an URL that matches some pattern that describes the old site, should go to the script - this way, the script need only exist in one file, and the original files can remain in the original locations.

I'm searching for info on how to do this. But maybe there is an even better solution? For instance, is it possible to configure webbserver in such a way as to eliminate the call to the script completely, after the visitor affirmed he wants to go to the archive?

Thanks!

eric76

12:14 am on Jan 9, 2015 (gmt 0)

10+ Year Member



Ok, I might be able to answer my own question, as I have been looking for information and stumbled across rewrite rules. I figured out that I can do a rewrite rule in the Apache config file:

RewriteRule ^(my.domain/my/path/.*)$ my.domain/transfer_script?url=$1

Would anyone who is knowledgable in the subject verify to me that this is a reasonable solution, or even a standard solution, then I would feel more confident ^^.

Thanks!

phranque

1:20 am on Jan 9, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



welcome to WebmasterWorld, eric76!


why not include a prominent link to the new site from the legacy site?
are both sites on the same hostname?
do you want the legacy site indexed by search engines?

lucy24

2:56 am on Jan 9, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



RewriteRule ^(my.domain/my/path/.*)$ my.domain/transfer_script?url=$1

<quibble>
I'd say
RewriteRule ^my\.domain/my/path/(.+) my.domain/transfer_script?url=$1 [L]

No point in including material in the query that will always be the same. Use the [L] flag unless you've got a compelling reason to omit it.
</quibble>

:: detour to look up format of RewriteRule target when lying loose in config file just for my own future reference ::

eric76

3:32 am on Jan 9, 2015 (gmt 0)

10+ Year Member



Hi and thanks!

phranque,

We were thinking of "just" having a link included let's say in an info banner at the top of the old pages, but decided that in order to avoid having to inject information into existing pages, and to make it really certain that no one can miss that they are visiting old pages, they should get the transfer page "in their face", which forces them to make a conscious decision by clicking one of two links. Only if the visitor clicks "proceed to old page, I know what I'm doing", should they get to it.

I have how ever realized a critical problem with my approach (although I haven't been able to test it yet), namely, if I use a rewrite rule that sends all old URL:s to the transfer page, then even if the transfer script detects that the visitor wants to see old pages, a redirect to the requested URL will fail, as the rewrite rule now turns back on me, and sends any redirects to old pages back to the transfer page. The rewrite rule, that at first was my friend, now becomes my worst problem, as I don't know of any way of overriding the rewrite rule, and say, "well now, you should actually go to that page, because the visitor said he wants to".

So is there any way of using rewrite rules such that all requests get rewritten at first, but get canceled for those visitors (based on their session I presume) who verified that they do indeed want to see the old material?

The legacy site and the new site have different host names.

We want the legacy pages still indexed - everything should be the same, except for the information/decision transfer page. This is why we came up with the "transfer page" idea to start with.


lucy24, thanks for correcting some of the details in my rule suggestion!

eric76

4:09 am on Jan 9, 2015 (gmt 0)

10+ Year Member



Ok, I have some new hope again, so story continues... :)

It is indeed apparently possible to set conditions based on cookie content for rewrite rules (https://www.howtoforge.com/community/threads/using-cookie-content-as-rewriterule-variable-in-apache-reverseproxy.60254/).

I will have to read up on that, and figure out exactly how to set up the cookie from the PHP-script.

Cheers

lucy24

5:34 am on Jan 9, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



and figure out exactly how to set up the cookie from the PHP-script

You can also set cookies right in mod_rewrite if that turns out to be more practical. Alternatively, you can set environmental variables-- either via mod_setenvif or mod_rewrite-- and use them in a RewriteCond.

eric76

2:20 pm on Jan 9, 2015 (gmt 0)

10+ Year Member



Thanks lucy, I will look into that.

More specifically, a visitor can visit different areas of the legacy cite, and each of these areas are treated independently, i.e. if a visitor has verified the intention to look at old material in one area ("root path"), the transfer page should still pop up again, if the visitor decides to visit another area. So as I see it, the rewrite condition needs to be able to compare which area the visitor was requesting with whether that visitor has a cookie that verifies the affirmed intention to go into that area.

I feel that a programming language would be needed to do that, but I suspect mod_rewrite can do it, but it will probably get messy, unless I come up with some clever simplifying idea. Maybe I should have a cookie for each area (per user)?

lucy24

7:59 pm on Jan 9, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



By default, a cookie applies only to the page that set it. If you've got a large site, that could add up to a colossal number of cookies. But mod_rewrite defaults to whole-site cookies unless you specify a path.

For comparison purposes, here's what cookies in mod_rewrite [httpd.apache.org] (link for 2.2) might look like. First part:

RewriteCond %{HTTP_USER_AGENT}
{here I list assorted very old browsers}
RewriteCond %{HTTP_COOKIE} !oldbrowser
RewriteCond %{HTTP_REFERER} !\?
{search-engine queries are assumed to be human unless a different rule kicks in later}
RewriteCond %{ENV:qiniq} !.
{environmental variable covering assorted further exceptions}
RewriteRule (^|\.html|/)$ http://example.com/boilerplate/goaway.html [R=301,L]


Second part (note absence of [L] flag):

RewriteCond %{HTTP_USER_AGENT}
{identical list of old browsers here}
RewriteRule (^|\.html|/)$ - [CO=oldbrowser:1:.example.com:525600]


How it works:

First part. Before the request reaches mod_rewrite, it passed through mod_setenvif. (This is the de facto sequence on my server. You may want to double-check on your own.) At this point, certain IPs* trigger an environmental variable. In the course of mod_rewrite, requests for pages (only) trigger a check for assorted elderly browsers, followed by a check for various exceptions including the environmental variable. If no exception applies, the request is redirected to a page called /goaway.html. For you, this would instead be something like /redirect.php?url=%1 where the very last RewriteCond is %{REQUEST_URI} so the server doesn't have to capture unless all conditions apply.

Second part. Those requests that do fit into one of the possible exceptions will carry on through the rest of mod_rewrite, winding up with a rule that sets a cookie. If you were setting a different cookie for each page, that would be a further : colon-delimited group immediately after the one setting the cookie's lifetime. (Careful! mod_rewrite uses minutes, not seconds.) Future requests for the same page will then coast on past because the cookie will be found.

Disclaimer: I don't know if you can use $1 or %1 in a flag, setting the "path" part of the cookie. If not, it will be easier to rewrite (not redirect) to a php script that gets the information, sets the cookie and then proceeds to the redirect page.


* On this specific site, I poke all possible holes for various IP ranges in northern Canada. It's a short, finite list.

eric76

2:35 am on Jan 11, 2015 (gmt 0)

10+ Year Member



Lucy24, thanks to an interesting example, I do as you mentioned at the end, redirect to a php script that sets the cookie - just as well, as the script is needed there anyway to function as the transfer page. And I also set the path of the cookie as you said.

I read somewhere that a rule like "RewriteRule . new_url" will rewrite all URLs, but that doesn't work if it resides in /some/path/.htaccess and the URL ends with /some/path/ i.e. ends with a slash. For some reason I need to do "RewriteRule (.*) new_url" to fix that case. But that seems to trip my rules in other ways.

Lucy, you suggeested
RewriteRule ^my\.domain/my/path/(.+) etc

Indicating that you don't like the lonely trailing slash either. Can some one explain what's going on with it? Should I perhaps have a rule that removes it, in case there is nothing after it?

Thanks!

lucy24

3:46 am on Jan 11, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Edit: Well, ###, it looks as if I was composing a very long reply even while you were deleting the material I was replying to. So now I'll have to go back and see what you ended up deciding to ask.

but not for
some/path/

That is very odd indeed. So let's look at what is supposed to happen:

What's going on with the trailing slash?

If you put in a request for
/some/path
then mod_dir kicks in and issues a 301 redirect to
/some/path/
and then mod_dir again looks to see whether you have a DirectoryIndex line. 999 times out of 1000 this will be something like "index.html" or "index.php" referring to a file in the directory you're aiming for.

Those are the default behaviors for mod_dir. And at this point I need to double-check: Is this some type of shared hosting, or is it your own server?

The most likely way I can think of that a request for
/some/path/
with trailing slash would not activate the htaccess located in this directory is if you have a non-standard DirectoryIndex such as
DirectoryIndex /some/otherpath/index.php

Then a request for URL /some/path/ would end up being a request for a file located in /some/otherpath/ possibly subject to a different htaccess.

If it's your own server, there are other possible explanations involving mod_alias. But there's still the question of why /some/path would behave differently from /some/path/ since by default the first form redirects to the second ... unless you've set
DirectorySlash off

This can be done in htaccess, but I can't imagine a host ever doing it by default for the whole site. See Apache docs [httpd.apache.org] for dire warnings. Short version: auto-indexing works with or without the trailing slash-- assuming you've enabled it-- but DirectoryIndex (using a physical file) only works with the trailing slash.

That's why it becomes necessary to know whose server it is.

:: detour here to read up on FallbackResource which I could swear I've never heard of in my life, but there it is in 2.2, with further detour to query confusing wording ::



And now for the revised reply to the revised post:
I read somewhere that a rule like "RewriteRule . new_url" will rewrite all URLs, but that doesn't work if it resides in /some/path/.htaccess and the URL ends with /some/path/ i.e. ends with a slash.

It shouldn't make any difference unless you've got a weird DirectoryIndex line. No, wait, there's one more exception. I've never found this spelled out in the docs, but I noticed it in the course of playing around on my test site:

When a RewriteRule ends in some kind of [R] flag, the "pattern" seems to apply only to the visible URL. When it ends in an [L] flag alone, the pattern may also apply to internal requests -- such as /index.html in a directory. So a request for some directory, ending in / slash, either will or will not trigger a rule whose pattern ends in "html":
/path/morepath/
may or may not be interpreted as
/path/morepath/index.html
depending on flag. Hence the difference between .* and .+

Again, I'm describing what I've personally seen in action, not what I've found in docs.

RewriteRule ^my\.domain/my/path/(.+) etc

Oh, right, you are the one with his own server. I have to remember that: It's significant in mod_rewrite because in per-directory context (including htaccess) the domain name would never be part of the pattern.

Should I perhaps have a rule that removes it, in case there is nothing after it?

No, certainly not, or mod_dir will go insane. But I think I did end up answering the question, even if you reworded the whole thing while I was typing :)

When choosing between .+ and .* you should think about which forms can actually occur, and under what circumstances. For example: a rule meant to cover anything and everything in a particular directory, such as a global redirect, is generally
/directory/.*
But say you're looking for anything with html extension. You wouldn't say
.*\.html
because if someone is putting in a request for "example.com/.html" there's no point in humoring them even for a moment.

eric76

7:23 am on Jan 11, 2015 (gmt 0)

10+ Year Member



Omg, Lucy, I'm so sorry that I troubled you like this :O, you are too kind :). I was pulling my hairs trying to understand what was going on, but in the end I understood that the htaccess was in fact executed, just the rule wouldn't fire for some reason for those trailing / cases, and so I rephrased my question in a more concise way.

A very simple

RewriteRule . some_url [R,L]

will not get executed if the path goes to where the htaccess resides, with a trailing slash. In that case it goes to the index.html of that URL (since there is one). I haven't touched DirectoryIndex in my htaccess - maybe it is set in the server conf file.

Since I get it to work with the (.*) pattern I will leave it for now, I have a hundred more things trying to fit in my brains concerning rewriting - truly, this is a mind boggling subject. For instance, the following may be an interesting challenge for you. I'm used to "normal" programming languages, and I'd like to get this kind of rewriting semantics:

IF Cond A THEN {
IF Cond B THEN Rule F
} ELSE IF Cond A' THEN {
IF Cond B' THEN Rule F
}

The only way so far, I have come up with a way of formulating this is:

RewriteCond A
RewriteCond B
RewriteRule F [L]

RewriteCond !A
RewriteCond A'
RewriteCond B'
RewriteRule F [L]

It took me a while to realize that the !A was necessary in the else-part, as other wise I could have the case "A and not B and A' and B'" firing the second part.

I wouldn't be surprised if there are a number of ways of writing this in mod_rewrite, I'm just wondering if my approach is reasonable.

Thanks again!

lucy24

8:53 am on Jan 11, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



IF Cond A THEN {
IF Cond B THEN Rule F
} ELSE IF Cond A' THEN {
IF Cond B' THEN Rule F
}

Well, you could move up to Apache 2.4 which supports <If> envelopes, but that's a pretty extreme measure :) RewriteConds just have two operators: AND (the default, always implied) and OR (meaning "either this line or the next one" -- sort of a glorified pipe | for situations where the conditions look at different areas).

So let's unpack this version:

IF Cond A
 {
 IF Cond B THEN Rule F
 }
ELSE IF Cond A'
 {
 IF Cond B' THEN Rule F
 }

Is this intended to be different from

IF (A and B)
OR (A' and B')
?

Remember that the [L] flag means "stop rewriting here and start over from the beginning". So once the rule has executed, the requested URI will no longer be the same, and the conditions won't even be evaluated. Is it that you never want to check for A' and B' if A is present (but B isn't, or the rule would already have executed)?

There are further options involving [C] (chain) and [S] (skip). But these require careful attention, since you're now treating rulesets as a package: you can't just rearrange them at random, or add or delete a rule, without changing anything else.

It may be simpler to lay out in English* what the various circumstances are. And, of course, you can always shovel certain requests into a php script where you can nest parentheses to your heart's content ;) But let's see what we can achieve in mod_rewrite first.


* Which I now realize is not your native language, but work with me here.

eric76

11:48 pm on Jan 11, 2015 (gmt 0)

10+ Year Member



I was just waiting for admin to make the reasonable move of this subject to somewhere more technically oriented - I didn't know it would get this gory along the way ^^.

Thanks again lucy24, for your input. Indeed, that I "never want to check for A' and B' if A is present" is exactly the case. That is a small but important difference between my if-structure and "IF (A and B) OR (A' and B')".

An english explanation would be this:
A checks whether URL is "some/path/subfolder"
B checks that the cookie does not contain that path
If both of these are true, then I want to rewrite

A' checks whether URL is in "some/path"
B' checks that the cookie does not contain that path
If both of these are true, then again, I want to rewrite

But if A and !B is true in the first section, then I want everything to stop. I did use the [L] flag, but realized it doesn't help, as the logic for when I want it is "negative". If the rewrite rule does *not* apply, then I want the L flag.

I suspected that you might mention "chaining", and "skip". I saw these flags during my furious study of rewriting, but it was a bit too much to take in that quickly. If you have some advice then I'll gladly take notes, otherwise I'll just let it be, although I do think it would be possible to do it in a nicer way. For instance, in this case A' is just a substring of A, so I was thinking that this might be possible to fix already in A with a tricky regex instead. But I felt that the regexes would get too complex - they are mind boggling already as they are.

I suspect that tomorrow monday I will not be asked to delve deeper into the subject, and just leave it if it works ^^.

Thanks so much!

lucy24

12:33 am on Jan 12, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



somewhere more technically oriented

Well, we're already in the Apache subforum, so you can't get much more technical. Unless we started out somewhere else and I wasn't paying attention. (I work in "Recent Posts" mode so I don't always know what subforum I'm in.)

Indeed, that I "never want to check for A' and B' if A is present" is exactly the case.

If so, the !A line is the right approach to take.

Since you've got two parallel and independent rules, you may like to list them in order of "most likely to succeed". If it turns out that the form involving A' and B' happens more often, then put that one first with a !A' condition on the other rule.

Within each ruleset, list conditions in order of "most likely to fail". There's no point in looking up something that will be true 90% of the time if you also need something that's only true 5% of the time.

The good news is: I don't see any benefit to [C] or [S] in this situation.

Do make sure the body of the rule contains as much information as possible. In particular: If any of your specified conditions (A B etcetera) refers to an URLpath like "only try this rule if the request was for something in /directory/subdir/" then that should go in the rule itself, not a condition. That way, the server knows right away that it won't need to evaluate conditions on most requests.

eric76

4:36 pm on Feb 8, 2015 (gmt 0)

10+ Year Member



Thanks a lot for all the input Lucy24!

I may ask questions on rewrite rules again some time in the future, but right now I'm good!

I'd mark this thread as answered if I only knew how - I don't see a button for it.

And creds to you Lucy24! ^^