Forum Moderators: phranque

Message Too Old, No Replies

Using mod-rewrite to redirect to new domain

Noob warning!

         

Rsw0001

6:28 am on Mar 20, 2012 (gmt 0)

10+ Year Member



I'm in the process of moving my old website (which is a subdomain on my ISP's free hosting site) to a new host. At the moment, I'm using the .htaccess file to do 301 redirects to the new site. This is working well, but the problem is that most users won't notice that the URL has changed from:
myoldsubdomain.myoldhost.net
to:
mynewdomain.com

So, in the last few weeks before the old site gets shut down, I'd like to have a page come up that informs the user about the move, so they can update their bookmarks.

I'd like to have the URLs rewritten as follows:

Input URL:
myoldsubdomain.myoldhost.net/mypage.html

Rewritten URL:
myoldsubdomain.myoldhost.net/redirectpage.html?mypage

So, the redirectpage has all the necessary notice info, and javascript to advise the user and then redirect the new page on the new site. That part works fine, but I'm having a problem with the .htaccess file. This is the code I've used:

RewriteRule ^(.*)\.html$ Pageredirect\.html\?$1 [L]


I tested this using an online regex applet, and it produces the correct URL, but when I run put this on my webserver, it keeps throwing a 500 error. I thought maybe I needed to use the full URL. So I tried this:

RewriteRule ^http://myoldsubdomain\.myoldhost\.net/(.*)\.html$ http://myoldsubdomain.myoldhost\.net/Pageredirect\.html\?$1 [L]


But, I still get the same 500 error. I'd be very grateful if someone could set me straight.

Rsw0001

8:17 am on Mar 20, 2012 (gmt 0)

10+ Year Member



I think I've figured it out. For anyone who's curious, I'm now using this:

RewriteCond %{REQUEST_URI} !Pageredirect [NC]
RewriteRule ^(.*)\.html$ http://myoldsubdomain.myoldhost.net/Pageredirect.html\?$1


(The RewriteCond line is required to prevent infinite redirects.)

g1smd

8:26 am on Mar 20, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The rewritten internal pointer rematches the rule pattern "<something>.html" and is rewritten again in an infinite loop.

Use a negative match RewriteCond checking REQUEST_URI to exclude already-rewritten requests.

For site moves, I usually just include text along the very top of every page of the old site saying the site will be offline for a few minutes on a certain date for move to new domain (or for a redesign), ending with the words "... when this message is no longer visible, the update is complete". For a site move, the old site is then completely taken down and replaced with redirects.

When you include the protocol and domain in the rule target you get a 302 redirect. Your script then generates another redirect giving you an unwanted multiple step redirect chain. Do not use a redirect for the first step. Use an internal rewrite.

The confusing part is that a RewriteRule can be configured to either deliver a redirect or perform an internal rewrite. There are only minor syntax changes for each function and using the wrong one can kill your rankings.

If you include protocol and domain name and/or the [R] flag, you get a 302 redirect. Omit both for a rewrite.

Every rule needs the [L] flag.

[edited by: g1smd at 8:37 am (utc) on Mar 20, 2012]

lucy24

8:28 am on Mar 20, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You need to sort out your escape-slashes. Escape in the pattern. Never in the target.

Rsw0001

8:44 am on Mar 20, 2012 (gmt 0)

10+ Year Member



Hmm, I did notice that I had escape slashes in the target, but this was after I'd already got it working. Apparently, my server is ignoring them. Anyway, I'll get rid of them so they don't cause problems down the road.

Thank you all, for the replies.

g1smd

8:49 am on Mar 20, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Let's see the final code, with all the fixes from above applied.

Rsw0001

9:07 am on Mar 20, 2012 (gmt 0)

10+ Year Member



g1smd,
I have both the old and new sites up and running at the moment.
My strategy was that I would generate 301 redirects for a couple of weeks in order to allow the search engines to update their information, and then I would implement what I've discussed here. Because of the way the old website is set up, it would be extremely cumbersome to add the a notice at the top of every page. However, I did put a notice on the home page and on an updates page. Hopefully, this will be enough.

Here is the .htaccess code that is now working:


ErrorDocument 403 /403.html
ErrorDocument 404 /404.html
Options +FollowSymLinks
RewriteEngine on
RewriteCond %{REQUEST_URI} !Pageredirect [NC]
RewriteRule ^(.*)\.html$ http://myoldsubdomain.myoldhost.net/Pageredirect.html?$1 [L]


and here is the html for the redirect page "Pageredirect.html":


<html><head><title></title>
<script language="JavaScript">

function setURL(){
cell1=document.getElementById("newlink");
cell2=document.getElementById("info1");
cell3=document.getElementById("info2");
request= document.location.search.substring(1)+".html";
part1="<A HREF=\"http://mynewdomain.com/";
part2="\"><FONT COLOR=\"#000099\"><U>http://mynewdomain.com/";
part3="</U></FONT></A>";
cell1.innerHTML=part1+request+part2+request+part3;
cell2.innerHTML="The new link to the requested page is:";
cell3.innerHTML ="Please update your bookmarks and then click on the link to go to the page."
}

</script>
</head>

<body onload="setURL();">
<form onsubmit="return false" > <br> <br> <br>
<table align="center" border="0" style="font-size: 14pt">
<caption style="font-size: 24pt"><b>This page has moved!</b></caption>
<tbody>
<tr>
<td align="left">&nbsp;</td>
</tr>
<tr>
<td align="left" id="info1">Javascript must be enabled in order to redirect to the new page</td>
</tr>
<tr>
<td align="left" id="newlink">&nbsp;</td>
</tr>
<tr>
<td align="left" id="info2">&nbsp;</td>
</tr>
<tr>
<td align="left">&nbsp;</td>
</tr>
<tr>
<td align="left">&nbsp;</td>
</tr>
<tr>
<td align="left">Or, to go to the new home page:</td>
</tr>
<tr>
<td>
<A HREF="http://mynewdomain.com/"><FONT COLOR="#000099"><U>http://mynewdomain.com/</U></FONT></A>
</td>
</tr>
</tbody>
</table>
</form>
</body></html>


Again, thank you both, for your quick replies.

g1smd

8:48 pm on Mar 20, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You have a double redirect.

The first redirect, the one in .htaccess, should be changed to an internal rewrite.

Rsw0001

9:35 pm on Mar 20, 2012 (gmt 0)

10+ Year Member



Okay, but I haven't had enough experience with rewriting to know how to do this. Could you explain? Thanks.

g1smd

9:39 pm on Mar 20, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



From the post above:
If you include protocol and domain name and/or the [R] flag, you get a 302 redirect. Omit both for a rewrite.

Every rule needs the [L] flag.

Rsw0001

2:25 am on Mar 21, 2012 (gmt 0)

10+ Year Member



I'm not having much luck.

I tried this and got a 404 error:
RewriteCond %{REQUEST_URI} !Pageredirect [NC]
RewriteRule ^(.*)\.html$ Pageredirect.html?$1 [L]


Then I tried this and got a 404 error:
RewriteCond %{REQUEST_URI} !Pageredirect [NC]
RewriteRule ^(.*)\.html$ mysubdomain.myoldhost.net/Pageredirect.html?$1 [L]


Finally, I tried this:
RewriteCond %{REQUEST_URI} !Pageredirect [NC]
RewriteRule ^(.*)\.html$ /Pageredirect.html?$1 [L]

It got to the redirect page, but the query string was missing.

The only way I could get it to work was to use the full protocol and domain name. I'm wondering if the problem is due to this being a subdomain, and I don't know the full structure of the directories.

Meanwhile, I had another idea. Why don't I do something like this?


# Permanent redirect for search engine bots
RewriteCond %{HTTP_USER_AGENT} bot\.htm [NC]
RewriteRule ^(.*)html$ http://mynewdomain.com$1html [R=301,L]
# For everyone else, divert them to redirect announcement page
RewriteCond %{REQUEST_URI} !Pageredirect [NC]
RewriteRule ^(.*)\.html$ http://mysubdomain.myoldhost.net/Pageredirect.html?$1 [L]


The first two lines will detect most search engine webcrawlers, and give a 301 redirect to the new page on the new site, which should keep them happy. For everyone else, the next two lines will send them to the announcement page. Do you see any problems with this?

lucy24

3:21 am on Mar 21, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



:: detour to text editor so I can spread myself out ::

Using example.com for your old site and example.org for the new one, and assuming all of this is happening at the old site:

RewriteCond %{REQUEST_URI} !Pageredirect [NC]
RewriteRule ^(.*)\.html$ Pageredirect.html?$1 [L]

What this Rewrite does:
IF the original request was not for "Pageredirect"
THEN take any request for a page-- including the null page www.example.com/.html --and silently rewrite to
www.example.com/Pageredirect.html
The original request, minus html component, turns into a new query string; if there was an earlier query string it is overwritten.

RewriteCond %{REQUEST_URI} !Pageredirect [NC]
RewriteRule ^(.*)\.html$ mysubdomain.myoldhost.net/Pageredirect.html?$1 [L]

What this Rewrite does:
IF, THEN as above, only this time you are silently rewriting to
www.example.com/www.example.com/Pageredirect.html
with query string as above

RewriteCond %{REQUEST_URI} !Pageredirect [NC]
RewriteRule ^(.*)\.html$ /Pageredirect.html?$1 [L]

There should be no difference between this version and version #1-- but the leading slash is safer. (g1 has explained this at least 400 times, but it has not sunk in yet.) Are you absolutely positive you transcribed the wording of both versions exactly as you had them?

RewriteCond %{HTTP_USER_AGENT} bot\.htm [NC]
RewriteRule ^(.*)html$ http://www.example.org$1html [R=301,L]

What this Rewrite does:
IF the user-agent string contains the text "bot.htm"
THEN take any request ending in "html" and 301 redirect to
http://www.example.org{request}html
FOR EXAMPLE:
www.example.com/foobar.html >> www.example.orgfoobar.html
AND ALSO:
www.example.com/html >> www.example.orghtml
www.example.com/foobarhtml >> www.example.orgfoobarhtml
www.example.com/foobar/widget/html >> www.example.orgfoobar/widget/html

RewriteCond %{REQUEST_URI} !Pageredirect [NC]
RewriteRule ^(.*)\.html$ http://www.example.com/Pageredirect.html?$1 [L]

What this Rewrite does:
IF the previous Rule + Condition did not apply, AND request was not for Pageredirect
THEN 302 redirect to
www.example.com/Pageredirect.html
This is exactly the same as your versions #1 and #3, except that by giving the full protocol and domain, you are forcing a Redirect. By default, it is a 302.

Rsw0001

7:09 am on Mar 21, 2012 (gmt 0)

10+ Year Member



Are you absolutely positive you transcribed the wording of both versions exactly as you had them?


I'm as sure as I can possibly be. Obviously I did change my true subdomain and domain to the example ones, but everything was copied and pasted into the post. I tried this again just to double check:

RewriteCond %{REQUEST_URI} !Pageredirect [NC]
RewriteRule ^(.*)\.html$ mysubdomain.myoldhost.net/Pageredirect.html?$1 [L]


My broswer displays this message when I use the above code:
> The requested URL /e/l/mysubdomain.myoldhost.net/public/mysubdomain.myoldhost.net/Pageredirect.html was not found on this server.
> Additionally, a 404 Not Found error was encountered while trying to use an ErrorDocument to handle the request.

Notice the /e/l/ appended to the front. Perhaps you can make some sense out of it.
FYI, when I log into my site using an FTP client, one of the top level directories is "public" and it is in the "public" directory where the entire site resides. You'll notice that appears in the rewritten URL.

After reading your comments, I've changed my code to the following and it appears to work:

# Permanent redirect for search engine bots
RewriteCond %{HTTP_USER_AGENT} bot\.htm [NC]
RewriteRule ^(.*)$ http://mynewdomain.com/$1 [R=301,L]
# For everyone else, divert them to page moved announcement page
RewriteCond %{REQUEST_URI} !Pageredirect [NC]
RewriteRule ^(.*)\.html$ http://mysubdomain.myoldhost.net/Pageredirect.html?$1 [L]

When I set my browser to spoof the googlebot user agent, it gets a 301 redirect to the corresponding page on the new site. And with the normal browser user agent, it goes to the announcement page, which is what I want (I think).

g1smd

7:47 am on Mar 21, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Your latest code delivers a redirect. When you include the protocol and domain in the rule target you get a 302 redirect. Your script then generates another redirect giving you an unwanted multiple step redirect chain. Do not use a redirect for the first step. Use an internal rewrite.


Backing up to the error message you saw when you had code for a rewrite in place:
The requested URL /e/l/mysubdomain.myoldhost.net/public/mysubdomain.myoldhost.net/Pageredirect.html was not found on this server.

The fact that the internal filepath is being exposed as a URL means that you have an external redirect happening after the internal rewrite.

This means that your rules are in the wrong order. You must list all redirects first from most specific to most general and then list all rewrites from most specific to most general.

You must use RewriteRule for all of your rules. Do not use Redirect or RedirectMatch for any of your rules. Make sure that you use RewriteRule for all of your redirects and for all of your rewrites.

You don't have a proper query string. The rewrite code should end ?name=$1 [L] rather than just ?$1 here.

The subpattern (.*) can only be used on the end of a pattern. It must never appear at the beginning or in the middle of a pattern.

Add a blank line after each RewriteRule for code clarity.

lucy24

8:52 am on Mar 21, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The subpattern (.*) can only be used on the end of a pattern. It must never appear at the beginning or in the middle of a pattern.

Ah, knew I'd forgotten something. In this group of rewrite/redirects, what you generally want is [^.] or in full

^([^.]+)\.html$

That is, capture everything up to the first period. There won't be any until you get to the extension, since the host/domain name has already been dropped.

You don't have a proper query string.

D'oh! Not {blahblah} but {queryname=blahblah}.

You could probably tweak your php script to handle a "naked" query, but it's likely to be more trouble than it's worth.

Rsw0001

12:34 am on Mar 22, 2012 (gmt 0)

10+ Year Member



Okay, I've changed the match pattern and removed the protocol and domain from the rewrite rule. So, I now have this:

RewriteRule ^([^.]+)\.html$ /Pageredirect.html?$1 [L]

This now appears to do the internal rewrite, as it does correctly go to Pageredirect.html. However, the query string now comes up blank.

I've also tried:

RewriteRule ^([^.]+)\.html$ /Pageredirect.html?name=$1 [L]

However this shouldn't be necessary as I'm just using the Javascript code:
query = document.location.search.substring(1);
which returns the entire raw query string. So, adding the "name=" just creates more work. (The site doesn't use php, just plain old html with some javascript.)

When I add the protocol and domain to the replacement string, so that it does a redirect rather than a rewrite, then the query string is properly added and the my announcement page works correctly.

Can you think of any reason why doing it as an internal rewrite would cause the query string to disappear?

g1smd

12:57 am on Mar 22, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The reason it "disappears" is that the "work" to process it and issue a proper redirect should be happening in a PHP (or similar) script inside the *server*. The server should be issuing a 301 redirect from the old URL to the new.

Javascript runs entirely inside the browser and so cannot "see" the appended query data in the server's internal pointer. The Pageredirect script should be a PHP script inside the server that issues a proper 301 redirect to the new site.

Trying to do this all in the browser is completely the wrong approach. You end up with two redirects, the second of which isn't a real 301 redirect.

Search engines likely cannot follow the chain as there isn't a proper redirect to the new site. If you continue using a redirect to pageredirect.html (instead of a rewrite) and then run some javascript in the browser, what I imagine will happen for searchengines is that it will appear that all requests for old URLs on the old site are redirected to mysubdomain.myoldhost.net/Pageredirect.html. When the bot later returns and directly accesses that URL there will be no way presented for it to get to the new site. Bots don't immediately follow redirects. They access pages from a crawl list compiled earlier.

Rsw0001

1:38 am on Mar 22, 2012 (gmt 0)

10+ Year Member



Except that I've added the rule to immediately redirect search engines to the new site:

# Permanent redirect for search engine bots
RewriteCond %{HTTP_USER_AGENT} bot\.htm [NC]
RewriteRule ^(.*)$ http://mynewdomain.com/$1 [R=301,L]

# For everyone else, divert them to the announcement page
RewriteCond %{REQUEST_URI} !Pageredirect [NC]
RewriteRule ^([^.]+)\.html$ http://mysubdomain.myoldhost.net/Pageredirect.html?$1 [L]

I have checked the server log, and it is indeed giving the search engines an immediate 301 redirect to the correct page on the new site. I do understand that the remaining code is less than optimal, but I have no experience with php. So, I'm trying to do it using the tools that I understand. As they say: "When you're a hammer, every problem looks like a nail." :) If this was going to stay up and running indefinitely, then I'd be much more concerned about it. But it will all be shut down in about a month anyway.

g1smd

7:32 am on Mar 22, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If bots are being properly redirected then that is good.

Showing different content to bots and users is risky, but if it is only in place for a few weeks it is unlikely to be a problem.

I'd missed the bit where you mentioned bots get redirected.

The crucial bit to stop an infinite loop is this line:
RewriteCond %{REQUEST_URI} !Pageredirect [NC]

though the NC flag is probably not needed.

lucy24

8:22 am on Mar 22, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



though the NC flag is probably not needed.

It may even be counterproductive.

If someone explicitly asks for PAGEREDIRECT or pageredirect-- typing from memory, let's say-- you should probably send them to a correctly cased Pageredirect rather than risk them slamming into a 404 if you're on a case-sensitive server.

Rsw0001

12:49 pm on Mar 26, 2012 (gmt 0)

10+ Year Member



A brief update:

I took out the [NC] for the reasons given.

After a few days with this .htaccess file deployed, Google has now re-crawled about 2/3 of my site, and I see that it has updated its links to the new pages, and the ranking is about the same as before (in the top five, for certain queries). So I'll consider this a success.


The subpattern (.*) can only be used on the end of a pattern. It must never appear at the beginning or in the middle of a pattern.

Originally, I didn't understand g1's comments about this, but after browsing through some other threads here, I found the explanation that it eats up a tremendous amount of server processing time. Okay, that totally makes sense now. Thanks for pointing that out. A good lesson learned.

Interesting side note:
Since the code is not checking to see what kind of files the search engine bots are looking at, the .htaccess code is giving 301 redirects to the new site when the robots.txt file is requested, so the webcrawlers are reading the robots file on the new site. I can see that this is probably not a good thing, but I'm too lazy to fix it, especially now that I see that the search engines have properly updated their links. I'm going to give Google credit for having enough smarts to figure out that the site is moving. I'm sure they've seen this before.

Interesting side note #2:
Since the Yahoo crawler doesn't have the string "bot.htm" anywhere in its UA string, I added a couple of lines to the .htaccess file to accommodate them too. Interestingly, the log shows that they've not crawled the site since the the .htaccess file has been deployed. Yet, they've managed, somehow, to update their links to the new site. It appears that their search results are the same as the Bing search results. Did Yahoo get bought out by Microsoft? I seem to remember reading somthing to that effect, but I have a short attention span.

Thanks again for all the help.

g1smd

2:28 pm on Mar 26, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You can block the redirect for specific filetypes (or specific names if you have to) by using

RewriteCond %{REQUEST_URI} !\.txt$


or similar. You might also want to not redirect requests for the Google WMT and Bing Webmasters account verification files etc.


Glad you found the extra information about why using the (.*) subpattern is often harmful. It's repeated on a regular basis in this forum.

lucy24

10:36 pm on Mar 26, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It appears that their search results are the same as the Bing search results. Did Yahoo get bought out by Microsoft? I seem to remember reading somthing to that effect, but I have a short attention span.

Yahoo now uses Bing's data, though some people have observed that it does different things with it.

All that's left in Yahoo's own robotic name are Slurp and the CacheSystem, both of which are badly behaved robots that can be locked out with a clear conscience. Unless it turns out that one of them does something terrifically important that nobody has figured out yet ;)