Forum Moderators: phranque

Message Too Old, No Replies

RewriteRule for Newbie

26 hours in mod_rewrite chair and I'm getting nowhere

         

abraxas1618

3:46 pm on Apr 8, 2009 (gmt 0)

10+ Year Member



Hi,
What follows is a paste of what I sent to my hosting company tech support, the response I received and some course of action i took.

I have a website, www.someproduct.com with a bunch of subdirectories.

I have another domain name www.someotherproduct.com , this domain has been aliased to www.someproduct.com so if you type in www.someotherproduct.com in browser, you get the www.someproduct.com website home page but has www.someotherproduct.com written in address bar of browser. Which is great!

On the www.someproduct.com website is a directory called /someotherproduct. If i type in www.someotherproduct.com/someotherproduct i get the default page in that directory. Great! I wish to craft (I'm learning that is an apt word for this realm of magic) a rule that says to the effect, if address is www.someotherproduct.com go to the /someotherproduct directory but keep www.someotherproduct.com in address bar of browser and don't show the /someotherproduct path. I think in the language used, I want /someotherproduct to be toplevel or '/' for www.someotherproduct.com

Does this make sense at all? This is the response I received

RewriteEngine on
RewriteCond %{HTTP_HOST} ^www\.someotherproduct\.com$ [OR]
RewriteCond %{HTTP_HOST} ^someotherproduct\.com$
RewriteRule ^(.*)$ [someproduct.com...] [L]

From what i read from the Apache URL Rewriting guide, I sort of worked out this wasn't going to answer my question. Browser obviously shows www.someproduct.com/someotherproduct/ in address bar, not www.someotherproduct.com/ that I wanted. Ok, I reasoned line 2 and 3 are conditionals matching the right domain name (I really am learning this on the fall) and that statement 4 is executed if either preceeding statments are true. The [L} means don't execute anymore statements after (?)

I changed the last statement to

RewriteRule ^(.*)$ /someotherproduct/$1 [L]

which returns an error. Because of the way the hosting company has setup it's logging I cant view new logs for 24 hours so I have an 18 hr wait to have a looksee.

I tried

RewriteRule ^(.*)$ www.someotherproduct.com/someotherproduct/$1 [L]

which was a big mistake. Endless loop. [Slaps hand]

Since then I've been going through [httpd.apache.org...] and the apache forums here (had a look at rewritebase and rewritelog), but I'm not sure what I am looking for

I would be really appreciative to anyone who could offer me pointers on this problem.

Sincerest thanks

abraxas1618

3:49 pm on Apr 8, 2009 (gmt 0)

10+ Year Member



Sorry, forgot to mention that .htaccess is stored in /www of www.someproduct.com

jdMorgan

4:51 pm on Apr 8, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



By putting the protocol and domain in the RewriteRule substitution, you make it a URL instead of a filepath, resulting in an external redirect response to your browser, which then changes the URL in its address bar, and re-requests the resource from that new URL.

You want an internal rewrite, not an external redirect. And you must also explicitly prevent an 'infinite' rewriting loop if you place this code into an .htaccess file:


RewriteEngine on
RewriteCond %{HTTP_HOST} ^(www\.)?someotherproduct\.com$
RewriteCond $1 !someotherproduct/
RewriteRule ^(.*)$ /someotherproduct/$1 [L]

Note that instead of accepting either www- or non-www domains as this code and your original code do, you should pick one or the other as the canonical domain, and externally redirect all other hostname variations to that one canonical hostname. So, you might want to precede this new rule with:

# Externally redirect non-canonical someproduct requests to canonical www.someproduct.com domain
RewriteCond %{HTTP_HOST} someproduct\.com [NC]
RewriteCond %{HTTP_HOST} !^www\.someproduct\.com$
RewriteRule ^(.*)$ http://www.someproduct.com/$1 [R=301,L]
#
# Externally redirect non-canonical someotherproduct requests to canonical www.someotherproduct.com domain
RewriteCond %{HTTP_HOST} someotherproduct\.com [NC]
RewriteCond %{HTTP_HOST} !^www\.someotherproduct\.com$
RewriteRule ^(.*)$ http://www.someotherproduct.com/$1 [R=301,L]

The first redirect rule here will take care of non-canonical requests for
someproduct.com
ww.someproduct.com
wwww.someproduct.com
junk.someproduct.com
www.SomeProduct.Com
www.someproduct.com.
www.someproduct.com:80
www.someproduct.com.:80
and all other variations on the canonical someproduct hostname. And the second rule does the same for someotherproduct host as well.

Doing this prevents your sites having duplicate-content-related problems, such as poor ranking because the different URLs are effectivley competing with each other, and prevent potential 'attack' by your competitors intentionally linking to "fraudulent-sale-of.someproduct.com" and getting your site indexed in search that way (These are just two examples of why you should force domain canonicaliztion -- there are many more.) The bottom line is that for each unique page on your site, there should be one and only one valid URL that can be used to reach it; All others should result in a 404 or a 301 redirect to the correct, single URL.

Once you get the above rules working, there's one more rule you may wish to add at the very top (order is important, and I've actually presented them in reverse-order here to simplify discussion):


RewriteCond %{THE_REQUEST} ^[A-Z]+\ /someotherproduct/[^\ ]*\ /HTTP
RewriteRule ^someotherproduct/(.*)$ http://www.someotherproduct.com/$1 [R=301,L]

This prevents duplicate pages at www.someotherproduct.com/xyz, www.someotherproduct.com/someproduct/xyz, and www.someproduct.com/someproduct/xyz -- It prevents having three different URLs for the same page.

Jim

[edited by: jdMorgan at 4:54 pm (utc) on April 8, 2009]

abraxas1618

12:39 am on Apr 9, 2009 (gmt 0)

10+ Year Member



Jesus...
Thanks Jim! Thank you very much for your response. Seems a tad more than the hosting company indicated, "you'll probably get away with two lines."

Just got out of bed (Aussie time), going to try it out and I will get back to you. Thanks again...Muchly!

abraxas1618

3:19 am on Apr 9, 2009 (gmt 0)

10+ Year Member



Ok,
Tried...
RewriteCond %{HTTP_HOST} ^(www\.)?someotherproduct\.com$
RewriteCond $1 !someotherproduct/
RewriteRule ^(.*)$ /someotherproduct/$1 [L]

.. worked like a charm, thanks!

Tried both of the external redirects non-canonical requests someproduct and someotherproduct but I get 404 errors if I type something amiss like wwww.someproduct.com or ww.someotherproduct.com. Though someproduct.com and someotherproduct.com do work. With or without the redirects in the .htaccess file. Would this be due to another level of redirection outside the scope of htaccess?

Looked at the third rule. Got an explanation from one of jd01 posts on '\ ' (escaped space, haven't seen that before). Rule made me think, I don't mind if /someotherproduct is reached through someproduct.com/someotherproduct since someotherproduct is part of the family of someproduct.com. All the redirects are in reality are due to a new product launch so it's a marketing requirement.

This is certainly a very involved subject. Your notes on duplicate content are very worthy of future consideration. Most of my development work is IIS and such things have been hidden from me by being taken care of by other people. I picked up a copy of Rich Bowen's Apache mod_rewrite to read on the train.

Thanks again Jim for your help

jdMorgan

4:26 am on Apr 9, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> I don't mind if /someotherproduct is reached through someproduct.com/someotherproduct since someotherproduct is part of the family of someproduct.com

Careful there, mate. You are saying that you do not mind if the two URLs /someotherproduct and someproduct.com/someotherproduct compete with each other for ranking in search engines, and neither reaches its full potential. This is duplicate content, and the thread in our Google News forum archive titled "Duplicate Content -- Get it right or die" pretty much says it all.

Also, to be clear, are you getting 404 errors on junk.someproduct.com requests, or are you getting DNS errors -- e.g. "Cannot find the server"?

A 404 indicates that your DNS does forward requests for junk.someproduct.com to your server, and that your server then maps them to *some* filespace (location unknown to me, but likely available in your server error log file). A "Server not found" or similar message indicates that your DNS zone file is set up with specific subdomains defined, rather than a wildcard subdomain record. In that case, only the specifically-defined domain and subdomain requests will be passed to your server, and the rest will be dropped at the DNS request level -- i.e the browser won't even connect to your server.

Jim

abraxas1618

5:43 am on Apr 9, 2009 (gmt 0)

10+ Year Member



Good call, I'll read Duplicate Content after I post this.

Actually.. i'll read it now

abraxas1618

2:59 am on Apr 10, 2009 (gmt 0)

10+ Year Member



Back!
14 hours and three headaches later. That was one hell of an enlightening read, so a quick update on what I leaned and how it applies to my situation.

RewriteCond %{HTTP_HOST} ^(www\.)?someotherproduct\.com(\.au)?$
RewriteCond $1 !someotherproduct/
RewriteRule ^(.*)$ /someotherproduct/$1 [L]

Notice the .au conditional? There's another domain name pointing to same place. Had a cringing feeling when reading Duplicate Content thread. But I came to a decision that I'd like to run by you.

I'm going to dynamically header("X-Robots-Tag: noindex, nofollow", true); both the someotherproduct.com and someotherproduct.com.au pages and allow crawling through someproduct.com/someotherproduct/ path only. How's that sound?

Regarding 404 errors, you were right, they weren't (sorry, shoulda looked harder). It was DNS request level.

I really like this site Jim particularly the Latest News frontpage

Many thanks
Ant

g1smd

8:20 am on Apr 10, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You could do that, but they will still have to spider the URLs to see the noindex tag.

You are much better off to redirect the duplicate URL version to the canonical form. You can control that using a RewriteCond or two.