Forum Moderators: phranque

Message Too Old, No Replies

htaccess nearly perfect but help please

         

Gral

9:30 pm on May 13, 2011 (gmt 0)

10+ Year Member



Hi, after trawling these forums for in excess of 20 hours this week I've come up with the following htaccess file that does nearly everything I need:

RewriteEngine on

RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/]+/)*[^.]+\.html([?#][^\ ]*)?\ HTTP/
RewriteRule ^(([^/]+/)*[^.]+)\.html$ http://example.com.au/$1 [R=301,L]

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /tagword/([^/]+/)*index\.html
RewriteRule ^tagword/(([^/]+/)*)index\.html$ http://www.example.com.au/tagword/$1 [R=301,L]

RewriteCond %{REQUEST_FILENAME}.html -f
RewriteRule ^(([^/]+/)*[^.]+)$ /$1.html [L]

RewriteCond %{HTTP_HOST} ^www.example.com.au$
RewriteRule ^/?$ "http\:\/\/example\.com\.au" [R=301,L]

RewriteCond %{HTTP_HOST} ^example.com$ [OR]
RewriteCond %{HTTP_HOST} ^www.example.com$
RewriteRule ^/?$ "http\:\/\/example\.com\.au\/" [R=301,L]

It removes www, .html and parks my .com domain on top of my actual .com.au domain.

But it does NOT remove index from root or sub-folders URLs

Can anyone help?

PS. I don't understand the code... I have spent hours and hours piecing it together from forum posts so I know what it does, not the details of how or why.

Do I have my rules in the right order? and is there an issue with the .com.au domain? I simply added in .au whenever a piece of code mentioned .com Is that correct?

I feel soooo close to solving this, any help would be greatly appreciated.

Gral

g1smd

9:41 pm on May 13, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do NOT escape colons or slashes in the target URL.

DO escape literal periods in the RegEx patterns.

The order of the rules is very important. Redirects must be listed first, from most specific to most general. Rewrites must be listed last.

The index.html rule never gets to run because the previous more general .html rule has already interfered with the request. The rule order is incorrect.

The very last rule in your list never gets to run because the previous rule already runs for "NOT www.example.com.au". The rule order is incorrect.

Using
(www\.)?example\.com
is more efficient than testing separately for www and non-www.

The internal rewrite is in the wrong place (currently third). It MUST be the very last rule in the list otherwise a non-www request will see the internally rewritten pointer exposed as a new URL back out on to the web.

Add a
# comment
to each code block explaining what it does. It will help you see the logic errors more clearly.

You have no rule to redirect general non-www pages to www.

Are both domains hosted on the same server or separately?

[edited by: g1smd at 9:52 pm (utc) on May 13, 2011]

lucy24

9:51 pm on May 13, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Edit: Oops, overlapped other reply. Hope we said the same thing.

:: detour to verify that you actually own both domains (earlier post) ::

Where is each separate piece of code located? That is, in which .htaccess?

example.com and example.com.au are entirely different locations even if they have the same content and you own both of them. Once you've sent people from .com to .com.au, any further activity has to be in the .com.au .htaccess file.

I would think the simplest way is to start by sending your visitors to australia-- a single .htaccess command-- and then do any further rewrites once they get there.

Gral

10:24 pm on May 13, 2011 (gmt 0)

10+ Year Member



Wow great. I am doing this for my Dad in evenings after work and have gone from knowing nothing about html to building a site in 3 weeks so your help is much appreciated!

He owns the .com and .com.au domains. There is nothing at .com, it just gets redirected to .com.au so I only have 1 htaccess file in the root directory of my .com.au server. I don't know if the .com domain is hosted at all - I just sent my domain registration proof to my host and said I want .com to point to the .com.au site.

So I guess the .com to .com.au is a "redirect and should therefore come 1st? I tried to combine the rules like you said... i tested it and seemed to work. Would appreciate any feedback.

How does this look:

RewriteEngine on

#redirect dotcom to dotcomdotau
RewriteCond %{HTTP_HOST} ^(www\.)?example.com$
RewriteRule ^/?$ "http\:\/\/example\.com\.au\/" [R=301,L]


RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /tagword/([^/]+/)*index\.html
RewriteRule ^tagword/(([^/]+/)*)index\.html$ http://www.example.com.au/tagword/$1 [R=301,L]

RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/]+/)*[^.]+\.html([?#][^\ ]*)?\ HTTP/
RewriteRule ^(([^/]+/)*[^.]+)\.html$ http://example.com.au/$1 [R=301,L]

RewriteCond %{HTTP_HOST} ^www.example.com.au$
RewriteRule ^/?$ "http\:\/\/example\.com\.au" [R=301,L]

RewriteCond %{REQUEST_FILENAME}.html -f
RewriteRule ^(([^/]+/)*[^.]+)$ /$1.html [L]

I have fixed my site links to point to / instead of index but moving the 'remove index' rule to after the 'remove html' rule didnt stop index from appearing when I deliberately type it into the browser.

PS. I am sorry I dont know what you mean by "escape" as I am really just copying code from places and trying to make sense of it. I know this is a shortcut but I have already had to learn html and css in 3 weeks and simply dont have enough time to learn mod rewrite before I need to get this site finished.

Thanks for your help I think I'm getting warmer rather than colder right?

and yes youre right www is no longer a problem at thew domain root, but it can still appear at sub-pages. I dont know how to fix that.

[edited by: tedster at 12:41 pm (utc) on May 18, 2011]
[edit reason] switched to example.com [/edit]

g1smd

10:42 pm on May 13, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



No. No. No. The domain and www redirects MUST be the LAST redirects (and be before the first (and in this case, your only) rewrite).

With your code if there is a request for a .com index page, then first there is a redirect to .com.au and then there is another redirect to remove the index. This is incorrect. You must avoid this redirection chain.

With the domain and www redirects positioned last, they should never get to run for index requests. Instead the index redirect removes the index and fixes the domain and www at the SAME TIME inside the one rule.

The domain and www rules therefore run only for requests OTHER THAN for index.html.

The unwanted escaping is all the added junk in the target URL in some of the rules. Simply,
http\:\/\/www\.example\.com\/
should be
http://www.example.com/
in the target URL.

You must escape literal periods in patterns so . becomes \. in the RegEx patterns.

The remove index.html rule must be BEFORE the remove .html rule, and it should force the www and domain name at the same time for those requests.

To redirect all pages of the site to www and to the right domain, change the ^/?$ pattern to (.*) and add /$1 to the target URL.

Use 'example' to stop the forum auto-linking the code.

Gral

11:43 pm on May 13, 2011 (gmt 0)

10+ Year Member



Fantastic! the www has completely gone from root and now sub-pages as well so I am making progress.

But I am still seeing index

I think I know what you mean about escaping the . in the regex patterns, but I'm not sure which dots it applies to. There are unescaped periods in the last 2 redirects, but those rules are working I'm not sure if I get what to do there.

I am doing my best with this... am I getting closer?

RewriteEngine on

# index redirect(most specific)
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /tagword/([^/]+/)*index\.html
RewriteRule ^tagword/(([^/]+/)*)index\.html$ http://www.example.com.au/tagword/$1 [R=301,L]

# html redirect
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/]+/)*[^.]+\.html([?#][^\ ]*)?\ HTTP/
RewriteRule ^(([^/]+/)*[^.]+)\.html$ http://example.com.au/$1 [R=301,L]

# dotcom to dotcomdotau redirect
RewriteCond %{HTTP_HOST} ^(www\.)?example.com$
RewriteRule ^/?$ "http://example.com.au/" [R=301,L]

#www to http redirect
RewriteCond %{HTTP_HOST} ^www.example.com.au$
RewriteRule ^/?$ "http://example.com.au" [R=301,L]

# internal rewrite
RewriteCond %{REQUEST_FILENAME}.html -f
RewriteRule ^(([^/]+/)*[^.]+)$ /$1.html [L]

Thanks again for your help.

[edited by: jdMorgan at 7:28 pm (utc) on May 27, 2011]
[edit reason] Example.com [/edit]

lucy24

12:55 am on May 14, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



"Escape" means
#1 what you do: precede the character with a backslash \
#2 what it means: "use this actual character, rather than its special regex meaning". Colon : semicolon ; and slash / have no special meaning, so you don't need to escape them. Most regex dialects don't really care if you escape something that doesn't need to be escaped, but it adds clutter.

\ itself has to be escaped: \\

. is a funny one because un-escaped it means "any one character" including . itself-- so it may actually not make any difference in the case of www addresses, unless you have some very weird typists among your visitors.

^ means two different things depending on context: either "the very beginning of the line" or (inside brackets) "not".

? and * and + must always be escaped, because otherwise they mean "apply RegEx stuff to the preceding character" (you can look up the details elsewhere).

( ) [ ] must always be escaped, for the same reason.

In htaccess, spaces have to be escaped because otherwise it means "I'm done with this piece of the rule, so move along to the next one". RegExes in other situations don't require this.

Inside of brackets, - and ^ and ] need to be escaped if you want them to be seen as literal characters. Everything else including . and space has its literal meaning.

...

g1smd, can you explain in words of two syllables how users get from .com to .com.au in the first place if-- as Gral suggests-- .com doesn't exist at all? Or are we only dealing with people who are already at .com.au and need to be prevented from accidentally leaving?

:: looking vaguely around for "clueless" emoticon ::

Gral

1:11 am on May 14, 2011 (gmt 0)

10+ Year Member



Hi Lucy,

Maybe I don't know the right terminology... actually forget "maybe"... its pretty obvious I have a lack of terminology :)

I set up a website at the .com.au domain. The host offers a service which they call "parked domains". I had to log into my .com domain name registrar and change the nameservers to point at the same nameservers as for my .com.au site then use my host control panel to set up the parked domain. That seemed to result in anyone entering mydomain.com into their browser will end up at my .com.au site. I'm not sure if that's because of the code in my htaccess file or if the host is doing something additional, but it seems to work.

I am still digesting your comments about escaping and will go over my code and repost my next efforts at getting rid of index

Cheers for your help

lucy24

1:24 am on May 14, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



D'oh. I should have known that; mine does the same thing.

Matter of fact, your host might also have a "with or without www" option. (Mine goes three ways: accept both forms, add www or delete www.) If you can deal with it via a single clickbutton, you don't need to say anything in your own htaccess at all, so that's one headache gone.

Gral

1:51 am on May 14, 2011 (gmt 0)

10+ Year Member



yeah my host does have that www auto thing... but I think it just dumps code into my htaccess file, which I have since improved thanks to the help of this forum.

And I'm pretty sure the .com to .com.au rule was also written into my htaccess file by the host's control panel, which has also since been improved.

Righto for "escaping" I assume its only in the URL part, of the lines that begin with RewriteCond. So I added \ before . in the URLs. Do I also need to add \ before . in the html rewrite ie in the first line of the final rule?

Either way, this doesn't seem relevant for the index rule which is the part that is not working, so I'm at a bit of a loss as to what to do next.

RewriteEngine on

# index redirect most specific
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /tagword/([^/]+/)*index\.html
RewriteRule ^tagword/(([^/]+/)*)index\.html$ http://www.example.com.au/tagword/$1 [R=301,L]

# html redirect
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/]+/)*[^.]+\.html([?#][^\ ]*)?\ HTTP/
RewriteRule ^(([^/]+/)*[^.]+)\.html$ http://example.com.au/$1 [R=301,L]

# dotcom to dotcomdotau redirect
RewriteCond %{HTTP_HOST} ^(www\.)?example\.com$
RewriteRule ^/?$ "http://example.com.au/" [R=301,L]

#www to http redirect
RewriteCond %{HTTP_HOST} ^www\.example\.com\.au$
RewriteRule ^/?$ "http://example.com.au" [R=301,L]

# internal rewrite
RewriteCond %{REQUEST_FILENAME}.html -f
RewriteRule ^(([^/]+/)*[^.]+)$ /$1.html [L]

[edited by: jdMorgan at 7:31 pm (utc) on May 27, 2011]
[edit reason] Example.com [/edit]

lucy24

4:09 am on May 14, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Bookmark this location if you haven't already:
Apache Rewrites [httpd.apache.org].

Are you sure you want THE_REQUEST? I would have thought REQUEST_URI and then you can dispense with the "[A-Z]+\ " at the beginning.

Can you explain in English what you want the first rule to do? All rewrites boil down to "if the user types x, quietly send them to y but let the browser's address bar continue to say x".

g1smd

7:34 am on May 14, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes, you must use THE_REQUEST otherwise you might get an infinite rewrite loop when the external request is internally mapped to the index.html file and which therefore matches the index redirect pattern again.

For the
# dotcom to dotcomdotau redirect
remove the $ from the pattern so that it can redirect even if there is a port number present on the originally requested URL.

The pattern for the
#www to http redirect
is currently
^www\.example\.com\.au$
and therefore fails to redirect URL requests with a port number. Change that one pattern to be exactly
!^(example\.com\.au)?$
here, the
!
being "not".

The rule order looks like it is correct.

You also need
DirectoryIndex index.html
at the very top of your code.

Look at your last two redirects. The target URL ends with
.au/
in one and
.au
in the other. Always include the trailing slash. The canonical URL is
example.com.au/
with a slash.

To fix the site for "all" pages, change both
^/?$
patterns to
(.*)
and put
/$1
on the end of the target URL, on BOTH of the bare domain redirects.


Your very first rule does not redirect
example.com/index.html
to
example.com/
because the ruleset requires
tagword/
to be present in the request. Instead,your second rule kicks in and redirects
example.com/index.html
to
example.com/index
. To fix this, you can safely delete
tagword/
from the condition and the rule of your first ruleset so that it correctly redirects ALL
index.html
requests.

You might want to change all occurrences of
\.html
in all patterns to
\.html?
so that both
.html
and
.htm
requests are redirected.

The
.html
in the final condition does need to be escaped to
\.html
. Escape all periods in all RegEx patterns in the condition and the rule.

[edited by: jdMorgan at 7:32 pm (utc) on May 27, 2011]
[edit reason] Example.com [/edit]

Gral

8:48 am on May 14, 2011 (gmt 0)

10+ Year Member



OK I made all those changes but now the site doesn't load at all.

The only thing I dont need is the .htm thing. This is a brand new, simple site with only html. The only inbound links so far are links that I can control so I think just getting rid of .html will be enough.

This is the code I have but now nothing seems to work, the site doesnt load at all just forever waiting:

DirectoryIndex index.html

RewriteEngine on

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.html
RewriteRule ^(([^/]+/)*)index\.html$ http://www.example.com.au/$1 [R=301,L]

RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/]+/)*[^.]+\.html([?#][^\ ]*)?\ HTTP/
RewriteRule ^(([^/]+/)*[^.]+)\.html$ http://example.com.au/$1 [R=301,L]

RewriteCond %{HTTP_HOST} ^(www\.)?example\.com
RewriteRule (.*) "http://example.com.au/$1" [R=301,L]

RewriteCond %{HTTP_HOST} !^(example\.com\.au)?$
RewriteRule (.*) "http://example.com.au/$1" [R=301,L]

RewriteCond %{REQUEST_FILENAME}\.html -f
RewriteRule ^(([^/]+/)*[^.]+)$ /$1.html [L]


Did I go wrong somewhere?

g1smd

8:55 am on May 14, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It's likely a simple typo somewhere. It might have resulted in an internal rewrite loop or an external redirect loop.

Does the Live HTTP Headers extension for Firefox give any clue?

The quotes around the target URLs can be deleted.

Use
^[A-Z]{3,9}\
in place of
^[A-Z]+\
in the second ruleset.

There are three places in the file where you have
[^.]+
and where I believe that should be
[^/.]+
instead.

I am trying to think about the logic of
RewriteCond %{REQUEST_FILENAME}\.html -f
as I have long suspected that was either not needed or is incorrect in some way. It slows requests down.

EDIT: Yes, that line is OK; "Rewrite only if the rewrite is going to be successful". Deleting it will make no difference, as an unsuccessful rewrite will correctly trigger a 404 error anyway.

g1smd

9:27 am on May 14, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Spotted it.

This line
RewriteCond %{HTTP_HOST} ^(www\.)?example\.com

also matches requests for www.example.com.au
and so creates an infinite redirect loop. Change that line to:
RewriteCond %{HTTP_HOST} ^(www\.)?example\.com(\.?(:[0-9]+)?)$


That was my mistake.

lucy24

8:23 pm on May 14, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You also need DirectoryIndex index.html at the very top of your code.

Aha, you're writing like a grownup. (I use the term figuratively: for aught I know, you're closer to my son's age.)

If the OP's host is anything like mine, they've got a sort of uber-htaccess that includes certain built-in decisions. They're not in your own .htaccess, they're somewhere Up Above. Depending on host, you may or may not be allowed to change them. Or you may be allowed to, but only if you ask nicely. (Consider the recent unrelated post about adding a favicon.) So find the fine print-- again, if it's anything like mine it will be scattered all over the map-- and read it.

The uber-.htaccess might include:
-- the with-or-without-www option
-- the handling of index files. Mine looks for "index.html" "index.htm" and "index.php" in that order. By default, auto-generated indexing is on, but you can turn it off at any level. I don't know whether you can manually add other forms like main.html or index.jsp.
-- infinite loops: mine's got the Apache-recommended stop at ten [httpd.apache.org] option, after which it gives up and serves a 500 or possibly 503 instead. (I know that this has to be in the uber-.htaccess because I don't have it myself, but I've spotted it in my error logs. Oops.)

g1smd

8:49 pm on May 14, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes, some things are usually/often taken care of in the main hosting configuration that comes with the account.

I usually still add those, as I do remember the time I moved a site to another host and everything broke because something that "hosts usually take care of" wasn't taken care of.

Anyway, the last problem that OP needs to fix is the typo that I introduced into the code a few posts back. That's the problem with editing code without testing it.

And on the subject of Server Errors. As jdMorgan would say: "Error 500? Great, only 499 to go...".

lucy24

3:57 am on May 18, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I had a belated "D'oh!" moment pertaining to this and a couple of other recent threads in this forum. Posting here as a reminder to myself for next time it comes up.

The OPs (all of them) are not being silly by escaping all those slashes \/ and they're not doing it to annoy us. They've got the behavior internalized because you do have to escape slashes in javascript, at least in some contexts. Also presumably semicolons. Don't remember about colons.