Forum Moderators: phranque

Message Too Old, No Replies

Help with .htaccess

Redirect extensionless page URLs to add .html extension

         

sjedwardz

9:17 pm on Nov 1, 2009 (gmt 0)

10+ Year Member



Hi - I have a specific web page that I did some SEO on but on some of the links I forgot to add the ".html" on the link.

Now if I check my backlinks of yahoo explorer I get some with the .html and some without!

If you click on the links or enter the address without the .html extension then it works fine and goes to the same address but I'd like to get the backlinks combined.

So how do I do that with .htaccess I've been trying for the last few hours but been pulling my hair out!

Here is an exmaple to help

www.domain.com/dir/page1 and
www.domain.com/dir/page1.html

What would the code be to connvert www.domain.com/dir/page1 to www.domain.com/dir/page1.html

Thanks

Shaun

jdMorgan

9:30 pm on Nov 1, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Please post your best-effort code as a basis for discussion.

Thanks,
Jim

sjedwardz

9:58 pm on Nov 1, 2009 (gmt 0)

10+ Year Member



OK - This is one of them! I've tried a few derivations but to no avail

RewriteEngine On
RewriteRule photography/wedding-photo1$ [domain.com...]

jdMorgan

11:24 pm on Nov 1, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Because you did not specify a 301 redirect, the result will be a 302, which won't tell search engines to 'fix' their listings to reference the new URL. Add [R=301,L] to the end of your rule.

I also suggest that you start-anchor your RewriteRule pattern.

If you end up with more than a few dozen of these redirects, there are 'generic' solutions available to do what you need to do for any extensionless URL using a few additional RewriteConds.

Jim

sjedwardz

11:45 pm on Nov 1, 2009 (gmt 0)

10+ Year Member



Thanks for that - will give it a try tomorrow as its late here.

Can you just clarify what you mean by "start-anchor your RewriteRule pattern"

As you can tell I'm a noob on this!

jdMorgan

12:03 am on Nov 2, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Your pattern is end-anchored with a "$" but not start-anchored with a "^". Therefore, any URL ending with "photography/wedding-photo1" (e.g. in *any* directory on your site) will be redirected.

See the regular-expressions tutorial cited in our Forum Charter for a lot of useful information. Please don't use mod_rewrite and regular expressions without understanding; As you can see, very small omissions or errors can have serious consequences for your site's ranking and function. Server config code like this is most definitely *not* a copy-and-paste proposition...

Jim

sjedwardz

12:38 am on Nov 2, 2009 (gmt 0)

10+ Year Member



Thanks,

I did have a ^ on one version but hadn't done the [301,l] bit.

Btw if I type the url without the .html in browser should I expect the url to change once the page has loaded?

It does this for my redirect for not putting www before domain name.

jdMorgan

5:22 pm on Nov 2, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, if you use the external redirect syntax in your rule, then by definition, the client state changes if it opts to follow the redirect response from your server. So the browser address bar will update.

You didn't mention other rules. If you're having problems with other rules, then be aware that rule order is important; You should order your rules with all external redirects (using the [R=30x] flags and/or specifying a full URL starting with "http" or "https") placed first, and ordered from most-specific patterns and conditions (fewest URLs affected) to least-specific, followed by all of your internal rewrites, again in order from most- to least-specific.

Doing this will prevent two problems: It will prevent multiple/chained/stacked redirects resulting from a single client request, and it will prevent having an external redirect 'expose' an internally-rewritten filepath as a URL.

Remember to always use an [L] flag on every rule unless you know why you don't want to, and to always completely-flush (delete) your browser cache before testing any new server-side code.

You want to verify that *any* 'incorrect' URL is redirected straight to the correct URL in one single step: One redirect no matter how many 'problems' the requested URL has... Use a server headers checker to verify this.

Jim

sjedwardz

11:22 pm on Nov 2, 2009 (gmt 0)

10+ Year Member



Thanks for that but I still can't get it to work! Here is my exact .htaccess

RewriteEngine On
RewriteCond %{HTTP_HOST} ^example.com [NC]
RewriteRule ^(.*)$ http://www.example.com [L,R=301]
RewriteRule ^photography/wedding-photographer-surrey$ http://www.example.com/photography/wedding-photographer-surrey.html [R=301,L]

It's just not picking it up, even if I change the redirect URL to a completely different page!

[edited by: jdMorgan at 11:28 pm (utc) on Nov. 2, 2009]
[edit reason] example.com [/edit]

sjedwardz

11:24 pm on Nov 2, 2009 (gmt 0)

10+ Year Member



Somebody mentioned that there is a generic rewrite that will work on all extensionless urls but I haven't been able to find this.

jdMorgan

11:27 pm on Nov 2, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Have you tried a single, very simple redirect before jumping into a complicated multi-rule test?

Comment-out all the other RewriteCond and RewriteRule lines, and try something trivial like:


RewriteEngine on
RewriteRule ^foo$ http://www.google.com/ [R=301,L]

Request the URL-path /foo from your server, and you should land at google.

If that works, we can address the rules you posted (which are in the wrong order, BTW).

Jim

sjedwardz

11:40 pm on Nov 2, 2009 (gmt 0)

10+ Year Member



jd -

did what you suggested i.e.

RewriteEngine on
RewriteRule ^foo$ [google.com...] [R=301,L]

and that worked! But can't for the life of me get it do do what i want!

sjedwardz

11:44 pm on Nov 2, 2009 (gmt 0)

10+ Year Member



Put in this

RewriteRule ^photography/wedding-photographer-surrey$ [google.com...] [R=301,L]
RewriteRule ^foo$ [google.com...] [R=301,L]

But if I enter http://www.example.com/photography/wedding-photographer-surrey into browser it doesn't go to google, but www.example.com/foo does...

[edited by: jdMorgan at 11:46 pm (utc) on Nov. 2, 2009]
[edit reason] example.com [/edit]

jdMorgan

11:49 pm on Nov 2, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Try
 Options -MultiViews 

then.

Have you got another .htaccess file in the /photography directory, by any chance?

Don't post your own domain unless you want searches for that name to result in this thread out-ranking your site... Please see our Terms of Service and this forum's Charter.

Jim

sjedwardz

12:12 am on Nov 3, 2009 (gmt 0)

10+ Year Member



I'll check tomorrow as its late here.

I didn't realise about not posting my own domain, didn't put it in the fisrt few threads and only put it in the last to show exactly what I was doing

sjedwardz

8:39 am on Nov 3, 2009 (gmt 0)

10+ Year Member



Brilliant the options -multiviews worked -- you are a star!

g1smd

10:00 am on Nov 3, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Now to address the rule ordering in order to avoid redirection chains.

sjedwardz

11:04 am on Nov 3, 2009 (gmt 0)

10+ Year Member



Ok - what order should they be in?

TheMadScientist

11:15 am on Nov 3, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



As a rule, mine anyway, unless I know I need to break it:

1.) Exclusions from redirects & rewrites come first.
EG RewriteRule \.(txt¦js¦css¦gif¦jpg)$ - [L]

2.) 'Page Specific' External Redirects go second, because they can contain your canonicalization.

3.) Canonicalization comes third.

4.) Internal Rewrites are fourth.

RewriteRule ^photography/wedding-photographer-surrey$ http://www.example.com/photography/wedding-photographer-surrey.html [R=301,L]

RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$ [NC]
RewriteRule ^(.*)$ http://www.example.com [L,R=301]

The way your rules were ordered if a person requested:
example.com/photography/wedding-photographer-surrey they would initially be sent to: www.example.com/photography/wedding-photographer-surrey, then to www.example.com/photography/wedding-photographer-surrey.html, which is two redirects in a row, or 'stacked' redirects. A single redirect will pass link weight, where a 'chain' or 'stack' or 'more than one' will not, so by reversing the order and taking care of the www. non-www. issue at the same time, you can continue passing link weight and save some processing time, because there is only one redirect.

I also edited your canonicalization ruleset a bit to redirect anything NOT www.example.com or empty to www.example.com... You may need to leave it the way you had it, but this version (if you wave wildcard domains enabled) will redirect ww.example.com and wwww.example.com to the correct location, so it works a bit better as a 'courtesy' to visitors if you can use it... As long as you don't have other sub-domains set up, it should be fine. If you do have other subdomains, it can still be adjusted to exclude those.

mark_roach

11:52 am on Nov 3, 2009 (gmt 0)

10+ Year Member



Nice post MadScientist. You have explained that very clearly and it has highlighted a minor problem I have had for sometime.

I use a .htaccess in a lot of my directories for internal and external redirecting. Externals first and internals second and it works just fine.

I also handle canonicalization for all urls by placing the rules you have used in your post in my server root directory.

However as it stands this can create a chain of redirects.

eg. mydomain.com/oldpage/
to www.mydomain.com/oldpage/
to www.mydomain.com/newpage/

I can see if I move the rules to handle canonicalization from the root to each individual folder this chain will be eliminated.

But this will then cause any urls in the root not be redirected.

I can think of two possible solutions.

One which I sure will work would be to code the rules in the root to be more specific and only redirect the root itself and urls in the root directory.

Alternatively, and this what I am unsure about, could I leave everything asis and simply remove the [L] flag from the redirect in the root directory ?

TheMadScientist

12:11 pm on Nov 3, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Personally, I only ever use a single .htaccess in the root...

If you move your 'directory specific' rules to the root, you can 'skip' through them fairly efficiently and not have to worry about the issue, what I mean is:

# S=NUM is the number of rules
# (not counting conditions) in
# the ruleset for each Directory
# I'll pretend you have 5 rules
# in each directory...

RewriteRule !^Directory1 - [S=5]
# Directory 1 Rules Here

RewriteRule ^[^/]{10}/
RewriteRule ^[^/]{10}/
RewriteRule ^[^/]{10}/
RewriteRule ^[^/]{10}/
RewriteRule ^[^/]{10}/

RewriteRule !^Directory2 - [S=11]
RewriteRule !^[^/]{10}/SubDir - [S=5]

# Directory2/SubDir Rules Here
RewriteRule ^[^/]{10}/[^/]{6}/
RewriteRule ^[^/]{10}/[^/]{6}/
RewriteRule ^[^/]{10}/[^/]{6}/
RewriteRule ^[^/]{10}/[^/]{6}/
RewriteRule ^[^/]{10}/[^/]{6}/

# Directory2 Rules Here
RewriteRule ^[^/]{10}/
RewriteRule ^[^/]{10}/
RewriteRule ^[^/]{10}/
RewriteRule ^[^/]{10}/
RewriteRule ^[^/]{10}/

Basically, if you set your file up right (and can count) you can have all your specific rulesets in the root file and if you order them correctly and put some thought into 'finding matches and skipping to rulesets' you can still be very efficient. Here's another example:

# If it's not Directory 1 or 2 Skip 'em all
RewriteRule !^Directory(1¦2) - [S=17]

# If it's not Directory 1 we know it's 2
RewriteRule !^[^.]{9}1 - [S=5]

# Directory 1 Rules Here
RewriteRule ^[^/]{10}/
RewriteRule ^[^/]{10}/
RewriteRule ^[^/]{10}/
RewriteRule ^[^/]{10}/
RewriteRule ^[^/]{10}/

# We know it's Directory 2, so if it's not the specific sub, skip 5
RewriteRule !^[^/]{10}/SubDir - [S=5]

# Directory2/SubDir Rules Here
RewriteRule ^[^/]{10}/[^/]{6}/
RewriteRule ^[^/]{10}/[^/]{6}/
RewriteRule ^[^/]{10}/[^/]{6}/
RewriteRule ^[^/]{10}/[^/]{6}/
RewriteRule ^[^/]{10}/[^/]{6}/

# We already know it's 2, so just run these
RewriteRule ^[^/]{10}/
RewriteRule ^[^/]{10}/
RewriteRule ^[^/]{10}/
RewriteRule ^[^/]{10}/
RewriteRule ^[^/]{10}/

jdMorgan

2:06 pm on Nov 3, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I suggest using [S=] skip rules only for very small skip counts (one or two or three) when possible, especially if on-going additions or changes to the site might affect the rules being skipped. For example, if you add a rule to any of the 'sections' in the code above and forget to update the skip count, there's a real danger of breaking your site and doing serious search-ranking damage. Even though it may seem less efficient, a better long-term solution is to add the 'exclusion' to each 'section' as a RewriteCond. This avoids the potential danger.

Take advantage of the fact that no RewriteConds are processed unless the RewriteRule pattern matches; By making the pattern very specific, a lot of wasted effort can be avoided. Also, attention to the order of the RewriteConds can help; put the RewriteCond most likely to cause the rule to be skipped first. The only exception to this rule of thumb is for RewriteConds that do file-exists checks or reverse-DNS lookups; Because they are horribly CPU-intensive, they should always be last.

I'm not sure what the difficulty being addressed with this solution is. A simpler solution is to put all redirects in the root .htaccess file, leaving only internal rewrites in subdirectory .htaccess files if needed.

Jim

TheMadScientist

3:29 pm on Nov 3, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I think jdMorgan and I will have to sort of agree, but sort of disagree at the same time... jdMorgan's is easier to manage, but depending on how you structure things and how your site is structured, mine could be quite a bit more efficient...

jdMorgan suggests putting the exclusions in the condition:

RewriteCont %{REQUEST_URI} ^[^/]{10}/stuff-to-match
RewriteRule ^Directory1

RewriteCont %{REQUEST_URI} ^[^/]{10}/[^/]{6}/stuff-to-match
RewriteRule ^Directory2/SubDir/

RewriteCont %{REQUEST_URI} ^[^/]{10}/stuff-to-match
RewriteRule ^Directory2

Here's the main difference I see and where your specific setting comes in to play... (If I'm understanding his post correctly.)

If your directory names are 'essentially the same' or if you are matching multiple sub-directories past one or two levels and use 'very specific' (as specific as possible) rules to eliminate errors EG

RewriteCont %{REQUEST_URI} ^[^.]{36}/page-to-match1.html
RewriteCont %{REQUEST_URI} ^[^.]{36}/page-to-match2.html
RewriteCont %{REQUEST_URI} ^[^.]{36}/page-to-match3.html
RewriteRule ^directory/subdir/subsubdir/one-more/ http://www.example.com/some-redirect [R=301,L]

RewriteCont %{REQUEST_URI} ^[^.]{36}/another-to-match1.html
RewriteCont %{REQUEST_URI} ^[^.]{36}/another-to-match2.html
RewriteCont %{REQUEST_URI} ^[^.]{36}/another-to-match3.html
RewriteRule ^directory/subdir/subsubdir/another/ http://www.example.com/some-other-redirect [R=301,L]

RewriteCont %{REQUEST_URI} ^[^.]{35}/third-to-match1.html
RewriteCont %{REQUEST_URI} ^[^.]{35}/third-to-match2.html
RewriteCont %{REQUEST_URI} ^[^.]{35}/third-to-match3.html
RewriteRule ^directory/subdir/subsubdir/a-third/ http://www.example.com/some-thrid-redirect [R=301,L]

# Keep in mind the preceding is more of an example of the rule matching necessary and the conditions could all be a single condition, but I wanted to make sure the code is 'more readable' and the main point I'm making is WRT the matching necessary for the rules.

Above, you match the first 27 characters before the pattern is broken and move to the second possible match, then match 29 characters to get to the 3rd rule, where with mine, you match the minimal amount of characters and I usually try to only match each section of a URL 'specifically' once...

I have .htaccess files I manage that skip 50+ rules, which would have to match at least a portion of the rules skipped if I did not find a way to skip past them with a single match / no match, and even if I switched from rules to conditions I would have a good number of rules to partially match.

I can see jdMorgan's point, where his way is more 'error proof' and not too much less efficient, but personally, I try to use the most efficient way I can, which means on sites I manage (and usually own) I have to pay more attention to what I'm doing than most people.

Like I said, the difference in efficiency is highly dependent on your URL structure, because if the main directories all start with a different character then the pattern is broken in a single character and there's not really a reason to use the less 'management friendly' version I posted, but if you have to match the first 27+ characters before you break the pattern and have to redirect 20+ sub-directories for some reason (rather than the 3 in my example), and have traffic, you might consider something you have to pay more attention to manage, because by combining all the redirects in one file, rather than in the sub-directory .htaccess files you could be adding significantly to the process, depending on your exact situation.

IOW: Here's what I see as the differences in our posts, and you have to evaluate your situation yourself... If it's not a high-traffic site, or you can break the matching patterns easily, or 'a number of other good reasons to do it go here'... Use jdMorgan's way... If you're a bit more high risk, speed and efficiency is absolutely essential, you have super-long URLs you will have to match repeatedly in a number of rules, etc... you might consider mine.

<aside>
One of the cool things about coding and scripting is where there are two people writing code there are usually 4 opinions on how to do it. :)
</aside>

mark_roach

4:22 pm on Nov 3, 2009 (gmt 0)

10+ Year Member



Wow quite a lot of stuff to digest but I think I am still following :-)

I think in my specific case the safer method suggested by jdMorgan is the way forward.

I have about 10 directories and for most the first character differs with the worst 2 directories having 5 characters that match.

Since the purpose of the majority of my external redirects is to simply remove the file extension I suspect I could add one RewriteCond RewriteRule pair to the .htaccess in the root to effect this change.

I would precede this rule with the more specific re-writes and follow it with rules to handle canonicalization.

I can then leave the slightly more complicated internal re-writes in their own directory specific .htaccess files.

Would I be right in assuming that the internal re-directs would always be processed last and I would never be at risk of exposing any of my internal filepaths ?

I should also add that I only use .htaccess files whilst I am developing as I find it easier to make changes. In my production environment I put all these directives in the httpd.conf file. Does this negate any of the performance overhead of having directory specific rules ?

jdMorgan

4:25 pm on Nov 3, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Actually, I was not necessarily suggesting adding a bunch of RewriteConds, but rather, making the rule patterns as specific as possible in order to avoid both redundant RewriteConds and difficult-to-maintain "skip rules."

In the examples above, where we're using "directory1" and "directory2", etc., there's a possible red herring, in that it appears that the matching engine would have to match all of "directory" in each case before getting to "1" and deciding that no match was present and that furhter processing was unnecessary. But real directory names may be quite different, and if ordered by "shortest match first," a large gain in performance is possible if the real directory names are more like "able" and "carla" and "charlie" -- Here, a decision can be taken after only one or two characters.

There's also another way to do it, sort of halfway between the methods already discussed: You can use a RewriteCond "lookup table" approach:


RewriteCond $1>replacement-path1 ^old-URL-path1-pattern>(.+)$ [OR]
RewriteCond $1>replacement-path2 ^old-URL-path2-pattern>(.+)$ [OR]
RewriteCond $1>replacement-path3 ^old-URL-path3-pattern>(.+)$
RewriteRule ^directory1/(.+)$ /%1 [L]

Note that I'm using ">" only as a sort of "soft anchor" to demarcate the end of one concatenated variable and the beginning of the other; The character itself has no special meaning, but won't be present in unescaped form in any valid URL-path. Using this 'anchor' prevents ambiguous matches, but comes in most handy when only a partial match is sought or needed, such as in
 RewriteCond $1>Replacement-path1 ^old-URL-path1-pattern-prefix[^>]+>(.+)$ [OR] 

And of course, if you've got server config-level access, a RewriteMap would be even better.

However, I agree that each Webmaster must evaluate all possible techniques and decide for him/herself which is best in any given circumstance. Mod_rewrite code isn't something to be written slap-dash, installed and forgotten; A lot of thought and analysis should go into it first, and maximizing efficiency is a major consideration. I just intended to point out that maintainability is another such consideration.

Jim

TheMadScientist

2:17 am on Nov 4, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@ jdMorgan:
Thanks for the clarification!

@ mark_roach
Safer is *usually* better, so go with what works best for you and your situation, erring on the side of caution, unless you really know what you are doing and have reason to do otherwise.

In the examples above, where we're using "directory1" and "directory2", etc., there's a possible red herring, in that it appears that the matching engine would have to match all of "directory" in each case before getting to "1" and deciding that no match was present and that furhter processing was unnecessary. But real directory names may be quite different, and if ordered by "shortest match first," a large gain in performance is possible if the real directory names are more like "able" and "carla" and "charlie" -- Here, a decision can be taken after only one or two characters.

Definitely, and where performance is only minimally impacted, then I definitely think easier to manage is better, because the performance impact will probably not be noticeable to the end user and should not impact the server to a great extent either.

One of the cool things about coding and scripting is where there are two people writing code there are usually 4 opinions on how to do it. :)

See what I mean... There are many different ways to arrive at the same result, and which you choose is really up to you and your situation. I could probably come up with a couple more if necessary, and I'm sure jdMorgan could.

jdMorgan

2:36 am on Nov 4, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



External redirects in the root-level .htaccess file will be processed first only if there are internal rewrites following them in the root-level .htaccess file, or if there are no external redirects in the subdirectory .htaccess file(s).

The server starts at the root .htaccess file, processes the rewriterules in order, then goes to the next-lower subdirectory in the path to the requested file, and processes those rules in order, and continues this until it runs out of subdirectories "above" the requested file to look at.

So make sure that when seen from that viewpoint, there are no internal rewrites preceding any external redirects. As long as that is the case, then internal filepath exposure is not a concern.

Jim

sjedwardz

4:09 pm on Dec 5, 2009 (gmt 0)

10+ Year Member



Sorry to resurect this thread!
Thanks for all of the help in getting it fixed for me. That was just for a one off.

What would I need to do so ANY URL that does not have a ".html" extension gets redirected to one that does?

jdMorgan

6:50 pm on Dec 5, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Try a search for "extensionless URLs RewriteCond %{REQUEST_FILENAME}" -- That should point you to several very relevant and detailed previous threads.

Jim

g1smd

9:59 pm on Dec 5, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Should that also apply to images, CSS files, robots.txt and other files?

Thank carefully, you don't mean "any URL" here, at all do you?