Forum Moderators: Robert Charlton & goodroi
For the sake of Google, last year, I changed my htaccess file so that http://example.com redirected to the www version.
Should I also be redirecting www.example.com/index.html to just www.example.com? I have changed my internal links to remove the "index.html" but was wondering if I should redirect it completely?
If so, how do i do this?
[edited by: tedster at 7:24 pm (utc) on Oct. 2, 2006]
[edit reason] use example.com [/edit]
RewriteCond %{THE_REQUEST} ^.*\/index\.html?
RewriteRule ^(.*)index\.html?$ http://www.domain.com/$1 [R=301,L]
OR
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /.*index\.html?\ HTTP/
RewriteRule ^(.*)index\.html?$ http://www.domain.com/$1 [R=301,L]
There is no infinite loop. When you ask for / the server gets the index page, whatever it is called, without telling you what it is actually called.
In .htaccess:
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.html
RewriteRule ^index\.html$ http://www.example.com/ [R=301,L]
Jim
Yet, I am not too familiar with apache and htacess: There is no problem to have TWO such rewrite conditios, is it (one for the canonical www/non-www and one for this index.html-thing)?
The past weeks I was wondering why google has indexed some pages designed in summer with the same pagerank as the domain itself, though almost all backlinks point to the mere domain-name. Even that current TBPR-update did not seem to change that. I will now see, if this PR-dilution between domain.de and domain.de/index.html was the reason.
However: those changes were only made for search-engines, not for my visitors.
RewriteEngine On
RewriteCond %{HTTP_HOST} ^example\.net [NC]
RewriteRule (.*) http://www.example.net/$1 [R=301,L]
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.htm
RewriteRule ^index\.htm$ http://www.example.net/ [R=301,L]
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.htm
I pondered that a bit, wondering why it is matching from 3 to 9 letters. I guess this is an attempt to skip over the method part of THE_REQUEST.
But why use THE_REQUEST in the first place? Why not just use REQUEST_URI?
This condition also doesn't address multiple directories containing an index.html - only the root instance.
This seems to deal with multiple index.html's on your site:
RewriteRule ^(.*/)index.html$ $1 [L,R=301]
No need for a RewriteCond at all.
Try your RewriteRule-only code. It will create an infinite redirection loop due to interaction with DirectoryIndex, as described in my first post.
It is true that the version I posted only redirects requests in the root directory. However, g1smd posted two variants that handle all subdirectories as well, if that is needed.
Jim
My sample code redirects both index.htm and index.html in both the root and in folders. It preserves the folder name in the redirect too.
One change to your code. You need to do the index redirect first, so that all */index get redirected to www.domain.com/*/
If you do it the other way, then domain.com/*/index first gets redirected to www.domain.com/*/index before being redirected on to www.domain.com/*/ after that. You need to avoid a redirection chain like that.
Do the index redirect to www non-index first (because that one works for all index pages, whether www or non-www and whether in the root or in a folder), and only after that then do the test for non-www and redirect all that remains to www.
RewriteRule ^(.*/)index.html$ http://example.com$1 [L,R=301]
(I use the non-www convention.)
Follow this with your secondary-site redirection rule.
Since you are doing a redirect anyway if you find index.html, there is no harm done in adding "http//example.com" to the front of the redirection, whether it is needed or not.
Still not grokking the "infinate loop" issue, as I haven't seen it. I've only addressed index.html, though, not index.htm, index.php, etc. I'm not sure that those are too important.
My goal is to have "clean" URLs. The only reason for redirecting index.html if your internal links are written using just the directories is that many people assume it's existence, and will automatically stick it in when hand-constructing a link, etc. You want to gently correct those links back to the clean version.
I notice some sites go one way or the other with this. Yahoo doesn't bother to redirect, nor does Google. I've seen sites that redirect from / to /index.html.
Wikipedia redirects from / to /wiki/Main_Page. This is pretty typical for CMSs, and makes some sense. Wikipedia wants a permanent URL for the wiki, but allows for the possibility of sticking some other home page in front of it in the future.
If all of those are true, you'll get a loop.
So, a simple example (omitting other config/setup directives) would be:
DirectoryIndex index.html
RewriteRule ^index\.html$ http://www.example.com/ [R=301,L]
The cure is to use THE_REQUEST to verify that the client originally requested /index.html before invoking the redirect. By doing so, the loop can be broken, since THE_REQUEST will not be updated, as REQUEST_URI is, by the action of mod_dir.
Many if not most Webmasters are on shared hosting, and are stuck with .htaccess solutions. This is one of the differences between a .htaccess context and that of httpd.conf or conf.d. Since I address many more .htaccess questions, I tend to forget to explicitly mention or describe the differences. Hopefully, that's not the point of confusion here.
Jim
As for other internal pages, you can use full links like "http://www.domain.com/folder/that.page.html" or you can use simpler links like "/folder/that.page.html" that BEGIN with a "/" each time, combined with the <base> tag like this: <base href="http://www.domain.com/"> where that tag appears once in the head section of each and every page of the site.
Also
As for other internal pages, you can use full links like "http://www.domain.com/folder/that.page.html" or you can use simpler links like "/folder/that.page.html" that BEGIN with a "/" each time, combined with the <base> tag like this: <base href="http://www.domain.com/"> where that tag appears once in the head section of each and every page of the site.
Wouldn't the links then be www.domain.com/folder//that.page.html? with the two leading slashes? which would be a 404?
The intended use of the <base> tag is as a shorcut to some directory other than the root.
For example, perhaps a page contains a lot of images which are not in the current directory or perhaps maybe not even on the current site. Let's say they are all in www.example2.com/project22/results/images.
You could use a <base> tag:
<base href="http://www.example2.com/project22/results/images/" />
Now you can refer to the images simply by their file names, without needing to add all that stuff in front of them.
Taking your specific question, there are three main ways you can link to an object on a page (such as an image) or to another page on your site.
The client (browser or SE robot) will resolve these links as follows:
<img src="blue_widget.gif"> (Page-relative path) Remove the 'file name' from current URL (in browser address bar) and add "blue_widget.gif"
<img src="/red_widget.gif"> (Server-relative path) Remove entire local URL from current URL (leaving only domain name), and add "red_widget.gif"
<img src="http://example.com/green_widget.gif"> (Canonical URL) Use this URL to get the object, disregarding the current URL.
Using the page-relative linking method, you can also remove and add directory levels as desired:
<img src="../../images/blue_widget.gif"> (Page-relative path) Remove two subdirectory levels and the 'file name' from the current URL and add "images/blue_widget.gif"
The methods above will give the stated results, without injecting any extra slashes into the URL.
Jim
[edited by: jdMorgan at 9:44 pm (utc) on Oct. 6, 2006]
<img src="Images/logo.gif" alt="description" width="371" height="59" longdesc="http://www.domain.com" />
<img src="http://www.domain.com/Images/logo.gif" alt="description" width="371" height="59" longdesc="http://www.domain.com" />
More immune to duplicate content issues. If all internal links include the www, even if there is no "non-www to www redirect" it is much harder for any non-www URLs to be indexed, and then "stick".
More immune to URL hijacking. That was very common a year or two back. This was where people used to point multiple dodgy 302 redirects at your site, and get your content listed at their URL.