Forum Moderators: phranque
I was thinking of setting both to be directroy indexes then 301'ing the incorrect url to the proper url, is that ok to do?
DirectoryIndex index.htm index.html
RewriteCond %{THE_REQUEST} ^[A-Z]+\ [^/]*/index\.html?\ HTTP/
RewriteRule ^(.*)index\.html?$ http://example.com/$1/ [R=301,L]
The mod_rewrite code, while appearing to have redundant lines, is correct: It looks for /index.htm or /index.html in any directory, and ONLY if the client originally requested either of those does it do the redirect. This prevents it from trying to redirect the internally-generated requests that result when DirectoryIndex is applied. Without this construct, you would get an 'infinite' loop.
Note that if you have problems with this working in subdirectories, you may need to add
RewriteOptions inherit
RewriteEngine on
To prevent unnecessary redirects and keep your search listings looking clean, always refer to index pages as "/" in your on-site links, and try to get your inbound links updated to refer to your index pages that way.
Simple, huh? ;)
Jim
it all sounds ok actually except for this bit:
To prevent unnecessary redirects and keep your search listings looking clean, always refer to index pages as "/" in your on-site links, and try to get your inbound links updated to refer to your index pages that way.
would absolute links achieve the same thing? I use them already, and *most* incoming links are going to the correct place, it just looks like the engines got confused since both .htm and html worked for a while...a long while I'm afraid.
also, since I'm taking a pretty big hit already due to this screw up, would I be better off just renaming the offending files to avoid all this entirely? hmmm...if I rename to .htm, then set the index to just .htm and 301 redirect the .html to the .htm, would that work?
and one more, since this site is on my own server now is it possible to control the directory index for each directory in the site? I don't have many instances of this really, so that is an option if it's doable.
can you tell I stressed yet? :)
We're getting into several different topics here...
First, for best results and a "professional look" the index page URL should be http://example.com/
It doesn't matter what the filename is -- DirectoryIndex in each directory's .htaccess file can resolve that for you.
URLs are not necessarily equal in any way to filepaths or filenames, as the use of mod_rewrite demonstrates.
Then we have the relative, absolute, and canonical URL reference topic:
<a href="http://example.com/"> Canonical URL </a>
<a href="/somepage.htm"> Absolute URL-path </a> (specifies absolute path from server root)
<a href="../somedir/somepage.htm"> Relative URL-path </a> (specifies path relative to current location)
The terms are mis-used more often than not, so I wanted to define my terms...
Jim
I'm fairly sure that the engines have indexed both index.html and .htm since the directory index was set to both but NOT because others are linking to both, I think either my site is linking to both somewhere or the engines just figured it out and found both versions. So, that leads me to these questions:
1) What issues can that cause? Dup content penalty (I seem to be ranking ok)? Does this dilute PR at all? Anything else? FWIW both versions have the same PR and backlinks right now.
2) I would like to have both versions work since I unfortunately have both in different areas of the site and there are too many to run around renaming. Assuming my site is linking to both somewhere incorrectly and I fix that by linking to /, then set the Directory Index in each directory correctly, will that solve the problem best?
3) As long as I can't find any external links linking to the wrong version of these index files, all the links and PR I see on those isn't real, right?
> 1) What issues can that cause? Dup content penalty (I seem to be ranking ok)? Does this dilute PR at all?
> Anything else? FWIW both versions have the same PR and backlinks right now.
This type of "duplicate-content penalty" is manifested as a reduction of PR, because it is caused by a splitting of PR. "index.htm" and "index.html" are completely different pages, as far as search engines are concerned, and so each accrues its own PR. Therefore, having two URLs for the same thing "splits" the PR and link popularity across those two URLs.
Google and possibly others have what they call "canonicalization" routines that they apparently use to post-process their indexes and figure out duplication problems like this. But if they run out of time and don't get to your site before it's time to roll out a new index, then you may be left with the split-PR problem. I am not one to depend on the kindness of strangers, so I prefer to fix this myself.
It does not matter whether the links to both index variants are internal links or external links; Search engines follow links, period.
The only problem with defining multiple DirectoryIndex files comes when the index file present in the directory is not the first one in the DirectoryIndex list. The server has to check for "file exists" on each candidate filename in the list ahead of the actual, existing file's name. Since this requires additional calls to the filesystem manager, it slows things down, and it's a good idea to eliminate the problem if possible.
You can use a shell script or batch file to fix the html-htm schizophrenic filenames problem, then internally link to your index pages only as "/" (use a multi-file search-and-replace utility if needed) and then move on, worry-free, into the future. Happiness will follow from running a very tight ship. :)
Jim
It does not matter whether the links to both index variants are internal links or external links; Search engines follow links, period.
I generally agree with that, but it many of these instances I think they picked up the wrong index file on their own. I have *many* directories that have no incoming external links, and only one *correct* / internal link to the directory and somehow the engines still list both in all cases.
you may be left with the split-PR problem
I know what you mean, but this case seems different to me. On each version of the page I have identical PR and backlinks. Why is that? The incorrect version could have some PR and links, but very little compared to the correct version...right? Seems almost like they are giving full PR to both versions, but I don't know...
internal links or external links
As I mentioed above, I have incorrect index files that have no incoming links, but the engines show identical backlinks and PR, where is that coming from?
link to your index pages only as "/"
I'll be doing this to the site tonight. Does this also apply to all my htacess rules? I think I have some point to /index.html now.
shell script or batch file to fix the html-htm schizophrenic filenames problem
I was going to do this, but I'm a bit paranoid to do it now since the site in just coming into season. So, I think I'm going to do Directory Index in each directory for now, then once prime time passes for the site, I'm going to redo the entire section of the site that is causing this problem.
Please tell us more about your test conditions: Is it slow when you request "/index.html" AND when you request "/" or only the first case? Understand that the code I posted above does an external 301 redirect, which means that in responding to the redirect, your browser must send a second HTTP request to the server, thus increasing the apparent load time. Also, certain coding errors could cause a slow-down.
Jim
I know what you mean, but this case seems different to me. On each version of the page I have identical PR and backlinks. Why is that? The incorrect version could have some PR and links, but very little compared to the correct version...right? Seems almost like they are giving full PR to both versions, but I don't know...
As a result of their "canonicalization" post-processing, which they do at their leisure. This introduces an important dependency of your site upon their current practices, which may change...
As I mentioned above, I have incorrect index files that have no incoming links, but the engines show identical backlinks and PR, where is that coming from?
Maybe they picked it up from the Google Toolbar, or from a temporary error in your site config that 'exposed' those URL-paths. The backlinks and PR are a result of the "canonicalization" post-processing mentioned above.
link to your index pages only as "/"I'll be doing this to the site tonight. Does this also apply to all my htacess rules? I think I have some point to /index.html now.
It depends on whether those rules are internal rewrites or external redirects. For internal rewrites, it should not matter, but external redirect rules should be corrected to redirect to "/".
Jim
Without the code you posted above index.php works reasonably quickly, not as quick as the rest of the site, but reasonably quickly.
With a straight forward rewrite index.html was about the same speed as index.php.
As it stands now:
#DirectoryIndex index.php
RewriteEngine on
#RewriteCond %{THE_REQUEST} ^[A-Z]+\ [^/]*/index\.php?\ HTTP/
#RewriteRule ^(.*)index\.php?$ http://wwWebmasterWorldidjetsite.com/$1 [R=301,L]
Redirect /index.html http://wwWebmasterWorldidjetsite.com/index.php [R=301]
RewriteCond %{HTTP_HOST}!^www\. [NC]
RewriteCond %{HTTP_HOST} ^(.+)\.com [NC]
RewriteRule (.*) http://www.%1.com/$1 [R=301,L]
This is the fastest I can get it currently, but would rather use your method.
Incidentally the .htaccess contains many more lines which are not relavent to this discussion (I think) such as additional rewrites for dynamic urls.
Thanks
It looks like the regex in your modified RewriteCond is not going to do what you expect. I'd suggest:
RewriteCond %{THE_REQUEST} ^[A-Z]+\ .*/index\.php\ HTTP/
The? in this RewriteCond in the original post above was present only to match "html" or "htm" -- you don't need it in either your RewriteCond or RewriteRule.
If you have query strings appended to the URL-path, then delete everything in the RewriteCond after the URL-path, i.e.
RewriteCond %{THE_REQUEST} ^[A-Z]+\ .*/index\.php
Your Redirect directive should also be changed to redirect to "/", rather than to "index.php", and I would suggest placing the domain-name redirect first, ahead of the index file redirects. (Note that the relative order of execution of your mod_rewrite code and mod_alias code is set by the server config, not by their order in your file for this reason, you might consider using mod_rewrite to redirect the index.html file as well.)
Jim