Forum Moderators: phranque

Message Too Old, No Replies

index.html and index.htm both have links to them

what's the best way to get only one to work?

         

Craig_F

11:21 pm on Jul 29, 2005 (gmt 0)

10+ Year Member



I have both index.html and index.htm files with links to them. I *thought* I solved this problem by setting the directory index to .htm and 301 redirecting the .html to the .htm BUT I just noticed that in some directories the .html needs to be the directory index, what is the best way to fix this?

I was thinking of setting both to be directroy indexes then 301'ing the incorrect url to the proper url, is that ok to do?

jdMorgan

2:06 am on Jul 30, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You can get in trouble doing this. But there's a solution using mod_rewrite. Set up your DirectoryIndex as

DirectoryIndex index.htm index.html

and then add mod_rewrite code:

RewriteCond %{THE_REQUEST} ^[A-Z]+\ [^/]*/index\.html?\ HTTP/
RewriteRule ^(.*)index\.html?$ http://example.com/$1/ [R=301,L]

How this works:
If the original client-requested URL is <anything>index.html or <anything>index.htm, then the client is redirected to <anything>/
So now your index pages are known as "/", and not as index.htm or index.html
The server, upon receipt of a request for "/" then uses DirectoryIndex to find the file it should serve. For those directories containing one of index.htm or index.html, that file is served. If both are present, then the first one appearing in the DirectoryIndex list will be served.

The mod_rewrite code, while appearing to have redundant lines, is correct: It looks for /index.htm or /index.html in any directory, and ONLY if the client originally requested either of those does it do the redirect. This prevents it from trying to redirect the internally-generated requests that result when DirectoryIndex is applied. Without this construct, you would get an 'infinite' loop.

Note that if you have problems with this working in subdirectories, you may need to add


RewriteOptions inherit

to the .htaccess files in those subdirectories. If there is no mod_rewrite code aready in those files, you'll have to add

RewriteEngine on

ahead of the RewriteOptions line as well

To prevent unnecessary redirects and keep your search listings looking clean, always refer to index pages as "/" in your on-site links, and try to get your inbound links updated to refer to your index pages that way.

Simple, huh? ;)

Jim

Craig_F

2:32 am on Jul 30, 2005 (gmt 0)

10+ Year Member



wow, that sounds so simple, i should have thought of that ;) (if I knew what the h*ll I'm doing!)

it all sounds ok actually except for this bit:

To prevent unnecessary redirects and keep your search listings looking clean, always refer to index pages as "/" in your on-site links, and try to get your inbound links updated to refer to your index pages that way.

would absolute links achieve the same thing? I use them already, and *most* incoming links are going to the correct place, it just looks like the engines got confused since both .htm and html worked for a while...a long while I'm afraid.

also, since I'm taking a pretty big hit already due to this screw up, would I be better off just renaming the offending files to avoid all this entirely? hmmm...if I rename to .htm, then set the index to just .htm and 301 redirect the .html to the .htm, would that work?

and one more, since this site is on my own server now is it possible to control the directory index for each directory in the site? I don't have many instances of this really, so that is an option if it's doable.

can you tell I stressed yet? :)

jdMorgan

2:58 am on Jul 30, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Refer to your index pages as <a href="/"> or, using "absolute" -- really "canonical" -- links as <a href="http://example.com/">

We're getting into several different topics here...

First, for best results and a "professional look" the index page URL should be http://example.com/
It doesn't matter what the filename is -- DirectoryIndex in each directory's .htaccess file can resolve that for you.

URLs are not necessarily equal in any way to filepaths or filenames, as the use of mod_rewrite demonstrates.

Then we have the relative, absolute, and canonical URL reference topic:


<a href="http://example.com/"> Canonical URL </a>
<a href="/somepage.htm"> Absolute URL-path </a> (specifies absolute path from server root)
<a href="../somedir/somepage.htm"> Relative URL-path </a> (specifies path relative to current location)

Note that *all* URL references are resolved in the client browser, and 'relative' links are relative to where the browser 'thinks' the current page is, and not to where it 'really is' on the server. This often causes massive confusion when people mod_rewrite URLs and then wonder why the browser requests incorrect relative image locations, etc.

The terms are mis-used more often than not, so I wanted to define my terms...

Jim

Craig_F

11:11 pm on Aug 1, 2005 (gmt 0)

10+ Year Member



ok, I'm back on this now. got caught up with some other stuff unexpectedly over the weekend. Here's what I'm thinking now:

I'm fairly sure that the engines have indexed both index.html and .htm since the directory index was set to both but NOT because others are linking to both, I think either my site is linking to both somewhere or the engines just figured it out and found both versions. So, that leads me to these questions:

1) What issues can that cause? Dup content penalty (I seem to be ranking ok)? Does this dilute PR at all? Anything else? FWIW both versions have the same PR and backlinks right now.

2) I would like to have both versions work since I unfortunately have both in different areas of the site and there are too many to run around renaming. Assuming my site is linking to both somewhere incorrectly and I fix that by linking to /, then set the Directory Index in each directory correctly, will that solve the problem best?

3) As long as I can't find any external links linking to the wrong version of these index files, all the links and PR I see on those isn't real, right?

jdMorgan

11:59 pm on Aug 1, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The setting of DirectoryIndex has nothing to do with links or search engines. It is used internally only, and is not visible to be "figured out" by search engines.

> 1) What issues can that cause? Dup content penalty (I seem to be ranking ok)? Does this dilute PR at all?
> Anything else? FWIW both versions have the same PR and backlinks right now.

This type of "duplicate-content penalty" is manifested as a reduction of PR, because it is caused by a splitting of PR. "index.htm" and "index.html" are completely different pages, as far as search engines are concerned, and so each accrues its own PR. Therefore, having two URLs for the same thing "splits" the PR and link popularity across those two URLs.

Google and possibly others have what they call "canonicalization" routines that they apparently use to post-process their indexes and figure out duplication problems like this. But if they run out of time and don't get to your site before it's time to roll out a new index, then you may be left with the split-PR problem. I am not one to depend on the kindness of strangers, so I prefer to fix this myself.

It does not matter whether the links to both index variants are internal links or external links; Search engines follow links, period.

The only problem with defining multiple DirectoryIndex files comes when the index file present in the directory is not the first one in the DirectoryIndex list. The server has to check for "file exists" on each candidate filename in the list ahead of the actual, existing file's name. Since this requires additional calls to the filesystem manager, it slows things down, and it's a good idea to eliminate the problem if possible.

You can use a shell script or batch file to fix the html-htm schizophrenic filenames problem, then internally link to your index pages only as "/" (use a multi-file search-and-replace utility if needed) and then move on, worry-free, into the future. Happiness will follow from running a very tight ship. :)

Jim

Craig_F

1:52 pm on Aug 2, 2005 (gmt 0)

10+ Year Member



It does not matter whether the links to both index variants are internal links or external links; Search engines follow links, period.

I generally agree with that, but it many of these instances I think they picked up the wrong index file on their own. I have *many* directories that have no incoming external links, and only one *correct* / internal link to the directory and somehow the engines still list both in all cases.

you may be left with the split-PR problem

I know what you mean, but this case seems different to me. On each version of the page I have identical PR and backlinks. Why is that? The incorrect version could have some PR and links, but very little compared to the correct version...right? Seems almost like they are giving full PR to both versions, but I don't know...

internal links or external links

As I mentioed above, I have incorrect index files that have no incoming links, but the engines show identical backlinks and PR, where is that coming from?

link to your index pages only as "/"

I'll be doing this to the site tonight. Does this also apply to all my htacess rules? I think I have some point to /index.html now.

shell script or batch file to fix the html-htm schizophrenic filenames problem

I was going to do this, but I'm a bit paranoid to do it now since the site in just coming into season. So, I think I'm going to do Directory Index in each directory for now, then once prime time passes for the site, I'm going to redo the entire section of the site that is causing this problem.

Bridge

3:12 pm on Aug 4, 2005 (gmt 0)

10+ Year Member



Could I jump in and seek some expert advice here. I have implemented the suggestions here. I found that on loading the index page the site is very very slow, all other links (also rewritten) are fast and work fine. Any suggestions on speeding up the home page loading would be much appreciated.

jdMorgan

5:08 pm on Aug 4, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Bridge,

Please tell us more about your test conditions: Is it slow when you request "/index.html" AND when you request "/" or only the first case? Understand that the code I posted above does an external 301 redirect, which means that in responding to the redirect, your browser must send a second HTTP request to the server, thus increasing the apparent load time. Also, certain coding errors could cause a slow-down.

Jim

jdMorgan

5:15 pm on Aug 4, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



CraigF,

I know what you mean, but this case seems different to me. On each version of the page I have identical PR and backlinks. Why is that? The incorrect version could have some PR and links, but very little compared to the correct version...right? Seems almost like they are giving full PR to both versions, but I don't know...

As a result of their "canonicalization" post-processing, which they do at their leisure. This introduces an important dependency of your site upon their current practices, which may change...

As I mentioned above, I have incorrect index files that have no incoming links, but the engines show identical backlinks and PR, where is that coming from?

Maybe they picked it up from the Google Toolbar, or from a temporary error in your site config that 'exposed' those URL-paths. The backlinks and PR are a result of the "canonicalization" post-processing mentioned above.

link to your index pages only as "/"

I'll be doing this to the site tonight. Does this also apply to all my htacess rules? I think I have some point to /index.html now.

It depends on whether those rules are internal rewrites or external redirects. For internal rewrites, it should not matter, but external redirect rules should be corrected to redirect to "/".

Jim

Bridge

5:48 pm on Aug 4, 2005 (gmt 0)

10+ Year Member



jdMorgan the default site index is index.php, (I mean as in a real page that exists). There has been an index.html for some time which is indexed in the search engines, though this is not a real page and was a rewrite.

Without the code you posted above index.php works reasonably quickly, not as quick as the rest of the site, but reasonably quickly.

With a straight forward rewrite index.html was about the same speed as index.php.

As it stands now:

#DirectoryIndex index.php
RewriteEngine on
#RewriteCond %{THE_REQUEST} ^[A-Z]+\ [^/]*/index\.php?\ HTTP/
#RewriteRule ^(.*)index\.php?$ http://wwWebmasterWorldidjetsite.com/$1 [R=301,L]

Redirect /index.html http://wwWebmasterWorldidjetsite.com/index.php [R=301]

RewriteCond %{HTTP_HOST}!^www\. [NC]
RewriteCond %{HTTP_HOST} ^(.+)\.com [NC]
RewriteRule (.*) http://www.%1.com/$1 [R=301,L]

This is the fastest I can get it currently, but would rather use your method.

Incidentally the .htaccess contains many more lines which are not relavent to this discussion (I think) such as additional rewrites for dynamic urls.

Thanks

jdMorgan

6:44 pm on Aug 4, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Bridge,

It looks like the regex in your modified RewriteCond is not going to do what you expect. I'd suggest:


RewriteCond %{THE_REQUEST} ^[A-Z]+\ .*/index\.php\ HTTP/

since the URL-path will *always* start with a slash. This pattern will do for testing, and perhaps you can come up with a more efficient version later.

The? in this RewriteCond in the original post above was present only to match "html" or "htm" -- you don't need it in either your RewriteCond or RewriteRule.

If you have query strings appended to the URL-path, then delete everything in the RewriteCond after the URL-path, i.e.


RewriteCond %{THE_REQUEST} ^[A-Z]+\ .*/index\.php

otherwise, it won't match when a query is present.

Your Redirect directive should also be changed to redirect to "/", rather than to "index.php", and I would suggest placing the domain-name redirect first, ahead of the index file redirects. (Note that the relative order of execution of your mod_rewrite code and mod_alias code is set by the server config, not by their order in your file for this reason, you might consider using mod_rewrite to redirect the index.html file as well.)

Jim

Bridge

11:04 am on Aug 23, 2005 (gmt 0)

10+ Year Member



jdMorgan

Many thanks, works like a charm.