Forum Moderators: Robert Charlton & goodroi
I hope there are some people around here that are willing to answer my question.
Unfortunately on my website sometimes I have used the url for a webpage including the index.html, because I always assumed that it would not matter. I never had any complaints from Google.
Until shortly I get a message in the webmaster tools about duplicate title and subscription tags. The pages that Google refers to are for example /carnival/ and /carnival/index.html.
What can I do?
Thanks in advance.
Jos
Any identical or substantially similar content available via a different URL is a technical error and results in unnecessary duplication of content. In some instances this might affect search engine performance.
The solution is to permanently (i.e. with a 301 HTTP status code) redirect requests for /directory/index.html to /directory/ with no filename present. This is best practice regardless of search engines. How to do this depends on your server setup, but if your on an Apache server there are lots of good resources related to this over in the Apache forum [webmasterworld.com]. You can find relevant threads in the library of the Apache forum [webmasterworld.com], or via a site search [webmasterworld.com]. This is a harder problem to solve on servers like IIS without the installation of third party components.
Thanks again. I will use WW for other questions too.
Jos
[edited by: Receptional_Andy at 10:54 am (utc) on Aug. 20, 2008]
[edit reason] Please use example.com - it can never be owned [/edit]
That thread is pretty comprehensive and may seem a bit intimidating if you aren't used to working with mod_rewrite, but is well worth taking the time to work through and understand.
I wouldn't recommend excluding index documents via robots exclusion since there's the potential for unwanted side effects.
If you haven't already, you should also get into the habit of making sure your site's internal links point to a directory without a document index - i.e. link to www.example.com/ not www.example.com/index.htm and www.example.com/directory/ not www.example.com/directory/index.htm
*** Maybe a file can't be redirected to a directory? ***
Yes it can. If you use Apache it is very easy to do. Make sure that all your internal links point to the shorter version of the URL too.
*** If I disallow the /directory/index.shtml in the robots.txt, would that affect the http://example.com/directory/ url too ***
Matching works from the left, so blocking a named file does not affect any shorter URLs.
You could block the longer URL like this, but the 301 redirect is better.
If you do use robots.txt, and as long as nothing links to the longer URL, and all links point to the shorter URL, the PageRank will then build for that shorter URL.
I found out that the to-be-redirected url had to include the domain name. The index.shtml is still shown in address bar, but the redirect doesn't keep the site from being displayed. I hope this will tell google that there is only one page containing the title and description tag. Strange, all other redirects - file to file, maybe that's the difference - did not need that. Only the file-to-redirect-to needed the complete url.
Again thanks. Jos
ErrorDocument 404 /404.html
Options +FollowSymlinks
RewriteEngine on
rewritecond %{http_host} ^www.example.com [nc]
rewriterule ^(.*)$ http://example.com/$1 [r=301,nc]
-.-.-.-.-.-
Could this be the causing the trouble with the redirects?
Jos
[edited by: Receptional_Andy at 12:19 pm (utc) on Aug. 20, 2008]
[edit reason] Please use example.com to replace personal URLs [/edit]
Your "catch all" (shown above) rule should be the last one in your redirects section, and all of the redirects should be placed before your rewrites.
It's clear that something else is going on here. You need the Live HTTP headers extension for Mozilla Firefox to check these out.
Jos
On Apache, in example.com/.htaccess:
# Parse .html and .inc files for server-side includes
AddHandler server-parsed .html .inc
#
# Declare custom 404 error document
ErrorDocument 404 /404.html
#
# Set up to enable mod_rewrite
Options +FollowSymlinks
RewriteEngine on
#
# Redirect index.html in any directory to directory index "/"
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/]+/)*index\.html
RewriteRule ^(([^/]+/)*)index\.html$ http://example.com/$1 [R=301,L]
#
# Redirect non-canonical "www" domain variants to example.com
RewriteCond %{HTTP_HOST} ^www\.example\.com [NC]
RewriteRule (.*) http://example.com/$1 [R=301,L]
It did already work getting rid of the www. and it does work now too, getting rid of the index.html.
Jim,
I also have directories containing index.shtml. I added another rewrite rule based on yours, but I replaced index\.html two times by index\.shtml and that doesn't work. Did I forget something?
Jos
No, that will loop, because REQUEST_URI and the URL-path seen by RewriteRule *will* be updated when the DirectoryIndex directive is applied.
That was meant for .htaccess in /dir/ directory.
Tested, doesn't loop.
Anyway, from the above mentioned thread, I found your interesting version for the same thing:
RewriteCond %{ENV:myURI} ^(/([^/]+/)*)index\.html [NC]
RewriteRule . - [E=qRed:yes,E=myURI:%1]
A million gazillion thank yous for everyone who participated in <this thread>.
I also use server side includes and could never get the redirecting of the the directory index to work. After reading through this thread I finally got it to work for my particular case! YEEEEEEEEEEAaaaaaaaaaH! You guys are awesome!
I really like the idea of NOT including the index file in the URL - seems shorter/cleaner. However, most of my internal links point to the index.html file. Google is showing half of these pages indexed as /directory/ and half indexed as /directory/index.html on a 6000+ page site.
If I clean up my internal links to point to the directory root and setup this 301 redirect in .htaccess, should I expect any fallout on my index.html pages?
======================================================
I'm sharing the code below in case anyone else's circumstances are just a little bit different than the other example.
# Parse .html and .inc files for server-side includes
AddHandler server-parsed .html .inc
#
#
# Set up to enable mod_rewrite
Options +FollowSymlinks +Includes All -Indexes
RewriteEngine on
#
#
# Redirect requests for index.html in any directory to "/" in the same directory
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(.+/)?index\.html\ HTTP
RewriteRule ^(.+/)?index\.html$ http://www.example.com/$1 [R=301,L]
#
#
# Redirect requests for resources in non-www domains to same resources in www domain
RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST} !^www\.example\.com [NC]
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
[edited by: Robert_Charlton at 1:44 am (utc) on Aug. 25, 2008]
[edited by: nmjudy at 2:07 am (utc) on Aug. 25, 2008]
If I clean up my internal links to point to the directory root and setup this 301 redirect in .htaccess, should I expect any fallout on my index.html pages?
"Fallout" --if any-- should be positive.
BTW, this line
RewriteCond %{HTTP_HOST} !^www\.example\.com [NC] RewriteCond %{HTTP_HOST} !^www\.example\.com$ With that change, case errors, appended port numbers, and or trailing dots will all be corrected, since the pattern now requires an *exact* match to your canonical hostname to avoid the redirection.
Someone also recently posted a "shorthand" method for accepting a blank hostname (result of an HTTP/1.0 request, which does not include a "Host" header), which eliminates the first RewriteCond as well:
RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
Jim
After playing around with my pages, I noticed one thing that will break by redirecting index pages to the directory root. All anchor links that I have on my index.html pages will be hosed (index.html#anchor).
If I wanted to set the redirect up to be opposite of above, what would that look like? (ie redirect from /directory/ to /directory/index.html)
Also, my internal linking structure uses the index.html page and it would be much less work to make this change.
If someone enters the site through an external link that points to /directory/index.html#myanchor
the browser redirects to show /directory/
but the page display scrolls to the anchor (see note below).
I had to move my anchor Waaaaaaaaaay down the page to test it again, because the browser was saying one thing and it wasn't obvious looking at the page that it was doing anything.
If someone clicks on an internal anchor link, it displays in the browser as /directory/#myanchor and works as expected.
Currently all the site's internal links point to a directory/index.html file using a relative link structure.
Will I run into a problem if I go ahead and apply the redirect code in the .htaccess file before completing all the internal link changes to / or /directory/ or /directory/directory/ etc?
It's going to take me awhile to make all the changes on a site this size. Will Google look at the number of 301 redirects on the site as a negative thing as it tries to crawl the site?