homepage Welcome to WebmasterWorld Guest from 54.145.183.126
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 42 message thread spans 2 pages: 42 ( [1] 2 > >     
/directory/ and /directory/index.html are duplicate content?
bolognese

5+ Year Member



 
Msg#: 3726529 posted 6:48 pm on Aug 19, 2008 (gmt 0)

Hi there,

I hope there are some people around here that are willing to answer my question.

Unfortunately on my website sometimes I have used the url for a webpage including the index.html, because I always assumed that it would not matter. I never had any complaints from Google.
Until shortly I get a message in the webmaster tools about duplicate title and subscription tags. The pages that Google refers to are for example /carnival/ and /carnival/index.html.
What can I do?

Thanks in advance.

Jos

 

Receptional Andy



 
Msg#: 3726529 posted 8:39 pm on Aug 19, 2008 (gmt 0)

Hi Jos , welcome to WebmasterWorld :)

Any identical or substantially similar content available via a different URL is a technical error and results in unnecessary duplication of content. In some instances this might affect search engine performance.

The solution is to permanently (i.e. with a 301 HTTP status code) redirect requests for /directory/index.html to /directory/ with no filename present. This is best practice regardless of search engines. How to do this depends on your server setup, but if your on an Apache server there are lots of good resources related to this over in the Apache forum [webmasterworld.com]. You can find relevant threads in the library of the Apache forum [webmasterworld.com], or via a site search [webmasterworld.com]. This is a harder problem to solve on servers like IIS without the installation of third party components.

bolognese

5+ Year Member



 
Msg#: 3726529 posted 10:52 am on Aug 20, 2008 (gmt 0)

I'm glad there is a forum like webmasterworld. Apart from google groups, here my questions are answered at least.
I tried the redirect already. But that did not work. In fact it blocked the whole website from being displayed. Maybe a file can't be redirected to a directory?
Besides, I have to add the domain too to the page or directory I want the redirect to point to.
If I disallow the /directory/index.shtml in the robots.txt, would that affect the http://example.com/directory/ url too? This is a link that has pagerank, that I hate to lose.

Thanks again. I will use WW for other questions too.

Jos

[edited by: Receptional_Andy at 10:54 am (utc) on Aug. 20, 2008]
[edit reason] Please use example.com - it can never be owned [/edit]

Receptional Andy



 
Msg#: 3726529 posted 10:59 am on Aug 20, 2008 (gmt 0)

If you're on Apache and can use htaccess directives and mod_rewrite, you'll find a fantastic guide to avoiding all types of duplicate content in the thread A guide to fixing duplicate content & URL issues on Apache [webmasterworld.com].

That thread is pretty comprehensive and may seem a bit intimidating if you aren't used to working with mod_rewrite, but is well worth taking the time to work through and understand.

I wouldn't recommend excluding index documents via robots exclusion since there's the potential for unwanted side effects.

If you haven't already, you should also get into the habit of making sure your site's internal links point to a directory without a document index - i.e. link to www.example.com/ not www.example.com/index.htm and www.example.com/directory/ not www.example.com/directory/index.htm

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3726529 posted 11:02 am on Aug 20, 2008 (gmt 0)

Good call on that linked thread. An awesome chunk of code that deserves far wider readership.

*** Maybe a file can't be redirected to a directory? ***

Yes it can. If you use Apache it is very easy to do. Make sure that all your internal links point to the shorter version of the URL too.

*** If I disallow the /directory/index.shtml in the robots.txt, would that affect the http://example.com/directory/ url too ***

Matching works from the left, so blocking a named file does not affect any shorter URLs.

You could block the longer URL like this, but the 301 redirect is better.

If you do use robots.txt, and as long as nothing links to the longer URL, and all links point to the shorter URL, the PageRank will then build for that shorter URL.

bolognese

5+ Year Member



 
Msg#: 3726529 posted 11:50 am on Aug 20, 2008 (gmt 0)

Thanks a lot. I will check everything out.

I found out that the to-be-redirected url had to include the domain name. The index.shtml is still shown in address bar, but the redirect doesn't keep the site from being displayed. I hope this will tell google that there is only one page containing the title and description tag. Strange, all other redirects - file to file, maybe that's the difference - did not need that. Only the file-to-redirect-to needed the complete url.

Again thanks. Jos

activeco

10+ Year Member



 
Msg#: 3726529 posted 12:02 pm on Aug 20, 2008 (gmt 0)

I assume your DirectoryIndex directive for that directory is set to index.shtml.
In that case there should be no problems with the redirecting.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3726529 posted 12:08 pm on Aug 20, 2008 (gmt 0)

*** The index.shtml is still shown in address bar ***

... then it isn't working.

A redirect will issue a 301 HTTP header back to the browser making it start a new HTTP request for the new URL.

bolognese

5+ Year Member



 
Msg#: 3726529 posted 12:17 pm on Aug 20, 2008 (gmt 0)

All pages on my site benefit from ssi. Long time ago I found out how to make all .html pages act like .shtml. I also wanted my domain to indexed without the www.
This is what's at the beginning of my .htaacces beside all redirects:
-.-.-.-.-.-
AddHandler server-parsed .html .inc

ErrorDocument 404 /404.html

Options +FollowSymlinks
RewriteEngine on
rewritecond %{http_host} ^www.example.com [nc]
rewriterule ^(.*)$ http://example.com/$1 [r=301,nc]

-.-.-.-.-.-

Could this be the causing the trouble with the redirects?

Jos

[edited by: Receptional_Andy at 12:19 pm (utc) on Aug. 20, 2008]
[edit reason] Please use example.com to replace personal URLs [/edit]

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3726529 posted 12:31 pm on Aug 20, 2008 (gmt 0)

I would use RewriteCond and RewriteRule and HTTP_HOST and R=301 (note casing) so as to be sure not to break anything.

Your "catch all" (shown above) rule should be the last one in your redirects section, and all of the redirects should be placed before your rewrites.

It's clear that something else is going on here. You need the Live HTTP headers extension for Mozilla Firefox to check these out.

bolognese

5+ Year Member



 
Msg#: 3726529 posted 1:42 pm on Aug 20, 2008 (gmt 0)

Can it be that it takes some time until the redirect is working on the webserver of my hosting provider?
Sometimes I can see that it works, and the index.shtml disappears in the address bar, but sometimes nothing is displayed at all and it looks as if the webserver would hang? Do .htaccess changes have an impact on performance?

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3726529 posted 1:51 pm on Aug 20, 2008 (gmt 0)

It should be almost instantaneous, but it is possible for something upstream to cache old data.

You do need to flush your browser cache, as that is usually the culprit.

bolognese

5+ Year Member



 
Msg#: 3726529 posted 3:46 pm on Aug 20, 2008 (gmt 0)

I don't believe the redirect is going to work. In firefox I even get the message that the page was sent as some kind of endless loop, which I think is correct. One wants to move a page (index.shtml) to a url containing only the domain/directory/ and expecting it to rely on the default file (which I moved). It's just a different move than from file1 to file2.
I found out that the pagerank of the url without the filename was 2. The filename version had no pagerank.
I have read the rewrite thread, but still I'm not sure if Ido it right and I do not want to ruin the search results.
Could give me the code that I have to enter into the .htacces to make it a 301? That would be great.

Jos

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3726529 posted 4:32 pm on Aug 20, 2008 (gmt 0)

What code are you using for the index redirect?

If you have an endless loop, use the Live HTTP Headers extension for Mozilla Seamonkey to see what the responses actually are.

If it something redirecting to itself, or a ping-pong where A redirects to B redirects to A redirects to B forever?

bolognese

5+ Year Member



 
Msg#: 3726529 posted 7:38 pm on Aug 20, 2008 (gmt 0)

I'm not redirecting anything index.shtml or .html until now. I think the only thing I need is that in case of an extension like index.shtml or index.html is that it is striped off.

activeco

10+ Year Member



 
Msg#: 3726529 posted 12:03 am on Aug 21, 2008 (gmt 0)

Something like this?

DirectoryIndex index.shtml
RewriteEngine On
RewriteCond %{REQUEST_URI} ^/dir/index\.s?html$
RewriteRule ^index\.s?html$ /dir/ [R=301,L]

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3726529 posted 1:40 am on Aug 21, 2008 (gmt 0)

No, that will loop, because REQUEST_URI and the URL-path seen by RewriteRule *will* be updated when the DirectoryIndex directive is applied.

On Apache, in example.com/.htaccess:

# Parse .html and .inc files for server-side includes
AddHandler server-parsed .html .inc
#
# Declare custom 404 error document
ErrorDocument 404 /404.html
#
# Set up to enable mod_rewrite
Options +FollowSymlinks
RewriteEngine on
#
# Redirect index.html in any directory to directory index "/"
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/]+/)*index\.html
RewriteRule ^(([^/]+/)*)index\.html$ http://example.com/$1 [R=301,L]
#
# Redirect non-canonical "www" domain variants to example.com
RewriteCond %{HTTP_HOST} ^www\.example\.com [NC]
RewriteRule (.*) http://example.com/$1 [R=301,L]

Jim

vero

5+ Year Member



 
Msg#: 3726529 posted 3:44 am on Aug 21, 2008 (gmt 0)

jos - is your site on a hosted server? Some don't let you write your own 301 for this redirect. If that's the case, just contact your hosting company and ask them to do the redirect of index.html to /, then test it your self to make sure it was done correctly.
While you're at it, you might want to ask them to redirect non-www to www or vice versa, depending on which version you prefer - again to avoid duplicate issues.

bolognese

5+ Year Member



 
Msg#: 3726529 posted 8:13 am on Aug 21, 2008 (gmt 0)

Vero,

It did already work getting rid of the www. and it does work now too, getting rid of the index.html.

Jim,

I also have directories containing index.shtml. I added another rewrite rule based on yours, but I replaced index\.html two times by index\.shtml and that doesn't work. Did I forget something?

Jos

activeco

10+ Year Member



 
Msg#: 3726529 posted 8:23 am on Aug 21, 2008 (gmt 0)

No, that will loop, because REQUEST_URI and the URL-path seen by RewriteRule *will* be updated when the DirectoryIndex directive is applied.

That was meant for .htaccess in /dir/ directory.
Tested, doesn't loop.

Anyway, from the above mentioned thread, I found your interesting version for the same thing:

RewriteCond %{ENV:myURI} ^(/([^/]+/)*)index\.html [NC]
RewriteRule . - [E=qRed:yes,E=myURI:%1]

bolognese

5+ Year Member



 
Msg#: 3726529 posted 10:26 am on Aug 21, 2008 (gmt 0)

Jim,

I used your ruleset on this thread:

[webmasterworld.com...]

It works. Thanks a lot.

Jos

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3726529 posted 8:54 pm on Aug 21, 2008 (gmt 0)

If you need to test for both index.html and index.shtml you can use index\.s?html simply making the "s" optional.

nmjudy

10+ Year Member



 
Msg#: 3726529 posted 1:22 am on Aug 25, 2008 (gmt 0)

<moved from another location>

A million gazillion thank yous for everyone who participated in <this thread>.

I also use server side includes and could never get the redirecting of the the directory index to work. After reading through this thread I finally got it to work for my particular case! YEEEEEEEEEEAaaaaaaaaaH! You guys are awesome!

I really like the idea of NOT including the index file in the URL - seems shorter/cleaner. However, most of my internal links point to the index.html file. Google is showing half of these pages indexed as /directory/ and half indexed as /directory/index.html on a 6000+ page site.

If I clean up my internal links to point to the directory root and setup this 301 redirect in .htaccess, should I expect any fallout on my index.html pages?

======================================================

I'm sharing the code below in case anyone else's circumstances are just a little bit different than the other example.

# Parse .html and .inc files for server-side includes
AddHandler server-parsed .html .inc
#
#
# Set up to enable mod_rewrite
Options +FollowSymlinks +Includes All -Indexes
RewriteEngine on
#
#
# Redirect requests for index.html in any directory to "/" in the same directory
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(.+/)?index\.html\ HTTP
RewriteRule ^(.+/)?index\.html$ http://www.example.com/$1 [R=301,L]
#
#
# Redirect requests for resources in non-www domains to same resources in www domain
RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST} !^www\.example\.com [NC]
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

[edited by: Robert_Charlton at 1:44 am (utc) on Aug. 25, 2008]

[edited by: nmjudy at 2:07 am (utc) on Aug. 25, 2008]

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3726529 posted 2:01 am on Aug 25, 2008 (gmt 0)

If I clean up my internal links to point to the directory root and setup this 301 redirect in .htaccess, should I expect any fallout on my index.html pages?

"Fallout" --if any-- should be positive.

BTW, this line
RewriteCond %{HTTP_HOST} !^www\.example\.com [NC]
could be better written as
RewriteCond %{HTTP_HOST} !^www\.example\.com$

With that change, case errors, appended port numbers, and or trailing dots will all be corrected, since the pattern now requires an *exact* match to your canonical hostname to avoid the redirection.

Someone also recently posted a "shorthand" method for accepting a blank hostname (result of an HTTP/1.0 request, which does not include a "Host" header), which eliminates the first RewriteCond as well:

RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

That will accept either an exact match on the hostname or a blank Host header. Neither method of accepting a blank Host header (your original RewriteCond or this tweaked one) is required if your site is hosted on a name-based shared server; Name-based servers cannot be reached with HTTP/1.0 requests containing no "Host" header, since that's the header that name-based hosting depends on.

Jim

nmjudy

10+ Year Member



 
Msg#: 3726529 posted 2:56 pm on Aug 26, 2008 (gmt 0)

Thanks Jim!

After playing around with my pages, I noticed one thing that will break by redirecting index pages to the directory root. All anchor links that I have on my index.html pages will be hosed (index.html#anchor).

If I wanted to set the redirect up to be opposite of above, what would that look like? (ie redirect from /directory/ to /directory/index.html)

Also, my internal linking structure uses the index.html page and it would be much less work to make this change.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3726529 posted 7:06 pm on Aug 26, 2008 (gmt 0)

Why will they be hosed?

This should work just fine: "/#anchor"

nmjudy

10+ Year Member



 
Msg#: 3726529 posted 7:59 pm on Aug 26, 2008 (gmt 0)

/directory/index.html#myanchor

redirects to

/directory/

Am I missing something?

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3726529 posted 8:24 pm on Aug 26, 2008 (gmt 0)

You need to put /folder/#anchor in the links on your page.

Ordinary users browsing the site by clicking internal navigation should never hit a redirect on their travels.

nmjudy

10+ Year Member



 
Msg#: 3726529 posted 1:12 pm on Aug 27, 2008 (gmt 0)

This is what I see...and is probably what you mean...

If someone enters the site through an external link that points to /directory/index.html#myanchor
the browser redirects to show /directory/
but the page display scrolls to the anchor (see note below).

I had to move my anchor Waaaaaaaaaay down the page to test it again, because the browser was saying one thing and it wasn't obvious looking at the page that it was doing anything.

If someone clicks on an internal anchor link, it displays in the browser as /directory/#myanchor and works as expected.

nmjudy

10+ Year Member



 
Msg#: 3726529 posted 2:38 am on Aug 28, 2008 (gmt 0)

Another quick question....

Currently all the site's internal links point to a directory/index.html file using a relative link structure.

Will I run into a problem if I go ahead and apply the redirect code in the .htaccess file before completing all the internal link changes to / or /directory/ or /directory/directory/ etc?

It's going to take me awhile to make all the changes on a site this size. Will Google look at the number of 301 redirects on the site as a negative thing as it tries to crawl the site?

This 42 message thread spans 2 pages: 42 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved