Welcome to WebmasterWorld Guest from 3.85.214.0

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Canonicalization, The Pros and Cons of Redirecting Various Typins to Root

Help Me Decide

     
6:24 am on May 2, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Oct 8, 2003
posts:1419
votes: 0


Redirection from /index.html to /. Example if someone types http://www.example.com/index.html we will take them to http://www.example.com/.

Why I might not do it

  1. It is very much natural
  2. Almost 99% of the website will never do it, so it is a problem at google’s end, they need to solve it.
  3. Very important, when a person is coming to index.html I am redirecting it to / which is making my customer wait.

Why I might do it?

  1. Only from links point of view (avoiding people to link to / and /index.html) and to help Google (may be google will consider the links to / differently from /index.html). SEO benefits (may be for some search engines which cant do URL Canonicalization at its end).
  2. Technically both these files can be different too.

What does Google do?
They will take the most appropriate URL, which is with / almost 99% of the times. I have never seen a duplicate content issue so far due to this. For www and non-www google has an option under webmaster console.

What is your opinion about it?

Thanks,
AjiNIMC

12:47 am on May 3, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


If all of your internal pages link back to the root as "/index.html" and all of your external incoming links from other sites are pointing at "www.domain.com/" then you have split your PR.

You are already in trouble in much the same way that www and non-www cause problems.

You should always link back to "/" from within your site. To make things tidy I always redirect (index¦default¦home)\.(html?¦php¦asp¦cfm¦jsp) back to "/" too, with a 301 redirect.

You can change the technology that runs the website at any time without exposing any new URLs to the outside world. Additionally you make it harder for people to know what technology actually runs the site.

4:02 pm on May 3, 2007 (gmt 0)

Preferred Member

10+ Year Member

joined:June 13, 2004
posts:650
votes: 0


To make the things more clear, let me add that there is no such thing as "root file".
Root ("/" on Unix systems) is the top-most DIRECTORY, as defined by DocumentRoot directive (in Apache).
When a user requests a directory, what will be returned depends on a few server settings. In Apache, if you want to provide a default file for a directory, the DirectoryIndex directive must be set:
DirectoryIndex MyFile.html

or even: DirectoryIndex SomeOtherDir/Script.pl

So, if you don't know what you're doing, redirecting files back to root could cause loops and additional problems.
IMO, the simplest, fastest and most secure way is to set default root file and exclude it in robots.txt.

5:22 pm on May 3, 2007 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator robert_charlton is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2000
posts:12389
votes: 409


Make index.html as the canonical one...

In my experience, links to your site (ie, to the default canonical) are going to tend to be in the form www.example.com/ no matter what you do, so you are creating a problem by making index.html your canonical.

...So, if you don't know what you're doing, redirecting files back to root could cause loops and additional problems. IMO, the simplest, fastest and most secure way is to set default root file and exclude it in robots.txt.

Agreed that the redirection, as well as the setup of the default root file, most definitely needs to be done by someone who knows what he or she is doing. When I get to index.html rewrites, I have someone else set up the server and do the rewrites for me.

But setting up a default root file, of, say, index.html, continuing to link to it and excluding it in robots.txt, IMO, is going to continue to split your PageRank and isn't going to cure your display problem... and it encourages potential links to the index.html form, further complicating the PageRank split.

5:40 pm on May 3, 2007 (gmt 0)

Preferred Member

10+ Year Member

joined:June 13, 2004
posts:650
votes: 0


But setting up a default root file, of, say, index.html, continuing to link to it and excluding it in robots.txt, IMO, is going to continue to split your PageRank and isn't going to cure your display problem...

No, index.html (or any other DirectoryIndex file) is transparent to robots and is associated with root directory. All refering links are still pointing to /. To escape duplicate content issue, straight access to the real file, in this case index.html, is disallowed in robots.txt.

5:43 pm on May 3, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Oct 8, 2003
posts: 1419
votes: 0


I have done some more testing with inurl:.com/index.html searches in Google. Also I have seen some possible duplicate content problems. Since I can't post the urls here I will be making a blog post (may be this weekend).

I have not seen any site having two cache one for index.html and another for / but I have seen site having different caches for?id=1 and?id=2 (though they go to the same page). I am doing some experiments to confirm it, it might take some 5 to 6 days.

To what level of canonicalization will we go? Google is the smartest website I have seen, perfect canonicalization done, no other search engine has done it. Try?id=1 for google and try the same for Yahoo or Ask or MSN.

9:57 pm on May 3, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


I see many sites with their root page listed both as www.domain.com and as www.domain.com/index.html - and that is a problem.

Agreed, the root of a site is at "/" and that is what should be linked to, not a named index file. Linking to a named index file is asking for trouble.

10:45 pm on May 3, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:July 19, 2004
posts:142
votes: 0


g1msd,

I just saw your recent post under the Google canonicalization thread and saw your rewrite example.

You indicated
(index¦default¦home)\.(html?¦php¦asp¦cfm¦jsp)

I'm pretty rusty on rewrites as well as regular expressions, but shouldn't the rewrite be

(index¦default¦home)\.(htm?¦php¦asp¦cfm¦jsp)

.

11:43 pm on May 3, 2007 (gmt 0)

Junior Member

joined:Mar 15, 2007
posts:120
votes: 0


Yes I think its important to only serve the same content one way unless you take other steps to make sure the duplicate version of page can not be indexed. Some people are experimenting in showing different content on /index.html and http:// "no www" often using them as 404 pages and I hear they are getting some success. Of course they are serving different content... spammy no clever possibly!
11:52 pm on May 3, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


html? accepts either htm or html as a valid name.

htm? accepts either ht or htm as a valid name.

All of those URLs, if requested, issue an external 301 redirect (not a rewrite) to the / version of the URL.

12:16 pm on May 8, 2007 (gmt 0)

New User

10+ Year Member

joined:Sept 9, 2006
posts: 27
votes: 0


I have been working via Webmasterworld pages on getting some of my sites correctly canonicalised - some great advice from jpMorgan, then I picked up a thread on the /index rewrite to start cutting off the enemy at the pass. Best forum around for this stuff.

The following is my current (simple) Mod rewrite, and I am still confused as to why the capitalisation in the domain doesn't get forced to lower case.

I assumed that www.EXAMPLE.COM would be forced to www.example.com - doesn't seem to work that way.

Rule 1 below deals with the index.htm(l) problem nicely, Rule 2 with the www. vs. non-www very well, and in combination nthey also work.

But I still have capitalisation issues - I don't mean inside the site or with indiviudal URls - different issue. I mean the capitalisation of the domain name and tld

I have read and think I understand the capitalisation discussion on webmasterworld, so I don't think that is the issue. I have also messed with adding [nc] to different lines and still can't make headway.

Options +FollowSymlinks
RewriteEngine on
rewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.html?\ HTTP/ [nc]
rewriteRule ^index\.html?$ http://www.example.com/ [R=301,L]
rewriteCond %{HTTP_HOST}!^www\.example\.com$
rewriteRule (.*) http://www.example.com/$1 [R=301,L]

Any ideas from the experts? Have I misunderstood how far you can go here?

Thanks

This 49 message thread spans 5 pages: 49