Domain root and home.html - both showing with the site:operator - Google Search and SEO forum at WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Domain root and home.html - both showing with the site:operator

gouri

4:32 pm on Jan 17, 2010 (gmt 0)

I am seeing when I do a site:operator search the homepage of a site is showing up in two different ways and I have not seen that before.

It is showing up as:

site:www.example.com
site:www.example.com/home.html

Usually, site:www.example.com is the one that shows.

Has anyone experienced this and/or know why this may be happening?

[edited by: tedster at 4:50 pm (utc) on Jan. 17, 2010]
[edit reason] moved from another location [/edit]

tedster

4:48 pm on Jan 17, 2010 (gmt 0)

If your server returns content for both URLs with a 200 HTTP status, then you've got a canonical URL problem [webmasterworld.com], gouri.

I'm interested in the fact that you are seeing this now with the site: operator, because I haven't come across an example in the SERPs for quite a few months. Google was apparently combining duplicate versions of a URL, or at least only showing one version. I expressed doubts here that they were "really" combining all the link juice for the two urls - at least in every case - but it became difficult to show any supporting data for that to a website owner.

Your report underscores the importance of handling potential canonical problems on your own server - and not depending on Google or any search engine to get it right for you. They may filter out the evidence so their users don't get the duplicate content, but that doesn't mean your site is getting the full ranking potential that it might.

tedster

7:27 pm on Jan 17, 2010 (gmt 0)

When Google is in technical disarray (as seems to be the case right now) we can often get glimpses and insights that are normally hidden from view. This may be such a case.

Gouri, in your case does the toolbar show the same PageRank for both versions of the URL?

gouri

7:42 pm on Jan 17, 2010 (gmt 0)

When Google is in technical disarray (as seems to be the case right now) we can often get glimpses and insights that are normally hidden from view. This may be such a case.

I think that's a great point.

Maybe this can help us to learn about an area that we might not know a lot about and it can help us in the future.

Gouri, in your case does the toolbar show the same PageRank for both versions of the URL?

Both versions of the page are showing the same Pagerank.

But the appearance of site:www.example.com/home.html in the Google index is new.

It is interesting to try to understand what this means, as pages that are new in the Google index usually don't have PageRank?

I also wanted to mention that this is not showing in Google Caffeine, but only the regular Google's (in case it might mean something).

tedster

8:03 pm on Jan 17, 2010 (gmt 0)

My suspicion has been that Google filters out the duplicates from showing in search results, and they point the public PageRank server to show the same PR - but that the actual link juice is still not combined in all situations. Some cases may be handled as we hope, especially the most common (index.html, default.aspx and the like) but home.html is only about 10% as common. Perhaps that's also a factor here.

At any rate, the message seems clear. We should continue to take responsibility for all URL canonicalization on our own shoulders.

Taking the responsibility theme a bit further, the canonical meta tag is apparently "working", but it is still handing responsibility over to Google. So I consider the canoncial meta tag to be a back-up approach used to support other steps. Or sometimes, I use it as a last resort - a temporary band-aid for the messiest of technical tangles.

gouri

8:21 pm on Jan 17, 2010 (gmt 0)

In this situation, would I be able to use a robots.txt file to indicate that the /home.html version of the page should not be indexed help in terms of getting the full page rank and rankings for keywords that the page should be getting?

Or since there is actually only one page, I wouldn't be able to block the /home.html version of the file from being indexed?

TheMadScientist

8:33 pm on Jan 17, 2010 (gmt 0)

In this situation, would using a robots.txt file to indicate that the /home.html version of the page should not be indexed help in terms of getting the full page rank and rankings for keywords that the page should be getting?

IMO No... and robots.txt does not keep a URL from being indexed, but rather keeps it from being spidered. (There's a difference) I recommend solving the issue the correct way by redirecting the /home.html to the root example.com/. It's a fairly simple redirect... If you cannot do this for some reason, then I would go with tedster's suggestion of the canonical tag, which means you cannot block the page in robots.txt, because the page needs to be spidered for the tag to be seen and have any effect whatsoever.

gouri

8:47 pm on Jan 17, 2010 (gmt 0)

First, thank you for that response.

I do not have access to the root host file.

Does this mean that the only way that I would be able to do something is by putting a meta tag or canonical meta tag on the home page?

TheMadScientist

9:04 pm on Jan 17, 2010 (gmt 0)

I do not have access to the root host file.

Interesting...

If you can get to the robots.txt location (example.com/robots.txt), you can usually get to the location of the .htaccess file for the root of the domain, because they are in the same directory?

The .htaccess is a hidden file, so you might need to turn invisibles 'on' in whatever FTP program you are using to see it, and there is some other advice you can get from the Apache Forum [webmasterworld.com] you should follow, like only using a plain text editor, not Word, to edit the file. Also, if it's not in the same directory as example.com/home.html, you can create one... (It seems like you should have access, even if you don't know you have access, unless you are on a Windows box, because the files you are talking about editing are all in the root folder and the .htaccess file you need to work with would be in the same place.)

Does this mean that the only way that I would be able to do something is by putting a meta tag or canonical meta tag on the home page?

Yes, if you cannot find, get to or create an .htaccess file then you will probably need to use a canonical meta tag and hope for the best handling from Google. (Personally, I would not use a host where I could not use an .htaccess file for redirecting, so if it's your site and you want to be in control of what's going on, and do not have that level of control you might want to consider moving somewhere else.)

gouri

9:10 pm on Jan 17, 2010 (gmt 0)

I can't access the .htaccess file because that is in the root host file and I don't have access to it.

I also don't have FTP.

I think my only option is the meta tag. Can you please tell me what I would have to write in the tag?

TheMadScientist

9:13 pm on Jan 17, 2010 (gmt 0)

Well that's ugly...

In the <head> of the home page put:
<link rel="canonical" href="http://www.example.com/">

There's more detailed info here:
Specify Canonical URL with a Meta Tag [googlewebmastercentral.blogspot.com]

gouri

9:32 pm on Jan 17, 2010 (gmt 0)

Thanks for the help with the coding and also the link. It has a lot of good information on it.

I am going to see if I can include this link tag on the page.

Is <link rel="canonical" href="http://www.example.com/"> considered a meta tag or is it something different? The reason I ask is that I think I would be able to add a meta tag but I am not sure that I can add a link tag.

Also, do you think Google will understand that this is pertaining to having www.example.com as the home page and not www.example.com/home.html or is it possible that they may think it is in reference to www.example.com instead of example.com?

I appreciate your help.

TheMadScientist

10:12 pm on Jan 17, 2010 (gmt 0)

Is <link rel="canonical" href="http://www.example.com/"> considered a meta tag or is it something different?

Technically, I think it's considered a 'Relationship Link', but meta tag for general discussion will probably suffice and might even lessen confusion a bit. How that will affect your ability to add the info I really don't know. If it's a person you have to get it past, then I would probably refer to it as a meta tag. If it's some sort of goofy program or software you are using, then it might not work so well. (It does need to be <link> not <meta> to work, IMO.)

You can find more information here:
Links in HTML Documents [w3.org]

Also, do you think Google will understand that this is pertaining to having www.example.com as the home page and not www.example.com/home.html or is it possible that they may think it is in reference to www.example.com instead of example.com?

IMO they will consider it the canonical location for all versions of the page, which, again IMO, will include www.example.com and example.com, but this is pure speculation on my part and someone else may have a better or more definite answer.

gouri

10:29 pm on Jan 17, 2010 (gmt 0)

I don't mean to make things more confusing but many times this tag

will be put in the meta section of the page that you don't want to show up in the search engine. That is the point of the tag, so the search engines know which page is to be considered the page to be ranked (the other one) and which one is the duplicate (the page that contains the tag). In my situation I only have one page, but it is showing up in two different ways in Google when I do a site:operator search.

I think this situation is a little different than actually having two versions of the same page in the web hosting account.

Does this change anything? Such as making it not a good idea to add such a tag since there is in reality only one page (not two different pages with the same content) in the web hosting account?

TheMadScientist

11:02 pm on Jan 17, 2010 (gmt 0)

It's the same content on two different URLs, which is addressed in the 3rd example down on the previously linked Google Blog page... The only difference is they are using Session IDs in their example rather than different locations. It's essentially the same though. You have the exact same content showing on different URLs, essentially the same as you would if you used Session IDs in your URLs.

The exact answer is a little over 1/3 of the way down the linked blog page in the replies to the questions asked.

Link to the Google Blog Provided Above:

Wade Leftwich said...
And I assume it's OK for the canonical page to have a 'link rel="canonical"' pointing to itself?
@Wade: Yes, it's absolutely okay to have a self-referential rel="canonical". It won't harm the system and additionally, by including a self-reference you better ensure that your mirrors have a rel=�canonical� to you.

gouri

12:12 am on Jan 18, 2010 (gmt 0)

Thank you for explaining that to me and also for including the response from the Google Blog.

That is very helpful.

I am going to find out if I can include a <link> tag on the page.

gouri

1:16 pm on Jan 21, 2010 (gmt 0)

Would using Google Webmaster Tools Parameter Handling feature be good to use in this situation?

I think that also deals with duplicate content.

tedster

8:06 pm on Jan 21, 2010 (gmt 0)

There's no query string parameter in either URL, the domain root or the home.html so I don't see how it would be useful.

crobb305

8:17 pm on Jan 21, 2010 (gmt 0)

For my site, Webmastertools is showing over one hundred 404s (Page Not Found) on googlebot requests for index.html. It also shows the referring url for each instance and NONE of those referrers link to me with index.html. They all link to me using the canonical. I am not sure how long this has been going on, because I only recently started using GWT. It has motivated me to cleanup my htaccess and ensure index.html resolves to the canonical.

I am not sure what effect this is having on my site's rankings, but it is disturbing, considering that NONE of the referring urls link to me with index.html. Could inbound links get discounted because googlebot is trying to crawl index.html and encountering a 404?

crobb305

8:40 pm on Jan 21, 2010 (gmt 0)

If you use the canonical tag, do you need it on every page, or homepage only?

TheMadScientist

8:44 pm on Jan 21, 2010 (gmt 0)

You need it on any page you would like another 'essentially the same' page considered to be the original source for the information, and it does not hurt to have it on the page considered the canonical page itself. So, you need it on any version of your home page (every location your home page is presented), except the location you want to be considered the original (canonical) version, and there it's optional.