Forum Moderators: Robert Charlton & goodroi
It is showing up as:
site:www.example.com
site:www.example.com/home.html
Usually, site:www.example.com is the one that shows.
Has anyone experienced this and/or know why this may be happening?
[edited by: tedster at 4:50 pm (utc) on Jan. 17, 2010]
[edit reason] moved from another location [/edit]
I'm interested in the fact that you are seeing this now with the site: operator, because I haven't come across an example in the SERPs for quite a few months. Google was apparently combining duplicate versions of a URL, or at least only showing one version. I expressed doubts here that they were "really" combining all the link juice for the two urls - at least in every case - but it became difficult to show any supporting data for that to a website owner.
Your report underscores the importance of handling potential canonical problems on your own server - and not depending on Google or any search engine to get it right for you. They may filter out the evidence so their users don't get the duplicate content, but that doesn't mean your site is getting the full ranking potential that it might.
When Google is in technical disarray (as seems to be the case right now) we can often get glimpses and insights that are normally hidden from view. This may be such a case.
I think that's a great point.
Maybe this can help us to learn about an area that we might not know a lot about and it can help us in the future.
Gouri, in your case does the toolbar show the same PageRank for both versions of the URL?
Both versions of the page are showing the same Pagerank.
But the appearance of site:www.example.com/home.html in the Google index is new.
It is interesting to try to understand what this means, as pages that are new in the Google index usually don't have PageRank?
I also wanted to mention that this is not showing in Google Caffeine, but only the regular Google's (in case it might mean something).
At any rate, the message seems clear. We should continue to take responsibility for all URL canonicalization on our own shoulders.
Taking the responsibility theme a bit further, the canonical meta tag is apparently "working", but it is still handing responsibility over to Google. So I consider the canoncial meta tag to be a back-up approach used to support other steps. Or sometimes, I use it as a last resort - a temporary band-aid for the messiest of technical tangles.
Or since there is actually only one page, I wouldn't be able to block the /home.html version of the file from being indexed?
In this situation, would using a robots.txt file to indicate that the /home.html version of the page should not be indexed help in terms of getting the full page rank and rankings for keywords that the page should be getting?
IMO No... and robots.txt does not keep a URL from being indexed, but rather keeps it from being spidered. (There's a difference) I recommend solving the issue the correct way by redirecting the /home.html to the root example.com/. It's a fairly simple redirect... If you cannot do this for some reason, then I would go with tedster's suggestion of the canonical tag, which means you cannot block the page in robots.txt, because the page needs to be spidered for the tag to be seen and have any effect whatsoever.
I do not have access to the root host file.
Interesting...
If you can get to the robots.txt location (example.com/robots.txt), you can usually get to the location of the .htaccess file for the root of the domain, because they are in the same directory?
The .htaccess is a hidden file, so you might need to turn invisibles 'on' in whatever FTP program you are using to see it, and there is some other advice you can get from the Apache Forum [webmasterworld.com] you should follow, like only using a plain text editor, not Word, to edit the file. Also, if it's not in the same directory as example.com/home.html, you can create one... (It seems like you should have access, even if you don't know you have access, unless you are on a Windows box, because the files you are talking about editing are all in the root folder and the .htaccess file you need to work with would be in the same place.)
Does this mean that the only way that I would be able to do something is by putting a meta tag or canonical meta tag on the home page?
Yes, if you cannot find, get to or create an .htaccess file then you will probably need to use a canonical meta tag and hope for the best handling from Google. (Personally, I would not use a host where I could not use an .htaccess file for redirecting, so if it's your site and you want to be in control of what's going on, and do not have that level of control you might want to consider moving somewhere else.)
In the <head> of the home page put:
<link rel="canonical" href="http://www.example.com/">
There's more detailed info here:
Specify Canonical URL with a Meta Tag [googlewebmastercentral.blogspot.com]
I am going to see if I can include this link tag on the page.
Is <link rel="canonical" href="http://www.example.com/"> considered a meta tag or is it something different? The reason I ask is that I think I would be able to add a meta tag but I am not sure that I can add a link tag.
Also, do you think Google will understand that this is pertaining to having www.example.com as the home page and not www.example.com/home.html or is it possible that they may think it is in reference to www.example.com instead of example.com?
I appreciate your help.
Is <link rel="canonical" href="http://www.example.com/"> considered a meta tag or is it something different?
Technically, I think it's considered a 'Relationship Link', but meta tag for general discussion will probably suffice and might even lessen confusion a bit. How that will affect your ability to add the info I really don't know. If it's a person you have to get it past, then I would probably refer to it as a meta tag. If it's some sort of goofy program or software you are using, then it might not work so well. (It does need to be <link> not <meta> to work, IMO.)
You can find more information here:
Links in HTML Documents [w3.org]
Also, do you think Google will understand that this is pertaining to having www.example.com as the home page and not www.example.com/home.html or is it possible that they may think it is in reference to www.example.com instead of example.com?
IMO they will consider it the canonical location for all versions of the page, which, again IMO, will include www.example.com and example.com, but this is pure speculation on my part and someone else may have a better or more definite answer.
<link rel="canonical" href="http://www.example.com/">
will be put in the meta section of the page that you don't want to show up in the search engine. That is the point of the tag, so the search engines know which page is to be considered the page to be ranked (the other one) and which one is the duplicate (the page that contains the tag). In my situation I only have one page, but it is showing up in two different ways in Google when I do a site:operator search.
I think this situation is a little different than actually having two versions of the same page in the web hosting account.
Does this change anything? Such as making it not a good idea to add such a tag since there is in reality only one page (not two different pages with the same content) in the web hosting account?
The exact answer is a little over 1/3 of the way down the linked blog page in the replies to the questions asked.
Link to the Google Blog Provided Above:
Wade Leftwich said...
And I assume it's OK for the canonical page to have a 'link rel="canonical"' pointing to itself?@Wade: Yes, it's absolutely okay to have a self-referential rel="canonical". It won't harm the system and additionally, by including a self-reference you better ensure that your mirrors have a rel=”canonical” to you.
I am not sure what effect this is having on my site's rankings, but it is disturbing, considering that NONE of the referring urls link to me with index.html. Could inbound links get discounted because googlebot is trying to crawl index.html and encountering a 404?
C