Whatever is going on with your rankings, it is not because of duplicate content. Search engines index character strings, not meanings - so the content is most definitely not duplicate.
I would strongly suspect the type of content negotiation you are implementing.
1. What happens when googlebot requests a page?
2. Is there a natural click path to each language version, or is it all automatic redirects?
3. How much cross-linking is involved? Does each page link to its four counterparts?
Thanks Ted, my responses below:
1. Can this be viewed in Goo Analytics or in some other web log analysis?
2. There is a natural click path to fr.example.com from www.example.com (a prominent country flag)
3. Very little cross linking (apart from point 2) on the basis that english version and english version readers share so little in common with other languages.
I hope I have understood your questions. Thanks again and please feel free to ask more questions. I will respond immediately.
[edited by: tedster at 9:04 pm (utc) on Sep. 5, 2008]
Your server logs should tell you a more detailed story. And your tech team also should be able to tell you how the googlebot user agent will be handled - at least how they planned it to be handled. Googlebot it isn't a browser and it won't be responding to language negotiation. So you need to make sure it can get a response for your urls in that situation.
I don't use GA very much but the accounts that I can see do not have a search engine bots report. As far as I know, you need your server server logs to see that.
My general recommendation is not to use automated language detection and forced redirects. First, when I travel to other countries, that type of site drives me wild with frustration. I think you're much better off allowing users to get the exact url that they asked for, and making the language choices they can make very clear on the page.
The technical "trick" of automated language detection can definitely backfire in many ways.
thanks again tedster, very generous of you. my work is cut out for me.
cheers - J
Here are my findings for the interest of others who find this topic:
Googlebot (not being a browser) does not reliably understand/process "Content Negotiation" and therefore thinks all these sites are identical (English version), and therefore may be subject to a duplicate content penalty and at the very least fewer pages indexed by Google. Technically, googlebot crawls without a preferred language setting and ignores "Accept-Language" which causes only the English page to be served when a specific URL is requested by Googlebot.
Short term solution:
. publish this tag <HTML lang="XX"> as part of every site. Replace the XX with the appropriate language code (fr, de, es etc.)
. have your server send this header: Content-Language: XX (again, replace the XX with the appropriate language)
[edited by: tedster at 7:40 pm (utc) on Sep. 15, 2008]
a quick question..
you said "3. How much cross-linking is involved? Does each page link to its four counterparts?"..
What do you mean by "four counterparts"?
johoney has five sites, and each one is a translation into another language. So in any given language, every page has four counterparts. Cross-linking every page directly to its four translations has been known to get a multi-lingual site into ranking problems. That kind of deep, internal and sitewide crosslinking between domains apparently runs into some kind of filter, or at least it has historically.