That's an interesting observation - thanks. When you say "kicks the page" I assume you mean filters it out or doesn't rank it, correct?
Have you tried changing this to see if the page comes back? Google engineers, in the past at least, denied that the server's http header or the page's meta information for language had much effect on rankings.
My English-language site has got a fair amount of foreign-language phrases on it, by necessity. I haven't seen any negative effect as of today, but if Google start to penalize that then I'm pretty much hosed.
|When you say "kicks the page" I assume you mean filters it out or doesn't rank it, correct? |
Yes ;) The pages did rank (and were indexed) and now they don't rank - and are not indexed.
- on a side note, the corresponding pages for the BlueWidgets where comment language = content language are enjoying a much higher/better indexation than before.
|Have you tried changing this to see if the page comes back? |
Yes... currently working on it, but there are a lot of pages, and 'generating' unique content won't work. Experimenting with a language-detection software to only show comments/reviews for the relevant language.
|My English-language site has got a fair amount of foreign-language phrases on it, by necessity. I haven't seen any negative effect as of today, but if Google start to penalize that then I'm pretty much hosed. |
What I have seen is that it primarily affects pages where there is a substantial amount of 'foreign' language elements in one or more large blocks in the same language.
If you have a few short comments, then you shouldn't be affected.
However, to be safe I would do some checks to make sure that you have at a bear minimum 60-65% of 'correct' language content on the page.
Again, I only noticed this because all of the pages from the website are not yet indexed. Now google has started skewing and not indexing pages google probably finds "not relevant to index". I would insofar then not say that it is a penalty... but it might as well have been for what it is doing to my traffic since a lot of important long-tail pages are no longer indexed.
|brotherhood of LAN|
|Experimenting with a language-detection software |
I've tried using a free script/method that uses trigrams, 3-letter sequences to best guess a language. It seems to be fairly effective.
I've read somewhere that Google's AJAX language API [google.com] uses a similar method.
|I've tried using a free script/method that uses trigrams, 3-letter sequences to best guess a language. It seems to be fairly effective. |
The language detection seems to be working with a 90% reliability. Getting it to work over all pages/comments and languages is what's proving to be the challenge ;)
"...and are not indexed."
Then you have other issues than what you are suggesting. Google doesn't not index pages based on language.
|Then you have other issues than what you are suggesting. Google doesn't not index pages based on language. |
Yes. The website has one other major issue. I mentioned it earlier in the thread. We have too many pages (in relation to inbound external links) since our content is multilingual.
We used to have a good spread on the indexing though, and now it is heavily skewed, because of the language. Our total indexation has taken a slight beating (which happens from time to time), but the language factor has hurt specific portions of the website severely. It wasn't obvious at first, but after analyzing which pages have been dropped, and also what pages have been added, there can be only one conclusion.
Are the Spanish comments on the English version of the page identical to Spannish comments on Spanish version of the page? I.e. are you publishing the same comments on the both versions of the page?
Also, what is your ratio of the English written content and the text from Spanish comments on the page? 40%-60%? 30%-70% Other?
|are you publishing the same comments on the both versions of the page? |
Yes, but since they are reviews they tend to be rather long-winded.
|what is your ratio of the English written content |
It varies. Worst case pages with lots of reviews have as little as 15% English.
-- -- -- -- -- -- --
I would like to stress that the following information is mostly guesswork and has not been tested yet. These are merely observations and not to be taken at face value.
For the affected pages it seems that the red-line is drawn at ~60%. That is to say English vs Spanish 60-40.
However that is only where the problem is bilingual. If the content is multilingual you can get closer to ~40% English... so long as none of the non-English content reaches higher than ~40% in and of itself.
Based on your answers, have you thought that maybe you are triggering duplicate content filter and not some new language filter? If Google thinks the pages have too much of similar content, it might filter one version of the page out.
What you could try to test this is to add some "unique" spanish comments on english version of the page, ensuring that these comments are not appearing on spanish (and other languages) version of the page. Then see if it gets indexed.
That is a valid point...
However, that does not explain the issue.
1. If there are enough comments/reviews (which is the case 70% of the time) we have a filter to sort the comments differently. ie. displayed comments are different on spanish vs. english page, unless the user clicks to see more comments (these are on a separate page w/ "robots='noindex'" tag)
2. Comments/reviews was just an example I took. There are other pages where I am seeing the same issue with different types of (unique)content.
3. The problem has affected all languages and all categories of pages. The only common factor for dropped pages is the language factor.
OK, I see.... I am interested in what your testing / investigation will show because one of sites I have runs in 6 languages.
We have a few pages where there is a mixture of english and local language(s), with english prevailing. These pages are not ranking very well on the local language google(s), but they never did rank particularly well.
What I am going to do is record exact ranking position of these pages, then remove english content and leave the local language content only. This will make these pages smaller, but it will still have around 300 words of unique content on the page. I will see if the ranking improves the next time google caches these pages.
|Google has incorporated a language detection parameter to the algo. I can't say as to whether it is a *new* feature, or if it only in the past 3-4 weeks has been given greater value. |
I have thousands of multilingual pages. If Google even slightly altered its algorithm for languages, I think I would know it. I have seen no change recently.
I have to say I have not noticed any changes either, but I have not had the specific case wendy is talking about.
I will do my test anyway. I have 4 good pages that each mix english and another (one) language, with the ratio currently being around 60% - 40% in favour of english language, even though the page was intended for local language. I can test french, italian, german and hungarian.
The test will be on how much the ranking changes on local google domains for the language-specific key phrase(s) after the 60% of english content has been removed from the page. I will not touch other elements of the page, I will only remove english content.
The domain in question is .com domain, no geo targeting set.
My impression was that Google has never used the Content-Language information, except maybe as a vague hint when trying to break a tie. As far as I know, they have always (or for a very long time) relied on the actual contents of the page to determine the language of a page. And I can tell you that it is probably a very good choice, the Content-Language data is so horribly wrong in some many cases that it's very often completely useless. If it's present at all, of course...
Whether they pick a single language for the whole page only or do it on a more granular level (per div, p, sentence...), and whether that changed, I have no idea, though.
|I have thousands of multilingual pages. If Google even slightly altered its algorithm for languages, I think I would know it. I have seen no change recently. |
If your website hasn't been affected, then that's great. Mine, and 2 other websites (competitors) are showing the same issue.
|What I am going to do is record exact ranking position of these pages, then remove english content and leave the local language content only. This will make these pages smaller, but it will still have around 300 words of unique content on the page. I will see if the ranking improves the next time google caches these pages. |
Well, my test has finished and I can report that both pages (one in italian and one in german), have moved up significantly in SERPs.
Basically, the german page had about 60% of content in english. By removing the english content from the page, the page still had 250 words in german remaining. After Google had cached it, the page jumped 2/3 up in ranking for a selected phrase. The similarly significant jump was observed for italian page.
Thanks for sharing the results of your test aakk9999.
For the pages I've "fixed" I am also seeing improvements.