|How does Google treat foreign languages in its Algo?|
I have a few questions about foreign Language SEO.
How does Google treat foreign language in its algorithms?
How does Google treat Unicode Languages in its Algo?
How does it treat Latin language compared to Unicode language? ie: Arabic Chinese
Are Algo updated such as Panda effect All languages at once or just Latin Languages or just english?
How does Googlebot differentiate between those different languages?
How come you do not hear about a Turkish, Arabic or Chinese Matt Cutts trying to tackle spam in their respective Language?
How does Google treat a URL in its Algo that has content Both in English and another language?
From my experience Specially Unicode Languages It seems Google is still in 2003 ALgo system or is google just showing the best of the worst in its search engine.
Ooh, my favorite question ;) Or, ahem, eight questions.
My own site includes material in a language written in a non-Roman script which google does not know. That is, the script is unique to the language, so it isn't a case of seeing Urdu and mistaking it for Arabic. Here's what I can say from direct personal experience. It's pretty basic, but it's a start.
google's keyword list includes words in languages google does not know-- but only as exact matches. If you have variant forms like "cats, catty, cattish, cattery, cat's" etc. each one will be listed separately.* If you have the identical word in Roman and non-Roman script, those too will be separate. I don't know how this works if google does know the language.
google search will similarly only bring up exact matches. I don't simply mean that it won't offer synonyms. I mean that it won't offer fragments: if you search for the equivalent of cdefg, search will not include results for abcdefg or cdefghijkl. This is a serious problem if the language in question is inflected and/or compounding and/or agglutinative, so you never do get exact matches except for a handful of isolated common words.
I don't know how keyword lists work in non-Roman languages that google does know. I do know that Search can be excellent. I can remember when it became possible to search in polytonic Greek, because it was a night-and-day change from useless to really good.
* Granted, English isn't 100% either. I've got a strings like "present, presented, presence, presents" or "states, state, stated, stately". But they're pretty darn close.
Very interesting findings you have. I'm looking more about how google ranks unicode character websites in its search engine. for example there is a website ranking no. 1 for a competitive keyword. I checked its back links and realized all the backlinks are widesite using exact anchor text for that exact keyword. This used to work on google in 2003. Now if anyone does this in english search you will be band before you can say google penalty.
I think it's a terrific food-for-thought question. How does g### apply its algorithm when it can't apply it to anything? Are we in a historical backwater where searches are still where they were in 2003 or even 1993?
If anyone hereabout routinely searches in Arabic or Japanese or some other language that g### might reasonably be expected to know, do you see a difference in behavior between the other language and English? (When the OP said "unicode languages" I assumed you meant languages in a completely non-Roman script, not just languages with less common diacritics.) My only meaningful experience is with Yandex image search, which can only tell me that they're translating their stuff more-or-less accurately.
Secondary question, with no thought required but I'm baffled. Why does the "Active Posts" list persist in saying there are seven messages in this thread when in fact there are only three? The Google SEO forum listing has it right.
I re-checked after posting this, and it's now correctly saying four. Curiouser and curiouser.
|Why does the "Active Posts" list persist in saying there are seven messages in this thread when in fact there are only three? The Google SEO forum listing has it right. I re-checked after posting this, and it's now correctly saying four. Curiouser and curiouser. |
lucy24 - You saw a discrepancy because, late last night, souffle's last message got posted five times, and I removed the four extras. I leave the rest to your imagination. ;)
<OT>lucy24.. many years ago we used to put such things down to Seafaring Nordic Kittens playing with the strings in the purl[sic]in the Gubbinsess..;)</OT>
And yes ..IME Google does deliver slightly less "esoteric" answers in "Non English" ( French at least )..less of the "we think that you really meant to search for"..thus IME, it is easier to predict where one is going to rank for a query ..
souffle (or should it be soufflé?)... to touch on several of your questions, and probably do only a very incomplete job of it....
|How does Google treat Unicode Languages in its Algo? |
I remember that some while back I'd posted that Caffeine supported Unicode. The thread (from Nov 2010), which I found via site Search here, only touches on what you're asking about, but it does suggest the difficulties you may have finding information...
Matt Cutts' Answer about Special Characters: "I Don't Know"
To find more about Google and Unicode support, I tried this search (on Google)....
None of the articles returned by the search get into the nitty gritty of how Unicode works in the Google algorithm and in mixed language environments, but here are the two most recent, which do provide partial answers to some of your questions about what searches can find via Google....
Unicode nearing 50% of the web
Official Google Blog
January 28, 2010
|...Unicode is growing both in usage and in character coverage. We recently upgraded to the latest version of Unicode, version 5.2 [unicode.org] ...We're constantly improving our handling of existing characters... after extensive testing, we just recently turned on support for these and thousands of other characters; your searches will now also find these documents.... |
Unicode over 60 percent of the web
Official Google Blog
February 3, 2012
|...We’ve long used Unicode as the internal format for all the text Google searches and process: any other encoding is first converted to Unicode. Version 6.1 just released with over 110,000 characters; soon we’ll be updating to that version and to Unicode’s locale data from CLDR 21 [cldr.unicode.org].... |
As you can see, just getting the standards for the infrastructure in place is a long process. The fact that Google uses the word "internal" in the above paragraph suggests that not everything has yet been followed through within the search algorithm or interface.
You ask several questions about mixed languages, which is probably the most difficult area that Google encounters. Increasingly, the algorithm depends on context, and when the languages are mixed, the proper determination of context becomes many orders of magnitude more complex.
When I last reviewed it, the suggestion was to try to avoid mixing of language on a page if possible. Here's a discussion on a mixed language problem that ultimately brought together lots of issues....
Translate problem in Google SERP - not always ranking right language
If you're a Supporter, I highly recommend checking out the link at the end of the above discussion to the "SEO for multi-language sites" thread, which is in the Supporters section. It's one of the more complete discussions on the topic we've had in WebmasterWorld. Site search should lead you to some other threads as well.
The treatment of language in search is a very broad topic. Location, hosting, linking, and TLDs related to language issues are another subset of the question.
If you can focus on an immediate issue, please advise, as your question will then become much easier to answer. That is the nature of search and of language. ;)
|When I last reviewed it, the suggestion was to try to avoid mixing of language on a page if possible. |
Can't help but think you may be better off if all the secondary languages use a different script, because then even a computer can tell the difference. If the googlebot meets Greek or Sanskrit or Inuktitut (I am sorry to say that I have at least one page that includes all three) it shrugs its robotic shoulders and moves on. But if there's German or Latin, it doesn't know what to do.
Real-life example. I think I've posted about this before, but in more of a foo-type context. I asked g### to translate a Japanese story-- that is, the English translation of a Japanese story-- into German. Results were, hm, uneven. The key point is this: at two or three places in the story, the (English) text says, in italics, sake. The German dutifully says willen in italics. We will not talk about how long it took me to figure out why it was throwing in this word in contexts where it made no sense.
It now occurs to me to wonder: The original e-book is in HTML 4 and the word is in ordinary presentational <i> tags. Now, I've got a nebulous idea that HTML 5 has a tag that means, specifically and semantically, foreign. If this tag had been used instead of plain <i>, would the word have been left untranslated?
|IME Google does deliver slightly less "esoteric" answers in "Non English" ( French at least )..less of the "we think that you really meant to search for" |
You spoke too soon. Only minutes ago I searched for the phrase εχουσι το εντελες* for a question in another forum. Google had the unmitigated, jaw-dropping, infernal gall to suggest that possibly I meant to say εντολες with an omicron in the middle-- and there isn't even such a word.
* I suppose the Forums will eat that in a single gulp. If you haven't the energy to reconstruct from numeric entities: echousi to enteles. Or (in google's fevered imagination) entoles.
"entoles" gives Greek performer Despina Vandi..and a linux "how to" ref and a "translation", it ( entoles ) being an English rendering of a biblical Greek word "ἔντολη"* used in "Mark" and "others" to translate "mitsvah" as meaning "commandments" or "precepts"...( the Greek text at * may well get eaten too ;-)..But then I don't suppose many of us here are set up to display Greek , Hebrew or Inuktitut**..;) ..
The speelshuckah underlined a lot of that post..especially the word Inuktitut..which is actually an English word, unlike ᐃᓄᒃᑎᑐᑦ ..
Oy vey => אױ װײ
Wow I think this this post has gone to complicated for me. Thanks everyone for shedding some light.
Not complicated ..it is that this forum ( hand rolled by Brett ) doesn't do "unicode languages"..( possibly because we are all supposed to be posting as much as is possible in English, so as to make all posts understandable by the maximum number of readers..
But when discussing unicode languages ..one is forced to use "example" words that "glitch" the forum..Most people here are not aware of it because those of us who use them are in the minority..never affects the others..
Your questions were interesting ..but the answers are not as clear cut as when discussing English..
|How does Google treat foreign language in its algorithms? |
It appears IME that it treats each one differently..
|How does Google treat Unicode Languages in its Algo? |
It appears IME that it treats each one differently..
|How does it treat Latin language compared to Unicode language? ie: Arabic Chinese |
I have trouble with understanding your question..did you mean ..
**How does it treat Latin language compared to Unicode language? ie: Arabic / Chinese** ..
|Are Algo updated such as Panda effect All languages at once or just Latin Languages or just english? |
Algo updates even in the English language are rolled out at different times and in slightly different ways to different English speaking areas..the 1st Panda roll out was in the USA before the UK ..and both of them were before France ..and it was before Spain etc etc ..no one reported back when the 1st Panda rollout hit any unicode language countries probably because the forum is very USA centric..
|How does Googlebot differentiate between those different languages? |
Differentiate in what way ? ..you mean crawl frequency ? or crawl rate ? ..or depth ?..or ...
|How come you do not hear about a Turkish, Arabic or Chinese Matt Cutts trying to tackle spam in their respective Language? |
Maybe there are Turkish, Arabic or Chinese Matt Cuttsess <= ( what is the plural of Matt Cutts ? *** ) but we wouldn't hear about them ( if there were ) here due to the English language being the official forum language thing..
I do know that there is not an equivalent spokesperson for Google's spam team for the French language..I've never heard of there being any for Chinese or Arabic either..
|How does Google treat a URL in its Algo that has content Both in English and another language? |
*** what would the collective noun be ? ..It can't be "cutlets"..<= that was taken a long time ago ;)..
Thanks Leosghost and everyone. Just wanted to basicly understand that each language is treated differently. Now all i need is to find a group who own website in my language and compare notes. I sure though panada has not roled out to my language yet But i thing its coming soon.
If I want to attract more customers from say France, does anyone know whether it's better for SEO to incorporate my website in French as part of my .co.uk site (typically via a French flag link on the home page), or host it in French of course on a France server and .fr domain ? Assume both sites are SE identical.
|host it in French of course on a France server and .fr domain |
Definitely this ..and incoming links from French sites , in French, hosted in France are better than incoming links in French, from French sites hosted outside France..
Thanks for that Leo :-)