homepage Welcome to WebmasterWorld Guest from 54.197.94.241
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 42 message thread spans 2 pages: < < 42 ( 1 [2]     
Google ignores all code-level language information
phranque




msg:4510666
 6:26 am on Oct 22, 2012 (gmt 0)

i mentioned during a presentation at pubcon last week that google ignores language specification in html code and was approached several times afterwards for clarification.
i was surprised this was news, especially since some of those who asked were very familiar with multilingual sites.

so just to get this out there for discussion, from the Official Google Webmaster Central Blog - Working with multilingual websites:
http://googlewebmastercentral.blogspot.com/2010/03/working-with-multilingual-websites.html [googlewebmastercentral.blogspot.com]

Keep in mind that Google ignores all code-level language information, from “lang” attributes to Document Type Definitions (DTD). Some web editing programs create these attributes automatically, and therefore they aren’t very reliable when trying to determine the language of a webpage.


and from Webmaster Tools Help - Multi-regional and multilingual sites:
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=182192 [support.google.com]
Make sure the page language is obvious
Google uses only the visible content of your page to determine its language. We don’t use any code-level language information such as lang attributes. You can help Google determine the language correctly by using a single language for content and navigation on each page, and by avoiding side-by-side translations. Translating only the boilerplate text of your pages while keeping the bulk of your content in a single language (as often happens on pages featuring user-generated content) can create a bad user experience if the same content appears multiple times in search results with various boilerplate languages.



this tells me google isn't that great at language and if not even google can "get it" it's a universal problem, so i would still recommend properly specifying language for all content.


just to be clear, "code-level" language information is distinct from "link-level" language information, which is the proprietary "link rel alternate hreflang" attribute google began supporting last year.

Official Google Webmaster Central Blog: New markup for multilingual content:
http://googlewebmastercentral.blogspot.com/2011/12/new-markup-for-multilingual-content.html [googlewebmastercentral.blogspot.com]

rel="alternate" hreflang="x" - Webmaster Tools Help:
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=189077 [support.google.com]

 

Maurice




msg:4514542
 9:12 am on Nov 1, 2012 (gmt 0)

@lucy24

for Arabic and Chinese sites and you start having to get deep into chracter encodings. Quite surprised my rankchecking tool worked prity much fist time for chinese query's

lucy24




msg:4514651
 1:21 pm on Nov 1, 2012 (gmt 0)

I didn't understand how the parts of this line fit together:
the various language encodings (and this is a very complex subject even for UTF8 let alone UTF16 and Non latin languages)

Are you talking about rendering (an action that happens within the browser, text editor or equivalent) as distinct from file encoding (a means of storing data)? I thought each Chinese character was a typographic island. A far cry from Semitic languages where you have both position-based variant forms, and diacritics combined on the fly. I don't have to deal with this much, as most scripts I use are precombined-- same as European languages. But I do remember when my browser learned how to write Devanagari. The mac still can't do Bengali though; maybe it's in some later cat. (I'm in 10.6 and really don't want to change.)

Google seems to be adding one language at a time. I can remember when searching in polytonic Greek took a flying leap upward and suddenly became very good. And they must be able to do Arabic and Japanese, because I've met both in logs.* But it is still impossible to search in any language that uses Devanagari or UCAS. I can understand about UCAS because it's such a small linguistic community. But Devanagari is used by a huge chunk of the world's population, and I've never seen any evidence of a Indian search engine picking up the slack. In the meantiem you'd think g### would at least ask someone if their present defaults are really the best approach.


* My log wrangling includes a few lines to decode percent-encoded scripts. Turns out you have to use different functions for ASCII and non-ASCII.

graeme_p




msg:4515550
 7:00 am on Nov 4, 2012 (gmt 0)

Is there any particular difficulty with Devanagari? Given what they do support, an omission like this must have a reason.

Also Google India offers a Hindi version. Are you saying it does not work well?

lucy24




msg:4515578
 9:23 am on Nov 4, 2012 (gmt 0)

Background: (graeme, you can skip this paragraph because I know that you already know it) Devanagari isn't precombined. So the application has to know that /e/ and /ai/, /u/ and /û/ go above or below the preceding letter-- which is not always the same width-- and that other vowel marks go inline but short /i/ goes before (that is, to the left) even though it's typed after. And more important it has to know how to assemble consonant clusters: -kt- is not simply -k- plus -t-, while combinations with -r- use a different letterform entirely. So it isn't enough just to have the appropriate font installed, as it would be with your average alphabetic script.

I have no idea why my computer can do Devanagari but not Bengali, since they work in almost exactly the same way and there is surely no shortage of Bengali computer geeks ;)

I detoured to google dot co dot in, picked Hindi and asked it to search for /bhârat/ and /bhârata/, and then /mahâ/ for good measure. (I have no imagination. Pretend that you don't notice the circumflexes standing in for macrons.) The first search was unnerving because #10 result was an English-language wikipedia article on "barat" (sic) which of course is not even the same word.

Both result sets strongly suggest that Hindi search is handled the same way as Sanskrit search; google presumably can't tell the difference. Search matches only exact text. This is not a huge disaster in Hindi because it's not a massively inflected language, but it makes Sanskrit effectively useless since the language is both inflected and compounding. You get the same problem with UCAS; in that case it's both inflected and polysynthetic.

We complain a lot about having to jump through hoops to get g### to understand that when we say "bare" we don't mean "bares" or "bared" or "barely" -- but what you have here is the equivalent of searching for information on "widgets" and missing half the relevant sites because they always say "widget", singular.

Obviously google's computer hasn't got around to learning Hindi inflectional endings. But it seems like they could at least offer the searcher a choice between "exact match" and "fragment".

Oh, and I also got to see Google In Your Language in action. I already knew that the whole scheme was horrendously under-researched-- I mean, this is google we're talking about. I'd have a hard time believing they even consulted a linguist at all, because the whole structure is based on the assumption that all languages work exactly like English. One of the search options is "Search near..." followed by a box where you type a place name. Problem is, Hindi does this construction the other way around, so you get "... ke paas {something-or-other}" ... and then the box comes after, as in English, when it should be before.

:: looking around uneasily for approaching Moderator with scissors and box of stick-on labels ::

phranque




msg:4515579
 9:38 am on Nov 4, 2012 (gmt 0)

looking around uneasily for approaching Moderator with scissors and box of stick-on labels

actually i started out thinking "there she goes!"
=8)

but you immediately got into a better explanation than i ever could of the hubris and ethnocentrism evident in google's stance with language specification in html code.

i think google should respect the webmaster's specification until there is a "strong enough" signal to consider ignoring it.
google could easily do this at the document and/or hostname level.

to put this in scale and context, do you have any numbers of native language speakers in some of the languages affected by this?

phranque




msg:4515580
 9:48 am on Nov 4, 2012 (gmt 0)

how i'd like to respond to google's weak excuses for this policy:
You can help Google determine the language correctly by using a single language for content and navigation on each page, and by avoiding side-by-side translations.

once we get better at this language stuff, you could help even moreso by properly specifying the primary language for the document and for any elements containing an alternate language.

Translating only the boilerplate text of your pages while keeping the bulk of your content in a single language (as often happens on pages featuring user-generated content) can create a bad user experience if the same content appears multiple times in search results with various boilerplate languages.

there are plenty of people who can get directions in english but want to speak french once they get there.
actually the bad user experience is google's inability to recognize languages.
you would think that an element in a document with an alternate language specification would be a pretty strong signal, but we'll discover that when we get better at this language stuff.

TheMadScientist




msg:4515630
 5:43 pm on Nov 4, 2012 (gmt 0)

Has anyone else thought about this from a search engine POV?

You have a little 'feedback' link on your pages and people start clicking it telling you there are pages displaying for the wrong languages...

So, you check your system and make sure you're handling the variables correctly and lang="FR" is really being treated as lang="FR" and not lang="EN". After you dig through everything and make sure you have everything right on your end you start looking at the source code of the pages only to find the ones not displaying correctly are actually miscoded.

Now you need to code a solution for your end, because the only other plausible alternative is to code a solution and then contact the webmaster to let them know their page is miscoded then Hope they change it. Either way, you have to use the actual text of the page to find the language, because if you don't you won't know which pages are coded correctly and which are not and there are too many pages to review by hand.

All the suggestions and 'oh, they should do it this way' ideas do is add a layer of processing that would have to be developed and tested prior to implementation, then maintained for the duration of your search engine, and to what end? You STILL have to go by the language on the page rather than the declared language to make sure you get it right, so the bottom line of all the suggestions is: More work for absolutely nothing other than making a few webmasters happy their coding is being used.

It's not practical from a business or even 'search advancement' perspective to bother with it, because the time it would take to code, implement and maintain could be much more well spent...

aakk9999




msg:4515744
 3:39 am on Nov 5, 2012 (gmt 0)

You STILL have to go by the language on the page rather than the declared language...

It is not always so simple. I had few cases where the page was completely in English (writen by a native English speaker) AND also used lang="en", but Google decided the page was in the local language of the country where the domain was hosted. There was no one word of the local language on the page.

Google takes other signals and often ignores the language on the page itself and it does get it wrong. I think this is less of the problem for domains hosted in English-speaking countries and owned by locals, but it can be a big problem for cases where non-english speaking country has a domain run in English but hosted and owned locally.

In cases like this the entry in English SERPs (google.com, or google.co.uk) gets "Translate this page", which, if clicked, makes Google translating the page from English to English. The CTR also suffers when the "Translate this page" shows unecessary.

TheMadScientist




msg:4515785
 7:25 am on Nov 5, 2012 (gmt 0)

Okay, can we just go with the stinking point that when the language code is wrong in a larger number of cases than you get it wrong by coding an alternative solution, using your solution rather than having to apply your solution and then go back and decide whether or not to use the original language code you had to ignore in the first place so you could apply your solution is a better decision, because going back and deciding if you should use the original just adds unnecessary steps to the process OR do we really need to split hairs over what precisely they use just so people can get the stinking point:

It's impractical and silly to have to create an alternative then go back and decide whether to use the original or not, when you can just as reliably use the alternative you were forced to create due to misuse of the original.

Did I say they always get it right? No, but they say the reason they did it is because the language code is often incorrect ... Do you really think they went and coded a way to try and determine what language a page is written in and then told a story about why and that they continue to use it if they get it wrong more than they would by using what's coded on the page just for the extra work? Come on, seriously...

It makes them look bad when they get it wrong, so don't you think they would use the most reliable solution they have? I certainly do.

Why on earth do we have to qualify every single statement here?

Next time I'll try to remember to add (or some other alternative) when I say 'use the language on the page', just for the hair splitters who can't seem to 'get the point' without a perfect statement. My bad for not qualifying my statement with the utmost accuracy the first time ... I didn't think the point I was making was that difficult to get, even if they don't ever use the language on the page to make the determination, but in hindsight, obviously it was...

aakk9999




msg:4515967
 5:54 pm on Nov 5, 2012 (gmt 0)

@theMadScientist
I am sorry, I did not want to upset you and I think you perhaps misunderstood what I said.

I was not saying that Google should go back and use lang="en" either.

What I was saying is that whilst Google does not use HTML language attributes, it does not always go by the language it sees on the page either - which seems to more impact sites in english hosted in non-english country and being owned by local company.

It appears that whilst Google initially gets language correct (just as you said, recognising language the page is written in), it must be also folding in some local signals which with the time tip over Google's understanding of the language (local traffic with browser language set to local language, perhaps?).

So at some point Google suddenly decides the language the page is written is not English - but on the other had, cutting/pasting the whole text from the page into Google Translate makes Google recognise language correctly as English.

TheMadScientist




msg:4516043
 8:34 pm on Nov 5, 2012 (gmt 0)

Sorry I misunderstood ... I guess it just seems like there's so much hair splitting here these days you have to be absolutely perfect in your posts and qualify every single statement to the point of making it a bit frustrating to bother with even trying to post anything.

I didn't get what you said in your second post out of your first one, so my apologies for missing it the first time and thinking you were trying to say the point I was making wasn't valid since they don't necessarily use the language on the page to make a determination.

Thanks for clarifying, I appreciate it!

lucy24




msg:4522616
 5:27 pm on Nov 26, 2012 (gmt 0)

:: bump ::

I just remembered something, and had to go back to check.

g### says explicitly that it doesn't use the "lang = blahblah" information.

Meanwhile, in that other search engine's webmaster tools, the SEO Reports section under Reports & Data (not to be confused with the SEO Analyzer under Diagnostics & Tools) offers a list of errors including, quote,
The page is missing meta language information.

Follow the link and you'll get to an explanatory page that says among other things
The Meta Language information is used as a hint to help us understand the intended language and country/region the page content applies to. This can help if your site is not hosted in the country/region. Use the “content-language” meta tag to embed the culture code in the <head> section of your page. For example, <meta http-equiv="content-language" content="en-gb"> indicates that the page is in English and intended for the the United Kingdom. Alternatively, you can use <html lang="en-gb"> or <title lang="en-gb">.

(Ignore the "title" option, which strikes me as demented. You can also argue about "intended for".)

Well, at least g### doesn't expressly order us to omit language information from the <head>. Leave it in, and you'll make at least one search engine happy.

Sigh.

This 42 message thread spans 2 pages: < < 42 ( 1 [2]
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved