homepage Welcome to WebmasterWorld Guest from 107.22.70.215
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 42 message thread spans 2 pages: 42 ( [1] 2 > >     
Google ignores all code-level language information
phranque




msg:4510666
 6:26 am on Oct 22, 2012 (gmt 0)

i mentioned during a presentation at pubcon last week that google ignores language specification in html code and was approached several times afterwards for clarification.
i was surprised this was news, especially since some of those who asked were very familiar with multilingual sites.

so just to get this out there for discussion, from the Official Google Webmaster Central Blog - Working with multilingual websites:
http://googlewebmastercentral.blogspot.com/2010/03/working-with-multilingual-websites.html [googlewebmastercentral.blogspot.com]

Keep in mind that Google ignores all code-level language information, from “lang” attributes to Document Type Definitions (DTD). Some web editing programs create these attributes automatically, and therefore they aren’t very reliable when trying to determine the language of a webpage.


and from Webmaster Tools Help - Multi-regional and multilingual sites:
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=182192 [support.google.com]
Make sure the page language is obvious
Google uses only the visible content of your page to determine its language. We don’t use any code-level language information such as lang attributes. You can help Google determine the language correctly by using a single language for content and navigation on each page, and by avoiding side-by-side translations. Translating only the boilerplate text of your pages while keeping the bulk of your content in a single language (as often happens on pages featuring user-generated content) can create a bad user experience if the same content appears multiple times in search results with various boilerplate languages.



this tells me google isn't that great at language and if not even google can "get it" it's a universal problem, so i would still recommend properly specifying language for all content.


just to be clear, "code-level" language information is distinct from "link-level" language information, which is the proprietary "link rel alternate hreflang" attribute google began supporting last year.

Official Google Webmaster Central Blog: New markup for multilingual content:
http://googlewebmastercentral.blogspot.com/2011/12/new-markup-for-multilingual-content.html [googlewebmastercentral.blogspot.com]

rel="alternate" hreflang="x" - Webmaster Tools Help:
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=189077 [support.google.com]

 

lucy24




msg:4510693
 8:32 am on Oct 22, 2012 (gmt 0)

Some web editing programs create these attributes automatically, and therefore they aren’t very reliable when trying to determine the language of a webpage.

Yeah: they put in <lang = "en">. So if it says <lang = "something else"> shouldn't that be taken as a pretty strong indicator that the page is in some other language?

All the more so when you've got <lang> tags around small discrete sections of the content. I've grumbled elsewhere about g###'s translation of the single line "grazie a tutti" into Italian-- happily ignoring the <lang="it"> tag and therefore making, let us not put too fine a point upon it, utter fools of themselves.

Not long ago, I found a log entry telling me that google had attempted to translate a particular page into Italian. Problem is, the page in question is already in Italian.

Sample. I assure you I am not making this up.

"Original English [sic] Text":*
Questa pagina ha sempre avuto un insolito numero di visitatori provenienti dall'Italia.

Google translation:
This Pagina Semper ha avuto delle Nazioni Unite insolito Numero di Visitatori provenienti DALL'ITALIA.

Clearly google's definition of "obvious" is different from yours and mine.


* Created by a multi-stage process: Run the English text through :: cough-cough :: Google translate. Go over it myself and fix the blatant errors. Find a kind Italian to fix my fixes. (Being several thousand miles away, I could not hear her laughing hysterically.) Check some obscure technical terms and run it past the Italian again.

phranque




msg:4510715
 10:12 am on Oct 22, 2012 (gmt 0)

Yeah: they put in <lang = "en">. So if it says <lang = "something else"> shouldn't that be taken as a pretty strong indicator that the page is in some other language?

i think joomla defaults to lang="en-uk".

Tropical Island




msg:4510794
 1:11 pm on Oct 22, 2012 (gmt 0)

It's a real problem for us using Spanish.

We have a page aimed at our national market with prices in the local currency.

We have another almost identical page but with prices in US$ & Euros. I don't want the national market going to the US$ page as we have currency controls & it could cause legal problems.

In the same way I don't want my customers from outside our country seeing prices in the local currency because they are meaningless to them due to different official & black market exchange rates.

I have marked the HTML as "Español country" on the national page & Español Spain" on the other.

TheMadScientist




msg:4510817
 1:56 pm on Oct 22, 2012 (gmt 0)

just to be clear, "code-level" language information is distinct from "link-level" language information, which is the proprietary "link rel alternate hreflang" attribute google began supporting last year.

Good to note the separation, but even though support of it maybe be limited to Google* it's not a proprietary attribute: (Standard in HTML 4.01 & HTML 5 according to w3schools.org HTML hreflang Attribute [w3schools.com])

12.3 Document relationships: the LINK element

<!ATTLIST LINK
%attrs; -- %coreattrs, %i18n, %events --
charset %Charset; #IMPLIED -- char encoding of linked resource --
href %URI; #IMPLIED -- URI for linked resource --
hreflang %LanguageCode; #IMPLIED -- language code --
type %ContentType; #IMPLIED -- advisory content type --
rel %LinkTypes; #IMPLIED -- forward link types --
rev %LinkTypes; #IMPLIED -- reverse link types --
media %MediaDesc; #IMPLIED -- for rendering on these media --
>

Links in HTML Documents [w3.org]


* I should note: I don't know if it's recognized by Google only, because I don't deal with multilingual sites unless I have to, but I remembered seeing it in the HTML docs when I read through the link section and thought I should point out it's not some new 'Google only thing' they decided to invent on us.

graeme_p




msg:4510828
 2:21 pm on Oct 22, 2012 (gmt 0)

Do they also ignore the content-language http header?

How much metadata in proprietary does Google think they can reasonably expect? Why can they not use existing standards plus analysis of content.

It looks to me that they are so busy not being a search engine, they cannot do their core business very well.

I hope someone is going to come along with something better (and Bing is not not it either).

phranque




msg:4510875
 4:08 pm on Oct 22, 2012 (gmt 0)

even though support of it maybe be limited to Google* it's not a proprietary attribute

thanks for pointing that out - i was inaccurate in that characterization.

for those interested, here's the W3C reference specific to that link element implementation.
Links and search engines:
http://www.w3.org/TR/html401/struct/links.html#h-12.3.3

Do they also ignore the content-language http header?

they did not mention header-level language specification, only code-level Language information:
http://www.w3.org/TR/html401/struct/dirlang.html [w3.org]

this means the discussion really only applies to the lang attribute.

even the "link rel alternate hreflang" element can also have its own lang attribute to specify the language of its title attribute, for example.
ironic that google would ignore that lang attribute here while being attentive to the hreflang attribute.


these all have different meanings.
for example, the HTTP Content-Language header can have multiple language values specified.

lucy24




msg:4510907
 5:20 pm on Oct 22, 2012 (gmt 0)

In any case, what's the use of putting information in a <link>? The information should be on the page itself.

I don't think g### distinguishes between "en" and "en-uk". For that matter, I've never met a site that came in bilingual versions. It's English either way.

:: insert obligatory wisecrack about translation into or out of en-au ::

phranque




msg:4511064
 1:39 am on Oct 23, 2012 (gmt 0)

what's the use of putting information in a <link>?

do you mean the title attribute on the link?

http://www.w3.org/TR/html401/struct/global.html#title
Values of the title attribute may be rendered by user agents in a variety of ways. For instance, visual browsers frequently display the title as a "tool tip" (a short message that appears when the pointing device pauses over an object). Audio user agents may speak the title information in a similar context. For example, setting the attribute on a link allows user agents (visual and non-visual) to tell users about the nature of the linked resource

lucy24




msg:4511100
 4:05 am on Oct 23, 2012 (gmt 0)

But that's only meaningful if you followed the link in the first place. Sure it can be useful to be told in advance "Oh, by the way, the page I'm recommending is in Turkish", on the off chance that the link text itself doesn't give a clue. ("This page in Italian" or "My favoritte smileys site defaults to German though you can change it if you're a wimp", say.)

But it's far more important to get information from the page in isolation, the way you'd get it if you typed the address manually or used a bookmark or, heck, came in from a search engine.

Now, if you wanted to mess with a competitor by including some perfectly legitimate links to their site, and coding it to say the page is in some obscure language so it will come up in the wrong searches or the page title comes out meaning something vulgar ...

graeme_p




msg:4511136
 6:09 am on Oct 23, 2012 (gmt 0)

I misunderstood this initially.

Lucy, competitors cannot really do that because it must be used on all pages (presumably it gets ignored if not). So if example.com/en has <link rel="alternate" hreflang="es" href="http://example.com/es" /> then example.com/es must have <link rel="alternate" hreflang="en" href="http://example.com/en" />

On the other hand, the scope for webmasters to mess this up seems quite large.....

I don't think g### distinguishes between "en" and "en-uk".


One of Google's scenarios for using this is for content with "small regional variations". They say "For example, you might have English-language content targeted at readers in the US, GB, and Ireland."

they did not mention header-level language specification


Yes, but I want to know! It seems to be to be the easiest way of doing it in most cases.

phranque




msg:4511142
 6:38 am on Oct 23, 2012 (gmt 0)

Yes, but I want to know! It seems to be to be the easiest way of doing it in most cases.


the purpose of the Content-Language header is different from the lang attribute.
the Content-Language header is more about the preferred language of the intended audience (i.e. visitor-centric) while the lang attribute is more about the actual language of the document or enclosing element (i.e. content-centric).

the bolded part below is a perfect example if this distinction where the content is in latin but the intended audience is english speakers.

http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.12
Content-Language

The Content-Language entity-header field describes the natural language(s) of the intended audience for the enclosed entity. Note that this might not be equivalent to all the languages used within the entity-body.

... The primary purpose of Content-Language is to allow a user to identify and differentiate entities according to the user's own preferred language. Thus, if the body content is intended only for a Danish-literate audience, the appropriate field is
Content-Language: da

If no Content-Language is specified, the default is that the content is intended for all language audiences. This might mean that the sender does not consider it to be specific to any natural language, or that the sender does not know for which language it is intended.

Multiple languages MAY be listed for content that is intended for multiple audiences. For example, a rendition of the "Treaty of Waitangi," presented simultaneously in the original Maori and English versions, would call for Content-Language: mi, en

However, just because multiple languages are present within an entity does not mean that it is intended for multiple linguistic audiences. An example would be a beginner's language primer, such as "A First Lesson in Latin," which is clearly intended to be used by an English-literate audience. In this case, the Content-Language would properly only include "en".

Content-Language MAY be applied to any media type -- it is not limited to textual documents.

Sgt_Kickaxe




msg:4511216
 10:08 am on Oct 23, 2012 (gmt 0)

I suspect that taking the code at face value became a problem when platforms like wordpress and drupal became popular and templates became readily available everywhere. Those templates were used by many different language sites yet the templates often had the same(wrong) language specs within the code.

This applies to more than just language specification, imo, and I'd be very surprised if there is anything at all that Google takes at face value from within the code save for perhaps rich snippets as an "extra" (not a ranking factor but an added feature). Even rich snippets are finding their way into templates and addons that are widely used.

- minimize sitewide boilerplate
- one language per page only
- clear definition between sections if a site is multi-lingual

Got it.

lucy24




msg:4511236
 10:37 am on Oct 23, 2012 (gmt 0)

one language per page only

Uhm... Uh... Have you any idea how many parallel translations I've put online over the years? :)

Sure there will always be a primary language. Even if it's only 51%. But when you're identifying that language you have to go by something. If it's a choice between
#1 a fully established standard markup that is sometimes used incorrectly
#2 trusting the computer's judgement (see my post near the top of this page)
#3 coming up with another proprietary markup and expecting everyone to use that correctly

... Feh. I'm not even going to finish the sentence.

Rumbas




msg:4511310
 2:32 pm on Oct 23, 2012 (gmt 0)

>this tells me google isn't that great at language and if not even google can "get it" it's a universal problem, so i would still recommend properly specifying language for all content.

I agree. Over the years we've seen Google having huge trouble distinguishing between similar language - Danish and Norwegian being a very good example.

I did surprise me though that they state they ignor the code level language tags.

httpwebwitch




msg:4511384
 5:42 pm on Oct 23, 2012 (gmt 0)

This is not something I wanted to hear. I use appropriate lang attributes in my markup, perfectly according to the HTML spec, for translated versions of the same content at different URLs. For Google to ignore that and try to detect the language is ... rude?

I assume this is because too many webmasters use those lang attributes wrong, and Google finds them unreliable.

Oh well, I can just hope that my Spanish pages get identified as Spanish, and the French ones as French. My markup tells the real story. If Google gets it wrong, that's their problem to fix.

Hoping that Google gets things right is often the only shred of influence we have over representation in the SERPs.

aakk9999




msg:4511535
 3:09 am on Oct 24, 2012 (gmt 0)

I had and still occassionally have a huge problem with this. What I have found out is that if you have a site in English that is targetting visitors worldwide, but where the business has locally based brick and mortar office and the website also has reasonable local traffic, Google suddenly decides that the page is in the local language.

Being it on ccTLD domain makes it even worse - Google just does not want to believe it is in English if your ccTLD is from non-english country and you are based in the non-english country, regardless of how good the English language on the page is and even if there is a no single foreign word on the page.

This is a big problem in tourism niche when you are located in a particular country and are targetting audience worldwide.

I have also found out that when Google gets language wrong and an english page gets "Translate this page" when searched in english Google, then your ranking drop few spots - almost as Google takes away a bit of relevance.

I have not noticed this to be a problem with other languages other than English / Local language combination. Most of sites I look after are multilingual with "home language" being English, but only English language has this problem (Google deciding it is not English).

I have found that the workaround for internal pages is to have language folder in URL (e.g. www.example.com/en/about-wigets )- this seem to work for internal pages and Google is less likely to get it wrong. But if the home page is in English, and your brick and mortar address is (lets say) in Italy and you have a significant number of Italians clicking on your home page (which is in English) then Google often gets it wrong and declares the page to be in Italian.

I also found out that geotargetting the site to "Unlisted" seemed to help with home page language recognition.

phranque




msg:4511540
 4:20 am on Oct 24, 2012 (gmt 0)

I also found out that geotargetting the site to "Unlisted" seemed to help with home page language recognition.

this is only an option for gTLDs.

nikhilrajr




msg:4511574
 7:53 am on Oct 24, 2012 (gmt 0)

Here is my experience with language. The problem I see with lots of websites is with the use of canonical tag. Ok I have 5 home pages each offering different language..
http://www.example.com/ - the main home page
http://www.example.com/?lang=en_EU
http://www.example.com/?lang=en_GB
http://www.example.com/?lang=de
http://www.example.com/?lang=fr
http://www.example.com/?lang=ja
Please see what Google says about Canonical tag [support.google.com...]

Google clearly says:
“The rel="canonical" attribute should be used only to specify the preferred version of many pages with identical content“

In all these pages the canonical tag points to http://www.example.com/ - english (US) version. How can http://www.example.com/?lang=en and http://www.example.com/?lang=de serve the same content?

What happens is Google might ignore all the other home pages and just show http://www.example.com/ - english (US) version in all the Google(US, UK, DE, FR) searches.
The best option is to use canonical tag in conjunction with alternate link tag See [support.google.com...]

But still it's a mystery [searchengineland.com...]

"The best practice is to place languages in subdirectory or subfolder rather than language parameter to help search engines more easily understand site structure." From this Google Webmaster video [youtube.com...] at 12:00 minutes

Having subfolder like http://www.example.com/de will help in verifying the German site and setting its geographic target in GWT to Germany.

phranque




msg:4511589
 8:37 am on Oct 24, 2012 (gmt 0)

In all these pages the canonical tag points to http://www.example.com/ - english (US) version.

this is an incorrect application of the link rel canonical element.
the content for each language should have its own canonical url.

Having subfolder like http://www.example.com/de will help in verifying the German site and setting its geographic target in GWT to Germany.

having a sudomain will have the advantages mentioned above and also allow the subdomain to be hosted within the targeted geography.
IP location is one of the stronger geolocation signals according to google.

lucy24




msg:4511606
 9:12 am on Oct 24, 2012 (gmt 0)

"The best practice is to place languages in subdirectory or subfolder rather than language parameter to help search engines more easily understand site structure."

I love it when the needs of search engines and the needs of human visitors are mutually exclusive. Your voice-text reader doesn't care what subdirectory the file lives in. It only cares what kind of unambiguous language information is present in the file.

TheMadScientist




msg:4511619
 10:08 am on Oct 24, 2012 (gmt 0)

If a text-reader doesn't care what directory a file is in, then how is it exclusive of text-readers to put the file in a 'language keyed' subdirectory or on a 'language keyed' subdomain?

And, if Google doesn't care what language identifier you use, then how is it exclusive of Google to identify a page using the correct markup for a text-reader?

Neither excludes the other from use of the file or information, lack of the use of both isn't a 'best practice', but Google's largely ignored HTML descriptions for years and they still somehow know what a page is about for the most part, so is there really a big deal here?

I mean now that people know they ignore it, are your pages showing up for the wrong searches more than they did before it was made known they ignore it or something? I doubt they just started ignoring it yesterday...

nikhilrajr




msg:4511630
 10:24 am on Oct 24, 2012 (gmt 0)

this is an incorrect application of the link rel canonical element.
the content for each language should have its own canonical url.


This is what I said : What happens is Google might ignore all the other home pages and just show http://www.example.com/ - english (US) version in all the Google(US, UK, DE, FR) searches.

having a sudomain will have the advantages mentioned above and also allow the subdomain to be hosted within the targeted geography.


There are issues. I would opt for this only if I am a big brand; and if I could generate content specific to their region! Not everyone could make a subdomain to rank because it spreads the authority.

To me having a single domain with multiple subfolders to target each language would help in getting a single domain that can build authority.

TheMadScientist




msg:4511632
 10:29 am on Oct 24, 2012 (gmt 0)

I guess I should add...

Yes, it's news to quite a few people, and it's good to know, but is there really anything to complain about or get up-in-arms over?

So there's another tag we know Google doesn't use ... Okay ... They even say they don't because it's unreliable, so it seems like a good thing for quite a few people who don't know what they're doing, and for those who do...

If you want to build a proper webpage, use the HTML attribute so it can be identified by user-agents who use the info.

If you want to build a site that's more easily ranked and understood by user-agents that don't use the tag (some search engine(s)), put the files in a language keyed subdirectory or on a language keyed subdomain.

If you want to do whats best for all user-agents, do both...

I would think it's 'Web 101' to not only do both, but to identify the links with a lang attribute, so there can be as little confusion as possible about the language of a given resource and site structure, regardless of visitor type and what markup they decide to give credence to.

lucy24




msg:4511648
 11:10 am on Oct 24, 2012 (gmt 0)

I doubt they just started ignoring it yesterday...

It definitely sheds light on why "search for pages in {some language}" has always been so utterly useless. They're not even looking at one possible source of information.

And then there's the counterpart, where you search for a non-English word and they helpfully say "tip: limit your search to pages in English". (Helloo? If I'd wanted English pages, I'd have said so. Or padded the search with something like "What does vixaxn mean?")

I detoured here for some random searches. Even if you search for a Greek word in Greek pages-- which should be a no-brainer-- you still get some pages that are primarily in English.

:: further digression to ponder the fact that Google Books thinks Dene = Inuktitut ::

It's that Microsoft thing. "We don't have to follow no steenking standards. We'll make up our own proprietary formulas and everyone else can jolly well accommodate themselves to us because we're the biggest."

phranque




msg:4511652
 11:23 am on Oct 24, 2012 (gmt 0)

what TheMadScientist said plus...
if you are building for the long term you should assume that google will eventually "learn" who can be "trusted" to properly use lang attribute.

phranque




msg:4511658
 11:35 am on Oct 24, 2012 (gmt 0)

There are issues. I would opt for this only if I am a big brand; and if I could generate content specific to their region! Not everyone could make a subdomain to rank because it spreads the authority.

To me having a single domain with multiple subfolders to target each language would help in getting a single domain that can build authority.


the subdomain is useful for geotargeting, where you are essentially filtering all the content into a specific google.ccTLD index.
when you are geotargeting you are pretty much by definition "generating content specific to the region."
in this case it will be more difficult to build regional authority in a subdirectory hosted offshore than in a subdomain hosted on an in-country IP address.

language targeting is distinct from geotargeting, to the extent that there is no mechanism for language targeting at hostname or subdirectory level in google.

Maurice




msg:4514108
 11:51 am on Oct 31, 2012 (gmt 0)

@lucy24 the problem is a lot of sites are using poor quality CMS (some one mentioned joomla) which default to EN or they are devlopers who have ZERO knowledge of the varuious language encodings (and this is a very complex subject even for UTF8 let alone UTF16 and Non latin languages)

If part of the html spec is so debased by using it incorectly then just ignoring it makes sense - less chance of a false posative.

lucy24




msg:4514382
 9:37 pm on Oct 31, 2012 (gmt 0)

even for UTF8 let alone UTF16 and Non latin languages

?

lucy24




msg:4514440
 12:35 am on Nov 1, 2012 (gmt 0)

Well, ### and ### that time limit anyway.

Follow-up thought arising from unrelated discussion in unrelated forum: Does anyone doubt that google can distinguish between pages made using WordPress, Joomla, Drupal, etc etc, and hand-crafted HTML? It's right there in the page source, isn't it? Possibly even in the visible text.

I can see false negatives: pages auto-marked <lang = "en"> when in fact their language is something else. But do you also get <lang = "something else"> when the language is really English? Never mind regions: just the initial 2- or 3-digit language code.

This 42 message thread spans 2 pages: 42 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved