homepage Welcome to WebmasterWorld Guest from 54.234.7.161
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 36 message thread spans 2 pages: 36 ( [1] 2 > >     
Can Google crawl the same URL in multiple languages
graeme_p

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4356217 posted 9:16 am on Aug 29, 2011 (gmt 0)

If a url return the same content in different languages depending on browser language settings or cookies recording previous choices, can Google index the URL in multiple languages (so it can match it to searches in different languages).

I think it cannot and I need a separate URL for each language, is that (still) correct?

 

yandr



 
Msg#: 4356217 posted 1:34 pm on Aug 29, 2011 (gmt 0)

No, Google cannot do that.

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4356217 posted 2:57 pm on Aug 29, 2011 (gmt 0)

Yes - a separate URL for each language is correct.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4356217 posted 5:39 pm on Aug 29, 2011 (gmt 0)

Also, Googlebot always comes from the US and (as used in some sites) auto-language settings will ensure Google is fed the English version and no other language is ever indexed. Be sure you avoid that problem.

graeme_p

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4356217 posted 11:58 am on Aug 30, 2011 (gmt 0)

Thanks for confirming that.

Its a bit pathetic that HTTP has provision for this, browsers can do it, CMS's can do it, it would be nice for users, and we cannot use it because of search engines.

Wouldn't it be nice if someone could like to a page that they read in their language, and everyone who followed the link could read it in their own language?

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4356217 posted 12:09 pm on Aug 30, 2011 (gmt 0)

That's only really possible if the website actually has the page published in every language there is.

I much prefer the scheme where every page of the site links to the same content (the same content page) in other languages via flags. This lets the user choose what they want to see.

rlange



 
Msg#: 4356217 posted 1:52 pm on Aug 30, 2011 (gmt 0)

graeme_p wrote:
Its a bit pathetic that HTTP has provision for this, browsers can do it, CMS's can do it, it would be nice for users, and we cannot use it because of search engines.

Well, without explicit links to these different languages, the only way Googlebot can discover them is by requesting every single URL as a different language. That is, potentially, over 3,000 requests per individual URL. Granted, there probably aren't nearly that many written languages with a significant presence on the Internet, but still...

Large amounts of random prodding would be a waste of resources for both the site and Google.

--
Ryan

Demaestro

WebmasterWorld Senior Member demaestro us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4356217 posted 2:53 pm on Aug 30, 2011 (gmt 0)

I don't think the same URL should resolve different language. The same file maybe but the path to the file should be different.

example.com/en/page
example.com/fr/page
example.com/sp/page

graeme_p

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4356217 posted 7:26 pm on Aug 30, 2011 (gmt 0)

Not flags please: names of languages as text. Flags, though widely used, do not work well: they correspond to countries not languages. Many countries have multiple languages, and many languages are used in several different countries, and it can touch national sensitivities. I have seen three different different flags used for English - and never the English flag. What flag do you use for Cantonese, or Tamil, or Hindi, or Bengali, or Serbo-Croat?

As for crawling multiple languages being impractical, search engines are constantly inventing ways for sites to communicate with them: robots.txt extensions, site maps, meta tags. Anyone of these could be used to indicate available languages (doing it in the HTTP header would be ideal, though, but those standards change more slowly). They could solve the problem if they wanted to.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4356217 posted 9:46 pm on Aug 30, 2011 (gmt 0)

and never the English flag.

You wouldn't recognize the English flag if you saw it. ;) You'd think it was something Scandinavian.

What flag do you use for Cantonese, or Tamil, or Hindi,

Did I miss a war? What's the other country in which Hindi is an official language?

The catch is that ordinary humans are more likely to recognize a flag than a two-letter abbreviation. If I've got a choice between the German flag and the UK flag I may grumble but I'll know which one to click. But it it says "de" I have to stop and remind myself that it means Deutsch, not Denmark.

:: still sulking because my computer can no longer be set to iu-ca ::

Demaestro

WebmasterWorld Senior Member demaestro us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4356217 posted 10:34 pm on Aug 30, 2011 (gmt 0)

The catch is that ordinary humans are more likely to recognize a flag than a two-letter abbreviation.


Sure but why use 2 letter abbreviations, they seem lacking as much as flags? Using them in the URL makes sense but when presenting a user with options to select a language why limit yourself to 2 chars?

It also doesn't solve the question when I click Canada's flag do I get English or French? If I am Quebecois do I have to recognize and select the French flag to get the site in French?

I think flag selection for language makes no sense from an end user perspective.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4356217 posted 12:29 am on Aug 31, 2011 (gmt 0)

The two characters are usually those defined in ISO 639 and are understood worldwide.

graeme_p

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4356217 posted 1:06 pm on Aug 31, 2011 (gmt 0)

@lucy, the English flag is used, at sporting events, for example. I can even tell it apart from the subtly different flag of the City of London :)

Hindi is the same as Urdu, but the Pakistanis are not too keen on calling it Hindi.

The alternative to a flag is not two characters, it is the full name of the language, in that language. It is surprisingly easy to spot your native language in a list of foreign names of other languages - it is even easier if your language has its own alphabet.

I am working on a bilingual site, possibly later going tri-lingual, and each language has its own, very different, alphabet (and one language does not have an associated flag) so I will have a box in a corner something like "English | සින්හල" (spelling may be wrong, and you may not have a font that can display that installed, but you get the idea).

[edited by: graeme_p at 2:02 pm (utc) on Aug 31, 2011]

rlange



 
Msg#: 4356217 posted 1:10 pm on Aug 31, 2011 (gmt 0)

g1smd wrote:
The two characters are usually those defined in ISO 639 and are understood worldwide.

By whom? I wouldn't guarantee that my mother would recognize "EN" to stand for English and that's the only language she speaks. She might select it because it's the closest match, but there's no confidence in that selection. That's user unfriendliness.

No, the best option is the actual name of the language in that language.

--
Ryan

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4356217 posted 8:32 pm on Aug 31, 2011 (gmt 0)

Hindi is the same as Urdu, but the Pakistanis are not too keen on calling it Hindi.

In the bazaar maybe, but on your www page they would be different because Hindi is written in Devanagari while Urdu is written in Arabic. Stomping on further digression about Sanskrit loanwords in Hindi, vs Arabic and Turkish loanwords in Urdu.

spelling may be wrong, and you may not have a font that can display that installed, but you get the idea

Even if I had the font-- which incredibly I don't-- the Forums probably wouldn't cooperate. Uhm... Sinhala, two letter + vowel combinations followed by two independent letters? (For some reason my computer knows enough to do that, although even the Last Resort font can't say more than "empty box".)

graeme_p

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4356217 posted 4:06 am on Sep 1, 2011 (gmt 0)

Ooops, I was wrong about that. I have only ever come across them in their spoken form (as in someone who speaks Hindi being able to speak to someone who speaks Urdu). You may still want to classify them together on the net (for video and audio) and it still does not make sense to use the Indian flag for Hindi.

The forums do cooperate - it works for me. Maybe I should have used a more widely installed font to show the effect but boxes show what I want, and I have Sinhala transliteration installed so it was easy to type in.

If the Last Resort font is showing an empty box, I think that means you have an old version of it somehow [developer.apple.com ]

None of this has much effect on my point: that the full text of names is preferable to flags.

graeme_p

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4356217 posted 4:46 am on Sep 1, 2011 (gmt 0)

This discussion persuaded me to install GNU Unifont. It cannot cover all of Unicode as Unicode as more characters than you can have in a TTF or OpenType font, but it covers the basic multi-lingual plane (in low quality glyphs).

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4356217 posted 5:28 am on Sep 1, 2011 (gmt 0)

If the Last Resort font is showing an empty box, I think that means you have an old version of it somehow

Last time I tried Last Resort, it overrode fonts that I did have installed, showing the language glyph (I didn't mean literally an empty box) instead of the proper character. But it's apparently now built in to the OS in some way, because I get its functionality in some applications. Pity they couldn't put something similar in the iPad; those boxes make me crazy.

:: major detour here, including an inexplicable trip to the unicode www site-- which I cannot navigate to save my life-- in order to download an Apple font, followed by FontBook telling me, quote, This font file contains a font name that conflicts with a system font required by Mac OS X to display onscreen text. You should move this font file to the Trash. I guess that explains why the Apple page didn't have a download link ::

Wonder if any of those ftx... utilities includes a glyph-deletion system so you can have @-embedded fonts for decorative headers without dragging along a bunch of other characters you don't need, that just take up space and bandwidth?

ᐃᓄᒃᓱᐃᕈᑎᔪᓯᒃ << testing Forums' mood of the hour before-- stop me if you've heard this one-- running off to download seven Sinhala fonts

graeme_p

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4356217 posted 10:41 am on Sep 1, 2011 (gmt 0)

Definitely not heard that before!

Why not try GNU Unifont? There are Mac and Windows downloads here: [unifoundry.com ]

Linux users can just install the unifont package using their usual package manager.

I installed it and can now read the front page of Wikipedia (which has a complete list of all languages versions) without any character missing boxes. It is far from perfect (especially for languages like Sinhala that have modifiers - they appear as separate letters), but its the best fallback I have found.

graeme_p

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4356217 posted 4:06 pm on Sep 1, 2011 (gmt 0)

Just to note I was wrong again: despite what I read, Unifont seems to cope with Sinhala (and presumably will cope with all Indic scripts) perfectly accurately - although it is ugly.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4356217 posted 8:22 pm on Sep 1, 2011 (gmt 0)

There are two components to text display: the font itself, which is trivial, and Rendering Support for languages whose display is non-linear. My computer could tell which characters go together even though it couldn't display the characters. (It can now. South Indian* scripts are pretty. Except Tamil, which is all squared-off and boxy-looking.)

This page contains the latest release of the GNU Unifont, with glyphs for every printable code point in the Unicode 5.1 Basic Multilingual Plane (BMP).

Holy ###. Can't get much better than that. Except by reverting to Unicode 4.x, before they officiously stepped in and attempted to steal codepoint 1400.

16MB, wow. I've got a few CJK fonts in that range, but the median is around 80K :) Oh, you're right, it is ugly. And it doesn't know how to do Devanagari consonant clusters, though it gets the vowels right. Unless it's after two consonants, as in kri...

Oh, and I'm missing one script in wikipedia's 10,000+ range (I happen to know it's Malayalam, which I've actually got but Camino hasn't caught up to it yet), and two more in the 1,000+ range (Aramaic and ... wtf? I thought I had Oriya!). They may be in unifont, but I won't know until I restart.

:: wondering how much longer it will be before a moderator comes along with scissors and paste-pot ::


* Yes, I know, but historically it's the same immediate branch of the family.

graeme_p

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4356217 posted 4:19 pm on Sep 2, 2011 (gmt 0)

Yes, I know, but historically it's the same immediate branch of the family.


I am not sensitive about these things (and no one who is seems to be taking part in this thread. I am pretty sure you know more about the history of Sinhala script than I do

it is ugly.


Less ugly than empty boxes IMO. I have a weird problem with Chromium which uses Unifont instead of the perfectly nice Sinhala font I have installed - Firefox and Epiphany use the better one.

You need a restart to get a font working!?

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4356217 posted 2:17 am on Sep 3, 2011 (gmt 0)

Not restart, just log in and out. But I'd still have to quit all applications, so it isn't worth the trouble.

I can see the fonts right away in FontBook, and they're available to most applications-- sometimes even if it was running while I installed the font! For example, the Sinhala word popped right up.

But some things are too intimately connected with the operating system to function separately. (I recently had a horrendous Safari problem because of this.) And I think Camino simply doesn't like Code2000. That's where I've got Malayalam and Oriya, and a few other South Asian scripts. Doesn't mind Code2001, though. If I paste the 1000+ paragraph from wikipedia into SubEthaEdit, it turns out I do have Aramaic. (The script is Syriac. Two different fonts, at that.) It also displays as intended in Safari, no surprise.

unifont shows up in FontBook, but the Character Viewer ignores it. Interesting.

I have a weird problem with Chromium which uses Unifont instead of the perfectly nice Sinhala font I have installed - Firefox and Epiphany use the better one.

I know that one. Some browsers go alphabetically-- serious annoyance for me, because that's likely to mean Alphabetum which I have yet to pay for-- others go by ... uhm ... some other factor. For Gothic (looking at the wikipedia front page again), Camino uses Code2001 while Safari uses Alphabetum. Go figure.

But hey, at least you're not in MSIE. (If you want to be exact, I think it's MSIE8 for Linux. Got this from a tester.) It would rather display blank boxes than obey explicit instructions to use DejaVu Sans for UCAS syllabics.

What was this thread about again?

graeme_p

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4356217 posted 5:15 am on Sep 3, 2011 (gmt 0)

MSIE8 for Linux


I did not know there was such a thing. Do you mean running IE 8 on Wine? I usually test with Windows in Virtualbox (and I recently found out you can convert the MS test Virtual PC images for testing with IE to run in Virtualbox).

What was this thread about again?


I forget, but it was not half as interesting.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4356217 posted 6:54 am on Sep 3, 2011 (gmt 0)

:: shuffling papers ::

Sorry. Got Tor mixed up with Andy. It's

Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR et cetera et cetera

Or, in human-speak, Windows XP SP3. Quote from the source (who, incidentally, is not a native speaker of English and does not live in an English-speaking country):
looking in IE Font settings, I cannot choose freely among all the fonts I have installed on the computer. For example, if I choose Japanese, it shows a list of 4-5 fonts to choose from. I guess it believes that those are the only fonts with those characters. And so I cant choose DejaVu Sans under UCAS; in fact I cant choose any font for UCAS in IE, cause the list is empty! (useless piece of junk). Now why would they make it so restrictive rather than just lets us choose whichever font we want from all available? Is that too much of a responsibility for us to handle?

The backstory, in case anyone is still following along, is that I was testing my version of the css-plus-javascript function that checks if you've got a particular font installed. Elsewhere in the same test I learned that Iceweasel behaves oddly, resulting in false positives. (This is why I had Linux on the brain.) So if you were doing the test for some Serious Purpose you'd have to detect the UA and push them to an alternative code. Adding decorative ᐀inuksuuk᐀ to your headers-- but only if the user's computer can display them-- does not count as a Serious Purpose.

Leosghost

WebmasterWorld Senior Member leosghost us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4356217 posted 2:17 pm on Sep 3, 2011 (gmt 0)

I forget, but it was not half as interesting.

᐀inuksuuk᐀ agreed ;-)

Interesting ..I can see them..and if i copy and paste them into "quick reply"..I can still see them as intended..
But if I copy and paste then into preview they do this �inuksuuk� .. :( .so ..some of tomorrow will be spent trying to discover why this penguin can see inuksuuk ( nice btw :) but can't pick them up and put them down again somewhere else..or that holding them during a page change mangles them.

Now to see if posting them via the submit button keeps them intact ?

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4356217 posted 4:47 pm on Sep 3, 2011 (gmt 0)

It's because the Forums don't :: cough, cough :: have a declared "charset", aka File Encoding. This works up until the moment someone uses a character outside of Latin-1, at which point you have to either reset the encoding manually or your browser does it for you.

In php/bb forums the problem is handled by auto-converting anything outside of Windows-Latin-1* to html decimal entities, which will display correctly in the post but are ### to edit. (A familiar problem to me because I often have to help people with polytonic Greek.)

Your copy-and-paste is what you see if characters in the UCAS range are reinterpreted as Latin-1; I know them well. Each character turns into a set of three: á (E1) followed by a non-displaying character (90 through 99) and finally by a semi-random character (80 through BF).


* Officially ISO-Latin-1, but de facto Windows-Latin-1. Someone pointed me to an explanatory link once. The difference is in a handful of "extras" like — (dash), curly quotes or apostrophes “” ‘’, the letters œ/Œ and a few more.

graeme_p

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4356217 posted 4:26 pm on Sep 4, 2011 (gmt 0)

@Lucy - I am very curious as to what lead you into such expertise on languages, alphabets and encodings. I am used to all kinds of knowledge showing up on this forum but this is unexpected (and impressive, of course).

Nicely spotted about the lack of a declared encoding here. It all works OK for me because my browser's default encoding is set to UTF-8. I am not entirely sure if I am seeing what I am supposed to either side of the "inuksuuk". It looks like rectangle made of a solid grid of dots to me.

I am not entirely clear why a browser would interpret ᐀inuksuuk᐀ correctly in other places but as Latin-1 in the preview. Did copy and paste change something?

WW really should be declaring an encoding, and it should be UTF-8 (efficient for mostly English text, but leaves for everything).

I wonder what is most efficient (in terms of bandwith and storage) for a site that will be half English and half Sinhala, possibly with some Tamil later?

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4356217 posted 9:44 pm on Sep 4, 2011 (gmt 0)

For characters in the Latin-1* range (decimal 128-255), UTF-8 takes up more room because it uses two bytes per character. But the total size difference is trivial when you add in all the other stuff that takes up room in a www page.

Beyond Latin-1, you have to choose between UTF-8 and entities. Ugh, ugh, ugh. (Omitting lengthy discussion about desirability or otherwise of using entities within Latin-1.) And I don't only say that because my fingers hate typing the word :: slowing down so I'll get it on the first try :: "entities".

Pro #1: Entities will be read and interpreted correctly even if you're in an elderly browser that can't deal with the "charset" declaration.

Pro #2: If you are on A Certain Platform, physically typing anything other than plain keyboard ASCII can be overwhelming.

Con #1: Entities make the raw html unreadable. Obviously not an issue if you never do read the raw html.**

Con #2: If a significant part of the text requires entities, it will come out larger than UTF-8 because each one takes an absolute minimum of four bytes. Like &pi; (named HTML 4 entity) or &#960; (decimal entity) or &#x03C0; (hexadecimal entity).

I am very curious as to what lead you into such expertise on languages, alphabets and encodings. I am used to all kinds of knowledge showing up on this forum but this is unexpected (and impressive, of course).

Academic background in Classics and Linguistics plus being an ecumenical Language Junkie. In the nearer past I've been doing e-books for about 7 years. I tend to gravitate toward weird languages ("Help! Greek ligatures! Where's Lucy?") or archaic texts (I can remember when I didn't know what a yogh was). I tried to count once; I think I've done e-texts in about 15 languages. But some of those were one-offs. For example, I have no idea why I was dragooned into that 17th-century Gascon book, or the Malay grammar.

Nicely spotted about the lack of a declared encoding here.

It has come up before ;)
It all works OK for me because my browser's default encoding is set to UTF-8. I am not entirely sure if I am seeing what I am supposed to either side of the "inuksuuk". It looks like rectangle made of a solid grid of dots to me.

The inuksuk character only exists in two UCAS fonts that I know of. One of them is Pigiarniq (the font the Nunavut government points you to), but the other is not Euphemia (included in the more recent Mac and Windows OS). It's at codepoint 1400, which was unassigned until Unicode 5 came along and decided it was going to be the "UCAS hyphen" (a character that nobody uses and afaik exists in no font) :(

When I Preview, anything beyond Latin-1 tends to change to decimal entities unless the browser is already set for UTF-8. On this page it is, because it decided that the ᐀ character outweighed the non-displayable ones.


* I'm using "Latin-1" as shorthand for "any one-bite charset". If your site is entirely in, say, Cyrillic, you'll have a different encoding but it's still just one byte.

** WYSIWYG editors are no longer just for wimps and dummies. My father, who spent his working life as a crystallographer and therefore speaks fluent Fortran, swears by Freeway.

graeme_p

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4356217 posted 2:01 pm on Sep 5, 2011 (gmt 0)

I was planning on UTF-8 for storage. I am still wondering about sending pages in Sinhala as UTF-16. I am not sure about browser support though.

Is the problem with Euphemia that it does not use the right code points? There are Sinhala fonts (not much used now, thankfully) which did not stick to any recognised encoding, and they caused so much confusion that it is still not uncommon for sites to use images for Sinhala text.

I still think HTML is better edited directly. Most HTML is generate from templates, and as they will be used on lots of pages (even right across a site or multiple sites) and its worth the extra effort. If you are directly editing static HTML WYSIWYG is probably the way to go, but why would you do that except for a very small site?

Freeway (and possibly other editors?) bases each page on a template, making it almost a static HTML site generator, rather than just an editor. A perfectly good approach if you site has essentially static content and is small enough that re-uploading after a site wide change will not take forever.

This 36 message thread spans 2 pages: 36 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved