Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Can Google crawl the same URL in multiple languages

         

graeme_p

9:16 am on Aug 29, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If a url return the same content in different languages depending on browser language settings or cookies recording previous choices, can Google index the URL in multiple languages (so it can match it to searches in different languages).

I think it cannot and I need a separate URL for each language, is that (still) correct?

lucy24

10:40 pm on Sep 5, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I was planning on UTF-8 for storage. I am still wondering about sending pages in Sinhala as UTF-16. I am not sure about browser support though.

I've never seen UTF-16 used for anything except Chinese, and so far I haven't seen any particular arguments in favor of it. Er. In favor of UTF-16. Not of Chinese.

Is the problem with Euphemia that it does not use the right code points? There are Sinhala fonts (not much used now, thankfully) which did not stick to any recognised encoding, and they caused so much confusion that it is still not uncommon for sites to use images for Sinhala text.

No, it's just that only one font uses codepoint 1400 at all. Actually two, but nobody uses Uqammaq. 1400, alias E19080, was officially unassigned (it's the very first codepoint in the UCAS sector) until Unicode 5, which was only a year or two back. So it was sort of a de facto Private Use point. I guess nobody at the Unicode Consortium ever noticed the inuksuk. There are still half a dozen unused codepoints at the end of the sector, so it's not like they had to swipe 1400 :(

If you've got pre-1999 Sinhala fonts, you're in legacy font territory. Explanatory link suppressed because :: cough, cough :: the first page that comes to mind is one of mine. I once and only once found a page at unicode.org that lets you look up when a given codepoint or sector was first assigned, but I don't know if I'll ever find it again.

Quick edit: Oops, I tell a lie. Mercifully I did bookmark the page: [unicode.org...] I also see that the rudiments of unicode go back to 1993; by 1999 we were in 3.0. But I remembered right about codepoint 1400. 5.2, September 2009.

I still think HTML is better edited directly. Most HTML is generate from templates, and as they will be used on lots of pages (even right across a site or multiple sites) and its worth the extra effort. If you are directly editing static HTML WYSIWYG is probably the way to go, but why would you do that except for a very small site?

I don't use WYSIWYG myself. The people who hotlink to me apparently do. When I was offered Freeway and figured out what it was, I screamed Over my dead body! Or words to that effect. I like getting my hands dirty. Granted, it would be better if I learned php as an alternative to cutting-and-pasting the same thing across a bunch of pages every time I tweak the layout, but, well, one of these years.

I'd put the page into UTF-8 and leave it at that. My computer has a "Region"-- just one!-- associated with Sinhala, so Safari won't melt down if users set it as their first choice.

graeme_p

8:50 am on Sep 6, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks, I will stick to just UTF-8

I found the "fonts and input" page on your site, and that is what I was talking about. The fonts I saw did not even follow a pre-unicode standard, I think they simply made up their own encoding.

There are still Sinhala sites I cannot get to display in any encoding, and which appear to work only with particular fonts, so I assume they are STILL using legacy fonts:

[rivira.lk ]

I chose that out of several I found as newspaper websites are normally regarded as OK to link to on WW.

My problem is going to be providing for Sinhala user input: most people do not know how to input Sinhala, do not have the software installed - most people seem to type Sinhala in Latin letters, which is ugly and unreadable. Telling people to install something to use a site properly is usually futile.

it would be better if I learned php as an alternative to cutting-and-pasting the same thing across a bunch of pages every time I tweak the layout


Old fashioned server side includes should solve that problem, and are still supported by most Apache based hosts.

Also, you do not need to know PHP to use a PHP based CMS. Even modifying PHP based templates is doable because you can often start from an existing theme and modify just the HTML, treating the PHP bits as copy and paste blocks.

lucy24

7:30 pm on Sep 6, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



There are still Sinhala sites I cannot get to display in any encoding, and which appear to work only with particular fonts, so I assume they are STILL using legacy fonts:

[rivira.lk...]

I chose that out of several I found as newspaper websites are normally regarded as OK to link to on WW.

Ugh. There you've got two separate and unrelated problems. Not only is it a legacy font, it's an embedded font in a format only MSIE ever recognized:
Please use only Internet Explorer web browser to browse www.rivira.lk. You cannot read properly in Mozilla Firefox or any other web browser, because those browsers do not support this particular font.

That's a dead giveaway that they are using the .eot (Embedded Open Type) format. It's all in the CSS:

[rivira.lk...]

So all those pretty links in the left margin are images. The "charset=x-user-defined" is also pretty lethal.

If I could read Sinhala* I would know whether this recent post [sransara.blogspot.com] explains things ;)


* Technically I don't know that I can't read Sinhala. I've never tried.

graeme_p

6:32 am on Sep 7, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I knew you would like it :)

x-user-defined is what you are supposed to do if you make up your own encoding, so I think they have an excuse for that.

The site looks wonderful in the Google SERPS. No-one actually reads the site either - Alexa rank of below 600 in its home country for a national newspaper is pretty bad.

My wife says the post is an explanation of the problem with an explanation of what they should be doing.

lucy24

6:46 pm on Sep 7, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



"Fix It", presumably ;)

I went over and asked someone on my e-books forum. Figured he'd say "been there, done that" because he is a Professional Computer Geek who worked in South Asia for quite a while. He says discouragingly "Probably you'll have to build your own transcoder" and pointed me towards the find-and-replace tool* in his tei2html package on-- dear God! will they follow us everywhere?-- google code.


* The exact words were "a little tool" ... which itself sounds unspeakably ominous.

graeme_p

5:02 pm on Sep 8, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Fit it, and how to fix it.

Sounds like an interesting forum. tei2html looks like a nice tool as well.

It does sound ominous - a little tool that you have to adapt. I do not think it would actually be difficult for some who was good at text munging scripts/
This 36 message thread spans 2 pages: 36