Guidance on how to convert non latin characters in URL

Forum Moderators: phranque

Message Too Old, No Replies

Guidance on how to convert non latin characters in URL

Whitey

7:19 am on Dec 2, 2015 (gmt 0)

We receive XML place name feed in Western characters that includes letters beyond the standard 25 alphabet.

Our current URL rewrite setup by our developers omits letters if they do not conform to the alphabet. They appear in the URL key.

An example of this is:
Málaga get rewritten into the URL as Mlaga [ it should be Malaga ]
Strenči gets rewritten as Streni [ it should be Strenci ]
Skødshoved Strand gets rewritten as skdshoved-strand [ it should be Skodshoved ]

PS - Heck - I can see even webmasterworld CMS has issues on Strenci !

Is there an easy way for these European accents and deviations from the 25 letter alphabet to be easily converted.

lucy24

8:36 am on Dec 2, 2015 (gmt 0)

even webmasterworld CMS has issues

At WebmasterWorld all non-8859-1 characters are converted to decimal entities ... and then, before they are displayed, there's a further conversion of & to & so the entity is worthless. (Funny about č though, because I thought that was in 1252, and 8859-1 is de facto interpreted as 1252; that’s how we’re able to use “curly quotes”.) *

What's the Apache aspect? Once things are out there in the ether, they've already been percent-encoded. It's not Apache eating your non-ASCII characters; it works with whatever it's given. Are the extra characters part of real URLs, or are they falling victim to a function that was only supposed to sanitize URLs and is now going overboard?

How many characters are involved? Just Latin-with-diacritics? Not that it really makes any difference-- there are scads of different unicode ranges, some of them way out beyond most non-Roman scripts-- but it's visually easier when you can see it as "letter plus diacritic". And if you're really talking about place names in Europe, that's a pretty finite character set. (Place names in, say, Vietnam, and it gets messier.)

What do you want the code to do? Something like percent-encoding and then switching back again?

:: idly wondering which of the 26 letters your system doesn't like ::

* OK, I looked it up. s-hacek and z-hacek yes, c-hacek no. That explains it.

Whitey

1:11 am on Dec 3, 2015 (gmt 0)

Are the extra characters part of real URLs, or are they falling victim to a function that was only supposed to sanitize URLs and is now going overboard?

Wasn't sure if this question was directed at me - but it looks like the way the current URL's are written, it simply drops the characters if they are not in the English alphabet.

How many characters are involved? Just Latin-with-diacritics? Not that it really makes any difference-- there are scads of different unicode ranges, some of them way out beyond most non-Roman scripts-- but it's visually easier when you can see it as "letter plus diacritic". And if you're really talking about place names in Europe, that's a pretty finite character set. (Place names in, say, Vietnam, and it gets messier.)

I haven't check beyond European areas, so typically we are talking, German, Scandinavian, Central European, Spain, France - but there are others.

btw - just learned a new word " diacritic". This is me being ignorant, but are there not standardised character conversion tables/ tools that can be used to produce or assist re write rules to produce the URL's? or is there a simpler way ?

My objective is to issue a directive to our developers that is both do-able and explainable, plus learn along the way.

lucy24

5:04 am on Dec 3, 2015 (gmt 0)

This is me being ignorant, but are there not standardised character conversion tables/ tools that can be used to produce or assist re write rules to produce the URL's?

I'm sure there are, but I haven't personally used them :( There's no intrinsic relationship between a "base" letter and its modified forms; any given application has to be told that, for example, "ô" belongs with "o" rather than "a". (If I were giving the long version of this answer-- yes, this is the short version-- you would here get a disquisition on precombined vs. combining forms and the historical reasons for using one or the other. Well, it's not the Romans' fault that their language had less than 30 phonemes. We should all be so lucky.)

The problem, of course, is that one man's diacritic is another man's entirely different letter. For example æ began historically as a fusion of "a" and "e", but today it's a full-fledged independent letter in some languages' alphabets. Similarly it's no use saying that å is a modification of "a" and people will know what you mean if you just use "a". (You might convince a French person that in some circumstances é, è and ê can all be expressed as "e", but you obviously can't reduce ç to c.)

I've got a feeling this comes down to a database question. I find it hard to believe that the database itself is currently ASCII-encoded; probably it could accept some non-ASCII letters, though not necessarily all of them. The questions is which ones.

When I asked about URLs, I meant, for example,
example.com/hotels/montréal
as a real-life URL that someone wants to link to. That's é in the actual URL, as opposed to a simplified URL using
example.com/hotels/montreal

Or you could have
example.com/hotels/düsseldorf
vs.
example.com/hotels/duesseldorf
(I think Germans generally throw up their hands and do it this way for safety, but then German is an exceptionally easy language to work with in this respect.)

Throughout this post I've intentionally limited examples to characters in the Latin-1 (but beyond ASCII) character set, because I know they will display correctly in the present forums.