|Accessibility issues with foreign characters in folder names|
I've scoured and scoured, trying to learn all I can online and talking to friends (who aren't internet savvy, but came from countries like Japan and Mexico), but so far I haven't really found a good way of doing this.
I will be having my website professionally translated into foreign languages. My domain name is a "made-up" word and will not have a translated equivalent, and to minimize hosting and domain fees, I will be housing all content on my domain.com address.
domain.com = English Main
domain.com/folder/ = English
domain.com/es/ = Spanish Main
domain.com/es/folder/ = Spanish
domain.com/ja/ = Japanese Main
domain.com/ja/folder/ = Japanese
Due to the nature of my evergreen content, I wanted my folder names translated as well--not simple using the English name. I was thinking this would also help my foreign visitors (as due to the nature of my site, and advertising campaigns) I would expect people to type in the folder of the content they want (in addition people coming through normal links).
I wouldn't want foreign visitors to have to memorize English words just because that's my language. This seems to lend itself to branding difficulties.
But it seems that its not very accessible to use foreign characters for folder names.
I don't think I want to substitute to the closet Latin character, since año is completely different than ano. And plus, what would the Latin equivalent be for Asian and Arabic languages?
I think I might have to resort to numbers, and foreign visitors will just have to remember numbers, like:
I would love to hear the feedback of others on this.
First of all, remember that "foreign" is a matter of point of view. Letters like åäö are not "foreign" to me, although they may be to you. I have seen three different approaches:
1) Use the proper localized word for your folder name. Thus,
domain.com/year for English,
domain.com/año for Spanish, and
domain.com/år for Swedish. This way, the folder name will match an actual word (which may be used in searches, for example). It is immediately better understood by those native to whichever language is in question. They don't have a hard time typing it, nor should there be a reason for confusion. If your server's system cannot handle such folder names, use mod_rewrite (or similar) to rewrite the path to a folder with the English equivalent name (which, by the way ... there is no reason for your actual file system to have multiple folders ... simply use index.html for English, index.es.html for Spanish, index.se.html for Swedish, and so on ... and the use mod_rewrite or something to point to the correct file).
2) Use localized words, but employ the ASCII equivalent. I think this is the worst solution, as (like you already pointed out) "año" and "ano" are two different words. An overall weird solution.
3) Stick to the English names for folders and files. This is, in my opinion, a better solution than #2. It botches type-in-traffic, but that is not necessarily a big problem if you have a nice logical structure to your site.
In my opinion, and from not-so-thorough studies I have performed, solution 1 and 3 work the best. #1 seems to be preferable from a user standpoint, although #3 performs just as well. #2 is not preferred by anyone, and it seemed to cause more confusion.
Thank you very much for your feedback DrDoc.
I realize now how biased the word "foreign" is. I guessed it was fine for the sake of this forum since we are all using the Latin alphabet to communicate. I know use non-ASCII characters.
I definitely agree #2 is the absolute worst approach.
I'm very interested in approach #1, as this seems ideal. I've tried experimenting with this, but I have noticed that different browsers seem to treat letters like ñ differently. I posted about that here: [webmasterworld.com...]
I'm curious about what your saying about using one folder, but I'm entirely clear. Are you saying if the server doesn't support non-ASCII characters, then use the rewrite to redirect to something like /year/index.es.html? Or even if the server does support non-ASCII charceters, it still might be preferable to rewrite them to /year/index.es.html.
I'm also wondering if the index.es.html file would show, or if it would just end up looking like /año/ or /year/ in the address bar.
Thank you very much.
In defence of transliteration (the use of non-accented US-ASCII characters instead of the correct accented versions), search engines do return very similar if not identical results with or without accents - ie. they consider "año" and "ano" to be identical despite the fact that they are not. This is for practical reasons - many searches are done with erroneous or omitted accents.
When it comes to URL encoding for non-ASCII characters, it is a bit of a minefield, with certain inconsistencies and restrictions. Firstly, the early RFCs [rfc-editor.org] defining URLs only allow the use of US-ASCII, but HTML allows Unicode (from HTML 4). Therefore for an URL to be valid, non-ASCII characters must be encoded for them to work. The encoding is done by taking the character's code point in ISO-8859-1 and converting it to the hex value preceeded by
Characters which need to be encoded include characters such as a space (code point 32, converted to hex, making
%20), quote marks, and the upper range of ISO-8859-1 characters (code point 128 to 255) which are outside the 7-bit ASCII range. You can consult an URL code chart [i-technica.com] for required values.
Even though HTML 4 allows the entire Unicode range of characters, the above encoding won't cover any character outside of ISO-8859-1. Unfortunately there is currently no reliable way of representing characters outside of the ISO-8859-1 character set in an URL. In the future there will be URIs, or Uniform Resource Identifiers [rfc-editor.org] and IRIs, or Internationalized Resource Identifiers [ietf.org], but we're not there yet. IRIs depend on UTF-8-encoding rather than the current US-ASCII URL structure. This, combined with IDNs (International Domain Names) based on "Punycode [en.wikipedia.org]" will finally get us to a stage where the web is truly international.
But after all that, where does that leave you? If you are sticking to western-European languages (ISO-8859-1-based) then you can depend on URL encoding, but for other language the situation is much harder. It is difficult to impossible to have a rich URL scheme for multi-language content which conforms to current restrictions.