This is something that I wrote up for a client this morning:
Things to Think About when Internationalizing a Program
Dealing with Locale/Cultural
- Encoding Data
.NET strings are implicitly Unicode enabled. But any place that you exit .NET to store data, you must be sure to allocate and store Unicode rather than ASCII. This includes databases, registry, web services, text files, etc.
When doing web enabled applications, you need to mark the page as being UTF-16, and send the appropriate headers.
- Use Locale Model
You need to discover what language and region the person using the program is using. In a Windows application, this is done by several .NET framework calls. In a web application, there is an HTTP header that has to be parsed.
- Sorting & String Comparison
Comparing and sorting strings can be simple or complicated, especially when dealing with multiple cultures. For example, where diacritical marks fall in a sorting sequence depends on the language and culture. If your application is dealing with a single culture, then it can be simple. If your application must deal with multiple cultures, then you can have problems, because data must be resorted based off language and culture info. You won't believe how many hours the Access program managers talked about how you sort databases in different cultures!
Calendars are very complicated things. They only seem simple because we've dealt with our calendar since we were young. To show how complicated they are, try to solve this question in your head: What day of the week is your birthday on in 2012? What does it take to figure that out? Figuring a date in any particular calendar isn't a problem, but if you deal with cultures that use calendars that aren't Gregorian, such as Japanese or Taiwan, then things get somewhat more complicated. You must allow input in the different calendars and output them correctly. .NET has some pretty extensive classes for dealing with calendar issues but I/O of the calendar data is still your responsibility.
- Formatting Dates
Outputting dates is different in different parts of the world. The localization settings in Windows helps, but in a Web application, you must determine what format to output things in. You must be very careful about month/day/year versus day/month/year confusion. Many people have been burned because 6/3/2004 is ambiguous in an i18n context. It is either June 3rd or March 6th depending on where you are. Best practices are to output all dates in an unambiguous manner using either ISO format, 2004-06-03, or using a spelled out month as in June 3, 2004.
- Formatting Times
Time formatting is typically in a 12 or 24 hour clock. Same issues of isolation. In an application that is working across the world, it is very important to store and move around all times in UTC (Coordinated Universal Time), because local times are ambiguous (and dates can be, too, across the International Date Line). Converting to and from Local Time may cause more problems than it is worth, because if end-users are talking on the phone about a display, they may be seeing different data. Whether you can get away with having end-users deal with UTC or whether times must be output as local is an important decision.
Currency is a big can of worms. How is input, output, storage, and conversion handled? There is no universally base monetary unit, so all money must be stored in a particular currency. Is storage done in one particular currency (such as US Dollars) or multiple (Dollars, Euros, and Yen)? Is conversion between currencies done at the moment of input, the moment of output, or some other time (such as midnight Eastern Time)? Where is the exchange rate data coming from, and how often is it updated? Is it important to be especially precise--A difference of .00001 cent in the exchange rate can make a difference if you are dealing with millions of dollars. If conversion is done at the moment of input, the exchange rates at that moment may need to be stored depending on the application. Input and output is also complicated. Do you need to allow for input in multiple currencies? This requires some UI design consideration. Output is simpler, but also has considerations: . versus , as separators for example. .NET has parsing and formatting functions that helps here.
- Formatting Numbers
When outputting numbers, the localized setting may have to be addressed (. versus ,; where - signs go, etc.) . For a Windows application, this is easy. For a web application, the proper format must somehow be stored and retrieved to properly output the number.
There are UI and storage considerations necessary for addresses, especially Postal Codes. Countries don't all use the U.S. Zip Code format. If your I/O and storage are expecting 99999-9999 format and gets a Canadian A9A 9A9 postal code, will it choke?
- Telephone Numbers
Same issue as addresses. Telephone numbers vary across the world.
- Paper Size
Paper sizes in the U.S. are typically Letter (8.5" x 11") or Legal (8.5" x 14"), whereas the rest of the world mostly uses ISO formats such as A4. You must be very careful when printing reports to allow for the different sizes of paper, otherwise truncation or excessive white space may occur. Selecting paper sizes may be important.
- Measurement Units
The U.S. largely uses English measurements, whereas most of the rest of the world uses Metric measurements. How is I/O and storage of units accomplished? Do you store everything in Metric, and do conversion as necessary, or do you store the size and a separate field for unit? You don't want your spacecraft crashing into Mars because of a screw-up here!
- Word Length
Different languages have different requirements for word length. English is among the tightest Roman Alphabet languages to express some word, whereas Finnish typically requires about 50% more characters to say the same word. This has some serious repercussions when designing the UI of a program or web page. If you create a nice looking screen in English, it may come out truncated or wrapped when expressed in Finnish. On the other hand, if you allow enough space to express the UI in Finnish, it may come out with unacceptable white space in English. In that case, multiple UI's may be necessary to account for different word length, or the UI may have to dynamical move and resize element on the dialog.
Output of Unicode requires the proper font to be installed on the end-user's system, otherwise the output will look like a bunch of gibberish. How are you going to ensure that the proper font is installed, and how are you going to keep the end-user from uninstalling it. It may be worthwhile in a Windows application to check that the proper font is installed at startup time. In a web application, there is not much that you can do here, unless you want to put all the data into PDF files or GIF files.
Fortunately, Windows does all the proper handling of the input of various keyboards and international characters.
- RTL and Vertical Scripts
Some languages write right-to-left or vertically instead of left-to-right. Frequently Windows and .NET take the hassle of dealing with these away from the application, but you may need different UIs to deal with these. There are many places where .NET has a property that you can set to right-to-left to
accommodate these languages.
- Isolate Localizable Strings
When you internationalize an application, you need to isolate all the strings that need to be localized in a small number of places. This means that an application cannot embed English strings into the application. All strings must be stored in a resource file, XML file, database, or other common location. All dialogs must either have their strings pulled from this repository, or individually recreated. The people translating the program into various languages do not want to look at code. .NET provides a number of tools, such as internationalized resource files to help with this process. Access 1.0 shipped within 40 days of the English version in three other languages because we took this into account. The earliest Microsoft had ever shipped another language edition of a program before that was six months.
- Handling Strings
Be very careful about string concatenation. Different languages use different word order than English. For example, English uses Subject-Verb-Direct Object, but a Maya language uses Verb-Direct Object-Subject. If you concatenate strings together to form, say, an error message, then when the string is internationalized, it may become gibberish. .NET provides a String.Format method that is designed to handle this, but it must be consistently used.
- Other UI Items
Besides the other considerations above, avoid putting text into graphics. An icon is an icon, not a billboard. If text is placed into graphics, then those graphics must then be internationalized, as well. .NET also provides things like multiple Tooltip controls that can be internationalized easily.
The application must be localized into at least one other language to test whether all the interface items that should be localized have been localized, and things that shouldn't be localized haven't. Then as development continues, needs to be rechecked regularly. I'll give a great trick for this that Microsoft uses: develop a tool that automatically localizes the English version into another language. You may think this is impossible because of the variability of different languages, but there is one language that has consistent rules for conversion from English: Pig Latin! It also has the added benefit that it adds two letters per word, making the word length more like Finnish. There is a Pig Latin version of most Microsoft programs somewhere. Testing may take considerable time.
Internationalizing a program is a non-trivial exercise, and planning to internationalize should be done from the earliest stages of design on a project. If done with proper planning, the cost can be small, maybe 15% more effort. If done after the fact, it may take a total redesign from ground up to re-do the application.
© 2004 by Greg Reddick (aka Xoc), permission explicitly given for use on WebmasterWorld