Which ISO format should I use to store a user's language code? - internationalization

Should I use ISO 639-1 (2-letter abbreviation) or ISO 639-2 (3 letter abbrv) to store a user's language code? Both are official standards, but which is the de facto standard in the development community? I think ISO 639-1 would be easier to remember, and is probably more popular for that reason, but thats just a guess.
The site I'm building will have a separate site for the US, Brazil, Russia, China, & the UK.
http://en.wikipedia.org/wiki/ISO_639

You should use IETF language tags because they are already used for HTTP/HTML/XML and many other technologies. They are based on several standards including the ISO-639 collection (yes language, region and culture selection are not so simple to define).
I wrote a more detailed article regarding the proper language code selection and usage. The idea is to use the simplest/shorter ISO-639-1 codes and specify more only for special cases. Inside the article there are codes for ~30 most used languages with reasons why I consider one alternative better than another.
In case you want to skip reading the entire article here is a short list of language codes (not to be confused with country codes): ar, cs, da, de, el, en, en-gb, es, fr, fi, he, hu, it, ja, ko, nb, nl, pl, pt, pt-pt, ro, ru, sv, tr, uk, zh, zh-hant
The following points may not be obvious but should be borne in mind:
en is used for en-us - American English, and for British English is used en-gb
pt is used for pt-br, and not pt-pt witch has much less speakers
zh is used instead of zh-hans, zh-CN,...
zh-hant (Traditional Chinese) is used instead of more specific codes like zh-hant-TW or zh-TW
You can find more explanations inside the article.

I would go with a derivative of ISO 639. Specifically I like to use this: http://en.wikipedia.org/wiki/IETF_language_tag

I'm no expert, but every site I've ever seen uses ISO 639-1, including the current site I'm working on.
It works for us!

I've only ever seen 2-character language codes in use - so I'd recommend going with them unless your work involves delving into linguistics in some way. If all you're doing is customizing the browsing experience for the world at large, you won't need the extra repertoire offered by 3-character codes.

ISO 639-1 Alpha-2 are used pretty much universally.
They are used for example in HTTP content negotiation. If you ever wondered how an international website can automatically show you their homepage in your native language, that's how it works. (Although it's sometimes kinda annoying. I, for example, often get shown the default Apache homepage in German, because the webmaster turned on content negotiation, but only put content for English in.)
Most web browsers use them directly in their settings dialog box.
Most operating systems use them in their settings dialog boxes or configuration files.
Wikipedia uses them in their server names for the different language versions.
In other words: if your users aren't native English speakers, they will probably already have encountered them when configuring their software, because otherwise they wouldn't be able to use their computers.
The other members of the ISO 639 family are mostly of interest to linguists. Unless you expect Jesus Christ himself (ISO 639-2 Alpha-3 code arc) to visit your website, or maybe Klingons (tlh), ISO 639-1 has more languages than you ever can hope to support.

Related

Is there a standard computer vocabulary for German? for Spanish?

I was given the task of coming up with shorter German words for the German version of our software.
It got me to thinking that there should be some sort of standard vocabulary for information technology somewhere. Like there "have to be" terms that most (if not all) German computer users use for what English-speakers call file, database, record, search, search terms, search hits, find and replace, delete, OCR ... you get the idea.
I found ISO 2382 on the ISO Web site, but it only seems to standardize English and French. Is there an equivalent standard for German? How about for Spanish, or for other languages?
I may suggest this book, although quite dated, was an attempt to come up with a set of standard computer terms for translating from German to English and back:
Grosses IWT-Wörterbuch der Computertechnik und der Wirtschaftsinformatik. Englisch-Deutsch. Deutsch-Englisch
I will offer up the answer, "no".
Even within English, there are not standard words to describe computer operations as you have presented them. Certainly one can "delete" a file, but they can also "erase" it, "remove" it, an (shudder) "move it to the trash can".
Instead of trying to solve the problem in the large, I suggest you solve the problem in the small. Build a glossary of commonly used German words, and whenever there is an opportunity to expand the Glossary, first look over the existing entries and do your best to reuse the current terminology.
In a way, the reason good English documentation works well is because good writers of English use a glossary like technique explicitly or implicitly. In the event that much of your documentation comes from a single source, or related set of sources, you can make a "translation map" of "when they say X, we say Y". But, even such simplifications often require native readers to re-read the translation in context, as languages are not nearly regular enough to do simple substitution without many pitfalls.
As a starting point, The Open Group (www.opengroup.org) seems to have defined glossaries as part of their work on The Open Group Architecture Framework (TOGAF), which appear to be the sort of thing I needed. For example, these document numbers and titles are taken directly from their Web site:
C148 TOGAF® 9.1 Translation Glossary: English – Hrvatski (Croatian)
C149 TOGAF® 9.1 Translation Glossary: English – Castilian Spanish
C146 TOGAF® 9.1 Translation Glossary: English – Portuguese (Portugal)
C13H TOGAF® 9.1 Translation Glossary: English – Slovak

Language codes for simplified Chinese and traditional Chinese?

We are creating multi-language subsites on our website.
I would like to use the 2-letter language codes. Spanish and French are easy. They will get URLs like:
mydomain.com/es
mydomain.com/fr
but I run into a problem with Traditional and Simplified chinese. Are there standards for which 2 letter codes to use for these languages?
mydomain.com/zh
mydomain.com/?
#dkarp gives an excellent general answer. I will add some additional specifics regarding Chinese:
There are several countries where Chinese is the main written language. The major difference between them is whether they use simplified or traditional characters, but there are also minor regional differences (in vocabulary, etc). The standard way to distinguish these would be with a country code, e.g. zh_CN for mainland China, zh_SG for Singapore, zh_TW for Taiwan, or zh_HK for Hong Kong.
Mainland China and Singapore both use simplified characters, and the others use traditional characters. Since China and Taiwan are the two with the biggest populations, just zh_CN and zh_TW are often used to distinguish the simplified and traditional character versions of a website.
More technically correct but not commonly used in practice, however, would be to use zh_HANS for (generic) simplified Chinese characters, and zh_HANT for traditional Chinese characters, except for rare cases when it is meaningful to distinguish different countries.
There is indeed a standard representation for this. As people have run into the exact same problem you are seeing -- same language, but different dialects or characters -- they've extended the two-letter language code with a two-letter region code. So you might have a universal French page at mydomain.com/fr, but internationalizing for French Canadian readers might leave you with a mydomain.com/fr_CA (Canada) and mydomain.com/fr_FR (France). Some platforms use a dash instead of an underscore to separate the language and region codes (hence fr-CA and fr-FR).
The standard locale for simplified Chinese is zh_CN. The standard locale for traditional Chinese is zh_TW.
I hesitate to point you towards the actual BCP 47 standards documents, as they're, uh, a little heavy on the detail and a little light on the readability. Just go with standard locale identifiers, like the ones in used by Java, and you'll be fine.
I'm just going to leave this here.
CODE
LANG
FORM
REGION
zh
Chinese
-
-
zh_Hans
Chinese
Han Simplified
-
zh_Hans_CN
Chinese
Han Simplified
China
zh_Hans_HK
Chinese
Han Simplified
Hong Kong SAR China
zh_Hans_MO
Chinese
Han Simplified
Macau SAR China
zh_Hans_SG
Chinese
Han Simplified
Singapore
zh_Hant
Chinese
Han Traditional
-
zh_Hant_HK
Chinese
Han Traditional
Hong Kong SAR China
zh_Hant_MO
Chinese
Han Traditional
Macau SAR China
zh_Hant_TW
Chinese
Han Traditional
Taiwan
Language is dependent upon where it is spoken (doh!), so language and locale codes reflect that reality. zh is the basic language code, but because there are two major forms of it, there are zh_Hans and zh_Hant, but they are still only language codes, not locales.
Location-specific
To fully specify which language is used in a particular location, the country code still has to be suffixed, so making zh_Hans_HK and zh_Hant_HK for simplified and traditional Chinese, respectively, both as spoken in Hong Kong.
Actually, the reality is that something more specific than country code is often required in many countries, but that is likely to exponentially increase the complexity and maintenance of databases like CLDR, plus the support infrastructure to feed into it, like IP to location details extraction, is not generally available or accurate enough.
Fixed text
Now, if the code is just to specify which set of fixed strings to use in the user interface, or even whole pages sets on a site, a country suffix is not really necessary, unless there are more than a few places where the language varies significantly enough (location-based info) to bother creating a whole separate resource set.
The larger the resource set, the more likely that a language code based upon locale [in this context, just a language attribute, rather than a true locale, so you can call it what you like!] will be required, but at least you only have to do that when necessary.
On-the-fly values
However, if wanting to format particular variable values, like dates, times, currencies and numbers, on-the-fly, locales become important, because all the tools that support such functionality (like those based upon Unicode CLDR data) expect them. The locale for these needs to be a separate setting to the code for which an in-house-generated UI language is set to use, unless you want to create a resource set for every known locale, and maintain them ad nauseum!
Browser language tools
Note that when specifying locale for a web page that can be edited, as in input boxes, and spellcheck in attributes or css has been enabled for the field, the browser's language tools will spellcheck the field according to that locale.
Criteria
You have to be clear about what the resource set is providing, so consider:
Fixed strings? Language only.
Formatting on-the-fly? Locale.
Spellchecking in the viewing environment? Locale.
Whole pages/subsite? Language only, else locale (as a language variant) if significantly different content required.
Spreadsheet to minimise maintenance overhead
I use a spreadsheet to hold UI strings where each language code has a parent code, so that the cell for its version of a string has a formula that gets its string from the parent. To create a custom string for that language and string, I just overwrite the cell formula with the exact text. That minimises the amount of resource maintenance. I run a macro at the end that generates a complete resource file for each language.

Steps to develop a multilingual web application

What are the steps to develop a multilingual web application?
Should i store the languages texts and resources in database or should i use property files or resource files?
I understand that I need to use CurrentCulture with C# alone with CultureFormat etc.
I wanted to know you opinions on steps to build a multilingual web application.
Doesn't have to be language specific. I'm just looking for steps to build this.
The specific mechanisms are different depending on the platform you are developing on.
As a cursory set of work items:
Separation of code from content. Generally, resources are compiled into assemblies with the help of resource files (in dot net) or stored in property files (in java, though there are other options), or some other location, and referred to by ID. If you want localization costs to be reasonable, you need to avoid changes to the IDs between releases, as most localization tools will treat new IDs as new content.
Identification of areas in the application which make assumptions about the locale of the user, especially date/time, currency, number formatting or input.
Create some mechanism for locale-specific CSS content; not all fonts work for all languages, and not all font-sizes are sane for all languages. Don't paint yourself into a corner of forcing Thai text to be displayed in 8 pt. Also, text directionality is going to be right-to-left for at least two languages.
Design your page content to reflow or resize reasonably when more or less content than you expect is present. Many languages expand 50-80% from English for short strings, and 30-40% for longer pieces of content (that's a rough rule of thumb, not a law).
Identify cultural presumptions made by your UI designers, and try to make them more neutral, or, if you've got money and sanity to burn, localizable. Mailboxes don't look the same everywhere, hand gestures aren't universal, and something that's cute or clever or relies on a visual pun won't necessarily travel well.
Choose appropriate encodings for your supported languages. It's now reasonable to use UTF-8 for all content that's sent to web browsers, regardless of language.
Choose appropriate collation for your databases, or enable alternate collations, if you are dealing with content in multiple languages in your databases. Case-insensitivity works differently in many languages than it does in English, and accent insensitivity is acceptable in some languages and generally inappropriate in others.
Don't assume words are delimited by spaces or that sentences are delimited by punctuation, if you're trying to support search.
Avoid:
Storing localized content in databases, unless there's a really, really, good reason. And then, think again. If you have content that is somewhat dynamic and representatives of each region need to customize it, it may be reasonable to store certain categories of content with an associated locale ID.
Trying to be clever with string concatenation. Also, try not to assume rules about pluralization or counting work the same for every culture. Make sure, at least, that the order of strings (and controls) can be specified with format strings that are typical your platform, or well documented in your localization kit if you elect to roll your own for some reason.
Presuming that it's ok for code bugs to be fixed by localizers. That's generally not reasonable, at least if you want to deliver your product within a reasonable time at a reasonable cost; it's sometimes not even possible.
The first step is to internationalize. The second step is to localize. The third step is to translate.

Displaying language lists: Which language should I use? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Every once in a while I'm confronted with displaying a list of available languages, and each and every time I ask my self:
Is it better to display the language in:
the currently selected language
English
in the language according to the button/list item
Examples:
English
German
French
or
English
Deutsch
Français
Is there any convention on which one should be used, is more polite or better in any other way? Are there other options?
I would say it's best to display the language in "its own language" (option #3). You can not necessarily expect the user to know the currently selected language, nor expect him to know English.
What's tricky is how to display the "Select your language" button in a language neutral way. I usually go for a flag indicating the current language since that tends to get the message across eventhough there's not always a 1:1 mapping between country and languages.
I definitely think you should display in the language that matches the item in the button list.
Reasons:
If it's not the language you're interested in, you won't mind if you don't understand it, as long as you can find your own language.
Think about the last time you called customer service. How many times have you heard something like, "Para Espanol, marque dos"? It's very common, accepted practice to mix different languages in one UI (whether visual or audible).
Think about how you'd feel if you went to a Spanish site, and you couldn't find your language under "E". Maybe, eventually, you'd notice "Ingles", and think it probably translated to "English", but it's definitely better to save the user the trouble of translating and mentally alphabetizing.
The standard (in both senses of the word, i.e. what is actually used in the real world, and what the IETF/W3C/ISO says) is to use ISO 639-1 Alpha-2 language codes. Maybe augmented with either the full name of the language in English, the language itself, a romanic transliteration of the name in the language itself or any combination thereof.
So, to keep with your example:
[de] German - Deutsch
[en] English
[fr] French - Français
[ja] Japanese - 日本語 (Nihongo)
Two options, first the name of the language in the selected locale or English, then the name of the language in itself between parens, or the other way around, e.g.:
English
French (Français)
German (Deutsch)
Spanish (Español)
or
English
Français (French)
Deutsch (German)
Español (Spanish)
English language name:
Pros:
Predictable sorting.
No need to think about different text flows.
Cons:
Users who doesn't speak English might have ha harder time finding their language.
If the rest of the application is translated, it might look sloppy or grammatically wrong: Ditt språk är English/votre langue est English.
Language in its own name:
Pros:
Easier for the non-English speaker.
You have to think about encoding and text flow; A useful exercise. :-)
Cons:
Harder to navigate if the user is used to English or has her mind set on finding an English name.
You have to consider all language variants.
What is right really depends on the rest of your application. You might want to consider having all language names translated to all languages. If english is choosen, then you get to pick from:
English
Swedish
French
If Swedish:
Engelska
Svenska
Franska
...and French:
Anglais
Suédois
Français
But then the translantion problem has turned from O(n) to O(n^2), which might be acceptable depending on what your current value of n is.
EDIT
As deceze points out. you will also have to handle the case when a user accidentally switches to a language she doesn't understand, and provide a way back - for example by always including a few major languages.
I find it harder to find "Magyar" in a list of languages.
Because there are languages with non-latin character set, this is not a simple first-letter-lookup, as I lose focus when I first meet one of these.
Where should I look? At 'M' - Magyar? But where is M? EDIT: M in the (current language's) alphabet, not on the keyboard.
Have a look at this (from Wikipedia):
Български - I know, this is Bulgarian, but
བོད་ཡིག - what is this?
Bosanski
Català
Česky
Dansk
Deutsch
Ελληνικά
I would prefer something like this:
A...
B...
C...
.
.
Hungarian (Magyar)
If the UI was Japanese, I would ctrl+f-ing "Magyar", though.
Whatever you do don't use the IP location to set the language.
Google is very annoying about this -- when logging on from a new location I get google in the local language and script. This is really annoying particularly, anywhere southeast of Croatia.
The worse offender though is Microsoft. When trying to purchase software thier servers keep switching languages depending on your location and in many cases makes it impossible pay for anything by Credit Card as the addresses and zip codes etc. are validated in the local format and not where your credit card was actually issued. ( By the way MS the first four digits of a credit card number indicate the issuing institution which is tied to a particular country so its not rocket science to work out a UK postcode format is required rather than say a six digit german ZIP code.
Use country-flags in combination with the language name in that language (Deutsch, Francais, Nederlands, ...).
I don't know about any programming related conventions about this but i would prefer to see the name of a language in its own language.
For example:
English
Türkçe
Deutsch
Have a look at your Regional Settings.
This is how Microsoft implemented it. Seems like your version 1.
alt text http://www.freeimagehosting.net/uploads/1c14f9f60d.jpg

What are the best practices for multilanguage sites?

I want to make a multi-language site, such that all or almost all pages will be available in 2 or more translations. What are the best practices to follow?
For example, I consider these language selection mechanisms:
Cookie-based selection of the preferred language.
Based on Accept-Language header if the cookie is not set.
Based on GeoIP otherwise (probably).
Is there anything else?
How should different translations be served?
as LANG.example.com/page
as example.com/LANG/page
as example.com/page?hl=LANG
...
any of the above with a redirect to example.com/page? (It seems to be discouraged)
How to ensure that all the translations are properly indexed?
Sitemaps with all pages + correct Content-Language header are enough?
What is the best way to let the users know there are other translations, but do not distract them?
list available languages in the header/footer/sidebar (like Wikipedia)
put “Choose a language” selector next to the content
What is the best policy to deal with missing/outdated translations?
do not display missing pages at all or display a page in a different language?
display old translation, old translation with a warning or a page in a different language?
What else should I take into account? What should I do and what I definitely should not?
In addition to #Quassnoi's answers ensure that you standard RFC 4646 language identifiers (e.g. EN-US, DE-AT); you may already be aware of this. The CLDR project is an excellent repository of internationalization data (the Supplemental Data is really useful).
If a translation of a specific page is not available, use a language fallback mechanism back to the neutral language; for example "DE-AT", "DE", "" (neutral, e.g. "EN").
Most recent browsers and the underlying operating systems will correctly show all of the characters required for a locale selector list if the page is encoded correctly (I'd recommend all pages being UTF-8). Ensure that the locale list contains both the native and current-language names to allow both native and non-native speakers to view the specified translations, e.g. "Deutsch (German)" if the current locale is EN-*.
A lot of sites use a flag icon to show the current locale, but this is more relevant to the location and some people may be offended if you show only a dominant flag (e.g. the US or UK flag for English).
It may be worthwhile to have a more visible (semi-graphical) locale selector on the home page if no locale cookie has been submitted, using a combination of GeoIP and Accept-Language to determine the default locale choice.
Semi-related: if your users are in located in different time zones include a zone preference in their account profile for displaying time values in their local time. And store all time stamps using UTC.
Make the decision whether you need support for languages that require double byte characters early on (Chinese, Japanese, Korean, etc), Unicode is the preferable choice. It can be tedious to change later, especially if you have a database that doesn't use unicode.
Cookie-based selection of the
preferred language.
Based on Accept-Language header if
the cookie is not set.
These two you should support.
Put a big english banner at the top of your page that reads This page in English.
as example.com/LANG/page
This is the best choice.
LANG.example.com isn't good for autocomplete, and the question marks look ugly.
list available languages in the header/footer/sidebar (like Wikipedia)
Choose a language dropbox is confusing, as it is not intelligible being written in a wrong foreign language and spoils overall impression being written in English.
And you always tend to make the error selecting the language you don't even have fonts for leaving yourself on a page full of question marks.
display old translation with a warning
You know there is something you can read and get the point, but for the details you'd better get a dictionary and read it in English.

Resources