I have some static resources (images and HTML files) that will be localized. One piece of software I've seen do this is Apache, which appends the locale to the name; for example, test_en_US.html or test_de_CH.html. I'm wondering whether this naming scheme is considered standard, or whether every project does it differently.
While there is no documented standard for naming Localized files, I'd recommend using the format filename[_language[ _country]] where
language is the ISO-639 2 letter language code
territory is the ISO-3166 2 letter country code
For example:
myFile.txt (non-localized file)
myFile_en.txt (localized for global English)
myFile_en_US.txt (localized for US English)
myFile_en_GB.txt (localized for UK English)
Why? This is the most typical format used by operating systems, globalization tools (such as Trados and WorldServer), and programming languages. So unless you have a particular fondness for a different format, I see no reason to deviate from what most other folks are doing. It may save you some integration headaches down the road.
While there doesn't appear to a standard conventions as to where in the file name to place them, the international codes for language (e.g. "en") and region (e.g. "en-US") are both very common and very straightforward. Variations I've seen, excluding "enUS" vs. "en_US" vs. "en-US":
foo.enUS.ext
foo.ext_enUS
enUS.foo.ext
foo/enUS.ext
enUS/foo.ext
…ad nauseum
I personally favor the first and last variants. The former for grouping files by name/resource (good for situations in which a limited number of files need localized) and the latter for grouping files by locale (better for situations with a large number of localized files).
You should always use the "de-facto" standard, which is the unix/posix way with gettext. And you shoud use gettext to make your localization!
Therefore one and only correct way is to use localization naming like this:
en
en_US
en_UK
Some applications and especially Java developers ar sometimes using the en-US (hyphenated instead than underscored) and it is ALL WRONG!!!
gettext standard is this and only this:
locale
|_en_US
|_LC_MESSAGES
|_appname.mo
Where:
locale - Name of the directory, can vary but it is highly recommended to stay with "locale"-name
en_US - Any standard locale like *es_ES*, *es_PT*, ...
LC_MESSAGES - mandatory and cannot be changed!
appname.mo - msgfmt compiled appname.po file (appname is what ever you want)
Related
So the question is can I point out that my application supports en-US, en-GB and use for all of them the single resource file?
The intention is that I want my application to be available for all english-speaking countries. But it's meaningless to have different translations, because there are no specific translations.
Does it have a sense considering the mentioned intention to point out all those specific cultures in a manifest?
Yes - just use one English file and make it as default culture. This way even when en-GB is selected, for example, the app will fallback to en-US :)
As for date formatting - just be sure to use CurrentCulture - it gets formatting from the Regional and Number settings (and not CurrentUICulture which is for language needs only). This way people with, say, en-US UI language and Number formatting set to de-DE will still see the app in English but have number formatting as German.
There is a common confusion between CurrentCulture and CurrentUICulture and that Language equals formatting. That's why I see many 12-hour formats throughout Windows Phone/Store apps that simply ignore my Regional settings. A must-read regarding confusion about UI and Number formatting: http://forums.asp.net/post/1080435.aspx
When I work on web application with my colleges. The name of i18n text are given quite freely.
It's like each one has his own rule of naming.
Take an example, we have a text "Create a new item", it is used for a link.
A names the key in resources file like: CreateANewItem, which puts all word together.
B prefers to name it like this: CreateLinkText, which describes it's usage in the application.
C, however, wants to use: CreateItemText, which summarizes it's literal meaning.
When some text is longer or containing format of dynamic content. Naming varies a lot and agreement is hard to be met.
So I wonder whether there's a good naming rule or convention for the i18n text in different cases: short, long, with format, vulnerable to change, etc. Or how do you do this in your project? With this convention, maintenance can be easy and code is more readable.
Thanks a lot.
It's not something that I have seen so far in coding conventions. I guess it matters a lot less than other issues when it comes to coding standards. That said, I don't know what platform you are using, but if it's .NET there is a very short page on naming conventions for resource identifiers here: http://msdn.microsoft.com/en-us/library/vstudio/ms229037%28v=vs.100%29.aspx
We are creating multi-language subsites on our website.
I would like to use the 2-letter language codes. Spanish and French are easy. They will get URLs like:
mydomain.com/es
mydomain.com/fr
but I run into a problem with Traditional and Simplified chinese. Are there standards for which 2 letter codes to use for these languages?
mydomain.com/zh
mydomain.com/?
#dkarp gives an excellent general answer. I will add some additional specifics regarding Chinese:
There are several countries where Chinese is the main written language. The major difference between them is whether they use simplified or traditional characters, but there are also minor regional differences (in vocabulary, etc). The standard way to distinguish these would be with a country code, e.g. zh_CN for mainland China, zh_SG for Singapore, zh_TW for Taiwan, or zh_HK for Hong Kong.
Mainland China and Singapore both use simplified characters, and the others use traditional characters. Since China and Taiwan are the two with the biggest populations, just zh_CN and zh_TW are often used to distinguish the simplified and traditional character versions of a website.
More technically correct but not commonly used in practice, however, would be to use zh_HANS for (generic) simplified Chinese characters, and zh_HANT for traditional Chinese characters, except for rare cases when it is meaningful to distinguish different countries.
There is indeed a standard representation for this. As people have run into the exact same problem you are seeing -- same language, but different dialects or characters -- they've extended the two-letter language code with a two-letter region code. So you might have a universal French page at mydomain.com/fr, but internationalizing for French Canadian readers might leave you with a mydomain.com/fr_CA (Canada) and mydomain.com/fr_FR (France). Some platforms use a dash instead of an underscore to separate the language and region codes (hence fr-CA and fr-FR).
The standard locale for simplified Chinese is zh_CN. The standard locale for traditional Chinese is zh_TW.
I hesitate to point you towards the actual BCP 47 standards documents, as they're, uh, a little heavy on the detail and a little light on the readability. Just go with standard locale identifiers, like the ones in used by Java, and you'll be fine.
I'm just going to leave this here.
CODE
LANG
FORM
REGION
zh
Chinese
-
-
zh_Hans
Chinese
Han Simplified
-
zh_Hans_CN
Chinese
Han Simplified
China
zh_Hans_HK
Chinese
Han Simplified
Hong Kong SAR China
zh_Hans_MO
Chinese
Han Simplified
Macau SAR China
zh_Hans_SG
Chinese
Han Simplified
Singapore
zh_Hant
Chinese
Han Traditional
-
zh_Hant_HK
Chinese
Han Traditional
Hong Kong SAR China
zh_Hant_MO
Chinese
Han Traditional
Macau SAR China
zh_Hant_TW
Chinese
Han Traditional
Taiwan
Language is dependent upon where it is spoken (doh!), so language and locale codes reflect that reality. zh is the basic language code, but because there are two major forms of it, there are zh_Hans and zh_Hant, but they are still only language codes, not locales.
Location-specific
To fully specify which language is used in a particular location, the country code still has to be suffixed, so making zh_Hans_HK and zh_Hant_HK for simplified and traditional Chinese, respectively, both as spoken in Hong Kong.
Actually, the reality is that something more specific than country code is often required in many countries, but that is likely to exponentially increase the complexity and maintenance of databases like CLDR, plus the support infrastructure to feed into it, like IP to location details extraction, is not generally available or accurate enough.
Fixed text
Now, if the code is just to specify which set of fixed strings to use in the user interface, or even whole pages sets on a site, a country suffix is not really necessary, unless there are more than a few places where the language varies significantly enough (location-based info) to bother creating a whole separate resource set.
The larger the resource set, the more likely that a language code based upon locale [in this context, just a language attribute, rather than a true locale, so you can call it what you like!] will be required, but at least you only have to do that when necessary.
On-the-fly values
However, if wanting to format particular variable values, like dates, times, currencies and numbers, on-the-fly, locales become important, because all the tools that support such functionality (like those based upon Unicode CLDR data) expect them. The locale for these needs to be a separate setting to the code for which an in-house-generated UI language is set to use, unless you want to create a resource set for every known locale, and maintain them ad nauseum!
Browser language tools
Note that when specifying locale for a web page that can be edited, as in input boxes, and spellcheck in attributes or css has been enabled for the field, the browser's language tools will spellcheck the field according to that locale.
Criteria
You have to be clear about what the resource set is providing, so consider:
Fixed strings? Language only.
Formatting on-the-fly? Locale.
Spellchecking in the viewing environment? Locale.
Whole pages/subsite? Language only, else locale (as a language variant) if significantly different content required.
Spreadsheet to minimise maintenance overhead
I use a spreadsheet to hold UI strings where each language code has a parent code, so that the cell for its version of a string has a formula that gets its string from the parent. To create a custom string for that language and string, I just overwrite the cell formula with the exact text. That minimises the amount of resource maintenance. I run a macro at the end that generates a complete resource file for each language.
What are the steps to develop a multilingual web application?
Should i store the languages texts and resources in database or should i use property files or resource files?
I understand that I need to use CurrentCulture with C# alone with CultureFormat etc.
I wanted to know you opinions on steps to build a multilingual web application.
Doesn't have to be language specific. I'm just looking for steps to build this.
The specific mechanisms are different depending on the platform you are developing on.
As a cursory set of work items:
Separation of code from content. Generally, resources are compiled into assemblies with the help of resource files (in dot net) or stored in property files (in java, though there are other options), or some other location, and referred to by ID. If you want localization costs to be reasonable, you need to avoid changes to the IDs between releases, as most localization tools will treat new IDs as new content.
Identification of areas in the application which make assumptions about the locale of the user, especially date/time, currency, number formatting or input.
Create some mechanism for locale-specific CSS content; not all fonts work for all languages, and not all font-sizes are sane for all languages. Don't paint yourself into a corner of forcing Thai text to be displayed in 8 pt. Also, text directionality is going to be right-to-left for at least two languages.
Design your page content to reflow or resize reasonably when more or less content than you expect is present. Many languages expand 50-80% from English for short strings, and 30-40% for longer pieces of content (that's a rough rule of thumb, not a law).
Identify cultural presumptions made by your UI designers, and try to make them more neutral, or, if you've got money and sanity to burn, localizable. Mailboxes don't look the same everywhere, hand gestures aren't universal, and something that's cute or clever or relies on a visual pun won't necessarily travel well.
Choose appropriate encodings for your supported languages. It's now reasonable to use UTF-8 for all content that's sent to web browsers, regardless of language.
Choose appropriate collation for your databases, or enable alternate collations, if you are dealing with content in multiple languages in your databases. Case-insensitivity works differently in many languages than it does in English, and accent insensitivity is acceptable in some languages and generally inappropriate in others.
Don't assume words are delimited by spaces or that sentences are delimited by punctuation, if you're trying to support search.
Avoid:
Storing localized content in databases, unless there's a really, really, good reason. And then, think again. If you have content that is somewhat dynamic and representatives of each region need to customize it, it may be reasonable to store certain categories of content with an associated locale ID.
Trying to be clever with string concatenation. Also, try not to assume rules about pluralization or counting work the same for every culture. Make sure, at least, that the order of strings (and controls) can be specified with format strings that are typical your platform, or well documented in your localization kit if you elect to roll your own for some reason.
Presuming that it's ok for code bugs to be fixed by localizers. That's generally not reasonable, at least if you want to deliver your product within a reasonable time at a reasonable cost; it's sometimes not even possible.
The first step is to internationalize. The second step is to localize. The third step is to translate.
I'm starting to modify my app, which uses all hardcoded strings for errors, GUI, etc. I'm considering these two approaches, but let me know if there is an even better way:
-Put all string in ressource (.rc) files.
-define all strings in a file, once for each language. Use a preprocessor define to decide which strings get compiled in.
Which of these two approaches is generally prefered?
Put all the strings in resource files. Once you've done that, there's several good translation packages available. One useful thing these packages do is allow you to get translation done by somebody who doesn't program.
Remember, also, that internationalization (i18n) is a large subject, and there's a lot of things to consider. It isn't just a matter of translating strings. Do a web search on it, at the very least. You might want to read a book on it: I used International Programming for Windows by Schmitt as a guide. It's an old book from Microsoft Press, and I had to get it through a used book service; most of the more modern stuff seems to be on internationalizing .NET apps.
Without knowing more about your project (what sort of software, who the intended audience is, what sort of organization you have, what sort of budget, why you're interested in internationalization, etc.), this is about the most I can tell you.
Generally you see locale specific resource files containing strings referenced by key. Compiling different versions for different locales is a very rigid solution and will be a maintenance nightmare. Using resource files also allows the user to have fallback locales.
There's another approach of just putting strings in the source with somethign like tr(" ") and usign one of the tools that strips them out and converts them.
It works with any toolkit/GUI library.
You can mark text to be converted and text not to change (such as protocol strings or db keys).
It makes the source easier to read and search, isntead of having to lookup what IDS_MESSAGE34 means.
One problem with resource files, at least with Windows/MFC, is that you can't use the stringtable in dialogs. So you have some text in the stringtabel and some in the dialog section which you have to dela with separately.