Translating for websites - English to Chinese (Simplified) - Characters or written? - internationalization

My question is this:
Should Chinese characters be used or should the translations be spelled out?
delete > 删除
or
delete > Shānchú

You should test your translated website with users of the destination language, and use the terms and spelling which they prefer. You should have a translator who is skilled in the source and destination language perform the translation, and accept the terms and spelling which they suggest.
I am not a user of websites in Chinese, nor a translator. But from what I understand of the language, Chinese should certainly be written in characters (e.g. 删除). The Latin script spelling (Shānchú) is used by those who do not read the Chinese script, as a way to understand the sound of the words.

Related

How to implement fuzzy search for Chinese pinyin and Japanese romaji?

I have some data in Chinese and Japanese, and I want it possible to search by their romanizations (Pinyin for Chinese, Romaji for Japanese). Assume that the romanizations are already provided, separated by syllables.
eg. the text "示例文本", which romanizes to ["shi", "li", "wen", "ben"].
Users should be able to match this by typing
whole syllables, with or without space, eg. shi li wen ben or shiliwenben
initials or first few letters of syllables, eg. shlwb or slwb
they might also type only part of the string, eg. wenben or wb (these examples correspond to the last two syllables of the text above).
Is there an elegant way of implementing this?
(note: I did not specify any programming language in this question, because I want to implement this in different languages. If your response is language-specific or requires specific libraries, please make it clear. Thank you!)

Things to take into account when internationalizing web app to handle chinese language

I have a MVC3 web app with i18n in 4 latin languages... but I would like to add CHINESE in the future.
I'm working with standard resource file.
Any tips?
EDIT: Anything about reading direction? Numbers? Fonts?
I would start with these observations:
Chinese is a non-character-based language, meaning that a search engine (if needed) must not use only punctuation and whitespace to find words (basically, each character is a word); also, you might have mixed Latin and Chinese words
make sure to use UTF-8 for all your HTML documents (.resx files are UTF-8 by default)
make sure that your database collation supports Chinese - or use a separate database with an appropriate collation
make sure you don't reverse strings or do other unusual text operations that might break with multi-byte characters
make sure you don't call ToLower and ToUpper to check user-input text because again this might break with other alphabets (or rather scripts) - aka the Turkey Test
To test for all of the above and other possible issues, a good way is pseudolocalization.

Language codes for simplified Chinese and traditional Chinese?

We are creating multi-language subsites on our website.
I would like to use the 2-letter language codes. Spanish and French are easy. They will get URLs like:
mydomain.com/es
mydomain.com/fr
but I run into a problem with Traditional and Simplified chinese. Are there standards for which 2 letter codes to use for these languages?
mydomain.com/zh
mydomain.com/?
#dkarp gives an excellent general answer. I will add some additional specifics regarding Chinese:
There are several countries where Chinese is the main written language. The major difference between them is whether they use simplified or traditional characters, but there are also minor regional differences (in vocabulary, etc). The standard way to distinguish these would be with a country code, e.g. zh_CN for mainland China, zh_SG for Singapore, zh_TW for Taiwan, or zh_HK for Hong Kong.
Mainland China and Singapore both use simplified characters, and the others use traditional characters. Since China and Taiwan are the two with the biggest populations, just zh_CN and zh_TW are often used to distinguish the simplified and traditional character versions of a website.
More technically correct but not commonly used in practice, however, would be to use zh_HANS for (generic) simplified Chinese characters, and zh_HANT for traditional Chinese characters, except for rare cases when it is meaningful to distinguish different countries.
There is indeed a standard representation for this. As people have run into the exact same problem you are seeing -- same language, but different dialects or characters -- they've extended the two-letter language code with a two-letter region code. So you might have a universal French page at mydomain.com/fr, but internationalizing for French Canadian readers might leave you with a mydomain.com/fr_CA (Canada) and mydomain.com/fr_FR (France). Some platforms use a dash instead of an underscore to separate the language and region codes (hence fr-CA and fr-FR).
The standard locale for simplified Chinese is zh_CN. The standard locale for traditional Chinese is zh_TW.
I hesitate to point you towards the actual BCP 47 standards documents, as they're, uh, a little heavy on the detail and a little light on the readability. Just go with standard locale identifiers, like the ones in used by Java, and you'll be fine.
I'm just going to leave this here.
CODE
LANG
FORM
REGION
zh
Chinese
-
-
zh_Hans
Chinese
Han Simplified
-
zh_Hans_CN
Chinese
Han Simplified
China
zh_Hans_HK
Chinese
Han Simplified
Hong Kong SAR China
zh_Hans_MO
Chinese
Han Simplified
Macau SAR China
zh_Hans_SG
Chinese
Han Simplified
Singapore
zh_Hant
Chinese
Han Traditional
-
zh_Hant_HK
Chinese
Han Traditional
Hong Kong SAR China
zh_Hant_MO
Chinese
Han Traditional
Macau SAR China
zh_Hant_TW
Chinese
Han Traditional
Taiwan
Language is dependent upon where it is spoken (doh!), so language and locale codes reflect that reality. zh is the basic language code, but because there are two major forms of it, there are zh_Hans and zh_Hant, but they are still only language codes, not locales.
Location-specific
To fully specify which language is used in a particular location, the country code still has to be suffixed, so making zh_Hans_HK and zh_Hant_HK for simplified and traditional Chinese, respectively, both as spoken in Hong Kong.
Actually, the reality is that something more specific than country code is often required in many countries, but that is likely to exponentially increase the complexity and maintenance of databases like CLDR, plus the support infrastructure to feed into it, like IP to location details extraction, is not generally available or accurate enough.
Fixed text
Now, if the code is just to specify which set of fixed strings to use in the user interface, or even whole pages sets on a site, a country suffix is not really necessary, unless there are more than a few places where the language varies significantly enough (location-based info) to bother creating a whole separate resource set.
The larger the resource set, the more likely that a language code based upon locale [in this context, just a language attribute, rather than a true locale, so you can call it what you like!] will be required, but at least you only have to do that when necessary.
On-the-fly values
However, if wanting to format particular variable values, like dates, times, currencies and numbers, on-the-fly, locales become important, because all the tools that support such functionality (like those based upon Unicode CLDR data) expect them. The locale for these needs to be a separate setting to the code for which an in-house-generated UI language is set to use, unless you want to create a resource set for every known locale, and maintain them ad nauseum!
Browser language tools
Note that when specifying locale for a web page that can be edited, as in input boxes, and spellcheck in attributes or css has been enabled for the field, the browser's language tools will spellcheck the field according to that locale.
Criteria
You have to be clear about what the resource set is providing, so consider:
Fixed strings? Language only.
Formatting on-the-fly? Locale.
Spellchecking in the viewing environment? Locale.
Whole pages/subsite? Language only, else locale (as a language variant) if significantly different content required.
Spreadsheet to minimise maintenance overhead
I use a spreadsheet to hold UI strings where each language code has a parent code, so that the cell for its version of a string has a formula that gets its string from the parent. To create a custom string for that language and string, I just overwrite the cell formula with the exact text. That minimises the amount of resource maintenance. I run a macro at the end that generates a complete resource file for each language.

Validating Kana Input

I am working on an application that allows users to input Japanese language characters. I am trying to come up with a way to determine whether the user's input is a Japanese kana (hiragana, katakana, or kanji).
There are certain fields in the application where entering Latin text would be inappropriate and I need a way to limit certain fields to kanji-only, or katakana-only, etc.
The project uses UTF-8 encoding. I don't expect to accept JIS or Shift-JIS input.
Ideas?
It sounds like you basically need to just check whether each Unicode character is within a particular range. The Unicode code charts should be a good starting point.
If you're using .NET, my MiscUtil library has some Unicode range support - it's primitive, but it should do the job. I don't have the source to hand right now, but will update this post with an example later if it would be helpful.
Not sure of a perfect answer, but there is a Unicode range for katakana and hiragana listed on Wikipedia. (Which I would expect are also available from unicode.org as well.)
Hiragana: Unicode: 3040-309F
Katakana: Unicode: 30A0–30FF
Checking those ranges against the input should work as a validation for hiragana or katakana for Unicode in a language-agnostic manner.
For kanji, I would expect it to be a little more complicated, as I
expect that the Chinese characters used in Chinese and Japanese are both included in the same range, but then again, I may be wrong here. (I can't expect that Simplified Chinese and Traditional Chinese to be included in the same range...)
oh oh! I had this one once... I had a regex with the hiragana, then katakana and then the kanji. I forget the exact codes, I'll go have a look.
regex is great because you double the problems. And I did it in PHP, my choice for extra strong auto problem generation
--edit--
$pattern = '/[^\wぁ-ゔァ-ヺー\x{4E00}-\x{9FAF}_\-]+/u';
I found this here, but it's not great... I'll keep looking
--edit--
I looked through my portable hard drive.... I thought I had kept that particular snippet from the last company... sorry.

Is it possible to create INTERNATIONAL permalinks?

i was wondering how you deal with permalinks on international sites. By permalink i mean some link which is unique and human readable.
E.g. for english phrases its no problem e.g. /product/some-title/
but what do you do if the product title is in e.g chinese language??
how do you deal with this problem?
i am implementing an international site and one requirement is to have human readable URLs.
Thanks for every comment
Characters outside the ISO Latin-1 set are not permitted in URLs according to this spec, so Chinese strings would be out immediately.
Where the product name can be localised, you can use urls like <DOMAIN>/<LANGUAGE>/DIR/<PRODUCT_TRANSLATED>, e.g.:
http://www.example.com/en/products/cat/
http://www.example.com/fr/products/chat/
accompanied by a mod_rewrite rule to the effect of:
RewriteRule ^([a-z]+)/product/([a-z]+)? product_lookup.php?lang=$1&product=$2
For the first example above, this rule will call product_lookup.php?lang=en&product=cat. Inside this script is where you would access the internal translation engine (from the lang parameter, en in this case) to do the same translation you do on the user-facing side to translate, say, "Chat" on the French page, "Cat" on the English, etc.
Using an external translation API would be a good idea, but tricky to get a reliable one which works correctly in your business domain. Google have opened up a translation API, but it currently only supports a limited number of languages.
English <=> Arabic
English <=> Chinese
English <=> Russian
Take a look at Wikipedia.
They use national characters in URLs.
For example, Russian home page URL is: http://ru.wikipedia.org/wiki/Заглавная_страница. The browser transparently encodes all non-ASCII characters and replaces them by their codes when sending URL to the server.
But on the web page all URLs are human-readable.
So you don't need to do anything special -- just put your product names into URLs as is.
The webserver should be able to decode them for your application automatically.
I usually transliterate the non-ascii characters. For example "täst" would become "taest". GNU iconv can do this for you (I'm sure there are other libraries):
$ echo täst | iconv -t 'ascii//translit'
taest
Alas, these transliterations are locale dependent: in languages other than german, 'ä' could be translitertated as simply 'a', for example. But on the other side, there should be a transliteration for every (commonly used) character set into ASCII.
How about some scheme like /productid/{product-id-number}/some-title/
where the site looks at the {number} and ignores the 'some-title' part entirely. You can put that into whatever language or encoding you like, because it's not being used.
If memory serves, you're only able to use English letters in URLs. There's a discussion to change that, but I'm fairly positive that it's not been implemented yet.
that said, you'd need to have a look up table where you assign translations of products/titles into whatever word that they'll be in the other language. For example:
foo.com/cat will need a translation look up for "cat" "gato" "neko" etc.
Then your HTTP module which is parsing those human reading objects into an exact url will know which page to serve based upon the translations.
Creating a look up for such thing seems an overflow to me. I cannot create a lookup for all the different words in all languages. Maybe accessing an translation API would be a good idea.
So as far as I can see its not possible to use foreign chars in the permalink as the sepecs of the URL does not allow it.
What do you think of encoding the specials chars? are those URLs recognized by Google then?

Resources