Things to take into account when internationalizing web app to handle chinese language - asp.net-mvc-3

I have a MVC3 web app with i18n in 4 latin languages... but I would like to add CHINESE in the future.
I'm working with standard resource file.
Any tips?
EDIT: Anything about reading direction? Numbers? Fonts?

I would start with these observations:
Chinese is a non-character-based language, meaning that a search engine (if needed) must not use only punctuation and whitespace to find words (basically, each character is a word); also, you might have mixed Latin and Chinese words
make sure to use UTF-8 for all your HTML documents (.resx files are UTF-8 by default)
make sure that your database collation supports Chinese - or use a separate database with an appropriate collation
make sure you don't reverse strings or do other unusual text operations that might break with multi-byte characters
make sure you don't call ToLower and ToUpper to check user-input text because again this might break with other alphabets (or rather scripts) - aka the Turkey Test
To test for all of the above and other possible issues, a good way is pseudolocalization.

Related

Using ASCII delimiters (29-31) in modern programming

I'm currently building a hash key string (collapsed from a map) where the values that are delimited by the special ASCII unit delimiter 31 (1F).
This nicely solves the problem of trying to guess what ASCII characters won't be used in the string values and I don't need to worry about escaping or quoting values etc.
However reading about the history of this is it appears to be a relic from the 1960s and I haven't seen many examples where strings are built and tokenised using this special character so it all seems too easy.
Are there any issues to using this delimiter in a modern application?
I'm currently doing this in a non-Unicode C++ application, however I'm interested to know how this applies generally in other languages such as Java, C# and with Unicode.
The lower 128 char map of ASCII is fully set in stone into the Unicode standard, this including characters 0->31. The only reason you don't see special ASCII chars in use in strings very often is simply because of human interfacing limitations: they do not visualize well (if at all) when displayed to screen or written to file, and you can't easily type them in from a keyboard either. They're also not allowed in un-escaped form within various popular 'human readable' file formats, such as XML.
For logical processing tasks within a program that do not need end-user interaction, however, they are perfectly suitable for whatever use you can find for them. Your particular use sounds novel and efficient and I think you should definitely run with it.
Your application is free to accept whatever binary format it pleases. However, if you need to embed arbitrary binary data in your input, you need to escape whatever delimiters or other special codes your format uses. This is true regardless of which ones you choose.
I'd also not ignore Unicode. It's 2012, by now it's rather silly to work with an outdated model for dealing with text. If your input data is textual, handle it as such.
The one issue that comes to mind is why invent another format instead of using XML or JSON; or if you need a compact encoding, a "binary" variant of those two (Fast Infoset, msgpack, who knows what else), or ASN.1? There's probably a whole bunch of other issues that you'll encounter when rolling your own that the design and tooling for those formats already solved.
I work with barcodes in a warehouse setting. We use ASCII code 31 as a field-separator so that a single scan can populate multiple data fields with a single scan. So, consider the ramifications if you think your hash key could end up on a barcode.

tcl utf-8 characters not displaying properly in ui

Objective : To have multi language characters in the user id in Enovia v6
I am using utf-8 encoding in tcl script and it seems it saves multi language characters properly in the database (after some conversion). But, in ui i literally see the saved information from the database.
While doing the same excercise throuhg Power Web, saved data somehow gets converted back into proper multi language character and displays properly.
Am i missing something while taking tcl approach?
Pasting one example to help understand better.
Original Name: Kátai-Pál
Name saved in database as: Kátai-Pál
In UI I see name as: Kátai-Pál
In Tcl I use below syntax
set encoded [encoding convertto utf-8 Kátai-Pál];
Now user name becomes: Kátai-Pál
In UI I see name as “Kátai-Pál”
The trick is to think in terms of characters, not bytes. They're different things. Encodings are ways of representing characters as byte sequences (internally, Tcl's really quite complicated, but you shouldn't ever have to care about that if you're not developing Tcl's implementation itself; suffice to say it's Unicode). Thus, when you use:
encoding convertto utf-8 "Kátai-Pál"
You're taking a sequence of characters and asking for the sequence of bytes (one per result character) that is the encoding of those characters in the given encoding (UTF-8).
What you need to do is to get the database integration layer to understand what encoding the database is using so it can convert back into characters for you (you can only ever communicate using bytes; everything else is just a simplification). There are two ways that can happen: either the information is correctly shared (via metadata or defined convention), or both sides make assumptions which come unstuck occasionally. It sounds like the latter is what's happening, alas.
If you can't handle it any other way, you can take the bytes produced out of the database layer and convert into characters:
encoding convertfrom $theEncoding $theBytes
Working out what $theEncoding should be is in general very tricky, but it sounds like it's utf-8 for you. Once you've got characters, Tcl/Tk will be able to display them correctly; it knows how to transfer them correctly into the guts of the platform's GUI. (And in scripts that you actually write, you're best off replacing non-ASCII characters with their \uXXXX escapes, because platforms don't agree on what encoding is right to use for scripts. Alas.)

In Ruby, how to automatically convert non-supported characters in text-processing?

(Using Ruby 1.8)
I only have a brief understanding of encoding and such...but what I want to know is, in any given script handling any given text-file, is there some universal library or call I need to make to turn non-standard characters into their nearest printable equivalent. I realize there's no "all-in-one" fix, but this is for a English (U.S. gov't) text file, and so I'm wondering if there's something that mitigates what must be a relatively common issue in English text formatting.
For example, in a text file, I have an entry like this:
0-8­23
That hyphen is just literally a hyphen as I've typed it out. In the file though, it's something that looks like a hyphen (an n-dash?) but when copy and pasting it...for example, into this browser text box, it doesn't show up.
Printing it out via a Ruby script gets this:
08�23
How do I get my script to resolve it into a dash. Or something other than a gremlin?
It's very common to run into hyphen-like characters and dashes, especially in the output of word-processors. Converting them isn't too hard if you know what the byte is that represents the character, but gets to be a pain when you get a document with several different ones. It gets worse as you throw other accented characters into the mix.
Ruby 1.8 doesn't support multibyte and Unicode character sets as well as 1.9+, but you can work around that somewhat by using the Iconv library.
Iconv lets you convert between various character-sets, such as US-ASCII, ISO-8859-1 and WIN-1252. It's smarter than a regex, because it knows how to convert from accented characters, to similarly looking characters, or ignore them if nothing similar exists, allowing your transliteration to degrade gracefully.
I have some example code in an answer to a related question. Also read James Grey's article linked in the answer. It explains the problem and ways to fix it, ending up with recommending Iconv too.
You could whitelist with gsub:
string.gsub(/[^a-zA-Z0-9]/)
Without knowing more information, I can't build the perfect regex for you, but the general idea is to replace anything that's not what you're expecting (anything not a letter or number or expected symbols).

Validating Kana Input

I am working on an application that allows users to input Japanese language characters. I am trying to come up with a way to determine whether the user's input is a Japanese kana (hiragana, katakana, or kanji).
There are certain fields in the application where entering Latin text would be inappropriate and I need a way to limit certain fields to kanji-only, or katakana-only, etc.
The project uses UTF-8 encoding. I don't expect to accept JIS or Shift-JIS input.
Ideas?
It sounds like you basically need to just check whether each Unicode character is within a particular range. The Unicode code charts should be a good starting point.
If you're using .NET, my MiscUtil library has some Unicode range support - it's primitive, but it should do the job. I don't have the source to hand right now, but will update this post with an example later if it would be helpful.
Not sure of a perfect answer, but there is a Unicode range for katakana and hiragana listed on Wikipedia. (Which I would expect are also available from unicode.org as well.)
Hiragana: Unicode: 3040-309F
Katakana: Unicode: 30A0–30FF
Checking those ranges against the input should work as a validation for hiragana or katakana for Unicode in a language-agnostic manner.
For kanji, I would expect it to be a little more complicated, as I
expect that the Chinese characters used in Chinese and Japanese are both included in the same range, but then again, I may be wrong here. (I can't expect that Simplified Chinese and Traditional Chinese to be included in the same range...)
oh oh! I had this one once... I had a regex with the hiragana, then katakana and then the kanji. I forget the exact codes, I'll go have a look.
regex is great because you double the problems. And I did it in PHP, my choice for extra strong auto problem generation
--edit--
$pattern = '/[^\wぁ-ゔァ-ヺー\x{4E00}-\x{9FAF}_\-]+/u';
I found this here, but it's not great... I'll keep looking
--edit--
I looked through my portable hard drive.... I thought I had kept that particular snippet from the last company... sorry.

Is it possible to create INTERNATIONAL permalinks?

i was wondering how you deal with permalinks on international sites. By permalink i mean some link which is unique and human readable.
E.g. for english phrases its no problem e.g. /product/some-title/
but what do you do if the product title is in e.g chinese language??
how do you deal with this problem?
i am implementing an international site and one requirement is to have human readable URLs.
Thanks for every comment
Characters outside the ISO Latin-1 set are not permitted in URLs according to this spec, so Chinese strings would be out immediately.
Where the product name can be localised, you can use urls like <DOMAIN>/<LANGUAGE>/DIR/<PRODUCT_TRANSLATED>, e.g.:
http://www.example.com/en/products/cat/
http://www.example.com/fr/products/chat/
accompanied by a mod_rewrite rule to the effect of:
RewriteRule ^([a-z]+)/product/([a-z]+)? product_lookup.php?lang=$1&product=$2
For the first example above, this rule will call product_lookup.php?lang=en&product=cat. Inside this script is where you would access the internal translation engine (from the lang parameter, en in this case) to do the same translation you do on the user-facing side to translate, say, "Chat" on the French page, "Cat" on the English, etc.
Using an external translation API would be a good idea, but tricky to get a reliable one which works correctly in your business domain. Google have opened up a translation API, but it currently only supports a limited number of languages.
English <=> Arabic
English <=> Chinese
English <=> Russian
Take a look at Wikipedia.
They use national characters in URLs.
For example, Russian home page URL is: http://ru.wikipedia.org/wiki/Заглавная_страница. The browser transparently encodes all non-ASCII characters and replaces them by their codes when sending URL to the server.
But on the web page all URLs are human-readable.
So you don't need to do anything special -- just put your product names into URLs as is.
The webserver should be able to decode them for your application automatically.
I usually transliterate the non-ascii characters. For example "täst" would become "taest". GNU iconv can do this for you (I'm sure there are other libraries):
$ echo täst | iconv -t 'ascii//translit'
taest
Alas, these transliterations are locale dependent: in languages other than german, 'ä' could be translitertated as simply 'a', for example. But on the other side, there should be a transliteration for every (commonly used) character set into ASCII.
How about some scheme like /productid/{product-id-number}/some-title/
where the site looks at the {number} and ignores the 'some-title' part entirely. You can put that into whatever language or encoding you like, because it's not being used.
If memory serves, you're only able to use English letters in URLs. There's a discussion to change that, but I'm fairly positive that it's not been implemented yet.
that said, you'd need to have a look up table where you assign translations of products/titles into whatever word that they'll be in the other language. For example:
foo.com/cat will need a translation look up for "cat" "gato" "neko" etc.
Then your HTTP module which is parsing those human reading objects into an exact url will know which page to serve based upon the translations.
Creating a look up for such thing seems an overflow to me. I cannot create a lookup for all the different words in all languages. Maybe accessing an translation API would be a good idea.
So as far as I can see its not possible to use foreign chars in the permalink as the sepecs of the URL does not allow it.
What do you think of encoding the specials chars? are those URLs recognized by Google then?

Resources