Why are non-ascii characters still disliked in function / var names? - utf-8

When I see a small program which is written for some students, I often see something like this: (haskell, german):
ueber = "What the haeck!"
instead of
über = "What the häck!"
As many modern languages are specified to allow non-standard charactes in declaration names via UTF-8, is there a special reason for avoiding these in a project, which is sure to be only for people who are able to input these characters (say for a team of german students?) or is this just a historical reason?
I know, that you should keep names in a-zA-Z_0-9 if you develop an applicaio internationally, but are there any reason for avoiding this in a "local" project?

is there a special reason for avoiding these in a project, which is sure to be only for people who are able to input these characters
That is certainly the main reason. Other reasons that come to mind is that many development tools, search functions, editors, parsers, documentors, code search engines etc. will not expect non-ASCII input in code.
Also, you never know where your code may be used one day! The smallest innocent school project can grow into a nice Open-Source tool that gets used around the globe one day. In that case, ASCII is the smallest common denominator, at least at the moment.

I've had to work on a project started by French developers. They had to spend quite a bit of time translating their program to English when more people joined the project. Teach your German students this lesson up front, and not only will they be able to share their code with others, they'll no longer need an über or ueber variable either.
BTW, ü is an alphabetic character. + and - are non-alphanumeric, and I'd say it's obvious why they're disliked in function names.

Related

Internationalisation - displaying gendered adjectives

I'm currently working on an internationalisation project for a large web application - initially we're just implementing French but more languages will follow in time. One of the issues we've come across is how to display adjectives.
Let's take "Active" as an example. When we received translations back from the company we're using, they returned "Actif(ve)", as English "Active" translates to masculine "Actif" or feminine "Active". We're unsure of how to display this, and wondered if there are any well established conventions in the web development world.
As far as I see it there are three possible scenarios:
We know at development time which noun a given adjective is referring to. In this case we can determine and use the correct gender.
We're referring to a user, either directly ("you") or in the third person. Short of making every user have a gender, I don't see a better approach than displaying both, i.e. "Actif(ve)"
We are displaying the adjective in isolation, not knowing which noun it's referring to. For example in a table of data, some rows might be dealing with a masculine entity, some feminine.
Scenarios 2 and 3 seem to be the toughest ones. Does anyone have any experience handling these issues? Any tips would be appreciated!
This is complex, because we cannot imagine all the cases, and there is risk to go in "opinion based" answer, so I keep it short and generic.
Usually I prefer to give context in translation (for translator), e.g. providing template: _("active {user_name}" (so also the ordering will be correct if languages want different ordering).
Then you may need to change code and template into _("active {first_name_feminine}") and _("active {first_name_masculine}") (and possibly more for duals, trials, plurals, collectives, honorific, etc.). Note: check that the translator will not mangle the {} and the string inside. Usually you need specific export/import scripts. Or I add a note inside the string, and I quickly translate into English removing the note to the translator). Also this can be automated (be creative on using special Unicode characters which should not be used in normal text, to delimit such text).
But if you cannot know the gender, the Actif(ve) may be the polite version used in such language. You need a native speaker test, and changes back and forth.

Can sorting Japanese kanji words be done programmatically?

I've recently discovered, to my astonishment (having never really thought about it before), machine-sorting Japanese proper nouns is apparently not possible.
I work on an application that must allow the user to select a hospital from a 3-menu interface. The first menu is Prefecture, the second is City Name, and the third is Hospital. Each menu should be sorted, as you might expect, so the user can find what they want in the menu.
Let me outline what I have found, as preamble to my question:
The expected sort order for Japanese words is based on their pronunciation. Kanji do not have an inherent order (there are tens of thousands of Kanji in use), but the Japanese phonetic syllabaries do have an order: あ、い、う、え、お、か、き、く、け、こ... and on for the fifty traditional distinct sounds (a few of which are obsolete in modern Japanese). This sort order is called 五十音順 (gojuu on jun , or '50-sound order').
Therefore, Kanji words should be sorted in the same order as they would be if they were written in hiragana. (You can represent any kanji word in phonetic hiragana in Japanese.)
The kicker: there is no canonical way to determine the pronunciation of a given word written in kanji. You never know. Some kanji have ten or more different pronunciations, depending on the word. Many common words are in the dictionary, and I could probably hack together a way to look them up from one of the free dictionary databases, but proper nouns (e.g. hospital names) are not in the dictionary.
So, in my application, I have a list of every prefecture, city, and hospital in Japan. In order to sort these lists, which is a requirement, I need a matching list of each of these names in phonetic form (kana).
I can't come up with anything other than paying somebody fluent in Japanese (I'm only so-so) to manually transcribe them. Before I do so though:
Is it possible that I am totally high on fire, and there actually is some way to do this sorting without creating my own mappings of kanji words to phonetic readings, that I have somehow overlooked?
Is there a publicly available mapping of prefecture/city names, from the government or something? That would reduce the manual mapping I'd need to do to only hospital names.
Does anybody have any other advice on how to approach this problem? Any programming language is fine--I'm working with Ruby on Rails but I would be delighted if I could just write a program that would take the kanji input (say 40,000 proper nouns) and then output the phonetic representations as data that I could import into my Rails app.
宜しくお願いします。
For Data, dig Google's Japanese IME (Mozc) data files here.
https://github.com/google/mozc/tree/master/src/data
There is lots of interesting data there, including IPA dictionaries.
Edit:
And you may also try Mecab, it can use IPA dictionary and can convert kanjis to katakana for most of the words
https://taku910.github.io/mecab/
and there is ruby bindings for that too.
https://taku910.github.io/mecab/bindings.html
and here is somebody tested, ruby with mecab with tagger -Oyomi
http://hirai2.blog129.fc2.com/blog-entry-4.html
just a quick followup to explain the eventual actual solution we used. Thanks to all who recommended mecab--this appears to have done the trick.
We have a mostly-Rails backend, but in our circumstance we didn't need to solve this problem on the backend. For user-entered data, e.g. creating new entities with Japanese names, we modified the UI to require the user to enter the phonetic yomigana in addition to the kanji name. Users seem accustomed to this. The problem was the large corpus of data that is built into the app--hospital, company, and place names, mainly.
So, what we did is:
We converted all the source data (a list of 4000 hospitals with name, address, etc) into .csv format (encoded as UTF-8, of course).
Then, for developer use, we wrote a ruby script that:
Uses mecab to translate the contents of that file into Japanese phonetic readings
(the precise command used was mecab -Oyomi -o seed_hospitals.converted.csv seed_hospitals.csv, which outputs a new file with the kanji replaced by the phonetic equivalent, expressed in full-width katakana).
Standardizes all yomikata into hiragana (because users tend to enter hiragana when manually entering yomikata, and hiragana and katakana sort differently). Ruby makes this easy once you find it: NKF.nkf("-h1 -w", katakana_str) # -h1 means to hiragana, -w means output utf8
Using the awesomely conveninent new Ruby 1.9.2 version of CSV, combine the input file with the mecab-translated file, so that the resulting file now has extra columns inserted, a la NAME, NAME_YOMIGANA, ADDRESS, ADDRESS_YOMIGANA, and so on.
Use the data from the resulting .csv file to seed our rails app with its built-in values.
From time to time the client updates the source data, so we will need to do this whenever that happens.
As far as I can tell, this output is good. My Japanese isn't good enough to be 100% sure, but a few of my Japanese coworkers skimmed it and said it looks all right. I put a slightly obfuscated sample of the converted addresses in this gist so that anybody who cared to read this far can see for themselves.
UPDATE: The results are in... it's pretty good, but not perfect. Still, it looks like it correctly phoneticized 95%+ of the quasi-random addresses in my list.
Many thanks to all who helped me!
Nice to hear people are working with Japanese.
I think you're spot on with your assessment of the problem difficulty. I just asked one of the Japanese guys in my lab, and the way to do it seems to be as you describe:
Take a list of Kanji
Infer (guess) the yomigana
Sort yomigana by gojuon.
The hard part is obviously step two. I have two guys in my lab: 高橋 and 高谷. Naturally, when sorting reports etc. by name they appear nowhere near each other.
EDIT
If you're fluent in Japanese, have a look here: http://mecab.sourceforge.net/
It's a pretty popular tool, so you should be able to find English documentation too (the man page for mecab has English info).
I'm not familiar with MeCab, but I think using MeCab is good idea.
Then, I'll introduce another method.
If your app is written in Microsoft VBA, you can call "GetPhonetic" function. It's easy to use.
see : http://msdn.microsoft.com/en-us/library/aa195745(v=office.11).aspx
Sorting prefectures by its pronunciation is not common. Most Japanese are used to prefectures sorted by 「都道府県コード」.
e.g. 01:北海道, 02:青森県, …, 13:東京都, …, 27:大阪府, …, 47:沖縄県
These codes are defined in "JIS X 0401" or "ISO-3166-2 JP".
see (Wikipedia Japanese) :
http://ja.wikipedia.org/wiki/%E5%85%A8%E5%9B%BD%E5%9C%B0%E6%96%B9%E5%85%AC%E5%85%B1%E5%9B%A3%E4%BD%93%E3%82%B3%E3%83%BC%E3%83%89

Why do people use plain english as translation placeholders?

This may be a stupid question, but here goes.
I've seen several projects using some translation library (e.g. gettext) working with plain english placeholders. So for example:
_("Please enter your name");
instead of abstract placeholders (which has always been my instinctive preference)
_("error_please_enter_name");
I have seen various recommendations on SO to work with the former method, but I don't understand why. What I don't get is what do you do if you need to change the english wording? Because if the actual text is used as the key for all existing translations, you would have to edit all the translations, too, and change each key. Or don't you?
Isn't that awfully cumbersome? Why is this the industry standard?
It's definitely not proper normalization to do it this way. Are there massive advantages to this method that I'm not seeing?
Yes, you have to alter the existing translation files, and that is a good thing.
If you change the English wording, the translations probably need to change, too. Even if they don't, you need someone who speaks the other language to check.
You prep a new version, and part of the QA process is checking the translations. If the English wording changed and nobody checked the translation, it'll stick out like a sore thumb and it'll get fixed.
The main language is already existent: you don't need to translate it.
Translators have better context with a real sentence than vague placeholders.
The placeholders are just the keys, it's still possible to change the original language by creating a translation for it. Because when the translation doesn't exists, it uses the placeholder as the translated text.
We've been using abstract placeholders for a while and it was pretty annoying having to write everything twice when creating a new function. When English is the placeholder, you just write the code in English, you have meaningful output from the start and don't have to think about naming placeholders.
So my reason would be less work for the developers.
I like your second approach. When translating texts you always have the problem of homonyms. Like 'open' can mean a state of a window but also the verb to perform the action. In other languages these homonyms may not exist. That's why you should be able to add meaning to your placeholders. Best approach is to put this meaning in your text library. If this is not possible on the platform the framework you use, it might be a good idea to define a 'development language'. This language will add meaning to the text entries like: 'action_open' and 'state_open'. you will off course have to put extra effort i translating this language to plain english (or the language you develop for). I have put this philosophy in some large projects and in the long run this saves some time (and headaches).
The best way in my opinion is keeping meaning separate so if you develop your own translation library or the one you use supports it you can do something like this:
_(i18n("Please enter your name", "error_please_enter_name"));
Where:
i18n(text, meaning)
Interesting question. I assume the main reason is that you don't have to care about translation or localization files during development as the main language is in the code itself.
Well it probably is just that it's easier to read, and so easier to translate. I'm of the opinion that your way is best for scalability, but it does just require that extra bit of effort, which some developers might not consider worth it... and for some projects, it probably isn't.
There's a fallback hierarchy, from most specific locale to the unlocalised version in the source code.
So French in France might have the following fallback route:
fr_FR
fr
Unlocalised. Source code.
As a result, having proper English sentences in the source code ensures that if a particular translation is not provided for in step (1) or (2), you will at least get a proper understandable sentence than random programmer garbage like “error_file_not_found”.
Plus, what do you do if it is a format string: “Sorry but the %s does not exist” ? Worse still: “Written %s entries to %s, total size: %d” ?
Quite old question but one additional reason I haven't seen in the answers yet:
You could end up with more placeholders than necessary, thus more work for translators and possible inconsistent translations. However, good editors like Poedit or Gtranslator can probably help with that.
To stick with your example:
The text "Please enter your name" could appear in a different context in a different template (that the developer is most likely not aware of and shouldn't need to be). E.g. it could be used not as an error but as a prompt like a placeholder of an input field.
If you use
_("Please enter your name");
it would be reusable, the developer can be unaware of the already existing key for an error message and would just use the same text intuitively.
However, if you used
_("error_please_enter_name");
in a previous template, developers wouldn't necessarily be aware of it and would make up a second key (most likely according to a predefined wording scheme to not end up in complete chaos), e.g.
_("prompt_please_enter_name");
which then has to be translated again.
So I think that doesn't scale very well. A pre-agreed wording scheme of suffixes/prefixes e.g. for contexts can never be as precise as the text itself I think (either too verbose or too general, beforehand you don't know and afterwards it's difficult to change) and is more work for the developer that's not worth it IMHO.
Does anybody agree/disagree?

Steps to develop a multilingual web application

What are the steps to develop a multilingual web application?
Should i store the languages texts and resources in database or should i use property files or resource files?
I understand that I need to use CurrentCulture with C# alone with CultureFormat etc.
I wanted to know you opinions on steps to build a multilingual web application.
Doesn't have to be language specific. I'm just looking for steps to build this.
The specific mechanisms are different depending on the platform you are developing on.
As a cursory set of work items:
Separation of code from content. Generally, resources are compiled into assemblies with the help of resource files (in dot net) or stored in property files (in java, though there are other options), or some other location, and referred to by ID. If you want localization costs to be reasonable, you need to avoid changes to the IDs between releases, as most localization tools will treat new IDs as new content.
Identification of areas in the application which make assumptions about the locale of the user, especially date/time, currency, number formatting or input.
Create some mechanism for locale-specific CSS content; not all fonts work for all languages, and not all font-sizes are sane for all languages. Don't paint yourself into a corner of forcing Thai text to be displayed in 8 pt. Also, text directionality is going to be right-to-left for at least two languages.
Design your page content to reflow or resize reasonably when more or less content than you expect is present. Many languages expand 50-80% from English for short strings, and 30-40% for longer pieces of content (that's a rough rule of thumb, not a law).
Identify cultural presumptions made by your UI designers, and try to make them more neutral, or, if you've got money and sanity to burn, localizable. Mailboxes don't look the same everywhere, hand gestures aren't universal, and something that's cute or clever or relies on a visual pun won't necessarily travel well.
Choose appropriate encodings for your supported languages. It's now reasonable to use UTF-8 for all content that's sent to web browsers, regardless of language.
Choose appropriate collation for your databases, or enable alternate collations, if you are dealing with content in multiple languages in your databases. Case-insensitivity works differently in many languages than it does in English, and accent insensitivity is acceptable in some languages and generally inappropriate in others.
Don't assume words are delimited by spaces or that sentences are delimited by punctuation, if you're trying to support search.
Avoid:
Storing localized content in databases, unless there's a really, really, good reason. And then, think again. If you have content that is somewhat dynamic and representatives of each region need to customize it, it may be reasonable to store certain categories of content with an associated locale ID.
Trying to be clever with string concatenation. Also, try not to assume rules about pluralization or counting work the same for every culture. Make sure, at least, that the order of strings (and controls) can be specified with format strings that are typical your platform, or well documented in your localization kit if you elect to roll your own for some reason.
Presuming that it's ok for code bugs to be fixed by localizers. That's generally not reasonable, at least if you want to deliver your product within a reasonable time at a reasonable cost; it's sometimes not even possible.
The first step is to internationalize. The second step is to localize. The third step is to translate.

Localization best practices

I'm starting to modify my app, which uses all hardcoded strings for errors, GUI, etc. I'm considering these two approaches, but let me know if there is an even better way:
-Put all string in ressource (.rc) files.
-define all strings in a file, once for each language. Use a preprocessor define to decide which strings get compiled in.
Which of these two approaches is generally prefered?
Put all the strings in resource files. Once you've done that, there's several good translation packages available. One useful thing these packages do is allow you to get translation done by somebody who doesn't program.
Remember, also, that internationalization (i18n) is a large subject, and there's a lot of things to consider. It isn't just a matter of translating strings. Do a web search on it, at the very least. You might want to read a book on it: I used International Programming for Windows by Schmitt as a guide. It's an old book from Microsoft Press, and I had to get it through a used book service; most of the more modern stuff seems to be on internationalizing .NET apps.
Without knowing more about your project (what sort of software, who the intended audience is, what sort of organization you have, what sort of budget, why you're interested in internationalization, etc.), this is about the most I can tell you.
Generally you see locale specific resource files containing strings referenced by key. Compiling different versions for different locales is a very rigid solution and will be a maintenance nightmare. Using resource files also allows the user to have fallback locales.
There's another approach of just putting strings in the source with somethign like tr(" ") and usign one of the tools that strips them out and converts them.
It works with any toolkit/GUI library.
You can mark text to be converted and text not to change (such as protocol strings or db keys).
It makes the source easier to read and search, isntead of having to lookup what IDS_MESSAGE34 means.
One problem with resource files, at least with Windows/MFC, is that you can't use the stringtable in dialogs. So you have some text in the stringtabel and some in the dialog section which you have to dela with separately.

Resources