How to handle different prepositions in internationalization? - internationalization

I'm currently working on an application which I plan translate in the future (right now it's only in English).
I would like to have a resource file for each language and just put the appropriate string where needed. I don't know how to solve the following problem:
Example (English -> French):
It's raining in Paris -> Il pleut à Paris.
It's raining in Massachusetts -> Il pleut au Massachusetts.
(Sorry if the sentences in French are wrong, I translated them using Google Translator. I only want them to help me explain my problem.)
The thing is, the sentences in French have different prepositions depending on where it is raining. Is there any best practice regarding this? I can't just pull "It's raining in" from the resource file and append the location.
The project I'm working on is a web application in JavaScript.
Thanks.

Please provide more detail, what kind of app are you making?
You can just, as you stated, pull the string up until a variable "It's raining in" or you can split the sentence to say "start of sentence" variable "end of sentence". And if in English it's not the same then just use the "start of the sentence" and leave the "end of the sentence" empty. If I understood you correctly.

Related

How to split/extract specific entities automatically to MS Luis?

I am currently working with MS LUIS.ai.
My string/utterance contains both English and Chinese.
Here is the problem:
While if sentence is ALL in English, it works fine in LUIS. The reason is probably because a sentence is composed of different words, which are split by a "space".
However, in Chinese (Both Traditional and Simplified), a sentence is composed of words that are concanated/joined together and difficult to be split.
For example, in English I can write:
I love you so much: There are 5 words here. In LUIS I can select I love you and turn it into an entity. And later on, when more words like I love you goes in LUIS, it can identify the related intent easily.
However, in Chinese if I write:
我很喜歡你: which has the same meaning as in English above. Under LUIS it will be counted as 1 word. If I want to extract the word 喜歡 (which means "Love/Like"), I cannot do this in LUIS.
Only if I put space around 喜歡 like this: 我很 喜歡 你 will I be able to select 喜歡 as a particular entity.
My Question:
Are there any ways/methods/tricks that I can use so that, when someone enters joined-string, like what you see in the Chinese version, to LUIS, LUIS will be able to identify specific words as entity automatically, without any manual change?
Thank you very much in advance for all your help.
To perform machine learning, LUIS breaks an utterance into tokens based on culture. We cannot suppress tokenization. LUIS tokenizes Chinese at character level and returns tokenized entity whereas for English it tokenizes for every space or special character. In the zh-cn culture, LUIS expects the simplified Chinese character set instead of the traditional character set.
Hope this helps!!

Stanford NERFeatureFactory description

Do you know where can I find more details on the description of the Stanford NERFeatureFactory?
I read the one at: https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ie/NERFeatureFactory.html
but I do not understand them all (and some have no description).
For example:usePrev,
useWordPairs,
conjoinShapeNGrams,
useSum, ...
or
(pw,c) (t,c)
There was a similar question 2 years ago without a better description. I was wondering if something new came out since then.
Thanks for your help!
If you look through the source code of NERFeatureFactory you can see what is going on.
The source code is available here: https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/ie/NERFeatureFactory.java
For example, useWordPairs creates features for the word under consideration and the previous/next word. You can see this in the code starting on line 1062...
As an example, consider the features for the word New in this text ...from New York......the useWordPairs feature produces the features New-from-W-PW and New-York-W-NW
A lot of the features have descriptions in that file as well.
It's helpful to look through the code and see what is being produced. For instance the conjoinShapeNGrams feature is producing features that attach the overall shape of the word and substrings of the word. You can see fully what is going on by looking at the code.
As an example of conjoinShapeNGrams, consider the name Wordsworth which would get features like worth-Xxxxxxxxxx-CNGram-CS , Words-Xxxxxxxxxx-CNGram-CS, etc...
This feature is capturing the presence of a certain substring and word shape together.
(pw, c) refers to "previous word" and "current word", which is linked to the usePrev flag
(t, c) refers to "part of speech tag" and "current word", which is linked to the useTags flag
It doesn't look like useSum does anything anymore...

Is there a standard computer vocabulary for German? for Spanish?

I was given the task of coming up with shorter German words for the German version of our software.
It got me to thinking that there should be some sort of standard vocabulary for information technology somewhere. Like there "have to be" terms that most (if not all) German computer users use for what English-speakers call file, database, record, search, search terms, search hits, find and replace, delete, OCR ... you get the idea.
I found ISO 2382 on the ISO Web site, but it only seems to standardize English and French. Is there an equivalent standard for German? How about for Spanish, or for other languages?
I may suggest this book, although quite dated, was an attempt to come up with a set of standard computer terms for translating from German to English and back:
Grosses IWT-Wörterbuch der Computertechnik und der Wirtschaftsinformatik. Englisch-Deutsch. Deutsch-Englisch
I will offer up the answer, "no".
Even within English, there are not standard words to describe computer operations as you have presented them. Certainly one can "delete" a file, but they can also "erase" it, "remove" it, an (shudder) "move it to the trash can".
Instead of trying to solve the problem in the large, I suggest you solve the problem in the small. Build a glossary of commonly used German words, and whenever there is an opportunity to expand the Glossary, first look over the existing entries and do your best to reuse the current terminology.
In a way, the reason good English documentation works well is because good writers of English use a glossary like technique explicitly or implicitly. In the event that much of your documentation comes from a single source, or related set of sources, you can make a "translation map" of "when they say X, we say Y". But, even such simplifications often require native readers to re-read the translation in context, as languages are not nearly regular enough to do simple substitution without many pitfalls.
As a starting point, The Open Group (www.opengroup.org) seems to have defined glossaries as part of their work on The Open Group Architecture Framework (TOGAF), which appear to be the sort of thing I needed. For example, these document numbers and titles are taken directly from their Web site:
C148 TOGAF® 9.1 Translation Glossary: English – Hrvatski (Croatian)
C149 TOGAF® 9.1 Translation Glossary: English – Castilian Spanish
C146 TOGAF® 9.1 Translation Glossary: English – Portuguese (Portugal)
C13H TOGAF® 9.1 Translation Glossary: English – Slovak

formatting a string with html for localization in spring

I have a string with urls in them.
eg: Under maintenance. Try again after sometime
I want to pass the complete string to the translators.
I see that there are two options
1. label.maintenance = Under maintenance. {0}Try again{1} after sometime (pass with anchor tags as variables)
2. label.maintenance = Under maintenance. Try again after sometime (pass the string with html)
In the first case if the translator misplaces {0} and {1} that may spoil the format of my page.
In the second case, the translator has to understand html and can possibly inject incorrect links.
What is the best way to achieve this?
I had this problem multiple times before and I came to conclusion that second option is best. Although it requires that translator know what HTML is all about, it turns out it gives the best results.
It communicates clearly what is a link here and if translator needs to re-order the sentence (which happens just too often), (s)he knows what part belongs to link.
In the first example, translator wouldn't know why the sentence contains placeholders and what to do with them. At best (s)he will ask. At worst, translator could re-order them or move translated "Try again" somewhere else and put another word between placeholders.
The conclusion is: leave HTML in translatable resources. At the same time, try to hire experienced software translators as oppose to regular translators. Chances are high they would understand Localization concepts and they would use common glossary of software terms (i.e. the one from Microsoft). That would make your software easier understandable for end users and decrease Localization defects that you will need to fix. In the end hiring software translator may cost less than "cheap" regular one...

Displaying language lists: Which language should I use? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Every once in a while I'm confronted with displaying a list of available languages, and each and every time I ask my self:
Is it better to display the language in:
the currently selected language
English
in the language according to the button/list item
Examples:
English
German
French
or
English
Deutsch
Français
Is there any convention on which one should be used, is more polite or better in any other way? Are there other options?
I would say it's best to display the language in "its own language" (option #3). You can not necessarily expect the user to know the currently selected language, nor expect him to know English.
What's tricky is how to display the "Select your language" button in a language neutral way. I usually go for a flag indicating the current language since that tends to get the message across eventhough there's not always a 1:1 mapping between country and languages.
I definitely think you should display in the language that matches the item in the button list.
Reasons:
If it's not the language you're interested in, you won't mind if you don't understand it, as long as you can find your own language.
Think about the last time you called customer service. How many times have you heard something like, "Para Espanol, marque dos"? It's very common, accepted practice to mix different languages in one UI (whether visual or audible).
Think about how you'd feel if you went to a Spanish site, and you couldn't find your language under "E". Maybe, eventually, you'd notice "Ingles", and think it probably translated to "English", but it's definitely better to save the user the trouble of translating and mentally alphabetizing.
The standard (in both senses of the word, i.e. what is actually used in the real world, and what the IETF/W3C/ISO says) is to use ISO 639-1 Alpha-2 language codes. Maybe augmented with either the full name of the language in English, the language itself, a romanic transliteration of the name in the language itself or any combination thereof.
So, to keep with your example:
[de] German - Deutsch
[en] English
[fr] French - Français
[ja] Japanese - 日本語 (Nihongo)
Two options, first the name of the language in the selected locale or English, then the name of the language in itself between parens, or the other way around, e.g.:
English
French (Français)
German (Deutsch)
Spanish (Español)
or
English
Français (French)
Deutsch (German)
Español (Spanish)
English language name:
Pros:
Predictable sorting.
No need to think about different text flows.
Cons:
Users who doesn't speak English might have ha harder time finding their language.
If the rest of the application is translated, it might look sloppy or grammatically wrong: Ditt språk är English/votre langue est English.
Language in its own name:
Pros:
Easier for the non-English speaker.
You have to think about encoding and text flow; A useful exercise. :-)
Cons:
Harder to navigate if the user is used to English or has her mind set on finding an English name.
You have to consider all language variants.
What is right really depends on the rest of your application. You might want to consider having all language names translated to all languages. If english is choosen, then you get to pick from:
English
Swedish
French
If Swedish:
Engelska
Svenska
Franska
...and French:
Anglais
Suédois
Français
But then the translantion problem has turned from O(n) to O(n^2), which might be acceptable depending on what your current value of n is.
EDIT
As deceze points out. you will also have to handle the case when a user accidentally switches to a language she doesn't understand, and provide a way back - for example by always including a few major languages.
I find it harder to find "Magyar" in a list of languages.
Because there are languages with non-latin character set, this is not a simple first-letter-lookup, as I lose focus when I first meet one of these.
Where should I look? At 'M' - Magyar? But where is M? EDIT: M in the (current language's) alphabet, not on the keyboard.
Have a look at this (from Wikipedia):
Български - I know, this is Bulgarian, but
བོད་ཡིག - what is this?
Bosanski
Català
Česky
Dansk
Deutsch
Ελληνικά
I would prefer something like this:
A...
B...
C...
.
.
Hungarian (Magyar)
If the UI was Japanese, I would ctrl+f-ing "Magyar", though.
Whatever you do don't use the IP location to set the language.
Google is very annoying about this -- when logging on from a new location I get google in the local language and script. This is really annoying particularly, anywhere southeast of Croatia.
The worse offender though is Microsoft. When trying to purchase software thier servers keep switching languages depending on your location and in many cases makes it impossible pay for anything by Credit Card as the addresses and zip codes etc. are validated in the local format and not where your credit card was actually issued. ( By the way MS the first four digits of a credit card number indicate the issuing institution which is tied to a particular country so its not rocket science to work out a UK postcode format is required rather than say a six digit german ZIP code.
Use country-flags in combination with the language name in that language (Deutsch, Francais, Nederlands, ...).
I don't know about any programming related conventions about this but i would prefer to see the name of a language in its own language.
For example:
English
Türkçe
Deutsch
Have a look at your Regional Settings.
This is how Microsoft implemented it. Seems like your version 1.
alt text http://www.freeimagehosting.net/uploads/1c14f9f60d.jpg

Resources