AutoML Translation Supported languages - automl

I have a question about AutoML Translation.
In the list of supported languages, we did not find our language. Can we add the Kazakh language to create our dataset? Example (translation from Russian to Kazakh)

It seems that they only support the next Languages list.
Note that AutoML Translation is on Beta stage so it might get changed and the Supported languages list might be changed as well.

Related

How to enable Latin names for places in some places instead of local ones in Carto Services?

Some places on the map is labeled with Cyrillic names, but I need only English/Latin names of places on the map, however sometimes there are only local names. How can I implement this?
P.S.: I have spotted this issue on Belorussian and partly on Russian places.
Screenshot
About languages in general: after all, it depends on which languages specific placename is tagged with. OpenStreetMap has always "local" variant in local primary language, and CARTO Mobile SDK uses this by default, but the data has also other languages, so you can control it as following.
CartoVectorTileLayer (both CartoOnlineVectorTileLayer and CartoOfflineVectorTileLayer are subclasses of it) has method setLanguage(String) to select language, so e.g.:
layer.setLanguage("en");
will give you English language maps.
In SDK 4.0.2 SDK and nutiteq.osm tile source you can use following languages: local/default, en, es, de, fr, it, ru, zh (Chinese), tr (Turkish) and et (Estonian) as language
With latest CARTO SDK 4.1.0 and new carto.streets source you can use any OSM language. I would suggest to configure map based on device language settings, with something like:
// Android
layer.setLanguage(Locale.getDefault().getLanguage());
// iOs / Xamarin
layer.Language = Foundation.NSLocale.PreferredLanguages[0].Substring(0, 2);
What if specific name is not available in given language? Then the MapView will fallback to 'local' language by default, the map will not be empty. But if the 'local' language is still unreadable, so I'd prefer latin alphabet names? In SDK 4.1.0 you can configure primary and secondary fallback languages, e.g. you set primary language to 'de' for Germans, then to avoid strange alphabets (say Hebrew, Greek, most of Asia) set 'en' as primary fallback; then local is used only if both your primary and English names are missing:
layer.FallbackLanguage = "en";
Now I know you want automatically transliterated / Romanizied names, so even if source data from OpenStreetMap has e.g. names in Cyrillic only (Russia, Belorussia etc), then it would show them in Latin chars. It is not exactly same as translation, e.g. Moscow would become Moskva with Romanization, but can be helpful for many cases, actually with all none-Latin scripts, especially from Asia (Chinese etc). The problem here is that many languages, including Russian have many competing Romanization rules, so even if we would want to, then we could not do it in general SDK map rendering level. Our CARTO SDK may provide API for app to apply your preferred translation table, but we do not have this anyway. The SDK is open source and you are welcome to provide patch for the feature. I added issue ih the project for this: https://github.com/CartoDB/mobile-sdk/issues/147

Works LibShortText with other languages too?

LibShortText is an open source tool for short-text classification and analysis.
http://www.csie.ntu.edu.tw/~cjlin/libshorttext/
I have tried to figure out if it also works with other languages than english (e.g. german)? But I didn't find a hint.
Who knows the answer? Thank you in advance.
I think so (but may need some extra preprocessing). Libsvm and Liblinear are both language-agnostic. Since LibShortText is built on top of LibLinear, it should work for all languages too.
According to this paper, it has internal pre-processing methods to extract features.
libshorttext.converter: For given short texts, LibShortText follows
the bag-of-word model to generate features. Users apply procedures in
this library to pre-process short texts by tokenization, stemming
(optional), and stop-word removal (optional). The library also allows
users to choose between unigram and bigram features.
However, it looks like its stemming and stop-word removal only supports English. So if you want to have better features extracted for non-English text, you might want to use your own pre-processing methods, for example, using nltk.

Is there a standard computer vocabulary for German? for Spanish?

I was given the task of coming up with shorter German words for the German version of our software.
It got me to thinking that there should be some sort of standard vocabulary for information technology somewhere. Like there "have to be" terms that most (if not all) German computer users use for what English-speakers call file, database, record, search, search terms, search hits, find and replace, delete, OCR ... you get the idea.
I found ISO 2382 on the ISO Web site, but it only seems to standardize English and French. Is there an equivalent standard for German? How about for Spanish, or for other languages?
I may suggest this book, although quite dated, was an attempt to come up with a set of standard computer terms for translating from German to English and back:
Grosses IWT-Wörterbuch der Computertechnik und der Wirtschaftsinformatik. Englisch-Deutsch. Deutsch-Englisch
I will offer up the answer, "no".
Even within English, there are not standard words to describe computer operations as you have presented them. Certainly one can "delete" a file, but they can also "erase" it, "remove" it, an (shudder) "move it to the trash can".
Instead of trying to solve the problem in the large, I suggest you solve the problem in the small. Build a glossary of commonly used German words, and whenever there is an opportunity to expand the Glossary, first look over the existing entries and do your best to reuse the current terminology.
In a way, the reason good English documentation works well is because good writers of English use a glossary like technique explicitly or implicitly. In the event that much of your documentation comes from a single source, or related set of sources, you can make a "translation map" of "when they say X, we say Y". But, even such simplifications often require native readers to re-read the translation in context, as languages are not nearly regular enough to do simple substitution without many pitfalls.
As a starting point, The Open Group (www.opengroup.org) seems to have defined glossaries as part of their work on The Open Group Architecture Framework (TOGAF), which appear to be the sort of thing I needed. For example, these document numbers and titles are taken directly from their Web site:
C148 TOGAF® 9.1 Translation Glossary: English – Hrvatski (Croatian)
C149 TOGAF® 9.1 Translation Glossary: English – Castilian Spanish
C146 TOGAF® 9.1 Translation Glossary: English – Portuguese (Portugal)
C13H TOGAF® 9.1 Translation Glossary: English – Slovak

Language codes for simplified Chinese and traditional Chinese?

We are creating multi-language subsites on our website.
I would like to use the 2-letter language codes. Spanish and French are easy. They will get URLs like:
mydomain.com/es
mydomain.com/fr
but I run into a problem with Traditional and Simplified chinese. Are there standards for which 2 letter codes to use for these languages?
mydomain.com/zh
mydomain.com/?
#dkarp gives an excellent general answer. I will add some additional specifics regarding Chinese:
There are several countries where Chinese is the main written language. The major difference between them is whether they use simplified or traditional characters, but there are also minor regional differences (in vocabulary, etc). The standard way to distinguish these would be with a country code, e.g. zh_CN for mainland China, zh_SG for Singapore, zh_TW for Taiwan, or zh_HK for Hong Kong.
Mainland China and Singapore both use simplified characters, and the others use traditional characters. Since China and Taiwan are the two with the biggest populations, just zh_CN and zh_TW are often used to distinguish the simplified and traditional character versions of a website.
More technically correct but not commonly used in practice, however, would be to use zh_HANS for (generic) simplified Chinese characters, and zh_HANT for traditional Chinese characters, except for rare cases when it is meaningful to distinguish different countries.
There is indeed a standard representation for this. As people have run into the exact same problem you are seeing -- same language, but different dialects or characters -- they've extended the two-letter language code with a two-letter region code. So you might have a universal French page at mydomain.com/fr, but internationalizing for French Canadian readers might leave you with a mydomain.com/fr_CA (Canada) and mydomain.com/fr_FR (France). Some platforms use a dash instead of an underscore to separate the language and region codes (hence fr-CA and fr-FR).
The standard locale for simplified Chinese is zh_CN. The standard locale for traditional Chinese is zh_TW.
I hesitate to point you towards the actual BCP 47 standards documents, as they're, uh, a little heavy on the detail and a little light on the readability. Just go with standard locale identifiers, like the ones in used by Java, and you'll be fine.
I'm just going to leave this here.
CODE
LANG
FORM
REGION
zh
Chinese
-
-
zh_Hans
Chinese
Han Simplified
-
zh_Hans_CN
Chinese
Han Simplified
China
zh_Hans_HK
Chinese
Han Simplified
Hong Kong SAR China
zh_Hans_MO
Chinese
Han Simplified
Macau SAR China
zh_Hans_SG
Chinese
Han Simplified
Singapore
zh_Hant
Chinese
Han Traditional
-
zh_Hant_HK
Chinese
Han Traditional
Hong Kong SAR China
zh_Hant_MO
Chinese
Han Traditional
Macau SAR China
zh_Hant_TW
Chinese
Han Traditional
Taiwan
Language is dependent upon where it is spoken (doh!), so language and locale codes reflect that reality. zh is the basic language code, but because there are two major forms of it, there are zh_Hans and zh_Hant, but they are still only language codes, not locales.
Location-specific
To fully specify which language is used in a particular location, the country code still has to be suffixed, so making zh_Hans_HK and zh_Hant_HK for simplified and traditional Chinese, respectively, both as spoken in Hong Kong.
Actually, the reality is that something more specific than country code is often required in many countries, but that is likely to exponentially increase the complexity and maintenance of databases like CLDR, plus the support infrastructure to feed into it, like IP to location details extraction, is not generally available or accurate enough.
Fixed text
Now, if the code is just to specify which set of fixed strings to use in the user interface, or even whole pages sets on a site, a country suffix is not really necessary, unless there are more than a few places where the language varies significantly enough (location-based info) to bother creating a whole separate resource set.
The larger the resource set, the more likely that a language code based upon locale [in this context, just a language attribute, rather than a true locale, so you can call it what you like!] will be required, but at least you only have to do that when necessary.
On-the-fly values
However, if wanting to format particular variable values, like dates, times, currencies and numbers, on-the-fly, locales become important, because all the tools that support such functionality (like those based upon Unicode CLDR data) expect them. The locale for these needs to be a separate setting to the code for which an in-house-generated UI language is set to use, unless you want to create a resource set for every known locale, and maintain them ad nauseum!
Browser language tools
Note that when specifying locale for a web page that can be edited, as in input boxes, and spellcheck in attributes or css has been enabled for the field, the browser's language tools will spellcheck the field according to that locale.
Criteria
You have to be clear about what the resource set is providing, so consider:
Fixed strings? Language only.
Formatting on-the-fly? Locale.
Spellchecking in the viewing environment? Locale.
Whole pages/subsite? Language only, else locale (as a language variant) if significantly different content required.
Spreadsheet to minimise maintenance overhead
I use a spreadsheet to hold UI strings where each language code has a parent code, so that the cell for its version of a string has a formula that gets its string from the parent. To create a custom string for that language and string, I just overwrite the cell formula with the exact text. That minimises the amount of resource maintenance. I run a macro at the end that generates a complete resource file for each language.

Which ISO format should I use to store a user's language code?

Should I use ISO 639-1 (2-letter abbreviation) or ISO 639-2 (3 letter abbrv) to store a user's language code? Both are official standards, but which is the de facto standard in the development community? I think ISO 639-1 would be easier to remember, and is probably more popular for that reason, but thats just a guess.
The site I'm building will have a separate site for the US, Brazil, Russia, China, & the UK.
http://en.wikipedia.org/wiki/ISO_639
You should use IETF language tags because they are already used for HTTP/HTML/XML and many other technologies. They are based on several standards including the ISO-639 collection (yes language, region and culture selection are not so simple to define).
I wrote a more detailed article regarding the proper language code selection and usage. The idea is to use the simplest/shorter ISO-639-1 codes and specify more only for special cases. Inside the article there are codes for ~30 most used languages with reasons why I consider one alternative better than another.
In case you want to skip reading the entire article here is a short list of language codes (not to be confused with country codes): ar, cs, da, de, el, en, en-gb, es, fr, fi, he, hu, it, ja, ko, nb, nl, pl, pt, pt-pt, ro, ru, sv, tr, uk, zh, zh-hant
The following points may not be obvious but should be borne in mind:
en is used for en-us - American English, and for British English is used en-gb
pt is used for pt-br, and not pt-pt witch has much less speakers
zh is used instead of zh-hans, zh-CN,...
zh-hant (Traditional Chinese) is used instead of more specific codes like zh-hant-TW or zh-TW
You can find more explanations inside the article.
I would go with a derivative of ISO 639. Specifically I like to use this: http://en.wikipedia.org/wiki/IETF_language_tag
I'm no expert, but every site I've ever seen uses ISO 639-1, including the current site I'm working on.
It works for us!
I've only ever seen 2-character language codes in use - so I'd recommend going with them unless your work involves delving into linguistics in some way. If all you're doing is customizing the browsing experience for the world at large, you won't need the extra repertoire offered by 3-character codes.
ISO 639-1 Alpha-2 are used pretty much universally.
They are used for example in HTTP content negotiation. If you ever wondered how an international website can automatically show you their homepage in your native language, that's how it works. (Although it's sometimes kinda annoying. I, for example, often get shown the default Apache homepage in German, because the webmaster turned on content negotiation, but only put content for English in.)
Most web browsers use them directly in their settings dialog box.
Most operating systems use them in their settings dialog boxes or configuration files.
Wikipedia uses them in their server names for the different language versions.
In other words: if your users aren't native English speakers, they will probably already have encountered them when configuring their software, because otherwise they wouldn't be able to use their computers.
The other members of the ISO 639 family are mostly of interest to linguists. Unless you expect Jesus Christ himself (ISO 639-2 Alpha-3 code arc) to visit your website, or maybe Klingons (tlh), ISO 639-1 has more languages than you ever can hope to support.

Resources