What standard do language codes of the form "zh-Hans" belong to? - internationalization

Through the REST API of an application, I receive language codes of the following form: ll-Xxxx.
two lowercase letters languages (looks like ISO 639-1),
a dash,
a code going up to four letters, starting with an uppercase letter (looks like an ISO 639-3 macrolanguage code).
Some examples:
az-Arab Azerbaijani in the Arabic script
az-Cyrl Azerbaijani in the Cyrillic script
az-Latn Azerbaijani in the Latin script
sr-Cyrl Serbian in the Cyrillic script
sr-Latn Serbian in the Latin script
uz-Cyrl Uzbek in the Cyrillic script
uz-Latn Uzbek in the Latin script
zh-Hans Chinese in the simplified script
zh-Hant Chinese in the traditional script
From what I found online:
[ISO 639-1] is the first part of the ISO 639 series of international standards for language codes. Part 1 covers the registration of two-letter codes.
and
ISO 639-3 is an international standard for language codes. In defining some of its language codes, some are defined as macrolanguages [...]
Now I need to write a piece of code to verify that I receive a valid language code.
But since what I receive is a mix of 639-1 (2 letters language) and 639-3 (macrolanguage), what standard am I supposed to stick with ? Are these code belonging to some sort of mixed up (perhaps common) standard ?

The current reference for identifying languages is IETF BCP 47, which combines IETF RFC 5646 and RFC 4647.
Codes of the form ll-Xxxx combine an ISO 639-1 language code (two letters) and an ISO 15924 script code (four letters). BCP 47 recommends that language codes be written in lower case and that script codes be written "lowercase with the initial letter capitalized", but this is basically for readability.
BCP 47 also recommends that the language code should be the shortest available ISO 639 tag. So if a language is represented in both ISO 639-1 (two letters) and ISO 639-3 (three letters), than you should use the ISO 639-1.

Following RFC-5646 (at page 4) a language tag can be written with the following form : [language]-[script].
language (2 or 3 letters) is the shortest ISO 639 code
script (4 letters) is a ISO 15924 code (see also RFC section)

Related

GoogleAPI translation to sr-Latn translating to sr-Cyrilic

Serbian has 2 alphabets, Latin and Cyrillic.
Is Latin supported and how to get it ?
According to this post:
https://support.google.com/translate/thread/1836538/google-translate-to-serbian-latin?hl=en
it should work with explicit longer language code
instead of just 'sr' there are 'sr-Latn' and 'sr-Cyrl'.
But even with sr-Latn it still translates in the cyrillic alphabet.

How to differ Chinese with GetLocaleInfo?

I want to get an ISO 639-1 language string from an LCID. The problem is that 2052 (Simplified Chinese) and 1028 (Traditional Chinese) both return zh (Chinese) instead of zh-CN and zh-TW.
The code I use is
WCHAR locale[8];
GetLocaleInfoW(lcid, LOCALE_SISO639LANGNAME, locale, 8);
Is there a way to get the right code?
ISO 639-1 specifies 2-letter language names, so GetLocaleInfo() correctly returns "zh" for both Simplified and Traditional Chinese - they are not differentiated in the ISO 639-1 spec.
A call with LOCALE_SNAME instead always returns a string also containing the sub-tag, eg "de-DE" or "de-AT".
Everything else, for example a 2-letter tag for "most" languages and 4-letter one (xx-YY) for some "exceptions" (like Chinese - and which other ones?), is something custom and would therefore require custom code.

Nifi ftp fails with path non-existent

Using nifi ListFTP and GetFTP processors I can access remote ftp directories and files as expected, except for this path:
/Oa 45° 25t 32rn
I get a non-existent path error. Other paths with spaces work fine. (and other clients 'filezilla' work fine with this path.) However, Nifi does not. If it's the degree char °, how do I escape it? I've tried:
"/Oa 45° 25t 32rn"
'/Oa 45° 25t 32rn'
'"'/Oa 45° 25t 32rn'"'
/Oa\ 45°\ 25t\ 32rn
Oa%2045%C2%B0%2025t%2032rn (url encoding, trying it all)
Any ideas why this is failing and how to resolve? Thanks.
I do not have an FTP server with a directory containing non-ASCII characters, so I cannot test this explicitly, but I would recommend using UTF-8 Unicode encoding 0xC2B0 or \uC2B0 to see if that works.
From FileZilla Character Encoding:
The FTP protocol is specified in RFC 959, which was published in 1985.
The FTP protocol is designed on top of the original Telnet protocol,
which is specified in RFC 854. The relevant sections of the Telnet
specification regarding FTP are those covering the Network Virtual
Terminal (NVT). According to RFC 854, the NVT requires the use of
(7-bit) ASCII as the character set. Use of any other character set
requires explicit negotiation. This character set only contains 127
different characters: English letters and numbers, punctuation
characters and a few control characters. Accented letters, umlauts or
other scripts are not contained in the ASCII character set.
In order to support non-English characters, the FTP specifications
were extended in 1999 in RFC 2640. This extension requires the use of
UTF-8 as the character set. This character set is a strict superset of
ASCII, every valid ASCII character is also the same character in
UTF-8. The UTF-8 character set can display any valid Unicode
character. That includes umlauts, accented letters and also different
scripts. This extension is fully backwards compatible with RFC 959.
As long as you're using only English characters, it doesn't matter if
the software you are using supports RFC 2640 or not. However, if you
use non-English characters without using RFC 2640 compatible software,
there will be problems--problems which are entirely self-made by not
obeying the specifications.

What is VBS UCASE function doing to Japanese?

In order to avoid case conflicts comparing strings on an ASP classic site, some inherited code converts all strings with UCASE() first. This seems to work well across languages ... except Japanese. Here's a simple example on a Japanese string. I've provided the UrlEncoded values to make it clear how little is changing behind the scenes:
Server.UrlEncode("戦艦帝国") = %E6%88%A6%E8%89%A6%E5%B8%9D%E5%9B%BD
UCASE("戦艦帝国") = ƈ�ȉ�Ÿ�ś�
Server.UrlEncode(UCASE("戦艦帝国")) = %C6%88%A6%C8%89%A6%C5%B8%9D%C5%9B%BD
So is UCASE doing anything sensible with this Japanese string? Or is its behavior buggy, undefined, or known to be incompatible with Japanese?
(LCASE leaves the sample string alone. But now I'm wary of switching all comparisons to LCASE because I don't know if it bungles other non-western languages that do work with UCASE....)
https://msdn.microsoft.com/en-us/library/1systdcy(v=vs.84).aspx
Only lowercase letters are converted to uppercase; all uppercase letters and non-letter characters remain unchanged.
https://en.wikipedia.org/wiki/Letter_case
Most Western languages (particularly those with writing systems based on the Latin, Cyrillic, Greek, Coptic, and Armenian alphabets) use letter cases in their written form as an aid to clarity. Scripts using two separate cases are also called bicameral scripts. Many other writing systems make no distinction between majuscules and minuscules – a system called unicameral script or unicase.
"lowercase or uppercase letters" does not apply in Chinese-Japanese-Korean languages, hence, the output of UCase() should remain unchanged.

ISO 3166 code conversion - Alpha to numeric

I have a situation where I need to convert between ISO 3166 country codes.
For example, using the ISO 3 standard for country codes, IOT is the alpha code for British Indian Ocean Territory and 086 is it's numeric equivalent.
Another example would be using the ISO 4 for currency codes, 'UZS' is the alpha code for Uzbekistan and 860 is it's numeric equivalent.
You can find machine-processable lists of ISO 3166 country codes in a few places, e.g.:
in plain text format: http://download.geonames.org/export/dump/countryInfo.txt
in JSON: https://github.com/mledoze/countries (check the file countries.json, which contains much more than just country codes; the README describes its structure).
See also Full list of ISO ALPHA-2 and ISO ALPHA-3 country codes on GIS Stack Exchange.

Resources