Serbian has 2 alphabets, Latin and Cyrillic.
Is Latin supported and how to get it ?
According to this post:
https://support.google.com/translate/thread/1836538/google-translate-to-serbian-latin?hl=en
it should work with explicit longer language code
instead of just 'sr' there are 'sr-Latn' and 'sr-Cyrl'.
But even with sr-Latn it still translates in the cyrillic alphabet.
Related
I am creating a GS1 DataBar code in ZPL and I can't find a way to encode FNC1 character to terminate variable length GS1 Application identifiers (GS1 AI). To be honest, it is not a necessity. GS1 DataBar is mostly used for fresh foods and other groceries, and so far I noticed only one variable length GS1 AI (10 - batch/lot) to be used regularly. Though I haven't done any research, so maybe I am wrong. Nevertheless, it came to my mind if it is possible in ZPL to insert FNC1 character. In other programming languages it is possible to include it, but I had no luck with ZPL. It seems, that GS1 DataBar does not work well with hex commands. When I used hex Group separator [GS] _1D it didn't even rendered the code. Other FNC1 characters like _1 from GS1 DM or >8 from GS1-128 do not work, as expected.
I found this answer on Zebra support, but it did not rendered on Labelary ZPL viewer, so I am not sure if it works. I tried including the # character directly and with hex character, but with no success.
My ZPL code:
^XA
^FO100,100^BRN,6,4,,,6
^FD010858000000000931030001251722022210ABC123^FS
^XZ
What I wonder is how to include for example serial number AI(21) after batch AI(10) at the end of the code.
I had a script or Ruby, and when I try to replace accented charcater gsub doesn't work with me :
my floder name is "Réé Ab"
name = File.basename(Dir.getwd)
name.downcase!
name.gsub!(/[àáâãäå]/,'a')
name.gsub!(/æ/,'ae')
name.gsub!(/ç/, 'c')
name.gsub!(/[èéêë]/,'e')
name.gsub!(/[ìíîï]/,'i')
name.gsub!(/[ýÿ]/,'y')
name.gsub!(/[òóôõö]/,'o')
name.gsub!(/[ùúûü]/,'u')
the output "réé ab", why the accented characters stil there ?
The é in your name are actually two different Unicode codepoints: U+0065 (LATIN SMALL LETTER E) and U+0301 (COMBINING ACUTE ACCENT).
p 'é'.each_codepoint.map{|e|"U+#{e.to_s(16).upcase.rjust(4,'0')}"} * ' ' # => "U+0065 U+0301"
However the é in your regex is only one: U+00E9 (LATIN SMALL LETTER E WITH ACUTE). Wikipedia has an article about Unicode equivalence. The official Unicode FAQ also contains explanations and information about this topic.
How to normalize Unicode strings in Ruby depends on its version. It has Unicode normalization support since 2.2. You don't have to require a library or install a gem like in previous versions (here's an overview). To normalize name simpy call String#unicode_normalize with :nfc or :nfkc as argument to compose é (U+0065 and U+0301) to é (U+00E9):
name = File.basename(Dir.getwd)
name.unicode_normalize! # thankfully :nfc is the default
name.downcase!
Of course, you could also use decomposed characters in your regular expressions but that probably won't work on other file systems and then you would also have to normalize: NFD or NFKD to decompose.
I also like to or even should point out that converting é to e or ü to u causes information loss. For example, the German word Müll (trash) would be converted to Mull (mull / forest humus).
I tagged character-encoding and text because I know if you type 'and' == 'and' into the rails console, or most any other programming language, you will get true. However, I am having the issue when one of my users pastes his text into my website, I can't spell check it properly or verify it's originality via copyscape because of some issue with the text. (or maybe my understanding of text encoding?)
EXAMPLE:
If you copy and paste the following line into the rails console you will get false.
'аnd' == 'and' #=> false
If you copy and paste the following line into the rails console you will get true even though they appear exactly the same in the browser.
'and' == 'and' #=> true
The difference is, in the first example, the first 'аnd' is copied and pasted from my user's text that is causing the issues. All the other instances of 'and' are typed into the browser.
Is this an encoding issue?
How to fix my issue?
This isn’t really an encoding problem, in the first case the strings compare as false simply because they are different.
The first character of the first string isn’t a ”normal“ a, it is actually U+0430 CYRILLIC SMALL LETTER A — the first two bytes (208 and 176, or 0xD0 and 0xB0 in hex) are the UTF-8 encoding for this character. It just happens to look exactly like a “normal” Latin a, which is U+0061 LATIN SMALL LETTER A.
Here’s the “normal” a: a, and this is the Cyrillic a: а, they appear pretty much identical.
The fix for this really depends on what you want your application to do. Ideally you would want to handle all languages, and so you might want to just leave it and rely on users to provide reasonable input.
You could replace the character in question with a latin a using e.g. gsub. The problem with that is there are many other characters that have similar appearance to the more familiar ones. If you choose this route you would be better looking for a library/gem that did it for you, and you might find you’re too strict about conversions.
Another option could be to choose a set of Unicode scripts that your application supports and refuse any characters outside those scripts. You can check fairly easily for this with Ruby‘s regular expression script support, e.g. /\p{Cyrillic}/ will match all Cyrillic characters.
The problem is not with encodings. A single file or a single terminal can only have a single encoding. If you copy and paste both strings into the same source file or the same terminal window, they will get inserted with the same encoding.
The problem is also not with normalization or folding.
The first string has 4 octets: 0xD0 0xB0 0x6E 0x64. The first two octets are a two-octet UTF-8 encoding of a single Unicode codepoint, the third and fourth octets are one-octet UTF-8 encodings of Unicode code points.
So, the string consists of three Unicode codepoints: U+0430 U+006E U+0064.
These three codepoints resolve to the following three characters:
CYRILLIC SMALL LETTER A
LATIN SMALL LETTER N
LATIN SMALL LETTER D
The second string has 3 octets: 0x61 0x6E 0x64. All three octets are one-octet UTF-8 encodings of Unicode code points.
So, the string consists of three Unicode codepoints: U+0061 U+006E U+0064.
These three codepoints resolve to the following three characters:
LATIN SMALL LETTER A
LATIN SMALL LETTER N
LATIN SMALL LETTER D
Really, there is no problem at all! The two strings are different. With the font you are using, a cyrillic a looks the same as a latin a, but as far as Unicode is concerned, they are two different characters. (And in a different font, they might even look different!) There's really nothing you can do from an encoding or Unicode perspective, because the problem is not with encodings or Unicode.
This is called a homoglyph, two characters that are different but have the same (or very similar) glyphs.
What you could try to do is transliterate all strings into Latin (provided that you can guarantee that nobody ever wants to enter non-Latin characters), but really, the questions are:
Where does that cyrillic a come from?
Maybe it was meant to be a cyrillic a and really should be treated not-equal to a latin a?
And depending on the answers to those questions, you might either want to fix the source, or just do nothing at all.
This is a very hot topic for browser vendors, BTW, because nowadays someone could register the domain google.com (with one of the letters switched out for a homoglpyh) and you wouldn't be able to spot the difference in the address bar. This is called a homograph attack. That's why they always display the Punycode domain in addition to the Unicode domain name.
I think it is eccoding issue, you can have a try like this.
irb(main):010:0> 'and'.each_byte {|b| puts b}
97
110
100
=> "and"
irb(main):011:0> 'аnd'.each_byte {|b| puts b} #copied and
208
176
110
100
=> "аnd"
Through the REST API of an application, I receive language codes of the following form: ll-Xxxx.
two lowercase letters languages (looks like ISO 639-1),
a dash,
a code going up to four letters, starting with an uppercase letter (looks like an ISO 639-3 macrolanguage code).
Some examples:
az-Arab Azerbaijani in the Arabic script
az-Cyrl Azerbaijani in the Cyrillic script
az-Latn Azerbaijani in the Latin script
sr-Cyrl Serbian in the Cyrillic script
sr-Latn Serbian in the Latin script
uz-Cyrl Uzbek in the Cyrillic script
uz-Latn Uzbek in the Latin script
zh-Hans Chinese in the simplified script
zh-Hant Chinese in the traditional script
From what I found online:
[ISO 639-1] is the first part of the ISO 639 series of international standards for language codes. Part 1 covers the registration of two-letter codes.
and
ISO 639-3 is an international standard for language codes. In defining some of its language codes, some are defined as macrolanguages [...]
Now I need to write a piece of code to verify that I receive a valid language code.
But since what I receive is a mix of 639-1 (2 letters language) and 639-3 (macrolanguage), what standard am I supposed to stick with ? Are these code belonging to some sort of mixed up (perhaps common) standard ?
The current reference for identifying languages is IETF BCP 47, which combines IETF RFC 5646 and RFC 4647.
Codes of the form ll-Xxxx combine an ISO 639-1 language code (two letters) and an ISO 15924 script code (four letters). BCP 47 recommends that language codes be written in lower case and that script codes be written "lowercase with the initial letter capitalized", but this is basically for readability.
BCP 47 also recommends that the language code should be the shortest available ISO 639 tag. So if a language is represented in both ISO 639-1 (two letters) and ISO 639-3 (three letters), than you should use the ISO 639-1.
Following RFC-5646 (at page 4) a language tag can be written with the following form : [language]-[script].
language (2 or 3 letters) is the shortest ISO 639 code
script (4 letters) is a ISO 15924 code (see also RFC section)
what kind of keyboard layout can type ISO 8859-1 Characters?
Example of what needs to be typed are:-
Ánam àbìa èbèa Ógbuá
First of all: Keyboard layouts and character sets are not directly tied to each other. If I type Ü on my keyboard while in a UTF-8 application, the resulting character will be
a UTF-8 character. If I type it in a ISO-8859-1 application, it will be a character from that character set.
That said, there isn't a keyboard layout that covers all ISO-8859-1 characters; every country layout covers a part of them.
Full list of characters
According to Wikipedia, ISO-8859-1 covers the following languages' special characters in full:
Afrikaans, Albanian, Basque, Breton, Catalan, English (UK and US), Faroese, Galician, German, Icelandic, Irish, (new
- orthography), Italian, Kurdish (The
Kurdish Unified Alphabet), Latin
(basic classical orthography), Leonese,
Luxembourgish (basic classical
orthography), Norwegian (Bokmål and
Nynorsk), Occitan, Portuguese,
Rhaeto-Romanic, Scottish, Gaelic,
Spanish, Swahili, Swedish, Walloon
so you can safely assume that the keyboard layouts of those countries cover a part of ISO-8859-1.
This is what I have decided to do. Hope it puts somebody else on the right footing.
With Special thanks to #Pekka for the patience, guidance and support.
// Replaces combination char with special chars
$phrase = "`U `are ^here tod`ay.";
$search = array("`U", "`a", "^h");
$replace = array("û", "ñ", "à");
$resulte = str_replace($search, $replace, $phrase);
Could be cleaner in a function though