Related
I recently read the UTF-8 Everywhere manifesto, a document arguing for handling text with UTF-8 by default. The manifesto argues that Unicode codepoints aren't a generally useful concept and shouldn't be directly interacted with outside of programs/libraries specializing in text processing.
However, some modern languages that use the UTF-8 default have built-in codepoint types, such as rune in Go and char in Rust.
What are these types actually useful for? Are they legacy from times before the meaninglessness of codepoints was broadly understood? Or is that an incomplete perspective?
Texts have many different meaning and usages, so the question is difficult to answer.
First: about codepoint. We uses the term codepoint because it is easy, it implies a number (code), and not really confuseable with other terms. Unicode tell us that it doesn't use the term codepoint and character in a consistent way, but also that it is not a problem: context is clear, and they are often interchangeable (but for few codepoints which are not characters, like surrogates, and few reserved codepoints). Note: Unicode is mostly about characters, and ISO 10646 was most about codepoints. So original ISO was about a table with numbers (codepoint) and names, and Unicode about properties of characters. So we may use codepoints where Unicode character should be better, but character is easy confuseable with C char, and with font glyphs/graphemes.
Codepoints are one basic unit, so useful for most of programs, e.g. to store in databases, to exchange to other programs, to save files, for sorting, etc. For this exact reasons program languages uses the codepoint as type. UTF-8 code units may be an alternative, but it would be more difficult to navigate (see a UTF-8 as a tape disk where you should read sequentially, and codepoint text as an hard disk where you can just in middle of a text). Not a 100% appropriate, because you may need some context bytes. If you are getting user text, your program probably do not need to split in graphemes, to do liguatures, etc. if it will just store the data in a database. Codepoint is really low level and so fast for most operations.
The other part of text: displaying (or speech). This part is very complex, because we have many different scripts with very different rules, and then different languages with own special cases. So we needs a series of libraries, e.g. text layout (so word separation, etc. like pango), sharper engine (to find which glyph to use, combining characters, where to put next characters, e.g. HarfBuzz), and a font library which display the font (cairo plus freetype). it is complex, but most programmers do not need special handling: just reading text from database and sent to screen, so we just uses the relevant library (and it depends on operating system), and just going on. It is too complex for a language specification (and also a moving target, maybe in 30 years things are more standardized). So it is complex, and with many operation, so we may use complex structures (array of array of codepoint: so array of graphemes): not much a slow down. Note: fonts have codepoint tables to perform various operation before to find the glyph index. Various API uses Unicode strings (as codepoint array, UTF-16, UTF-8, etc.).
Naturally things are more complex, and it requires a lot of knowledge of different part of Unicode, if you are trying to program an editor (WYSIWYG, but also with terminals): you mix both worlds, and you need much more information (e.g. for selection of text). But in this case you must create your own structures.
And really: things are complex: do you want to just show first x characters on your blog? (maybe about assessment), or split at words (some language are not so linear, so the interpretation may be very wrong). For now just humans can do a good job for all languages, so also not yet need to a supporting type in different languages.
The manifesto argues that Unicode codepoints aren't a generally useful concept and shouldn't be directly interacted with outside of programs/libraries specializing in text processing.
Where? It merely outlines advantages and disadvantages of code points. Two examples are:
Some abstract characters can be encoded by different code points; U+03A9 greek capital letter omega and U+2126 ohm sign both correspond to the same abstract character Ω, and must be treated identically.
Moreover, for some abstract characters, there exist representations using multiple code points, in addition to the single coded character form. The abstract character ǵ can be coded by the single code point U+01F5 latin small letter g with acute, or by the sequence <U+0067 latin small letter g, U+0301 combining acute accent>.
In other words: code points just index which graphemes Unicode supports.
Sometimes they're meant as single characters: one prominent example would be € (EURO SIGN), having only the code point U+20AC.
Sometimes the same character has multiple code-points as per context: the dollar sign exists as:
﹩ = U+FE69 (SMALL DOLLAR SIGN)
$ = U+FF04 (FULLWIDTH DOLLAR SIGN)
💲 = U+1F4B2 (HEAVY DOLLAR SIGN)
Storage wise when searching for one variant you might want to match all 3 variants instead on relying on the exact code point only.
Sometimes multiple code points can be combined to form up a single character:
á = U+00E1 (LATIN SMALL LETTER A WITH ACUTE), also termed "precomposed"
á = combination of U+0061 (LATIN SMALL LETTER A) and U+0301 (COMBINING ACUTE ACCENT) - in a text editor trying to delete á (from the right side) will mostly result in actually deleting the acute accent first. Searching for either variant should find both variants.
Storage wise you avoid to need searching for both variants by performing Unicode normalization, i.e. NFC to always favor precombined code points over two combined code points to form one character.
As for homoglyphs code points clearly distinguish the contextual meaning:
A = U+0041 (LATIN CAPITAL LETTER A)
Α = U+0391 (GREEK CAPITAL LETTER ALPHA)
А = U+0410 (CYRILLIC CAPITAL LETTER A)
Copy the greek or cyrillic character, then search this website for that letter - it will never find the other letters, no matter how similar they look. Likewise the latin letter A won't find the greek or cyrillic one.
Writing system wise code points can be used by multiple alphabets: the CJK portion is an attempt to use as few code points as possible while supporting as many languages as possible - Chinese (simplified, traditional, Hong Kong), Japanese, Korean, Vietnamese:
今 = U+4ECA
入 = U+5165
才 = U+624D
Dealing as a programmer with code points has valid reasons. Programming languages which support these may (or may not) support correct encodings (UTF-8 vs. UTF-16 vs. ISO-8859-1) and may (or may not) correctly produce surrogates for UTF-16. Text wise users should not be concerned about code points, although it would help them distinguishing homographs.
Does it exist some established convention of sorting lines (characters)? Some convention which should play the similar role as PCRE for regular expressions.
For example, if you try to sort 0A1b-a2_B (each character on its own line) with Sublime Text (Ctrl-F9) and Vim (:%sort), the result will be the same (see below). However, I'm not sure it will be the same with another editors and IDEs.
-
0
1
2
A
B
_
a
b
Generally, characters are sorted based on their numeric value. While this used to only be applied to ASCII characters, this has also been adopted by unicode encodings as well. http://www.asciitable.com/
If no preference is given to the contrary, this is the de facto standard for sorting characters. Save for the actual alphabetical characters, the ordering is somewhat arbitrary.
There are two main ways of sorting character strings:
Lexicographic: numeric value of either the codepoint values or the code unit values or the serialized code unit values (bytes). For some character encodings, they would all be the same. The algorithm is very simple but this method is not human-friendly.
Culture/Locale-specific: an ordinal database for each supported culture is used. For the Unicode character set, it's called the CLDR. Also, in applying sorting for Unicode, sorting can respect grapheme clusters. A grapheme cluster is a base codepoint followed by a sequence of zero or more non-spacing (applied as extensions of the previous glyph) marks.
For some older character sets with one encoding, designed for only one or two scripts, the two methods might amount to the same thing.
Sometimes, people read a format into strings, such as a sequence of letters followed by a sequence of digits, or one of several date formats. These are very specialized sorts that need to be applied where users expect. Note: The ISO 8601 date format for the Julian calendar sorts correctly regardless of method (for all? character encodings).
I'm building an online store in Go. As would be expected, several important pieces need to record exact monetary amounts. I'm aware of the rounding problems associated with floats (i.e. 0.3 cannot be exactly represented, etc.).
The concept of currency seems easy to just represent as a string. However, I'm unsure of what type would be most appropriate to express the actual monetary amount in.
The key requirements would seem to be:
Can exactly express decimal numbers down to a specified number of decimal places, based on the currency (some currencies use more than 2 decimal places: http://www.londonfx.co.uk/ccylist.html )
Obviously basic arithmetic operations are needed - add/sub/mul/div.
Sane string conversion - which would essentially mean conversion to it's decimal equivalent. Also, internationalization would need to be at least possible, even if all of the logic for that isn't built in (1.000 in Europe vs 1,000 in the US).
Rounding, possibly using alternate rounding schemes like Banker's rounding.
Needs to have a simple and obvious way to correspond to a database value - MySQL in my case. (Might make the most sense to treat the value as a string at this level in order to ensure it's value is preserved exactly.)
I'm aware of math/big.Rat and it seems to solve a lot of these things, but for example it's string output won't work as-is, since it will output in the "a/b" form. I'm sure there is a solution for that too, but I'm wondering if there is some sort of existing best practice that I'm not aware of (couldn't easily find) for this sort of thing.
UPDATE: This package looks promising: https://code.google.com/p/godec/
You should keep i18n decoupled from your currency implementation. So no, don't bundle everything in a struct and call it a day. Mark what currency the amount represents but nothing more. Let i18n take care of formatting, stringifying, prefixing, etc.
Use an arbitrary precision numerical type like math/big.Rat. If that is not an option (because of serialization limitations or other barriers), then use the biggest fixed-size integer type you can use to represent the amount of atomic money in whatever currency you are representing – cents for USD, yens for JPY, rappen for CHF, cents for EUR, and so forth.
When using the second approach take extra care to not incur in overflows and define a clear and meaningful rounding behaviour for division.
One-line summary: suggest optimal (lookup-speed/compactness) data structure(s) for a multi-lingual dictionary representing primarily Indo-European languages (list at bottom).
Say you want to build some data structure(s) to implement a multi-language dictionary for let's say the top-N (N~40) European languages on the internet, ranking choice of language by number of webpages (rough list of languages given at bottom of this question).
The aim is to store the working vocabulary of each language (i.e. 25,000 words for English etc.) Proper nouns excluded. Not sure whether we store plurals, verb conjugations, prefixes etc., or add language-specific rules on how these are formed from noun singulars or verb stems.
Also your choice on how we encode and handle accents, diphthongs and language-specific special characters e.g. maybe where possible we transliterate things (e.g. Romanize German ß as 'ss', then add a rule to convert it). Obviously if you choose to use 40-100 characters and a trie, there are way too many branches and most of them are empty.
Task definition: Whatever data structure(s) you use, you must do both of the following:
The main operation in lookup is to quickly get an indication 'Yes this is a valid word in languages A,B and F but not C,D or E'. So, if N=40 languages, your structure quickly returns 40 Booleans.
The secondary operation is to return some pointer/object for that word (and all its variants) for each language (or null if it was invalid). This pointer/object could be user-defined e.g. the Part-of-Speech and dictionary definition/thesaurus similes/list of translations into the other languages/... It could be language-specific or language-independent e.g. a shared definition of pizza)
And the main metric for efficiency is a tradeoff of a) compactness (across all N languages) and b) lookup speed. Insertion time not important. The compactness constraint excludes memory-wasteful approaches like "keep a separate hash for each word" or "keep a separate for each language, and each word within that language".
So:
What are the possible data structures, how do they rank on the
lookup speed/compactness curve?
Do you have a unified structure for all N languages, or partition e.g. the Germanic languages into one sub-structure, Slavic into
another etc? or just N separate structures (which would allow you to
Huffman-encode )?
What representation do you use for characters, accents and language-specific special characters?
Ideally, give link to algorithm or code, esp. Python or else C. -
(I checked SO and there have been related questions but not this exact question. Certainly not looking for a SQL database. One 2000 paper which might be useful: "Estimation of English and non-English Language Use on the WWW" - Grefenstette & Nioche. And one list of multi-language dictionaries)
Resources: two online multi-language dictionaries are Interglot (en/ge/nl/fr/sp/se) and LookWayUp (en<->fr/ge/sp/nl/pt).
Languages to include:
Probably mainly Indo-European languages for simplicity: English, French, Spanish, German, Italian, Swedish + Albanian, Czech, Danish, Dutch, Estonian, Finnish, Hungarian, Icelandic, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbo Croat, Slovak, Slovenian + Breton, Catalan, Corsican, Esperanto, Gaelic, Welsh
Probably include Russian, Slavic, Turkish, exclude Arabic, Hebrew, Iranian, Indian etc. Maybe include Malay family too. Tell me what's achievable.
I will not win points here, but some things.
A multi-language dictionary is a large and time-consuming undertaking. You did not talk in detail about the exact uses for which your dictionary is intended: statistical probably, not translating, not grammatical, .... Different usages require different data to be collected, for instance classifying "went" as passed tense.
First formulate your first requirements in a document, and with a programmed interface prototype. Asking data structures before algorithmic conception I see often for complex business logic. One would then start out wrong, risking feature creep. Or premature optimisation, like that romanisation, which might have no advantage, and bar bidrectiveness.
Maybe you can start with some active projects like Reta Vortaro; its XML might not be efficient, but give you some ideas for organisation. There are several academic linguistic projects. The most relevant aspect might be stemming: recognising greet/greets/greeted/greeter/greeting/greetings (#smci) as belonging to the same (major) entry. You want to take the already programmed stemmers; they often are well-tested and already applied in electronic dictionaries. My advise would be to research those projects without losing to much energy, impetus, to them; just enough to collect ideas and see where they might be used.
The data structures one can think up, are IMHO of secondary importance. I would first collect all in a well defined database, and then generate the software used data structures. You can then compare and measure alternatives. And it might be for a developer the most interesting part, creating a beautiful data structure & algorithm.
An answer
Requirement:
Map of word to list of [language, definition reference].
List of definitions.
Several words can have the same definition, hence the need for a definition reference.
The definition could consist of a language bound definition (grammatical properties, declinations), and/or a language indepedendant definition (description of the notion).
One word can have several definitions (book = (noun) reading material, = (verb) reserve use of location).
Remarks
As single words are handled, this does not consider that an occuring text is in general mono-lingual. As a text can be of mixed languages, and I see no special overhead in the O-complexity, that seems irrelevant.
So a over-general abstract data structure would be:
Map<String /*Word*/, List<DefinitionEntry>> wordDefinitions;
Map<String /*Language/Locale/""*/, List<Definition>> definitions;
class Definition {
String content;
}
class DefinitionEntry {
String language;
Ref<Definition> definition;
}
The concrete data structure:
The wordDefinitions are best served with an optimised hash map.
Please let me add:
I did come up with a concrete data structure at last. I started with the following.
Guava's MultiMap is, what we have here, but Trove's collections with primitive types is what one needs, if using a compact binary representation in core.
One would do something like:
import gnu.trove.map.*;
/**
* Map of word to DefinitionEntry.
* Key: word.
* Value: offset in byte array wordDefinitionEntries,
* 0 serves as null, which implies a dummy byte (data version?)
* in the byte arrary at [0].
*/
TObjectIntMap<String> wordDefinitions = TObjectIntHashMap<String>();
byte[] wordDefinitionEntries = new byte[...]; // Actually read from file.
void walkEntries(String word) {
int value = wordDefinitions.get(word);
if (value == 0)
return;
DataInputStream in = new DataInputStream(
new ByteArrayInputStream(wordDefinitionEntries));
in.skipBytes(value);
int entriesCount = in.readShort();
for (int entryno = 0; entryno < entriesCount; ++entryno) {
int language = in.readByte();
walkDefinition(in, language); // Index to readUTF8 or gzipped bytes.
}
}
I'm not sure whether or not this will work for your particular problem, but here's one idea to think about.
A data structure that's often used for fast, compact representations of language is a minimum-state DFA for the language. You could construct this by creating a trie for the language (which is itself an automaton for recognizing strings in the language), then using of the canonical algorithms for constructing a minimum-state DFA for the language. This may require an enormous amount of preprocessing time, but once you've constructed the automaton you'll have extremely fast lookup of words. You would just start at the start state and follow the labeled transitions for each of the letters. Each state could encode (perhaps) a 40-bit value encoding for each language whether or not the state corresponds to a word in that language.
Because different languages use different alphabets, it might a good idea to separate out each language by alphabet so that you can minimize the size of the transition table for the automaton. After all, if you have words using the Latin and Cyrillic alphabets, the state transitions for states representing Greek words would probably all be to the dead state on Latin letters, while the transitions for Greek characters for Latin words would also be to the dead state. Having multiple automata for each of these alphabets thus could eliminate a huge amount of wasted space.
A common solution to this in the field of NLP is finite automata. See http://www.stanford.edu/~laurik/fsmbook/home.html.
Easy.
Construct a minimal, perfect hash function for your data (union of all dictionaries, construct the hash offline), and enjoy O(1) lookup for the rest of eternity.
Takes advantage of the fact your keys are known statically. Doesn't care about your accents and so on (normalize them prior to hashing if you want).
I had a similar (but not exactly) task: implement a four-way mapping for sets, e.g. A, B, C, D
Each item x has it's projections in all of the sets, x.A, x.B, x.C, x.D;
The task was: for each item encountered, determine which set it belongs to and find its projections in other sets.
Using languages analogy: for any word, identify its language and find all translations to other languages.
However: in my case a word can be uniquely identified as belonging to one language only, so no false friends such as burro in Spanish is donkey in English, whereas burro in Italian is butter in English (see also https://www.daytranslations.com/blog/different-meanings/)
I implemented the following solution:
Four maps/dictionaries matching the entry to its unique id (integer)
AtoI[x.A] = BtoI[x.B] = CtoI[x.C] = DtoI[x.D] = i
Four maps/dictionaries matching the unique id to the corresponding language
ItoA[i] = x.A;
ItoB[i] = x.B;
ItoC[i] = x.C;
ItoD[i] = x.D;
For each encounter x, I need to do 4 searches at worst to get its id (each search is O(log(N))); then 3 access operations, each O(log(N)). All in all, O(log(N)).
I have not implemented this, but I don't see why hash sets cannot be used for either set of dictionaries to make it O(1).
Going back to your problem:
Given N concepts in M languages (so N*M words in total)
My approach adapts in the following way:
M lookup hashsets, that give you integer id for every language (or None/null if the word does not exist in the language).
Overlapped case is covered by the fact that lookups for different languages will yield different ids.
For each word, you do M*O(1) lookups in the hash sets corresponding to languages, yielding K<=M ids, identifying the word as belonging to K languages;
for each id you need to do (M-1)*O(1) lookups in actual dictionaries, mapping K ids to M-1 translations each)
In total, O(MKM) which I think is not bad, given your M=40 and K will be much smaller than M in most cases (K=1 for quite a lot of words).
As for storage: NM words + NM integers for the id-to-word dictionaries, and the same amount for reverse lookups (word-to-id);
I'm trying to come up with a way to estimate the number of English words a translation from Japanese will turn into. Japanese has three main scripts -- Kanji, Hiragana, and Katakana -- and each has a different average character-to-word ratio (Kanji being the lowest, Katakana the highest).
Examples:
computer: コンピュータ (Katakana - 6
characters); 計算機 (Kanji: 3
characters)
whale: くじら (Hiragana --
3 characters); 鯨 (Kanji: 1
character)
As data, I have a large glossary of Japanese words and their English translations, and a fairly large corpus of matched Japanese source documents and their English translations. I want to come up with a formula that will count numbers of Kanji, Hiragana, and Katakana characters in a source text, and estimate the number of English words this is likely to turn into.
Here's what Borland (now Embarcadero) thinks about English to non-English:
Length of English string (in characters)
Expected increase
1-5 100%
6-12 80%
13-20 60%
21-30 40%
31-50 20%
over 50 10%
I think you can sort of apply this (with some modification) for Japanese to non-Japanese.
Another element you might want to consider is the tone of the language. In English, instructions are phrased as an imperative as in "Press OK." But in Japanese language, imperatives are considered rude, and you must phrase instructions in honorific (or keigo) as in "OKボタンを押してください。"
Watch out for three-letter kanji combos. Many of the big words translate into three- or four- letter kanji combo such as 国際化(internationalization: 20 chars), 高可用性(high availability: 17 chars).
I would start with linear approximation: approx_english_words = a1*no_characters_in_script1 + a2 * no_chars_in_script2 + a3 * no_chars_in_script3, with the coefficients a1, a2, a3 fit from your data using linear least squares.
If this doesn't approximate very well, then look at the worst cases for the reasons they don't fit (specialized words, etc.).
In my experience as a translator and localization specialist, a good rule of thumb is 2 Japanese characters per English word.
As an experienced translator between Japanese and English, I can say that this is extremely difficult to quantify, but typically in my experience English text translated from Japanese is nearly 200% as many characters as the source text. In Japanese there are many culturally specific phrases and nouns that can't be translated literally and need to be explained in English.
When translating it is not unusual for me to take a single Japanese sentence and to make a single English paragraph out of it in order for the meaning to be communicated to the reader. Off the top of my here is an example:
「懐かしい」
This literally means nostalgic. However, in Japanese it can be used as a single phrase in an exclamation. Yet, in English in order to convey a feeling of nostalgia we require a lot more context. For instance, you may need to turn that single phrase into a sentence:
"As I walked by my old elementary school, I was flooded with memories of the past."
This is why machine translation between Japanese and English is impossible.
Well, it's a little more complex than just the number of characters in a noun compared to English, for instance, Japanese also has a different grammatical structure compared to English, so certain sentences would use MORE words in Japanese, and others would use LESS words. I don't really know Japanese, so please forgive me for using Korean as an example.
In Korean, a sentence is often shorter than an English sentence, due mainly to the fact that they are cut short by using context to fill in the missing words. For instance, saying "I love you" could be as short as 사랑해 ("sarang hae", simply the verb "love"), or as long as the fully qualified sentence 저는 당신을 살앙해요 (I [topic] you [object] love [verb + polite modifier]. In a text how it is written depends on context, which is usually set by earlier sentences in the paragraph.
Anyway, having an algorithm to actually KNOW this kind of thing would be very difficult, so you're probably much better off, just using statistics. What you should do is use random samples where the known Japanese texts, and English texts have the same meaning. The larger the sample (and the more random it is) the better... though if they are truly random, it won't make much difference how many you have past a few hundred.
Now, another thing is this ratio would change completely on the type of text being translated. For instance, highly technical document is quite likely to have a much higher Japanese/English length ratio than a soppy novel.
As for simply using your dictionary of word to word translations - that probably won't work to well (and is probably wrong). The same word does not translate to the same word every time in a different language (although much more likely to happen in technical discussions). For instance, the word beautiful. There is not only more than one word I could assign it to in Korean (i.e. there is a choice), but sometimes I lose that choice, as in the sentence (that food is beautiful), where I don't mean the food looks good. I mean it tastes good, and my option of translations for that word changes. And this is a VERY common circumstance.
Another big problem is optimal translation. Something that human's are really bad at, and something that computers are much much worse at. Whenever I've proofread a document translated from another text to English, I can always see various ways to cut it much much shorter.
So although, with statistics, you would be able to work out a pretty good average ratio in length between translations, this will be far different than it would be were all translations to be optimal.
It seems simple enough - you just need to find out the ratios.
For each script, count the number of script characters and English words in your glossary and work out the ratio.
This can be augmented with the Japanese source documents assuming you can both detect which script a Japanese word is in and what the English equivalent phrase is in the translation. Otherwise you'll have to guesstimate the ratios or ignore this as source data,
Then, as you say, count the number of words in each script of your source text, do the multiplies, and you should have a rough estimate.
My (albeit tiny) experience seems to indicate that, no matter what the language, blocks of text take the same amount of printed space to convey equivalent information. So, for a large-ish block of text, you could assign a width count to each character in English (grab this from a common font like Times New Roman), and likewise use a common Japanese font at the same point size to calculate the number of characters that would be required.