How to SortBy two parameters in OCL? - sorting

I need to sort the collection of Persons by two parameters, by Surnames and after by Names. How can I do something like this in OCL?

The sortedBy function sorts elements using a the criteria expressed in its body and a < relationship between each gathered result.
In your case, assuming that you have a surname attribute, the following statement will sort the collection c using the < operator on each surname gathered (so a < on strings):
c->sortedBy(p | p.surname)
An idea could be to compute a unique string using the surname and the name concatenated toghether. Thus, if you have:
George Smith
Garry Smith
George Smath
The comparison would be done between "Smith_George", "Smith_Garry" and "Smath_George" and would be ordered, following the lexicographical order, to:
George Smath (Smath_George)
Garry Smith (Smith_Garry)
George Smith (smith_George)
Finally, the OCL request would be (assuming surname and name as existing attributes):
c->sortedBy(p | p.surname + '_' + p.name)
This little trick does the job, but it is not "exactly" a two parameters comparison for sortedBy.

The OCL sortedBy(lambda) appears to be very different to Java's sort(comparator) apparently requiring a projection of the objects as the metric for sorting. However if the projection is self, you have a different Java functionality. Therefore if you do sortedBy(p | p) the sorting is dependent on the < operation for p.
To facilitate this, the Eclipse OCL prototype of a future OCL introduces an OclComparable type with a compareTo method enabling all relational operations to be realized provided your custom type extends the OclComparable type.
(A similar OclSummable with zero() and sum() operates supports Collection::sum() generically; e.g. String realizes sum as a concatenation.)

Thanks you inspired me. I just raised http://issues.omg.org/browse/OCL25-213 whose text is:
The sortedBy iteration provides an elegant solution to a sort problem in which the sort metric is a projection of the sorted object. Thus sortedBy(p|p.name) or just sortedBy(name) is short and avoids the opportunities for typos from a more conventional exposition involving comparison of two objects. The natural solution may well be an efficient one for large collections with non-trivial metrics.
However the sortedBy solution is unfamiliar and so confusing for newcomers and unsuitable for multi-key sorting for which an artificial compound single key may need to be constructed.
One solution might be to provide a more conventional iterator such as sort(p1, p2 | comparison-expression) allowing a two key sort:
sort(p1, p2 | let diff1 = p1.key1.compareTo(p2.key1) in
if diff1 <> 0 the diff 1 else p1.key2.compareTo(p2.key2) endif)
However this has poor readability and ample opportunities for typos.
Alternatively sortedBy with a Tuple-valued metric might support multiple keys as:
sortedBy(Tuple{first=key1,second=key2})
(The alphabetical order of the Tuple part names determines the priority.)
(Since sortedBy is declaratively clear and compact, inefficient small/trivial implementations can be optimized to their sort() equivalents.)

Related

How can Clojure data-structures best be tagged with a type?

I'm writing a program that's manipulating polynomials. I'm defining polynomials recursively as either a term (base case) or a sum or product of polynomials (recursive cases).
Sums and products are completely identical as far as their contents are concerned. They just contain a sequence of polynomials. But they need to be processed very differently. So to distinguish them I have to somehow tag my sequences of polynomials.
Currently I have two records - Sum and Product - defined. But this is causing my code to be littered with the line (:polynomials sum-or-product) to extract the contents of polynomials. Also printing out even small polynomials in the REPL produces so much boilerplate that I have to run everything through a dedicated prettyprinting routine if I want to make sense of it.
Alternatives I have considered are tagging my sums and products using metadata instead, or putting a + or * symbol at the head of the sequence. But I'm not convinced that either of these approaches are good style and I'm wondering if there's perhaps another option I haven't considered yet.
Putting a + or * symbol at the head of the sequence sounds like it would print out nicely. I would try implementing the processing of these two different "types" via multimethods, which keeps the calling convention neat and extensible. That document starts from object-oriented programmin view, but the "area of a shape" is a very neat example on what this approach can accomplish.
In your case you'd use first of the seq to determine if you are dealing with a sum or a product of polynomials, and the multimethod would automagically use the correct implementation.

OCaml performance according to matching order

In OCaml, is there any relation between the order in a pattern-matching and performance?
For instance, if I declare a type:
type t = A | B | C
and then perform some pattern-matching as follows:
match t1 with
| A -> ...
| _ -> ...
From a performance point of view, is it equivalent to
match t1 with
| B -> ...
| _ -> ...
assuming in the first case there are as many A's as there are B's in the second?
In other words, should I worry about the order of declaration of constructors in a type, when considering performance?
There is a paper explaining how pattern-matching is compiled in OCaml:
“Optimizing Pattern Matching”, L. Maranget and F. Le Fessant, ICFP’01
It basically says that the semantics is "in order", but that it is usually compiled in the optimal way, independently of the order of lines. Values of constructors don't matter either, it's the number of constructors that makes the difference, i.e. if it is compiled by a tree of comparisons, or by a jump table.
Optimality + exhaustivity test makes pattern-matching in OCaml probably the most wonderful feature of the language, and is much more efficient that writting cascades of "if" manually.
This is an impossible question to answer carefully. However, in practice if you have a type whose constructors are all nullary (i.e., equivalent to small integers), and there are more than a very few of them, but less than a whomping huge pile of them, the code generator will almost certainly use a hardware jump table, which has essentially the same performance for each possible value.
In general, I wouldn't worry about things like this at all until you have identified the slow parts of your code. But there's almost no chance you would be able to speed things up by reordering a set of nullary constructors.

Maximum difference between columns using relational algebra

Is it possible to obtain the maximum difference between two columns (for example starting and ending weights)?
Right now I'm leaning towards no as this would require a new column with the difference between the two columns for each row, then taking the max of that. Doing it the way I orginally intended doesn't work either since arithmetic operations are not allowed in the conditions of select operations (e.g. SIGMA (c1 - c2 < c3 - c4)(Table) is not allowed).
Disclosure: this is part of a homework question.
It can be done, exactly in the way you planned, but you need generalized projection for that. The generalized projection is the operator
Π(E1, E2,..., En)R
where R is a relation, and E1...En are expressions in the form a⊕b, where a and b are attributes of R or constants, and ⊕ is an arbitrary binary operator between them. The result is a relation with attributes E1...En.
This would allow you to project the differences into a new relation (R' := Π(x-y)R), then find the maximum on that, just as you planned.
If we're not allowed to use generalized projection, then I think we have no means to actually subtract an attribute from another, or to actually calculate anything from them, as the definition of projection allow only attribute names, and the definition of selection allow only expressions of the form aθb where a and b are attributes or constants and θ is a binary relational operator (this is logical, in its way, because if we have a relation R(X,Y), then we have no idea about the type of X or Y, making operations on them quite meaningless).
I think generalized projection is a great extension to relational algebra. It's obviously immensely useful in real life, and it can be defended even from a more scientific point of view: if we allow binary conditional operators on the values like "X > 50", then we made assumptions on the type already, rendering that point kind of moot. Your instructor may disagree, though.
If you're looking to do this in the real world, you should be able to do this with a subquery (or a view, which amounts to much the same thing), something like:
select max (diff) from (
select high - low as diff from blah blah blah
)
Whether this applies to the abstract world of relational algebra, I couldn't say. I'm too busy fixing those damn real-world problems :-)

Data structure for multi-language dictionary?

One-line summary: suggest optimal (lookup-speed/compactness) data structure(s) for a multi-lingual dictionary representing primarily Indo-European languages (list at bottom).
Say you want to build some data structure(s) to implement a multi-language dictionary for let's say the top-N (N~40) European languages on the internet, ranking choice of language by number of webpages (rough list of languages given at bottom of this question).
The aim is to store the working vocabulary of each language (i.e. 25,000 words for English etc.) Proper nouns excluded. Not sure whether we store plurals, verb conjugations, prefixes etc., or add language-specific rules on how these are formed from noun singulars or verb stems.
Also your choice on how we encode and handle accents, diphthongs and language-specific special characters e.g. maybe where possible we transliterate things (e.g. Romanize German ß as 'ss', then add a rule to convert it). Obviously if you choose to use 40-100 characters and a trie, there are way too many branches and most of them are empty.
Task definition: Whatever data structure(s) you use, you must do both of the following:
The main operation in lookup is to quickly get an indication 'Yes this is a valid word in languages A,B and F but not C,D or E'. So, if N=40 languages, your structure quickly returns 40 Booleans.
The secondary operation is to return some pointer/object for that word (and all its variants) for each language (or null if it was invalid). This pointer/object could be user-defined e.g. the Part-of-Speech and dictionary definition/thesaurus similes/list of translations into the other languages/... It could be language-specific or language-independent e.g. a shared definition of pizza)
And the main metric for efficiency is a tradeoff of a) compactness (across all N languages) and b) lookup speed. Insertion time not important. The compactness constraint excludes memory-wasteful approaches like "keep a separate hash for each word" or "keep a separate for each language, and each word within that language".
So:
What are the possible data structures, how do they rank on the
lookup speed/compactness curve?
Do you have a unified structure for all N languages, or partition e.g. the Germanic languages into one sub-structure, Slavic into
another etc? or just N separate structures (which would allow you to
Huffman-encode )?
What representation do you use for characters, accents and language-specific special characters?
Ideally, give link to algorithm or code, esp. Python or else C. -
(I checked SO and there have been related questions but not this exact question. Certainly not looking for a SQL database. One 2000 paper which might be useful: "Estimation of English and non-English Language Use on the WWW" - Grefenstette & Nioche. And one list of multi-language dictionaries)
Resources: two online multi-language dictionaries are Interglot (en/ge/nl/fr/sp/se) and LookWayUp (en<->fr/ge/sp/nl/pt).
Languages to include:
Probably mainly Indo-European languages for simplicity: English, French, Spanish, German, Italian, Swedish + Albanian, Czech, Danish, Dutch, Estonian, Finnish, Hungarian, Icelandic, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbo Croat, Slovak, Slovenian + Breton, Catalan, Corsican, Esperanto, Gaelic, Welsh
Probably include Russian, Slavic, Turkish, exclude Arabic, Hebrew, Iranian, Indian etc. Maybe include Malay family too. Tell me what's achievable.
I will not win points here, but some things.
A multi-language dictionary is a large and time-consuming undertaking. You did not talk in detail about the exact uses for which your dictionary is intended: statistical probably, not translating, not grammatical, .... Different usages require different data to be collected, for instance classifying "went" as passed tense.
First formulate your first requirements in a document, and with a programmed interface prototype. Asking data structures before algorithmic conception I see often for complex business logic. One would then start out wrong, risking feature creep. Or premature optimisation, like that romanisation, which might have no advantage, and bar bidrectiveness.
Maybe you can start with some active projects like Reta Vortaro; its XML might not be efficient, but give you some ideas for organisation. There are several academic linguistic projects. The most relevant aspect might be stemming: recognising greet/greets/greeted/greeter/greeting/greetings (#smci) as belonging to the same (major) entry. You want to take the already programmed stemmers; they often are well-tested and already applied in electronic dictionaries. My advise would be to research those projects without losing to much energy, impetus, to them; just enough to collect ideas and see where they might be used.
The data structures one can think up, are IMHO of secondary importance. I would first collect all in a well defined database, and then generate the software used data structures. You can then compare and measure alternatives. And it might be for a developer the most interesting part, creating a beautiful data structure & algorithm.
An answer
Requirement:
Map of word to list of [language, definition reference].
List of definitions.
Several words can have the same definition, hence the need for a definition reference.
The definition could consist of a language bound definition (grammatical properties, declinations), and/or a language indepedendant definition (description of the notion).
One word can have several definitions (book = (noun) reading material, = (verb) reserve use of location).
Remarks
As single words are handled, this does not consider that an occuring text is in general mono-lingual. As a text can be of mixed languages, and I see no special overhead in the O-complexity, that seems irrelevant.
So a over-general abstract data structure would be:
Map<String /*Word*/, List<DefinitionEntry>> wordDefinitions;
Map<String /*Language/Locale/""*/, List<Definition>> definitions;
class Definition {
String content;
}
class DefinitionEntry {
String language;
Ref<Definition> definition;
}
The concrete data structure:
The wordDefinitions are best served with an optimised hash map.
Please let me add:
I did come up with a concrete data structure at last. I started with the following.
Guava's MultiMap is, what we have here, but Trove's collections with primitive types is what one needs, if using a compact binary representation in core.
One would do something like:
import gnu.trove.map.*;
/**
* Map of word to DefinitionEntry.
* Key: word.
* Value: offset in byte array wordDefinitionEntries,
* 0 serves as null, which implies a dummy byte (data version?)
* in the byte arrary at [0].
*/
TObjectIntMap<String> wordDefinitions = TObjectIntHashMap<String>();
byte[] wordDefinitionEntries = new byte[...]; // Actually read from file.
void walkEntries(String word) {
int value = wordDefinitions.get(word);
if (value == 0)
return;
DataInputStream in = new DataInputStream(
new ByteArrayInputStream(wordDefinitionEntries));
in.skipBytes(value);
int entriesCount = in.readShort();
for (int entryno = 0; entryno < entriesCount; ++entryno) {
int language = in.readByte();
walkDefinition(in, language); // Index to readUTF8 or gzipped bytes.
}
}
I'm not sure whether or not this will work for your particular problem, but here's one idea to think about.
A data structure that's often used for fast, compact representations of language is a minimum-state DFA for the language. You could construct this by creating a trie for the language (which is itself an automaton for recognizing strings in the language), then using of the canonical algorithms for constructing a minimum-state DFA for the language. This may require an enormous amount of preprocessing time, but once you've constructed the automaton you'll have extremely fast lookup of words. You would just start at the start state and follow the labeled transitions for each of the letters. Each state could encode (perhaps) a 40-bit value encoding for each language whether or not the state corresponds to a word in that language.
Because different languages use different alphabets, it might a good idea to separate out each language by alphabet so that you can minimize the size of the transition table for the automaton. After all, if you have words using the Latin and Cyrillic alphabets, the state transitions for states representing Greek words would probably all be to the dead state on Latin letters, while the transitions for Greek characters for Latin words would also be to the dead state. Having multiple automata for each of these alphabets thus could eliminate a huge amount of wasted space.
A common solution to this in the field of NLP is finite automata. See http://www.stanford.edu/~laurik/fsmbook/home.html.
Easy.
Construct a minimal, perfect hash function for your data (union of all dictionaries, construct the hash offline), and enjoy O(1) lookup for the rest of eternity.
Takes advantage of the fact your keys are known statically. Doesn't care about your accents and so on (normalize them prior to hashing if you want).
I had a similar (but not exactly) task: implement a four-way mapping for sets, e.g. A, B, C, D
Each item x has it's projections in all of the sets, x.A, x.B, x.C, x.D;
The task was: for each item encountered, determine which set it belongs to and find its projections in other sets.
Using languages analogy: for any word, identify its language and find all translations to other languages.
However: in my case a word can be uniquely identified as belonging to one language only, so no false friends such as burro in Spanish is donkey in English, whereas burro in Italian is butter in English (see also https://www.daytranslations.com/blog/different-meanings/)
I implemented the following solution:
Four maps/dictionaries matching the entry to its unique id (integer)
AtoI[x.A] = BtoI[x.B] = CtoI[x.C] = DtoI[x.D] = i
Four maps/dictionaries matching the unique id to the corresponding language
ItoA[i] = x.A;
ItoB[i] = x.B;
ItoC[i] = x.C;
ItoD[i] = x.D;
For each encounter x, I need to do 4 searches at worst to get its id (each search is O(log(N))); then 3 access operations, each O(log(N)). All in all, O(log(N)).
I have not implemented this, but I don't see why hash sets cannot be used for either set of dictionaries to make it O(1).
Going back to your problem:
Given N concepts in M languages (so N*M words in total)
My approach adapts in the following way:
M lookup hashsets, that give you integer id for every language (or None/null if the word does not exist in the language).
Overlapped case is covered by the fact that lookups for different languages will yield different ids.
For each word, you do M*O(1) lookups in the hash sets corresponding to languages, yielding K<=M ids, identifying the word as belonging to K languages;
for each id you need to do (M-1)*O(1) lookups in actual dictionaries, mapping K ids to M-1 translations each)
In total, O(MKM) which I think is not bad, given your M=40 and K will be much smaller than M in most cases (K=1 for quite a lot of words).
As for storage: NM words + NM integers for the id-to-word dictionaries, and the same amount for reverse lookups (word-to-id);

Relational algebra - what is the proper way to represent a 'having' clause?

This is a small part of a homework question so I can understand the whole.
SQL query to list car prices that occur more than once:
select car_price from cars
group by car_price
having count (car_price) > 1;
The general form of this in relational algebra is
Y (gl, al) R
where Y is the greek symbol, gl is list of attributes to group, and al is a list of aggregations.
The relational algebra:
Y (count(car_price)) cars
How is the having clause written in that statement? Is there a shorthand? If not, do I just need to select from that relation? Like this?
SELECT (count(car_price) > 1) [Y (count(car_price)) cars]
select count(*) from (select * from cars where price > 1) as cars;
also known as relational closure.
For a more or less precise answer to the actual question asked, "Relational algebra - what is the proper way to represent a ‘having’ clause?", it needs to be stated first that the question itself seems to suggest, or presume, that there exists such a thing as "THE" relational algebra, but that presumption is simply untrue !
An algebra is a set of operators, and anyone can define any set of operators he likes, meaning anyone can define any algebra he likes ! In his most recent publication, Hugh Darwen mentions that RESTRICT is not a fundamental operator of the algebra, though lots of others do consider it as such.
Especially with respect to aggregations and summaries, there is little consensus as to how those should be incorporated in a relational algebra. Defining operators such as COUNT() (that take a relation as an argument value and return an integer) as part of the algebra, might be problematic wrt the closure property of the algebra, precisely because such operators do not return a relation ...
So the sorry, but nevertheless most appropriate, answer here seems to be that a conclusive answer to this question is almost impossible to give ...

Resources