Logoot CRDT: interleaving of data on concurrent edits to the same spot? - algorithm

I want to implement Logoot for eventually-convergent P2P text editing and I've run into a bit of a problem.
My understanding of Logoot is that the intervals between objects (lines of text in the original paper, but could be characters or words) can be divided infinitely on account of an unbounded identifier. This means that the position of an object is determined not by its neighbors as in WOOT (which would require tombstones) but by a fixed numerical point along the length of the string. Combined with a unique site identifier, this also gives us a total order and enables eventual convergence.
However... doesn't this cause a problem when concurrent edits are made to the same spot? If two disconnected clients start writing new sentences at the same cursor position and then merge, their sentences have a good chance of interleaving.
Below is a whiteboard example of what I'm talking about:
As you can see, both site B and site C divide the interval between "I" and "conquered" according to the rules of Logoot, giving us random points between the positions of (20,A) and (25,A). But nothing orders these points relative to each other, causing them to mix when merged. Meanwhile, neighbor-based algorithms can account for this issue since the causality chain of each object is preserved.
The above is a baby example, but in the more general case, imagine if two users wanted to insert a different sentence between two existing sentences. If one of the users happened to be offline, they shouldn't come back to a garbled mess! Clearly, to preserve intent, one sentence should follow the other.
Am I missing something in my reading of the paper, or is this an inherent downside to Logoot?
(Also, why is there a recorded clock value that's seemingly unused in the algorithm? The paper even points out that each object's identifier is necessarily unique without the clock.)

You're correct, this a real anomaly in Logoot and LSEQ. Whether or not it constitutes a intention violation depends on what your definition of intention is. An extension to the definition requiring that contiguous sequences remain contiguous unless they are split by a casually subsequent operation would make intuitive sense.
The clock is unnecessary. Most likely the authors used the (site, clock) pair or Lamport timestamp as their UUIDs out of convention. One site can never create two identical positions, so clocks will never need to be compared. (Assuming messages are received from a site in order, which is required for other aspects of Logoot/LSEQ too.)

Related

word2vec window size at sentence boundaries

I am using word2vec (and doc2vec) to get embeddings for sentences, but i want to completely ignore word order.
I am currently using gensim, but can use other packages if necessary.
As an example, my text looks like this:
[
['apple', 'banana','carrot','dates', 'elderberry', ..., 'zucchini'],
['aluminium', 'brass','copper', ..., 'zinc'],
...
]
I intentionally want 'apple' to be considered as close to 'zucchini' as it is to 'banana' so I have set the window size to a very large number, say 1000.
I am aware of 2 problems that may arise with this.
Problem 1:
The window might roll in at the start of a sentence creating the following training pairs:
('apple', ('banana')), ('apple', ('banana', 'carrot')), ('apple', ('banana', 'carrot', 'date')) before it eventually gets to the correct ('apple', ('banana','carrot', ..., 'zucchini')).
This would seem to have the effect of making 'apple' closer to 'banana' than 'zucchini',
since their are so many more pairs containing 'apple' and 'banana' than there are pairs containing 'apple' and 'zucchini'.
Problem 2:
I heard that pairs are sampled with inverse proportion to the distance from the target word to the context word- This also causes an issue making nearby words more seem more connected than I want them to be.
Is there a way around problems 1 and 2?
Should I be using cbow as opposed to sgns? Are there any other hyperparameters that I should be aware of?
What is the best way to go about removing/ignoring the order in this case?
Thank you
I'm not sure what you mean by "Problem 1" - there's no "roll" or "wraparound" in the usual interpretation of a word2vec-style algorithm's window parameter. So I wouldn't worry about this.
Regarding "Problem 2", this factor can be essentially made negligible by the choice of a giant window value – say for example, a value one million times larger than your largest sentence. Then, any difference in how the algorithm treats the nearest-word and the 2nd-nearest-word is vanishingly tiny.
(More specifically, the way the gensim implementation – which copies the original Google word2vec.c in this respect – achieves a sort of distance-based weighting is actually via random dynamic shrinking of the actual window used. That is, for each visit during training to each target word, the effective window truly used is some random number from 1 to the user-specified window. By effectively using smaller windows much of the time, the nearer words have more influence – just without the cost of performing other scaling on the whole window's words every time. But in your case, with a giant window value, it will be incredibly rare for the effective-window to ever be smaller than your actual sentences. Thus every word will be included, equally, almost every time.)
All these considerations would be the same using SG or CBOW mode.
I believe a million-times-larger window will be adequate for your needs, for if for some reason it wasn't, another way to essentially cancel-out any nearness effects could be to ensure your corpus's items individual word-orders are re-shuffled between each time they're accessed as training data. That ensures any nearness advantages will be mixed evenly across all words – especially if each sentence is trained on many times. (In a large-enough corpus, perhaps even just a 1-time shuffle of each sentence would be enough. Then, over all examples of co-occurring words, the word co-occurrences would be sampled in the right proportions even with small windows.)
Other tips:
If your training data starts in some arranged order that clumps words/topics together, it can be beneficial to shuffle them into a random order instead. (It's better if the full variety of the data is interleaved, rather than presented in runs of many similar examples.)
When your data isn't true natural-language data (with its usual distributions & ordering significance), it may be worth it to search further from the usual defaults to find optimal metaparameters. This goes for negative, sample, & especially ns_exponent. (One paper has suggested the optimal ns_exponent for training vectors for recommendation-systems is far different from the usual 0.75 default for natural-language modeling.)

Method to find optimal-ish result for numerous combinations

I have a pool of n objects, and each object has a known degree of "compatibility" with each other object. Recognizing that "best" is a relative term, how can I best group these objects into n/2 pairs where the overall result is "good"?
To further clarify, it'd be fairly simple to just iterate over the entire batch and say, "What's the best match for this one?", put them together, and move on, but that may lead to some items at the end of the process that have very low compatibility being paired, and a small change earlier on (perhaps making a second-best match) might have increased the overall quality of matches across the entire pool.

Constant-time Spelling Correction on Ten Million Entities

I have a list of ~10M entities. I need to match an entity that a user types out with an entity from the list. Users often misspell the entities (ie. orang instead of orange). I need to correct 1-2 instances of letter replacement (aca instead of aba), letter insertion (aca instead of ac), and letter deletion (aca instead of acca). I want to do this in constant time with respect to the size of the entity list.
Making a dictionary of all possible spellings that are off by 1-2 letters would be constant time but requires an intractably large amount of memory. Edit distance is linear in time with respect to the size of the entity list. I'm thinking there is probably a clever algorithm to prune down the candidate matches to <100 (maybe via a clever hash of the letters in the entity). Then I could run edit distance on the small set of candidates.
Does anyone know of a technique that will work here?
In addition to the linked document in Matt's comment (suggesting to only generate/compare/search via deletions), you can try using a DAWG aka MADFA aka DAFSA to store all possible distance=2 words. For example, for Python there's pyDAWG. Not sure if the space savings will be sufficient to meet your needs, as that depends on language, but if you affixes are similar, it could be quite significant: Each substitution/deletion is just an extra arc, and each insertion is only one more node.

Matching people based on names, DoB, address, etc

I have two databases that are differently formatted. Each database contains person data such as name, date of birth and address. They are both fairly large, one is ~50,000 entries the other ~1.5 million.
My problem is to compare the entries and find possible matches. Ideally generating some sort of percentage representing how close the data matches. I have considered solutions involving generating multiple indexes or searching based on Levenshtein distance but these both seem sub-optimal. Indexes could easily miss close matches and Levenshtein distance seems too expensive for this amount of data.
Let's try to put a few ideas together. The general situation is too broad, and these will be just guidelines/tips/whatever.
Usually what you'll want is not a true/false match relationship, but a scoring for each candidate match. That is because you never can't be completely sure if candidate is really a match.
The score is a relation one to many. You should be prepared to rank each record of your small DB against several records of the master DB.
Each kind of match should have assigned a weight and a score, to be added up for the general score of that pair.
You should try to compare fragments as small as possible in order to detect partial matches. Instead of comparing [address], try to compare [city] [state] [street] [number] [apt].
Some fields require special treatment, but this issue is too broad for this answer. Just a few tips. Middle initial in names and prefixes could add some score, but should be kept at a minimum (as they are many times skipped). Phone numbers may have variable prefixes and suffixes, so sometimes a substring matching is needed. Depending on the data quality, names and surnames must be converted to soundex or similar. Streets names are usually normalized, but they may lack prefixes or suffixes.
Be prepared for long runtimes if you need a high quality output.
A porcentual threshold is usually set, so that if after processing a partially a pair, and obtaining a score of less than x out of a max of y, the pair is discarded.
If you KNOW that some field MUST match in order to consider a pair as a candidate, that usually speeds the whole thing a lot.
The data structures for comparing are critical, but I don't feel my particular experience will serve well you, as I always did this kind of thing in a mainframe: very high speed disks, a lot of memory, and massive parallelisms. I could think what is relevant for the general situation, if you feel some help about it may be useful.
HTH!
PS: Almost a joke: In a big project I managed quite a few years ago we had the mother maiden surname in both databases, and we assigned a heavy score to the fact that = both surnames matched (the individual's and his mother's). Morale: All Smith->Smith are the same person :)
You could try using Full text search feature maybe, if your DBMS supports it? Full text search builds its indices, and can find similar word.
Would that work for you?

Shuffle and deal a deck of card with constraints

Here is the facts first.
In the game of bridge there are 4
players named North, South, East and
West.
All 52 cards are dealt with 13 cards
to each player.
There is a Honour counting systems.
Ace=4 points, King=3 points, Queen=2
points and Jack=1 point.
I'm creating a "Card dealer" with constraints where for example you might say that the hand dealt to north has to have exactly 5 spades and between 13 to 16 Honour counting points, the rest of the hands are random.
How do I accomplish this without affecting the "randomness" in the best way and also having effective code?
I'm coding in C# and .Net but some idea in Pseudo code would be nice!
Since somebody already mentioned my Deal 3.1, I'd like to point out some of the optimizations I made in that code.
First of all, to get the most flexibly constraints, I wanted to add a complete programming language to my dealer, so you could generate whole libraries of constraints with different types of evaluators and rules. I used Tcl for that language, because I was already learning it for work, and, in 1994 when Deal 0.0 was released, Tcl was the easiest language to embed inside a C application.
Second, I needed the constraint language to run fairly fast. The constraints are running deep inside the loop. Quite a lot of code in my dealer is little optimizations with lookup tables and the like.
One of the most surprising and simple optimizations was to not deal cards to a seat until a constraint is checked on that seat. For example, if you want north to match constraint A and south to match constraint B, and your constraint code is:
match constraint A to north
match constraint B to south
Then only when you get to the first line do you fill out the north hand. If it fails, you reject the complete deal. If it passes, next fill out the south hand and check its constraint. If it fails, throw out the entire deal. Otherwise, finish the deal and accept it.
I found this optimization when doing some profiling and noticing that most of the time was spent in the random number generator.
There is one fancy optimization, which can work in some instances, call "smart stacking."
deal::input smartstack south balanced hcp 20 21
This generates a "factory" for the south hand which takes some time to build but which can then very quickly fill out the one hand to match this criteria. Smart stacking can only be applied to one hand per deal at a time, because of conditional probability problems. [*]
Smart stacking takes a "shape class" - in this case, "balanced," a "holding evaluator", in this case, "hcp", and a range of values for the holding evaluator. A "holding evaluator" is any evaluator which is applied to each suit and then totaled, so hcp, controls, losers, and hcp_plus_shape, etc. are all holding evalators.
For smartstacking to be effective, the holding evaluator needs to take a fairly limited set of values. How does smart stacking work? That might be a bit more than I have time to post here, but it's basically a huge set of tables.
One last comment: If you really only want this program for bidding practice, and not for simulations, a lot of these optimizations are probably unnecessary. That's because the very nature of practicing makes it unworthy of the time to practice bids that are extremely rare. So if you have a condition which only comes up once in a billion deals, you really might not want to worry about it. :)
[Edit: Add smart stacking details.]
Okay, there are exactly 8192=2^13 possible holdings in a suit. Group them by length and honor count:
Holdings(length,points) = { set of holdings with this length and honor count }
So
Holdings(3,7) = {AK2, AK3,...,AKT,AQJ}
and let
h(length,points) = |Holdings(length,points)|
Now list all shapes that match your shape condition (spades=5):
5-8-0-0
5-7-1-0
5-7-0-1
...
5-0-0-8
Note that the collection of all possible hand shapes has size 560, so this list is not huge.
For each shape, list the ways you can get the total honor points you are looking for by listing the honor points per suit. For example,
Shape Points per suit
5-4-4-0 10-3-0-0
5-4-4-0 10-2-1-0
5-4-4-0 10-1-2-0
5-4-4-0 10-0-3-0
5-4-4-0 9-4-0-0
...
Using our sets Holdings(length,points), we can compute the number of ways to get each of these rows.
For example, for the row 5-4-4-0 10-3-0-0, you'd have:
h(5,10)*h(4,3)*h(4,0)*h(0,0)
So, pick one of these rows at random, with relative probability based on the count, and then, for each suit, choose a holding at random from the correct Holdings() set.
Obviously, the wider the range of hand shapes and points, the more rows you will need to pre-compute. A little more code, you can still do this with some cards pre-determined - if you know where the ace of spades or west's whole hand or whatever.
[*] In theory, you can solve these conditional probability issues for smart stacking with multiple hands, but the solution to the problem would make it effective only for extremely rare types of deals. That's because the number of rows in the factory's table is roughly the product of the number of rows for stacking one hand times the number of rows for stacking the other hand. Also, the h() table has to be keyed on the number of ways of dividing the n cards amongst hand 1, hand 2, and other hands, which changes the number of values from roughly 2^13 to 3^13 possible values, which is about two orders of magnitude bigger.
Since the numbers are quite small here, you could just take the heuristic approach: Randomly deal your cards, evaluate the constraints and just deal again if they are not met.
Depending on how fast your computer is, it might be enough to do this:
Repeat:
do a random deal
Until the board meets all the constraints
As with all performance questions, the thing to do is try it and see!
edit I tried it and saw:
done 1000000 hands in 12914 ms, 4424 ok
This is without giving any thought to optimisation - and it produces 342 hands per second meeting your criteria of "North has 5 spades and 13-16 honour points". I don't know the details of your application but it seems to me that this might be enough.
I would go for this flow, which I think does not affect the randomness (other than by pruning solutions that do not meet constraints):
List in your program all possible combinations of "valued" cards whose total Honour points count is between 13 and 16. Then pick randomly one of these combinations, removing the cards from a fresh deck.
Count how many spades you already have among the valued cards, and pick randomly among the remaining spades of the deck until you meet the count.
Now pick from the deck as much non-spades, non-valued cards as you need to complete the hand.
Finally pick the other hands among the remaining cards.
You can write a program that generates the combinations of my first point, or simply hardcode them while accounting for color symmetries to reduce the number of lines of code :)
Since you want to practise bidding, I guess you will likely be having various forms of constraints (and not just 1S opening, as I guess for this current problem) coming up in the future. Trying to come up with the optimal hand generation tailored to the constraints could be a huge time sink and not really worth the effort.
I would suggest you use rejection sampling: Generate a random deal (without any constraints) and test if it satisfies your constraints.
In order to make this feasible, I suggest you concentrate on making the random deal generation (without any constraints) as fast as you can.
To do this, map each hand to a 12byte integer (the total number of bridge hands fits in 12 bytes). Generating a random 12 byte integer can be done in just 3, 4 byte random number calls, of course since the number of hands is not exactly fitting in 12 bytes, you might have a bit of processing to do here, but I expect it won't be too much.
Richard Pavlicek has an excellent page (with algorithms) to map a deal to a number and back.
See here: http://www.rpbridge.net/7z68.htm
I would also suggest you look at the existing bridge hand dealing software (like Deal 3.1, which is freely available) too. Deal 3.1 also supports doing double dummy analysis. Perhaps you could make it work for you without having to roll one of your own.
Hope that helps.

Resources