Are add/remove set CRDT's monotonic? - data-structures

The internals of an add/remove set CRDT is monotonic, because we only ever add to the internal sets, so the internal state of the CRDT cannot ever go backwards in logical time.
However, the observed state of the CRDT is that we're adding and removing elements, so the observed state doesn't have to be monotonic.
If we chain these systems together and take actions based on the presense or non-presense of an element, it doesn't look very monotonic anymore. The final state will still converge eventually, but we may or may not see some elements for a while before it stabilizes. It's not unlikely that some side-effect happens because of that intermediate state, such as a user reading the state of the system and reacting before it converges.
What does it mean for a CRDT to be monotonic?

Just to add a TL;DR on top of alekibango's awesome answer:
Monotonicity refers to the fact that once operations are observed and applied by a replica, the object's state will always take into consideration that operation.
Once an operation is applied, it will never be un-applied.
The observed non-monotonicity of (most) CRDT Sets does not invalidate the CRDT monotonic property.
CRDT Sets that support the remove operation are at its core two G-Sets:
One of the G-Sets is the set of elements that were added.
The other G-set is the set of elements that were removed.
The observed state is the set of added elements minus the set of removed elements. Although each of the internal sets is clearly monotonic, their difference can appear to be non-monotonic.

CRDT means Conflict-Free Replicated Data Type
This means (diverging) instances of CRDT might merge together (in whatever order and repetitions) to finally get into correct, consistent state.
Monotonicity might help with implementing this (see CALM -- Consistency as Logical Monotonicity). But it is not a requirement for your set instance.
Read those notes on crdt: https://github.com/pfrazee/crdt_notes
Some Examples of CRDT sets are:
G-set (grow only set, only adding items)
2P-set (keeps tombstones, element can be inserted once only)
LWW-set uses timestamps for marking 'time' of adding/removing item, allows adding/removing an item multiple times. Concurrent add
and remove is decided using bias.
OR-set - similar to lww set, but uses unique tags to be sure which element we are removing.
Optimized OR-Sets - can be able to be useable without having many tombstones around, see those if your sets are big (and have many
changes).
some links to read more:
https://hal.inria.fr/inria-00609399v1/document
https://github.com/CBaquero/delta-enabled-crdts
https://doc.akka.io/docs/akka/2.5/distributed-data.html
https://www.youtube.com/watch?v=ebWVLVhiaiY
https://www.youtube.com/watch?v=veeWamWy8dk
https://www.youtube.com/watch?v=xxjHC3yLDqw
https://www.youtube.com/watch?v=PQzNW8uQ_Y4

Related

Logoot CRDT: interleaving of data on concurrent edits to the same spot?

I want to implement Logoot for eventually-convergent P2P text editing and I've run into a bit of a problem.
My understanding of Logoot is that the intervals between objects (lines of text in the original paper, but could be characters or words) can be divided infinitely on account of an unbounded identifier. This means that the position of an object is determined not by its neighbors as in WOOT (which would require tombstones) but by a fixed numerical point along the length of the string. Combined with a unique site identifier, this also gives us a total order and enables eventual convergence.
However... doesn't this cause a problem when concurrent edits are made to the same spot? If two disconnected clients start writing new sentences at the same cursor position and then merge, their sentences have a good chance of interleaving.
Below is a whiteboard example of what I'm talking about:
As you can see, both site B and site C divide the interval between "I" and "conquered" according to the rules of Logoot, giving us random points between the positions of (20,A) and (25,A). But nothing orders these points relative to each other, causing them to mix when merged. Meanwhile, neighbor-based algorithms can account for this issue since the causality chain of each object is preserved.
The above is a baby example, but in the more general case, imagine if two users wanted to insert a different sentence between two existing sentences. If one of the users happened to be offline, they shouldn't come back to a garbled mess! Clearly, to preserve intent, one sentence should follow the other.
Am I missing something in my reading of the paper, or is this an inherent downside to Logoot?
(Also, why is there a recorded clock value that's seemingly unused in the algorithm? The paper even points out that each object's identifier is necessarily unique without the clock.)
You're correct, this a real anomaly in Logoot and LSEQ. Whether or not it constitutes a intention violation depends on what your definition of intention is. An extension to the definition requiring that contiguous sequences remain contiguous unless they are split by a casually subsequent operation would make intuitive sense.
The clock is unnecessary. Most likely the authors used the (site, clock) pair or Lamport timestamp as their UUIDs out of convention. One site can never create two identical positions, so clocks will never need to be compared. (Assuming messages are received from a site in order, which is required for other aspects of Logoot/LSEQ too.)

Algorithm for concurrent queue (single consumer, multiple producers) based on shared dictionary data structure

I am looking for a queue algorithm that fulfills the following properties:
Processes communicate using only a shared dictionary (key-value-store)
Does not use any atomic operations other than load and store (no CAS, for example)
Supports multiple producers
Supports a single consumer
Producers can die at any time and queue must remain operational
The consumer can also die at any time and be restarted later, but there will never be more than one consumer-process running at a time
This is meant as a general question about a suitable algorithm, since I'd like to use it in a couple of different scenarios. But to help visualize the requirements, here is an example use-case:
I have a website with two pages: producer.html and consumer.html
producer.html can be opened in multiple tabs simultaneously
Each producer.html adds events to the queue
One copy of consumer.html is open and consumes these events (to aggregate and stream them to a webserver, for example)
If the multiple producer-tabs are opened by the user rather than the page, these tabs do not have references to each other available, so the usual communication methods (postMessage or calling directly into the other tab's JS code) are out. One of the ways they can still communicate with each other is via LocalStorage as suggested here: Javascript; communication between tabs/windows with same origin. But LocalStorage is not "thread-safe" as detailed here.
Note: There may be other ways to implement cross-tab communication in the browser (Flash, ...), but these are NOT the aim of this question as they won't translate to my other use-cases. This is really just an example use-case for the general queue algorithm that I am trying to find.
A couple more parameters:
The number of producers will never be very large (10s or 100s maybe), so the scaling of the number of reads and writes needed with respect to the number of producers is not really a concern.
I don't know before hand how many producers I might have and there is no immediately obvious way to assign a number or index to them. (Many mutex algorithms (Lamport's Bakery, Eisenberg&McGuire, Szymański's, ...) maintain an array of state for each process, which wouldn't necessarily be a natural approach here, although I do not want to exclude these approaches ex ante, if they can be implemented using the shared dictionary in some way...)
The algorithm should be 100% reliable. So, I'd like to avoid things like the delay in Lamport's first Fast Mutex algorithm (page 2 in the PDF) since I don't have any kind of real-time guarantees.
It would be very helpful if the queue was FIFO, but it's not strictly required.
The algorithm should not be encumbered by any patents, etc.
Update:
The Two-Lock Concurrent Queue Algorithm by Michael and Scott looks like it could work, but I would need two things to implement it:
A locking mechanism using the shared dictionary that can survive the crash of a lock-holder
A reliable way to allocate a new node (if I move the allocation into the locked section, I could just generate new random keys until I find one that's not in use yet, but there might be a better way?)
Update 2:
It seems, I wasn't being specific enough about the dictionary:
It's really nothing more than a trivial key-value-store. It provides the functions get(key) to read the value of a key, put(key, value) to change the value of a key, and delete(key) to remove a key. In some of my use-cases, I can also iterate over keys, but if possible, I'd like to avoid it for generality. Keys are arbitrary and the producers and consumers can create or calculate them as needed. The dictionary does not provide any facilities for automatically generating unique keys.
Examples are HTML LocalStorage, Google AppEngine's Datastore, a Java Map, a Python dictionary, or even a file-system with only a single directory (where the keys would be the file-names and the values the content of the files).
After quite a bit of further reading and sleeping on things for a night, I came up with one way that should be able to accomplish what I need, but it might not be the most elegant:
The paper Wait-Free Algorithms for Fast, Long-Lived Renaming by Moir and Anderson generalizes Lamport's Fast Mutex Algorithm #2 (page 6 here) into the following building block (Figure 2):
When n processes enter this section of code, at most one of them will stop, at most n-1 will move right and at most n-1 will move down.
In Lamport's algorithm, stopping means the process acquired the lock, whereas moving right or left will simply send the process back to the beginning of this section of code. To release the lock, a process simply sets Y back to false. (Not quite correct, actually... See "Update" below...)
The big problem with this is that if any of the processes ever die while holding the lock (i.e. before releasing it), the block will simply stay locked forever.
Another problem is that every process needs to be assigned a unique process ID p.
The locked-forever problem can be fixed by borrowing an idea from Moir and Anderson, namely to send processes that end up moving right or down into a different building block rather than back to this one, leading to a structure like this (Figure 3 in the paper):
Except that in this case, I won't be using this grid to assign process IDs as M&A did (although I could probably solve the problem of the unique values for p with this). Instead, every box in the grid will correspond to a very simple queue. If a process stops on a box, it acquired the tail-lock for the corresponding queue (e.g. as per the algorithm by Michael and Scott) and proceeds to enqueue a new element to that queue. Upon completion, it sets the Y value of the box back to false to allow other processes to use this queue. This way, if there is high contention or if processes die before releasing locks, new queues will be created dynamically as needed.
The consumer-process doesn't need to worry about locking the heads of the queues when dequeuing elements, since it's the only process to ever do so. So, it simply traverses the tree of boxes to find all queues and trivially helps itself to their contained elements. One thing to note is that while each individual queue will be FIFO, there is no synchronization between the queues, so the combined queue will not necessarily be FIFO.
If we now change the boolean Y to a time-stamp (or null/0 to indicate false), the consumer can also expire locks after some safe timeout to re-activate dead queues.
A note about implementation using the dictionary:
The shared variables X and Y can be entries in the dictionaries with key-names X_123 and Y_123, where 123 is the number of the box.
p can simply be any unique random string and will be stored as the value of key X_123.
The boolean or time-stamp is also simply stored as the value of key Y_123. The producer-processes interpret a missing entry for Y_123 as false or null/0.
The box-numbers 123 need to be calculated from the move-pattern. One way to do this would be to start with 1 in the top-left corner. If the process stops in that box, we're done. If not, the current number (starting with 1) is shifted left by 1 (i.e. multiplied by 2) and, if the process moved down, also incremented by 1. Smaller (and fewer) numbers can be calculated with a different numbering scheme (I still need to work it out), but this one should work.
The queues then consist of one entry with key H_123 that holds the index of the current head of the queue in its value and one entry with key T_123 that holds the index of the tail. Both default to 0 if they don't exist.
To enqueue an item into queue 123, the tail index is read from T_123 (let's say it yields 48) and an entry with key Q_123_48 is put into the dictionary with its value containing the enqueued item. After, T_123 is incremented by 1.
After the item is enqueued, the Y_123 entry is set back to false or null/0 (not deleted!)
To dequeue an item, the head index is read from H_123 (let's say it yields 39) and compared to the tail index T_123. If it is smaller, an item is available at Q_123_39, which is then read and deleted from the dictionary. After, H_123 is incremented by 1.
To traverse the box-tree, the consumer starts with the box in the top left corner. For each box (e.g. 123), if a key Y_123 exists in the dictionary (even if it contains values null/0 or false), the consumer dequeues items from the corresponding queue, and then recursively moves right and down to the adjacent boxes. If no key Y_123 exists, this box hasn't been used by any processes yet and doesn't need to be considered (and neither do the boxes below or to its right).
I haven't actually implemented this yet, but I'll do that next. I just wanted to post this already to see if it could inspire other approaches or if anyone can see anything wrong with this idea.
Update:
I just noticed one complication: It is possible that if two processes are trying to acquire the lock for a queue simultaneously, both will fail and move on to the next block. This will leave that queue locked forever as no-one will be left to set Y back to false or null/0.
This is the reason why the "Long-Lived Renaming" algorithm by M&A as well as Lamport's algorithm #2 use an array of Y-values in which every process has its own entry that it resets also if it moves on to another block. Y is then only considered false if all entries are false.
Since I don't know before-hand how many processes I will have, I could implement this only if the dictionary had some way of enumerating keys (the keys would then be Y_123_456 where 456 is the value of p for each process).
But, with rare contention and the above described timeout-mechanism for reactivating dead queues, the issue might lead to only a little bit of memory inefficiency, rather than a major problem.
Update 2:
A better way to label the boxes would be this pattern:
If we call the total number of moves n (counting the move into the top left box also, i.e. n ≥ 1) and the number of moves to the right r, then the box-number can be calculated using
box = (n × (n - 1))/2 + r
Just use a RDBMS. It's pretty simple in MS SQL, for PostgresSQL you'd have to use the RETURNING keyword and for MySQL you'd probably have to use triggers.
CREATE TABLE Q ([Key] BIGINT IDENTITY(1,1) PRIMARY KEY, [Message] NVARCHAR(4000))
INSERT INTO Q OUTPUT inserted.* VALUE(#message)
DELETE TOP(1) Q WITH (READPAST) OUTPUT deleted.*
If you were really hoping for an algorithmic solution, just use a ring buffer.
const int MAX_Q_SIZE = 20000000;
static string[] Q = new string[MAX_Q_SIZE];
static long ProducerID = 0;
static long ConsumerID = 0;
public static long Produce(string message) {
long key = Interlocked.Increment(ref ProducerID);
int idx = (int)(key % MAX_Q_SIZE);
Q[idx] = message;
return key;
}
public static string Consume() {
long key = Interlocked.Increment(ref ConsumerID);
int idx = (int)(key % MAX_Q_SIZE);
string message = Q[idx];
return message;
}

Synchronize two lists of objects

Problem
I have two lists of objects. Each object contains the following:
GUID (allows to determine if objects are the same — from business
point of view)
Timestamp (updates to current UTC each time the
object changed)
Version (positive integer; increments each time
the object changed)
Deleted (boolean flag; switches to "true" instead
of actual object deleting)
Data (some useful payload)
Any other fields if need
Next, I need to sync two lists according to these rules:
If object with some GUID presented only in one list, it should be copied to another list
If object with some GUID presented in both lists, the instance with less Version should be replaced with one having greater Version (nothing to do if versions are equal)
Real-world requirements:
Each list has 50k+ objects, each object is about 1 Kb
Lists are placed on different machines connected via Internet (e.g., mobile app and remote server), thus, algorithm shouldn't waste the traffic or CPU much
Most of time (say, 96%) lists are already synced before sync process, hence, the algorithm should determine it with minimal effort
If there are any differences, most of time they are pretty small (3-5 objects changed/added)
Should proceed OK if one list is empty (and other still has 50k+ items)
Solution #1 (currently implemented)
Client stores the time-of-last-sync-succeed (say T)
Both lists are asked for all objects having Timestamp > T (i.e. recently modified; in the production it's ... > (T - day) for better robustness)
These lists of recently modified objects are synced naively:
items presented only in first list are saved to second list
items presented only in second list are saved to first list
other items has their Version's compared and saved to appropriative list (if need)
Procs:
Works great with small changes
Almost fits the requirements
Cons:
Depends on T, which makes the algorithm fragile: it's easy to sync last updates, but hard to make sure lists are completely synced (using minimal T like 1970-01-01 just hangs the sync process)
My questions:
Is there any common / best-practice / proved way to sync object lists?
Is there any better [than #1] solutions for my case?
P.S. Already viewed, not duplicates:
Compare Two List Of Objects For Synchronization
Two list synchronization
Summary
All answers has some worth points. To summarize, here is the compiled answer I was looking for, based on finally implemented working sync system:
In general, use Merkle trees. They are dramatically efficient in comparing large amounts of data.
If you can, rebuild your hash tree from scratch every time you need it.
Check the time required to rebuild hash tree. Most likely it's pretty fast (e.g., in my case on Nexus 4 rebuilding tree for 20k items takes ~2 sec: 1.8 sec for fetching data from DB + 0.2 sec for building tree; the server performs ~20x faster), so you don't need to store the tree in the DB and maintain it when data changed (my first try was rebuilding only relevant branches — it's not too complicated to implement, but is very fragile).
Nevertheless, it's ok to cache and reuse tree if no data modifications was done at all. Once modification happened, invalidate the whole cache.
Technical details
GUID is 32 chars long without any hyphens/braces, lowercase;
I use 16-ary tree with the height of 4, where each branch is related to the GUID's char. It may be implemented as actual tree or map:
0000 → (hash of items with GUID 0000*)
0001 → (hash of items with GUID 0001*)
...
ffff → (hash of items with GUID ffff*);
000 → (hash of hashes 000_)
...
00 → (hash of hashes 00_)
...
() → (root hash, i.e. hash of hashes _)
Thus, the tree has 65536 leafs and requires 2 Mb of memory; each leaf covers ~N/65536 data items. Binary trees would be 2x more efficient in terms of memory, but it's a harder to implement.
I had to implement these methods:
getHash() — returns root hash; used for primary check (as mentioned,
in 96% that's all we need to test);
getHashChildren(x) — returns list of hashes x_ (at most 16); used for effective, single-request discovering data difference;
findByIdPrefix(x) — returns items with GUID x*, x must contain exactly 4 chars; used for requesting leaf items;
count(x) — returns number of items with GUID x*; when reasonably small, we can dismiss checking tree branch-by-branch and transfer bunch of items with single request;
As far as syncing is done per-branch transmitting small amounts of data, it's very responsive (you can check the progress at any time) + very robust for unexpected terminating (e.g., due to network failure) and easily restarts from the last point if need.
IMPORTANT: sometimes you will stuck with conflicting state: {version_1 = version_2, but hash_1 != hash_2}: in this case you must make some decision (maybe with user's help or comparing timestamps as last resort) and rewrite some item with another to resolve the conflict, otherwise you'll end up with unsynced and unsyncable hash trees.
Possible improvements
Implement transmitting (GUID, Version) pairs without payload for lightweighting requests.
Two suggestions come to mind, the first one is possibly something you're doing already:
1) Don't send entire lists of items with timestamps > T. Instead, send a list of (UUID, Version) tuples of objects with timestamps > T. Then the other side can figure out which objects it needs to update from that. Send the UUIDs of those back to request the actual objects. This avoids sending full objects if they have timestamp > T, but are nonetheless newer already (or present already with the latest Version) on the other side.
2) Don't process the full list at once, but in chunks, i.e. first sync 10%, then the next 10% etc. to avoid transferring too much data at once for big syncs (and to allow for restarting points if a connection should break). This can be done by e.g. starting with all UUIDs with a checksum equivalent to 1 modulo 10, then 1 modulo 10 etc.
Another possibility would be proactive syncing, e.g. asynchronously posting chances, possibly via UCP (unreliable as opposed to TCP). You would still need to sync when you need current information, but chances are most of it is current.
You need to store not time of last synchronization, but the state of the objects (eg. the hash of object data) at time of last synchronization. Then you compare each list with the stored list and find, what objects have changed on each side.
This is much more reliable than rely on time, cause time requires that both sides have synchronized timer which gives precise time (and this is not the case on most systems). For the same reason your idea of detecting changes based on time + version can be more error-prone than it initially seems.
Also you don't initially transfer object data but only GUIDs.
BTW we've made a framework (free with source) which addresses exactly your problems. I am not giving the link because some alternatively talented people would complain.

Efficient mass modification of persistent data structures

I understand how typically trees are used to modify persistent data structures (create a new node and replace all it's ancestors).
But what if I have a tree of 10,000's of nodes and I need to modify 1000's of them? I don't want to go through and create 1000's of new roots, I only need the one new root that results from modifying everything at once.
For example:
Let's take a persistent binary tree for example. In the single update node case, it does a search until it finds the node, creates a new one with the modifications and the old children, and creates new ancestors up to the root.
In the bulk update case could we do:
Instead of just updating a single node, you're going to update 1000 nodes on it in one pass.
At the root node, the current list is the full list. You then split that list between those that match the left node and those that match the right. If none match one of the children, don't descend to it. You then descend to the left node (assuming there were matches), split its search list between its children, and continue. When you have a single node and a match, you update it and go back up, replacing and updating ancestors and other branches as appropriate.
This would result in only one new root even though it modified any number of nodes.
These kind of "mass modification" operations are sometimes called bulk updates. Of course, the details will vary depending on exactly what kind of data structure you are working with and what kind of modifications you are trying to perform.
Typical kinds of operations might include "delete all values satisfying some condition" or "increment the values associated with all the keys in this list". Frequently, these operations can be performed in a single walk over the entire structure, taking O(n) time.
You seem to be concerned about the memory allocation involved in creating "1000's of new roots". Typical allocation for performing the operations one at a time would be O(k log n), where k is the number of nodes being modified. Typical allocation for performing the single walk over the entire structure would be O(n). Which is better depends on k and n.
In some cases, you can decrease the amount of allocation--at the cost of more complicated code--by paying special attention to when changes occur. For example, if you have a recursive algorithm that returns a tree, you might modify the algorithm to return a tree together with a boolean indicating whether anything has changed. The algorithm could then check those booleans before allocating a new node to see whether the old node can safely be reused. However, people don't usually bother with this extra check unless and until they have evidence that the extra memory allocation is actually a problem.
A particular implementation of what you're looking for can be found in Clojure's (and ClojureScript's) transients.
In short, given a fully-immutable, persistent data structure, a transient version of it will make changes using destructive (allocation-efficient) mutation, which you can flip back into a proper persistent data structure again when you're done with your performance-sensitive operations. It is only at the transition back to a persistent data structure that new roots are created (for example), thus amortizing the attendant cost over the number of logical operations you performed on the structure while it was in its transient form.

Caching vector addition over changing collections

I have the following setup:
I have a largish number of uuids (currently about 10k but expected to grow unboundedly - they're user IDs) and a function f : id -> sparse vector with 32-bit integer values (no need to worry about precision). The function is reasonably expensive (not outrageously so, but probably on the order of a few 100ms for a given id). The dimension of the sparse vectors should be assumed to be infinite, as new dimensions can appear over time, but in practice is unlikely to ever exceed about 20k (and individual results of f are unlikely to have more than a few hundred non-zero values).
I want to support the following operations efficiently:
add a new ID to the collection
invalidate an existing ID
retrieve sum f(id) in O(changes since last retrieval)
i.e. I want to cache the sum of the vectors in a way that's reasonable to do incrementally.
One option would be to support a remove ID operation and treat invalidation as a remove followed by an add. The problem with this is that it requires us to keep track of all the old values of f, which is expensive in space. I potentially need to use many instances of this sort of cached structure, so I would like to avoid that.
The likely usage pattern is that new IDs are added at a fairly continuous rate and are frequently invalidated at first. Ids which have been invalidated recently are much more likely to be invalidated again than ones which have remained valid for a long time, but in principle an old Id can still be invalidated.
Ideally I don't want to do this in memory (or at least I want a way that lets me save the result to disk efficiently), so an idea which lets me piggyback off an existing DB implementation of some sort would be especially appreciated.

Resources