Is there a strongly consistent group membership protocol? - algorithm

I'm looking for an algorithm where groups of connected nodes can be merged together to form a new group (by creating links between nodes from different groups). And where a group can be partitioned to form new partitions.
Unlike with consensus style of membership protocols (e.g. the one described in the Raft paper) where only one group can remain after the partition, I'd like each new partition to form a new group.
Also I'd like that for each partition, each of its members is going to agree on who belongs to that partition with a strong consistency guarantee.
Or put differently, I'd like the following property to hold: After a group undergoes a membership change, if two nodes that belonged to the original group can still communicate (there is a path between the two), they should agree on the sequence of changes that happened to the group.
My understanding is that the fact that each new partition is going to agree on a different set of members in a sense implies that the Consistency part from the CAP theorem is relaxed. Giving me hope that such protocol may exist(?).

No consensus protocol (such as Paxos, Raft etc.) can be leveraged to develop a multi-group membership protocol. This is due to the fact that all consensus protocols are based on the fundamental idea that any decision can be taken only if the majority of members have "agreed-accepted" it. In this way, the "split-brain" phenomenon is avoided, because there could be no 2 partitions (of size greater than the majority: (n/2)+1) that have agreed on a different leader (and thus member set), since at least one member would be member of both partitions and would have voted for only one of the partitions (the one that asked first for vote).
One protocol that could possibly be leveraged to create a multi-group membership protocol is the Virtual Synchrony. However, note that virtual synchrony is a protocol used to send messages to (statically) predefined process groups, aka to the currently existing members of these groups. As a result, it is not designed for cases, where new process groups should be created (dynamically) at each new partition. Also note that the virtual synchrony is a protocol that does not scale to bigger members, since the message latency is proportional to the groups size.
I believe that by using the virtual synchrony protocol, you could develop such a membership protocol, which could satisfy the condition
After a group undergoes a membership change, if two nodes that belonged to the original group can still communicate (there is a path between the two), they should agree on the sequence of changes that happened to the group
However, note that this membership is not strongly consistent in a strict sense, because the failure of a node might be propagated inside the group eventually. Nonetheless, the message deliveries (which is what matters most) will be delivered in a way that ensures these deliveries obey the memberships of the group. This is achieved through imposing order on message deliveries in the members side.
Another alternative approach for a membership protocol are gossip-based membership protocols, with real-life implementations being integrated in various tools in industry, such as Consul. In order to leverage this approach, you could emit multiple different classes of messages from each member, depending on the different groups that you would like to monitor. However, again these groups are statically defined inside the protocol and eventually consistent (meaning that every failure will be finally detected by all live members).
As a conclusion, I think that a strongly-consistent membership protocol is not feasible in a strict definition, since you cannot distinguish between a member that has failed and a member that is responding really-really slow (basis of FLP and CAP theorem).

Related

Priority Queues VS Queues

How a Priority Queue a Queue Data Structure. Since it doesn't follow FIFO, shouldn't it be named Priority Array or Priority Linked LIst majorly because Priority Queues don't follow a fashion like a FIFO queue
In a priority queue, an element with high priority is served before an element with low priority.
'If two elements have the same priority, they are served according to their order in the queue'
i think this will answer your question
If you look at most used implementations, priority queues are essentially heaps - they are arranged in a particular fashion based on priority defined by the programmer - in a simple example, ascending or descending order of integers.
Think of priority queue as a queue where rather than retrieving the elements based on when you add the element, you retrieve them based on how they compare with each other. This comparison can be simply ascending or descending order in your textbook examples. You can understand the ADT from an analogy from another StackOverflow answer:
You're running a hospital and patients are coming in. There's only one
doctor on staff. The first man walks in - and he's served immediately.
Next, a man with a cold comes in and requires assistance. You add him
to the queue and he waits in line for the doctor to become available.
Next, a man with an axe in his head comes through the door. He is
assigned a higher priority because he is a higher medical liability.
So the man with the cold is bumped down in line. Next, someone comes
in with breathing problems. So, once again, the man with the cold is
bumped down in priority. This is called trigaing in the real world -
but in this case it's a medical line.
Implementing this in code would use a priority queue and a worker
thread (the doctor) to perform work on the consumable / units of work
(the patients).
In real scenario, instead of patients, you might have processes waiting to be addressed by the CPU.
Read:
When would I use a priority queue?
In the queue, the natural
ordering given by how much time an element waits in a line can be considered the fairest. When you enter in a line waiting for something, first comes first served.
Sometimes, however, there is something special about some elements that
might suggest they should be served sooner than others that waited longer. For example, we don’t always read our emails in the order we received them, but often
you skip newsletters or “funny” jokes from friends to read work-related messages first.
Likewise, when you design an app or test an app, if there are some bugs, those bugs are prioritized and teams work on those bugs based on bugs severity. First, new bugs are discovered all the
time, and so new items will be added to the list. Say a nasty authentication bug is found—
you’d need to have it solved by yesterday! Moreover, priority for bugs can change over
time. For instance, your CEO might decide that you are going after the market share
that’s mostly using browser X, and you have a big feature launch next Friday, so you really need to solve that bug at the bottom within a couple of days.
Priority queues are especially useful when we need to consume elements in a certain order from a dynamically changing list (such as the list of tasks to run on a CPU), so that at any time we can get the next element (according to a certain criterion), remove it from the list, and (usually) stop worrying about fixing anything for
the other elements.
That’s the idea behind priority queues: they behave like regular, plain queues, except that the front of the queue is dynamically determined based on some kind of priority. The differences caused to the implementation by the introduction of priority are profound, enough to deserve a special kind of data structure.

In "Part-time Parliament" why does using the membership from decree n-3 work to add or remove members?

Section 3.3.6 of "The Part-Time Parliament" suggests that membership in the parliament (and thus the quorum for decisions) can be changed safely "by letting the membership of Parliament used in passing decree n be specified by the law as of decree n-3".
Translated into more common MultiPaxos terms, that means that the set of acceptors becomes part of the replicated state machine's state, changed by proposals to add or remove acceptors.
The quorum for slot N would be taken from the set of acceptors defined in the state when slot N-3 was decided.
Lamport offers no justification for this decision, and while his next paragraph says that changes must be handled with care and describes the ultimate failure of the algorithm, it fails for reasons unrelated to this particular issue.
Is this an adequate safeguard to ensure consistency? If so, what literature supports it?
I maintain a Paxos system that is a core component to several large web services. The system runs Basic Paxos, and not Multi-Paxos. In that system changes to the set of acceptors can be proposed like any other transition. The set of acceptors for a paxos instance N is the one that was approved in N-1.
I am unsure if any literature supports this, but it trivial to see that it works. Because Paxos guarantees consensus of the transition N-1, it is guaranteed that hosts agree on which can act as acceptors for transition N.
However, things get a little more complicated with Multi-Paxos and Raft--or any pipelined consensus algorithm. According to the Raft video lecture, this must be a two-phased approach, but I don't recall that he explains why.
On further reading of the Paxos slides for the raft user study linked by Michael, I see that my suggestion is close, but in fact every decision needs to be made in a view that is agreed on by all participants. If we choose that view to be that in effect at slot N-1, that limits the whole machine to lock-step: each slot can only be decided once the previous slot has been decided.
However, N-1 can be generalized to N-α, where Lamport sets α=3. As long as all participants agree on α, they agree on the view for each slot, which means that the rest of the algorithm's correctness holds.
This adds a fairly trivial amount of storage overhead, then: leaders must track the view for the most recent slot executed at the replica and the preceding α-1 slots. This is sufficient information to either determine the view for slot N (slot_views[N-α]) or know that the view is undefined (slot N-α or some previous slot is not yet decided) and thus ignore the proposal.

Design of reputation engine

Lets imagine a social network where each user can gain reputation from others by, say, delegation. So given A and B initially have a reputation of 1 when A delegates to B then A has 0 and B has 2.
Then B can delegate to C and so on.
Also - the delagation has its scope, and scopes can be nested. So A can delegate the reputaion on all topics, or only programming, or only c#. And he can delegate on programming to B but on C# to C. That means the final reputation varies depending on a given scope.
So we get a kind of directed graph structure (probably a tree but it's not yet clear what about cycles) which we need to traverse to calculate the reputation.
I'm trying to model that with DDD principles and I'm not sure what is the aggregate here.
I suppose the delegation tree/graph is a candidate for that as the aggregate is a unit of consistency. However that means the aggregates would be very large. The scope thing complicates it even more because it makes an aggregate boundry not clear. Is delegation on C# a part of aggregate with delegations on programming?
What about user? As an aggregate it would have to store references (delegations) to/from other users. Again - which aggregate a given user belongs to?
A separate question is how to efficiently calculate the reputation. I guess the graph database will be more apropriate than relational in this case but is that the only good answer?
A root aggregate in is meant to enforce invariants. The rules of delegation you've informed us about are one set of invariants. Not knowing what other invariants you may require it is hard to tell what a suitable root aggregate would be, but simply going by what you've presented "user" seems to me a perfect root aggregate to enforce all your delegation rules as invariants. A user may have one or more delegation scopes, which themselves may be root aggregates. A user can, under the rules of delegation, delegate to another user, which may in turn delegate under those same rules. This allows you to enforce all your invariants and there is no problem storing references to (other) users under the rules of DDD.
Keep asking how you can enforce your domain specific rules consistently and you will find your root aggregates.
On your separate question: a graph db seems like a better idea then a relation database, but it's hard to tell with limited information. I suggest that you post this question separately and include your considerations about relational versus graph databases.

Networks: how to model them using aggregate roots?

A domain model defines a.o. the relationships between entities and we define aggregate roots to provide encapsulation and transaction boundaries. The well known relationships are one-to-one relationships (an entity or value object is contained within an aggregate root), one-to-many relationships (an aggregate root contains a collection of child objects) and many-to-many relationships. The latter are difficult, because many-to-many relationships between aggregate roots get you in trouble with the transaction boundary. Thus, in many cases, one direction of the many-to-many relationship is seen as more important and only that relation is modeled as a one-to-many relation.
Now, take it one step further. Networks. Many-to-many relationships between equivalent partners. How can you model that without violating the transaction boundary on your aggregate roots?
Have a look at this widely-applicable example:
I have a network with nodes. Every node has a limited amount of ports. One port can only be connected to one port on another node. I have to be able to add and remove connections between nodes, using the ports.
An intuitive approach to this would be to model the nodes as aggregate roots containing ports. Connections seem to be value objects and one port can have one connection. I could implement a Node.ConnectTo(nodeId, portId) method which add the connection (between port X on node A and port Y on node B) to the aggregate root, node A. Preferably, I would call this method twice, once on Node A and once on Node B and wrap it in a transaction. However, this would violate the transaction boundary, so I decide to only store it on Node A.
To see the connection on node B on the application client, a separate read model would be needed. But that's no problem, the CQRS architecture provides us these possibilities. So, adding, removing and viewing connections is not a problem.
The problem arises when I want to validate whether a port is still free before I add the connection to a port. The result of respecting our transaction boundary is that (in the write model) the fact that a port already is connected might not be known to the aggregate root, but might be stored in any other aggregate root.
Of course, you could trust your client's validation, go ahead and add the connection if it's ok for the node you are adding it to and rely on a process running consistency checks to execute compensating actions for invalid connections. But that seems to be a big deal to me compared to wrapping a transaction around two ConnectTo calls...
This made me think that maybe my aggregate roots were chosen incorrectly. And I started thinking about Nodes and Networks as aggregate roots, where a Network is a collection of Connections. The good thing about a Network aggregate is that you could always validate adding or removing connections. Except when a new connection would result in the joining of two existing networks... And your aggregate could become big, possibly resulting only in a single huge network. Not feasible either.
So, how do you think this should be modeled? Do you see a solution where you respect aggregate roots as transaction boundaries, you can validate your network and you do not risk to store your entire network as a single aggregate? Or am I asking for all 3 CAP's here and is it simply not possible?
I think your "new way" is flawed, since the View model should not produce an Exception that propagates "somehow" back to the domain model. The domain model need to resolve this by itself.
So, in this case (bind 1-to-1) you could utilize events within the domain model, so that
NodeA.connect( "port1" ).to( NodeB ).on( "port3" );
NodeA reserves "port1" on itself.
NodeA sends a "portConnectionRequest" to NodeB.
NodeB binds "port3" if available.
NodeB sends "portConnectionConfirmed" or "portConnectionDenied".
NodeA receieves event and acts accordingly.
The above assumes reliable messaging, which is easily achieved within the JVM, but much harder in a distributed environment, yet that is where you want it more. If a reliable messaging system can not be provided, I think you will have a Byzantine Agreement Problem problem at hand, or a subset of it.
Ok, I read and thought some more about it and I guess this is the 'right' way to do it:
Before executing the ConnectTo method on Node A, you validate whether the port on Node B is still free using an eventually consistent view model as your data source (not the domain model which cannot validate this efficiently, see above).
ConnectTo is run only on Node A, thus no transaction boundary is violated.
If the view model is not able to connect the port on Node B, because it is already in use, a true concurrency exception has happened and it must be signaled. Some action needs to be taken (either manuel intervention or an automated process must pick it up). The probability for this concurrency exception will usually be very low.

Strong Semaphore Queuing Discipline and Starvation

In William Stallings' book on Operating Systems, he defines a strong semaphore as one that has a FIFO queuing discipline, and a weak semaphore one that is unordered. Surely there are other queuing disciplines for strong semaphores, such as by priority? Or would this no longer be a strong semaphore, since starvation would become possible? (Stallings says that strong semaphores do not allow for starvation.) Is the primary distinction between strong and weak ordered/unordered, or starvation possible/impossible?
Yes, one non-FIFO non-starving possibility (among many) is to select the next process in a round-robin manner. For example, if the order is 1, 2, 3, 4, and while 1 is holding the semaphore, 4 and then 3 request it, then the next process up is 3. No process P starves because, after each request of P, each other process has at most one critical section before P's request is granted.
Definitions of "strong semaphore" in the first pages of hits from Google are split between "no starvation" and "FIFO". Which one is "right" is a matter of taste – given this mess (and the general overuse of strong as an adjective in mathematical writing), I'd probably use neither.
When it comes to the literature on semaphores, I have never seen (with my limited knowledge) anyone using FIFO or some form of ordering as the criteria for weak/strong categorization. In fact, starvation-freedom is not always the criteria either. The initial literatures (due to viz. Morris('79), Martin and J. R. Burch('85), Udding('86), Friedberg and Peterson('87), and Haldar and Subramanian('88)) used certain characteristics of ‘P’ and ‘V’ operations to define weak-semaphore. Interestingly, all of the definitions from the cited researchers eventually imply possible presence of starvation in case of weak semaphores. Further, though FIFO guarantees starvation-freedom, referring to the term FIFO or some form of ordering, in my opinion, restricts the behavior of the semaphore. One form of restriction could be that, for example, a FIFO ordering would imply that a semaphore has some kind of buffer attached to it in order to keep track of the blocked processes/threads on the ‘P’ operation. For hardware implementation of semaphores, this definition might be too restrictive. Another form of restrictions could be that instead of considering all possible ordering schemes with same bounded overtaking by k (i.e. no process will be overtaken more than k times), one would be restricted to consider each scheme as one different kind of semaphore. Thus, my personal opinion is to define a weak-semaphore as one that does NOT guarantee starvation-freedom (but guarantees dead-lock-freedom). However, if you’re more deep into research-grade activities, then by all means feel free to use more mathematical or/and fine-grained definitions as you prefer.
I think there is no starvation with a priority queue with a predefined priority on queue elements. As you can see it's just a regular queue except the next element has the highest priority. So if you implement priorities with FIFO logic (first in has the highest priority) there will be no starvation. Otherwise it can cause starvation.

Resources