In order to convince oneself that the complications of standard algorithms such as Paxos and Raft are necessary, one must understand why simpler solutions aren't satisfactory. Suppose that, in order to reach consensus w.r.t a stream of events in a cluster of N machines (i.e., implement a replicated time-growing log), the following algorithm is proposed:
Whenever a machine wants to append a message to the log, it broadcasts the tuple (msg, rnd, prev), where msg is the message, rnd is a random number, and prev is the ID of the last message on the log.
When a machine receives a tuple, it inserts msg as a child of prev, forming a tree.
If a node has more than one child, only the one with highest rnd is considered valid; the path of valid messages through the tree is the main chain.
If a message is part of the main chain, and it is old enough, it is considered decided/final.
If a machine attempts to submit a message and, after some time, it isn't present on the main chain, that means another machine broadcasted a message at roughly the same time, so you re-broadcast it until it is there.
Looks simple, efficient and resilient to crashes. Would this algorithm work?
I think you have a problem if a machine send two tuple in sequence and the first gets lost (package loss/corruption or whatever)
In that case, lets say machine 1 has prev elemtent id of 10 and sends two more with (msg,rnd,10)=11 and (msg,rnd,11)=12 to machine 2.
Machine 2 only receives (msg,rnd,11) but does not have prev id of 11 in its tree.
Machine 3 receives both, so inserts it into the main tree.
At this time you would have a desync beetween the distributed trees.
I propose an ack for the packages after they are inserted in the tree by machine x to the sender, with him waiting for it to send the next.
This way sender needs to resend previous message to the machines that failed to ack in a given timeframe.
Related
I am working on a system where I need to select a leader(out of n nodes) randomly. The leader would change after each round (after the current leader has finished its task). All the nodes would be communicating with each other.
A re-election would take place in two conditions:
The round is finished.
The leader dies prematurely.
Are there any implementations of this idea for reference. Is doing so a good idea? Why? Should this situation be approached differently?
As far as I have understood your question you need to select a different leader from your nodes Everytime so to do this you can put all the nodes in a queue and then find the length of queue and generate a random number from 0 to the queue length and name the node at this index as a leader when he dies or finished it's work you can remove this node from the queue and re elect your leader by the same process.Now the length is 1 less.
Hope I have understood the question correctly.
I am using the following pdf as reference.
It says that lastApplied is the highest log entry applied to state machine, but how is that any different than the commitIndex?
Also is the matchIndex on leader just the commitIndex on followers? If not what is the difference?
Your observation is reasonable: most of the time, nextIndex equals matchIndex + 1, but it is not always the case.
For example, when a leader is initiated, matchIndex is initiated to the 0, while nextIndex is initiated to the last log index + 1.
The difference here is because these two fields are used for different purposes: matchIndex is an accurate value indicating the index up to which all the log entries in leader and follower match. However, nextIndex is only an optimistic "guess" indicating which index the leader should try for the next AppendEntries operation, it can be a good guess (i.e. it equals matchIndex + 1) in which case the AppendEntries operation will succeed, but it can also be a bad guess (e.g. in the case when a leader was just initiated) in which case the AppendEntries will fail so that the leader will decrement nextIndex and retry.
As for lastApplied, it's simply another accurate value indicating the index up to which all the log entries in a follower have been applied to the underlying state machine. It's similar to matchIndex in that they both are both accurate values instead of heuristic "guess", but they really mean different things and serve for different purposes.
... lastApplied is the highest log entry applied to state machine, but how is that any different than the commitIndex?
These are different in a practical system because the component that commits the data in the log is typically separate from the component that applies it to replicated state machine or database. The commitIndex is typically just nanoseconds or maybe a few milliseconds more up-to-date than lastApplied.
Is the matchIndex on leader just the commitIndex on followers? If not what is the difference?
They are different. There is a period of time when the data is on a server and not yet committed, such as during the replication itself.
The leader keeps track of the latest un-committed data on each of its peers and only need to send log[matchIndex[peer], ...] to each peer instead of the whole log. This is especially useful if the peer is significantly behind the leader; because the leader can update the peer with a series of small AppendEntries calls, incrementally bringing the peer up to date.
commit is not mean already applied, there is time different between them. but eventually applied will catch up commit index.
matchIndex[i] which is saved in leader is equal to follower_i's commitIndex, and they are try to catch up to nextIndex
I have an abstract question.
I need a service with fault tolerance. The service only can only be running on one node at a time. This is the key.
With two connected nodes: A and B.
If A is running the service, B must be waiting.
If A is turned off, B should detect this and start the service.
If A is turned on again, A should wait and don't run the service.
etc. (If B is turned off, A starts, if A is turned off B starts)
I have think about heartbeat protocol for sync the status of the nodes and detect timeouts, however there are a lot of race conditions.
I can add a third node with a global lock, but I'm not sure about how to do this.
Anybody know any well-known algorithm to do this? Or better Is there any open source software that lets me control this kind of things?
Thanks
If you can provide some sort of a shared memory between nodes, then there is the classical algorithm that resolves this problem, called Peterson's algorithm.
It is based on two additional variables, called flag and turn. Turn is an integer variable whose value represents an index of node that is allowed to be active at the moment. In other words, turn=1 indicates that node no 1 has right to be active, and other node should wait. In other words, it is his turn to be active - that's where the name comes from.
Flag is a boolean array where flag[i] indicates that i-th node declares itself as ready for service. In your setup, flag[i]=false means that i-th node is down. Key part of the algorithm is that a node which is ready for service (i.e. flag[i] = true) has to wait until he obtains turn.
Algorithm is originally developed for resolving a problem of execution a critical section without conflict. However, in your case a critical section is simply running the service. You just have to ensure that before i-th node is turned off, it sets flag[i] to false. This is definitely a tricky part because if a node crashes, it obviously cannot set any value. I would go here with a some sort of a heartbeat.
Regarding the open source software that resolves similar problems, try searching for "cluster failover". Read about Google's Paxos and Google FileSystem. There are plenty of solutions, but if you want to implement something by yourself, I would try Peterson's algorithm.
I have two programs (host and slave) communicating over a com port. In the simplest case, the host sends a command to the slave and waits for a response, then does it again. But this means that each side has to wait for the other for every transaction. So I use a queue so the second command can be sent before the first response comes back. This keeps things flowing faster.
But I need a way of metering the use of the queue so that there are never more than N command/response pairs in route at any time. So for example if N is 3, I will wait to send the fourth command until I get the first response back, etc. And it must keep track of which response goes with which command.
One thought I had is to tag each command with an integer modulo counter which is also returned with the response. This would ensure that the command and response are always paired correctly and I can do a modulo compare to be able to meter the commands always N ahead of the responses.
What I am wondering, is there a better way? Isn't this a somewhat common thing to do?
(I am using Python, but that is not important.)
Using a sequence number and modulo arithmetic is in fact quite a common way to both acknowledge messages received and tell the sender when it can send more messages - see e.g. http://en.wikipedia.org/wiki/Sliding_window_protocol. Unfortunately for you, the obvious example, TCP, is unusual in that it uses a sequence number based on byte counts, not message counts, but the principle is much the same - TCP just has an extra degree of flexibility.
I am using the Birman-Schiper-Stephenson protocol of distributed system with the current assumption that peer set of any node doesn't change. As the protocol dictates, the messages which have come out of causal order to a node have to be put in a 'delay queue'. My problem is with the organisation of the delay queue where we must implement some kind of order with the messages. After deciding the order we will have to make a 'Wake-Up' protocol which would efficiently search the queue after the current timestamp is modified to find out if one of the delayed messages can be 'woken-up' and accepted.
I was thinking of segregating the delayed messages into bins based on the points of difference of their vector-timestamps with the timestamp of this node. But the number of bins can be very large and maintaining them won't be efficient.
Please suggest some designs for such a queue(s).
Sorry about the delay -- didn't see your question until now. Anyhow, if you look at Isis2.codeplex.com you'll see that in Isis2, I have a causalsend implementation that employs the same vector timestamp scheme we described in the BSS paper. What I do is to keep my messages in a partial order, sorted by VT, and then when a delivery occurs I can look at the delayed queue and deliver off the front of the queue until I find something that isn't deliverable. Everything behind it will be undeliverable too.
But in fact there is a deeper insight here: you actually never want to allow the queue of delayed messages to get very long. If the queue gets longer than a few messages (say, 50 or 100) you run into the problem that the guy with the queue could be holding quite a few bytes of data and may start paging or otherwise running slowly. So it becomes a self-perpetuating cycle in which because he has a queue, he is very likely to be dropping messages and hence enqueuing more and more. Plus in any case from his point of view, the urgent thing is to recover that missed message that caused the others to be out of order.
What this adds up to is that you need a flow control scheme in which the amount of pending asynchronous stuff is kept small. But once you know the queue is small, searching every single element won't be very costly! So this deeper perspective says flow control is needed no matter what, and then because of flow control (if you have a flow control scheme that works) the queue is small, and because the queue is small, the search won't be costly!