The point of Raft is that all of the participants that are still working agree on what the state of the system is (or at least they should do once they have time to find out what the total consensus is). This means that they all agree on what messages have been received, and in what order. This also means that they must all get the same answer when they compute the consequences of receiving those messages. So the messages must be processed sequentially, or if they are processed in parallel, the participants have to use transactions and locking and so on so that the effect is as if the messages were processed sequentially. Under load, responses can be delayed, or some other sort of back-pressure used to get the senders to slow down, but you can't just drop messages because you are too busy, unless you do it in a way that ensures that all participants take the same decisions about this.

Most raft implementation are using pipelining you can several log entry in flight from master to slave.
But the master only respond to client write request with success after the master received a ACK response from a quorum of slave for a log offset equal or greater than the log offset this client request was written to.


ZooKeeper TIme Synchronization

How does ZooKeeper deal with time ?
Are the Znodes/Clients synchronized ? and How?
Otherwise, how does the algorithm work without time Synchronization?
I see relative Question here, but it does not answer my question
How does Zookeeper synchronize the clock in the cluster
Thanks in Advance
As you may have heard, zookeeper elects a leader for the ensemble. Thereafter, all the write requests go through the leader.
Therefore, leader is responsible for preserving the order of write requests. (Yes, the order is determined by the time at which a request reaches the leader). When all the write requests are served by the leader, no need to worry about synchronization right? Zookeeper doesn't depend on synchronizations of clocks.
How the leader is transmitting new values to the followers is another problem which is addressed through consensus algorithm, ZAB (Zookeeper Atomic Broadcast). This protocol make sure that the majority of the ensemble have updated the new value before sending OK response to the write request.

How do I practically use Raft algorithm

In the Raft paper, they mentioned that all the client interaction happens with the leader node. What I don't understand is that the leader keeps changing. So let's say my cluster is behind a load balancer. How do I notify the load balancer that the leader has changed? Or if I'm correct, is it that load balancer can send out client request to any of the node (follower or leader) and it is the responsibility of the follower node to send the request to the leader?
After the voting finish, you will have a leader (new or old). It is the responsibility of the leader to notify all the nodes in the network to send heartbeats at a regular interval(smaller than the keep-alive time of network but bigger than the maximum round trip time) to all the nodes.
Your load balancer should update the leader every time it gets heartbeats. Load balancer will send data only to the leader as according to raft algorithm all client requests directly goes to leader only, other nodes can't send data but only acknowledgments to voting and append commands.
There is a really good presentation here on the same:- Raft: Log-Replication
There are really two ways this can be done: either the load balancer needs to understand where the leader is or the followers can proxy requests to the leader.
There's nothing wrong with proxying client requests to the leader through a follower, and in fact there can be major benefits to it. Many Raft implementations allow clients to read from followers while maintaining sequential consistency. This can still be safely done with a load balancer sending requests to arbitrary nodes if the client keeps track of the last index it has seen and sends that with each request to ensure it does not see state go back in time. I won't write the full algorithm here, but this is described in the Raft dissertation which you should consult.
But using a load balancer in this manner can become unsafe in certain cases. If clients are allowed to send multiple concurrent requests, the load balancer could route those requests through different nodes and they could arrive at the leader out of order. This can be accounted for by attaching a sequence number to client requests and reordering requests on the leader. But to do so, the implementation has to include sessions to allow the leader to track per-client state.
Typically, though, Raft clients connect to specific nodes and stay connected to them for as long as possible to reduce the overhead of maintaining consistency while switching servers. If an implementation supports reading from followers, it can still be costly to switch servers since servers have to wait for state to catch up to maintain sequential consistency.

Raft Algorithm Normal Operations

I have read the Raft algorithm paper's and got a question related to the sequence of operations Raft executes upon receiving a client request:
In order to overcome a single point of failure scenario, Raft relies on maintaining a replicated log on other machines, the algorithm also consults a consensus module for the complete logging management. The sequence of operations work as follow:
Client request is received at the leader's state machine, leader appends command to its log.
The leader sends AppendEntries RPCs to his followers to clone the command in their local logs', and waits for an acknowledgment from majority of the followers that the entry has been successfully appended to their local log file.
Once an acknowledgment has been received that the request has been successfully logged in majority of the followers logs', then the request is committed to the leader's state machine causing a transition to happen, returning back the output of that transition to the client.
Ultimately, the leader notifies followers of committed entries in subsequent AppendEntries RPCs.
If above understanding is correct, then I can claim that the client request is being held for a bit of time for the replication process to complete, also I may also claim that the success of a client request is heavily dependent on the success of the replication process (since the client command / request is not executed on the leader's machine until a majority acknowledgment has been received). The question is, how long it is expected to take on average for a client request to receive a response after the replication procedure is completed, also does that work efficiently for real-time systems? suggests that systems such as Raft requesting the Consistency and Availability parts of the CAP theorem's trinity will suffer performance limits. You may also be interested in (A review of experiences with reliable multicast, by Birman), which describes experience with reliable multicast groups in high assurance systems such as air traffic control.
My takeaway from this is that a real system may want to be very careful about what information it guards with Raft, Paxos, and friends, and what it can guard less tightly. The other point of view is to go for a very sophisticated implementation of Paxos, such as Google Spanner, so that programmers don't have to worry about the problems of non-ACID systems.
If above understanding is correct, then I can claim that the client request is being held for a bit of time for the replication process to complete
Correct, the leader of the current term will acknowledge a client request only after the command has been replicated to majority of nodes in the cluster.
I may also claim that the success of a client request is heavily dependent on the success of the replication process
That's also correct. At least of majority of nodes in the cluster (including the leader) need to be available and responsive, in order for the command to be replicated successfully and the leader to acknowledge the request.
how long it is expected to take on average for a client request to receive a response after the replication procedure is completed
That depends on the topology of your network. The latency of the response to a client request will be composed of the following parts (assuming no leader crashes):
* the latency required for the client request to be transmitted between the client and the leader.
* the latency of an AppendEntries request from the leader to followers to replicate the entry (sent in parallel to all the followers.
* the latency of an AppendEntries response from the followers to the leader.
* the time required by the leader to apply the command to its state machine (i.e. a disk write in the best case)
* the latency of the client response to be transmitted from the leader to the client
The latency of the various messages depends on the distance between nodes, but it will probably be in the order of tenths to hundreds of milliseconds.
also does that work efficiently for real-time systems?
It depends on what are your requirements for your specific case. But in general, real-time systems require latencies that are under a few milliseconds, so the answer is most likely no. Also, keep in mind that during periods of crashes and instability where new leader elections happen, latency can increase significantly.

ZeroMQ pattern for load balancing work across workers based on idleness

I have a single producer and n workers that I only want to give work to when they're not already processing a unit of work and I'm struggling to find a good zeroMQ pattern.
The producer is the requestor and creates a connection to each worker. It tracks which worker is busy and round-robins to idle workers
How to be notified of responses and still able to send new work to idle workers without dedicating a thread in the producer to each worker?
Producer pushes into one socket that all workers feed off, and workers push into another socket that the producer listens to.
Has no concept of worker idleness, i.e. work gets stuck behind long units of work
Non-starter, since there is no way to make sure work doesn't get lost
4) Reverse REQ/REP
Each worker is the REQ end and requests work from the producer and then sends another request when it completes the work
Producer has to block on a request for work until there is work (since each recv has to be paired with a send ). This prevents workers to respond with work completion
Could be fixed with a separate completion channel, but the producer still needs some polling mechanism to detect new work and stay on the same thread.
5) PAIR per worker
Each worker has its own PAIR connection allowing independent sending of work and receipt of results
Same problem as REQ/REP with requiring a thread per worker
As much as zeroMQ is non-blocking/async under the hood, I cannot find a pattern that allows my code to be asynchronous as well, rather than blocking in many many dedicated threads or polling spin-loops in fewer. Is this just not a good use case for zeroMQ?
Your problem is solved with the Load Balancing Pattern in the ZMQ Guide. It's all about flow control whilst also being able to send and receive messages. The producer will only send work requests to idle workers, whilst the workers are able to send and receive other messages at all times, e.g. abort, shutdown, etc.
Push/Pull is your answer.
When you send a message in ZeroMQ, all that happens initially is that it sits in a queue waiting to be delivered to the destination(s). When it has been successfully transferred it is removed from the queue. The queue is limited in length, but can be set by changing a socket's high water mark.
There is a/some background thread(s) that manage all this on your behalf, and your calls to the ZeroMQ API are simply issuing instructions to that/those threads. The threads at either end of a socket connection are collaborating to marshall the transfer of messages, i.e. a sender won't send a message unless the recipient can receive it.
Consider what this means in a push/pull set up. Suppose one of your pull workers is falling behind. It won't then be accepting messages. That means that messages being sent to it start piling up until the highwater mark is reached. ZeroMQ will no longer send messages to that pull worker. In fact AFAIK in ZeroMQ, a pull worker whose queue is more full than those of its peers will receive less messages, so the workload is evened out across all workers.
So What Does That Mean?
Just send the messages. Let 0MQ sort it out for you.
Whilst there's no explicit flag saying 'already busy', if messages can be sent at all then that means that some pull worker somewhere is able to receive it solely because it has kept up with the workload. It will therefore be best placed to process new messages.
There are limitations. If all the workers are full up then no messages are sent and you get blocked in the push when it tries to send another message. You can discover this only (it seems) by timing how long the zmq_send() took.
Don't Forget the Network
There's also the matter of network bandwidth to consider. Messages queued in the push will tranfer at the rate at which they're consumed by the recipients, or at the speed of the network (whichever is slower). If your network is fundamentally too slow, then it's the Wrong Network for the job.
Of course, messages piling up in buffers represents latency. This can be restricted by setting the high water mark to be quite low.
This won't cure a high latency problem, but it will allow you to find out that you have one. If you have an inadequate number of pull workers, a low high water mark will result in message sending failing/blocking sooner.
Actually I think in ZeroMQ it blocks for push/pull; you'd have to measure elapsed time in the call to zmq_send() to discover whether things had got bottled up.
Thought about Nanomsg?
Nanomsg is a reboot of ZeroMQ, one of the same guys is involved. There's many things I prefer about it, and ultimately I think it will replace ZeroMQ. It has some fancier patterns which are more universally usable (PAIR works on all transports, unlike in ZeroMQ). Also the patterns are essentially a plugable component in the source code, so it is far simpler for patterns to be developed and integrated than in ZeroMQ. There is a discussion on the differences here
Philisophical Discussion
Actor Model
ZeroMQ is definitely in the realms of Actor Model programming. Messages get stuffed into queues / channels / sockets, and at some undetermined point in time later they emerge at the recipient end to be processed.
The danger of this type of architecture is that it is possible to have the potential for deadlock without knowing it.
Suppose you have a system where messages pass both ways down a chain of processes, say instructions in one way and results in the other. It is possible that one of the processes will be trying to send a message whilst the recipient is actually also trying to send a message back to it.
That only works so long as the queues aren't full and can (temporarily) absorb the messages, allowing everyone to move on.
But suppose the network briefly became a little busy for some reason, and that delayed message transfer. The message send might then fail because the high water mark had been reached. Whoops! No one is then sending anything to anyone anymore!
A development of the Actor Model, called Communicating Sequential Processes, was invented to solve this problem. It has a restriction; there is no buffering of messages at all. No process can complete sending a message until the recipient has received all the data.
The theoretical consequence of this was that it was then possible to mathematically analyse a system design and pronounce it to be free of deadlock. The practical consequence is that if you've built a system that can deadlock, it will do so every time. That's actually not so bad; it'll show up in testing, not post-deployment.
Curiously this is hinted at in the documentation of Microsoft's Task Parallel library, where they advocate setting buffer lengths to zero in the intersts of achieving a more robust application.
It'd be like setting the ZeroMQ high water mark to zero, but in zmq_setsockopt() 0 means default, not nought. The default is non-zero...
CSP is much more suited to real time applications. Any shortage of available workers immediately results in an inability to send messages (so your system knows it's failed to keep up with the real time demand) instead of resulting in an increased latency as data is absorbed by sockets, etc. (which is far harder to discover).
Unfortunately almost every communications technology we have (Ethernet, TCP/IP, ZeroMQ, nanomsg, etc) leans towards Actor Model. Everything has some sort of buffer somewhere, be it a packet buffer on a NIC or a socket buffer in an operating system.
Thus to implement CSP in the real world one has to implement flow control on top of the existing transports. This takes work, and it's slightly inefficient. But if a system that needs it, it's definitely the way to go.
Personally I'd love to see 0MQ and Nanomsg to adopt it as a behavioural option.

Suggestion for Oracle AQ dequeue approach

I have a need to dequeue messages coming from an Oracle Queue on a continuous basis.
As far as I could imagine, we can deuque the message in two ways, either through Asyncronous Auto-Notification approach or by manual polling process where one can dequeue one message at a time.
I can't go for Asyncronous notification feature as the number of messages it receives could go upto 1000 within 5 mintues during peak hours and
I do not want to overload the database by spawning multiple callback procedures in the background.
With the manual polling process,I can create a one-time scheduler job that runs 24*7 which calls a stored proc that dequeus the messages in a loop in WAIT mode(kind of listening for a message).
The problem with this approach is that
1) the scheduler job runs continously and occupies one permanent job slot
2) the stored procedure does not EXIT as it runs in a loop waiting for messages.
Are there any alternative/better solutions where I do not need to have a job/procedure running continuously looking for messages?
Can I use auto-notification approach to get notification for the very first message,un-subscribe the subscriber and dequeue further messages and
subscribe to the queue again when there are no more messages ? Is this a safe approach and will i lose any message in between subscription and un-subscription ?
BTW, We use Oracle 10gR2 database, so I can't use PURGE ON NOTIFICATION option.
Appreciate your expert solution!!
You're right, it's not a good idea to use auto-notification for a high-volume queue.
At one client I've seen a one-time scheduler job which runs 24*7, it seems to work reasonably well, and they can enqueue a special "STOP" message (which goes to the top of the queue) that it listens for and stops processing messages.
However, generally I'd lean towards a job that runs regularly (e.g. once per minute, or whatever granularity is suitable for you) which would dequeue all the messages. I'd put the dequeue in a loop with a loop counter and a "maximum messages" limiter based on the maximum number of messages you'd expect in a 1-minute period. The job would keep processing messages until (a) there are no more messages in the queue, or (b) the maximum limit has been reached.
You can then set the schedule for the job based on the maximum delay you want to see between an enqueue and a dequeue. E.g. if it doesn't matter if a message isn't processed within 5 minutes, you could set the job to run once every 5 minutes.
The maximum limit needs to be quite a high figure - e.g. 10x or 100x the expected maximum number - otherwise a spike could flood your queue and it might not keep up. The idea of the maximum limit is to ensure that the job never runs forever. This should give ops enough time to detect a problem with the queue (e.g. if some rogue process is flooding the queue with bogus messages).
