Raft: How to solve the performance bottleneck of leader node?

Raft: How to solve the performance bottleneck of leader node? - algorithm

In raft, all operation requests will be forwarded to the leader node, and then the leader will send logs to all followers. So under a heavy loaded environment, the leader node will be a bottleneck. How to solve this?

This can be solved in different ways depending on your desires. Here are some example solutions.
Partition the data. Many large-scale systems partition the data to spread the load (as well as reduce hurt if a partition goes down). But transactions cannot cross partitions. That could be a bummer, depending on your application.
Chain Consensus. This protocol spreads the work of moving data to all the nodes in the cluster. There is still a leader that is a bottleneck for accepting data, but its burden is smaller. Chain consensus also leads to slightly higher latencies than a broadcast system.

Related

Does scaling etcd affect write performance?

The distributed value-store etcd uses the raft algorithm. The docs link to animations explaining: how the replica nodes vote to make one node the leader (to be the recipient of external write instructions), and thereafter the leader broadcasts all instructions to all nodes (attaching those instructions to a heartbeat signal that is bounced off of the other nodes, in a star topology, with confirmation after a majority acknowledge).
The replication obviously provides resilience (against failures of individual nodes), and presumably the read performance scales up with replica count.
Is it correct to understand that write performance is constant, and does not scale with replica count?

It is true. write requires majority of nodes to ack new entry in order to commit it. It may happen that write is even slower with increased number of replicas (it is as fast as slowest node out of quorum). In regards to read, you might find etcd docs about linearizability interesting. TL;DR; default reads also need quorum.

Raft nodes count

Raft leader node sends append entries RPC to all followers. Obviously we increase network usage, when we add new follower, so my question is about how much nodes we can add to cluster. In Raft paper and in other places I read that 5 nodes in cluster is optimal choice, but what you can say if we will have 100 nodes in cluster?
Yes I understand that I can calculate limit, will be enough network bandwidth or not. My question is more general, is cluster with tens of nodes sign of bad architecture?

Yes, a cluster with tens of nodes is generally a bad idea. Typically, we see clusters go up to 7 nodes, but not really beyond that, and even that's atypical. 3 or 5 nodes is the most common.
If you want to scale across more than 3/5/7 nodes you typically just shard the cluster, where each shard runs a completely separate and independent instance of the Raft protocol. If you need to scale for fault tolerance, you will have to relax consistency requirements.

How to make RabbitMQ scalable?

I tried to test RabbitMQ, but I found that rabbitmq has some problems:
if I created a cluster of 3 nodes, I can't publish/delivered more than 6000/s.
in other hand, if I worked with one single node, I can publish/delivery until 25000/s.
which means, more that I add nodes, more performance is deteriorating.
but from this article : https://blog.pivotal.io/pivotal/products/rabbitmq-hits-one-million-messages-per-second-on-google-compute-engine
they can publish more than 1 million, so how they can do that?
I want to make RabbitMQ process more than 1 million messages per second

I resolved the problem by adding load balancer.
The producers send data to load balancer. On the other hand the load balancer id connected to many nodes of rabbitmq, but those nodes are not connected between them (to avoid synchronization which affects the performance).
So by this way, I can multiply the throughput (ex: 3 nodes= 3x throughput).

It might depend on other factors such as your network, or your hardware performance.
When reading benchmark always consider the environment surrounding the tests
As on how to improve perf you can improve your hardware or network if this is the limiting factor.
Consider switching to a SSD or using link aggregation on your network would be a good start.

In this test of RabbitMQ performance, the authors concluded that a small cluster will underperform a single node cluster. More nodes need to be added to increase the performance. This makes sense when you think about the overhead induced by replication required in a distributed system, especially given that RabbitMQ focus is reliability.
The following is mentioned in a blog post by RabbitMQ:
If you use quorum queues or mirrored queues, then each message will be delivered to multiple brokers. If you have a cluster of three brokers and quorum queues with a replicator factor of 3, then every broker will receive every message. In that case, we’ve created a cluster for redundancy only. But we can also create larger clusters for scalability. We could have a cluster of 9 brokers, with quorum queues with a rep factor of 3 and now we’ve spread that load out and can handle a much larger total throughput.

What's the benefit of advanced master election algorithms over bully algorithm?

I read how current master election algorithms like Raft, Paxos or Zab elect master on a cluster and couldn't understand why they use sophisticated algorithms instead of simple bully algorithm.
I'm developing a cluster library and use UDP Multicast for heartbeat messages. Each node joins a multicast address and also send datagram packets periodically to that address. If the nodes find out there is a new node that sends packets to this multicast address, the node is simply added to cluster and similarly when the nodes in the cluster don't get any package from a node, they remove it from the cluster. When I need to choose a master node, I simply iterate over the nodes in the cluster and choose the oldest one.
I read some articles that implies this approach is not effective and more sophisticated algorithms like Paxos should be used in order to elect a master or detect failures via heartbeat messages. I couldn't understand why Paxos is better for split-brain scenarios or other network failures than traditional bully algorithm because I can easily find out when quorum of nodes leave from the cluster without using Raft. The only benefit I see is the count of packets that each server have to handle; only master sends heartbeat messages in Raft while in this case each node has to send heartbeat message to each other. However I don't think this is a problem since I can simply implement similar heartbeat algorithm without changing my master election algorithm.
Can someone elaborate on that?

From a theoretical perspective, Raft, Paxos and Zab are not leader election algorithms. They solve a different problem called consensus.
In your concrete scenario, the difference is the following. With a leader election algorithm, you can only guarantee that eventually one node is a leader. That means that during a period of time, multiple nodes might think they are the leader and, consequently, might act like one. In contrast, with the consensus algorithms above, you can guarantee that there is at most one leader in a logical time instant.
The consequence is this. If the safety of the system depends on the presence of a single leader, you might get in trouble relying only on a leader election. Consider an example. Nodes receive messages from the UDP multicast and do A if the sender is the leader, but do B if the sender is not the leader. If two nodes check for the oldest node in the cluster at slightly different points in time, they might see different leaders. These two nodes might then receive a multicast message and process it in different manners, possibly violating some safety property of the system you'd like to hold (eg, that all nodes either do A or B, but never one does A and another does B).
With Raft, Paxos, and Zab, you can overcome that problem since these algorithms create sort of logical epochs, having at most one leader in each of them.
Two notes here. First, the bully algorithm is defined for synchronous systems. If you really implement it as described in the paper by Garcia-Molina, I believe you might experience problems in your partially synchronous system. Second, the Zab algorithm relies on a sort of bully algorithm for asynchronous systems. The leader is elected by comparing the length of their histories (that minimizes the recovery time of the system). Once the leader is elected, it tries to start the Zab protocol, which in turn locks the epoch for the leader.

Is it good to create virtual machines(nodes) to get better performance on cassandra?

I know Cassandra is good in multiple nodes set up. The more nodes,the better performance. If I have two dedicated servers with same hardware, it would be good I create some virtual machines in both of them to have more nodes, or not?
For example I have two dedicated server with this specifications:
1TB hard drive
64 GB RAM
8 core CPU
then create 8 virtual machine(nodes) in both of them. each of them has:
~150GB hard drive
8 GB RAM
share 8 core CPU
So I have 16 nodes. Are these 16 nodes had better performance than 2 nodes with this two dedicated server?
In the other word which side of this trade off is better, more nodes with lower hardware or two stronger nodes?
I know it should be tested, but I want to know basically is it reasonable or not?

Adding new nodes always adds some overhead, they need to communicate within each other and sync their data. Therefore, the more nodes you add, you'd expect the overhead to increase with adding each node. You'd add more nodes only in a situation where the existing number of nodes can't handle the input/output demands. Since in the situation you are describing , you'd be actually writing on the same disk, you'd actually effectively be slowing down your cluster by adding more nodes.
Imagine the situation: you have a server, it receives some data and then writes it on disk. Now imagine the same situation, where the disk is shared between two servers and they both write the same information at the almost same time on the same disk. The two servers also use cpu cycles to communicate between each other that the data has been written so they can sync up. I think this is a sufficient enough information to describe to you why what you are thinking is not a good idea if you can avoid it.
EDIT:
Of course, this is the information only in layman's terms, C* has a very nice architecture in which data is actually spread according to an algorithm to a certain range of nodes (not all of them) and when you are querying for a specific key, the algorithm actually can tell you where to find the data. With that said, when you add and remove nodes, the new nodes have to communicate with the cluster that they want to share 'the burden' and as a result, a recalculation of what is known as a 'token-ring' takes place at the end of which data may be shuffled around so it is accessible in a predictable way.
You can take a look at this:
http://www.datastax.com/dev/blog/upgrading-an-existing-cluster-to-vnodes-2
But in general, there is indeed some overhead when nodes communicate with each other, but the number of the nodes would almost never negatively or positively impact your query speed dramatically if you are querying for a single key.

"I know it should be tested, but I want to know basically is it reasonable or not?"
That will answer most of your assumptions.
The basic advantage of using cassandra is availability. If you are planning to have just two dedicated servers, then there is a question mark on your availability of data. Considering the worst case, you always have just two replicas of data at any point of time.
My take is to go for a nicely split dedicated set up in small chunks. Everything boils down to your use case.
1.If you have a lot of data flowing in and if you consider data as king(in such a case , you need more replicas to handle in case of failures), i would prefer a high end distributed set up.
2.If you are looking for the other way around(data is not your forte and your data is just another part of your set up), you shall just go for the set up what you have mentioned.
3.If you have a cost constraint and if you are a start up with a minimal data that is important to you, set up in two nodes what you have with replication of 2(Simple Strategy ) and replication of 1(Network Topology)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio