Raft divides time into terms of arbitrary length, as shown in Figure 5. Terms are numbered with consecutive integers. Each term begins with an election, in which one or more candidates attempt to become leader as described in Section 5.2. If a candidate wins the election, then it serves as leader for the rest of the term. In some situations an election will result in a split vote. In this case the term will end with no leader; a new term (with a new election) will begin shortly. Raft ensures that there is at most one leader in a given term.
Terms act as a logical clock [14] in Raft, and they allow servers to detect obsolete information such as stale leaders. Each server stores a current term number, which increases monotonically over time.
From this paper, we got know term begins with an election and increase monotonically.
My question is when will term increase?
Does it increase when over physical time? e.g. every minute, or every hour.
Is it relating to logical time?
Does it only increase when new election happens?
How frequently will term change?
How many log entries will be generated within a term?
The term is a logical timestamp, or what’s more generally referred to in distributed systems as an epoch. The frequency with which terms change is entirely dependent on node and network conditions. Terms are incremented only when a member starts a new election. So, the term will be incremented e.g. after a leader crashes, if a network partition leads to some members’ election timers expiring, if there’s enough latency in the network to expire election timers, or if an election ends without a winner.
Related
i'm trying to learn the raft algorithm to implement it, i don't have understood when the term is incremented, apart from when a node pass from stato follower to state candidate there are others cases when the term is incremented? For instance when the leader increment the last commit index? Thanks
If you're looking from a standpoint of a single peer in the Raft cluster, you will update your term if:
you receive a RPC request or response that contains a term higher than yours (note here that if you're in the leader mode, you also need to step down as a leader and turn to follower)
you're in the follower mode and didn't hear from the current leader in the minimum election timeout time (you'll be switching to candidate mode)
you're in the candidate mode and didn't get enough votes to win the election and the election timeout elapses
Generally observing the system, the term will be increased only when a new election starts.
Suppose a network partition occurs and the leader A is in minority. Raft will elect a new leader B but A thinks it's still the leader for some time. And we have two clients. Client 1 writes a key/value pair to B, then Client 2 reads the key from A before A steps down. Because A still believes it's the leader, it will return stale data.
The original paper says:
Second, a leader must check whether it has been deposed
before processing a read-only request (its information
may be stale if a more recent leader has been elected).
Raft handles this by having the leader exchange heartbeat
messages with a majority of the cluster before responding
to read-only requests.
Isn't it too expensive? The leader has to talk to majority nodes for every read request?
I'm surprised there's so much ambiguity in the answers, as this is quite well known:
Yes, to get linearizable reads from Raft you must round-trip through the quorum.
There are no shortcuts here. In fact, both etcd and Consul committed an error in their implementations of Raft and caused linearizability violations. The implementors erroneously believed (as did many people, including myself) that if a node thought of itself as a leader, it was the leader.
Raft does not guarantee this at all. A node can be a leader and not learn of its loss of leadership because of the very network partition that caused someone else to step up in the first place. Because clock error is taken as unbounded in distributed systems literature, no amount of waiting can solve this race condition. New leaders cannot simply "wait it out" and then decide "okay, the old leader must have realized it by now". This is just typical lease lock stuff - you can't use clocks with unbounded error to make distributed decisions.
Jepsen covered this error detail, but to quote the conclusion:
[There are] three types of reads, for varying performance/correctness needs:
Anything-goes reads, where any node can respond with its last known value. Totally available, in the CAP sense, but no guarantees of monotonicity. Etcd does this by default, and Consul terms this “stale”.
Mostly-consistent reads, where only leaders can respond, and stale reads are occasionally allowed. This is what etcd currently terms “consistent”, and what Consul does by default.
Consistent reads, which require a round-trip delay so the leader can confirm it is still authoritative before responding. Consul now terms this consistent.
Just to tie in with some other results from literature, this very problem was one of the things Flexible Paxos showed it could handle. The key realization in FPaxos is that you have two quorums: one for leader election and one for replication. The only requirement is that these quorums intersect, and while a majority quorum is guaranteed to do so, it is not the only configuration.
For example, one could require that every node participate in leader election. The winner of this election could be the sole node serving requests - now it is safe for this node to serve reads locally, because it knows for a new leader to step up the leadership quorum would need to include itself. (Of course, the tradeoff is that if this node went down, you could not elect a new leader!)
The point of FPaxos is that this is an engineering tradeoff you get to make.
The leader doesn't have to talk to a majority for each read request. Instead, as it continuously heartbeats with its peers it maintains a staleness measure: how long it has been since it has received an okay from a quorum? The leader will check this staleness measure and return a StalenessExceeded error. This gives the calling system the chance to connect to another host.
It may be better to push that staleness check to the calling systems; let the low-level raft system have higher Availability (in CAP terms) and let the calling systems decide at what staleness level to fail over. This can be done in various ways. You could have the calling systems heartbeat to the raft system, but my favorite is to return the staleness measure in the response. This last can be improved when the client includes its timestamp in the request, the raft server echos it back in the response and the client adds the round trip time to raft staleness. (NB. Always use the nano clock in measuring time differences because it doesn't go backwards like the system clock does.)
Not sure whether timeout configure can solve this problem:
2 x heartbeat interval <= election timeout
which means when network partition happens leader A is single leader and write will fail because leader A locates in the minority and leader A can not get echo back from majority of the node and step back as a follower.
After that leader B is selected, it can catch up the latest change from at least one of the followers and then client can perform read and write on leader B.
Question
The leader has to talk to majority nodes for every read request
Answer: No.
Explaination
Let's understand it with code example from HashiCorp's raft implementation.
There are 2 timeouts involved: (their names are self explanatory but link has been included to read detailed definition.)
LeaderLease timeout[1]
Election timeout[2]
Example of their values are 500ms & 1000ms respectively[3]
Must condition for node to start is: LeaderLease timeout < Election timeout [4,5]
Once a node becomes Leader, it is checked "whether it is heartbeating with quorum of followers or not"[6, 7]. If heartbeat stops then its tolerated till LeaderLease timeout[8]. If Leader is not able to contact quorum of nodes for LeaderLease timeout then Leader node has to become Follower[9]
Hence for example given in question, Node-A must step down as Leader before Node-B becomes Leader. Since Node-A knows its not a Leader before Node-B becomes Leader, Node-A will not serve the read or write request.
[1]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L141
[2]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L179
[3]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L230
[4]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L272
[5]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L275
[6]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L456
[7]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L762
[8]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L891
[9]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L894
My question is mostly based on the following article:
https://qbox.io/blog/optimizing-elasticsearch-how-many-shards-per-index
The article advises against having multiple shards per node for two reasons:
Each shard is essentially a Lucene index, it consumes file handles, memory, and CPU resources
Each search request will touch a copy of every shard in the index. Contention arises and performance decreases when the shards are competing for the same hardware resources
The article advocates the use of rolling indices for indices that see many writes and fewer reads.
Questions:
Do the problems of resource consumption by Lucene indices arise if the old indices are left open?
Do the problems of contention arise when searching over a large time range involving many indices and hence many shards?
How does searching many small indices compare to searching one large one?
I should mention that in our particular case, there is only one ES node though of course generally applicable answers will be more useful to SO readers.
It's very difficult to spit out general best practices and guidelines when it comes to cluster sizing as it depends on so many factors. If you ask five ES experts, you'll get ten different answers.
After several years of tinkering and fiddling around ES, I've found out that what works best for me is always to start small (one node, how many indices your app needs and one shard per index), load a representative data set (ideally your full data set) and load test to death. Your load testing scenarii should represent the real maximum load you're experiencing (or expecting) in your production environment during peak hours.
Increase the capacity of your cluster (add shard, add nodes, tune knobs, etc) until your load test pass and make sure to increase your capacity by a few more percent in order to allow for future growth. You don't want your production to be fine now, you want it to be fine in a year from now. Of course, it will depend on how fast your data will grow and it's very unlikely that you can predict with 100% certainty what will happen in a year from now. For that reason, as soon as my load test pass, if I expect a large exponential data growth, I usually increase the capacity by 50% more percent, knowing that I will have to revisit my cluster topology within a few month or a year.
So to answer your questions:
Yes, if old indices are left open, they will consume resources.
Yes, the more indices you search, the more resources you will need in order to go through every shard of every index. Be careful with aliases spanning many, many rolling indices (especially on a single node)
This is too broad to answer, as it again depends on the amount of data we're talking about and on what kind of query you're sending, whether it uses aggregation, sorting and/or scripting, etc
Do the problems of resource consumption by Lucene indices arise if the old indices are left open?
Yes.
Do the problems of contention arise when searching over a large time range involving many indices and hence many shards?
Yes.
How does searching many small indices compare to searching one large one?
When ES searches an index it will pick up one copy of each shard (be it replica or primary) and asks that copy to run the query on its own set of data. Searching a shard will use one thread from the search threadpool the node has (the threadpool is per node). One thread basically means one CPU core. If your node has 8 cores then at any given time the node can search concurrently 8 shards.
Imagine you have 100 shards on that node and your query will want to search all of them. ES will initiate the search and all 100 shards will compete for the 8 cores so some shards will have to wait some amount of time (microseconds, milliseconds etc) to get their share of those 8 cores. Having many shards means less documents on each and, thus, potentially a faster response time from each. But then the node that initiated the request needs to gather all the shards' responses and aggregate the final result. So, the response will be ready when the slowest shard finally responds with its set of results.
On the other hand, if you have a big index with very few shards, there is not so much contention for those CPU cores. But the shards having a lot of work to do individually, it can take more time to return back the individual result.
When choosing the number of shards many aspects need to be considered. But, for some rough guidelines yes, 30GB per shard is a good limit. But this won't work for everyone and for every use case and the article fails to mention that. If, for example, your index is using parent/child relationships those 30GB per shard might be too much and the response time of a single shard can be too slow.
You took this out of the context: "The article advises against having multiple shards per node". No, the article advises one to think about the aspects of structuring the indices shards before hand. One important step here is the testing one. Please, test your data before deciding how many shards you need.
You mentioned in the post "rolling indices", and I assume time-based indices. In this case, one question is about the retention period (for how long you need the data). Based on the answer to this question you can determine how many indices you'll have. Knowing how many indices you'll have gives you the total number of shards you'll have.
Also, with rolling indices, you need to take care of deleting the expired indices. Have a look at Curator for this.
I have the following problem:
A leader server creates objects which have a start time and end time. The start time and end time are set when an object gets created.
The Start time of the object is set to current time on the leader node, and end time is set to Start time + Delta Time
A thread wakes up regularly and checks if the End time for any of the objects are lesser than the current time (hence the object has expired) - if yes then the object needs to be deleted
All this works fine, as long as things are running smoothly on leader node. If the leader node goes down, one of the follower node becomes the new leader. (There will be replication among leader and follower node (RAFT algorithm))
Now on the new leader, the time could be very different from the previous leader. Hence the computation in step 3 could be misleading.
One way to solve this problem, is to keep the clocks of nodes (leader and followers) in sync (as much as possible).
But I am wondering if there is any other way to resolve this problem of "expiry" with distributed nodes?
Further Information:
Will be using RAFT protocol for message passing and state
replication
It will have known bounds on the delay of message between processes
Leader and follower failures will be tolerated (as per RAFT protocol)
Message loss is assumed not to occur (RAFT ensures this)
The operation on objects is to check if they are alive. Objects will be enqueued by a client.
There will be strong consistency among processes (RAFT provides this)
I've seen expiry done in two different ways. Both of these methods guarantee that time will not regress, as what can happen if synchrnozing clocks via NTP or otherwise using the system clock. In particular, both methods utilize the chip clock for strictly increasing time. (System.nanoTime in Javaland.)
These methods are only for expiry: time does not regress, but it is possible that time can go appear to go slower.
First Method
The first method works because you are using a raft cluster (or a similar protocol). It works by broadcasting an ever-increasing clock from the leader to the replicas.
Each peer maintains what we'll call the cluster clock that runs at near real time. The leader periodically broadcasts the clock value via raft.
When a peer receives this clock value it records it, along with the current chip clock value. When the peer is elected leader it can determine the duration since the last clock value by comparing its current chip clock with the last recorded chip clock value.
Bonus 1: Instead of having a new transition type, the cluster clock value may be attached to every transition, and during quiet periods the leader makes no-op transitions just to move the clock forward. If you can, attach these to the raft heartbeat mechanism.
Bonus 2: I maintain some systems where time is increased for every transition, even within the same time quantum. In other words, every transition has a unique timestamp. For this to work without moving time forward too quickly your clock mechanism must have a granularity that can cover your expected transition rate. Milliseconds only allow for 1,000 tps, microseconds allow for 1,000,000 tps, etc.
Second Method
Each peer merely records its chip clock when it receives each object and stores it along with each object. This guarantees that peers will never expire an object before the leader, because the leader records the time stamp and then sends the object over a network. This creates a strict happens-before relationship.
This second method is susceptible, however, to server restarts. Many chips and processing environments (e.g. the JVM) will reset the chip-clock to a random value on startup. The first method does not have this problem, but is more expensive.
If you know your nodes are synchronized to some absolute time, within some epsilon, the easy solution is probably to just bake the epsilon into your garbage collection scheme. Normally with NTP, the epsilon is somewhere around 1ms. With a protocol like PTP, it would be well below 1ms.
Absolute time doesn't really exist in distributed systems though. It can be bottleneck to try to depend on it, since it implies that all the nodes need communicate. One way of avoiding it, and synchronization in general, is to keep a relative sequence of events using a vector clock, or an interval tree clock. This avoids the need to synchronize on absolute time as state. Since the sequences describe related events, the implication is that only nodes with related events need to communicate.
So, with garbage collection, objects could be marked stale using node sequence numbers. Then, instead of the garbage collector thread checking liveness, the object could either be collected as the sequence number increments, or just marked stale and collected asynchronously.
I am trying to perform leader election. These days I am thinking of using a key-value store to realize that but I am not quite sure if the idea is reliable as for scalability and consistency issues. The real deployment will have thousands of nodes and the election should take place without any central authority or service like zookeeper.
Now that, my question is:
Can I use a key-value store(preferably C-A tunable like riak) to perform the leader election? What are the possible pros/cons of utilizing a KV store for leader election?
Thanks!
EDIT:
I am not interested in bully algorithm approach anymore.
A key-value store that does not guarantee consistency (like Riak) is a bad way to do this because you could get two nodes that both think (with reason!) that they are the new leader. A key-value store that guarantees consistency won't guarantee availability in the event of problems, and availability is going to be compromised exactly when you've got problems that could cause the loss of nodes.
The way that I would suggest doing this for thousands of nodes is to go from a straight peer to peer arrangement with thousands of nodes to a hierarchical arrangement. So have a master and several groups. Each incoming node is assigned to a group, which assigns it to a subgroup, which assigns it to a sub-sub group until you find yourself in a sufficiently small peer group. Then master election is only held between the leaders of the groups, and the winner gets promoted from being the leader of that group. If the leader of a group goes away (possibly because of promotion), a master election between the leaders of its subgroups elects the new leader. And so on.
If the peer group gets to be too large, say 26, then its master randomly splits it into 5 smaller groups of 5 peers each, with randomly assigned leaders. Similarly if a peer group gets too small, say 3, then it can petition its leader to be merged with someone else. If the leader notices that it has too few followers, say 3, then it can tell one of them to promote its subgroups to full groups, and to join one of those groups. You can play with those numbers, depending on how much redundancy you need.
This will lead to more elections, but you'll have massively reduced overhead per election. This should be a very significant overall win. For a start randomly confused nodes won't immediately start polling thousands of peers, generating huge spikes in network traffic.