I'm learning Raft, and I already know the basic mechanism of Raft.
When a Leader is elected, it is responsible to update the Followers' log to the Leader's one. When updating a Follower, it finds the first matched <entry, term> backwards, and update the Follower with the following logs.
How does Raft guarantee the logs of the Leader and the Follower before the matched <entry, term> are the same? Will this case happen:
|
Leader v
Entry : 1 2 3 4 5 6 7 8 9 10
Term : 1 1 1 2 2 3 3 3 3 3
Follower
Entry : 1 2 3 4 5 6 7
Term : 1 1 1 1 2 3 3
This property of the Raft algorithm is called Log Matching.
If two logs contain an entry with the same index and term, then the
logs are identical in all entries up through the given index
This holds because:
When sending an AppendEntries RPC, the leader includes the index and
term of the entry in its log that immediately precedes the new
entries. If the follower does not find an entry in its log with the
same index and term, then it refuses the new entries. The consistency
check acts as an induction step: the initial empty state of the logs
satisfies the Log Matching Property, and the consistency check
preserves the Log Matching Property whenever logs are extended. As a
result, whenever AppendEntries returns successfully, the leader knows
that the follower’s log is identical to its own log up through the new
entries.
Source https://raft.github.io/raft.pdf
Related
Language not really important for this, I will translate to C#, but I am after a queue or waiting list position algorithm.
So I have effectively three queues in a store, Staff 1, Staff 2, Staff 3.
People in the queue can choose to book with an individual staff member or first available. So the queue would look something like.
Staff 1
Staff 3
First Available
Staff 3
First Available
Staff 2
First Available
Staff 1
First Available
Staff 3
So for the next person who comes into the store how would I calculate their queue position if they selected
a) A staff member (1,2 or 3)
b) First available
Given that at each passing of a constant interval each of the staff-linked queues gets 1 entry shorter simultaneously, we can reflect the waiting time for a new client as follows:
If the client wishes to be served by a specific member of staff, the waiting time (or rank) corresponds to the length of the queue that is linked with that particular member of staff
If the client wishes to be served by the desk that becomes first available, then this means they should be added to the shortest staff-linked queue, and so their staff selection can be made immediately, and the waiting time is thus determined accordingly as in the previous point.
If you have 3 staff members, then you would need to implement this with three (FIFO) queues. Put clients in the queue according to above considerations. It is easy to report the length of that queue, which is the rank for that client: 0 means they get served when the next interval starts.
At each passing of the fixed interval, you would pull one entry from each of the queues, which represent the clients that are getting served by each of the staff.
I have 3 nodes, each of them has 5 shards and 2 replicas.
Two of the nodes are way slower than the last one.
On a simple search query "match_all", two nodes answer in 3 secondes, and the third one in 200ms.
They all have been created at the same time, with the same configuration. They contain roughly the same amount of documents.
Do you have any idea what may cause such a difference in response time ?
I'll update my question if you need any missing information
In raft paper section 5.4.2
If a leader crashes before
committing an entry, future leaders will attempt to
finish replicating the entry. However, a leader cannot immediately
conclude that an entry from a previous term is
committed once it is stored on a majority of servers. There could be a situation where an old log entry is stored
on a majority of servers, yet can still be overwritten by a
future leader.
The author mentioned that to avoid the situation above
To eliminate problems like the one in Figure 8, Raft
never commits log entries from previous terms by counting
replicas. Only log entries from the leader’s current
term are committed by counting replicas; once an entry
from the current term has been committed in this way,
then all prior entries are committed indirectly because
of the Log Matching Property.
But wouldn't the same problem still occur?
Given the following situation that the author provided
When S5 is elected leader it only looks at its current committed log which is (term3, index1) and this is going to override term2 entries in all followers.
How does making a leader looking at its own committed log solve the problem?
Read the caption on this image. Both (d) and (e) are possible resolutions to the state of the log produced by (a), (b), and (c). The problem is, even though in (c) entry (2, 2) was replicated to a majority of the cluster, this is illustrating that it could still be overwritten when S5 is elected in (d). So, the solution is to only allow nodes to commit entries from their own term. In other words, replicating an entry on a majority of nodes does not equal commitment. In (c), entry (2, 2) is replicated to a majority of the cluster, but because it's not in the leader's term (at least 4) it's not committed. But in (e), after the leader replicates an entry from its current term (4), that prevents the situation in (d) from occurring since S5 can no longer be elected leader.
After S1 replicates entry 4 with a higher term than 2 and 3. S5 will no longer be elected as leader, since the leader election strategy of Raft:
Raft determines which of two logs is more up-to-date by comparing the index and term of the last entries in the logs. If the logs have last entries with different terms, then the log with the later term is more up-to-date. If the logs end with the same term, then whichever log is longer is more up-to-date.
So, in my opinion, the appended log entry 4 in (e) implicitly promote all the entries' term before it. Because what we only care about is the term or the last entry, rather than entry 2 any more.
This is just like what the proposer do in Phase 2 of Paxos:
If the proposer receives a response to its prepare requests (numbered n) from a majority of acceptors, then it sends an accept request to each of those acceptors for a proposal numbered n with a value v, where v is the value of the highest-numbered proposal among the responses, or is any value if the responses reported no proposals.
That's say, propose the learned value 2 with a higher propose number.
I think both situations in figure 8 (d) and (e) are legal in Raft because the paper says:
To eliminate problems like the one in Figure 8, Raft never commits log entries from previous terms by counting replicas. Only log entries from the leader’s current term are committed by counting replicas.
In figure 8(d) the entries with term 2 is not in the local log of leader S5, and they are not committed to the state machine. It is ok to overwrite them with entries with term 3. Only entries in leader's current log are eligible to be considered as committed by counting the number of replicas.
If we allow entry from the previous term being committed, after (c), entry numbered 2 will be committed. After that, if 3 is selected as the leader, it will overwrite the committed 2. Thus, S5 and S1 will execute different commands. To prevent that, we will not allow 2 committed. Thus, the commands that are executed in all state machines will become consistent.
I think the question is that the leader crashed after few followers applied log to the state machine.
In (c) in the figure, the Leader has copied the log of term 2 to more than half of the nodes, and then the Leader updates the commitIndex to 2 and sends the heartbeat. And before leader crash, only S2 received the heartbeat and applies the log of term 2 to the state machine .According to the paper, S5 can be new leader by votes from S3 and S4, and try to append the log of term 3 to S2~S4. But, this operation should not be allowed, because S2 has already applied the log at index 2 to state machine.
Seems Raft does not cover this situation
I'm hoping that someone can just confirm my understanding of how the resource manager works...
If I've got a 4 node RAC with 2 consumer groups and 2 services, the services send each consumer group to one node only i.e. consumer group 1 ALWAYS gets sent to node 1 and 2 and consumer group 2 ALWAYS gets sent to node 3 and 4.
If I've got a tiered resource plan such as:
Group Name | L0 | L1 | max
Group 1 | 75% | 0 | 80%
Group 2 | 0 | 75% | 80%
Am I right in saying that as group 1 is on nodes 1 and 2 and group 2 is on nodes 3 and 4, they will each have 75% of resources available on their respective nodes? and both be limited to 80% on the node they are running on?
i.e. Resources are constricted and calculated on a per node basis and not a per cluster.
So just because a connection on node 1 group 1 is using 80% of resources, another connection on node 2 group 1 will have up to 80% available to it and not 0%.
And similarly if group 1 is using its allocated maximum, group 2 will also get its full share on nodes 3 and 4 as group 1 which is of higher priority isn't running on those nodes.
I've had a response from Oracle Support:
Resource management's limits are applied per node except PARALLEL_TARGET_PERCENTAGE, so for your example, you are right.
Connections in consumer group 2 only ever hit node 2 (due to the
services), group 2 will get a minimum of 75% of resources on the 2nd
node and potentially 100% if no max limit has been set or 80% if the max limit has been set.
Suppose I have n computers. Each of them has a set of integers. Each computer will not have the same set.
i.e. computer 1 has {1,2,3,4}, computer 2 has {4, 5,10,20,21}, computer 3 has {-10,3,5} and so on.
I want to replicate this data so that all computers will have all the integers , i.e. all of them will have {-10,1,2,3,4,5,10,20,21}
I want to minimize the number of messages that each computer sends and also minimize the time. (i.e. avoid a serial approach where computer 1 first communicates with everyone and gets the data it is missing, then computer 2 does the same and so on.
What is an efficient way of doing this?
Thanks.
Minimal approach would be : All computers send info to just one ( master ) computer and get the result
For reliability you could consider at least two computers as master computers
Assumptions :
Total n computers
One of the computers is considered as master
Algorithm :
All computers send input-info to Master ( total n-1 messages )
Master processes the info
Master sends the result-info to all computers ( total n-1 messages )
Reliability :
Total failure of the system based on this algorithm can only occur if all the masters failed .
Efficiency :
With 1 master , total messages : 2 * (n-1)
With 2 masters , total messages : 2 * 2 * (n-1)
With 3 masters , total messages : 3 * 2 * (n-1)
If all the computers are on the same network, you could use UDP sockets with SO_BROADCAST option.
This way when one computer does a message 'send', all the other computers would 'recv' the message and update as necessary.
Here's one way of doing it in 2*n - 2 moves. Model the machines as nodes in a linked-list and numbered from 1..n.
Let node 1 send all of its data in one message to node 2.
Node 2 remembers the message node 1 sent, performs a union of its content with node 1's content and sends the unified message to Node 3. Then Node 2 waits for a response from Node 3.
Node 3 does the same as above and so on until we get to node 'n'. Node 'n' now has the complete set.
Node 'n' already knows what message node 'n - 1' sent it, so it sends the diff back to node 'n - 1'
Node 'n - 1' performs the union as above. Since it has remembered the message of node 'n - 2' (in step 2 above), it can send the diff back to node 'n - 3'
and so on.
I think it is not complex to show that the above leads to 2 * (n - 1) messages being sent in the network.
I think it can be proven that 2n - 2 is necessary by considering each node to have a unique element. It should be a short exercise in mathematical induction to prove that 2n - 2 is necessary..
Well, there's already a system that does this, is widely adopted, well documented, widely available and, while it's perhaps not perfect (for assorted definitions of perfect), it's practical.
It's called RSync.
You can start here: http://www.samba.org/~tridge/phd_thesis.pdf