Determining expiry - distributed nodes - without syncing the clocks - time

I have the following problem:
A leader server creates objects which have a start time and end time. The start time and end time are set when an object gets created.
The Start time of the object is set to current time on the leader node, and end time is set to Start time + Delta Time
A thread wakes up regularly and checks if the End time for any of the objects are lesser than the current time (hence the object has expired) - if yes then the object needs to be deleted
All this works fine, as long as things are running smoothly on leader node. If the leader node goes down, one of the follower node becomes the new leader. (There will be replication among leader and follower node (RAFT algorithm))
Now on the new leader, the time could be very different from the previous leader. Hence the computation in step 3 could be misleading.
One way to solve this problem, is to keep the clocks of nodes (leader and followers) in sync (as much as possible).
But I am wondering if there is any other way to resolve this problem of "expiry" with distributed nodes?
Further Information:
Will be using RAFT protocol for message passing and state
replication
It will have known bounds on the delay of message between processes
Leader and follower failures will be tolerated (as per RAFT protocol)
Message loss is assumed not to occur (RAFT ensures this)
The operation on objects is to check if they are alive. Objects will be enqueued by a client.
There will be strong consistency among processes (RAFT provides this)

I've seen expiry done in two different ways. Both of these methods guarantee that time will not regress, as what can happen if synchrnozing clocks via NTP or otherwise using the system clock. In particular, both methods utilize the chip clock for strictly increasing time. (System.nanoTime in Javaland.)
These methods are only for expiry: time does not regress, but it is possible that time can go appear to go slower.
First Method
The first method works because you are using a raft cluster (or a similar protocol). It works by broadcasting an ever-increasing clock from the leader to the replicas.
Each peer maintains what we'll call the cluster clock that runs at near real time. The leader periodically broadcasts the clock value via raft.
When a peer receives this clock value it records it, along with the current chip clock value. When the peer is elected leader it can determine the duration since the last clock value by comparing its current chip clock with the last recorded chip clock value.
Bonus 1: Instead of having a new transition type, the cluster clock value may be attached to every transition, and during quiet periods the leader makes no-op transitions just to move the clock forward. If you can, attach these to the raft heartbeat mechanism.
Bonus 2: I maintain some systems where time is increased for every transition, even within the same time quantum. In other words, every transition has a unique timestamp. For this to work without moving time forward too quickly your clock mechanism must have a granularity that can cover your expected transition rate. Milliseconds only allow for 1,000 tps, microseconds allow for 1,000,000 tps, etc.
Second Method
Each peer merely records its chip clock when it receives each object and stores it along with each object. This guarantees that peers will never expire an object before the leader, because the leader records the time stamp and then sends the object over a network. This creates a strict happens-before relationship.
This second method is susceptible, however, to server restarts. Many chips and processing environments (e.g. the JVM) will reset the chip-clock to a random value on startup. The first method does not have this problem, but is more expensive.

If you know your nodes are synchronized to some absolute time, within some epsilon, the easy solution is probably to just bake the epsilon into your garbage collection scheme. Normally with NTP, the epsilon is somewhere around 1ms. With a protocol like PTP, it would be well below 1ms.
Absolute time doesn't really exist in distributed systems though. It can be bottleneck to try to depend on it, since it implies that all the nodes need communicate. One way of avoiding it, and synchronization in general, is to keep a relative sequence of events using a vector clock, or an interval tree clock. This avoids the need to synchronize on absolute time as state. Since the sequences describe related events, the implication is that only nodes with related events need to communicate.
So, with garbage collection, objects could be marked stale using node sequence numbers. Then, instead of the garbage collector thread checking liveness, the object could either be collected as the sequence number increments, or just marked stale and collected asynchronously.

Related

What if vector clock update reaches much before the actual update?

In a normal Vector clock algorithm, the vector clock is piggybacked along with the request itself.
What if the vector clock gets updated much before the actual request comes ?
As in the vector clock and requests are updated independently and the request gets delayed for some reason.
There is lots of ambiguity in the question, but let me try...
Each node has its own version clock, representing every node in the system. A node can change state (version clock) in two cases:
A request from another node arrives. This request has a version clock from the other node. All regular version clock logic is applied here.
A request comes from outside, e.g. a user input on a given node. In that case, clearly, the request comes without any vector info. Technically, these are two different types of request and they are processed differently.
You could imagine that every node has two apis: one for external requests and one for requests from other nodes.
For the second (external) request, the node would update its own component of version clock stored on the node itself. That's it. Nothing more, nothing less. As an optimization, it could stop updating the clock until it sends data to any other node - own component needs to be updates AT LEAST once in between inter-node communication.
Now when this node sends updates to other nodes, this is where the version clock info is sent along. And the receiver do proper logic around version clock management.
You need strict ordering, but a request, vector clock or even parts of a request can arrive separately. Just like when you send a message on TCP, it is broken down to many packets and each packet arrives separately. On other end you wait till all parts are there before processing them.
So, in your question, on the receiving side, you have to wait until the request and the associated vector clock has arrived. They must becoming one after other with nothing else in-between on the same channel (connection), so that the receiver knows which request is related to which vector clock. Once all of it is received, it can process the request and the clock in an atomic update.

Switching nodes in High Availability mode

I repeatedly switch data nodes in high availability mode, the terminal continues prompting that the transaction is occupied, and then resumes after a few minutes. Why does this occur?
Regarding the questions about switching data nodes repeatedly, prompting the transactions are occupied and so on, the mechanism is as follows. After a data node is down, it takes some time to synchronize the transaction status between itself and the control node. The transaction timeout is about two minutes. In addition, There is some time spent on data recovery, which is related to the specific data volume. If you switch back to this node again in a short time, you will be prompted that the transaction is occupied, and it will be restored in about two minutes. In actual production, the probability of a node breaks down, especially some different nodes hangs continuously within a few minutes is extremely small. Thus, for high-availability tests, it is recommended to set the interval between simulated hangs more than two minutes. It is better to have interval time longer than 5 mins for simulating the breakdown of next node.

Solana - leader validator and incrementing field

As I understand it, Solana will elect a leader each round and there will be multiple validators handling the transactions independently. The leader will then consolidate all the transactions.
From this understanding, I'm curious how Solana actually handles programs which increment a field. So lets say we have this counter field, which increases by 1 each time the program is called. What happens if 10 different users calls this program at the same time, how will this work if the 10 transactions are handled by the ten validators independently. For example at the start of the round, counter=50 and during the round, ten different validators handles the transactions separately so each validator will increase the counter=51. When the leader gets back all the txns, it will say counter=51, what happens in this scenario?
I feel like there is something missing in my assumptions.
So my understanding here seems to be incorrect. It is actually the leader who executes the transactions and the validators who are verifying the transactions.
Source
Page 2 - Section 3 - https://solana.com/solana-whitepaper.pdf
As shown in Figure 1, at any given time a system node is designated as
Leader to generate a Proof of History sequence, providing the network global
read consistency and a verifiable passage of time. The Leader sequences user
messages and orders them such that they can be efficiently processed by other
nodes in the system, maximizing throughput. It executes the transactions
on the current state that is stored in RAM and publishes the transactions
and a signature of the final state to the replications nodes called Verifiers.
Verifiers execute the same transactions on their copies of the state, and publish their computed signatures of the state as confirmations. The published
confirmations serve as votes for the consensus algorithm.
The "recent blockhash" is another important part of this. A transaction references a recent blockhash, which is part of the Proof of History sequence. If two transactions reference the same blockhash, they are counted as duplicates by the network, even if they come from two different users.
More information can be found at https://docs.solana.com/developing/programming-model/transactions#recent-blockhash
There is only one PoH generator(Block producer) at a time. other nodes are just validating.
I cannot comment to Jon C but the answer is wrong. you can use the same recent blockhash otherwise there is no way solana can handle 50000 tps when block time is around 0.4 sec.

How to account for clock offsets in a distributed system?

Background
I have a system consisting of several distributed services, each of which is continuously generating events and reporting these to a central service.
I need to present a unified timeline of the events, where the ordering in the timeline corresponds to the moment event occurred. The frequency of event occurrence and the network latency is such that I cannot simply use time of arrival at the central collector to order the events.
E.g. in the following scenario:
E1 needs to be rendered in the timeline above E2, despite arriving at the collector afterwards, which means the events need to come with timestamp metadata. This is where the problem arises.
Problem
Due to constraints on how the environment is set up, it is not possible to ensure that the local time services on each machine are reliably aware of current UTC time. I can assume that each machine can accurately gauge relative time, i.e. the clock speeds are close enough to make measurement of short timespans identical, but problems like NTP misconfiguration/partitioning make it impossible to guarantee that every machine agrees on the current UTC time.
This means that a naive approach of simply generating a local timestamp for each event as it occurs, then ordering events using that will not work: every machine has its own opinion of what universal time is.
So the question is: how can I recover an ordering for events generated in a distributed system where the clocks do not agree?
Approaches I've considered
Most solutions I find online go down the path of trying to synchronize all the clocks, which is not possible for me since:
I don't control the machines in question
The reason the clocks are out of sync in the first place is due to network flakiness, which I can't fix
My own idea was to query some kind of central time service every time an event is generated, then stamp that event with the retrieved time minus network flight time. This gets hairy, because I have to add another service to the system and ensure its availability (I'm back to square zero if the other services can't reach this one). I was hoping there is some clever way to do this that doesn't require me to centralize timekeeping in this way.
A simple solution, somewhat inspired by your own at the end, is to periodically ping what I'll call the time-source server. In the ping include the service's chip clock; the time-source echos that and includes its timestamp. The service can then deduce the round-trip-time and guess that the time-source's clock was at the timestamp roughly round-trip-time/2 nanoseconds ago. You can then use this as an offset to the local chip clock to determine a globalish time.
You don't have to use a different service for this; the Collector server will do. The important part is that you don't have to ask call the time-source server at every request; it removes it from the critical path.
If you don't want a sawtooth function for the time, you can smooth the time difference
Congratulations, you've rebuilt NTP!

How do Raft guarantee consistency when network partition occurs?

Suppose a network partition occurs and the leader A is in minority. Raft will elect a new leader B but A thinks it's still the leader for some time. And we have two clients. Client 1 writes a key/value pair to B, then Client 2 reads the key from A before A steps down. Because A still believes it's the leader, it will return stale data.
The original paper says:
Second, a leader must check whether it has been deposed
before processing a read-only request (its information
may be stale if a more recent leader has been elected).
Raft handles this by having the leader exchange heartbeat
messages with a majority of the cluster before responding
to read-only requests.
Isn't it too expensive? The leader has to talk to majority nodes for every read request?
I'm surprised there's so much ambiguity in the answers, as this is quite well known:
Yes, to get linearizable reads from Raft you must round-trip through the quorum.
There are no shortcuts here. In fact, both etcd and Consul committed an error in their implementations of Raft and caused linearizability violations. The implementors erroneously believed (as did many people, including myself) that if a node thought of itself as a leader, it was the leader.
Raft does not guarantee this at all. A node can be a leader and not learn of its loss of leadership because of the very network partition that caused someone else to step up in the first place. Because clock error is taken as unbounded in distributed systems literature, no amount of waiting can solve this race condition. New leaders cannot simply "wait it out" and then decide "okay, the old leader must have realized it by now". This is just typical lease lock stuff - you can't use clocks with unbounded error to make distributed decisions.
Jepsen covered this error detail, but to quote the conclusion:
[There are] three types of reads, for varying performance/correctness needs:
Anything-goes reads, where any node can respond with its last known value. Totally available, in the CAP sense, but no guarantees of monotonicity. Etcd does this by default, and Consul terms this “stale”.
Mostly-consistent reads, where only leaders can respond, and stale reads are occasionally allowed. This is what etcd currently terms “consistent”, and what Consul does by default.
Consistent reads, which require a round-trip delay so the leader can confirm it is still authoritative before responding. Consul now terms this consistent.
Just to tie in with some other results from literature, this very problem was one of the things Flexible Paxos showed it could handle. The key realization in FPaxos is that you have two quorums: one for leader election and one for replication. The only requirement is that these quorums intersect, and while a majority quorum is guaranteed to do so, it is not the only configuration.
For example, one could require that every node participate in leader election. The winner of this election could be the sole node serving requests - now it is safe for this node to serve reads locally, because it knows for a new leader to step up the leadership quorum would need to include itself. (Of course, the tradeoff is that if this node went down, you could not elect a new leader!)
The point of FPaxos is that this is an engineering tradeoff you get to make.
The leader doesn't have to talk to a majority for each read request. Instead, as it continuously heartbeats with its peers it maintains a staleness measure: how long it has been since it has received an okay from a quorum? The leader will check this staleness measure and return a StalenessExceeded error. This gives the calling system the chance to connect to another host.
It may be better to push that staleness check to the calling systems; let the low-level raft system have higher Availability (in CAP terms) and let the calling systems decide at what staleness level to fail over. This can be done in various ways. You could have the calling systems heartbeat to the raft system, but my favorite is to return the staleness measure in the response. This last can be improved when the client includes its timestamp in the request, the raft server echos it back in the response and the client adds the round trip time to raft staleness. (NB. Always use the nano clock in measuring time differences because it doesn't go backwards like the system clock does.)
Not sure whether timeout configure can solve this problem:
2 x heartbeat interval <= election timeout
which means when network partition happens leader A is single leader and write will fail because leader A locates in the minority and leader A can not get echo back from majority of the node and step back as a follower.
After that leader B is selected, it can catch up the latest change from at least one of the followers and then client can perform read and write on leader B.
Question
The leader has to talk to majority nodes for every read request
Answer: No.
Explaination
Let's understand it with code example from HashiCorp's raft implementation.
There are 2 timeouts involved: (their names are self explanatory but link has been included to read detailed definition.)
LeaderLease timeout[1]
Election timeout[2]
Example of their values are 500ms & 1000ms respectively[3]
Must condition for node to start is: LeaderLease timeout < Election timeout [4,5]
Once a node becomes Leader, it is checked "whether it is heartbeating with quorum of followers or not"[6, 7]. If heartbeat stops then its tolerated till LeaderLease timeout[8]. If Leader is not able to contact quorum of nodes for LeaderLease timeout then Leader node has to become Follower[9]
Hence for example given in question, Node-A must step down as Leader before Node-B becomes Leader. Since Node-A knows its not a Leader before Node-B becomes Leader, Node-A will not serve the read or write request.
[1]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L141
[2]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L179
[3]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L230
[4]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L272
[5]https://github.com/hashicorp/raft/blob/9ecdba6a067b549fe5149561a054a9dd831e140e/config.go#L275
[6]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L456
[7]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L762
[8]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L891
[9]https://github.com/hashicorp/raft/blob/ba082378c3436b5fc9af38c40587f2d9ee59cccf/raft.go#L894

Resources