PBFT view-change: What happens to committed operations after the valid snapshot? - algorithm

PBFT says that if the timer of backup i expires in view v then it starts a view change for v+1 by multicasting <view-change, v+1, n, C, P, i> where n is the sequence number of the last stable checkpoint s and P is a set containing a set Pm for each request m that prepared at i with a sequence number higher than n.
Now, the checkpoints are taken periodically so, there can be prepared messages at i with a sequence number higher than n which are already committed. We don't want these to be included in Pm as they are already committed.
So, how does PBFT handles that?

I think that those messages are executed again. When a view-change happens, all the nodes would be in the same checkpoint.
When the "new" primary for the view v + 1 receives 2⨍ valid view-change messages, multicasts a new-view message. In the message that is sent, it indicates:
V: set of received and valid view-change messages.
P: set of pre-prepared unprocessed messages. These messages are calculate as follows:
From the last stable checkpoint, you get the sequence number of the last executed request. This value will correspond to the min-s.
Take the largest sequence number of all prepared messages you have received. This value will correspond to the max-s.
As a result, it generates as many pre-prepared messages as needed.
Each node saves the log of the messages that are not inside a chekcpoint, so they don't need to be procesed again.

Related

Kafka job assignment & completion

I have requirement where i receive a record in KafkaStream, I divide this single record into n records based on logic. These n records are pushed to another stream. These n records are processed parallely. But I need to know the completion of n jobs so I can send response back. Can you please let me know how can i achieve this.
I think the simplest way to do it is to set up a consumer to subscribe to that stream and do process it.
If I understand correctly, you want to know that in fact "N messages were processed". AFAIK, there's not a universal solution for this, but here's what I would do.
(Note: My solution assumes that you have a single consumer consuming the second stream.)
To know that N records have been processed, you have to know what N is. So when you divide the single record into N records, store that N in the records themselves (e.g. if your records are JSON objects, add a "n": 5 key-value pair). Then, in the consumer(s) that process the 2nd stream (the one that contains N records), with each message you consume you do the following:
consume message and process message
increment a counter (let's call it K) - this could be an in-memory or a persistent storage-backed counter
compare K to N
if equal, all messages have been processed (at this point you might want to exit)
otherwise, there are more messages to be processed; continue consuming

Expired Tuples in Apache Storm Tumbling Window

I have implemented a Tumbling Window (Count based) of size 100. On running the topology, I see that the count of new tuples (inputWindow.get) and the count of expired tuples (inputWindow.getExpired) are both 100. I have set message time out of 600seconds. With this time timeout, I had expected no tuple to expire. What could be the reason for tuples expiring?
I have set the bolt as
bolt.withTumblingWindow(Count.of(100))
The bolt has parallelism_hint of 120
builder.setBolt("bolt", bolt.withTumblingWindow(Count.of(100)), 120).shuffleGrouping("spout")
I think maybe you're misunderstanding what expired tuples are. Maybe it would have been more friendly to call them "evicted tuples".
They are tuples that have been evicted from the current window, but were present in the last window. They are not tuples whose message timeouts have expired, though of course they may have also expired in this sense.
So let's say you receive 200 tuples. You first window will be tuple 0-99, with no expired tuples. Your second window will be tuple 100-199, where tuple 0-99 are expired.
The reason this is useful is in the case of sliding windows, where the windows are not disjoint. In that case you may get e.g. a window that is 0-99, then 50-149, then 99-199. There it can be helpful if you get told "tuples 0-49 are no longer in the window" rather than having to compute this yourself.
For more information on this, take a look at the class controlling windows at https://github.com/apache/storm/blob/925422a5b5ad1c3329a2c2b44db460ae94f70806/storm-client/src/jvm/org/apache/storm/windowing/WindowManager.java

Schedule sending messages to consumers at different rate

I'm looking for best algorithm for message schedule. What I mean with message schedule is a way to send a messages on the bus when we have many consumers at different rate.
Example :
Suppose that we have data D1 to Dn
. D1 to send to many consumer C1 every 5ms, C2 every 19ms, C3 every 30ms, Cn every Rn ms
. Dn to send to C1 every 10ms, C2 every 31ms , Cn every 50ms
What is best algorithm which schedule this actions with the best performance (CPU, Memory, IO)?
Regards
I can think of quite a few options, each with their own costs and benefits. It really comes down to exactly what your needs are -- what really defines "best" for you. I've pseudocoded a couple possibilities below to hopefully help you get started.
Option 1: Execute the following every time unit (in your example, millisecond)
func callEachMs
time = getCurrentTime()
for each datum
for each customer
if time % datum.customer.rate == 0
sendMsg()
This has the advantage of requiring no consistently stored memory -- you just check at each time unit whether your should be sending a message. This can also deal with messages that weren't sent at time == 0 -- just store the time the message was initially sent modulo the rate, and replace the conditional with if time % datum.customer.rate == data.customer.firstMsgTimeMod.
A downside to this method is it is completely reliant on always being called at a rate of 1 ms. If there's lag caused by another process on a CPU and it misses a cycle, you may miss sending a message altogether (as opposed to sending it a little late).
Option 2: Maintain a list of lists of tuples, where each entry represents the tasks that need to be done at that millisecond. Make your list at least as long as the longest rate divided by the time unit (if your longest rate is 50 ms and you're going by ms, your list must be at least 50 long). When you start your program, place the first time a message will be sent into the queue. And then each time you send a message, update the next time you'll send it in that list.
func buildList(&list)
for each datum
for each customer
if list.size < datum.customer.rate
list.resize(datum.customer.rate+1)
list[customer.rate].push_back(tuple(datum.name, customer.name))
func callEachMs(&list)
for each (datum.name, customer.name) in list[0]
sendMsg()
list[customer.rate].push_back((datum.name, customer.name))
list.pop_front()
list.push_back(empty list)
This has the advantage of avoiding the many unnecessary modulus calculations option 1 required. However, that comes with the cost of increased memory usage. This implementation would also not be efficient if there's a large disparity in the rate of your various messages (although you could modify this to deal with algorithms with longer rates more efficiently). And it still has to be called every millisecond.
Finally, you'll have to think very carefully about what data structure you use, as this will make a huge difference in its efficiency. Because you pop from the front and push from the back at every iteration, and the list is a fixed size, you may want to implement a circular buffer to avoid unneeded moving of values. For the lists of tuples, since they're only ever iterated over (random access isn't needed), and there are frequent additions, a singly-linked list may be your best solution.
.
Obviously, there are many more ways that you could do this, but hopefully, these ideas can get you started. Also, keep in mind that the nature of the system you're running this on could have a strong effect on which method works better, or whether you want to do something else entirely. For example, both methods require that they can be reliably called at a certain rate. I also haven't described parallellized implementations, which may be the best option if your application supports them.
Like Helium_1s2 described, there is a second way which based on what I called a schedule table and this is what I used now but this solution has its limits.
Suppose that we have one data to send and two consumer C1 and C2 :
Like you can see we must extract our schedule table and we must identify the repeating transmission cycle and the value of IDLE MINIMUM PERIOD. In fact, it is useless to loop on the smallest peace of time ex 1ms or 1ns or 1mn or 1h (depending on the case) BUT it is not always the best period and we can optimize this loop as follows.
for example one (C1 at 6 and C2 at 9), we remark that there is cycle which repeats from 0 to 18. with a minimal difference of two consecutive send event equal to 3.
so :
HCF(6,9) = 3 = IDLE MINIMUM PERIOD
LCM(6,9) = 18 = transmission cycle length
LCM/HCF = 6 = size of our schedule table
And the schedule table is :
and the sending loop looks like :
while(1) {
sleep(IDLE_MINIMUM_PERIOD); // free CPU for idle min period
i++; // initialized at 0
send(ScheduleTable[i]);
if (i == sizeof(ScheduleTable)) i=0;
}
The problem with this method is that this array will grows if LCM grows which is the case if we have bad combination like with rate = prime number, etc.

Would this simple consensus algorithm work?

In order to convince oneself that the complications of standard algorithms such as Paxos and Raft are necessary, one must understand why simpler solutions aren't satisfactory. Suppose that, in order to reach consensus w.r.t a stream of events in a cluster of N machines (i.e., implement a replicated time-growing log), the following algorithm is proposed:
Whenever a machine wants to append a message to the log, it broadcasts the tuple (msg, rnd, prev), where msg is the message, rnd is a random number, and prev is the ID of the last message on the log.
When a machine receives a tuple, it inserts msg as a child of prev, forming a tree.
If a node has more than one child, only the one with highest rnd is considered valid; the path of valid messages through the tree is the main chain.
If a message is part of the main chain, and it is old enough, it is considered decided/final.
If a machine attempts to submit a message and, after some time, it isn't present on the main chain, that means another machine broadcasted a message at roughly the same time, so you re-broadcast it until it is there.
Looks simple, efficient and resilient to crashes. Would this algorithm work?
I think you have a problem if a machine send two tuple in sequence and the first gets lost (package loss/corruption or whatever)
In that case, lets say machine 1 has prev elemtent id of 10 and sends two more with (msg,rnd,10)=11 and (msg,rnd,11)=12 to machine 2.
Machine 2 only receives (msg,rnd,11) but does not have prev id of 11 in its tree.
Machine 3 receives both, so inserts it into the main tree.
At this time you would have a desync beetween the distributed trees.
I propose an ack for the packages after they are inserted in the tree by machine x to the sender, with him waiting for it to send the next.
This way sender needs to resend previous message to the machines that failed to ack in a given timeframe.

GV$PERSISTENT_QUEUES Fields

I am wondering what the fields on the oracle table GV$PERSISTENT_QUEUES really mean.
The Documentation:
ENQUEUED_MSGS NUMBER Number of messages enqueued
DEQUEUED_MSGS NUMBER Number of messages dequeued
Note: This column will not be incremented until all the subscribers of the message have dequeued the message and its retention time has elapsed.
...
ENQUEUED_EXPIRY_MSGS NUMBER Number of messages enqueued with expiry
ENQUEUED_DELAY_MSGS NUMBER Number of messages enqueued with delay
MSGS_MADE_EXPIRED NUMBER Number of messages expired by time manager
MSGS_MADE_READY NUMBER Number of messages made ready by time manager
...
ENQUEUE_TRANSACTIONS NUMBER Number of enqueue transactions
DEQUEUE_TRANSACTIONS NUMBER Number of dequeue transactions
Oracle Documentation (11.2)
My Questions:
How can the number of dequeued messages be larger than the number of enqueued messages?
If messages with a certain delay get added to the queue, do they get counted at ENQUEUED_MSGS and ENQUEUED_DELAY_MSGS?
If a message with a certain delay gets delivered after the delay, will it get counted at DEQUEUED_MSGS and MSGS_MADE_READY?
If so, how can MSGS_MADE_READY be larger than ENQUEUED_DELAY_MSGS?
What do the fields ENQUEUED_EXPIRY_MSGS and MSGS_MADE_EXPIRED mean?
What's the difference between ENQUEUED_MSGS and ENQUEUE_TRANSACTIONS, same with dequeueing?
Thank you in advance for help!
I am pretty sure of having found the solution to most of the above questions.
DEQUEUED_MSGS can be greater than ENQUEUED_MSGS in case of reboot of a database. Queue Entries that are still in the Queue Table will remain there. After database reboot, the entries will get dequeued and added to the number of dequeued messages, but they won't get added to the number of enqueued messages.
The Field ENQUEUED_MSGS is the sum of all messages that got enqueued into the Queue.
The Field ENQUEUED_DELAY_MSGS is the sum of all messages enqueued with delay.
ENQUEUED_MSGS - ENQUEUED_DELAY_MSGS = All messages that got enqueued without delay
The same is for DEQUEUED_MSGS (all) and MSGS_MADE_READY (only with delay).
I don't know yet what ENQUEUE_TRANSACTIONS and DEQUEUE_TRANSACTIONS mean (maybe DEQUEUE_TRANSATIONS describes the number of dequeues of one message in a multi consumer queue), but I won't use those fields.

Resources