Expired Tuples in Apache Storm Tumbling Window - apache-storm

I have implemented a Tumbling Window (Count based) of size 100. On running the topology, I see that the count of new tuples (inputWindow.get) and the count of expired tuples (inputWindow.getExpired) are both 100. I have set message time out of 600seconds. With this time timeout, I had expected no tuple to expire. What could be the reason for tuples expiring?
I have set the bolt as
bolt.withTumblingWindow(Count.of(100))
The bolt has parallelism_hint of 120
builder.setBolt("bolt", bolt.withTumblingWindow(Count.of(100)), 120).shuffleGrouping("spout")

I think maybe you're misunderstanding what expired tuples are. Maybe it would have been more friendly to call them "evicted tuples".
They are tuples that have been evicted from the current window, but were present in the last window. They are not tuples whose message timeouts have expired, though of course they may have also expired in this sense.
So let's say you receive 200 tuples. You first window will be tuple 0-99, with no expired tuples. Your second window will be tuple 100-199, where tuple 0-99 are expired.
The reason this is useful is in the case of sliding windows, where the windows are not disjoint. In that case you may get e.g. a window that is 0-99, then 50-149, then 99-199. There it can be helpful if you get told "tuples 0-49 are no longer in the window" rather than having to compute this yourself.
For more information on this, take a look at the class controlling windows at https://github.com/apache/storm/blob/925422a5b5ad1c3329a2c2b44db460ae94f70806/storm-client/src/jvm/org/apache/storm/windowing/WindowManager.java

Related

Multiple Micro batch Storm Topology

First of all sincere apologies if my question is duplicate, I tried searching but couldn’t find relevant answer to my question
First of all sincere apologies, if i asking something very basic , as I am a beginner in Storm.
And also if my question is duplicate, As i tried searching but couldn’t find relevant answer to my question
Please advise on my below use case.
My USE Case :
I have a Spout reading data from one internal messaging mechanism, as its receiving & emitting tuples with very high frequency(100s/second).
Now every apart from data, every tuple also has a frequency(int) (as there can be total 4-5 type of frequency).
Now I need to design a Bolt to batch/Pool all tuples and only emit periodically on frequency, with a feature of emitting only latest tuple (in case of duplicate received before next batch), as we have a string-based key in tuple data to identify a duplicate.
e.g.
So all tuple with 25 seconds as frequency will be pooled together and will be emitted by Bolt on every 25 seconds (in case of duplicate tuple received within 25 seconds only latest one will be considered).
Similar like all tuple with 10 minutes as frequency will be pooled together and will be emitted by Bolt on every 10 min interval (in case of duplicate tuple received within 10 min only latest one will be considered).
** Now since we can have a 4-5 type of frequencies (e.g. 10 sec, 25 sec, 10 min, 20 min etc. , these are as configured), and every tuple should be clubbed into an appropriate batch and emitted (as exampled above).
Fyi. For Bolt grouping, I have used "fieldsGrouping" as below configuration.
*.fieldsGrouping("FILTERING_BOLT",new Fields(PUBLISHING_FREQUENCY));*
Please help or advise on, what's the best approach for my use case, as just couldn't think of implementing anything to handle flowing of concurrent tuples and managing Storm's internal parallelism.
It sounds like you want windowing bolts https://storm.apache.org/releases/2.0.0-SNAPSHOT/Windowing.html. Probably you want a tumbling window (i.e. no overlap between window intervals)
Windowing bolts let you set an interval they should emit windows at (e.g. every 10 seconds), and then the bolt will buffer up all tuples it receives for the previous 10 seconds before calling an execute method you supply.
The structure I think you want is something like e.g.
spout -> splitter -> 5 second window bolt
-> 10 second window bolt
The splitter should receive the tuples, examine the frequency field and send the tuple on to the right window bolt. You make it do this by declaring a stream for each frequency type.
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare("5-sec-stream", ...);
declarer.declare("10-sec-stream", ...);
}
public void execute(Tuple input) {
if (frequencyIsFive(input)) {
collector.emit("5-sec-stream", new Values(input.getValues()))
}
//more cases here
}
Then when declaring your topology you do
topologyBuilder.setBolt("splitter", new SplitterBolt())
.shuffleGrouping("spout")
topologyBuilder.setBolt("5-second-window", new YourWindowingBolt())
.globalGrouping("splitter", "5-sec-stream")
to make all the 5 second tuples go to the 5 second windowing bolt.
See https://storm.apache.org/releases/2.0.0-SNAPSHOT/Concepts.html for more information on this, particularly the parts about streams and groupings.
There's a simple example of a windowing topology at https://github.com/apache/storm/blob/master/examples/storm-starter/src/jvm/org/apache/storm/starter/SlidingWindowTopology.java.
One thing you may want to be aware of is Storm's tuple timeout. If you need a window of e.g. 10 minutes, you need to bump the tuple timeout significantly from the default of 30 seconds, so the tuples don't time out while waiting in queue. You can do this by setting e.g. conf.setMessageTimeoutSecs(15*60) when configuring your topology. You want there to be a bit of leeway between the window intervals and the tuple timeout, because you want to avoid your tuples timing out as much as possible.

perf_event_open: handling last recorded sample

When counting for events based on a specific sampling period, how to handle the last recorded sample when the last counter value of the leader is less than the sampling period.
Update:
I have checked the value of type which is a member of struct perf_event_header. For the last recorded sample this value is zero and according to perf_event.h header file, it does not seem that the value of zero has a corresponding sample record type!
To put my question in other words: How does perf_event API deal with the case when the workload finishes execution but the group leader counter value is less than the value of the sampling period? Is the data discarded at this case?
How does perf_event API deal with the case when the workload finishes execution but the group leader counter value is less than the value of the sampling period?
Nothing happens. If the event count is not reached yet, no sample is written.
You should consider that samples are typically statistical information.
If you really need to know you could possibly use some form of ptrace and manually read the counter value before the thread terminates.
If you read a perf_event_header with a type == 0, I would be concerned. I don't think that should ever happen.
Edit:
As per the manpage, I believe you cannot read the remaining value from that particular event because sampling and counting events are exclusive.
Events come in two flavors: counting and sampled. A counting event
one that is used for counting the aggregate number of events that.
In general, counting event results are gathered with a
read(2) call. A sampling event periodically writes measurements to a buffer
that can then be accessed via mmap(2).

Efficiently implementing Birman-Schiper-Stephenson(BSS) protocol's delay queue

I am using the Birman-Schiper-Stephenson protocol of distributed system with the current assumption that peer set of any node doesn't change. As the protocol dictates, the messages which have come out of causal order to a node have to be put in a 'delay queue'. My problem is with the organisation of the delay queue where we must implement some kind of order with the messages. After deciding the order we will have to make a 'Wake-Up' protocol which would efficiently search the queue after the current timestamp is modified to find out if one of the delayed messages can be 'woken-up' and accepted.
I was thinking of segregating the delayed messages into bins based on the points of difference of their vector-timestamps with the timestamp of this node. But the number of bins can be very large and maintaining them won't be efficient.
Please suggest some designs for such a queue(s).
Sorry about the delay -- didn't see your question until now. Anyhow, if you look at Isis2.codeplex.com you'll see that in Isis2, I have a causalsend implementation that employs the same vector timestamp scheme we described in the BSS paper. What I do is to keep my messages in a partial order, sorted by VT, and then when a delivery occurs I can look at the delayed queue and deliver off the front of the queue until I find something that isn't deliverable. Everything behind it will be undeliverable too.
But in fact there is a deeper insight here: you actually never want to allow the queue of delayed messages to get very long. If the queue gets longer than a few messages (say, 50 or 100) you run into the problem that the guy with the queue could be holding quite a few bytes of data and may start paging or otherwise running slowly. So it becomes a self-perpetuating cycle in which because he has a queue, he is very likely to be dropping messages and hence enqueuing more and more. Plus in any case from his point of view, the urgent thing is to recover that missed message that caused the others to be out of order.
What this adds up to is that you need a flow control scheme in which the amount of pending asynchronous stuff is kept small. But once you know the queue is small, searching every single element won't be very costly! So this deeper perspective says flow control is needed no matter what, and then because of flow control (if you have a flow control scheme that works) the queue is small, and because the queue is small, the search won't be costly!

Searching an algorithm similar to producer-consumer

I would like to ask if someone would have an idea on the best(fastest) algorithm for the following scenario:
X processes generate a list of very large files. Each process generates one file at a time
Y processes are being notified that a file is ready. Each Y process has its own queue to collect the notifications
At a given time 1 X process will notify 1 Y process through a Load Balancer that has the Round Rubin algorithm
Each file has a size and naturally, bigger files will keep both X and Y more busy
Limitations
Once a file gets on a Y process it would be impractical to remove it and move it to another Y process.
I can't think of other limitations at the moment.
Disadvantages to this approach
sometimes X falls behind(files are no longer pushed). It's not really impacted by the queueing system and no matter if I change it it will still have slow/good times.
sometimes Y falls behind(a lot of files gather in the queues). Again, the same thing like before.
1 Y process is busy with a very large file. It also has several small files in its queue that could be taken on by other Y processes.
The notification itself is through HTTP and seems somehow unreliable sometimes. Notifications fail and debugging has not revealed anything.
There are some more details that would help to see the picture more clearly.
Y processes are DB threads/jobs
X processes are web apps
Once files reach the X processes, these would also burn resources from the DB side by querying it. It has an impact on the producing part
Now I considered the following approach:
X will produce files like it has before but will not notify Y. It will hold a buffer (table) to populate the file list
Y will constantly search for files in the buffer and retrieve them itself and store them in its own queue.
Now would this change be practical? Like I said, each Y process has its own queue, it doesn't seem to be efficient to keep it anymore. If so, then I'm still undecided on the next bit:
How to decide which files to fetch
I've read through the knapsack problem and I think that has application if I would have the entire list of files from the beginning which I don't. Actually, I do have the list and the size of each file but I wouldn't know when each file would be ready to be taken.
I've gone through the producer-consumer problem but that centers around a fixed buffer and optimising that but in this scenario the buffer is unlimited and I don't really care if it is large or small.
The next best option would be a greedy approach where each Y process locks on the smallest file and takes it. At first it does appear to be the fastest approach and I'm currently building a simulation to verify that but a second opinion would be fantastic.
Update
Just to be sure that everyone gets the big picture, I'm linking here a fast-done diagram.
Jobs are independent from Processes. They will run at a speed and process how many files are possible.
When a Job finishes with a file it will send a HTTP request to the LB
Each process queues requests (files) coming from the LB
The LB works on a round robin rule
Diagram
The current LB idea is not good
The load balancer as you've described it is a bad idea because it's needlessly required to predict the future, which you are saying is impossible. Moreover, round-robin is a terrible scheduling strategy when jobs have varying lengths.
Just have consumers notify the LB when they're idle. When a new job arrives from a producer, it selects the first idle consumer and sends the job there. When there are no idle consumers, the producer's job is queued in the LB waiting for a free consumer to appear.
This way consumers will always be optimally busy.
You say "Having one queue to serve 100 apps (for example) would be inefficient." This is a huge leap of intuition that's probably wrong. A work queue that's only handling file names can be fast. You need it only to be 100 times faster (because you infer there are 100 consumers) than the average "very large file" handling operation. File handling is normally 10th of seconds or seconds. A queue handler based, say, on an Apache mod or Redis for two random choices, could pretty easily serve 10,000 requests per second. This is a factor of 10 away from being a bottleneck.
If you select from idle consumers on a FIFO basis, the behavior will be round-robin when all jobs are equal length.
If the LB absolutelly cannot queue work
Then let Ty(t) be the total future time needed to complete the work in the queue of consumer y at the current epoch t. The LB's goal is to make Ty(t) values equal for all y and t. This is the ideal.
To get as close as possible to the ideal, it needs an internal model to compute these Ty(t) values. When a new job arrives from a producer at epoch t, it finds consumer y with the the minimum Ty(t) value, assigns the job to this y, and adjusts the model accordingly. This is a variation of the "least time remaining" scheduling strategy, which is optimal for this situation.
The model must inevitably be an approximation. The quality of the approximation will determine its usefulness.
A standard approach (e.g. from OS scheduling), will be to maintain a pair [t, T]_y for each consumer y. T is the estimate of Ty(t) that was computed at the past epoch t. Thus at a later epoch t+d, we can estimate Ty(t+d) as max(T-t,0). The max is because for d>t, the estimated job time has expired, so the consumer should be complete.
The LB uses whatever information it can get to update the model. Examples are estimates of time a job will require (from your description probably based on file size and other characteristics), notification that the consumer has actually finished a job (LB decreases T by the esimated duration of the completed job and updates t), assignment of a new job (LB increases T by the estimated duration of the new job and updates t), and intermediate progress updates of estimated time remaining from consumers during long jobs.
If the information available to the LB is detailed, you will want to replace the total time T in the [t, T]_y pair with a more complete model of the work queued at y: for example a list of estimated job durations, where the head of the list is the one currently being executed.
The more accurate the LB model, the less likely a consumer will starve when work is available, which is what you are trying to avoid.

A priority queue which allows efficient priority update?

UPDATE: Here's my implementation of Hashed Timing Wheels. Please let me know if you have an idea to improve the performance and concurrency. (20-Jan-2009)
// Sample usage:
public static void main(String[] args) throws Exception {
Timer timer = new HashedWheelTimer();
for (int i = 0; i < 100000; i ++) {
timer.newTimeout(new TimerTask() {
public void run(Timeout timeout) throws Exception {
// Extend another second.
timeout.extend();
}
}, 1000, TimeUnit.MILLISECONDS);
}
}
UPDATE: I solved this problem by using Hierarchical and Hashed Timing Wheels. (19-Jan-2009)
I'm trying to implement a special purpose timer in Java which is optimized for timeout handling. For example, a user can register a task with a dead line and the timer could notify a user's callback method when the dead line is over. In most cases, a registered task will be done within a very short amount of time, so most tasks will be canceled (e.g. task.cancel()) or rescheduled to the future (e.g. task.rescheduleToLater(1, TimeUnit.SECOND)).
I want to use this timer to detect an idle socket connection (e.g. close the connection when no message is received in 10 seconds) and write timeout (e.g. raise an exception when the write operation is not finished in 30 seconds.) In most cases, the timeout will not occur, client will send a message and the response will be sent unless there's a weird network issue..
I can't use java.util.Timer or java.util.concurrent.ScheduledThreadPoolExecutor because they assume most tasks are supposed to be timed out. If a task is cancelled, the cancelled task is stored in its internal heap until ScheduledThreadPoolExecutor.purge() is called, and it's a very expensive operation. (O(NlogN) perhaps?)
In traditional heaps or priority queues I've learned in my CS classes, updating the priority of an element was an expensive operation (O(logN) in many cases because it can only be achieved by removing the element and re-inserting it with a new priority value. Some heaps like Fibonacci heap has O(1) time of decreaseKey() and min() operation, but what I need at least is fast increaseKey() and min() (or decreaseKey() and max()).
Do you know any data structure which is highly optimized for this particular use case? One strategy I'm thinking of is just storing all tasks in a hash table and iterating all tasks every second or so, but it's not that beautiful.
How about trying to separate the handing of the normal case where things complete quickly from the error cases?
Use both a hash table and a priority queue. When a task is started it gets put in the hash table and if it finishes quickly it gets removed in O(1) time.
Every one second you scan the hash table and any tasks that have been a long time, say .75 seconds, get moved to the priority queue. The priority queue should always be small and easy to handle. This assumes that one second is much less than the timeout times you are looking for.
If scanning the hash table is too slow, you could use two hash tables, essentially one for even-numbered seconds and one for odd-numbered seconds. When a task gets started it is put in the current hash table. Every second move all the tasks from the non-current hash table into the priority queue and swap the hash tables so that the current hash table is now empty and the non-current table contains the tasks started between one and two seconds ago.
There options are a lot more complicated than just using a priority queue, but are pretty easily implemented should be stable.
To the best of my knowledge (I wrote a paper about a new priority queue, which also reviewed past results), no priority queue implementation gets the bounds of Fibonacci heaps, as well as constant-time increase-key.
There is a small problem with getting that literally. If you could get increase-key in O(1), then you could get delete in O(1) -- just increase the key to +infinity (you can handle the queue being full of lots of +infinitys using some standard amortization tricks). But if find-min is also O(1), that means delete-min = find-min + delete becomes O(1). That's impossible in a comparison-based priority queue because the sorting bound implies (insert everything, then remove one-by-one) that
n * insert + n * delete-min > n log n.
The point here is that if you want a priority-queue to support increase-key in O(1), then you must accept one of the following penalties:
Not be comparison based. Actually, this is a pretty good way to get around things, e.g. vEB trees.
Accept O(log n) for inserts and also O(n log n) for make-heap (given n starting values). This sucks.
Accept O(log n) for find-min. This is entirely acceptable if you never actually do find-min (without an accompanying delete).
But, again, to the best of my knowledge, no one has done the last option. I've always seen it as an opportunity for new results in a pretty basic area of data structures.
Use Hashed Timing Wheel - Google 'Hashed Hierarchical Timing Wheels' for more information. It's a generalization of the answers made by people here. I'd prefer a hashed timing wheel with a large wheel size to hierarchical timing wheels.
Some combination of hashes and O(logN) structures should do what you ask.
I'm tempted to quibble with the way you're analyzing the problem. In your comment above, you say
Because the update will occur very very frequently. Let's say we are sending M messages per connection then the overall time becomes O(MNlogN), which is pretty big. – Trustin Lee (6 hours ago)
which is absolutely correct as far as it goes. But most people I know would concentrate on the cost per message, on the theory that as you app has more and more work to do, obviously it's going to require more resources.
So if your application has a billion sockets open simultaneously (is that really likely?) the insertion cost is only about 60 comparisons per message.
I'll bet money that this is premature optimization: you haven't actually measured the bottlenecks in you system with a performance analysis tool like CodeAnalyst or VTune.
Anyway, there's probably an infinite number of ways of doing what you ask, once you just decide that no single structure will do what you want, and you want some combination of the strengths and weaknesses of different algorithms.
One possiblity is to divide the socket domain N into some number of buckets of size B, and then hash each socket into one of those (N/B) buckets. In that bucket is a heap (or whatever) with O(log B) update time. If an upper bound on N isn't fixed in advance, but can vary, then you can create more buckets dynamically, which adds a little complication, but is certainly doable.
In the worst case, the watchdog timer has to search (N/B) queues for expirations, but I assume the watchdog timer is not required to kill idle sockets in any particular order!
That is, if 10 sockets went idle in the last time slice, it doesn't have to search that domain for the one that time-out first, deal with it, then find the one that timed-out second, etc. It just has to scan the (N/B) set of buckets and enumerate all time-outs.
If you're not satisfied with a linear array of buckets, you can use a priority queue of queues, but you want to avoid updating that queue on every message, or else you're back where you started. Instead, define some time that's less than the actual time-out. (Say, 3/4 or 7/8 of that) and you only put the low-level queue into the high-level queue if it's longest time exceeds that.
And at the risk of stating the obvious, you don't want your queues keyed on elapsed time. The keys should be start time. For each record in the queues, elapsed time would have to be updated constantly, but the start time of each record doesn't change.
There's a VERY simple way to do all inserts and removes in O(1), taking advantage of the fact that 1) priority is based on time and 2) you probably have a small, fixed number of timeout durations.
Create a regular FIFO queue to hold all tasks that timeout in 10 seconds. Because all tasks have identical timeout durations, you can simply insert to the end and remove from the beginning to keep the queue sorted.
Create another FIFO queue for tasks with 30-second timeout duration. Create more queues for other timeout durations.
To cancel, remove the item from the queue. This is O(1) if the queue is implemented as a linked list.
Rescheduling can be done as cancel-insert, as both operations are O(1). Note that tasks can be rescheduled to different queues.
Finally, to combine all the FIFO queues into a single overall priority queue, have the head of every FIFO queue participate in a regular heap. The head of this heap will be the task with the soonest expiring timeout out of ALL tasks.
If you have m number of different timeout durations, the complexity for each operation of the overall structure is O(log m). Insertion is O(log m) due to the need to look up which queue to insert to. Remove-min is O(log m) for restoring the heap. Cancelling is O(1) but worst case O(log m) if you're cancelling the head of a queue. Because m is a small, fixed number, O(log m) is essentially O(1). It does not scale with the number of tasks.
Your specific scenario suggests a circular buffer to me. If the max. timeout is 30 seconds and we want to reap sockets at least every tenth of a second, then use a buffer of 300 doubly-linked lists, one for each tenth of a second in that period. To 'increaseTime' on an entry, remove it from the list it's in and add it to the one for its new tenth-second period (both constant-time operations). When a period ends, reap anything left over in the current list (maybe by feeding it to a reaper thread) and advance the current-list pointer.
You've got a hard-limit on the number of items in the queue - there is a limit to TCP sockets.
Therefore the problem is bounded. I suspect any clever data structure will be slower than using built-in types.
Is there a good reason not to use java.lang.PriorityQueue? Doesn't remove() handle your cancel operations in log(N) time? Then implement your own waiting based on the time until the item on the front of the queue.
I think storing all the tasks in a list and iterating through them would be best.
You must be (going to) run the server on some pretty beefy machine to get to the limits where this cost will be important?

Resources