Dataflow Distinct operation not scaling - performance

I have a linear pipeline with a "final" stage that outputs around 200k elements (short strings) per second.
However when I add a Distinct operation after that stage (myPCollection.apply(Distinct.<String>create());), the speed of the stage right before Distinct drops to around 80k elements processed by second.
However, I am processing a bounded collection without a maximum number of workers, so I would expect Dataflow to automatically raise the number of workers to match the workload. Not only does this not happen, when I manually start the pipeline with many workers (20+), it gets automatically downscaled to a few workers.
How can I make Dataflow upscale the worker pool so that this Distinct operation doesn't dramatically reduce the processing rate of my pipeline ?

It may be interesting to look at the implementation of Distinct.
As you can see, it groups elements first, and later it picks up the first element. I've filed a bug to improve this behavior.
In the current implementation all elements are first grouped, which requires writing them to persistent storage, and later picked up. If you have any elements that occur many times (i.e. a hot key), you will have a bottleneck on how much data you can write out.
As a trick, you can add a DoFn that deduplicates elements before you write them out. Something like this:
class MapperDedupFn extends DoFn<String, String> {
Set<String> seenElements;
MapperDedupFn() {
seenElements = new HashSet<>();
}
#ProcessElement
public void processElement(#Element String element, OutputReceiver<String> receiver) {
if (seenElements.contains(element)) return;
seenElements.add(element)
receiver.output(word);
}
}
}
You should be able to stick this before the Distinct function, and hopefully have better performance.

Related

Tuning my Apache Storm serializer for performance

i am new to Java and Apache Storm and i want to know how i can make things go faster!
I setup a Storm cluster with 2 physical machines with 8 cores each. The cluster is working perfectly fine. I setup the following test topology in order to measure performance:
builder.setSpout("spout", new RandomNumberSpoutSingle(sizeOfArray), 10);
builder.setBolt("null", new NullBolt(), 4).allGrouping("spout");
RandomNumberSpoutSingle creates an Array like so:
ArrayList<Integer> array = new ArrayList<Integer>();
I fill it with sizeOfArray integers. This array, combined with an ID, builds my tuple.
Now i measure how many tuples per second arrive at the bolt with allGrouping (i look at the Storm GUI's "transferred" value).
If i put sizeOfArray = 1024, about 173000 tuples/s get pushed. Since 1 tuple should be about 4*1024 bytes, around 675MB/second get moved.
Am i correct so far?
Now my question is: Is Storm/Kryo capable of moving more? How can i tune this? Are there settings i ignored?
I want to serialize more tuples per second! If i use local shuffling, the values skyrocket because nothing has to be serialized, but i need the tuples on all workers.
Neither CPU, Memory nor network are fully occupied.
I think you got the math about right, I am not sure though if the Java overhead for the non-primitive Integer type is considered in serialization, which would add some more bytes to the equation. Yet, I am also not sure if this is the best way of analyzing storm performance, as this is more measured in number of tuples per second than in bandwidth.
Storm has built in serialization for primitive types, strings, byte arrays, ArrayList, HashMap, and HashSet (source). When I program Java for maximum performance I try to stick with primitive types as much as possible. Would it be feasible to use int[] instead of ArrayList<Integer>? I would expect to gain some performance from that, if it is possible in your setup.
Considering the above types which storm is able to serialize out-of-the-box I would most likely shy away from trying to improve serialization performance. I assume kryo is pretty optimized and that it will be very hard to achieve anything faster here. I am also not sure if serialization is the real bottleneck here or rather something in your topology setup (see below).
I would look at other tunables which are related to the intra and inter worker communication. A good overview can be found here. In one topology for which performance is critical, I am using the following setup code to adjust these kind of parameters. What works best in your case needs to be found out via testing.
int topology_executor_receive_buffer_size = 32768; // intra-worker messaging, default: 32768
int topology_transfer_buffer_size = 2048; // inter-worker messaging, default: 1000
int topology_producer_batch_size = 10; // intra-worker batch, default: 1
int topology_transfer_batch_size = 20; // inter-worker batch, default: 1
int topology_batch_flush_interval_millis = 10; // flush tuple creation ms, default: 1
double topology_stats_sample_rate = 0.001; // calculate metrics every 1000 messages, default: 0.05
conf.put("topology.executor.receive.buffer.size", topology_executor_receive_buffer_size);
conf.put("topology.transfer.buffer.size", topology_transfer_buffer_size);
conf.put("topology.producer.batch.size", topology_producer_batch_size);
conf.put("topology.transfer.batch.size", topology_transfer_batch_size);
conf.put("topology.batch.flush.interval.millis", topology_batch_flush_interval_millis);
conf.put("topology.stats.sample.rate", topology_stats_sample_rate);
As you have noticed, performance greatly increases when storm is able to use intra-worker processing, so I would always suggest to make use of that if possible. Are you sure you need allGrouping? If not I would suggest to use shuffleGrouping, which will actually use local communication if storm thinks it is appropriate, unless topology.disable.loadaware.messaging is set to false. I am not sure if allGrouping will use local communication for those components which are on the same worker.
Another thing which I wonder about is the configuration of your topology: you have 10 spouts and 4 consumer bolts. Unless the bolts consume incoming tuples much faster than they are created, it might be advisable to use an equal number for both components. From how you describe your process it seems you use acking and failing, because you have written you assign an ID to your tuples. In case that guaranteed processing of individual tuples is not a absolute requirement, performance can probably be gained by switching to unanchored tuples. Acking and failing does produce some overhead, so I would assume a higher tuple throughput if it is turned off.
And lastly, you can also experiment with the value for maximum number of pending tuples (configured via method .setMaxSpoutPending of the spouts). Not sure what storm uses as default, however from my experience setting a little higher number than what the bolts can ingest downstream delivers higher throughput. Look at metrics capacity and number of transferred tuples in the storm UI.

Multiple Micro batch Storm Topology

First of all sincere apologies if my question is duplicate, I tried searching but couldn’t find relevant answer to my question
First of all sincere apologies, if i asking something very basic , as I am a beginner in Storm.
And also if my question is duplicate, As i tried searching but couldn’t find relevant answer to my question
Please advise on my below use case.
My USE Case :
I have a Spout reading data from one internal messaging mechanism, as its receiving & emitting tuples with very high frequency(100s/second).
Now every apart from data, every tuple also has a frequency(int) (as there can be total 4-5 type of frequency).
Now I need to design a Bolt to batch/Pool all tuples and only emit periodically on frequency, with a feature of emitting only latest tuple (in case of duplicate received before next batch), as we have a string-based key in tuple data to identify a duplicate.
e.g.
So all tuple with 25 seconds as frequency will be pooled together and will be emitted by Bolt on every 25 seconds (in case of duplicate tuple received within 25 seconds only latest one will be considered).
Similar like all tuple with 10 minutes as frequency will be pooled together and will be emitted by Bolt on every 10 min interval (in case of duplicate tuple received within 10 min only latest one will be considered).
** Now since we can have a 4-5 type of frequencies (e.g. 10 sec, 25 sec, 10 min, 20 min etc. , these are as configured), and every tuple should be clubbed into an appropriate batch and emitted (as exampled above).
Fyi. For Bolt grouping, I have used "fieldsGrouping" as below configuration.
*.fieldsGrouping("FILTERING_BOLT",new Fields(PUBLISHING_FREQUENCY));*
Please help or advise on, what's the best approach for my use case, as just couldn't think of implementing anything to handle flowing of concurrent tuples and managing Storm's internal parallelism.
It sounds like you want windowing bolts https://storm.apache.org/releases/2.0.0-SNAPSHOT/Windowing.html. Probably you want a tumbling window (i.e. no overlap between window intervals)
Windowing bolts let you set an interval they should emit windows at (e.g. every 10 seconds), and then the bolt will buffer up all tuples it receives for the previous 10 seconds before calling an execute method you supply.
The structure I think you want is something like e.g.
spout -> splitter -> 5 second window bolt
-> 10 second window bolt
The splitter should receive the tuples, examine the frequency field and send the tuple on to the right window bolt. You make it do this by declaring a stream for each frequency type.
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare("5-sec-stream", ...);
declarer.declare("10-sec-stream", ...);
}
public void execute(Tuple input) {
if (frequencyIsFive(input)) {
collector.emit("5-sec-stream", new Values(input.getValues()))
}
//more cases here
}
Then when declaring your topology you do
topologyBuilder.setBolt("splitter", new SplitterBolt())
.shuffleGrouping("spout")
topologyBuilder.setBolt("5-second-window", new YourWindowingBolt())
.globalGrouping("splitter", "5-sec-stream")
to make all the 5 second tuples go to the 5 second windowing bolt.
See https://storm.apache.org/releases/2.0.0-SNAPSHOT/Concepts.html for more information on this, particularly the parts about streams and groupings.
There's a simple example of a windowing topology at https://github.com/apache/storm/blob/master/examples/storm-starter/src/jvm/org/apache/storm/starter/SlidingWindowTopology.java.
One thing you may want to be aware of is Storm's tuple timeout. If you need a window of e.g. 10 minutes, you need to bump the tuple timeout significantly from the default of 30 seconds, so the tuples don't time out while waiting in queue. You can do this by setting e.g. conf.setMessageTimeoutSecs(15*60) when configuring your topology. You want there to be a bit of leeway between the window intervals and the tuple timeout, because you want to avoid your tuples timing out as much as possible.

Schedule sending messages to consumers at different rate

I'm looking for best algorithm for message schedule. What I mean with message schedule is a way to send a messages on the bus when we have many consumers at different rate.
Example :
Suppose that we have data D1 to Dn
. D1 to send to many consumer C1 every 5ms, C2 every 19ms, C3 every 30ms, Cn every Rn ms
. Dn to send to C1 every 10ms, C2 every 31ms , Cn every 50ms
What is best algorithm which schedule this actions with the best performance (CPU, Memory, IO)?
Regards
I can think of quite a few options, each with their own costs and benefits. It really comes down to exactly what your needs are -- what really defines "best" for you. I've pseudocoded a couple possibilities below to hopefully help you get started.
Option 1: Execute the following every time unit (in your example, millisecond)
func callEachMs
time = getCurrentTime()
for each datum
for each customer
if time % datum.customer.rate == 0
sendMsg()
This has the advantage of requiring no consistently stored memory -- you just check at each time unit whether your should be sending a message. This can also deal with messages that weren't sent at time == 0 -- just store the time the message was initially sent modulo the rate, and replace the conditional with if time % datum.customer.rate == data.customer.firstMsgTimeMod.
A downside to this method is it is completely reliant on always being called at a rate of 1 ms. If there's lag caused by another process on a CPU and it misses a cycle, you may miss sending a message altogether (as opposed to sending it a little late).
Option 2: Maintain a list of lists of tuples, where each entry represents the tasks that need to be done at that millisecond. Make your list at least as long as the longest rate divided by the time unit (if your longest rate is 50 ms and you're going by ms, your list must be at least 50 long). When you start your program, place the first time a message will be sent into the queue. And then each time you send a message, update the next time you'll send it in that list.
func buildList(&list)
for each datum
for each customer
if list.size < datum.customer.rate
list.resize(datum.customer.rate+1)
list[customer.rate].push_back(tuple(datum.name, customer.name))
func callEachMs(&list)
for each (datum.name, customer.name) in list[0]
sendMsg()
list[customer.rate].push_back((datum.name, customer.name))
list.pop_front()
list.push_back(empty list)
This has the advantage of avoiding the many unnecessary modulus calculations option 1 required. However, that comes with the cost of increased memory usage. This implementation would also not be efficient if there's a large disparity in the rate of your various messages (although you could modify this to deal with algorithms with longer rates more efficiently). And it still has to be called every millisecond.
Finally, you'll have to think very carefully about what data structure you use, as this will make a huge difference in its efficiency. Because you pop from the front and push from the back at every iteration, and the list is a fixed size, you may want to implement a circular buffer to avoid unneeded moving of values. For the lists of tuples, since they're only ever iterated over (random access isn't needed), and there are frequent additions, a singly-linked list may be your best solution.
.
Obviously, there are many more ways that you could do this, but hopefully, these ideas can get you started. Also, keep in mind that the nature of the system you're running this on could have a strong effect on which method works better, or whether you want to do something else entirely. For example, both methods require that they can be reliably called at a certain rate. I also haven't described parallellized implementations, which may be the best option if your application supports them.
Like Helium_1s2 described, there is a second way which based on what I called a schedule table and this is what I used now but this solution has its limits.
Suppose that we have one data to send and two consumer C1 and C2 :
Like you can see we must extract our schedule table and we must identify the repeating transmission cycle and the value of IDLE MINIMUM PERIOD. In fact, it is useless to loop on the smallest peace of time ex 1ms or 1ns or 1mn or 1h (depending on the case) BUT it is not always the best period and we can optimize this loop as follows.
for example one (C1 at 6 and C2 at 9), we remark that there is cycle which repeats from 0 to 18. with a minimal difference of two consecutive send event equal to 3.
so :
HCF(6,9) = 3 = IDLE MINIMUM PERIOD
LCM(6,9) = 18 = transmission cycle length
LCM/HCF = 6 = size of our schedule table
And the schedule table is :
and the sending loop looks like :
while(1) {
sleep(IDLE_MINIMUM_PERIOD); // free CPU for idle min period
i++; // initialized at 0
send(ScheduleTable[i]);
if (i == sizeof(ScheduleTable)) i=0;
}
The problem with this method is that this array will grows if LCM grows which is the case if we have bad combination like with rate = prime number, etc.

How do I implement this topology in Storm?

I'm new to Storm, so be gentle :-)
I want to implement a topology that is similar to the RollingTopWords topology in the Storm examples. The idea is to count the frequency of words emitted. Basically, the spouts emit words at random, the first level bolts count the frequency and pass them on. The twist is that I want the bolts to pass on the frequency of a word only if its frequency in one of the bolts exceeded a threshold. So, for example, if the word "Nathan" passed the threshold of 5 occurrences within a time window on one bolt then all bolts would start passing "Nathan"'s frequency onwards.
What I thought of doing is having another layer of bolts which would have the list of words which have passed a threshold. They would then receive the words and frequencies from the previous layer of bolts and pass them on only if they appear in the list. Obviously, this list would have to be synchronized across the whole layer of bolts.
Is this a good idea? What would be the best way of implementing it?
Update: What I'm hoping to achieve a situation where communication is minimized i.e. each node in my use case is simulated by a spout and an attached bolt which does the local counting. I'd like that bolt to emit only words that have passed a threshold, either in the bolt itself or in another one. So every bolt will have to have a list of words that have passed the threshold. There will be a central repository that will hold the list of words over the threshold and will communicate with the bolts to pass that information.
What would be the best way of implementing that?
That shouldn't be too complicated. Just don't emit the words until you reach the threshold and in the meantime keep them stored in a HashMap. That is just one if-else statement.
About the synchronization - I don't think you need it because when you have these kind of problems (with counting words) you want one and only one task to receive a specific word. The one task that receives the word (e.g. "Nathan") will be the only one emitting its frequency. For that you should use fields grouping.

A priority queue which allows efficient priority update?

UPDATE: Here's my implementation of Hashed Timing Wheels. Please let me know if you have an idea to improve the performance and concurrency. (20-Jan-2009)
// Sample usage:
public static void main(String[] args) throws Exception {
Timer timer = new HashedWheelTimer();
for (int i = 0; i < 100000; i ++) {
timer.newTimeout(new TimerTask() {
public void run(Timeout timeout) throws Exception {
// Extend another second.
timeout.extend();
}
}, 1000, TimeUnit.MILLISECONDS);
}
}
UPDATE: I solved this problem by using Hierarchical and Hashed Timing Wheels. (19-Jan-2009)
I'm trying to implement a special purpose timer in Java which is optimized for timeout handling. For example, a user can register a task with a dead line and the timer could notify a user's callback method when the dead line is over. In most cases, a registered task will be done within a very short amount of time, so most tasks will be canceled (e.g. task.cancel()) or rescheduled to the future (e.g. task.rescheduleToLater(1, TimeUnit.SECOND)).
I want to use this timer to detect an idle socket connection (e.g. close the connection when no message is received in 10 seconds) and write timeout (e.g. raise an exception when the write operation is not finished in 30 seconds.) In most cases, the timeout will not occur, client will send a message and the response will be sent unless there's a weird network issue..
I can't use java.util.Timer or java.util.concurrent.ScheduledThreadPoolExecutor because they assume most tasks are supposed to be timed out. If a task is cancelled, the cancelled task is stored in its internal heap until ScheduledThreadPoolExecutor.purge() is called, and it's a very expensive operation. (O(NlogN) perhaps?)
In traditional heaps or priority queues I've learned in my CS classes, updating the priority of an element was an expensive operation (O(logN) in many cases because it can only be achieved by removing the element and re-inserting it with a new priority value. Some heaps like Fibonacci heap has O(1) time of decreaseKey() and min() operation, but what I need at least is fast increaseKey() and min() (or decreaseKey() and max()).
Do you know any data structure which is highly optimized for this particular use case? One strategy I'm thinking of is just storing all tasks in a hash table and iterating all tasks every second or so, but it's not that beautiful.
How about trying to separate the handing of the normal case where things complete quickly from the error cases?
Use both a hash table and a priority queue. When a task is started it gets put in the hash table and if it finishes quickly it gets removed in O(1) time.
Every one second you scan the hash table and any tasks that have been a long time, say .75 seconds, get moved to the priority queue. The priority queue should always be small and easy to handle. This assumes that one second is much less than the timeout times you are looking for.
If scanning the hash table is too slow, you could use two hash tables, essentially one for even-numbered seconds and one for odd-numbered seconds. When a task gets started it is put in the current hash table. Every second move all the tasks from the non-current hash table into the priority queue and swap the hash tables so that the current hash table is now empty and the non-current table contains the tasks started between one and two seconds ago.
There options are a lot more complicated than just using a priority queue, but are pretty easily implemented should be stable.
To the best of my knowledge (I wrote a paper about a new priority queue, which also reviewed past results), no priority queue implementation gets the bounds of Fibonacci heaps, as well as constant-time increase-key.
There is a small problem with getting that literally. If you could get increase-key in O(1), then you could get delete in O(1) -- just increase the key to +infinity (you can handle the queue being full of lots of +infinitys using some standard amortization tricks). But if find-min is also O(1), that means delete-min = find-min + delete becomes O(1). That's impossible in a comparison-based priority queue because the sorting bound implies (insert everything, then remove one-by-one) that
n * insert + n * delete-min > n log n.
The point here is that if you want a priority-queue to support increase-key in O(1), then you must accept one of the following penalties:
Not be comparison based. Actually, this is a pretty good way to get around things, e.g. vEB trees.
Accept O(log n) for inserts and also O(n log n) for make-heap (given n starting values). This sucks.
Accept O(log n) for find-min. This is entirely acceptable if you never actually do find-min (without an accompanying delete).
But, again, to the best of my knowledge, no one has done the last option. I've always seen it as an opportunity for new results in a pretty basic area of data structures.
Use Hashed Timing Wheel - Google 'Hashed Hierarchical Timing Wheels' for more information. It's a generalization of the answers made by people here. I'd prefer a hashed timing wheel with a large wheel size to hierarchical timing wheels.
Some combination of hashes and O(logN) structures should do what you ask.
I'm tempted to quibble with the way you're analyzing the problem. In your comment above, you say
Because the update will occur very very frequently. Let's say we are sending M messages per connection then the overall time becomes O(MNlogN), which is pretty big. – Trustin Lee (6 hours ago)
which is absolutely correct as far as it goes. But most people I know would concentrate on the cost per message, on the theory that as you app has more and more work to do, obviously it's going to require more resources.
So if your application has a billion sockets open simultaneously (is that really likely?) the insertion cost is only about 60 comparisons per message.
I'll bet money that this is premature optimization: you haven't actually measured the bottlenecks in you system with a performance analysis tool like CodeAnalyst or VTune.
Anyway, there's probably an infinite number of ways of doing what you ask, once you just decide that no single structure will do what you want, and you want some combination of the strengths and weaknesses of different algorithms.
One possiblity is to divide the socket domain N into some number of buckets of size B, and then hash each socket into one of those (N/B) buckets. In that bucket is a heap (or whatever) with O(log B) update time. If an upper bound on N isn't fixed in advance, but can vary, then you can create more buckets dynamically, which adds a little complication, but is certainly doable.
In the worst case, the watchdog timer has to search (N/B) queues for expirations, but I assume the watchdog timer is not required to kill idle sockets in any particular order!
That is, if 10 sockets went idle in the last time slice, it doesn't have to search that domain for the one that time-out first, deal with it, then find the one that timed-out second, etc. It just has to scan the (N/B) set of buckets and enumerate all time-outs.
If you're not satisfied with a linear array of buckets, you can use a priority queue of queues, but you want to avoid updating that queue on every message, or else you're back where you started. Instead, define some time that's less than the actual time-out. (Say, 3/4 or 7/8 of that) and you only put the low-level queue into the high-level queue if it's longest time exceeds that.
And at the risk of stating the obvious, you don't want your queues keyed on elapsed time. The keys should be start time. For each record in the queues, elapsed time would have to be updated constantly, but the start time of each record doesn't change.
There's a VERY simple way to do all inserts and removes in O(1), taking advantage of the fact that 1) priority is based on time and 2) you probably have a small, fixed number of timeout durations.
Create a regular FIFO queue to hold all tasks that timeout in 10 seconds. Because all tasks have identical timeout durations, you can simply insert to the end and remove from the beginning to keep the queue sorted.
Create another FIFO queue for tasks with 30-second timeout duration. Create more queues for other timeout durations.
To cancel, remove the item from the queue. This is O(1) if the queue is implemented as a linked list.
Rescheduling can be done as cancel-insert, as both operations are O(1). Note that tasks can be rescheduled to different queues.
Finally, to combine all the FIFO queues into a single overall priority queue, have the head of every FIFO queue participate in a regular heap. The head of this heap will be the task with the soonest expiring timeout out of ALL tasks.
If you have m number of different timeout durations, the complexity for each operation of the overall structure is O(log m). Insertion is O(log m) due to the need to look up which queue to insert to. Remove-min is O(log m) for restoring the heap. Cancelling is O(1) but worst case O(log m) if you're cancelling the head of a queue. Because m is a small, fixed number, O(log m) is essentially O(1). It does not scale with the number of tasks.
Your specific scenario suggests a circular buffer to me. If the max. timeout is 30 seconds and we want to reap sockets at least every tenth of a second, then use a buffer of 300 doubly-linked lists, one for each tenth of a second in that period. To 'increaseTime' on an entry, remove it from the list it's in and add it to the one for its new tenth-second period (both constant-time operations). When a period ends, reap anything left over in the current list (maybe by feeding it to a reaper thread) and advance the current-list pointer.
You've got a hard-limit on the number of items in the queue - there is a limit to TCP sockets.
Therefore the problem is bounded. I suspect any clever data structure will be slower than using built-in types.
Is there a good reason not to use java.lang.PriorityQueue? Doesn't remove() handle your cancel operations in log(N) time? Then implement your own waiting based on the time until the item on the front of the queue.
I think storing all the tasks in a list and iterating through them would be best.
You must be (going to) run the server on some pretty beefy machine to get to the limits where this cost will be important?

Resources