Input data buffer for multiple downstream operators in parallel in Flink - memory-management

I want to know how Flink handles branching out the result of one operator to multiple downstream operators. If I have an operator A that is connected to 3 other operators B, C and D in parallel, does Flink keep only one copy of the result and send it out to all 3 operators? Assuming that's the case, is the result removed once it has been sent to all 3 operators or is there a garbage collection process?
I haven't found anything concrete about this topic in the Flink documentation. Any relevant resource regarding this topic is highly appreciated.

Before sending record A to operators B, C, D, FLINK serialise record A into bytes. Then those bytes will be copied into buffer pools for sub-partitions for different operators, so there will be multiple copies of record A.
Record A will be GC later.

Related

How Apache Storm parallelism works?

I am new to Apache storm and wondering how parallelism hint works.
For e.g. We have one stream containing two tuples <4>,<6>, one spout with only one task per executor and we have one bolt to perform some operation on the tuples and having parallelism hint as 2, so we have two executor of this bolt namely A and B, regarding this, I have 3 questions.
Considering above scenario is this possible that our tuple which contain value 4 is processed by A and another tuple which contain value 6 is processed by B.
If processing done in this manner i.e. mentioned in question (1), then won't it impact on operation in which sequence matter.
If processing not done in this manner, means both tuples going to same executor then what is the benefit of parallelism.
Considering above scenario is this possible that our tuple which contain value 4 is processed by A and another tuple which contain value 6 is processed by B.
Yes.
If processing done in this manner i.e. mentioned in question (1), then won't it impact on operation in which sequence matter.
It depends. You most likely have control over the sequence of the tuples in your spout. If sequence matters, it is advisable to either reduce parallelism or use fields grouping, to make sure tuples which depend on each other go to the same executor. If sequence does not matter use shuffleGrouping or localOrShuffleGrouping to get benefits from parallel processing.
If processing not done in this manner, means both tuples going to same executor then what is the benefit of parallelism.
If both tuples go to the same executor, there is no benefit, obviously.

How to handle duplication of events caused by Storm's replay in case of fail() from one of the bolts

I have a Topology with Spout S, and 3 bolts - A, B, C.
Bolt A reads from Spout S. Bolt A then splits the data into Bolts B and C (based on some filter). Bolts B and C have their respective data sinks.
If I use Storm's anchoring and anchor the tuple at Bolt A, and then later on Bolt B ack's successfully, but Bolt C does a fail(). Will replaying by Storm at Spout S cause duplication of events at Bolt B and so into the data sink at B?
If so, what is the way to avoid that while still using storm's reliability feature of anchoring?
Storm's anchoring feature only support at-least-once processing and there is no support to handle duplicates in case of failure. Depending on you application semantics, this might be an issues or not.
For example if you do idempotent operation later on, duplicate are not issues (an example of an idempotent operation is updating a key-value store -- if you to two put operation because of a duplicate, the state of the key-value store still be the same).
If you have non idempotent operations and duplicates are an issue, you could try to take care if this by you own -- but this is pretty difficult to get right.
As an alternative you could use Trident API instead of low-level API which does provide exactly-once guarantees.
Or, as last resort, use a different system, that does provide exactly-once semantics out-of-the box.

What happen internally when we join two DStream grouped by keys?

I'am new in spark (spark-streaming in Python) and if i have understood correctly, a DStream is a sequence of RDD.
Imagine that we have in our code:
ssc = StreamingContext(sc, 5)
So for every 5s a DSTream object is generated which is a sequence of RDDs.
Imagine i have two DStreams DS1 and DS2 (each 5s). My code is:
DGS1 = DS1.groupByKey()
DGS2 = DS2.groupByKey()
FinalStream = DS1.join(DS2)
What happen internally when i call groupByKey and Join (in the RDDs level) ?
Thank you !
When you use groupByKey and join, you're causing a shuffle. A picture to illustrate:
Assume you have a stream of incoming RDD's (called a DStream) which are tuples of a String, Int. What you want is to group them by key (which is a word in this example). But, all the keys aren't locally availaible in the same executor, they are potentionally spread between many workers which have previously done work on the said RDD.
What spark has to do now, is say "Hey guys, I now need all keys which values are equal to X to go to worker 1, and all keys which values are Y to go to worker 2, etc", so you can have all values of a given key in a single worker node, which can then continue to do more work on each RDD which is now of type (String, Iterator[Int]) as a cause of the grouping.
A join is similar in it's behavior to a groupByKey, as it has to have all keys available in order to compare every two keys stream of RDDs.
Behind the scenes, spark has to do a couple of things in order for this to work:
Repartitioning of the data: Since all keys may not be available on a single worker
Data serialization/deserialization and Compression: Since spark has to potentially transfer data across nodes, it has to be serialized and later deserialized
Disk IO: As a cause of a shuffle spill since a single worker may not be able to hold all data in-memory.
For more, see this introduction to shuffling.

Storm parallel understanding

I have already read related materials about storm parallel but still keep something unclear. Suppose we take Tweets processing as an example. Generally what we are doing is retrieving tweets streaming, counting numbers of words of each tweets and write the numbers into a local file.
My question is how to understand the value of the parallelism of spouts as well as bolts. Within the function of builder.setSpout and builder.setBolt we can assign the parallel value. But in the case of word counting of tweets is it correct that only one spout should be set? More than one spouts are regarded as copies of the first same spout by which identical tweets flow into several spouts. If that is the case what is the value of setting more than one spouts?
Another unclear point is how to assign works to bolts? Is the parallel mechanism achieve in the way of Storm will find currently available bolts to process a next emitting spout? I revise the basic tweets counting code so the final counting results will be written into a specific directory, however, all results are actually combined in one file on nimbus. Therefore after processing data on supervisors all results will be sent back to nimbus. If this is true what is the communication mechanism between nimbus and supervisors?
I really want to figure out those problems!!! Do appreciate for the help!!
Setting the parallelism for spouts larger than one, required that the user code does different things for different instances. Otherwise (as you mentioned already), data is just sent through the topology twice. For example, you can have a list of ports you want to listen to (or a list of different Kafka topics). Thus, you need to ensure, that different instanced listen to different ports or topics... This can be achieved in open(...) method by looking at topology metadata like own task ID, and dop. As each instance has a unique ID, you can partition your ports/topics such that each instance picks different ports/topics from the overall list.
About parallelism: this depends on the connection pattern you are using when pluging your topology together. For example, using shuffleGrouping results in a round robin distribution of your emitted tuples to the consuming bolt instances. For this case, Storm does not "look" if any bolt instance is available for processing. Tuples are simply transfered and buffered at the receiver if necessary.
Furthermore, Nimbus and Supervisor only exchange meta data. There is not dataflow (ie, flow of tuples) between them.
In some cases like "Kafka's Consumer Group" you have queue behaviour - which means that if one consumer read from the queue, other consumer will read different message from the queue.
This will distribute read load from the queue across all workers.
In those cases you can have multiple spout reading from the queue

Storm data structures - map vs separated values?

I'm using Storm to parse and save data from Kafka. The data comes in as some identifiers and then a map<string,string> of varying size. After some munging the end goal is Cassandra.
Should I send the data as one block of tuples or split up the map and send each piece separately?
A tuple should represent a "unit of work" for the next bolt in the stream. If you think of your map as a single entity that gets processed as a single, albeit complex, object then the map should be emitted as a single tuple. If you want different bolts independently processing different map attributes, then break the map into subsequently processable subsets of attributes and emit multiple tuples.
It depends on the size of the tuple you want to send.
Every tuple you emit in Storm will be taken as a serialized message to transmit from one executor to another. You should also take the performance of Netty and LMAX into consideration, since they are used in the latest version of Storm for Inter-worker communication and Intra-worker communication. That is, settings like
Config.TOPOLOGY_RECEIVER_BUFFER_SIZE
Config.TOPOLOGY_TRANSFER_BUFFER_SIZE
Config.TOPOLOGY_EXECUTOR_RECEIVE_BUFFER_SIZE
Config.TOPOLOGY_EXECUTOR_SEND_BUFFER_SIZE
should be taken into account. You could take a look at Understanding the Internal Message Buffers of Storm for more details.

Resources