How to go about parallelizing my processing using tbb::parallel_for and tbb::dataflow? - tbb-flow-graph

I have a source of files that I need to process.
From each file, my code generates a variable number of data objects, let's call it N.
I have K number of processing objects that can be used to process the N data objects.
I'm thinking of doing the following using Tbb:dataflow:
Create a function_node with concurrency K and put my K processing objects into a concurrent_queue.
Use input_node to read file, generate the N data objects, and try_put each into the function_node.
The function_node body dequeues a processing object, uses it to process a data object, then returns the processing object back to the concurrent_queue when done.
Another way I can think of is possibly like so:
Create a function_node with serial concurrency.
Use input_node to read file, generate the N data objects, put the data objects into a collection and send over to the function_node.
At the function_node, partition the N objects into K ranges and use each of the K processing objects to process each range concurrently - not sure if it is possible to customize parallel_for for this purpose.
The advantage of the first method is probably lower latency because I can start sending data objects through the dataflow the moment they are generated rather than have to wait for all N data objects to be generated.
What do you think is the best way to go about parallelizing this processing?

Yes, you are right that the first method has this advantage of not waiting all of the data objects to start their processing. However, it also has an advantage of not waiting completion of processing all of the data objects passed to parallel_for. This becomes especially visible if the speed of processing varies for each data object and/or by each processing object.
Also, it seems enough to have buffer_node followed by (perhaps, reserving) join_node instead of concurrent_queue for saving of processing objects for further reuse. In this case, function_node would return processing object back to the buffer_node once it finishes processing of the data object. So, the graph will look like the following:
input_node -> input_port<0>(join_node);
buffer_node -> input_port<1>(join_node);
join_node -> function_node;
function_node -> buffer_node;
In this case, the concurrency of the function_node can be left unlimited as it would be automatically followed by the number of processing objects that exist (available tokens) in the graph.
Also, note that generating data objects from different files can be done in parallel as well. If you see benefit from that consider using function_node instead of input_node as the latter is always serial. However, in this case, use join_node with queueing policy since function_node is not reservable.
Also, please consider using tbb::parallel_pipeline instead as it seems you have a classic pipelining scheme of processing. In particular, this and that link might be useful.

Related

Apache Flink relating/caching data options

This is a very broad question, I’m new to Flink and looking into the possibility of using it as a replacement for a current analytics engine.
The scenario is, data collected from various equipment, the data is received As a JSON encoded string with the format of {“location.attribute”:value, “TimeStamp”:value}
For example a unitary traceability code is received for a location, after which various process parameters are received in a real-time stream. The analysis is to be ran over the process parameters however the output needs to include a relation to a traceability code. For example {“location.alarm”:value, “location.traceability”:value, “TimeStamp”:value}
What method does Flink use for caching values, in this case the current traceability code whilst running analysis over other parameters received at a later time?
I’m mainly just looking for the area to research as so far I’ve been unable to find any examples of this kind of scenario. Perhaps it’s not the kind of process that Flink can handle
A natural way to do this sort of thing with Flink would be to key the stream by the location, and then use keyed state in a ProcessFunction (or RichFlatMapFunction) to store the partial results until ready to emit the output.
With a keyed stream, you are guaranteed that every event with the same key will be processed by the same instance. You can then use keyed state, which is effectively a sharded key/value store, to store per-key information.
The Apache Flink training includes some explanatory material on keyed streams and working with keyed state, as well as an exercise or two that explore how to use these mechanisms to do roughly what you need.
Alternatively, you could do this with the Table or SQL API, and implement this as a join of the stream with itself.

Network shuffle in streaming

So,keyBy or groupBy causes a network shuffle that repartitions the stream. It is said that it is pretty expensive, since it involves network communication along with serialization and deserialization etc.
For an example, if I run the following operators:
map(Mapper1).keyBy(0).map(Mapper2)
with a parallelism of 2, I would get something like this:
Mapper1(1) -\-/- Mapper2(1)
X
Mapper1(2) -/-\- Mapper2(2)
And in the end all records with the same key within the Mapper1 are assigned to the same partition in Mapper2.
My question is:
I want to know what happens during the keyBy or groupBy in streaming. Every processed element is serialized and deserialized by every sub task ? How can I compare the cost of keyBy or groupBy with an another operation ?
Also, I am familiar with the concept of partitioner in batch systems, but I am getting a bit confused when I am trying to apply that in streaming.
Thank you !
So Apache Flink buffers the outgoing of a task and after that sends it to the next task for processing. setBufferTimeout is a parameter on the job-level which can be configured via the StreamExecutionEnvironment and the default value for this timeout is 100 ms. After this time, the buffers are sent automatically even if they are not full.
Also the following links are really helpful to understand the details:
https://flink.apache.org/2019/06/05/flink-network-stack.html
https://flink.apache.org/2019/07/23/flink-network-stack-2.html

How to convert nested list into list?

I am trying to convert a for loop which takes a list of one object and stores that list into another object:
/*List<Rbs> cellList = new ArrayList<Rbs>();
for (CommonPrep commonTest : commonTest.values()) {
cellList.addAll(commonTestInfo.getRbsList());
}*/
List<Rbs> cellList =
commonTest.values()
.stream()
.flatMap(rbsList -> rbsList.getRbsList().stream())
.collect(Collectors.toList());
Is there a better way of doing it than using two streams (as I'm currently doing)?
A slightly different variant.
List<Rbs> cellList =
commonTest.values()
.stream()
.map(CommonPrep::getRbsList)
.flatMap(Collection::stream)
.collect(Collectors.toList());
As you can see, eventually you'll need to call the stream method once in the .values().stream() and N times in the flatMap (due to the return value of the function passed to flatMap) where N is the number of elements returned from the map operation.
In fact, each intermediate operation also returns a Stream<T>, so there are more stream objects than what I've mentioned above.
There's no way to avoid it. plus creating a stream, in general, is a cheap operation so you shouldn't be concerned about it.
Thus, unless you're experiencing performance issues it's better not to think about how many Stream objects are created and instead focus on writing your query to achieve what you want and let the library handle the rest.
Even, if you were experiencing performance issues trying to find ways to avoid creating new streams on each function call or method call is not going to work as all intermediate operations returns a new stream on each invocation and some operations accepting behavioral parameters also return new streams as in the function passed to flatMap.
I very much doubt you're currently experiencing performance issues due to the creation of stream objects. Anyhow, you can always consider going parallel when you're experiencing performance problems.
It's important to understand several factors before even attempting to go parallel. You can read the answers here for things to consider.

Aggregating processor or aggregating reader

I have a requirement, which is like, I read items from a DB, if possible in a paging way where the items represent the later "batch size", I do some processing steps, like filtering etc. then I want to accumulate the items to send it to a rest service where I can send it to in batches, e.g. n of them at once instead one by one.
Parallelising it on the step level is what I am doing but I am not sure on how to get the batching to work, do I need to implement a reader that returns a list and a processor that receives a list? If so, I read that you will have not a proper account of number items processed.
I am trying to find a way to do it in the most appropriate spring batch way instead of hacking a fix, I also assume that I need to keep state in the reader and wondered if there is a better way not to.
You cannot have something like an aggregating processor. Every single item that is read is processed as single item.
However, you can implement a Reader that groups items and forwards them as a whole group. to get an idea, how this could be done have a look at my answer to this question Spring Batch Processor or Dean Clark's answer here Spring Batch-How to process multiple records at the same time in the processor?.
Both use a SpringBatch's SingleItemPeekableItemReader.

why do we use tibco mapper activity?

The tibco documentation says
The Mapper activity adds a new process variable to the process definition. This variable can be a simple datatype, a TIBCO ActiveEnterprise schema, an XML schema, or a complex structure.
so my question is tibco mapper does only this simple function.We can create process variables in process definition also(by right clicking on process definition).I looked for it in google but no body clearly explains why to use this activity and I have also tried in youtube and there also only one video and it does not explain clearly.I am looking for an example how it is used in large organizations and a real time example.Thanks in advance
The term "process variable" is a bit overloaded I guess:
The process variables that you define in the Process properties are stateful. You can use (read) their values anywhere in the process and you can change their values during the process using the Assign task (yellow diamond with a black equals sign).
The mapper activity produces a new output variable of that task that you can only use (read) in activities that are downstream from it. You cannot change its value after the mapper activity, as for any other activity's output.
The mapper activity is mainly useful to perform complex and reusable data mappings in it rather than in the mappers of other activities. For example, you have a process that has to map its input data into a different data structure and then has to both send this via a JMS message and log it to a file. The mapper allows you to perform the mapping only once rather than doing it twice (both in the Send JMS and Write to File activity).
You'll find that in real world projects, the mapper activity is quite often used to perform data mapping independently of other activities, it just gives a nicer structure to the processes. In contrast the Process Variables defined in the Process properties together with the Assign task are used much less frequently.
Here's a very simple example, where you use the mapper activity once to set a process variable (here the filename) and then use it in two different following activities (create CSV File and Write File). Obviously, the mapper activity becomes more interesting if the mapping is not as trivial as here (though even in this simple example, you only have one place to change how the filename is generated rather than two):
Mapper Activiy
First use of the filename variable in Create File
Second use of the filename variable in Write File
Process Variable/Assign Activity Vs Mapper Activity
The primary purpose of an assign task is to store a variable at a process level. Any variable in an assign task can be modified N times in a process. But a mapper is specifically used for introducing a new variable. We cannot change the same mapper variable multiple times in a project.
Memory is allocated to Process Variable when the process instance is created but in case of TIBCO Mapper the memory is allocated only when the mapper activity is executed in a process instance.
Process Variable is allocated a single slot of memory which is used to update/modify the schema thought the process instance execution i.e. N number of assign activity will access same memory allocated to the variable. Whereas using N mapper for a same schema will create N amount of memory.
Assign Activity can be is used to accumulate the output of a tibco activity inside a group.

Resources