difference between threadedport and parallel annotation in ibm infosphere streams - ibm-streams

I am trying to understand the difference between threaded port and #parallel annotation in IBM info-sphare streams,i have searched many places but' couldn't get my answers,as per my understanding ,both of them are useful in making the operator threaded ,but i am not sure when and where to use them,and can they be used together for performance boost.Could someone please corroborate their usage with examples.
Thanks.

Threaded ports are primarily about pipeline parallelism. When you specify a threaded port on an operator's config clause, you're telling the runtime to execute that operator with a different thread from the thread that executes the upstream operator. We call this pipeline parallelism because the operators in the pipeline are able to execute simultaneously. If the operators in your pipeline are computationally expensive enough, this can improve throughput.
The #parallel annotation is primarily about data parallelism. When you apply the #parallel(width=N) annotation to an operator invocation, Streams will replicate that operator N times. Your running application will have N copies of that operator, each receiving a different subset of the overall number of tuples. We call this data parallelism because we're processing different data (which are tuples, in the case of Streams) simultaneously by replicating the operator. When you have an operator which is computationally expensive, and it's okay to process incoming tuples out of order, #parallel can improve throughput.
In practice, using the #parallel annotation will sometimes inject threaded ports into your application in order to make sure that the replicated operator execution happens in parallel. This will, as a side effect, also introduce some pipeline parallelism.
This application demonstrates both threaded ports and #parallel: streamsx.demo.logwatch. It is the source code for the application developed in the Optimizing Streams Applications presentation.

Related

Apache Storm Message Passing Implementation (MPI)

According to the MPI implementation of Storm the workers manage connections to other workers and maintain a mapping from task to task. Also, transferring takes in a task id and a tuple, and it serializes the tuple and puts it onto a "transfer queue”.
The question is, if there is a way to organise scheduling, such that certain tasks of an operator communicate to only certain tasks of the following operator at a given time according to the application’s topology (could ZeroMQ possibly do something like this?).
Q : "If there is a way to organise scheduling, such that certain tasks of an operator communicate to only certain tasks of the following operator at a given time according to the application’s topology ( could ZeroMQ possibly do something like this? )."
Obviously could,it does allow smart & flexible creation of signalling/messaging meta-plane(s) infrastructure(s) for the distributed-computing, improving itself in doing this for about the last 12+ years.
The #HristoIlliev attached comment's URL details that Apache-Storm itself reports to already use the ZeroMQ-layer for its own services *[in ver.0.8.0, almost all implementation (source-code) links unfortunately already dead there]:
The implementation for distributed mode uses ZeroMQ code
The implementation for local mode uses in-memory Java queues (so that it's easy to use Storm locally without needing to get ZeroMQ installed) code
...
Tasks listen on an in-memory ZeroMQ port for messages from the virtual port code
So the topology-related part of your question is related to the decision already made on this subject in the "outer" Apache-Storm architecture, that was done.
Tasks are responsible for message routing. A tuple is emitted either to a direct stream (where the task id is specified) or a regular stream. In direct streams, the message is only sent if that bolt subscribes to that direct stream. In regular streams, the stream grouping functions are used to determine the task ids to send the tuple to.
The MPI does the same for the HPC-focused computing ecosphere, since FORTRAN jobs started to run on first HPC distributed infrastructures. Due to the most of the HPC-computing problems were "simply" scaled onto larger footprints of the computing hardware, the MPI focus was more on efficiency of such uniform scaling, not visiting thus the opposite corner of adaptive, almost ad-hoc setup of message-passing infrastructure, layered topologies of specialised ZeroMQ Scalable Formal Communication Archetypes Patterns, so each of the tools focus on other factors.
If you feel you want to read a bit more on ZeroMQ, this answer might help to fast understand the core underlying concepts.

Spark : Tackle performance intensive commands like collect(), groupByKey(), reduceByKey()

I know that some of Spark Actions like collect() cause performance issues.
It has been quoted in documentation
To print all elements on the driver, one can use the collect() method to first bring the RDD to the driver node thus:rdd.collect().foreach(println). This can cause the driver to run out of memory, though,
because collect() fetches the entire RDD to a single machine; if you only need to print a few elements of the RDD, a safer approach is to use the take(): rdd.take(100).foreach(println).
And from one more related SE question: Spark runs out of memory when grouping by key
I have come to know that groupByKey(), reduceByKey() may cause out of memory if parallelism is not set properly.
I did not get enough evidence on other Transformations and Action commands, which have to be used with caution.
These three are the only commands to be tackled? I have doubts about below commands too
aggregateByKey()
sortByKey()
persist() / cache()
It would be great if you provide information on intensive commands (global across partitions instead of single partition OR low performance commands), which have to be tackled with better guarding.
You have to consider three types of operations:
transformations implemented using only mapPartitions(WithIndex) like filter, map, flatMap etc. Typically it will be the safest group. Probably the biggest possible issue you can encounter is an extensive spilling to disk.
transformations which require shuffle. It includes obvious suspects like different variants of combineByKey (groupByKey, reduceByKey, aggregateByKey) or join and less obvious like sortBy, distinct or repartition. Without a context (data distribution, exact function for reduction, partitioner, resources) it is hard to tell if particular transformation will be problematic. There are two main factors:
network traffic and disk IO - any operation which is not performed in memory will be at least an order of magnitude slower.
skewed data distribution - if distribution is highly skewed shuffle can fail or subsequent operations may suffer from a suboptimal resource allocation
operations which require passing data to and from the driver. Typically it covers actions like collect or take and creating distributed data structure from a local one (parallelize).
Other members of this category are broadcasts (including automatic broadcast joins) and accumulators. Total cost depends of course on a particular operation and the amount of data.
While some of these operations can be expensive none is particularly bad (including demonized groupByKey) by itself. Obviously it is better to avoid network traffic or additional disk IO but in practice you cannot avoid it in any complex application.
Regarding cache you may find Spark: Why do i have to explicitly tell what to cache? useful.

Spring Batch: Which ItemReader implementation to use for high volume & low latency

Use case: Read 10 million rows [10 columns] from database and write to a file (csv format).
Which ItemReader implementation among JdbcCursorItemReader & JdbcPagingItemReader would be suggested? What would be the reason?
Which would be better performing (fast) in the above use case?
Would the selection be different in case of a single-process vs multi-process approach?
In case of a multi-threaded approach using TaskExecutor, which one would be better & simple?
To process that kind of data, you're probably going to want to parallelize it if that is possible (the only thing preventing it would be if the output file needed to retain an order from the input). Assuming you are going to parallelize your processing, you are then left with two main options for this type of use case (from what you have provided):
Multithreaded step - This will process a chunk per thread until complete. This allows for parallelization in a very easy way (simply adding a TaskExecutor to your step definition). With this, you do loose restartability out of the box because you will need to turn off state persistence on either of the ItemReaders you have mentioned (there are ways around this with flagging records in the database as having been processed, etc).
Partitioning - This breaks up your input data into partitions that are processed by step instances in parallel (master/slave configuration). The partitions can be executed locally via threads (via a TaskExecutor) or remotely via remote partitioning. In either case, you gain restartability (each step processes it's own data so there is no stepping on state from partition to partition) with parallization.
I did a talk on processing data in parallel with Spring Batch. Specifically, the example I present is a remote partitioned job. You can view it here: https://www.youtube.com/watch?v=CYTj5YT7CZU
To your specific questions:
Which ItemReader implementation among JdbcCursorItemReader & JdbcPagingItemReader would be suggested? What would be the reason? - Either of these two options can be tuned to meet many performance needs. It really depends on the database you're using, driver options available as well as processing models you can support. Another consideration is, do you need restartability?
Which would be better performing (fast) in the above use case? - Again it depends on your processing model chosen.
Would the selection be different in case of a single-process vs multi-process approach? - This goes to how you manage jobs more so than what Spring Batch can handle. The question is, do you want to manage partitioning external to the job (passing in the data description to the job as parameters) or do you want the job to manage it (via partitioning).
In case of a multi-threaded approach using TaskExecutor, which one would be better & simple? - I won't deny that remote partitioning adds a level of complexity that local partitioning and multithreaded steps don't have.
I'd start with the basic step definition. Then try a multithreaded step. If that doesn't meet your needs, then move to local partitioning, and finally remote partitioning if needed. Keep in mind that Spring Batch was designed to make that progression as painless as possible. You can go from a regular step to a multithreaded step with only configuration updates. To go to partitioning, you need to add a single new class (a Partitioner implementation) and some configuration updates.
One final note. Most of this has talked about parallelizing the processing of this data. Spring Batch's FlatFileItemWriter is not thread safe. Your best bet would be to write to multiple files in parallel, then aggregate them afterwards if speed is your number one concern.
You should profile this in order to make a choice. In plain JDBC I would start with something that:
prepares statements with ResultSet.TYPE_FORWARD_ONLY and ResultSet.CONCUR_READ_ONLY. Several JDBC drivers "simulate" cursors in client side unless you use those two, and for large result sets you don't want that as it will probably lead you to OutOfMemoryError because your JDBC driver is buffering the entire data set in memory. By using those options you increase the chance that you get server side cursors and get the results "streamed" to you bit by bit, which is what you want for large result sets. Note that some JDBC drivers always "simulate" cursors in client side, so this tip might be useless for your particular DBMS.
set a reasonable fetch size to minimize the impact of network roundtrips. 50-100 is often a good starting value for profiling. As fetch size is hint, this might also be useless for your particular DBMS.
JdbcCursorItemReader seems to cover both things, but as it is said before they are not guaranteed to give you best performance in all DBMS, so I would start with that and then, if performance is inadequate, try JdbcPagingItemReader.
I don't think doing simple processing with JdbcCursorItemReader will be slow for your data set size unless you have very strict performance requirements. If you really need to parallelize using JdbcPagingItemReader might be easier, but the interface of those two is very similar, so I would not count on it.
Anyway, profile.

akka actor model vs java usage in following scenario

I want to know the applicability of the Akka Actor model.
I know it is useful in the case a huge number of Actor instances are created and destroyed. e.g. a call server, where every incoming call creates an actor instance and communicates with few other actors and get killed after the call is over.
Is it also useful in the following scenario :
A server has a few processing elements (10~50) implemented over Actors. The lifetime of these processing elements is infinite. some of them do not maintain state and a few maintain state. The processing elements process the message and pass the message to other actors in a fixed manner. The system receives a huge number of messages from outside and gets passed through processing elements and goes out of the system.
My gut feeling is that we cannot get any advantage by using Akka Actor model and even implementing this server in Scala. Because the use case for which Akka is designed, is not applicable here. If the scale-up meant that processing elements be increased dynamically then it would be applicable.
For fixed topologies, I think if i implement it in Java, it is going to be more beneficial in terms of raw performance. The 'immutability' feature of Scala leads to more copies and so reduces performance. So i believe i better stick to Java.
Is my understanding correct? I a nut shell i want to know why i should leave Java and use Scala/Akka for the application scenario above. and my target is to process 1 million messages per second.
If this question is still actual...
Scala vs. Java
Scala gives productivity to developers.
Immutability decreases debugging to almost zero level.
GC perfectly copes with waste immutables.
Akka Actors vs. other means
Akka has dispatcher that distributes all tasks across fixed thread pool. This allows to evenly consume available resources. This approach is much better than the fixed worker threads — the processing resources are provided to the tasks not DataFlow nodes.
DataFlow implementation
There is a SynapseGrid library that is built on top of Akka Actors and allows easy construction of DataFlow systems distributed over fixed immortal Actors. It can even draw the DataFlow diagram (in .dot format) of the whole system.
(The library is more convenient to be used with Scala.)

Storm as a replacement for Multi-threaded Consumer/Producer approach to process high volumes?

We have a existing setup where upstream systems send messages to us on a Message Queue and we process these messages.The content is xml and we simply unmarshal.This unmarshalling step is followed by a write to db (to put relevant values onto relevant columns).
The system is set to interface with many more upstream systems and our volumes are going to increase to a peak size of 40mm per day.
Our current way of processing is have listeners on the queues and then have a multiple threads of producers and consumers which do the unmarshalling and subsequent db write.
My question : Can this process fit into the Storm use case scenario?
I mean can MQ be my spout and I have 2 bolts one to unmarshal and this then becomes the spout for the next bolt which does the write to db?
If yes,what is the benefit that I can derive? Is it a goodbye to cumbersome multi threaded producer/worker pattern of code.
If its as simple as the above then where/why would one want to resort to the conventional multi threaded approach to producer/consumer scenario
My point being is there a data volume/frequency at which Storm starts to shine when compared to the conventional approach.
PS : I'm very new to this and trying to get a hang of this and want to ascertain if the line of thinking is right
Regards,
CVM
Definitely this scenario can fit into a storm topology. The spouts can pull from MQ and the bolts can handle the unmarshalling and subsequent processing.
The major benefit over conventional multi threaded pattern is the ability to add more worker nodes as the load increases. This is not so easy with traditional producer consumer patterns.
Specific data volume number is a very broad question since it depends on a large number of factors like hardware etc.

Resources