I have confusion in the number of tasks that can work in parallel in Flink,
Can someone explain to me:
what is the number of parallelism in a distributed system? and its relation to Flink terminology
In Flink, is it the same as we say 2 parallelism = 2 tasks work in parallel?
In Flink, if 2 operators work separately but the number of parallelism in each one of them is 1, does that count as parallel computation?
Is it true that in a KeyedStream, the maximum number of parallelism is the number of keys?
Does the Current CEP engine in Flink able to work in more than 1 task?
Thank you.
Flink uses the term parallelism in a pretty standard way -- it refers to running multiple copies of the same computation simultaneously on multiple processors, but with different data. When we speak of parallelism with respect to Flink, it can apply to an operator that has parallel instances, or it can apply to a pipeline or job (composed of a several operators).
In Flink it is possible for several operators to work separately and concurrently. E.g., in this job
source ---> map ---> sink
the source, map, and sink could all be running simultaneously in separate processors, but we wouldn't call that parallel computation. (Distributed, yes.)
In a typical Flink deployment, the number of task slots equals the parallelism of the job, and each slot is executing one complete parallel slice of the application. Each parallel instance of an operator chain will correspond to a task. So in the simple example above, the source, map, and sink can all be chained together and run in a single task. If you deploy this job with a parallelism of two, then there will be two tasks. But you could disable the chaining, and run each operator in its own task, in which case you'd be using six tasks to run the job with a parallelism of two.
Yes, with a KeyedStream, the number of distinct keys is an upper bound on the parallelism.
CEP can run in parallel if it is operating on a KeyedStream (in which case, the pattern matching is being done independently for each key).
Related
I am new to Flink and I'm trying to understand a few things. I've got a theory which I am trying to confirm. So it goes like that:
Parallelism refers to how many parallel "machines" (could be threads or different machines as I understand, correct me if I'm wrong) will run my job.
Flink by default will partition the stream in a round-robin manner to take advantage of the job's parallelism.
If the programmer defines a partitioning strategy (for example with keyBy) then this strategy will be followed instead of the default round-robin.
If the parallelism is set to 1 then partitioning the stream will not have any effect on the processing speed as the whole stream will end up being processed by the same machine. In this case, the only benefit of partitioning a stream (with keyBy) is that the stream can be processed in keyed context.
keyBy guarantees that the elements with the same key (same group) will be processed by the same "machine" but it doesn't mean that this machine will only process elements of this group. It could process elements from other groups as well but it processes each group as if it is the only one, independently from the others.
Setting a parallelism of 3 while the maximum number of partitions that my partition strategy can spawn is 2, is kind of meaningless as only 2 of the 3 "machines" will end up processing the two partitions.
Can somebody tell me if those points are correct? Correct me if I'm wrong please.
Thank you in advance for your time
I think you've got it. To expand on point 6: If your job uses a keyBy to do repartitioning, as in
source
.keyBy(...)
.window(...)
.sinkTo(...)
then in a case where the source is a Kafka topic with only 2 partitions,
the source operator will only have 2 active instances, but for the window and sink all 3 instances will have meaningful work to do (assuming there are enough distinct keys).
Also, while we don't talk about it much, there's also horizontal parallelism you can exploit. For example, in the job outlined above, the source task will run in one Java thread, and the task with the window and sink will run in another thread. (These are separate tasks because the keyBy forces a network shuffle.) If you give each task slot enough hardware resources, then these tasks will be able to run more-or-less independently (there's a bit of coupling, since they're in the same JVM).
I am new to Apache storm and wondering how parallelism hint works.
For e.g. We have one stream containing two tuples <4>,<6>, one spout with only one task per executor and we have one bolt to perform some operation on the tuples and having parallelism hint as 2, so we have two executor of this bolt namely A and B, regarding this, I have 3 questions.
Considering above scenario is this possible that our tuple which contain value 4 is processed by A and another tuple which contain value 6 is processed by B.
If processing done in this manner i.e. mentioned in question (1), then won't it impact on operation in which sequence matter.
If processing not done in this manner, means both tuples going to same executor then what is the benefit of parallelism.
Considering above scenario is this possible that our tuple which contain value 4 is processed by A and another tuple which contain value 6 is processed by B.
Yes.
If processing done in this manner i.e. mentioned in question (1), then won't it impact on operation in which sequence matter.
It depends. You most likely have control over the sequence of the tuples in your spout. If sequence matters, it is advisable to either reduce parallelism or use fields grouping, to make sure tuples which depend on each other go to the same executor. If sequence does not matter use shuffleGrouping or localOrShuffleGrouping to get benefits from parallel processing.
If processing not done in this manner, means both tuples going to same executor then what is the benefit of parallelism.
If both tuples go to the same executor, there is no benefit, obviously.
I'm developing a Flink toy-application on my local machine before to deploy the real one on a real cluster.
Now I have to determine how many nodes I need to set the cluster.
But I'm still a bit confused about how many nodes I have to consider to execute my application.
For example if I have the following code (from the doc):
DataStream<String> lines = env.addSource(new FlinkKafkaConsumer<>()...);
DataStream<Event> events = lines.map((line)->parse(line));
DataStream<Statistics> stats = events
.keyBy("id");
.timeWindow(Time.seconds(10))
.apply(new MyWindowAggregationFunction());
stats.addSink(new RollingSink(path));
This means that operations "on same line" are executed on same node? (It sounds a bit strange to me)
Some confirms:
If the answer to previous question is yes and if I set parallelism to 1 I can establish how many nodes I need counting how many operations I have to perform ?
If I set parallelism to N but I have less than N nodes available Flink automatically scales the elaboration on available nodes?
My throughput and data load are not relevant I think, it is not heavy.
If you haven't already, I recommend reading https://ci.apache.org/projects/flink/flink-docs-release-1.3/concepts/runtime.html, which explains how the Flink runtime is organized.
Each task manager (worker node) has some number of task slots (at least one), and a Flink cluster needs exactly as many task slots as the highest parallelism used in the job. So if the entire job has a parallelism of one, then a single node is sufficient. If the parallelism is N and fewer than N task slots are available, the job can't be executed.
The Flink community is working on dynamic rescaling, but as of version 1.3, it's not yet available.
Let's say I have two RDDs with size M1 and M2, distributed equally into p partitions.
I'm interested in knowing that (theoretically / approximately) what is the cost of the operations filter, map, leftOuterJoin, ++, reduceByKey, etc.
Thanks for the help.
To measure the cost of execution it is important to understand how spark execution is performed.
In a nutshell, when you execute a set of transformations on your RDDs spark will create an execution plan (aka DAG), and group them together in the form of stages which are executed once you trigger an action.
Operations like map/filter/flatMap are grouped together to form one stage since they do not incur a shuffle, and operations like join, reduceByKey will create more stages because they involve data to be moved across executors. Spark executes action as a sequence of stages (which gets executed sequentially or parallely if they are independent of each other). And, each stage gets executed as a number of parallel tasks where number of tasks running at a time depends upon the partitions of RDD and resources available.
Best way to measure the cost for your operations is to look at the SparkUI. Open the spark UI (by default it will be at localhost:4040 if you are running in local mode). You'll find several tabs on the top of the page, once you click on any of them you'll be directed to the page which will show you the corresponding metrics.
Here is what I do to measure the performance:
Cost of a Job => Sum of costs of executing all its stages.
Cost of a Stage => Mean of cost of executing each parallel tasks on the stage.
Cost of a Task => By default, a task consumes one CPU core. Memory consumed is given in the UI which depends upon the size of your partition.
It is really difficult to derive metrics for each transformation within a stage since Spark combines these transformations and executes them together on a partition of RDD.
I am trying to learn the parallelism and scalability features offered by Storm and read the following article http://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html. I am confused that whether Storm supports data or task parallelism. What I could understand ( I may be wrong) is that Storm supports task parallelism (since the degree of parallelism is restricted by the number of tasks in the topology). If this is the case then how can it be used for large scale parallel data processing which requires data parallelism.
Any help would be greatly appreciated. Thanks :)
Storm does not follow text book terminology. In fact, Storm does support data, task, and pipelined parallelism.
If you have an operator and assign a parallelism larger than one (parallelism_hint) you get as many threads as specified by the parameter, each executing the same code on different data, ie, you get data parallelism. You can further assign parameter number_of_tasks (which must be >= parallelism_hint) to split the input data into number_of_task partitions/substreams (ie, more partitions than executors). Thus, some executor threads need to process multiple partitions/substreams (called tasks in Storm). This does not increase the parallelism (maybe concurrency). However, it allows to change the number of executor at runtime.
As you have multiple spouts and bolts in your topology and all those spouts and bolt are executed in different thread and even different machines, you have task parallelism here (not to confuse with Storm's usage of the term task!). As there are produce/consumer relationships between spouts/bolts you also get pipeline parallelism hers, which is a special form of task parallelism. Another form of task parallelism in Storm is the ability to run multiple topology at the same time.