I just wanted to know what is the actual relevance of using tasks in storm with respect to the output or performance since it does not have to do anything with parallelism, so on choosing more than 1 task for a component will make any change in output? or what will be the flow than? Or if i choose no of tasks > executors how does that make difference in flow or the output (here i am just taking the basic word count example).
It would be very helpful if anybody could explain me this with or without example.
for example say-
I have a topology with 3 bolts and 1 spout, and i have mentioned only 2 workers port,than that means that all these 4 components(1 spot and 3 bolts will get run on these workers only) now i have mentioned 2 executors for 1st bolt than it means that there will be 2 thread of that bolt will be running in parallel.Now if i mention the no of task=3 how will this make difference whether in output or performance?
And if i have mentioned the field grouping than the grouping will be there in different executors(plz correct me if m wrong)?
Did you read this article? https://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html
To pick up your example: If you set #tasks=3 and specify 2 executors using fieldsGrouping the data will be partitioned into 3 substreams (= #tasks). 2 substreams go to one executor and the third to the second executor. However, using 3 tasks and 2 executors, allows you to increase the number of executors to 3 using rebalance command.
As long as you do not want to increase the number of executors during execution, #tasks should be equal to #executors (ie, just don't specify #tasks).
For your example (if you don't want to change the parallelism at runtime), you most likely can an imbalance workload for both executors (one executor processed 33% of the data, the other 66%). However, this is only a problem in this special case and not in general. If you assume you have 4 tasks, each executors processed 2 substreams and no inbalance occurs.
Related
I have a spark cluster (DataProc) with a master and 4 workers (2 preemtible), in my code I have some thing like this:
JavaRDD<Signal> rdd_data = javaSparkContext.parallelize(myArray);
rdd_data.foreachPartition(partitionOfRecords -> {
while (partitionOfRecords.hasNext()) {
MyData d = partitionOfRecords.next();
LOG.info("my data: " + d.getId().toString());
}
})
myArray is composed by 1200 MyData objects.
I don't understand why spark uses only 2 cores, divide my array into 2 partitions, and doesn't use 16 cores.
I need to set the number of partition?
Thanks in advance for any help.
Generally it's always a good idea to specific the number of partitions as the second argument to parallelize since the optimal slicing of your dataset should really be independent from the particular shape of the cluster you're using, and Spark can at best use current sizes of executors as a "hint".
What you're seeing here is that Spark will default to asking taskScheduler for current number of executor cores to use as the defaultParallelism, combined with the fact that in Dataproc Spark dynamic allocation is enabled. Dynamic allocation is important because otherwise a single job submitted to a cluster might just specify max executors even if it sits idle and then it will prevent other jobs from being able to use those idle resources.
So on Dataproc, if you're using default n1-standard-4, Dataproc configures 2 executors per machine and gives each executor 2 cores. The value of spark.dynamicAllocation.minExecutors should be 1, so your default job, upon startup without doing any work, would sit on 1 executor with 2 cores. Then taskScheduler will report that 2 cores are currently reserved in total, and therefore defaultParallelism will be 2.
If you had a large cluster and you were already running a job for awhile (say, you have a map phase that runs for longer than 60 seconds) you'd expect dynamic allocation to have taken all available resources, so the next step of the job that uses defaultParallelism would then presumably be 16, which is the total cores on your cluster (or possibly 14, if 2 are consumed by an appmaster).
In practice, you probably want to parallelize into a larger number of partitions than total cores available anyways. Then if there's any skew in how long each element takes to process, you can have nice balancing where fast tasks finish and then those executors can start taking on new partitions while the slow ones are still running, instead of always having to wait for a single slowest partition to finish. It's common to choose a number of partitions anywhere from 2x the number of available cores to something 100x or more.
Here's another related StackOverflow question: spark.default.parallelism for Parallelize RDD defaults to 2 for spark submit
Let's say I have 8 task managers with 16 task slots. If I submit a job using the Jobmanager UI and set the parallelism to 8, do I only utilise 8 task slots?
What if I have 8 task managers with 8 slots, and submit the same job with a parallelism of 8? Is it exactly the same thing? Or is there a difference in the way the data is processed?
Thank you.
The total number of task slots in a Flink cluster defines the maximum parallelism, but the number of slots used may exceed the actual parallelism. Consider, for example, this job:
If run with parallelism of two in a cluster with 2 task managers, each offering 3 slots, the scheduler will use 5 task slots, like this:
However, if the base parallelism is increased to six, then the scheduler will do this (note that the sink remains at a parallelism of one in this example):
See Flink's Distributed Runtime Environment for more information.
According to the official documentation:
How many instances to create for a spout/bolt. A task runs on a thread with zero or more other tasks for the same spout/bolt. The number of tasks for a spout/bolt is always the same throughout the lifetime of a topology, but the number of executors (threads) for a spout/bolt can change over time. This allows a topology to scale to more or less resources without redeploying the topology or violating the constraints of Storm (such as a fields grouping guaranteeing that the same value goes to the same task)
My questions are:
Under what circumstances would I choose to run multiple tasks in one executor?
If I do use multiple tasks in one executor, what might be reasons that I would choose different number of tasks per executor between my spout and my bolt (such as 2 tasks per bolt executor but only 1 task per spout executor)?
I thought https://stackoverflow.com/a/47714449/8845188 was a fine answer, but I'll try to reword it as examples:
The number of tasks for a component (e.g. spout or bolt) is set in stone when you submit the topology, while the number of executors can be changed without redeploying the topology. The number of executors is always less than or equal to the number of tasks for a component.
Question 1
You wouldn't normally have a reason to choose running e.g. 2 tasks in 1 executor, but if you currently have a low load but expect a high load later, you may choose to submit the topology with a high number of tasks but a low number of executors. You could of course just submit the topology with as many executors as you expect to need, but using many threads when you only need a few is inefficient due to context switching and/or potential resource contention.
For example, lets say you submit your topology so the spout has 4 tasks and 4 executors (one per). When your load increases, you can't scale further because 4 is the maximum number of executors you can have. You now have to redeploy the topology in order to scale with the load.
Let's say instead you submit your topology so the spout has 32 tasks and 4 executors (8 per). When the load increases, you can increase the number of executors to 32, even though you started out with only 4. You can do this scaling up without redeploying the topology.
Question 2
Let's say your topology has a spout A, and a bolt B. Let's say bolt B does some heavyweight work (e.g. can do 10 tuples per executor per second), while the spout is lightweight (e.g. can do 1000 tuples per executor per second). Let's say your load is initially 20 messages per second into the topology, but you expect that to grow.
In this case it makes sense that you might configure your spout with 1 executor and 1 task, since it's likely to be idle most of the time. At the same time you want to configure your bolt with a high number of tasks so you can scale the number of executors for it, and at least 2-3 executors to start.
Config#TOPOLOGY_TASKS -> How many tasks to create per component.
A task performs the actual data processing and is run within its parent executor’s thread of execution. Each spout or bolt that you implement in your code executes as many tasks across the cluster.
The number of tasks for a component is always the same throughout the lifetime of a topology, but the number of executors (threads) for a component can change over time. This means that the following condition holds true: #threads <= #tasks.
By default, the number of tasks is set to be the same as the number of executors, i.e. Storm will run one task per thread (which is usually what you want anyways).
Also be aware that:
The number of executor threads can be changed after the topology has been started.
The number of tasks of a topology is static.
There is another reason where having tasks in place of executors makes more sense.
Lets suppose you have 2 tasks of the same bolt running on a single executor(thread). Lets suppose you are calling a relatively long running(1 second maybe) database subroutine and the result is needed before proceeding further.
Case 1 - Your database call would be running on the executor thread and it would pause for a while and you would not gain anything by running 2 tasks.
Case 2 - You refactor your database call code to spawn a new thread and execute. In this case, your main executor thread would not hang and it would be able to start processing of the second bolt task while the newly spawned thread would be fetching data from database.
Unless you introduce your own parallelism within the component, I do not see a performance gain and no reason to run multiple tasks apart from maintenance reasons as mentioned in other answers.
I am exploring Apache Storm. I know that there is no way of determining what tasks get mapped to which node. I wanted to know if there is any way to even guess which executors are grouped together. For instance, consider a linear chain topology with 1 spout and 2 bolts:
Spout -> Bolt1 -> Bolt2
If there is a 3 node cluster, and numworkers = 3, with combined parallelism = 9 (3 spouts + 2 x 3 bolts), is there any way of determining how executors are grouped? I have read that the default scheduler distributes the load evenly in a round robin manner. Does it mean that all the workers will have one instance each of:
S -> B1 -> B2 executors?
For the default scheduler, you are right. If you have 3 workers, each worker will get assigned one instance of your Spout, Bolt1, and Bolt2.
The order in which the default scheduler assigns executors to workers, is round robin as you stated correctly. In more detail, the round robin assignment for each logical operator happens for all its executors before the scheduler considers the next logical operator. However, the order of the logical operators themselves is not fixed. See the code here for more details: https://github.com/apache/storm/tree/0.9.x-branch/storm-core/src/clj/backtype/storm/scheduler
If you want to influence this behavior, you can provide a custom scheduler. See an example here: https://xumingming.sinaapp.com/885/twitter-storm-how-to-develop-a-pluggable-scheduler/
I have a question related to Storm Functionality. Suppose I have a spout which is reading a csv file and emits records chunk by chunk. That is, it emits 100 records at a time to the bolt.
My question is that whether a single chunk when received by the bolt will be sent to only one executor or will be divided among different executors for the sake of parallelism.
Note : The bolt has 5 executors.
What do you mean by "it emits 100 record at a time"? Does it mean, that a single tuple contains 100 CSV lines? Or do you emit 100 tuples (each containing a single CSV line) in a single nextTuple() call.
For the first case, Storm cannot parallelize those 100 line within a single tuple. Storm can only send different tuples to different executors.
For the second case, Storm will send the 100 tuples to different executors (of course, depending on the connection pattern you have chosen).
One side remark: it is considered bad practice to emit multiple tuples in a single call to nextTuple(). If nextTuple() blocks for any reason, the spout thread is blocked and cannot (for example) react on incoming acks. Best practice is, to emit a single tuple for each call to nextTuple(). If no tuple is available to be emitted, you should return (without emitting) and not block, to wait until a tuple is available.
executor = Threads
If you do not explicitly configure the number of tasks (instances) then storm will run 1 task per executor by default. So practically what happen is there are 5 different instances of the bolt is running which is handled by 5 different threads (1 thread handling 1 task).
So ideally the tuples you emit will be processed by 5 different threads
simultaneously