How to get lists of tasks associated with each executor in apache storm? - apache-storm

I am exploring Apache Storm. I need to know how can I get list of tasks which are associated to each executor in a node in apache storm. For instance, consider a simple topology with 1 spout and 2 bolts:
Spout -> Bolt1 -> Bolt2
If there is a 3 node cluster, and numworkers = 3, is there any way of determining how tasks are associated and grouped with the executors?

Related

Utilization imbalance in Storm Bolt Executors

I am running the Rolling Count Benchmark from this set of benchmarks. Here is the relevant piece of code:
spout = new FileReadSpout(BenchmarkUtils.ifAckEnabled(config));
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout(SPOUT_ID, spout, spoutNum);
builder.setBolt(SPLIT_ID, new WordCount.SplitSentence(), spBoltNum)
.localOrShuffleGrouping(SPOUT_ID);
builder.setBolt(COUNTER_ID, new RollingCountBolt(windowLength, emitFreq), rcBoltNum)
.fieldsGrouping(SPLIT_ID, new Fields(WordCount.SplitSentence.FIELDS));
I have a three node setup with a total of 96 cores with spBoltNum = 6 and rcBoltNum = 6. After a run, I see that there is a significant imbalance in the capacity metric reported for each executor of the split bolt. Even though each node has 2 executors for split bolt. I see the following numbers for capacity:
For split bolt executors on
Node 1 ~ 0.95
Node 2 ~ 0.7
Node 3 ~ 0.25
I do not understand this imbalance in utilization as the grouping for split bolt is localOrShuffleGrouping, I was expecting the capacity reported for each executor to be more or less equal. What am I missing here?
What is your spoutNum? A would assume it is 1 as FileReaderSpout read a local file (if I am not mistaken).
As your split-bolt connect to the spout via localOrShuffle some instances will be node-local to FileReaderSpout and some remote. localOrShuffle prefers to send to the local instance and only send over the network in case of an overload of the local consumer. Thus, your local split bolt executor get much more data than the remote one.

Apache Storm: Assigning executors to slots

I am exploring Apache Storm. I know that there is no way of determining what tasks get mapped to which node. I wanted to know if there is any way to even guess which executors are grouped together. For instance, consider a linear chain topology with 1 spout and 2 bolts:
Spout -> Bolt1 -> Bolt2
If there is a 3 node cluster, and numworkers = 3, with combined parallelism = 9 (3 spouts + 2 x 3 bolts), is there any way of determining how executors are grouped? I have read that the default scheduler distributes the load evenly in a round robin manner. Does it mean that all the workers will have one instance each of:
S -> B1 -> B2 executors?
For the default scheduler, you are right. If you have 3 workers, each worker will get assigned one instance of your Spout, Bolt1, and Bolt2.
The order in which the default scheduler assigns executors to workers, is round robin as you stated correctly. In more detail, the round robin assignment for each logical operator happens for all its executors before the scheduler considers the next logical operator. However, the order of the logical operators themselves is not fixed. See the code here for more details: https://github.com/apache/storm/tree/0.9.x-branch/storm-core/src/clj/backtype/storm/scheduler
If you want to influence this behavior, you can provide a custom scheduler. See an example here: https://xumingming.sinaapp.com/885/twitter-storm-how-to-develop-a-pluggable-scheduler/

Apache Storm: Is the topology replicated on atleast one worker on a Supervisor Node?

I have just started learning about Apache Storm. One thing that I cannot understand is whether the entire topology is replicated on atleast one worker process on a supervisor node. If it is the case, then a component in the topology which is very compute intensive (and possibly gives better(performance) executed on a single machine by itself), is a potential bottleneck? If not, I assume Nimbus in a way "distributes" parts of topology across the cluster. How does it know how to optimally "distribute" the topology?
Storm does not replicate a topology. If you deploy a topology, all executor threads are distributed evenly over all worker nodes (using a round-robin scheduling mechanism). the number of worker nodes a topology can use, can be configured via Config.setNumWorkers(int);.
If you have a compute intensive bolt and you want to ensure that it is deployed to an own worker, you would need to implement a custom scheduler. See her for more details: https://xumingming.sinaapp.com/885/twitter-storm-how-to-develop-a-pluggable-scheduler/

Apache storm - Map topology with storm cluster

I read many sites related to Storm.
But still I cannot map topology into storm cluster perfectly.
Please help me to understand this.
In storm cluster there are terms like
Supervisor
Worker node
Worker processor
Workers
Slots
Executer
Tasks
In topology, there are
Spout
Bolt
Also there is possible to configure
numWorkers
parallelism
So anyone please relate all these thing to help me.
I want to know like, each spout/bolt is act executer or is it the task.
If parallelism hint is given, the count of which entity will increase.
If num workers set, which one's count is that.
All these things to map with storm cluster.
I already worked in a project. So I know the topology.
Physical Cluster Setup:
The term node usually refers to a physical machine (or a VM) in your cluster. On each node a supervisor is running in an own JVM. Supervisor have worker slots. This is a logical configuration and tells how many worker can be started by a supervisor. Each worker (if started) runs in an own JVM (thus, some people call it worker process). In summary: on a node there is one supervisor JVM and up to number-of-worker-slots worker JVMs. Therefore, the node a worker JVM is running on, can be called worker node. While the supervisor is running all the time, workers are started if needed, ie, if topologies are deployed, and stopped when a topology is killed. Within a worker, executors are running as threads (ie, each executor maps to an own thread).
Logical Topology Setup:
Topologies are build out of Spouts (also called sources, ie, operators with no incoming data stream) and Bolts (regular operators with at least one incoming data stream and any number of outgoing data streams -- if there is no outgoing data stream, a Bolt is also called sink). For each Spout/Bolt you can configure two parameters:
the number of tasks
the dop (degree of parallelism, called parallelism_hint), ie, the number of executors you want to have for a Spout/Bolt
Tasks are logical unit of works (ie, something passive). Let's assume you use fieldsGrouping connection pattern. Thus, the data stream is partitioned into number-of-tasks many sub-streams. Tasks are assigned to executors, ie, each executor processes one or multiple tasks. This implies, that you cannot have less tasks than executors (ie, parallelism); otherwise, there would be a thread without any work to do.
See the Storm documentation for further details (https://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html). Furthermore, there are many other question on SO about task/executors in Storm.
Last but not least, you can configure the numberOfWorkers for a topology. This parameter indicates, how many workers should be started to run the topology. The overall number of executors for a topology is the sum of dops over all Spouts/Bolts. All executors will be evenly distributes over all available worker JVMs.
Furthermore, a single worker can only run executors of a single topology. This is done for fault-tolerance reasons, ie, topologies are isolated from each other. At the same time, a worker itself can run any number of executors.

How bolts and spouts are shared among workers?

Let's say that I have 2 spouts and 3 bolts in Storm cluster and there are two worker nodes. Will be these spouts and bolts shared among these workers (for example first worker has 1 spout and 2 bolts, the second has 1 spout and 1 bolt) or each worker has 2 spouts and 3 bolts which ends up with 4 spouts and 6 bolts in whole cluster?
Spout and bolt are shared by all your cluster (so worker).
If you have 2 spouts and 3 bolts for 2 workers, they will be balanced between your 2 workers.
You can use the ui (./nimbus ui) to visualise that :).
In storm, a supervisor has multiple worker(processes) slots. By default Storm uses even scheduler to schedule #executors(threads that execute spouts/bolt logic) on #worker_slots that are available. You can find the code to different scheduler implementations here.

Resources