spout sends tuples just to a subset of bolt instances - apache-storm

I'm just starting with storm. I have a simple topology of a spout (1 executor) and a bolt (4 executors). The spout and the bolt are connected via "shuffleGrouping".
From what I can see the spout sends tuples only to a subset of bolt executors - the ones that are running on the same host as the spout.
Is this expected? Is there a way to spread the load across all bolt executors no matter where they run?

Yes, it's expected. I think you can config it by using same spout source, like storm integrate with kafka,then you create 2 topologies, config each spout of each topology with the same topic and same zookeeper host.

Sorry for digging up this old post. I had the same problem with a topology and found this post, so I want to share my research, and a/the solution.
I upgraded a topology from Apache Storm 1.0.2 to 2.2.0. Since, the spout sends tuples only to the local bolt executors despite the SHUFFLE grouping mode.
According to this post, it's seems expected, but there are a workaround :
topology.disable.loadaware.messaging: true
See also org.apache.storm.daemon.GrouperFactory to understand how this configuration is used.

Related

NiFi from hadoop to kafak with exactly once guarantee

Is it possible for NiFi to read from hdfs (or hive) and publish data-rows to kafka with exactly once delivery guarantee?
Publishing to Kafka from NiFi is at-least-once guarantee because a failure could occur after Kafka has already received the message, but before NiFi receives the response, which could be due to a network issue, or maybe nifi crashed and restarted at that exact moment.
In any of those cases, the flow file would be put back in the original queue before the publish kafka processor (i.e. the session was never committed), and so it would be tried again.
Due to the threading model where different threads may execute the processor, it can't be guaranteed that the same thread that originally did the publishing will be the same thread that does the retry, and therefore can't make use of the "idempotent producer" concept.

Storm Pacemaker with upgraded KafkaSpout

I had a question regarding the usage of Pacemaker. We have a currently running Storm cluster on 1.0.2 and are in the process of migrating it to 1.2.2. We also use KafkaSpout to consume data from the KAfka topics.
Now, since this release in for Kafka 0.10 +, most of the load from ZK would be taken off since the offsets won't be stored in ZK.
Considering this, does it make sense for us to also start looking at Pacemaker to reduce load further on ZK?
Our cluster has 70+ supervisor and around 70 workers with a few unused slots. Also, we have around 9100+ executors/tasks running.
Another question I have is regarding the heartbeats and who all send it to whom? From what I have read, workers and supervisors send their heartbeats to ZK, which is what Pacemaker alleviates. How about the tasks? Do they also send heartbeats? If yes, then is it to ZK or where else? There's this config called task.heartbeat.frequency.secs which has led me to some more confusion.
The reason I ask this is that if the task level heartbeats aren't being sent to ZK, then its pretty evident that Pacemaker won't be needed. This is because with no offsets being committed to ZK, the load would be reduced dramatically. Is my assesment correct or would Pacemaker be still a feasible option? Any leads would be appreciated.
Pacemaker is an optional Storm daemon designed to process heartbeats from workers, which is implemented as a in-memory storage. You could use it if ZK become a bottleneck because the storm cluster scaled up
supervisor report heartbeat to nimbusthat it is alive, used for failure tolerance, and the frequency is set via supervisor.heartbeat.frequency.secs, stored in ZK.
And worker should heartbeat to the supervisor, the frequency is set via worker.heartbeat.frequency.secs. These heartbeats are stored in local file system.
task.heartbeat.frequency.secs: How often a task(executor) should heartbeat its status to the master(Nimbus), it never take effect in storm, and has been deprecated for Storm v2.0 RPC heartbeat reporting
This heartbeat stats what executors are assigned to which worker, stored in ZK.

how many storm topologies can be deployed on single storm server?

Currently I am able to deploy 24 topologies, but after that Storm is running out of Workers. Can any one suggest how to increase the Workers on Storms?
The number of topologies which can be deployed to a cluster ultimately depends on how much cpu/ram is available and how much cpu/ram is required by your workers.
You can increase the number of supervisor.slots.ports to have more workers on a given machine. If you define 10 ports you could have 10 workers.
See also Setting up a Storm cluster.

How does Storm assign tasks to workers?

How does Storm assign tasks to its workers? How does load balancing work?
Storm assigns tasks to workers when you submit the topology via "storm jar ..."
A typical Storm cluster will have many Supervisors (aka Storm nodes). Each Supervisor node (server) will have many Worker processes running. The number of workers per Supervisor is determined by how many ports you assign with supervisor.slots.ports .
When the topology is submitted via "storm jar" the Storm platform determines which workers will host each of your spouts and bolts (aka tasks). The number of workers and executors which will host your topology is dependent on the "parallelism" that you set during development, when the topology is submitted, or changed in a live running topology using "storm rebalance".
Michael Noll has a great breakdown of Parallelism, Workers and Tasks in his blog post here: http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/#example-of-a-running-topology

Does Storm DRPCTopology have a inbuilt queue?

I am trying to setup a storm topology to get updates from social networks, process them and write to a backend. I thought about getting the data and using a kafka queue and let kafka spout read from the queue. But on reading about DRPCTopology, it looks like I just need to send data to DRPC server and it handles forwarding to spouts. Does the DRPC server have a queue inbuilt? So for my use case can I use that instead of kafka spout?
Q: Does the DRPC server have a queue inbuilt?
Yes, the drpc server that comes with storm uses an internal ConcurrentLinkedQueue.
Q: So for my use case can I use that instead of kafka spout?
Only if you expect relatively low volume, as the ConcurrentLinkedQueue will consume memory with no way to fall to disk like Kafka does.

Resources