How does Apache Storm manage its spouts? - apache-storm

Let's take an example of a topology which has more than one Kafka Spouts (or any other kind of spouts), now questions I have -
Which spout should consume the messages first - how is this determined?
Does each spout spin up its own thread?
Should Bolts be thread safe if each spout has its own thread?
Is there a way to give priority to one spout on another?
If there is a back pressure behind one spout, does this back pressure then transmitted to another spout as well?

Related

Complete latency for topology and spouts always zero in UI

The complete latency for the topology and spout is always coming zero. Also acker is coming zero. The number of acked for bolts are coming fine. I am using storm 1.1.1. My topology is reading text file and classifiying text using Naive Bayes in distributed environment of Apache storm.
I think you need to anchor the tuples in order to be able to track the latency:
http://storm.apache.org/releases/1.0.6/Guaranteeing-message-processing.html

Parallelism in Spouts

New to Storm and just understanding the concept of Spouts and how to achieve parallelism in them.
I have defined a Spout A and have set 3 tasks and 3 executors and 1 Bolt(Lets not worry about Bolt). Lets assume each of the spout task
is assigned a dedicated worker. That means there are 3 spouts ready to receive a Stream. A message or stream (say X) enters the topology. How is this handled in the Spout?
a. Will all the spouts receive the stream A ? If yes, then all the 3 spouts will process it and the same message is processed multiple times right?
b. Who will decide in above case which spout should receive this stream?
c. Is it possible to balance the load across the spouts?
d. Is it that there should be only one spout in the topology ?
P.S: Consider this is general spout, not to confuse with the Kafka spouts.
Storm is just a frame, your questions are basically determined by implementation of spout code. So,sadly, there is no way to consider "general spout". We have to discuss some specific spout.
Let's take kafka spout for example. Basically, it has no difference with normal kafka consumer. Kafka spout has a logic to distribute partitions to different spout tasks, and load balance is also handled at this period, one partition will be consumed by only one spout task,so there will be no multiple data.

Storm message failed

Recently I got a really strange problem. The storm cluster have 3 machines. The topology structure is like this, Kafka Spout A -> Bolt B -> Bolt C. I have acked all tuples in every bolt, even though there possibly throw exceptions inner bolt (in bolt execute method I try and catch all exceptions, and finally ack the tuple).
But here the strange thing happens. I print the log of the spout, on one machine all the tuples acked by the spout, but on other 2 machines, almost all tuples failed. And after 60 seconds the tuple replayed once again and again and again. 'Almost' means at the begin time, all tuples failed on the other 2 machines. After a time, there's a small amount of tuples acked on the 2 machines.
Absolutely the tuples are failed because of timeout. But I really don't know why they timed out. According to the logs I've printed, I'm really sure all tuples acked at the end of the execute method in every bolt. So I want to know why some of the tuples failed on the 2 machines.
Is there any thing I can do to find out what's wrong with the topology or the storm cluster? Really thanks and hoping for your reply.
Your problem is related to the handling of backpressure by KafkaSpout in the StormTopology.
You can handle the back pressure of the KafkaSpout by setting the maxSpoutPending value in the topology configuration,
Config config = new Config();
config.setMaxSpoutPending(200);
config.setMessageTimeoutSecs(100);
StormSubmitter.submitTopology("testtopology", config, builder.createTopology());
maxSpoutPending is the number of tuples that can be pending acknowledgement in your topology at a given time. Setting this property, will intimate the KafkaSpout not to consume any more data from Kafka unless the unacknowledged tuple count is less than maxSpoutPending value.
Also, make sure you can fine tune your Bolts to be lightweight as possible so that the tuples get acknowledged before they timeout.

Where does Apache Storm store tuples before a node is available to process it?

I am reading up on Apache Storm to evaluate if it is suited for our real time processing needs.
One thing that I couldn't figure out until now is — Where does it store the tuples during the time when next node is not available for processing it. For e.g. Let's say spout A is producing at the speed of 1000 tuples per second, but the next level of bolts(that process spout A output) can only collectively consume at a rate of 500 tuples per second. What happens to the other tuples ? Does it have a disk-based buffer(or something else) to account for this ?
Storm used internal in-memory message queues. Thus, if a bolt cannot keep up processing, the messages are buffered there.
Before Storm 1.0.0 those queues may grow out-of-bound (ie, you get an out-of-memory exception and your worker dies). To protect from data loss, you need to make sure that the spout can re-read the data (see https://storm.apache.org/releases/1.0.0/Guaranteeing-message-processing.html)
You could use "max.spout.pending" parameter, to limit the tuples in-flight to tackle this problem though.
As of Storm 1.0.0, backpressure is supported (see https://storm.apache.org/2016/04/12/storm100-released.html). This allows bolt to notify its upstream producers to "slow down" if a queues grows too large (and speed up again in a queues get empty). In your spout-bolt-example, the spout would slow down to emit messaged in this case.
Typically, Storm spouts read off of some persistent store and track that completion of tuples to determine when it's safe to remove or ack a message in that store. Vanilla Storm itself does not persist tuples. Tuples are replayed from the source in the event of a failure.
I have to agree with others that you should check out Heron. Stream processing frameworks have advanced significantly since the inception of Storm.

Remote Bolts unable to Ack messages

We have a multi-node storm cluster and we are using Guaranteed processing of Storm for making sure that events are processed exactly once.
Our topology consists of a spout and multiple bolts (50). We have set the number of acker equivalent to the number of workers. From our logs we can identify that bolts are acking but the spout may not receive them. We do notice that the spout is receiving the ack from bolts which are on the same node. We believe the issue to be either network connectivity or ack timing out.
Any idea what is cause of why the remote bolts are unable to ack and how do we debug this? Any suggestions are appreciated.

Resources