Let's say I have an Apache Storm topology processing some tuples. I ack most of them but sometimes, due to an error, they are not processed and therefore not acked.
What happens to these 'lost' tuples? Does Storm fail them automatically, or should I do that explicitly every time?
From Storm's docs:
http://storm.apache.org/releases/1.2.2/Guaranteeing-message-processing.html
By failing the tuple explicitly, the spout tuple can be replayed faster than if you waited for the tuple to time-out. (=30 seconds by default)
Every tuple you process must be acked or failed. Storm uses memory to track each tuple, so if you don't ack/fail every tuple, the task will eventually run out of memory.
What happens to these 'lost' tuples? Does Storm fail them
automatically
Yes, storm failed them automatically after the tuple timeout. But it's better for you to do that explicitly .
Related
I am a beginner of storm. Storm's creator created a very impressive method to check every Bolts in topology, which is using XOR.
But I start wondering why he just not use a counter. When a Bolts is successfully executed, the counter will minus one. So when the counter equal with 0, means the whole task is completetly.
Thanks
I believe one can reason why counters are not only inefficient but an incorrect acker tracker mechanism in an always running topology.
Storm tuple topology in itself can be a complex DAG. When a bolt receives ack from multiple downstream sources, what is it to do with the counters? Should it increment them, should it always decrement them? In what order?
Storm tuples have random message Ids. Counters will be finite. A topology runs forever emitting billions of tuples. How will you map the 673686557th tuple to a counter id? With XOR, you only have a single state to maintain and broadcast.
XOR operations are hardware instructions that execute extremely efficiently. Counters are longs which require huge amounts of storage. They have overflow problems and defeat the original requirement of a solution with a low space overhead.
One of our Bolt breaks a large tuple message down into its children and then emits these children as tuples. There can be 10000 children some times.
This bombardment of tuples, choke down our topology.
Is there any cap/ceiling value on the number of tuples generated out of one tuple in a Bolt?
We need to send these children further down the topology so that state of these children can be updated according to state of parent.
There is a cap where Storm's algorithm for tracking tuples breaks down, but that cap is at the point where you start to see common collisions between 64-bit random values. So no, there effectively isn't a cap.
What you might run into is that it takes too long to process all the child tuples, so the whole tuple tree hits the tuple timeout. Either you can increase the timeout, or you can detour over e.g. Kafka so the entire processing doesn't have to happen within a single tuple tree's processing time.
A setup like
Topology A: source -> splitter -> Kafka
Topology B: Kafka -> processing
lets you process each child individually instead of having to handle all 10k tuples within the message timeout of the parent.
Please elaborate if you meant something else by your topology being choked.
Now there is a problem that puzzles me. How should I count the execution times of bolt and spout in Storm? I have tried to use ConcurrentHashmap (considering multithreading), but it can't be done on multiple machines. Can you help me solve this problem?
Considering your question i think you are trying to keep a track of number of tuple got executed and not the amount of time bolt or spout takes to execute one tuple.
You can use metices with graphite for visualisation. It gives a time series data.
Database can also be used for the same purpose.
Recently I got a really strange problem. The storm cluster have 3 machines. The topology structure is like this, Kafka Spout A -> Bolt B -> Bolt C. I have acked all tuples in every bolt, even though there possibly throw exceptions inner bolt (in bolt execute method I try and catch all exceptions, and finally ack the tuple).
But here the strange thing happens. I print the log of the spout, on one machine all the tuples acked by the spout, but on other 2 machines, almost all tuples failed. And after 60 seconds the tuple replayed once again and again and again. 'Almost' means at the begin time, all tuples failed on the other 2 machines. After a time, there's a small amount of tuples acked on the 2 machines.
Absolutely the tuples are failed because of timeout. But I really don't know why they timed out. According to the logs I've printed, I'm really sure all tuples acked at the end of the execute method in every bolt. So I want to know why some of the tuples failed on the 2 machines.
Is there any thing I can do to find out what's wrong with the topology or the storm cluster? Really thanks and hoping for your reply.
Your problem is related to the handling of backpressure by KafkaSpout in the StormTopology.
You can handle the back pressure of the KafkaSpout by setting the maxSpoutPending value in the topology configuration,
Config config = new Config();
config.setMaxSpoutPending(200);
config.setMessageTimeoutSecs(100);
StormSubmitter.submitTopology("testtopology", config, builder.createTopology());
maxSpoutPending is the number of tuples that can be pending acknowledgement in your topology at a given time. Setting this property, will intimate the KafkaSpout not to consume any more data from Kafka unless the unacknowledged tuple count is less than maxSpoutPending value.
Also, make sure you can fine tune your Bolts to be lightweight as possible so that the tuples get acknowledged before they timeout.
I am reading up on Apache Storm to evaluate if it is suited for our real time processing needs.
One thing that I couldn't figure out until now is — Where does it store the tuples during the time when next node is not available for processing it. For e.g. Let's say spout A is producing at the speed of 1000 tuples per second, but the next level of bolts(that process spout A output) can only collectively consume at a rate of 500 tuples per second. What happens to the other tuples ? Does it have a disk-based buffer(or something else) to account for this ?
Storm used internal in-memory message queues. Thus, if a bolt cannot keep up processing, the messages are buffered there.
Before Storm 1.0.0 those queues may grow out-of-bound (ie, you get an out-of-memory exception and your worker dies). To protect from data loss, you need to make sure that the spout can re-read the data (see https://storm.apache.org/releases/1.0.0/Guaranteeing-message-processing.html)
You could use "max.spout.pending" parameter, to limit the tuples in-flight to tackle this problem though.
As of Storm 1.0.0, backpressure is supported (see https://storm.apache.org/2016/04/12/storm100-released.html). This allows bolt to notify its upstream producers to "slow down" if a queues grows too large (and speed up again in a queues get empty). In your spout-bolt-example, the spout would slow down to emit messaged in this case.
Typically, Storm spouts read off of some persistent store and track that completion of tuples to determine when it's safe to remove or ack a message in that store. Vanilla Storm itself does not persist tuples. Tuples are replayed from the source in the event of a failure.
I have to agree with others that you should check out Heron. Stream processing frameworks have advanced significantly since the inception of Storm.