I have a Storm topology where I have to send output to kafka as well as update a value in redis. For this I have a Kafkabolt as well as a RedisBolt.
Below is what my topology looks like -
tp.setSpout("kafkaSpout", kafkaSpout, 3);
tp.setBolt("EvaluatorBolt", evaluatorBolt, 6).shuffleGrouping("kafkaStream");
tp.setBolt("ResultToRedisBolt",ResultsToRedisBolt,3).shuffleGrouping("EvaluatorBolt","ResultStream");
tp.setBolt("ResultToKafkaBolt", ResultsToKafkaBolt, 3).shuffleGrouping("EvaluatorBolt","ResultStream");
The problem is that both of the end bolts (Redis and Kafka) are listening to the same stream from the preceding bolt (ResultStream), hence both can fail independently. What I really need is that if the result is successfully published in Kafka, then only I update the value in Redis. Is there a way to have an output stream from a kafkaBolt where I can get the messages published successfully to Kafka? I can then probably listen to that stream in my RedisBolt and act accordingly.
It is not currently possible, unless you modify the bolt code. You would likely be better off changing your design slightly, since doing extra processing after the tuple is written to Kafka has some drawbacks. If you write the tuple to Kafka and you fail to write to Redis, you will get duplicates in Kafka, since the processing will start over at the spout.
It might be better, depending on your use case, to write the result to Kafka, and then have another topology read the result from Kafka and write to Redis.
If you still need to be able to emit new tuples from the bolt, it should be pretty easy to implement. The bolt recently got the ability to add a custom Producer callback, so we could extend that mechanism.
See the discussion at https://github.com/apache/storm/pull/2790#issuecomment-411709331 for context.
Related
Our Apache Storm topology listens messages from Kafka using KafkaSpout and after doing lot of mapping/reducing/enrichment/aggregation etc. etc finally inserts data into Cassandra. There is another kafka input where we receive user queries for data if topology finds a response then it sends that onto a third kafka topic. Now we want to write E2E test using Junit in which we can directly programmatically insert data into topology and then by inserting user query message, we can assert on third point that response received on our query is correct.
To achieve this, we thought of starting EmbeddedKafka and CassandraUnit and then replacing actual Kafka and Cassandra with them and then we can start topology in the context of this single Junit test.
But our approach doesn't fit well with JUnit because it makes these tests too bulky. Starting kafka, cassandra and topology all are time taking and consume lot of resource. Is there anything in Apache Storm which can support kind of testing we are planning to write?
There are a number of options you have here, depending on what kind of slowdown you can live with:
As you mentioned, you can start Kafka, Cassandra and the topology. This is the slowest option, and the "most realistic".
Start Kafka and Cassandra once, and reuse them for all the tests. You can do the same with the Storm LocalCluster. It is likely faster to clear Kafka/Cassandra between each test (e.g. deleting all topics) instead of restarting them.
Replace the Kafka spouts/bolts and Cassandra bolt with stubs in test. Storm has a number of tools built in for stubbing bolts and spouts, e.g. the FixedTupleSpout, FeederSpout, the tracked topology and completable topology functionality in LocalCluster. This way you can insert some fixed tuples into the topology, and do asserts about which tuples where sent to the Cassandra bolt stub. There's examples of some of this functionality here and here
Finally you can of course unit test individual bolts. This is the fastest kind of test. You can use Testing.testTuple to create test tuples to pass to the bolt.
My topology looks like this :
Data_Enrichment_Persistence_Topology
So basically the problem I am trying to solve here is that every time any issue comes in the Stop or Load service bolts, and a tuple fails , it replays and the spout re emits it. This makes the Cassandra bolt re process the tuple and rewrite data.
I can not make the tuples in the load and stop bolts unanchored as i need them to be replayed in case of any failure. However I only want to get the upper workflow replayed.
I am using a KafkaSpout to emit data ( it is emitting it on the " default" stream). Not sure how to duplicate the streams at the Kafka Spout's emit level.
If I can duplicate the streams the replay on any of of the two will only re emit the message on a particular stream right at the spout level leaving the other stream untouched right?
TIA!
You need to use two output streams in your Spout -- one for each downstream pass. Furthermore, you emit each tuple to both streams (using different message-id).
Thus, if one fails, you can reply this tuple to just this stream.
This sounds like a stupid question, but it does solve certain problems if it's possible.
Say my topology has only 1 spout and 1 bolt. Of course, spout is upstream of bolt. Is it possible for the bolt to define a stream and the data emit to this stream will be received by other instance of the bolt?
I am not sure what you mean be "other instance of the bolt". However, it seems you want to define a cyclic topology, and yes, this is possible in Storm. Of course, you need to be careful not to spin tuples through the cycle forever...
There is nothing special to do it. Just connect to the output stream as to any other one:
builder.setSpout("spout", new MySpout());
builder.setBolt("bolt", new MyBolt())
.shuffleGrouping("spout")
.shuffleGrouping("bolt");
I am trying to to put kafka-data through storm in hdfs and hive. I am working with hortonworks. Therefore i have the following structure, as (a little modificated) seen in many tutorials (http://henning.kropponline.de/2015/01/24/hive-streaming-with-storm/):
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("kafka-spout", kafkaSpout);
builder.setBolt("hdfs-bolt", hdfsBolt).globalGrouping("kafka-spout");
builder.setBolt("parse-bolt", new ParseBolt()).globalGrouping("kafka-spout");
builder.setBolt("hive-bolt", hiveBolt).globalGrouping("parse-bolt");
I send the kafka-spout data directly to hdfs-bolt, which is working when i only use hdfs-bolt. When i add the parse-bolt to parse the kafka-data and emit it to hive-bolt, the complete system goes crazy. Even when iam just sending one single message over kafka, this message is duplicated by the kafka-spout infinite times and is written to the hdfs infinite.
If there is an error in the parse-bolt, shouldn't the hdfs-bolt still working normal? I'am new to the topic, can someone see a simple beginners mistake? I am grateful for any advice.
Are you acking the messages at the end of both bolt's execution?
When you read from the same stream from your kafka-spout, messages will get anchored to the same spout but with unique messageIds. So essentially even though your parse-bolt 's tuple fails, since it is anchored to the same spout, it will get replayed at the spout . This will result in another tuple with a different messageId but same content being played for all the bolts subscribed to it, in your case the parse-bolt and the hdfs-bolt.
Remember that the replay happens at the Spout and hence everything subscribed to that stream from the spout will get redundant messages.
As I understand things, ZooKeeper will persist tuples emitted by bolts so if a bolt crashes (or a computer with the bolt crashes, or the entire cluster crashes), the tuple emitted by the bolt will not be lost. Once everything is restarted, the tuples will be fetched from ZooKeeper, and everything will continue on as if nothing bad ever happened.
What I don't yet understand is if the same thing is true for spouts. If a spout emits a tuple (i.e., the emit() function within a spout is executed), and the computer the spout is running on crashes shortly thereafter, will that tuple be resurrected by ZooKeeper? Or do we need Kafka in order to guarantee this?
P.S. I understand that the tuple emitted by the spout must be assigned a unique ID in the call to emit().
P.P.S. I see sample code in books that uses something like ConcurrentHashMap<UUID, Values> to track which spouted tuples have not yet been acked. Is this somehow automatically persisted with ZooKeeper? If not, then I shouldn't really be doing that, should I? What should I being doing instead? Using Kafka?
Florian Hussonnois answered my question thoroughly and clearly in this storm-user thread. This was his answer:
Actually, the tuples aren't persisted into "zookeeper". If your
"spout" emits a tuple with a unique id, it will be automatically
follow internally by storm (i.e ackers) . Thus, in case the emitted
tuple comes to fail because of a bolt failure, Storm invokes the
method 'fail' on the origin spout task with the unique id as argument.
It's then up to you to re-emit the failed tuple.
In sample codes, spouts use a Map to track which tuples are fully
processed by your entire topology in order to be able to re-emit in
case of a bolt failure.
However, if the failure doesn't come from a bolt but from your spout,
the in memory Map will be lost and your topology will not be able to
remit failed tuples.
For a such scenario you can rely on Kafka. In fact, the Kafka Spout
store its read offset into zookeeper. In that way, if a spout task
goes down it will be able to read its offset from zookeeper after
restarting.