I just want to know whether we can write conditional bolts or not in storm.
If i have three bolts,1st bolt will do its work and 2nd one will check weather the bolt 1 execution is done or not if done correctly then only 3rd bolt should start working.Does we can try for this,if YES let me know How and if No say why?
Can't what you are trying to do be accomplished by linking the bolts together using a different stream id and emitting to this stream id from one bolt to the other when done?
e.g.
Bolt 1 receives its data from wherever it receives it.
Bolt 2 also receives its data from wherever it receives it, but does not start working.
Bolt 1 finished working and emits a tuple saying "finished" to Bolt 2.
Bolt 2 sees this tuple and starts working.
You can distinguish different streams in the bolt using:
tuple.getSourceStreamId()
which returns a String with the name of the stream id this tuple was emitted to.
Related
I'm fairly new to Storm, and recently changed my bolts to inherit from IRichBolt
instead of BaseBasicBolt, meaning I am now in charge of acking and failing
a tuple according to my own logic.
My topology looks like this:
Bolt A emits the same tuple to Bolts B and C, each persist data to Cassandra.
These operations are not idempotent, and include an update to two different counter column families.
I am only interested in failing the tuple and replaying it in certain exceptions from Cassandra (not read/write timeouts, only QueryConsistency or Validation exception).
The problem is that in case bolt B fails, the same tuple is replayed from the spout and is emitted again to bolt C, which already succeeded to persist the its data, creating false data.
I've tried to understand how exactly acking is done (from reading: http://www.slideshare.net/andreaiacono/storm-44638254) but failed to understand
what happens in the situation I described above.
The only way I figured to solve this correctly is to either create another spout with the same input source: Spout 1 -> bolt A -> bolt B, and Spout 1' -> Bolt A' -> Bolt C', or either to persist the data for both column family in the same Batch Statement that is done in Bolts B and C by combining them into one.
Is this correct or am I missing something? And Is there another possible solution to properly ack these tuples?
Thanks.
You didn't say how long you want to wait to retry an failed update in bolt B or C, but instead of outright failing the tuple in bolt B, you could add some more streams. Add a scorpion-tail output stream from bolt B back to the same bolt B. If an update in bolt B fails, write the tuple to the scorpion-tail output stream so it comes right back as input into bolt B again, just from a second stream. You could enrich the tuple to hold a timestamp so your processing logic on bolt B for the new stream could look at the last attempted time and if enough time hasn't passed you could write it out to the scorpion-tail stream again. Of course you'd do the same thing for bolt C.
If you want to wait a long time to retry the tuple (long in Storm terms), you could replace those scorpion-tail streams with Kafka topics along with the requisite spouts.
My topology looks like this :
Data_Enrichment_Persistence_Topology
So basically the problem I am trying to solve here is that every time any issue comes in the Stop or Load service bolts, and a tuple fails , it replays and the spout re emits it. This makes the Cassandra bolt re process the tuple and rewrite data.
I can not make the tuples in the load and stop bolts unanchored as i need them to be replayed in case of any failure. However I only want to get the upper workflow replayed.
I am using a KafkaSpout to emit data ( it is emitting it on the " default" stream). Not sure how to duplicate the streams at the Kafka Spout's emit level.
If I can duplicate the streams the replay on any of of the two will only re emit the message on a particular stream right at the spout level leaving the other stream untouched right?
TIA!
You need to use two output streams in your Spout -- one for each downstream pass. Furthermore, you emit each tuple to both streams (using different message-id).
Thus, if one fails, you can reply this tuple to just this stream.
This sounds like a stupid question, but it does solve certain problems if it's possible.
Say my topology has only 1 spout and 1 bolt. Of course, spout is upstream of bolt. Is it possible for the bolt to define a stream and the data emit to this stream will be received by other instance of the bolt?
I am not sure what you mean be "other instance of the bolt". However, it seems you want to define a cyclic topology, and yes, this is possible in Storm. Of course, you need to be careful not to spin tuples through the cycle forever...
There is nothing special to do it. Just connect to the output stream as to any other one:
builder.setSpout("spout", new MySpout());
builder.setBolt("bolt", new MyBolt())
.shuffleGrouping("spout")
.shuffleGrouping("bolt");
I've 1 spout and 3 bolts in a topology sharing a single stream declared originally using declarer.declareStream(s1,...) in the declareOutputFields() method of the spout.
The spout emits to the stream s1, and all downstream bolts also emit Values to the same stream s1. The bolts also declare the same stream s1 in their declareOutputFields().
Is there any problem with that? What is the correct way to do it? Please provide sufficient references.
I don't see any problem with your design, except it is unncessary unless you have a specific reason. According to Storm documentation:
Saying declarer.shuffleGrouping("1") subscribes to the default stream
on component "1" and is equivalent to declarer.shuffleGrouping("1",
DEFAULT_STREAM_ID).
Thus if your bolts and spouts do not need to emit more than one stream, there is really no need to specify the steam ID yourself. You can just use the default stream ID.
We have a fairly simple storm topology with one head ache.
One of our bolts can either find the data it is processing to be valid and every thing carries on down the stream as normal or it can find it to be invalid but fixable. In which case we need to send it for some additional processing.
We tried making this step part of the topology with a separate bolt and stream.
declarer.declareStream(NORMAL_STREAM, getStreamFields());
declarer.declareStream(ERROR_STREAM, getErrorStreamFields());
Followed by some thing like the following at the end of the execute method.
if(errorOutput != null) {
collector.emit(ERROR_STREAM, input, errorOutput);
}
else {
collector.emit(NORMAL_STREAM, input, output);
}
collector.ack(input);
This does work however it has a breaking effect of causing all of the tuples that do not go down this error path to fail and get re-sent by the spout endlessly.
I think this is because the error bolt can not send acks for messages it doesn't receive but the acker thing waits for all the bolts in a topology to ack before sending the ack back to the spout. At the very least taking out the error processing bolt causes every thing to get acked back to the spout correctly.
What is the best way to achieve some thing like this?
It's possible that the error bolt is slower than you suspect, causing a backup on error_stream which, in turn, causes a backup into your first bolt and finally causing tuples to start timing out. When a tuple times out, it gets resent by the spout.
Try:
Increasing the timeout config (topology.message.timeout.secs),
Limiting the number of inflight tuples from the spout (topology.max.spout.pending) and/or
Increasing the parallelism count for your bolts