I'm new to Storm, so be gentle :-)
I want to implement a topology that is similar to the RollingTopWords topology in the Storm examples. The idea is to count the frequency of words emitted. Basically, the spouts emit words at random, the first level bolts count the frequency and pass them on. The twist is that I want the bolts to pass on the frequency of a word only if its frequency in one of the bolts exceeded a threshold. So, for example, if the word "Nathan" passed the threshold of 5 occurrences within a time window on one bolt then all bolts would start passing "Nathan"'s frequency onwards.
What I thought of doing is having another layer of bolts which would have the list of words which have passed a threshold. They would then receive the words and frequencies from the previous layer of bolts and pass them on only if they appear in the list. Obviously, this list would have to be synchronized across the whole layer of bolts.
Is this a good idea? What would be the best way of implementing it?
Update: What I'm hoping to achieve a situation where communication is minimized i.e. each node in my use case is simulated by a spout and an attached bolt which does the local counting. I'd like that bolt to emit only words that have passed a threshold, either in the bolt itself or in another one. So every bolt will have to have a list of words that have passed the threshold. There will be a central repository that will hold the list of words over the threshold and will communicate with the bolts to pass that information.
What would be the best way of implementing that?
That shouldn't be too complicated. Just don't emit the words until you reach the threshold and in the meantime keep them stored in a HashMap. That is just one if-else statement.
About the synchronization - I don't think you need it because when you have these kind of problems (with counting words) you want one and only one task to receive a specific word. The one task that receives the word (e.g. "Nathan") will be the only one emitting its frequency. For that you should use fields grouping.
Related
First of all sincere apologies if my question is duplicate, I tried searching but couldn’t find relevant answer to my question
First of all sincere apologies, if i asking something very basic , as I am a beginner in Storm.
And also if my question is duplicate, As i tried searching but couldn’t find relevant answer to my question
Please advise on my below use case.
My USE Case :
I have a Spout reading data from one internal messaging mechanism, as its receiving & emitting tuples with very high frequency(100s/second).
Now every apart from data, every tuple also has a frequency(int) (as there can be total 4-5 type of frequency).
Now I need to design a Bolt to batch/Pool all tuples and only emit periodically on frequency, with a feature of emitting only latest tuple (in case of duplicate received before next batch), as we have a string-based key in tuple data to identify a duplicate.
e.g.
So all tuple with 25 seconds as frequency will be pooled together and will be emitted by Bolt on every 25 seconds (in case of duplicate tuple received within 25 seconds only latest one will be considered).
Similar like all tuple with 10 minutes as frequency will be pooled together and will be emitted by Bolt on every 10 min interval (in case of duplicate tuple received within 10 min only latest one will be considered).
** Now since we can have a 4-5 type of frequencies (e.g. 10 sec, 25 sec, 10 min, 20 min etc. , these are as configured), and every tuple should be clubbed into an appropriate batch and emitted (as exampled above).
Fyi. For Bolt grouping, I have used "fieldsGrouping" as below configuration.
*.fieldsGrouping("FILTERING_BOLT",new Fields(PUBLISHING_FREQUENCY));*
Please help or advise on, what's the best approach for my use case, as just couldn't think of implementing anything to handle flowing of concurrent tuples and managing Storm's internal parallelism.
It sounds like you want windowing bolts https://storm.apache.org/releases/2.0.0-SNAPSHOT/Windowing.html. Probably you want a tumbling window (i.e. no overlap between window intervals)
Windowing bolts let you set an interval they should emit windows at (e.g. every 10 seconds), and then the bolt will buffer up all tuples it receives for the previous 10 seconds before calling an execute method you supply.
The structure I think you want is something like e.g.
spout -> splitter -> 5 second window bolt
-> 10 second window bolt
The splitter should receive the tuples, examine the frequency field and send the tuple on to the right window bolt. You make it do this by declaring a stream for each frequency type.
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare("5-sec-stream", ...);
declarer.declare("10-sec-stream", ...);
}
public void execute(Tuple input) {
if (frequencyIsFive(input)) {
collector.emit("5-sec-stream", new Values(input.getValues()))
}
//more cases here
}
Then when declaring your topology you do
topologyBuilder.setBolt("splitter", new SplitterBolt())
.shuffleGrouping("spout")
topologyBuilder.setBolt("5-second-window", new YourWindowingBolt())
.globalGrouping("splitter", "5-sec-stream")
to make all the 5 second tuples go to the 5 second windowing bolt.
See https://storm.apache.org/releases/2.0.0-SNAPSHOT/Concepts.html for more information on this, particularly the parts about streams and groupings.
There's a simple example of a windowing topology at https://github.com/apache/storm/blob/master/examples/storm-starter/src/jvm/org/apache/storm/starter/SlidingWindowTopology.java.
One thing you may want to be aware of is Storm's tuple timeout. If you need a window of e.g. 10 minutes, you need to bump the tuple timeout significantly from the default of 30 seconds, so the tuples don't time out while waiting in queue. You can do this by setting e.g. conf.setMessageTimeoutSecs(15*60) when configuring your topology. You want there to be a bit of leeway between the window intervals and the tuple timeout, because you want to avoid your tuples timing out as much as possible.
My topology uses the default KafkaSpout implementation. In some very controlled testing, I noticed the spout was failing tuples even though none of my bolts were failing any tuples and I was certain all messages were being fully processed well within my configured timeout.
I also noticed that (due to some sub-classing structure with my bolts), one of my bolts was ack-ing tuples twice. When I fixed this, the spout stopped failing tuples.
Sorry that this is more than a sanity check than a question, but does this make sense? I don't see why ack-ing the same tuple instance twice would cause the Spout to register timeouts, but it seems it was in my case?
It does make sense.
Storm tracks all of the acks (direct and indirect) for a tuple emitted by a spout in an odd but effective manner. I'm not sure of the exact algorithm, but it entails repeatedly XOR'ing what was originally the spout-emitted tuple ID with the ID's of subsequent anchored tuple ID's. each of those subsequent ID's is XOR'ed twice - once when the tuple is anchored and once when the tuple is acked. When the results of an XOR is all zero's, then the assumption is that each anchor was matched by an ack and the original spout-emitted tuple has finished processing.
By ack'ing some tuples more than once, you made it seem that some of the spout-emitted tuples were not finished completely (because an odd number of XOR's will never zero out).
During my stream my tuples do not split in anyway. One action is eventually performed out for each tuple in.
I can still fail them if they run into some sort of exception that they might conquer if replayed by my KafkaSpout. Though, I don't know how my spout knows which tuple to replay when they're not anchored, but in testing it seems to replay the right one. Is this expected, the KafkaSpout implementation tracks tuples/messages in some way I'm not aware of? Am I possibly anchoring an not realizing it (my bolts extend BaseRichBolt)? Possibly I'm just mistaken that it replays the correct one?
But if manually failing does work, then I believe the only benefit I get from anchoring is that my tuple will be replayed when it times out -- which I'm not sure is worth the overhead of anchoring.
Am I correct about this? Is there some other significant benefit to anchoring in this case?
BaseRichBolt does not do any anchoring automatically (BaseBasicBolt would do this). Thus, the behavior you describe should only work if you have simple Spout -> Bolt topology. For deeper topologies, ie, Spout -> Bolt1 -> Bolt2 and no anchoring in Bolt1, failing of tuples in Bolt2 cannot work.
Using KafkaSpout each tuple emitted gets a MessageId assigned, thus fault-tolerance mechanism is activated. Thus, each tuple must get acked in the first Bolt receiving those tuples; otherwise, the tuples time-out eventually. Tuples emitted in Bolt1 should get anchored (otherwise, those tuples get not tracked, cannot fail---neither manually in Bolt2 or per time-out---and cannot get replayed in case of failure).
Thus, anchoring is a pure fault-tolerance mechanism. You should actually always anchor tuples because anchoring itself does not enable fault-tolerance; assigning MessageIds in Spout does enable it. If a Bolt processes a tuple that does not have an ID assigned, the anchoring call will do "nothing" and the overhead of an additional method call is tiny. Therefore, adding anchoring code is usually a good choice, because the Bolt can be used with or without fault-tolerance enabled (depending if the Spout assigns messaged IDs or not). If you omit the anchoring code, fault-tolerance will break in this Bolt and downstream tuples cannot get recovered on failure.
I'm using Apache Storm for parallel processing. I'd like to detect when the tuple is on its last replay count so that if it fails again then the tuple can be moved to a dead letter queue.
Is there a way to find the replay count from within the Bolt? I'm not able to find such a field within the tuple.
The reason I'm looking for the last replay count is to iron out our topology so that it is more resilient failures caused by bugs and downstream service outages. When the bug/downstream issue has been resolved the tuples can be reprocessed from the dead letter queue. However I'd like to place the tuples on the dead letter queue only on its last and final replay.
There are multiple possible answers to this question:
Do you use low level Java API to define your topology? If yes, see here: Storm: Is it possible to limit the number of replays on fail (Anchoring)?
You can also use transactional topologies. The documentation is here: https://storm.apache.org/documentation/Transactional-topologies.html
Limiting the number of replays implies counting the number of replays and that's a requirement to get this done. However, Storm does not support a dead letter queue or similar natively. You would need to use a reliable external distributed storage system (maybe Kafka) and put the tuple there if the replay count exceed your threshold. And in your spout, you need to check periodically for tuple in this external storage. If they are stored there "long enough" (whatever that means in your application), the spout can try re-processing.
I would like to process tuples in batches for which I am in a thought of using Trident API. However, there are no operations that I perform in batches here. Every tuple is processed individually. All that I need here is exactly-once semantics so that every tuple is processed only once and this is the only reason to use Trident.
I want to store the information of which tuple is processed so that when a batch is replayed, the tuple will not be executed when that is already processed.
The topology contains a persistentAggregate() method, but it takes some aggregation operation but I don't have any aggregation operation to perform on a set of tuples as every tuple is processed individually.
Here, the functions that a tuple undergoes are too minute to be executed. So, I am looking to process them in batches in order to save computing resources and time.
Now, how to write a topology which consumes tuples as batches but still doesn't perform any batch operations (like word count)?
Looks like what you need is partitionPersist.
It should be provided with a state (or a state factory), fields to persist and an updater.
For development purposes check MemoryMapState - it's basically an in-memory hashmap.
For production you can use, say, cassandra - check out the examples at https://github.com/hmsonline/storm-cassandra-cql