How would you emit storm data after a period of time has lapsed? - apache-storm

For example, lets say you were using storm to aggregate web visit start and end dates. A session starts with the first visit from a user and ends after 30 minutes of inactivity from that same user. This data is being streamed into storm in realtime as its collected. How would you tell storm to emit data after that 30 minutes of inactivity?

I am not sure but you can look for TOPOLOGY_TICK_TUPLE_FREQ_SECS properties in storm. As found in this article
Tick tuples: It’s common to require a bolt to “do something” at a fixed interval, like flush writes to a database. Many people have been using variants of a ClockSpout to send these ticks. The problem with a ClockSpout is that you can’t internalize the need for ticks within your bolt, so if you forget to set up your bolt correctly within your topology it won’t work correctly. 0.8.0 introduces a new “tick tuple” config that lets you specify the frequency at which you want to receive tick tuples via the “topology.tick.tuple.freq.secs” component-specific config, and then your bolt will receive a tuple from the __system component and __tick stream at that frequency.
You can also found the sample code to configure spouts or bolt to receive the tick tuple with a specific interval.

Related

Is there any Java API to know when topology is ready for reading first message from Spout

Our Apache Storm topology listens messages from Kafka using KafkaSpout and after doing lot of mapping/reducing/enrichment/aggregation etc. etc finally inserts data into Cassandra. There is another kafka input where we receive user queries for data if topology finds a response then it sends that onto a third kafka topic. Now we want to write E2E test using Junit in which we can directly programmatically insert data into topology and then by inserting user query message, we can assert on third point that response received on our query is correct.
To achieve this, we thought of starting EmbeddedKafka and CassandraUnit and then replacing actual Kafka and Cassandra with them and then we can start topology in the context of this single Junit test.
Before, we start our actual test, we create topology and submit it into LocalCluster. It starts topology on a different thread and comes out from Before and starts executing our test. Till that time, topology is not ready because it takes some time to be ready for processing. Is there any java API which can tell us when topology is ready for processing (means ready to read first message from Spout)?
This depends on what you mean when you say "ready for processing".
If you enable time simulation for your LocalCluster, you can use Time.advanceClusterTime to advance time in steps. If you call this method after submitting a topology, it will only return once the cluster is mostly idling. See e.g. https://github.com/apache/storm/blob/8f49e06998abb4dfc50f51d78b6784ebd04844fb/storm-core/test/jvm/org/apache/storm/integration/TopologyIntegrationTest.java#L233.
If you're willing to replace your spouts with stubs (e.g. FixedTupleSpout), you can use Testing.completeTopologyto wait until the topology has finished processing all the tuples you set up the stub to emit.
Another method to wait for the topology to have processed some tuples would be that you put some messages in Kafka, start your topology, and then have your testing thread poll Cassandra to see if the messages you expect have made it through. This way, you can set a timeout in your testing thread, and have the test fail if the condition is not met in some number of seconds. You could use a utility like Awaitility for this https://github.com/awaitility/awaitility, or just write your own polling logic.
If you mean something else by "ready for processing", please describe in more detail what you mean.

Input Data rate in Apache Storm

I am reading text data from a file and processing it to produce results using apache storm. I want to experiment with different input data rates. I want to know, how will I change the input data rate in apache storm in this setting. Also is the input data rate is:
Number of tuples emitted by spout/Time
By default, Storm will pull tuples out of the spout as fast as possible. You can interact with this via a few settings:
topology.max.spout.pending defines how many tuples can be emitted into the topology before Storm will throttle the spout and wait for some of the tuples to be acked. By default this is uncapped.
topology.sleep.spout.wait.strategy.time.ms defines how many milliseconds Storm will pause between calls to nextTuple on the spout, if a call to nextTuple produces no output. This is 1ms by default.

Apache Storm 0.10.0 : Could not get my custom metrics every timeBucketSizeInSecs

I register my custom metrics in my bolt, code like this, context.registerMetric("et", _executedTuple, 2), this code just count the number of tuples the bolt emitted, and I register metricconsumer in my topology.
But I just get the executedTuple every ten seconds, I just think the metric should be sent every 2 seconds(timeBucketSizeInSecs).
Perhaps you know how to solve the problem!

Disable a spout for a particular duration of time

We have a requirement to disable spout for a specific interval (9:00 p.m to 9:00 a.m) every day. Currently we have written code in Spout that checks whether current time lies in that duration, if yes then do nothing, but this approach calls next tuple method continuously. Is there any better way to do the same (using config etc)?
There is no better way. And even if the Spout is called over and over again, Storm will apply a sleep penalty if no output is emitted on a next() call, thus, a "busy wait situation" is avoided.
If you want to improve if the waiting penalty, you can implement an own ISpoutWaitStrategy and register for a topology via parameter topology.spout.wait.strategy (see default.yaml).
What Matthias has suggested, will work well. Alternatively, you can also consider deactivating topology for this duration. Nimbus client can be used to programmatically deactivate the topology. nextTuple wouldn't be called on spout if the topology is deactivated. However, it will turn off all the spouts of topology which you may not want.

Making storm spouts wait for bolts to be ready

Right now Storm Spouts have an open method to configure them and Bolts have a prepare method. Is there any way to make all the Spout instances wait for all the prepare methods on the Bolts listening to them to finish?
I have a case where I would like to pass some config info to the bolts on the fly (since this config info changes all the time). I've read in some places that we should use Zookeeper or an in-memory key-value storage like redis to do this. My worry though is, what happens if the Bolts aren't ready to process data from Spouts yet, and the Spouts start emitting tuples? Is there a way to make the Spouts wait for an update from the Bolts saying they're ready?
I found a slightly more elegant solution for this (I think). The problem was that certain bolts needed config info in order to process incoming tuples. I figured out Storm's capability to replay tuples, so now my bolts listen for updates from one spout, and tuples from the other. As long as I dont receive updates, I keep failing the tuples and having the spout replay them after a configurable amount of time.
Yes, you can use Redis to store your configuration then read it from the prepare method.
The prepare method is invoked by the worker process which start processing tuples after finishing. Actually, I think that no tuple is emitted until all components of a worker process are ready. http://nathanmarz.github.io/storm/doc-0.8.1/index.html
Finally, you can have an additional spout which look up for configuration changes. Then, if a newer configuration is available it is send to your bolts via named streams.
You don't have to worry about this. Storm framework loads Bolt before Spout. Storm loads the bolts in reverse order. Bolts towards the end of the topology are loaded before the bolts in the middle of the topology and in the end, Spout gets loaded.

Resources