We have an use case where we do not want to run storm topology continuously. Instead, there are set of inputs( 10K+) that should be processed at the specified time, Spout continuously emits these inputs and get processed by rest of the bolts in the topology. Once all the inputs are processed, there is nothing to emit from nextTuple in my spout.
At this time we wanted our topology to go to sleep and restart the process everyday night 12:00 am.
Is there any property to set in the storm config to run the topology once a day and sleep after processing is done and start at the specified time?
I'm not aware of a feature like what you're asking for. Storm isn't a batch processing system, it's meant to be running continuously. Consider if Storm is a great fit for this use case.
That said, you should be able to implement what you want. You could put in an "I'm done" message at the end of your spout input. When the spout hits that message and all other pending messages are acked, it could use the Nimbus client to kill or deactivate the topology (depending on whether you want to kill or deactivate), see https://stackoverflow.com/a/37134473/8845188. Then the final step would be using your favorite scheduling software to resubmit/reactivate the topology every day at midnight.
Related
Our Apache Storm topology listens messages from Kafka using KafkaSpout and after doing lot of mapping/reducing/enrichment/aggregation etc. etc finally inserts data into Cassandra. There is another kafka input where we receive user queries for data if topology finds a response then it sends that onto a third kafka topic. Now we want to write E2E test using Junit in which we can directly programmatically insert data into topology and then by inserting user query message, we can assert on third point that response received on our query is correct.
To achieve this, we thought of starting EmbeddedKafka and CassandraUnit and then replacing actual Kafka and Cassandra with them and then we can start topology in the context of this single Junit test.
Before, we start our actual test, we create topology and submit it into LocalCluster. It starts topology on a different thread and comes out from Before and starts executing our test. Till that time, topology is not ready because it takes some time to be ready for processing. Is there any java API which can tell us when topology is ready for processing (means ready to read first message from Spout)?
This depends on what you mean when you say "ready for processing".
If you enable time simulation for your LocalCluster, you can use Time.advanceClusterTime to advance time in steps. If you call this method after submitting a topology, it will only return once the cluster is mostly idling. See e.g. https://github.com/apache/storm/blob/8f49e06998abb4dfc50f51d78b6784ebd04844fb/storm-core/test/jvm/org/apache/storm/integration/TopologyIntegrationTest.java#L233.
If you're willing to replace your spouts with stubs (e.g. FixedTupleSpout), you can use Testing.completeTopologyto wait until the topology has finished processing all the tuples you set up the stub to emit.
Another method to wait for the topology to have processed some tuples would be that you put some messages in Kafka, start your topology, and then have your testing thread poll Cassandra to see if the messages you expect have made it through. This way, you can set a timeout in your testing thread, and have the test fail if the condition is not met in some number of seconds. You could use a utility like Awaitility for this https://github.com/awaitility/awaitility, or just write your own polling logic.
If you mean something else by "ready for processing", please describe in more detail what you mean.
I am reading the storm applied book. I found the following code snippet in the book
LocalCluster lc = new LocalCluster()
lc.submitTopology("GitHub-commit-count-topology"), config, topology);
Utils.sleep(TEN_MINUTES)
lc.killTopology("GitHub-commit-count-topology")
lc.shutdown()
So this code will submit the topology for execution wait for fixed 10 minutes and then kill the topology. But this is odd. How can I say. submitTopology wait for it to complete and completed. kill and shutdown.
Like in Akka Streams we get Future[Done] and we just wait on that future to complete. (rather than fixed 10 minutes).
You can do this with https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/Testing.java#L376.
The reason this isn't used in some cases is that it requires every spout in the topology to implement the CompletableSpout interface https://github.com/apache/storm/blob/4137328b75c06771f84414c3c2113e2d1c757c08/storm-client/src/jvm/org/apache/storm/testing/CompletableSpout.java.
Most Storm spouts never reach a point where they're "done" (since it's a stream processing framework, not a batch processing framework), so there's no way to tell when the topology is finished. For example, if you're consuming messages from a Kafka topic, the producers may at any point add more messages to the topic, so how will the consumer determine it is finished consuming?
CompletableSpout exists mostly to ease testing, because it's then possible for a spout to say whether it is done. The completeTopology method I linked can then use this extra feature to tell whether all spouts in the topology are "done", and can stop the topology after that.
If the spout you're using in a test doesn't implement CompletableSpout (which most spouts don't), there's no way to tell when the topology is finished in general. In many cases you can still do better than the example you linked, e.g. if my topology is supposed to write 10 messages to a queue in the test, I can make the test end once 10 messages have been written to the queue.
To relate to Akka streams, I'm not really familiar with them, but looking at the introductory documentation, you could consider CompletableSpouts to be similar to bounded Sources (eg. a Source(1 to 100)), while "normal" spouts are unbounded Sources (e.g. a Source.repeat(1)).
I ran a storm topology on a storm cluster. Later the topology was killed. But it is not getting removed from the list of topologies. Hence I can't rerun the topology with the same name again.
Isn't there a way to remove the killed topology from the list ?
When you kill the topology you normally set a timeout for how long you want to wait for currently emitted tuples to be processed. I think the default is 30 seconds. After that the topology should be removed from the list. If you don't want to wait, you can just specify a timeout of 0 seconds, and the topology will be removed immediately.
When you run kill command from storm ui or command line .Storm will first deactivate the topology's spouts for the duration of the topology's message timeout to allow all messages currently being processed to finish processing. Storm will then shutdown the workers and clean up their state .
So, maybe your topology still has message that needs to be processed .Hence the topology has not died till now .
Yet another way to kill topology is run storm kill from the command line. This had worked for me when one topology hung in "KILLED" state and shown in the list for hours.
storm kill yourToplogyName -w 5
We have a requirement to disable spout for a specific interval (9:00 p.m to 9:00 a.m) every day. Currently we have written code in Spout that checks whether current time lies in that duration, if yes then do nothing, but this approach calls next tuple method continuously. Is there any better way to do the same (using config etc)?
There is no better way. And even if the Spout is called over and over again, Storm will apply a sleep penalty if no output is emitted on a next() call, thus, a "busy wait situation" is avoided.
If you want to improve if the waiting penalty, you can implement an own ISpoutWaitStrategy and register for a topology via parameter topology.spout.wait.strategy (see default.yaml).
What Matthias has suggested, will work well. Alternatively, you can also consider deactivating topology for this duration. Nimbus client can be used to programmatically deactivate the topology. nextTuple wouldn't be called on spout if the topology is deactivated. However, it will turn off all the spouts of topology which you may not want.
Right now Storm Spouts have an open method to configure them and Bolts have a prepare method. Is there any way to make all the Spout instances wait for all the prepare methods on the Bolts listening to them to finish?
I have a case where I would like to pass some config info to the bolts on the fly (since this config info changes all the time). I've read in some places that we should use Zookeeper or an in-memory key-value storage like redis to do this. My worry though is, what happens if the Bolts aren't ready to process data from Spouts yet, and the Spouts start emitting tuples? Is there a way to make the Spouts wait for an update from the Bolts saying they're ready?
I found a slightly more elegant solution for this (I think). The problem was that certain bolts needed config info in order to process incoming tuples. I figured out Storm's capability to replay tuples, so now my bolts listen for updates from one spout, and tuples from the other. As long as I dont receive updates, I keep failing the tuples and having the spout replay them after a configurable amount of time.
Yes, you can use Redis to store your configuration then read it from the prepare method.
The prepare method is invoked by the worker process which start processing tuples after finishing. Actually, I think that no tuple is emitted until all components of a worker process are ready. http://nathanmarz.github.io/storm/doc-0.8.1/index.html
Finally, you can have an additional spout which look up for configuration changes. Then, if a newer configuration is available it is send to your bolts via named streams.
You don't have to worry about this. Storm framework loads Bolt before Spout. Storm loads the bolts in reverse order. Bolts towards the end of the topology are loaded before the bolts in the middle of the topology and in the end, Spout gets loaded.