I am reading text data from a file and processing it to produce results using apache storm. I want to experiment with different input data rates. I want to know, how will I change the input data rate in apache storm in this setting. Also is the input data rate is:
Number of tuples emitted by spout/Time
By default, Storm will pull tuples out of the spout as fast as possible. You can interact with this via a few settings:
topology.max.spout.pending defines how many tuples can be emitted into the topology before Storm will throttle the spout and wait for some of the tuples to be acked. By default this is uncapped.
topology.sleep.spout.wait.strategy.time.ms defines how many milliseconds Storm will pause between calls to nextTuple on the spout, if a call to nextTuple produces no output. This is 1ms by default.
Related
We have an use case where we do not want to run storm topology continuously. Instead, there are set of inputs( 10K+) that should be processed at the specified time, Spout continuously emits these inputs and get processed by rest of the bolts in the topology. Once all the inputs are processed, there is nothing to emit from nextTuple in my spout.
At this time we wanted our topology to go to sleep and restart the process everyday night 12:00 am.
Is there any property to set in the storm config to run the topology once a day and sleep after processing is done and start at the specified time?
I'm not aware of a feature like what you're asking for. Storm isn't a batch processing system, it's meant to be running continuously. Consider if Storm is a great fit for this use case.
That said, you should be able to implement what you want. You could put in an "I'm done" message at the end of your spout input. When the spout hits that message and all other pending messages are acked, it could use the Nimbus client to kill or deactivate the topology (depending on whether you want to kill or deactivate), see https://stackoverflow.com/a/37134473/8845188. Then the final step would be using your favorite scheduling software to resubmit/reactivate the topology every day at midnight.
Say I deploy a topology with 2 workers, the topo has 1 spout and 1 bolt with 2 tasks. Then my understanding is, 1 worker will run spout executor and 1 bolt executor, the other worker will run 1 bolt executor.
Is my understanding correct?
If my understanding is correct, then my question comes. Say the bolt is implemented by Python. Since storm transfers data between multi-lang bolts via stdout/stdin, if the 2 workers run on different hosts, how spout can send data to bolt that locates on the other host?
Little more clarification to your question. Storm uses various types of queue for data/tuple transfer between various components of topology
Example :
1) Intra-worker communication in Storm (inter-thread on the same Storm node): LMAX Disruptor
2) Inter-worker communication (node-to-node across the network): ZeroMQ or Netty
3) Inter-topology communication: nothing built into Storm, you must take care of this yourself with e.g. a messaging system such as Kafka/RabbitMQ, a database, etc.
For further reference :
http://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal-message-buffers/
To give a more detailed answer:
Storm will sent the data to both bolt executors. For the spout-local bolt, this happens in-memory; for the other bolt via network. Afterwards, each bolt-instance will deliver the input to an local-running python process. Thus, your describe stdout/stdin delivery happens locally on each machine. The data is transfer to each bolt before the data delivery from Java to Python happens.
Thus, stdout/stdin bridge is used within each bolt, and not from spout to bolt.
I have done a test by myself. Storm can properly deliver spout emitted data to bolts on different hosts.
I am running a simple topology which has a simple spout that emits tuples with two fields and a bolt which just acks in the execute method. These are run in two machines. With this setup and default configuration values, I get 10ms for the complete latency while both execute and process latency are .005ms. I have disabled logging as well. What could be the issue? Storm version is 1.0.
If you run your topology on several machines try to use localOrShuffle() grouping on your bolts. It will remove unnessasery traffic and network delay.
For example, lets say you were using storm to aggregate web visit start and end dates. A session starts with the first visit from a user and ends after 30 minutes of inactivity from that same user. This data is being streamed into storm in realtime as its collected. How would you tell storm to emit data after that 30 minutes of inactivity?
I am not sure but you can look for TOPOLOGY_TICK_TUPLE_FREQ_SECS properties in storm. As found in this article
Tick tuples: It’s common to require a bolt to “do something” at a fixed interval, like flush writes to a database. Many people have been using variants of a ClockSpout to send these ticks. The problem with a ClockSpout is that you can’t internalize the need for ticks within your bolt, so if you forget to set up your bolt correctly within your topology it won’t work correctly. 0.8.0 introduces a new “tick tuple” config that lets you specify the frequency at which you want to receive tick tuples via the “topology.tick.tuple.freq.secs” component-specific config, and then your bolt will receive a tuple from the __system component and __tick stream at that frequency.
You can also found the sample code to configure spouts or bolt to receive the tick tuple with a specific interval.
Right now Storm Spouts have an open method to configure them and Bolts have a prepare method. Is there any way to make all the Spout instances wait for all the prepare methods on the Bolts listening to them to finish?
I have a case where I would like to pass some config info to the bolts on the fly (since this config info changes all the time). I've read in some places that we should use Zookeeper or an in-memory key-value storage like redis to do this. My worry though is, what happens if the Bolts aren't ready to process data from Spouts yet, and the Spouts start emitting tuples? Is there a way to make the Spouts wait for an update from the Bolts saying they're ready?
I found a slightly more elegant solution for this (I think). The problem was that certain bolts needed config info in order to process incoming tuples. I figured out Storm's capability to replay tuples, so now my bolts listen for updates from one spout, and tuples from the other. As long as I dont receive updates, I keep failing the tuples and having the spout replay them after a configurable amount of time.
Yes, you can use Redis to store your configuration then read it from the prepare method.
The prepare method is invoked by the worker process which start processing tuples after finishing. Actually, I think that no tuple is emitted until all components of a worker process are ready. http://nathanmarz.github.io/storm/doc-0.8.1/index.html
Finally, you can have an additional spout which look up for configuration changes. Then, if a newer configuration is available it is send to your bolts via named streams.
You don't have to worry about this. Storm framework loads Bolt before Spout. Storm loads the bolts in reverse order. Bolts towards the end of the topology are loaded before the bolts in the middle of the topology and in the end, Spout gets loaded.