I have started to use Storm recently but could not find any resources on the net about global grouping option's fault tolerance.
According to my understanding from the documents; while running a topology with a bolt(Bolt A) which is uses global grouping will receive tuples from tasks of Bolt B into the task of Bolt A. As it is using global grouping option, there is only one task of Bolt A in the topology.
The question is as follows: What will happen if we store some historical data of the stream within Bolt A and the worker process dies which contains the task of Bolt A? Meaning that will the data stored in this bolt get lost?
Thanks in advance
Once all the downstream tasks have acked the tuple, it means that they have successfully processed the message and it need not be replayed if there is a shut down. If you are keeping any state in memory, then you should store it in a persistent store. Message should be acked when the state change due to the message has been persisted.
Related
For example, tuple A is now processing on Server B. Suddenly, server B is shut down by my crazy colleague. Will the topology process tuple A again on another server?
If you enable fault-tolerance (and the tuple was not acked), than yes.
What API are you using? For low-level API, you enable fault-tolerance by assigning IDs to the tuples you emit in your spouts.
See https://storm.apache.org/releases/1.0.2/Guaranteeing-message-processing.html for more details.
For Trident, it depend what spout you are using: https://storm.apache.org/releases/1.0.2/Trident-spouts.html
In my topology I have a spout with a socket opened on port 5555 to receive messages.
If I have 10 supervisors in my Storm cluster, will each one of them be listening to their 5555 ports?
In the end, to which supervisor should I send messages?
Multiple comments here:
Storm uses a pull based model for data ingestion via Spouts. If you open a socket you will block the Spout until data is available (and this is bad; see this SO question for more details: Why should I not loop or block in Spout.nextTuple())
About Spout deployment (Supervisors):
first, it depends on the parallelism of your spout (ie,parallelims_hint, default value is one)
second, supervisors do no execute Spout code: Supervisors start up worker JVM that execute Spouts/Bolts (see config parameter number_of_workers for a topology)
third, Storm uses a load-balanced round-robin scheduler; thus, it might happen that two Spout executor are scheduled to the same worker JVM (or different workers on the same host); for this case, you will get a port conflict (only one execute will be able to open the port)
Dated distribution should not matter in this case: if you really go with push, you can choose any host to send the data; Storm does not care. Of course, if you need some kind of key-based partitioning, you might want to send data from a single partition the a single Spout instance; as an alternative, just forward the data within the Spout and use fieldsGrouping to get your partitions for the consuming Bolt. However, if you use pull based data ingestion by the Spout, you can ensure that each Spout pulls data from certain partitions and the problem resolves naturally.
To sum up: using push based data ingestion might be a bad idea.
As part of the development of streamparse, we have a BatchingBolt that processes tuples in batches. It's intended for use with things like databases that are more performant when you send things in batches.
I've recently proposed switching our BatchingBolt implementation over from using a timer/thread approach to using tick tuples; however, one of my fellow devs pointed out that with our current approach the final batch will definitely get processed when a topology is shutdown (and it's in the inactive state), whereas that isn't explicitly documented anywhere about tick tuples.
Therefore, my question is this: does Storm continue sending tick tuples to bolts after a kill/deactivate has been issued, while it is in the waiting/inactive period? The topology lifecycle docs don't make it clear.
http://mail-archives.apache.org/mod_mbox/storm-user/201506.mbox/%3CCAF5108ijGpdMeax1LaKQ1MG6MSZQF=YM=vO8AacmN0RUiNfNkQ#mail.gmail.com%3E
AFAIK, "setup-tick!" is called from start of executor (which schedules tick timer for each executor), and tick tuples will be emitted unless worker is going to be shutdown.
In short, your fellow is correct.
As I understand things, ZooKeeper will persist tuples emitted by bolts so if a bolt crashes (or a computer with the bolt crashes, or the entire cluster crashes), the tuple emitted by the bolt will not be lost. Once everything is restarted, the tuples will be fetched from ZooKeeper, and everything will continue on as if nothing bad ever happened.
What I don't yet understand is if the same thing is true for spouts. If a spout emits a tuple (i.e., the emit() function within a spout is executed), and the computer the spout is running on crashes shortly thereafter, will that tuple be resurrected by ZooKeeper? Or do we need Kafka in order to guarantee this?
P.S. I understand that the tuple emitted by the spout must be assigned a unique ID in the call to emit().
P.P.S. I see sample code in books that uses something like ConcurrentHashMap<UUID, Values> to track which spouted tuples have not yet been acked. Is this somehow automatically persisted with ZooKeeper? If not, then I shouldn't really be doing that, should I? What should I being doing instead? Using Kafka?
Florian Hussonnois answered my question thoroughly and clearly in this storm-user thread. This was his answer:
Actually, the tuples aren't persisted into "zookeeper". If your
"spout" emits a tuple with a unique id, it will be automatically
follow internally by storm (i.e ackers) . Thus, in case the emitted
tuple comes to fail because of a bolt failure, Storm invokes the
method 'fail' on the origin spout task with the unique id as argument.
It's then up to you to re-emit the failed tuple.
In sample codes, spouts use a Map to track which tuples are fully
processed by your entire topology in order to be able to re-emit in
case of a bolt failure.
However, if the failure doesn't come from a bolt but from your spout,
the in memory Map will be lost and your topology will not be able to
remit failed tuples.
For a such scenario you can rely on Kafka. In fact, the Kafka Spout
store its read offset into zookeeper. In that way, if a spout task
goes down it will be able to read its offset from zookeeper after
restarting.
I'm trying to figure out how to restore the state of a storm bolt intance during failover. I can persist the state externally (DB or file system), however once the bolt instance is restarted I need to point to the specific state of that bolt instance to recover it.
The prepare method of a bolt receives a context, documented here http://nathanmarz.github.io/storm/doc/backtype/storm/task/TopologyContext.html
What is not clear to me is - is there any piece of this context that uniquely identifies the specific bolt instance so I can understand which persistent state to point to? Is that ID preserved during failover? Alternatively, is there any variable/object I can set for the specific bolt/instance that is preserved during failover? Any help appreciated!
br
Sib
P.S.
New to stackoverflow so pls bear with me...
You can probably look for Trident Its basically an abstraction built on top of storm . The documentation says
Trident has first-class abstractions for reading from and writing to stateful sources. The state can either be internal to the topology – e.g., kept in-memory and backed by HDFS – or externally stored in a database like Memcached or Cassandra
In case of any fail over it says
Trident manages state in a fault-tolerant way so that state updates are idempotent in the face of retries and failures.
You can go through the documentation for any further clarification.
Tx (and credit) to Storm user group!
http://mail-archives.apache.org/mod_mbox/storm-user/201312.mbox/%3C74083558E4509844944FF5CF2BA7B20F1060FD0E#ESESSMB305.ericsson.se%3E
In original Storm, both spout and bolt are stateless. Storm can managed to restart nodes but it will require some effort to restore the states of nodes. There are two solutions that I can think of:
If a message fails to process, Storm will replay it from ROOT of the topology and the logic of replay has to be implemented by user. So in this case I would like to put more state information (e.g. the ID of some external state storage and id of this task) in messages.
Or you can use Trident. It can provides txid to each transaction and simplify storage process.
I'm OK with first solution because my app doesn't require transactional operations and I have a better understanding of the original Storm (Storm generates simpler log than Trident does).
You can use the task ID.
Task ids are assigned at topology creation and are static. If a task dies/restarts or gets reassigned somewhere else, it will still have the same id.