How to restore bolt state during failover - apache-storm

I'm trying to figure out how to restore the state of a storm bolt intance during failover. I can persist the state externally (DB or file system), however once the bolt instance is restarted I need to point to the specific state of that bolt instance to recover it.
The prepare method of a bolt receives a context, documented here http://nathanmarz.github.io/storm/doc/backtype/storm/task/TopologyContext.html
What is not clear to me is - is there any piece of this context that uniquely identifies the specific bolt instance so I can understand which persistent state to point to? Is that ID preserved during failover? Alternatively, is there any variable/object I can set for the specific bolt/instance that is preserved during failover? Any help appreciated!
br
Sib
P.S.
New to stackoverflow so pls bear with me...

You can probably look for Trident Its basically an abstraction built on top of storm . The documentation says
Trident has first-class abstractions for reading from and writing to stateful sources. The state can either be internal to the topology – e.g., kept in-memory and backed by HDFS – or externally stored in a database like Memcached or Cassandra
In case of any fail over it says
Trident manages state in a fault-tolerant way so that state updates are idempotent in the face of retries and failures.
You can go through the documentation for any further clarification.

Tx (and credit) to Storm user group!
http://mail-archives.apache.org/mod_mbox/storm-user/201312.mbox/%3C74083558E4509844944FF5CF2BA7B20F1060FD0E#ESESSMB305.ericsson.se%3E

In original Storm, both spout and bolt are stateless. Storm can managed to restart nodes but it will require some effort to restore the states of nodes. There are two solutions that I can think of:
If a message fails to process, Storm will replay it from ROOT of the topology and the logic of replay has to be implemented by user. So in this case I would like to put more state information (e.g. the ID of some external state storage and id of this task) in messages.
Or you can use Trident. It can provides txid to each transaction and simplify storage process.
I'm OK with first solution because my app doesn't require transactional operations and I have a better understanding of the original Storm (Storm generates simpler log than Trident does).

You can use the task ID.
Task ids are assigned at topology creation and are static. If a task dies/restarts or gets reassigned somewhere else, it will still have the same id.

Related

Sharing local store between different KafkaStreams

Given: Have two KafkaStreams with DSL topology each one. Local state store is added to one of the typologies. What is the optimal way for the second KafkaStream to update the local store in the first KafkaStreams?
I could think about adding some processor to the KafkaStreams with local store. This processor has (1) some static task list populated by second KafkaStream, (2) Punctuator which will process tasks from the task list.
Unfortunately, this design doesn't provide any guarantee for failure tolerate.
Any better approach?
Local state of an application should only be update by the application itself.
Not sure what you exactly want to achieve. One way to "update" state from a Kafka Streams instance might be via topic. Instance A creates a table from a topic. Instance B write into this topic when it want to update A's table-state.
Hope this helps. If not, maybe updated your question to give more details what you want to achieve.

Turn recovery on after first message

I have a persistent actor which receives many messages. Fist message is CREATE (case class) and next messages are UPDATEs (case classes). So if it receives CREATE then it should not go into persistence to run recovery because the storage is empty for this actor. It's performance wasting from my perspective.
Is there any possibility to do not call recovery for particular input message (the first one which is CREATE), please?
A persistent actor will always have to hit the database, because there is no other way to know whether it having existed before - it could have been created in a previous instance of the application that was stopped or it could have been created on a different node in a cluster.
In general a good pattern for performance is to keep the actor in memory after it has been hit the first time, as that will allow as fast responses as possible. The most common way to do this is using Cluster Sharding (which you can read more about in the docs here: https://doc.akka.io/docs/akka/current/cluster-sharding.html?language=scala#cluster-sharding
I have never heard of anyone seeing the hit for an empty persistent actor as a performance problem and I'm not sure it is possible to solve that in a general way, so if you have such a problem and somehow can know the actor was never created before you can not do that with Akka Persistence but would have to build a special solution for that yourself.

When I use storm, if one server crashed(e.g. shutdown), will the topology process the tuple once processed on the host again

For example, tuple A is now processing on Server B. Suddenly, server B is shut down by my crazy colleague. Will the topology process tuple A again on another server?
If you enable fault-tolerance (and the tuple was not acked), than yes.
What API are you using? For low-level API, you enable fault-tolerance by assigning IDs to the tuples you emit in your spouts.
See https://storm.apache.org/releases/1.0.2/Guaranteeing-message-processing.html for more details.
For Trident, it depend what spout you are using: https://storm.apache.org/releases/1.0.2/Trident-spouts.html

Storm Global Grouping Fault Tolerance

I have started to use Storm recently but could not find any resources on the net about global grouping option's fault tolerance.
According to my understanding from the documents; while running a topology with a bolt(Bolt A) which is uses global grouping will receive tuples from tasks of Bolt B into the task of Bolt A. As it is using global grouping option, there is only one task of Bolt A in the topology.
The question is as follows: What will happen if we store some historical data of the stream within Bolt A and the worker process dies which contains the task of Bolt A? Meaning that will the data stored in this bolt get lost?
Thanks in advance
Once all the downstream tasks have acked the tuple, it means that they have successfully processed the message and it need not be replayed if there is a shut down. If you are keeping any state in memory, then you should store it in a persistent store. Message should be acked when the state change due to the message has been persisted.

Does Storm keep sending tick tuples to bolts when a topology is deactivated?

As part of the development of streamparse, we have a BatchingBolt that processes tuples in batches. It's intended for use with things like databases that are more performant when you send things in batches.
I've recently proposed switching our BatchingBolt implementation over from using a timer/thread approach to using tick tuples; however, one of my fellow devs pointed out that with our current approach the final batch will definitely get processed when a topology is shutdown (and it's in the inactive state), whereas that isn't explicitly documented anywhere about tick tuples.
Therefore, my question is this: does Storm continue sending tick tuples to bolts after a kill/deactivate has been issued, while it is in the waiting/inactive period? The topology lifecycle docs don't make it clear.
http://mail-archives.apache.org/mod_mbox/storm-user/201506.mbox/%3CCAF5108ijGpdMeax1LaKQ1MG6MSZQF=YM=vO8AacmN0RUiNfNkQ#mail.gmail.com%3E
AFAIK, "setup-tick!" is called from start of executor (which schedules tick timer for each executor), and tick tuples will be emitted unless worker is going to be shutdown.
In short, your fellow is correct.

Resources