Retry topology if Zookeeper stops working - apache-storm

Zookeeper of storm is stopped working. Because of this Topologies stop working.Do we any mechanism so that zookeep will start automatically?

You will have do define some supervising over the Zookeeper. try daemontools or puppet

What do you mean by "Zookeeper stopped working"? Did you setup Zookeeper in reliable distributed mode? If yes, Zookeeper should be available all the time and Storm topologies should keep running.
However, if one of you ZK nodes dies, you need to start up a new one manually.
See "Setup up a Zookeeper cluster" in https://storm.apache.org/documentation/Setting-up-a-Storm-cluster.html
See also https://storm.apache.org/documentation/images/storm-cluster.png from https://storm.apache.org/tutorial.html

Related

How to keep zookeeper run on ec2

I installed Zookeeper and Kafka on my ec2 instances, and they work well.
However, I wondered how to keep Zookeeper running on ec2.
I think if I make a real-time streaming service using Kafka, Zookeeper has to keep running state.
But it is shut down when I close a cli.
How to keep zookeeper run on ec2? Is it possible?
Yes, until Zookeeper is fully removed from Kafka (target late 2022}, it stores important topic and broker information.
Ideally, you'd use a process supervisor like SystemD to (re)start, stop, and monitor the process. If you apt/yum install Confluent Platform, then it'll come with SystemD scripts for both Kafka and Zookeeper, so you wouldn't need to write your own.
And you'd use zookeeper-server-start -daemon zookeeper.properties to make it run in the background.
Or you can just use Amazon MSK and not worry about the infrastructure.

Does storm support ha host, e.g. for the nimbus host?

If so, from which version, does storm support this? I want to know this because I want to upgrade my storm version(now my storm version is 0.10.0).
High availability for Nimbus was introduced in Storm 1.0.0 (see https://storm.apache.org/2016/04/12/storm100-released.html)
However, even for prior Storm versions, missing HA for Nimbus was not a critical issue, because a failing Nimbus does not affect running topologies. The only problem if Nimbus is down is, that no interaction with the cluster is possible from outside (eg, submitting new topologies etc.).
Workers are HA too, ie, supervisors can restart failing workers. Supervisors are not HA -- however, the task they host will be redistributed automatically to other supervisors if one supervisor fails.

What happens if the namenode and the ZooKeeper fail together

What happens if the namenode and the ZooKeeper fail together. is this possible? Also, do various QJM keep log edits of each other?
If the Zookeeper server is installed on other nodes(not on namenode). It brings the other standby namenode to active state.
If you have installed more than 1 zookeeper server, for example consider you have installed 3 zookeeper servers. If one of the zookeeper fails, election process takes place and new zookeeper will be made active.
A Zookeeper Quorum is used to avoid zookeeper failure. The Zookeeper replicates its data to other nodes in the quorum. In case of a failure? the election occurs and a new node is appointed the leader which directs the client to the secondary name node.

Running Storm nimbus and supervisor on the same physical node in cluster mode

I've a storm cluster of 2 physical nodes right now. I'm running storm nimbus on node-1 and storm supervisor on node-2. Looks like all my topologies are running on running on node-2 (supervisor node) only. Should I run supervisor on node-1 as well ?
Thanks
You could, but I wouldn't recommend it.
In Storm's current design, nimbus is a single point of failure (there's plans to address this), but running a supervisor on the same node as nimbus makes it more likely that something bad might happen to the nimbus node, which would be catastrophic for your Storm cluster.
Further, part of Storm's design is that the workers and the supervisor nodes should be able to die and Storm should be able to recover. If you use your node-1 as a supervisor in addition to it being the nimbus server, you lose some of that flexibility.
Finally, as your cluster grows, your nimbus server will have plenty to do on its own and you want it to operate quickly so it doesn't slow down your workers since it could be a bottleneck if you don't give it adequate resources.
If you want topologies to run on node-1, then yes, you should run the Supervisor process on node-1 as well. The Nimbus helps to coordinate work among Supervisors, but does not execute a topology's Workers itself. For more details, see http://storm.incubator.apache.org/documentation/Tutorial.html

Need help regarding storm

1) What happens if Nimbus fails? Can we convert some other node into a Nimbus?
2) Where is the output of topology stored? When a bolt emits a tuple, where is it stored ?
3) What happens if zookeeper fails ?
Nimbus is itself a failure-tolerant process, which means it doesn't store its state in-memory but in an external database (Zookeeper). So if Nimbus crashes (an unlikely scenario), on the next start it will resume processing just where it stopped. Nimbus usually must be setup to be monitored by an external monitoring system, such as Monit, which will check the Nimbus process state periodically and restart it if any problem occurs. I suggest you read the Storm project's wiki for further information.
Nimbus is the master node of a Storm cluster and isn't possible to have multiple Nimbus nodes. (Update: the Storm community is now (as of 5/2014) actively working on making the Nimbus daemon fault tolerant in a failover manner, by having multiple Nimbuses heartbeating each other)
The tuple is "stored" in the tuple tree, and it is passed to the next bolt in the topology execution chain as topology execution progresses. As for physical storage, tuples are probably stored in an in-memory structure and seralized as necessary to be distributed among the cluster's nodes. The complete Storm cluster's state itself is stored in Zookeeper. Storm doesn't concern itself with persisent storage of a topology or a bolt's output -- it is your job to persist the results of the processing.
Same as for Nimbus, Zookeper in a real, production Storm cluster must be configured for reliability, and for Zookeeper that means having an odd number of Zookeeper nodes running on different servers. You can find more information on configuring a Zookeeper production cluster in the Zookeper Administrator's Guide. If Zookeeper would fail (altough a highly unlikely scenario in a properly configured Zookeeper cluster) the Storm cluster wouldn't be able to continue processing, since all cluster's state is stored in Zookeeper.
Regarding question 1), this bug report and subsequent comment from Storm author and maintainer Nathan Marz clarifies the issue:
Storm is not designed for having topologies partially running. When you bring down the master, it is unable to reassign failed workers. We are working on Nimbus failover. Nimbus is fault-tolerant to the process restarting, which has made it fault-tolerant enough for our and most people's use cases.

Resources