Storm-zookeeper transactional logs extremlly large - apache-storm

I'm using a ZooKeeper cluster (3 mchines) for my Storm cluster (4 machines). The problem is that -because of the topologies deployed on the storm cluster- the zookeeper transactional logs grow to be extremly large making the zookeeper desk to be full and what is really strange that those logs are not devided into multiple files instead I'm having one big transactional file in every zookeeper machine! making the autopurge in my zookeeper configuration not to have any affect on those files.
Is there a way to solve this problem from zookeeper side, or can I change the way storm uses zookeeper to minimize the size of those logs?
Note: I'm using zookeeper 3.6.4 and Storm 0.9.6 .

I was able to resolve this problem by using Pacemarker to process heartbeats from workers instead of zookeeper; That allowed me to avoid writting to zookeeper disk in order to maintain consistency and use in-memory store instead. In order to be able to use Pacemaker I upgraded to Storm-1.0.2.

Related

Wildfly 11 - High Availability - Single deploy on slave

I have two servers in a HA mode. I'd like to know if is it possible to deploy an application on the slave server? If yes, how to configure it in jgroups? I need to run a specific program that access the master database, but I would not like to run on master serve to avoid overhead on it.
JGroups itself does not know much about WildFly and the deployments, it only creates a communication channel between nodes. I don't know where you get the notion of master/slave, but JGroups always has single* node marked as coordinator. You can check the membership through Channel.getView().
However, you still need to deploy the app on both nodes and just make it inactive if this is not its target node.
*) If there's no split-brain partition, or similar rare/temporal issues

Change single to cluster hadoop installation keeping persisted data

I'm going to do a Hadoop POC in a production environment. The POC consists of:
1. Receive lots of (real life) events
2. Accumulate them to have a set of events with enough size
3. Persist the set of events in a single file HDFS
In case the POC is successful, I want to install a cluster environment but I need to keep the data persisted in the single cluster installation (POC).
Then, the question: How difficult is to migrate the data already persisted in HDFS single cluster to a real cluster HDFS environment?
Thanks in advance (and sorry for my bad english)
Regards
You don't need to migrate anything.
If you're running Hadoop in Pseudo distributed mode, all you need to do is add datanodes that are pointing at your existing namenode and that's it!
I would like to point out
Persist the set of events in a single file HDFS
I'm not sure about making "a single file", but I suggest you do periodic checkpointing. What if the stream fails? How do you catch dropped events? Spark, Flume, Kafka Connect, NiFi, etc can allow you to do this.
And if all you're doing is streaming events, and want to store them for a variable time period, then Kafka is more built for that use case. You don't necessarily need Hadoop. Push events to Kafka, consume them where it makes sense, for example, a search engine or a database (Hadoop is not a database)

how to install kafka in hadoop cluster

I want to install the latest release of Kafka on my ubuntu Hadoop cluster that contains 1 master nodes and 4 data nodes.
Here are my questions:
Should kafka be installed on all the machines or only on NameNode machine?
What about zookeeper? Should it be installed on all the machines or only
on NameNode machine?
Please share required document to install kafka and Zookeeper in a Hadoop 5 node cluster
The architecture is strictly based on your requirements and on what you have: how powerful your machines are, how much data do they need to process, how many consumers do the Kafka instances need to feed, and so on. In theory you can have 1 kafka instance and 1 zookeeper, but it won't be fault-tolerant - if it fails, you lose data and so on.
You find more information about zookeeper multi-cluster here.
What I would do first is to try to analyze
how much data they need to process,
how much data they need to
"ingest",
how powerful your machines are,
how many consumers you
are going to need,
how reliable your machines are
These are just a few factors to consider before starting to build up an infrastructure. If you want to have a rough estimate based on "just" 5 machines, assuming they are all equally powerful and with a good amount of memory (e.g., 32GB per machine), is that you need is to have at least a couple of Kafka nodes and at least 3 machines for Zookeeper (2N + 1) so that if one fails, Zookeeper can handle this failure.

Need help regarding storm

1) What happens if Nimbus fails? Can we convert some other node into a Nimbus?
2) Where is the output of topology stored? When a bolt emits a tuple, where is it stored ?
3) What happens if zookeeper fails ?
Nimbus is itself a failure-tolerant process, which means it doesn't store its state in-memory but in an external database (Zookeeper). So if Nimbus crashes (an unlikely scenario), on the next start it will resume processing just where it stopped. Nimbus usually must be setup to be monitored by an external monitoring system, such as Monit, which will check the Nimbus process state periodically and restart it if any problem occurs. I suggest you read the Storm project's wiki for further information.
Nimbus is the master node of a Storm cluster and isn't possible to have multiple Nimbus nodes. (Update: the Storm community is now (as of 5/2014) actively working on making the Nimbus daemon fault tolerant in a failover manner, by having multiple Nimbuses heartbeating each other)
The tuple is "stored" in the tuple tree, and it is passed to the next bolt in the topology execution chain as topology execution progresses. As for physical storage, tuples are probably stored in an in-memory structure and seralized as necessary to be distributed among the cluster's nodes. The complete Storm cluster's state itself is stored in Zookeeper. Storm doesn't concern itself with persisent storage of a topology or a bolt's output -- it is your job to persist the results of the processing.
Same as for Nimbus, Zookeper in a real, production Storm cluster must be configured for reliability, and for Zookeeper that means having an odd number of Zookeeper nodes running on different servers. You can find more information on configuring a Zookeeper production cluster in the Zookeper Administrator's Guide. If Zookeeper would fail (altough a highly unlikely scenario in a properly configured Zookeeper cluster) the Storm cluster wouldn't be able to continue processing, since all cluster's state is stored in Zookeeper.
Regarding question 1), this bug report and subsequent comment from Storm author and maintainer Nathan Marz clarifies the issue:
Storm is not designed for having topologies partially running. When you bring down the master, it is unable to reassign failed workers. We are working on Nimbus failover. Nimbus is fault-tolerant to the process restarting, which has made it fault-tolerant enough for our and most people's use cases.

Can I access zookeeper from AWS Elastic Mapreduce job

I'm new to Hadoop, and running under AWS Elastic Mapreduce.
I need cluster-wide atomic counters in Hadoop and was suggested to use zookeeper for this.
I believe zookeeper is part of the Hadoop stack (right?), how would I access it from an Elastic Mapreduce job in order to set and update a cluster-wide counter?
I believe zookeeper is part of the Hadoop stack (right?)
ZooKeeper (ZK) is not part of the Hadoop Stack. It's a Top Level Project (TLP) under Apache and is independent of Hadoop. So, first ZK has to be installed on EC2. Here are the instructions for the same.
how would I access it from an Elastic Mapreduce job in order to set and update a cluster-wide counter?
Once installed ZK can be used to generate a cluster wide counter using the ZK API. Here (1 and 2) discussions on the approach with the pros and cons. Here are some other alternatives for ZK for the same requirements.
You can, as Praveen Sripati answers.
But I wan't to clarify some points:
Keep in mind, that zk has a limited write rate (~300 request per
second)
Clients can see stale data (zk don't guarantee read consistency across replicas).
I suggest to use dedicated sequence generator server, which will generate sequences for you (and this service can use Zk or whatever it wants). One example of such service: https://github.com/kasabi/H1

Resources