How to keep zookeeper run on ec2 - amazon-ec2

I installed Zookeeper and Kafka on my ec2 instances, and they work well.
However, I wondered how to keep Zookeeper running on ec2.
I think if I make a real-time streaming service using Kafka, Zookeeper has to keep running state.
But it is shut down when I close a cli.
How to keep zookeeper run on ec2? Is it possible?

Yes, until Zookeeper is fully removed from Kafka (target late 2022}, it stores important topic and broker information.
Ideally, you'd use a process supervisor like SystemD to (re)start, stop, and monitor the process. If you apt/yum install Confluent Platform, then it'll come with SystemD scripts for both Kafka and Zookeeper, so you wouldn't need to write your own.
And you'd use zookeeper-server-start -daemon zookeeper.properties to make it run in the background.
Or you can just use Amazon MSK and not worry about the infrastructure.

Related

Use a shell script to consume a message from Kafka to check service availability. What should I do best?

I have 100 servers with Kafka deployed on some of the hosts, and the kafka service is started by the kafka user. I want to use a shell script to find out the kafka machine and consume a message to ensure that the service is available.
Find out the Kafka machine
I'm not sure what this means
ensure that the service is available.
Ideally, you'd use something like Nagios or Consul for this, not a consumer. Otherwise, kcat -L or kcat -C are two popular CLI ways to check/consume from brokers that doesnt require extra dependencies.
Beyond that, your commands wouldn't be checking a specific broker, only the cluster

Remote access to HDFS on Kubernetes

I am trying to setup HDFS on minikube (for now) and later on a DEV kubernetes cluster so I can use it with Spark. I want Spark to run locally on my machine so I can run in debug mode during development so it should have access to my HDFS on K8s.
I have already set up 1 namenode deployment and a datanode statefulset (3 replicas) and those work fine when I am using HDFS from within the cluster. I am using a headless service for the datanodes and a cluster-ip service for the namenode.
The problem starts when I am trying to expose hdfs. I was thinking of using an ingress for that but that only exposes port 80 outside of the cluster and maps paths to different services inside the cluster which is not what I'm looking for. As far as I understand, my local spark jobs (or hdfs client) talk to the namenode which replies with an address for each block of data. That address though is something like 172.17.0.x:50010 and of course my local machine can't see those.
Is there any way I make this work? Thanks in advance!
I know this question is about just getting it to run in a dev environment, but HDFS is very much a work in progress on K8s, so I wouldn't by any means run it in production (as of this writing). It's quite tricky to get it working on a container orchestration system because:
You are talking about a lot of data and a lot of nodes (namenodes/datanodes) that are not meant to start/stop in different places in your cluster.
You have the risk of having a constantly unbalanced cluster if you are not pinning your namenodes/datanodes to a K8s node (which defeats the purpose of having a container orchestration system)
If you run your namenodes in HA mode and it for any reason your namenodes die and restart you run the risk of corrupting the namenode metadata which would make you lose all your data. It's also risky if you have a single node and you don't pin it to a K8s node.
You can't scale up and down easily without running in an unbalanced cluster. Running an unbalanced cluster defeats one of the main purposes of HDFS.
If you look at DC/OS they were able to make it work on their platform, so that may give you some guidance.
In K8s you basically need to create services for all your namenode ports and all your datanode ports. Your client needs to be able to find every namenode and datanode so that it can read/write from them. Also the some ports cannot go through an Ingress because they are layer 4 ports (TCP) for example the IPC port 8020 on the namenode and 50020 on the datanodes.
Hope it helps!

Storm-zookeeper transactional logs extremlly large

I'm using a ZooKeeper cluster (3 mchines) for my Storm cluster (4 machines). The problem is that -because of the topologies deployed on the storm cluster- the zookeeper transactional logs grow to be extremly large making the zookeeper desk to be full and what is really strange that those logs are not devided into multiple files instead I'm having one big transactional file in every zookeeper machine! making the autopurge in my zookeeper configuration not to have any affect on those files.
Is there a way to solve this problem from zookeeper side, or can I change the way storm uses zookeeper to minimize the size of those logs?
Note: I'm using zookeeper 3.6.4 and Storm 0.9.6 .
I was able to resolve this problem by using Pacemarker to process heartbeats from workers instead of zookeeper; That allowed me to avoid writting to zookeeper disk in order to maintain consistency and use in-memory store instead. In order to be able to use Pacemaker I upgraded to Storm-1.0.2.

Retry topology if Zookeeper stops working

Zookeeper of storm is stopped working. Because of this Topologies stop working.Do we any mechanism so that zookeep will start automatically?
You will have do define some supervising over the Zookeeper. try daemontools or puppet
What do you mean by "Zookeeper stopped working"? Did you setup Zookeeper in reliable distributed mode? If yes, Zookeeper should be available all the time and Storm topologies should keep running.
However, if one of you ZK nodes dies, you need to start up a new one manually.
See "Setup up a Zookeeper cluster" in https://storm.apache.org/documentation/Setting-up-a-Storm-cluster.html
See also https://storm.apache.org/documentation/images/storm-cluster.png from https://storm.apache.org/tutorial.html

Running Hadoop/Storm tasks on Apache Marathon

I recently came across Apache Mesos and successfully deployed my Storm topology over Mesos.
I want to try running Storm topology/Hadoop jobs over Apache Marathon (had issues running Storm directly on Apache Mesos using mesos-storm framework).
I couldn't find any tutorial/article that could list steps how to launch a Hadoop/Spark tasks from Apache Marathon.
It would be great if anyone could provide any help or information on this topic (possibly a Json job definition for Marathon for launching storm/hadoop job).
Thanks a lot
Thanks for your reply, I went ahead and deployed a Storm-Docker cluster on Apache Mesos with Marathon. For service discovery I used HAProxy. This setup allows services (nimbus or zookeeper etc) to talk to each other with the help of ports, so for example adding multiple instances for a service is not a problem since the cluster will find them using the ports and loadbalance the requests between all the instances of a service. Following is the GitHub project which has the Marathon recipes and Docker images: https://github.com/obaidsalikeen/storm-marathon
Marathon is intended for long-running services, so you could use it to start your JobTracker or Spark scheduler, but you're better off launching the actual batch jobs like Hadoop/Spark tasks on a batch framework like Chronos (https://github.com/airbnb/chronos). Marathon will restart tasks when the complete/fail, whereas Chronos (a distributed cron with dependencies) lets you set up scheduled jobs and complex workflows.
While a little outdated, the following tutorial gives a good example.
http://mesosphere.com/docs/tutorials/etl-pipelines-with-chronos-and-hadoop/

Resources