Is there docker orchestration for Hadoop cluster - hadoop

I was looking at Rancher(an orchestration engine for docker). I think there isn't build in support of hadoop setup.

Take a look at the latest version of Rancher, it has a catalog function that includes Hadoop deployment out of the box.
this is in the latest 0.49 release of Rancher for sure.

One source of information would be "Docker Releases Orchestration Tool Kit", which mentions docker machine, docker swarm, and more importantly, built on top of the swarm API, mesosphere.
Mesosphere’s technology is the only way for an organization to run a Docker Swarm workload in a highly elastic way on the same cluster as other types of workloads.
For example, you can run Cassandra, Kafka, Storm, Hadoop and Docker Swarm workloads alongside each other on a single Mesosphere cluster, all sharing the same resources.

Related

How to provision a Hadoop ecosystem cluster with OpenShift?

We are searching a viable way for provisioning a Hadoop ecosystem cluster with OpenShift (based on Docker). We look to build up a cluster using the services of the Hadoop ecosystem, i.e. HDFS, YARN, Spark, Hive, HBase, ZooKeeper etc.
My team has been using Hortonworks HDP for on-premise hardware but will now switch into a OpenShift-based infrastructure. Hortonworks Cloudbreak seems not to be suitable for OpenShift-based infrastructures. I have found this article that describes the integration of YARN into OpenShift but it seems like there are no further information available.
What is the easiest way to provision a Hadoop ecosystem cluster on OpenShift? Manually adding all the services feels error-prone and hard to administer. I have stumbled upon the Docker images of these separate services, but it is not comparable to the automated provisioning you get with a platform like Hortonworks HDP. Any guidance is appreciated.
If you install Openstack within Openshift, Sahara allows provisioning of Openstack Hadoop clusters
Alternatively, Cloudbreak is Hortonwork's tool for provisioning container based cloud deployments
Both provides Ambari, allowing you the same interface for cluster administration as HDP.
FWIW, I personally don't find the reason for putting Hadoop in containers. Your datanodes are locked to specific disks. There's no improvement in running several smaller ResourceManagers on a single host. Plus, for YARN, you'd be running containers within containers. And for the namenode, you must have a replicated Fsimage + Editlog because the container could be placed on any system

Multi-Node Hadoop in kubernetes

I already intalled minikube the single node Kubernetes cluster, I just want a help of how to deploy a multi-node hadoop cluster inside this kubernetes node, I need a starting point please!?
For clarification, do you want hadoop to leverage k8s components to run jobs or do you just want it to run as a k8s pod?
Unfortunately I could not find an example of hadoop built as a Kubernetes scheduler. You can probably still run it similar to the spark example.
Update: Spark now ships with better integration for Kubernetes. Information can be found here here

How to set up a POC environment with DC/OS, Kafka and ElasticSearch on two nodes with Docker Swarm or Kubernetes containers?

The instructions for installing Mesosphere DC/OS on AWS use a CloudFormation template where the minimum configuration indicates:
You have the option of 1 or 3 Mesos master nodes.
5 private Mesos agent nodes is the default.
1 public Mesos agent node is the default.
For our POC, as not to incur too much up-front cost, is it possible to do this all with two nodes? One for DC/OS and the other containterized with ElasticSearch and Kafka?
If not, what would be a good configuration for this type of architecture?
DC/OS does not run on Docker Swarm or Kubernetes. But you can run a development docker-in-docker local deployment on linux (or in a VM on mac/windows): dcos-docker
You could then install ElasticSearch and Kafka on top of DC/OS.
You could also use dcos-vagrant to run a multiple VM DC/OS local dev cluster.
Warning: the current vagrant v1.9.1 has a crippling centos network bug, if you need a VM. dcos-vagrant has a monkey patch workaround included, dcos-docker does not.

Using Hadoop and Spark on Docker containers

I want to use Big Data Analytics for my work. I have already implemented all the docker stuff creating containers within containers. I am new to Big Data however and I have come to know that using Hadoop for HDFS and using Spark instead of MapReduce on Hadoop itself is the best way for websites and applications when speed matters (is it?). Will this work on my Docker containers? It'd be very helpful if someone could direct me somewhere to learn more.
You can try playing with Cloudera QuickStart Docker Image to get started. Please take a look at https://hub.docker.com/r/cloudera/quickstart/. This docker image supports single-node deployment of Cloudera's Hadoop platform, and Cloudera Manager. Also this docker image supports spark too.

Multi-node Hadoop cluster with Docker

I am in planning phase of a multi-node Hadoop cluster in a Docker based environment. So it should be based on a lightweight easy to use virtualized system.
Current architecture (regarding to documentation) contains 1 master and 3 slave nodes. This host machine uses HDFS filesystem and KVM for virtualization.
The whole cloud is managed by Cloudera Manager. There are several Hadoop modules installed on this cluster. There is also a NodeJS data upload service.
This time I should make architecture Docker based.
I have read several tutorials and have some opinions, but also open questions.
A. What do you think, is https://github.com/Lewuathe/docker-hadoop-cluster a good base for my project? I have found also an official image, but it is single-node.
B. How will system requirements change if I would like to make this in a single container? It would be great, because this architecture should work in different locations, so changes can be easily transferred between these locations. Synchronization between these so called clones would be important.
C. Do you have some other ideas, maybe best practices?
As of September 2016 there is no quick answer.
https://github.com/Lewuathe/docker-hadoop-cluster does not seem like a good start, as it should be universal for your B. option
Keep an eye on https://github.com/sequenceiq/hadoop-docker and https://github.com/kiwenlau/hadoop-cluster-docker
To address your question C., you may want to check out BlueData's software platform: http://www.bluedata.com/blog/2015/06/docker-containers-big-data-clusters
It's designed to run multi-node Hadoop clusters in a Docker-based environment and there is a free version available for download (you can also run it in an AWS EC2 instance).
This work has already been done for you, actually:
https://hub.docker.com/r/cloudera/clusterdock/
It includes a pre-packaged multi-node CDH cluster, with Cloudera Manager as an optional component for cluster management et al.

Resources