Run a hadoop cluster on docker containers - hadoop

I want to run a multi-node hadoop cluster, with each node inside a docker container on a different host. This image - https://github.com/sequenceiq/hadoop-docker works well to start hadoop in a pseudo distributed mode, what is the easiest way to modify this to have each node in a different container on a separate ec2 host?

I did this with two containers running master and slave nodes on two different ubuntu hosts. I did the networking between containers using weave. I have added the images of the containers on docker hub account div4. I installed hadoop in the same way, as its installed on different hosts. I have added the two images with coomands to run haddop on them here:
https://registry.hub.docker.com/u/div4/hadoop_master/
https://registry.hub.docker.com/u/div4/hadoop_slave/.

The people from sequenceiq have created a new project called cloud-break that is designed to work with different cloud providers and create hadoop clusters on them easily. You just have to enter your credentials and then it works the same for all providers, as far as I can see.
So for ec2, this will now probably be the easiest solution(especially because of a nice GUI):
https://github.com/sequenceiq/cloudbreak-deployer

Related

Install Hadoop in openstack

I'm new to big data. And I have a question about the installation of hadoop.
Currently I use an image on VirtualBox, but I would like to create a cluster on the openstack. At first I thought I just need to instantiate a hadoop image on the openstack or install several instances and use the hadoop docker image.
But I found several examples of the Sahara openstack. Knowing that I already have an openstack shared with several people, is it possible to create a hadoop cluster without going through openstack Sahara? Or is it not recommended?
Not sure about "Sahara Openstack", but you can surely create Hadoop cluster using VM nodes on openstack.
Single node installation guide
http://tecadmin.net/setup-hadoop-2-4-single-node-cluster-on-linux/#
Yes, its possible to create Hadoop cluster on OpenStack cloud without using OpenStack sahara. You can launch 3 Virtual machines on OpenStack, and assign floating IP to these virtual machines.
One can be used as Master and other 2 as slaves. You can follow the Hadoop multinode installation steps on these virtual machines and connect them using SSH configuration which will be mentioned in Hadoop multinode setup guide.
You can also write automated shell script for launching Hadoop on OpenStack.

Is it possible to start multi physical node hadoop clustster using docker?

I've seen searching for a way to start docker on multiple physical machines and connect them to a hadoop cluster, so far I only found ways to start a cluster locally on 1 machine. Is there a way to do this?
You can very well provision a multinode hadoop cluster with docker.
Please look at some posts below which will give you some insights on doing it:
http://blog.sequenceiq.com/blog/2014/06/19/multinode-hadoop-cluster-on-docker/
Run a hadoop cluster on docker containers

Fabric scripts to install Hadoop on a cluster of machines?

I am beginning to install hadoop on a cluster. I have ssh access to these machines and I have already installed fabric on them. I was wondering if someone has already written a fabfile to install and deploy hadoop to a cluster easily.
I found this project [0]; but this is written for deploying over AWS instances. I was looking for something where I can just fill in the IPs of my machines and then execute a set of fab commands to bring up the cluster.
[0] http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide/#ec2-deployment-with-fabric-script
I'm AlexJF, the author of the scripts you linked.
The scripts you reference can also be used outside EC2. You just need to configure, as you requested, the list of hosts and configurations on the top of the fabfile.py. Be sure to set EC2 = False (which just happens to be the default).
You'll then have several useful commands available to you.

Multiple datanodes on a single machine in hadoop2.7.1

I am working on hadoop hdfs 2.7.1. I have set up a single node cluster having one datanode. But now i need to set up three datanodes on the same machine. I tried using various methods available on the internet but am unable to start the hadoop cluster having three datanodes on the same machine. Please help me.
You can run a multi-node cluster on a single machine using Docker containers. The guys at SequenceIQ, a company that was recently acquired by Hortonworks, even prepared Docker images that you can download. See here:
http://blog.sequenceiq.com/blog/2014/06/19/multinode-hadoop-cluster-on-docker/

For a Docker container based implementation, does it make sense to run a pair of Kafka server and Zookeeper server inside the same container?

I'm trying to implement a kafka cluster consists of three nodes of kafka servers and three nodes of zookeeper servers using docker containers, which of the following is preferred or if neither, what is the preferred way?
three docker containers each hosting a kafka/zookeeper server pair
six docker containers with three of them for kafka servers and three others for zookeeper servers
I'm asking this because it seems to me like a three-node Zookeeper cluster only survives single-node failure, whilst a three-node Kafka cluster could potentially survive a two-node failure (you may have to set the topic replication factor to 3). So is it better to run them in different containers if it isn't so costly to create new containers? Speaking of which, how costly is it to start a new Docker container?
In case I am advised to run one server per container, is it more preferred to build a tailored Docker image for every kind of server (in this case, one docker image for kafka and another for zookeeper), or one unified image for all different servers? I'm guessing it doesn't make sense to create two separate images just for kafka and zookeeper but what if i have all different kinds of clusters and servers, think elasticsearch, to simulate? at what point would it start to make sense to create different docker images to be used inside a single project?
If i had the time to do that, i would make 2 differents images, one for kafka and one for zk. and i'd build a docker-compose file to launch the cluster.
So 6 differents containers
https://docs.docker.com/compose/

Resources