I'm new to big data. And I have a question about the installation of hadoop.
Currently I use an image on VirtualBox, but I would like to create a cluster on the openstack. At first I thought I just need to instantiate a hadoop image on the openstack or install several instances and use the hadoop docker image.
But I found several examples of the Sahara openstack. Knowing that I already have an openstack shared with several people, is it possible to create a hadoop cluster without going through openstack Sahara? Or is it not recommended?
Not sure about "Sahara Openstack", but you can surely create Hadoop cluster using VM nodes on openstack.
Single node installation guide
http://tecadmin.net/setup-hadoop-2-4-single-node-cluster-on-linux/#
Yes, its possible to create Hadoop cluster on OpenStack cloud without using OpenStack sahara. You can launch 3 Virtual machines on OpenStack, and assign floating IP to these virtual machines.
One can be used as Master and other 2 as slaves. You can follow the Hadoop multinode installation steps on these virtual machines and connect them using SSH configuration which will be mentioned in Hadoop multinode setup guide.
You can also write automated shell script for launching Hadoop on OpenStack.
Related
I've installed Hadoop via a helm chart on my microk8s kubernetes cluster.
I would like to know how to create a dask cluster on my different machines on this hadoop cluster. I tried following the the tutorials on the Dask websites, but I keep getting errors because it is looking for the local yarn/hadoop. How do I point to the hadoop on kubernetes so I can create the cluster?
If you want to launch Dask on Yarn we recommend using https://yarn.dask.org
However, if you are using Kubernetes already you might consider https://kubernetes.dask.org, which is more commonly used today.
We are searching a viable way for provisioning a Hadoop ecosystem cluster with OpenShift (based on Docker). We look to build up a cluster using the services of the Hadoop ecosystem, i.e. HDFS, YARN, Spark, Hive, HBase, ZooKeeper etc.
My team has been using Hortonworks HDP for on-premise hardware but will now switch into a OpenShift-based infrastructure. Hortonworks Cloudbreak seems not to be suitable for OpenShift-based infrastructures. I have found this article that describes the integration of YARN into OpenShift but it seems like there are no further information available.
What is the easiest way to provision a Hadoop ecosystem cluster on OpenShift? Manually adding all the services feels error-prone and hard to administer. I have stumbled upon the Docker images of these separate services, but it is not comparable to the automated provisioning you get with a platform like Hortonworks HDP. Any guidance is appreciated.
If you install Openstack within Openshift, Sahara allows provisioning of Openstack Hadoop clusters
Alternatively, Cloudbreak is Hortonwork's tool for provisioning container based cloud deployments
Both provides Ambari, allowing you the same interface for cluster administration as HDP.
FWIW, I personally don't find the reason for putting Hadoop in containers. Your datanodes are locked to specific disks. There's no improvement in running several smaller ResourceManagers on a single host. Plus, for YARN, you'd be running containers within containers. And for the namenode, you must have a replicated Fsimage + Editlog because the container could be placed on any system
I've seen searching for a way to start docker on multiple physical machines and connect them to a hadoop cluster, so far I only found ways to start a cluster locally on 1 machine. Is there a way to do this?
You can very well provision a multinode hadoop cluster with docker.
Please look at some posts below which will give you some insights on doing it:
http://blog.sequenceiq.com/blog/2014/06/19/multinode-hadoop-cluster-on-docker/
Run a hadoop cluster on docker containers
I am working on hadoop hdfs 2.7.1. I have set up a single node cluster having one datanode. But now i need to set up three datanodes on the same machine. I tried using various methods available on the internet but am unable to start the hadoop cluster having three datanodes on the same machine. Please help me.
You can run a multi-node cluster on a single machine using Docker containers. The guys at SequenceIQ, a company that was recently acquired by Hortonworks, even prepared Docker images that you can download. See here:
http://blog.sequenceiq.com/blog/2014/06/19/multinode-hadoop-cluster-on-docker/
I want to run a multi-node hadoop cluster, with each node inside a docker container on a different host. This image - https://github.com/sequenceiq/hadoop-docker works well to start hadoop in a pseudo distributed mode, what is the easiest way to modify this to have each node in a different container on a separate ec2 host?
I did this with two containers running master and slave nodes on two different ubuntu hosts. I did the networking between containers using weave. I have added the images of the containers on docker hub account div4. I installed hadoop in the same way, as its installed on different hosts. I have added the two images with coomands to run haddop on them here:
https://registry.hub.docker.com/u/div4/hadoop_master/
https://registry.hub.docker.com/u/div4/hadoop_slave/.
The people from sequenceiq have created a new project called cloud-break that is designed to work with different cloud providers and create hadoop clusters on them easily. You just have to enter your credentials and then it works the same for all providers, as far as I can see.
So for ec2, this will now probably be the easiest solution(especially because of a nice GUI):
https://github.com/sequenceiq/cloudbreak-deployer