Dask on Hadoop Kubernetes - hadoop

I've installed Hadoop via a helm chart on my microk8s kubernetes cluster.
I would like to know how to create a dask cluster on my different machines on this hadoop cluster. I tried following the the tutorials on the Dask websites, but I keep getting errors because it is looking for the local yarn/hadoop. How do I point to the hadoop on kubernetes so I can create the cluster?

If you want to launch Dask on Yarn we recommend using https://yarn.dask.org
However, if you are using Kubernetes already you might consider https://kubernetes.dask.org, which is more commonly used today.

Related

Install Hadoop in openstack

I'm new to big data. And I have a question about the installation of hadoop.
Currently I use an image on VirtualBox, but I would like to create a cluster on the openstack. At first I thought I just need to instantiate a hadoop image on the openstack or install several instances and use the hadoop docker image.
But I found several examples of the Sahara openstack. Knowing that I already have an openstack shared with several people, is it possible to create a hadoop cluster without going through openstack Sahara? Or is it not recommended?
Not sure about "Sahara Openstack", but you can surely create Hadoop cluster using VM nodes on openstack.
Single node installation guide
http://tecadmin.net/setup-hadoop-2-4-single-node-cluster-on-linux/#
Yes, its possible to create Hadoop cluster on OpenStack cloud without using OpenStack sahara. You can launch 3 Virtual machines on OpenStack, and assign floating IP to these virtual machines.
One can be used as Master and other 2 as slaves. You can follow the Hadoop multinode installation steps on these virtual machines and connect them using SSH configuration which will be mentioned in Hadoop multinode setup guide.
You can also write automated shell script for launching Hadoop on OpenStack.

How to provision a Hadoop ecosystem cluster with OpenShift?

We are searching a viable way for provisioning a Hadoop ecosystem cluster with OpenShift (based on Docker). We look to build up a cluster using the services of the Hadoop ecosystem, i.e. HDFS, YARN, Spark, Hive, HBase, ZooKeeper etc.
My team has been using Hortonworks HDP for on-premise hardware but will now switch into a OpenShift-based infrastructure. Hortonworks Cloudbreak seems not to be suitable for OpenShift-based infrastructures. I have found this article that describes the integration of YARN into OpenShift but it seems like there are no further information available.
What is the easiest way to provision a Hadoop ecosystem cluster on OpenShift? Manually adding all the services feels error-prone and hard to administer. I have stumbled upon the Docker images of these separate services, but it is not comparable to the automated provisioning you get with a platform like Hortonworks HDP. Any guidance is appreciated.
If you install Openstack within Openshift, Sahara allows provisioning of Openstack Hadoop clusters
Alternatively, Cloudbreak is Hortonwork's tool for provisioning container based cloud deployments
Both provides Ambari, allowing you the same interface for cluster administration as HDP.
FWIW, I personally don't find the reason for putting Hadoop in containers. Your datanodes are locked to specific disks. There's no improvement in running several smaller ResourceManagers on a single host. Plus, for YARN, you'd be running containers within containers. And for the namenode, you must have a replicated Fsimage + Editlog because the container could be placed on any system

Multi-Node Hadoop in kubernetes

I already intalled minikube the single node Kubernetes cluster, I just want a help of how to deploy a multi-node hadoop cluster inside this kubernetes node, I need a starting point please!?
For clarification, do you want hadoop to leverage k8s components to run jobs or do you just want it to run as a k8s pod?
Unfortunately I could not find an example of hadoop built as a Kubernetes scheduler. You can probably still run it similar to the spark example.
Update: Spark now ships with better integration for Kubernetes. Information can be found here here

Is it possible to start multi physical node hadoop clustster using docker?

I've seen searching for a way to start docker on multiple physical machines and connect them to a hadoop cluster, so far I only found ways to start a cluster locally on 1 machine. Is there a way to do this?
You can very well provision a multinode hadoop cluster with docker.
Please look at some posts below which will give you some insights on doing it:
http://blog.sequenceiq.com/blog/2014/06/19/multinode-hadoop-cluster-on-docker/
Run a hadoop cluster on docker containers

Running mahout using hadoop on Amazon's EMR/EC2

I want to migrate my current local hadoop cluster into amazon . In this hadoop cluster I am using services like mahout,hbase and hive . I have two option now in amazon either go for pure EC2 instances or Elastic map reduce cluster . I want some suggestion on what is better option to move the cluster which has these kinds of requirement .
I always suggest people to go for EMR, as that is managed and will be a bit more costlier than using pure ec2, but the headache and time you will spent in configuring the clusters and then managing them can be saved by running managed services like EMR.
Mahout can easily be run like a custom jar.
Hive cluster can also be launched within minutes.
Similary for HBase, Amazon has recently added creating HBase cluster over EMR.
See other views here.

Resources