Using Hadoop and Spark on Docker containers - hadoop

I want to use Big Data Analytics for my work. I have already implemented all the docker stuff creating containers within containers. I am new to Big Data however and I have come to know that using Hadoop for HDFS and using Spark instead of MapReduce on Hadoop itself is the best way for websites and applications when speed matters (is it?). Will this work on my Docker containers? It'd be very helpful if someone could direct me somewhere to learn more.

You can try playing with Cloudera QuickStart Docker Image to get started. Please take a look at https://hub.docker.com/r/cloudera/quickstart/. This docker image supports single-node deployment of Cloudera's Hadoop platform, and Cloudera Manager. Also this docker image supports spark too.

Related

Install Druid on AWS EMR

Just started exploring Druid we dint find any blog on link to install Druid on AWS, Is there any chance to install Druid on AWS EMR ? If so if there are any per-defined Cloud Formation to execute it will be help full for my R&D on Druid.
its pretty straighforward to setup a basic single cluster druid
launch EMR with a single node master, like r3.4xlarge
download imply tar (comes with druid and pivot), https://docs.imply.io/on-prem/quickstart
tar -xzf imply-3.1.8.1.tar.gz cd imply-3.1.8.1
bin/supervise -c conf/supervise/quickstart.conf
If you are looking for a full cluster deploy, EMR is not the right tool.
If you know EKS / kubernetes, I think the easiest way to get started is using Helm
https://github.com/helm/charts/tree/master/incubator/druid
Other option is to look for Imply Cloud
They also solid documentation around Druid. Druid'd own documentation is pretty intense. I found imply to be better for beginners.
https://docs.imply.io/cloud/
Although for POC, a single r3.4xlarge or i3.4xlarge having some 200G storage is good enough
The most likely reason why you would not find much documentation is that the two things have a different nature.
Druid is meant to be long lived and statefull, where the EMR hadoop variant is meant to spin up and down in a more ephemeral manner. As such the combination is somewhat awkward.
Consider using a different hadoop distribution like HDP. Of course you can easily deploy it on AWS if needed, or on your own hardware if you want to minimize infra costs.
Disclaimer: I am an employee of Cloudera, the distributor of HDP, which is currently the most common hadoop platform under Druid.

How to provision a Hadoop ecosystem cluster with OpenShift?

We are searching a viable way for provisioning a Hadoop ecosystem cluster with OpenShift (based on Docker). We look to build up a cluster using the services of the Hadoop ecosystem, i.e. HDFS, YARN, Spark, Hive, HBase, ZooKeeper etc.
My team has been using Hortonworks HDP for on-premise hardware but will now switch into a OpenShift-based infrastructure. Hortonworks Cloudbreak seems not to be suitable for OpenShift-based infrastructures. I have found this article that describes the integration of YARN into OpenShift but it seems like there are no further information available.
What is the easiest way to provision a Hadoop ecosystem cluster on OpenShift? Manually adding all the services feels error-prone and hard to administer. I have stumbled upon the Docker images of these separate services, but it is not comparable to the automated provisioning you get with a platform like Hortonworks HDP. Any guidance is appreciated.
If you install Openstack within Openshift, Sahara allows provisioning of Openstack Hadoop clusters
Alternatively, Cloudbreak is Hortonwork's tool for provisioning container based cloud deployments
Both provides Ambari, allowing you the same interface for cluster administration as HDP.
FWIW, I personally don't find the reason for putting Hadoop in containers. Your datanodes are locked to specific disks. There's no improvement in running several smaller ResourceManagers on a single host. Plus, for YARN, you'd be running containers within containers. And for the namenode, you must have a replicated Fsimage + Editlog because the container could be placed on any system

Multi-node Hadoop cluster with Docker

I am in planning phase of a multi-node Hadoop cluster in a Docker based environment. So it should be based on a lightweight easy to use virtualized system.
Current architecture (regarding to documentation) contains 1 master and 3 slave nodes. This host machine uses HDFS filesystem and KVM for virtualization.
The whole cloud is managed by Cloudera Manager. There are several Hadoop modules installed on this cluster. There is also a NodeJS data upload service.
This time I should make architecture Docker based.
I have read several tutorials and have some opinions, but also open questions.
A. What do you think, is https://github.com/Lewuathe/docker-hadoop-cluster a good base for my project? I have found also an official image, but it is single-node.
B. How will system requirements change if I would like to make this in a single container? It would be great, because this architecture should work in different locations, so changes can be easily transferred between these locations. Synchronization between these so called clones would be important.
C. Do you have some other ideas, maybe best practices?
As of September 2016 there is no quick answer.
https://github.com/Lewuathe/docker-hadoop-cluster does not seem like a good start, as it should be universal for your B. option
Keep an eye on https://github.com/sequenceiq/hadoop-docker and https://github.com/kiwenlau/hadoop-cluster-docker
To address your question C., you may want to check out BlueData's software platform: http://www.bluedata.com/blog/2015/06/docker-containers-big-data-clusters
It's designed to run multi-node Hadoop clusters in a Docker-based environment and there is a free version available for download (you can also run it in an AWS EC2 instance).
This work has already been done for you, actually:
https://hub.docker.com/r/cloudera/clusterdock/
It includes a pre-packaged multi-node CDH cluster, with Cloudera Manager as an optional component for cluster management et al.

Did hortan sandbox can use as a single node Hadoop cluster

I like to study about Hadoop multinode setup and installation, by referring the above tutorial I understand that single node cluster environment can be used as node for the multinode cluster
http://bigdatahandler.com/hadoop-hdfs/hadoop-multi-node-cluster-setup/
Currently I am learning Hadoop using Horton sandbox, can we use a sandbox system as a single node environment?
If not what is the difference between sandbox and traditional Hadoop cluster installation
The sandbox images (from Hortonworks and Cloudera) provide the user with a pre-configured development environment with all the usual tools already available and installed (pig, hive etc.). Since the image is a single "system" it is set-up such that the hadoop cluster is single-node: i.e. everything - HDFS, Hadoop map-reduce etc. - is local to that image. That is a massive benefit, as anyone who has set up a hadoop cluster will tell you! It allows you to get up-and-running with very little operational overhead.
What these sandboxes do not provide, however, is realistic cluster behaviour as you have only one node. But there other possibilities - tools such as Vagrant and Docker - that would allow you to do this (I have not tried it myself).
The big data handler link you shared seems to be about combining several of these standalone, inherently single-node "clusters" so that you have something more realistic. But I would guess setting this up so that YARN, Zookeeper and other services are not duplicated comes with a not insignificant challenge.

How to dockerize individual apps inside Hadoop in multi tenant environment?

I have seen HortonWorks put the full Hadoop inside a docker that allows to install Hadoop in different environments. But how about the individual apps inside Hadoop that run on YARN? Especially in a multi-tenant environment, this would be useful.
Appreciate any thoughts on how to achieve this.

Resources