How to provision a Hadoop ecosystem cluster with OpenShift? - hadoop

We are searching a viable way for provisioning a Hadoop ecosystem cluster with OpenShift (based on Docker). We look to build up a cluster using the services of the Hadoop ecosystem, i.e. HDFS, YARN, Spark, Hive, HBase, ZooKeeper etc.
My team has been using Hortonworks HDP for on-premise hardware but will now switch into a OpenShift-based infrastructure. Hortonworks Cloudbreak seems not to be suitable for OpenShift-based infrastructures. I have found this article that describes the integration of YARN into OpenShift but it seems like there are no further information available.
What is the easiest way to provision a Hadoop ecosystem cluster on OpenShift? Manually adding all the services feels error-prone and hard to administer. I have stumbled upon the Docker images of these separate services, but it is not comparable to the automated provisioning you get with a platform like Hortonworks HDP. Any guidance is appreciated.

If you install Openstack within Openshift, Sahara allows provisioning of Openstack Hadoop clusters
Alternatively, Cloudbreak is Hortonwork's tool for provisioning container based cloud deployments
Both provides Ambari, allowing you the same interface for cluster administration as HDP.
FWIW, I personally don't find the reason for putting Hadoop in containers. Your datanodes are locked to specific disks. There's no improvement in running several smaller ResourceManagers on a single host. Plus, for YARN, you'd be running containers within containers. And for the namenode, you must have a replicated Fsimage + Editlog because the container could be placed on any system

Related

Problems applying AMBARI to existing system

I'm going to apply AMBARI to my system.
But my system already has hadoop.
How do I add existing Hadoop clusters to my new AMBARI environment
Sorry for my English.
Ambari can only manage clusters that it provisioned. Your pre-existing hadoop cluster was not provisioned with Ambari so it cannot be managed by Ambari.
Ambari is designed around a Stack concept where each stack consists of several services. A stack definition is what allows Ambari to install, manage and monitor the services in the cluster.
You can not do right now because already hadoop is installed in the system and you want to apply AMBARI over that for managing the hadoop cluster that's not possible.
Detailed description about the Apache Ambari :---
The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.
Ambari enables System Administrators to:
Provision a Hadoop Cluster
Ambari provides a step-by-step wizard for installing Hadoop services across any number of hosts.
Ambari handles configuration of Hadoop services for the cluster.
Manage a Hadoop Cluster
Ambari provides central management for starting, stopping, and reconfiguring Hadoop services across the entire cluster.
Monitor a Hadoop Cluster
Ambari provides a dashboard for monitoring health and status of the Hadoop cluster.
Ambari leverages Ambari Metrics System for metrics collection.
Ambari leverages Ambari Alert Framework for system alerting and will notify you when your attention is needed (e.g., a node goes down, remaining disk space is low, etc).

How to intergrate hadoop using ambari without HDP?

I have a hadoop cluster with apache hadoop 2.0.7.
I want to know how to integrate Ambari with the apache hadoop without the HDP(HortonWorks).
Actually, If I use HDP the solution is easy. but , I don't want to use the in my situation.
Do you have an any Idea?
Ambari relies on 'Stack' definitions to describe what services the Hadoop cluster consists of. Hortonworks defined a custom Ambari stack, its called HDP.
You could define your own stack and use any services and respective versions that you wanted. See the ambari wiki for more information about defining stacks and services.
That being said, I don't think it's possible to use your pre-existing installation of Hadoop with Ambari. Ambari is used to provision and manage hadoop clusters. It keeps track of the state of each of its stacks services, and the states of each services components. Since your cluster is already provisioned it would be difficult (maybe impossible) to add it to an Ambari instance.

AWS EMR Hadoop Administration

We are currently using Apache Hadoop (Vanilla Version) in our org. We are planning to migrate to AWS EMR. I'm trying to understand how AWS EMR Hadoop works internally (not how to use it), I'm mainly interested in Hadoop administration steps and how master and slave communicates and various configuration configurations. I already checked the AWS EMR documentation but I don't see detailed comparison.
Can someone recommend me a link/tutorial for migrating to AWS EMR from an Apache Hadoop.
During EMR cluster creation, it will ask you to specify Master and Node. a default settings will provision 1 master and two nodes for you. You can also specify what all applications you want to be in the cluster (e.g.: hadoop, hive, spark, zeppelin, hue, etc.).
Once the cluster is created, it will provision all the services. you can click on these services and access them via web, or using ssh into the master. For e.g: to access the ambari interface, go to the service within EMR and click it. a new window will be launched with the ambari monitoring service interface.
Installing these applications is very easy. all you have to do is specify all the services while cluster creation.
Amazon Elastic MapReduce uses a mostly standard implementation of Hadoop and associated tools.
See: AMI Versions Supported in Amazon EMR
The benefits of using EMR are in the automated deployment of instances. For example, launching a cluster with an appropriate AMI means that software is already loaded on each instance and HDFS is configured across the core nodes.
The Master and Slave (Core/Task) nodes communicate in exactly the normal way that they communicate in any Hadoop cluster. However, only one Master is supported (with no backup Master).
When migrating to EMR, check that you are using compatible versions of software (eg Hadoop, Hive, Pig, Impala, etc). Also consider using Amazon S3 for storage of data instead of HDFS, especially for storing source data, since data on S3 persists even after the EMR cluster is terminated.
Technically, Hadoop provided with EMR, can be few releases back. You should check EMR release notes for detailed application provided with each version. EMR takes care application provisioning, setup and configuration. Based on EC2 instance type, Hadoop (and other application configuration) will change. You can override default settings using configure application.
Other than this Hadoop you have on premises and EMR should be the same.

Did hortan sandbox can use as a single node Hadoop cluster

I like to study about Hadoop multinode setup and installation, by referring the above tutorial I understand that single node cluster environment can be used as node for the multinode cluster
http://bigdatahandler.com/hadoop-hdfs/hadoop-multi-node-cluster-setup/
Currently I am learning Hadoop using Horton sandbox, can we use a sandbox system as a single node environment?
If not what is the difference between sandbox and traditional Hadoop cluster installation
The sandbox images (from Hortonworks and Cloudera) provide the user with a pre-configured development environment with all the usual tools already available and installed (pig, hive etc.). Since the image is a single "system" it is set-up such that the hadoop cluster is single-node: i.e. everything - HDFS, Hadoop map-reduce etc. - is local to that image. That is a massive benefit, as anyone who has set up a hadoop cluster will tell you! It allows you to get up-and-running with very little operational overhead.
What these sandboxes do not provide, however, is realistic cluster behaviour as you have only one node. But there other possibilities - tools such as Vagrant and Docker - that would allow you to do this (I have not tried it myself).
The big data handler link you shared seems to be about combining several of these standalone, inherently single-node "clusters" so that you have something more realistic. But I would guess setting this up so that YARN, Zookeeper and other services are not duplicated comes with a not insignificant challenge.

How to deploy ambari for an existing hadoop cluster

As I mention in this title, can I skip the step of install hadoop cluster for that cluster already exist and which in service?
Ambari relies on 'Stack' definitions to describe what services the Hadoop cluster consists of. Hortonworks defined a custom Ambari stack, its called HDP.
You could define your own stack and use any services and respective versions that you wanted. See the ambari wiki for more information about defining stacks and services.
That being said, I don't think it's possible to use your pre-existing installation of Hadoop with Ambari. Ambari is used to provision and manage hadoop clusters. It keeps track of the state of each of its stacks services, and the states of each services components. Since your cluster is already provisioned it would be difficult (maybe impossible) to add it to an Ambari instance.
One of the minimum requierments of installing Ambari is removing the pre-existing installations of tools mentioned here.It is not mentioned to remove any pre-existing hadoop installation.

Resources