Where to install Java on multi-node hadoop cluster? - hadoop

In a multi-node hadoop cluster where there are multiple slave nodes, one master node, and one client node, where all do we need java to be installed?
Also is that we need hadoop to be installed only on the client node? I get confused after going through sites where they mention that we first need to install Java but it does not mention on which node do we need to install it.

Java is prerequisite to run Hadoop. You need to install java in all the machines even in client also.
Coming to client configuration. In client machine no need to install Hadoop. It is just to communicate with the Hadoop cluster
Check below links for more
Hadoop Client Node Configuration
https://pravinchavan.wordpress.com/2013/06/18/submitting-hadoop-job-from-client-machine/

Java is the pre-requisite to run hadoop. It should be installed on all Master and slave node.
You can refer the document for Hadoop MultiNode cluster setup for more details.

JDK should be installed on all the nodes as it is the primary requirement for Hadoop to work.
Make sure you install the same version of Java in all the nodes.
Oracle Java is preferred over openjdk

Related

YARN-Specify Which Application to be Run on Which Nodemanager

I have a Hadoop YARN cluster including one resourcemanager and 6 nodemanagers. I want to run both Flink and Spark applications on the cluster. So I have two main question about YARN:
In case of Spark, Should I install and config Spark on resource manager and each nodemanagers? When I want to submit a Spark application on YARN, in addition to YARN resourcemanager and nodemanagers, should Spark cluster (master and slaves) be run?
Can I set YARN such that run Flink in some special nodemanagers?
Thanks
For the first question, that depends on whether you're using a packaged Hadoop distribution (like Cloudera CDH, Hortonworks HDP for example) or not. The distros will likely take care of this. If you're not using a distribution, you need to consider if you want to run Spark on YARN or Spark stand-alone.
For the second question, you can specify special Node Managers if you are using Capacity Scheduler with the node-labelling feature enabled and if you are using Hadoop 2.6 and higher.

Apache ambari installation

I'm trying to install ambari server + agents.
I have a doubt regarding ambari.
I tried to install ambari.
It always gets link with hortonwork
My doubt is that I have hadoop cluster of my own in Ubunu 16.0.Will ambari only work with HDP or is it possible to also make it work with custom built clusters?
Or if possible please share me detailed descriptive documentation
It's not clear where you downloaded Ambari from, but it sounds like you used the Hortonworks version of it. Not directly from https://ambari.apache.org
Ambari works with the concept of stacks. Each stack has a set of services and components. HDP is such a stack, but there are others, or you can even define your own, so yes, you can manage your own Hadoop installation components, but that really would be not much different from what Hortonworks already provides.
Besides, the HDP services and components have been tested to work together more throughly than off the shelf Hadoop installation.
If you don't want HDP components, there is also the Apache Bigtop project that provides installation packs for many Hadoop related services
Ambari expects Java and Hadoop to be installed in a certain way. I'm not sure how easy it is to setup for an existing Hadoop install.

Adding Apache NiFi to existing Hortonworks HDP Cluster

I have a 6-node-cluster running Hortonworks HDP 2.5.3 and Ambari 2.4.2.0
I want to install Apache NiFi on this cluster. When looking in the documentation, the following line jumps to my eyes:
1.1. Interoperability Requirements
You cannot install HDF on a system where HDP is already installed.
I wonder how I can install NiFi on my cluster. I would like to manage it with Ambari too, if possible.
Should I just go ahead and install the standalone version of NiFi and changing the port to something else than 8080, which is in use by Ambari? The problem is that I'd have to install it on every node and this process is not automated.
Currently you can only install one stack into a given Ambari instance, and there is an HDP stack which does not include NiFi, and an HDF stack which includes NiFi, Kafka, Storm, and Ranger. So you need a second Ambari instance where you can install the HDF stack. You also can't share nodes between two Ambaris because there can only be one Ambari agent running on a node.
There might be enhancements in future Ambari releases to improve this situation, but for now if you are limited to using your 6 HDP nodes then you would have to install/manage NiFi manually using the RPM or TAR.
As of HDP 2.6.1 it is possible to install HDF components on an HDP cluster. See https://docs.hortonworks.com/HDPDocuments/HDF3/HDF-3.0.1.1/bk_installing-hdf-and-hdp/content/ch_install-ambari.html
Since the latest HDP 3.0, it can add HDF 3.2 and work together with NiFi

Existing Cluster monitoring by Hortonworks Ambari

I have a 10 node existing cluster in RHEL 6.6 which was prepared by plain apache Hadoop configuration XMLs. Now I wanted to check the cluster status by Ambari. Would it be possible to install Hortonworks Ambari just to monitor only not to install Hadoop.
No, Ambari must provision the cluster it's monitoring.
Ambari is designed around a Stack concept where each stack consists of several services. A stack definition is what allows Ambari to install, manage and monitor the services in the cluster.
In order for you to use Ambari with the hadoop core that you built you would have to provide your own Ambari stack definition.
Specifically in your case your existing Hadoop installation would not have the necessary alert.json descriptors used by Ambari to provide alerts for any given service.

How to install Apache Spark on HortonWorks HDP 2.2 (built using Ambari)

I successfully built a 5 node cluster of HortonWorks HDP 2.2 using Ambari.
However I don't see Apache Spark in the installed services list.
I did some research and found that Ambari does not install certain components like hue etc. ( Spark was not in that list, but I guess its not installed).
How do I do a manual install of Apache spark on my 5 node HDP 2.2?
Or should I delete my cluster and perform a fresh install without using Ambari?
Hortonworks support for Spark is arriving but not fully complete (details and blog).
Instructions for how to integrate Spark with HDP can be found here.
You could build your own Ambari Stack for Spark. I recently did just that, but I cannot share that code :(
What I can do is share a tutorial I did on how to do any stack for Ambari, including Spark. There are many interesting issues with Spark that need to be addressed and are not covered through the tutorial. Anyways hope it helps. http://bit.ly/1HDBgS6
There is also a guide from the Ambari people here: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=38571133.
1) Ambari 1.7x does not install Accumulo, Hue, Ranger, or Solr services for the HDP 2.2 Stack.
For Installing Accumulo, Hue, Knox, Ranger, and Solr services, install
HDP Manually.
2) Apache Spark 1.2.0 on YARN with HDP 2.2 : here .
3)
Spark and Hadoop: Working Together :
Standalone deployment: With the standalone deployment one can statically allocate resources on all or a subset of machines in a Hadoop cluster and run Spark side by side with Hadoop MR. The user can then run arbitrary Spark jobs on her HDFS data. Its simplicity makes this the deployment of choice for many Hadoop 1.x users.
Hadoop Yarn deployment: Hadoop users who have already deployed or are planning to deploy Hadoop Yarn can simply run Spark on YARN without any pre-installation or administrative access required. This allows users to easily integrate Spark in their Hadoop stack and take advantage of the full power of Spark, as well as of other components running on top of Spark.
Spark In MapReduce : For the Hadoop users that are not running YARN yet, another option, in addition to the standalone deployment, is to use SIMR to launch Spark jobs inside MapReduce. With SIMR, users can start experimenting with Spark and use its shell within a couple of minutes after downloading it! This tremendously lowers the barrier of deployment, and lets virtually everyone play with Spark.

Resources