Spark clustering with yarn - hadoop

I would like to make spark clustering with yarn.
Do i need
installing hadoop master and slaves with yarn config?
installing hadoop master/slaves and yarn master/slaves separately?
If 1 is ok, I'm going to work with this docker image(link). Is it suitable for this?

Installing hadoop master and slave with yarn config is sufficient in order to run spark over yarn but then you also need to make sure that spark version you are downloading supports yarn. once installed spark should be able to access yarn configurations and required jar files related to yarn are also in path of spark.

Related

Install ambari on existing single node hadoop

I have single node hadoop setup on Ubuntu(personal dev env.).
Now want to install Ambari.
Question: Can we install Ambari on existing hadoop set up if Yes than assist me.
No, Ambari requires you first install Ambari agents, then Ambari server to monitor the agents + install / configure Hadoop.
In theory, if you installed Hadoop in the same way that Ambari expects, it might work, but adding any node into Ambari will throw lots of warnings if the node is running any extra process other than the agent

How to get HDFS and YARN version programmatically?

I'm writing a spark program that download different jars from maven based on the environment it runs on, each for a different version of Hadoop distribution (e.g. CDH, HDP, MapR).
This is necessary because some low-level APIs of HDFS and YARN are not shared between these distributions. However, I cannot find any public API of HDFS and YARN that tells their version.
Is it possible to do it only in Java? Or I have to run an external shell to know it?
In Java org.apache.hadoop.util.VersionInfo.getVersion() should work.
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/util/VersionInfo.html
For the CLIs, you can use:
$ hadoop version
$ hdfs version
$ yarn version

Building Oozie 4.2.0 with Spark on YARN support

What I am trying to achieve is to build and install Oozie 4.2.0 that will enable me to submit Spark jobs to a YARN cluster.
I build the distro by executing: oozie-4.2.0/bin/mkdistro.sh -Puber -Phadoop-2 -DskipTests. That created oozie-4.2.0-distro.tar.gz package and inside I can find oozie-4.2.0-sharelib.tar.gz. However, many tutorials online state that I should use oozie-4.2.0-sharelib-yarn.tar.gz in order to use YARN. Such a file is not contained in the distro package. How can I make the build process output the YARN version of sharelibs?
I tried to continue with the non-YARN version, but when submitting the example Spark job (and adjusting the HDFS and YARN addresses in job.properties along with master property from local[*] to yarn) I got an error:
Error: Could not load YARN classes. This copy of Spark may not have
been compiled with YARN support.
Oozie 4.2 does not include OOZIE-2271 that added the spark_yarn dependency to the sharelib when compiling against the hadoop-2 profile.
Try to build distro with Oozie 4.3. Alternatively, you can try to backport OOZIE-2271 and build Oozie yourself.
See spark-yarn_2.10 in this commit:
https://github.com/apache/oozie/commit/e6b5c95efb492a70087377db45524e06f803459e

How to find cdh version hadoop

When connecting to Hadoop cluster, how can I know which version of Hadoop this cluster is running? In particular this is important for proper configuration of libraries when compiling and packaging Hadoop Java jobs with Maven.
The simplest way if you have ssh access to hadoop node is by running command
$ hadoop version
If you are looking for CDH version then check /usr/lib/hadoop/cloudera/cdh_version.properties
In cdh, in the cluster I am using, there is not any cdh_version.properties (or I couldn't find it)
If your cluster uses "Parcels", you could check which version of cdh is used by doing:
/opt/cloudera/parcels
And you could see the version as the name of the folder:
CDH-5.5.1-1.cdh5.5.1.p0.11
Note: I know that this is a not a general rule for getting which cdh version is used. I am trying to show an alternative way that it worked to me.
We can check the installed version with the help of following command:
cat /usr/lib/hadoop/cloudera/cdh_version.properties
Hope this may help you.

Cloudera Installation Error I want to know can we cloudera manager for Hadoop single node Cluster on ubuntu?

I am using ubuntu 12.04 64bit, I installed and ran sample hadoop programs with single node successfully.
I am getting the following error while installing cloudera manager on my ubuntu
Refreshing repository metadata failed. See
/var/log/cloudera-manager-installer/2.refresh-repo.log for details.
Click OK to revert this installation.
I want to know can we install Cloudera for Hadoop's Single node cluster on ubuntu. Please response me that Is it possible to install cloudera manager for single node or not. Or else Am i want to create multiple nodes for using cloudera with my hadooop
Yes, CM can run in a single node.
This error is because CM can not use apt-get install to get the packages. Which tutorial do you follow?
However, you can manually add the cloudera repo. See this thread.

Resources