Building Oozie 4.2.0 with Spark on YARN support - hadoop

What I am trying to achieve is to build and install Oozie 4.2.0 that will enable me to submit Spark jobs to a YARN cluster.
I build the distro by executing: oozie-4.2.0/bin/mkdistro.sh -Puber -Phadoop-2 -DskipTests. That created oozie-4.2.0-distro.tar.gz package and inside I can find oozie-4.2.0-sharelib.tar.gz. However, many tutorials online state that I should use oozie-4.2.0-sharelib-yarn.tar.gz in order to use YARN. Such a file is not contained in the distro package. How can I make the build process output the YARN version of sharelibs?
I tried to continue with the non-YARN version, but when submitting the example Spark job (and adjusting the HDFS and YARN addresses in job.properties along with master property from local[*] to yarn) I got an error:
Error: Could not load YARN classes. This copy of Spark may not have
been compiled with YARN support.

Oozie 4.2 does not include OOZIE-2271 that added the spark_yarn dependency to the sharelib when compiling against the hadoop-2 profile.
Try to build distro with Oozie 4.3. Alternatively, you can try to backport OOZIE-2271 and build Oozie yourself.
See spark-yarn_2.10 in this commit:
https://github.com/apache/oozie/commit/e6b5c95efb492a70087377db45524e06f803459e

Related

error while installing kylo specific services for nifi

I am trying to install kylo 0.8.4.
There is a step to install kylo specific components after installing Nifi using command,
sudo ./install-kylo-components.sh /opt /opt/kylo kylo kylo
but getting follwing error.
Creating symlinks for NiFi version 1.4.0.jar compatible nars
ERROR: spark-submit not on path. Has spark been installed?
I have spark installed.
need help.
The script calls which spark-submit to check if Spark is available. If available, it uses spark-submit --version to determine the version of Spark that is installed.
The error indicates that spark-submit is not available on system path. Can you please execute which spark-submit on the command line and check the result? Please refer to the screenshot below for expected result on Kylo sandbox.
If spark-submit is not available on the system path, you can fix it by updating the PATHvariable in .bash_profile file by providing the location of your Spark installation.
As a next step, you can also verify the installed version of Spark by running spark-submit --version. Please refer to screenshot below for an example result.

How to get HDFS and YARN version programmatically?

I'm writing a spark program that download different jars from maven based on the environment it runs on, each for a different version of Hadoop distribution (e.g. CDH, HDP, MapR).
This is necessary because some low-level APIs of HDFS and YARN are not shared between these distributions. However, I cannot find any public API of HDFS and YARN that tells their version.
Is it possible to do it only in Java? Or I have to run an external shell to know it?
In Java org.apache.hadoop.util.VersionInfo.getVersion() should work.
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/util/VersionInfo.html
For the CLIs, you can use:
$ hadoop version
$ hdfs version
$ yarn version

Spark clustering with yarn

I would like to make spark clustering with yarn.
Do i need
installing hadoop master and slaves with yarn config?
installing hadoop master/slaves and yarn master/slaves separately?
If 1 is ok, I'm going to work with this docker image(link). Is it suitable for this?
Installing hadoop master and slave with yarn config is sufficient in order to run spark over yarn but then you also need to make sure that spark version you are downloading supports yarn. once installed spark should be able to access yarn configurations and required jar files related to yarn are also in path of spark.

how to install apache phoenix to ambari 1.7 with hbase?

I'm new to hadoop. I want to install phoenix with hbase but I have installed hadoop cluster using ambari 1.7 on ubuntu. I'm not able to find any tutorial to do so.
If you build up your own Hadoop stack:
https://phoenix.apache.org/download.html
https://phoenix.apache.org/installation.html
If you use e.g. IBM Open Platform (which is for free btw):
https://developer.ibm.com/hadoop/blog/2015/10/21/installing-apache-phoenix-ibm-open-platform-apache-hadoop-4-1/
hbase should be available as service under add service button on home page.
For installing phoenix i used this link
http://dev.hortonworks.com.s3.amazonaws.com/HDPDocuments/HDP2/HDP-2-trunk/bk_installing_manually_book/content/upgrade-22-7-a.html
basically yum install phoenix on each node and then create soft links to the phoenix server jar file
hth

How to find cdh version hadoop

When connecting to Hadoop cluster, how can I know which version of Hadoop this cluster is running? In particular this is important for proper configuration of libraries when compiling and packaging Hadoop Java jobs with Maven.
The simplest way if you have ssh access to hadoop node is by running command
$ hadoop version
If you are looking for CDH version then check /usr/lib/hadoop/cloudera/cdh_version.properties
In cdh, in the cluster I am using, there is not any cdh_version.properties (or I couldn't find it)
If your cluster uses "Parcels", you could check which version of cdh is used by doing:
/opt/cloudera/parcels
And you could see the version as the name of the folder:
CDH-5.5.1-1.cdh5.5.1.p0.11
Note: I know that this is a not a general rule for getting which cdh version is used. I am trying to show an alternative way that it worked to me.
We can check the installed version with the help of following command:
cat /usr/lib/hadoop/cloudera/cdh_version.properties
Hope this may help you.

Resources