How to get HDFS and YARN version programmatically? - hadoop

I'm writing a spark program that download different jars from maven based on the environment it runs on, each for a different version of Hadoop distribution (e.g. CDH, HDP, MapR).
This is necessary because some low-level APIs of HDFS and YARN are not shared between these distributions. However, I cannot find any public API of HDFS and YARN that tells their version.
Is it possible to do it only in Java? Or I have to run an external shell to know it?

In Java org.apache.hadoop.util.VersionInfo.getVersion() should work.
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/util/VersionInfo.html
For the CLIs, you can use:
$ hadoop version
$ hdfs version
$ yarn version

Related

Spark clustering with yarn

I would like to make spark clustering with yarn.
Do i need
installing hadoop master and slaves with yarn config?
installing hadoop master/slaves and yarn master/slaves separately?
If 1 is ok, I'm going to work with this docker image(link). Is it suitable for this?
Installing hadoop master and slave with yarn config is sufficient in order to run spark over yarn but then you also need to make sure that spark version you are downloading supports yarn. once installed spark should be able to access yarn configurations and required jar files related to yarn are also in path of spark.

Building Oozie 4.2.0 with Spark on YARN support

What I am trying to achieve is to build and install Oozie 4.2.0 that will enable me to submit Spark jobs to a YARN cluster.
I build the distro by executing: oozie-4.2.0/bin/mkdistro.sh -Puber -Phadoop-2 -DskipTests. That created oozie-4.2.0-distro.tar.gz package and inside I can find oozie-4.2.0-sharelib.tar.gz. However, many tutorials online state that I should use oozie-4.2.0-sharelib-yarn.tar.gz in order to use YARN. Such a file is not contained in the distro package. How can I make the build process output the YARN version of sharelibs?
I tried to continue with the non-YARN version, but when submitting the example Spark job (and adjusting the HDFS and YARN addresses in job.properties along with master property from local[*] to yarn) I got an error:
Error: Could not load YARN classes. This copy of Spark may not have
been compiled with YARN support.
Oozie 4.2 does not include OOZIE-2271 that added the spark_yarn dependency to the sharelib when compiling against the hadoop-2 profile.
Try to build distro with Oozie 4.3. Alternatively, you can try to backport OOZIE-2271 and build Oozie yourself.
See spark-yarn_2.10 in this commit:
https://github.com/apache/oozie/commit/e6b5c95efb492a70087377db45524e06f803459e

How to find cdh version hadoop

When connecting to Hadoop cluster, how can I know which version of Hadoop this cluster is running? In particular this is important for proper configuration of libraries when compiling and packaging Hadoop Java jobs with Maven.
The simplest way if you have ssh access to hadoop node is by running command
$ hadoop version
If you are looking for CDH version then check /usr/lib/hadoop/cloudera/cdh_version.properties
In cdh, in the cluster I am using, there is not any cdh_version.properties (or I couldn't find it)
If your cluster uses "Parcels", you could check which version of cdh is used by doing:
/opt/cloudera/parcels
And you could see the version as the name of the folder:
CDH-5.5.1-1.cdh5.5.1.p0.11
Note: I know that this is a not a general rule for getting which cdh version is used. I am trying to show an alternative way that it worked to me.
We can check the installed version with the help of following command:
cat /usr/lib/hadoop/cloudera/cdh_version.properties
Hope this may help you.

How to have lzo compression in hadoop mapreduce?

I want to use lzo to compress map output but I can't run it! The version of Hadoop I used is 0.20.2. I set:
conf.set("mapred.compress.map.output", "true")
conf.set("mapred.map.output.compression.codec",
"org.apache.hadoop.io.compress.LzoCodec");
When I run the jar file in Hadoop it shows an exception that can't write map output.
Do I have to install lzo?
What do I have to do to use lzo?
LZO's licence (GPL) is incompatible with that of Hadoop (Apache) and therefore it cannot be bundled with it. One needs to install LZO separately on the cluster.
The following steps are tested on Cloudera's Demo VM (CentOS 6.2, x64) that comes with full stack of CDH 4.2.0 and CM Free Edition installed, but they should work on any Linux based on Red Hat.
The installation consists of the following steps:
Installing LZO
sudo yum install lzop
sudo yum install lzo-devel
Installing ANT
sudo yum install ant ant-nodeps ant-junit java-devel
Downloading the source
git clone https://github.com/twitter/hadoop-lzo.git
Compiling Hadoop-LZO
ant compile-native tar
For further instructions and troubleshooting see https://github.com/twitter/hadoop-lzo
Copying Hapoop-LZO jar to Hadoop libs
sudo cp build/hadoop-lzo*.jar /usr/lib/hadoop/lib/
Moving native code to Hadoop native libs
sudo mv build/hadoop-lzo-0.4.17-SNAPSHOT/lib/native/Linux-amd64-64/ /usr/lib/hadoop/lib/native/
cp /usr/lib/hadoop/lib/native/Linux-amd64-64/libgplcompression.* /usr/lib/hadoop/lib/native/
Correct version number with the version you cloned
When working with a real cluster (as opposed to a pseudo-cluster) you need to rsync these to the rest of the machines
rsync /usr/lib/hadoop/lib/ to all hosts.
You can dry run this first with -n
Login to Cloudera Manager
Select from Services: mapreduce1->Configuration
Client->Compression
Add to Compression Codecs:
com.hadoop.compression.lzo.LzoCodec
com.hadoop.compression.lzo.LzopCodec
Search "valve"
Add to MapReduce Service Configuration Safety Valve
io.compression.codec.lzo.class=com.hadoop.compression.lzo.LzoCodec
mapred.child.env="JAVA_LIBRARY_PATH=/usr/lib/hadoop/lib/native/Linux-amd64-64/"
Add to MapReduce Service Environment Safety Valve
HADOOP_CLASSPATH=/usr/lib/hadoop/lib/*
That's it.
Your MarReduce jobs that use TextInputFormat should work seamlessly with .lzo files. However, if you choose to index the LZO files to make them splittable (using com.hadoop.compression.lzo.DistributedLzoIndexer), you will find that the indexer writes a .index file next to each .lzo file. This is a problem because your TextInputFormat will interpret these as part of the input. In this case you need to change your MR jobs to work with LzoTextInputFormat.
As of Hive, as long as you don't index the LZO files, the change is also transparent. If you start indexing (to take advantage of a better data distribution) you will need to update the input format to LzoTextInputFormat. If you use partitions, you can do it per partition.

How to start hadoop in a certain version

We are using hadoop-2.0.0-cdh4.0.0 and we start namenode using hadoop namenode, how do I start the hadoop processes in either 0.20 mode or 0.23 mode?
You'll need to install and configure these version of hadoop if you want to use them - there is no 'switch' to allow 2.0.0 (or any other hadoop version) to run in a lower version / mode.

Resources