apache hadoop-tools installation in cloudera - hadoop

I have cloudera 5.14 development environment. I want to install apache hadoop-tools(link) in the cloudera distribution .
Specifically I need hadoop-resourceestimator (link).
There is no documentation available how to install the same .
Any leads will be highly appreciated.

AFAIK cdh5.14.x is based on the old hadoop version 2.6.0 which does not have resourceestimator tool.
It is available but is not supported in CDH6 ("not supported" is not the same as "not available"). You can find resourceestimator in CDH6.x distribution,
-rw-r--r-- 1 root root 71105 Dec 6 03:13 /opt/cloudera/parcels/CDH/jars/hadoop-resourceestimator-3.0.0-cdh6.0.x-SNAPSHOT.jar
and you're free to use it, but Cloudera Support won't provide any help.

Related

java.lang.NoSuchMethodError: org.apache.hive.common.util.ShutdownHookManager.addShutdownHook

I'm trying to build a cube on Kylin with Spark as engine type. The cluster contains the following tools:
OS image: 1.0-debian9
Apache Spark 2.4.4 (changed from 1.6.2)
Apache Hadoop 2.7.4
Apache Hive 1.2.1
I'm getting this error while building a cube:
java.lang.NoSuchMethodError: org.apache.hive.common.util.ShutdownHookManager.addShutdownHook(Ljava/lang/Runnable;)V
at org.apache.hive.hcatalog.common.HiveClientCache.createShutdownHook(HiveClientCache.java:221)
at org.apache.hive.hcatalog.common.HiveClientCache.<init>(HiveClientCache.java:153)
at org.apache.hive.hcatalog.common.HiveClientCache.<init>(HiveClientCache.java:97)
at org.apache.hive.hcatalog.common.HCatUtil.getHiveMetastoreClient(HCatUtil.java:553)
at org.apache.hive.hcatalog.mapreduce.InitializeInput.getInputJobInfo(InitializeInput.java:104)
at org.apache.hive.hcatalog.mapreduce.InitializeInput.setInput(InitializeInput.java:88)
at org.apache.hive.hcatalog.mapreduce.HCatInputFormat.setInput(HCatInputFormat.java:95)
at org.apache.hive.hcatalog.mapreduce.HCatInputFormat.setInput(HCatInputFormat.java:51)
at org.apache.kylin.source.hive.HiveMRInput$HiveTableInputFormat.configureJob(HiveMRInput.java:80)
at org.apache.kylin.engine.mr.steps.FactDistinctColumnsJob.setupMapper(FactDistinctColumnsJob.java:126)
at org.apache.kylin.engine.mr.steps.FactDistinctColumnsJob.run(FactDistinctColumnsJob.java:104)
at org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:131)
at org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
at org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71)
at org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
at org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I checked the hive and hadoop library jars directory to see if there are any redundant jars and I found two versions of every type of jar. For example: hive-common-1.2.1.jar and hive-common.jar.
I tried moving either of them to a different location and tried resuming the cube building process. But I got the same error. Any help on this would be greatly appreciated.
This is not supported use case for Dataproc, if you need to use Spark 2.4.4, then you should use Dataproc 1.4 or 1.5 instead of Dataproc 1.0 that comes with Spark 1.6.2.
Aside this, ShutdownHookManager.addShutdownHook(Ljava/lang/Runnable;)V method was added in Hive 2.3.0, but Spark uses fork of the Hive 1.2.1, that's why you need to use Kylin version that supports Hive 1.2.1.
Regarding duplicate jars, version less hive-common.jar is not a duplicate, it's a symbolic link to the versioned hive-common-1.2.1.jar. You can verify this by listing it:
$ ls -al /usr/lib/hive/lib/hive-common.jar
lrwxrwxrwx 1 root root 21 Nov 9 09:20 /usr/lib/hive/lib/hive-common.jar -> hive-common-2.3.6.jar
I changed the Hive version to 2.1.0 and it worked for me. I decided to install this version of Hive by checking the Kylin download page and in turn going through other cloud platforms like AWS EMR and Microsoft Azure HDInsight for Kylin 2.6.4 release.
Thanks, #Igor Dvorzhak for your valuable suggestions.

How to check the hadoop distribution used in my cluster?

How can I know whether my cluster has been setup using Hortonworks,Cloudera or normal installation of hadoop components?
Also how can I know the port number of various services?
It is difficult to identify hadoop distribution from port number, since Apache, Hortonworks, Cloudera distros uses different port numbers
Other options are to check for cluster management service agents (Cloudera Manager - agent start up script - /etc/init.d/cloudera-scm-agent , Hortonworks - Ambari agent start up script - /etc/init.d/ambari-agent, Vanilla Apache hadoop will not have any agents in the server
Another option is to check hadoop classpath, below command can be used to get the classpath.
`hadoop classpath`
Most of hadoop distributions include distro name in the classpath, If classpath doesn't contains any of below keywords, distribution/setup will be Apache/Normal installation.
hdp - (Hortonworks)
cdh - (Cloudera)
The simplest way is to run hadoop version command and in output you will see, what version of Hadoop you are having and also which distribution and its version you are running with. If you will find words like cdh or hdp then cdh stands for cloudera and hdp for hortonworks.
For example, here I am having cloudera and with hadoop version command below is output.
Here in first line Hadoop version followed by hadoop distribution and its version.
Hope this will help.
Command hdfs version will give you version of the hadoop and its distribution

Can open source hbase work on Cloudera distribution of Hadoop

I have a Cloudera Distribution installed as a 5 Node Cluster. Now I do not want to use the Hbase parcel that comes with cloudera,
but instead I want to use only HDFS from the cloudera setup and an opensource version of Hbase.
So my question is will this work or I will have to install normal open-source version of Apache Hadoop for HDFS and then go forward with the Opensource version of Apache Hbase on top of it.
As long as the version of hadoop matches the version of used by the hadoop client used by the version of hbase matches it should all work.

Apache Sqoop - addtowar.sh not found

I just now downloaded the Sqoop installation file sqoop-1.99.3-bin-hadoop100.tar.gz. I am not able to find the file addtowar.sh in it. I am following the installation instructions from here - https://sqoop.apache.org/docs/1.99.1/Installation.html . The following is the listing of the bin directory.
hduser#system:~/sqoop-1.99.3-bin-hadoop100/bin$ ls -ltr
total 8
-rwxr-xr-x 1 hduser2 hadoop 1361 Oct 18 2013 sqoop-sys.sh
-rwxr-xr-x 1 hduser2 hadoop 3439 Oct 18 2013 sqoop.sh
Am I missing something here or are the installations instructions not updated properly?
You may refer to the docs of the version you are using.
For 1.99.3 Refer the below link
http://sqoop.apache.org/docs/1.99.3/Installation.html
I don't have a direct answer but I have been tracking this down and it seems addtowar.sh has been removed (I am also using 1.99.3) in favor of adding hadoop jar directories into catalina.properties under common.loader line. However I cannot get this to work.
Definitely follow the 1.99.3 documentation:
http://sqoop.apache.org/docs/1.99.3/Installation.html
But they fail to mention in that documentation that you need to add all of the libraries for Hadoop to the common.loader variable in catalina.properties.
To get the sqoop client working, I had to add the following to catalina.properties:
common.loader=${catalina.base}/lib,${catalina.base}/lib/*.jar,${catalina.home}/lib,${catalina.home}/lib/*.jar,${catalina.home}/../lib/*.jar,/Users/bone/tools/hadoop/share/hadoop/common/*.jar,/Users/bone/tools/hadoop/share/hadoop/yarn/lib/*.jar,/Users/bone/tools/hadoop/share/hadoop/mapreduce/*.jar,/Users/bone/tools/hadoop/share/hadoop/tools/lib/*.jar,/Users/bone/tools/hadoop/share/hadoop/common/lib/*.jar
In my case, /Users/bone/tools/hadoop was a complete install of hadoop-2.4.0.

Install Hue without Cloudera

Has anyone tried/succeeded in installing Hue on Hadoop without Cloudera?
I have gotten to a point where I can reliably reproduce a hadoop cluster with hbase and hive and can set it all up in about 15 minutes. I'd love to have Hue along with all this without having to go back and redo my setup with Cloudera.
Checkout slides #19 & #5, Hue is getting everywhere and is compatible with Hadoop 0.20 / 1.2.0 / 2.2.0: http://gethue.com/hue-goes-to-paris-hug-france/
Hue has tarball releases releases that you are free to install. You can also simply clone the source code (Hue is open source and Apache Licenced) github: https://github.com/cloudera/hue and build the branch you want.
Upstream documentation is here or CDH's one here.
Hue is also packaged in BigTop (and so based on Vanilla Hadoop).
Hue is a Web Server (Django based) which acts as a view on top of Hadoop. So Hue just needs to be installed and then configured by adding the hosts of NameNode, JobTracker, Resource Manager, Oozie, HiveServer... etc in its hue.ini.
Also, as detailed on the gehue.com/releases, the version you need might depend on your Hive version.
Notice that without Cloudera's distribution your mileage might vary but feel free to chime-in on the Hue user-list or gethue.com ;)
We are also seeing for improving Hue setup with Amazon AWS/EMR!
To build and run hue 3.6.0 with apache hadoop 2.4.1
git clone https://github.com/cloudera/hue.git (Notice! releases/tag/release-3.6.0 is unstable, It's better to build from latest master. I built from Aug 7, 87d6b2da1 - it's stable)
cd hue
$ vi maven/pom.xml
change hadoop.version to 2.4.1
replace hadoop-core with hadoop-common
set hadoop-test version to 1.2.1
remove files which need hadoop mr1
$ rm desktop/libs/hadoop/java/src/main/java/org/apache/hadoop/mapred/ThriftJobTrackerPlugin.java
$ rm desktop/libs/hadoop/java/src/main/java/org/apache/hadoop/thriftfs/ThriftJobTrackerPlugin.java
build hue $ make apps
configure hue $ vi desktop/conf/pseudo-distributed.ini
run hue server in dev mode $ build/env/bin/hue runserver 0.0.0.0:8000
Follow the Hue manual installation steps from Hortonworks documentation, it will take you step-by-step on how to do it manually.
Quote: "...without Cloudera's distribution your mileage might vary...."
Indeed, it will vary A LOT! It would seem that the following is quite true:
Per the install giude:
http://cloudera.github.io/hue/docs-2.0.1/manual.html#_install_hue
NOTE:
Hue requires the Hadoop contained in Cloudera’s Distribution including Apache Hadoop (CDH), version 3 update 4 or later.
I've tried it and have run into walls with Hue trying to connect to Hive, Pig and OOZIE.
At this stage - from my experience at least - Hue will NOT run on a standard Apache Hadoop installation using standard Apache tools like Hive and Pig. It must be a vintage of Cloudera’s Distribution.
If anyone has any other (positive) experiences installing Hue outside of the Cloudera’s Distribution, I'd be quite interested to hear about them...

Resources