installing Cloudera Impala without installing CDH? - hadoop

I'm using pure Apache Hadoop with Hive. I need to install Apache Impala, for integrate with Hive and Kudu. I'm trying to install with source from this link and with this build instructions. In this page said that, "build Impala from source and how to configure and run Impala in a single node development environment."
Can I use Apache Impala source for production environment with pure Apache Hadoop(not Cloudera)? I also see this page from other stack-overflow question. In this page answer, recommended that using package for installing. But the dependencies of this packages is HBase, Hadoop etc CDH.
Can anyone show how to install Apache Impala without Cloudera Hadoop and inside the Apache Hadoop(pure) in production environment?
Thanks.

Related

Updating individual CDH Components in a Community Edition via '1 Click Installer'

Can someone let me know if it possible to update individual CDH component to 5.13 from 5.7 via "1 Click Installer" for Community Edition?
For example, let's say I want to update only the hadoop-hdfs-datanode to the latest in a server. If I do sudo apt-get install hadoop-hdfs-datanode it is updating other CDH component also running in that node (like resource-manager, node-manager, etc).
As discussed here if I am trying to upgrade hadoop-yarn-resourcemanager it is upgrading almost all the cdh hadoop components
support#platform1:~$ sudo apt-get install hadoop-yarn-resourcemanager
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following extra packages will be installed:
hadoop hadoop-0.20-mapreduce hadoop-client hadoop-conf-pseudo hadoop-hdfs
hadoop-hdfs-datanode hadoop-hdfs-journalnode hadoop-hdfs-namenode
hadoop-hdfs-secondarynamenode hadoop-hdfs-zkfc hadoop-mapreduce
hadoop-mapreduce-historyserver hadoop-yarn hadoop-yarn-nodemanager
The following packages will be upgraded:
hadoop hadoop-0.20-mapreduce hadoop-client hadoop-conf-pseudo hadoop-hdfs
hadoop-hdfs-datanode hadoop-hdfs-journalnode hadoop-hdfs-namenode
hadoop-hdfs-secondarynamenode hadoop-hdfs-zkfc hadoop-mapreduce
hadoop-mapreduce-historyserver hadoop-yarn hadoop-yarn-nodemanager
hadoop-yarn-resourcemanager
15 upgraded, 0 newly installed, 0 to remove and 16 not upgraded.
it is updating other CDH component also running in that node
I doubt it is upgrading everything in the node, just the dependent services of upgrading the hadoop client.
If you were to install Hadoop all by itself, it includes HDFS, MapReduce, YARN, and the Hadoop client libraries. Therefore, it makes sense that upgrading the datanode package would try to grab those, but not HBase, Hive, Pig, Spark, Oozie, etc. packages.
Essentially, you need to ensure all your Hadoop client libraries are the same version. CDH itself hasn't moved off of Hadoop 2.6.0 between those releases, although it has added patches to that base release, so it might be fine to upgrade.
However, let's take HBase as an example. From the documentation, it says Hadoop 2.6.0, 2.7.0 nor Hadoop 2.8.x are supported; Hadoop 3.x is not tested; only 2.6.1+ or 2.7.1+ are supported.
And continues on to say that
In distributed mode, it is critical that the version of Hadoop that is out on your cluster match what is under HBase... Make sure you replace the jar in HBase across your whole cluster. Hadoop version mismatch issues have various manifestations but often all look like its hung
All component upgrades should be followed through, and Cloudera makes the effort to ensure all components of a single release work together, not mixed across releases.

Missing spark-env.sh if I installed pyspark with pip

I installed pyspark 2.2.0 with pip, but I don't see a file named spark-env.sh nor the conf directory. I would like to define variables like SPARK_WORKER_CORES in this file. How should I proceed?
I am using Mac OSX El Capitan, python 2.7.
PySpark from PyPi (i.e. installed with pip or conda) does not contain the full PySpark functionality; it is only intended for use with a Spark installation in an already existing cluster, in which case you might want to avoid downloading the whole Spark distribution. From the docs:
The Python packaging for Spark is not intended to replace all of the
other use cases. This Python packaged version of Spark is suitable for
interacting with an existing cluster (be it Spark standalone, YARN, or
Mesos) - but does not contain the tools required to setup your own
standalone Spark cluster. You can download the full version of Spark
from the Apache Spark downloads page.
So, what you should do is download Spark as said above (PySpark is an essential component of it).

how to install Spark and Hadoop from tarball separately [Cloudera]

I want to install Cloudera distribution of Hadoop and Spark using tarball.
I have already set up Hadoop in Pseudo-Distributed mode in my local machine and successfully ran a Yarn example.
I have downloaded latest tarballs CDH 5.3.x from here
But the folder structure of Spark downloaded from Cloudera is differrent from Apache website. This may be because Cloudera provides it's own version maintained separately.
So, as there are no documentation I have found yet to install Spark from this Cloudera's tarball separately.
Could someone help me to understand how to do it?
Spark could be extracted to any directory. You just need to run the ./bin/spark-submit command (available in extracted spark directory) with required parameters to submit the job. To start spark interactive shell, please use command ./bin/spark-shell.

How to download CDH4 setup manually

How can I download the CDH4 setup manually?
I mean I want to download the setup without using apt-get from the ubuntu command prompt.
CDH4 is Cloudera's distribution of Apache Hadoop. CDH is a collection of Apache Hadoop and several components of its ecosystem.
Assuming that you are requesting for the source, each component can be downloaded as a tarball from the following location:
CDH Packaging and Tarball Information

hbase 0.94.11 and hadoop version

I have a Hadoop cluster with version 1.2.1 and recently i also downloaded hbase 0.94.11 to try out. I able to setup hbase t run in distributed mode but when i checked the web gui status, it stated that the Hadoop version is 1.0.4. I noticed that this is because hbase use the hadoop-core-1.0.4.jar file comes together with hbase. So my question is should i replace this jar file with the hadoop-core-1.2.1.jar so that hbase can use the latest hadoop-core jar file? And does it matter?
Cw
You don't have to do that if 1.0.4 works for you. Because the newest version may bring you any other problems and just replace hadoop-core.jar is unsafe. If you want to upgrade the HBase, please follow the official guide.
Hope it helps.

Resources