I want to know what is the CDH version currently being used most and Its all software version in detail. I.e.: If CDH 5.6 then what is the MapReduce, Hive, Impala, Sqoop etc version in this package.
The most used? You're not going to be able to find that information unless Cloudera collected it and published the CDH versions their clients use.
Click the respective CDH version here for version information
https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_vd_cdh_package_tarball.html
For example, Hadoop is at 2.6.0 , Sqoop is at 1.4.6 for CDH 5.6
Related
I am trying to build a Java program using Hadoop 3.2 client. Will it be able to work with Hadoop 2.x clusters? Or, is it not supported? Thank you for sharing your experience.
With Hadoop and most Apache-licensed projects compatibility is only guaranteed between minor version numbers. So you should not expect a 3.2 client to work with a 2.x Hadoop cluster.
Cloudera's blog Upgrading your clusters and workloads from Apache Hadoop 2 to Apache Hadoop 3 written by Suma Shivaprasad also mentions the following:
Compatibility with Hadoop 2
Wire compatibility
Hadoop 3 preserves wire compatibility with Hadoop 2 clients
Distcp/WebHDFS compatibility is preserved
API compatibility
Hadoop 3 doesn’t preserve full API level compatibility due to the following changes
Classpath – Dependency version bumps like guava
Removal of deprecated APIs and tools
Shell script rewrites
Incompatible bug fixes
But also states:
Migrating Workloads
MapReduce applications
MapReduce is fully binary compatible and workloads should run as is without any changes required.
We are currently using hadoop-2.8.0 on a 10 node cluster and are planning to upgrade to latest hadoop-3.0.0.
I want to know whether there will be any issue if we use hadoop-3.0.0 with an older version of Spark and other components such as Hive, Pig and Sqoop.
Latest Hive version does not support Hadoop3.0.It seems that Hive may be established on Spark or other calculating engines in the future.
I am relatively new on cluster installations for Spark along with Ambari. Recently, I got a task for installing Spark 2.1.0 on a cluster which pre-installed Ambari with Spark 1.6.2 with HDFS & YARN 2.7.3.
My task is to have Spark 2.1.0 installed since it is the newest version with better compacity with RSpark and more. I searched over the internet for couple days, only found some installation guide on either AWS or Spark 2.1.0 alone.
such as following:
http://data-flair.training/blogs/install-deploy-run-spark-2-x-multi-node-cluster-step-by-step-guide/
and http://spark.apache.org/docs/latest/building-spark.html.
But none of them mentioning the interference of different versions of Spark. Since I need to keep this cluster running, I would like to know some potential threat for the cluster.
Is there some proper way to do this installation? Thanks a lot!
If you want to have your SPARK2 installation managed by Ambari then SPARK2 must be provisioned by Ambari.
HDP 2.5.3 does NOT support Spark 2.1.0, it does however come with a technical preview of Spark 2.0.0.
Your options are:
Install Spark 2.1.0 manually and not have it managed by Ambari
Use Spark 2.0.0 instead of Spark 2.1.0 which is provided by HDP 2.5.3
Use a different stack. ie. IBM Open Platform (IOP) 4.3, slated to release in 2017, it will ship with Spark 2.1.0 support. You can get started using it today with the technical preview release.
Upgrade HDP (2.6) which supports Spark 2.1.
Extend the HDP 2.5 stack to support Spark 2.1.0. You can see how to customize and extend ambari stacks on the wiki. This would let you used Spark 2.1.0 and have it managed by ambari. However, this would be a lot of work to implement and being that you're new to Ambari it would be rather difficult.
I am trying to work on impala in my linux box. Mine is not a cloudera distribution. I installed Hadoop, Hive, HBase and other components individually.
Here are the versions
Hadoop - 1.0.4
HBase - 0.94.8
Hive - 0.9.0
Impala - 1.2.3
I installed impala using rpm as mine is a redhat linux box.
I am not able to configure the impala servrer (indeed not able to find site.xml's) in my machine.
In the research I did, I came to know that impala will only work with Hadoop 2.x. Is it true? If it is correct, I need to migrate to 2.x rather than wasting time on 1.x.
Could someone confirm the same? Thanks in advance.
I suggest to use latest CDH 4.x
http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_prereqs.html
After reading this article...
http://blog.cloudera.com/blog/2012/01/an-update-on-apache-hadoop-1-0/
If I were to make a brand new installation of hadoop to work with... is it still 0.23 today that has all the features? Or is there a better version that is out there now that has everything and captures all features and performance? There are so many guides out there that use 0.20... makes it seem as if 1.0 is not to be trusted...
Here is a guide I have followed at least three times to install and run on single node and two-node clusters and Michael does a pretty good job of keeping it current:
Running Hadoop on Ubuntu Linux (Single-Node Cluster)
Running Hadoop on Ubuntu Linux (Multi-Node Cluster)
This uses version Hadoop version 1.0.3 released in May 2012; The latest stable as of this writing is 1.1.2, but if you want to do a first install to test and become familiar a guide like the one above may help you familiarize with the system and then upgrade to the latest-one once you have a reference point.
Check the Hadoop documentation for the status of the different releases. As of now 1.0.4 is the stable release.
I came across this tutorial for setting up a single node cluster in ubuntu 12.04.
http://preciselyconcise.com/apis_and_installations/hadoop_installation.php. I followed the tutorial and i successfully installed hadoop 1.1.2 on my linux system.