Impala on Hadoop 1.0.4 - hadoop

I am trying to work on impala in my linux box. Mine is not a cloudera distribution. I installed Hadoop, Hive, HBase and other components individually.
Here are the versions
Hadoop - 1.0.4
HBase - 0.94.8
Hive - 0.9.0
Impala - 1.2.3
I installed impala using rpm as mine is a redhat linux box.
I am not able to configure the impala servrer (indeed not able to find site.xml's) in my machine.
In the research I did, I came to know that impala will only work with Hadoop 2.x. Is it true? If it is correct, I need to migrate to 2.x rather than wasting time on 1.x.
Could someone confirm the same? Thanks in advance.

I suggest to use latest CDH 4.x
http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_prereqs.html

Related

How is Hadoop-3.0.0 's compatibility with older versions of Hive, Pig, Sqoop and Spark

We are currently using hadoop-2.8.0 on a 10 node cluster and are planning to upgrade to latest hadoop-3.0.0.
I want to know whether there will be any issue if we use hadoop-3.0.0 with an older version of Spark and other components such as Hive, Pig and Sqoop.
Latest Hive version does not support Hadoop3.0.It seems that Hive may be established on Spark or other calculating engines in the future.

Where to find CDH and all its software versions?

I want to know what is the CDH version currently being used most and Its all software version in detail. I.e.: If CDH 5.6 then what is the MapReduce, Hive, Impala, Sqoop etc version in this package.
The most used? You're not going to be able to find that information unless Cloudera collected it and published the CDH versions their clients use.
Click the respective CDH version here for version information
https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_vd_cdh_package_tarball.html
For example, Hadoop is at 2.6.0 , Sqoop is at 1.4.6 for CDH 5.6

installation for spark 2.1.0 with Ambari 2.4.2.0

I am relatively new on cluster installations for Spark along with Ambari. Recently, I got a task for installing Spark 2.1.0 on a cluster which pre-installed Ambari with Spark 1.6.2 with HDFS & YARN 2.7.3.
My task is to have Spark 2.1.0 installed since it is the newest version with better compacity with RSpark and more. I searched over the internet for couple days, only found some installation guide on either AWS or Spark 2.1.0 alone.
such as following:
http://data-flair.training/blogs/install-deploy-run-spark-2-x-multi-node-cluster-step-by-step-guide/
and http://spark.apache.org/docs/latest/building-spark.html.
But none of them mentioning the interference of different versions of Spark. Since I need to keep this cluster running, I would like to know some potential threat for the cluster.
Is there some proper way to do this installation? Thanks a lot!
If you want to have your SPARK2 installation managed by Ambari then SPARK2 must be provisioned by Ambari.
HDP 2.5.3 does NOT support Spark 2.1.0, it does however come with a technical preview of Spark 2.0.0.
Your options are:
Install Spark 2.1.0 manually and not have it managed by Ambari
Use Spark 2.0.0 instead of Spark 2.1.0 which is provided by HDP 2.5.3
Use a different stack. ie. IBM Open Platform (IOP) 4.3, slated to release in 2017, it will ship with Spark 2.1.0 support. You can get started using it today with the technical preview release.
Upgrade HDP (2.6) which supports Spark 2.1.
Extend the HDP 2.5 stack to support Spark 2.1.0. You can see how to customize and extend ambari stacks on the wiki. This would let you used Spark 2.1.0 and have it managed by ambari. However, this would be a lot of work to implement and being that you're new to Ambari it would be rather difficult.

What version of hadoop to install and run?

After reading this article...
http://blog.cloudera.com/blog/2012/01/an-update-on-apache-hadoop-1-0/
If I were to make a brand new installation of hadoop to work with... is it still 0.23 today that has all the features? Or is there a better version that is out there now that has everything and captures all features and performance? There are so many guides out there that use 0.20... makes it seem as if 1.0 is not to be trusted...
Here is a guide I have followed at least three times to install and run on single node and two-node clusters and Michael does a pretty good job of keeping it current:
Running Hadoop on Ubuntu Linux (Single-Node Cluster)
Running Hadoop on Ubuntu Linux (Multi-Node Cluster)
This uses version Hadoop version 1.0.3 released in May 2012; The latest stable as of this writing is 1.1.2, but if you want to do a first install to test and become familiar a guide like the one above may help you familiarize with the system and then upgrade to the latest-one once you have a reference point.
Check the Hadoop documentation for the status of the different releases. As of now 1.0.4 is the stable release.
I came across this tutorial for setting up a single node cluster in ubuntu 12.04.
http://preciselyconcise.com/apis_and_installations/hadoop_installation.php. I followed the tutorial and i successfully installed hadoop 1.1.2 on my linux system.

Setting up Nutch 1.3 and Hadoop 0.20.2

I have a multi-node cluster running on UEC(Ubuntu enterprise cloud) and I thought it will be a good idea to set up nutch with it .
However, I found this tutorial unhelpful http://wiki.apache.org/nutch/NutchHadoopTutorial as this is for Nucth release 0.8 and it doesn't support latest versions. Can somebody tell me how can I configure Nutch 1.3 with Hadoop?

Resources