What version of hadoop to install and run? - hadoop

After reading this article...
http://blog.cloudera.com/blog/2012/01/an-update-on-apache-hadoop-1-0/
If I were to make a brand new installation of hadoop to work with... is it still 0.23 today that has all the features? Or is there a better version that is out there now that has everything and captures all features and performance? There are so many guides out there that use 0.20... makes it seem as if 1.0 is not to be trusted...

Here is a guide I have followed at least three times to install and run on single node and two-node clusters and Michael does a pretty good job of keeping it current:
Running Hadoop on Ubuntu Linux (Single-Node Cluster)
Running Hadoop on Ubuntu Linux (Multi-Node Cluster)
This uses version Hadoop version 1.0.3 released in May 2012; The latest stable as of this writing is 1.1.2, but if you want to do a first install to test and become familiar a guide like the one above may help you familiarize with the system and then upgrade to the latest-one once you have a reference point.

Check the Hadoop documentation for the status of the different releases. As of now 1.0.4 is the stable release.

I came across this tutorial for setting up a single node cluster in ubuntu 12.04.
http://preciselyconcise.com/apis_and_installations/hadoop_installation.php. I followed the tutorial and i successfully installed hadoop 1.1.2 on my linux system.

Related

Master and Slave system OS version

I'm trying to create my own hadoop clister. My all data nodes have installed ubuntu 18 and Name node is having ubuntu 14.
Is it mandatory that Name node and Data nodes should have same version of OS .. ?
It is recommended to have the same major version at least to avoid kernel vulnerabilities. If you come across these low level issues, they are very difficult to debug.
As #piyush-p said, it's not recommended but as long as you are running the same Java version across all the hosts you should be okay. You probably won't want to
do this if you are using a commercial distribution of Hadoop (HDP, Cloudera) as their
respective setup tools (Ambari, Cloudera Manager) will probably disallow this.
See HDP Support for mix of OS Releases within a cluster for more details.

are there different ways how to install a cloudera hadoop packages?

can I only install packages via RPM? (RedHat Package Management)
Im using Cloudera and I heard a couple times about CDH Parcel Services but im not sure, if i can do that with this too? or is there another mechanism?
best regards
If you're using Debian or Ubuntu, then you'd use DEB packages, not RPM.
Parcels should work
So would compiling code from source.
Just because you're running Cloudera doesn't make the system any less of a regular Linux machine
Parcels is Cloudera's way of installing their distribution. You install Cloudera Manager, and it will install all of the components using parcels (although I think you have a choice). This is done through a GUI (or API). This is probably the easiest way to go about it.
If you are just learning, the Quick Start VM is not a bad way to get started.
I have done installation of Cloudera using Parcels and it is easier than package installation. Parcels are quite conveniently picked up by Cloudera Manager for installation purposes. Almost everything is ready once parcel installation is done.

installation for spark 2.1.0 with Ambari 2.4.2.0

I am relatively new on cluster installations for Spark along with Ambari. Recently, I got a task for installing Spark 2.1.0 on a cluster which pre-installed Ambari with Spark 1.6.2 with HDFS & YARN 2.7.3.
My task is to have Spark 2.1.0 installed since it is the newest version with better compacity with RSpark and more. I searched over the internet for couple days, only found some installation guide on either AWS or Spark 2.1.0 alone.
such as following:
http://data-flair.training/blogs/install-deploy-run-spark-2-x-multi-node-cluster-step-by-step-guide/
and http://spark.apache.org/docs/latest/building-spark.html.
But none of them mentioning the interference of different versions of Spark. Since I need to keep this cluster running, I would like to know some potential threat for the cluster.
Is there some proper way to do this installation? Thanks a lot!
If you want to have your SPARK2 installation managed by Ambari then SPARK2 must be provisioned by Ambari.
HDP 2.5.3 does NOT support Spark 2.1.0, it does however come with a technical preview of Spark 2.0.0.
Your options are:
Install Spark 2.1.0 manually and not have it managed by Ambari
Use Spark 2.0.0 instead of Spark 2.1.0 which is provided by HDP 2.5.3
Use a different stack. ie. IBM Open Platform (IOP) 4.3, slated to release in 2017, it will ship with Spark 2.1.0 support. You can get started using it today with the technical preview release.
Upgrade HDP (2.6) which supports Spark 2.1.
Extend the HDP 2.5 stack to support Spark 2.1.0. You can see how to customize and extend ambari stacks on the wiki. This would let you used Spark 2.1.0 and have it managed by ambari. However, this would be a lot of work to implement and being that you're new to Ambari it would be rather difficult.

Suitable hadoop framework for ubuntu

I want to start working with Hadoop and BigData. I need an easy graphical interface to start. I try Hue but I couldn't get it configured.
Please help me to choose my suitable Hadoop.
I use Ubuntu 14.04.
I think Cloudera,sandbox(by hortonworks) is a easy way.Hard way is installation to Ubuntu.Also i have ubuntu 14.04 and Hadoop(hive,pig),Apache spark exist and i dont need open virtual machine.
There are 3 major Hadoop distributions that you can start with.
Cloudera
Hortonworks
MapR
Each one of them has a UI installer and manager. I think the best for you would be though, to use the virtual environment that these vendors provide.
The Hortonworks Developer Sandbox is an image including Hue as UI to get started. However, the downloadable sandbox image is based on CentOS.
If you want to install a Hortonworks Distribution on Ubuntu, you need to run an Ambari installation (Downloads - Hortonworks Hadoop). Be aware that Hue is not included into the default Ambari installation, but Hue can be installed easily separately. To run properly, Hue on Hortonworks still needs Python 2.6.x.
There are some distributions like Cloudera or Hortonworks but their package needs high machine configuration. For example RAM + 16GB and sometimes it's not possible for the user. In addition, they include some Hadoop related project that user doesn't need at all. If you want to enter this field seriously I strongly recommend installing Hadoop on your own. Doing that you do some configuration and will get familiar with many Hadoop concepts.
You can start using this install tutorial.

Impala on Hadoop 1.0.4

I am trying to work on impala in my linux box. Mine is not a cloudera distribution. I installed Hadoop, Hive, HBase and other components individually.
Here are the versions
Hadoop - 1.0.4
HBase - 0.94.8
Hive - 0.9.0
Impala - 1.2.3
I installed impala using rpm as mine is a redhat linux box.
I am not able to configure the impala servrer (indeed not able to find site.xml's) in my machine.
In the research I did, I came to know that impala will only work with Hadoop 2.x. Is it true? If it is correct, I need to migrate to 2.x rather than wasting time on 1.x.
Could someone confirm the same? Thanks in advance.
I suggest to use latest CDH 4.x
http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_prereqs.html

Resources