How could I install Hortonworks' HDP? - hadoop

I am new on this and I want to know how could I install the solution provided by Hortonworks, HDP, (http://hortonworks.com/products/data-center/hdp/) following the next specification: I have 2 Virtual Machines and another local machine to work with, and I want to use the 2 VM as Master node and Worker node by the time I configure Apache SPARK.
But my question is: what do I have to do to install HDP correctly? I have to install te solution in my local machine and configure Apache SPARK to use those 2 Virtual Machines as Master node and Worker node? Or I must to install HDP in the 3 machines that I have?
I repeat that I am new on this, and it wold be very helpfull for me any answer or comment that you could give.
Thank you, so much!

if you are trying to deploy HDP on a cluster (MultiNode Enviroment)
use Apache Ambari to Install HDP and other services such as spark.
I have tried it on centos environment below are the steps and links for installation.
get the repo by using the command below
wget
http://public-repo-1.hortonworks.com/ambari/centos7/2.x/updates/2.2.2.0/ambari.repo
Install Ambari by using the command below
yum install ambari-server
Start the server by using the command below
ambari-server start
Now you can setup your by going to web browser and typing
http://:8080.
Now you can select the desired HDP version. Add the other nodes and services and deploy your cluster.
Detailed steps for installation can be found here

Related

Setup Ambari with HDP, and HDP-UTILS rpm for making a local repository

I'm trying to install Ambari Server 1.7 on Oracle Linux 6 machine, but it turned out that it's not open source anymore. The public repository can't be accessed.
I've got an older version of Ambari's tar.gz file, after I successfully installed the Ambari server, when I build the Hadoop cluster, it was directed to the public repository that is no longer accessible for the HDP and HDP-UTILS repository.
(http://public-repo-1.hortonworks.com/HDP/centos6/2.x/GA/2.2.0.0)
(http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.20/repos/centos6)
So, I need those rpm files for making a local repository to build the cluster. I'm looking for the file through the internet but I can't find it anywhere, is there anyone who still has the file?
(HDP-2.2.4.2-centos6-rpm.tar.gz)
(HDP-UTILS-1.1.0.20-centos6.tar.gz)
Thank you.
Ambari itself is still open source. HDP has been moved behind Cloudera's pay wall for a while.
You can use Apache Bigtop to deploy Hadoop clusters and get a public/free distribution of Ambari, but Ambari itself is no longer supported or developed, so I would not suggest using it to deploy Hadoop clusters (not even sure Bigtop supports Oracle Linux 6).
You can use any Ambari stack you want, too; you don't need those specific HDP / HDP-UTILS yum repositories, but there aren't any that are as publicly documented, from what I've found

Differences : Single-node and Multi-node

I'm trying to install Hadoop in a virtual machine, I found a tutorial explaining how to do that in a multi-node cluster .
So my question is what's the difference between a single-node and a multi-node cluster ?
Thanks in advance :)
Single node cluster : By default, Hadoop is configured to run in a non-distributed or standalone mode, as a single Java process. There are no daemons running and everything runs in a single JVM instance. HDFS is not used.
Pseudo-distributed or multi-node cluster: The Hadoop daemons run on a local machine, thus simulating a cluster on a small scale. Different Hadoop daemons run in different JVM instances, but on a single machine. HDFS is used instead of local FS
And you can have your work environment set up as follow
Step 1 - Download VMware player latest version and install on your laptop/desktop. You can also install VMware tools, which will be very useful for your working with guest OS.
Step 2 - Once you Step 1 is completed then Download Cloudera Quick Start VM from
http://www.cloudera.com/content/support/en/downloads/download-components/download-products.html?productID=F6mO278Rvo
Step 3 - Open VMPlayer program and click on “Open a virtual Machine”. and go to directory “cloudera-quickstart-vm-4.4.0-1-vmware”. Select cloudera-quickstart-vm-4.4.0-1-vmware. This will create a virtual machine instance in VM Player.
Step 4 - Start the Cloudera VM by Clicking on power on to start the Cloudera Demo VM.
You are good to go
Good luck

Hadoop cluster configuration with Ubuntu Master and Windows slave

Hi I am new to Hadoop.
Hadoop Version (2.2.0)
Goals:
Setup Hadoop standalone - Ubuntu 12 (Completed)
Setup Hadoop standalone - Windows 7 (cygwin being used for only sshd) (Completed)
Setup cluster with Ubuntu Master and Windows 7 slave (This is mostly for learning purposes and setting up a env for development) (Stuck)
Setup in relationship with the questions below:
Master running on Ubuntu with hadoop 2.2.0
Slaves running on Windows 7 with a self compiled version from hadoop 2.2.0 source. I am using cygwin only for the sshd
password less login setup and i am able to login both ways using ssh
from outside hadoop. Since my Ubuntu and Windows machine have
different usernames I have set up a config file in the .ssh folder
which maps Hosts with users
Questions:
In a cluster does the username in the master need to be same as in the slave. The reason I am asking this is that post configuration of the cluster when I try to use start-dfs.sh the logs say that they are able to ssh into the slave nodes but were not able to find the location "/home/xxx/hadoop/bin/hadoop-daemon.sh" in the slave. The "xxx" is my master username and not the slaveone. Also since my slave in pure Windows version the install is under C:/hadoop/... Does the master look at the env variable $HADOOP_HOME to check where the install is in the slave? Is there any other env variables that I need to set?
My goal was to use the Windows hadoop build on slave since hadoop is officially supporting windows now. But is it better to run the Linux build under cygwin to accomplish this. The question comes since I am seeing that the start-dfs.sh is trying to execute hadoop-daemon.sh and not some *.cmd.
If this setup works out in future, a possible question that I have is whether Pig, Mahout etc will run in this kind of a setup as I have not seen a build of Pig, Mahout for Windows. Does these components need to be present only on the master node or do they need to be in the slave nodes too. I saw 2 ways of running mahout when experimenting with standalone mode first using the mahout script which I was able to use in linux and second using the yarn jar command where I passed in the mahout jar while using the windows version. In the case Mahout/ Pig (when using the provided sh script) will assume that the slaves already have the jars in place then the Ubuntu + Windows combo does not seem to work. Please advice.
As I mentioned this is more as an experiment rather than an implementation plan. Our final env will be completely on linux. Thank you for your suggestions.
You may have more success going with more standard ways of deploying hadoop. Try out using ubuntu vm's for master and slaves.
You can also try to do a pseudo-distributed deployment in which all of the processes run on a single VM and thus avoid the need to even consider multiple os's.
I have only worked with the same username. In general SSH allows to login with a different login name with the -l command. But this might get tricky. You have to list your slaves in the slaves file.
At least at the manual https://hadoop.apache.org/docs/r0.19.1/cluster_setup.html#Slaves I did not find anything to add usernames. it might be worth trying to add -l login_name to the slavenode in the slave conf file and see if it works.

Install Hue without Cloudera

Has anyone tried/succeeded in installing Hue on Hadoop without Cloudera?
I have gotten to a point where I can reliably reproduce a hadoop cluster with hbase and hive and can set it all up in about 15 minutes. I'd love to have Hue along with all this without having to go back and redo my setup with Cloudera.
Checkout slides #19 & #5, Hue is getting everywhere and is compatible with Hadoop 0.20 / 1.2.0 / 2.2.0: http://gethue.com/hue-goes-to-paris-hug-france/
Hue has tarball releases releases that you are free to install. You can also simply clone the source code (Hue is open source and Apache Licenced) github: https://github.com/cloudera/hue and build the branch you want.
Upstream documentation is here or CDH's one here.
Hue is also packaged in BigTop (and so based on Vanilla Hadoop).
Hue is a Web Server (Django based) which acts as a view on top of Hadoop. So Hue just needs to be installed and then configured by adding the hosts of NameNode, JobTracker, Resource Manager, Oozie, HiveServer... etc in its hue.ini.
Also, as detailed on the gehue.com/releases, the version you need might depend on your Hive version.
Notice that without Cloudera's distribution your mileage might vary but feel free to chime-in on the Hue user-list or gethue.com ;)
We are also seeing for improving Hue setup with Amazon AWS/EMR!
To build and run hue 3.6.0 with apache hadoop 2.4.1
git clone https://github.com/cloudera/hue.git (Notice! releases/tag/release-3.6.0 is unstable, It's better to build from latest master. I built from Aug 7, 87d6b2da1 - it's stable)
cd hue
$ vi maven/pom.xml
change hadoop.version to 2.4.1
replace hadoop-core with hadoop-common
set hadoop-test version to 1.2.1
remove files which need hadoop mr1
$ rm desktop/libs/hadoop/java/src/main/java/org/apache/hadoop/mapred/ThriftJobTrackerPlugin.java
$ rm desktop/libs/hadoop/java/src/main/java/org/apache/hadoop/thriftfs/ThriftJobTrackerPlugin.java
build hue $ make apps
configure hue $ vi desktop/conf/pseudo-distributed.ini
run hue server in dev mode $ build/env/bin/hue runserver 0.0.0.0:8000
Follow the Hue manual installation steps from Hortonworks documentation, it will take you step-by-step on how to do it manually.
Quote: "...without Cloudera's distribution your mileage might vary...."
Indeed, it will vary A LOT! It would seem that the following is quite true:
Per the install giude:
http://cloudera.github.io/hue/docs-2.0.1/manual.html#_install_hue
NOTE:
Hue requires the Hadoop contained in Cloudera’s Distribution including Apache Hadoop (CDH), version 3 update 4 or later.
I've tried it and have run into walls with Hue trying to connect to Hive, Pig and OOZIE.
At this stage - from my experience at least - Hue will NOT run on a standard Apache Hadoop installation using standard Apache tools like Hive and Pig. It must be a vintage of Cloudera’s Distribution.
If anyone has any other (positive) experiences installing Hue outside of the Cloudera’s Distribution, I'd be quite interested to hear about them...

CDH4 installation using tarball

I have been struggling to install CDH via tarball, there is no document that describes the steps or guides through. I do have root access on the server & wish to install CDH4 via tarball in Pseudo mode. Can anyone help?. On the same server apache hadoop is also installed, i want to install this CDH, without effecting the existing apache hadoop.
It will not work..because in the end CDH4 will use the same ports which your existing apache hadoop is using..It will work ..if you shutdown your existing hadoop cluster and then start your CDH4 cluster. Or else change all the port numbers for namenode,secondary namenode,jobtracker, tasktracker and datanode and their respective web UI's port..which is kind of tedious.. It would be also helpful if you provide some error logs..So I can highlight what exactly is the problem.

Resources