Deploy Mahout jobs on a cluster - hadoop

I'm new to Hadoop/Mahout, I understand the concepts, but I'm having issues deploying Mahout jobs to an already set cluster of computers.
I have used Mahout on single computer, but what should I do to make it up and running to an already formed Hadoop cluster?
I have a cluster with Hadoop 0.20.2 installed, and Mahout 0.9, which contains Hadoop 1.2.1. What jars should I copy so I could run code that contains Mahout calls, or what else should I do to make it work on Hadoop cluster?
Any suggestion/example/tutorial would be great.
Thanks

important link for your problem
https://mahout.apache.org/users/clustering/k-means-commandline.html

Related

How to set up Spark on multi-node Hadoop cluster?

I would like to install Hadoop HDFS and Spark on multi-node cluster.
I was able to successfully install and configure Hadoop on multi-node cluster. I have also installed and configured Spark on master node.
I have doubts that I have to configure the spark in slaves as well?
I have doubt that I have to configure the spark in slaves as well?
You should not. You're done. You did more than you had to to submit Spark applications to Hadoop YARN (which I concluded is the cluster manager).
Spark is a library for distributed computations on massive datasets and as such it belongs solely to your Spark applications (not any cluster you may use).
Time to spark-submit Spark applications!

How to install Apache Spark on HortonWorks HDP 2.2 (built using Ambari)

I successfully built a 5 node cluster of HortonWorks HDP 2.2 using Ambari.
However I don't see Apache Spark in the installed services list.
I did some research and found that Ambari does not install certain components like hue etc. ( Spark was not in that list, but I guess its not installed).
How do I do a manual install of Apache spark on my 5 node HDP 2.2?
Or should I delete my cluster and perform a fresh install without using Ambari?
Hortonworks support for Spark is arriving but not fully complete (details and blog).
Instructions for how to integrate Spark with HDP can be found here.
You could build your own Ambari Stack for Spark. I recently did just that, but I cannot share that code :(
What I can do is share a tutorial I did on how to do any stack for Ambari, including Spark. There are many interesting issues with Spark that need to be addressed and are not covered through the tutorial. Anyways hope it helps. http://bit.ly/1HDBgS6
There is also a guide from the Ambari people here: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=38571133.
1) Ambari 1.7x does not install Accumulo, Hue, Ranger, or Solr services for the HDP 2.2 Stack.
For Installing Accumulo, Hue, Knox, Ranger, and Solr services, install
HDP Manually.
2) Apache Spark 1.2.0 on YARN with HDP 2.2 : here .
3)
Spark and Hadoop: Working Together :
Standalone deployment: With the standalone deployment one can statically allocate resources on all or a subset of machines in a Hadoop cluster and run Spark side by side with Hadoop MR. The user can then run arbitrary Spark jobs on her HDFS data. Its simplicity makes this the deployment of choice for many Hadoop 1.x users.
Hadoop Yarn deployment: Hadoop users who have already deployed or are planning to deploy Hadoop Yarn can simply run Spark on YARN without any pre-installation or administrative access required. This allows users to easily integrate Spark in their Hadoop stack and take advantage of the full power of Spark, as well as of other components running on top of Spark.
Spark In MapReduce : For the Hadoop users that are not running YARN yet, another option, in addition to the standalone deployment, is to use SIMR to launch Spark jobs inside MapReduce. With SIMR, users can start experimenting with Spark and use its shell within a couple of minutes after downloading it! This tremendously lowers the barrier of deployment, and lets virtually everyone play with Spark.

Should oozie be installed on all the hadoop nodes inside a single hadoop cluster?

I am running oozie over hadoop 1.0.3. I wanted to find out whether oozie has to be installed over all the hadoop nodes inside a single cluster ? Is it sufficient to install it on the master node (hadoop) only ? I searched through the oozie documentation, but could not find the answer to my question.
Thankyou,
Mohsin.
Oozie need not be installed on all the nodes in a cluster. It can be installed on a dedicated machine or along with any other framework. Check this guide for a quick installation of Oozie.
Note that Oozie has got a client and a server component. The server component has a Scheduler and also a WorkFlow engine. And the WorkFlow engine used hPDL (Hadoop Process Definition Language) for defining the WorkFlow.

Running Mahout on Hadoop Cluster

I am a Mahout/Hadoop Beginner.
I am trying to run Mahout examples given in "Mahout in Action" Book. I am able to run the examples in Eclipse without Hadoop.
Can you please let me know how to run the same examples in the Hadoop Cluster.
This wiki page has the different articles implemented in Mahout and how to run them. Many of them take the below as an argument
-xm "execution method: sequential or mapreduce"
Mahout requirements mention that it works on Hadoop 0.20.0+. See this tutorial on how to setup Hadoop on a single node and on a multi node cluster on Ubuntu.

Deploying Mahout on hadoop cluster

I want to run Mahout's K-Means example in a hadoop cluster of 5 machines. Which Mahout jar files should I need to keep in all the nodes, in order for the K-Means to be executed in a distributed manner.
Thanks.
-Venkiram
If you really just want to run the built-in K-Means, or other jobs with static drivers, the answer is 'none'. The mahout 'job' jars are self-contained hadoop job jars. If you submit a job to the cluster with 'hadoop job' it will work without any other jars.

Resources