Deploying Mahout on hadoop cluster - hadoop

I want to run Mahout's K-Means example in a hadoop cluster of 5 machines. Which Mahout jar files should I need to keep in all the nodes, in order for the K-Means to be executed in a distributed manner.
Thanks.
-Venkiram

If you really just want to run the built-in K-Means, or other jobs with static drivers, the answer is 'none'. The mahout 'job' jars are self-contained hadoop job jars. If you submit a job to the cluster with 'hadoop job' it will work without any other jars.

Related

Spark cluster - read/write on hadoop

I would like to read data from hadoop, process on spark, and wirte result on hadoop and elastic search. I have few worker nodes to do this.
Spark standalone cluster is sufficient? or Do I need to make hadoop cluster to use yarn or mesos?
If standalone cluster mode is sufficient, should jar file be set on all node unlike yarn, mesos mode?
First of all, you can not write data in Hadoop or read data from Hadoop. It is HDFS (Component of Hadoop ecosystem) which is responsible for read/write of data.
Now coming to your question
Yes, it possible to read data from HDFS and process it in spark engine and then write the output on HDFS.
YARN, mesos and spark standalone all are cluster managers and you can use any one of them to do management of resources in your cluster and it had nothing to do with hadoop. But since you want to read and write data from/to HDFS then you need to install HDFS on cluster and thus it is better to install hadoop on your all nodes that will also install HDFS on all nodes. Now whether you want to use YARN, mesos or spark standalone that is your choice all will work with HDFS I myself use spark standalone for cluster management.
It is not clear about which jar files you are talking to but I assume it will be of spark then yes you need to set the path for spark jar on each node so that there will be no contradiction in paths when spark run's.

MapReduce 2 without YARN

Considering the fact that YARN is a better option to run mapreduce2, but is it possible to run MR2 without YARN?
I tried using MR2 but it runs with YARN.
MRv2 is actually YARN! So, no you can't run mapreduce2 jobs without YARN!
Official documentation :
Apache Hadoop NextGen MapReduce (YARN)
MapReduce has undergone a complete overhaul in hadoop-0.23 and we now
have, what we call, MapReduce 2.0 (MRv2) or YARN.
The fundamental idea of MRv2 is to split up the two major
functionalities of the JobTracker, resource management and job
scheduling/monitoring, into separate daemons. The idea is to have a
global ResourceManager (RM) and per-application ApplicationMaster
(AM). An application is either a single job in the classical sense of
Map-Reduce jobs or a DAG of jobs.

Deploy Mahout jobs on a cluster

I'm new to Hadoop/Mahout, I understand the concepts, but I'm having issues deploying Mahout jobs to an already set cluster of computers.
I have used Mahout on single computer, but what should I do to make it up and running to an already formed Hadoop cluster?
I have a cluster with Hadoop 0.20.2 installed, and Mahout 0.9, which contains Hadoop 1.2.1. What jars should I copy so I could run code that contains Mahout calls, or what else should I do to make it work on Hadoop cluster?
Any suggestion/example/tutorial would be great.
Thanks
important link for your problem
https://mahout.apache.org/users/clustering/k-means-commandline.html

Can an oozie instance run jobs on multiple hadoop clusters at the same time?

I have an available developer Hadoop cluster to run test jobs as well as an available production cluster. My question is, can I utilize oozie to kick off workflow jobs to multiple clusters on a single oozie instance?
What are the gotchas? I'm assuming I can just reconfigure the job tracker, namenode, and fs location properties for my workflow depending on which cluster I want the job to run on.
Assuming the clusters are all running the same distribution and version of hadoop, you should be able to.
As your note, you'll need to adjust the jobtracker and namenode values in your oozie actions

Running Mahout on Hadoop Cluster

I am a Mahout/Hadoop Beginner.
I am trying to run Mahout examples given in "Mahout in Action" Book. I am able to run the examples in Eclipse without Hadoop.
Can you please let me know how to run the same examples in the Hadoop Cluster.
This wiki page has the different articles implemented in Mahout and how to run them. Many of them take the below as an argument
-xm "execution method: sequential or mapreduce"
Mahout requirements mention that it works on Hadoop 0.20.0+. See this tutorial on how to setup Hadoop on a single node and on a multi node cluster on Ubuntu.

Resources