How to run Mahout jobs on Spark Engine? - hadoop

Currently I’m doing some document similarity analysis using Mahout RowSimilarity Job. This can be easily done be running command ‘mahout rowsimilarity…’ from the console. However I noticed that this Job is also supported to be run on Spark engine. I wonder to know how I can run this Job on Spark Engine.

You can use MLlib alternate of mahout in spark. All library in MLlib are processing in distributed mode(Map-reduce in Hadoop).
In Mahout 0.10 provide job execution with spark.
More detail Link
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
step to setup spark with mahout.
1 Goto the directory where you unpacked Spark and type sbin/start-all.sh to locally start Spark
2 Open a browser, point it to http://localhost:8080/ to check whether Spark successfully started. Copy the url of the spark master at the top of the page (it starts with spark://)
3 Define the following environment variables:
export MAHOUT_HOME=[directory into which you checked out Mahout]
export SPARK_HOME=[directory where you unpacked Spark]
export MASTER=[url of the Spark master]
4 Finally, change to the directory where you unpacked Mahout and type bin/mahout spark-shell, you should see the shell starting and get the prompt mahout>. Check FAQ for further troubleshooting.

Please visit link.It uses new mahout 0.10 and works uses spark server.

Related

Is there a way to load the install-interpreter.sh file in EMR in order to load 3rd party interpreters?

I have an Apache Zeppelin notebook running and I'm trying to load the jdbc and/or postgres interpreter to my notebook in order to write to a postgres DB from Zeppelin.
The main resource to load new interpreters here tells me to run the code below to get other interpreters:
./bin/install-interpreter.sh --all
However, when I run this command in EMR terminal, I find that the EMR cluster does not come with an install-interpreter.sh executable file.
What is the recommended path?
1. Should I find the install-interpreter.sh file and load that to the EMR cluster under ./bin/?
2. Is there an EMR configuration on start time that would enable the install-interpreter.sh file?
Currently all tutorials and documentations assumes that you can run the install-interpreter.sh file.
The solution is to not run this code below in root (aka - ./ )
./bin/install-interpreter.sh --all
Instead in EMR, run the code above in Zeppelin, which in the EMR cluster, is in /usr/lib/zeppelin

Create hdfs when using integrated spark build

I'm working with Windows and trying to set up Spark.
Previously I installed Hadoop in addition to Spark, edited the config files, run the hadoop namenode -format and away we went.
I'm now trying to achieve the same by using the bundled version of Spark that is pre built with hadoop - spark-1.6.1-bin-hadoop2.6.tgz
So far it's been a much cleaner, simpler process however I no longer have access to the command that creates the hdfs, the config files for the hdfs are no longer present and I've no 'hadoop' in any of the bin folders.
There wasn't an Hadoop folder in the spark install, I created one for the purpose of winutils.exe.
It feels like I've missed something. Do the pre-built versions of spark not include hadoop? Is this functionality missing from this variant or is there something else that I'm overlooking?
Thanks for any help.
By saying that Spark is built with Hadoop, it is meant that Spark is built with the dependencies of Hadoop, i.e. with the clients for accessing Hadoop (or HDFS, to be more precise).
Thus, if you use a version of Spark which is built for Hadoop 2.6 you will be able to access HDFS filesystem of a cluster with the version 2.6 of Hadoop via Spark.
It doesn't mean that Hadoop is part of the pakage and downloading it Hadoop is installed as well. You have to install Hadoop separately.
If you download a Spark release without Hadoop support, you'll need to include the Hadoop client libraries in all the applications you write wiìhich are supposed to access HDFS (by a textFile for instance).
I am also using same spark in my windows 10. What I have done create C:\winutils\bin directory and put winutils.exe there. Than create HADOOP_HOME=C:\winutils variable. If you have set all
env variables and PATH like SPARK_HOME,HADOOP_HOME etc than it should work.

what should be the correct flow of data in hadoop and mahout?

I am working with hadoop, hive and mahout technology.
I am processing some data with a mapreduce job in hadoop for recommendation purposes in mahout.
I want to know the correct workflow of above model, i.e when hadoop processes the data and stores it in HDFS, then how will mahout use this data and how will mahout get this data and after mahout processes the data, where will mahout put this recommended data?
Note: I am working with hadoop for processing the data and my colleague is working with mahout on a different machine .
Hope u got my question correctly.
If you want to take input from hadoop hdfs in mahout then you have to do following steps-
first copy input file to hdfs by command
hadoop dfs -copyFromLocal input /
Then run the mahout command for recommendation which take input from hdfs and save the output in hdfs
Assuming your JAVA_HOME is appropriately set and Mahout was installed properly we’re ready to configure our syntax. Enter the following command:
$ mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -i hdfs://localhost:9000/inputfile -o hdfs://localhost:9000/output --numRecommendations 25
Running the command will execute a series of jobs the final product of which will be an output file deposited to the directory specified in the command syntax. The output file will contain two columns: the userID and an array of itemIDs and scores.
It all depends on how Mahout is configured to run. Mahout can run in local mode or distributed mode. We need to set the "MAHOUT_LOCAL" variable.
MAHOUT_LOCAL set to anything other than an empty string to force
mahout to run locally even if
HADOOP_CONF_DIR and HADOOP_HOME are set
For example, If we don't configure MAHOUT_LOCAL and tries to execute any Mahout algorithm, Then you can see below in the console.
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop,
When running in distributed mode, Mahout treats all the paths as HDFS path's. So even after Mahout processing your data, final output will be stored in HDFS.

configure hive with hadoop

I have configured hadoop 2.2.0 as single node cluster ( was able to run example jar)
Now I need to make hive perform queries using this hadoop
should I set
mapred.job.tracker
to
yarn.resourcemanager.resource-tracker.address
property?
tried so, but can't see the data loaded into hive tables in hdfs
I don't have enough reputation points to add a comment, so trying to help via an answer.
What are the daemons currently running for Hadoop? Use ps -eaf |
grep "java" to check.
Do you see the JobTracker running or the ResourceManager?
Also, can you elaborate on the steps you performed to install Hive?
I have screen cast, Installing Apache Hive that walks you through installing Hive. Next, you can follow my blog post Apache Hive - Getting Started. Hope this helps.

Using different hadoop-mapreduce-client-core.jar to run hadoop cluster

I'm working on a hadoop cluster with CDH4.2.0 installed and ran into this error. It's been fixed in later versions of hadoop but I don't have access to update the cluster. Is there a way to tell hadoop to use this jar when running my job through the command line arguments like
hadoop jar MyJob.jar -D hadoop.mapreduce.client=hadoop-mapreduce-client-core-2.0.0-cdh4.2.0.jar
where the new mapreduce-client-core.jar file is the patched jar from the ticket. Or must hadoop be completely recompiled with this new jar? I'm new to hadoop so I don't know all the command line options that are possible.
I'm not sure how that would work as when you're executing the hadoop command you're actually executing code in the client jar.
Can you not use MR1? The issue says this issue only occurs when you're using MR2, so unless you really need Yarn you're probably better using the MR1 library to run your map/reduce.

Resources