what should be the correct flow of data in hadoop and mahout? - hadoop

I am working with hadoop, hive and mahout technology.
I am processing some data with a mapreduce job in hadoop for recommendation purposes in mahout.
I want to know the correct workflow of above model, i.e when hadoop processes the data and stores it in HDFS, then how will mahout use this data and how will mahout get this data and after mahout processes the data, where will mahout put this recommended data?
Note: I am working with hadoop for processing the data and my colleague is working with mahout on a different machine .
Hope u got my question correctly.

If you want to take input from hadoop hdfs in mahout then you have to do following steps-
first copy input file to hdfs by command
hadoop dfs -copyFromLocal input /
Then run the mahout command for recommendation which take input from hdfs and save the output in hdfs
Assuming your JAVA_HOME is appropriately set and Mahout was installed properly we’re ready to configure our syntax. Enter the following command:
$ mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -i hdfs://localhost:9000/inputfile -o hdfs://localhost:9000/output --numRecommendations 25
Running the command will execute a series of jobs the final product of which will be an output file deposited to the directory specified in the command syntax. The output file will contain two columns: the userID and an array of itemIDs and scores.

It all depends on how Mahout is configured to run. Mahout can run in local mode or distributed mode. We need to set the "MAHOUT_LOCAL" variable.
MAHOUT_LOCAL set to anything other than an empty string to force
mahout to run locally even if
HADOOP_CONF_DIR and HADOOP_HOME are set
For example, If we don't configure MAHOUT_LOCAL and tries to execute any Mahout algorithm, Then you can see below in the console.
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop,
When running in distributed mode, Mahout treats all the paths as HDFS path's. So even after Mahout processing your data, final output will be stored in HDFS.

Related

How to run Mahout jobs on Spark Engine?

Currently I’m doing some document similarity analysis using Mahout RowSimilarity Job. This can be easily done be running command ‘mahout rowsimilarity…’ from the console. However I noticed that this Job is also supported to be run on Spark engine. I wonder to know how I can run this Job on Spark Engine.
You can use MLlib alternate of mahout in spark. All library in MLlib are processing in distributed mode(Map-reduce in Hadoop).
In Mahout 0.10 provide job execution with spark.
More detail Link
http://mahout.apache.org/users/sparkbindings/play-with-shell.html
step to setup spark with mahout.
1 Goto the directory where you unpacked Spark and type sbin/start-all.sh to locally start Spark
2 Open a browser, point it to http://localhost:8080/ to check whether Spark successfully started. Copy the url of the spark master at the top of the page (it starts with spark://)
3 Define the following environment variables:
export MAHOUT_HOME=[directory into which you checked out Mahout]
export SPARK_HOME=[directory where you unpacked Spark]
export MASTER=[url of the Spark master]
4 Finally, change to the directory where you unpacked Mahout and type bin/mahout spark-shell, you should see the shell starting and get the prompt mahout>. Check FAQ for further troubleshooting.
Please visit link.It uses new mahout 0.10 and works uses spark server.

Configuring pig relation with Hadoop

I'm having troubles understanding the relation between Hadoop and Pig.
I understand Pig's purpose is to hide the MapReduce pattern behind a scripting language, Pig Latin.
What I don't understand is how Hadoop and Pig are linked. So far, the only installation procedures seem to assume that pig is run on the same machine as the main hadoop node.
Indeed, it uses the hadoop configuration files.
Is this because pig only translates the scripts into mapreduce code and send them to hadoop ?
If that's the case, how could I configure Pig in order to make it send the scripts to a distant server ?
If not, does it mean we always need to have hadoop running within pig ?
Pig can run in two modes:
Local mode. In this mode Hadoop cluster is not used at all. All processes run in single JVM and files are read from the local filesystem. To run Pig in local mode, use the command:
pig -x local
MapReduce Mode. In this mode Pig converts scripts to MapReduce jobs and run them on Hadoop cluster. It is the default mode.
Cluster can be local or remote. Pig uses the HADOOP_MAPRED_HOME environment variable to find Hadoop installation on local machine (see Installing Pig).
If you want to connect to remote cluster, you should specify cluster parameters in the pig.properties file. Example for MRv1:
fs.default.name=hdfs://namenode_address:8020/
mapred.job.tracker=jobtracker_address:8021
You can also specify remote cluster address at the command line:
pig -fs namenode_address:8020 -jt jobtracker_address:8021
Hence, you can install Pig to any machine and connect to remote cluster. Pig includes Hadoop client, therefore you don't have to install Hadoop to use Pig.

Output Folders for Amazon EMR

I want to jun a custom jar, whose main class a chain of map reduce jobs, with the output of the first job going as the input of the second jar, and so on.
What do I set in FileOutputFormat.setOutputPath("what path should be here?");
If I specify -outputdir in the argument, I get the error FileAlraedy exists. If I don't specify, then I do not know where the ouput will land. I want to be able to see the ouput from every job of the chained mapreduce jobs.
Thanks in adv. Pls help!
You are likely getting the "FileAlraedy exists" error because that output directory exists prior to the job you are running. Make sure to delete the directories that you specify as output for your Hadoop jobs; otherwise you will not be able to run those jobs.
Good practice is to take output from command line as it will increase flexibility of your code And you will compile your jar only once provided the changes are related to your path.
for EMR if you launch your cluster and compile your jar
For eg.
dfs_ip_folder=HDFS_IP_DIR
dfs_op_folder=HDFS_OP_DIR
hadoop jar hadoop-examples-*.jar wordcount ${dfs_ip_folder} ${dfs_op_folder}
Note : you have to create dfs_ip_folder and store input data inside it.
dfs_op_folder will be created automatically on HDFS not on local file system
To access the HDFS op folder either you can copy it to local file system or you can do cat.
eg.
hadoop fs -cat ${dfs_op_folder}/<file_name>
hadoop fs -copyToLocal ${dfs_op_folder} ${your_local_input_dir_path}

PIG automatically connected with default HDFS, how?

I just started learning Hadoop and PIG (from last two days!) for one of my future project.
For experiments I've installed Hadoop (HDFS on default localhost:9000) as pseudo distributed mode and PIG (map-reduce mode).
When I initialized PIG by typing ./bin/pig command it launched GRUNT command line and I got message that pig connected with HDFS (localhost:9000), later I could successfully able to access HDFS thru pig.
I was expecting to perform some manual configuration for PIG to access HDFS (as per various internet articles).
My question is, from where PIG identified default HDFS configuration (localhost:9000)? I checked pig.properties but I didn't find anything there. I need this info as I might change default HDFS configuration in future.
BTW, I have HADOOP_HOME and PIG_HOME defined in my OS PATH variable.
When installing Pig (I assume v0.10.0) you have to tell how it will connect to the HDFS.
I don't know how you did this but generally this is done by adding the hadoop conf dir path to the PIG_CLASSPATH environment variable. You can also set HADOOP_CONF_DIR as well.
If you are starting the grunt shell Pig will locate the directory of the Hadoop configuration XMLs, and takes the value of fs.default.name (core-site.xml) and mapred.job.tracker (mapred-site.xml) , i.e: the location of the Namenode and JobTracker.
For reference you may have a look at the Pig shell script to see how env. variables are collected and evaluated.
PIG can connects to underlying HDFS in the 3 ways
1-
Pig uses HADOOP_HOME for finding the HADOOP client to Run.
your HADOOP_HOME should have been already setup in your bash_profile
export HADOOP_HOME=~/myHadoop/hadoop-2.5.2
2-
or else there might be possibility that your HADOOP_CONF_DIR has already been setup which contains the xml file for the hadoop configuration
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/
3-And if these are not setup you can also connect to underlying hdfs
by changing the pig.properties which is present under PIG_HOME/conf dir

hadoop - Where are input/output files stored in hadoop and how to execute java file in hadoop?

Suppose I write a java program and i want to run it in Hadoop, then
where should the file be saved?
how to access it from hadoop?
should i be calling it by the following command? hadoop classname
what is the command in hadoop to execute the java file?
The simplest answers I can think of to your questions are:
1) Anywhere
2,3,4)$HADOOP_HOME/bin/hadoop jar [path_to_your_jar_file]
A similar question was asked here Executing helloworld.java in apache hadoop
It may seem complicated, but it's simpler than you might think!
Compile your map/reduce classes, and your main class into a jar. Let's call this jar myjob.jar.
This jar does not need to include the Hadoop libraries, but it should include any other dependencies you have.
Your main method should set up and run your map/reduce job, here is an example.
Put this jar on any machine with the hadoop command line utility installed.
Run your main method using the hadoop command line utility:
hadoop jar myjob.jar
Hope that helps.
where should the file be saved?
The data should be saved in "hdfs". You will want to probably load it into the cluster from your data source using something like Apache Flume. The file can be placed anywhere but most home is /user/hadoop/
how to access it from hadoop?
SSH into the hadoop cluster headnode like a standard linux server.
To list your hadoop root hdfs
hadoop fs -ls /
should i be calling it by the following command? hadoop classname
You should be using the hadoop command to access your data and run your programs, try hadoop help
what is the command in hadoop to execute the java file?
hadoop -jar MyJar.jar com.mycompany.MainDriver arg[0] arg[1] ...

Resources