I am using HDP mahout version 0.8. I have set MAHOUT_LOCAL="". When I run mahout, I see the message HADOOP LOCAL NOT SET RUNNING ON HADOOP but my program is not writing output to HDFS directory.
Can anyone tell me how to make my mahout program take input from HDFS and write output to HDFS?
Did you set the $MAHOUT_HOME/bin and $HADOOP_HOME/bin on the PATH ?
For example on Linux:
export PATH=$PATH:$MAHOUT_HOME/bin/:$HADOOP_HOME/bin/
export HADOOP_CONF_DIR=$HADOOP_HOME/conf/
Then, almost all the Mahout's commands use the options -i (input) and -o (output).
For example:
mahout seqdirectory -i <input_path> -o <output_path> -chunk 64
Assuming you have your mahout jar build which takes input and write to hdfs. Do the following:
From hadoop bin directory:
./hadoop jar /home/kuntal/Kuntal/BIG_DATA/mahout-recommender.jar mia.recommender.RecommenderIntro --tempDir /home/kuntal/Kuntal/BIG_DATA --recommenderClassName org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender
#Input Output Args specify if required
-Dmapred.input.dir=./ratingsLess.txt -Dmapred.output.dir=/input/output
Please check this:
http://chimpler.wordpress.com/2013/02/20/playing-with-the-mahout-recommendation-engine-on-a-hadoop-cluster/
Related
I am running a hadoop distcp command as below:
hadoop distcp src-loc target-loc
I want to know the size of the data copied by running this command.
I am planning to run the command on Qubole.
Any help is appreciated
Run following command
hadoop dfs -dus -h target-loc
225.2 G target-loc
It will print the human readable summary for the target-loc.
Im running a Hadoop job and outputs are displayed on the console.
Is there a way for me to redirect the output to a file..I tried the below command to redirect the output but it does not work.
hduser#vagrant:/usr/local/hadoop$ hadoop jar share/hadoop/mapreduce/hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output>joboutput
You can redirect the error stream to file, which is the output of hadoop job. That is use;
hadoop jar share/hadoop/mapreduce/hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output 2>joboutput
If you are running the examples from the Hadoop homepage (https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html) the output will be written to
/user/hduser/gutenberg /user/hduser/gutenberg-output
on HDFS and not the local file system.
You can see the output via
hadoop fs -text /user/hduser/gutenberg /user/hduser/gutenberg-output/*
And to dump that output to a local file
hadoop fs -text /user/hduser/gutenberg /user/hduser/gutenberg-output/* > local.txt
The -text option will decompress the data so you get textual output in case you have some type of compression enabled.
I'm Beginner in Hadoop. I wanted to view fs-image and Edit logs in hadoop. I have searched it in many blogs, nothing is clear. Please can any one tell me step by step procedure to view the Edit log/fs-image file in hadoop.
My version: Apache Hadoop: Hadoop-1.2.1
My Installed director is ![/home/students/hadoop-1.2.1]
I'm listing steps what i have tried based on some blogs.
Ex.1. $ hdfs dfsadmin -fetchImage /tmp
Ex.2. hdfs oiv -i /tmp/fsimage_0000000000000001386 -o /tmp/fsimage.txt
Nothing works for me.
It shows that hdfs is not a directory or a file.
For edit log, navigate to
/var/lib/hadoop-hdfs/cache/hdfs/dfs/name/current
then;
ls -l
to view the complete name of the log file you want to extract; after then
hdfs oev -i editFileName -o /home/youraccount/Desktop/edits_your.xml -p XML
For the fsimage;
hdfs oiv -i fsimage -o /home/youraccount/Desktop/fsimage_your.xml
Go to the bin directory and try to execute the same commands
The command:
hadoop jar /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/mahout/mahout-examples-0.8-cdh5.0.0-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest –d advert-train.csv –ds advert-info –t 100 -o advert-model
Generates the error:
org.apache.commons.cli2.OptionException: Unexpected –d while processing Options
That doesn't seem possible. I looked at the source code and -d is a required option.
hadoop version
returns
Hadoop 2.3.0-cdh5.0.0
The files advert-train.csv and advert-info both exist in my default HDFS directory /users/cloudera
A detailed instruction to run the random forest in mahout can be found here:
https://mahout.apache.org/users/classification/partial-implementation.html
I was able to run this example in Cloudera CDH 5.0 with no problem. I think the problem maybe due to the configuration or the fact that you need to specify the other parameters also. I just used the mahout command in Cloudera for running the example. In your case the command would be:
mahout org.apache.mahout.classifier.df.mapreduce.BuildForest
-Dmapred.max.split.size=1874231 –d advert-train.csv –ds advert-info
-sl 5 -p –t 100 -o advert-model
in which,
-Dmapred.max.split.size should specify Hadoop the max. size of each partition which should be around 1/10 of size of your dataset
-sl is used to specify the number of variables randomly selected
-p tells mahout to use partial implementation
The rest of variables should be fine.
I've written a Hadoop program which requires a certain layout within HDFS, and which afterwards, I need to get the files out of HDFS. It works on my single-node Hadoop setup and I'm eager to get it working on 10's of nodes within Elastic MapReduce.
What I've been doing is something like this:
./elastic-mapreduce --create --alive
JOBID="j-XXX" # output from creation
./elastic-mapreduce -j $JOBID --ssh "hadoop fs -cp s3://bucket-id/XXX /XXX"
./elastic-mapreduce -j $JOBID --jar s3://bucket-id/jars/hdeploy.jar --main-class com.ranjan.HadoopMain --arg /XXX
This is asynchronous, but when the job's completed, I can do this
./elastic-mapreduce -j $JOBID --ssh "hadoop fs -cp /XXX s3://bucket-id/XXX-output"
./elastic-mapreduce -j $JOBID --terminate
So while this sort-of works, but it's clunky and not what I'd like. Is there cleaner way to do this?
Thanks!
You can use distcp which will copy the files as a mapreduce job
# download from s3
$ hadoop distcp s3://bucket/path/on/s3/ /target/path/on/hdfs/
# upload to s3
$ hadoop distcp /source/path/on/hdfs/ s3://bucket/path/on/s3/
This makes use of your entire cluster to copy in parallel from s3.
(note: the trailing slashes on each path are important to copy from directory to directory)
#mat-kelcey, does the command distcp expect the files in S3 to have a minimum permission level? For some reason I have to set permission levels of the files to "Open/Download" and "View Permissions" for "Everyone", for the files to be able accessible from within the bootstrap or the step scripts.