Mahout K-Means: No input clusters found error - hadoop

I am running Mahout 0.9 version on the most recent version of Hadoop. In K-Means algorithm, I transform the input data into vectors as required.
I executed the following command, to run K-Means:
mahout kmeans -i /user/ubuntu/Test/Vec/tfidf-vectors/ -c /user/ubuntu/Test/init-cluster -o /user/ubuntu/Test/Result -x 10 -k 2 -ow -cl
/user/ubuntu/Test/init-cluster is an empty folder, because I have already provided the -k parameter.
Interestingly, according to the log information below, Mahout deletes the cluster folder (/user/ubuntu/Test/init-cluster):
15/11/27 17:13:31 INFO common.HadoopUtil: Deleting /user/ubuntu/Test/init-cluster
In the end, Mahout gives:
Exception in thread "main" java.lang.IllegalStateException:
No input clusters found in /user/ubuntu/Test/init-cluster/part-randomSeed.
Check your -c argument
Any idea how to fix it?

Related

retrieve size of data copied with hadoop distcp

I am running a hadoop distcp command as below:
hadoop distcp src-loc target-loc
I want to know the size of the data copied by running this command.
I am planning to run the command on Qubole.
Any help is appreciated
Run following command
hadoop dfs -dus -h target-loc
225.2 G target-loc
It will print the human readable summary for the target-loc.

Hadoop2 client on Windows for a Linux Cluster

We have a linux hadoop cluster but for a variety of reasons have some windows clients connecting and pushing data to the linux cluster.
In hadoop1 we had been able to run hadoop via cygwin
However in hadoop2 as stated on the website cygwin is not required or not supported.
Questions
what exactly has changed ? why would a client (only) not run under
cygwin or it could ? Apart from paths what other considerations are at play ?
Apart from the property below for job submissions is there anything else that needs to considered for windows/client interacting with a linux cluster
conf.set("mapreduce.app-submission.cross-platform", "true");
Extracting the hadoop-2.6.0-cdh5.5.2 and running it from cygwin with the right configurations under $HADOOP_HOME/etc yields some classpath or classpath formation issues class not found issues ? For instance the following run
hdfs dfs -ls
Error: Could not find or load main class org.apache.hadoop.fs.FsShell
Then looking at the classpath looks like they contain cygwin paths . attempt to convert them to windows paths so that the jar can be looked up
in $HADOOP_HOME/etc/hdfs.sh locate the dfs command and change to
elif [ "$COMMAND" = "dfs" ] ; then
if $cygwin; then
CLASSPATH=`cygpath -p -w "$CLASSPATH"`
fi
CLASS=org.apache.hadoop.fs.FsShell
This results in the following:
16/04/07 16:01:05 ERROR util.Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:378)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:393)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:386)
at org.apache.hadoop.util.GenericOptionsParser.preProcessForWindows(GenericOptionsParser.java:438)
at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:484)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:170)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:64)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:362)
16/04/07 16:01:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Warning: fs.defaultFs is not set when running "ls" command.
Found 15 items
-ls: Fatal internal error
java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
at org.apache.hadoop.util.Shell.run(Shell.java:478)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:831)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:814)
at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1100)
at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:582)
at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getOwner(RawLocalFileSystem.java:565)
at org.apache.hadoop.fs.shell.Ls.adjustColumnWidths(Ls.java:139)
at org.apache.hadoop.fs.shell.Ls.processPaths(Ls.java:110)
at org.apache.hadoop.fs.shell.Command.recursePath(Command.java:373)
at org.apache.hadoop.fs.shell.Ls.processPathArgument(Ls.java:98)
at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118)
at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:305)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:362)
For the above my question should I be going further to try and fix this so that i can reuse my existing client .sh scripts or just convert them .bat ?
the problem is that cygwin needs to return windows paths rather than cygwin paths. Also winutils.exe needs to be installed in the path as described here
Simply fix the scripts to return the actual win paths and turn off a few commands which don't run under cygwin
#!/bin/bash
# fix $HADOOP_HOME/bin/hdfs
sed -i -e "s/bin=/#bin=/g" $HADOOP_HOME/bin/hdfs
sed -i -e "s#DEFAULT_LIBEXEC_DIR=\"\$bin\"/../libexec#DEFAULT_LIBEXEC_DIR=\"\$HADOOP_HOME\\\libexec\"#g" $HADOOP_HOME/bin/hdfs
sed -i "/export CLASSPATH=$CLASSPATH/i CLASSPATH=\`cygpath -p -w \"\$CLASSPATH\"\`" $HADOOP_HOME/bin/hdfs
# fix $HADOOP_HOME/libexec/hdfs-config.sh
sed -i -e "s/bin=/#bin=/g" $HADOOP_HOME/libexec/hdfs-config.sh
sed -i -e "s#DEFAULT_LIBEXEC_DIR=\"\$bin\"/../libexec#DEFAULT_LIBEXEC_DIR=\"\$HADOOP_HOME\\\libexec\"#g" $HADOOP_HOME/libexec/hdfs-config.sh
# fix $HADOOP_HOME/libexec/hadoop-config.sh
sed -i "/HADOOP_DEFAULT_PREFIX=/a HADOOP_PREFIX=" $HADOOP_HOME/libexec/hadoop-config.sh
sed -i "/export HADOOP_PREFIX/i HADOOP_PREFIX=\`cygpath -p -w \"\$HADOOP_PREFIX\"\`" $HADOOP_HOME/libexec/hadoop-config.sh
# fix $HADOOP_HOME/bin/hadoop
sed -i -e "s/bin=/#bin=/g" $HADOOP_HOME/bin/hadoop
sed -i -e "s#DEFAULT_LIBEXEC_DIR=\"\$bin\"/../libexec#DEFAULT_LIBEXEC_DIR=\"\$HADOOP_HOME\\\libexec\"#g" $HADOOP_HOME/bin/hadoop
sed -i "/export CLASSPATH=$CLASSPATH/i CLASSPATH=\`cygpath -p -w \"\$CLASSPATH\"\`" $HADOOP_HOME/bin/hadoop

Mahout random forest example, command line parameter for data not recognized

The command:
hadoop jar /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/mahout/mahout-examples-0.8-cdh5.0.0-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest –d advert-train.csv –ds advert-info –t 100 -o advert-model
Generates the error:
org.apache.commons.cli2.OptionException: Unexpected –d while processing Options
That doesn't seem possible. I looked at the source code and -d is a required option.
hadoop version
returns
Hadoop 2.3.0-cdh5.0.0
The files advert-train.csv and advert-info both exist in my default HDFS directory /users/cloudera
A detailed instruction to run the random forest in mahout can be found here:
https://mahout.apache.org/users/classification/partial-implementation.html
I was able to run this example in Cloudera CDH 5.0 with no problem. I think the problem maybe due to the configuration or the fact that you need to specify the other parameters also. I just used the mahout command in Cloudera for running the example. In your case the command would be:
mahout org.apache.mahout.classifier.df.mapreduce.BuildForest
-Dmapred.max.split.size=1874231 –d advert-train.csv –ds advert-info
-sl 5 -p –t 100 -o advert-model
in which,
-Dmapred.max.split.size should specify Hadoop the max. size of each partition which should be around 1/10 of size of your dataset
-sl is used to specify the number of variables randomly selected
-p tells mahout to use partial implementation
The rest of variables should be fine.

How to make mahout interact with hadoop HDFS

I am using HDP mahout version 0.8. I have set MAHOUT_LOCAL="". When I run mahout, I see the message HADOOP LOCAL NOT SET RUNNING ON HADOOP but my program is not writing output to HDFS directory.
Can anyone tell me how to make my mahout program take input from HDFS and write output to HDFS?
Did you set the $MAHOUT_HOME/bin and $HADOOP_HOME/bin on the PATH ?
For example on Linux:
export PATH=$PATH:$MAHOUT_HOME/bin/:$HADOOP_HOME/bin/
export HADOOP_CONF_DIR=$HADOOP_HOME/conf/
Then, almost all the Mahout's commands use the options -i (input) and -o (output).
For example:
mahout seqdirectory -i <input_path> -o <output_path> -chunk 64
Assuming you have your mahout jar build which takes input and write to hdfs. Do the following:
From hadoop bin directory:
./hadoop jar /home/kuntal/Kuntal/BIG_DATA/mahout-recommender.jar mia.recommender.RecommenderIntro --tempDir /home/kuntal/Kuntal/BIG_DATA --recommenderClassName org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender
#Input Output Args specify if required
-Dmapred.input.dir=./ratingsLess.txt -Dmapred.output.dir=/input/output
Please check this:
http://chimpler.wordpress.com/2013/02/20/playing-with-the-mahout-recommendation-engine-on-a-hadoop-cluster/

Run Mahout in action example lastfm

I have one problem when run the Last.fm clustering example in Mahout in action book.
The issue is when I run the command :
--bin/mahout kmeans -i /user/local/Mia/lastfm_vector/ -o /user/local/Mia/lastfm_topics -c /user/local/Mia/lastfm_centroids -k 2000 -ow -dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.01 -x 20 -cl
I checked out the source code from git:
https://github.com/tdunning/MiA.git
The issue is: Exception is thread "main" java.io.FileNotFoundException: File does not exit:/user/local/Mia/lastfm_vector
Everyone who help me to provide me file input Lastfm_vector to run this example. I need it for my education.
Thanks & Regard!

Resources