I am running the following commands,
/mahout trainnb
-i ${WORK_DIR}/20news-train-vectors -el
-o ${WORK_DIR}/model
-li ${WORK_DIR}/labelindex
-ow
./mahout testnb
-i ${WORK_DIR}/20news-test-vectors
-m ${WORK_DIR}/model
-l ${WORK_DIR}/labelindex\
-ow -o ${WORK_DIR}/20news-testing
On running the last command,I am able to run the map task to 100% but on reduce task I am getting the the following Error:
Exception in thread "main" java.lang.IllegalArgumentException: Label not found: 10002
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
at org.apache.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:182)
at org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java: 205)
at org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java: 209)
at org.apache.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:173 )
at org.apache.mahout.classifier.ResultAnalyzer.addInstance(ResultAnalyzer.java:70)
at org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.analyzeResults( TestNaiveBayesDriver.java:160)
at org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.run(TestNaiveBa yesDriver.java:125)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.main(TestNaiveB ayesDriver.java:66)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java :43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java :72)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java :43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
I am following the example from http://www.packtpub.com/article/implementing-the-na%C3%AFve-bayes-classifier-in-mahout and also tried the seqdumper on labelindex and can see the keys and values in it.
I am using Hadoop 2.2,Mahout 1.0 and whole environment is setup on Amazon EC2.
Please help me out.Am I doing something wrong ?
I think mahout is not compatible with your hadoop version, you should download the 1.1.0 or 1.2.0 versions of hadoop.
this will probabely fix your problem.
I guess you have your files in local. I also had this problem and I fixed it when I changed the files to HDFS
Related
I am trying to cluster a sample dataset which is in csv file format. But when I give the below command,
user#ubuntu:/usr/local/mahout/trunk$ bin/mahout kmeans -i /root/Mahout/temp/parsedtext-seqdir-sparse-kmeans/tfidf-vectors/ -c /root/Mahout/temp/parsedtext-kmeans-clusters -o /root/Mahout/reuters21578/root/Mahout/temp/parsedtext-kmeans -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 2 -k 1 -ow --clustering -cl
I am getting the following error, saying there is no input clusters available and to check the -c cluster argument. Can anybody help me please>
Here is the error which I got for the above command:
16/05/11 16:09:15 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
Exception in thread "main" java.lang.IllegalStateException: No input clusters found in /root/Mahout/temp/parsedtext-kmeans-clusters/part-randomSeed. Check your -c argument.
at org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:213)
at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:147)
at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:110)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:47)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Let me copy the error message for you:
No input clusters found in /root/Mahout/temp/parsedtext-kmeans-clusters/part-randomSeed. Check your -c argument.
Have you considered checking or removing your -c argument?
But Mahout k-means is really low-quality. Use something else. apt-get install elki and try that instead, it's much faster.
I ran Giraph 1.0.0 with hadoop 2.2.0 using the PageRank Benchmark example here.
Suddenly I got this error result:
Exception in
thread "main" java.lang.IllegalArgumentException:
checkLocalJobRunnerConfiguration: When using LocalJobRunner, must have
only one worker since only 1 task at a time! at
org.apache.giraph.job.GiraphJob.checkLocalJobRunnerConfiguration(GiraphJob.java:151)
at org.apache.giraph.job.GiraphJob.run(GiraphJob.java:225) at
org.apache.giraph.benchmark.GiraphBenchmark.run(GiraphBenchmark.java:90)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at
org.apache.giraph.benchmark.PageRankBenchmark.main(PageRankBenchmark.java:71)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
org.apache.hadoop.util.RunJar.main(RunJar.java:212)
When I changed the number of workers to 1, I got:
Exception in thread "main" java.lang.IllegalArgumentException:
checkLocalJobRunnerConfiguration: When using LocalJobRunner, you
cannot run in split master / worker mode since there is only 1 task at
a time! at
org.apache.giraph.job.GiraphJob.checkLocalJobRunnerConfiguration(GiraphJob.java:157)
at org.apache.giraph.job.GiraphJob.run(GiraphJob.java:225) at
org.apache.giraph.benchmark.GiraphBenchmark.run(GiraphBenchmark.java:90)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at
org.apache.giraph.benchmark.PageRankBenchmark.main(PageRankBenchmark.java:71)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Any solutions?
Hi I assume you are not running on a cluster? If I run in our demo VMs I get the same error.
You can disable the split master worker behaviour in giraph-site.xml
giraph.SplitMasterWorker=false
If you just want to disable this during a one-shot exeuction you can also pass it as a command-line parameter to your program.
-ca giraph.SplitMasterWorker=false
For instance I run a demo for my Big Data lecture like this:
#!/bin/bash
yarn jar /root/giraph-0.0.1-SNAPSHOT-jar-with-dependencies.jar org.apache.giraph.GiraphRunner at.jku.tk.steinbauer.bigdata.giraph.MaxInDegreeComputation -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /user/hue/graph/tinygraph.txt -of org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /user/hue/graph/degree -w 1 -ca giraph.SplitMasterWorker=false
I am trying to copy some files from S3 bucket to HDFS of my EMR cluster. But I am getting the following error:
Exception in thread "main" java.lang.RuntimeException: Error running job
at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:771)
at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:580)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at com.amazon.elasticmapreduce.s3distcp.Main.main(Main.java:22)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://10.87.26.26:9000/tmp/33e4f3b9-d29a-49e8-9706-ea70e07e3ff2/files
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:285)
at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:491)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:508)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:392)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1268)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1265)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1265)
at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:751)
... 9 more
The command I am using is :
./elastic-mapreduce --jobflow j-12345678 --jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args '--src,s3n://my-bucket/data/,--dest,hdfs:///data/in,--srcPattern,xyz01-1-1*ped*' --step-name "Copy input files to HDFS" --wait-for-steps
I tried to run the sample word-count job, to check if there is any issue with HDFS, but it ran fine.
Can anyone please help me with this? If any more info is needed, please let me know and I will update the description.
Usually its the --srcPattern '<regex>' argument. You can also use hadoop fs -cp s3://src/file1.something /my/output/path/ to test for 1 file and modify your regex. Also starting with .* any char-0 or more times, should relax the matching.
It would be great to know if regex non-matches get logged and where.
I have hadoop cluster version 2.2.0 running with mahout 0.8, is it compatible? Because whenever I run this command:
bin/mahout recommenditembased --input mydata.dat --usersFile user.dat --numRecommendations 2 --output output/ --similarityClassname SIMILARITY_PEARSON_CORRELATION
Give me this error:
Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
at org.apache.mahout.common.AbstractJob.prepareJob(AbstractJob.java:614)
at org.apache.mahout.cf.taste.hadoop.preparation.PreparePreferenceMatrixJob.run(PreparePreferenceMatrixJob.java:75)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.run(RecommenderJob.java:158)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(RecommenderJob.java:312)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:622)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:194)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:622)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Or Im wrong? Any info would be helpful.
No, it does not work with Hadoop 2.x, someone else got the same error message like you.
It seems that at the very least it would require a recompile.
And more people are having the same problems: how can I compile/using mahout for hadoop 2.0?
About an hour ago Mahout has officially added support to Hadoop 2.x in the master branch (see MAHOUT-1329)
Checkout the code here https://github.com/apache/mahout and recompile using:
mvn clean package -Dhadoop2.version=2.2.0
Try and see if that works.
You can get the source code from github https://github.com/apache/mahout and run the following command
mvn -Dhadoop2.version=2.2.0 -DskipTests clean install
Then, you can find the release package in distribution/target/
Using Mahout seq2sparse command, I manage to successfully create the following folders in HDFS
df-count
dictionary.file-0
frequency.file-0
tf-vectors
tfidf-vectors
tokenized-documents
wordcount
After that when I run the trainnb command with the following syntax
mahout trainnb -i tweet-vectors -el -li labelindex -o model -ow -c
I get the following error. Does anyone know resolution for the same?
Exception in thread "main" java.lang.IllegalStateException: hdfs://machineinfo:8020/user/hhhh/tweetvectors/df-count
at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterator$1.apply(SequenceFileDirIterator.java:115)
at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterator$1.apply(SequenceFileDirIterator.java:106)
at com.google.common.collect.Iterators$8.transform(Iterators.java:860)
at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48)
at com.google.common.collect.Iterators$5.hasNext(Iterators.java:597)
at com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43)
at org.apache.mahout.classifier.naivebayes.BayesUtils.writeLabelIndex(BayesUtils.java:122)
at org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.createLabelIndex(TrainNaiveBayesJob.java:180)
at org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:94)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.main(TrainNaiveBayesJob.java:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:194)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
Caused by: java.io.FileNotFoundException: File does not exist: /user/hhhh/tweet-vectors/df-count
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.fetchLocatedBlocks(DFSClient.java:2006)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1975)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1967)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:735)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:165)
at org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1499)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1479)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1474)
at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.<init>(SequenceFileIterator.java:63)
at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterator$1.apply(SequenceFileDirIterator.java:110)
... 22 more
It seems mahout cannot see the file /user/hhhh/tweet-vectors/df-count in HDFS.
First, try hadoop dfs -ls /user/hhhh/tweet-vectors/df-count to verify the file exists.
If it doesn't exist, there's your problem. If it does exist, check if it is a file or a directory. mahout seems to be looking for a file, not a directory.
If is exists and it is a file, then verify that mahout is connecting to the same hadoop namenode instance where the file is stored.