hadoop streaming error,mapreduce with python - hadoop

I'm newbie to hadoop environment,Do you have any idea about how to solve this error,or what may be the reason behind this error?
hduser#intel-HP-Pavilion-g6-Notebook-PC:~/hduser/hadoop$ sudo ./bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar -file /home/hduser/map.py -mapper /home/hduser/map.py -file /home/hduser/red.py -reducer /home/hduser/red.py -input /home/hduser/tmp/cddb.txt -output /home/hduser/op1
packageJobJar: [/home/hduser/map.py, /home/hduser/red.py] [] /tmp/streamjob7455767556382290755.jar tmpDir=null
13/06/20 12:43:55 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/06/20 12:43:55 WARN snappy.LoadSnappy: Snappy native library not loaded
13/06/20 12:43:55 INFO mapred.FileInputFormat: Total input paths to process : 1
13/06/20 12:43:55 WARN mapred.LocalJobRunner: LocalJobRunner does not support symlinking into current working dir.
13/06/20 12:43:56 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-root/mapred/local]
13/06/20 12:43:56 INFO streaming.StreamJob: Running job: job_local_0001
13/06/20 12:43:56 INFO streaming.StreamJob: Job running in-process (local Hadoop)
13/06/20 12:43:56 INFO util.ProcessTree: setsid exited with exit code 0
13/06/20 12:43:56 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#e2081
13/06/20 12:43:56 INFO mapred.MapTask: numReduceTasks: 1
13/06/20 12:43:56 INFO mapred.MapTask: io.sort.mb = 100
13/06/20 12:43:56 INFO mapred.MapTask: data buffer = 79691776/99614720
13/06/20 12:43:56 INFO mapred.MapTask: record buffer = 262144/327680
13/06/20 12:43:56 INFO streaming.PipeMapRed: PipeMapRed exec [/home/hduser/hduser/hadoop/./map.py]
13/06/20 12:43:56 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
13/06/20 12:43:57 INFO streaming.StreamJob: map 0% reduce 0%
13/06/20 12:44:02 INFO mapred.LocalJobRunner: file:/home/hduser/tmp/cddb.txt:0+1205
13/06/20 12:44:03 INFO streaming.StreamJob: map 100% reduce 0%
13/06/20 12:48:11 INFO streaming.PipeMapRed: Records R/W=9/1
13/06/20 12:48:11 INFO streaming.PipeMapRed: MRErrorThread done
13/06/20 12:48:11 INFO streaming.PipeMapRed: mapRedFinished
13/06/20 12:48:11 INFO mapred.MapTask: Starting flush of map output
13/06/20 12:48:11 INFO mapred.MapTask: Finished spill 0
13/06/20 12:48:11 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
13/06/20 12:48:11 INFO mapred.LocalJobRunner: Records R/W=9/1
13/06/20 12:48:11 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
13/06/20 12:48:11 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#1c84be9
13/06/20 12:48:11 INFO mapred.LocalJobRunner:
13/06/20 12:48:11 INFO mapred.Merger: Merging 1 sorted segments
13/06/20 12:48:11 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 1356 bytes
13/06/20 12:48:11 INFO mapred.LocalJobRunner:
13/06/20 12:48:11 INFO streaming.PipeMapRed: PipeMapRed exec [/home/hduser/hduser/hadoop/./red.py]
13/06/20 12:48:11 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
13/06/20 12:48:11 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
Traceback (most recent call last):
File "/home/hduser/hduser/hadoop/./red.py", line 30, in <module>
main()
File "/home/hduser/hduser/hadoop/./red.py", line 19, in main
for similarity, group in groupby(data, itemgetter(0), reverse=True):
TypeError: groupby() takes at most 2 arguments (3 given)
13/06/20 12:48:11 INFO streaming.PipeMapRed: MRErrorThread done
13/06/20 12:48:11 INFO streaming.PipeMapRed: PipeMapRed failed!
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576)
at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:137)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:529)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)
13/06/20 12:48:11 WARN mapred.LocalJobRunner: job_local_0001
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576)
at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:137)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:529)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)
13/06/20 12:48:12 INFO streaming.StreamJob: Job running in-process (local Hadoop)
13/06/20 12:48:12 ERROR streaming.StreamJob: Job not successful. Error: NA
13/06/20 12:48:12 INFO streaming.StreamJob: killJob...
Streaming Command Failed!
I'm using hadoop 1.0.4,and wrote map reduce in python(hadoop streaming is used)
.

The error is obvious:
Traceback (most recent call last):
File "/home/hduser/hduser/hadoop/./red.py", line 30, in <module>
main()
File "/home/hduser/hduser/hadoop/./red.py", line 19, in main
for similarity, group in groupby(data, itemgetter(0), reverse=True):
TypeError: groupby() takes at most 2 arguments (3 given)
groupby only accepts 2 arguments. Here is the document of groupby.

Related

Second mapper not getting called - MultipleInputs

My job gets stuck once the first mapper (Reducemapper2) gets complete at "map 50% reduce 0%". I tried to debug a lot and googled it as well, but I'm not able to figure out the reason. Below is the driver class.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class Reducedriver {
public static void main(String args[]) throws Exception
{
if(args.length!=3)
{
System.err.println("Usage: Worddrivernewapi <input path1> <inputpath2> <output path>");
System.exit(-1);
}
Configuration conf=new Configuration();
Job job=new Job(conf,"Reducesideexample");
job.setJarByClass(Reducedriver.class);
job.setJobName("Reducedriver");
Path path1=new Path(args[0]);
Path path2=new Path(args[1]); MultipleInputs.addInputPath(job,path1,TextInputFormat.class,Reducemapper1.class);
MultipleInputs.addInputPath(job,path2,TextInputFormat.class,Reducemapper2.class);
FileOutputFormat.setOutputPath(job,new Path(args[2]));
//job.setMapperClass(Reducemapper1.class);
job.setPartitionerClass(Reducepartitioner.class);
//job.setSortComparatorClass(Reducesortcomparator.class);
job.setGroupingComparatorClass(Reducegroupcomparator.class);
job.setReducerClass(Reducereducer.class);
//job.setNumReduceTasks(0);
job.setMapOutputKeyClass(ReduceWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setOutputFormatClass(TextOutputFormat.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Could some one help me in figuring out the issue?
This is a pseudo distributed mode with 2 mapper and reducer capacity. I had multiple successful runs in my 2 node capacity.
Log for a single mapper(Jobtracker log):
2015-05-16 11:10:56,630 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
2015-05-16 11:10:57,126 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already exists!
2015-05-16 11:10:57,288 INFO org.apache.hadoop.util.ProcessTree: setsid exited with exit code 0
2015-05-16 11:10:57,309 INFO org.apache.hadoop.mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#42f93a98
2015-05-16 11:10:57,484 INFO org.apache.hadoop.mapred.MapTask: Processing split: hdfs://localhost:9000/user/hduser/test/mapmainfile.dat:0+40
2015-05-16 11:10:57,512 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
2015-05-16 11:10:57,591 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720
2015-05-16 11:10:57,592 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680
2015-05-16 11:10:57,607 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library not loaded
2015-05-16 11:10:57,666 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map output
2015-05-16 11:10:57,669 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map output
From the terminal:
15/05/16 11:10:50 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
15/05/16 11:10:50 INFO input.FileInputFormat: Total input paths to process : 1
15/05/16 11:10:50 INFO util.NativeCodeLoader: Loaded the native-hadoop library
15/05/16 11:10:50 WARN snappy.LoadSnappy: Snappy native library not loaded
15/05/16 11:10:50 INFO input.FileInputFormat: Total input paths to process : 1
15/05/16 11:10:51 INFO mapred.JobClient: Running job: job_201505161109_0001
15/05/16 11:10:52 INFO mapred.JobClient: map 0% reduce 0%
15/05/16 11:11:04 INFO mapred.JobClient: map 100% reduce 0%.
When I tried to debug through localhost I could see that the first mapper completes and the map progress stops at 50%.
Localjobrunner log:
15/05/16 11:36:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/05/16 11:36:08 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
15/05/16 11:36:08 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
15/05/16 11:36:08 INFO input.FileInputFormat: Total input paths to process : 1
15/05/16 11:36:08 WARN snappy.LoadSnappy: Snappy native library not loaded
15/05/16 11:36:08 INFO input.FileInputFormat: Total input paths to process : 1
15/05/16 11:36:08 INFO mapred.JobClient: Running job: job_local815502428_0001
15/05/16 11:36:09 INFO mapred.LocalJobRunner: Waiting for map tasks
15/05/16 11:36:09 INFO mapred.LocalJobRunner: Starting task: attempt_local815502428_0001_m_000000_0
15/05/16 11:36:09 INFO util.ProcessTree: setsid exited with exit code 0
15/05/16 11:36:09 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#11507b87
15/05/16 11:36:09 INFO mapred.MapTask: Processing split: file:/home/hduser/hadoop/myexamples/mainmapdatafile.dat:0+137
15/05/16 11:36:09 INFO mapred.MapTask: io.sort.mb = 100
15/05/16 11:36:09 INFO mapred.MapTask: data buffer = 79691776/99614720
15/05/16 11:36:09 INFO mapred.MapTask: record buffer = 262144/327680
15/05/16 11:36:09 INFO mapred.JobClient: map 0% reduce 0%
15/05/16 11:36:18 INFO mapred.LocalJobRunner:
15/05/16 11:36:18 INFO mapred.JobClient: map 6% reduce 0%
15/05/16 11:36:27 INFO mapred.LocalJobRunner:
15/05/16 11:36:28 INFO mapred.JobClient: map 12% reduce 0%
15/05/16 11:36:36 INFO mapred.LocalJobRunner:
15/05/16 11:36:37 INFO mapred.JobClient: map 18% reduce 0%
15/05/16 11:36:45 INFO mapred.LocalJobRunner:
15/05/16 11:36:46 INFO mapred.JobClient: map 25% reduce 0%
15/05/16 11:36:51 INFO mapred.LocalJobRunner:
15/05/16 11:36:52 INFO mapred.JobClient: map 31% reduce 0%
15/05/16 11:36:57 INFO mapred.LocalJobRunner:
15/05/16 11:36:58 INFO mapred.JobClient: map 37% reduce 0%
15/05/16 11:37:03 INFO mapred.LocalJobRunner:
15/05/16 11:37:04 INFO mapred.JobClient: map 43% reduce 0%
15/05/16 11:37:09 INFO mapred.LocalJobRunner:
15/05/16 11:37:10 INFO mapred.JobClient: map 50% reduce 0%
15/05/16 11:37:12 INFO mapred.MapTask: Starting flush of map output
15/05/16 11:37:12 INFO mapred.MapTask: Starting flush of map output
15/05/16 11:37:18 INFO mapred.LocalJobRunner:

hadoop streaming job failed Unable to load realm info from SCDynamicStore env: ruby\r: No such file or directory

While running hadoop streaming using ruby as my mapper and reduce functions, I get the following error.
packageJobJar: [summarymapper.rb, wcreducer.rb, /var/lib/hadoop/hadoop-unjar6514686449101598265/] [] /var/folders/md/0ww65qrx1_n1nlhrr7hrs8d00000gn/T/streamjob9165241112855689376.jar tmpDir=null
14/06/25 19:54:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/06/25 19:54:35 WARN snappy.LoadSnappy: Snappy native library not loaded
14/06/25 19:54:35 INFO mapred.FileInputFormat: Total input paths to process : 1
14/06/25 19:54:35 INFO streaming.StreamJob: getLocalDirs(): [/var/lib/hadoop/mapred/local]
14/06/25 19:54:35 INFO streaming.StreamJob: Running job: job_201406251944_0005
14/06/25 19:54:35 INFO streaming.StreamJob: To kill this job, run:
14/06/25 19:54:35 INFO streaming.StreamJob: /Users/oladotunopasina/hadoop-1.2.1/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201406251944_0005
14/06/25 19:54:35 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201406251944_0005
14/06/25 19:54:36 INFO streaming.StreamJob: map 0% reduce 0%
14/06/25 19:55:18 INFO streaming.StreamJob: map 100% reduce 100%
14/06/25 19:55:18 INFO streaming.StreamJob: To kill this job, run:
14/06/25 19:55:18 INFO streaming.StreamJob: /Users/oladotunopasina/hadoop-1.2.1/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201406251944_0005
14/06/25 19:55:18 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201406251944_0005
14/06/25 19:55:18 ERROR streaming.StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201406251944_0005_m_000001
14/06/25 19:55:18 INFO streaming.StreamJob: killJob...
Streaming Command Failed!
On checking the log file produced , I see this
stderr logs
2014-06-25 19:54:38.332 java[8468:1003] Unable to load realm info from SCDynamicStore
env: ruby\r: No such file or directory
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 127
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:394)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
I have tried using this thread Hadoop environment variables
but still have no success. Kindly help.
I solved this problem by resaving my .rb files on the mac.It seems like the version I downloaded was saved as a pc file. "\r" is an hidden character present in the mapper and reducer classes.

Mahout - Exception: Java Heap space

I'm trying to convert some texts to mahout sequence files using:
mahout seqdirectory -i Lastfm-ArtistTags2007 -o seqdirectory
But all I get is a OutOfMemoryError, as here:
Running on hadoop, using /usr/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /opt/mahout/mahout-examples-0.9-job.jar
14/04/07 16:44:34 INFO common.AbstractJob: Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], --input=[Lastfm-ArtistTags2007], --keyPrefix=[], --method=[mapreduce], --output=[seqdirectoryjps], --startPhase=[0], --tempDir=[temp]}
14/04/07 16:44:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/04/07 16:44:35 INFO input.FileInputFormat: Total input paths to process : 4
14/04/07 16:44:35 WARN snappy.LoadSnappy: Snappy native library not loaded
14/04/07 16:44:35 INFO mapred.JobClient: Running job: job_local407267609_0001
14/04/07 16:44:35 INFO mapred.LocalJobRunner: Waiting for map tasks
14/04/07 16:44:35 INFO mapred.LocalJobRunner: Starting task: attempt_local407267609_0001_m_000000_0
14/04/07 16:44:35 INFO util.ProcessTree: setsid exited with exit code 0
14/04/07 16:44:35 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#6ad3ad65
14/04/07 16:44:35 INFO mapred.MapTask: Processing split: Paths:/home/giuliano/cook/lastfm/Lastfm-ArtistTags2007/README.txt:0+2472,/home/giuliano/cook/lastfm/Lastfm-ArtistTags2007/ArtistTags.dat:0+71652722,/home/giuliano/cook/lastfm/Lastfm-ArtistTags2007/tags.txt:0+1739746,/home/giuliano/cook/lastfm/Lastfm-ArtistTags2007/artists.txt:0+327051
14/04/07 16:44:35 INFO compress.CodecPool: Got brand-new compressor
14/04/07 16:44:35 INFO mapred.LocalJobRunner: Map task executor complete.
14/04/07 16:44:35 WARN mapred.LocalJobRunner: job_local407267609_0001
java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.io.BytesWritable.setCapacity(BytesWritable.java:119)
at org.apache.mahout.text.WholeFileRecordReader.nextKeyValue(WholeFileRecordReader.java:118)
at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:69)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:531)
at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
14/04/07 16:44:36 INFO mapred.JobClient: map 0% reduce 0%
14/04/07 16:44:36 INFO mapred.JobClient: Job complete: job_local407267609_0001
14/04/07 16:44:36 INFO mapred.JobClient: Counters: 0
14/04/07 16:44:36 INFO driver.MahoutDriver: Program took 1749 ms (Minutes: 0.02915)
I am using Mahout 0.9, Hadoop 1.2.1 and OpenJDK Java7u25
defining MAHOUT_HEAPSIZE to 4096 did not help, and the text files can be found here: http://static.echonest.com/Lastfm-ArtistTags2007.tar.gz
Currently the spawned job is executed as a local job runner, the execution happens only in the node in which you fired the job. Specify the job tracker address by setting the property mapred.job.tracker in your mapred-site.xml inorder to make the execution distributed.
Execution in distributed mode might solve your outOfMemory issue
If you look at the environment variable HADOOP_CONF_DIR, its values is empty set its value using the following command export HADOOP_CONF_DIR=/etc/hadoop/conf. Make sure the value of the property mapred.job.tracker which should point to your jobTracker in /etc/hadoop/conf/mapred-site.xml configuration

Debugging a Tutorial Hadoop Pipes-Project

I am working through this tutorial
and got to the very last part (with some small changes).
Now I am stuck with an error message I can't make sense of.
damian#damian-ThinkPad-T61:~/hadoop-1.1.2$ bin/hadoop pipes -D hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true -input dft1 -output dft1-out -program bin/word_count
13/06/09 20:17:01 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/06/09 20:17:01 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
13/06/09 20:17:01 WARN snappy.LoadSnappy: Snappy native library not loaded
13/06/09 20:17:01 INFO mapred.FileInputFormat: Total input paths to process : 1
13/06/09 20:17:02 INFO filecache.TrackerDistributedCacheManager: Creating word_count in /tmp/hadoop-damian/mapred/local/archive/7642618178782392982_1522484642_696507214/filebin-work-1867423021697266227 with rwxr-xr-x
13/06/09 20:17:02 INFO filecache.TrackerDistributedCacheManager: Cached bin/word_count as /tmp/hadoop-damian/mapred/local/archive/7642618178782392982_1522484642_696507214/filebin/word_count
13/06/09 20:17:02 INFO filecache.TrackerDistributedCacheManager: Cached bin/word_count as /tmp/hadoop-damian/mapred/local/archive/7642618178782392982_1522484642_696507214/filebin/word_count
13/06/09 20:17:02 INFO mapred.JobClient: Running job: job_local_0001
13/06/09 20:17:02 INFO util.ProcessTree: setsid exited with exit code 0
13/06/09 20:17:02 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#4200d3
13/06/09 20:17:02 INFO mapred.MapTask: numReduceTasks: 1
13/06/09 20:17:02 INFO mapred.MapTask: io.sort.mb = 100
13/06/09 20:17:02 INFO mapred.MapTask: data buffer = 79691776/99614720
13/06/09 20:17:02 INFO mapred.MapTask: record buffer = 262144/327680
13/06/09 20:17:02 WARN mapred.LocalJobRunner: job_local_0001
java.lang.NullPointerException
at org.apache.hadoop.mapred.pipes.Application.<init>(Application.java:103)
at org.apache.hadoop.mapred.pipes.PipesMapRunner.run(PipesMapRunner.java:68)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:214)
13/06/09 20:17:03 INFO mapred.JobClient: map 0% reduce 0%
13/06/09 20:17:03 INFO mapred.JobClient: Job complete: job_local_0001
13/06/09 20:17:03 INFO mapred.JobClient: Counters: 0
13/06/09 20:17:03 INFO mapred.JobClient: Job Failed: NA
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1327)
at org.apache.hadoop.mapred.pipes.Submitter.runJob(Submitter.java:248)
at org.apache.hadoop.mapred.pipes.Submitter.run(Submitter.java:479)
at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:494)
Does anyone see where the error hides? What is a straightforward way for debugging Hadoop Pipes programs?
Thanks!
The exception :
at org.apache.hadoop.mapred.pipes.Application.<init>(Application.java:103)
Is caused by the following lines in the source:
//Add token to the environment if security is enabled
Token<JobTokenIdentifier> jobToken = TokenCache.getJobToken(conf
.getCredentials());
// This password is used as shared secret key between this application and
// child pipes process
byte[] password = jobToken.getPassword();
The actual NPE is throw in the final line as jobToken is null.
As you're using local mode (local job tracker and local file system), i'm not sure that security should be 'enabled' - do you have either of the following properties configured in your core-site.xml, or hdfs-site.xml coniguration files (if so, what are their values):
hadoop.security.authentication
hadoop.security.authorization
Possibly because your cluster is running in local mode. Do you have the following property in your mapred-site.xml file?
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<description>
Let the MapReduce jobs run with the yarn framework.
</description>
</property>
If you don't have this property, your cluster, by default, will run in local mode. I used to have exactly the same problem in local mode. After I add this property, the cluster will run in distributed mode and the problem will be gone.
HTH,
Shumin

Mahout RecommenderJob not converging

This is my first SO post so please let me know if I've missed out anything important. I am a Mahout/Hadoop beginner, and am trying to put together a distributed recommendation engine.
In order to simulate working on a remote cluster, I have set up hadoop on my machine to communicate with a Ubuntu VM (using VirtualBox), also located on my machine, which has hadoop installed on it. This setup seems to be working fine and I am now trying to run Mahout's 'RecommenderJob' on a (very!) small trial dataset as a test.
The input consists of a .csv file (saved on the hadoop dfs) containing around 50 user preferences in the format: userID, itemID, preference ... and the command I am running is:
hadoop jar /Users/MyName/src/trunk/core/target/mahout-core-0.8-SNAPSHOT-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=/user/MyName/Recommendations/input/TestRatings.csv -Dmapred.output.dir=/user/MyName/Recommendations/output -s SIMILARITY_PEARSON_CORELLATION
where TestRatings.csv is the file containing the preferences and output is the desired output directory.
At first the job looks like it's running fine, and I get the following output:
12/12/11 12:26:21 INFO common.AbstractJob: Command line arguments: {--booleanData=[false], --endPhase=[2147483647], --maxPrefsPerUser=[10], --maxPrefsPerUserInItemSimilarity=[1000], --maxSimilaritiesPerItem=[100], --minPrefsPerUser=[1], --numRecommendations=[10], --similarityClassname=[SIMILARITY_PEARSON_CORELLATION], --startPhase=[0], --tempDir=[temp]}
12/12/11 12:26:21 INFO common.AbstractJob: Command line arguments: {--booleanData=[false], --endPhase=[2147483647], --input=[/user/Naaman/Delphi/input/TestRatings.csv], --maxPrefsPerUser=[1000], --minPrefsPerUser=[1], --output=[temp/preparePreferenceMatrix], --ratingShift=[0.0], --startPhase=[0], --tempDir=[temp]}
12/12/11 12:26:21 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
12/12/11 12:26:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
12/12/11 12:26:22 INFO input.FileInputFormat: Total input paths to process : 1
12/12/11 12:26:22 WARN snappy.LoadSnappy: Snappy native library not loaded
12/12/11 12:26:22 INFO mapred.JobClient: Running job: job_local_0001
12/12/11 12:26:22 INFO mapred.Task: Using ResourceCalculatorPlugin : null
12/12/11 12:26:22 INFO mapred.MapTask: io.sort.mb = 100
12/12/11 12:26:22 INFO mapred.MapTask: data buffer = 79691776/99614720
12/12/11 12:26:22 INFO mapred.MapTask: record buffer = 262144/327680
12/12/11 12:26:22 INFO mapred.MapTask: Starting flush of map output
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new compressor
12/12/11 12:26:22 INFO mapred.MapTask: Finished spill 0
12/12/11 12:26:22 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
12/12/11 12:26:22 INFO mapred.LocalJobRunner:
12/12/11 12:26:22 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
12/12/11 12:26:22 INFO mapred.Task: Using ResourceCalculatorPlugin : null
12/12/11 12:26:22 INFO mapred.ReduceTask: ShuffleRamManager: MemoryLimit=1491035776, MaxSingleShuffleLimit=372758944
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new decompressor
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new decompressor
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new decompressor
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new decompressor
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new decompressor
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Thread started: Thread for merging on-disk files
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Thread started: Thread for merging in memory files
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Thread waiting: Thread for merging on-disk files
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Need another 1 map output(s) where 0 is already in progress
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Thread started: Thread for polling Map Completion Events
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
12/12/11 12:26:23 INFO mapred.JobClient: map 100% reduce 0%
12/12/11 12:26:28 INFO mapred.LocalJobRunner: reduce > copy >
12/12/11 12:26:31 INFO mapred.LocalJobRunner: reduce > copy >
12/12/11 12:26:37 INFO mapred.LocalJobRunner: reduce > copy >
But then the last three lines repeat indefinitely (I left it overnight...), with the two lines:
12/12/11 12:27:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Need another 1 map output(s) where 0 is already in progress
12/12/11 12:27:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
repeating every twelve rows.
I'm not sure whether there's something wrong with my input, or whether the tiny size of the trial data is messing things up. Any help and/or advice on the best way to go about this would be much appreciated.
p.s. I was trying to follow the instructions from https://www.box.com/s/041rdjeh7sny128r2uki
This is really a Hadoop or cluster issue. It is waiting on mapper output that is not coming. Look for earlier failures, in the mapping phase.

Resources