Mahout RecommenderJob not converging - hadoop

This is my first SO post so please let me know if I've missed out anything important. I am a Mahout/Hadoop beginner, and am trying to put together a distributed recommendation engine.
In order to simulate working on a remote cluster, I have set up hadoop on my machine to communicate with a Ubuntu VM (using VirtualBox), also located on my machine, which has hadoop installed on it. This setup seems to be working fine and I am now trying to run Mahout's 'RecommenderJob' on a (very!) small trial dataset as a test.
The input consists of a .csv file (saved on the hadoop dfs) containing around 50 user preferences in the format: userID, itemID, preference ... and the command I am running is:
hadoop jar /Users/MyName/src/trunk/core/target/mahout-core-0.8-SNAPSHOT-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=/user/MyName/Recommendations/input/TestRatings.csv -Dmapred.output.dir=/user/MyName/Recommendations/output -s SIMILARITY_PEARSON_CORELLATION
where TestRatings.csv is the file containing the preferences and output is the desired output directory.
At first the job looks like it's running fine, and I get the following output:
12/12/11 12:26:21 INFO common.AbstractJob: Command line arguments: {--booleanData=[false], --endPhase=[2147483647], --maxPrefsPerUser=[10], --maxPrefsPerUserInItemSimilarity=[1000], --maxSimilaritiesPerItem=[100], --minPrefsPerUser=[1], --numRecommendations=[10], --similarityClassname=[SIMILARITY_PEARSON_CORELLATION], --startPhase=[0], --tempDir=[temp]}
12/12/11 12:26:21 INFO common.AbstractJob: Command line arguments: {--booleanData=[false], --endPhase=[2147483647], --input=[/user/Naaman/Delphi/input/TestRatings.csv], --maxPrefsPerUser=[1000], --minPrefsPerUser=[1], --output=[temp/preparePreferenceMatrix], --ratingShift=[0.0], --startPhase=[0], --tempDir=[temp]}
12/12/11 12:26:21 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
12/12/11 12:26:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
12/12/11 12:26:22 INFO input.FileInputFormat: Total input paths to process : 1
12/12/11 12:26:22 WARN snappy.LoadSnappy: Snappy native library not loaded
12/12/11 12:26:22 INFO mapred.JobClient: Running job: job_local_0001
12/12/11 12:26:22 INFO mapred.Task: Using ResourceCalculatorPlugin : null
12/12/11 12:26:22 INFO mapred.MapTask: io.sort.mb = 100
12/12/11 12:26:22 INFO mapred.MapTask: data buffer = 79691776/99614720
12/12/11 12:26:22 INFO mapred.MapTask: record buffer = 262144/327680
12/12/11 12:26:22 INFO mapred.MapTask: Starting flush of map output
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new compressor
12/12/11 12:26:22 INFO mapred.MapTask: Finished spill 0
12/12/11 12:26:22 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
12/12/11 12:26:22 INFO mapred.LocalJobRunner:
12/12/11 12:26:22 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
12/12/11 12:26:22 INFO mapred.Task: Using ResourceCalculatorPlugin : null
12/12/11 12:26:22 INFO mapred.ReduceTask: ShuffleRamManager: MemoryLimit=1491035776, MaxSingleShuffleLimit=372758944
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new decompressor
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new decompressor
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new decompressor
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new decompressor
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new decompressor
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Thread started: Thread for merging on-disk files
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Thread started: Thread for merging in memory files
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Thread waiting: Thread for merging on-disk files
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Need another 1 map output(s) where 0 is already in progress
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Thread started: Thread for polling Map Completion Events
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
12/12/11 12:26:23 INFO mapred.JobClient: map 100% reduce 0%
12/12/11 12:26:28 INFO mapred.LocalJobRunner: reduce > copy >
12/12/11 12:26:31 INFO mapred.LocalJobRunner: reduce > copy >
12/12/11 12:26:37 INFO mapred.LocalJobRunner: reduce > copy >
But then the last three lines repeat indefinitely (I left it overnight...), with the two lines:
12/12/11 12:27:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Need another 1 map output(s) where 0 is already in progress
12/12/11 12:27:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
repeating every twelve rows.
I'm not sure whether there's something wrong with my input, or whether the tiny size of the trial data is messing things up. Any help and/or advice on the best way to go about this would be much appreciated.
p.s. I was trying to follow the instructions from https://www.box.com/s/041rdjeh7sny128r2uki

This is really a Hadoop or cluster issue. It is waiting on mapper output that is not coming. Look for earlier failures, in the mapping phase.

Related

Mahout - Exception: Java Heap space

I'm trying to convert some texts to mahout sequence files using:
mahout seqdirectory -i Lastfm-ArtistTags2007 -o seqdirectory
But all I get is a OutOfMemoryError, as here:
Running on hadoop, using /usr/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /opt/mahout/mahout-examples-0.9-job.jar
14/04/07 16:44:34 INFO common.AbstractJob: Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], --input=[Lastfm-ArtistTags2007], --keyPrefix=[], --method=[mapreduce], --output=[seqdirectoryjps], --startPhase=[0], --tempDir=[temp]}
14/04/07 16:44:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/04/07 16:44:35 INFO input.FileInputFormat: Total input paths to process : 4
14/04/07 16:44:35 WARN snappy.LoadSnappy: Snappy native library not loaded
14/04/07 16:44:35 INFO mapred.JobClient: Running job: job_local407267609_0001
14/04/07 16:44:35 INFO mapred.LocalJobRunner: Waiting for map tasks
14/04/07 16:44:35 INFO mapred.LocalJobRunner: Starting task: attempt_local407267609_0001_m_000000_0
14/04/07 16:44:35 INFO util.ProcessTree: setsid exited with exit code 0
14/04/07 16:44:35 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#6ad3ad65
14/04/07 16:44:35 INFO mapred.MapTask: Processing split: Paths:/home/giuliano/cook/lastfm/Lastfm-ArtistTags2007/README.txt:0+2472,/home/giuliano/cook/lastfm/Lastfm-ArtistTags2007/ArtistTags.dat:0+71652722,/home/giuliano/cook/lastfm/Lastfm-ArtistTags2007/tags.txt:0+1739746,/home/giuliano/cook/lastfm/Lastfm-ArtistTags2007/artists.txt:0+327051
14/04/07 16:44:35 INFO compress.CodecPool: Got brand-new compressor
14/04/07 16:44:35 INFO mapred.LocalJobRunner: Map task executor complete.
14/04/07 16:44:35 WARN mapred.LocalJobRunner: job_local407267609_0001
java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.io.BytesWritable.setCapacity(BytesWritable.java:119)
at org.apache.mahout.text.WholeFileRecordReader.nextKeyValue(WholeFileRecordReader.java:118)
at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:69)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:531)
at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
14/04/07 16:44:36 INFO mapred.JobClient: map 0% reduce 0%
14/04/07 16:44:36 INFO mapred.JobClient: Job complete: job_local407267609_0001
14/04/07 16:44:36 INFO mapred.JobClient: Counters: 0
14/04/07 16:44:36 INFO driver.MahoutDriver: Program took 1749 ms (Minutes: 0.02915)
I am using Mahout 0.9, Hadoop 1.2.1 and OpenJDK Java7u25
defining MAHOUT_HEAPSIZE to 4096 did not help, and the text files can be found here: http://static.echonest.com/Lastfm-ArtistTags2007.tar.gz
Currently the spawned job is executed as a local job runner, the execution happens only in the node in which you fired the job. Specify the job tracker address by setting the property mapred.job.tracker in your mapred-site.xml inorder to make the execution distributed.
Execution in distributed mode might solve your outOfMemory issue
If you look at the environment variable HADOOP_CONF_DIR, its values is empty set its value using the following command export HADOOP_CONF_DIR=/etc/hadoop/conf. Make sure the value of the property mapred.job.tracker which should point to your jobTracker in /etc/hadoop/conf/mapred-site.xml configuration

Mapreduce throwing OutOfMemoryError for large input file

Hi I have a mapreduce jar that runs perfectly fine for small input files. When I say small I mean sample input files that I've created with less than 10 lines of input. But when I try to run mapreduce on an input file of size 1.8GB, I get the OutOfMemoryError. I'm not sure what i'm supposed to be doing.
Is there anyway that I can limit the number of tasks being spawned? And have few tasks run for longer durations?
Around 20 tasks are spawned on the large input file before I get this error. Here's part of the log that's generated for the first two tasks.
13/12/13 12:00:22 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
13/12/13 12:00:22 INFO mapreduce.Job: Running job: job_local1170901099_0001
13/12/13 12:00:22 INFO mapred.LocalJobRunner: OutputCommitter set in config null
13/12/13 12:00:22 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
13/12/13 12:00:22 INFO mapred.LocalJobRunner: Waiting for map tasks
13/12/13 12:00:22 INFO mapred.LocalJobRunner: Starting task: attempt_local1170901099_0001_m_000000_0
13/12/13 12:00:22 INFO util.ProcfsBasedProcessTree: ProcfsBasedProcessTree currently is supported only on Linux.
13/12/13 12:00:22 INFO mapred.Task: Using ResourceCalculatorProcessTree : null
13/12/13 12:00:22 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/chaitanya.nadig/friendship.txt:0+134217728
13/12/13 12:00:22 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
13/12/13 12:00:23 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
13/12/13 12:00:23 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
13/12/13 12:00:23 INFO mapred.MapTask: soft limit at 83886080
13/12/13 12:00:23 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
13/12/13 12:00:23 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
13/12/13 12:00:23 INFO mapreduce.Job: Job job_local1170901099_0001 running in uber mode : false
13/12/13 12:00:23 INFO mapreduce.Job: map 0% reduce 0%
13/12/13 12:00:24 INFO mapred.MapTask: Starting flush of map output
13/12/13 12:00:24 INFO mapred.LocalJobRunner: Starting task: attempt_local1170901099_0001_m_000001_0
13/12/13 12:00:24 INFO util.ProcfsBasedProcessTree: ProcfsBasedProcessTree currently is supported only on Linux.
13/12/13 12:00:24 INFO mapred.Task: Using ResourceCalculatorProcessTree : null
13/12/13 12:00:24 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/chaitanya.nadig/friendship.txt:134217728+134217728
13/12/13 12:00:24 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
13/12/13 12:00:24 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
13/12/13 12:00:24 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
13/12/13 12:00:24 INFO mapred.MapTask: soft limit at 83886080
13/12/13 12:00:24 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
13/12/13 12:00:24 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
13/12/13 12:00:25 INFO mapred.MapTask: Starting flush of map output
This is the tail of the log which is generated when the error occurs.
13/12/13 12:00:43 INFO mapred.MapTask: Starting flush of map output
13/12/13 12:00:43 INFO mapred.Task: Task:attempt_local1170901099_0001_m_000020_0 is done. And is in the process of committing
13/12/13 12:00:43 INFO mapred.LocalJobRunner: map
13/12/13 12:00:43 INFO mapred.Task: Task 'attempt_local1170901099_0001_m_000020_0' done.
13/12/13 12:00:43 INFO mapred.LocalJobRunner: Finishing task: attempt_local1170901099_0001_m_000020_0
13/12/13 12:00:43 INFO mapred.LocalJobRunner: Map task executor complete.
13/12/13 12:00:43 WARN mapred.LocalJobRunner: job_local1170901099_0001
java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:403)
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at org.apache.hadoop.io.Text.setCapacity(Text.java:266)
at org.apache.hadoop.io.Text.append(Text.java:236)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:238)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:164)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:235)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)
13/12/13 12:00:44 INFO mapreduce.Job: map 100% reduce 0%
13/12/13 12:00:44 INFO mapreduce.Job: Job job_local1170901099_0001 failed with state FAILED due to: NA
13/12/13 12:00:44 INFO mapreduce.Job: Counters: 22
File System Counters
FILE: Number of bytes read=27635962
FILE: Number of bytes written=28018656
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=5338170260
HDFS: Number of bytes written=0
HDFS: Number of read operations=25
HDFS: Number of large read operations=0
HDFS: Number of write operations=1
Map-Reduce Framework
Map input records=0
Map output records=0
Map output bytes=0
Map output materialized bytes=6
Input split bytes=122
Combine input records=0
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=5
Total committed heap usage (bytes)=530186240
File Input Format Counters
Bytes Read=118909386
This answer is late, but posting it in case it helps someone else. The problem was that the file I was trying to process was corrupted. I got different copy of the file and ran my MR job on it and everything worked fine.
My first impulse would be to ask what your startup parameters are. Typically, when you run MapReduce and experience an out-of-memory error, you would use something like the following as your startup params:
-Dmapred.map.child.java.opts=-Xmx1G -Dmapred.reduce.child.java.opts=-Xmx1G
The key here is that these two amounts are cumulative. So, the amounts you specificy added together should not come close to exceeding the memory available on your system after you start MapReduce.
Might be late but i solved this by setting the following parameter to 0.2
mapred.job.shuffle.input.buffer.percent
This tells the reducer JVM in the shuffle space to ask only 0.2 % of the heap space,rather than 0.7%.You are getting "Out of heap space" error because the shuffle space is asking the JVM for memory which is not available to it.Rather than spilling it just throws the exception.But if you ask only for 0.2% chances are you will get the memory.Also once you exceed the alloted memory the spilling logic comes into picture.
Ofcourse the downside is the slowless.
You can also calculate at run-time the amount of memory available and then reset the buffer.

hadoop streaming error,mapreduce with python

I'm newbie to hadoop environment,Do you have any idea about how to solve this error,or what may be the reason behind this error?
hduser#intel-HP-Pavilion-g6-Notebook-PC:~/hduser/hadoop$ sudo ./bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar -file /home/hduser/map.py -mapper /home/hduser/map.py -file /home/hduser/red.py -reducer /home/hduser/red.py -input /home/hduser/tmp/cddb.txt -output /home/hduser/op1
packageJobJar: [/home/hduser/map.py, /home/hduser/red.py] [] /tmp/streamjob7455767556382290755.jar tmpDir=null
13/06/20 12:43:55 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/06/20 12:43:55 WARN snappy.LoadSnappy: Snappy native library not loaded
13/06/20 12:43:55 INFO mapred.FileInputFormat: Total input paths to process : 1
13/06/20 12:43:55 WARN mapred.LocalJobRunner: LocalJobRunner does not support symlinking into current working dir.
13/06/20 12:43:56 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-root/mapred/local]
13/06/20 12:43:56 INFO streaming.StreamJob: Running job: job_local_0001
13/06/20 12:43:56 INFO streaming.StreamJob: Job running in-process (local Hadoop)
13/06/20 12:43:56 INFO util.ProcessTree: setsid exited with exit code 0
13/06/20 12:43:56 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#e2081
13/06/20 12:43:56 INFO mapred.MapTask: numReduceTasks: 1
13/06/20 12:43:56 INFO mapred.MapTask: io.sort.mb = 100
13/06/20 12:43:56 INFO mapred.MapTask: data buffer = 79691776/99614720
13/06/20 12:43:56 INFO mapred.MapTask: record buffer = 262144/327680
13/06/20 12:43:56 INFO streaming.PipeMapRed: PipeMapRed exec [/home/hduser/hduser/hadoop/./map.py]
13/06/20 12:43:56 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
13/06/20 12:43:57 INFO streaming.StreamJob: map 0% reduce 0%
13/06/20 12:44:02 INFO mapred.LocalJobRunner: file:/home/hduser/tmp/cddb.txt:0+1205
13/06/20 12:44:03 INFO streaming.StreamJob: map 100% reduce 0%
13/06/20 12:48:11 INFO streaming.PipeMapRed: Records R/W=9/1
13/06/20 12:48:11 INFO streaming.PipeMapRed: MRErrorThread done
13/06/20 12:48:11 INFO streaming.PipeMapRed: mapRedFinished
13/06/20 12:48:11 INFO mapred.MapTask: Starting flush of map output
13/06/20 12:48:11 INFO mapred.MapTask: Finished spill 0
13/06/20 12:48:11 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
13/06/20 12:48:11 INFO mapred.LocalJobRunner: Records R/W=9/1
13/06/20 12:48:11 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
13/06/20 12:48:11 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#1c84be9
13/06/20 12:48:11 INFO mapred.LocalJobRunner:
13/06/20 12:48:11 INFO mapred.Merger: Merging 1 sorted segments
13/06/20 12:48:11 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 1356 bytes
13/06/20 12:48:11 INFO mapred.LocalJobRunner:
13/06/20 12:48:11 INFO streaming.PipeMapRed: PipeMapRed exec [/home/hduser/hduser/hadoop/./red.py]
13/06/20 12:48:11 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
13/06/20 12:48:11 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
Traceback (most recent call last):
File "/home/hduser/hduser/hadoop/./red.py", line 30, in <module>
main()
File "/home/hduser/hduser/hadoop/./red.py", line 19, in main
for similarity, group in groupby(data, itemgetter(0), reverse=True):
TypeError: groupby() takes at most 2 arguments (3 given)
13/06/20 12:48:11 INFO streaming.PipeMapRed: MRErrorThread done
13/06/20 12:48:11 INFO streaming.PipeMapRed: PipeMapRed failed!
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576)
at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:137)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:529)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)
13/06/20 12:48:11 WARN mapred.LocalJobRunner: job_local_0001
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576)
at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:137)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:529)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)
13/06/20 12:48:12 INFO streaming.StreamJob: Job running in-process (local Hadoop)
13/06/20 12:48:12 ERROR streaming.StreamJob: Job not successful. Error: NA
13/06/20 12:48:12 INFO streaming.StreamJob: killJob...
Streaming Command Failed!
I'm using hadoop 1.0.4,and wrote map reduce in python(hadoop streaming is used)
.
The error is obvious:
Traceback (most recent call last):
File "/home/hduser/hduser/hadoop/./red.py", line 30, in <module>
main()
File "/home/hduser/hduser/hadoop/./red.py", line 19, in main
for similarity, group in groupby(data, itemgetter(0), reverse=True):
TypeError: groupby() takes at most 2 arguments (3 given)
groupby only accepts 2 arguments. Here is the document of groupby.

Debugging a Tutorial Hadoop Pipes-Project

I am working through this tutorial
and got to the very last part (with some small changes).
Now I am stuck with an error message I can't make sense of.
damian#damian-ThinkPad-T61:~/hadoop-1.1.2$ bin/hadoop pipes -D hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true -input dft1 -output dft1-out -program bin/word_count
13/06/09 20:17:01 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/06/09 20:17:01 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
13/06/09 20:17:01 WARN snappy.LoadSnappy: Snappy native library not loaded
13/06/09 20:17:01 INFO mapred.FileInputFormat: Total input paths to process : 1
13/06/09 20:17:02 INFO filecache.TrackerDistributedCacheManager: Creating word_count in /tmp/hadoop-damian/mapred/local/archive/7642618178782392982_1522484642_696507214/filebin-work-1867423021697266227 with rwxr-xr-x
13/06/09 20:17:02 INFO filecache.TrackerDistributedCacheManager: Cached bin/word_count as /tmp/hadoop-damian/mapred/local/archive/7642618178782392982_1522484642_696507214/filebin/word_count
13/06/09 20:17:02 INFO filecache.TrackerDistributedCacheManager: Cached bin/word_count as /tmp/hadoop-damian/mapred/local/archive/7642618178782392982_1522484642_696507214/filebin/word_count
13/06/09 20:17:02 INFO mapred.JobClient: Running job: job_local_0001
13/06/09 20:17:02 INFO util.ProcessTree: setsid exited with exit code 0
13/06/09 20:17:02 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#4200d3
13/06/09 20:17:02 INFO mapred.MapTask: numReduceTasks: 1
13/06/09 20:17:02 INFO mapred.MapTask: io.sort.mb = 100
13/06/09 20:17:02 INFO mapred.MapTask: data buffer = 79691776/99614720
13/06/09 20:17:02 INFO mapred.MapTask: record buffer = 262144/327680
13/06/09 20:17:02 WARN mapred.LocalJobRunner: job_local_0001
java.lang.NullPointerException
at org.apache.hadoop.mapred.pipes.Application.<init>(Application.java:103)
at org.apache.hadoop.mapred.pipes.PipesMapRunner.run(PipesMapRunner.java:68)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:214)
13/06/09 20:17:03 INFO mapred.JobClient: map 0% reduce 0%
13/06/09 20:17:03 INFO mapred.JobClient: Job complete: job_local_0001
13/06/09 20:17:03 INFO mapred.JobClient: Counters: 0
13/06/09 20:17:03 INFO mapred.JobClient: Job Failed: NA
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1327)
at org.apache.hadoop.mapred.pipes.Submitter.runJob(Submitter.java:248)
at org.apache.hadoop.mapred.pipes.Submitter.run(Submitter.java:479)
at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:494)
Does anyone see where the error hides? What is a straightforward way for debugging Hadoop Pipes programs?
Thanks!
The exception :
at org.apache.hadoop.mapred.pipes.Application.<init>(Application.java:103)
Is caused by the following lines in the source:
//Add token to the environment if security is enabled
Token<JobTokenIdentifier> jobToken = TokenCache.getJobToken(conf
.getCredentials());
// This password is used as shared secret key between this application and
// child pipes process
byte[] password = jobToken.getPassword();
The actual NPE is throw in the final line as jobToken is null.
As you're using local mode (local job tracker and local file system), i'm not sure that security should be 'enabled' - do you have either of the following properties configured in your core-site.xml, or hdfs-site.xml coniguration files (if so, what are their values):
hadoop.security.authentication
hadoop.security.authorization
Possibly because your cluster is running in local mode. Do you have the following property in your mapred-site.xml file?
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<description>
Let the MapReduce jobs run with the yarn framework.
</description>
</property>
If you don't have this property, your cluster, by default, will run in local mode. I used to have exactly the same problem in local mode. After I add this property, the cluster will run in distributed mode and the problem will be gone.
HTH,
Shumin

Hadoop - Reducer is waiting for Mapper inputs?

as explained in the title, when i execute my Hadoop Program (and debug it in local mode) the following happens:
1. All 10 csv-lines in my test data are handled correctly in the Mapper, the Partitioner and the RawComperator(OutputKeyComparatorClass) that is called after the map-step. But the OutputValueGroupingComparatorClass's and the ReduceClass's functions do NOT get executed afterwards.
2. My application looks like the following. (due to space constraints i omit the implementation of the classes i used as configuration parameters, til somebody has an idea, that involves them):
public class RetweetApplication {
public static int DEBUG = 1;
static String INPUT = "/home/ema/INPUT-H";
static String OUTPUT = "/home/ema/OUTPUT-H "+ (new Date()).toString();
public static void main(String[] args) {
JobClient client = new JobClient();
JobConf conf = new JobConf(RetweetApplication.class);
if(DEBUG > 0){
conf.set("mapred.job.tracker", "local");
conf.set("fs.default.name", "file:///");
conf.set("dfs.replication", "1");
}
FileInputFormat.setInputPaths(conf, new Path(INPUT));
FileOutputFormat.setOutputPath(conf, new Path(OUTPUT));
//conf.setOutputKeyClass(Text.class);
//conf.setOutputValueClass(Text.class);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(Text.class);
conf.setMapperClass(RetweetMapper.class);
conf.setPartitionerClass(TweetPartitioner.class);
conf.setOutputKeyComparatorClass(TwitterValueGroupingComparator.class);
conf.setOutputValueGroupingComparator(TwitterKeyGroupingComparator.class);
conf.setReducerClass(RetweetReducer.class);
conf.setOutputFormat(TextOutputFormat.class);
client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}
}
3. I get the following console output(sorry for the format, but somehow this log doesnt get formatted correctly):
12/05/22 03:51:05 INFO mapred.MapTask: io.sort.mb = 100 12/05/22
03:51:05 INFO mapred.MapTask: data buffer = 79691776/99614720
12/05/22 03:51:05 INFO mapred.MapTask: record buffer = 262144/327680
12/05/22 03:51:06 INFO mapred.JobClient: map 0% reduce 0%
12/05/22 03:51:11 INFO mapred.LocalJobRunner:
file:/home/ema/INPUT-H/tweets:0+967 12/05/22 03:51:12 INFO
mapred.JobClient: map 39% reduce 0%
12/05/22 03:51:14 INFO mapred.LocalJobRunner:
file:/home/ema/INPUT-H/tweets:0+967 12/05/22 03:51:15 INFO
mapred.MapTask: Starting flush of map output
12/05/22 03:51:15 INFO mapred.MapTask: Finished spill 0
12/05/22 03:51:15 INFO mapred.Task: Task:attempt_local_0001_m_000000_0
is done. And is in the process of commiting
12/05/22 03:51:15 INFO mapred.JobClient: map 79% reduce 0%
12/05/22 03:51:17 INFO mapred.LocalJobRunner:
file:/home/ema/INPUT-H/tweets:0+967
12/05/22 03:51:17 INFO mapred.LocalJobRunner:
file:/home/ema/INPUT-H/tweets:0+967
12/05/22 03:51:17 INFO mapred.Task: Task
'attempt_local_0001_m_000000_0' done.
12/05/22 03:51:17 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin#35eed0
12/05/22 03:51:17 INFO mapred.ReduceTask: ShuffleRamManager:
MemoryLimit=709551680, MaxSingleShuffleLimit=177387920
12/05/22 03:51:17 INFO mapred.ReduceTask:
attempt_local_0001_r_000000_0 Thread started: Thread for merging
on-disk files
12/05/22 03:51:17 INFO mapred.ReduceTask:
attempt_local_0001_r_000000_0 Thread waiting: Thread for merging
on-disk files
12/05/22 03:51:17 INFO mapred.ReduceTask:
attempt_local_0001_r_000000_0 Thread started: Thread for merging in
memory files
12/05/22 03:51:17 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Need another 1 map output(s) where 0 is
already in progress 12/05/22 03:51:17 INFO mapred.ReduceTask:
attempt_local_0001_r_000000_0 Scheduled 0 outputs (0 slow hosts and0
dup hosts)
12/05/22 03:51:17 INFO mapred.ReduceTask:
attempt_local_0001_r_000000_0 Thread started: Thread for polling Map
Completion Events
12/05/22 03:51:18 INFO mapred.JobClient: map 100% reduce 0% 12/05/22 03:51:23 INFO mapred.LocalJobRunner: reduce > copy >
The bold marked lines repeat endlessly from this point.
4. Alot of open processes are active after the mapper saw every tuple:
RetweetApplication (1) [Remote Java Application]
OpenJDK Client VM[localhost:5002]
Thread [main] (Running)
Thread [Thread-2] (Running)
Daemon Thread [communication thread] (Running)
Thread [MapOutputCopier attempt_local_0001_r_000000_0.0] (Running)
Thread [MapOutputCopier attempt_local_0001_r_000000_0.1] (Running)
Thread [MapOutputCopier attempt_local_0001_r_000000_0.2] (Running)
Thread [MapOutputCopier attempt_local_0001_r_000000_0.4] (Running)
Thread [MapOutputCopier attempt_local_0001_r_000000_0.3] (Running)
Daemon Thread [Thread for merging on-disk files] (Running)
Daemon Thread [Thread for merging in memory files] (Running)
Daemon Thread [Thread for polling Map Completion Events] (Running)
Is there any reason, why Hadoop expects more output from the mapper (see the bold marked lines in the log) than i put into the input directory? As already mentioned, i debugged that ALL inputs are properly processed in the mapper/partitioner/etc.
UPDATE
With the help of Chris (see comments) i found out, that my program was NOT started in localMode as i expected it: the isLocal variable in the ReduceTask class is set to false, though it should be true.
To me it is absolutely unclear why this happens, since the 3 options that have to be set to enable the standalone mode were set the right way. Surprisingly: tho the local setting was ignored, the "read from normal disc" setting wasnt, which is very strange imho, because i thought local mode and the file:/// protocol are coupled.
During debugging ReduceTask i set the isLocal variable to true by evaluating isLocal=true in my debug view and then tried to execute the rest of the program. It did not work out and this is the stacktrace:
12/05/22 14:28:28 INFO mapred.LocalJobRunner:
12/05/22 14:28:28 INFO mapred.Merger: Merging 1 sorted segments
12/05/22 14:28:28 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 1956 bytes
12/05/22 14:28:28 INFO mapred.LocalJobRunner:
12/05/22 14:28:29 WARN conf.Configuration: file:/tmp/hadoop-ema/mapred/local/localRunner/job_local_0001.xml:a attempt to override final parameter: fs.default.name; Ignoring.
12/05/22 14:28:29 WARN conf.Configuration: file:/tmp/hadoop-ema/mapred/local/localRunner/job_local_0001.xml:a attempt to override final parameter: mapred.job.tracker; Ignoring.
12/05/22 14:28:30 INFO ipc.Client: Retrying connect to server: master/127.0.0.1:9001. Already tried 0 time(s).
12/05/22 14:28:31 INFO ipc.Client: Retrying connect to server: master/127.0.0.1:9001. Already tried 1 time(s).
12/05/22 14:28:32 INFO ipc.Client: Retrying connect to server: master/127.0.0.1:9001. Already tried 2 time(s).
12/05/22 14:28:33 INFO ipc.Client: Retrying connect to server: master/127.0.0.1:9001. Already tried 3 time(s).
12/05/22 14:28:34 INFO ipc.Client: Retrying connect to server: master/127.0.0.1:9001. Already tried 4 time(s).
12/05/22 14:28:35 INFO ipc.Client: Retrying connect to server: master/127.0.0.1:9001. Already tried 5 time(s).
12/05/22 14:28:36 INFO ipc.Client: Retrying connect to server: master/127.0.0.1:9001. Already tried 6 time(s).
12/05/22 14:28:37 INFO ipc.Client: Retrying connect to server: master/127.0.0.1:9001. Already tried 7 time(s).
12/05/22 14:28:38 INFO ipc.Client: Retrying connect to server: master/127.0.0.1:9001. Already tried 8 time(s).
12/05/22 14:28:39 INFO ipc.Client: Retrying connect to server: master/127.0.0.1:9001. Already tried 9 time(s).
12/05/22 14:28:39 WARN conf.Configuration: file:/tmp/hadoop-ema/mapred/local/localRunner/job_local_0001.xml:a attempt to override final parameter: fs.default.name; Ignoring.
12/05/22 14:28:39 WARN conf.Configuration: file:/tmp/hadoop-ema/mapred/local/localRunner/job_local_0001.xml:a attempt to override final parameter: mapred.job.tracker; Ignoring.
12/05/22 14:28:39 WARN mapred.LocalJobRunner: job_local_0001
java.net.ConnectException: Call to master/127.0.0.1:9001 failed on connection exception: java.net.ConnectException: Connection refused
at org.apache.hadoop.ipc.Client.wrapException(Client.java:1095)
at org.apache.hadoop.ipc.Client.call(Client.java:1071)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
at $Proxy1.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:238)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:203)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:123)
at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.<init>(ReduceTask.java:446)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:490)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:489)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560)
at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1202)
at org.apache.hadoop.ipc.Client.call(Client.java:1046)
... 17 more
12/05/22 14:28:39 WARN conf.Configuration: file:/tmp/hadoop-ema/mapred/local/localRunner/job_local_0001.xml:a attempt to override final parameter: fs.default.name; Ignoring.
12/05/22 14:28:39 WARN conf.Configuration: file:/tmp/hadoop-ema/mapred/local/localRunner/job_local_0001.xml:a attempt to override final parameter: mapred.job.tracker; Ignoring.
12/05/22 14:28:39 INFO mapred.JobClient: Job complete: job_local_0001
12/05/22 14:28:39 INFO mapred.JobClient: Counters: 20
12/05/22 14:28:39 INFO mapred.JobClient: File Input Format Counters
12/05/22 14:28:39 INFO mapred.JobClient: Bytes Read=967
12/05/22 14:28:39 INFO mapred.JobClient: FileSystemCounters
12/05/22 14:28:39 INFO mapred.JobClient: FILE_BYTES_READ=14093
12/05/22 14:28:39 INFO mapred.JobClient: FILE_BYTES_WRITTEN=47859
12/05/22 14:28:39 INFO mapred.JobClient: Map-Reduce Framework
12/05/22 14:28:39 INFO mapred.JobClient: Map output materialized bytes=1960
12/05/22 14:28:39 INFO mapred.JobClient: Map input records=10
12/05/22 14:28:39 INFO mapred.JobClient: Reduce shuffle bytes=0
12/05/22 14:28:39 INFO mapred.JobClient: Spilled Records=10
12/05/22 14:28:39 INFO mapred.JobClient: Map output bytes=1934
12/05/22 14:28:39 INFO mapred.JobClient: Total committed heap usage (bytes)=115937280
12/05/22 14:28:39 INFO mapred.JobClient: CPU time spent (ms)=0
12/05/22 14:28:39 INFO mapred.JobClient: Map input bytes=967
12/05/22 14:28:39 INFO mapred.JobClient: SPLIT_RAW_BYTES=82
12/05/22 14:28:39 INFO mapred.JobClient: Combine input records=0
12/05/22 14:28:39 INFO mapred.JobClient: Reduce input records=0
12/05/22 14:28:39 INFO mapred.JobClient: Reduce input groups=0
12/05/22 14:28:39 INFO mapred.JobClient: Combine output records=0
12/05/22 14:28:39 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
12/05/22 14:28:39 INFO mapred.JobClient: Reduce output records=0
12/05/22 14:28:39 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
12/05/22 14:28:39 INFO mapred.JobClient: Map output records=10
12/05/22 14:28:39 INFO mapred.JobClient: Job Failed: NA
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
at uni.kassel.macek.rtprep.RetweetApplication.main(RetweetApplication.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Since this stacktrace now shows me, that the port 9001 is used during execution, i guess that somehow the xml-configuration file overwrites the local-java-made setting (which i use for testing), which is strange since i read over and over on the internet, that java overwrites xml configuration. If nobody knows how to correct this, ill try to simply erase all configuration-xmls. Perhaps this solves the problem...
NEW UPDATE
Renaming Hadoops conf folder solved the problem of the waiting copier and the program is executed til the end. Sadly the execution doesnt wait anymore for my debugger although HADOOP_OPTS is set correctly.
RESUME:Its only a configuration issue: XML may (for some configuration parameters) overwrite JAVA. If somebody knew how i can get debugging to run again, it would be perfect, but for now im just glad i dont see this stacktrace anymore! ;)
Thank you Chris for your time and effords!
Sorry i didn't see this before, but you appear to have two important configuration properties set to final in your conf xml files, as denoted by the following log statements:
12/05/22 14:28:29 WARN conf.Configuration: file:/tmp/hadoop-ema/mapred/local/localRunner/job_local_0001.xml:a attempt to override final parameter: fs.default.name; Ignoring.
12/05/22 14:28:29 WARN conf.Configuration: file:/tmp/hadoop-ema/mapred/local/localRunner/job_local_0001.xml:a attempt to override final parameter: mapred.job.tracker; Ignoring.
This means that your job is unable to actually run in local mode, it starts in local mode, but the reducer reads the serialized job configuration and determines it is not in local mode, and tried to fetch map outputs via the task tracker ports.
You said your fix was to rename the conf folder - this will default hadoop back to the default configuration, where these two properties are not marked as 'final'

Resources