java.lang.NullPointerException at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.close - hadoop

I am running two map-reduce pairs. The output of first map-reduce is being used as the input for the next map-reduce. In order to do that I have given the job.setOutputFormatClass(SequenceFileOutputFormat.class). While running the following Driver class:
package org;
import org.apache.commons.configuration.ConfigurationFactory;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.mahout.math.VarLongWritable;
import org.apache.mahout.math.VectorWritable;
public class Driver1 extends Configured implements Tool
{
public int run(String[] args) throws Exception
{
if(args.length !=3) {
System.err.println("Usage: MaxTemperatureDriver <input path> <outputpath>");
System.exit(-1);
}
//ConfFactory WorkFlow=new ConfFactory(new Path("/input.txt"),new Path("/output.txt"),TextInputFormat.class,VarLongWritable.class,Text.class,VarLongWritable.class,VectorWritable.class,SequenceFileOutputFormat.class);
Job job = new Job();
Job job1=new Job();
job.setJarByClass(Driver1.class);
job.setJobName("Max Temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
job.setMapperClass(UserVectorMapper.class);
job.setReducerClass(UserVectorReducer.class);
job.setOutputKeyClass(VarLongWritable.class);
job.setOutputValueClass(VectorWritable.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job1.setJarByClass(Driver1.class);
//job.setJobName("Max Temperature");
job1.setInputFormatClass(SequenceFileInputFormat.class);
FileInputFormat.addInputPath(job1, new Path("output/part-r-00000"));
FileOutputFormat.setOutputPath(job1,new Path(args[2]));
job1.setMapperClass(ItemToItemPrefMapper.class);
//job1.setReducerClass(UserVectorReducer.class);
job1.setOutputKeyClass(VectorWritable.class);
job1.setOutputValueClass(VectorWritable.class);
job1.setOutputFormatClass(SequenceFileOutputFormat.class);
System.exit(job.waitForCompletion(true) && job1.waitForCompletion(true) ? 0:1);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
Driver1 driver = new Driver1();
int exitCode = ToolRunner.run(driver, args);
System.exit(exitCode);
}
}
I am getting the following runtime log.
15/02/24 20:00:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/02/24 20:00:49 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
15/02/24 20:00:49 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
15/02/24 20:00:49 INFO input.FileInputFormat: Total input paths to process : 1
15/02/24 20:00:49 WARN snappy.LoadSnappy: Snappy native library not loaded
15/02/24 20:00:49 INFO mapred.JobClient: Running job: job_local1723586736_0001
15/02/24 20:00:49 INFO mapred.LocalJobRunner: Waiting for map tasks
15/02/24 20:00:49 INFO mapred.LocalJobRunner: Starting task: attempt_local1723586736_0001_m_000000_0
15/02/24 20:00:49 INFO util.ProcessTree: setsid exited with exit code 0
15/02/24 20:00:49 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#1185f32
15/02/24 20:00:49 INFO mapred.MapTask: Processing split: file:/home/smaiti/workspace/recommendationsy/data.txt:0+1979173
15/02/24 20:00:50 INFO mapred.MapTask: io.sort.mb = 100
15/02/24 20:00:50 INFO mapred.MapTask: data buffer = 79691776/99614720
15/02/24 20:00:50 INFO mapred.MapTask: record buffer = 262144/327680
15/02/24 20:00:50 INFO mapred.JobClient: map 0% reduce 0%
15/02/24 20:00:50 INFO mapred.MapTask: Starting flush of map output
15/02/24 20:00:51 INFO mapred.MapTask: Finished spill 0
15/02/24 20:00:51 INFO mapred.Task: Task:attempt_local1723586736_0001_m_000000_0 is done. And is in the process of commiting
15/02/24 20:00:51 INFO mapred.LocalJobRunner:
15/02/24 20:00:51 INFO mapred.Task: Task 'attempt_local1723586736_0001_m_000000_0' done.
15/02/24 20:00:51 INFO mapred.LocalJobRunner: Finishing task: attempt_local1723586736_0001_m_000000_0
15/02/24 20:00:51 INFO mapred.LocalJobRunner: Map task executor complete.
15/02/24 20:00:51 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#9cce9
15/02/24 20:00:51 INFO mapred.LocalJobRunner:
15/02/24 20:00:51 INFO mapred.Merger: Merging 1 sorted segments
15/02/24 20:00:51 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 2074779 bytes
15/02/24 20:00:51 INFO mapred.LocalJobRunner:
15/02/24 20:00:51 INFO mapred.Task: Task:attempt_local1723586736_0001_r_000000_0 is done. And is in the process of commiting
15/02/24 20:00:51 INFO mapred.LocalJobRunner:
15/02/24 20:00:51 INFO mapred.Task: Task attempt_local1723586736_0001_r_000000_0 is allowed to commit now
15/02/24 20:00:51 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1723586736_0001_r_000000_0' to output
15/02/24 20:00:51 INFO mapred.LocalJobRunner: reduce > reduce
15/02/24 20:00:51 INFO mapred.Task: Task 'attempt_local1723586736_0001_r_000000_0' done.
15/02/24 20:00:51 INFO mapred.JobClient: map 100% reduce 100%
15/02/24 20:00:51 INFO mapred.JobClient: Job complete: job_local1723586736_0001
15/02/24 20:00:51 INFO mapred.JobClient: Counters: 20
15/02/24 20:00:51 INFO mapred.JobClient: File Output Format Counters
15/02/24 20:00:51 INFO mapred.JobClient: Bytes Written=1012481
15/02/24 20:00:51 INFO mapred.JobClient: File Input Format Counters
15/02/24 20:00:51 INFO mapred.JobClient: Bytes Read=1979173
15/02/24 20:00:51 INFO mapred.JobClient: FileSystemCounters
15/02/24 20:00:51 INFO mapred.JobClient: FILE_BYTES_READ=6033479
15/02/24 20:00:51 INFO mapred.JobClient: FILE_BYTES_WRITTEN=5264031
15/02/24 20:00:51 INFO mapred.JobClient: Map-Reduce Framework
15/02/24 20:00:51 INFO mapred.JobClient: Reduce input groups=943
15/02/24 20:00:51 INFO mapred.JobClient: Map output materialized bytes=2074783
15/02/24 20:00:51 INFO mapred.JobClient: Combine output records=0
15/02/24 20:00:51 INFO mapred.JobClient: Map input records=100000
15/02/24 20:00:51 INFO mapred.JobClient: Reduce shuffle bytes=0
15/02/24 20:00:51 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
15/02/24 20:00:51 INFO mapred.JobClient: Reduce output records=943
15/02/24 20:00:51 INFO mapred.JobClient: Spilled Records=200000
15/02/24 20:00:51 INFO mapred.JobClient: Map output bytes=1874777
15/02/24 20:00:51 INFO mapred.JobClient: Total committed heap usage (bytes)=415760384
15/02/24 20:00:51 INFO mapred.JobClient: CPU time spent (ms)=0
15/02/24 20:00:51 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
15/02/24 20:00:51 INFO mapred.JobClient: SPLIT_RAW_BYTES=118
15/02/24 20:00:51 INFO mapred.JobClient: Map output records=100000
15/02/24 20:00:51 INFO mapred.JobClient: Combine input records=0
15/02/24 20:00:51 INFO mapred.JobClient: Reduce input records=100000
15/02/24 20:00:51 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
15/02/24 20:00:51 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
15/02/24 20:00:51 INFO input.FileInputFormat: Total input paths to process : 1
15/02/24 20:00:51 INFO mapred.JobClient: Running job: job_local735350013_0002
15/02/24 20:00:51 INFO mapred.LocalJobRunner: Waiting for map tasks
15/02/24 20:00:51 INFO mapred.LocalJobRunner: Starting task: attempt_local735350013_0002_m_000000_0
15/02/24 20:00:51 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#1a970
15/02/24 20:00:51 INFO mapred.MapTask: Processing split: file:/home/smaiti/workspace/recommendationsy/output/part-r-00000:0+1004621
15/02/24 20:00:51 INFO mapred.MapTask: io.sort.mb = 100
15/02/24 20:00:51 INFO mapred.MapTask: data buffer = 79691776/99614720
15/02/24 20:00:51 INFO mapred.MapTask: record buffer = 262144/327680
15/02/24 20:00:51 INFO mapred.MapTask: Ignoring exception during close for org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader#9cc591
java.lang.NullPointerException
at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.close(SequenceFileRecordReader.java:101)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.close(MapTask.java:496)
at org.apache.hadoop.mapred.MapTask.closeQuietly(MapTask.java:1776)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:778)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/02/24 20:00:51 INFO mapred.LocalJobRunner: Map task executor complete.
15/02/24 20:00:51 WARN mapred.LocalJobRunner: job_local735350013_0002
java.lang.Exception: java.lang.ClassCastException: class org.apache.mahout.math.VectorWritable
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.ClassCastException: class org.apache.mahout.math.VectorWritable
at java.lang.Class.asSubclass(Class.java:3208)
at org.apache.hadoop.mapred.JobConf.getOutputKeyComparator(JobConf.java:795)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:964)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:673)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:756)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/02/24 20:00:52 INFO mapred.JobClient: map 0% reduce 0%
15/02/24 20:00:52 INFO mapred.JobClient: Job complete: job_local735350013_0002
15/02/24 20:00:52 INFO mapred.JobClient: Counters: 0
The first exception that I am getting is this:
java.lang.NullPointerException
at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.close(SequenceFileRecordReader.java:101)
Please help.

This is mainly because Hadoop is confused while Serializing the data.
Make sure to
You should set Input and output file format class to both the reducers.
Check that Inputformat of second class is OutputFormat of first class.
It might be possible that intermediate file format is different from what the reducer is expecting to read.
Maintain consistent FileFormats across your program.

Related

Map Reduce File Output Counter is zero

I am writing Map Reduce code for Inverted Indexing of a file which contains each line as "Doc_id Title Document Contents".
I am not able to figure out why File output format counter is zero although map reduce jobs are successfully completed without any Exception.
import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class InvertedIndex {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, Text> {
private Text word = new Text();
private Text docID_Title = new Text();
//RemoveStopWords is a different class
static RemoveStopWords rmvStpWrd = new RemoveStopWords();
//Stemmer is a different class
Stemmer stemmer = new Stemmer();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
rmvStpWrd.makeStopWordList();
StringTokenizer itr = new StringTokenizer(value.toString().replaceAll(" [^\\p{L}]", " "));
//fetching id of the document
String id = null;
String title = null;
if(itr.hasMoreTokens())
id = itr.nextToken();
//fetching title of the document
if(itr.hasMoreTokens())
title = itr.nextToken();
String ID_TITLE = id + title;
if(id!=null)
docID_Title.set(ID_TITLE);
while (itr.hasMoreTokens()) {
/*manipulation of tokens:
* First we remove stop words
* Then Stem the words
*/
String temp = itr.nextToken().toLowerCase();
if(RemoveStopWords.isStopWord(temp)) {
continue;
}
else {
//now the word is not a stop word
//we will stem it
char[] a;
stemmer.add((a = temp.toCharArray()), a.length);
stemmer.stem();
temp = stemmer.toString();
word.set(temp);
context.write(word, docID_Title);
}
}//end while
}//end map
}//end mapper
public static class IntSumReducer
extends Reducer<Text,Text,Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
//to iterate over the values
Iterator<Text> itr = values.iterator();
String old = itr.next().toString();
int freq = 1;
String next = null;
boolean isThere = true;
StringBuilder stringBuilder = new StringBuilder();
while(itr.hasNext()) {
//freq counts number of times a word comes in a document
freq = 1;
while((isThere = itr.hasNext())) {
next = itr.next().toString();
if(old == next)
freq++;
else {
//the loop break when we get different docID_Title for the word(key)
break;
}
//if more data is there
if(isThere) {
old = old +"_"+ freq;
stringBuilder.append(old);
stringBuilder.append(" | ");
old = next;
context.write(key, new Text(stringBuilder.toString()));
stringBuilder.setLength(0);
}
else {
//for the last key
freq++;
old = old +"_"+ freq;
stringBuilder.append(old);
stringBuilder.append(" | ");
old = next;
context.write(key, new Text(stringBuilder.toString()));
}//end else
}//end while
}//end while
}//end reduce
}//end reducer
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "InvertedIndex");
job.setJarByClass(InvertedIndex.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}//end main
}//end InvertexIndex
This is the output I am getting:
16/10/03 15:34:21 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
16/10/03 15:34:21 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
16/10/03 15:34:21 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16/10/03 15:34:22 INFO input.FileInputFormat: Total input paths to process : 1
16/10/03 15:34:22 INFO mapreduce.JobSubmitter: number of splits:1
16/10/03 15:34:22 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local507694567_0001
16/10/03 15:34:22 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
16/10/03 15:34:22 INFO mapreduce.Job: Running job: job_local507694567_0001
16/10/03 15:34:22 INFO mapred.LocalJobRunner: OutputCommitter set in config null
16/10/03 15:34:22 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/10/03 15:34:22 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
16/10/03 15:34:22 INFO mapred.LocalJobRunner: Waiting for map tasks
16/10/03 15:34:22 INFO mapred.LocalJobRunner: Starting task: attempt_local507694567_0001_m_000000_0
16/10/03 15:34:22 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/10/03 15:34:22 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/10/03 15:34:22 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/sonu/ss.txt:0+1002072
16/10/03 15:34:23 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
16/10/03 15:34:23 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
16/10/03 15:34:23 INFO mapred.MapTask: soft limit at 83886080
16/10/03 15:34:23 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
16/10/03 15:34:23 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
16/10/03 15:34:23 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
16/10/03 15:34:23 INFO mapreduce.Job: Job job_local507694567_0001 running in uber mode : false
16/10/03 15:34:23 INFO mapreduce.Job: map 0% reduce 0%
16/10/03 15:34:24 INFO mapred.LocalJobRunner:
16/10/03 15:34:24 INFO mapred.MapTask: Starting flush of map output
16/10/03 15:34:24 INFO mapred.MapTask: Spilling map output
16/10/03 15:34:24 INFO mapred.MapTask: bufstart = 0; bufend = 2206696; bufvoid = 104857600
16/10/03 15:34:24 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 25789248(103156992); length = 425149/6553600
16/10/03 15:34:24 INFO mapred.MapTask: Finished spill 0
16/10/03 15:34:24 INFO mapred.Task: Task:attempt_local507694567_0001_m_000000_0 is done. And is in the process of committing
16/10/03 15:34:24 INFO mapred.LocalJobRunner: map
16/10/03 15:34:24 INFO mapred.Task: Task 'attempt_local507694567_0001_m_000000_0' done.
16/10/03 15:34:24 INFO mapred.LocalJobRunner: Finishing task: attempt_local507694567_0001_m_000000_0
16/10/03 15:34:24 INFO mapred.LocalJobRunner: map task executor complete.
16/10/03 15:34:25 INFO mapred.LocalJobRunner: Waiting for reduce tasks
16/10/03 15:34:25 INFO mapred.LocalJobRunner: Starting task: attempt_local507694567_0001_r_000000_0
16/10/03 15:34:25 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/10/03 15:34:25 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/10/03 15:34:25 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle#5d0e7307
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=333971456, maxSingleShuffleLimit=83492864, mergeThreshold=220421168, ioSortFactor=10, memToMemMergeOutputsThreshold=10
16/10/03 15:34:25 INFO reduce.EventFetcher: attempt_local507694567_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
16/10/03 15:34:25 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local507694567_0001_m_000000_0 decomp: 2 len: 6 to MEMORY
16/10/03 15:34:25 INFO reduce.InMemoryMapOutput: Read 2 bytes from map-output for attempt_local507694567_0001_m_000000_0
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 2, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->2
16/10/03 15:34:25 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
16/10/03 15:34:25 INFO mapred.LocalJobRunner: 1 / 1 copied.
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
16/10/03 15:34:25 INFO mapred.Merger: Merging 1 sorted segments
16/10/03 15:34:25 INFO mapred.Merger: Down to the last merge-pass, with 0 segments left of total size: 0 bytes
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: Merged 1 segments, 2 bytes to disk to satisfy reduce memory limit
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: Merging 1 files, 6 bytes from disk
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
16/10/03 15:34:25 INFO mapred.Merger: Merging 1 sorted segments
16/10/03 15:34:25 INFO mapred.Merger: Down to the last merge-pass, with 0 segments left of total size: 0 bytes
16/10/03 15:34:25 INFO mapred.LocalJobRunner: 1 / 1 copied.
16/10/03 15:34:25 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
16/10/03 15:34:25 INFO mapred.Task: Task:attempt_local507694567_0001_r_000000_0 is done. And is in the process of committing
16/10/03 15:34:25 INFO mapred.LocalJobRunner: 1 / 1 copied.
16/10/03 15:34:25 INFO mapred.Task: Task attempt_local507694567_0001_r_000000_0 is allowed to commit now
16/10/03 15:34:25 INFO output.FileOutputCommitter: Saved output of task 'attempt_local507694567_0001_r_000000_0' to hdfs://localhost:9000/user/sonu/output/_temporary/0/task_local507694567_0001_r_000000
16/10/03 15:34:25 INFO mapred.LocalJobRunner: reduce > reduce
16/10/03 15:34:25 INFO mapred.Task: Task 'attempt_local507694567_0001_r_000000_0' done.
16/10/03 15:34:25 INFO mapred.LocalJobRunner: Finishing task: attempt_local507694567_0001_r_000000_0
16/10/03 15:34:25 INFO mapred.LocalJobRunner: reduce task executor complete.
16/10/03 15:34:25 INFO mapreduce.Job: map 100% reduce 100%
16/10/03 15:34:25 INFO mapreduce.Job: Job job_local507694567_0001 completed successfully
16/10/03 15:34:25 INFO mapreduce.Job: Counters: 35
File System Counters
FILE: Number of bytes read=17342
FILE: Number of bytes written=571556
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2004144
HDFS: Number of bytes written=0
HDFS: Number of read operations=13
HDFS: Number of large read operations=0
HDFS: Number of write operations=4
Map-Reduce Framework
Map input records=53
Map output records=106288
Map output bytes=2206696
Map output materialized bytes=6
Input split bytes=103
Combine input records=106288
Combine output records=0
Reduce input groups=0
Reduce shuffle bytes=6
Reduce input records=0
Reduce output records=0
Spilled Records=0
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=12
Total committed heap usage (bytes)=562036736
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1002072
File Output Format Counters
Bytes Written=0

How do i know if my hadoop mapreduce application is running in distributed mode

I'm very new in hadoop mapreduce, however i install the multinode cluster but i still get a sequential excution.
How can i work out if my program is running on the other machines in the cluster or not?
This is the result of execution :
Picked up _JAVA_OPTIONS: -Xmx1g
16/06/07 14:49:16 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/06/07 14:49:19 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
16/06/07 14:49:19 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
16/06/07 14:49:21 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16/06/07 14:49:21 INFO input.FileInputFormat: Total input paths to process : 3
16/06/07 14:49:22 INFO mapreduce.JobSubmitter: number of splits:3
16/06/07 14:49:23 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1881318657_0001
16/06/07 14:49:24 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
16/06/07 14:49:24 INFO mapreduce.Job: Running job: job_local1881318657_0001
16/06/07 14:49:24 INFO mapred.LocalJobRunner: OutputCommitter set in config null
16/06/07 14:49:24 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
16/06/07 14:49:24 INFO mapred.LocalJobRunner: Waiting for map tasks
16/06/07 14:49:24 INFO mapred.LocalJobRunner: Starting task: attempt_local1881318657_0001_m_000000_0
16/06/07 14:49:24 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/06/07 14:49:24 INFO mapred.MapTask: Processing split: hdfs://master:9000/input/leukemia.txt:0+1172207
16/06/07 14:49:24 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
16/06/07 14:49:24 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
16/06/07 14:49:24 INFO mapred.MapTask: soft limit at 83886080
16/06/07 14:49:24 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
16/06/07 14:49:24 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
16/06/07 14:49:24 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
16/06/07 14:49:25 INFO mapreduce.Job: Job job_local1881318657_0001 running in uber mode : false
16/06/07 14:49:25 INFO mapreduce.Job: map 0% reduce 0%
16/06/07 14:49:31 INFO mapred.LocalJobRunner: map > map
16/06/07 14:49:31 INFO mapreduce.Job: map 22% reduce 0%
-3.042421771435325E-9
-3.042421771435325E-9
-3.042421771435325E-9
-3.042421771435325E-9
-3.042421771435325E-9
-2.9889415942690763E-9
-2.9889415942690763E-9
-2.9889415942690763E-9
-2.9287384547432996E-9
-2.898469757139896E-9
-2.898469757139896E-9
-2.880377562441664E-9
-2.880377562441664E-9
-2.880377562441664E-9
-2.8430632294667886E-9
-2.819146987128837E-9
-2.819146987128837E-9
-2.819146987128837E-9
-2.819146987128837E-9
-2.819146987128837E-9
931
16/06/07 15:00:44 INFO mapred.LocalJobRunner: map > map
16/06/07 15:00:44 INFO mapred.MapTask: Starting flush of map output
16/06/07 15:00:44 INFO mapred.MapTask: Spilling map output
16/06/07 15:00:44 INFO mapred.MapTask: bufstart = 0; bufend = 14151; bufvoid = 104857600
16/06/07 15:00:44 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
16/06/07 15:00:46 INFO mapred.MapTask: Finished spill 0
16/06/07 15:00:46 INFO mapred.Task: Task:attempt_local1881318657_0001_m_000000_0 is done. And is in the process of committing
16/06/07 15:00:47 INFO mapred.LocalJobRunner: map
16/06/07 15:00:47 INFO mapred.Task: Task 'attempt_local1881318657_0001_m_000000_0' done.
16/06/07 15:00:47 INFO mapred.LocalJobRunner: Finishing task: attempt_local1881318657_0001_m_000000_0
16/06/07 15:00:47 INFO mapred.LocalJobRunner: Starting task: attempt_local1881318657_0001_m_000001_0
16/06/07 15:00:48 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/06/07 15:00:48 INFO mapred.MapTask: Processing split: hdfs://master:9000/input/leukemia1.txt:0+1172207
16/06/07 15:00:48 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
16/06/07 15:00:48 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
16/06/07 15:00:48 INFO mapred.MapTask: soft limit at 83886080
16/06/07 15:00:48 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
16/06/07 15:00:48 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
16/06/07 15:00:48 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
16/06/07 15:00:48 INFO mapreduce.Job: map 100% reduce 0%
16/06/07 15:01:47 INFO mapred.LocalJobRunner: map > map
16/06/07 15:01:48 INFO mapreduce.Job: map 56% reduce 0%
-3.0279963370711145E-9
-3.0279963370711145E-9
-3.0279963370711145E-9
-3.0279963370711145E-9
-3.0279963370711145E-9
-3.001716001136338E-9
-2.997252637652067E-9
-2.997252637652067E-9
-2.9593407930592893E-9
-2.9178102507568847E-9
-2.9178102507568847E-9
-2.9178102507568847E-9
-2.8542232742481287E-9
-2.8542232742481287E-9
-2.8510431833778047E-9
-2.8510431833778047E-9
-2.8510431833778047E-9
-2.8510431833778047E-9
-2.8222418341121026E-9
-2.8222418341121026E-9
907
16/06/07 15:11:30 INFO mapred.LocalJobRunner: map > map
16/06/07 15:11:30 INFO mapred.MapTask: Starting flush of map output
16/06/07 15:11:30 INFO mapred.MapTask: Spilling map output
16/06/07 15:11:30 INFO mapred.MapTask: bufstart = 0; bufend = 14151; bufvoid = 104857600
16/06/07 15:11:30 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
16/06/07 15:11:30 INFO mapred.MapTask: Finished spill 0
16/06/07 15:11:30 INFO mapred.Task: Task:attempt_local1881318657_0001_m_000001_0 is done. And is in the process of committing
16/06/07 15:11:30 INFO mapred.LocalJobRunner: map
16/06/07 15:11:30 INFO mapred.Task: Task 'attempt_local1881318657_0001_m_000001_0' done.
16/06/07 15:11:30 INFO mapred.LocalJobRunner: Finishing task: attempt_local1881318657_0001_m_000001_0
16/06/07 15:11:30 INFO mapred.LocalJobRunner: Starting task: attempt_local1881318657_0001_m_000002_0
16/06/07 15:11:30 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/06/07 15:11:30 INFO mapred.MapTask: Processing split: hdfs://master:9000/input/leukemia2.txt:0+1172207
16/06/07 15:11:30 INFO mapreduce.Job: map 100% reduce 0%
16/06/07 15:11:31 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
16/06/07 15:11:31 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
16/06/07 15:11:31 INFO mapred.MapTask: soft limit at 83886080
16/06/07 15:11:31 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
16/06/07 15:11:31 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
16/06/07 15:11:31 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
16/06/07 15:11:37 INFO mapred.LocalJobRunner: map > map
16/06/07 15:11:38 INFO mapreduce.Job: map 89% reduce 0%
-3.064963887619912E-9
-3.064963887619912E-9
-3.064963887619912E-9
-3.064963887619912E-9
-3.064963887619912E-9
-3.0090989883906007E-9
-2.9474075636124447E-9
-2.9474075636124447E-9
-2.9474075636124447E-9
-2.9388849943338927E-9
-2.9388849943338927E-9
-2.8915704649620403E-9
-2.8102046711682226E-9
-2.8102046711682226E-9
-2.8102046711682226E-9
-2.8102046711682226E-9
-2.8102046711682226E-9
-2.8102046711682226E-9
-2.8102046711682226E-9
-2.8102046711682226E-9
925
16/06/07 15:20:19 INFO mapred.LocalJobRunner: map > map
16/06/07 15:20:19 INFO mapred.MapTask: Starting flush of map output
16/06/07 15:20:19 INFO mapred.MapTask: Spilling map output
16/06/07 15:20:19 INFO mapred.MapTask: bufstart = 0; bufend = 14151; bufvoid = 104857600
16/06/07 15:20:19 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
16/06/07 15:20:20 INFO mapred.MapTask: Finished spill 0
16/06/07 15:20:20 INFO mapred.Task: Task:attempt_local1881318657_0001_m_000002_0 is done. And is in the process of committing
16/06/07 15:20:22 INFO mapred.LocalJobRunner: map
16/06/07 15:20:22 INFO mapred.Task: Task 'attempt_local1881318657_0001_m_000002_0' done.
16/06/07 15:20:22 INFO mapred.LocalJobRunner: Finishing task: attempt_local1881318657_0001_m_000002_0
16/06/07 15:20:22 INFO mapred.LocalJobRunner: map task executor complete.
16/06/07 15:20:22 INFO mapreduce.Job: map 100% reduce 0%
16/06/07 15:20:23 INFO mapred.LocalJobRunner: Waiting for reduce tasks
16/06/07 15:20:23 INFO mapred.LocalJobRunner: Starting task: attempt_local1881318657_0001_r_000000_0
16/06/07 15:20:24 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/06/07 15:20:24 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle#7f5be2d5
16/06/07 15:20:25 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=668309888, maxSingleShuffleLimit=167077472, mergeThreshold=441084544, ioSortFactor=10, memToMemMergeOutputsThreshold=10
16/06/07 15:20:25 INFO reduce.EventFetcher: attempt_local1881318657_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
16/06/07 15:20:28 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1881318657_0001_m_000002_0 decomp: 14157 len: 14161 to MEMORY
16/06/07 15:20:29 INFO reduce.InMemoryMapOutput: Read 14157 bytes from map-output for attempt_local1881318657_0001_m_000002_0
16/06/07 15:20:30 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 14157, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->14157
16/06/07 15:20:30 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1881318657_0001_m_000001_0 decomp: 14157 len: 14161 to MEMORY
16/06/07 15:20:30 INFO reduce.InMemoryMapOutput: Read 14157 bytes from map-output for attempt_local1881318657_0001_m_000001_0
16/06/07 15:20:30 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 14157, inMemoryMapOutputs.size() -> 2, commitMemory -> 14157, usedMemory ->28314
16/06/07 15:20:30 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1881318657_0001_m_000000_0 decomp: 14157 len: 14161 to MEMORY
16/06/07 15:20:30 INFO reduce.InMemoryMapOutput: Read 14157 bytes from map-output for attempt_local1881318657_0001_m_000000_0
16/06/07 15:20:30 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 14157, inMemoryMapOutputs.size() -> 3, commitMemory -> 28314, usedMemory ->42471
16/06/07 15:20:30 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
16/06/07 15:20:30 INFO mapred.LocalJobRunner: 3 / 3 copied.
16/06/07 15:20:30 INFO reduce.MergeManagerImpl: finalMerge called with 3 in-memory map-outputs and 0 on-disk map-outputs
16/06/07 15:20:30 INFO mapred.Merger: Merging 3 sorted segments
16/06/07 15:20:30 INFO mapred.Merger: Down to the last merge-pass, with 3 segments left of total size: 42435 bytes
16/06/07 15:20:30 INFO reduce.MergeManagerImpl: Merged 3 segments, 42471 bytes to disk to satisfy reduce memory limit
16/06/07 15:20:30 INFO reduce.MergeManagerImpl: Merging 1 files, 42471 bytes from disk
16/06/07 15:20:30 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
16/06/07 15:20:30 INFO mapred.Merger: Merging 1 sorted segments
16/06/07 15:20:30 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 42455 bytes
16/06/07 15:20:30 INFO mapred.LocalJobRunner: 3 / 3 copied.
16/06/07 15:20:33 INFO mapred.LocalJobRunner: reduce > reduce
16/06/07 15:20:33 INFO mapreduce.Job: map 100% reduce 67%
16/06/07 15:20:36 INFO mapred.LocalJobRunner: reduce > reduce
16/06/07 15:20:38 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
16/06/07 15:20:42 INFO mapred.LocalJobRunner: reduce > reduce
16/06/07 15:20:42 INFO mapreduce.Job: map 100% reduce 100%
16/06/07 15:20:44 INFO mapred.Task: Task:attempt_local1881318657_0001_r_000000_0 is done. And is in the process of committing
16/06/07 15:20:44 INFO mapred.LocalJobRunner: reduce > reduce
16/06/07 15:20:44 INFO mapred.Task: Task attempt_local1881318657_0001_r_000000_0 is allowed to commit now
16/06/07 15:20:45 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1881318657_0001_r_000000_0' to hdfs://master:9000/output2/_temporary/0/task_local1881318657_0001_r_000000
16/06/07 15:20:45 INFO mapred.LocalJobRunner: reduce > reduce
16/06/07 15:20:45 INFO mapred.Task: Task 'attempt_local1881318657_0001_r_000000_0' done.
16/06/07 15:20:45 INFO mapred.LocalJobRunner: Finishing task: attempt_local1881318657_0001_r_000000_0
16/06/07 15:20:45 INFO mapred.LocalJobRunner: reduce task executor complete.
16/06/07 15:20:45 INFO mapreduce.Job: Job job_local1881318657_0001 completed successfully
16/06/07 15:20:46 INFO mapreduce.Job: Counters: 38
File System Counters
FILE: Number of bytes read=177067554
FILE: Number of bytes written=179551452
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=10549863
HDFS: Number of bytes written=42438
HDFS: Number of read operations=37
HDFS: Number of large read operations=0
HDFS: Number of write operations=6
Map-Reduce Framework
Map input records=3
Map output records=3
Map output bytes=42453
Map output materialized bytes=42483
Input split bytes=557
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=42483
Reduce input records=3
Reduce output records=3
Spilled Records=6
Shuffled Maps =3
Failed Shuffles=0
Merged Map outputs=3
GC time elapsed (ms)=227283
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=2477260800
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=42438
peace
By the job ID. Your's says: job_local1881318657_0001 running in uber mode : false. Which is a local job. If you ran on a cluster it would just be the job and the identifiers of the app master and attempts.
You need to check the JobTracker ( default port 50030) and explore the job id details mentioned in the above logs.
You can monitor the jobs at:
localhost:8088

Cannot run the job on hadoop cluster. only runs using LocalJobRunner

I have submitted a MR job using hadoop jar command with the following command on CDH5 Beta 2
hadoop jar ./hadoop-examples-0.0.1-SNAPSHOT.jar com.aravind.learning.hadoop.mapred.join.ReduceSideJoinDriver tech_talks/users.csv tech_talks/ratings.csv tech_talks/output/ReduceSideJoinDriver/
I've also tried providing the fs name and job tracker url explicitly as below without any success
hadoop jar ./hadoop-examples-0.0.1-SNAPSHOT.jar com.aravind.learning.hadoop.mapred.join.ReduceSideJoinDriver -Dfs.default.name=hdfs://abc.com:8020 -Dmapreduce.job.tracker=x.x.x.x:8021 tech_talks/users.csv tech_talks/ratings.csv tech_talks/output/ReduceSideJoinDriver/
The job runs successfully but is using the LocalJobRunner instead of submitting to the cluster. The output is written to HDFS and is correct. Not sure what I am doing wrong here so appreciate your input. I've also tried explicitly specifying the fs and job tracker as below but have the same result
14/04/16 20:35:44 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
14/04/16 20:35:44 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
14/04/16 20:35:45 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String).
14/04/16 20:35:45 INFO input.FileInputFormat: Total input paths to process : 2
14/04/16 20:35:45 INFO mapreduce.JobSubmitter: number of splits:2
14/04/16 20:35:46 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1427968352_0001
14/04/16 20:35:46 WARN conf.Configuration: file:/tmp/hadoop-ird2/mapred/staging/ird21427968352/.staging/job_local1427968352_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
14/04/16 20:35:46 WARN conf.Configuration: file:/tmp/hadoop-ird2/mapred/staging/ird21427968352/.staging/job_local1427968352_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
14/04/16 20:35:46 WARN conf.Configuration: file:/tmp/hadoop-ird2/mapred/local/localRunner/ird2/job_local1427968352_0001/job_local1427968352_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
14/04/16 20:35:46 WARN conf.Configuration: file:/tmp/hadoop-ird2/mapred/local/localRunner/ird2/job_local1427968352_0001/job_local1427968352_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
14/04/16 20:35:46 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
14/04/16 20:35:46 INFO mapreduce.Job: Running job: job_local1427968352_0001
14/04/16 20:35:46 INFO mapred.LocalJobRunner: OutputCommitter set in config null
14/04/16 20:35:46 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
14/04/16 20:35:46 INFO mapred.LocalJobRunner: Waiting for map tasks
14/04/16 20:35:46 INFO mapred.LocalJobRunner: Starting task: attempt_local1427968352_0001_m_000000_0
14/04/16 20:35:46 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
14/04/16 20:35:46 INFO mapred.MapTask: Processing split: hdfs://...:8020/user/ird2/tech_talks/ratings.csv:0+4388258
14/04/16 20:35:46 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
14/04/16 20:35:46 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
14/04/16 20:35:46 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
14/04/16 20:35:46 INFO mapred.MapTask: soft limit at 83886080
14/04/16 20:35:46 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
14/04/16 20:35:46 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
14/04/16 20:35:47 INFO mapreduce.Job: Job job_local1427968352_0001 running in uber mode : false
14/04/16 20:35:47 INFO mapreduce.Job: map 0% reduce 0%
14/04/16 20:35:48 INFO mapred.LocalJobRunner:
14/04/16 20:35:48 INFO mapred.MapTask: Starting flush of map output
14/04/16 20:35:48 INFO mapred.MapTask: Spilling map output
14/04/16 20:35:48 INFO mapred.MapTask: bufstart = 0; bufend = 6485388; bufvoid = 104857600
14/04/16 20:35:48 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 24860980(99443920); length = 1353417/6553600
14/04/16 20:35:49 INFO mapred.MapTask: Finished spill 0
14/04/16 20:35:49 INFO mapred.Task: Task:attempt_local1427968352_0001_m_000000_0 is done. And is in the process of committing
14/04/16 20:35:49 INFO mapred.LocalJobRunner: map
14/04/16 20:35:49 INFO mapred.Task: Task 'attempt_local1427968352_0001_m_000000_0' done.
14/04/16 20:35:49 INFO mapred.LocalJobRunner: Finishing task: attempt_local1427968352_0001_m_000000_0
14/04/16 20:35:49 INFO mapred.LocalJobRunner: Starting task: attempt_local1427968352_0001_m_000001_0
14/04/16 20:35:49 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
14/04/16 20:35:49 INFO mapred.MapTask: Processing split: hdfs://...:8020/user/ird2/tech_talks/users.csv:0+186304
14/04/16 20:35:49 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
14/04/16 20:35:49 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
14/04/16 20:35:49 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
14/04/16 20:35:49 INFO mapred.MapTask: soft limit at 83886080
14/04/16 20:35:49 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
14/04/16 20:35:49 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
14/04/16 20:35:49 INFO mapred.LocalJobRunner:
14/04/16 20:35:49 INFO mapred.MapTask: Starting flush of map output
14/04/16 20:35:49 INFO mapred.MapTask: Spilling map output
14/04/16 20:35:49 INFO mapred.MapTask: bufstart = 0; bufend = 209667; bufvoid = 104857600
14/04/16 20:35:49 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26192144(104768576); length = 22253/6553600
14/04/16 20:35:49 INFO mapred.MapTask: Finished spill 0
14/04/16 20:35:49 INFO mapred.Task: Task:attempt_local1427968352_0001_m_000001_0 is done. And is in the process of committing
14/04/16 20:35:49 INFO mapred.LocalJobRunner: map
14/04/16 20:35:49 INFO mapred.Task: Task 'attempt_local1427968352_0001_m_000001_0' done.
14/04/16 20:35:49 INFO mapred.LocalJobRunner: Finishing task: attempt_local1427968352_0001_m_000001_0
14/04/16 20:35:49 INFO mapred.LocalJobRunner: map task executor complete.
14/04/16 20:35:49 INFO mapred.LocalJobRunner: Waiting for reduce tasks
14/04/16 20:35:49 INFO mapred.LocalJobRunner: Starting task: attempt_local1427968352_0001_r_000000_0
14/04/16 20:35:49 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
14/04/16 20:35:49 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle#5116331d
14/04/16 20:35:49 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=652528832, maxSingleShuffleLimit=163132208, mergeThreshold=430669056, ioSortFactor=10, memToMemMergeOutputsThreshold=10
14/04/16 20:35:49 INFO reduce.EventFetcher: attempt_local1427968352_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
14/04/16 20:35:49 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1427968352_0001_m_000001_0 decomp: 220797 len: 220801 to MEMORY
14/04/16 20:35:49 INFO reduce.InMemoryMapOutput: Read 220797 bytes from map-output for attempt_local1427968352_0001_m_000001_0
14/04/16 20:35:49 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 220797, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->220797
14/04/16 20:35:49 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1427968352_0001_m_000000_0 decomp: 7162100 len: 7162104 to MEMORY
14/04/16 20:35:49 INFO reduce.InMemoryMapOutput: Read 7162100 bytes from map-output for attempt_local1427968352_0001_m_000000_0
14/04/16 20:35:49 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 7162100, inMemoryMapOutputs.size() -> 2, commitMemory -> 220797, usedMemory ->7382897
14/04/16 20:35:49 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
14/04/16 20:35:49 INFO mapred.LocalJobRunner: 2 / 2 copied.
14/04/16 20:35:49 INFO reduce.MergeManagerImpl: finalMerge called with 2 in-memory map-outputs and 0 on-disk map-outputs
14/04/16 20:35:49 INFO mapred.Merger: Merging 2 sorted segments
14/04/16 20:35:49 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 7382885 bytes
14/04/16 20:35:50 INFO reduce.MergeManagerImpl: Merged 2 segments, 7382897 bytes to disk to satisfy reduce memory limit
14/04/16 20:35:50 INFO reduce.MergeManagerImpl: Merging 1 files, 7382899 bytes from disk
14/04/16 20:35:50 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
14/04/16 20:35:50 INFO mapred.Merger: Merging 1 sorted segments
14/04/16 20:35:50 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 7382889 bytes
14/04/16 20:35:50 INFO mapred.LocalJobRunner: 2 / 2 copied.
14/04/16 20:35:50 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
14/04/16 20:35:50 INFO mapreduce.Job: map 100% reduce 0%
14/04/16 20:35:51 INFO mapred.Task: Task:attempt_local1427968352_0001_r_000000_0 is done. And is in the process of committing
14/04/16 20:35:51 INFO mapred.LocalJobRunner: 2 / 2 copied.
14/04/16 20:35:51 INFO mapred.Task: Task attempt_local1427968352_0001_r_000000_0 is allowed to commit now
14/04/16 20:35:51 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1427968352_0001_r_000000_0' to hdfs://...:8020/user/ird2/tech_talks/output/ReduceSideJoinDriver/_temporary/0/task_local1427968352_0001_r_000000
14/04/16 20:35:51 INFO mapred.LocalJobRunner: reduce > reduce
14/04/16 20:35:51 INFO mapred.Task: Task 'attempt_local1427968352_0001_r_000000_0' done.
14/04/16 20:35:51 INFO mapred.LocalJobRunner: Finishing task: attempt_local1427968352_0001_r_000000_0
14/04/16 20:35:51 INFO mapred.LocalJobRunner: reduce task executor complete.
14/04/16 20:35:52 INFO mapreduce.Job: map 100% reduce 100%
14/04/16 20:35:52 INFO mapreduce.Job: Job job_local1427968352_0001 completed successfully
14/04/16 20:35:52 INFO mapreduce.Job: Counters: 38
File System Counters
FILE: Number of bytes read=14767932
FILE: Number of bytes written=29952985
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=13537382
HDFS: Number of bytes written=2949787
HDFS: Number of read operations=28
HDFS: Number of large read operations=0
HDFS: Number of write operations=5
Map-Reduce Framework
Map input records=343919
Map output records=343919
Map output bytes=6695055
Map output materialized bytes=7382905
Input split bytes=272
Combine input records=0
Combine output records=0
Reduce input groups=5564
Reduce shuffle bytes=7382905
Reduce input records=343919
Reduce output records=5564
Spilled Records=687838
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=92
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=1416101888
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=4574562
File Output Format Counters
Bytes Written=2949787
Driver code
public class ReduceSideJoinDriver extends Configured implements Tool
{
#Override
public int run(String[] args) throws Exception
{
if (args.length != 3)
{
System.err.printf("Usage: %s [generic options] <input> <output>\n", getClass().getSimpleName());
ToolRunner.printGenericCommandUsage(System.err);
return -1;
}
Path usersFile = new Path(args[0]);
Path ratingsFile = new Path(args[1]);
Job job = Job.getInstance(getConf(), "Aravind - Reduce Side Join");
job.getConfiguration().setStrings(usersFile.getName(), "user");
job.getConfiguration().setStrings(ratingsFile.getName(), "rating");
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(TagAndRecord.class);
TextInputFormat.addInputPath(job, usersFile);
TextInputFormat.addInputPath(job, ratingsFile);
TextOutputFormat.setOutputPath(job, new Path(args[2]));
job.setMapperClass(ReduceSideJoinMapper.class);
job.setReducerClass(ReduceSideJoinReducer.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String args[]) throws Exception
{
int exitCode = ToolRunner.run(new Configuration(), new ReduceSideJoinDriver(), args);
System.exit(exitCode);
}
}
Make sure you have valid following configuration files in hadoop classpath. By default configuration files are taken from the directory /etc/hadoop/conf. This activity should be performed a part of hadoop client node setup.
mapred-site.xml
yarn-site.xml
core-site.xml
If the above mentioned configuration files are empty. You got to pupulate the above files with right properties. Population can be achieved in two ways
In Cloudera Manager when click on service yarn, in action portion, there is an option Deploy client configuration along with start,stop etc. Use that option to deploy the client configuration.
Sometimes above option maynot work if the node is not managed by CM and yarn gateway is not configured on the node. use the option Download client configuration instead of deploy client Configuration. Extract the downloaded zip configuration file(above files) and copy those files to the location /etc/hadoop/conf manually.
For executing the jar either hadoop or yarn can be used.
Apparently, you can only submit a hadoop job from the node designated as the gateway node. Everything is working once I submitted the job from the gateway node.

Hadoop MapReduce: running two jobs inside one Configured/Tool produces no output on 2nd job

I need to run 2 map reduce jobs such that the 2nd takes as input the output from the first job. I'd like to do this within a single invocation, where MyClass extends Configured and implements Tool.
I've written the code, and it works as long as I don't run the two jobs within the same invocation (this works):
hadoop jar myjar.jar path.to.my.class.MyClass -i input -o output -m job1
hadoop jar myjar.jar path.to.my.class.MyClass -i dummy -o output -m job2
But this doesn't:
hadoop jar myjar.jar path.to.my.class.MyClass -i input -o output -m all
(-m stands for "mode")
In this case, the output of the first job does not make it to the mappers of the 2nd job (I figured this out by debugging), but I can't figure out why.
I've seen other posts on chaining, but they are for the "old" mapred api. And I need to run 3rd party code between the jobs, so I don't know if ChainMapper/ChainReducer will work for my use case.
Using hadoop version 1.0.3, AWS Elastic MapReduce distribution.
Code:
import java.io.IOException;
import org.apache.commons.cli.BasicParser;
import org.apache.commons.cli.CommandLine;
import org.apache.commons.cli.CommandLineParser;
import org.apache.commons.cli.Option;
import org.apache.commons.cli.OptionBuilder;
import org.apache.commons.cli.Options;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat;
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class MyClass extends Configured implements Tool {
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new HBasePrep(), args);
System.exit(res);
}
#Override
public int run(String[] args) throws Exception {
CommandLineParser parser = new BasicParser();
Options allOptions = setupOptions();
Configuration conf = getConf();
String[] argv_ = new GenericOptionsParser(conf, args).getRemainingArgs();
CommandLine cmdLine = parser.parse(allOptions, argv_);
boolean doJob1 = true;
boolean doJob2 = true;
if (cmdLine.hasOption('m')) {
String mode = cmdLine.getOptionValue('m');
if ("job1".equals(mode)) {
doJob2 = false;
} else if ("job2".equals(mode)){
doJob1 = false;
}
}
Path outPath = new Path(cmdLine.getOptionValue("output"), "job1out");
Job job = new Job(conf, "HBase Prep For Data Build");
Job job2 = new Job(conf, "HBase SessionIndex load");
if (doJob1) {
conf = job.getConfiguration();
String[] values = cmdLine.getOptionValues("input");
if (values != null && values.length > 0) {
for (String input : values) {
System.out.println("input:" + input);
FileInputFormat.addInputPaths(job, input);
}
}
job.setJarByClass(HBasePrep.class);
job.setMapperClass(SessionMapper.class);
MultipleOutputs.setCountersEnabled(job, false);
MultipleOutputs.addNamedOutput(job, "sessionindex", TextOutputFormat.class, Text.class, Text.class);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(KeyValue.class);
job.setOutputFormatClass(HFileOutputFormat.class);
HTable hTable = new HTable(conf, "session");
// Auto configure partitioner and reducer
HFileOutputFormat.configureIncrementalLoad(job, hTable);
FileOutputFormat.setOutputPath(job, outPath);
if (!job.waitForCompletion(true)) {
return 1;
}
// Load generated HFiles into table
LoadIncrementalHFiles loader = new LoadIncrementalHFiles(conf);
loader.doBulkLoad(outPath, hTable);
FileSystem fs = FileSystem.get(outPath.toUri(), conf);
fs.delete(new Path(outPath, "cf"), true); # i delete this because after the hbase build load, it is left an empty directory which causes problems later
}
/////////////////////////////////////////////
// SECOND JOB //
/////////////////////////////////////////////
if (doJob2) {
conf = job2.getConfiguration();
System.out.println("-- job 2 input path : " + outPath.toString());
FileInputFormat.setInputPaths(job2, outPath.toString());
job2.setJarByClass(HBasePrep.class);
job2.setMapperClass(SessionIndexMapper.class);
MultipleOutputs.setCountersEnabled(job2, false);
job2.setMapOutputKeyClass(ImmutableBytesWritable.class);
job2.setMapOutputValueClass(KeyValue.class);
job2.setOutputFormatClass(HFileOutputFormat.class);
HTable hTable = new HTable(conf, "session_index_by_hour");
// Auto configure partitioner and reducer
HFileOutputFormat.configureIncrementalLoad(job2, hTable);
outPath = new Path(cmdLine.getOptionValue("output"), "job2out");
System.out.println("-- job 2 output path: " + outPath.toString());
FileOutputFormat.setOutputPath(job2, outPath);
if (!job2.waitForCompletion(true)) {
return 2;
}
// Load generated HFiles into table
LoadIncrementalHFiles loader = new LoadIncrementalHFiles(conf);
loader.doBulkLoad(outPath, hTable);
}
return 0;
}
public static class SessionMapper extends
Mapper<LongWritable, Text, ImmutableBytesWritable, KeyValue> {
private MultipleOutputs<ImmutableBytesWritable, KeyValue> multiOut;
#Override
public void setup(Context context) throws IOException {
multiOut = new MultipleOutputs<ImmutableBytesWritable, KeyValue>(context);
}
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
...
context.write(..., ...); # this is called mutiple times
multiOut.write("sessionindex", new Text(...), new Text(...), "sessionindex");
}
}
public static class SessionIndexMapper extends
Mapper<LongWritable, Text, ImmutableBytesWritable, KeyValue> {
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(new ImmutableBytesWritable(...), new KeyValue(...));
}
}
private static Options setupOptions() {
Option input = createOption("i", "input",
"input file(s) for the Map step", "path", Integer.MAX_VALUE,
true);
Option output = createOption("o", "output",
"output directory for the Reduce step", "path", 1, true);
Option mode = createOption("m", "mode",
"what mode ('all', 'job1', 'job2')", "-mode-", 1, false);
return new Options().addOption(input).addOption(output)
.addOption(mode);
}
public static Option createOption(String name, String longOpt, String desc,
String argName, int max, boolean required) {
OptionBuilder.withArgName(argName);
OptionBuilder.hasArgs(max);
OptionBuilder.withDescription(desc);
OptionBuilder.isRequired(required);
OptionBuilder.withLongOpt(longOpt);
return OptionBuilder.create(name);
}
}
Output (single invocation):
input:s3n://...snip...
13/12/09 23:08:43 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/12/09 23:08:43 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
13/12/09 23:08:43 INFO compress.CodecPool: Got brand-new compressor
13/12/09 23:08:43 INFO mapred.JobClient: Default number of map tasks: null
13/12/09 23:08:43 INFO mapred.JobClient: Setting default number of map tasks based on cluster size to : 2
13/12/09 23:08:43 INFO mapred.JobClient: Default number of reduce tasks: 1
13/12/09 23:08:43 INFO security.ShellBasedUnixGroupsMapping: add hadoop to shell userGroupsCache
13/12/09 23:08:43 INFO mapred.JobClient: Setting group to hadoop
13/12/09 23:08:43 INFO input.FileInputFormat: Total input paths to process : 1
13/12/09 23:08:43 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
13/12/09 23:08:43 WARN lzo.LzoCodec: Could not find build properties file with revision hash
13/12/09 23:08:43 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev UNKNOWN]
13/12/09 23:08:43 WARN snappy.LoadSnappy: Snappy native library is available
13/12/09 23:08:43 INFO snappy.LoadSnappy: Snappy native library loaded
13/12/09 23:08:44 INFO mapred.JobClient: Running job: job_201312062235_0044
13/12/09 23:08:45 INFO mapred.JobClient: map 0% reduce 0%
13/12/09 23:09:09 INFO mapred.JobClient: map 100% reduce 0%
13/12/09 23:09:27 INFO mapred.JobClient: map 100% reduce 100%
13/12/09 23:09:32 INFO mapred.JobClient: Job complete: job_201312062235_0044
13/12/09 23:09:32 INFO mapred.JobClient: Counters: 42
13/12/09 23:09:32 INFO mapred.JobClient: MyCounter1
13/12/09 23:09:32 INFO mapred.JobClient: ValidCurrentDay=3526
13/12/09 23:09:32 INFO mapred.JobClient: Job Counters
13/12/09 23:09:32 INFO mapred.JobClient: Launched reduce tasks=1
13/12/09 23:09:32 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=19693
13/12/09 23:09:32 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/12/09 23:09:32 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/12/09 23:09:32 INFO mapred.JobClient: Rack-local map tasks=1
13/12/09 23:09:32 INFO mapred.JobClient: Launched map tasks=1
13/12/09 23:09:32 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=15201
13/12/09 23:09:32 INFO mapred.JobClient: File Output Format Counters
13/12/09 23:09:32 INFO mapred.JobClient: Bytes Written=1979245
13/12/09 23:09:32 INFO mapred.JobClient: FileSystemCounters
13/12/09 23:09:32 INFO mapred.JobClient: S3N_BYTES_READ=51212
13/12/09 23:09:32 INFO mapred.JobClient: FILE_BYTES_READ=400417
13/12/09 23:09:32 INFO mapred.JobClient: HDFS_BYTES_READ=231
13/12/09 23:09:32 INFO mapred.JobClient: FILE_BYTES_WRITTEN=859881
13/12/09 23:09:32 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=2181624
13/12/09 23:09:32 INFO mapred.JobClient: File Input Format Counters
13/12/09 23:09:32 INFO mapred.JobClient: Bytes Read=51212
13/12/09 23:09:32 INFO mapred.JobClient: MyCounter2
13/12/09 23:09:32 INFO mapred.JobClient: ASCII=3526
13/12/09 23:09:32 INFO mapred.JobClient: StatsUnaggregatedMapEventTypeCurrentDay
13/12/09 23:09:32 INFO mapred.JobClient: adProgress0=343
13/12/09 23:09:32 INFO mapred.JobClient: asset=562
13/12/09 23:09:32 INFO mapred.JobClient: podComplete=612
13/12/09 23:09:32 INFO mapred.JobClient: adProgress100=247
13/12/09 23:09:32 INFO mapred.JobClient: adProgress25=247
13/12/09 23:09:32 INFO mapred.JobClient: click=164
13/12/09 23:09:32 INFO mapred.JobClient: adProgress50=247
13/12/09 23:09:32 INFO mapred.JobClient: adCall=244
13/12/09 23:09:32 INFO mapred.JobClient: adProgress75=247
13/12/09 23:09:32 INFO mapred.JobClient: podStart=613
13/12/09 23:09:32 INFO mapred.JobClient: Map-Reduce Framework
13/12/09 23:09:32 INFO mapred.JobClient: Map output materialized bytes=400260
13/12/09 23:09:32 INFO mapred.JobClient: Map input records=3526
13/12/09 23:09:32 INFO mapred.JobClient: Reduce shuffle bytes=400260
13/12/09 23:09:32 INFO mapred.JobClient: Spilled Records=14104
13/12/09 23:09:32 INFO mapred.JobClient: Map output bytes=2343990
13/12/09 23:09:32 INFO mapred.JobClient: Total committed heap usage (bytes)=497549312
13/12/09 23:09:32 INFO mapred.JobClient: CPU time spent (ms)=10120
13/12/09 23:09:32 INFO mapred.JobClient: Combine input records=0
13/12/09 23:09:32 INFO mapred.JobClient: SPLIT_RAW_BYTES=231
13/12/09 23:09:32 INFO mapred.JobClient: Reduce input records=7052
13/12/09 23:09:32 INFO mapred.JobClient: Reduce input groups=246
13/12/09 23:09:32 INFO mapred.JobClient: Combine output records=0
13/12/09 23:09:32 INFO mapred.JobClient: Physical memory (bytes) snapshot=519942144
13/12/09 23:09:32 INFO mapred.JobClient: Reduce output records=7052
13/12/09 23:09:32 INFO mapred.JobClient: Virtual memory (bytes) snapshot=3076526080
13/12/09 23:09:32 INFO mapred.JobClient: Map output records=7052
13/12/09 23:09:32 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://10.91.18.96:9000/path/job1out/_SUCCESS
13/12/09 23:09:32 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://10.91.18.96:9000/path/job1out/sessionindex-m-00000
1091740526
-- job 2 input path : /path/job1out
-- job 2 output path: /path/job2out
13/12/09 23:09:32 INFO mapred.JobClient: Default number of map tasks: null
13/12/09 23:09:32 INFO mapred.JobClient: Setting default number of map tasks based on cluster size to : 2
13/12/09 23:09:32 INFO mapred.JobClient: Default number of reduce tasks: 1
13/12/09 23:09:33 INFO mapred.JobClient: Setting group to hadoop
13/12/09 23:09:33 INFO input.FileInputFormat: Total input paths to process : 1
13/12/09 23:09:33 INFO mapred.JobClient: Running job: job_201312062235_0045
13/12/09 23:09:34 INFO mapred.JobClient: map 0% reduce 0%
13/12/09 23:09:51 INFO mapred.JobClient: map 100% reduce 0%
13/12/09 23:10:03 INFO mapred.JobClient: map 100% reduce 33%
13/12/09 23:10:06 INFO mapred.JobClient: map 100% reduce 100%
13/12/09 23:10:11 INFO mapred.JobClient: Job complete: job_201312062235_0045
13/12/09 23:10:11 INFO mapred.JobClient: Counters: 27
13/12/09 23:10:11 INFO mapred.JobClient: Job Counters
13/12/09 23:10:11 INFO mapred.JobClient: Launched reduce tasks=1
13/12/09 23:10:11 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=13533
13/12/09 23:10:11 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/12/09 23:10:11 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/12/09 23:10:11 INFO mapred.JobClient: Launched map tasks=1
13/12/09 23:10:11 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=12176
13/12/09 23:10:11 INFO mapred.JobClient: File Output Format Counters
13/12/09 23:10:11 INFO mapred.JobClient: Bytes Written=0
13/12/09 23:10:11 INFO mapred.JobClient: FileSystemCounters
13/12/09 23:10:11 INFO mapred.JobClient: FILE_BYTES_READ=173
13/12/09 23:10:11 INFO mapred.JobClient: HDFS_BYTES_READ=134
13/12/09 23:10:11 INFO mapred.JobClient: FILE_BYTES_WRITTEN=57735
13/12/09 23:10:11 INFO mapred.JobClient: File Input Format Counters
13/12/09 23:10:11 INFO mapred.JobClient: Bytes Read=0
13/12/09 23:10:11 INFO mapred.JobClient: Map-Reduce Framework
13/12/09 23:10:11 INFO mapred.JobClient: Map output materialized bytes=16
13/12/09 23:10:11 INFO mapred.JobClient: Map input records=0
13/12/09 23:10:11 INFO mapred.JobClient: Reduce shuffle bytes=16
13/12/09 23:10:11 INFO mapred.JobClient: Spilled Records=0
13/12/09 23:10:11 INFO mapred.JobClient: Map output bytes=0
13/12/09 23:10:11 INFO mapred.JobClient: Total committed heap usage (bytes)=434634752
13/12/09 23:10:11 INFO mapred.JobClient: CPU time spent (ms)=2270
13/12/09 23:10:11 INFO mapred.JobClient: Combine input records=0
13/12/09 23:10:11 INFO mapred.JobClient: SPLIT_RAW_BYTES=134
13/12/09 23:10:11 INFO mapred.JobClient: Reduce input records=0
13/12/09 23:10:11 INFO mapred.JobClient: Reduce input groups=0
13/12/09 23:10:11 INFO mapred.JobClient: Combine output records=0
13/12/09 23:10:11 INFO mapred.JobClient: Physical memory (bytes) snapshot=423612416
13/12/09 23:10:11 INFO mapred.JobClient: Reduce output records=0
13/12/09 23:10:11 INFO mapred.JobClient: Virtual memory (bytes) snapshot=3058089984
13/12/09 23:10:11 INFO mapred.JobClient: Map output records=0
13/12/09 23:10:11 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://10.91.18.96:9000/path/job2out/_SUCCESS
13/12/09 23:10:11 WARN mapreduce.LoadIncrementalHFiles: Bulk load operation did not find any files to load in directory /path/job2out. Does it contain files in subdirectories that correspond to column family names?

"Starting flush of map output" takes very long time in hadoop map task

I execute a map task on a small file (3-4 MB), but map output is relatively large (150 MB). After showing Map 100%, it takes long time to finish the spill. Please suggest how can I reduce this period. Following are some sample logs...
13/07/10 17:45:31 INFO mapred.MapTask: Starting flush of map output
13/07/10 17:45:32 INFO mapred.JobClient: map 98% reduce 0%
13/07/10 17:45:34 INFO mapred.LocalJobRunner:
13/07/10 17:45:35 INFO mapred.JobClient: map 100% reduce 0%
13/07/10 17:45:37 INFO mapred.LocalJobRunner:
13/07/10 17:45:40 INFO mapred.LocalJobRunner:
13/07/10 17:45:43 INFO mapred.LocalJobRunner:
13/07/10 17:45:46 INFO mapred.LocalJobRunner:
13/07/10 17:45:49 INFO mapred.LocalJobRunner:
13/07/10 17:45:52 INFO mapred.LocalJobRunner:
13/07/10 17:45:55 INFO mapred.LocalJobRunner:
13/07/10 17:45:58 INFO mapred.LocalJobRunner:
13/07/10 17:46:01 INFO mapred.LocalJobRunner:
13/07/10 17:46:04 INFO mapred.LocalJobRunner:
13/07/10 17:46:07 INFO mapred.LocalJobRunner:
13/07/10 17:46:10 INFO mapred.LocalJobRunner:
13/07/10 17:46:13 INFO mapred.LocalJobRunner:
13/07/10 17:46:16 INFO mapred.LocalJobRunner:
13/07/10 17:46:19 INFO mapred.LocalJobRunner:
13/07/10 17:46:22 INFO mapred.LocalJobRunner:
13/07/10 17:46:25 INFO mapred.LocalJobRunner:
13/07/10 17:46:28 INFO mapred.LocalJobRunner:
13/07/10 17:46:31 INFO mapred.LocalJobRunner:
13/07/10 17:46:34 INFO mapred.LocalJobRunner:
13/07/10 17:46:37 INFO mapred.LocalJobRunner:
13/07/10 17:46:40 INFO mapred.LocalJobRunner:
13/07/10 17:46:43 INFO mapred.LocalJobRunner:
13/07/10 17:46:46 INFO mapred.LocalJobRunner:
13/07/10 17:46:49 INFO mapred.LocalJobRunner:
13/07/10 17:46:52 INFO mapred.LocalJobRunner:
13/07/10 17:46:55 INFO mapred.LocalJobRunner:
13/07/10 17:46:58 INFO mapred.LocalJobRunner:
13/07/10 17:47:01 INFO mapred.LocalJobRunner:
13/07/10 17:47:04 INFO mapred.LocalJobRunner:
13/07/10 17:47:07 INFO mapred.LocalJobRunner:
13/07/10 17:47:10 INFO mapred.LocalJobRunner:
13/07/10 17:47:13 INFO mapred.LocalJobRunner:
13/07/10 17:47:16 INFO mapred.LocalJobRunner:
13/07/10 17:47:19 INFO mapred.LocalJobRunner:
13/07/10 17:47:22 INFO mapred.LocalJobRunner:
13/07/10 17:47:25 INFO mapred.LocalJobRunner:
13/07/10 17:47:28 INFO mapred.LocalJobRunner:
13/07/10 17:47:31 INFO mapred.LocalJobRunner:
13/07/10 17:47:34 INFO mapred.LocalJobRunner:
13/07/10 17:47:37 INFO mapred.LocalJobRunner:
13/07/10 17:47:40 INFO mapred.LocalJobRunner:
13/07/10 17:47:43 INFO mapred.LocalJobRunner:
13/07/10 17:47:45 INFO mapred.MapTask: Finished spill 0
13/07/10 17:47:45 INFO mapred.Task: Task:attempt_local_0003_m_000000_0 is done. And is in the process of commiting
13/07/10 17:47:45 INFO mapred.LocalJobRunner:
13/07/10 17:47:45 INFO mapred.Task: Task 'attempt_local_0003_m_000000_0' done.
...............................
...............................
...............................
13/07/10 17:47:52 INFO mapred.JobClient: Counters: 22
13/07/10 17:47:52 INFO mapred.JobClient: File Output Format Counters
13/07/10 17:47:52 INFO mapred.JobClient: Bytes Written=13401245
13/07/10 17:47:52 INFO mapred.JobClient: FileSystemCounters
13/07/10 17:47:52 INFO mapred.JobClient: FILE_BYTES_READ=18871098
13/07/10 17:47:52 INFO mapred.JobClient: HDFS_BYTES_READ=7346566
13/07/10 17:47:52 INFO mapred.JobClient: FILE_BYTES_WRITTEN=35878426
13/07/10 17:47:52 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=18621307
13/07/10 17:47:52 INFO mapred.JobClient: File Input Format Counters
13/07/10 17:47:52 INFO mapred.JobClient: Bytes Read=2558288
13/07/10 17:47:52 INFO mapred.JobClient: Map-Reduce Framework
13/07/10 17:47:52 INFO mapred.JobClient: Reduce input groups=740000
13/07/10 17:47:52 INFO mapred.JobClient: Map output materialized bytes=13320006
13/07/10 17:47:52 INFO mapred.JobClient: Combine output records=740000
13/07/10 17:47:52 INFO mapred.JobClient: Map input records=71040
13/07/10 17:47:52 INFO mapred.JobClient: Reduce shuffle bytes=0
13/07/10 17:47:52 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
13/07/10 17:47:52 INFO mapred.JobClient: Reduce output records=740000
13/07/10 17:47:52 INFO mapred.JobClient: Spilled Records=1480000
13/07/10 17:47:52 INFO mapred.JobClient: Map output bytes=119998400
13/07/10 17:47:52 INFO mapred.JobClient: CPU time spent (ms)=0
13/07/10 17:47:52 INFO mapred.JobClient: Total committed heap usage (bytes)=1178009600
13/07/10 17:47:52 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
13/07/10 17:47:52 INFO mapred.JobClient: Combine input records=7499900
13/07/10 17:47:52 INFO mapred.JobClient: Map output records=7499900
13/07/10 17:47:52 INFO mapred.JobClient: SPLIT_RAW_BYTES=122
13/07/10 17:47:52 INFO mapred.JobClient: Reduce input records=740000
Map Task Source code:
public class GsMR2MapThree extends Mapper<Text, Text, LongWritable,DoubleWritable>{
private DoubleWritable distGexpr = new DoubleWritable();
private LongWritable m2keyOut = new LongWritable();
int trMax,tstMax;
protected void setup(Context context) throws java.io.IOException, java.lang.InterruptedException {
Configuration conf =context.getConfiguration();
tstMax = conf.getInt("mtst", 10);
trMax = conf.getInt("mtr", 10);
}
public void map(Text key, Text values, Context context) throws IOException, InterruptedException {
String line = values.toString();
double Tij=0.0,TRij=0.0, dist=0;
int i=0,j;
long m2key=0;
String[] SLl = new String[]{};
Configuration conf =context.getConfiguration();
m2key = Long.parseLong(key.toString());
StringTokenizer tokenizer = new StringTokenizer(line);
j=0;
while (tokenizer.hasMoreTokens()) {
String test = tokenizer.nextToken();
if(j==0){
Tij = Double.parseDouble(test);
}
else if(j==1){
TRij = Double.parseDouble(test);
}
else if(j==2){
SLl = StringUtils.split(conf.get(test),",");
}
j++;
}
//Map input ends
//Distance Measure function
dist = (long)Math.pow( (Tij - TRij), 2);
//remove gid from key
m2key = m2key / 100000;
//Map2 <key,value> emit starts
for(i=0; i<SLl.length;i++){
long m2keyNew = (Integer.parseInt(SLl[i])*(trMax*tstMax))+m2key;
m2keyOut.set(m2keyNew);
distGexpr.set(dist);
context.write(m2keyOut,distGexpr);
}
//<key,value> emit done
}
}
Sample Map Input: The last variable in each line get an integer array from broadcast variables. Each line will produce around 100-200 output records.
10100014 1356.3238 1181.63 gs-4-56
10100026 3263.1167 3192.4131 gs-3-21
10100043 1852.0 1926.3962 gs-4-76
10100062 1175.5925 983.47125 gs-3-19
10100066 606.59125 976.26625 gs-8-23
Sample Map Output:
10101 8633.0
10102 1822.0
10103 13832.0
10104 2726470.0
10105 1172991.0
10107 239367.0
10109 5410384.0
10111 7698352.0
10112 6.417
I suppose you have solved that (2 years after posting the original message), but just for anyone who steps into the same problem, I will try providing some suggestions.
Judging from your counters, I understand that you already use compression (since the number of map output materialized bytes is different to the number of map output bytes), which is a good thing. You can further compress the output of the mapper, by using the variable-length encoded VLongWritable class, as the map output key type. (There used to be a VDoubleWritable class, too, if I am not mistaken, but it must have been deprecated by now).
In the for loop, in which you emit the output, there is no need to set the distGexpr variable each time. It is always the same, so set it just before the for loop. You can also store a long with the product of trMax*tstMax outside the loop and not calculate it on each iteration.
If possible, make your input key LongWritable (from the previous job), so that you can save the Long.parseLong() and the Text.toString() invocations.
If possible (depending on your reducer), use a combiner, to reduce the size of spilled bytes.
I could not find a way to skip that Integer.parseInt() call within the for loop, but it would save some time if you could initially load SLl as int[].

Resources