infinitely loop for org.apache.hadoop.mapred.TaskTracker - hadoop

I am running one simple hadoop application which collect information from a 64MB file, the map finished quickly, roughly in about 2 -5 minutes, then reduce also runs fast up to 16%, then it just stopped.
This is the program output
11/12/20 17:53:46 INFO input.FileInputFormat: Total input paths to process : 1
11/12/20 17:53:46 INFO mapred.JobClient: Running job: job_201112201749_0001
11/12/20 17:53:47 INFO mapred.JobClient: map 0% reduce 0%
11/12/20 17:54:06 INFO mapred.JobClient: map 4% reduce 0%
11/12/20 17:54:09 INFO mapred.JobClient: map 15% reduce 0%
11/12/20 17:54:12 INFO mapred.JobClient: map 28% reduce 0%
11/12/20 17:54:15 INFO mapred.JobClient: map 40% reduce 0%
11/12/20 17:54:18 INFO mapred.JobClient: map 53% reduce 0%
11/12/20 17:54:21 INFO mapred.JobClient: map 64% reduce 0%
11/12/20 17:54:24 INFO mapred.JobClient: map 77% reduce 0%
11/12/20 17:54:27 INFO mapred.JobClient: map 89% reduce 0%
11/12/20 17:54:30 INFO mapred.JobClient: map 98% reduce 0%
11/12/20 17:54:33 INFO mapred.JobClient: map 100% reduce 0%
11/12/20 17:54:54 INFO mapred.JobClient: map 100% reduce 8%
11/12/20 17:54:57 INFO mapred.JobClient: map 100% reduce 16%
In the data node log, I see tons of same message again and again, the following is there is starts,
2011-12-20 17:54:51,353 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201112201749_0001_r_000000_0 0.083333336% reduce > copy (1 of 4 at 9.01 MB/s) >
2011-12-20 17:54:51,507 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 127.0.1.1:50060, dest: 127.0.0.1:44367, bytes: 75623263, op: MAPRED_SHUFFLE, cliID: attempt_201112201749_0001_m_000000_0, duration: 2161793492
2011-12-20 17:54:54,389 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201112201749_0001_r_000000_0 0.16666667% reduce > copy (2 of 4 at 14.42 MB/s) >
2011-12-20 17:54:57,433 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201112201749_0001_r_000000_0 0.16666667% reduce > copy (2 of 4 at 14.42 MB/s) >
2011-12-20 17:55:40,359 INFO org.mortbay.log: org.mortbay.io.nio.SelectorManager$SelectSet#53d3cf JVM BUG(s) - injecting delay3 times
2011-12-20 17:55:40,359 INFO org.mortbay.log: org.mortbay.io.nio.SelectorManager$SelectSet#53d3cf JVM BUG(s) - recreating selector 3 times, canceled keys 72 times
2011-12-20 17:57:51,518 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201112201749_0001_r_000000_0 0.16666667% reduce > copy (2 of 4 at 14.42 MB/s) >
2011-12-20 17:57:57,536 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201112201749_0001_r_000000_0 0.16666667% reduce > copy (2 of 4 at 14.42 MB/s) >
2011-12-20 17:58:03,554 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201112201749_0001_r_000000_0 0.16666667% reduce > copy (2 of 4 at 14.42 MB/s) >
...
up to
2011-12-20 17:54:51,353 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201112201749_0001_r_000000_0 0.083333336% reduce > copy (1 of 4 at 9.01 MB/s) >
2011-12-20 17:54:51,507 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 127.0.1.1:50060, dest: 127.0.0.1:44367, bytes: 75623263, op: MAPRED_SHUFFLE, cliID: attempt_201112201749_0001_m_000000_0, duration: 2161793492
2011-12-20 17:54:54,389 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201112201749_0001_r_000000_0 0.16666667% reduce > copy (2 of 4 at 14.42 MB/s) >
2011-12-20 17:54:57,433 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201112201749_0001_r_000000_0 0.16666667% reduce > copy (2 of 4 at 14.42 MB/s) >
2011-12-20 17:55:40,359 INFO org.mortbay.log: org.mortbay.io.nio.SelectorManager$SelectSet#53d3cf JVM BUG(s) - injecting delay3 times
2011-12-20 17:55:40,359 INFO org.mortbay.log: org.mortbay.io.nio.SelectorManager$SelectSet#53d3cf JVM BUG(s) - recreating selector 3 times, canceled keys 72 times
2011-12-20 17:57:51,518 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201112201749_0001_r_000000_0 0.16666667% reduce > copy (2 of 4 at 14.42 MB/s) >
2011-12-20 17:57:57,536 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201112201749_0001_r_000000_0 0.16666667% reduce > copy (2 of 4 at 14.42 MB/s) >
2011-12-20 17:58:03,554 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201112201749_0001_r_000000_0 0.16666667% reduce > copy (2 of 4 at 14.42 MB/s) >
Here is the code
package com.bluedolphin;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class MyJob {
private final static LongWritable one = new LongWritable(1);
private final static Text word = new Text();
public static class MyMapClass extends Mapper<LongWritable, Text, Text, LongWritable> {
public void map(LongWritable key,
Text value,
Context context) throws IOException, InterruptedException {
String[] citation = value.toString().split(",");
word.set(citation[0]);
context.write(word, one);
}
}
public static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
private LongWritable result = new LongWritable();
public void reduce(
Text key,
Iterator<LongWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: myjob <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "patent citation");
job.setJarByClass(MyJob.class);
job.setMapperClass(MyMapClass.class);
// job.setCombinerClass(MyReducer.class);
// job.setReducerClass(MyReducer.class);
job.setNumReduceTasks(0);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
I don't know how to further troubleshooting.
Thanks in advace.

I figured out the solution, in the reduce method signature, I should have been using Iterable, not Iterator. Because of that, the reduce method was actually not called. It runs fine now, but I don't know the internal reason that it hung before.

A couple of things i noticed in your code:
Since you are updating the Text object "word" in map and the LongWritable object "result" in reduce for each call to map and reduce respectively, you probably shouldn't declare them final (Although i don't see this as a problem in this case, as the objects are only changing the state).
Your code looks similar to the trivial word count, only difference is as you are just emiting one value per record in map. You could just eliminate reduce (i.e., run a map-only job) and see if you get what you expect from map.

I also had this infinity loop during the reduce work. After struggling for a day, the solution turns out to be adjust the /etc/hosts file.
It seems that the existence of the entry "127.0.1.1 your_Machine's_Name" confused the Hadoop. One proof of that would be the disability to access the slave:50060, the taskTracker on the slave machine, from the master machine.
As long as this "127.0.1.1 your_machine's_name" entry is deleted and add "your_machine's_name" at the end of the entry "127.0.0.1 localhost", my problem is gone.
I hope this observation can be helpful.

Related

Map Reduce File Output Counter is zero

I am writing Map Reduce code for Inverted Indexing of a file which contains each line as "Doc_id Title Document Contents".
I am not able to figure out why File output format counter is zero although map reduce jobs are successfully completed without any Exception.
import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class InvertedIndex {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, Text> {
private Text word = new Text();
private Text docID_Title = new Text();
//RemoveStopWords is a different class
static RemoveStopWords rmvStpWrd = new RemoveStopWords();
//Stemmer is a different class
Stemmer stemmer = new Stemmer();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
rmvStpWrd.makeStopWordList();
StringTokenizer itr = new StringTokenizer(value.toString().replaceAll(" [^\\p{L}]", " "));
//fetching id of the document
String id = null;
String title = null;
if(itr.hasMoreTokens())
id = itr.nextToken();
//fetching title of the document
if(itr.hasMoreTokens())
title = itr.nextToken();
String ID_TITLE = id + title;
if(id!=null)
docID_Title.set(ID_TITLE);
while (itr.hasMoreTokens()) {
/*manipulation of tokens:
* First we remove stop words
* Then Stem the words
*/
String temp = itr.nextToken().toLowerCase();
if(RemoveStopWords.isStopWord(temp)) {
continue;
}
else {
//now the word is not a stop word
//we will stem it
char[] a;
stemmer.add((a = temp.toCharArray()), a.length);
stemmer.stem();
temp = stemmer.toString();
word.set(temp);
context.write(word, docID_Title);
}
}//end while
}//end map
}//end mapper
public static class IntSumReducer
extends Reducer<Text,Text,Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
//to iterate over the values
Iterator<Text> itr = values.iterator();
String old = itr.next().toString();
int freq = 1;
String next = null;
boolean isThere = true;
StringBuilder stringBuilder = new StringBuilder();
while(itr.hasNext()) {
//freq counts number of times a word comes in a document
freq = 1;
while((isThere = itr.hasNext())) {
next = itr.next().toString();
if(old == next)
freq++;
else {
//the loop break when we get different docID_Title for the word(key)
break;
}
//if more data is there
if(isThere) {
old = old +"_"+ freq;
stringBuilder.append(old);
stringBuilder.append(" | ");
old = next;
context.write(key, new Text(stringBuilder.toString()));
stringBuilder.setLength(0);
}
else {
//for the last key
freq++;
old = old +"_"+ freq;
stringBuilder.append(old);
stringBuilder.append(" | ");
old = next;
context.write(key, new Text(stringBuilder.toString()));
}//end else
}//end while
}//end while
}//end reduce
}//end reducer
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "InvertedIndex");
job.setJarByClass(InvertedIndex.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}//end main
}//end InvertexIndex
This is the output I am getting:
16/10/03 15:34:21 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
16/10/03 15:34:21 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
16/10/03 15:34:21 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16/10/03 15:34:22 INFO input.FileInputFormat: Total input paths to process : 1
16/10/03 15:34:22 INFO mapreduce.JobSubmitter: number of splits:1
16/10/03 15:34:22 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local507694567_0001
16/10/03 15:34:22 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
16/10/03 15:34:22 INFO mapreduce.Job: Running job: job_local507694567_0001
16/10/03 15:34:22 INFO mapred.LocalJobRunner: OutputCommitter set in config null
16/10/03 15:34:22 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/10/03 15:34:22 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
16/10/03 15:34:22 INFO mapred.LocalJobRunner: Waiting for map tasks
16/10/03 15:34:22 INFO mapred.LocalJobRunner: Starting task: attempt_local507694567_0001_m_000000_0
16/10/03 15:34:22 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/10/03 15:34:22 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/10/03 15:34:22 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/sonu/ss.txt:0+1002072
16/10/03 15:34:23 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
16/10/03 15:34:23 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
16/10/03 15:34:23 INFO mapred.MapTask: soft limit at 83886080
16/10/03 15:34:23 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
16/10/03 15:34:23 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
16/10/03 15:34:23 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
16/10/03 15:34:23 INFO mapreduce.Job: Job job_local507694567_0001 running in uber mode : false
16/10/03 15:34:23 INFO mapreduce.Job: map 0% reduce 0%
16/10/03 15:34:24 INFO mapred.LocalJobRunner:
16/10/03 15:34:24 INFO mapred.MapTask: Starting flush of map output
16/10/03 15:34:24 INFO mapred.MapTask: Spilling map output
16/10/03 15:34:24 INFO mapred.MapTask: bufstart = 0; bufend = 2206696; bufvoid = 104857600
16/10/03 15:34:24 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 25789248(103156992); length = 425149/6553600
16/10/03 15:34:24 INFO mapred.MapTask: Finished spill 0
16/10/03 15:34:24 INFO mapred.Task: Task:attempt_local507694567_0001_m_000000_0 is done. And is in the process of committing
16/10/03 15:34:24 INFO mapred.LocalJobRunner: map
16/10/03 15:34:24 INFO mapred.Task: Task 'attempt_local507694567_0001_m_000000_0' done.
16/10/03 15:34:24 INFO mapred.LocalJobRunner: Finishing task: attempt_local507694567_0001_m_000000_0
16/10/03 15:34:24 INFO mapred.LocalJobRunner: map task executor complete.
16/10/03 15:34:25 INFO mapred.LocalJobRunner: Waiting for reduce tasks
16/10/03 15:34:25 INFO mapred.LocalJobRunner: Starting task: attempt_local507694567_0001_r_000000_0
16/10/03 15:34:25 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/10/03 15:34:25 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/10/03 15:34:25 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle#5d0e7307
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=333971456, maxSingleShuffleLimit=83492864, mergeThreshold=220421168, ioSortFactor=10, memToMemMergeOutputsThreshold=10
16/10/03 15:34:25 INFO reduce.EventFetcher: attempt_local507694567_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
16/10/03 15:34:25 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local507694567_0001_m_000000_0 decomp: 2 len: 6 to MEMORY
16/10/03 15:34:25 INFO reduce.InMemoryMapOutput: Read 2 bytes from map-output for attempt_local507694567_0001_m_000000_0
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 2, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->2
16/10/03 15:34:25 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
16/10/03 15:34:25 INFO mapred.LocalJobRunner: 1 / 1 copied.
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
16/10/03 15:34:25 INFO mapred.Merger: Merging 1 sorted segments
16/10/03 15:34:25 INFO mapred.Merger: Down to the last merge-pass, with 0 segments left of total size: 0 bytes
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: Merged 1 segments, 2 bytes to disk to satisfy reduce memory limit
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: Merging 1 files, 6 bytes from disk
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
16/10/03 15:34:25 INFO mapred.Merger: Merging 1 sorted segments
16/10/03 15:34:25 INFO mapred.Merger: Down to the last merge-pass, with 0 segments left of total size: 0 bytes
16/10/03 15:34:25 INFO mapred.LocalJobRunner: 1 / 1 copied.
16/10/03 15:34:25 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
16/10/03 15:34:25 INFO mapred.Task: Task:attempt_local507694567_0001_r_000000_0 is done. And is in the process of committing
16/10/03 15:34:25 INFO mapred.LocalJobRunner: 1 / 1 copied.
16/10/03 15:34:25 INFO mapred.Task: Task attempt_local507694567_0001_r_000000_0 is allowed to commit now
16/10/03 15:34:25 INFO output.FileOutputCommitter: Saved output of task 'attempt_local507694567_0001_r_000000_0' to hdfs://localhost:9000/user/sonu/output/_temporary/0/task_local507694567_0001_r_000000
16/10/03 15:34:25 INFO mapred.LocalJobRunner: reduce > reduce
16/10/03 15:34:25 INFO mapred.Task: Task 'attempt_local507694567_0001_r_000000_0' done.
16/10/03 15:34:25 INFO mapred.LocalJobRunner: Finishing task: attempt_local507694567_0001_r_000000_0
16/10/03 15:34:25 INFO mapred.LocalJobRunner: reduce task executor complete.
16/10/03 15:34:25 INFO mapreduce.Job: map 100% reduce 100%
16/10/03 15:34:25 INFO mapreduce.Job: Job job_local507694567_0001 completed successfully
16/10/03 15:34:25 INFO mapreduce.Job: Counters: 35
File System Counters
FILE: Number of bytes read=17342
FILE: Number of bytes written=571556
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2004144
HDFS: Number of bytes written=0
HDFS: Number of read operations=13
HDFS: Number of large read operations=0
HDFS: Number of write operations=4
Map-Reduce Framework
Map input records=53
Map output records=106288
Map output bytes=2206696
Map output materialized bytes=6
Input split bytes=103
Combine input records=106288
Combine output records=0
Reduce input groups=0
Reduce shuffle bytes=6
Reduce input records=0
Reduce output records=0
Spilled Records=0
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=12
Total committed heap usage (bytes)=562036736
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1002072
File Output Format Counters
Bytes Written=0

The mapreduce program is not giving me any output.Could somebody have a look into it?

I am not getting output in this program.When I am runnning this mapreduce program , I am not getting any result.
Inputfile: dict1.txt
apple,seo
apple,sev
dog,kukura
dog,kutta
cat,bilei
cat,billi
Output I want :
apple seo|sev
dog kukura|kutta
cat bilei|billi
Mapper class code :
package com.accure.Dict;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class DictMapper extends MapReduceBase implements Mapper<Text,Text,Text,Text> {
private Text word = new Text();
public void map(Text key,Text value,OutputCollector<Text,Text> output,Reporter reporter) throws IOException{
StringTokenizer itr = new StringTokenizer(value.toString(),",");
while (itr.hasMoreTokens())
{
System.out.println(key);
word.set(itr.nextToken());
output.collect(key, word);
}
}
}
Reducer code :
package com.accure.Dict;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class DictReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
private Text result = new Text();
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text,Text> output,Reporter reporter) throws IOException {
String translations = "";
while(values.hasNext()){
translations += "|" + values.next().toString();
}
result.set(translations);
output.collect(key,result);
}
}
Driver code :
package com.accure.driver;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import com.accure.Dict.DictMapper;
import com.accure.Dict.DictReducer;
public class DictDriver {
public static void main(String[] args) throws Exception{
// TODO Auto-generated method stub
JobConf conf=new JobConf();
conf.setJobName("wordcount_pradosh");
System.setProperty("HADOOP_USER_NAME","accure");
conf.set("fs.default.name","hdfs://host2.hadoop.career.com:54310/");
conf.set("hadoop.job.ugi","accuregrp");
conf.set("mapred.job.tracker","host2.hadoop.career.com:54311");
/*mapper and reduce class */
conf.setMapperClass(DictMapper.class);
conf.setReducerClass(DictReducer.class);
/*This particular jar file has your classes*/
conf.setJarByClass(DictMapper.class);
Path inputPath= new Path("/myCareer/pradosh/input");
Path outputPath=new Path("/myCareer/pradosh/output"+System.currentTimeMillis());
/*input and output directory path */
FileInputFormat.setInputPaths(conf,inputPath);
FileOutputFormat.setOutputPath(conf,outputPath);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(Text.class);
/*output key and value class*/
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
/*input and output format */
conf.setInputFormat(KeyValueTextInputFormat.class); /*Here the file is a text file*/
conf.setOutputFormat(TextOutputFormat.class);
JobClient.runJob(conf);
}
}
output log :
14/04/02 08:33:38 INFO mapred.JobClient: Running job: job_201404010637_0011
14/04/02 08:33:39 INFO mapred.JobClient: map 0% reduce 0%
14/04/02 08:33:58 INFO mapred.JobClient: map 50% reduce 0%
14/04/02 08:33:59 INFO mapred.JobClient: map 100% reduce 0%
14/04/02 08:34:21 INFO mapred.JobClient: map 100% reduce 16%
14/04/02 08:34:23 INFO mapred.JobClient: map 100% reduce 100%
14/04/02 08:34:25 INFO mapred.JobClient: Job complete: job_201404010637_0011
14/04/02 08:34:25 INFO mapred.JobClient: Counters: 29
14/04/02 08:34:25 INFO mapred.JobClient: Job Counters
14/04/02 08:34:25 INFO mapred.JobClient: Launched reduce tasks=1
14/04/02 08:34:25 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=33692
14/04/02 08:34:25 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/04/02 08:34:25 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/04/02 08:34:25 INFO mapred.JobClient: Launched map tasks=2
14/04/02 08:34:25 INFO mapred.JobClient: Data-local map tasks=2
14/04/02 08:34:25 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=25327
14/04/02 08:34:25 INFO mapred.JobClient: File Input Format Counters
14/04/02 08:34:25 INFO mapred.JobClient: Bytes Read=92
14/04/02 08:34:25 INFO mapred.JobClient: File Output Format Counters
14/04/02 08:34:25 INFO mapred.JobClient: Bytes Written=0
14/04/02 08:34:25 INFO mapred.JobClient: FileSystemCounters
14/04/02 08:34:25 INFO mapred.JobClient: FILE_BYTES_READ=6
14/04/02 08:34:25 INFO mapred.JobClient: HDFS_BYTES_READ=336
14/04/02 08:34:25 INFO mapred.JobClient: FILE_BYTES_WRITTEN=169311
14/04/02 08:34:25 INFO mapred.JobClient: Map-Reduce Framework
14/04/02 08:34:25 INFO mapred.JobClient: Map output materialized bytes=12
14/04/02 08:34:25 INFO mapred.JobClient: Map input records=6
14/04/02 08:34:25 INFO mapred.JobClient: Reduce shuffle bytes=12
14/04/02 08:34:25 INFO mapred.JobClient: Spilled Records=0
14/04/02 08:34:25 INFO mapred.JobClient: Map output bytes=0
14/04/02 08:34:25 INFO mapred.JobClient: Total committed heap usage (bytes)=246685696
14/04/02 08:34:25 INFO mapred.JobClient: CPU time spent (ms)=2650
14/04/02 08:34:25 INFO mapred.JobClient: Map input bytes=61
14/04/02 08:34:25 INFO mapred.JobClient: SPLIT_RAW_BYTES=244
14/04/02 08:34:25 INFO mapred.JobClient: Combine input records=0
14/04/02 08:34:25 INFO mapred.JobClient: Reduce input records=0
14/04/02 08:34:25 INFO mapred.JobClient: Reduce input groups=0
14/04/02 08:34:25 INFO mapred.JobClient: Combine output records=0
14/04/02 08:34:25 INFO mapred.JobClient: Physical memory (bytes) snapshot=392347648
14/04/02 08:34:25 INFO mapred.JobClient: Reduce output records=0
14/04/02 08:34:25 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2173820928
14/04/02 08:34:25 INFO mapred.JobClient: Map output records=0
When reading input you are setting input format as : KeyValueTextInputFormat
This expects the Byte separator b/w key and value. In you input you key and value are separated by "," hence the whole text goes as key and value would be empty.
This is why it is not going into the below loop of your mapper:
while (itr.hasMoreTokens())
{
System.out.println(key);
word.set(itr.nextToken());
output.collect(key, word);
}
You should tokenize your key and take the first split and key and second split as value.
This is evidenced in the logs : map Input Records : 6 but Map output records=0

Hadoop MapReduce: running two jobs inside one Configured/Tool produces no output on 2nd job

I need to run 2 map reduce jobs such that the 2nd takes as input the output from the first job. I'd like to do this within a single invocation, where MyClass extends Configured and implements Tool.
I've written the code, and it works as long as I don't run the two jobs within the same invocation (this works):
hadoop jar myjar.jar path.to.my.class.MyClass -i input -o output -m job1
hadoop jar myjar.jar path.to.my.class.MyClass -i dummy -o output -m job2
But this doesn't:
hadoop jar myjar.jar path.to.my.class.MyClass -i input -o output -m all
(-m stands for "mode")
In this case, the output of the first job does not make it to the mappers of the 2nd job (I figured this out by debugging), but I can't figure out why.
I've seen other posts on chaining, but they are for the "old" mapred api. And I need to run 3rd party code between the jobs, so I don't know if ChainMapper/ChainReducer will work for my use case.
Using hadoop version 1.0.3, AWS Elastic MapReduce distribution.
Code:
import java.io.IOException;
import org.apache.commons.cli.BasicParser;
import org.apache.commons.cli.CommandLine;
import org.apache.commons.cli.CommandLineParser;
import org.apache.commons.cli.Option;
import org.apache.commons.cli.OptionBuilder;
import org.apache.commons.cli.Options;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat;
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class MyClass extends Configured implements Tool {
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new HBasePrep(), args);
System.exit(res);
}
#Override
public int run(String[] args) throws Exception {
CommandLineParser parser = new BasicParser();
Options allOptions = setupOptions();
Configuration conf = getConf();
String[] argv_ = new GenericOptionsParser(conf, args).getRemainingArgs();
CommandLine cmdLine = parser.parse(allOptions, argv_);
boolean doJob1 = true;
boolean doJob2 = true;
if (cmdLine.hasOption('m')) {
String mode = cmdLine.getOptionValue('m');
if ("job1".equals(mode)) {
doJob2 = false;
} else if ("job2".equals(mode)){
doJob1 = false;
}
}
Path outPath = new Path(cmdLine.getOptionValue("output"), "job1out");
Job job = new Job(conf, "HBase Prep For Data Build");
Job job2 = new Job(conf, "HBase SessionIndex load");
if (doJob1) {
conf = job.getConfiguration();
String[] values = cmdLine.getOptionValues("input");
if (values != null && values.length > 0) {
for (String input : values) {
System.out.println("input:" + input);
FileInputFormat.addInputPaths(job, input);
}
}
job.setJarByClass(HBasePrep.class);
job.setMapperClass(SessionMapper.class);
MultipleOutputs.setCountersEnabled(job, false);
MultipleOutputs.addNamedOutput(job, "sessionindex", TextOutputFormat.class, Text.class, Text.class);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(KeyValue.class);
job.setOutputFormatClass(HFileOutputFormat.class);
HTable hTable = new HTable(conf, "session");
// Auto configure partitioner and reducer
HFileOutputFormat.configureIncrementalLoad(job, hTable);
FileOutputFormat.setOutputPath(job, outPath);
if (!job.waitForCompletion(true)) {
return 1;
}
// Load generated HFiles into table
LoadIncrementalHFiles loader = new LoadIncrementalHFiles(conf);
loader.doBulkLoad(outPath, hTable);
FileSystem fs = FileSystem.get(outPath.toUri(), conf);
fs.delete(new Path(outPath, "cf"), true); # i delete this because after the hbase build load, it is left an empty directory which causes problems later
}
/////////////////////////////////////////////
// SECOND JOB //
/////////////////////////////////////////////
if (doJob2) {
conf = job2.getConfiguration();
System.out.println("-- job 2 input path : " + outPath.toString());
FileInputFormat.setInputPaths(job2, outPath.toString());
job2.setJarByClass(HBasePrep.class);
job2.setMapperClass(SessionIndexMapper.class);
MultipleOutputs.setCountersEnabled(job2, false);
job2.setMapOutputKeyClass(ImmutableBytesWritable.class);
job2.setMapOutputValueClass(KeyValue.class);
job2.setOutputFormatClass(HFileOutputFormat.class);
HTable hTable = new HTable(conf, "session_index_by_hour");
// Auto configure partitioner and reducer
HFileOutputFormat.configureIncrementalLoad(job2, hTable);
outPath = new Path(cmdLine.getOptionValue("output"), "job2out");
System.out.println("-- job 2 output path: " + outPath.toString());
FileOutputFormat.setOutputPath(job2, outPath);
if (!job2.waitForCompletion(true)) {
return 2;
}
// Load generated HFiles into table
LoadIncrementalHFiles loader = new LoadIncrementalHFiles(conf);
loader.doBulkLoad(outPath, hTable);
}
return 0;
}
public static class SessionMapper extends
Mapper<LongWritable, Text, ImmutableBytesWritable, KeyValue> {
private MultipleOutputs<ImmutableBytesWritable, KeyValue> multiOut;
#Override
public void setup(Context context) throws IOException {
multiOut = new MultipleOutputs<ImmutableBytesWritable, KeyValue>(context);
}
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
...
context.write(..., ...); # this is called mutiple times
multiOut.write("sessionindex", new Text(...), new Text(...), "sessionindex");
}
}
public static class SessionIndexMapper extends
Mapper<LongWritable, Text, ImmutableBytesWritable, KeyValue> {
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(new ImmutableBytesWritable(...), new KeyValue(...));
}
}
private static Options setupOptions() {
Option input = createOption("i", "input",
"input file(s) for the Map step", "path", Integer.MAX_VALUE,
true);
Option output = createOption("o", "output",
"output directory for the Reduce step", "path", 1, true);
Option mode = createOption("m", "mode",
"what mode ('all', 'job1', 'job2')", "-mode-", 1, false);
return new Options().addOption(input).addOption(output)
.addOption(mode);
}
public static Option createOption(String name, String longOpt, String desc,
String argName, int max, boolean required) {
OptionBuilder.withArgName(argName);
OptionBuilder.hasArgs(max);
OptionBuilder.withDescription(desc);
OptionBuilder.isRequired(required);
OptionBuilder.withLongOpt(longOpt);
return OptionBuilder.create(name);
}
}
Output (single invocation):
input:s3n://...snip...
13/12/09 23:08:43 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/12/09 23:08:43 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
13/12/09 23:08:43 INFO compress.CodecPool: Got brand-new compressor
13/12/09 23:08:43 INFO mapred.JobClient: Default number of map tasks: null
13/12/09 23:08:43 INFO mapred.JobClient: Setting default number of map tasks based on cluster size to : 2
13/12/09 23:08:43 INFO mapred.JobClient: Default number of reduce tasks: 1
13/12/09 23:08:43 INFO security.ShellBasedUnixGroupsMapping: add hadoop to shell userGroupsCache
13/12/09 23:08:43 INFO mapred.JobClient: Setting group to hadoop
13/12/09 23:08:43 INFO input.FileInputFormat: Total input paths to process : 1
13/12/09 23:08:43 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
13/12/09 23:08:43 WARN lzo.LzoCodec: Could not find build properties file with revision hash
13/12/09 23:08:43 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev UNKNOWN]
13/12/09 23:08:43 WARN snappy.LoadSnappy: Snappy native library is available
13/12/09 23:08:43 INFO snappy.LoadSnappy: Snappy native library loaded
13/12/09 23:08:44 INFO mapred.JobClient: Running job: job_201312062235_0044
13/12/09 23:08:45 INFO mapred.JobClient: map 0% reduce 0%
13/12/09 23:09:09 INFO mapred.JobClient: map 100% reduce 0%
13/12/09 23:09:27 INFO mapred.JobClient: map 100% reduce 100%
13/12/09 23:09:32 INFO mapred.JobClient: Job complete: job_201312062235_0044
13/12/09 23:09:32 INFO mapred.JobClient: Counters: 42
13/12/09 23:09:32 INFO mapred.JobClient: MyCounter1
13/12/09 23:09:32 INFO mapred.JobClient: ValidCurrentDay=3526
13/12/09 23:09:32 INFO mapred.JobClient: Job Counters
13/12/09 23:09:32 INFO mapred.JobClient: Launched reduce tasks=1
13/12/09 23:09:32 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=19693
13/12/09 23:09:32 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/12/09 23:09:32 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/12/09 23:09:32 INFO mapred.JobClient: Rack-local map tasks=1
13/12/09 23:09:32 INFO mapred.JobClient: Launched map tasks=1
13/12/09 23:09:32 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=15201
13/12/09 23:09:32 INFO mapred.JobClient: File Output Format Counters
13/12/09 23:09:32 INFO mapred.JobClient: Bytes Written=1979245
13/12/09 23:09:32 INFO mapred.JobClient: FileSystemCounters
13/12/09 23:09:32 INFO mapred.JobClient: S3N_BYTES_READ=51212
13/12/09 23:09:32 INFO mapred.JobClient: FILE_BYTES_READ=400417
13/12/09 23:09:32 INFO mapred.JobClient: HDFS_BYTES_READ=231
13/12/09 23:09:32 INFO mapred.JobClient: FILE_BYTES_WRITTEN=859881
13/12/09 23:09:32 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=2181624
13/12/09 23:09:32 INFO mapred.JobClient: File Input Format Counters
13/12/09 23:09:32 INFO mapred.JobClient: Bytes Read=51212
13/12/09 23:09:32 INFO mapred.JobClient: MyCounter2
13/12/09 23:09:32 INFO mapred.JobClient: ASCII=3526
13/12/09 23:09:32 INFO mapred.JobClient: StatsUnaggregatedMapEventTypeCurrentDay
13/12/09 23:09:32 INFO mapred.JobClient: adProgress0=343
13/12/09 23:09:32 INFO mapred.JobClient: asset=562
13/12/09 23:09:32 INFO mapred.JobClient: podComplete=612
13/12/09 23:09:32 INFO mapred.JobClient: adProgress100=247
13/12/09 23:09:32 INFO mapred.JobClient: adProgress25=247
13/12/09 23:09:32 INFO mapred.JobClient: click=164
13/12/09 23:09:32 INFO mapred.JobClient: adProgress50=247
13/12/09 23:09:32 INFO mapred.JobClient: adCall=244
13/12/09 23:09:32 INFO mapred.JobClient: adProgress75=247
13/12/09 23:09:32 INFO mapred.JobClient: podStart=613
13/12/09 23:09:32 INFO mapred.JobClient: Map-Reduce Framework
13/12/09 23:09:32 INFO mapred.JobClient: Map output materialized bytes=400260
13/12/09 23:09:32 INFO mapred.JobClient: Map input records=3526
13/12/09 23:09:32 INFO mapred.JobClient: Reduce shuffle bytes=400260
13/12/09 23:09:32 INFO mapred.JobClient: Spilled Records=14104
13/12/09 23:09:32 INFO mapred.JobClient: Map output bytes=2343990
13/12/09 23:09:32 INFO mapred.JobClient: Total committed heap usage (bytes)=497549312
13/12/09 23:09:32 INFO mapred.JobClient: CPU time spent (ms)=10120
13/12/09 23:09:32 INFO mapred.JobClient: Combine input records=0
13/12/09 23:09:32 INFO mapred.JobClient: SPLIT_RAW_BYTES=231
13/12/09 23:09:32 INFO mapred.JobClient: Reduce input records=7052
13/12/09 23:09:32 INFO mapred.JobClient: Reduce input groups=246
13/12/09 23:09:32 INFO mapred.JobClient: Combine output records=0
13/12/09 23:09:32 INFO mapred.JobClient: Physical memory (bytes) snapshot=519942144
13/12/09 23:09:32 INFO mapred.JobClient: Reduce output records=7052
13/12/09 23:09:32 INFO mapred.JobClient: Virtual memory (bytes) snapshot=3076526080
13/12/09 23:09:32 INFO mapred.JobClient: Map output records=7052
13/12/09 23:09:32 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://10.91.18.96:9000/path/job1out/_SUCCESS
13/12/09 23:09:32 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://10.91.18.96:9000/path/job1out/sessionindex-m-00000
1091740526
-- job 2 input path : /path/job1out
-- job 2 output path: /path/job2out
13/12/09 23:09:32 INFO mapred.JobClient: Default number of map tasks: null
13/12/09 23:09:32 INFO mapred.JobClient: Setting default number of map tasks based on cluster size to : 2
13/12/09 23:09:32 INFO mapred.JobClient: Default number of reduce tasks: 1
13/12/09 23:09:33 INFO mapred.JobClient: Setting group to hadoop
13/12/09 23:09:33 INFO input.FileInputFormat: Total input paths to process : 1
13/12/09 23:09:33 INFO mapred.JobClient: Running job: job_201312062235_0045
13/12/09 23:09:34 INFO mapred.JobClient: map 0% reduce 0%
13/12/09 23:09:51 INFO mapred.JobClient: map 100% reduce 0%
13/12/09 23:10:03 INFO mapred.JobClient: map 100% reduce 33%
13/12/09 23:10:06 INFO mapred.JobClient: map 100% reduce 100%
13/12/09 23:10:11 INFO mapred.JobClient: Job complete: job_201312062235_0045
13/12/09 23:10:11 INFO mapred.JobClient: Counters: 27
13/12/09 23:10:11 INFO mapred.JobClient: Job Counters
13/12/09 23:10:11 INFO mapred.JobClient: Launched reduce tasks=1
13/12/09 23:10:11 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=13533
13/12/09 23:10:11 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/12/09 23:10:11 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/12/09 23:10:11 INFO mapred.JobClient: Launched map tasks=1
13/12/09 23:10:11 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=12176
13/12/09 23:10:11 INFO mapred.JobClient: File Output Format Counters
13/12/09 23:10:11 INFO mapred.JobClient: Bytes Written=0
13/12/09 23:10:11 INFO mapred.JobClient: FileSystemCounters
13/12/09 23:10:11 INFO mapred.JobClient: FILE_BYTES_READ=173
13/12/09 23:10:11 INFO mapred.JobClient: HDFS_BYTES_READ=134
13/12/09 23:10:11 INFO mapred.JobClient: FILE_BYTES_WRITTEN=57735
13/12/09 23:10:11 INFO mapred.JobClient: File Input Format Counters
13/12/09 23:10:11 INFO mapred.JobClient: Bytes Read=0
13/12/09 23:10:11 INFO mapred.JobClient: Map-Reduce Framework
13/12/09 23:10:11 INFO mapred.JobClient: Map output materialized bytes=16
13/12/09 23:10:11 INFO mapred.JobClient: Map input records=0
13/12/09 23:10:11 INFO mapred.JobClient: Reduce shuffle bytes=16
13/12/09 23:10:11 INFO mapred.JobClient: Spilled Records=0
13/12/09 23:10:11 INFO mapred.JobClient: Map output bytes=0
13/12/09 23:10:11 INFO mapred.JobClient: Total committed heap usage (bytes)=434634752
13/12/09 23:10:11 INFO mapred.JobClient: CPU time spent (ms)=2270
13/12/09 23:10:11 INFO mapred.JobClient: Combine input records=0
13/12/09 23:10:11 INFO mapred.JobClient: SPLIT_RAW_BYTES=134
13/12/09 23:10:11 INFO mapred.JobClient: Reduce input records=0
13/12/09 23:10:11 INFO mapred.JobClient: Reduce input groups=0
13/12/09 23:10:11 INFO mapred.JobClient: Combine output records=0
13/12/09 23:10:11 INFO mapred.JobClient: Physical memory (bytes) snapshot=423612416
13/12/09 23:10:11 INFO mapred.JobClient: Reduce output records=0
13/12/09 23:10:11 INFO mapred.JobClient: Virtual memory (bytes) snapshot=3058089984
13/12/09 23:10:11 INFO mapred.JobClient: Map output records=0
13/12/09 23:10:11 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://10.91.18.96:9000/path/job2out/_SUCCESS
13/12/09 23:10:11 WARN mapreduce.LoadIncrementalHFiles: Bulk load operation did not find any files to load in directory /path/job2out. Does it contain files in subdirectories that correspond to column family names?

long hadoop run, stuck at reduce>reduce

I have hadoop run that basically just aggregate over keys, it's code:
(mapper is the identity mapper)
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> results, Reporter reporter) throws IOException {
String res = new String("");
while(values.hasNext())
{
res += values.next().toString();
}
Text outputValue = new Text("<all><id>"+key.toString()+"</id>"+res+"</all>");
results.collect(key, outputValue);
}
it stuck at this level:
12/11/26 06:19:23 INFO mapred.JobClient: Running job: job_201210240845_0099
12/11/26 06:19:24 INFO mapred.JobClient: map 0% reduce 0%
12/11/26 06:19:37 INFO mapred.JobClient: map 20% reduce 0%
12/11/26 06:19:40 INFO mapred.JobClient: map 80% reduce 0%
12/11/26 06:19:41 INFO mapred.JobClient: map 100% reduce 0%
12/11/26 06:19:46 INFO mapred.JobClient: map 100% reduce 6%
12/11/26 06:19:55 INFO mapred.JobClient: map 100% reduce 66%
I run it locally and saw this:
12/11/26 06:06:48 INFO mapred.LocalJobRunner:
12/11/26 06:06:48 INFO mapred.Merger: Merging 5 sorted segments
12/11/26 06:06:48 INFO mapred.Merger: Down to the last merge-pass, with 5 segments left of total size: 82159206 bytes
12/11/26 06:06:48 INFO mapred.LocalJobRunner:
12/11/26 06:06:54 INFO mapred.LocalJobRunner: reduce > reduce
12/11/26 06:06:55 INFO mapred.JobClient: map 100% reduce 66%
12/11/26 06:06:57 INFO mapred.LocalJobRunner: reduce > reduce
12/11/26 06:07:00 INFO mapred.LocalJobRunner: reduce > reduce
12/11/26 06:07:03 INFO mapred.LocalJobRunner: reduce > reduce
...
a lot of reduce > reduce ...
...
in the end , it finished the work. I want to ask:
1) what does it do in this reduce > reduce stage?
2) how can i improve this?
When looking at the percentages, 0-33% is shuffle, 34%-65% is sort, 66%-100% is the actual reduce function.
Everything looks fine in your code, but I'll take a stab in the dark:
You are creating and re-recreating the string res over and over. Every time you get a new value, Java is creating a new string object, then creating another string object to hold the concatenation. As you can see, this can get out of hand when the string gets pretty big. Try using a StringBuffer instead. Edit: StringBuilder is better than StringBuffer.
Whether or not this is the problem, you should change this to improve performance.
Using StringBuilder solves it. It improves the run time from 30 min to 30 sec. I didn't think it would make such a difference. Thanks a lot.

why is my sequence file being read twice in my hadoop mapper class?

i have a SequenceFile with 1264 records. each key is unique for each record. my problem is that my mapper seems to be reading this file twice or it is being read twice. for sanity checking, i have written a little utility class to read the SequenceFile and indeed, there are only 1264 records (i.e. SequenceFile.Reader).
in my reducer, i should only get 1 record per Iterable, however, when i iterate over the iterable (Iterator), i get 2 records per Key (always 2 per key, and not 1 or 3 or something else per Key).
the logging output of my Job is below. i am not sure why, but why is it that the "Total input paths to process" is 2? when i run my Job, i tried -Dmapred.input.dir=/data and also -Dmapred.input.dir=/data/part-r-00000, but still, the total paths to process is 2.
any ideas is appreciated.
12/03/01 05:28:30 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
12/03/01 05:28:30 INFO input.FileInputFormat: Total input paths to process : 2
12/03/01 05:28:31 INFO mapred.JobClient: Running job: job_local_0001
12/03/01 05:28:31 INFO input.FileInputFormat: Total input paths to process : 2
12/03/01 05:28:31 INFO mapred.MapTask: io.sort.mb = 100
12/03/01 05:28:31 INFO mapred.MapTask: data buffer = 79691776/99614720
12/03/01 05:28:31 INFO mapred.MapTask: record buffer = 262144/327680
12/03/01 05:28:31 INFO mapred.MapTask: Starting flush of map output
12/03/01 05:28:31 INFO mapred.MapTask: Finished spill 0
12/03/01 05:28:31 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
12/03/01 05:28:31 INFO mapred.LocalJobRunner:
12/03/01 05:28:31 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
12/03/01 05:28:31 INFO mapred.MapTask: io.sort.mb = 100
12/03/01 05:28:31 INFO mapred.MapTask: data buffer = 79691776/99614720
12/03/01 05:28:31 INFO mapred.MapTask: record buffer = 262144/327680
12/03/01 05:28:31 INFO mapred.MapTask: Starting flush of map output
12/03/01 05:28:31 INFO mapred.MapTask: Finished spill 0
12/03/01 05:28:31 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting
12/03/01 05:28:31 INFO mapred.LocalJobRunner:
12/03/01 05:28:31 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000001_0' done.
12/03/01 05:28:31 INFO mapred.LocalJobRunner:
12/03/01 05:28:31 INFO mapred.Merger: Merging 2 sorted segments
12/03/01 05:28:31 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 307310 bytes
12/03/01 05:28:31 INFO mapred.LocalJobRunner:
12/03/01 05:28:32 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
12/03/01 05:28:32 INFO mapred.LocalJobRunner:
12/03/01 05:28:32 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now
12/03/01 05:28:32 INFO mapred.JobClient: map 100% reduce 0%
12/03/01 05:28:32 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to results
12/03/01 05:28:32 INFO mapred.LocalJobRunner: reduce > reduce
12/03/01 05:28:32 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
12/03/01 05:28:33 INFO mapred.JobClient: map 100% reduce 100%
12/03/01 05:28:33 INFO mapred.JobClient: Job complete: job_local_0001
12/03/01 05:28:33 INFO mapred.JobClient: Counters: 12
12/03/01 05:28:33 INFO mapred.JobClient: FileSystemCounters
12/03/01 05:28:33 INFO mapred.JobClient: FILE_BYTES_READ=1320214
12/03/01 05:28:33 INFO mapred.JobClient: FILE_BYTES_WRITTEN=1275041
12/03/01 05:28:33 INFO mapred.JobClient: Map-Reduce Framework
12/03/01 05:28:33 INFO mapred.JobClient: Reduce input groups=1264
12/03/01 05:28:33 INFO mapred.JobClient: Combine output records=0
12/03/01 05:28:33 INFO mapred.JobClient: Map input records=2528
12/03/01 05:28:33 INFO mapred.JobClient: Reduce shuffle bytes=0
12/03/01 05:28:33 INFO mapred.JobClient: Reduce output records=2528
12/03/01 05:28:33 INFO mapred.JobClient: Spilled Records=5056
12/03/01 05:28:33 INFO mapred.JobClient: Map output bytes=301472
12/03/01 05:28:33 INFO mapred.JobClient: Combine input records=0
12/03/01 05:28:33 INFO mapred.JobClient: Map output records=2528
12/03/01 05:28:33 INFO mapred.JobClient: Reduce input records=2528
My mapper class is very simple. It reads in a text file. To each line, it appends "m" to the line.
public class MyMapper extends Mapper<LongWritable, Text, LongWritable, Text> {
private static final Log _log = LogFactory.getLog(MyMapper.class);
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String s = (new StringBuilder()).append(value.toString()).append("m").toString();
context.write(key, new Text(s));
_log.debug(key.toString() + " => " + s);
}
}
My reducer class is also very simple. It simply appends "r" to the line.
public class MyReducer extends Reducer<LongWritable, Text, LongWritable, Text> {
private static final Log _log = LogFactory.getLog(MyReducer.class);
#Override
public void reduce(LongWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
for(Iterator<Text> it = values.iterator(); it.hasNext();) {
Text txt = it.next();
String s = (new StringBuilder()).append(txt.toString()).append("r").toString();
context.write(key, new Text(s));
_log.debug(key.toString() + " => " + s);
}
}
}
my Job class is as follows.
public class MyJob extends Configured implements Tool {
public static void main(String[] args) throws Exception {
ToolRunner.run(new Configuration(), new MyJob(), args);
}
#Override
public int run(String[] args) throws Exception {
Configuration conf = getConf();
Path input = new Path(conf.get("mapred.input.dir"));
Path output = new Path(conf.get("mapred.output.dir"));
System.out.println("input = " + input);
System.out.println("output = " + output);
Job job = new Job(conf, "dummy job");
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
FileInputFormat.addInputPath(job, input);
FileOutputFormat.setOutputPath(job, output);
job.setJarByClass(MyJob.class);
return job.waitForCompletion(true) ? 0 : 1;
}
}
my input data looks like the following.
T, T
T, T
T, T
F, F
F, F
F, F
F, F
T, F
F, T
after running my Job, i get an output like the following.
0 T, Tmr
0 T, Tmr
6 T, Tmr
6 T, Tmr
12 T, Tmr
12 T, Tmr
18 F, Fmr
18 F, Fmr
24 F, Fmr
24 F, Fmr
30 F, Fmr
30 F, Fmr
36 F, Fmr
36 F, Fmr
42 T, Fmr
42 T, Fmr
48 F, Tmr
48 F, Tmr
did i do something wrong with setting up my Job? i tried the following way to run my Job, and in this approach, the file only gets read once. why is this? the System.out.println(inpath) and System.out.println(outpath) values are identical! help?
public class MyJob2 {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: MyJob2 <in> <out>");
System.exit(2);
}
String sInput = args[0];
String sOutput = args[1];
Path input = new Path(sInput);
Path output = new Path(sOutput);
System.out.println("input = " + input);
System.out.println("output = " + output);
Job job = new Job(conf, "dummy job");
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
FileInputFormat.addInputPath(job, input);
FileOutputFormat.setOutputPath(job, output);
job.setJarByClass(MyJob2.class);
int result = job.waitForCompletion(true) ? 0 : 1;
System.exit(result);
}
}
i got help from the hadoop mailing list. my problem was with the line below.
FileInputFormat.addInputPath(job, input);
this line simply appends input back to config. after commenting this line out, the input file is read only once now. in fact, i also commented out the other line,
FileOutputFormat.setOutputPath(job, output);
and everything still works.
I've had a similar problem, but for a different reason: linux apparently created a hidden copy of my input file (~input.txt), so that's a second way of getting this error..

Resources