Getting error while implementing a simple sorting program in Mapreduce with zero reduce nodes - sorting

I tried implementing a sorting program in mapreduce such that I have just the sorted output after the map phase where the sorting is done by the hadoop framework internally. For it, I tried to set the number of reduce tasks to zero as there wasnt any reduction required. Now when I tried executing the program, I kept on getting checksum
error.. I am not able to figure out what's to be done next. Surely it's possible to run the program on my netbook as the sorting does work fine when I have set the reduce tasks to one.. Please help!!
For your reference, here's the entire code that I have written to perform the sorting:
/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
*/
/**
*
* #author root
*/
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.io.*;
import java.util.*;
import java.io.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.util.*;
import org.apache.hadoop.conf.*;
public class word extends Configured implements Tool
{
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
private static IntWritable one=new IntWritable(1);
private Text word=new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter report) throws IOException
{
String line=value.toString();
StringTokenizer token=new StringTokenizer(line," .,?!");
String wordToken=null;
while(token.hasMoreTokens())
{
wordToken=token.nextToken();
output.collect(new Text(wordToken), one);
}
}
}
public int run(String args[])throws Exception
{
//Configuration conf=getConf();
JobConf job=new JobConf(word.class);
job.setInputFormat(TextInputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setOutputFormat(TextOutputFormat.class);
job.setMapperClass(Map.class);
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
JobClient.runJob(job);
return 0;
}
public static void main(String args[])throws Exception
{
int exitCode=ToolRunner.run(new word(), args);
System.exit(exitCode);
}
}
Here is the checksum error I got on executing this program:
12/03/25 10:26:42 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively
12/03/25 10:26:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
12/03/25 10:26:43 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/03/25 10:26:44 INFO mapred.FileInputFormat: Total input paths to process : 1
12/03/25 10:26:45 INFO mapred.JobClient: Running job: job_local_0001
12/03/25 10:26:45 INFO mapred.FileInputFormat: Total input paths to process : 1
12/03/25 10:26:45 INFO mapred.MapTask: numReduceTasks: 0
12/03/25 10:26:45 INFO fs.FSInputChecker: Found checksum error: b[0, 26]=610a630a620a640a650a740a790a780a730a670a7a0a680a730a
org.apache.hadoop.fs.ChecksumException: Checksum error: file:/root/NetBeansProjects/projectAll/output/regionMulti/individual/part-00000 at 0
at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
12/03/25 10:26:45 WARN mapred.LocalJobRunner: job_local_0001
org.apache.hadoop.fs.ChecksumException: Checksum error: file:/root/NetBeansProjects/projectAll/output/regionMulti/individual/part-00000 at 0
at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
12/03/25 10:26:46 INFO mapred.JobClient: map 0% reduce 0%
12/03/25 10:26:46 INFO mapred.JobClient: Job complete: job_local_0001
12/03/25 10:26:46 INFO mapred.JobClient: Counters: 0
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at sortLog.run(sortLog.java:59)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at sortLog.main(sortLog.java:66)
Java Result: 1
BUILD SUCCESSFUL (total time: 4 seconds)

So have a look at the org.apache.hadoop.mapred.MapTask arround line 600 in 0.20.2.
// get an output object
if (job.getNumReduceTasks() == 0) {
output =
new NewDirectOutputCollector(taskContext, job, umbilical, reporter);
} else {
output = new NewOutputCollector(taskContext, job, umbilical, reporter);
}
If you set the number of reduce tasks to zero it will be directly written to the output. The NewOutputCollector will use the so called MapOutputBuffer which does the spilling, sorting, combining and partitioning.
So when you set no reducer, no sort takes places, even if Tom White states this in the definitive guide.

I have faced the same problem (checksum error concerning file part-00000 at 0). I solved it by renaming the file to any other name than -00000.

So if you need at least one Reducer to make the internal sorting happen, than you can take the IdentityReducer.
You may also want to see this discussion:
hadoop: difference between 0 reducer and identity reducer?

Related

All task attempt are done but job failed in mapreduce

I work with 8 map tasks and 1 reduce task. Although all of the map task attempts are successfully done, map reduce job failed. My example code is from Hadoop Beginner's Guide (Garry Turkington)that is run for skip data.The main idea of program is that testing of task failure in map reduce. Although data that causing failure (skiptext in example)have in source file, the map reduce can do the job successfully. But, I didn't finish job and encounter the job failed .How should I do?
full source code is:
import java.io.IOException;
import org.apache.hadoop.conf.* ;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.* ;
import org.apache.hadoop.mapred.* ;
import org.apache.hadoop.mapred.lib.* ;
public class SkipData
{
public static class MapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, LongWritable>
{
private final static LongWritable one = new
LongWritable(1);
private Text word = new Text("totalcount");
public void map(LongWritable key, Text value,
OutputCollector<Text, LongWritable> output,
Reporter reporter) throws IOException
{
String line = value.toString();
if (line.equals("skiptext"))
throw new RuntimeException("Found skiptext") ;
output.collect(word, one);
}
}
public static void main(String[] args) throws Exception
{
Configuration config = new Configuration() ;
JobConf conf = new JobConf(config, SkipData.class);
conf.setJobName("SkipData");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(LongWritable.class);
conf.setMapperClass(MapClass.class);
conf.setCombinerClass(LongSumReducer.class);
conf.setReducerClass(LongSumReducer.class);
FileInputFormat.setInputPaths(conf,args[0]) ;
FileOutputFormat.setOutputPath(conf, new
Path(args[1])) ;
JobClient.runJob(conf);
}
}
The full error console is:
18/02/28 21:12:58 INFO mapreduce.Job: Job job_local724352166_0001 failed with state FAILED due to: NA
18/02/28 21:12:58 WARN mapred.LocalJobRunner: job_local724352166_0001
java.lang.Exception: java.lang.RuntimeException: Found skiptext
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.RuntimeException: Found skiptext
at mapredpack.SkipTest$MapClass.map(SkipTest.java:23)
at mapredpack.SkipTest$MapClass.map(SkipTest.java:1)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner .java:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
18/02/28 21:12:58 DEBUG security.UserGroupInformation: PrivilegedAction as:naychi (auth:SIMPLE) from:org.apache.hadoop.mapreduce.Job.getCounters(Job.java:758)
18/02/28 21:12:59 DEBUG security.UserGroupInformation: PrivilegedAction as:naychi (auth:SIMPLE) from:org.apache.hadoop.fs.FileContext.getAbstractFileSystem(FileContext.java:331)
18/02/28 21:12:59 INFO mapreduce.Job: Counters: 23
File System Counters
FILE: Number of bytes read=29905
FILE: Number of bytes written=2020669
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=128005127
HDFS: Number of bytes written=0
HDFS: Number of read operations=80
HDFS: Number of large read operations=0
HDFS: Number of write operations=7
Map-Reduce Framework
Map input records=1542671
Map output records=1542669
Map output bytes=29310711
Map output materialized bytes=135
Input split bytes=686
Combine input records=1161148
Combine output records=5
Spilled Records=5
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=8601
Total committed heap usage (bytes)=3840933888
File Input Format Counters
Bytes Read=23163911
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
at mapredpack.SkipTest.main(SkipTest.java:58)
18/02/28 21:12:59 DEBUG ipc.Client: stopping client from cache: org.apache.hadoop.ipc.Client#2e55dd0c
18/02/28 21:12:59 DEBUG ipc.Client: removing client from cache: org.apache.hadoop.ipc.Client#2e55dd0c
18/02/28 21:12:59 DEBUG ipc.Client: stopping actual client because no more references remain: org.apache.hadoop.ipc.Client#2e55dd0c
18/02/28 21:12:59 DEBUG ipc.Client: Stopping client
18/02/28 21:12:59 DEBUG ipc.Client: IPC Client (1313916817) connection to localhost/127.0.0.1:9000 from naychi: closed
18/02/28 21:12:59 DEBUG ipc.Client: IPC Client (1313916817) connection to localhost/127.0.0.1:9000 from naychi: stopped, remaining connections 0
Looks like the code is working as designed. A skiptext line was found and the job is implemented to throw a task-ending exception in that case. This is a common coding technique to force people to implement logic at a certain point. Put a throw RuntimeException() where the code needs to be modified and the developer is forced to look at that part of the code.
Look at the code and decide what you want to do in the case of a skiptext line. Is there additional logic you need to implement, replacing the exception? If so, then replace the thrown exception with the correct behavior.

Map Reduce File Output Counter is zero

I am writing Map Reduce code for Inverted Indexing of a file which contains each line as "Doc_id Title Document Contents".
I am not able to figure out why File output format counter is zero although map reduce jobs are successfully completed without any Exception.
import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class InvertedIndex {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, Text> {
private Text word = new Text();
private Text docID_Title = new Text();
//RemoveStopWords is a different class
static RemoveStopWords rmvStpWrd = new RemoveStopWords();
//Stemmer is a different class
Stemmer stemmer = new Stemmer();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
rmvStpWrd.makeStopWordList();
StringTokenizer itr = new StringTokenizer(value.toString().replaceAll(" [^\\p{L}]", " "));
//fetching id of the document
String id = null;
String title = null;
if(itr.hasMoreTokens())
id = itr.nextToken();
//fetching title of the document
if(itr.hasMoreTokens())
title = itr.nextToken();
String ID_TITLE = id + title;
if(id!=null)
docID_Title.set(ID_TITLE);
while (itr.hasMoreTokens()) {
/*manipulation of tokens:
* First we remove stop words
* Then Stem the words
*/
String temp = itr.nextToken().toLowerCase();
if(RemoveStopWords.isStopWord(temp)) {
continue;
}
else {
//now the word is not a stop word
//we will stem it
char[] a;
stemmer.add((a = temp.toCharArray()), a.length);
stemmer.stem();
temp = stemmer.toString();
word.set(temp);
context.write(word, docID_Title);
}
}//end while
}//end map
}//end mapper
public static class IntSumReducer
extends Reducer<Text,Text,Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
//to iterate over the values
Iterator<Text> itr = values.iterator();
String old = itr.next().toString();
int freq = 1;
String next = null;
boolean isThere = true;
StringBuilder stringBuilder = new StringBuilder();
while(itr.hasNext()) {
//freq counts number of times a word comes in a document
freq = 1;
while((isThere = itr.hasNext())) {
next = itr.next().toString();
if(old == next)
freq++;
else {
//the loop break when we get different docID_Title for the word(key)
break;
}
//if more data is there
if(isThere) {
old = old +"_"+ freq;
stringBuilder.append(old);
stringBuilder.append(" | ");
old = next;
context.write(key, new Text(stringBuilder.toString()));
stringBuilder.setLength(0);
}
else {
//for the last key
freq++;
old = old +"_"+ freq;
stringBuilder.append(old);
stringBuilder.append(" | ");
old = next;
context.write(key, new Text(stringBuilder.toString()));
}//end else
}//end while
}//end while
}//end reduce
}//end reducer
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "InvertedIndex");
job.setJarByClass(InvertedIndex.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}//end main
}//end InvertexIndex
This is the output I am getting:
16/10/03 15:34:21 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
16/10/03 15:34:21 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
16/10/03 15:34:21 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16/10/03 15:34:22 INFO input.FileInputFormat: Total input paths to process : 1
16/10/03 15:34:22 INFO mapreduce.JobSubmitter: number of splits:1
16/10/03 15:34:22 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local507694567_0001
16/10/03 15:34:22 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
16/10/03 15:34:22 INFO mapreduce.Job: Running job: job_local507694567_0001
16/10/03 15:34:22 INFO mapred.LocalJobRunner: OutputCommitter set in config null
16/10/03 15:34:22 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/10/03 15:34:22 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
16/10/03 15:34:22 INFO mapred.LocalJobRunner: Waiting for map tasks
16/10/03 15:34:22 INFO mapred.LocalJobRunner: Starting task: attempt_local507694567_0001_m_000000_0
16/10/03 15:34:22 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/10/03 15:34:22 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/10/03 15:34:22 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/sonu/ss.txt:0+1002072
16/10/03 15:34:23 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
16/10/03 15:34:23 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
16/10/03 15:34:23 INFO mapred.MapTask: soft limit at 83886080
16/10/03 15:34:23 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
16/10/03 15:34:23 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
16/10/03 15:34:23 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
16/10/03 15:34:23 INFO mapreduce.Job: Job job_local507694567_0001 running in uber mode : false
16/10/03 15:34:23 INFO mapreduce.Job: map 0% reduce 0%
16/10/03 15:34:24 INFO mapred.LocalJobRunner:
16/10/03 15:34:24 INFO mapred.MapTask: Starting flush of map output
16/10/03 15:34:24 INFO mapred.MapTask: Spilling map output
16/10/03 15:34:24 INFO mapred.MapTask: bufstart = 0; bufend = 2206696; bufvoid = 104857600
16/10/03 15:34:24 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 25789248(103156992); length = 425149/6553600
16/10/03 15:34:24 INFO mapred.MapTask: Finished spill 0
16/10/03 15:34:24 INFO mapred.Task: Task:attempt_local507694567_0001_m_000000_0 is done. And is in the process of committing
16/10/03 15:34:24 INFO mapred.LocalJobRunner: map
16/10/03 15:34:24 INFO mapred.Task: Task 'attempt_local507694567_0001_m_000000_0' done.
16/10/03 15:34:24 INFO mapred.LocalJobRunner: Finishing task: attempt_local507694567_0001_m_000000_0
16/10/03 15:34:24 INFO mapred.LocalJobRunner: map task executor complete.
16/10/03 15:34:25 INFO mapred.LocalJobRunner: Waiting for reduce tasks
16/10/03 15:34:25 INFO mapred.LocalJobRunner: Starting task: attempt_local507694567_0001_r_000000_0
16/10/03 15:34:25 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/10/03 15:34:25 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/10/03 15:34:25 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle#5d0e7307
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=333971456, maxSingleShuffleLimit=83492864, mergeThreshold=220421168, ioSortFactor=10, memToMemMergeOutputsThreshold=10
16/10/03 15:34:25 INFO reduce.EventFetcher: attempt_local507694567_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
16/10/03 15:34:25 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local507694567_0001_m_000000_0 decomp: 2 len: 6 to MEMORY
16/10/03 15:34:25 INFO reduce.InMemoryMapOutput: Read 2 bytes from map-output for attempt_local507694567_0001_m_000000_0
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 2, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->2
16/10/03 15:34:25 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
16/10/03 15:34:25 INFO mapred.LocalJobRunner: 1 / 1 copied.
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
16/10/03 15:34:25 INFO mapred.Merger: Merging 1 sorted segments
16/10/03 15:34:25 INFO mapred.Merger: Down to the last merge-pass, with 0 segments left of total size: 0 bytes
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: Merged 1 segments, 2 bytes to disk to satisfy reduce memory limit
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: Merging 1 files, 6 bytes from disk
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
16/10/03 15:34:25 INFO mapred.Merger: Merging 1 sorted segments
16/10/03 15:34:25 INFO mapred.Merger: Down to the last merge-pass, with 0 segments left of total size: 0 bytes
16/10/03 15:34:25 INFO mapred.LocalJobRunner: 1 / 1 copied.
16/10/03 15:34:25 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
16/10/03 15:34:25 INFO mapred.Task: Task:attempt_local507694567_0001_r_000000_0 is done. And is in the process of committing
16/10/03 15:34:25 INFO mapred.LocalJobRunner: 1 / 1 copied.
16/10/03 15:34:25 INFO mapred.Task: Task attempt_local507694567_0001_r_000000_0 is allowed to commit now
16/10/03 15:34:25 INFO output.FileOutputCommitter: Saved output of task 'attempt_local507694567_0001_r_000000_0' to hdfs://localhost:9000/user/sonu/output/_temporary/0/task_local507694567_0001_r_000000
16/10/03 15:34:25 INFO mapred.LocalJobRunner: reduce > reduce
16/10/03 15:34:25 INFO mapred.Task: Task 'attempt_local507694567_0001_r_000000_0' done.
16/10/03 15:34:25 INFO mapred.LocalJobRunner: Finishing task: attempt_local507694567_0001_r_000000_0
16/10/03 15:34:25 INFO mapred.LocalJobRunner: reduce task executor complete.
16/10/03 15:34:25 INFO mapreduce.Job: map 100% reduce 100%
16/10/03 15:34:25 INFO mapreduce.Job: Job job_local507694567_0001 completed successfully
16/10/03 15:34:25 INFO mapreduce.Job: Counters: 35
File System Counters
FILE: Number of bytes read=17342
FILE: Number of bytes written=571556
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2004144
HDFS: Number of bytes written=0
HDFS: Number of read operations=13
HDFS: Number of large read operations=0
HDFS: Number of write operations=4
Map-Reduce Framework
Map input records=53
Map output records=106288
Map output bytes=2206696
Map output materialized bytes=6
Input split bytes=103
Combine input records=106288
Combine output records=0
Reduce input groups=0
Reduce shuffle bytes=6
Reduce input records=0
Reduce output records=0
Spilled Records=0
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=12
Total committed heap usage (bytes)=562036736
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1002072
File Output Format Counters
Bytes Written=0

The mapreduce program is not giving me any output.Could somebody have a look into it?

I am not getting output in this program.When I am runnning this mapreduce program , I am not getting any result.
Inputfile: dict1.txt
apple,seo
apple,sev
dog,kukura
dog,kutta
cat,bilei
cat,billi
Output I want :
apple seo|sev
dog kukura|kutta
cat bilei|billi
Mapper class code :
package com.accure.Dict;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class DictMapper extends MapReduceBase implements Mapper<Text,Text,Text,Text> {
private Text word = new Text();
public void map(Text key,Text value,OutputCollector<Text,Text> output,Reporter reporter) throws IOException{
StringTokenizer itr = new StringTokenizer(value.toString(),",");
while (itr.hasMoreTokens())
{
System.out.println(key);
word.set(itr.nextToken());
output.collect(key, word);
}
}
}
Reducer code :
package com.accure.Dict;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class DictReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
private Text result = new Text();
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text,Text> output,Reporter reporter) throws IOException {
String translations = "";
while(values.hasNext()){
translations += "|" + values.next().toString();
}
result.set(translations);
output.collect(key,result);
}
}
Driver code :
package com.accure.driver;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import com.accure.Dict.DictMapper;
import com.accure.Dict.DictReducer;
public class DictDriver {
public static void main(String[] args) throws Exception{
// TODO Auto-generated method stub
JobConf conf=new JobConf();
conf.setJobName("wordcount_pradosh");
System.setProperty("HADOOP_USER_NAME","accure");
conf.set("fs.default.name","hdfs://host2.hadoop.career.com:54310/");
conf.set("hadoop.job.ugi","accuregrp");
conf.set("mapred.job.tracker","host2.hadoop.career.com:54311");
/*mapper and reduce class */
conf.setMapperClass(DictMapper.class);
conf.setReducerClass(DictReducer.class);
/*This particular jar file has your classes*/
conf.setJarByClass(DictMapper.class);
Path inputPath= new Path("/myCareer/pradosh/input");
Path outputPath=new Path("/myCareer/pradosh/output"+System.currentTimeMillis());
/*input and output directory path */
FileInputFormat.setInputPaths(conf,inputPath);
FileOutputFormat.setOutputPath(conf,outputPath);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(Text.class);
/*output key and value class*/
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
/*input and output format */
conf.setInputFormat(KeyValueTextInputFormat.class); /*Here the file is a text file*/
conf.setOutputFormat(TextOutputFormat.class);
JobClient.runJob(conf);
}
}
output log :
14/04/02 08:33:38 INFO mapred.JobClient: Running job: job_201404010637_0011
14/04/02 08:33:39 INFO mapred.JobClient: map 0% reduce 0%
14/04/02 08:33:58 INFO mapred.JobClient: map 50% reduce 0%
14/04/02 08:33:59 INFO mapred.JobClient: map 100% reduce 0%
14/04/02 08:34:21 INFO mapred.JobClient: map 100% reduce 16%
14/04/02 08:34:23 INFO mapred.JobClient: map 100% reduce 100%
14/04/02 08:34:25 INFO mapred.JobClient: Job complete: job_201404010637_0011
14/04/02 08:34:25 INFO mapred.JobClient: Counters: 29
14/04/02 08:34:25 INFO mapred.JobClient: Job Counters
14/04/02 08:34:25 INFO mapred.JobClient: Launched reduce tasks=1
14/04/02 08:34:25 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=33692
14/04/02 08:34:25 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/04/02 08:34:25 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/04/02 08:34:25 INFO mapred.JobClient: Launched map tasks=2
14/04/02 08:34:25 INFO mapred.JobClient: Data-local map tasks=2
14/04/02 08:34:25 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=25327
14/04/02 08:34:25 INFO mapred.JobClient: File Input Format Counters
14/04/02 08:34:25 INFO mapred.JobClient: Bytes Read=92
14/04/02 08:34:25 INFO mapred.JobClient: File Output Format Counters
14/04/02 08:34:25 INFO mapred.JobClient: Bytes Written=0
14/04/02 08:34:25 INFO mapred.JobClient: FileSystemCounters
14/04/02 08:34:25 INFO mapred.JobClient: FILE_BYTES_READ=6
14/04/02 08:34:25 INFO mapred.JobClient: HDFS_BYTES_READ=336
14/04/02 08:34:25 INFO mapred.JobClient: FILE_BYTES_WRITTEN=169311
14/04/02 08:34:25 INFO mapred.JobClient: Map-Reduce Framework
14/04/02 08:34:25 INFO mapred.JobClient: Map output materialized bytes=12
14/04/02 08:34:25 INFO mapred.JobClient: Map input records=6
14/04/02 08:34:25 INFO mapred.JobClient: Reduce shuffle bytes=12
14/04/02 08:34:25 INFO mapred.JobClient: Spilled Records=0
14/04/02 08:34:25 INFO mapred.JobClient: Map output bytes=0
14/04/02 08:34:25 INFO mapred.JobClient: Total committed heap usage (bytes)=246685696
14/04/02 08:34:25 INFO mapred.JobClient: CPU time spent (ms)=2650
14/04/02 08:34:25 INFO mapred.JobClient: Map input bytes=61
14/04/02 08:34:25 INFO mapred.JobClient: SPLIT_RAW_BYTES=244
14/04/02 08:34:25 INFO mapred.JobClient: Combine input records=0
14/04/02 08:34:25 INFO mapred.JobClient: Reduce input records=0
14/04/02 08:34:25 INFO mapred.JobClient: Reduce input groups=0
14/04/02 08:34:25 INFO mapred.JobClient: Combine output records=0
14/04/02 08:34:25 INFO mapred.JobClient: Physical memory (bytes) snapshot=392347648
14/04/02 08:34:25 INFO mapred.JobClient: Reduce output records=0
14/04/02 08:34:25 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2173820928
14/04/02 08:34:25 INFO mapred.JobClient: Map output records=0
When reading input you are setting input format as : KeyValueTextInputFormat
This expects the Byte separator b/w key and value. In you input you key and value are separated by "," hence the whole text goes as key and value would be empty.
This is why it is not going into the below loop of your mapper:
while (itr.hasMoreTokens())
{
System.out.println(key);
word.set(itr.nextToken());
output.collect(key, word);
}
You should tokenize your key and take the first split and key and second split as value.
This is evidenced in the logs : map Input Records : 6 but Map output records=0

hadoop mapper over consumption of memory(heap)

I wrote a simple hash join program in hadoop map reduce. The idea is the following:
A small table is distributed to every mapper using DistributedCache provided by hadoop framework. The large table is distributed over the mappers with the split size being 64M.
The setup code of the mapper creates a hashmap reading every line from this small table. In the mapper code, every key is searched(get) on the hashmap, and if the key exists in the hash map it is written out. There is no need of a reducer at this point of time. This is the code which we use:
public class Map extends Mapper<LongWritable, Text, Text, Text> {
private HashMap<String, String> joinData = new HashMap<String, String>();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String textvalue = value.toString();
String[] tokens;
tokens = textvalue.split(",");
if (tokens.length == 2) {
String joinValue = joinData.get(tokens[0]);
if (null != joinValue) {
context.write(new Text(tokens[0]), new Text(tokens[1] + ","
+ joinValue));
}
}
}
public void setup(Context context) {
try {
Path[] cacheFiles = DistributedCache.getLocalCacheFiles(context
.getConfiguration());
if (null != cacheFiles && cacheFiles.length > 0) {
String line;
String[] tokens;
BufferedReader br = new BufferedReader(new FileReader(
cacheFiles[0].toString()));
try {
while ((line = br.readLine()) != null) {
tokens = line.split(",");
if (tokens.length == 2) {
joinData.put(tokens[0], tokens[1]);
}
}
System.exit(0);
} finally {
br.close();
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
While testing this code, our small table was 32M, and large table was 128M, one master and 2 slave nodes.
This code fails with the above inputs when I have a 256M of heap. I use -Xmx256m in the mapred.child.java.opts in mapred-site.xml file. When I increase it to 300m it proceeds very slowly and with 512m it reaches its max throughput.
I dont understand where my mapper is consuming so much memory. With the inputs given above
and with the mapper code I dont expect my heap memory to ever reach 256M, yet it fails with java heap space error.
I will be thankful if you can give some insight into why the mapper is consuming so much memory.
EDIT:
13/03/11 09:37:33 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/03/11 09:37:33 INFO input.FileInputFormat: Total input paths to process : 1
13/03/11 09:37:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
13/03/11 09:37:33 WARN snappy.LoadSnappy: Snappy native library not loaded
13/03/11 09:37:34 INFO mapred.JobClient: Running job: job_201303110921_0004
13/03/11 09:37:35 INFO mapred.JobClient: map 0% reduce 0%
13/03/11 09:39:12 INFO mapred.JobClient: Task Id : attempt_201303110921_0004_m_000000_0, Status : FAILED
Error: GC overhead limit exceeded
13/03/11 09:40:43 INFO mapred.JobClient: Task Id : attempt_201303110921_0004_m_000001_0, Status : FAILED
org.apache.hadoop.io.SecureIOUtils$AlreadyExistsException: File /usr/home/hadoop/hadoop-1.0.3/libexec/../logs/userlogs/job_201303110921_0004/attempt_201303110921_0004_m_000001_0/log.tmp already exists
at org.apache.hadoop.io.SecureIOUtils.insecureCreateForWrite(SecureIOUtils.java:130)
at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:157)
at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:312)
at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:385)
at org.apache.hadoop.mapred.Child$4.run(Child.java:257)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
attempt_201303110921_0004_m_000001_0: Exception in thread "Thread for syncLogs" java.lang.OutOfMemoryError: Java heap space
attempt_201303110921_0004_m_000001_0: at java.io.BufferedOutputStream.<init>(BufferedOutputStream.java:76)
attempt_201303110921_0004_m_000001_0: at java.io.BufferedOutputStream.<init>(BufferedOutputStream.java:59)
attempt_201303110921_0004_m_000001_0: at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:312)
attempt_201303110921_0004_m_000001_0: at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:385)
attempt_201303110921_0004_m_000001_0: at org.apache.hadoop.mapred.Child$3.run(Child.java:141)
attempt_201303110921_0004_m_000001_0: log4j:WARN No appenders could be found for logger (org.apache.hadoop.hdfs.DFSClient).
attempt_201303110921_0004_m_000001_0: log4j:WARN Please initialize the log4j system properly.
13/03/11 09:42:18 INFO mapred.JobClient: Task Id : attempt_201303110921_0004_m_000001_1, Status : FAILED
Error: GC overhead limit exceeded
13/03/11 09:43:48 INFO mapred.JobClient: Task Id : attempt_201303110921_0004_m_000001_2, Status : FAILED
Error: GC overhead limit exceeded
13/03/11 09:45:09 INFO mapred.JobClient: Job complete: job_201303110921_0004
13/03/11 09:45:09 INFO mapred.JobClient: Counters: 7
13/03/11 09:45:09 INFO mapred.JobClient: Job Counters
13/03/11 09:45:09 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=468506
13/03/11 09:45:09 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/03/11 09:45:09 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/03/11 09:45:09 INFO mapred.JobClient: Launched map tasks=6
13/03/11 09:45:09 INFO mapred.JobClient: Data-local map tasks=6
13/03/11 09:45:09 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
13/03/11 09:45:09 INFO mapred.JobClient: Failed map tasks=1
It's hard to say for sure where the memory consumption is going, but here are a few pointers:
You're creating 2 Text objects for every line of your input. You should just use 2 Text objects that will be initialized once in your Mapper as class variables, and then for each line just call text.set(...). This is a common usage pattern for Map/Reduce patterns, and can save quite a bit of memory overhead.
You should consider using SequenceFile format for your input, which would avoid the need to parse the lines with textValue.split, you would instead have this data directly available as an array. I've read several times that doing string splits like this can be quite intensive, so you should avoid as much as possible if memory is really an issue. You can also think about using KeyValueTextInputFormat if, as in your example, you only care about key/value pairs.
If that isn't enough, I would advise looking at this link, especially part 7 which gives you a very simple method to profile your application and see what gets allocated where.

Hadoop: Reduce-side join get stuck at map 100% reduce 100% and never finish

I'm beginner with Hadoop, these days I'm trying to run
reduce-side join example but it got stuck: Map 100% and Reduce 100%
but never finishing. Progress,logs, code, sample data and
configuration files are as below:
Progress:
12/10/02 15:48:06 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/10/02 15:48:06 WARN snappy.LoadSnappy: Snappy native library not loaded
12/10/02 15:48:06 INFO mapred.FileInputFormat: Total input paths to process : 2
12/10/02 15:48:07 INFO mapred.JobClient: Running job: job_201210021515_0007
12/10/02 15:48:08 INFO mapred.JobClient: map 0% reduce 0%
12/10/02 15:48:26 INFO mapred.JobClient: map 66% reduce 0%
12/10/02 15:48:35 INFO mapred.JobClient: map 100% reduce 0%
12/10/02 15:48:38 INFO mapred.JobClient: map 100% reduce 22%
12/10/02 15:48:47 INFO mapred.JobClient: map 100% reduce 100%
Logs from Reduce task:
2012-10-02 15:48:28,018 INFO org.apache.hadoop.mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#1f53935
2012-10-02 15:48:28,179 INFO org.apache.hadoop.mapred.ReduceTask: ShuffleRamManager: MemoryLimit=668126400, MaxSingleShuffleLimit=167031600
2012-10-02 15:48:28,202 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201210021515_0007_r_000000_0 Thread started: Thread for merging on-disk files
2012-10-02 15:48:28,202 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201210021515_0007_r_000000_0 Thread started: Thread for merging in memory files
2012-10-02 15:48:28,203 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201210021515_0007_r_000000_0 Thread waiting: Thread for merging on-disk files
2012-10-02 15:48:28,207 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201210021515_0007_r_000000_0 Thread started: Thread for polling Map Completion Events
2012-10-02 15:48:28,207 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201210021515_0007_r_000000_0 Need another 3 map output(s) where 0 is already in progress
2012-10-02 15:48:28,208 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201210021515_0007_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
2012-10-02 15:48:33,209 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201210021515_0007_r_000000_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts)
2012-10-02 15:48:33,596 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201210021515_0007_r_000000_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts)
2012-10-02 15:48:38,606 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201210021515_0007_r_000000_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts)
2012-10-02 15:48:39,239 INFO org.apache.hadoop.mapred.ReduceTask: GetMapEventsThread exiting
2012-10-02 15:48:39,239 INFO org.apache.hadoop.mapred.ReduceTask: getMapsEventsThread joined.
2012-10-02 15:48:39,241 INFO org.apache.hadoop.mapred.ReduceTask: Closed ram manager
2012-10-02 15:48:39,242 INFO org.apache.hadoop.mapred.ReduceTask: Interleaved on-disk merge complete: 0 files left.
2012-10-02 15:48:39,242 INFO org.apache.hadoop.mapred.ReduceTask: In-memory merge complete: 3 files left.
2012-10-02 15:48:39,285 INFO org.apache.hadoop.mapred.Merger: Merging 3 sorted segments
2012-10-02 15:48:39,285 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 3 segments left of total size: 10500 bytes
2012-10-02 15:48:39,314 INFO org.apache.hadoop.mapred.ReduceTask: Merged 3 segments, 10500 bytes to disk to satisfy reduce memory limit
2012-10-02 15:48:39,318 INFO org.apache.hadoop.mapred.ReduceTask: Merging 1 files, 10500 bytes from disk
2012-10-02 15:48:39,319 INFO org.apache.hadoop.mapred.ReduceTask: Merging 0 segments, 0 bytes from memory into reduce
2012-10-02 15:48:39,320 INFO org.apache.hadoop.mapred.Merger: Merging 1 sorted segments
2012-10-02 15:48:39,322 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 10496 bytes
Java Code:
public class DataJoin extends Configured implements Tool {
public static class MapClass extends DataJoinMapperBase {
protected Text generateInputTag(String inputFile) {//specify tag
String datasource = inputFile.split("-")[0];
return new Text(datasource);
}
protected Text generateGroupKey(TaggedMapOutput aRecord) {//takes a tagged record (of type TaggedMapOutput)and returns the group key for joining
String line = ((Text) aRecord.getData()).toString();
String[] tokens = line.split(",", 2);
String groupKey = tokens[0];
return new Text(groupKey);
}
protected TaggedMapOutput generateTaggedMapOutput(Object value) {//wraps the record value into a TaggedMapOutput type
TaggedWritable retv = new TaggedWritable((Text) value);
retv.setTag(this.inputTag);//inputTag: result of generateInputTag
return retv;
}
}
public static class Reduce extends DataJoinReducerBase {
protected TaggedMapOutput combine(Object[] tags, Object[] values) {//combination of the cross product of the tagged records with the same join (group) key
if (tags.length != 2) return null;
String joinedStr = "";
for (int i=0; i<tags.length; i++) {
if (i > 0) joinedStr += ",";
TaggedWritable tw = (TaggedWritable) values[i];
String line = ((Text) tw.getData()).toString();
if (line == null)
return null;
String[] tokens = line.split(",", 2);
joinedStr += tokens[1];
}
TaggedWritable retv = new TaggedWritable(new Text(joinedStr));
retv.setTag((Text) tags[0]);
return retv;
}
}
public static class TaggedWritable extends TaggedMapOutput {//tagged record
private Writable data;
public TaggedWritable() {
this.tag = new Text("");
this.data = null;
}
public TaggedWritable(Writable data) {
this.tag = new Text("");
this.data = data;
}
public Writable getData() {
return data;
}
#Override
public void write(DataOutput out) throws IOException {
this.tag.write(out);
out.writeUTF(this.data.getClass().getName());
this.data.write(out);
}
#Override
public void readFields(DataInput in) throws IOException {
this.tag.readFields(in);
String dataClz = in.readUTF();
if ((this.data == null) || !this.data.getClass().getName().equals(dataClz)) {
try {
this.data = (Writable) ReflectionUtils.newInstance(Class.forName(dataClz), null);
System.out.printf(dataClz);
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
this.data.readFields(in);
}
}
public int run(String[] args) throws Exception {
Configuration conf = getConf();
JobConf job = new JobConf(conf, DataJoin.class);
Path in = new Path(args[0]);
Path out = new Path(args[1]);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
job.setJobName("DataJoin");
job.setMapperClass(MapClass.class);
job.setReducerClass(Reduce.class);
job.setInputFormat(TextInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(TaggedWritable.class);
job.set("mapred.textoutputformat.separator", ",");
JobClient.runJob(job);
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(),
new DataJoin(),
args);
System.exit(res);
}
}
Sample data:
file 1: apat.txt(1 line) 4373932,1983,8446,1981,"NL","",16025,2,65,436,1,19,108,49,1,0.5289,0.6516,9.8571,4.1481,0.0109,0.0093,0,0
file 2: cite.txt(100 lines)
4373932,3641235
4373932,3720760
4373932,3853987
4373932,3900558
4373932,3939350
4373932,3941876
4373932,3992631
4373932,3996345
4373932,3998943
4373932,3999948
4373932,4001400
4373932,4011219
4373932,4025310
4373932,4036946
4373932,4058732
4373932,4104029
4373932,4108972
4373932,4160016
4373932,4160018
4373932,4160019
4373932,4160818
4373932,4161515
4373932,4163779
4373932,4168146
4373932,4169137
4373932,4181650
4373932,4187075
4373932,4197361
4373932,4199599
4373932,4200436
4373932,4201763
4373932,4207075
4373932,4208479
4373932,4211766
4373932,4215102
4373932,4220450
4373932,4222744
4373932,4225783
4373932,4231750
4373932,4234563
4373932,4235869
4373932,4238195
4373932,4238395
4373932,4248854
4373932,4251514
4373932,4258130
4373932,4248965
4373932,4252783
4373932,4254097
4373932,4259313
4373932,4272505
4373932,4272506
4373932,4277437
4373932,4279992
4373932,4283382
4373932,4294817
4373932,4296201
4373932,4297273
4373932,4298687
4373932,4302534
4373932,4314026
4373932,4318707
4373932,4318846
4373932,3773625
4373932,3935074
4373932,3951748
4373932,3992516
4373932,3996344
4373932,3997657
4373932,4011308
4373932,4016250
4373932,4018884
4373932,4056724
4373932,4067959
4373932,4069352
4373932,4097586
4373932,4098876
4373932,4130462
4373932,4152411
4373932,4153675
4373932,4174384
4373932,4222743
4373932,4254096
4373932,4256834
4373932,4284412
4373932,4323647
4373932,3985867
4373932,4166105
4373932,4278653
4373932,4194877
4373932,4202815
4373932,4286959
4373932,4302536
4373932,4020151
4373932,4115535
4373932,4152412
4373932,4177253
4373932,4223002
4373932,4225485
4373932,4261968
Configurations:
core-site.xml
<!-- In: conf/core-site.xml -->
<property>
<name>hadoop.tmp.dir</name>
<value>/your/path/to/hadoop/tmp/dir/hadoop-${user.name}</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
mapred-site.xml
<!-- In: conf/mapred-site.xml -->
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
hdfs-site.xml
<!-- In: conf/hdfs-site.xml -->
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
I've googled the answer and made some change in code or some configuration in (mapred/core/hdps)-site.xml files but I lost. I run this code in pseudo-mode. The join key from two files is equivalent. If I change the cite.txt file to 99 lines or lesser, It runs well while from 100 lines or above, it gets stuck like the logs shown. Please help me figure out the problem. I appreciate your explanation.
Best regards,
HaiLong
Please check your Reduce class.
I faced similar problem which turned out to be a very silly mistake. Maybe this will help you out and solve the issue:
while (values.hasNext()) {
String val = values.next().toString();
.....
}
You need to add: .next

Resources