hadoop mapper over consumption of memory(heap) - hadoop

I wrote a simple hash join program in hadoop map reduce. The idea is the following:
A small table is distributed to every mapper using DistributedCache provided by hadoop framework. The large table is distributed over the mappers with the split size being 64M.
The setup code of the mapper creates a hashmap reading every line from this small table. In the mapper code, every key is searched(get) on the hashmap, and if the key exists in the hash map it is written out. There is no need of a reducer at this point of time. This is the code which we use:
public class Map extends Mapper<LongWritable, Text, Text, Text> {
private HashMap<String, String> joinData = new HashMap<String, String>();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String textvalue = value.toString();
String[] tokens;
tokens = textvalue.split(",");
if (tokens.length == 2) {
String joinValue = joinData.get(tokens[0]);
if (null != joinValue) {
context.write(new Text(tokens[0]), new Text(tokens[1] + ","
+ joinValue));
}
}
}
public void setup(Context context) {
try {
Path[] cacheFiles = DistributedCache.getLocalCacheFiles(context
.getConfiguration());
if (null != cacheFiles && cacheFiles.length > 0) {
String line;
String[] tokens;
BufferedReader br = new BufferedReader(new FileReader(
cacheFiles[0].toString()));
try {
while ((line = br.readLine()) != null) {
tokens = line.split(",");
if (tokens.length == 2) {
joinData.put(tokens[0], tokens[1]);
}
}
System.exit(0);
} finally {
br.close();
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
While testing this code, our small table was 32M, and large table was 128M, one master and 2 slave nodes.
This code fails with the above inputs when I have a 256M of heap. I use -Xmx256m in the mapred.child.java.opts in mapred-site.xml file. When I increase it to 300m it proceeds very slowly and with 512m it reaches its max throughput.
I dont understand where my mapper is consuming so much memory. With the inputs given above
and with the mapper code I dont expect my heap memory to ever reach 256M, yet it fails with java heap space error.
I will be thankful if you can give some insight into why the mapper is consuming so much memory.
EDIT:
13/03/11 09:37:33 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/03/11 09:37:33 INFO input.FileInputFormat: Total input paths to process : 1
13/03/11 09:37:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
13/03/11 09:37:33 WARN snappy.LoadSnappy: Snappy native library not loaded
13/03/11 09:37:34 INFO mapred.JobClient: Running job: job_201303110921_0004
13/03/11 09:37:35 INFO mapred.JobClient: map 0% reduce 0%
13/03/11 09:39:12 INFO mapred.JobClient: Task Id : attempt_201303110921_0004_m_000000_0, Status : FAILED
Error: GC overhead limit exceeded
13/03/11 09:40:43 INFO mapred.JobClient: Task Id : attempt_201303110921_0004_m_000001_0, Status : FAILED
org.apache.hadoop.io.SecureIOUtils$AlreadyExistsException: File /usr/home/hadoop/hadoop-1.0.3/libexec/../logs/userlogs/job_201303110921_0004/attempt_201303110921_0004_m_000001_0/log.tmp already exists
at org.apache.hadoop.io.SecureIOUtils.insecureCreateForWrite(SecureIOUtils.java:130)
at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:157)
at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:312)
at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:385)
at org.apache.hadoop.mapred.Child$4.run(Child.java:257)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
attempt_201303110921_0004_m_000001_0: Exception in thread "Thread for syncLogs" java.lang.OutOfMemoryError: Java heap space
attempt_201303110921_0004_m_000001_0: at java.io.BufferedOutputStream.<init>(BufferedOutputStream.java:76)
attempt_201303110921_0004_m_000001_0: at java.io.BufferedOutputStream.<init>(BufferedOutputStream.java:59)
attempt_201303110921_0004_m_000001_0: at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:312)
attempt_201303110921_0004_m_000001_0: at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:385)
attempt_201303110921_0004_m_000001_0: at org.apache.hadoop.mapred.Child$3.run(Child.java:141)
attempt_201303110921_0004_m_000001_0: log4j:WARN No appenders could be found for logger (org.apache.hadoop.hdfs.DFSClient).
attempt_201303110921_0004_m_000001_0: log4j:WARN Please initialize the log4j system properly.
13/03/11 09:42:18 INFO mapred.JobClient: Task Id : attempt_201303110921_0004_m_000001_1, Status : FAILED
Error: GC overhead limit exceeded
13/03/11 09:43:48 INFO mapred.JobClient: Task Id : attempt_201303110921_0004_m_000001_2, Status : FAILED
Error: GC overhead limit exceeded
13/03/11 09:45:09 INFO mapred.JobClient: Job complete: job_201303110921_0004
13/03/11 09:45:09 INFO mapred.JobClient: Counters: 7
13/03/11 09:45:09 INFO mapred.JobClient: Job Counters
13/03/11 09:45:09 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=468506
13/03/11 09:45:09 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/03/11 09:45:09 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/03/11 09:45:09 INFO mapred.JobClient: Launched map tasks=6
13/03/11 09:45:09 INFO mapred.JobClient: Data-local map tasks=6
13/03/11 09:45:09 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
13/03/11 09:45:09 INFO mapred.JobClient: Failed map tasks=1

It's hard to say for sure where the memory consumption is going, but here are a few pointers:
You're creating 2 Text objects for every line of your input. You should just use 2 Text objects that will be initialized once in your Mapper as class variables, and then for each line just call text.set(...). This is a common usage pattern for Map/Reduce patterns, and can save quite a bit of memory overhead.
You should consider using SequenceFile format for your input, which would avoid the need to parse the lines with textValue.split, you would instead have this data directly available as an array. I've read several times that doing string splits like this can be quite intensive, so you should avoid as much as possible if memory is really an issue. You can also think about using KeyValueTextInputFormat if, as in your example, you only care about key/value pairs.
If that isn't enough, I would advise looking at this link, especially part 7 which gives you a very simple method to profile your application and see what gets allocated where.

Related

The mapreduce program is not giving me any output.Could somebody have a look into it?

I am not getting output in this program.When I am runnning this mapreduce program , I am not getting any result.
Inputfile: dict1.txt
apple,seo
apple,sev
dog,kukura
dog,kutta
cat,bilei
cat,billi
Output I want :
apple seo|sev
dog kukura|kutta
cat bilei|billi
Mapper class code :
package com.accure.Dict;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class DictMapper extends MapReduceBase implements Mapper<Text,Text,Text,Text> {
private Text word = new Text();
public void map(Text key,Text value,OutputCollector<Text,Text> output,Reporter reporter) throws IOException{
StringTokenizer itr = new StringTokenizer(value.toString(),",");
while (itr.hasMoreTokens())
{
System.out.println(key);
word.set(itr.nextToken());
output.collect(key, word);
}
}
}
Reducer code :
package com.accure.Dict;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class DictReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
private Text result = new Text();
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text,Text> output,Reporter reporter) throws IOException {
String translations = "";
while(values.hasNext()){
translations += "|" + values.next().toString();
}
result.set(translations);
output.collect(key,result);
}
}
Driver code :
package com.accure.driver;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import com.accure.Dict.DictMapper;
import com.accure.Dict.DictReducer;
public class DictDriver {
public static void main(String[] args) throws Exception{
// TODO Auto-generated method stub
JobConf conf=new JobConf();
conf.setJobName("wordcount_pradosh");
System.setProperty("HADOOP_USER_NAME","accure");
conf.set("fs.default.name","hdfs://host2.hadoop.career.com:54310/");
conf.set("hadoop.job.ugi","accuregrp");
conf.set("mapred.job.tracker","host2.hadoop.career.com:54311");
/*mapper and reduce class */
conf.setMapperClass(DictMapper.class);
conf.setReducerClass(DictReducer.class);
/*This particular jar file has your classes*/
conf.setJarByClass(DictMapper.class);
Path inputPath= new Path("/myCareer/pradosh/input");
Path outputPath=new Path("/myCareer/pradosh/output"+System.currentTimeMillis());
/*input and output directory path */
FileInputFormat.setInputPaths(conf,inputPath);
FileOutputFormat.setOutputPath(conf,outputPath);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(Text.class);
/*output key and value class*/
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
/*input and output format */
conf.setInputFormat(KeyValueTextInputFormat.class); /*Here the file is a text file*/
conf.setOutputFormat(TextOutputFormat.class);
JobClient.runJob(conf);
}
}
output log :
14/04/02 08:33:38 INFO mapred.JobClient: Running job: job_201404010637_0011
14/04/02 08:33:39 INFO mapred.JobClient: map 0% reduce 0%
14/04/02 08:33:58 INFO mapred.JobClient: map 50% reduce 0%
14/04/02 08:33:59 INFO mapred.JobClient: map 100% reduce 0%
14/04/02 08:34:21 INFO mapred.JobClient: map 100% reduce 16%
14/04/02 08:34:23 INFO mapred.JobClient: map 100% reduce 100%
14/04/02 08:34:25 INFO mapred.JobClient: Job complete: job_201404010637_0011
14/04/02 08:34:25 INFO mapred.JobClient: Counters: 29
14/04/02 08:34:25 INFO mapred.JobClient: Job Counters
14/04/02 08:34:25 INFO mapred.JobClient: Launched reduce tasks=1
14/04/02 08:34:25 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=33692
14/04/02 08:34:25 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/04/02 08:34:25 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/04/02 08:34:25 INFO mapred.JobClient: Launched map tasks=2
14/04/02 08:34:25 INFO mapred.JobClient: Data-local map tasks=2
14/04/02 08:34:25 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=25327
14/04/02 08:34:25 INFO mapred.JobClient: File Input Format Counters
14/04/02 08:34:25 INFO mapred.JobClient: Bytes Read=92
14/04/02 08:34:25 INFO mapred.JobClient: File Output Format Counters
14/04/02 08:34:25 INFO mapred.JobClient: Bytes Written=0
14/04/02 08:34:25 INFO mapred.JobClient: FileSystemCounters
14/04/02 08:34:25 INFO mapred.JobClient: FILE_BYTES_READ=6
14/04/02 08:34:25 INFO mapred.JobClient: HDFS_BYTES_READ=336
14/04/02 08:34:25 INFO mapred.JobClient: FILE_BYTES_WRITTEN=169311
14/04/02 08:34:25 INFO mapred.JobClient: Map-Reduce Framework
14/04/02 08:34:25 INFO mapred.JobClient: Map output materialized bytes=12
14/04/02 08:34:25 INFO mapred.JobClient: Map input records=6
14/04/02 08:34:25 INFO mapred.JobClient: Reduce shuffle bytes=12
14/04/02 08:34:25 INFO mapred.JobClient: Spilled Records=0
14/04/02 08:34:25 INFO mapred.JobClient: Map output bytes=0
14/04/02 08:34:25 INFO mapred.JobClient: Total committed heap usage (bytes)=246685696
14/04/02 08:34:25 INFO mapred.JobClient: CPU time spent (ms)=2650
14/04/02 08:34:25 INFO mapred.JobClient: Map input bytes=61
14/04/02 08:34:25 INFO mapred.JobClient: SPLIT_RAW_BYTES=244
14/04/02 08:34:25 INFO mapred.JobClient: Combine input records=0
14/04/02 08:34:25 INFO mapred.JobClient: Reduce input records=0
14/04/02 08:34:25 INFO mapred.JobClient: Reduce input groups=0
14/04/02 08:34:25 INFO mapred.JobClient: Combine output records=0
14/04/02 08:34:25 INFO mapred.JobClient: Physical memory (bytes) snapshot=392347648
14/04/02 08:34:25 INFO mapred.JobClient: Reduce output records=0
14/04/02 08:34:25 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2173820928
14/04/02 08:34:25 INFO mapred.JobClient: Map output records=0
When reading input you are setting input format as : KeyValueTextInputFormat
This expects the Byte separator b/w key and value. In you input you key and value are separated by "," hence the whole text goes as key and value would be empty.
This is why it is not going into the below loop of your mapper:
while (itr.hasMoreTokens())
{
System.out.println(key);
word.set(itr.nextToken());
output.collect(key, word);
}
You should tokenize your key and take the first split and key and second split as value.
This is evidenced in the logs : map Input Records : 6 but Map output records=0

Hadoop MapReduce: running two jobs inside one Configured/Tool produces no output on 2nd job

I need to run 2 map reduce jobs such that the 2nd takes as input the output from the first job. I'd like to do this within a single invocation, where MyClass extends Configured and implements Tool.
I've written the code, and it works as long as I don't run the two jobs within the same invocation (this works):
hadoop jar myjar.jar path.to.my.class.MyClass -i input -o output -m job1
hadoop jar myjar.jar path.to.my.class.MyClass -i dummy -o output -m job2
But this doesn't:
hadoop jar myjar.jar path.to.my.class.MyClass -i input -o output -m all
(-m stands for "mode")
In this case, the output of the first job does not make it to the mappers of the 2nd job (I figured this out by debugging), but I can't figure out why.
I've seen other posts on chaining, but they are for the "old" mapred api. And I need to run 3rd party code between the jobs, so I don't know if ChainMapper/ChainReducer will work for my use case.
Using hadoop version 1.0.3, AWS Elastic MapReduce distribution.
Code:
import java.io.IOException;
import org.apache.commons.cli.BasicParser;
import org.apache.commons.cli.CommandLine;
import org.apache.commons.cli.CommandLineParser;
import org.apache.commons.cli.Option;
import org.apache.commons.cli.OptionBuilder;
import org.apache.commons.cli.Options;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat;
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class MyClass extends Configured implements Tool {
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new HBasePrep(), args);
System.exit(res);
}
#Override
public int run(String[] args) throws Exception {
CommandLineParser parser = new BasicParser();
Options allOptions = setupOptions();
Configuration conf = getConf();
String[] argv_ = new GenericOptionsParser(conf, args).getRemainingArgs();
CommandLine cmdLine = parser.parse(allOptions, argv_);
boolean doJob1 = true;
boolean doJob2 = true;
if (cmdLine.hasOption('m')) {
String mode = cmdLine.getOptionValue('m');
if ("job1".equals(mode)) {
doJob2 = false;
} else if ("job2".equals(mode)){
doJob1 = false;
}
}
Path outPath = new Path(cmdLine.getOptionValue("output"), "job1out");
Job job = new Job(conf, "HBase Prep For Data Build");
Job job2 = new Job(conf, "HBase SessionIndex load");
if (doJob1) {
conf = job.getConfiguration();
String[] values = cmdLine.getOptionValues("input");
if (values != null && values.length > 0) {
for (String input : values) {
System.out.println("input:" + input);
FileInputFormat.addInputPaths(job, input);
}
}
job.setJarByClass(HBasePrep.class);
job.setMapperClass(SessionMapper.class);
MultipleOutputs.setCountersEnabled(job, false);
MultipleOutputs.addNamedOutput(job, "sessionindex", TextOutputFormat.class, Text.class, Text.class);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(KeyValue.class);
job.setOutputFormatClass(HFileOutputFormat.class);
HTable hTable = new HTable(conf, "session");
// Auto configure partitioner and reducer
HFileOutputFormat.configureIncrementalLoad(job, hTable);
FileOutputFormat.setOutputPath(job, outPath);
if (!job.waitForCompletion(true)) {
return 1;
}
// Load generated HFiles into table
LoadIncrementalHFiles loader = new LoadIncrementalHFiles(conf);
loader.doBulkLoad(outPath, hTable);
FileSystem fs = FileSystem.get(outPath.toUri(), conf);
fs.delete(new Path(outPath, "cf"), true); # i delete this because after the hbase build load, it is left an empty directory which causes problems later
}
/////////////////////////////////////////////
// SECOND JOB //
/////////////////////////////////////////////
if (doJob2) {
conf = job2.getConfiguration();
System.out.println("-- job 2 input path : " + outPath.toString());
FileInputFormat.setInputPaths(job2, outPath.toString());
job2.setJarByClass(HBasePrep.class);
job2.setMapperClass(SessionIndexMapper.class);
MultipleOutputs.setCountersEnabled(job2, false);
job2.setMapOutputKeyClass(ImmutableBytesWritable.class);
job2.setMapOutputValueClass(KeyValue.class);
job2.setOutputFormatClass(HFileOutputFormat.class);
HTable hTable = new HTable(conf, "session_index_by_hour");
// Auto configure partitioner and reducer
HFileOutputFormat.configureIncrementalLoad(job2, hTable);
outPath = new Path(cmdLine.getOptionValue("output"), "job2out");
System.out.println("-- job 2 output path: " + outPath.toString());
FileOutputFormat.setOutputPath(job2, outPath);
if (!job2.waitForCompletion(true)) {
return 2;
}
// Load generated HFiles into table
LoadIncrementalHFiles loader = new LoadIncrementalHFiles(conf);
loader.doBulkLoad(outPath, hTable);
}
return 0;
}
public static class SessionMapper extends
Mapper<LongWritable, Text, ImmutableBytesWritable, KeyValue> {
private MultipleOutputs<ImmutableBytesWritable, KeyValue> multiOut;
#Override
public void setup(Context context) throws IOException {
multiOut = new MultipleOutputs<ImmutableBytesWritable, KeyValue>(context);
}
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
...
context.write(..., ...); # this is called mutiple times
multiOut.write("sessionindex", new Text(...), new Text(...), "sessionindex");
}
}
public static class SessionIndexMapper extends
Mapper<LongWritable, Text, ImmutableBytesWritable, KeyValue> {
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(new ImmutableBytesWritable(...), new KeyValue(...));
}
}
private static Options setupOptions() {
Option input = createOption("i", "input",
"input file(s) for the Map step", "path", Integer.MAX_VALUE,
true);
Option output = createOption("o", "output",
"output directory for the Reduce step", "path", 1, true);
Option mode = createOption("m", "mode",
"what mode ('all', 'job1', 'job2')", "-mode-", 1, false);
return new Options().addOption(input).addOption(output)
.addOption(mode);
}
public static Option createOption(String name, String longOpt, String desc,
String argName, int max, boolean required) {
OptionBuilder.withArgName(argName);
OptionBuilder.hasArgs(max);
OptionBuilder.withDescription(desc);
OptionBuilder.isRequired(required);
OptionBuilder.withLongOpt(longOpt);
return OptionBuilder.create(name);
}
}
Output (single invocation):
input:s3n://...snip...
13/12/09 23:08:43 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/12/09 23:08:43 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
13/12/09 23:08:43 INFO compress.CodecPool: Got brand-new compressor
13/12/09 23:08:43 INFO mapred.JobClient: Default number of map tasks: null
13/12/09 23:08:43 INFO mapred.JobClient: Setting default number of map tasks based on cluster size to : 2
13/12/09 23:08:43 INFO mapred.JobClient: Default number of reduce tasks: 1
13/12/09 23:08:43 INFO security.ShellBasedUnixGroupsMapping: add hadoop to shell userGroupsCache
13/12/09 23:08:43 INFO mapred.JobClient: Setting group to hadoop
13/12/09 23:08:43 INFO input.FileInputFormat: Total input paths to process : 1
13/12/09 23:08:43 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
13/12/09 23:08:43 WARN lzo.LzoCodec: Could not find build properties file with revision hash
13/12/09 23:08:43 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev UNKNOWN]
13/12/09 23:08:43 WARN snappy.LoadSnappy: Snappy native library is available
13/12/09 23:08:43 INFO snappy.LoadSnappy: Snappy native library loaded
13/12/09 23:08:44 INFO mapred.JobClient: Running job: job_201312062235_0044
13/12/09 23:08:45 INFO mapred.JobClient: map 0% reduce 0%
13/12/09 23:09:09 INFO mapred.JobClient: map 100% reduce 0%
13/12/09 23:09:27 INFO mapred.JobClient: map 100% reduce 100%
13/12/09 23:09:32 INFO mapred.JobClient: Job complete: job_201312062235_0044
13/12/09 23:09:32 INFO mapred.JobClient: Counters: 42
13/12/09 23:09:32 INFO mapred.JobClient: MyCounter1
13/12/09 23:09:32 INFO mapred.JobClient: ValidCurrentDay=3526
13/12/09 23:09:32 INFO mapred.JobClient: Job Counters
13/12/09 23:09:32 INFO mapred.JobClient: Launched reduce tasks=1
13/12/09 23:09:32 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=19693
13/12/09 23:09:32 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/12/09 23:09:32 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/12/09 23:09:32 INFO mapred.JobClient: Rack-local map tasks=1
13/12/09 23:09:32 INFO mapred.JobClient: Launched map tasks=1
13/12/09 23:09:32 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=15201
13/12/09 23:09:32 INFO mapred.JobClient: File Output Format Counters
13/12/09 23:09:32 INFO mapred.JobClient: Bytes Written=1979245
13/12/09 23:09:32 INFO mapred.JobClient: FileSystemCounters
13/12/09 23:09:32 INFO mapred.JobClient: S3N_BYTES_READ=51212
13/12/09 23:09:32 INFO mapred.JobClient: FILE_BYTES_READ=400417
13/12/09 23:09:32 INFO mapred.JobClient: HDFS_BYTES_READ=231
13/12/09 23:09:32 INFO mapred.JobClient: FILE_BYTES_WRITTEN=859881
13/12/09 23:09:32 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=2181624
13/12/09 23:09:32 INFO mapred.JobClient: File Input Format Counters
13/12/09 23:09:32 INFO mapred.JobClient: Bytes Read=51212
13/12/09 23:09:32 INFO mapred.JobClient: MyCounter2
13/12/09 23:09:32 INFO mapred.JobClient: ASCII=3526
13/12/09 23:09:32 INFO mapred.JobClient: StatsUnaggregatedMapEventTypeCurrentDay
13/12/09 23:09:32 INFO mapred.JobClient: adProgress0=343
13/12/09 23:09:32 INFO mapred.JobClient: asset=562
13/12/09 23:09:32 INFO mapred.JobClient: podComplete=612
13/12/09 23:09:32 INFO mapred.JobClient: adProgress100=247
13/12/09 23:09:32 INFO mapred.JobClient: adProgress25=247
13/12/09 23:09:32 INFO mapred.JobClient: click=164
13/12/09 23:09:32 INFO mapred.JobClient: adProgress50=247
13/12/09 23:09:32 INFO mapred.JobClient: adCall=244
13/12/09 23:09:32 INFO mapred.JobClient: adProgress75=247
13/12/09 23:09:32 INFO mapred.JobClient: podStart=613
13/12/09 23:09:32 INFO mapred.JobClient: Map-Reduce Framework
13/12/09 23:09:32 INFO mapred.JobClient: Map output materialized bytes=400260
13/12/09 23:09:32 INFO mapred.JobClient: Map input records=3526
13/12/09 23:09:32 INFO mapred.JobClient: Reduce shuffle bytes=400260
13/12/09 23:09:32 INFO mapred.JobClient: Spilled Records=14104
13/12/09 23:09:32 INFO mapred.JobClient: Map output bytes=2343990
13/12/09 23:09:32 INFO mapred.JobClient: Total committed heap usage (bytes)=497549312
13/12/09 23:09:32 INFO mapred.JobClient: CPU time spent (ms)=10120
13/12/09 23:09:32 INFO mapred.JobClient: Combine input records=0
13/12/09 23:09:32 INFO mapred.JobClient: SPLIT_RAW_BYTES=231
13/12/09 23:09:32 INFO mapred.JobClient: Reduce input records=7052
13/12/09 23:09:32 INFO mapred.JobClient: Reduce input groups=246
13/12/09 23:09:32 INFO mapred.JobClient: Combine output records=0
13/12/09 23:09:32 INFO mapred.JobClient: Physical memory (bytes) snapshot=519942144
13/12/09 23:09:32 INFO mapred.JobClient: Reduce output records=7052
13/12/09 23:09:32 INFO mapred.JobClient: Virtual memory (bytes) snapshot=3076526080
13/12/09 23:09:32 INFO mapred.JobClient: Map output records=7052
13/12/09 23:09:32 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://10.91.18.96:9000/path/job1out/_SUCCESS
13/12/09 23:09:32 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://10.91.18.96:9000/path/job1out/sessionindex-m-00000
1091740526
-- job 2 input path : /path/job1out
-- job 2 output path: /path/job2out
13/12/09 23:09:32 INFO mapred.JobClient: Default number of map tasks: null
13/12/09 23:09:32 INFO mapred.JobClient: Setting default number of map tasks based on cluster size to : 2
13/12/09 23:09:32 INFO mapred.JobClient: Default number of reduce tasks: 1
13/12/09 23:09:33 INFO mapred.JobClient: Setting group to hadoop
13/12/09 23:09:33 INFO input.FileInputFormat: Total input paths to process : 1
13/12/09 23:09:33 INFO mapred.JobClient: Running job: job_201312062235_0045
13/12/09 23:09:34 INFO mapred.JobClient: map 0% reduce 0%
13/12/09 23:09:51 INFO mapred.JobClient: map 100% reduce 0%
13/12/09 23:10:03 INFO mapred.JobClient: map 100% reduce 33%
13/12/09 23:10:06 INFO mapred.JobClient: map 100% reduce 100%
13/12/09 23:10:11 INFO mapred.JobClient: Job complete: job_201312062235_0045
13/12/09 23:10:11 INFO mapred.JobClient: Counters: 27
13/12/09 23:10:11 INFO mapred.JobClient: Job Counters
13/12/09 23:10:11 INFO mapred.JobClient: Launched reduce tasks=1
13/12/09 23:10:11 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=13533
13/12/09 23:10:11 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/12/09 23:10:11 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/12/09 23:10:11 INFO mapred.JobClient: Launched map tasks=1
13/12/09 23:10:11 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=12176
13/12/09 23:10:11 INFO mapred.JobClient: File Output Format Counters
13/12/09 23:10:11 INFO mapred.JobClient: Bytes Written=0
13/12/09 23:10:11 INFO mapred.JobClient: FileSystemCounters
13/12/09 23:10:11 INFO mapred.JobClient: FILE_BYTES_READ=173
13/12/09 23:10:11 INFO mapred.JobClient: HDFS_BYTES_READ=134
13/12/09 23:10:11 INFO mapred.JobClient: FILE_BYTES_WRITTEN=57735
13/12/09 23:10:11 INFO mapred.JobClient: File Input Format Counters
13/12/09 23:10:11 INFO mapred.JobClient: Bytes Read=0
13/12/09 23:10:11 INFO mapred.JobClient: Map-Reduce Framework
13/12/09 23:10:11 INFO mapred.JobClient: Map output materialized bytes=16
13/12/09 23:10:11 INFO mapred.JobClient: Map input records=0
13/12/09 23:10:11 INFO mapred.JobClient: Reduce shuffle bytes=16
13/12/09 23:10:11 INFO mapred.JobClient: Spilled Records=0
13/12/09 23:10:11 INFO mapred.JobClient: Map output bytes=0
13/12/09 23:10:11 INFO mapred.JobClient: Total committed heap usage (bytes)=434634752
13/12/09 23:10:11 INFO mapred.JobClient: CPU time spent (ms)=2270
13/12/09 23:10:11 INFO mapred.JobClient: Combine input records=0
13/12/09 23:10:11 INFO mapred.JobClient: SPLIT_RAW_BYTES=134
13/12/09 23:10:11 INFO mapred.JobClient: Reduce input records=0
13/12/09 23:10:11 INFO mapred.JobClient: Reduce input groups=0
13/12/09 23:10:11 INFO mapred.JobClient: Combine output records=0
13/12/09 23:10:11 INFO mapred.JobClient: Physical memory (bytes) snapshot=423612416
13/12/09 23:10:11 INFO mapred.JobClient: Reduce output records=0
13/12/09 23:10:11 INFO mapred.JobClient: Virtual memory (bytes) snapshot=3058089984
13/12/09 23:10:11 INFO mapred.JobClient: Map output records=0
13/12/09 23:10:11 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://10.91.18.96:9000/path/job2out/_SUCCESS
13/12/09 23:10:11 WARN mapreduce.LoadIncrementalHFiles: Bulk load operation did not find any files to load in directory /path/job2out. Does it contain files in subdirectories that correspond to column family names?

long hadoop run, stuck at reduce>reduce

I have hadoop run that basically just aggregate over keys, it's code:
(mapper is the identity mapper)
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> results, Reporter reporter) throws IOException {
String res = new String("");
while(values.hasNext())
{
res += values.next().toString();
}
Text outputValue = new Text("<all><id>"+key.toString()+"</id>"+res+"</all>");
results.collect(key, outputValue);
}
it stuck at this level:
12/11/26 06:19:23 INFO mapred.JobClient: Running job: job_201210240845_0099
12/11/26 06:19:24 INFO mapred.JobClient: map 0% reduce 0%
12/11/26 06:19:37 INFO mapred.JobClient: map 20% reduce 0%
12/11/26 06:19:40 INFO mapred.JobClient: map 80% reduce 0%
12/11/26 06:19:41 INFO mapred.JobClient: map 100% reduce 0%
12/11/26 06:19:46 INFO mapred.JobClient: map 100% reduce 6%
12/11/26 06:19:55 INFO mapred.JobClient: map 100% reduce 66%
I run it locally and saw this:
12/11/26 06:06:48 INFO mapred.LocalJobRunner:
12/11/26 06:06:48 INFO mapred.Merger: Merging 5 sorted segments
12/11/26 06:06:48 INFO mapred.Merger: Down to the last merge-pass, with 5 segments left of total size: 82159206 bytes
12/11/26 06:06:48 INFO mapred.LocalJobRunner:
12/11/26 06:06:54 INFO mapred.LocalJobRunner: reduce > reduce
12/11/26 06:06:55 INFO mapred.JobClient: map 100% reduce 66%
12/11/26 06:06:57 INFO mapred.LocalJobRunner: reduce > reduce
12/11/26 06:07:00 INFO mapred.LocalJobRunner: reduce > reduce
12/11/26 06:07:03 INFO mapred.LocalJobRunner: reduce > reduce
...
a lot of reduce > reduce ...
...
in the end , it finished the work. I want to ask:
1) what does it do in this reduce > reduce stage?
2) how can i improve this?
When looking at the percentages, 0-33% is shuffle, 34%-65% is sort, 66%-100% is the actual reduce function.
Everything looks fine in your code, but I'll take a stab in the dark:
You are creating and re-recreating the string res over and over. Every time you get a new value, Java is creating a new string object, then creating another string object to hold the concatenation. As you can see, this can get out of hand when the string gets pretty big. Try using a StringBuffer instead. Edit: StringBuilder is better than StringBuffer.
Whether or not this is the problem, you should change this to improve performance.
Using StringBuilder solves it. It improves the run time from 30 min to 30 sec. I didn't think it would make such a difference. Thanks a lot.

Hadoop: Reduce-side join get stuck at map 100% reduce 100% and never finish

I'm beginner with Hadoop, these days I'm trying to run
reduce-side join example but it got stuck: Map 100% and Reduce 100%
but never finishing. Progress,logs, code, sample data and
configuration files are as below:
Progress:
12/10/02 15:48:06 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/10/02 15:48:06 WARN snappy.LoadSnappy: Snappy native library not loaded
12/10/02 15:48:06 INFO mapred.FileInputFormat: Total input paths to process : 2
12/10/02 15:48:07 INFO mapred.JobClient: Running job: job_201210021515_0007
12/10/02 15:48:08 INFO mapred.JobClient: map 0% reduce 0%
12/10/02 15:48:26 INFO mapred.JobClient: map 66% reduce 0%
12/10/02 15:48:35 INFO mapred.JobClient: map 100% reduce 0%
12/10/02 15:48:38 INFO mapred.JobClient: map 100% reduce 22%
12/10/02 15:48:47 INFO mapred.JobClient: map 100% reduce 100%
Logs from Reduce task:
2012-10-02 15:48:28,018 INFO org.apache.hadoop.mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#1f53935
2012-10-02 15:48:28,179 INFO org.apache.hadoop.mapred.ReduceTask: ShuffleRamManager: MemoryLimit=668126400, MaxSingleShuffleLimit=167031600
2012-10-02 15:48:28,202 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201210021515_0007_r_000000_0 Thread started: Thread for merging on-disk files
2012-10-02 15:48:28,202 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201210021515_0007_r_000000_0 Thread started: Thread for merging in memory files
2012-10-02 15:48:28,203 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201210021515_0007_r_000000_0 Thread waiting: Thread for merging on-disk files
2012-10-02 15:48:28,207 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201210021515_0007_r_000000_0 Thread started: Thread for polling Map Completion Events
2012-10-02 15:48:28,207 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201210021515_0007_r_000000_0 Need another 3 map output(s) where 0 is already in progress
2012-10-02 15:48:28,208 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201210021515_0007_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
2012-10-02 15:48:33,209 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201210021515_0007_r_000000_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts)
2012-10-02 15:48:33,596 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201210021515_0007_r_000000_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts)
2012-10-02 15:48:38,606 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201210021515_0007_r_000000_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts)
2012-10-02 15:48:39,239 INFO org.apache.hadoop.mapred.ReduceTask: GetMapEventsThread exiting
2012-10-02 15:48:39,239 INFO org.apache.hadoop.mapred.ReduceTask: getMapsEventsThread joined.
2012-10-02 15:48:39,241 INFO org.apache.hadoop.mapred.ReduceTask: Closed ram manager
2012-10-02 15:48:39,242 INFO org.apache.hadoop.mapred.ReduceTask: Interleaved on-disk merge complete: 0 files left.
2012-10-02 15:48:39,242 INFO org.apache.hadoop.mapred.ReduceTask: In-memory merge complete: 3 files left.
2012-10-02 15:48:39,285 INFO org.apache.hadoop.mapred.Merger: Merging 3 sorted segments
2012-10-02 15:48:39,285 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 3 segments left of total size: 10500 bytes
2012-10-02 15:48:39,314 INFO org.apache.hadoop.mapred.ReduceTask: Merged 3 segments, 10500 bytes to disk to satisfy reduce memory limit
2012-10-02 15:48:39,318 INFO org.apache.hadoop.mapred.ReduceTask: Merging 1 files, 10500 bytes from disk
2012-10-02 15:48:39,319 INFO org.apache.hadoop.mapred.ReduceTask: Merging 0 segments, 0 bytes from memory into reduce
2012-10-02 15:48:39,320 INFO org.apache.hadoop.mapred.Merger: Merging 1 sorted segments
2012-10-02 15:48:39,322 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 10496 bytes
Java Code:
public class DataJoin extends Configured implements Tool {
public static class MapClass extends DataJoinMapperBase {
protected Text generateInputTag(String inputFile) {//specify tag
String datasource = inputFile.split("-")[0];
return new Text(datasource);
}
protected Text generateGroupKey(TaggedMapOutput aRecord) {//takes a tagged record (of type TaggedMapOutput)and returns the group key for joining
String line = ((Text) aRecord.getData()).toString();
String[] tokens = line.split(",", 2);
String groupKey = tokens[0];
return new Text(groupKey);
}
protected TaggedMapOutput generateTaggedMapOutput(Object value) {//wraps the record value into a TaggedMapOutput type
TaggedWritable retv = new TaggedWritable((Text) value);
retv.setTag(this.inputTag);//inputTag: result of generateInputTag
return retv;
}
}
public static class Reduce extends DataJoinReducerBase {
protected TaggedMapOutput combine(Object[] tags, Object[] values) {//combination of the cross product of the tagged records with the same join (group) key
if (tags.length != 2) return null;
String joinedStr = "";
for (int i=0; i<tags.length; i++) {
if (i > 0) joinedStr += ",";
TaggedWritable tw = (TaggedWritable) values[i];
String line = ((Text) tw.getData()).toString();
if (line == null)
return null;
String[] tokens = line.split(",", 2);
joinedStr += tokens[1];
}
TaggedWritable retv = new TaggedWritable(new Text(joinedStr));
retv.setTag((Text) tags[0]);
return retv;
}
}
public static class TaggedWritable extends TaggedMapOutput {//tagged record
private Writable data;
public TaggedWritable() {
this.tag = new Text("");
this.data = null;
}
public TaggedWritable(Writable data) {
this.tag = new Text("");
this.data = data;
}
public Writable getData() {
return data;
}
#Override
public void write(DataOutput out) throws IOException {
this.tag.write(out);
out.writeUTF(this.data.getClass().getName());
this.data.write(out);
}
#Override
public void readFields(DataInput in) throws IOException {
this.tag.readFields(in);
String dataClz = in.readUTF();
if ((this.data == null) || !this.data.getClass().getName().equals(dataClz)) {
try {
this.data = (Writable) ReflectionUtils.newInstance(Class.forName(dataClz), null);
System.out.printf(dataClz);
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
this.data.readFields(in);
}
}
public int run(String[] args) throws Exception {
Configuration conf = getConf();
JobConf job = new JobConf(conf, DataJoin.class);
Path in = new Path(args[0]);
Path out = new Path(args[1]);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
job.setJobName("DataJoin");
job.setMapperClass(MapClass.class);
job.setReducerClass(Reduce.class);
job.setInputFormat(TextInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(TaggedWritable.class);
job.set("mapred.textoutputformat.separator", ",");
JobClient.runJob(job);
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(),
new DataJoin(),
args);
System.exit(res);
}
}
Sample data:
file 1: apat.txt(1 line) 4373932,1983,8446,1981,"NL","",16025,2,65,436,1,19,108,49,1,0.5289,0.6516,9.8571,4.1481,0.0109,0.0093,0,0
file 2: cite.txt(100 lines)
4373932,3641235
4373932,3720760
4373932,3853987
4373932,3900558
4373932,3939350
4373932,3941876
4373932,3992631
4373932,3996345
4373932,3998943
4373932,3999948
4373932,4001400
4373932,4011219
4373932,4025310
4373932,4036946
4373932,4058732
4373932,4104029
4373932,4108972
4373932,4160016
4373932,4160018
4373932,4160019
4373932,4160818
4373932,4161515
4373932,4163779
4373932,4168146
4373932,4169137
4373932,4181650
4373932,4187075
4373932,4197361
4373932,4199599
4373932,4200436
4373932,4201763
4373932,4207075
4373932,4208479
4373932,4211766
4373932,4215102
4373932,4220450
4373932,4222744
4373932,4225783
4373932,4231750
4373932,4234563
4373932,4235869
4373932,4238195
4373932,4238395
4373932,4248854
4373932,4251514
4373932,4258130
4373932,4248965
4373932,4252783
4373932,4254097
4373932,4259313
4373932,4272505
4373932,4272506
4373932,4277437
4373932,4279992
4373932,4283382
4373932,4294817
4373932,4296201
4373932,4297273
4373932,4298687
4373932,4302534
4373932,4314026
4373932,4318707
4373932,4318846
4373932,3773625
4373932,3935074
4373932,3951748
4373932,3992516
4373932,3996344
4373932,3997657
4373932,4011308
4373932,4016250
4373932,4018884
4373932,4056724
4373932,4067959
4373932,4069352
4373932,4097586
4373932,4098876
4373932,4130462
4373932,4152411
4373932,4153675
4373932,4174384
4373932,4222743
4373932,4254096
4373932,4256834
4373932,4284412
4373932,4323647
4373932,3985867
4373932,4166105
4373932,4278653
4373932,4194877
4373932,4202815
4373932,4286959
4373932,4302536
4373932,4020151
4373932,4115535
4373932,4152412
4373932,4177253
4373932,4223002
4373932,4225485
4373932,4261968
Configurations:
core-site.xml
<!-- In: conf/core-site.xml -->
<property>
<name>hadoop.tmp.dir</name>
<value>/your/path/to/hadoop/tmp/dir/hadoop-${user.name}</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
mapred-site.xml
<!-- In: conf/mapred-site.xml -->
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
hdfs-site.xml
<!-- In: conf/hdfs-site.xml -->
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
I've googled the answer and made some change in code or some configuration in (mapred/core/hdps)-site.xml files but I lost. I run this code in pseudo-mode. The join key from two files is equivalent. If I change the cite.txt file to 99 lines or lesser, It runs well while from 100 lines or above, it gets stuck like the logs shown. Please help me figure out the problem. I appreciate your explanation.
Best regards,
HaiLong
Please check your Reduce class.
I faced similar problem which turned out to be a very silly mistake. Maybe this will help you out and solve the issue:
while (values.hasNext()) {
String val = values.next().toString();
.....
}
You need to add: .next

Getting error while implementing a simple sorting program in Mapreduce with zero reduce nodes

I tried implementing a sorting program in mapreduce such that I have just the sorted output after the map phase where the sorting is done by the hadoop framework internally. For it, I tried to set the number of reduce tasks to zero as there wasnt any reduction required. Now when I tried executing the program, I kept on getting checksum
error.. I am not able to figure out what's to be done next. Surely it's possible to run the program on my netbook as the sorting does work fine when I have set the reduce tasks to one.. Please help!!
For your reference, here's the entire code that I have written to perform the sorting:
/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
*/
/**
*
* #author root
*/
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.io.*;
import java.util.*;
import java.io.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.util.*;
import org.apache.hadoop.conf.*;
public class word extends Configured implements Tool
{
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
private static IntWritable one=new IntWritable(1);
private Text word=new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter report) throws IOException
{
String line=value.toString();
StringTokenizer token=new StringTokenizer(line," .,?!");
String wordToken=null;
while(token.hasMoreTokens())
{
wordToken=token.nextToken();
output.collect(new Text(wordToken), one);
}
}
}
public int run(String args[])throws Exception
{
//Configuration conf=getConf();
JobConf job=new JobConf(word.class);
job.setInputFormat(TextInputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setOutputFormat(TextOutputFormat.class);
job.setMapperClass(Map.class);
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
JobClient.runJob(job);
return 0;
}
public static void main(String args[])throws Exception
{
int exitCode=ToolRunner.run(new word(), args);
System.exit(exitCode);
}
}
Here is the checksum error I got on executing this program:
12/03/25 10:26:42 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively
12/03/25 10:26:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
12/03/25 10:26:43 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/03/25 10:26:44 INFO mapred.FileInputFormat: Total input paths to process : 1
12/03/25 10:26:45 INFO mapred.JobClient: Running job: job_local_0001
12/03/25 10:26:45 INFO mapred.FileInputFormat: Total input paths to process : 1
12/03/25 10:26:45 INFO mapred.MapTask: numReduceTasks: 0
12/03/25 10:26:45 INFO fs.FSInputChecker: Found checksum error: b[0, 26]=610a630a620a640a650a740a790a780a730a670a7a0a680a730a
org.apache.hadoop.fs.ChecksumException: Checksum error: file:/root/NetBeansProjects/projectAll/output/regionMulti/individual/part-00000 at 0
at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
12/03/25 10:26:45 WARN mapred.LocalJobRunner: job_local_0001
org.apache.hadoop.fs.ChecksumException: Checksum error: file:/root/NetBeansProjects/projectAll/output/regionMulti/individual/part-00000 at 0
at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
12/03/25 10:26:46 INFO mapred.JobClient: map 0% reduce 0%
12/03/25 10:26:46 INFO mapred.JobClient: Job complete: job_local_0001
12/03/25 10:26:46 INFO mapred.JobClient: Counters: 0
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at sortLog.run(sortLog.java:59)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at sortLog.main(sortLog.java:66)
Java Result: 1
BUILD SUCCESSFUL (total time: 4 seconds)
So have a look at the org.apache.hadoop.mapred.MapTask arround line 600 in 0.20.2.
// get an output object
if (job.getNumReduceTasks() == 0) {
output =
new NewDirectOutputCollector(taskContext, job, umbilical, reporter);
} else {
output = new NewOutputCollector(taskContext, job, umbilical, reporter);
}
If you set the number of reduce tasks to zero it will be directly written to the output. The NewOutputCollector will use the so called MapOutputBuffer which does the spilling, sorting, combining and partitioning.
So when you set no reducer, no sort takes places, even if Tom White states this in the definitive guide.
I have faced the same problem (checksum error concerning file part-00000 at 0). I solved it by renaming the file to any other name than -00000.
So if you need at least one Reducer to make the internal sorting happen, than you can take the IdentityReducer.
You may also want to see this discussion:
hadoop: difference between 0 reducer and identity reducer?

Resources