I want to compare two text files line by line to find whether they are equal or not. How can I do it using hadoop map reduce programming?
static int i=0;
public void map(LongWritable key, String value, OutputCollector<String,IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
i++; //used as a line number
output.collect(line, new IntWritable(i));
}
I tries to map each line with line number.But how can i reduce it and compare with another file?
Comparing two text files is equivalent to joining two files in map reduce programming. For Joining two text files you have to use two mappers with same keys. In your case you can use the key as line offset and value as line. MultipleInputs() method is used for using multiple mappers and multiple text files.
Please find below the detailed program for comparing two text files in map-reduce programming using JAVA.
The arguments for the program are file 1,file 2 and output file
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class CompareTwoFiles {
public static class Map extends
Mapper<LongWritable, Text, LongWritable, Text> {
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(key, value);
}
}
public static class Map2 extends
Mapper<LongWritable, Text, LongWritable, Text> {
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(key, value);
}
}
public static class Reduce extends
Reducer<LongWritable, Text, LongWritable, Text> {
#Override
public void reduce(LongWritable key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
String[] lines = new String[2];
int i = 0;
for (Text text : values) {
lines[i] = text.toString();
i++;
}
if (lines[0].equals(lines[1])) {
context.write(key, new Text("same"));
} else {
context.write(key,
new Text(lines[0] + " vs " + lines[1]));
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("fs.default.name", "hdfs://localhost:8020");
Job job = new Job(conf);
job.setJarByClass(CompareTwoFiles.class);
job.setJobName("Compare Two Files and Identify the Difference");
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
MultipleInputs.addInputPath(job, new Path(args[0]),
TextInputFormat.class, Map.class);
MultipleInputs.addInputPath(job, new Path(args[1]),
TextInputFormat.class, Map2.class);
job.waitForCompletion(true);
}
}
Related
I have this in Main...
job.setMapperClass(AverageIntMapper.class);
job.setCombinerClass(AverageIntCombiner.class);
job.setReducerClass(AverageIntReducer.class);
And the Combiner has different code but the Combiner is being completely ignored as the output the Reducer is using is the output from the Mapper.
I understand that a Combiner may not be used but I thought that was the case when the Combiner is the same as the Reducer. I don't really understand the point of being able to create a custom Combiner but the system can still skip its usage.
If that's not supposed to happen, what could be a reason that the Combiner is not being used?
Code...
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class AverageInt {
public static class AverageIntMapper extends Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String n_string = value.toString();
context.write(new Text("Value"), new Text(n_string));
}
}
public static class AverageIntCombiner extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
int count = 0;
for(IntWritable value : values) {
int temp = Integer.parseInt(value.toString());
sum += value.get();
count += 1;
}
String sum_count = Integer.toString(sum) + "," + Integer.toString(count);
context.write(key, new Text(sum_count));
}
}
public static class AverageIntReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
int total = 0;
int count = 0;
for(Text value : values) {
String temp = value.toString();
String[] split = temp.split(",");
total += Integer.parseInt(split[0]);
count += Integer.parseInt(split[1]);
}
Double average = (double)total/count;
context.write(key, new Text(average.toString()));
}
}
public static void main(String[] args) throws Exception {
if(args.length != 2) {
System.err.println("Usage: AverageInt <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(AverageInt.class);
job.setJobName("Average");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(AverageIntMapper.class);
job.setCombinerClass(AverageIntCombiner.class);
job.setReducerClass(AverageIntReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
If you look at what your mapper is emitting:
public void map(LongWritable key, Text value, Context context)
Its sending two Text objects, but whilst you've declared the combiner class itself correctly, the reduce method has:
public void reduce(Text key, Iterable<IntWritable> values, Context context)
It should be:
public void reduce(Text key, Iterable<Text> values, Context context)
I have written the below program.
I have run it without using TotalOrderPartitioner and it has run well. So I don't think there are any issues with Mapper or Reducer class as such.
But when I include the code for TotalOrderPartitioner i.e. write the partition file and then to put it in DistributedCache, I am getting following error: Really clueless how to go about it.
[train#sandbox TOTALORDERPARTITIONER]$ hadoop jar totalorderpart.jar average.AverageJob counties totpart
//counties is input directory and totpart is output directory
16/01/18 04:14:00 INFO input.FileInputFormat: Total input paths to
process : 4 16/01/18 04:14:00 INFO partition.InputSampler: Using 6
samples 16/01/18 04:14:00 INFO zlib.ZlibFactory: Successfully loaded &
initialized native-zlib library 16/01/18 04:14:00 INFO
compress.CodecPool: Got brand-new compressor [.deflate]
java.io.IOException: wrong key class:
org.apache.hadoop.io.LongWritable is not class
org.apache.hadoop.io.Text at
org.apache.hadoop.io.SequenceFile$RecordCompressWriter.append(SequenceFile.java:1380)
at
org.apache.hadoop.mapreduce.lib.partition.InputSampler.writePartitionFile(InputSampler.java:340)
at average.AverageJob.run(AverageJob.java:132) at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at
average.AverageJob.main(AverageJob.java:146) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597) at
org.apache.hadoop.util.RunJar.main(RunJar.java:212)
My code
package average;
import java.io.IOException;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.util.StringUtils;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.partition.InputSampler;
import org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class AverageJob extends Configured implements Tool {
public enum Counters {MAP, COMINE, REDUCE};
public static class AverageMapper extends Mapper<LongWritable, Text, Text, Text> {
private Text mapOutputKey = new Text();
private Text mapOutputValue = new Text();
#Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] words = StringUtils.split(value.toString(), '\\', ',');
mapOutputKey.set(words[1].trim());
StringBuilder moValue = new StringBuilder();
moValue.append(words[9].trim()).append(",1");
mapOutputValue.set(moValue.toString());
context.write(mapOutputKey, mapOutputValue);
context.getCounter(Counters.MAP).increment(1);
}
}
public static class AverageCombiner extends Reducer<Text, Text, Text, Text> {
private Text combinerOutputValue = new Text();
#Override
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
int count=0;
long sum=0;
for(Text value: values)
{
String[] strValues = StringUtils.split(value.toString(), ',');
sum+= Long.parseLong(strValues[0]);
count+= Integer.parseInt(strValues[1]);
}
combinerOutputValue.set(sum + "," + count);
context.write(key, combinerOutputValue);
context.getCounter(Counters.COMINE).increment(1);
}
}
public static class AverageReducer extends Reducer<Text, Text, Text, DoubleWritable> {
private DoubleWritable reduceOutputKey = new DoubleWritable();
#Override
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
int count=0;
double sum=0;
for(Text value: values)
{
String[] strValues = StringUtils.split(value.toString(), ',');
sum+= Double.parseDouble(strValues[0]);
count+= Integer.parseInt(strValues[1]);
}
reduceOutputKey.set(sum/count);
context.write(key, reduceOutputKey);
context.getCounter(Counters.REDUCE).increment(1);
}
}
#Override
public int run(String[] args) throws Exception {
Configuration conf = getConf();
Job job = Job.getInstance(conf);
job.setJarByClass(getClass());
Path in = new Path(args[0]);
Path out = new Path(args[1]);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DoubleWritable.class);
job.setMapperClass(AverageMapper.class);
job.setCombinerClass(AverageCombiner.class);
job.setPartitionerClass(TotalOrderPartitioner.class);
job.setReducerClass(AverageReducer.class);
job.setNumReduceTasks(6);
InputSampler.Sampler<Text, Text> sampler = new InputSampler.RandomSampler<Text, Text>(0.2, 6, 5);
InputSampler.writePartitionFile(job, sampler);
String partitionFile = TotalOrderPartitioner.getPartitionFile(conf);
URI partitionUri = new URI(partitionFile + "#" + TotalOrderPartitioner.DEFAULT_PATH);
job.addCacheFile(partitionUri);
return job.waitForCompletion(true)?0:1;
}
public static void main(String[] args) {
int result=0;
try
{
result = ToolRunner.run(new Configuration(), new AverageJob(), args);
System.exit(result);
}
catch (Exception e)
{
e.printStackTrace();
}
}
}
TotalOrderPartitioner does not run its sampling on the output of the Mapper, but on the input dataset. Your input format has LongWritable as key and Text as value. Instead, you are trying to call RandomSampler claiming that your format has Text as key and Text as value. This is a mismatch that InputSampler finds when it runs, hence the message
wrong key class: org.apache.hadoop.io.LongWritable is not class org.apache.hadoop.io.Text
Meaning that it was trying to find Text as key (based on your parametrization) but it found LongWritable instead.
I'm following the tutorial at http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html and this is my code
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
import java.util.StringTokenizer;
import java.util.Iterator;
public class WordCount {
public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text word = new Text();
private final IntWritable one = new IntWritable(1);
#Override
public void map(Object key, Text val, Context context) throws IOException, InterruptedException {
String line = val.toString();
StringTokenizer tokenizer = new StringTokenizer(line.toLowerCase());
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> value, Context context) throws IOException, InterruptedException {
int sum = 0;
while (value.hasNext()) {
IntWritable val = (IntWritable) value.next();
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration config = new Configuration();
Job job = Job.getInstance(config, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(WordCountMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setCombinerClass(WordCountReducer.class);
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path("/user/Icarus/words.txt"));
FileOutputFormat.setOutputPath(job, new Path("/user/Icarus/words.out"));
job.waitForCompletion(true);
}
}
But when I run it instead of calculating the word frequency, I got this:
bye 1
goodbye 1
hadoop 1
hadoop 1
hello 1
hello 1
hello 1
world 1
I must missed something very trivial but I can't figure out what. Help please..
Root cause of this problem is, You are not calling the reduce() with the exact Signature required to call by Hadoop. Signature should be as below (reference here)
protected void reduce(KEYIN key, Iterable<VALUEIN> values, org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException, InterruptedException
Since your reduce() not matching the Signature, Hadoop will call default IdentityReducer, which output the same input.
So only you are getting the same output of Map as Reduce output.
For this problem, i can suggest you 2 solutions,
First: Try the below code
public static class WordCountReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
Second: And second solution is quite simple,
Instead of you define manually reduce class, Just set the Reducer class to IntSumReducer or LongSumReducer which will do the same as above code.
So don't define the WordCountReducer class and add the following code,
job.setReducerClass(LongSumReducer.class); or
job.setReducerClass(IntSumReducer.class);
based on the count type you want.
Hope it helps!
The output I am expecting is the count of every word in the input file. But my output is the whole input file, as it is.
I am using extends Mapper<LongWritable, Text, Text, IntWritable> for mapper class and Reducer<Text, IntWritable, Text, IntWritable> for reducer class.
Here is my code
driver.java
public class driver extends Configured implements Tool{
public int run(String[] args) throws Exception
{
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setMapperClass(mapper.class);
job.setReducerClass(reducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
//JobClient.runJob((JobConf) conf);
//System.exit(job.waitForCompletion(true) ? 0 : 1);
return 0;
}
public static void main(String[] args) throws Exception
{
long start = System.currentTimeMillis();
//int res = ToolRunner.run(new Configuration(), new driver(),args);
int res = ToolRunner.run(new Configuration(), new driver(),args);
long stop = System.currentTimeMillis();
System.out.println ("Time: " + (stop-start));
System.exit(res);
}
}
mapper.java
public class mapper extends Mapper<LongWritable, Text, Text, IntWritable>
{
//hadoop supported data types
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
//map method that performs the tokenizer job and framing the initial key value pairs
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
reducer.java
public class reducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
//reduce method accepts the Key Value pairs from mappers, do the aggregation based on keys and produce the final out put
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
{
int sum = 0;
while (values.hasNext())
{
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
You are perplexed by the new & old APIs of MapReduce. I think you tried to write WordCount program in new API, but took snippets from the old API(a old blogpost perhaps). You can find the problem yourself, if you just add #override annotation to both the map & reduce methods.
See what happens to them after evolution :
map
reduce
You just wrote two new methods specifying older signature, so they just don't override anything, nowhere being called. The code is doing nothing since the actual methods being called have empty bodies(I don't think there is a default implementation and if there is that will be identity operations only).
Anyway, you should follow basic conventions for coding.
try this,
import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class WordCount {
public static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
#Override
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
System.out.println(line);
while (tokenizer.hasMoreTokens()) {
value.set(tokenizer.nextToken());
output.collect(value, new IntWritable(1));
}
}
}
public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
#Override
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception,IOException {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("WordCount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path("/home/user17/test.txt"));
FileOutputFormat.setOutputPath(conf, new Path("hdfs://localhost:9000/out2"));
JobClient.runJob(conf);
}
}
make jar and execute given command on commandLine
hadoop jar WordCount.jar WordCount /inputfile /outputfile
Please run this code if you are facing problem with your code.This code contains mapper,reducer and main functions.
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()){
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
2) After that create a jar of this code say wordcount.jar saved in your home directory (/home/user/wordcount.jar) and run the following command :
hadoop jar wordcount.jar classname /inputfile /outputfile /
This will create a file outputfile under /(root) directory of hadoop. View your result by
hadoop dfs -cat /outputfile/part-m-00000
This will successfully run your wordcount program.
I have been trying to execute some code that would allow me to 'only' list the words that exist in multiple files; what I have done so far was use the wordcount example and thanx to Chris White I managed to compile it. I tried reading here and there to get the code to work but all I am getting is a blank page with no data. the mapper is suppose to collect each word with its corresponding locations; the reducer is suppose to collect the common words any thoughts as to what might be the problem? the code is:
package org.myorg;
import java.io.IOException;
import java.util.*;
import java.lang.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<Text, Text, Text, Text>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private Text outvalue=new Text();
private String filename = null;
public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException
{
if (filename == null)
{
filename = ((FileSplit) reporter.getInputSplit()).getPath().getName();
}
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
outvalue.set(filename);
output.collect(word, outvalue);
}
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text>
{
private Text src = new Text();
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException
{
int sum = 0;
//List<Text> list = new ArrayList<Text>();
while (values.hasNext()) // I believe this would have all locations of the same word in different files?
{
sum += values.next().get();
src =values.next().get();
}
output.collect(key, src);
//while(values.hasNext())
//{
//Text value = values.next();
//list.add(new Text(value));
//System.out.println(value.toString());
//}
//System.out.println(values.toString());
//for(Text value : list)
//{
//System.out.println(value.toString());
//}
}
}
public static void main(String[] args) throws Exception
{
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setInputFormat(KeyValueTextInputFormat.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
//conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
Am I missing anything?
much obliged...
My Hadoop version : 0.20.203
First of all it seems you're using the old Hadoop API (mapred), and a word of advice would be to use the new Hadoop API (mapreduce) which is compatible with 0.20.203
In the new API, here is a wordcount that will work
import java.io.IOException;
import java.lang.InterruptedException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
/**
* The map class of WordCount.
*/
public static class TokenCounterMapper
extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
/**
* The reducer class of WordCount
*/
public static class TokenCounterReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
}
/**
* The main entry point.
*/
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
Job job = new Job(conf, "Example Hadoop 0.20.1 WordCount");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenCounterMapper.class);
job.setReducerClass(TokenCounterReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Then, we build this file and pack the result into a jar file:
mkdir classes
javac -classpath /path/to/hadoop-0.20.203/hadoop-0.20.203-core.jar:/path/to/hadoop- 0.20.203/lib/commons-cli-1.2.jar -d classes WordCount.java && jar -cvf wordcount.jar -C classes/ .
Finally, we run the jar file in standalone mode of Hadoop
echo "hello world bye world" > /tmp/in/0.txt
echo "hello hadoop goodebye hadoop" > /tmp/in/1.txt
hadoop jar wordcount.jar org.packagename.WordCount /tmp/in /tmp/out
In the reducer, maintain a set of the values observed (the filenames emitted in the mapper), if after you consume all the values, this set size is 1, then the word is only used in one file.
public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text>
{
private TreeSet<Text> files = new TreeSet<Text>();
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException
{
files.clear();
for (Text file : values)
{
if (!files.contains(value))
{
// make a copy of value as hadoop re-uses the object
files.add(new Text(value));
}
}
if (files.size() == 1) {
output.collect(key, files.first());
}
files.clear();
}
}