Duplicate "keys" in map-reduce output? - hadoop

As we all know, either this
public static class SReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text>
{
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException
{
StringBuilder sb = new StringBuilder();
while (key.hasNext())
{
sb.append(key.next().toString());
}
output.collect(key, new Text(sb.toString()));
}
}
or
public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text>
{
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException
{
boolean start = true;
StringBuilder sb = new StringBuilder();
while (values.hasNext())
{
if(!start)
{
start=false;
sb.append(values.next().toString());
}
}
output.collect(key, new Text(sb.toString()));
}
}
this, is the kind of reducer function we use to eliminate duplicate "values" in output. But what should I do to eliminate duplicate "keys"? Any idea?
Thanks.
PS: more info : In my < key,values > pairs, keys contain links and values contain words. But in my output, each word occurs only once, but I get many duplicate links.

In the Reducer, there will be one call to reduce() for each unique key that the Reducer receives. It will receive all values for that key. But if you only care about the keys, and only care about unique keys, well, just ignore the values entirely. You will get exactly one reduce() per key; do whatever you want with that (non-duplicated) key.

Related

What exactly is output of mapper and reducer function

This is a follow up question of Extracting rows containing specific value using mapReduce and hadoop
Mapper function
public static class MapForWordCount extends Mapper<Object, Text, Text, IntWritable>{
private IntWritable saleValue = new IntWritable();
private Text rangeValue = new Text();
public void map(Object key, Text value, Context con) throws IOException, InterruptedException
{
String line = value.toString();
String[] words = line.split(",");
for(String word: words )
{
if(words[3].equals("40")){
saleValue.set(Integer.parseInt(words[0]));
rangeValue.set(words[3]);
con.write( rangeValue , saleValue );
}
}
}
}
Reducer function
public static class ReduceForWordCount extends Reducer<Text, IntWritable, Text, IntWritable>
{
private IntWritable result = new IntWritable();
public void reduce(Text word, Iterable<IntWritable> values, Context con) throws IOException, InterruptedException
{
for(IntWritable value : values)
{
result.set(value.get());
con.write(word, result);
}
}
}
Output obtained is
40 105
40 105
40 105
40 105
EDIT 1 :
But the Expected output is
40 102
40 104
40 105
What am I doing wrong ?
What exactly is happening here in mapper and reducer function ?
In the context of the original question - you don't need the loop not in the mapper nor in the reducer as you are duplicating entries:
public static class MapForWordCount extends Mapper<Object, Text, Text, IntWritable>{
private IntWritable saleValue = new IntWritable();
private Text rangeValue = new Text();
public void map(Object key, Text value, Context con) throws IOException, InterruptedException
{
String line = value.toString();
String[] words = line.split(",");
if(words[3].equals("40")){
saleValue.set(Integer.parseInt(words[0]));
rangeValue.set(words[3]);
con.write(rangeValue , saleValue );
}
}
}
And in the reducer, as suggested by #Serhiy in the original question you need only one line of code:
public static class ReduceForWordCount extends Reducer<Text, IntWritable, Text, IntWritable>
{
private IntWritable result = new IntWritable();
public void reduce(Text word, Iterable<IntWritable> values, Context con) throws IOException, InterruptedException
{
con.write(word, null);
}
Regrading "Edit 1" - I will leave it a trivial practice :)
What exactly is happening
You are consuming lines of comma-delimited text, splitting the commas, and filtering out some values. con.write() should only be called once per line if all you are doing is extracting only those values.
The mapper will group all the "40" keys that you output and form a list of all the values that were written with that key. And that is what the reducer is reading over.
You should probably try this for your map function.
// Set the values to write
saleValue.set(Integer.parseInt(words[0]));
rangeValue.set(words[3]);
// Filter out only the 40s
if(words[3].equals("40")) {
// Write out "(40, safeValue)" words.length times
for(String word: words )
{
con.write( rangeValue , saleValue );
}
}
If you don't want duplicate values for the length of the split string, then get rid of the for loop.
All your reducer is doing is just printing out what it received from the mapper.
Mapper output would be something like this :
<word,count>
Reducer output would be like this :
<unique word, its total count>
Eg: A line is read and all words in it are counted and put in a <key,value> pair:
<40,1>
<140,1>
<50,1>
<40,1> ..
here 40,50,140, .. are all keys and the value is the count of number of occurrences of that key in a line. This happens in the mapper.
Then, these key,valuepairs are sent to the reducer where similar keys are all reduced to a single key and all the values associates with that key is summed to give a value to the key-value pair. So, the result of the reducer would be something like:
<40,10>
<50,5>
...
In your case, the reducer isn't doing anything. The unique values/words found by the mapper are just given out as the output.
Ideally, you are supposed to reduce & get an output like : "40,150" was found 5 times on the same line.

How to effectively reduce the length of input to mapper

My data has 20 fields in the schema. Only the first three fields are important to me as far as my map reduce program is concerned. How can I decrease the size of input to mapper so that only the first three fields are received.
1,2,3,4,5,6,7,8...20 columns in schema.
I want only 1,2,3 in the mapper to process it as offset and value.
NOTE I cant use PIG as some other map reduce logic is implemented in MAP REDUCE.
You need a custom RecordReader to do this :
public class TrimmedRecordReader implements RecordReader<LongWritable, Text> {
private LineRecordReader lineReader;
private LongWritable lineKey;
private Text lineValue;
public TrimmedRecordReader(JobConf job, FileSplit split) throws IOException {
lineReader = new LineRecordReader(job, split);
lineKey = lineReader.createKey();
lineValue = lineReader.createValue();
}
public boolean next(LongWritable key, Text value) throws IOException {
if (!lineReader.next(lineKey, lineValue)) {
return false;
}
String[] fields = lineValue.toString().split(",");
if (fields.length < 3) {
throw new IOException("Invalid record received");
}
value.set(fields[0] + "," + fields[1] + "," + fields[2]);
return true;
}
public LongWritable createKey() {
return lineReader.createKey();
}
public Text createValue() {
return lineReader.createValue();
}
public long getPos() throws IOException {
return lineReader.getPos();
}
public void close() throws IOException {
lineReader.close();
}
public float getProgress() throws IOException {
return lineReader.getProgress();
}
}
It should be pretty self-explanatory, just a wrap up of LineRecordReader.
Unfortunately, to invoke it you need to extend the InputFormat too. The following is enough :
public class TrimmedTextInputFormat extends FileInputFormat<LongWritable, Text> {
public RecordReader<LongWritable, Text> getRecordReader(InputSplit input,
JobConf job, Reporter reporter) throws IOException {
reporter.setStatus(input.toString());
return new TrimmedRecordReader(job, (FileSplit) input);
}
}
Just don't forget to set it in the driver.
You can implement custom input format in map reduce to read the required fields alone.
Just for reference, following blog post explains how to read text as Paragraphs
http://blog.minjar.com/post/54759039969/mapreduce-custom-input-formats-reading

Read values wrapped in Hadoop ArrayWritable

I am new to Hadoop and Java. My mapper outputs text and Arraywritable. I having trouble to read ArrayWritable values. Unbale to cast .get() values to integer. Mapper and reducer code are attached. Can someone please help me to correct my reducer code in order to read ArrayWritable values?
public static class Temp2Mapper extends Mapper<LongWritable, Text, Text, ArrayWritable>{
private static final int MISSING=9999;
#Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
String line = value.toString();
String date = line.substring(07,14);
int maxTemp,minTemp,avgTemp;
IntArrayWritable carrier = new IntArrayWritable();
IntWritable innercarrier[] = new IntWritable[3];
maxTemp=Integer.parseInt(line.substring(39,45));
minTemp=Integer.parseInt(line.substring(47,53));
avgTemp=Integer.parseInt(line.substring(63,69));
if (maxTemp!= MISSING)
innercarrier[0]=new IntWritable(maxTemp); // maximum Temperature
if (minTemp!= MISSING)
innercarrier[1]=new IntWritable(minTemp); //minimum temperature
if (avgTemp!= MISSING)
innercarrier[2]=new IntWritable(avgTemp); // average temperature of 24 hours
carrier.set(innercarrier);
context.write(new Text(date), carrier); // Output Text and ArrayWritable
}
}
public static class Temp2Reducer
extends Reducer<Text, ArrayWritable, Text, IntWritable>{
#Override public void reduce(Text key, Iterable<ArrayWritable> values, Context context )
throws IOException, InterruptedException {
int max = Integer.MIN_VALUE;
int[] arr= new int[3];
for (ArrayWritable val : values) {
arr = (Int) val.get(); // Error: cannot cast Writable to int
max = Math.max(max, arr[0]);
}
context.write( key, new IntWritable(max) );
}
}
ArrayWritable#get method returns an array of Writable.
You can't cast an array of Writable to int. What you can do is:
iterate over this array
cast each item (which will be of type Writable) of the array to IntWritable
use IntWritable#get method to get the int value.
for (ArrayWritable val: values) {
for (Writable writable: val.get()) { // iterate
IntWritable intWritable = (IntWritable)writable; // cast
int value = intWritable.get(); // get
// do your thing with int value
}
}

How can i emmit key values in the end of the whole file processing?

Mapper reads lines from file... How can i emmit key values in the end, after the whole scanning of the file and not per line?
Using the new mapreduce API, you can override the Mapper.cleanup(Context) method and use Context.write(K, V) as you normally would in the map method.
#Override
protected void cleanup(Context context) {
context.write(new Text("key"), new Text("value"));
}
The old mapred API you can override the close() method - but you'll need to store a reference to the OutputCollector given to the map method:
private OutputCollector cachedCollector = null;
void map(Longwritable key, Text value, OutputCollector outputCollector, Reporter reporter) {
if (cachedCollector == null) {
cachedCollector = outputCollector;
}
// ...
}
public void close() {
cachedCollector.collect(outputKey, outputValue);
}
Do you have one Key value for whole file or multiple?
If it is case #1:
Use WholeFileInputFormat. You will receive full file content as a single record. You can split this into records, process all the records and emit final Key/Value at the end of your processing
Cae #2:
Use same fileInputFormat. Store all key values in a temp storage. At the end, access your temp storage and emit whatever the Key/values you want and suppress those you don't want
Another alternative to Chris's answer could be that you can achieve this by overriding the run() of the Mapper class (New API)
public static class Map extends Mapper<IntWritable, IntWritable, IntWritable, IntWritable> {
//map method here
// Override the run()
#override
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
// Have your last <key,value> emitted here
context.write(lastOutputKey, lastOutputValue);
cleanup(context);
}
}
And in order to ensure that each mapper gets one file to process, you have to create your own version of FileInputFormat and override isSplittable(), like this:
Class NonSplittableFileInputFormat extends FileInputFormat{
#Override
public boolean isSplitable(FileSystem fs, Path filename){
return false;
}
}

"Type mismatch in key from map: expected org.apache.hadoop.io.IntWritable, recieved org.apache.hadoop.io.LongWritable" -Every thing looks right

I am trying to write simple map reduce program to find largest prime number using new API (0.20.2). This is how my Map and reduce class look likeā€¦
public class PrimeNumberMap extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
public void map (LongWritable key, Text Kvalue,Context context) throws IOException,InterruptedException
{
Integer value = new Integer(Kvalue.toString());
if(isNumberPrime(value))
{
context.write(new IntWritable(value), new IntWritable(new Integer(key.toString())));
}
}
boolean isNumberPrime(Integer number)
{
if (number == 1) return false;
if (number == 2) return true;
for (int counter =2; counter<(number/2);counter++)
{
if(number%counter ==0 )
return false;
}
return true;
}
}
public class PrimeNumberReduce extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
public void reduce ( IntWritable primeNo, Iterable<IntWritable> Values,Context context) throws IOException ,InterruptedException
{
int maxValue = Integer.MIN_VALUE;
for (IntWritable value : Values)
{
maxValue= Math.max(maxValue, value.get());
}
//output.collect(primeNo, new IntWritable(maxValue));
context.write(primeNo, new IntWritable(maxValue)); }
}
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException{
if (args.length ==0)
{
System.err.println(" Usage:\n\tPrimenumber <input Directory> <output Directory>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(Main.class);
job.setJobName("Prime");
// Creating job configuration object
FileInputFormat.addInputPath(job, new Path (args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
String star ="*********************************************";
System.out.println(star+"\n Prime number computer \n"+star);
System.out.println(" Application started ... keeping fingers crossed :/ ");
System.exit(job.waitForCompletion(true)?0:1);
}
}
I am still getting error regarding mismatch of key for map
java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.IntWritable, recieved org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1034)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:595)
at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at org.apache.hadoop.mapreduce.Mapper.map(Mapper.java:124)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:668)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:334)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1109)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
2012-06-13 14:27:21,116 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task
Can some one please suggest what is wrong. I have tried all hooks and crooks.
You've not configured the Mapper or reducer classes in your main block, so the default Mapper is being used - which is known as the identity mapper - each pair it receives as input is output (hence the LongWritable as the output key):
job.setMapperClass(PrimeNumberMap.class);
job.setReducerClass(PrimeNumberReduce.class);
The mapper should be defined as below,
public class PrimeNumberMap extends Mapper<**IntWritable**, Text, IntWritable, IntWritable> {
instead of
public class PrimeNumberMap extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
As it is mentioned in the comment before you should have the mapper and reducer defined.
job.setMapperClass(PrimeNumberMap.class);
job.setReducerClass(PrimeNumberReduce.class);
Please refer to Hadoop Definitive guide 3rd edition, Chapter 2, Page 24
I am a fresh hand in hadoop mapreduce program.
When mapping, I use IntWritable but I reduce the values in IntWritable format and convert the result to double before using DoubleWritable in context write.
It fails when running.
My method to handle the covert int in map to double in reduce is:
Mapper(LongWritable,Text,Text,DoubleWritable)
Reducer(Text,DoubleWritable,Text,DoubleWritable)
job.setOutputValueClass(DoubleWritable.Class)

Resources