custom partitioner to send single key to multiple reducers? - hadoop

If I have only one key. Can I avoid it being sent to only one reducer (and distribute it across multiple reducers)?
I understand that then I might have to have a second map reduce program to combine the reducer outputs?
Is this a good approach? Or please let me know if there is a better way?

I was in a similar situation once. What I did is something like this :
int numberOfReduceCalls = 5
IntWritable outKey = new IntWritable();
Random random = new Random();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// use a random integer within a limit
outKey.set( random.nextInt(numberOfReduceCalls) );
context.write(outKey, value);
}

Related

the last reducer is very slow in MapReduce

the speed of the last reduce is very slow. the other reduce
the number of my map and reduce is follows
the number of map is 18784, the number of reduce is 1500
the average of time for each reduce about 1'26, but the last reduce is about 2h
i try to change the number of reduce and reduce the size of job. but nothing changed
the last reduce
as for my partition
public int getPartition(Object key, Object value, int numPartitions) {
// TODO Auto-generated method stub
String keyStr = key.toString();
int partId= String.valueOf(keyStr.hashCode()).hashCode();
partId = Math.abs(partId % numPartitions);
partId = Math.max(partId, 0);
return partId;
//return (key.hashCode() & Integer.MAX_VALUE) % numPartitions;
}
I had similar experience, in my case it was due to only one reduce was doing processing all data. This happens due to data skewness. Take a look counters at reducers that is already been processed and the one that is taking lot of time, you will likely see more data is being handled by the reducer that is taking lot of time.
You might want to look into this.
Hadoop handling data skew in reducer
Very probably you are facing skew data problem.
Or your keys are not very well distributed or your getPartition is generating the issue. ItÅ› not clear form me why you are creating a string from the hash code of the string and then getting the hash code for this new string. My suggestion is that first try with the default partition and then look inside the distribution of your keys.
In fact, when you process the large amount of data, you should set the class of Combiner. And if you want to changes encoding you should reset the Reduce function.
for example.
public class GramModelReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
private LongWritable result = new LongWritable();
public void reduce(Text key, Iterable<LongWritable> values,Context context) throws IOException, InterruptedException {
long sum = 0;
for (LongWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(new Text(key.toString().getBytes("GB18030")), result);
}
}
class GramModelCombiner extends Reducer<Text, LongWritable, Text, LongWritable> {
public void reduce(Text key, Iterable<LongWritable> values,Context context) throws IOException, InterruptedException {
long sum = 0;
for (LongWritable val : values) {
sum += val.get();
}
context.write(key, new LongWritable(sum));
}
}

storing and printing values in reducer

can anybody help me finding error in following code. I need to store values in arraylist and then use it for further processing but this code read one value, print it and store in arraylist and then again print it according to my second print statement in loop of arraylist.
But i want to store all elements in arraylist and then want to print it.
Please help!!
Thanks
public class tempreducer extends Reducer<LongWritable,Text,IntWritable,Text> {
public void reduce(LongWritable key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
System.out.println("reducer");
ArrayList<String> vArrayList = new ArrayList<String>();
for(Text v: values)
{
String line=v.toString();
System.out.println(line);
vArrayList.add(line);
}
for(int i = 0; i < vArrayList.size(); ++i)
{
System.out.println("value"+vArrayList.get(i));
}
Hadoop MapReduces passes
key - list of values
for every key you have.
If you want to print all values of all keys.
In you mapper try to have one key some thing like
output.collect("hadoop", value);
the above will get all values into one iterator.

map emitted keys change inside reducer/combiner

I need to do two separate matrix multiplications (S*A and P*A) inside my mappers and emit the results of both. I know i can do that easily with two mapreduce job, but inorder to save running time I need to do them in one job. So what I do is that after doing both multiplications I put both of outputs in context object, but with different keys so that I can distinguish them inside reducer:
LongWritable One = new LongWritable();
One.set(1);
context.write(One, partialSA);
LongWritable two = new LongWritable();
two.set(2);
context.write(two, partialPA);
In reduce, I just need to add all partialSA matrices together and add all partialPA matrices together too. The problem is that If I use combiner, the emitted keys I receive inside combiner are 0 and 1 instead of 1 and two!!! And if i dont use combiner, inside reducer I receive 0 and 1 as keys instead of 1 and 2.
Why is this happening? what is the problem?
Here is the exact cleanup function of my mapper:
public void cleanup(Context context) throws IOException, InterruptedException{
LongWritable one = new LongWritable();
one.set(1);
LongWritable two = new LongWritable();
two.set(2)
context.write(one, partialSA);
context.write(two, partialPA);
}
Here is the reducer() code:
public void reduce(LongWritable key, Iterable<MatrixWritable> values, Context context) throws IOException, InterruptedException{
System.out.println("*** In reduce() **** "+key.get());
Iterator<MatrixWritable> itr = values.iterator();
if(key.get() == 1){
while(itr.hasNext()){
SA.addMatrices(itr.next());
}
}else if(key.get() == 2){
while(itr.hasNext()){
PA.addMatrices(itr.next());
}
}
}

How to manipulate reduce() output and store it in another file?

I have just started learning Hadoop. I would like to use the output of my reduce() and do some manipulations on it. I am working on the new API and have tried using JobControl, but it doesn't seem to work with the new API.
Any way out?
Not sure what you are trying to do. Do you want to send different kinds of output to different output formats? Check This If you want to filter out or do manipulations on the values from the map, reduce is the best place to do this.
You can make use of ChainReducer to create a job of the form [MAP+ / REDUCE MAP*] i.e. several Maps followed by a reducer and then another series of maps that start with working on the output of the Reducer. The final output is the output of the last Mapper in series.
Alternatively, you can have multiple jobs that start sequentially and the output of the reducer of the previous is the input to the next . But, this causes unnecessary IO incase you are not interested in the intermediate output
Do whatever you want inside the reducer, create a FSDataOutputStream and write the output through it.
For example :
public static class TokenCounterReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
FileSystem fs = FileSystem.get(context.getConfiguration());
FSDataOutputStream out = fs.create(new Path("/path/to/your/file"));
//do the manipulation and write it down to the file
out.write(......);
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
}

Why is MultipleOutputs not working for this Map Reduce program?

I have a Mapper class that is giving a text key and IntWritable value which could be 1 two or three. Depending upon the values I have to write three different files with different keys. I am getting a Single File output with No record in it.
Also, is there any good Multiple Outputs example(with explanation) you could guide me to?
My Driver Class Had this code:
MultipleOutputs.addNamedOutput(job, "name", TextOutputFormat.class, Text.class, IntWritable.class);
MultipleOutputs.addNamedOutput(job, "attributes", TextOutputFormat.class, Text.class, IntWritable.class);
MultipleOutputs.addNamedOutput(job, "others", TextOutputFormat.class, Text.class, IntWritable.class);
My reducer class is:
public static class Reduce extends Reducer<Text, IntWritable, Text, NullWritable> {
private MultipleOutputs mos;
public void setup(Context context) {
mos = new MultipleOutputs(context);
}
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
String CheckKey = values.toString();
if("1".equals(CheckKey)) {
mos.write("name", key, new IntWritable(1));
}
else if("2".equals(CheckKey)) {
mos.write("attributes", key, new IntWritable(2));
}
else if("3".equals(CheckKey)) {
mos.write("others", key,new IntWritable(3));
}
/* for (IntWritable val : values) {
sum += val.get();
}*/
//context.write(key, null);
}
#Override
public void cleanup(Context context) throws IOException, InterruptedException {
mos.close();
}
}
P.S I am new to HADOOP/MAP-Reduce Programming.
ArrayList<Integer> l = new ArrayList<Integer>();
l.add(1);
System.out.println(l.toString());
results in "[1]" not 1 so
values.toString()
will not give "1"
Apart from that I just tried to print an Iterable and it just gave a reference, so that is definitely your problem. If you want to iterate over the values do as in the example below:
Iterator<Text> valueIterator = values.iterator();
while (valueIterator.hasNext()){
}
Note that you can only iterate once!
Your problem statement is muddled. What do you mean, "depending on the values"? The reducer gets an Iterable of values, not a single value. Something tells me that you need to move the multiple output code in your reducer inside the loop you have commented out for taking the sum.
Or perhaps you don't need a reducer at all and can take care of this in the map phase. If you are using the reduce phase to end up with exactly 4 files by using a single reduce task, then you can also achieve what you want by flipping the key and value in your map phase and forgetting about MultipleOutputs altogether, because you'll end up with only 3 working reduce tasks, one for each of your int values. To get the 4th one you can output two copies of the record in each map call using a special key to indicate that the output is meant for the normal file, not one of the three special files. Normally I would not recommend such a course of action as you have severe bounds on the level of parallelism you can achieve in the reduce phase when the number of keys is small.
You should also include some anomalous data handling code to the end of your 'if' ladder that increments a counter or something if you encounter a value that is not one of the three you are expecting.

Resources