storing and printing values in reducer - hadoop

can anybody help me finding error in following code. I need to store values in arraylist and then use it for further processing but this code read one value, print it and store in arraylist and then again print it according to my second print statement in loop of arraylist.
But i want to store all elements in arraylist and then want to print it.
Please help!!
Thanks
public class tempreducer extends Reducer<LongWritable,Text,IntWritable,Text> {
public void reduce(LongWritable key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
System.out.println("reducer");
ArrayList<String> vArrayList = new ArrayList<String>();
for(Text v: values)
{
String line=v.toString();
System.out.println(line);
vArrayList.add(line);
}
for(int i = 0; i < vArrayList.size(); ++i)
{
System.out.println("value"+vArrayList.get(i));
}

Hadoop MapReduces passes
key - list of values
for every key you have.
If you want to print all values of all keys.
In you mapper try to have one key some thing like
output.collect("hadoop", value);
the above will get all values into one iterator.

Related

hadoop - total line of input files

I have an input file contains:
id value
1e 1
2e 1
...
2e 1
3e 1
4e 1
And I would like to find the total id of my input file. So In my main, I have declare a list so that when I read the input file, I will insert the line into the list
MainDriver.java
public static Set list = new HashSet();
and I my map
// Apply regex to find the id
...
// Insert id to the list
MainDriver.list.add(regex.group(1)); // add 1e, 2e, 3e ...
and In my reduce, I try to use the list as
public void reduce(WritableComparable key, Iterator values,
OutputCollector output, Reporter reporter) throws IOException
{
...
output.collect(key, new IntWritable(MainDriver.list.size()));
}
So I expect the value print out the file, in this case will be 4. But it actually prints out 0.
I have verify that regex.group(1) would extract valid id. So I have no clue why the size of my list is 0 in the reduce process.
The mappers and reducers run on separate JVMs (and often separate machines altogether) both from each other and from the driver program, so there is no common instance of your list Set variable that all of those methods can concurrently read and write to.
One way in MapReduce to count the number of keys is:
Emit (id, 1) from your mapper
(optionally) Sum the 1s for each mapper using a combiner to minimize network and reducer I/O
In the reducer:
In setup() initialize a class-scope numeric variable (int or long presumbly) to 0
In reduce() increment the counter, and ignore the values
In cleanup() emit the counter value now that all keys have been processed
Run the job with a single reducer, so all the keys go to the same JVM where a single count can be made
This is basically ignoring the advantage of using MapReduce in the first place.
Correct me if I'm wrong, but it appears you can map your output from your Mapper by "id", and then in your Reducer you receive something like Text key, Iterator values as the parameters.
You can then just sum up values and output output.collect(key, <total value>);
Example (apologies for using Context rather than OutputCollector, but the logic is the same):
public static class MyMapper extends Mapper<LongWritable, Text, Text, Text> {
private final Text key = new Text("id");
private final Text id = new Text();
public void map(LongWritable key, Text value,
Context context) throws IOException, InterruptedException {
id.set(regex.group(1)); // do whatever you do
context.write(id, countOne);
}
}
public static class MyReducer extends Reducer<Text, Text, Text, IntWritable> {
private final IntWritable totalCount = new IntWritable();
public void reduce(Text key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
int cnt = 0;
for (Text value : values) {
cnt ++;
}
totalCount.set(cnt);
context.write(key, totalCount);
}
}

custom partitioner to send single key to multiple reducers?

If I have only one key. Can I avoid it being sent to only one reducer (and distribute it across multiple reducers)?
I understand that then I might have to have a second map reduce program to combine the reducer outputs?
Is this a good approach? Or please let me know if there is a better way?
I was in a similar situation once. What I did is something like this :
int numberOfReduceCalls = 5
IntWritable outKey = new IntWritable();
Random random = new Random();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// use a random integer within a limit
outKey.set( random.nextInt(numberOfReduceCalls) );
context.write(outKey, value);
}

map emitted keys change inside reducer/combiner

I need to do two separate matrix multiplications (S*A and P*A) inside my mappers and emit the results of both. I know i can do that easily with two mapreduce job, but inorder to save running time I need to do them in one job. So what I do is that after doing both multiplications I put both of outputs in context object, but with different keys so that I can distinguish them inside reducer:
LongWritable One = new LongWritable();
One.set(1);
context.write(One, partialSA);
LongWritable two = new LongWritable();
two.set(2);
context.write(two, partialPA);
In reduce, I just need to add all partialSA matrices together and add all partialPA matrices together too. The problem is that If I use combiner, the emitted keys I receive inside combiner are 0 and 1 instead of 1 and two!!! And if i dont use combiner, inside reducer I receive 0 and 1 as keys instead of 1 and 2.
Why is this happening? what is the problem?
Here is the exact cleanup function of my mapper:
public void cleanup(Context context) throws IOException, InterruptedException{
LongWritable one = new LongWritable();
one.set(1);
LongWritable two = new LongWritable();
two.set(2)
context.write(one, partialSA);
context.write(two, partialPA);
}
Here is the reducer() code:
public void reduce(LongWritable key, Iterable<MatrixWritable> values, Context context) throws IOException, InterruptedException{
System.out.println("*** In reduce() **** "+key.get());
Iterator<MatrixWritable> itr = values.iterator();
if(key.get() == 1){
while(itr.hasNext()){
SA.addMatrices(itr.next());
}
}else if(key.get() == 2){
while(itr.hasNext()){
PA.addMatrices(itr.next());
}
}
}

Why is MultipleOutputs not working for this Map Reduce program?

I have a Mapper class that is giving a text key and IntWritable value which could be 1 two or three. Depending upon the values I have to write three different files with different keys. I am getting a Single File output with No record in it.
Also, is there any good Multiple Outputs example(with explanation) you could guide me to?
My Driver Class Had this code:
MultipleOutputs.addNamedOutput(job, "name", TextOutputFormat.class, Text.class, IntWritable.class);
MultipleOutputs.addNamedOutput(job, "attributes", TextOutputFormat.class, Text.class, IntWritable.class);
MultipleOutputs.addNamedOutput(job, "others", TextOutputFormat.class, Text.class, IntWritable.class);
My reducer class is:
public static class Reduce extends Reducer<Text, IntWritable, Text, NullWritable> {
private MultipleOutputs mos;
public void setup(Context context) {
mos = new MultipleOutputs(context);
}
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
String CheckKey = values.toString();
if("1".equals(CheckKey)) {
mos.write("name", key, new IntWritable(1));
}
else if("2".equals(CheckKey)) {
mos.write("attributes", key, new IntWritable(2));
}
else if("3".equals(CheckKey)) {
mos.write("others", key,new IntWritable(3));
}
/* for (IntWritable val : values) {
sum += val.get();
}*/
//context.write(key, null);
}
#Override
public void cleanup(Context context) throws IOException, InterruptedException {
mos.close();
}
}
P.S I am new to HADOOP/MAP-Reduce Programming.
ArrayList<Integer> l = new ArrayList<Integer>();
l.add(1);
System.out.println(l.toString());
results in "[1]" not 1 so
values.toString()
will not give "1"
Apart from that I just tried to print an Iterable and it just gave a reference, so that is definitely your problem. If you want to iterate over the values do as in the example below:
Iterator<Text> valueIterator = values.iterator();
while (valueIterator.hasNext()){
}
Note that you can only iterate once!
Your problem statement is muddled. What do you mean, "depending on the values"? The reducer gets an Iterable of values, not a single value. Something tells me that you need to move the multiple output code in your reducer inside the loop you have commented out for taking the sum.
Or perhaps you don't need a reducer at all and can take care of this in the map phase. If you are using the reduce phase to end up with exactly 4 files by using a single reduce task, then you can also achieve what you want by flipping the key and value in your map phase and forgetting about MultipleOutputs altogether, because you'll end up with only 3 working reduce tasks, one for each of your int values. To get the 4th one you can output two copies of the record in each map call using a special key to indicate that the output is meant for the normal file, not one of the three special files. Normally I would not recommend such a course of action as you have severe bounds on the level of parallelism you can achieve in the reduce phase when the number of keys is small.
You should also include some anomalous data handling code to the end of your 'if' ladder that increments a counter or something if you encounter a value that is not one of the three you are expecting.

Hadoop MapReduce: return sorted list of words in a text file

So my task is to return a alpahbetically sorted list of all words contained in a text file while keeping duplicates.
{To be or not to be} −→ {be be not or to to}
My idea is to take each word as the key as well as the value. This way, because hadoop sorts the keys, they will automatically be sorted alphabtically. In the Reduce phase I simply append all words with the same key (so basically identical words) to one single Text value.
public class WordSort {
public static class Map extends Mapper<LongWritable, Text, Text, Text> {
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
// transform to lower case
String lower = word.toString().toLowerCase();
context.write(new Text(lower), new Text(lower));
}
}
}
public static class Reduce extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
String result = "";
for (Text value : values){
res += value.toString() + " ";
}
context.write(key, new Text(result));
}
}
However my problem is, how do I simply return the value in my output file? At the moment I have this:
be be be
not not
or or
to to to
So in every line I have the key first and then the values, but I just want to return the values so that I get this:
be be
not
or
to to
Is this even possible or do I have to just delete one entry from the value of each word?
Disclaimer: I'm not an Hadoop user, but I do a lot of Map/Reduce with CouchDB.
If you just need the keys, why don't you emit an empty value?
Moreover, it sounds like you don't want to reduce them at all, since you want to get a key for every occurrence.
Just tried with the MaxTemperature example from the Hadoop - The Definitive Guide and the below code worked
context.write(null, new Text(result));

Resources