I have just started learning Hadoop. I would like to use the output of my reduce() and do some manipulations on it. I am working on the new API and have tried using JobControl, but it doesn't seem to work with the new API.
Any way out?
Not sure what you are trying to do. Do you want to send different kinds of output to different output formats? Check This If you want to filter out or do manipulations on the values from the map, reduce is the best place to do this.
You can make use of ChainReducer to create a job of the form [MAP+ / REDUCE MAP*] i.e. several Maps followed by a reducer and then another series of maps that start with working on the output of the Reducer. The final output is the output of the last Mapper in series.
Alternatively, you can have multiple jobs that start sequentially and the output of the reducer of the previous is the input to the next . But, this causes unnecessary IO incase you are not interested in the intermediate output
Do whatever you want inside the reducer, create a FSDataOutputStream and write the output through it.
For example :
public static class TokenCounterReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
FileSystem fs = FileSystem.get(context.getConfiguration());
FSDataOutputStream out = fs.create(new Path("/path/to/your/file"));
//do the manipulation and write it down to the file
out.write(......);
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
}
Related
For example, the typical WordCount mapreduce might return an output that reads:
hello 3
world 4
again 1
I want to format the output slightly differently so that it would show this instead:
3 hello
4 world
1 again
I've read a lot of posts wanting to sort by the value and the answers suggested a second mapreduce job on the output of the first one. However, I don't need to sort by the value, and it's possible that multiple keys have the same value--I don't want them to be lumped together.
Is there an easy way to simply switch the order the key/values are printed? It seems like it should be simple.
Two options to consider in order of ease are:
Switch the Key/Value in the Reduce
Modify the output from the reduce to switch the key and value. For example the reduce in Hadoops example WordCount job would change to:
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(result, key);
}
}
Here the context.write(result, key); has changed to switch the key and value.
Use a second Map only job
You can use the InverseMapper (Source) provided by Hadoop to run a Map only (0 reducers) job to switch the key and value. So you would just have a second job, and only need to write the driver, which would look something like:
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Switch inputs");
job.setJarByClass(WordCount.class);
job.setMapperClass(InverseMapper.class);
job.setNumReduceTasks(0);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(SequenceFileInputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Note, that you would want the first job to write the output of the first job using SequenceFileOutputFormat and use SequenceFileInputFormat as the input to the second.
As far as I understand, the reduce task has three phases.
Shuffle, Sort and actual reduce invocation.
So usually in hadoop job's output we see something like,
map 0% reduce 0%
map 20% reduce 0%
.
.
.
map 90% reduce 10%
.
.
.
So I assume that the reduce tasks start before all the maps are finished and this behavior is controlled by the slow start configuration.
Now I don't yet understand when does the setup method of the reducer is actually called.
In my use case, I have some files to parse in the setup method. The file is about 60MB in size and is picked up from the distributed cache. While the file is being parsed, there is another set of data from configuration that can update the just parsed record. After parsing and possible updation, the file is stored in a HashMap for fast lookups. So I would like this method to be invoked as soon as possible, possibly while the mappers are still doing their thing.
Is it possible to do this? Or is that what already happens?
Thanks
Setup is called right before it is able to read the first key/values pair from the stream.
Which is effectively after all mappers ran and all the merging for a given reducer partition is finished.
As explained in Hadoop docs, setup() method is called once at the start of the task. It should be used for the instantiating resources/variables or reading configurable params which in turn can be used in reduce() method. Think of it like a constructor.
Here is an example reducer:
class ExampleReducer extends TableReducer<ImmutableBytesWritable, ImmutableBytesWritable, ImmutableBytesWritable> {
private int runId;
private ObjectMapper objectMapper;
#Override
protected void setup(Context context) throws IOException {
Configuration conf = context.getConfiguration();
this.runId = Integer.valueOf(conf.get("stackoverflow_run_id"));
this.objectMapper = new ObjectMapper();
}
#Override
protected void reduce(ImmutableBytesWritable keyFromMap, Iterable<ImmutableBytesWritable> valuesFromMap, Context context) throws IOException, InterruptedException {
// your code
var = objectMapper.writeValueAsString();
// your code
context.write(new ImmutableBytesWritable(somekey.getBytes()), put);
}
}
If I have only one key. Can I avoid it being sent to only one reducer (and distribute it across multiple reducers)?
I understand that then I might have to have a second map reduce program to combine the reducer outputs?
Is this a good approach? Or please let me know if there is a better way?
I was in a similar situation once. What I did is something like this :
int numberOfReduceCalls = 5
IntWritable outKey = new IntWritable();
Random random = new Random();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// use a random integer within a limit
outKey.set( random.nextInt(numberOfReduceCalls) );
context.write(outKey, value);
}
I am new in hadoop and currently working on hadoop. I have a small query.
I have around 10 files in input folder which I need to pass to my map reduce program. I want the file Name in my mapper as my fileName contains the time at which this file got created. I saw people using FileSplit to get the file Name in mapper. If let say my input files contains million of lines then every time mapper code will be called, it will get the file Name and then extract the time from the file, which is obvious a repeated time consuming thing for the same file. Once I get the time in the mapper I do not have to again and again assign the time from the file.
How can I achieve this?
You could use Mapper's setup method to get the filename, as setup method is gaurenteed to run only once before map() method gets initialized like this:
public class MapperRSJ extends Mapper<LongWritable, Text, CompositeKeyWritableRSJ, Text> {
String filename;
#Override
protected void setup(Context context) throws IOException, InterruptedException {
FileSplit fsFileSplit = (FileSplit) context.getInputSplit();
filename = context.getConfiguration().get(fsFileSplit.getPath().getParent().getName()));
}
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// process each key value pair
}
}
I have a Mapper class that is giving a text key and IntWritable value which could be 1 two or three. Depending upon the values I have to write three different files with different keys. I am getting a Single File output with No record in it.
Also, is there any good Multiple Outputs example(with explanation) you could guide me to?
My Driver Class Had this code:
MultipleOutputs.addNamedOutput(job, "name", TextOutputFormat.class, Text.class, IntWritable.class);
MultipleOutputs.addNamedOutput(job, "attributes", TextOutputFormat.class, Text.class, IntWritable.class);
MultipleOutputs.addNamedOutput(job, "others", TextOutputFormat.class, Text.class, IntWritable.class);
My reducer class is:
public static class Reduce extends Reducer<Text, IntWritable, Text, NullWritable> {
private MultipleOutputs mos;
public void setup(Context context) {
mos = new MultipleOutputs(context);
}
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
String CheckKey = values.toString();
if("1".equals(CheckKey)) {
mos.write("name", key, new IntWritable(1));
}
else if("2".equals(CheckKey)) {
mos.write("attributes", key, new IntWritable(2));
}
else if("3".equals(CheckKey)) {
mos.write("others", key,new IntWritable(3));
}
/* for (IntWritable val : values) {
sum += val.get();
}*/
//context.write(key, null);
}
#Override
public void cleanup(Context context) throws IOException, InterruptedException {
mos.close();
}
}
P.S I am new to HADOOP/MAP-Reduce Programming.
ArrayList<Integer> l = new ArrayList<Integer>();
l.add(1);
System.out.println(l.toString());
results in "[1]" not 1 so
values.toString()
will not give "1"
Apart from that I just tried to print an Iterable and it just gave a reference, so that is definitely your problem. If you want to iterate over the values do as in the example below:
Iterator<Text> valueIterator = values.iterator();
while (valueIterator.hasNext()){
}
Note that you can only iterate once!
Your problem statement is muddled. What do you mean, "depending on the values"? The reducer gets an Iterable of values, not a single value. Something tells me that you need to move the multiple output code in your reducer inside the loop you have commented out for taking the sum.
Or perhaps you don't need a reducer at all and can take care of this in the map phase. If you are using the reduce phase to end up with exactly 4 files by using a single reduce task, then you can also achieve what you want by flipping the key and value in your map phase and forgetting about MultipleOutputs altogether, because you'll end up with only 3 working reduce tasks, one for each of your int values. To get the 4th one you can output two copies of the record in each map call using a special key to indicate that the output is meant for the normal file, not one of the three special files. Normally I would not recommend such a course of action as you have severe bounds on the level of parallelism you can achieve in the reduce phase when the number of keys is small.
You should also include some anomalous data handling code to the end of your 'if' ladder that increments a counter or something if you encounter a value that is not one of the three you are expecting.