MultipleOutputFormat does only one iteration in Reduce step - hadoop

Here is my Reducer. Reducer takes in EdgeWritable and NullWritable
EdgeWritable has 4 integers, say <71, 74, 7, 2000>
The communication is between 71(FromID) to 74(ToID) on 7(July) 2000(Year).
Mapper outputs 10787 records as such to reducer, But Reducer only outputs 1.
I need to output 44 files with for 44 months between the period Oct-1998 and July 2002. The output should be in format "out"+month+year. ForExample July 2002 records will be in file out72002.
I have debugged the code. After one iteration, it outputs one file and stops without taking next record. Please suggest How I should use MultipleOutput. Thanks
public class MultipleOutputReducer extends Reducer<EdgeWritable, NullWritable, IntWritable, IntWritable>{
private MultipleOutputs<IntWritable,IntWritable> multipleOutputs;
protected void setup(Context context) throws IOException, InterruptedException{
multipleOutputs = new MultipleOutputs<IntWritable, IntWritable>(context);
}
#Override
public void reduce(EdgeWritable key, Iterable val , Context context) throws IOException, InterruptedException {
int year = key.get(3).get();
int month= key.get(2).get();
int to = key.get(1).get();
int from = key.get(0).get();
//if(year >= 1997 && year <= 2001){
if((month >= 9 && year >= 1997) || (month <= 6 && year <= 2001)){
multipleOutputs.write(new IntWritable(from), new IntWritable(to), "out"+month+year );
}
//}
}
#Override
public void cleanup(Context context) throws IOException, InterruptedException{
multipleOutputs.close();
}
Driver
public class TimeSlicingDriver extends Configured implements Tool{
static final SimpleDateFormat sdf = new SimpleDateFormat("EEE, d MMM yyyy HH:mm:ss Z");
public int run(String[] args) throws Exception {
if(args.length != 2){
System.out.println("Enter <input path> <output path>");
System.exit(-1);
}
Configuration setup = new Configuration();
//setup.set("Input Path", args[0]);
Job job = new Job(setup, "Time Slicing");
//job.setJobName("Time Slicing");
job.setJarByClass(TimeSlicingDriver.class);
job.setMapperClass(TimeSlicingMapper.class);
job.setReducerClass(MultipleOutputReducer.class);
//MultipleOutputs.addNamedOutput(setup, "output", org.apache.hadoop.mapred.TextOutputFormat.class, EdgeWritable.class, NullWritable.class);
job.setMapOutputKeyClass(EdgeWritable.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
/**Set the Input File Path and output file path*/
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true)?0:1;
}

you are not iterating your Iterator "val", for that reason your logic in your code is executed one time for each group.

Related

Hadoop not all values get assembled for one key

I have some data that I would like to aggregate by key using Mapper code and then perform something on all values that belong to a key using Reducer code. For example if I have:
key = 1, val = 1,
key = 1, val = 2,
key = 1, val = 3
I would like to get key=1, val=[1,2,3] in my Reducer.
The thing is, I get something like
key = 1, val=[1,2]
key = 1, val=[3]
Why is that so?
I thought that all the values for one specific key will be assembled in one reducer, but now it seems that there can be more key, val [ ] pairs, since there can be multiple reducers, is that so?
Should I set number of reducers to be 1?
I'm new to Hadoop so this confuses me.
Here's the code
public class SomeJob {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException
{
Job job = new Job();
job.setJarByClass(SomeJob.class);
FileInputFormat.addInputPath(job, new Path("/home/pera/data/input/some.csv"));
FileOutputFormat.setOutputPath(job, new Path("/home/pera/data/output"));
job.setMapperClass(SomeMapper.class);
job.setReducerClass(SomeReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.waitForCompletion(true);
}
}
public class SomeMapper extends Mapper<LongWritable, Text, Text, Text>{
#Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String parts[] = line.split(";");
context.write(new Text(parts[0]), new Text(parts[4]));
}
}
public class SomeReducer extends Reducer<Text, Text, Text, Text>{
#Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
String properties = "";
for(Text value : values)
{
properties += value + " ";
}
context.write(key, new Text(properties));
}
}

Reducer receives identical value multiple times instead of expected input

While writing a map-reduce job in my local hadoop environment I ran into the problem that the Reducer did not receive the values I expected. I abstracted the problem down to the following:
I create an arbitrary input file with 10 lines to have the map method executed 10 times. In the mapper I create an invocation count and write this count as value to the output with 0 as key if the value is even and 1 as key if the value is odd, i.e. the following (key, value) pairs:
(1,1), (0,2), (1,3), (0,4), (1,5), etc.
I would expect to receive two calls to the Reducer with
0 > [2,4,6,8,10]
1 > [1,3,5,7,9]
but I get two calls with
0 > [2,2,2,2,2]
1 > [1,1,1,1,1]
instead. It seems I receive the first value that was wrote in the mapper with the multiplicities of the key (if I reverse the counter, I receive values 10 and 9 instead of 2 and 1). From my understanding this is not the expected behaviour (?), but I cannot figure out what I am doing wrong.
I use the following Mapper and reducer:
public class TestMapper extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
int count = 0;
#Override
protected void map(LongWritable keyUnused, Text valueUnused, Context context) throws IOException, InterruptedException {
count += 1;
context.write(new IntWritable(count % 2), new IntWritable(count));
System.err.println((count % 2) + "|" + count);
}
}
public class TestReducer extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable>{
#Override
protected void reduce(IntWritable key, Iterable<IntWritable> valueItr, Context context) throws IOException, InterruptedException {
List<IntWritable> values = Lists.newArrayList(valueItr);
System.err.println(key + "|" + values);
}
}
I run the hadoop job with a local test runner as described for example in the book "Hadoop: The Definitive Guide" (O'Reilly):
public class TestDriver extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.printf("Usage: %s [generic options] <input> <output>\n",
getClass().getSimpleName());
ToolRunner.printGenericCommandUsage(System.err);
return -1;
}
Job jobConf = Job.getInstance(getConf());
jobConf.setJarByClass(getClass());
jobConf.setJobName("TestJob");
jobConf.setMapperClass(TestMapper.class);
jobConf.setReducerClass(TestReducer.class);
FileInputFormat.addInputPath(jobConf, new Path(args[0]));
FileOutputFormat.setOutputPath(jobConf, new Path(args[1]));
jobConf.setOutputKeyClass(IntWritable.class);
jobConf.setOutputValueClass(IntWritable.class);
return jobConf.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
System.exit(ToolRunner.run(new TestDriver(), args));
}
packaged in a jar and run with 'hadoop jar test.jar infile.txt /tmp/testout'.
Hadoop is reusing the value object while streaming the reducer values.
So in order to capture all of your different values, you need to copy:
#Override
protected void reduce(IntWritable key, Iterable<IntWritable> valueItr, Context context) throws IOException, InterruptedException {
List<IntWritable> values = Lists.newArrayList();
for(IntWritable writable : valueItr) {
values.add(new IntWritable(writable.get());
}
System.err.println(key + "|" + values);
}

Writing a value to file without moving to reducer

I have an input of records like this,
a|1|Y,
b|0|N,
c|1|N,
d|2|Y,
e|1|Y
Now, in mapper, i has to check the value of third column. If it is 'Y' then that record has to write directly to output file without moving that record to reducer or else i.e, 'N' value records has to move to reducer for further processing..
So,
a|1|Y,
d|2|Y,
e|1|Y
should not go to reducer but
b|0|N,
c|1|N
should go to reducer and then to output file.
How can i do this??
What you can probably do is use MultipleOutputs - click here to separate out records of 'Y' and 'N' type to two different files from mappers.
Next, you run saparate jobs for the two newly generated 'Y' and 'N' type data sets.
For 'Y' types set number of reducers to 0, so that, Reducers aren't use. And, for 'N' types do it the way you want using reducers.
Hope this helps.
See if this works,
public class Xxxx {
public static class MyMapper extends
Mapper<LongWritable, Text, LongWritable, Text> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
FileSystem fs = FileSystem.get(context.getConfiguration());
Random r = new Random();
FileSplit split = (FileSplit)context.getInputSplit();
String fileName = split.getPath().getName();
FSDataOutputStream out = fs.create(new Path(fileName + "-m-" + r.nextInt()));
String parts[];
String line = value.toString();
String[] splits = line.split(",");
for(String s : splits) {
parts = s.split("\\|");
if(parts[2].equals("Y")) {
out.writeBytes(line);
}else {
context.write(key, value);
}
}
out.close();
fs.close();
}
}
public static class MyReducer extends
Reducer<LongWritable, Text, LongWritable, Text> {
public void reduce(LongWritable key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
for(Text t : values) {
context.write(key, t);
}
}
}
/**
* #param args
* #throws IOException
* #throws InterruptedException
* #throws ClassNotFoundException
*/
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
// TODO Auto-generated method stub
Configuration conf = new Configuration();
conf.set("fs.default.name", "hdfs://localhost:9000");
conf.set("mapred.job.tracker", "localhost:9001");
Job job = new Job(conf, "Xxxx");
job.setJarByClass(Xxxx.class);
Path outPath = new Path("/output_path");
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
FileInputFormat.addInputPath(job, new Path("/input.txt"));
FileOutputFormat.setOutputPath(job, outPath);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
In your map function, you will get input line by line. Split it according by using | as the delimiter. (by using the String.split() method to be exact)
It will look like this
String[] line = value.toString().split('|');
Access the third element of this array by line[2]
Then, using a simple if else statement, emit the output with N value for further processing.

How to pass variable between two map reduce jobs

I have chained two Map reduce jobs. The Job1 will have only one reducer and I am computing a float value. I want to use this value in my reducer of Job2. This is my main method setup.
public static String GlobalVriable;
public static void main(String[] args) throws Exception {
int runs = 0;
for (; runs < 10; runs++) {
String inputPath = "part-r-000" + nf.format(runs);
String outputPath = "part-r-000" + nf.format(runs + 1);
MyProgram.MR1(inputPath);
MyProgram.MR2(inputPath, outputPath);
}
}
public static void MR1(String inputPath)
throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
conf.set("var1","");
Job job = new Job(conf, "This is job1");
job.setJarByClass(MyProgram.class);
job.setMapperClass(MyMapper1.class);
job.setReducerClass(MyReduce1.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class);
FileInputFormat.addInputPath(job, new Path(inputPath));
job.waitForCompletion(true);
GlobalVriable = conf.get("var1"); // I am getting NULL here
}
public static void MR2(String inputPath, String outputPath)
throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf, "This is job2");
...
}
public static class MyReduce1 extends
Reducer<Text, FloatWritable, Text, FloatWritable> {
public void reduce(Text key, Iterable<FloatWritable> values, Context context)
throws IOException, InterruptedException {
float s = 0;
for (FloatWritable val : values) {
s += val.get();
}
String sum = Float.toString(s);
context.getConfiguration().set("var1", sum);
}
}
As you can see I need to iterate the entire program multiple times. My Job1 is computing a single number from the input. Since it is just a single number and a lot of iterations I dont want to write it to HDFS and read from it. Is there a way to share the value computed in Myreducer1 and use it in Myreducer2.
UPDATE: I have tried passing the value using conf.set & conf.get. The value is not being passed.
Here's how to pass back a float value via a counter ...
First, in the first reducer, transform the float value into a long by multiplying by 1000 (to maintain 3 digits of precision, for example) and putting the result into a counter:
public void cleanup(Context context) {
long result = (long) (floatValue * 1000);
context.getCounter("Result","Result").increment(result);
}
In the driver class, retrieve the long value and transform it back to a float:
public static void MR1(String inputPath)
throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf, "This is job1");
job.setJarByClass(MyProgram.class);
job.setMapperClass(MyMapper1.class);
job.setReducerClass(MyReduce1.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class);
FileInputFormat.addInputPath(job, new Path(inputPath));
job.waitForCompletion(true);
long result = job.getCounters().findCounter("Result","Result").getValue();
float value = ((float)result) / 1000;
}
You could use ZooKeeper for this. It's great for any inter-job coordination or message passing like this.
Can't you just change the return type of MR1 to int (or whatever data type is appropriate) and return the number you computed:
int myNumber = MyProgram.MR1(inputPath);
Then add a parameter to MR2 and call it with your computed number:
MyProgram.MR2(inputPath, outputPath, myNumber);

Type mismatch in key from map, using SequenceFileInputFormat correctly

I am trying to run a recommender example from chapter6 (listing 6.1 ~ 6.4) in the ebook Mahout in Action. There are two mapper/reducer pairs. Here is the code:
Mapper - 1
public class WikipediaToItemPrefsMapper extends
Mapper<LongWritable,Text,VarLongWritable,VarLongWritable> {
private static final Pattern NUMBERS = Pattern.compile("(\d+)");
#Override
public void map(LongWritable key,
Text value,
Context context)
throws IOException, InterruptedException {
String line = value.toString();
Matcher m = NUMBERS.matcher(line);
m.find();
VarLongWritable userID = new VarLongWritable(Long.parseLong(m.group()));
VarLongWritable itemID = new VarLongWritable();
while (m.find()) {
itemID.set(Long.parseLong(m.group()));
context.write(userID, itemID);
}
}
}
Reducer - 1
public class WikipediaToUserVectorReducer extends
Reducer<VarLongWritable,VarLongWritable,VarLongWritable,VectorWritable> {
#Override
public void reduce(VarLongWritable userID,
Iterable<VarLongWritable> itemPrefs,
Context context)
throws IOException, InterruptedException {
Vector userVector = new RandomAccessSparseVector(
Integer.MAX_VALUE, 100);
for (VarLongWritable itemPref : itemPrefs) {
userVector.set((int)itemPref.get(), 1.0f);
}
//LongWritable userID_lw = new LongWritable(userID.get());
context.write(userID, new VectorWritable(userVector));
//context.write(userID_lw, new VectorWritable(userVector));
}
}
The reducer outputs a userID and a userVector and it looks like this: 98955 {590:1.0 22:1.0 9059:1.0 3:1.0 2:1.0 1:1.0} provided FileInputformat and TextInputFormat are used in the driver.
I want to use another pair of mapper-reducer to process this data further:
Mapper - 2
public class UserVectorToCooccurenceMapper extends
Mapper<VarLongWritable,VectorWritable,IntWritable,IntWritable> {
#Override
public void map(VarLongWritable userID,
VectorWritable userVector,
Context context)
throws IOException, InterruptedException {
Iterator<Vector.Element> it = userVector.get().iterateNonZero();
while (it.hasNext()) {
int index1 = it.next().index();
Iterator<Vector.Element> it2 = userVector.get().iterateNonZero();
while (it2.hasNext()) {
int index2 = it2.next().index();
context.write(new IntWritable(index1),
new IntWritable(index2));
}
}
}
}
Reducer - 2
public class UserVectorToCooccurenceReducer extends
Reducer {
#Override
public void reduce(IntWritable itemIndex1,
Iterable<IntWritable> itemIndex2s,
Context context)
throws IOException, InterruptedException {
Vector cooccurrenceRow = new RandomAccessSparseVector(Integer.MAX_VALUE, 100);
for (IntWritable intWritable : itemIndex2s) {
int itemIndex2 = intWritable.get();
cooccurrenceRow.set(itemIndex2, cooccurrenceRow.get(itemIndex2) + 1.0);
}
context.write(itemIndex1, new VectorWritable(cooccurrenceRow));
}
}
This is the driver I am using:
public final class RecommenderJob extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
Job job_preferenceValues = new Job (getConf());
job_preferenceValues.setJarByClass(RecommenderJob.class);
job_preferenceValues.setJobName("job_preferenceValues");
job_preferenceValues.setInputFormatClass(TextInputFormat.class);
job_preferenceValues.setOutputFormatClass(SequenceFileOutputFormat.class);
FileInputFormat.setInputPaths(job_preferenceValues, new Path(args[0]));
SequenceFileOutputFormat.setOutputPath(job_preferenceValues, new Path(args[1]));
job_preferenceValues.setMapOutputKeyClass(VarLongWritable.class);
job_preferenceValues.setMapOutputValueClass(VarLongWritable.class);
job_preferenceValues.setOutputKeyClass(VarLongWritable.class);
job_preferenceValues.setOutputValueClass(VectorWritable.class);
job_preferenceValues.setMapperClass(WikipediaToItemPrefsMapper.class);
job_preferenceValues.setReducerClass(WikipediaToUserVectorReducer.class);
job_preferenceValues.waitForCompletion(true);
Job job_cooccurence = new Job (getConf());
job_cooccurence.setJarByClass(RecommenderJob.class);
job_cooccurence.setJobName("job_cooccurence");
job_cooccurence.setInputFormatClass(SequenceFileInputFormat.class);
job_cooccurence.setOutputFormatClass(TextOutputFormat.class);
SequenceFileInputFormat.setInputPaths(job_cooccurence, new Path(args[1]));
FileOutputFormat.setOutputPath(job_cooccurence, new Path(args[2]));
job_cooccurence.setMapOutputKeyClass(VarLongWritable.class);
job_cooccurence.setMapOutputValueClass(VectorWritable.class);
job_cooccurence.setOutputKeyClass(IntWritable.class);
job_cooccurence.setOutputValueClass(VectorWritable.class);
job_cooccurence.setMapperClass(UserVectorToCooccurenceMapper.class);
job_cooccurence.setReducerClass(UserVectorToCooccurenceReducer.class);
job_cooccurence.waitForCompletion(true);
return 0;
}
public static void main(String[] args) throws Exception {
ToolRunner.run(new Configuration(), new RecommenderJob(), args);
}
}
The error that I get is:
java.io.IOException: Type mismatch in key from map: expected org.apache.mahout.math.VarLongWritable, received org.apache.hadoop.io.IntWritable
In course of Googling for a fix, I found out that my issue is similar to this question. But the difference is that I am already using SequenceFileInputFormat and SequenceFileOutputFormat, I believe correctly. I also see that org.apache.mahout.cf.taste.hadoop.item.RecommenderJob does more or less something similar. In my understanding & Yahoo Tutorial
SequenceFileOutputFormat rapidly serializes arbitrary data types to the file; the corresponding SequenceFileInputFormat will deserialize the file into the same types and presents the data to the next Mapper in the same manner as it was emitted by the previous Reducer.
What am I doing wrong? Will really appreciate some pointers from someone.. I spent the day trying to fix this and got nowhere :(
Your second mapper has the following signature:
public class UserVectorToCooccurenceMapper extends
Mapper<VarLongWritable,VectorWritable,IntWritable,IntWritable>
But you define the following in your driver code:
job_cooccurence.setMapOutputKeyClass(VarLongWritable.class);
job_cooccurence.setMapOutputValueClass(VectorWritable.class);
The reducer is expecting <IntWritable, IntWritable> as input, so you should just amend your driver code to:
job_cooccurence.setMapOutputKeyClass(IntWritable.class);
job_cooccurence.setMapOutputValueClass(IntWritable.class);

Resources