output not produced in hadoop - hadoop

I am trying to find the average for each widget using mapreduce. The job gets completed successfully but no out is produced when using hadoop fs -cat user/vagrant/example-1/part-r-00000
public static class MaxWidgetReducer
extends Reducer<Text, FloatWritable, FloatWritable, NullWritable> {
public void reduce(Text k, Iterable<FloatWritable> vals, Context context)
throws IOException, InterruptedException {
Float totalPrice = 0.0f;
Float avgPrice = 0.0f;
Integer count = null;
for (FloatWritable w : vals) {
totalPrice = (totalPrice + w.get());
count++;
}
avgPrice = (totalPrice)/(count);
context.write(new FloatWritable(avgPrice), NullWritable.get());
}

I strongly suggest that you use a try/catch block in both: mapper and reducer, so you could know if it is due to an exception being thrown when processing your data, try to cast w.get() to float in order to be able to add that value to the total price.
Cheers.

Related

Reducer receives identical value multiple times instead of expected input

While writing a map-reduce job in my local hadoop environment I ran into the problem that the Reducer did not receive the values I expected. I abstracted the problem down to the following:
I create an arbitrary input file with 10 lines to have the map method executed 10 times. In the mapper I create an invocation count and write this count as value to the output with 0 as key if the value is even and 1 as key if the value is odd, i.e. the following (key, value) pairs:
(1,1), (0,2), (1,3), (0,4), (1,5), etc.
I would expect to receive two calls to the Reducer with
0 > [2,4,6,8,10]
1 > [1,3,5,7,9]
but I get two calls with
0 > [2,2,2,2,2]
1 > [1,1,1,1,1]
instead. It seems I receive the first value that was wrote in the mapper with the multiplicities of the key (if I reverse the counter, I receive values 10 and 9 instead of 2 and 1). From my understanding this is not the expected behaviour (?), but I cannot figure out what I am doing wrong.
I use the following Mapper and reducer:
public class TestMapper extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
int count = 0;
#Override
protected void map(LongWritable keyUnused, Text valueUnused, Context context) throws IOException, InterruptedException {
count += 1;
context.write(new IntWritable(count % 2), new IntWritable(count));
System.err.println((count % 2) + "|" + count);
}
}
public class TestReducer extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable>{
#Override
protected void reduce(IntWritable key, Iterable<IntWritable> valueItr, Context context) throws IOException, InterruptedException {
List<IntWritable> values = Lists.newArrayList(valueItr);
System.err.println(key + "|" + values);
}
}
I run the hadoop job with a local test runner as described for example in the book "Hadoop: The Definitive Guide" (O'Reilly):
public class TestDriver extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.printf("Usage: %s [generic options] <input> <output>\n",
getClass().getSimpleName());
ToolRunner.printGenericCommandUsage(System.err);
return -1;
}
Job jobConf = Job.getInstance(getConf());
jobConf.setJarByClass(getClass());
jobConf.setJobName("TestJob");
jobConf.setMapperClass(TestMapper.class);
jobConf.setReducerClass(TestReducer.class);
FileInputFormat.addInputPath(jobConf, new Path(args[0]));
FileOutputFormat.setOutputPath(jobConf, new Path(args[1]));
jobConf.setOutputKeyClass(IntWritable.class);
jobConf.setOutputValueClass(IntWritable.class);
return jobConf.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
System.exit(ToolRunner.run(new TestDriver(), args));
}
packaged in a jar and run with 'hadoop jar test.jar infile.txt /tmp/testout'.
Hadoop is reusing the value object while streaming the reducer values.
So in order to capture all of your different values, you need to copy:
#Override
protected void reduce(IntWritable key, Iterable<IntWritable> valueItr, Context context) throws IOException, InterruptedException {
List<IntWritable> values = Lists.newArrayList();
for(IntWritable writable : valueItr) {
values.add(new IntWritable(writable.get());
}
System.err.println(key + "|" + values);
}

Can I get a Partition number of Hadoop?

I am a hadoop newbie.
I want to get a partition number on output file.
At first, I made a customized partitioner.
public static class MyPartitioner extends Partitioner<Text, LongWritable> {
public int getPartition(Text key, LongWritable value, int numReduceTasks) {
int numOfChars = key.toString().length();
return numOfChars % numReduceTasks;
}
}
It works. But, I want to output partition numbers 'visually' on Reducer.
How can I get a partiton number ??
Below is my reducer source.
public static class MyReducer extends Reducer<Text, LongWritable, Text, Text>{
private Text textList = new Text();
public void reduce(Text key, Iterable<LongWritable> values, Context context)
throws IOException, InterruptedException {
String list = new String();
for(LongWritable value: values) {
list = new String(list + "\t" + value.toString());
}
textList.set(list);
context.write(key, textList);
}
}
I want to put a partition number on 'list' respectively. There will be '0' or '1'.
list = new String(list + "\t" + value.toString() + "\t" + ??);
It would be great if someone helps me.
+
Thanks to the answer, I got a solution. But, It didn't work and I think I did something wrong.
Below is the modified MyPartitioner.
public static class MyPartitioner extends Partitioner {
public int getPartition(Text key, LongWritable value, int numReduceTasks) {
int numOfChars = key.toString().length();
return numOfChars % numReduceTasks;
private int bring_num = 0;
public void configure(JobConf job) {
bring_num = jobConf.getInt(numOfChars & numReduceTasks);
}
}
}
Add the below code to the Reducer class to get the partition number in a class variable which can be later used in the reducer method.
String partition;
protected void setup(Context context) throws IOException,
InterruptedException {
Configuration conf = context.getConfiguration();
partition = conf.get("mapred.task.partition");
}

have wo reduce with one map in M/R program

I have a question.. How can I have a mapreduce job with one mapper and two reducer that both reducer inputs come from map output? and each of reducers has its own output?
and one other thing is that can mapper have 2 or more inputs?
public static class dpred extends Reducer<Text, DoubleWritable, Text, DoubleWritable>
{
public void reduce1(Text key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException
{
double beta = 17.62;
DoubleWritable result1 = new DoubleWritable();
double mul = 1;
double res = 1;
for (DoubleWritable val : values){
// System.out.println(val.get());
mul *= val.get();
}
res = beta*mul;
result1.set(res);
context.write(key, result1);
}
///////////////////////////////////////////////////////////
public void reduce2(Text key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException
{
double landa = 243.12;
double sum = 0;
double res = 0;
DoubleWritable result2 = new DoubleWritable();
for (DoubleWritable val : values){
// System.out.println(val.get());
landa += val.get();
}
// System.out.println(sum);
result2.set(landa);
context.write(key, result2);
}
}
if the operation is this simple you could consider doing 2 context.write() in once reduce function (it's possible with MultipleOutputs to write them to different files if you want)

How to pass variable between two map reduce jobs

I have chained two Map reduce jobs. The Job1 will have only one reducer and I am computing a float value. I want to use this value in my reducer of Job2. This is my main method setup.
public static String GlobalVriable;
public static void main(String[] args) throws Exception {
int runs = 0;
for (; runs < 10; runs++) {
String inputPath = "part-r-000" + nf.format(runs);
String outputPath = "part-r-000" + nf.format(runs + 1);
MyProgram.MR1(inputPath);
MyProgram.MR2(inputPath, outputPath);
}
}
public static void MR1(String inputPath)
throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
conf.set("var1","");
Job job = new Job(conf, "This is job1");
job.setJarByClass(MyProgram.class);
job.setMapperClass(MyMapper1.class);
job.setReducerClass(MyReduce1.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class);
FileInputFormat.addInputPath(job, new Path(inputPath));
job.waitForCompletion(true);
GlobalVriable = conf.get("var1"); // I am getting NULL here
}
public static void MR2(String inputPath, String outputPath)
throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf, "This is job2");
...
}
public static class MyReduce1 extends
Reducer<Text, FloatWritable, Text, FloatWritable> {
public void reduce(Text key, Iterable<FloatWritable> values, Context context)
throws IOException, InterruptedException {
float s = 0;
for (FloatWritable val : values) {
s += val.get();
}
String sum = Float.toString(s);
context.getConfiguration().set("var1", sum);
}
}
As you can see I need to iterate the entire program multiple times. My Job1 is computing a single number from the input. Since it is just a single number and a lot of iterations I dont want to write it to HDFS and read from it. Is there a way to share the value computed in Myreducer1 and use it in Myreducer2.
UPDATE: I have tried passing the value using conf.set & conf.get. The value is not being passed.
Here's how to pass back a float value via a counter ...
First, in the first reducer, transform the float value into a long by multiplying by 1000 (to maintain 3 digits of precision, for example) and putting the result into a counter:
public void cleanup(Context context) {
long result = (long) (floatValue * 1000);
context.getCounter("Result","Result").increment(result);
}
In the driver class, retrieve the long value and transform it back to a float:
public static void MR1(String inputPath)
throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf, "This is job1");
job.setJarByClass(MyProgram.class);
job.setMapperClass(MyMapper1.class);
job.setReducerClass(MyReduce1.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class);
FileInputFormat.addInputPath(job, new Path(inputPath));
job.waitForCompletion(true);
long result = job.getCounters().findCounter("Result","Result").getValue();
float value = ((float)result) / 1000;
}
You could use ZooKeeper for this. It's great for any inter-job coordination or message passing like this.
Can't you just change the return type of MR1 to int (or whatever data type is appropriate) and return the number you computed:
int myNumber = MyProgram.MR1(inputPath);
Then add a parameter to MR2 and call it with your computed number:
MyProgram.MR2(inputPath, outputPath, myNumber);

Type mismatch in key from map, using SequenceFileInputFormat correctly

I am trying to run a recommender example from chapter6 (listing 6.1 ~ 6.4) in the ebook Mahout in Action. There are two mapper/reducer pairs. Here is the code:
Mapper - 1
public class WikipediaToItemPrefsMapper extends
Mapper<LongWritable,Text,VarLongWritable,VarLongWritable> {
private static final Pattern NUMBERS = Pattern.compile("(\d+)");
#Override
public void map(LongWritable key,
Text value,
Context context)
throws IOException, InterruptedException {
String line = value.toString();
Matcher m = NUMBERS.matcher(line);
m.find();
VarLongWritable userID = new VarLongWritable(Long.parseLong(m.group()));
VarLongWritable itemID = new VarLongWritable();
while (m.find()) {
itemID.set(Long.parseLong(m.group()));
context.write(userID, itemID);
}
}
}
Reducer - 1
public class WikipediaToUserVectorReducer extends
Reducer<VarLongWritable,VarLongWritable,VarLongWritable,VectorWritable> {
#Override
public void reduce(VarLongWritable userID,
Iterable<VarLongWritable> itemPrefs,
Context context)
throws IOException, InterruptedException {
Vector userVector = new RandomAccessSparseVector(
Integer.MAX_VALUE, 100);
for (VarLongWritable itemPref : itemPrefs) {
userVector.set((int)itemPref.get(), 1.0f);
}
//LongWritable userID_lw = new LongWritable(userID.get());
context.write(userID, new VectorWritable(userVector));
//context.write(userID_lw, new VectorWritable(userVector));
}
}
The reducer outputs a userID and a userVector and it looks like this: 98955 {590:1.0 22:1.0 9059:1.0 3:1.0 2:1.0 1:1.0} provided FileInputformat and TextInputFormat are used in the driver.
I want to use another pair of mapper-reducer to process this data further:
Mapper - 2
public class UserVectorToCooccurenceMapper extends
Mapper<VarLongWritable,VectorWritable,IntWritable,IntWritable> {
#Override
public void map(VarLongWritable userID,
VectorWritable userVector,
Context context)
throws IOException, InterruptedException {
Iterator<Vector.Element> it = userVector.get().iterateNonZero();
while (it.hasNext()) {
int index1 = it.next().index();
Iterator<Vector.Element> it2 = userVector.get().iterateNonZero();
while (it2.hasNext()) {
int index2 = it2.next().index();
context.write(new IntWritable(index1),
new IntWritable(index2));
}
}
}
}
Reducer - 2
public class UserVectorToCooccurenceReducer extends
Reducer {
#Override
public void reduce(IntWritable itemIndex1,
Iterable<IntWritable> itemIndex2s,
Context context)
throws IOException, InterruptedException {
Vector cooccurrenceRow = new RandomAccessSparseVector(Integer.MAX_VALUE, 100);
for (IntWritable intWritable : itemIndex2s) {
int itemIndex2 = intWritable.get();
cooccurrenceRow.set(itemIndex2, cooccurrenceRow.get(itemIndex2) + 1.0);
}
context.write(itemIndex1, new VectorWritable(cooccurrenceRow));
}
}
This is the driver I am using:
public final class RecommenderJob extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
Job job_preferenceValues = new Job (getConf());
job_preferenceValues.setJarByClass(RecommenderJob.class);
job_preferenceValues.setJobName("job_preferenceValues");
job_preferenceValues.setInputFormatClass(TextInputFormat.class);
job_preferenceValues.setOutputFormatClass(SequenceFileOutputFormat.class);
FileInputFormat.setInputPaths(job_preferenceValues, new Path(args[0]));
SequenceFileOutputFormat.setOutputPath(job_preferenceValues, new Path(args[1]));
job_preferenceValues.setMapOutputKeyClass(VarLongWritable.class);
job_preferenceValues.setMapOutputValueClass(VarLongWritable.class);
job_preferenceValues.setOutputKeyClass(VarLongWritable.class);
job_preferenceValues.setOutputValueClass(VectorWritable.class);
job_preferenceValues.setMapperClass(WikipediaToItemPrefsMapper.class);
job_preferenceValues.setReducerClass(WikipediaToUserVectorReducer.class);
job_preferenceValues.waitForCompletion(true);
Job job_cooccurence = new Job (getConf());
job_cooccurence.setJarByClass(RecommenderJob.class);
job_cooccurence.setJobName("job_cooccurence");
job_cooccurence.setInputFormatClass(SequenceFileInputFormat.class);
job_cooccurence.setOutputFormatClass(TextOutputFormat.class);
SequenceFileInputFormat.setInputPaths(job_cooccurence, new Path(args[1]));
FileOutputFormat.setOutputPath(job_cooccurence, new Path(args[2]));
job_cooccurence.setMapOutputKeyClass(VarLongWritable.class);
job_cooccurence.setMapOutputValueClass(VectorWritable.class);
job_cooccurence.setOutputKeyClass(IntWritable.class);
job_cooccurence.setOutputValueClass(VectorWritable.class);
job_cooccurence.setMapperClass(UserVectorToCooccurenceMapper.class);
job_cooccurence.setReducerClass(UserVectorToCooccurenceReducer.class);
job_cooccurence.waitForCompletion(true);
return 0;
}
public static void main(String[] args) throws Exception {
ToolRunner.run(new Configuration(), new RecommenderJob(), args);
}
}
The error that I get is:
java.io.IOException: Type mismatch in key from map: expected org.apache.mahout.math.VarLongWritable, received org.apache.hadoop.io.IntWritable
In course of Googling for a fix, I found out that my issue is similar to this question. But the difference is that I am already using SequenceFileInputFormat and SequenceFileOutputFormat, I believe correctly. I also see that org.apache.mahout.cf.taste.hadoop.item.RecommenderJob does more or less something similar. In my understanding & Yahoo Tutorial
SequenceFileOutputFormat rapidly serializes arbitrary data types to the file; the corresponding SequenceFileInputFormat will deserialize the file into the same types and presents the data to the next Mapper in the same manner as it was emitted by the previous Reducer.
What am I doing wrong? Will really appreciate some pointers from someone.. I spent the day trying to fix this and got nowhere :(
Your second mapper has the following signature:
public class UserVectorToCooccurenceMapper extends
Mapper<VarLongWritable,VectorWritable,IntWritable,IntWritable>
But you define the following in your driver code:
job_cooccurence.setMapOutputKeyClass(VarLongWritable.class);
job_cooccurence.setMapOutputValueClass(VectorWritable.class);
The reducer is expecting <IntWritable, IntWritable> as input, so you should just amend your driver code to:
job_cooccurence.setMapOutputKeyClass(IntWritable.class);
job_cooccurence.setMapOutputValueClass(IntWritable.class);

Resources