Adding multiple Reducer using ChainReducer throwing Exception - hadoop

I have already read previous posts related to this but didn't get anything meaningful.
My use case is :
Aggregate Impression and click data
Separate Clicked and non-clicked data in different files.
I have written mapper and reducer for that but that reducer's output is data containing clicked & non-clicked and it is going in same file. I want to separate that data so clicked data should be present in one file and non clicked should be present in other file.
Error :
java.lang.IllegalStateException: Reducer has been already set
at org.apache.hadoop.mapreduce.lib.chain.Chain.checkReducerAlreadySet(Chain.java:662)
Code
Configuration conf = new Configuration();
conf.set("mapreduce.output.fileoutputformat.compress", "true");
conf.set("mapreduce.output.fileoutputformat.compress.codec", "org.apache.hadoop.io.compress.GzipCodec");
conf.set("mapreduce.map.output.compress.codec", "org.apache.hadoop.io.compress.SnappyCodec");
conf.set("mapreduce.output.fileoutputformat.compress.type", "BLOCK");
Job job = Job.getInstance(conf, "IMPRESSION_CLICK_COMBINE_JOB");
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setReducerClass(ImpressionClickReducer.class);
FileInputFormat.setInputDirRecursive(job, true);
// FileInputFormat.addInputPath(job, new Path(args[0]));
// job.setMapperClass(ImpressionMapper.class);
Path p = new Path(args[2]);
FileSystem fs = FileSystem.get(conf);
fs.exists(p);
fs.delete(p, true);
/**
* Here directory of impressions will be present
*/
MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, ImpressionMapper.class);
/**
* Here directory of clicks will be present
*/
MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, ClickMapper.class);
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.setNumReduceTasks(10);
job.setPartitionerClass(TrackerPartitioner.class);
ChainReducer.setReducer(job, ImpressionClickReducer.class, Text.class, Text.class, Text.class, Text.class, new Configuration(false));
ChainReducer.addMapper(job, ImpressionClickMapper.class, Text.class, Text.class, Text.class, Text.class, new Configuration(false));
//Below mentioned line is giving Error
//ChainReducer.setReducer(job, ImpressionAndClickReducer.class, Text.class, Text.class, Text.class, Text.class, new Configuration(false));
job.waitForCompletion(true);

ChainReducer is used to chain Map tasks after the Reducer, you can only call setReducer() once (See the code here).
From the Javadocs:
The ChainReducer class allows to chain multiple Mapper classes after a
Reducer within the Reducer task.
Using the ChainMapper and the ChainReducer classes is possible to compose Map/Reduce jobs that look like [MAP+ / REDUCE MAP*]. And immediate benefit of this pattern is a dramatic reduction in disk IO.
So the idea is you set a single Reducer and then chain Map operations after that.
It sounds like you actually want to use MultipleOutputs. The Hadoop Javadocs provide an example on how to use it. With this you can define more than one output and its down to you which output key/values get written to.

Related

AvroMultipleOutputs creates empty file no error in logs

Trying to write output into two different named output file using
AvroMultipleOutputs but getting an empty file and no error in the logs. Counter shows correct number of records. Also this
works fine when writing to a single file.
Avro version 1.7.1
Code
Job job = new Job(config, "AVRO_MULTITEST");
job.setJarByClass(AvroMultiWriter.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(AvroKeyValueOutputFormat.class);
job.setMapperClass(AvroMultiWriteMapper.class);
job.setNumReduceTasks(0);
AvroJob.setOutputKeySchema(job, Schema.create(Schema.Type.STRING));
AvroJob.setOutputValueSchema(job, Schema.create(Schema.Type.STRING));
AvroJob.setMapOutputKeySchema(job, Schema.create(Schema.Type.STRING));
AvroJob.setMapOutputValueSchema(job, Schema.create(Schema.Type.STRING));
AvroMultipleOutputs.setCountersEnabled(job, true);
AvroMultipleOutputs.addNamedOutput(job,"F1",
AvroKeyValueOutputFormat.class, Schema.create
(Schema.Type.STRING),Schema.create(Schema.Type.STRING));
AvroMultipleOutputs.addNamedOutput(job,"F2",
AvroKeyValueOutputFormat.class, Schema.create
(Schema.Type.STRING),Schema.create(Schema.Type.STRING));
LazyOutputFormat.setOutputFormatClass(job, AvroKeyValueOutputFormat.class);
Job Counter
mapred.JobClient: org.apache.avro.mapreduce.AvroMultipleOutputs
mapred.JobClient: F1=3
mapred.JobClient: F2=3
Have you tried calling multipleOutputs.close() in the close() method of the mapper class?

MultipleOutputs with AWS EMR S3

I have this job where I write the output using multiple outputs. This program gives output with normal Hadoop Cluster.
But when I use AWS cluster, and give s3n paths for the multiple outputs as shown below, I do not get any output in the s3n path specified. Can anyone help me with this?
Configuration config3 = new Configuration();
JobConf conf3 = new JobConf(config3, t1debugJob.class);
conf3.setJobName("PJob3.7 scalability test correct 60r");
conf3.setOutputKeyClass(Text.class);
conf3.setOutputValueClass(Text.class);
conf3.setMapOutputKeyClass(StockKey.class);
conf3.setMapOutputValueClass(Text.class);
conf3.setPartitionerClass(CustomPartitionerStage3.class);
conf3.setOutputValueGroupingComparator(StockKeyGroupingComparator.class);
conf3.setOutputKeyComparatorClass(StockKeySortComparator.class);
conf3.setReducerClass(dt1Amazon.class);
//conf3.setNumMapTasks(10);
conf3.setNumReduceTasks(30);
conf3.setInputFormat(TextInputFormat.class);
conf3.setOutputFormat(TextOutputFormat.class);
MultipleInputs.addInputPath(conf3, new Path(other_args.get(2)),TextInputFormat.class, PMap3aPos.class);
MultipleInputs.addInputPath(conf3, new Path(other_args.get(1)),TextInputFormat.class, PMap3b.class);
MultipleOutputs.addNamedOutput(conf3,"s3n://gs3test/output/MIDPairspos/pairfile", TextOutputFormat.class, LongWritable.class, Text.class);
MultipleOutputs.addNamedOutput(conf3,"s3n://gs3test/output/MIDpos/idfile", TextOutputFormat.class, LongWritable.class, Text.class);
FileOutputFormat.setOutputPath(conf3, new Path(other_args.get(3)));
JobClient.runJob(conf3);
Thanks!

chaining mapreduce jobs in hadoop

I'm new to Hadoop. My task is to find an employee who has max salary.
In my firstmap class I split the word and put the key and value like this -
outputcollector.collect("salary",salary);
In my reducer I found the maximum salary and had set the output like this
outputcollector.collect("max salary",maxsalary);
Now I want to use the output from this reducer in another mapper.
I have constructed a chain like this
JobConf mapAConf = new JobConf(false);
ChainMapper.addMapper(conf, mymap.class, LongWritable.class, Text.class, Text.class, IntWritable.class, true, mapAConf);
JobConf reduceConf = new JobConf(false);
ChainReducer.setReducer(conf, myreduce.class, Text.class, IntWritable.class, Text.class, IntWritable.class, true, reduceConf);
JobConf mapCConf = new JobConf(false);
ChainReducer.addMapper(conf, LastMapper.class, Text.class, IntWritable.class, Text.class, IntWritable.class, true, mapCConf);
But the reducer is not getting executed. Any help on this?
You need to create & set the same JobConf for both mapper & reducer for a single Map/Reduce Job.

hadoop mapreduce job on Cassandra result is wrong

I have 7 node cassandra (1.1.1) and hadoop (1.03) cluster ( tasktracker install same on every cassandra node).
and my column family use wide row pattern. 1 row contains about 200k columns (max about 300k).
My problem is when we use Hadoop to run analytic jobs ( count numbers of occurrence of a word) the result i received is wrong ( result is too lower as I expected in test records)
there 's one strange when we monitoring on job tracker is map progress task indicate wrong ( in my image below ) , And number of "Map input records" when i rerun job ( same data) is not same.
here is my init job code:
Job job = new Job(conf);
job.setJobName(this.jobname);
job.setJarByClass(BannerCount.class);
job.setMapperClass(BannerViewMapper.class);
job.setReducerClass(BannerClickReducer.class);
FileSystem fs = FileSystem.get(conf);
ConfigHelper.setInputRpcPort(job.getConfiguration(), "9160");
ConfigHelper.setInputInitialAddress(job.getConfiguration(), "192.168.23.114,192.168.23.115,192.168.23.116,192.168.23.117,192.168.23.121,192.168.23.122,192.168.23.123");
ConfigHelper.setInputPartitioner(job.getConfiguration(), "org.apache.cassandra.dht.RandomPartitioner");
ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY, true);
ConfigHelper.setRangeBatchSize(job.getConfiguration(), 500);
SlicePredicate predicate = new SlicePredicate();
SliceRange sliceRange = new SliceRange();
sliceRange.setStart(ByteBufferUtil.EMPTY_BYTE_BUFFER);
sliceRange.setFinish(ByteBufferUtil.EMPTY_BYTE_BUFFER);
sliceRange.setCount(200000);
predicate.setSlice_range(sliceRange);
ConfigHelper.setInputSlicePredicate(job.getConfiguration(), predicate);
String outPathString = "BannerViewResultV3" + COLUMN_FAMILY;
if (fs.exists(new Path(outPathString)))
fs.delete(new Path(outPathString), true);
FileOutputFormat.setOutputPath(job, new Path(outPathString));
job.setInputFormatClass(ColumnFamilyInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setNumReduceTasks(28);
job.waitForCompletion(true);
return 1;

Migrating Data from HBase to FileSystem. (Writing Reducer output to Local or Hadoop filesystem)

My Purpose is to migrate the data from Hbase Tables to Flat (say csv formatted) files.
I am used
TableMapReduceUtil.initTableMapperJob(tableName, scan,
GetCustomerAccountsMapper.class, Text.class, Result.class,
job);
for scanning through HBase table and TableMapper for Mapper.
My challange is in while forcing Reducer to dump the Row values (which is normalized in flattened format) to local(or Hdfs) file system.
My problem is neither I am able to see logs of Reducer nor I can see the any files at path that I have mentioned in Reducer.
It's my 2nd or 3rd MR job and first serious one. After trying hard for two days, I am still clueless how to achieve my goal.
Would be great if someone could show the right direction.
Here is my reducer code -
public void reduce(Text key, Iterable<Result> rows, Context context)
throws IOException, InterruptedException {
FileSystem fs = LocalFileSystem.getLocal(new Configuration());
Path dir = new Path("/data/HBaseDataMigration/" + tableName+"_Reducer" + "/" + key.toString());
FSDataOutputStream fsOut = fs.create(dir,true);
for (Result row : rows) {
try {
String normRow = NormalizeHBaserow(
Bytes.toString(key.getBytes()), row, tableName);
fsOut.writeBytes(normRow);
//context.write(new Text(key.toString()), new Text(normRow));
} catch (BadHTableResultException ex) {
throw new IOException(ex);
}
}
fsOut.flush();
fsOut.close();
My Configuration for Reducer Output
Path out = new Path(args[0] + "/" + tableName+"Global");
FileOutputFormat.setOutputPath(job, out);
Thanks in Advance - Panks
Why not reduce into HDFS and once finished use hdfs fs to export the file
hadoop fs -get /user/hadoop/file localfile
If you do want to handle it in the reduce phase take a look at this article on OutputFormat on InfoQ

Resources