chaining mapreduce jobs in hadoop - hadoop

I'm new to Hadoop. My task is to find an employee who has max salary.
In my firstmap class I split the word and put the key and value like this -
outputcollector.collect("salary",salary);
In my reducer I found the maximum salary and had set the output like this
outputcollector.collect("max salary",maxsalary);
Now I want to use the output from this reducer in another mapper.
I have constructed a chain like this
JobConf mapAConf = new JobConf(false);
ChainMapper.addMapper(conf, mymap.class, LongWritable.class, Text.class, Text.class, IntWritable.class, true, mapAConf);
JobConf reduceConf = new JobConf(false);
ChainReducer.setReducer(conf, myreduce.class, Text.class, IntWritable.class, Text.class, IntWritable.class, true, reduceConf);
JobConf mapCConf = new JobConf(false);
ChainReducer.addMapper(conf, LastMapper.class, Text.class, IntWritable.class, Text.class, IntWritable.class, true, mapCConf);
But the reducer is not getting executed. Any help on this?

You need to create & set the same JobConf for both mapper & reducer for a single Map/Reduce Job.

Related

Adding multiple Reducer using ChainReducer throwing Exception

I have already read previous posts related to this but didn't get anything meaningful.
My use case is :
Aggregate Impression and click data
Separate Clicked and non-clicked data in different files.
I have written mapper and reducer for that but that reducer's output is data containing clicked & non-clicked and it is going in same file. I want to separate that data so clicked data should be present in one file and non clicked should be present in other file.
Error :
java.lang.IllegalStateException: Reducer has been already set
at org.apache.hadoop.mapreduce.lib.chain.Chain.checkReducerAlreadySet(Chain.java:662)
Code
Configuration conf = new Configuration();
conf.set("mapreduce.output.fileoutputformat.compress", "true");
conf.set("mapreduce.output.fileoutputformat.compress.codec", "org.apache.hadoop.io.compress.GzipCodec");
conf.set("mapreduce.map.output.compress.codec", "org.apache.hadoop.io.compress.SnappyCodec");
conf.set("mapreduce.output.fileoutputformat.compress.type", "BLOCK");
Job job = Job.getInstance(conf, "IMPRESSION_CLICK_COMBINE_JOB");
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setReducerClass(ImpressionClickReducer.class);
FileInputFormat.setInputDirRecursive(job, true);
// FileInputFormat.addInputPath(job, new Path(args[0]));
// job.setMapperClass(ImpressionMapper.class);
Path p = new Path(args[2]);
FileSystem fs = FileSystem.get(conf);
fs.exists(p);
fs.delete(p, true);
/**
* Here directory of impressions will be present
*/
MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, ImpressionMapper.class);
/**
* Here directory of clicks will be present
*/
MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, ClickMapper.class);
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.setNumReduceTasks(10);
job.setPartitionerClass(TrackerPartitioner.class);
ChainReducer.setReducer(job, ImpressionClickReducer.class, Text.class, Text.class, Text.class, Text.class, new Configuration(false));
ChainReducer.addMapper(job, ImpressionClickMapper.class, Text.class, Text.class, Text.class, Text.class, new Configuration(false));
//Below mentioned line is giving Error
//ChainReducer.setReducer(job, ImpressionAndClickReducer.class, Text.class, Text.class, Text.class, Text.class, new Configuration(false));
job.waitForCompletion(true);
ChainReducer is used to chain Map tasks after the Reducer, you can only call setReducer() once (See the code here).
From the Javadocs:
The ChainReducer class allows to chain multiple Mapper classes after a
Reducer within the Reducer task.
Using the ChainMapper and the ChainReducer classes is possible to compose Map/Reduce jobs that look like [MAP+ / REDUCE MAP*]. And immediate benefit of this pattern is a dramatic reduction in disk IO.
So the idea is you set a single Reducer and then chain Map operations after that.
It sounds like you actually want to use MultipleOutputs. The Hadoop Javadocs provide an example on how to use it. With this you can define more than one output and its down to you which output key/values get written to.

AvroMultipleOutputs creates empty file no error in logs

Trying to write output into two different named output file using
AvroMultipleOutputs but getting an empty file and no error in the logs. Counter shows correct number of records. Also this
works fine when writing to a single file.
Avro version 1.7.1
Code
Job job = new Job(config, "AVRO_MULTITEST");
job.setJarByClass(AvroMultiWriter.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(AvroKeyValueOutputFormat.class);
job.setMapperClass(AvroMultiWriteMapper.class);
job.setNumReduceTasks(0);
AvroJob.setOutputKeySchema(job, Schema.create(Schema.Type.STRING));
AvroJob.setOutputValueSchema(job, Schema.create(Schema.Type.STRING));
AvroJob.setMapOutputKeySchema(job, Schema.create(Schema.Type.STRING));
AvroJob.setMapOutputValueSchema(job, Schema.create(Schema.Type.STRING));
AvroMultipleOutputs.setCountersEnabled(job, true);
AvroMultipleOutputs.addNamedOutput(job,"F1",
AvroKeyValueOutputFormat.class, Schema.create
(Schema.Type.STRING),Schema.create(Schema.Type.STRING));
AvroMultipleOutputs.addNamedOutput(job,"F2",
AvroKeyValueOutputFormat.class, Schema.create
(Schema.Type.STRING),Schema.create(Schema.Type.STRING));
LazyOutputFormat.setOutputFormatClass(job, AvroKeyValueOutputFormat.class);
Job Counter
mapred.JobClient: org.apache.avro.mapreduce.AvroMultipleOutputs
mapred.JobClient: F1=3
mapred.JobClient: F2=3
Have you tried calling multipleOutputs.close() in the close() method of the mapper class?

MultipleOutputs with AWS EMR S3

I have this job where I write the output using multiple outputs. This program gives output with normal Hadoop Cluster.
But when I use AWS cluster, and give s3n paths for the multiple outputs as shown below, I do not get any output in the s3n path specified. Can anyone help me with this?
Configuration config3 = new Configuration();
JobConf conf3 = new JobConf(config3, t1debugJob.class);
conf3.setJobName("PJob3.7 scalability test correct 60r");
conf3.setOutputKeyClass(Text.class);
conf3.setOutputValueClass(Text.class);
conf3.setMapOutputKeyClass(StockKey.class);
conf3.setMapOutputValueClass(Text.class);
conf3.setPartitionerClass(CustomPartitionerStage3.class);
conf3.setOutputValueGroupingComparator(StockKeyGroupingComparator.class);
conf3.setOutputKeyComparatorClass(StockKeySortComparator.class);
conf3.setReducerClass(dt1Amazon.class);
//conf3.setNumMapTasks(10);
conf3.setNumReduceTasks(30);
conf3.setInputFormat(TextInputFormat.class);
conf3.setOutputFormat(TextOutputFormat.class);
MultipleInputs.addInputPath(conf3, new Path(other_args.get(2)),TextInputFormat.class, PMap3aPos.class);
MultipleInputs.addInputPath(conf3, new Path(other_args.get(1)),TextInputFormat.class, PMap3b.class);
MultipleOutputs.addNamedOutput(conf3,"s3n://gs3test/output/MIDPairspos/pairfile", TextOutputFormat.class, LongWritable.class, Text.class);
MultipleOutputs.addNamedOutput(conf3,"s3n://gs3test/output/MIDpos/idfile", TextOutputFormat.class, LongWritable.class, Text.class);
FileOutputFormat.setOutputPath(conf3, new Path(other_args.get(3)));
JobClient.runJob(conf3);
Thanks!

mahout kmeans clustering : showing error

I was trying to cluster data in mahout. An error is showing.
here is the error
java.lang.ArrayIndexOutOfBoundsException: 0
at org.apache.mahout.clustering.classify.ClusterClassificationMapper.populateClusterModels(ClusterClassificationMapper.java:129)
at org.apache.mahout.clustering.classify.ClusterClassificationMapper.setup(ClusterClassificationMapper.java:74)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
13/03/07 19:29:31 INFO mapred.JobClient: map 0% reduce 0%
13/03/07 19:29:31 INFO mapred.JobClient: Job complete: job_local_0010
13/03/07 19:29:31 INFO mapred.JobClient: Counters: 0
java.lang.InterruptedException: Cluster Classification Driver Job failed processing E:/Thesis/Experiments/Mahout dataset/input
at org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276)
at org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(ClusterClassificationDriver.java:135)
at org.apache.mahout.clustering.kmeans.KMeansDriver.clusterData(KMeansDriver.java:260)
at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:152)
at com.ifm.dataclustering.SequencePrep.<init>(SequencePrep.java:95)
at com.ifm.dataclustering.App.main(App.java:8)
here is my code
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path vector_path = new Path("E:/Thesis/Experiments/Mahout dataset/input/vector_input");
SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, vector_path, Text.class, VectorWritable.class);
VectorWritable vec = new VectorWritable();
for (NamedVector outputVec : vector) {
vec.set(outputVec);
writer.append(new Text(outputVec.getName()), vec);
}
writer.close();
// create initial cluster
Path cluster_path = new Path("E:/Thesis/Experiments/Mahout dataset/clusters/part-00000");
SequenceFile.Writer cluster_writer = new SequenceFile.Writer(fs, conf, cluster_path, Text.class, Kluster.class);
// number of cluster k
int k=4;
for(i=0;i<k;i++) {
NamedVector outputVec = vector.get(i);
Kluster cluster = new Kluster(outputVec, i, new EuclideanDistanceMeasure());
// System.out.println(cluster);
cluster_writer.append(new Text(cluster.getIdentifier()), cluster);
}
cluster_writer.close();
// set cluster output path
Path output = new Path("E:/Thesis/Experiments/Mahout dataset/output");
HadoopUtil.delete(conf, output);
KMeansDriver.run(conf, new Path("E:/Thesis/Experiments/Mahout dataset/input"), new Path("E:/Thesis/Experiments/Mahout dataset/clusters"),
output, new EuclideanDistanceMeasure(), 0.001, 10,
true, 0.0, false);
SequenceFile.Reader output_reader = new SequenceFile.Reader(fs,new Path("E:/Thesis/Experiments/Mahout dataset/output/" + Kluster.CLUSTERED_POINTS_DIR+ "/part-m-00000"), conf);
IntWritable key = new IntWritable();
WeightedVectorWritable value = new WeightedVectorWritable();
while (output_reader.next(key, value)) {
System.out.println(value.toString() + " belongs to cluster "
+ key.toString());
}
reader.close();
}
The paths to your input/output data seem incorrect. The MapReduce job runs on a cluster. Thus the data is read from HDFS and not from your local hard disk.
The error message:
java.lang.InterruptedException: Cluster Classification Driver Job failed processing E:/Thesis/Experiments/Mahout dataset/input
at org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyClusterMR(ClusterClassificationDriver.java:276)
gives you a hint about the incorrect path.
Before running the job, make sure that you fist upload the input data to HDFS:
hadoop fs -mkdir input
hadoop fs -copyFromLocal E:\\file input
...
then instead of:
new Path("E:/Thesis/Experiments/Mahout dataset/input")
you should use the HDFS path:
new Path("input")
or
new Path("/user/<username>/input")
EDIT:
Use FileSystem#exists(Path path) In order to check, whether a Path is valid or not.

hadoop mapreduce job on Cassandra result is wrong

I have 7 node cassandra (1.1.1) and hadoop (1.03) cluster ( tasktracker install same on every cassandra node).
and my column family use wide row pattern. 1 row contains about 200k columns (max about 300k).
My problem is when we use Hadoop to run analytic jobs ( count numbers of occurrence of a word) the result i received is wrong ( result is too lower as I expected in test records)
there 's one strange when we monitoring on job tracker is map progress task indicate wrong ( in my image below ) , And number of "Map input records" when i rerun job ( same data) is not same.
here is my init job code:
Job job = new Job(conf);
job.setJobName(this.jobname);
job.setJarByClass(BannerCount.class);
job.setMapperClass(BannerViewMapper.class);
job.setReducerClass(BannerClickReducer.class);
FileSystem fs = FileSystem.get(conf);
ConfigHelper.setInputRpcPort(job.getConfiguration(), "9160");
ConfigHelper.setInputInitialAddress(job.getConfiguration(), "192.168.23.114,192.168.23.115,192.168.23.116,192.168.23.117,192.168.23.121,192.168.23.122,192.168.23.123");
ConfigHelper.setInputPartitioner(job.getConfiguration(), "org.apache.cassandra.dht.RandomPartitioner");
ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY, true);
ConfigHelper.setRangeBatchSize(job.getConfiguration(), 500);
SlicePredicate predicate = new SlicePredicate();
SliceRange sliceRange = new SliceRange();
sliceRange.setStart(ByteBufferUtil.EMPTY_BYTE_BUFFER);
sliceRange.setFinish(ByteBufferUtil.EMPTY_BYTE_BUFFER);
sliceRange.setCount(200000);
predicate.setSlice_range(sliceRange);
ConfigHelper.setInputSlicePredicate(job.getConfiguration(), predicate);
String outPathString = "BannerViewResultV3" + COLUMN_FAMILY;
if (fs.exists(new Path(outPathString)))
fs.delete(new Path(outPathString), true);
FileOutputFormat.setOutputPath(job, new Path(outPathString));
job.setInputFormatClass(ColumnFamilyInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setNumReduceTasks(28);
job.waitForCompletion(true);
return 1;

Resources