MultipleOutputs with AWS EMR S3

MultipleOutputs with AWS EMR S3 - hadoop

I have this job where I write the output using multiple outputs. This program gives output with normal Hadoop Cluster.
But when I use AWS cluster, and give s3n paths for the multiple outputs as shown below, I do not get any output in the s3n path specified. Can anyone help me with this?
Configuration config3 = new Configuration();
JobConf conf3 = new JobConf(config3, t1debugJob.class);
conf3.setJobName("PJob3.7 scalability test correct 60r");
conf3.setOutputKeyClass(Text.class);
conf3.setOutputValueClass(Text.class);
conf3.setMapOutputKeyClass(StockKey.class);
conf3.setMapOutputValueClass(Text.class);
conf3.setPartitionerClass(CustomPartitionerStage3.class);
conf3.setOutputValueGroupingComparator(StockKeyGroupingComparator.class);
conf3.setOutputKeyComparatorClass(StockKeySortComparator.class);
conf3.setReducerClass(dt1Amazon.class);
//conf3.setNumMapTasks(10);
conf3.setNumReduceTasks(30);
conf3.setInputFormat(TextInputFormat.class);
conf3.setOutputFormat(TextOutputFormat.class);
MultipleInputs.addInputPath(conf3, new Path(other_args.get(2)),TextInputFormat.class, PMap3aPos.class);
MultipleInputs.addInputPath(conf3, new Path(other_args.get(1)),TextInputFormat.class, PMap3b.class);
MultipleOutputs.addNamedOutput(conf3,"s3n://gs3test/output/MIDPairspos/pairfile", TextOutputFormat.class, LongWritable.class, Text.class);
MultipleOutputs.addNamedOutput(conf3,"s3n://gs3test/output/MIDpos/idfile", TextOutputFormat.class, LongWritable.class, Text.class);
FileOutputFormat.setOutputPath(conf3, new Path(other_args.get(3)));
JobClient.runJob(conf3);
Thanks!

Related

Map Reduce job on EMR successfully running but no output data on S3

Im running MR job on EMR master host.
My input file is in S3 and output set to a table in Hive via Hcatalog.
The job is running successful and i do see reducers output rows but looking at the S3 new partitions folder i can only see MR 0 byte SUCCESS file but no actual data files.
note- when reducer stage start i do see files writes to S3 into temp folder, but it seems the last operation throws the files somewhere.
I don't see any errors in MR logs.
Relevant MR driver code:"
Job job = Job.getInstance();
job.setJobName("Build Events");
job.setJarByClass(LoggersApp.class);
job.getConfiguration().set("fs.defaultFS", "s3://my-bucket");
// set input paths Path[] inputPaths = "file on s3";
FileInputFormat.setInputPaths(job, inputPaths); // set input output
format job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(HCatOutputFormat.class);
_configureOutputTable(job);
private void _setReducer(Job job) {
job.setReducerClass(Reducer.class);
job.setOutputValueClass(DefaultHCatRecord.class); }
private void _configureOutputTable(Job job) throws IOException {
OutputJobInfo jobInfo =
OutputJobInfo.create(_cli.getOptionValue("hive-dbname"),
_cli.getOptionValue("output-table"), null); HCatOutputFormat.setOutput(job, jobInfo); HCatSchema schema =
HCatOutputFormat.getTableSchema(job.getConfiguration());
HCatFieldSchema partitionDate = new HCatFieldSchema("date",
TypeInfoFactory.stringTypeInfo, null); HCatFieldSchema
partitionBatchId = new HCatFieldSchema("batch_id",
TypeInfoFactory.stringTypeInfo, null);
schema.append(partitionDate); schema.append(partitionBatchId);
HCatOutputFormat.setSchema(job, schema);
}
Any help?

Adding multiple Reducer using ChainReducer throwing Exception

I have already read previous posts related to this but didn't get anything meaningful.
My use case is :
Aggregate Impression and click data
Separate Clicked and non-clicked data in different files.
I have written mapper and reducer for that but that reducer's output is data containing clicked & non-clicked and it is going in same file. I want to separate that data so clicked data should be present in one file and non clicked should be present in other file.
Error :
java.lang.IllegalStateException: Reducer has been already set
at org.apache.hadoop.mapreduce.lib.chain.Chain.checkReducerAlreadySet(Chain.java:662)
Code
Configuration conf = new Configuration();
conf.set("mapreduce.output.fileoutputformat.compress", "true");
conf.set("mapreduce.output.fileoutputformat.compress.codec", "org.apache.hadoop.io.compress.GzipCodec");
conf.set("mapreduce.map.output.compress.codec", "org.apache.hadoop.io.compress.SnappyCodec");
conf.set("mapreduce.output.fileoutputformat.compress.type", "BLOCK");
Job job = Job.getInstance(conf, "IMPRESSION_CLICK_COMBINE_JOB");
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setReducerClass(ImpressionClickReducer.class);
FileInputFormat.setInputDirRecursive(job, true);
// FileInputFormat.addInputPath(job, new Path(args[0]));
// job.setMapperClass(ImpressionMapper.class);
Path p = new Path(args[2]);
FileSystem fs = FileSystem.get(conf);
fs.exists(p);
fs.delete(p, true);
/**
* Here directory of impressions will be present
*/
MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, ImpressionMapper.class);
/**
* Here directory of clicks will be present
*/
MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, ClickMapper.class);
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.setNumReduceTasks(10);
job.setPartitionerClass(TrackerPartitioner.class);
ChainReducer.setReducer(job, ImpressionClickReducer.class, Text.class, Text.class, Text.class, Text.class, new Configuration(false));
ChainReducer.addMapper(job, ImpressionClickMapper.class, Text.class, Text.class, Text.class, Text.class, new Configuration(false));
//Below mentioned line is giving Error
//ChainReducer.setReducer(job, ImpressionAndClickReducer.class, Text.class, Text.class, Text.class, Text.class, new Configuration(false));
job.waitForCompletion(true);

ChainReducer is used to chain Map tasks after the Reducer, you can only call setReducer() once (See the code here).
From the Javadocs:
The ChainReducer class allows to chain multiple Mapper classes after a
Reducer within the Reducer task.
Using the ChainMapper and the ChainReducer classes is possible to compose Map/Reduce jobs that look like [MAP+ / REDUCE MAP*]. And immediate benefit of this pattern is a dramatic reduction in disk IO.
So the idea is you set a single Reducer and then chain Map operations after that.
It sounds like you actually want to use MultipleOutputs. The Hadoop Javadocs provide an example on how to use it. With this you can define more than one output and its down to you which output key/values get written to.

chaining mapreduce jobs in hadoop

I'm new to Hadoop. My task is to find an employee who has max salary.
In my firstmap class I split the word and put the key and value like this -
outputcollector.collect("salary",salary);
In my reducer I found the maximum salary and had set the output like this
outputcollector.collect("max salary",maxsalary);
Now I want to use the output from this reducer in another mapper.
I have constructed a chain like this
JobConf mapAConf = new JobConf(false);
ChainMapper.addMapper(conf, mymap.class, LongWritable.class, Text.class, Text.class, IntWritable.class, true, mapAConf);
JobConf reduceConf = new JobConf(false);
ChainReducer.setReducer(conf, myreduce.class, Text.class, IntWritable.class, Text.class, IntWritable.class, true, reduceConf);
JobConf mapCConf = new JobConf(false);
ChainReducer.addMapper(conf, LastMapper.class, Text.class, IntWritable.class, Text.class, IntWritable.class, true, mapCConf);
But the reducer is not getting executed. Any help on this?

You need to create & set the same JobConf for both mapper & reducer for a single Map/Reduce Job.

using amazon s3 as input,output and to store intermediate results in EMR map reduce job

I am trying to use Amazon s3 storage with EMR. However, when I currently run my code I get multiple errors like
java.lang.IllegalArgumentException: This file system object (hdfs://10.254.37.109:9000) does not support access to the request path 's3n://energydata/input/centers_200_10k_norm.csv' You possibly called FileSystem.get(conf) when you should have called FileSystem.get(uri, conf) to obtain a file system supporting your path.
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:384)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:129)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:154)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:429)
at edu.stanford.cs246.hw2.KMeans$CentroidMapper.setup(KMeans.java:112)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
In main I set my input and output paths like this and I put s3n://energydata/input/centers_200_10k_norm.csv in configuration CFILE that I retrieve in the mapper and reducer:
FileSystem fs = FileSystem.get(conf);
conf.set(CFILE, inPath); //inPath in this case is s3n://energydata/input/centers_200_10k_norm.csv
FileInputFormat.addInputPath(job, new Path(inputDir));
FileOutputFormat.setOutputPath(job, new Path(outputDir));
The specific example where the error above occurs in my mapper and reducer where I try to access CFILE (s3n://energydata/input/centers_200_10k_norm.csv). This is how I try to get the path:
FileSystem fs = FileSystem.get(context.getConfiguration());
Path cFile = new Path(context.getConfiguration().get(CFILE));
DataInputStream d = new DataInputStream(fs.open(cFile)); ---->Error
s3n://energydata/input/centers_200_10k_norm.csv is one of the input arguments to the program and when I launched my EMR job I specified my input and output directories to be s3n://energydata/input and s3n://energydata/output
I tried doing what was suggested in file path in hdfs but I'm still getting the error. Any help would be appreciated.
thanks!

try instead:
Path cFile = new Path(context.getConfiguration().get(CFILE));
FileSystem fs = cFile.getFileSystem(context.getConfiguration());
DataInputStream d = new DataInputStream(fs.open(cFile));

thanks. I actually fixed it by using the following code:
String uriStr = "s3n://energydata/centroid/";
URI uri = URI.create(uriStr);
FileSystem fs = FileSystem.get(uri, context.getConfiguration());
Path cFile = new Path(context.getConfiguration().get(CFILE));
DataInputStream d = new DataInputStream(fs.open(cFile));

Migrating Data from HBase to FileSystem. (Writing Reducer output to Local or Hadoop filesystem)

My Purpose is to migrate the data from Hbase Tables to Flat (say csv formatted) files.
I am used
TableMapReduceUtil.initTableMapperJob(tableName, scan,
GetCustomerAccountsMapper.class, Text.class, Result.class,
job);
for scanning through HBase table and TableMapper for Mapper.
My challange is in while forcing Reducer to dump the Row values (which is normalized in flattened format) to local(or Hdfs) file system.
My problem is neither I am able to see logs of Reducer nor I can see the any files at path that I have mentioned in Reducer.
It's my 2nd or 3rd MR job and first serious one. After trying hard for two days, I am still clueless how to achieve my goal.
Would be great if someone could show the right direction.
Here is my reducer code -
public void reduce(Text key, Iterable<Result> rows, Context context)
throws IOException, InterruptedException {
FileSystem fs = LocalFileSystem.getLocal(new Configuration());
Path dir = new Path("/data/HBaseDataMigration/" + tableName+"_Reducer" + "/" + key.toString());
FSDataOutputStream fsOut = fs.create(dir,true);
for (Result row : rows) {
try {
String normRow = NormalizeHBaserow(
Bytes.toString(key.getBytes()), row, tableName);
fsOut.writeBytes(normRow);
//context.write(new Text(key.toString()), new Text(normRow));
} catch (BadHTableResultException ex) {
throw new IOException(ex);
}
}
fsOut.flush();
fsOut.close();
My Configuration for Reducer Output
Path out = new Path(args[0] + "/" + tableName+"Global");
FileOutputFormat.setOutputPath(job, out);
Thanks in Advance - Panks

Why not reduce into HDFS and once finished use hdfs fs to export the file
hadoop fs -get /user/hadoop/file localfile
If you do want to handle it in the reduce phase take a look at this article on OutputFormat on InfoQ

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

MultipleOutputs with AWS EMR S3 - hadoop

Related

Map Reduce job on EMR successfully running but no output data on S3

Adding multiple Reducer using ChainReducer throwing Exception

chaining mapreduce jobs in hadoop

using amazon s3 as input,output and to store intermediate results in EMR map reduce job

Migrating Data from HBase to FileSystem. (Writing Reducer output to Local or Hadoop filesystem)

Categories

Resources