Hadoop Map-Reduce , Need to combine two mapper with one common Reducer - hadoop

I need to implement below Functionality using Hadoop Map-Reduce?
1) I am reading one input for a mapper from one source & another input from another different input source.
2) I need to pass both output of mapper into a single reducer for further process.
Is there any to do the above requirement in Hadoop Map-Reduce

MultipleInputs.addInputPath is what you are looking for. This is how your configuration would look like. Make sure both AnyMapper1 and AnyMapper2 write the same output expected by MergeReducer
JobConf conf = new JobConf(Merge.class);
conf.setJobName("merge");
conf.setOutputKeyClass(IntWritable.class);
conf.setOutputValueClass(Text.class);
conf.setReducerClass(MergeReducer.class);
conf.setOutputFormat(TextOutputFormat.class);
MultipleInputs.addInputPath(conf, inputDir1, SequenceFileInputFormat.class, AnyMapper1.class);
MultipleInputs.addInputPath(conf, inputDir2, TextInputFormat.class, AnyMapper2.class);
FileOutputFormat.setOutputPath(conf, outputPath);

You can create a custom writable. You can populate the same in the Mapper. Later in the Reducer you can get the Custom writable Object and do the necessary business operation.

Related

How to write output in parquet fileformat in a MapReduce job?

I am looking to write MapReduce output in parquet fileformat using parquet-mr library as something like below :
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(ParquetOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[1]));
ParquetOutputFormat.setOutputPath(job, new Path(args[2]));
ParquetOutputFormat.setCompression(job, CompressionCodecName.GZIP);
SkipBadRecords.setMapperMaxSkipRecords(conf, Long.MAX_VALUE);
SkipBadRecords.setAttemptsToStartSkipping(conf, 0);
job.submit();
However, I keep getting errors like these :
2018-02-23 09:32:58,325 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.NullPointerException: writeSupportClass should not be null
at org.apache.parquet.Preconditions.checkNotNull(Preconditions.java:38)
at org.apache.parquet.hadoop.ParquetOutputFormat.getWriteSupport(ParquetOutputFormat.java:350)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:293)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:283)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.<init>(ReduceTask.java:548)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:622)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
I understand that writeSupportClass needs to be passed/set as something like
ParquetOutputFormat.setWriteSupportClass(job, ProtoWriteSupport.class);
but can I ask how can specify schema,implement ProtoWriteSupport or any other WriteSupport classes out there? What methods do I need to implement and are there any examples of doing this in a correct way?
If it helps, my MR job's output should look like & stored in parquet format:
Text INTWRITABLE
a 100
Try ParquetOutputFormat.setWriteSupportClass(job, ProtoWriteSupport.class);
ProtoWriteSupport<T extends MessageOrBuilder>
Implementation of WriteSupport for writing Protocol Buffers.
Check Javadoc for list of nested default classes available.
The CDH Tutorial on using parquet file format with MapReduce, Hive, HBase, and Pig.

How do I make the mapper process the entire file from HDFS

This is the code where I read the file that contain Hl7 messages and iterate through them using Hapi Iterator (from http://hl7api.sourceforge.net)
File file = new File("/home/training/Documents/msgs.txt");
InputStream is = new FileInputStream(file);
is = new BufferedInputStream(is);
Hl7InputStreamMessageStringIterator iter = new
Hl7InputStreamMessageStringIterator(is);
I want to make this done inside the map function? obviously I need to prevent the splitting in InputFormat to read the entire file as once as a single value and change it toString (the file size is 7KB), because as you know Hapi can parse only entire message.
I am newbie to all of this so please bear with me.
You will need to implement you own FileInputFormat subclass:
It must override isSplittable() method to false which means that number of mappers will be equal to number of input files: one input file per each mapper.
You also need to implement getRecordReader() method. This is exactly the class where you need to put you parsing logic from above to.
If you do not want your data file to split or you want a single mapper which will process your entire file. So that one file will be processed by only one mapper. In that case extending map/reduce inputformat and overriding isSplitable() method and return "false" as boolean will help you.
For ref : ( Not based on your code )
https://gist.github.com/sritchie/808035
As the input is getting from the text file, you can override isSplitable() method of fileInputFormat. Using this, one mapper will process the whole file.
public boolean isSplitable(Context context,Path args[0])
{
return false;
}

Hadoop Multiple Outputs with CQL3

I need to output the results of a MR job to multiple CQL3 column families.
In my reducer, I specify the CF using MultipleOutputs, but all the results are written to the one CF defined in the job's OutputCQL statement.
Job definiton:
...
job.setOutputFormatClass(CqlOutputFormat.class);
ConfigHelper.setOutputKeyspace(job.getConfiguration(), "keyspace1");
MultipleOutputs.addNamedOutput(job, "CF1", CqlOutputFormat.class, Map.class, List.class);
MultipleOutputs.addNamedOutput(job, "CF2", CqlOutputFormat.class, Map.class, List.class);
CqlConfigHelper.setOutputCql(job.getConfiguration(), "UPDATE keyspace1.CF1 SET value = ? ");
...
Reducer class setup:
mos = new MultipleOutputs(context);
Reduce method (psudo code):
keys = new LinkedHashMap<>();
keys.put("key", ByteBufferUtil.bytes("rowKey"));
keys.put("name", ByteBufferUtil.bytes("columnName"));
List<ByteBuffer> variables = new ArrayList<>();
variables.add(ByteBufferUtil.bytes("columnValue"));
mos.write("CF2", keys, variables);
The problem is that my reducer ignores the CF I specify in mos.write() and instead must just run the outputCQL. So in the example above, everything is written to CF1.
Ive tried using a prepared statement to inject the CF into the outputCQL, along the lines of "UPDATE keyspace1.? SET value = ?", but I dont think its possible to use a placeholder for the CF like this.
Is there any way I can overwrite the outputCQL inside the reducer class?
So the simple answer is that you cannot output results from a mr job to multiple CFs. However, having the need to do this actually highlights a flaw in the approach, rather than a missing feature in Hadoop.
Instead of processing a bunch of records and trying to produce 2 different results sets in one pass, a better approach is to arrive at the desired result sets iteratively. Basically, this means having multiple jobs iterating over the results of previous jobs until the desired results are achieved.

How to use Snappy in Hadoop in Container format

I have to use Snappy to compress the map o/p and the map-reduce o/p as well. Further, this should be splittable.
As I studied online, to make Snappy write splittable o/p, we have to use it in a Container like format.
Can you please suggest how to go about it? I tried finding some examples online, but could not fine one. I am using Hadoop v0.20.203.
Thanks.
Piyush
for output
conf.setOutputFormat(SequenceFileOutputFormat.class);
SequenceFileOutputFormat.setOutputCompressionType(conf, CompressionType.BLOCK);
SequenceFileOutputFormat.setCompressOutput(conf, true);
conf.set("mapred.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");
For map output
Configuration conf = new Configuration();
conf.setBoolean("mapred.compress.map.output", true);
conf.set("mapred.map.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");
In the new API OutputFormat installing for the Job, and not for the configuration.
Then, first part will be:
Job job = new Job(conf);
...
SequenceFileOutputFormat.setOutputCompressionType(job, CompressionType.BLOCK);
SequenceFileOutputFormat.setCompressOutput(job, true);
conf.set("mapred.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");

How to read the Hadoop Sequentil file as an input to the Hadoop job?

I have a Sequential file which has the key-value pair of type "org.apache.hadoop.typedbytes.TypedBytesWritable" , I have to provide this file as the input to the Hadoop job and have to process it in map only. I mean i dont have to do anything which will need reduce.
1) How will i specify the FileInputFormat as SequentialFile ?
2) What will be the signature of map function.
3) How will i get output from map instead of Reduce?
1) How will i specify the FileInputFormat as SequentialFile ?
Set the SequenceFileAsBinaryInputFormat as the input format. Here is the code for the SequenceFileAsBinaryInputFormat class.
Here is the code
JobConf conf = new JobConf(getConf(), getClass());
conf.setInputFormat(SequenceFileAsBinaryInputFormat.class);
2) What will be the signature of map function.
The map would be invoked with a BytesWritable as key and value types.
3) How will i get output from map instead of Reduce?
Set the mapred.reduce.tasks property to 0. The output of the map will be the final output of the job.
Also, take a look at the SequenceFileAsTextInputFormat. The map would be invoked with Text as key and value types.

Resources