How to write output in parquet fileformat in a MapReduce job? - hadoop

I am looking to write MapReduce output in parquet fileformat using parquet-mr library as something like below :
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(ParquetOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[1]));
ParquetOutputFormat.setOutputPath(job, new Path(args[2]));
ParquetOutputFormat.setCompression(job, CompressionCodecName.GZIP);
SkipBadRecords.setMapperMaxSkipRecords(conf, Long.MAX_VALUE);
SkipBadRecords.setAttemptsToStartSkipping(conf, 0);
job.submit();
However, I keep getting errors like these :
2018-02-23 09:32:58,325 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.NullPointerException: writeSupportClass should not be null
at org.apache.parquet.Preconditions.checkNotNull(Preconditions.java:38)
at org.apache.parquet.hadoop.ParquetOutputFormat.getWriteSupport(ParquetOutputFormat.java:350)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:293)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:283)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.<init>(ReduceTask.java:548)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:622)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
I understand that writeSupportClass needs to be passed/set as something like
ParquetOutputFormat.setWriteSupportClass(job, ProtoWriteSupport.class);
but can I ask how can specify schema,implement ProtoWriteSupport or any other WriteSupport classes out there? What methods do I need to implement and are there any examples of doing this in a correct way?
If it helps, my MR job's output should look like & stored in parquet format:
Text INTWRITABLE
a 100

Try ParquetOutputFormat.setWriteSupportClass(job, ProtoWriteSupport.class);
ProtoWriteSupport<T extends MessageOrBuilder>
Implementation of WriteSupport for writing Protocol Buffers.
Check Javadoc for list of nested default classes available.
The CDH Tutorial on using parquet file format with MapReduce, Hive, HBase, and Pig.

Related

Sequence file reading issue using spark Java

i am trying to read the sequence file generated by hive using spark. When i try to access the file , i am facing org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException:
I have tried the workarounds for this issue like making the class serializable, still i face the issue. I am writing the code snippet here , please let me know what i am missing here.
Is it because of the BytesWritable data type or something else which is causing the issue.
JavaPairRDD<BytesWritable, Text> fileRDD = javaCtx.sequenceFile("hdfs://path_to_the_file", BytesWritable.class, Text.class);
List<String> result = fileRDD.map(new Function<Tuple2<BytesWritables,Text>,String>(){
public String call (Tuple2<BytesWritable,Text> row){
return row._2.toString()+"\n";
}).collect();
}
Here is what was needed to make it work
Because we use HBase to store our data and this reducer outputs its result to HBase table, Hadoop is telling us that he doesn’t know how to serialize our data. That is why we need to help it. Inside setUp set the io.serializations variable
You can do it in spark accordingly
conf.setStrings("io.serializations", new String[]{hbaseConf.get("io.serializations"), MutationSerialization.class.getName(), ResultSerialization.class.getName()});

How to test hadoop mapreduce with hdfs?

I am using MRUnit to write unit tests for my mapreduce jobs.
However, I am having trouble including hdfs into that mix. My MR job needs a file from hdfs. How do I mock out the hdfs part in MRUnit test case?
Edit:
I know that I can specify inputs/exepctedOutput for my MR code in the test infrastructure. However, that is not what I want. My MR job needs to read another file that has domain data to do the job. This file is in HDFS. How do I mock out this file?
I tried using mockito but it didnt work. The reason was that FileSystem.open() returns a FSDataInputStream which inherits from other interfaces besides java.io.Stream. It was too painful to mock out all the interfaces. So, I hacked it in my code by doing the following
if (System.getProperty("junit_running") != null)
{
inputStream = this.getClass().getClassLoader().getResourceAsStream("domain_data.txt");
br = new BufferedReader(new InputStreamReader(inputStream));
} else {
Path pathToRegionData = new Path("/domain_data.txt");
LOG.info("checking for existence of region assignment file at path: " + pathToRegionData.toString());
if (!fileSystem.exists(pathToRegionData))
{
LOG.error("domain file does not exist at path: " + pathToRegionData.toString());
throw new IllegalArgumentException("region assignments file does not exist at path: " + pathToRegionData.toString());
}
inputStream = fileSystem.open(pathToRegionData);
br = new BufferedReader(new InputStreamReader(inputStream));
}
This solution is not ideal because I had to put test specific code in my production code. I am still waiting to see if there is an elegant solution out there.
Please follow the this small tutorial for MRUnit.
https://github.com/malli3131/HadoopTutorial/blob/master/MRUnit/Tutorial
In MRUnit test case, we supply the data inside the testMapper() and testReducer() methods. So there is no need of input from HDFS for MRUnit Job. Only MapReduce jobs require data inputs from HDFS.

Hadoop Map-Reduce , Need to combine two mapper with one common Reducer

I need to implement below Functionality using Hadoop Map-Reduce?
1) I am reading one input for a mapper from one source & another input from another different input source.
2) I need to pass both output of mapper into a single reducer for further process.
Is there any to do the above requirement in Hadoop Map-Reduce
MultipleInputs.addInputPath is what you are looking for. This is how your configuration would look like. Make sure both AnyMapper1 and AnyMapper2 write the same output expected by MergeReducer
JobConf conf = new JobConf(Merge.class);
conf.setJobName("merge");
conf.setOutputKeyClass(IntWritable.class);
conf.setOutputValueClass(Text.class);
conf.setReducerClass(MergeReducer.class);
conf.setOutputFormat(TextOutputFormat.class);
MultipleInputs.addInputPath(conf, inputDir1, SequenceFileInputFormat.class, AnyMapper1.class);
MultipleInputs.addInputPath(conf, inputDir2, TextInputFormat.class, AnyMapper2.class);
FileOutputFormat.setOutputPath(conf, outputPath);
You can create a custom writable. You can populate the same in the Mapper. Later in the Reducer you can get the Custom writable Object and do the necessary business operation.

Use MRUnit and AVRO together

I have created a Mapper & Reducer which use AVRO for input, map-output en reduce output. When creating a MRUnit test i get the following stacktrace:
java.lang.NullPointerException
at org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
at org.apache.hadoop.mrunit.mock.MockOutputCollector.deepCopy(MockOutputCollector.java:74)
at org.apache.hadoop.mrunit.mock.MockOutputCollector.collect(MockOutputCollector.java:110)
at org.apache.hadoop.mrunit.mapreduce.mock.MockMapContextWrapper$MockMapContext.write(MockMapContextWrapper.java:119)
at org.apache.avro.mapreduce.AvroMapper.writePair(AvroMapper.java:22)
at com.bol.searchrank.phase.day.DayMapper.doMap(DayMapper.java:29)
at com.bol.searchrank.phase.day.DayMapper.doMap(DayMapper.java:1)
at org.apache.avro.mapreduce.AvroMapper.map(AvroMapper.java:16)
at org.apache.avro.mapreduce.AvroMapper.map(AvroMapper.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mrunit.mapreduce.MapDriver.run(MapDriver.java:200)
at org.apache.hadoop.mrunit.mapreduce.MapReduceDriver.run(MapReduceDriver.java:207)
at com.bol.searchrank.phase.day.DayMapReduceTest.shouldProduceAndCountTerms(DayMapReduceTest.java:39)
The driver is initialized as follows (i have created a Avro MapReduce API implementation):
driver = new MapReduceDriver<AvroWrapper<Pair<Utf8, LiveTrackingLine>>, NullWritable, AvroKey<Utf8>, AvroValue<Product>, AvroWrapper<Pair<Utf8, Product>>, NullWritable>().withMapper(new DayMapper()).withReducer(new DayReducer());
Adding a configuration object with io.serialization won't help:
Configuration configuration = new Configuration();
configuration.setStrings("io.serializations", new String[] {
AvroSerialization.class.getName()
});
driver = new MapReduceDriver<AvroWrapper<Pair<Utf8, LiveTrackingLine>>, NullWritable, AvroKey<Utf8>, AvroValue<Product>, AvroWrapper<Pair<Utf8, Product>>, NullWritable>().withMapper(new DayMapper()).withReducer(new DayReducer()).withConfiguration(configuration);
I use Hadoop & MRUnit 0.20.2-cdh3u2 from Cloudera and Avro MapRed 1.6.3.
You are getting a NPE because the SerializationFactory is not finding an acceptable class implementing Serialization in io.serializations.
MRUnit had several bugs related to serializations besides Writable including MRUNIT-45, MRUNIT-70, MRUNIT-77, MRUNIT-86 at https://issues.apache.org/jira/browse/MRUNIT. These bugs involved the conf not getting passed to the SerializationFactory constructor correctly or the code required a default constructor from the Key or Value which all Writables have. All of these fixes appear in Apache MRUnit 0.9.0-incubating which will get released sometime this week.
Cloudera's 0.20.2-cdh3u2 MRUnit is close to Apache MRUnit 0.5.0-incubating. I think that your code may still be a problem even in 0.9.0-incubating, please email your full code example to mrunit-user#incubator.apache.org and the Apache MRUnit project will be happy to take a look at it
This will compile now MRUNIT-99 relaxes the restriction on K2 type parameter to not have to be Comparable

How to use Snappy in Hadoop in Container format

I have to use Snappy to compress the map o/p and the map-reduce o/p as well. Further, this should be splittable.
As I studied online, to make Snappy write splittable o/p, we have to use it in a Container like format.
Can you please suggest how to go about it? I tried finding some examples online, but could not fine one. I am using Hadoop v0.20.203.
Thanks.
Piyush
for output
conf.setOutputFormat(SequenceFileOutputFormat.class);
SequenceFileOutputFormat.setOutputCompressionType(conf, CompressionType.BLOCK);
SequenceFileOutputFormat.setCompressOutput(conf, true);
conf.set("mapred.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");
For map output
Configuration conf = new Configuration();
conf.setBoolean("mapred.compress.map.output", true);
conf.set("mapred.map.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");
In the new API OutputFormat installing for the Job, and not for the configuration.
Then, first part will be:
Job job = new Job(conf);
...
SequenceFileOutputFormat.setOutputCompressionType(job, CompressionType.BLOCK);
SequenceFileOutputFormat.setCompressOutput(job, true);
conf.set("mapred.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");

Resources