Use MRUnit and AVRO together - hadoop

I have created a Mapper & Reducer which use AVRO for input, map-output en reduce output. When creating a MRUnit test i get the following stacktrace:
java.lang.NullPointerException
at org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
at org.apache.hadoop.mrunit.mock.MockOutputCollector.deepCopy(MockOutputCollector.java:74)
at org.apache.hadoop.mrunit.mock.MockOutputCollector.collect(MockOutputCollector.java:110)
at org.apache.hadoop.mrunit.mapreduce.mock.MockMapContextWrapper$MockMapContext.write(MockMapContextWrapper.java:119)
at org.apache.avro.mapreduce.AvroMapper.writePair(AvroMapper.java:22)
at com.bol.searchrank.phase.day.DayMapper.doMap(DayMapper.java:29)
at com.bol.searchrank.phase.day.DayMapper.doMap(DayMapper.java:1)
at org.apache.avro.mapreduce.AvroMapper.map(AvroMapper.java:16)
at org.apache.avro.mapreduce.AvroMapper.map(AvroMapper.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mrunit.mapreduce.MapDriver.run(MapDriver.java:200)
at org.apache.hadoop.mrunit.mapreduce.MapReduceDriver.run(MapReduceDriver.java:207)
at com.bol.searchrank.phase.day.DayMapReduceTest.shouldProduceAndCountTerms(DayMapReduceTest.java:39)
The driver is initialized as follows (i have created a Avro MapReduce API implementation):
driver = new MapReduceDriver<AvroWrapper<Pair<Utf8, LiveTrackingLine>>, NullWritable, AvroKey<Utf8>, AvroValue<Product>, AvroWrapper<Pair<Utf8, Product>>, NullWritable>().withMapper(new DayMapper()).withReducer(new DayReducer());
Adding a configuration object with io.serialization won't help:
Configuration configuration = new Configuration();
configuration.setStrings("io.serializations", new String[] {
AvroSerialization.class.getName()
});
driver = new MapReduceDriver<AvroWrapper<Pair<Utf8, LiveTrackingLine>>, NullWritable, AvroKey<Utf8>, AvroValue<Product>, AvroWrapper<Pair<Utf8, Product>>, NullWritable>().withMapper(new DayMapper()).withReducer(new DayReducer()).withConfiguration(configuration);
I use Hadoop & MRUnit 0.20.2-cdh3u2 from Cloudera and Avro MapRed 1.6.3.

You are getting a NPE because the SerializationFactory is not finding an acceptable class implementing Serialization in io.serializations.
MRUnit had several bugs related to serializations besides Writable including MRUNIT-45, MRUNIT-70, MRUNIT-77, MRUNIT-86 at https://issues.apache.org/jira/browse/MRUNIT. These bugs involved the conf not getting passed to the SerializationFactory constructor correctly or the code required a default constructor from the Key or Value which all Writables have. All of these fixes appear in Apache MRUnit 0.9.0-incubating which will get released sometime this week.
Cloudera's 0.20.2-cdh3u2 MRUnit is close to Apache MRUnit 0.5.0-incubating. I think that your code may still be a problem even in 0.9.0-incubating, please email your full code example to mrunit-user#incubator.apache.org and the Apache MRUnit project will be happy to take a look at it
This will compile now MRUNIT-99 relaxes the restriction on K2 type parameter to not have to be Comparable

Related

How to write output in parquet fileformat in a MapReduce job?

I am looking to write MapReduce output in parquet fileformat using parquet-mr library as something like below :
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(ParquetOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[1]));
ParquetOutputFormat.setOutputPath(job, new Path(args[2]));
ParquetOutputFormat.setCompression(job, CompressionCodecName.GZIP);
SkipBadRecords.setMapperMaxSkipRecords(conf, Long.MAX_VALUE);
SkipBadRecords.setAttemptsToStartSkipping(conf, 0);
job.submit();
However, I keep getting errors like these :
2018-02-23 09:32:58,325 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.NullPointerException: writeSupportClass should not be null
at org.apache.parquet.Preconditions.checkNotNull(Preconditions.java:38)
at org.apache.parquet.hadoop.ParquetOutputFormat.getWriteSupport(ParquetOutputFormat.java:350)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:293)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:283)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.<init>(ReduceTask.java:548)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:622)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
I understand that writeSupportClass needs to be passed/set as something like
ParquetOutputFormat.setWriteSupportClass(job, ProtoWriteSupport.class);
but can I ask how can specify schema,implement ProtoWriteSupport or any other WriteSupport classes out there? What methods do I need to implement and are there any examples of doing this in a correct way?
If it helps, my MR job's output should look like & stored in parquet format:
Text INTWRITABLE
a 100
Try ParquetOutputFormat.setWriteSupportClass(job, ProtoWriteSupport.class);
ProtoWriteSupport<T extends MessageOrBuilder>
Implementation of WriteSupport for writing Protocol Buffers.
Check Javadoc for list of nested default classes available.
The CDH Tutorial on using parquet file format with MapReduce, Hive, HBase, and Pig.

Sequence file reading issue using spark Java

i am trying to read the sequence file generated by hive using spark. When i try to access the file , i am facing org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException:
I have tried the workarounds for this issue like making the class serializable, still i face the issue. I am writing the code snippet here , please let me know what i am missing here.
Is it because of the BytesWritable data type or something else which is causing the issue.
JavaPairRDD<BytesWritable, Text> fileRDD = javaCtx.sequenceFile("hdfs://path_to_the_file", BytesWritable.class, Text.class);
List<String> result = fileRDD.map(new Function<Tuple2<BytesWritables,Text>,String>(){
public String call (Tuple2<BytesWritable,Text> row){
return row._2.toString()+"\n";
}).collect();
}
Here is what was needed to make it work
Because we use HBase to store our data and this reducer outputs its result to HBase table, Hadoop is telling us that he doesn’t know how to serialize our data. That is why we need to help it. Inside setUp set the io.serializations variable
You can do it in spark accordingly
conf.setStrings("io.serializations", new String[]{hbaseConf.get("io.serializations"), MutationSerialization.class.getName(), ResultSerialization.class.getName()});

Parquet-MR AvroParquetWriter - how to convert data to Parquet (with Specific Mapping)

I'm working on a tool for converting data from a homegrown format to Parquet and JSON (for use in different settings with Spark, Drill and MongoDB), using Avro with Specific Mapping as the stepping stone. I have to support conversion of new data on a regular basis and on client machines which is why I try to write my own standalone conversion tool with a (Avro|Parquet|JSON) switch instead of using Drill or Spark or other tools as converters as I probably would if this was a one time job. I'm basing the whole thing on Avro because this seems like the easiest way to get conversion to Parquet and JSON under one hood.
I used Specific Mapping to profit from static type checking, wrote an IDL, converted that to a schema.avsc, generated classes and set up a sample conversion with specific constructor, but now I'm stuck configuring the writers. All Avro-Parquet conversion examples I could find [0] use AvroParquetWriter with deprecated signatures (mostly: Path file, Schema schema) and Generic Mapping.
AvroParquetWriter has only one none-deprecated Constructor, with this signature:
AvroParquetWriter(
Path file,
WriteSupport<T> writeSupport,
CompressionCodecName compressionCodecName,
int blockSize,
int pageSize,
boolean enableDictionary,
boolean enableValidation,
WriterVersion writerVersion,
Configuration conf
)
Most of the parameters are not hard to figure out but WriteSupport<T> writeSupport throws me off. I can't find any further documentation or an example.
Staring at the source of AvroParquetWriter I see GenericData model pop up a few times but only one line mentioning SpecificData: GenericData model = SpecificData.get();.
So I have a few questions:
1) Does AvroParquetWriter not support Avro Specific Mapping? Or does it by means of that SpecificData.get() method? The comment "Utilities for generated Java classes and interfaces." over 'SpecificData.class` seems to suggest that but how exactly should I proceed?
2) What's going on in the AvroParquetWriter constructor, is there an example or some documentation to be found somewhere?
3) More specifically: the signature of the WriteSupport method asks for 'Schema avroSchema' and 'GenericData model'. What does GenericData model refer to? Maybe I'm not seeing the forest because of all the trees here...
To give an example of what I'm aiming for, my central piece of Avro conversion code currently looks like this:
DatumWriter<MyData> avroDatumWriter = new SpecificDatumWriter<>(MyData.class);
DataFileWriter<MyData> dataFileWriter = new DataFileWriter<>(avroDatumWriter);
dataFileWriter.create(schema, avroOutput);
The Parquet equivalent currently looks like this:
AvroParquetWriter<SpecificRecord> parquetWriter = new AvroParquetWriter<>(parquetOutput, schema);
but this is not more than a beginning and is modeled after the examples I found, using the deprecated constructor, so will have to change anyway.
Thanks,
Thomas
[0] Hadoop - The definitive Guide, O'Reilly, https://gist.github.com/hammer/76996fb8426a0ada233e, http://www.programcreek.com/java-api-example/index.php?api=parquet.avro.AvroParquetWriter
Try AvroParquetWriter.builder :
MyData obj = ... // should be avro Object
ParquetWriter<Object> pw = AvroParquetWriter.builder(file)
.withSchema(obj.getSchema())
.build();
pw.write(obj);
pw.close();
Thanks.

How to test hadoop mapreduce with hdfs?

I am using MRUnit to write unit tests for my mapreduce jobs.
However, I am having trouble including hdfs into that mix. My MR job needs a file from hdfs. How do I mock out the hdfs part in MRUnit test case?
Edit:
I know that I can specify inputs/exepctedOutput for my MR code in the test infrastructure. However, that is not what I want. My MR job needs to read another file that has domain data to do the job. This file is in HDFS. How do I mock out this file?
I tried using mockito but it didnt work. The reason was that FileSystem.open() returns a FSDataInputStream which inherits from other interfaces besides java.io.Stream. It was too painful to mock out all the interfaces. So, I hacked it in my code by doing the following
if (System.getProperty("junit_running") != null)
{
inputStream = this.getClass().getClassLoader().getResourceAsStream("domain_data.txt");
br = new BufferedReader(new InputStreamReader(inputStream));
} else {
Path pathToRegionData = new Path("/domain_data.txt");
LOG.info("checking for existence of region assignment file at path: " + pathToRegionData.toString());
if (!fileSystem.exists(pathToRegionData))
{
LOG.error("domain file does not exist at path: " + pathToRegionData.toString());
throw new IllegalArgumentException("region assignments file does not exist at path: " + pathToRegionData.toString());
}
inputStream = fileSystem.open(pathToRegionData);
br = new BufferedReader(new InputStreamReader(inputStream));
}
This solution is not ideal because I had to put test specific code in my production code. I am still waiting to see if there is an elegant solution out there.
Please follow the this small tutorial for MRUnit.
https://github.com/malli3131/HadoopTutorial/blob/master/MRUnit/Tutorial
In MRUnit test case, we supply the data inside the testMapper() and testReducer() methods. So there is no need of input from HDFS for MRUnit Job. Only MapReduce jobs require data inputs from HDFS.

Hadoop new API - Set OutputFormat

I'm trying to set the OutputFormat of my job to MapFileOutputFormat using:
jobConf.setOutputFormat(MapFileOutputFormat.class);
I get this error: mapred.output.format.class is incompatible with new reduce API mode
I suppose I should use the set setOutputFormatClass() of the new Job class but the problem is that when I try to do this:
job.setOutputFormatClass(MapFileOutputFormat.class);
it expects me to use this class: org.apache.hadoop.mapreduce.lib.output.MapFileOutputFormat.
In hadoop 1.0.X there is no such class. It only exists in earlier versions (e.g 0.x)
How can I solve this problem ?
Thank you!
This problem has no decently easily implementable solution.
I gave up and used Sequence files which fit my requirements too.
Have you tried the following?
import org.apache.hadoop.mapreduce.lib.output;
...
LazyOutputFormat.setOutputFormatClass(job, MapFileOutputFormat.class);

Resources