Sequence file reading issue using spark Java - hadoop

i am trying to read the sequence file generated by hive using spark. When i try to access the file , i am facing org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException:
I have tried the workarounds for this issue like making the class serializable, still i face the issue. I am writing the code snippet here , please let me know what i am missing here.
Is it because of the BytesWritable data type or something else which is causing the issue.
JavaPairRDD<BytesWritable, Text> fileRDD = javaCtx.sequenceFile("hdfs://path_to_the_file", BytesWritable.class, Text.class);
List<String> result = fileRDD.map(new Function<Tuple2<BytesWritables,Text>,String>(){
public String call (Tuple2<BytesWritable,Text> row){
return row._2.toString()+"\n";
}).collect();
}

Here is what was needed to make it work
Because we use HBase to store our data and this reducer outputs its result to HBase table, Hadoop is telling us that he doesn’t know how to serialize our data. That is why we need to help it. Inside setUp set the io.serializations variable
You can do it in spark accordingly
conf.setStrings("io.serializations", new String[]{hbaseConf.get("io.serializations"), MutationSerialization.class.getName(), ResultSerialization.class.getName()});

Related

Parquet-MR AvroParquetWriter - how to convert data to Parquet (with Specific Mapping)

I'm working on a tool for converting data from a homegrown format to Parquet and JSON (for use in different settings with Spark, Drill and MongoDB), using Avro with Specific Mapping as the stepping stone. I have to support conversion of new data on a regular basis and on client machines which is why I try to write my own standalone conversion tool with a (Avro|Parquet|JSON) switch instead of using Drill or Spark or other tools as converters as I probably would if this was a one time job. I'm basing the whole thing on Avro because this seems like the easiest way to get conversion to Parquet and JSON under one hood.
I used Specific Mapping to profit from static type checking, wrote an IDL, converted that to a schema.avsc, generated classes and set up a sample conversion with specific constructor, but now I'm stuck configuring the writers. All Avro-Parquet conversion examples I could find [0] use AvroParquetWriter with deprecated signatures (mostly: Path file, Schema schema) and Generic Mapping.
AvroParquetWriter has only one none-deprecated Constructor, with this signature:
AvroParquetWriter(
Path file,
WriteSupport<T> writeSupport,
CompressionCodecName compressionCodecName,
int blockSize,
int pageSize,
boolean enableDictionary,
boolean enableValidation,
WriterVersion writerVersion,
Configuration conf
)
Most of the parameters are not hard to figure out but WriteSupport<T> writeSupport throws me off. I can't find any further documentation or an example.
Staring at the source of AvroParquetWriter I see GenericData model pop up a few times but only one line mentioning SpecificData: GenericData model = SpecificData.get();.
So I have a few questions:
1) Does AvroParquetWriter not support Avro Specific Mapping? Or does it by means of that SpecificData.get() method? The comment "Utilities for generated Java classes and interfaces." over 'SpecificData.class` seems to suggest that but how exactly should I proceed?
2) What's going on in the AvroParquetWriter constructor, is there an example or some documentation to be found somewhere?
3) More specifically: the signature of the WriteSupport method asks for 'Schema avroSchema' and 'GenericData model'. What does GenericData model refer to? Maybe I'm not seeing the forest because of all the trees here...
To give an example of what I'm aiming for, my central piece of Avro conversion code currently looks like this:
DatumWriter<MyData> avroDatumWriter = new SpecificDatumWriter<>(MyData.class);
DataFileWriter<MyData> dataFileWriter = new DataFileWriter<>(avroDatumWriter);
dataFileWriter.create(schema, avroOutput);
The Parquet equivalent currently looks like this:
AvroParquetWriter<SpecificRecord> parquetWriter = new AvroParquetWriter<>(parquetOutput, schema);
but this is not more than a beginning and is modeled after the examples I found, using the deprecated constructor, so will have to change anyway.
Thanks,
Thomas
[0] Hadoop - The definitive Guide, O'Reilly, https://gist.github.com/hammer/76996fb8426a0ada233e, http://www.programcreek.com/java-api-example/index.php?api=parquet.avro.AvroParquetWriter
Try AvroParquetWriter.builder :
MyData obj = ... // should be avro Object
ParquetWriter<Object> pw = AvroParquetWriter.builder(file)
.withSchema(obj.getSchema())
.build();
pw.write(obj);
pw.close();
Thanks.

How to test hadoop mapreduce with hdfs?

I am using MRUnit to write unit tests for my mapreduce jobs.
However, I am having trouble including hdfs into that mix. My MR job needs a file from hdfs. How do I mock out the hdfs part in MRUnit test case?
Edit:
I know that I can specify inputs/exepctedOutput for my MR code in the test infrastructure. However, that is not what I want. My MR job needs to read another file that has domain data to do the job. This file is in HDFS. How do I mock out this file?
I tried using mockito but it didnt work. The reason was that FileSystem.open() returns a FSDataInputStream which inherits from other interfaces besides java.io.Stream. It was too painful to mock out all the interfaces. So, I hacked it in my code by doing the following
if (System.getProperty("junit_running") != null)
{
inputStream = this.getClass().getClassLoader().getResourceAsStream("domain_data.txt");
br = new BufferedReader(new InputStreamReader(inputStream));
} else {
Path pathToRegionData = new Path("/domain_data.txt");
LOG.info("checking for existence of region assignment file at path: " + pathToRegionData.toString());
if (!fileSystem.exists(pathToRegionData))
{
LOG.error("domain file does not exist at path: " + pathToRegionData.toString());
throw new IllegalArgumentException("region assignments file does not exist at path: " + pathToRegionData.toString());
}
inputStream = fileSystem.open(pathToRegionData);
br = new BufferedReader(new InputStreamReader(inputStream));
}
This solution is not ideal because I had to put test specific code in my production code. I am still waiting to see if there is an elegant solution out there.
Please follow the this small tutorial for MRUnit.
https://github.com/malli3131/HadoopTutorial/blob/master/MRUnit/Tutorial
In MRUnit test case, we supply the data inside the testMapper() and testReducer() methods. So there is no need of input from HDFS for MRUnit Job. Only MapReduce jobs require data inputs from HDFS.

Hive setup()-like functionality similar to Mapper setup()?

I want to replace a Hadoop job with Hive. My challenge is in Hadoop, I'm using setup() to build a kdtree by reading in reference data (points of interest) from the distributed cache. I then use the kdtree in map() to evaluate distance of the target data against the kdtree.
In Hive, I wanted to use a udf with evaluate() method to determine the distance, but I don't know how to setup the kdtree with the reference data. Is this possible?
I probably don't have the entire answer, so I'm just going to throw out some ideas that might be of help.
You can add files to the distributed cache in hive using ADD FILE ...
Hive 11+ (I think) should let you access to the distributed cache in GenericUDF.initialize
https://issues.apache.org/jira/browse/HIVE-1016 which references...
https://issues.apache.org/jira/browse/HIVE-3628
So when you initialize the UDF, you might be able to build your kdtree by accessing the file you added in the distributed cache.
Like climbage says ADD FILE command adds the file into distributed cache.
You can access the distributed cache in your UDF simply by opening a file which is in the current directory.
ie... open( new File( System.getProperty("user.dir") + "/myfile") );
You can use a ConstantObjectInspector to access the filename in the initialize method of GenericUDF, where you can open the file and read into memory into your data structure.
The distributed_map UDF of Brickhouse does something similar ( https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/dcache/DistributedMapUDF.java )
Something like
public ObjectInspector initialize(ObjectInspector[] inspArr) {
ConstantObjectInspector fileNameInsp = (ConstantObjectInspector)inspArr[0];
String fileName = fileNameInsp.getWritableConstantValue().toString();
FileInputStream inFile = new FileInputStream("./" + fileName);
doStuff( inFile );
.....
}

Use MRUnit and AVRO together

I have created a Mapper & Reducer which use AVRO for input, map-output en reduce output. When creating a MRUnit test i get the following stacktrace:
java.lang.NullPointerException
at org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
at org.apache.hadoop.mrunit.mock.MockOutputCollector.deepCopy(MockOutputCollector.java:74)
at org.apache.hadoop.mrunit.mock.MockOutputCollector.collect(MockOutputCollector.java:110)
at org.apache.hadoop.mrunit.mapreduce.mock.MockMapContextWrapper$MockMapContext.write(MockMapContextWrapper.java:119)
at org.apache.avro.mapreduce.AvroMapper.writePair(AvroMapper.java:22)
at com.bol.searchrank.phase.day.DayMapper.doMap(DayMapper.java:29)
at com.bol.searchrank.phase.day.DayMapper.doMap(DayMapper.java:1)
at org.apache.avro.mapreduce.AvroMapper.map(AvroMapper.java:16)
at org.apache.avro.mapreduce.AvroMapper.map(AvroMapper.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mrunit.mapreduce.MapDriver.run(MapDriver.java:200)
at org.apache.hadoop.mrunit.mapreduce.MapReduceDriver.run(MapReduceDriver.java:207)
at com.bol.searchrank.phase.day.DayMapReduceTest.shouldProduceAndCountTerms(DayMapReduceTest.java:39)
The driver is initialized as follows (i have created a Avro MapReduce API implementation):
driver = new MapReduceDriver<AvroWrapper<Pair<Utf8, LiveTrackingLine>>, NullWritable, AvroKey<Utf8>, AvroValue<Product>, AvroWrapper<Pair<Utf8, Product>>, NullWritable>().withMapper(new DayMapper()).withReducer(new DayReducer());
Adding a configuration object with io.serialization won't help:
Configuration configuration = new Configuration();
configuration.setStrings("io.serializations", new String[] {
AvroSerialization.class.getName()
});
driver = new MapReduceDriver<AvroWrapper<Pair<Utf8, LiveTrackingLine>>, NullWritable, AvroKey<Utf8>, AvroValue<Product>, AvroWrapper<Pair<Utf8, Product>>, NullWritable>().withMapper(new DayMapper()).withReducer(new DayReducer()).withConfiguration(configuration);
I use Hadoop & MRUnit 0.20.2-cdh3u2 from Cloudera and Avro MapRed 1.6.3.
You are getting a NPE because the SerializationFactory is not finding an acceptable class implementing Serialization in io.serializations.
MRUnit had several bugs related to serializations besides Writable including MRUNIT-45, MRUNIT-70, MRUNIT-77, MRUNIT-86 at https://issues.apache.org/jira/browse/MRUNIT. These bugs involved the conf not getting passed to the SerializationFactory constructor correctly or the code required a default constructor from the Key or Value which all Writables have. All of these fixes appear in Apache MRUnit 0.9.0-incubating which will get released sometime this week.
Cloudera's 0.20.2-cdh3u2 MRUnit is close to Apache MRUnit 0.5.0-incubating. I think that your code may still be a problem even in 0.9.0-incubating, please email your full code example to mrunit-user#incubator.apache.org and the Apache MRUnit project will be happy to take a look at it
This will compile now MRUNIT-99 relaxes the restriction on K2 type parameter to not have to be Comparable

Using Distributed Cache with Pig on Elastic Map Reduce

I am trying to run my Pig script (which uses UDFs) on Amazon's Elastic Map Reduce.
I need to use some static files from within my UDFs.
I do something like this in my UDF:
public class MyUDF extends EvalFunc<DataBag> {
public DataBag exec(Tuple input) {
...
FileReader fr = new FileReader("./myfile.txt");
...
}
public List<String> getCacheFiles() {
List<String> list = new ArrayList<String>(1);
list.add("s3://path/to/myfile.txt#myfile.txt");
return list;
}
}
I have stored the file in my s3 bucket /path/to/myfile.txt
However, on running my Pig job, I see an exception:
Got an exception java.io.FileNotFoundException: ./myfile.txt (No such file or directory)
So, my question is: how do I use distributed cache files when running pig script on amazon's EMR?
EDIT: I figured out that pig-0.6, unlike pig-0.9 does not have a function called getCacheFiles(). Amazon does not support pig-0.6 and so I need to figure out a different way to get distributed cache work in 0.6
I think adding this extra arg to the Pig command line call should work (with s3 or s3n, depending on where your file is stored):
–cacheFile s3n://bucket_name/file_name#cache_file_name
You should be able to add that in the "Extra Args" box when creating the Job flow.

Resources