Parquet-MR AvroParquetWriter - how to convert data to Parquet (with Specific Mapping) - hadoop

I'm working on a tool for converting data from a homegrown format to Parquet and JSON (for use in different settings with Spark, Drill and MongoDB), using Avro with Specific Mapping as the stepping stone. I have to support conversion of new data on a regular basis and on client machines which is why I try to write my own standalone conversion tool with a (Avro|Parquet|JSON) switch instead of using Drill or Spark or other tools as converters as I probably would if this was a one time job. I'm basing the whole thing on Avro because this seems like the easiest way to get conversion to Parquet and JSON under one hood.
I used Specific Mapping to profit from static type checking, wrote an IDL, converted that to a schema.avsc, generated classes and set up a sample conversion with specific constructor, but now I'm stuck configuring the writers. All Avro-Parquet conversion examples I could find [0] use AvroParquetWriter with deprecated signatures (mostly: Path file, Schema schema) and Generic Mapping.
AvroParquetWriter has only one none-deprecated Constructor, with this signature:
AvroParquetWriter(
Path file,
WriteSupport<T> writeSupport,
CompressionCodecName compressionCodecName,
int blockSize,
int pageSize,
boolean enableDictionary,
boolean enableValidation,
WriterVersion writerVersion,
Configuration conf
)
Most of the parameters are not hard to figure out but WriteSupport<T> writeSupport throws me off. I can't find any further documentation or an example.
Staring at the source of AvroParquetWriter I see GenericData model pop up a few times but only one line mentioning SpecificData: GenericData model = SpecificData.get();.
So I have a few questions:
1) Does AvroParquetWriter not support Avro Specific Mapping? Or does it by means of that SpecificData.get() method? The comment "Utilities for generated Java classes and interfaces." over 'SpecificData.class` seems to suggest that but how exactly should I proceed?
2) What's going on in the AvroParquetWriter constructor, is there an example or some documentation to be found somewhere?
3) More specifically: the signature of the WriteSupport method asks for 'Schema avroSchema' and 'GenericData model'. What does GenericData model refer to? Maybe I'm not seeing the forest because of all the trees here...
To give an example of what I'm aiming for, my central piece of Avro conversion code currently looks like this:
DatumWriter<MyData> avroDatumWriter = new SpecificDatumWriter<>(MyData.class);
DataFileWriter<MyData> dataFileWriter = new DataFileWriter<>(avroDatumWriter);
dataFileWriter.create(schema, avroOutput);
The Parquet equivalent currently looks like this:
AvroParquetWriter<SpecificRecord> parquetWriter = new AvroParquetWriter<>(parquetOutput, schema);
but this is not more than a beginning and is modeled after the examples I found, using the deprecated constructor, so will have to change anyway.
Thanks,
Thomas
[0] Hadoop - The definitive Guide, O'Reilly, https://gist.github.com/hammer/76996fb8426a0ada233e, http://www.programcreek.com/java-api-example/index.php?api=parquet.avro.AvroParquetWriter

Try AvroParquetWriter.builder :
MyData obj = ... // should be avro Object
ParquetWriter<Object> pw = AvroParquetWriter.builder(file)
.withSchema(obj.getSchema())
.build();
pw.write(obj);
pw.close();
Thanks.

Related

what is the type of flatbuffer in call to UnPackTo

I just started to understand how flatbuffers work. The document is good. In the section for usage in c++ i see following example
// Autogenerated class from table Monster.
MonsterT monsterobj;
// Deserialize from buffer into object.
UnPackTo(&monsterobj, flatbuffer);
// Update object directly like a C++ class instance.
cout << monsterobj->name; // This is now a std::string!
monsterobj->name = "Bob"; // Change the name.
// Serialize into new flatbuffer.
FlatBufferBuilder fbb;
Pack(fbb, &monsterobj);
My question is what is the type of flatbuffer? no where in the document it mentioned. Is it the binary buffer either read from file or received over network?
This is the link from where i copied the above sample code.
https://google.github.io/flatbuffers/flatbuffers_guide_use_cpp.html
That documentation looks out of date, it should probably be GetMonster(flatbuffer)->UnPackTo(&monsterobj) where flatbuffer is a pointer to the bytes containing the binary FlatBuffer representation.
The above however is part of the "object API", which you should only be using is convenience is more important than performance. Read about the base API here: https://google.github.io/flatbuffers/flatbuffers_guide_tutorial.html

Model evaluation in Stanford NER

I'm doing a project with the NER module from Stanford CoreNLP and I'm currently having some issues with the evaluation of the model.
I'm using the API to call the functionality from inside a java program instead of using the command line arguments and so far I've managed to train the model from several training files (in a tab-separated format; 2 columns with token and annotation/answer) and to serialize it to a file which was pretty easy.
Now I'm trying to evaluate the model I've trained on some test files (precision, recall, f1) and I'm kinda stuck there. First of all, what format should the test files be in? I'm assuming they should be the same as the training files (tab-separated) which would be the logical thing. I've looked through the JavaDoc documentation for information on how to use the classify method and also had a look at the NERDemo.java. I've managed to get the classifyToString method to work but that doesn't really help me with the evaluation. I've found the classifyAndWriteAnswers(String testFile, DocumentReaderAndWriter<IN> readerWriter, boolean outputScores) method that I assume would give me the precision and recall scores if I set outputScores to true.
However, I can't manage to get this to work. Which DocumentReaderAndWriter should I use as the second argument?
This is what I've got right now:
public static void evaluate(CRFClassifier classifier, File testFile) {
try {
classifier.classifyAndWriteAnswers(testFile.getPath(), new PlainTextDocumentReaderAndWriter(), true);
} catch (IOException e) {
e.printStackTrace();
}
}
This is what I get:
Unchecked call to 'classifyAndWriteAnswers(String, DocumentReaderAndWriter<IN>, boolean)' as a member of raw type 'edu.stanford.nlp.ie.AbstractSequenceClassifier'
Also, do I pass the path to the test file as the first argument or rather the file itself loaded into a String? Some help would be greatly appreciated.

Mongoengine Django Rest Framework - Serializer Error - ReferenceField is not JSON serializable

Everything works great until the ObjectID value of the ReferenceField no longer points to a valid document. Then The ObjectID is left as the value, and json doesn't know how to serialize this.
How do I deal with invalid ReferenceFields?
E.g.
class Food(Document):
name = StringField()
owner = ReferenceField("Person")
class Person(Document):
first_name = StringField()
last_name = StringField()
...
p = Person(...)
apple = Food(name="apple", owner=p)
p.delete() # might be the wrong method, but you get the idea
At this point, attempting to fetch a list of foods via the REST API will fail with the is not JSON serializable error, since apple.owner no longer points to an owner that exists.
Since you are using DRF with mongoengine, you must be using django-rest-framework-mongoengine.
Apparenly, its a bug in django-rest-framework-mongoengine. Check this open issue on Github which was reported recently regarding the same.
https://github.com/umutbozkurt/django-rest-framework-mongoengine/issues/91
One way is to write your own JSONEncoder for this. This link might help.
Another option is to use the json_util library of Pymongo. They provide explicit BSON conversion to and from json.
As per json-util docs:
This module provides two helper methods dumps and loads that wrap the
native json methods and provide explicit BSON conversion to and from
json. This allows for specialized encoding and decoding of BSON
documents into Mongo Extended JSON‘s Strict mode. This lets you encode
/ decode BSON documents to JSON even when they use special BSON types.

Save image (via ImageWriter / FileImageOutputStream) to the filesystem without use of a File object

As a learning task I am converting my software I use every day to NIO, with the somewhat arbitrary objective of having zero remaining instances of java.io.File.
I have been successful in every case except one. It seems an ImageWriter can only write to a FileImageOutputStream which requires a java.io.File.
Path path = Paths.get(inputFileName);
InputStream is = Files.newInputStream(path, StandardOpenOption.READ);
BufferedImage bi = ImageIO.read(is);
...
Iterator<ImageWriter> iter = ImageIO.getImageWritersBySuffix("jpg");
ImageWriter writer = iter.next();
ImageWriteParam param = writer.getDefaultWriteParam();
File outputFile = new File(outputFileName);
ImageOutputStream ios = new FileImageOutputStream(outputFile);
IIOImage iioi = new IIOImage(bi, null, null);
writer.setOutput(ios);
writer.write(null, iioi, param);
...
Is there a way to do this with a java.nio.file.Path? The java 8 api doc for ImageWriter only mentions FileImageOutputStream.
I understand there might only be a symbolic value to doing this, but I was under the impression that NIO is intended to provide a complete alternative to java.io.File.
A RandomAccessFile, constructed with just a String for a filename, can be supplied to the ImageOutputStream constructor constructor.
This doesn't "use NIO" any more than just using the File in the first place, but it doesn't require File to be used directly..
For direct support of Path (or to "use NIO"), the FileImageOutputStream (or RandomAccessFile) could be extended, or a type deriving from the ImageOutputStream interface created, but .. how much work is it worth?
The intended way to instantiate an ImageInputStream or ImageOutputStream in the javax.imageio API, is through the ImageIO.createImageInputStream() and ImageIO.createImageOutputStream() methods.
You will see that both these methods take Object as its parameter. Internally, ImageIO will use a service lookup mechanism, and delegate the creation to a provider able to create a stream based on the parameter. By default, there are providers for File, RandomAccessFile and InputStream.
But the mechanism is extendable. See the API doc for the javax.imageio.spi package for a starting point. If you like, you can create a provider that takes a java.nio.Path and creates a FileImageOutputStream based on it, or alternatively create your own implementation using some more fancy NIO backing (ie. SeekableByteChannel).
Here's source code for a sample provider and stream I created to read images from a byte array, that you could use as a starting point.
(Of course, I have to agree with #user2864740's thoughts on the cost/benefit of doing this, but as you are doing this for the sake of learning, it might make sense.)

Write data that can be read by ProtobufPigLoader from Elephant Bird

For a project of mine, I want to analyse around 2 TB of Protobuf objects. I want to consume these objects in a Pig Script via the "elephant bird" library. However it is not totally clear to my how to write a file to HDFS so that it can be consumed by the ProtobufPigLoader class.
This is what I have:
Pig script:
register ../fs-c/lib/*.jar // this includes the elephant bird library
register ../fs-c/*.jar
raw_data = load 'hdfs://XXX/fsc-data2/XXX*' using com.twitter.elephantbird.pig.load.ProtobufPigLoader('de.pc2.dedup.fschunk.pig.PigProtocol.File');
Import tool (parts of it):
def getWriter(filenamePath: Path) : ProtobufBlockWriter[de.pc2.dedup.fschunk.pig.PigProtocol.File] = {
val conf = new Configuration()
val fs = FileSystem.get(filenamePath.toUri(), conf)
val os = fs.create(filenamePath, true)
val writer = new ProtobufBlockWriter[de.pc2.dedup.fschunk.pig.PigProtocol.File](os, classOf[de.pc2.dedup.fschunk.pig.PigProtocol.File])
return writer
}
val writer = getWriter(new Path(filename))
val builder = de.pc2.dedup.fschunk.pig.PigProtocol.File.newBuilder()
writer.write(builder.build)
writer.finish()
writer.close()
The import tool runs fine. I had a few problems with the ProtobufPigLoader because I cannot use the hadoop-lzo compression library, and without a fix (see here) ProtobufPigLoader isn't working. The problem where I have problems is that DUMP raw_data; returns Unable to open iterator for alias raw_data and ILLUSTRATE raw_data; returns No (valid) input data found!.
For me, it looks like the ProtobufBlockWriter data cannot be read by the ProtobufPigLoader. But what to use instead? How to write data in a external tool to HDFS so that it can be processed by ProtobufPigLoader.
Alternative question: What to use instead? How to write pretty large objects to Hadoop to consume it with Pig? The objects are not very complex, but contain a large list of sub-objects in a list (repeated field in Protobuf).
I want to avoid any text format or JSON because they are simply to large for my data. I expect that it would bloat up the data by a factor of 2 or 3 (lots of integer, lots of byte strings that I would need to encode as Base64)..
I want to avoid normalizing the data so that the id of the main object is attached to each of the subobjects (this is what is done now) because this also blows up the space consumption and makes joins necessary in the later processing.
Updates:
I didn't use the generation of protobuf loader classes, but use the reflection type loader
The protobuf classes are in a jar that is registered. DESCRIBE correctly shows the types.

Resources