Write data that can be read by ProtobufPigLoader from Elephant Bird - hadoop

For a project of mine, I want to analyse around 2 TB of Protobuf objects. I want to consume these objects in a Pig Script via the "elephant bird" library. However it is not totally clear to my how to write a file to HDFS so that it can be consumed by the ProtobufPigLoader class.
This is what I have:
Pig script:
register ../fs-c/lib/*.jar // this includes the elephant bird library
register ../fs-c/*.jar
raw_data = load 'hdfs://XXX/fsc-data2/XXX*' using com.twitter.elephantbird.pig.load.ProtobufPigLoader('de.pc2.dedup.fschunk.pig.PigProtocol.File');
Import tool (parts of it):
def getWriter(filenamePath: Path) : ProtobufBlockWriter[de.pc2.dedup.fschunk.pig.PigProtocol.File] = {
val conf = new Configuration()
val fs = FileSystem.get(filenamePath.toUri(), conf)
val os = fs.create(filenamePath, true)
val writer = new ProtobufBlockWriter[de.pc2.dedup.fschunk.pig.PigProtocol.File](os, classOf[de.pc2.dedup.fschunk.pig.PigProtocol.File])
return writer
}
val writer = getWriter(new Path(filename))
val builder = de.pc2.dedup.fschunk.pig.PigProtocol.File.newBuilder()
writer.write(builder.build)
writer.finish()
writer.close()
The import tool runs fine. I had a few problems with the ProtobufPigLoader because I cannot use the hadoop-lzo compression library, and without a fix (see here) ProtobufPigLoader isn't working. The problem where I have problems is that DUMP raw_data; returns Unable to open iterator for alias raw_data and ILLUSTRATE raw_data; returns No (valid) input data found!.
For me, it looks like the ProtobufBlockWriter data cannot be read by the ProtobufPigLoader. But what to use instead? How to write data in a external tool to HDFS so that it can be processed by ProtobufPigLoader.
Alternative question: What to use instead? How to write pretty large objects to Hadoop to consume it with Pig? The objects are not very complex, but contain a large list of sub-objects in a list (repeated field in Protobuf).
I want to avoid any text format or JSON because they are simply to large for my data. I expect that it would bloat up the data by a factor of 2 or 3 (lots of integer, lots of byte strings that I would need to encode as Base64)..
I want to avoid normalizing the data so that the id of the main object is attached to each of the subobjects (this is what is done now) because this also blows up the space consumption and makes joins necessary in the later processing.
Updates:
I didn't use the generation of protobuf loader classes, but use the reflection type loader
The protobuf classes are in a jar that is registered. DESCRIBE correctly shows the types.

Related

Is it possible to read a protobufs encoded message without using code generation?

When I am decoding protobuf encoded data, I use parseFrom() method which is available in the code generated by protoc.
What I want to know is, is there a way to load protocol buffers data into some kind of a generic object, and read data from it using field names or tag numbers, without using code generation?
This is available in Avro with GenericRecord. What I want to know is, whether there is a similar capability in protobuf too.
Found it
Compile the proto file into a desc file
protoc --descriptor_set_out=point.desc --include_imports point.proto
Load the desc file.
InputStream input = new FileInputStream("point.desc");
DescriptorProtos.FileDescriptorSet descriptorSet = DescriptorProtos.FileDescriptorSet.parseFrom(input);
DescriptorProtos.FileDescriptorProto fileDescriptorProto = descriptorSet.getFile(0);
Descriptors.Descriptor messageDescriptor = Descriptors.FileDescriptor
.buildFrom(fileDescriptorProto, new Descriptors.FileDescriptor[0])
.findMessageTypeByName("Point");
Use the loaded descriptor to decode data.
DynamicMessage dynamicMessage = DynamicMessage.parseFrom(messageDescriptor, encodedBytes);
int x = (int) dynamicMessage.getField(messageDescriptor.findFieldByName("x"));

How to initialize PYFMI models in parallel?

I am using pyfmi to do simulations with EnergyPlus. I recognized that initializing the individual EnergyPlus models takes quite some time. Therefore, I hope to find a way to initialize the models in parallel. I tried the python library multiprocessing with no success. If it matters, I am on Ubuntu 16.10 and use Python 3.6.
Here is what I want to get done in serial:
fmus = {}
for id in id_list:
chdir(fmu_path+str(id))
fmus[id] = load_fmu('f_' + str(id)+'.fmu',fmu_path+str(id))
fmus[id].initialize(start_time,final_time)
The result is a dictionary with ids as key and the models as value: {id1:FMUModelCS1,id2:FMUModelCS1}
The purpose is to call later the models by their key and do simulations.
Here is my attempt with multiprocessing:
def ep_intialization(id,start_time,final_time):
chdir(fmu_path+str(id))
model = load_fmu('f_' + str(id)+'.fmu',fmu_path+str(id))
model.initialize(start_time,final_time)
return {id:model}
data = ((id,start_time,final_time) for id in id_list)
if __name__ == '__main__':
pool = Pool(processes=cpus)
pool.starmap(ep_intialization, data)
pool.close()
pool.join()
I can see the processes of the models in my system monitor but then the script raise an error because the models are not pickable:
MaybeEncodingError: Error sending result: '[{id2: <pyfmi.fmi.FMUModelCS1 object at 0x561eaf851188>}]'. Reason: 'TypeError('self._fmu,self.callBackFunctions,self.callbacks,self.context,self.variable_list cannot be converted to a Python object for pickling',)'
But I cannot imagine that there is no way to initialize the models in parallel. Other frameworks/libraries than threading/multiprocessing are also welcome.
I saw this answer but it seems that it focuses on the simulations after initialization.
The answer below the one you refer to seems to explain what the problem with multiprocessing and FMU instantiation is.
I tried with pathos suggested in this answer, but run into the same problem:
from pyfmi import load_fmu
from multiprocessing import Pool
from os import chdir
from pathos.multiprocessing import Pool
def ep_intialization(id):
chdir('folder' + str(id))
model = load_fmu('BouncingBall.fmu')
model.initialize(0,10)
return {id:model}
id_list = [1,2]
cpus = 2
data = ((id) for id in id_list)
pool = Pool(cpus)
out = pool.map(ep_intialization, data)
This gives:
MaybeEncodingError: Error sending result: '[{1: <pyfmi.fmi.FMUModelME2 object at 0x564e0c529290>}]'. Reason: 'TypeError('self._context,self._fmu,self.callBackFunctions,self.callbacks cannot be converted to a Python object for pickling',)'
Here is another idea:
I suppose the instantiation is slow because EnergyPlus links plenty of libraries into the FMU. If the components you are modelling all have the same interface (input, output, parameters), you can probably use a single FMU with an additional parameter that switches between the models.
This would be much more efficient: You would only have to instantiate a single FMU and could call it in parallel with different parameters and inputs.
Example:
I have never worked with EnergyPlus, but maybe the following example will illustrate the approach:
You have three variants of a building and you are merely interested in the total heat flux over the entire surface area of buildings as a function of - "weather" (whatever that means - maybe a lot of variables).
Put all three buildings into a single EnergyPlus model and build an if or case clause around them (pseudo code):
if (id_building == 1) {
[model the building one]
elseif (if_building == 2) {
[model the building two]
[...]
Define the "weather" or whatever you need as an input variable for the FMU and define id_building also as a parameter. Define the overall heat flux as output variable.
This would allow you to choose the building before starting the simulation.
The two requirements are:
EnergyPlus Syntax allows if or case structures.
All your models work with the same interface (in our example we have weather as in and a flux as out variables)
There is a dirty workaround for the second requirement: Just define all the variables all your models need and only use what you need in the respective if block.

Parquet-MR AvroParquetWriter - how to convert data to Parquet (with Specific Mapping)

I'm working on a tool for converting data from a homegrown format to Parquet and JSON (for use in different settings with Spark, Drill and MongoDB), using Avro with Specific Mapping as the stepping stone. I have to support conversion of new data on a regular basis and on client machines which is why I try to write my own standalone conversion tool with a (Avro|Parquet|JSON) switch instead of using Drill or Spark or other tools as converters as I probably would if this was a one time job. I'm basing the whole thing on Avro because this seems like the easiest way to get conversion to Parquet and JSON under one hood.
I used Specific Mapping to profit from static type checking, wrote an IDL, converted that to a schema.avsc, generated classes and set up a sample conversion with specific constructor, but now I'm stuck configuring the writers. All Avro-Parquet conversion examples I could find [0] use AvroParquetWriter with deprecated signatures (mostly: Path file, Schema schema) and Generic Mapping.
AvroParquetWriter has only one none-deprecated Constructor, with this signature:
AvroParquetWriter(
Path file,
WriteSupport<T> writeSupport,
CompressionCodecName compressionCodecName,
int blockSize,
int pageSize,
boolean enableDictionary,
boolean enableValidation,
WriterVersion writerVersion,
Configuration conf
)
Most of the parameters are not hard to figure out but WriteSupport<T> writeSupport throws me off. I can't find any further documentation or an example.
Staring at the source of AvroParquetWriter I see GenericData model pop up a few times but only one line mentioning SpecificData: GenericData model = SpecificData.get();.
So I have a few questions:
1) Does AvroParquetWriter not support Avro Specific Mapping? Or does it by means of that SpecificData.get() method? The comment "Utilities for generated Java classes and interfaces." over 'SpecificData.class` seems to suggest that but how exactly should I proceed?
2) What's going on in the AvroParquetWriter constructor, is there an example or some documentation to be found somewhere?
3) More specifically: the signature of the WriteSupport method asks for 'Schema avroSchema' and 'GenericData model'. What does GenericData model refer to? Maybe I'm not seeing the forest because of all the trees here...
To give an example of what I'm aiming for, my central piece of Avro conversion code currently looks like this:
DatumWriter<MyData> avroDatumWriter = new SpecificDatumWriter<>(MyData.class);
DataFileWriter<MyData> dataFileWriter = new DataFileWriter<>(avroDatumWriter);
dataFileWriter.create(schema, avroOutput);
The Parquet equivalent currently looks like this:
AvroParquetWriter<SpecificRecord> parquetWriter = new AvroParquetWriter<>(parquetOutput, schema);
but this is not more than a beginning and is modeled after the examples I found, using the deprecated constructor, so will have to change anyway.
Thanks,
Thomas
[0] Hadoop - The definitive Guide, O'Reilly, https://gist.github.com/hammer/76996fb8426a0ada233e, http://www.programcreek.com/java-api-example/index.php?api=parquet.avro.AvroParquetWriter
Try AvroParquetWriter.builder :
MyData obj = ... // should be avro Object
ParquetWriter<Object> pw = AvroParquetWriter.builder(file)
.withSchema(obj.getSchema())
.build();
pw.write(obj);
pw.close();
Thanks.

Save image (via ImageWriter / FileImageOutputStream) to the filesystem without use of a File object

As a learning task I am converting my software I use every day to NIO, with the somewhat arbitrary objective of having zero remaining instances of java.io.File.
I have been successful in every case except one. It seems an ImageWriter can only write to a FileImageOutputStream which requires a java.io.File.
Path path = Paths.get(inputFileName);
InputStream is = Files.newInputStream(path, StandardOpenOption.READ);
BufferedImage bi = ImageIO.read(is);
...
Iterator<ImageWriter> iter = ImageIO.getImageWritersBySuffix("jpg");
ImageWriter writer = iter.next();
ImageWriteParam param = writer.getDefaultWriteParam();
File outputFile = new File(outputFileName);
ImageOutputStream ios = new FileImageOutputStream(outputFile);
IIOImage iioi = new IIOImage(bi, null, null);
writer.setOutput(ios);
writer.write(null, iioi, param);
...
Is there a way to do this with a java.nio.file.Path? The java 8 api doc for ImageWriter only mentions FileImageOutputStream.
I understand there might only be a symbolic value to doing this, but I was under the impression that NIO is intended to provide a complete alternative to java.io.File.
A RandomAccessFile, constructed with just a String for a filename, can be supplied to the ImageOutputStream constructor constructor.
This doesn't "use NIO" any more than just using the File in the first place, but it doesn't require File to be used directly..
For direct support of Path (or to "use NIO"), the FileImageOutputStream (or RandomAccessFile) could be extended, or a type deriving from the ImageOutputStream interface created, but .. how much work is it worth?
The intended way to instantiate an ImageInputStream or ImageOutputStream in the javax.imageio API, is through the ImageIO.createImageInputStream() and ImageIO.createImageOutputStream() methods.
You will see that both these methods take Object as its parameter. Internally, ImageIO will use a service lookup mechanism, and delegate the creation to a provider able to create a stream based on the parameter. By default, there are providers for File, RandomAccessFile and InputStream.
But the mechanism is extendable. See the API doc for the javax.imageio.spi package for a starting point. If you like, you can create a provider that takes a java.nio.Path and creates a FileImageOutputStream based on it, or alternatively create your own implementation using some more fancy NIO backing (ie. SeekableByteChannel).
Here's source code for a sample provider and stream I created to read images from a byte array, that you could use as a starting point.
(Of course, I have to agree with #user2864740's thoughts on the cost/benefit of doing this, but as you are doing this for the sake of learning, it might make sense.)

Hive setup()-like functionality similar to Mapper setup()?

I want to replace a Hadoop job with Hive. My challenge is in Hadoop, I'm using setup() to build a kdtree by reading in reference data (points of interest) from the distributed cache. I then use the kdtree in map() to evaluate distance of the target data against the kdtree.
In Hive, I wanted to use a udf with evaluate() method to determine the distance, but I don't know how to setup the kdtree with the reference data. Is this possible?
I probably don't have the entire answer, so I'm just going to throw out some ideas that might be of help.
You can add files to the distributed cache in hive using ADD FILE ...
Hive 11+ (I think) should let you access to the distributed cache in GenericUDF.initialize
https://issues.apache.org/jira/browse/HIVE-1016 which references...
https://issues.apache.org/jira/browse/HIVE-3628
So when you initialize the UDF, you might be able to build your kdtree by accessing the file you added in the distributed cache.
Like climbage says ADD FILE command adds the file into distributed cache.
You can access the distributed cache in your UDF simply by opening a file which is in the current directory.
ie... open( new File( System.getProperty("user.dir") + "/myfile") );
You can use a ConstantObjectInspector to access the filename in the initialize method of GenericUDF, where you can open the file and read into memory into your data structure.
The distributed_map UDF of Brickhouse does something similar ( https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/dcache/DistributedMapUDF.java )
Something like
public ObjectInspector initialize(ObjectInspector[] inspArr) {
ConstantObjectInspector fileNameInsp = (ConstantObjectInspector)inspArr[0];
String fileName = fileNameInsp.getWritableConstantValue().toString();
FileInputStream inFile = new FileInputStream("./" + fileName);
doStuff( inFile );
.....
}

Resources