Google Protocol Buffers - Storing messages into file - protocol-buffers

I'm using google protocol buffer to serialize equity market data (ie. timestamp, bid,ask fields).
I can store one message into a file and deserialize it without issue.
How can I store multiple messages into a single file? Not sure how I can separate the messages. I need to be able to append new messages to the file on the fly.

I would recommend using the writeDelimitedTo(OutputStream) and parseDelimitedFrom(InputStream) methods on Message objects. writeDelimitedTo writes the length of the message before the message itself; parseDelimitedFrom then uses that length to read only one message and no farther. This allows multiple messages to be written to a single OutputStream to then be parsed separately. For more information, see https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/MessageLite#writeDelimitedTo(java.io.OutputStream)

From the docs:
http://code.google.com/apis/protocolbuffers/docs/techniques.html#streaming
Streaming Multiple Messages
If you want to write multiple messages to a single file or stream, it
is up to you to keep track of where one message ends and the next
begins. The Protocol Buffer wire format is not self-delimiting, so
protocol buffer parsers cannot determine where a message ends on their
own. The easiest way to solve this problem is to write the size of
each message before you write the message itself. When you read the
messages back in, you read the size, then read the bytes into a
separate buffer, then parse from that buffer. (If you want to avoid
copying bytes to a separate buffer, check out the CodedInputStream
class (in both C++ and Java) which can be told to limit reads to a
certain number of bytes.)

Protobuf does not include a terminator per outermost record, so you need to do that yourself. The simplest approach is to prefix the data with the length of the record that follows. Personally, I tend to use the approach of writing a string-header (for an arbitrary field number), then the length as a "varint" - this means the entire document is then itself a valid protobuf, and could be consumed as an object with a "repeated" element, however, just a fixed-length (typically 32-bit little-endian) marker would do just as well. With any such storage, it is appendable as you require.

If you're looking for a C++ solution, Kenton Varda submitted a patch to protobuf around August 2015 that adds support for writeDelimitedTo() and readDelimitedFrom() calls that will serialize/deserialize a sequence of proto messages to/from a file in a way that's compatible with the Java version of these calls. Unfortunately this patch hasn't been approved yet, so if you want the functionality you'll need to merge it yourself.
Another option is Google has open sourced protobuf file reading/writing code through other projects. The or-tools library, for example, contains the classes RecordReader and RecordWriter that serialize/deserialize a proto stream to a file.
If you would like stand-alone versions of these classes that have almost no external dependencies, I have a fork of or-tools that contains only these classes. See: https://github.com/moof2k/recordio
Reading and writing with these classes is straightforward:
File* file = File::Open("proto.log", "w");
RecordWriter writer(file);
writer.WriteProtocolMessage(msg1);
writer.WriteProtocolMessage(msg2);
...
writer.Close();

An easier way is to base64 encode each message and store it as a record per line.

Related

How to open protocol buffer file

For example I have protocol buffer file compressed in snappy-format
file.pbuf.sn
how can I view the file's content? Which programms are recommended to work with protocol buffers files?
There's two separate steps here:
un-snappy the file container
process the contents that are presumably protobuf
If you're trying to do this through code, then obviously each will depend on your target language/platform/etc. Presumably "snappy" tools are available from Google (who created "snappy", IIRC).
Once you have he contents, it depends whether it is a .proto schema, binary data contents, JSON data contents, or some combination. If you have a schema for the data, then run it through "protoc" or the language/platform-specific tool of your choice to get the generated code that matches the schema. Then you can run either binary or JASON data through that generated code to get a populated object model.
If you don't have a schema: if it is JSON you should be able to understand the data via the names. Just run it through your chosen JSON tooling
If it is binary data without a schema, things are tougher. Protobuf data doesn't include names and the same values can be encoded in multiple ways (so: the same bytes can have come from multiple sources values). So you'll have to reverse-engineer the meaning of each field. "Protoc" has a schema-less decode mode that might help with this, as does https://protogen.marcgravell.com/decode

IBM Filenet p8 concurrently reading document content

I want to read a document content from FileNetP8 parallel to reduce my reading time. Also the issue is I write into a OutputStream. Is there anyway or any API from where I can parallelize my reads into a OutputStream. I am asking this because I am sure IBM would have provided some way to do it.
Also because let's say if my file is 1GB, then sequential reads are going to be performance hit.
I think from a Document instance there's only one API to retrieve the content - accessContentStream which gives you an object of InputStream. However, for reading huge files there's a new util class called ExtendedInputStream which you might be interested in.
An ExtendedInputStream is an input stream that can retrieve content at arbitrary positions within the stream. The ExtendedInputStream class includes methods that can read a certain number of bytes from the stream or read an unspecified number of bytes. The stream keeps track of the last byte position that was read. You can specify a position in the input stream to get to a later or earlier position within the stream.
More details at :
https://www.ibm.com/support/knowledgecenter/SSGLW6_5.2.1/com.ibm.p8.ce.dev.java.doc/com/filenet/api/util/ExtendedInputStream.html
Edit:
ExtendedInputStream has been introduced in v5.2.1 and is not available if you are using older version P8.

Ruby PStore file too large

I am using PStore to store the results of some computer simulations. Unfortunately, when the file becomes too large (more than 2GB from what I can see) I am not able to write the file to disk anymore and I receive the following error;
Errno::EINVAL: Invalid argument - <filename>
I am aware that this is probably a limitation of IO but I was wondering whether there is a workaround. For example, to read large JSON files, I would first split the file and then read it in parts. Probably the definitive solution should be to switch to a proper database in the backend, but because of some limitations of the specific Ruby (Sketchup) I am using this is not always possible.
I am going to assume that your data has a field that could be used as a crude key.
Therefore I would suggest that instead of dumping data into one huge file, you could put your data into different files/buckets.
For example, if your data has a name field, you could take the first 1-4 chars of the name, create a file with those chars like rojj-datafile.pstore and add the entry there. Any records with a name starting 'rojj' go in that file.
A more structured version is to take the first char as a directory, then put the file inside that, like r/rojj-datafile.pstore.
Obviously your mechanism for reading/writing will have to take this new file structure into account, and it will undoubtedly end up slower to process the data into the pstores.

How to decode a binary file which must be decoded using an external binary in one shot?

I have a large number of input files in a proprietary binary format. I need to turn them into rows for further processing. Each file must be decoded in one shot by an external binary (i.e. files must not be concatenated or split).
Options that I'm aware of:
Force single file load, extend RecordReader, use DistributedCache to run the decoder via RecordReader
Force single file load, RecordReader returns single file, use hadoop streaming to decode each file
It looks however like [2] will not work since pig will concatenate records before sending them to the STREAM operator (i.e. it will send multiple records).
[1] seems doable, just a little more work.
Is there a better way?
Seems like Option 1 that you mentioned is the most viable option. In addition to extending RecordReader, appropriate InputFormat should be extended and override the isSplitable() to return false

Should I use a binary or a text file for storing protobuf messages?

Using Google protobuf, I am saving my serialized messaged data to a file - in each file there are several messages. We have both C++ and Python versions of the code, so I need to use protobuf functions that are available in both languages. I have experimented with using SerializeToArray and SerializeAsString and there seems to be the following unfortunate conditions:
SerializeToArray: As suggested in one answer, the best way to use this is to prefix each message with it's data size. This would work great for C++, but in Python it doesn't look like this is possible - am I wrong?
SerializeAsString: This generates a serialized string equivalent to it's binary counterpart - which I can save to a file, but what happens if one of the characters in the serialization result is \n - how do we find line endings, or the ending of messages for that matter?
Update:
Please allow me to rephrase slightly. As I understand it, I cannot write binary data in C++ because then our Python application cannot read the data, since it can only parse string serialized messages. Should I then instead use SerializeAsString in both C++ and Python? If yes, then is it best practice to store such data in a text file rather than a binary file? My gut feeling is binary, but as you can see this doesn't look like an option.
We have had great success base64 encoding the messages, and using a simple \n to separate messages. This will ofcoirse depend a lot on your use - we need to store the messages in "log" files. It naturally has overhead encoding/decoding this - but this has not even remotely been an issue for us.
The advantage of keeping these messages as line separated text has so far been invaluable for maintenance and debugging. Figure out how many messages are in a file ? wc -l . Find the Nth message - head ... | tail. Figure out what's wrong with a record on a remote system you need to access through 2 VPNs and a citrix solution ? copy paste the message and mail it to the programmer.
The best practice for concatenating messages in this way is to prepend each message with its size. That way you read in the size (try a 32bit int or something), then read that number of bytes into a buffer and deserialize it. Then read the next size, etc. etc.
The same goes for writing, you first write out the size of the message, then the message itself.
See Streaming Multiple Messages on the protobuf documentation for more information.
Protobuf is a binary format, so reading and writing should be done as binary, not text.
If you don't want binary format, you should consider using something other than protobuf (there are lots of textual data formats, such as XML, JSON, CSV); just using text abstractions is not enough.

Resources