How to open protocol buffer file - protocol-buffers

For example I have protocol buffer file compressed in snappy-format
file.pbuf.sn
how can I view the file's content? Which programms are recommended to work with protocol buffers files?

There's two separate steps here:
un-snappy the file container
process the contents that are presumably protobuf
If you're trying to do this through code, then obviously each will depend on your target language/platform/etc. Presumably "snappy" tools are available from Google (who created "snappy", IIRC).
Once you have he contents, it depends whether it is a .proto schema, binary data contents, JSON data contents, or some combination. If you have a schema for the data, then run it through "protoc" or the language/platform-specific tool of your choice to get the generated code that matches the schema. Then you can run either binary or JASON data through that generated code to get a populated object model.
If you don't have a schema: if it is JSON you should be able to understand the data via the names. Just run it through your chosen JSON tooling
If it is binary data without a schema, things are tougher. Protobuf data doesn't include names and the same values can be encoded in multiple ways (so: the same bytes can have come from multiple sources values). So you'll have to reverse-engineer the meaning of each field. "Protoc" has a schema-less decode mode that might help with this, as does https://protogen.marcgravell.com/decode

Related

How to create a partially modifiable binary file format?

I'm creating my custom binary file extension.
I use the RIFF standard for encoding data. And it seems to work pretty well.
But there are some additional requirements:
Binary files could be large up to 500 MB.
Real-time saving data into the binary file in intervals when data on the application has changed.
Application could run on the browser.
The problem I face is when I want to save data it needs to read everything from memory and rewrite the whole binary file.
This won't be a problem when data is small. But when it's getting larger, the Real-time saving feature seems to be unscalable.
So main requirement of this binary file could be:
Able to partially read the binary file (Cause file is huge)
Able to partially write changed data into the file without rewriting the whole file.
Streaming protocol like .m3u8 is not an option, We can't split it into chunks and point it using separate URLs.
Any guidance on how to design a binary file system that scales in this scenario?
There is an answer from a random user that has been deleted here.
It seems great to me.
You can claim your answer back and I'll delete this one.
He said:
If we design the file to be support addition then we able to add whatever data we want without needing to rewrite the whole file.
This idea gives me a very great starting point.
So I can append more and more changes at the end of the file.
Then obsolete old chunks of data in the middle of the file.
I can then reuse these obsolete data slots later if I want to.
The downside is that I need to clean up the obsolete slot when I have a chance to rewrite the whole file.

Ruby PStore file too large

I am using PStore to store the results of some computer simulations. Unfortunately, when the file becomes too large (more than 2GB from what I can see) I am not able to write the file to disk anymore and I receive the following error;
Errno::EINVAL: Invalid argument - <filename>
I am aware that this is probably a limitation of IO but I was wondering whether there is a workaround. For example, to read large JSON files, I would first split the file and then read it in parts. Probably the definitive solution should be to switch to a proper database in the backend, but because of some limitations of the specific Ruby (Sketchup) I am using this is not always possible.
I am going to assume that your data has a field that could be used as a crude key.
Therefore I would suggest that instead of dumping data into one huge file, you could put your data into different files/buckets.
For example, if your data has a name field, you could take the first 1-4 chars of the name, create a file with those chars like rojj-datafile.pstore and add the entry there. Any records with a name starting 'rojj' go in that file.
A more structured version is to take the first char as a directory, then put the file inside that, like r/rojj-datafile.pstore.
Obviously your mechanism for reading/writing will have to take this new file structure into account, and it will undoubtedly end up slower to process the data into the pstores.

WebHDFS and SequenceFiles

Is it true WebHDFS does not support SequenceFiles?
I can't find anything that says it does. I have the usual small file problem and believe SequenceFiles would work well enough, but I need to use WebHDFS. I need to create and then append to a SequenceFile via WebHDFS.
I think it's true. There is no web API to append to a sequence file.
However you can append binary data, and if your sequence file is not block-compressed, you should be able to format your data on the client with relatively little effort. You can do it by running your input through a sequence file writer on the client, and then using the output for uploading (either the whole file, or a slice representing the delta since last append).
You can read more about sequence file format here.

Excel import and export to SQL

I have a requirement where I need to upload xsl (excel 2003) into SQL 2008R2 database. I am using ORCHAD stuff for scheduling.
I am using HTTPPOSTEDFILEBASE filestream to convert into byte array and store in database.
After storing background scheduler picks up the task and process the data stored. I need to create objects from data in excel and send for processing. I am struck at decoding byte array :(
What is the best way to handle this kinda requirement? any libraries which I make use.
My web app is build with MVC3, EF4.1,repository pattern, Autofaq.
I have not used the HTTPPOSTEDFILEBASE class, but you could:
Convert the file to a byte stream
Save it as the appropriate byte/blob type in your database (store the extension in a separate field)
Retrieve bytes and add the appropriate extension to the file stream
Treat as a normal file...
But I'm actually wondering if your requirements even demand this. Why are you storing the file in the first place? If you are only using the file data to shape your business object (that I'm guessing gets saved somewhere), you could perform that data extraction, shaping, and persistence before you store the file as raw bytes so you never have to reconstitute the file for that purpose.

Google Protocol Buffers - Storing messages into file

I'm using google protocol buffer to serialize equity market data (ie. timestamp, bid,ask fields).
I can store one message into a file and deserialize it without issue.
How can I store multiple messages into a single file? Not sure how I can separate the messages. I need to be able to append new messages to the file on the fly.
I would recommend using the writeDelimitedTo(OutputStream) and parseDelimitedFrom(InputStream) methods on Message objects. writeDelimitedTo writes the length of the message before the message itself; parseDelimitedFrom then uses that length to read only one message and no farther. This allows multiple messages to be written to a single OutputStream to then be parsed separately. For more information, see https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/MessageLite#writeDelimitedTo(java.io.OutputStream)
From the docs:
http://code.google.com/apis/protocolbuffers/docs/techniques.html#streaming
Streaming Multiple Messages
If you want to write multiple messages to a single file or stream, it
is up to you to keep track of where one message ends and the next
begins. The Protocol Buffer wire format is not self-delimiting, so
protocol buffer parsers cannot determine where a message ends on their
own. The easiest way to solve this problem is to write the size of
each message before you write the message itself. When you read the
messages back in, you read the size, then read the bytes into a
separate buffer, then parse from that buffer. (If you want to avoid
copying bytes to a separate buffer, check out the CodedInputStream
class (in both C++ and Java) which can be told to limit reads to a
certain number of bytes.)
Protobuf does not include a terminator per outermost record, so you need to do that yourself. The simplest approach is to prefix the data with the length of the record that follows. Personally, I tend to use the approach of writing a string-header (for an arbitrary field number), then the length as a "varint" - this means the entire document is then itself a valid protobuf, and could be consumed as an object with a "repeated" element, however, just a fixed-length (typically 32-bit little-endian) marker would do just as well. With any such storage, it is appendable as you require.
If you're looking for a C++ solution, Kenton Varda submitted a patch to protobuf around August 2015 that adds support for writeDelimitedTo() and readDelimitedFrom() calls that will serialize/deserialize a sequence of proto messages to/from a file in a way that's compatible with the Java version of these calls. Unfortunately this patch hasn't been approved yet, so if you want the functionality you'll need to merge it yourself.
Another option is Google has open sourced protobuf file reading/writing code through other projects. The or-tools library, for example, contains the classes RecordReader and RecordWriter that serialize/deserialize a proto stream to a file.
If you would like stand-alone versions of these classes that have almost no external dependencies, I have a fork of or-tools that contains only these classes. See: https://github.com/moof2k/recordio
Reading and writing with these classes is straightforward:
File* file = File::Open("proto.log", "w");
RecordWriter writer(file);
writer.WriteProtocolMessage(msg1);
writer.WriteProtocolMessage(msg2);
...
writer.Close();
An easier way is to base64 encode each message and store it as a record per line.

Resources