Both are language-neutral and platform-neutral data exchange libraries. I wonder what are the difference of them and which library is good for which situations.
They are intended for two different problems. Protobuf is designed to create a common "on the wire" or "disk" format for data.
Arrow is designed to create a common "in memory" format for the data.
Of course, the next question, is what does this mean?
In Protobuf, if an application wants to work with the data, they first deserialize the data into some kind of "in memory" representation. This must be done because the Protobuf format is not easily compatible with CPU instructions. For example, protobuf packs unsigned integers into varints. These have a variable # of bytes and the wire-type of the field is crammed into the 3 least significant bits. You cannot take two unsigned integers and just add them without first converting them to some kind of "in memory" representation.
Now, protoc does have libraries for every language to convert to an "in memory" representation for those languages. However, this "in memory" representation is not common. You cannot take a Protobuf message, deserialize it into C# (using protoc generated code) and then process on these in-memory bytes in Java without doing some kind of C#->Java marshalling of the data.
Arrow, on the other hand, solves this problem. If you have an Arrow table in C# you can map that memory to a different language and start processing on it without doing any kind of "language-to-language" marshaling of data. This zero-copy allows for efficient hand-off between languages. Python has been employing tricks like this (e.g. the array protocol) for a while now and it works great for data analysis.
However, Arrow is not always the greatest format for over-the-wire transmission because it can be inefficient. Those varints I mentioned before help Protobuf cut down on message size. Also, Protobuf tags each field so it can save space when there are many optional fields. In fact, Arrow uses Protobuf & gRPC for over-the-wire transmission of metadata in Arrow Flight (an RPC framework).
Related
Hello everyone I am getting my hands dirty with gRPC and protocol buffers and have came across an article which has mentioned that the binary protobuf file for a message is 5 times smaller than Json counterpart but in that article it is mentioned that this level of compression can only be achieved if transmitting via gRPC. This particular comment "compression possible when transmitting via gRPC", I cant seem to understand cause i had an understanding that protocol buffer is a serialization format which can work irrespective of gRPC or is this understanding flawed? what does this means? here is the link to the article and the screen shot.
https://www.datascienceblog.net/post/programming/essential-protobuf-guide-python/
You are correct - Protocol buffers "provide a serialization format for packets of typed, structured data that are up to a few megabytes in size" and has many uses (e.g. serialising data to disk, network transmission, passing data between applications etc).
The article you reference is somewhat misleading when it says
we can only achieve this level of compression if we are transmitting binary Protobuf using the gRPC protocol
The reasoning behind the statement becomes a bit clearer when you consider the following paragraph:
If gRPC is not an option, a common pattern is to encode the binary Protobuf data using the base64-encoding. Although this encoding irrevocably increases the size of the payload by 33%, it is still much smaller than the corresponding REST payload.
So this appears to be focusing on transmitting data over HTTP which is a text based protocol (HTTP/2 changes this somewhat).
Because Profobuf is a binary format it is often encoded before being transmitted via HTTP (technically it could be transferred in a binary format using something like application/octet-stream). The article mentions base64-encoding and using this this increases it's size.
If you are not using HTTP (e.g. writing to disk, direct TCP/IP link, Websockets, MQTT etc) then this does not apply.
Having said that I believe that the example may be overstating the benefits of Protobuf a little because a lot of HTTP traffic is compressed (this article reported a 9% difference in their test).
I agree with Brits' answer.
Some separate points. It's not quite accurate to talk about GPB compressing data; it merely goes some way to minimally encode data (e.g. integers). If you send a GPB message that has 10 strings all the same, it won't compress that down in the same way you might expect zip to.
GPB is not the most efficent encoder of data out there. ASN.1 uPER is even better, being able to exploit knowledge that can be put into an ASN.1 schema that cannot be expressed in a GPB schema .proto file. For example, in ASN.1 you can constrain the value of an integer to, say, 1000..1003. In ASN.1 uPER that's 2 bits (there are 4 possible values). In GPB, that's two bytes of encoded data at least.
If I use the same .proto file, across several machine (arm, x86, amd64 etc.) with implementations written in different languages (c++, python, java, etc.), will the same message result in the exact same byte sequence when serialized across those different configurations?
I would like to use these bytes for hashing to ensure that the same message, when generated on a different platform, would end up with the exact same hash.
"Often, but not quite always"
The reasons you might get variance include:
it is only a "should", not a "must" that fields are written in numerical sequential order - citation, emphasis mine:
when a message is serialized its known fields should be written sequentially by field number
and it is not demanded that fields are thus ordered (it is a "must" that deserializers be able to handle out-of-order fields); this can apply especially when discussing unexpected/extension fields; if two serializations choose different field orders, the bytes will be different
protobuf can be constructed by merging two partial messages, which will by necessity cause out-of-order fields, but when re-serializing an object deserialized from a merged message, it may become normalized (sequential)
the "varint" encoding allows some small subtle ambiguity... the number 1 would usually be encoded as 0x01, but it could also be encoded as 0x8100 or 0x818000 or 0x81808080808000 - the specification doesn't actually demand (AFAIK) that the shortest version be used; I am not aware of any implementation that actually outputs this kind of subnormal form, though :)
some options are designed to be forward- and backward- compatible; in particular, the [packed=true] option on repeated primitive values can be safely toggled at any time, and libraries are expected to cope with it; if you originally serialized it in one way, and now you're serializing it with the other option: the result can be different; a side-effect of this is that a specific library could also simply choose to use the alternate representation, especially if it knows it will be smaller; if two libraries make different decisions here - different bytes
In most common cases, yes: it'll be reliable and repeatable. But this is not an actual guarantee.
The bytes should be compatible, though - it'll still have the same semantics - just not the same bytes. It shouldn't matter what language, framework, library, runtime, or processor you use. If it does: that's a bug.
I do some research on what is the best data exchange format in my company. For the moment I compare Protocol Buffers and Apache Avro.
Request are exchanging between components in our architecture, but only one by one. And my impression is that Avro is very bigger thant Protocol Buffers when transport only one by one. In the avro file, the schema is always present and our request has a lot of optional field, so our schema is ver big even if our data are small.
But I don't know if I missed something, it's written everywhere than avro is smaller, but for us it seems that we have to put one thousand requests in one file for having PBuffers and avro's size equals.
I missed something or my thoughts are true?
Thanks
It's not at all surprising that two serialization formats would produce basically equal sizes. These aren't compression algorithms, they're just structure. For any decent format, the vast majority of your data is going to be your data; the structure around it (which is the part that varies depending on serialization format) ought to be negligible. The size of your data simply doesn't change regardless of the serialization format around it.
Note also that anyone who claims that one format is always smaller than another is either lying or doesn't know what they're talking about. Every format has strengths and weaknesses, so the "best" format totally depends on the use case. It's important to test each format using your own data to find out what is best for you -- and it sounds like you are doing just that, which is great! If Protobuf and Avro came out equal size in your test, then you should choose based on other factors. You might want to test encoding/decoding speed, for example.
Do any of thrift, protobuf, avro, etc support quering on the resulting compact data? Or would something like a thrift-server first have to de-encode the compact data before being able to query it?
background: since there might be an entirely different answer to my usecase I'm not seeing.
I've sketched out a custom datastructure on paper (akin to a trie), which will contain tens/hundreds of millions of key-value pairs. The whole stuff needs to be in RAM so it needs to be compact.
For this reason I'm probably skipping the normal kv-stores, since there's just too much overhead in encoding. They can't optimize for the specicialized case of the structure. (Redis has the least overhead per key afaik but it isn't enough: 100+ bytes per key)
I've looked into Thrift, Protobuf, Avro, Messagepack, which will all allow me to encode the data to a nice compact structure all taking care of the specific opportunities of my datastructure (encoding keys as 1 or 2 bytes, bitpacking, values are fixed length, etc.)
However, It's completely unclear to me if any of these protocols/techniques will allow me to query on the compacted datastructure as is, or if the data-structure has to be de-encode before quering? If the latter, well than this whole exercise hasn't been of much use to me.
As an alternative, I've thought of looking at other programming languages (c/c++ probably although I've never dabbled with it) that probably would allow me to have very tight memory control over structs (As opposed to Node/javascript which is extremely bad with that)
Anyone?
They need to be decoded for querying
Our business deals with houses and over the years we have created several business objects to represent them. We also receive lots of data from outside sources, and send data to external consumers. Every one of these represents the house in a different way and we spend a lot of time and energy translating one format into another. I'm looking for some general patterns or best practices on how to deal with this situation. How can I write a universal data translator that is flexible, extensible, and fast.
Background: A house generally has 30-40 attributes such as size, number of bedrooms, roof type, construction material, siding material, etc. These are typically represented as key/value pairs. A typical translation problem is that one vendor will represent the number of bedrooms as a single key/value pair: NumBedrooms=3, while a different vendor will have a key/value pair per bedroom: Bedroom=master, Bedroom=small, Bedroom=small.
There's nothing particularly hard about the translation, but we spend a lot of time and energy writing and testing translations. How can I optimize this?
Thanks
(My environment is .Net)
The best place to start is by creating an "internal representation" which is the representation that your processing will always. Then create translators from and to "external representations" as needed. I'd imagine that this is what you are already doing, but it should be mentioned for completeness. The optimization comes from being able to selectively write import and export only when you need them.
A good implementation strategy is to externalize the transformation if you can. If you can get your inputs and outputs into XML documents, then you can write XSLT transforms between your internal and external representations. The goal is to be able to set up a pipeline of transformations from an input XML document to your internal representation. If everything is represented in XML and using a common protocol (say... hmm... HTTP), then the process can be controlled using configuration. BTW - this is essentially the Pipes and Filters design pattern.
Take a look at Yahoo pipes, Apache Cocoon, XML pipeline, and NetKernel for inspiration.
My employer back in the 90s faced this problem. We had a standard format we converted the customers' data to and from, as D.Shawley suggests.
I went further and designed a simple format-description language; we described our standard format in that language and then, for a new dataset, we'd write up its format too. Then a program would take both descriptions and convert the data from one format to the other, with automatic type conversions, safety checks, etc. (This came in handy for some other operations as well, not just these initial/final conversions.)
The particulars probably won't help you -- chances are you deal with completely different kinds of data. You can likely profit from the general principle, though. The "data definition language" needn't necessarily be a fancy thing with a parser and scanner; you might define it directly with a data structure in IronPython, say.