Hello everyone I am getting my hands dirty with gRPC and protocol buffers and have came across an article which has mentioned that the binary protobuf file for a message is 5 times smaller than Json counterpart but in that article it is mentioned that this level of compression can only be achieved if transmitting via gRPC. This particular comment "compression possible when transmitting via gRPC", I cant seem to understand cause i had an understanding that protocol buffer is a serialization format which can work irrespective of gRPC or is this understanding flawed? what does this means? here is the link to the article and the screen shot.
https://www.datascienceblog.net/post/programming/essential-protobuf-guide-python/
You are correct - Protocol buffers "provide a serialization format for packets of typed, structured data that are up to a few megabytes in size" and has many uses (e.g. serialising data to disk, network transmission, passing data between applications etc).
The article you reference is somewhat misleading when it says
we can only achieve this level of compression if we are transmitting binary Protobuf using the gRPC protocol
The reasoning behind the statement becomes a bit clearer when you consider the following paragraph:
If gRPC is not an option, a common pattern is to encode the binary Protobuf data using the base64-encoding. Although this encoding irrevocably increases the size of the payload by 33%, it is still much smaller than the corresponding REST payload.
So this appears to be focusing on transmitting data over HTTP which is a text based protocol (HTTP/2 changes this somewhat).
Because Profobuf is a binary format it is often encoded before being transmitted via HTTP (technically it could be transferred in a binary format using something like application/octet-stream). The article mentions base64-encoding and using this this increases it's size.
If you are not using HTTP (e.g. writing to disk, direct TCP/IP link, Websockets, MQTT etc) then this does not apply.
Having said that I believe that the example may be overstating the benefits of Protobuf a little because a lot of HTTP traffic is compressed (this article reported a 9% difference in their test).
I agree with Brits' answer.
Some separate points. It's not quite accurate to talk about GPB compressing data; it merely goes some way to minimally encode data (e.g. integers). If you send a GPB message that has 10 strings all the same, it won't compress that down in the same way you might expect zip to.
GPB is not the most efficent encoder of data out there. ASN.1 uPER is even better, being able to exploit knowledge that can be put into an ASN.1 schema that cannot be expressed in a GPB schema .proto file. For example, in ASN.1 you can constrain the value of an integer to, say, 1000..1003. In ASN.1 uPER that's 2 bits (there are 4 possible values). In GPB, that's two bytes of encoded data at least.
Related
Context
I have read about Protos and am aware that there are several advantages as compared to sending/storing data in serialized XML/JSON form, viz.
automatically-generated classes
Backward and Forward compatibility
Cross-language / Cross-Platform Compatibility, etc.
But I also read the Protos are faster and smaller as compared to JSON/XML and want to know the reasoning behind it, thus have asked few questions below.
For below queries, assume that data being sent is in serialized bytes form for both Protos or XML or JSON.
Question(s)
Are they faster and smaller because the fields being sent are having field numbers instead of fields names? Is it the only reason or are there some other reasons?
I also read some details about encoding in protos, does the way encoding is done also makes it smaller and faster? Was it not possible in other formats (XML/JSON)?
Both are language-neutral and platform-neutral data exchange libraries. I wonder what are the difference of them and which library is good for which situations.
They are intended for two different problems. Protobuf is designed to create a common "on the wire" or "disk" format for data.
Arrow is designed to create a common "in memory" format for the data.
Of course, the next question, is what does this mean?
In Protobuf, if an application wants to work with the data, they first deserialize the data into some kind of "in memory" representation. This must be done because the Protobuf format is not easily compatible with CPU instructions. For example, protobuf packs unsigned integers into varints. These have a variable # of bytes and the wire-type of the field is crammed into the 3 least significant bits. You cannot take two unsigned integers and just add them without first converting them to some kind of "in memory" representation.
Now, protoc does have libraries for every language to convert to an "in memory" representation for those languages. However, this "in memory" representation is not common. You cannot take a Protobuf message, deserialize it into C# (using protoc generated code) and then process on these in-memory bytes in Java without doing some kind of C#->Java marshalling of the data.
Arrow, on the other hand, solves this problem. If you have an Arrow table in C# you can map that memory to a different language and start processing on it without doing any kind of "language-to-language" marshaling of data. This zero-copy allows for efficient hand-off between languages. Python has been employing tricks like this (e.g. the array protocol) for a while now and it works great for data analysis.
However, Arrow is not always the greatest format for over-the-wire transmission because it can be inefficient. Those varints I mentioned before help Protobuf cut down on message size. Also, Protobuf tags each field so it can save space when there are many optional fields. In fact, Arrow uses Protobuf & gRPC for over-the-wire transmission of metadata in Arrow Flight (an RPC framework).
I do some research on what is the best data exchange format in my company. For the moment I compare Protocol Buffers and Apache Avro.
Request are exchanging between components in our architecture, but only one by one. And my impression is that Avro is very bigger thant Protocol Buffers when transport only one by one. In the avro file, the schema is always present and our request has a lot of optional field, so our schema is ver big even if our data are small.
But I don't know if I missed something, it's written everywhere than avro is smaller, but for us it seems that we have to put one thousand requests in one file for having PBuffers and avro's size equals.
I missed something or my thoughts are true?
Thanks
It's not at all surprising that two serialization formats would produce basically equal sizes. These aren't compression algorithms, they're just structure. For any decent format, the vast majority of your data is going to be your data; the structure around it (which is the part that varies depending on serialization format) ought to be negligible. The size of your data simply doesn't change regardless of the serialization format around it.
Note also that anyone who claims that one format is always smaller than another is either lying or doesn't know what they're talking about. Every format has strengths and weaknesses, so the "best" format totally depends on the use case. It's important to test each format using your own data to find out what is best for you -- and it sounds like you are doing just that, which is great! If Protobuf and Avro came out equal size in your test, then you should choose based on other factors. You might want to test encoding/decoding speed, for example.
Do any of thrift, protobuf, avro, etc support quering on the resulting compact data? Or would something like a thrift-server first have to de-encode the compact data before being able to query it?
background: since there might be an entirely different answer to my usecase I'm not seeing.
I've sketched out a custom datastructure on paper (akin to a trie), which will contain tens/hundreds of millions of key-value pairs. The whole stuff needs to be in RAM so it needs to be compact.
For this reason I'm probably skipping the normal kv-stores, since there's just too much overhead in encoding. They can't optimize for the specicialized case of the structure. (Redis has the least overhead per key afaik but it isn't enough: 100+ bytes per key)
I've looked into Thrift, Protobuf, Avro, Messagepack, which will all allow me to encode the data to a nice compact structure all taking care of the specific opportunities of my datastructure (encoding keys as 1 or 2 bytes, bitpacking, values are fixed length, etc.)
However, It's completely unclear to me if any of these protocols/techniques will allow me to query on the compacted datastructure as is, or if the data-structure has to be de-encode before quering? If the latter, well than this whole exercise hasn't been of much use to me.
As an alternative, I've thought of looking at other programming languages (c/c++ probably although I've never dabbled with it) that probably would allow me to have very tight memory control over structs (As opposed to Node/javascript which is extremely bad with that)
Anyone?
They need to be decoded for querying
I have a server-side program that generates JSON for a client. A few of my colleagues have suggested using zip/gzip compression in order to reduce the amount of data that sending over the wire. However, when tested against one of my average JSON messages, it they both actually increased the amount of data being sent. It wasn't until I sent an unusually large response that the zipping kicked in and was useful.
So I started poking around stackoverflow, and I eventually found LZO, which, when tested, did exactly what I wanted it to do. However, I can't seem to find documentation of the run time of the algorithm, and I'm not quite good enough to sit down with the code and figure it out myself :)
tl;dr? RUN TIME OF LZO?
I'm going to ignore your question about the runtime of LZO (answer: almost certainly fast enough) and discuss the underlying problem.
You are exchanging JSON data structures over the wire and want to reduce your bandwidth. At the moment you are considering general-purpose compression algorithms like DEFLATE and LZO. However, any compression algorithm based on Lempel-Ziv techniques works best on large amounts of data. These algorithms work by building up a dictionary of frequently occurring sequences of data, so that they can encode a reference to the dictionary instead of the whole sequence when it repeats. The bigger the dictionary, the better the compression ratio. For very small amounts of data, like individual data packets, the technique is useless: there isn't time to build up the dictionary, and there isn't time for lots of repeats to appear.
If you are using JSON to encode a wire protocol, then your packets are very likely stereotyped, with similar structures and a small number of common keys. So I suggest investigating Google's Protocol Buffers which are designed specifically for this use case.
Seconding the suggestion to avoid LZO and any other type of generic/binary-data compression algorithm.
Your other options are basically:
Google's Protocol Buffers
Apache Thrift
MessagePack
The best choice depends on your server/language setup, your speed-vs-compression requirements, and personal preference. I'd probably go with MessagePack myself, but you won't go wrong with Protocol Buffers either.