Why are protos faster than JSON/XML? - protocol-buffers

Context
I have read about Protos and am aware that there are several advantages as compared to sending/storing data in serialized XML/JSON form, viz.
automatically-generated classes
Backward and Forward compatibility
Cross-language / Cross-Platform Compatibility, etc.
But I also read the Protos are faster and smaller as compared to JSON/XML and want to know the reasoning behind it, thus have asked few questions below.
For below queries, assume that data being sent is in serialized bytes form for both Protos or XML or JSON.
Question(s)
Are they faster and smaller because the fields being sent are having field numbers instead of fields names? Is it the only reason or are there some other reasons?
I also read some details about encoding in protos, does the way encoding is done also makes it smaller and faster? Was it not possible in other formats (XML/JSON)?

Related

can protocol buffer be used without gRPC?

Hello everyone I am getting my hands dirty with gRPC and protocol buffers and have came across an article which has mentioned that the binary protobuf file for a message is 5 times smaller than Json counterpart but in that article it is mentioned that this level of compression can only be achieved if transmitting via gRPC. This particular comment "compression possible when transmitting via gRPC", I cant seem to understand cause i had an understanding that protocol buffer is a serialization format which can work irrespective of gRPC or is this understanding flawed? what does this means? here is the link to the article and the screen shot.
https://www.datascienceblog.net/post/programming/essential-protobuf-guide-python/
You are correct - Protocol buffers "provide a serialization format for packets of typed, structured data that are up to a few megabytes in size" and has many uses (e.g. serialising data to disk, network transmission, passing data between applications etc).
The article you reference is somewhat misleading when it says
we can only achieve this level of compression if we are transmitting binary Protobuf using the gRPC protocol
The reasoning behind the statement becomes a bit clearer when you consider the following paragraph:
If gRPC is not an option, a common pattern is to encode the binary Protobuf data using the base64-encoding. Although this encoding irrevocably increases the size of the payload by 33%, it is still much smaller than the corresponding REST payload.
So this appears to be focusing on transmitting data over HTTP which is a text based protocol (HTTP/2 changes this somewhat).
Because Profobuf is a binary format it is often encoded before being transmitted via HTTP (technically it could be transferred in a binary format using something like application/octet-stream). The article mentions base64-encoding and using this this increases it's size.
If you are not using HTTP (e.g. writing to disk, direct TCP/IP link, Websockets, MQTT etc) then this does not apply.
Having said that I believe that the example may be overstating the benefits of Protobuf a little because a lot of HTTP traffic is compressed (this article reported a 9% difference in their test).
I agree with Brits' answer.
Some separate points. It's not quite accurate to talk about GPB compressing data; it merely goes some way to minimally encode data (e.g. integers). If you send a GPB message that has 10 strings all the same, it won't compress that down in the same way you might expect zip to.
GPB is not the most efficent encoder of data out there. ASN.1 uPER is even better, being able to exploit knowledge that can be put into an ASN.1 schema that cannot be expressed in a GPB schema .proto file. For example, in ASN.1 you can constrain the value of an integer to, say, 1000..1003. In ASN.1 uPER that's 2 bits (there are 4 possible values). In GPB, that's two bytes of encoded data at least.

Most efficient storage format for HDFS data

I have to store a lot of data on dedicated storage servers in HDFS. This is some kind of archive for historic data. The data being store is row oriented and have tens of different kind of fields. Some of them are Strings, some are Integers, there are also few Floats, Shorts, ArrayLists and a Map.
The idea is that the data will be scanned from time to time using MapReduce or Spark job.
Currently I am storing them as SequenceFiles with NullWritable as keys and custom WritableComparable class as values. This custom class has all of these fields defined.
I would like to achieve two goals - one is to optimize a size of data, as it is getting really big and I have to add new servers every few weeks and the costs are constantly growing. The other thing is to make it easier to add new fields - in current state if I would like to add some new field I would have to rewrite all of the old data.
I tried to achieve this by using EnumMap inside this class. It gave quite good results, as it allows adding new fields easily and also the size of data have been reduced by 20% (the reason is a lot of fields in a record are often empty). But the code I wrote looks awful and it gets even uglier when I try to add to this EnumMap also Lists and Maps. It's ok for a data of the same type, but trying to combine all of the fields is a nightmare.
So I thought of some other popular formats. I have tried Avro and Parquet, but size of the data is almost exactly the same as SequenceFiles with custom class before trying with Enums. So it resolves problems of adding new fields without a need of rewriting old data, but I feel like there is more potential to optimize the size of the data.
The one more thing I am going to check yet is of course the time it takes to load the data (this will also tell me if it's ok to use bzip2 compression or I have to go back to gzip because of performance), but before I proceed with this I was wondering if maybe someone will suggest some other solution or a hint.
Thanks in advance for all comments.
Most of your approach seems good. I just decided to add some of my thoughts in this answer.
The data being store is row oriented and have tens of different kind
of fields. Some of them are Strings, some are Integers, there are also
few Floats, Shorts, ArrayLists and a Map.
None of the types you have mentioned here are any more complex than the datatypes supported by spark. So I wouldn't bother changing the data types in any way.
achieve two goals - one is to optimize a size of data, as it is
getting really big and I have to add new servers every few weeks and
the costs are constantly growing.
By adding servers, are you also adding compute? Storage should be relatively cheap, and I'm wondering if you are adding compute with your servers, which you don't really need. You should only be paying to store and retrieve data. Consider a simple object store like S3 that only charges you for storage space and gives a free quota of access requests (GET/PUT/POST) - I believe about 1000 requests are free and it costs only ~$10 for a terabyte of storage per month.
The other thing is to make it easier to add new fields - in current
state if I would like to add some new field I would have to rewrite
all of the old data.
If you have a use case where you will be writing to the files more often than reading, I'd recommend not storing the file on HDFS. It is more suited for write once, read many type applications. That said, i'd recommend using parquet to start since i think you will need a file format that allows slicing and dicing the data. Avro is also a good choice as it also supports schema evolution. But its better to use this if you have a complex structures where you need to specify the schema and make it easier to serialize/deserialize with java objects.
The one more thing I am going to check yet is of course the time it
takes to load the data (this will also tell me if it's ok to use bzip2
compression or I have to go back to gzip because of performance)
Bzip2 has the highest compression, but is also the slowest. So i'd recommend it if the data isn't really used/queried frequently. Gzip has comparable compression with Bzip2, but is slightly faster. Also consider snappy compression as that has a balance of performance and storage and can support splittable files for certain file types (parquet or avro) which is useful for map-reduce jobs.

Is the string/bytestring resulting from protobuf serialization of the same record value always the same? [duplicate]

If I use the same .proto file, across several machine (arm, x86, amd64 etc.) with implementations written in different languages (c++, python, java, etc.), will the same message result in the exact same byte sequence when serialized across those different configurations?
I would like to use these bytes for hashing to ensure that the same message, when generated on a different platform, would end up with the exact same hash.
"Often, but not quite always"
The reasons you might get variance include:
it is only a "should", not a "must" that fields are written in numerical sequential order - citation, emphasis mine:
when a message is serialized its known fields should be written sequentially by field number
and it is not demanded that fields are thus ordered (it is a "must" that deserializers be able to handle out-of-order fields); this can apply especially when discussing unexpected/extension fields; if two serializations choose different field orders, the bytes will be different
protobuf can be constructed by merging two partial messages, which will by necessity cause out-of-order fields, but when re-serializing an object deserialized from a merged message, it may become normalized (sequential)
the "varint" encoding allows some small subtle ambiguity... the number 1 would usually be encoded as 0x01, but it could also be encoded as 0x8100 or 0x818000 or 0x81808080808000 - the specification doesn't actually demand (AFAIK) that the shortest version be used; I am not aware of any implementation that actually outputs this kind of subnormal form, though :)
some options are designed to be forward- and backward- compatible; in particular, the [packed=true] option on repeated primitive values can be safely toggled at any time, and libraries are expected to cope with it; if you originally serialized it in one way, and now you're serializing it with the other option: the result can be different; a side-effect of this is that a specific library could also simply choose to use the alternate representation, especially if it knows it will be smaller; if two libraries make different decisions here - different bytes
In most common cases, yes: it'll be reliable and repeatable. But this is not an actual guarantee.
The bytes should be compatible, though - it'll still have the same semantics - just not the same bytes. It shouldn't matter what language, framework, library, runtime, or processor you use. If it does: that's a bug.

Avro size too big?

I do some research on what is the best data exchange format in my company. For the moment I compare Protocol Buffers and Apache Avro.
Request are exchanging between components in our architecture, but only one by one. And my impression is that Avro is very bigger thant Protocol Buffers when transport only one by one. In the avro file, the schema is always present and our request has a lot of optional field, so our schema is ver big even if our data are small.
But I don't know if I missed something, it's written everywhere than avro is smaller, but for us it seems that we have to put one thousand requests in one file for having PBuffers and avro's size equals.
I missed something or my thoughts are true?
Thanks
It's not at all surprising that two serialization formats would produce basically equal sizes. These aren't compression algorithms, they're just structure. For any decent format, the vast majority of your data is going to be your data; the structure around it (which is the part that varies depending on serialization format) ought to be negligible. The size of your data simply doesn't change regardless of the serialization format around it.
Note also that anyone who claims that one format is always smaller than another is either lying or doesn't know what they're talking about. Every format has strengths and weaknesses, so the "best" format totally depends on the use case. It's important to test each format using your own data to find out what is best for you -- and it sounds like you are doing just that, which is great! If Protobuf and Avro came out equal size in your test, then you should choose based on other factors. You might want to test encoding/decoding speed, for example.

Do any of thrift, protobuf, avro, etc support quering on encoded data directly?

Do any of thrift, protobuf, avro, etc support quering on the resulting compact data? Or would something like a thrift-server first have to de-encode the compact data before being able to query it?
background: since there might be an entirely different answer to my usecase I'm not seeing.
I've sketched out a custom datastructure on paper (akin to a trie), which will contain tens/hundreds of millions of key-value pairs. The whole stuff needs to be in RAM so it needs to be compact.
For this reason I'm probably skipping the normal kv-stores, since there's just too much overhead in encoding. They can't optimize for the specicialized case of the structure. (Redis has the least overhead per key afaik but it isn't enough: 100+ bytes per key)
I've looked into Thrift, Protobuf, Avro, Messagepack, which will all allow me to encode the data to a nice compact structure all taking care of the specific opportunities of my datastructure (encoding keys as 1 or 2 bytes, bitpacking, values are fixed length, etc.)
However, It's completely unclear to me if any of these protocols/techniques will allow me to query on the compacted datastructure as is, or if the data-structure has to be de-encode before quering? If the latter, well than this whole exercise hasn't been of much use to me.
As an alternative, I've thought of looking at other programming languages (c/c++ probably although I've never dabbled with it) that probably would allow me to have very tight memory control over structs (As opposed to Node/javascript which is extremely bad with that)
Anyone?
They need to be decoded for querying

Resources