Avro size too big? - protocol-buffers

I do some research on what is the best data exchange format in my company. For the moment I compare Protocol Buffers and Apache Avro.
Request are exchanging between components in our architecture, but only one by one. And my impression is that Avro is very bigger thant Protocol Buffers when transport only one by one. In the avro file, the schema is always present and our request has a lot of optional field, so our schema is ver big even if our data are small.
But I don't know if I missed something, it's written everywhere than avro is smaller, but for us it seems that we have to put one thousand requests in one file for having PBuffers and avro's size equals.
I missed something or my thoughts are true?
Thanks

It's not at all surprising that two serialization formats would produce basically equal sizes. These aren't compression algorithms, they're just structure. For any decent format, the vast majority of your data is going to be your data; the structure around it (which is the part that varies depending on serialization format) ought to be negligible. The size of your data simply doesn't change regardless of the serialization format around it.
Note also that anyone who claims that one format is always smaller than another is either lying or doesn't know what they're talking about. Every format has strengths and weaknesses, so the "best" format totally depends on the use case. It's important to test each format using your own data to find out what is best for you -- and it sounds like you are doing just that, which is great! If Protobuf and Avro came out equal size in your test, then you should choose based on other factors. You might want to test encoding/decoding speed, for example.

Related

Most efficient storage format for HDFS data

I have to store a lot of data on dedicated storage servers in HDFS. This is some kind of archive for historic data. The data being store is row oriented and have tens of different kind of fields. Some of them are Strings, some are Integers, there are also few Floats, Shorts, ArrayLists and a Map.
The idea is that the data will be scanned from time to time using MapReduce or Spark job.
Currently I am storing them as SequenceFiles with NullWritable as keys and custom WritableComparable class as values. This custom class has all of these fields defined.
I would like to achieve two goals - one is to optimize a size of data, as it is getting really big and I have to add new servers every few weeks and the costs are constantly growing. The other thing is to make it easier to add new fields - in current state if I would like to add some new field I would have to rewrite all of the old data.
I tried to achieve this by using EnumMap inside this class. It gave quite good results, as it allows adding new fields easily and also the size of data have been reduced by 20% (the reason is a lot of fields in a record are often empty). But the code I wrote looks awful and it gets even uglier when I try to add to this EnumMap also Lists and Maps. It's ok for a data of the same type, but trying to combine all of the fields is a nightmare.
So I thought of some other popular formats. I have tried Avro and Parquet, but size of the data is almost exactly the same as SequenceFiles with custom class before trying with Enums. So it resolves problems of adding new fields without a need of rewriting old data, but I feel like there is more potential to optimize the size of the data.
The one more thing I am going to check yet is of course the time it takes to load the data (this will also tell me if it's ok to use bzip2 compression or I have to go back to gzip because of performance), but before I proceed with this I was wondering if maybe someone will suggest some other solution or a hint.
Thanks in advance for all comments.
Most of your approach seems good. I just decided to add some of my thoughts in this answer.
The data being store is row oriented and have tens of different kind
of fields. Some of them are Strings, some are Integers, there are also
few Floats, Shorts, ArrayLists and a Map.
None of the types you have mentioned here are any more complex than the datatypes supported by spark. So I wouldn't bother changing the data types in any way.
achieve two goals - one is to optimize a size of data, as it is
getting really big and I have to add new servers every few weeks and
the costs are constantly growing.
By adding servers, are you also adding compute? Storage should be relatively cheap, and I'm wondering if you are adding compute with your servers, which you don't really need. You should only be paying to store and retrieve data. Consider a simple object store like S3 that only charges you for storage space and gives a free quota of access requests (GET/PUT/POST) - I believe about 1000 requests are free and it costs only ~$10 for a terabyte of storage per month.
The other thing is to make it easier to add new fields - in current
state if I would like to add some new field I would have to rewrite
all of the old data.
If you have a use case where you will be writing to the files more often than reading, I'd recommend not storing the file on HDFS. It is more suited for write once, read many type applications. That said, i'd recommend using parquet to start since i think you will need a file format that allows slicing and dicing the data. Avro is also a good choice as it also supports schema evolution. But its better to use this if you have a complex structures where you need to specify the schema and make it easier to serialize/deserialize with java objects.
The one more thing I am going to check yet is of course the time it
takes to load the data (this will also tell me if it's ok to use bzip2
compression or I have to go back to gzip because of performance)
Bzip2 has the highest compression, but is also the slowest. So i'd recommend it if the data isn't really used/queried frequently. Gzip has comparable compression with Bzip2, but is slightly faster. Also consider snappy compression as that has a balance of performance and storage and can support splittable files for certain file types (parquet or avro) which is useful for map-reduce jobs.

Do any of thrift, protobuf, avro, etc support quering on encoded data directly?

Do any of thrift, protobuf, avro, etc support quering on the resulting compact data? Or would something like a thrift-server first have to de-encode the compact data before being able to query it?
background: since there might be an entirely different answer to my usecase I'm not seeing.
I've sketched out a custom datastructure on paper (akin to a trie), which will contain tens/hundreds of millions of key-value pairs. The whole stuff needs to be in RAM so it needs to be compact.
For this reason I'm probably skipping the normal kv-stores, since there's just too much overhead in encoding. They can't optimize for the specicialized case of the structure. (Redis has the least overhead per key afaik but it isn't enough: 100+ bytes per key)
I've looked into Thrift, Protobuf, Avro, Messagepack, which will all allow me to encode the data to a nice compact structure all taking care of the specific opportunities of my datastructure (encoding keys as 1 or 2 bytes, bitpacking, values are fixed length, etc.)
However, It's completely unclear to me if any of these protocols/techniques will allow me to query on the compacted datastructure as is, or if the data-structure has to be de-encode before quering? If the latter, well than this whole exercise hasn't been of much use to me.
As an alternative, I've thought of looking at other programming languages (c/c++ probably although I've never dabbled with it) that probably would allow me to have very tight memory control over structs (As opposed to Node/javascript which is extremely bad with that)
Anyone?
They need to be decoded for querying

Reverse "jpeg" compression algorithm?

I have to write a tool that manages very large data sets (well, large for an ordinary workstations). I need basically something that works the opposite that the jpeg format. I need the dataset to be intact on disk where it can be arbitrarily large, but then it needs to be lossy compressed when it gets read in memory and only the sub-part used at any given time need to be uncompressed on the flight. I have started looking at ipp (Intel Integrated Performance Primitives) but it's not really clear for now if I can use them for what I need to do.
Can anyone point me in the right direction?
Thank you.
Given the nature of your data, it seems you are handling some kind of raw sample.
So the easiest and most generic "lossy" technique will be to drop the lower bits, reducing precision, up to the level you want.
Note that you will need to "drop the lower bits", which is quite different from "round to the next power of 10". Computer work on base 2, and you want all your lower bits to be "00000" for compression to perform as well as possible. This method suppose that the selected compression algorithm will make use of the predictable 0-bits pattern.
Another method, more complex and more specific, could be to convert your values as an index into a table. The advantage is that you can "target" precision where you want it. The obvious drawback is that the table will be specific to a distribution pattern.
On top of that, you may also store not the value itself, but the delta of the value with its preceding one if there is any kind of relation between them. This will help compression too.
For data to be compressed, you will need to "group" them by packets of appropriate size, such as 64KB. On a single field, no compression algorithm will give you suitable results. This, in turn, means that each time you want to access a field, you need to decompress the whole packet, so better tune it depending on what you want to do with it. Sequential access is easier to deal with in such circumstances.
Regarding compression algorithm, since these data are going to be "live", you need something very fast, so that accessing the data has very small latency impact.
There are several open-source alternatives out there for that use. For easier license management, i would recommend a BSD alternative. Since you use C++, the following ones look suitable :
http://code.google.com/p/snappy/
and
http://code.google.com/p/lz4/

Good compression algorithm for small chunks of data? (around 2k in size)

I have a system with one machine generate small chunks of data in the form of objects containing arrays of integers and longs. These chunks get passed to another server which in turn distributes them elsewhere.
I want to compress these objects so the memory load on the pass-through server is reduced. I understand that compression algorithms like deflate need to build a dictionary so something like that wouldn't really work on data this small.
Are there any algorithms that could compress data like this efficiently?
If not, another thing I could do is batch these chunks into arrays of objects and compress the array once it gets to be a certain size. But I am reluctant to do this because I would have to change interfaces in an existing system. Compressing them individually would not require any interface changes, the way this is all set up.
Not that I think it matters, but the target system is Java.
Edit: Would Elias gamma coding be the best for this situation?
Thanks
If you think that reducing your data packet to its entropy level is at best as it can be, you can try a simple huffman compression.
For an early look at how well this would compress, you can pass a packet through Huff0 :
http://fastcompression.blogspot.com/p/huff0-range0-entropy-coders.html
It is a simple 0-order huffman encoder. So the result will be representative.
For more specific ideas on how to efficiently use the characteristics of your data, it would be advised to describe a bit what data the packets contains and how it is generated (as you have done in the comments, so they are ints (4 bytes?) and longs (8 bytes?)), and then provide one or a few samples.
It sounds like you're currently looking at general-purpose compression algorithms. The most effective way to compress small chunks of data is to build a special-purpose compressor that knows the structure of your data.
The important thing is that you need to match the coding you use with the distribution of values you expect from your data: to get a good result from Elias gamma coding, you need to make sure the values you code are smallish positive integers...
If different integers within the same block are not completely independent (e.g., if your arrays represent a time series), you may be able to use this to improve your compression (e.g., the differences between successive values in a time series tend to be smallish signed integers). However, because each block needs to be independently compressed, you will not be able to take this kind of advantage of differences between successive blocks.
If you're worried that your compressor might turn into an "expander", you can add an initial flag to indicate whether the data is compressed or uncompressed. Then, in the worst case where your data doesn't fit your compression model at all, you can always punt and send the uncompressed version; your worst-case overhead is the size of the flag...
Elias Gamma Coding might actually increase the size of your data.
You already have upper bounds on your numbers (whatever fits into a 4- or probably 8-byte int/long). This method encodes the length of your numbers, followed by your number (probably not what you want). If you get many small values, it might make things smaller. If you also get big values, it will probably increase the size (the 8-byte unsigned max value would become almost twice as big).
Look at the entropy of your data packets. If it's close to the maximum, compression will be useless. Otherwise, try different GP compressors. Tho I'm not sure if the time spent compressing and decompressing is worth the size reduction.
I would have a close look at the options of your compression library, for instance deflateSetDictionary() and the flag Z_FILTERED in http://www.zlib.net/manual.html. If you can distribute - or hardwire in the source code - an agreed dictionary to both sender and receiver ahead of time, and if that dictionary is representative of real data, you should get decent compression savings. Oops - in Java look at java.util.zip.Deflater.setDictionary() and FILTERED.

Compression Algorithm for Small Amounts of Data

I have a server-side program that generates JSON for a client. A few of my colleagues have suggested using zip/gzip compression in order to reduce the amount of data that sending over the wire. However, when tested against one of my average JSON messages, it they both actually increased the amount of data being sent. It wasn't until I sent an unusually large response that the zipping kicked in and was useful.
So I started poking around stackoverflow, and I eventually found LZO, which, when tested, did exactly what I wanted it to do. However, I can't seem to find documentation of the run time of the algorithm, and I'm not quite good enough to sit down with the code and figure it out myself :)
tl;dr? RUN TIME OF LZO?
I'm going to ignore your question about the runtime of LZO (answer: almost certainly fast enough) and discuss the underlying problem.
You are exchanging JSON data structures over the wire and want to reduce your bandwidth. At the moment you are considering general-purpose compression algorithms like DEFLATE and LZO. However, any compression algorithm based on Lempel-Ziv techniques works best on large amounts of data. These algorithms work by building up a dictionary of frequently occurring sequences of data, so that they can encode a reference to the dictionary instead of the whole sequence when it repeats. The bigger the dictionary, the better the compression ratio. For very small amounts of data, like individual data packets, the technique is useless: there isn't time to build up the dictionary, and there isn't time for lots of repeats to appear.
If you are using JSON to encode a wire protocol, then your packets are very likely stereotyped, with similar structures and a small number of common keys. So I suggest investigating Google's Protocol Buffers which are designed specifically for this use case.
Seconding the suggestion to avoid LZO and any other type of generic/binary-data compression algorithm.
Your other options are basically:
Google's Protocol Buffers
Apache Thrift
MessagePack
The best choice depends on your server/language setup, your speed-vs-compression requirements, and personal preference. I'd probably go with MessagePack myself, but you won't go wrong with Protocol Buffers either.

Resources