I have a server-side program that generates JSON for a client. A few of my colleagues have suggested using zip/gzip compression in order to reduce the amount of data that sending over the wire. However, when tested against one of my average JSON messages, it they both actually increased the amount of data being sent. It wasn't until I sent an unusually large response that the zipping kicked in and was useful.
So I started poking around stackoverflow, and I eventually found LZO, which, when tested, did exactly what I wanted it to do. However, I can't seem to find documentation of the run time of the algorithm, and I'm not quite good enough to sit down with the code and figure it out myself :)
tl;dr? RUN TIME OF LZO?
I'm going to ignore your question about the runtime of LZO (answer: almost certainly fast enough) and discuss the underlying problem.
You are exchanging JSON data structures over the wire and want to reduce your bandwidth. At the moment you are considering general-purpose compression algorithms like DEFLATE and LZO. However, any compression algorithm based on Lempel-Ziv techniques works best on large amounts of data. These algorithms work by building up a dictionary of frequently occurring sequences of data, so that they can encode a reference to the dictionary instead of the whole sequence when it repeats. The bigger the dictionary, the better the compression ratio. For very small amounts of data, like individual data packets, the technique is useless: there isn't time to build up the dictionary, and there isn't time for lots of repeats to appear.
If you are using JSON to encode a wire protocol, then your packets are very likely stereotyped, with similar structures and a small number of common keys. So I suggest investigating Google's Protocol Buffers which are designed specifically for this use case.
Seconding the suggestion to avoid LZO and any other type of generic/binary-data compression algorithm.
Your other options are basically:
Google's Protocol Buffers
Apache Thrift
MessagePack
The best choice depends on your server/language setup, your speed-vs-compression requirements, and personal preference. I'd probably go with MessagePack myself, but you won't go wrong with Protocol Buffers either.
Related
I do some research on what is the best data exchange format in my company. For the moment I compare Protocol Buffers and Apache Avro.
Request are exchanging between components in our architecture, but only one by one. And my impression is that Avro is very bigger thant Protocol Buffers when transport only one by one. In the avro file, the schema is always present and our request has a lot of optional field, so our schema is ver big even if our data are small.
But I don't know if I missed something, it's written everywhere than avro is smaller, but for us it seems that we have to put one thousand requests in one file for having PBuffers and avro's size equals.
I missed something or my thoughts are true?
Thanks
It's not at all surprising that two serialization formats would produce basically equal sizes. These aren't compression algorithms, they're just structure. For any decent format, the vast majority of your data is going to be your data; the structure around it (which is the part that varies depending on serialization format) ought to be negligible. The size of your data simply doesn't change regardless of the serialization format around it.
Note also that anyone who claims that one format is always smaller than another is either lying or doesn't know what they're talking about. Every format has strengths and weaknesses, so the "best" format totally depends on the use case. It's important to test each format using your own data to find out what is best for you -- and it sounds like you are doing just that, which is great! If Protobuf and Avro came out equal size in your test, then you should choose based on other factors. You might want to test encoding/decoding speed, for example.
Do any of thrift, protobuf, avro, etc support quering on the resulting compact data? Or would something like a thrift-server first have to de-encode the compact data before being able to query it?
background: since there might be an entirely different answer to my usecase I'm not seeing.
I've sketched out a custom datastructure on paper (akin to a trie), which will contain tens/hundreds of millions of key-value pairs. The whole stuff needs to be in RAM so it needs to be compact.
For this reason I'm probably skipping the normal kv-stores, since there's just too much overhead in encoding. They can't optimize for the specicialized case of the structure. (Redis has the least overhead per key afaik but it isn't enough: 100+ bytes per key)
I've looked into Thrift, Protobuf, Avro, Messagepack, which will all allow me to encode the data to a nice compact structure all taking care of the specific opportunities of my datastructure (encoding keys as 1 or 2 bytes, bitpacking, values are fixed length, etc.)
However, It's completely unclear to me if any of these protocols/techniques will allow me to query on the compacted datastructure as is, or if the data-structure has to be de-encode before quering? If the latter, well than this whole exercise hasn't been of much use to me.
As an alternative, I've thought of looking at other programming languages (c/c++ probably although I've never dabbled with it) that probably would allow me to have very tight memory control over structs (As opposed to Node/javascript which is extremely bad with that)
Anyone?
They need to be decoded for querying
I have to write a tool that manages very large data sets (well, large for an ordinary workstations). I need basically something that works the opposite that the jpeg format. I need the dataset to be intact on disk where it can be arbitrarily large, but then it needs to be lossy compressed when it gets read in memory and only the sub-part used at any given time need to be uncompressed on the flight. I have started looking at ipp (Intel Integrated Performance Primitives) but it's not really clear for now if I can use them for what I need to do.
Can anyone point me in the right direction?
Thank you.
Given the nature of your data, it seems you are handling some kind of raw sample.
So the easiest and most generic "lossy" technique will be to drop the lower bits, reducing precision, up to the level you want.
Note that you will need to "drop the lower bits", which is quite different from "round to the next power of 10". Computer work on base 2, and you want all your lower bits to be "00000" for compression to perform as well as possible. This method suppose that the selected compression algorithm will make use of the predictable 0-bits pattern.
Another method, more complex and more specific, could be to convert your values as an index into a table. The advantage is that you can "target" precision where you want it. The obvious drawback is that the table will be specific to a distribution pattern.
On top of that, you may also store not the value itself, but the delta of the value with its preceding one if there is any kind of relation between them. This will help compression too.
For data to be compressed, you will need to "group" them by packets of appropriate size, such as 64KB. On a single field, no compression algorithm will give you suitable results. This, in turn, means that each time you want to access a field, you need to decompress the whole packet, so better tune it depending on what you want to do with it. Sequential access is easier to deal with in such circumstances.
Regarding compression algorithm, since these data are going to be "live", you need something very fast, so that accessing the data has very small latency impact.
There are several open-source alternatives out there for that use. For easier license management, i would recommend a BSD alternative. Since you use C++, the following ones look suitable :
http://code.google.com/p/snappy/
and
http://code.google.com/p/lz4/
I have a system with one machine generate small chunks of data in the form of objects containing arrays of integers and longs. These chunks get passed to another server which in turn distributes them elsewhere.
I want to compress these objects so the memory load on the pass-through server is reduced. I understand that compression algorithms like deflate need to build a dictionary so something like that wouldn't really work on data this small.
Are there any algorithms that could compress data like this efficiently?
If not, another thing I could do is batch these chunks into arrays of objects and compress the array once it gets to be a certain size. But I am reluctant to do this because I would have to change interfaces in an existing system. Compressing them individually would not require any interface changes, the way this is all set up.
Not that I think it matters, but the target system is Java.
Edit: Would Elias gamma coding be the best for this situation?
Thanks
If you think that reducing your data packet to its entropy level is at best as it can be, you can try a simple huffman compression.
For an early look at how well this would compress, you can pass a packet through Huff0 :
http://fastcompression.blogspot.com/p/huff0-range0-entropy-coders.html
It is a simple 0-order huffman encoder. So the result will be representative.
For more specific ideas on how to efficiently use the characteristics of your data, it would be advised to describe a bit what data the packets contains and how it is generated (as you have done in the comments, so they are ints (4 bytes?) and longs (8 bytes?)), and then provide one or a few samples.
It sounds like you're currently looking at general-purpose compression algorithms. The most effective way to compress small chunks of data is to build a special-purpose compressor that knows the structure of your data.
The important thing is that you need to match the coding you use with the distribution of values you expect from your data: to get a good result from Elias gamma coding, you need to make sure the values you code are smallish positive integers...
If different integers within the same block are not completely independent (e.g., if your arrays represent a time series), you may be able to use this to improve your compression (e.g., the differences between successive values in a time series tend to be smallish signed integers). However, because each block needs to be independently compressed, you will not be able to take this kind of advantage of differences between successive blocks.
If you're worried that your compressor might turn into an "expander", you can add an initial flag to indicate whether the data is compressed or uncompressed. Then, in the worst case where your data doesn't fit your compression model at all, you can always punt and send the uncompressed version; your worst-case overhead is the size of the flag...
Elias Gamma Coding might actually increase the size of your data.
You already have upper bounds on your numbers (whatever fits into a 4- or probably 8-byte int/long). This method encodes the length of your numbers, followed by your number (probably not what you want). If you get many small values, it might make things smaller. If you also get big values, it will probably increase the size (the 8-byte unsigned max value would become almost twice as big).
Look at the entropy of your data packets. If it's close to the maximum, compression will be useless. Otherwise, try different GP compressors. Tho I'm not sure if the time spent compressing and decompressing is worth the size reduction.
I would have a close look at the options of your compression library, for instance deflateSetDictionary() and the flag Z_FILTERED in http://www.zlib.net/manual.html. If you can distribute - or hardwire in the source code - an agreed dictionary to both sender and receiver ahead of time, and if that dictionary is representative of real data, you should get decent compression savings. Oops - in Java look at java.util.zip.Deflater.setDictionary() and FILTERED.
Does anyone of you know a lossless compression algorithm, which produces headerless outputs?
For example do not store the huffman tree used to compress it? I do not speak about hard coded huffman trees, but I like to know if there is any algorithm that can compress and decompress input without storing some metadata in its output. Or is this even theoretically impossible?
Of course it is posible. Among others, the LZ family of compressors don't need to output anything apart from the compressed data itself, as the dictionary is built on-line as compression (or decompression) progress. You have a lot of reference implementations for those LZ-type algorithms. For example, LZMA, component of 7zip.
Adaptive Huffman coding does exactly that. More generally, the term adaptive coding is used to describe entropy codes with this property. Some dictionary codes have this property too, e.g. run-length encoding (RLE) and Lempel-Ziv-Welch (LZW).
Run Length Encoding would be one example
lzo springs to mind. it's used in OpenVPN, with great results
Why are you looking for compression algorithms with headerless compressed output?
Perhaps (a) you have a system like 2-way telephony that needs low-latency streaming compression/decompression.
The adaptive coding category of compression algorithms mentioned by Zach Scrivena
and the LZ family of dictionary compression algorithms mentioned by Diego Sevilla and Javier
are excellent for this kind of application.
Practical implementations of these algorithms usually do have a byte or two of metadata
at the beginning (making them useless for (b) applications), but that has little or no effect on latency.
Perhaps (b) you are mainly interested in cryptography, and you hear that compress-before-encrypt gives some improved security properties, as long as the compressed text does not have fixed metadata header "crib".
Modern encryption algorithms aren't (as far as we know) vulnerable to such "cribs", but if you're paranoid you might be interested in
"bijective compression" (a, b, c, etc.).
It's not possible to detect errors in transmission (flipped bits, inserted bits, deleted bits, etc.) when a receiver gets such compressed output (making these algorithms not especially useful for (a) applications).
Perhaps (c) you are interested in headerless compression for some other reason. Sounds fascinating -- what is that reason?