What is the best lossless compression algorithm for random data - algorithm

I need to compress a random stream data like [25,94,182,3,254, ...]. The number of data are close to 4 million. I currently only get 1.4x ratio by Huffman code. The LZW algorithm I tried is take too much time to compress. I hope to find out an efficiency compression method and still have high compression rate, at least 3x.
Is there another algorithm that would be able to compress this random data more better?

It depends on the distribution of the rng. A compression ratio of 1:1.4 suggest that it's not uniform or not good. Huffman and arithmetic coding are practically the only options*, since there is no other correlation between successive entries of good RNG.
*To be precise, the best compression scheme has to be 0-order statistical compression that is able to allocate a variable number of bits for each symbol to reach the Shannon entropy
H(x) = -Sigma_{i=1}^{N} P(x_i) log_2 P(x_i)
The theoretical best is achieved by arithmetical coding, but other encodings can come close by chance. Arithmetic coding can allocate less than one bit per symbol, where as Huffman, or Golomb coding need at least one bit per symbol (or symbol group).

Related

Exploit batching (drawing many samples at once) for random sampling

Suppose I have large buffers of uniformly random bytes (entropy source).
I want to use it to draw many samples (e.g. 10^7 at a time) from a fixed (rational) probability distribution over a finite set (e.g., 8 elements).
I need
a theoretical guarantee that the specified distribution is reproduced exactly.
to be reasonably efficient with the random bits. E.g., if the Shannon entropy H of my distribution (over 8 symbols) is around 2.3 and I would like to use at most 3 bits from my stream on average to draw a sample. Even better would be, say, within 20% of the Shannon limit.
to sample quickly. At the very least 100 Mbyte/sec on "one core of a standard processor".
reasonable RAM usage (not counting stored sampling results) below, say, 200MB
I do not care about the runtime of pre-computations that need to be done once per distribution.
There are very many algorithms and implementations to chose from and I'm having trouble comparing them in terms of trading off entropy consumption, speed and memory. I found an overview in this SO-question. There are also many papers comparing algorithms (e.g. arXiv:1502.02539v6) and new algorithms being proposed (e.g. the "Fast Loaded Dice Roller" arXiv:2003.03830v2).
Knuth and Yao show that any optimal (in terms of entropy consumption) algorithm (that spits out one sample at a time) consumes between H and H+2 bits of entropy. By drawing multiple samples (i.e., sampling from the product distribution), one can get closer to the Shannon limit of using H bits per sample on average. This is sometimes called "batching".
My first instinct would thus be to use, say, the available C implementation of the "Fast Loaded Dice Roller" after "packing" my symbols to get a distribution over a range of integers that fit into one (or a few) bytes. However, the descriptions of these algorithms don't seem to focus on "batching". I wonder, perhaps other methods can be more efficient (in either entropy consumption, speed or memory needs) by making use of my large batch sizes?

Compression algorithms for nearly uniform data

I've seen questions on compression algorithms around SE, but none quite fit what I'm looking for. Clearly truly uniformly distributed data cannot be compressed, but how close can we get?
My (probably incorrect) thoughts: I would imagine that by transforming the data (normalizing in some way?), you could accentuate the non-uniformity aspects of nearly uniform data and then use that transformed set to compress, perhaps along with the inverse transform or its parameters. But maybe I'm totally wrong and they all perform equally terribly as the data approaches uniformity?
When I look at lists of (lossless) compression algorithms, I don't see them ranked by how effective they are against certain types of data, at least not in any concrete terms. Does anyone know of a source that dives into this?
As background, I have an application where the data set is not independent, but nevertheless appears to be nearly uniform (most of the symbols have very low frequencies, and none of them have very high frequencies). So I was wondering if there are algorithms that can exploit the sampling dependence even if the data frequencies are mostly low. Then of course it would be more helpful to have a source that detailed exactly why some compression algorithms might perform better at this than others, if such a thing existed.
The short answer is no. Such a thing both does not and cannot exist.
The long answer involves information theory.
What matters to a compression algorithm is not how hard it is to say the thing you are specifying. It is how many equally likely things could you have said instead, but didn't. That is, if you have M things you might have said that were equally likely, you must send a signal long enough that it specifies which of the M you said. And that requires log_2(M) bits to make it clear which one you actually said.
In the case of a stream of independent symbols, each with a known probability, we can figure out how many messages could be sent with equal likelihood. And thereby put a lower bound on how efficiently a message can be compressed. That lower bound is the entropy bits per symbol sent. This lower bound is actually achieved by Huffman coding.
In order to do better than Huffman coding, we must find some additional structure to our messages. For example language often has correlations where "h" is likely to follow "t". Or in images, the color of a pixel tends to be similar to the color of a nearby pixel. Any such structure reduces the number of equally likely messages we could have sent, and opens up the possibility of a better compression algorithm.
However you've not described such a structure. So Huffman coding is the best you can do. And if the symbol probabilities are close to each other, it won't give you very much.
Sorry.

Depth Image compression up to a maximum permitted error

Articles on image compression often focus on generating the best possible image quality (PSNR) given a fixed compression ratio. I'm curious about getting the best possible compression ratio given a maximum permissible per-pixel error. My instinct is to greedily remove the smallest coefficients in the transformed data, keep track of the error I've caused, until I can't remove any more without passing the maximum error. But I can't find any papers to confirm it. Can anyone point me to a reference about this problem?
edit
Let me give some more details. I'm trying to compress depth images from 3D scanners, not regular images. Color is not a factor. Depth images tend to have large smooth patches, but accurate discontinuities are important. Some pixels will be empty - outside the scanner's range or low confidence level - and not require compression.
The algorithm will need to run fast - optimally at 30 fps like the Microsoft Kinect, or at least somewhere in the 100 millisecond area. The algorithm will be included in a library I distribute. I prefer to minimize dependencies, so compression schemes that I can implement myself in a reasonably small amount of code are preferable.
This answer won't satisfy your request for references, but it's too long to post as a comment.
First, depth buffer compression for computer generated imagery may apply to your case. Usually this compression is done at the hardware level with a transparent interface, so it's typically designed to be simple and fast. Given this, it may be worth your while to search for depth buffer compression.
One of the major issues you're going to have with transform-based compressors (DCTs, Wavelets, etc...) is that there's no easy way to find compact coefficients that meet your hard maximum error criteria. (The problem you end up with looks a lot like linear programming. Wavelets can have localized behavior in most of their basis vectors which can help somewhat, but it's still rather inconvenient.) To achieve the accuracy you desire you may need to add on another refinement step but this will also add more computation time, complexity, and will introduce another layer imperfect entropy coding leading to a loss of compression efficiency.
What you want is more akin to lossless compression than lossy compression. In this light, one approach would be to simply throw away the bits under your error threshold: if your maximum allowable error is X and your depths are represented as integers, integer-divide your depths by X and then apply lossless compression.
Another issue you're facing is the representation of depth -- depending on your circumstances it may be a floating point number, an integer, it may be in a projective coordinate system, or even more bizarre.
Given these restrictions, I recommend a scheme like
BTPC as it allows for a more easily adapted wavelet-like scheme where errors are more clearly localized and easier to understand and account for. Additionally, BTPC has shown a great resistance to many types of images and a good ability to handle continuous gradients and sharp edges with low loss of fidelity -- exactly the sorts of traits you're looking for.
Since BTPC is predictive, it doesn't matter particularly how your depth format is stored -- you just need to modify your predictor to take your coordinate system and numeric type (integer vs. floating) into account.
Since BTPC doesn't do terribly much math, it can run pretty fast on general CPUs, too, although it may not be as easy to vectorize as you'd like. (It sounds like you're possibly doing low level optimized game programming, so this may be a serious consideration for you.)
If you're looking for something simpler to implement I'd recommend a "filter" type of approach (similar to PNG) with a Golomb-Rice coder strapped on. Rather than coding the deltas perfectly to end up with lossless compression, you can code to a "good enough" degree. The advantage of doing this as compared to a quantize-then-lossless-encode style compressor is that you can potentially maintain more continuity.
"greedily remove the smallest coefficients" reminds me of SVD compression, where you use the data associated with the first k largest eigenvalues to approximate the data. The rest of the eigenvalues that are small don't hold significant information and can be discarded.
Large k -> high quality, low compression
Small k -> lower quality, high compression
(disclaimer: I have no idea what I'm talking here but it might help)
edit:
here is a better illustration of SVD compression
I am not aware of any references in case of the problem you have proposed.
However, one direction in which I can think is using optimization techniques to select best coefficients. Techniques like genetic algorithms, hill climbing, simulated annihilation can be used in this regard.
Given that I have experience in genetic algorithms, I can suggest the following process. If you are not aware about genetic algorithm, I recommend you to read up the wiki page on genetic algorithms.
Your problem can be thought of has selecting a subset of coefficients which give the minimum reconstruction error. Say there are N coefficients. It is easy to establish that there are 2^N subsets. Each subset can be represented by a string on N binary numbers. For example, for N=5,
the string 11101 represents that the selected subset contains all the coeff except the coeff4. With genetic algorithms it is possible to find an optimum bit sting. The objective function can be chosen as the absolute error between the reconstructed and the original signals. However, I am aware that you can get an error of zero when all the coeffs are taken.
To get around this problem, you may choose to modulate the objective function with an appropriate function which discourages objective function value near zero and is a monotonically increasing function after a threshold. A function like | log( \epsion + f ) | may suffice.
If what I propose seem interesting to you do let me know. I have an implementation of genetic algorithm with me. But it is tailored to my needs and you might not be in a position to adapt it for this problem. I am willing to work with you on this problem as the problem seem interesting to explore.
Do let me know.
I think you are pretty close to the solution, but there are an issue that i think you should pay some attention.
Because different wavelet coefficients are corresponding to functions with different scale (and shift), so error being introduced by elimination of a partricular coefficient depends not only on it value, but on it position (especially scale), so the weight of the coefficient should be something like w(c) = amp(c) * F(scale, shift) where amp(c) is an amplitude of coefficient and F is a function that depends on the compressed data. When you determine weigths like that the problem is reduced to the backpack problem, that could be solved in many ways (for example reorder the coefficients and eliminate tha smallest one until you get a threshold error on a pixel affected by the corresponding function). The hard part is to determine F(scale,shift). You can do it in the following way. If the data that you are compressing is relatively stable (for example surveillance video), you could estimate F as a middle probability of recieve an unacceptable error eliminating the component with given scale and shift from the wavelet decomposition. So you could perform SVD (or PCA) decomposition on historical data and calculate 'F(scale, shift)' as a weighted (with weights equal to eigenvalues) summ of scalar products of the component with given scale and shift to the eigenvectors
F(scale,shift) = summ eValue(i) * (w(scale,shift) * eVector(i)) where eValue is eigenvalue corresponding to the eigenvector - eVector(i), w(scale,shift) is a wavelet function with given scale and shift.
Iteratively evaluating different sets of coefficients will not help your goal of being able to compress frames as quickly as they are generated, and will not help you to keep complexity low.
Depth maps are different from intensity maps in several ways that can help you.
Large areas of "no data" can be handled very efficiently by run-length encoding.
Measurement error in intensity images is constant across the image after fixed-noise has been subtracted, but depth maps from both Kinects and stereo vision systems have errors that increase as an inverse function of depth. If these are the scanners you are targeting then you can use lossier compression for closer pixels - because the errors your lossy function introduces are independent of sensor error, the total error won't be increased until your lossy function's error is greater than the sensor error.
A team at Microsoft had a lot of success with a very low-loss algorithm that relied heavily on run-length encoding (see paper here), beating out JPEG 2000 with better compression and excellent performance; however, part of their success seemed to step from the relatively crude depth maps their sensor produces. If you are targeting Kinects, you may find it hard to improve on their method.
I think you are looking for something like JPEG-LS algorithm, which tries to limit the maximum amount pixel error. Albeit, it is mainly designed for compression of natural or medical images and is not well designed for depth images ( which are smoother).
The term "near-lossless compression" refers to a lossy algorithm for which each reconstructed image sample differs from the corresponding original image sample by not more than a pre-specified value, the (usually small) "loss." Lossless compression corresponds to loss=0.link to the original reference
I'd try preprocessing the image, then compressing with a general method, such as PNG.
Preprocessing for PNG (first read this)
for y in 1..height
for x in 1..width
if(abs(A[y][x-1] - A[y][x]) < threshold)
A[y][x] = A[y][x-1]
elsif (abs(A[y-1][x] - A[y][x]) < threshold)
A[y][x] = A[y-1][x]

Good compression algorithm for small chunks of data? (around 2k in size)

I have a system with one machine generate small chunks of data in the form of objects containing arrays of integers and longs. These chunks get passed to another server which in turn distributes them elsewhere.
I want to compress these objects so the memory load on the pass-through server is reduced. I understand that compression algorithms like deflate need to build a dictionary so something like that wouldn't really work on data this small.
Are there any algorithms that could compress data like this efficiently?
If not, another thing I could do is batch these chunks into arrays of objects and compress the array once it gets to be a certain size. But I am reluctant to do this because I would have to change interfaces in an existing system. Compressing them individually would not require any interface changes, the way this is all set up.
Not that I think it matters, but the target system is Java.
Edit: Would Elias gamma coding be the best for this situation?
Thanks
If you think that reducing your data packet to its entropy level is at best as it can be, you can try a simple huffman compression.
For an early look at how well this would compress, you can pass a packet through Huff0 :
http://fastcompression.blogspot.com/p/huff0-range0-entropy-coders.html
It is a simple 0-order huffman encoder. So the result will be representative.
For more specific ideas on how to efficiently use the characteristics of your data, it would be advised to describe a bit what data the packets contains and how it is generated (as you have done in the comments, so they are ints (4 bytes?) and longs (8 bytes?)), and then provide one or a few samples.
It sounds like you're currently looking at general-purpose compression algorithms. The most effective way to compress small chunks of data is to build a special-purpose compressor that knows the structure of your data.
The important thing is that you need to match the coding you use with the distribution of values you expect from your data: to get a good result from Elias gamma coding, you need to make sure the values you code are smallish positive integers...
If different integers within the same block are not completely independent (e.g., if your arrays represent a time series), you may be able to use this to improve your compression (e.g., the differences between successive values in a time series tend to be smallish signed integers). However, because each block needs to be independently compressed, you will not be able to take this kind of advantage of differences between successive blocks.
If you're worried that your compressor might turn into an "expander", you can add an initial flag to indicate whether the data is compressed or uncompressed. Then, in the worst case where your data doesn't fit your compression model at all, you can always punt and send the uncompressed version; your worst-case overhead is the size of the flag...
Elias Gamma Coding might actually increase the size of your data.
You already have upper bounds on your numbers (whatever fits into a 4- or probably 8-byte int/long). This method encodes the length of your numbers, followed by your number (probably not what you want). If you get many small values, it might make things smaller. If you also get big values, it will probably increase the size (the 8-byte unsigned max value would become almost twice as big).
Look at the entropy of your data packets. If it's close to the maximum, compression will be useless. Otherwise, try different GP compressors. Tho I'm not sure if the time spent compressing and decompressing is worth the size reduction.
I would have a close look at the options of your compression library, for instance deflateSetDictionary() and the flag Z_FILTERED in http://www.zlib.net/manual.html. If you can distribute - or hardwire in the source code - an agreed dictionary to both sender and receiver ahead of time, and if that dictionary is representative of real data, you should get decent compression savings. Oops - in Java look at java.util.zip.Deflater.setDictionary() and FILTERED.

What are some alternatives to a bit array?

I have an information retrieval application that creates bit arrays on the order of 10s of million bits. The number of "set" bits in the array varies widely, from all clear to all set. Currently, I'm using a straight-forward bit array (java.util.BitSet), so each of my bit arrays takes several megabytes.
My plan is to look at the cardinality of the first N bits, then make a decision about what data structure to use for the remainder. Clearly some data structures are better for very sparse bit arrays, and others when roughly half the bits are set (when most bits are set, I can use negation to treat it as a sparse set of zeroes).
What structures might be good at each extreme?
Are there any in the middle?
Here are a few constraints or hints:
The bits are set only once, and in index order.
I need 100% accuracy, so something like a Bloom filter isn't good enough.
After the set is built, I need to be able to efficiently iterate over the "set" bits.
The bits are randomly distributed, so run-length–encoding algorithms aren't likely to be much better than a simple list of bit indexes.
I'm trying to optimize memory utilization, but speed still carries some weight.
Something with an open source Java implementation is helpful, but not strictly necessary. I'm more interested in the fundamentals.
Unless the data is truly random and has a symmetric 1/0 distribution, then this simply becomes a lossless data compression problem and is very analogous to CCITT Group 3 compression used for black and white (i.e.: Binary) FAX images. CCITT Group 3 uses a Huffman Coding scheme. In the case of FAX they are using a fixed set of Huffman codes, but for a given data set, you can generate a specific set of codes for each data set to improve the compression ratio achieved. As long as you only need to access the bits sequentially, as you implied, this will be a pretty efficient approach. Random access would create some additional challenges, but you could probably generate a binary search tree index to various offset points in the array that would allow you to get close to the desired location and then walk in from there.
Note: The Huffman scheme still works well even if the data is random, as long as the 1/0 distribution is not perfectly even. That is, the less even the distribution, the better the compression ratio.
Finally, if the bits are truly random with an even distribution, then, well, according to Mr. Claude Shannon, you are not going to be able to compress it any significant amount using any scheme.
I would strongly consider using range encoding in place of Huffman coding. In general, range encoding can exploit asymmetry more effectively than Huffman coding, but this is especially so when the alphabet size is so small. In fact, when the "native alphabet" is simply 0s and 1s, the only way Huffman can get any compression at all is by combining those symbols -- which is exactly what range encoding will do, more effectively.
Maybe too late for you, but there is a very fast and memory efficient library for sparse bit arrays (lossless) and other data types based on tries. Look at Judy arrays
Thanks for the answers. This is what I'm going to try for dynamically choosing the right method:
I'll collect all of the first N hits in a conventional bit array, and choose one of three methods, based on the symmetry of this sample.
If the sample is highly asymmetric,
I'll simply store the indexes to the
set bits (or maybe the distance to
the next bit) in a list.
If the sample is highly symmetric,
I'll keep using a conventional bit
array.
If the sample is moderately
symmetric, I'll use a lossless
compression method like Huffman
coding suggested by
InSciTekJeff.
The boundaries between the asymmetric, moderate, and symmetric regions will depend on the time required by the various algorithms balanced against the space they need, where the relative value of time versus space would be an adjustable parameter. The space needed for Huffman coding is a function of the symmetry, and I'll profile that with testing. Also, I'll test all three methods to determine the time requirements of my implementation.
It's possible (and actually I'm hoping) that the middle compression method will always be better than the list or the bit array or both. Maybe I can encourage this by choosing a set of Huffman codes adapted for higher or lower symmetry. Then I can simplify the system and just use two methods.
One more compression thought:
If the bit array is not crazy long, you could try applying the Burrows-Wheeler transform before using any repetition encoding, such as Huffman. A naive implementation would take O(n^2) memory during (de)compression and O(n^2 log n) time to decompress - there are almost certainly shortcuts to be had, as well. But if there's any sequential structure to your data at all, this should really help the Huffman encoding out.
You could also apply that idea to one block at a time to keep the time/memory usage more practical. Using one block at time could allow you to always keep most of the data structure compressed if you're reading/writing sequentially.
Straight forward lossless compression is the way to go. To make it searchable you will have to compress relatively small blocks and create an index into an array of the blocks. This index can contain the bit offset of the starting bit in each block.
Quick combinatoric proof that you can't really save much space:
Suppose you have an arbitrary subset of n/2 bits set to 1 out of n total bits. You have (n choose n/2) possibilities. Using Stirling's formula, this is roughly 2^n / sqrt(n) * sqrt(2/pi). If every possibility is equally likely, then there's no way to give more likely choices shorter representations. So we need log_2 (n choose n/2) bits, which is about n - (1/2)log(n) bits.
That's not a very good savings of memory. For example, if you're working with n=2^20 (1 meg), then you can only save about 10 bits. It's just not worth it.
Having said all that, it also seems very unlikely that any really useful data is truly random. In case there's any more structure to your data, there's probably a more optimistic answer.

Resources