Compression algorithms for nearly uniform data - algorithm

I've seen questions on compression algorithms around SE, but none quite fit what I'm looking for. Clearly truly uniformly distributed data cannot be compressed, but how close can we get?
My (probably incorrect) thoughts: I would imagine that by transforming the data (normalizing in some way?), you could accentuate the non-uniformity aspects of nearly uniform data and then use that transformed set to compress, perhaps along with the inverse transform or its parameters. But maybe I'm totally wrong and they all perform equally terribly as the data approaches uniformity?
When I look at lists of (lossless) compression algorithms, I don't see them ranked by how effective they are against certain types of data, at least not in any concrete terms. Does anyone know of a source that dives into this?
As background, I have an application where the data set is not independent, but nevertheless appears to be nearly uniform (most of the symbols have very low frequencies, and none of them have very high frequencies). So I was wondering if there are algorithms that can exploit the sampling dependence even if the data frequencies are mostly low. Then of course it would be more helpful to have a source that detailed exactly why some compression algorithms might perform better at this than others, if such a thing existed.

The short answer is no. Such a thing both does not and cannot exist.
The long answer involves information theory.
What matters to a compression algorithm is not how hard it is to say the thing you are specifying. It is how many equally likely things could you have said instead, but didn't. That is, if you have M things you might have said that were equally likely, you must send a signal long enough that it specifies which of the M you said. And that requires log_2(M) bits to make it clear which one you actually said.
In the case of a stream of independent symbols, each with a known probability, we can figure out how many messages could be sent with equal likelihood. And thereby put a lower bound on how efficiently a message can be compressed. That lower bound is the entropy bits per symbol sent. This lower bound is actually achieved by Huffman coding.
In order to do better than Huffman coding, we must find some additional structure to our messages. For example language often has correlations where "h" is likely to follow "t". Or in images, the color of a pixel tends to be similar to the color of a nearby pixel. Any such structure reduces the number of equally likely messages we could have sent, and opens up the possibility of a better compression algorithm.
However you've not described such a structure. So Huffman coding is the best you can do. And if the symbol probabilities are close to each other, it won't give you very much.
Sorry.

Related

Uncertainty versus randomness

I would like to know the difference between uncertainty and randomness in mathematical fashion. I tried to find it but I get confused , as some people said they are the same? But can any one provide me logical reasoning behind it. If they are not same then please explain it why?
Don't get too hung up on it.
People use different words in different situations.
It's not so much that they have different meanings, as that their meanings are situation-dependent.
Randomness is just a fuzzy general term meaning something is random.
In statistics, uncertainty is used to mean that some property of a distribution, such as its mean, is itself unknown but can be given a distribution.
For example, suppose you want to know the average weight of all people.
You could find it out exactly if you could go around to all people, get their weight, add it all up, and divide by the number of people.
But that's too hard to do, so suppose you just pick 10 people at random and get their average weight, and pretend it's the same as the average of everybody.
That's called the sample mean, but you know it isn't accurate.
It has what is called a standard error, meaning it has uncertainty.
In fact, if you were to do that experiment many times over with different people, you would get a different sample mean every time, and those sample means would themselves form a bell-shaped distribution, the standard deviation of which would be called the standard error, representing its uncertainty.
In general, if you increased the number of people you look at by a factor of 100, you can reduce the standard error, the uncertainty, by a factor of 10.
I bet you can tell that people who take polls for a living care about this stuff very much.
EDIT for the downvoter: In case the downvote is because this doesn't look like a stackoverflow question or answer,
I've made a point of advocating the random pausing method of profiling.
Profiling in large part is perceived to be about measuring (statistically) the time that programming constructs are responsible for.
Often people are inhibited from using that method because they are afraid the results have too much uncertainty.
This post gets very specific about what that uncertainty actually is.
It shows that the bogey-man fear of uncertainty has the effect of preventing people from finding really substantial speedups in their code.
So naivete' about statistics is definitely a serious programming problem.
My view looks at a scenario using three different coloured balls:
I love some of the answers given here. My own view, based on my current research, is that these are two distinct terms. Uncertainty refers to not knowing in advance which ball could be selected when a person, for instance, is given a chance to select one ball from three different coloured balls.
This remains true when each ball has an equal chance of being selected i.e. equal probabilities. However, things soon get complex when each ball has it's own distinct probability. Chances are that the one with the highest probability will be selected. This seems especially true in algorithm development which would almost always select the highest probability compromising the meaning of randomness.
Having said all of this - I believe these concepts remain confusing which has just made me realise the time I need to dedicate on clearly distinguishing between the two to make sure my current research is not confusing. My own predicament is that I need to work on stochastic vs deterministic views. Based on the current view stochastic would be more uncertain than random whereas deterministic would be more probability based i.e. knowing for certain that the highest probability would be chosen; but this seems very far from the truth.
It seems as if uncertainty holds until just before a ball is selected/touched and soon looses its meaning as soon as the ball is picked which should result to its probability being revised. I personally think the terms have theoretical differences which perhaps allows them to be used interchangeably.
Uncertainty in math and science typically means there are a lack of facts, or the facts are unobtainable. Weather forecasting is a great example of uncertainty.
Randomness has many definitions. Commonly it's used in probability / statistics as a measure or quantification of uncertainty. So in my weather example, a 30% chance of rain is a measure of uncertainty. The more general definition (which also applies to math / science) is unpredictable, or lack of order.
There is definitely a fuzzy distinction between the two.
According to the Bayesian interpretation of probability, uncertainty and randomness are just two names for the same thing.
If an experiment is random, then it is uncertain to you. If something is uncertain to you, then it has the randomness property.

Depth Image compression up to a maximum permitted error

Articles on image compression often focus on generating the best possible image quality (PSNR) given a fixed compression ratio. I'm curious about getting the best possible compression ratio given a maximum permissible per-pixel error. My instinct is to greedily remove the smallest coefficients in the transformed data, keep track of the error I've caused, until I can't remove any more without passing the maximum error. But I can't find any papers to confirm it. Can anyone point me to a reference about this problem?
edit
Let me give some more details. I'm trying to compress depth images from 3D scanners, not regular images. Color is not a factor. Depth images tend to have large smooth patches, but accurate discontinuities are important. Some pixels will be empty - outside the scanner's range or low confidence level - and not require compression.
The algorithm will need to run fast - optimally at 30 fps like the Microsoft Kinect, or at least somewhere in the 100 millisecond area. The algorithm will be included in a library I distribute. I prefer to minimize dependencies, so compression schemes that I can implement myself in a reasonably small amount of code are preferable.
This answer won't satisfy your request for references, but it's too long to post as a comment.
First, depth buffer compression for computer generated imagery may apply to your case. Usually this compression is done at the hardware level with a transparent interface, so it's typically designed to be simple and fast. Given this, it may be worth your while to search for depth buffer compression.
One of the major issues you're going to have with transform-based compressors (DCTs, Wavelets, etc...) is that there's no easy way to find compact coefficients that meet your hard maximum error criteria. (The problem you end up with looks a lot like linear programming. Wavelets can have localized behavior in most of their basis vectors which can help somewhat, but it's still rather inconvenient.) To achieve the accuracy you desire you may need to add on another refinement step but this will also add more computation time, complexity, and will introduce another layer imperfect entropy coding leading to a loss of compression efficiency.
What you want is more akin to lossless compression than lossy compression. In this light, one approach would be to simply throw away the bits under your error threshold: if your maximum allowable error is X and your depths are represented as integers, integer-divide your depths by X and then apply lossless compression.
Another issue you're facing is the representation of depth -- depending on your circumstances it may be a floating point number, an integer, it may be in a projective coordinate system, or even more bizarre.
Given these restrictions, I recommend a scheme like
BTPC as it allows for a more easily adapted wavelet-like scheme where errors are more clearly localized and easier to understand and account for. Additionally, BTPC has shown a great resistance to many types of images and a good ability to handle continuous gradients and sharp edges with low loss of fidelity -- exactly the sorts of traits you're looking for.
Since BTPC is predictive, it doesn't matter particularly how your depth format is stored -- you just need to modify your predictor to take your coordinate system and numeric type (integer vs. floating) into account.
Since BTPC doesn't do terribly much math, it can run pretty fast on general CPUs, too, although it may not be as easy to vectorize as you'd like. (It sounds like you're possibly doing low level optimized game programming, so this may be a serious consideration for you.)
If you're looking for something simpler to implement I'd recommend a "filter" type of approach (similar to PNG) with a Golomb-Rice coder strapped on. Rather than coding the deltas perfectly to end up with lossless compression, you can code to a "good enough" degree. The advantage of doing this as compared to a quantize-then-lossless-encode style compressor is that you can potentially maintain more continuity.
"greedily remove the smallest coefficients" reminds me of SVD compression, where you use the data associated with the first k largest eigenvalues to approximate the data. The rest of the eigenvalues that are small don't hold significant information and can be discarded.
Large k -> high quality, low compression
Small k -> lower quality, high compression
(disclaimer: I have no idea what I'm talking here but it might help)
edit:
here is a better illustration of SVD compression
I am not aware of any references in case of the problem you have proposed.
However, one direction in which I can think is using optimization techniques to select best coefficients. Techniques like genetic algorithms, hill climbing, simulated annihilation can be used in this regard.
Given that I have experience in genetic algorithms, I can suggest the following process. If you are not aware about genetic algorithm, I recommend you to read up the wiki page on genetic algorithms.
Your problem can be thought of has selecting a subset of coefficients which give the minimum reconstruction error. Say there are N coefficients. It is easy to establish that there are 2^N subsets. Each subset can be represented by a string on N binary numbers. For example, for N=5,
the string 11101 represents that the selected subset contains all the coeff except the coeff4. With genetic algorithms it is possible to find an optimum bit sting. The objective function can be chosen as the absolute error between the reconstructed and the original signals. However, I am aware that you can get an error of zero when all the coeffs are taken.
To get around this problem, you may choose to modulate the objective function with an appropriate function which discourages objective function value near zero and is a monotonically increasing function after a threshold. A function like | log( \epsion + f ) | may suffice.
If what I propose seem interesting to you do let me know. I have an implementation of genetic algorithm with me. But it is tailored to my needs and you might not be in a position to adapt it for this problem. I am willing to work with you on this problem as the problem seem interesting to explore.
Do let me know.
I think you are pretty close to the solution, but there are an issue that i think you should pay some attention.
Because different wavelet coefficients are corresponding to functions with different scale (and shift), so error being introduced by elimination of a partricular coefficient depends not only on it value, but on it position (especially scale), so the weight of the coefficient should be something like w(c) = amp(c) * F(scale, shift) where amp(c) is an amplitude of coefficient and F is a function that depends on the compressed data. When you determine weigths like that the problem is reduced to the backpack problem, that could be solved in many ways (for example reorder the coefficients and eliminate tha smallest one until you get a threshold error on a pixel affected by the corresponding function). The hard part is to determine F(scale,shift). You can do it in the following way. If the data that you are compressing is relatively stable (for example surveillance video), you could estimate F as a middle probability of recieve an unacceptable error eliminating the component with given scale and shift from the wavelet decomposition. So you could perform SVD (or PCA) decomposition on historical data and calculate 'F(scale, shift)' as a weighted (with weights equal to eigenvalues) summ of scalar products of the component with given scale and shift to the eigenvectors
F(scale,shift) = summ eValue(i) * (w(scale,shift) * eVector(i)) where eValue is eigenvalue corresponding to the eigenvector - eVector(i), w(scale,shift) is a wavelet function with given scale and shift.
Iteratively evaluating different sets of coefficients will not help your goal of being able to compress frames as quickly as they are generated, and will not help you to keep complexity low.
Depth maps are different from intensity maps in several ways that can help you.
Large areas of "no data" can be handled very efficiently by run-length encoding.
Measurement error in intensity images is constant across the image after fixed-noise has been subtracted, but depth maps from both Kinects and stereo vision systems have errors that increase as an inverse function of depth. If these are the scanners you are targeting then you can use lossier compression for closer pixels - because the errors your lossy function introduces are independent of sensor error, the total error won't be increased until your lossy function's error is greater than the sensor error.
A team at Microsoft had a lot of success with a very low-loss algorithm that relied heavily on run-length encoding (see paper here), beating out JPEG 2000 with better compression and excellent performance; however, part of their success seemed to step from the relatively crude depth maps their sensor produces. If you are targeting Kinects, you may find it hard to improve on their method.
I think you are looking for something like JPEG-LS algorithm, which tries to limit the maximum amount pixel error. Albeit, it is mainly designed for compression of natural or medical images and is not well designed for depth images ( which are smoother).
The term "near-lossless compression" refers to a lossy algorithm for which each reconstructed image sample differs from the corresponding original image sample by not more than a pre-specified value, the (usually small) "loss." Lossless compression corresponds to loss=0.link to the original reference
I'd try preprocessing the image, then compressing with a general method, such as PNG.
Preprocessing for PNG (first read this)
for y in 1..height
for x in 1..width
if(abs(A[y][x-1] - A[y][x]) < threshold)
A[y][x] = A[y][x-1]
elsif (abs(A[y-1][x] - A[y][x]) < threshold)
A[y][x] = A[y-1][x]

Why is average so popular when measuring application performance

When measuring application performance (response time for example) it's so easy to come across averages (mean). ab, httpref and bunch of other utilities are reporting mean and standard deviation. But from theoretical point of view it doesn't make a lot of sense to me. And there is why.
Mean value is good at describing symmetrical distributed population, because in case of symmetrical distribution mean is equal to population mode and expected value. But response times are not distributed symmetrical. They are more like exponential. In this case average tells us nothing.
It's more convenient to work with percentile values, which tells us what response time we could afford in what percentage of responses.
Am I missing something or mean is popular just because it's very simple to calculate?
All kinds of tools get their features not necessarily from what makes sense, but from users' expectations.
You're absolutely right that the distributions are non-negative and heavily skewed, and that percentiles would be more informative.
Alternatively, a distribution more like lognormal or chi-square would be a little better.
Yes, you are missing something.
The whole point of descriptive statistics is to present a few numbers to describe (or represent or model or ...) a large number of numbers. They aid the comprehension of large datasets, the extraction of information from data, the approximate comparison of datasets whose exact comparison is large and bewildering to the limitations of the human mind.
But no single descriptive statistic is always fit for all purposes, and no one is dictating to you that you must or should or ought to use the mean. If it doesn't suit your purposes, use something else.
As it happens you are quite wrong to write They are more like exponential. In this case average tells us nothing. For an exponential distribution with rate parameter lambda the mean is simply 1/lambda so the mean tells you everything about an exponential distribution.
I'm not an expert in statistics but i believe the average values are used so much because those are the values that help to measure the scalability of a system.
You need to consider first your average values to know how your system needs to bahevae under certains workloads and those needs to be predictable, you usually are not very interested in outliers at least not at first.
Of course you need to look into your min values and the peak values to know the moment your system its going to have a bottleneck but the average values show you as i said a correct and predictable behavior.

How does bootstrapping improve the quality of a phylogenetic reconstruction?

My understanding of bootstrapping is that you
Build a "tree" using some algorithm from a matrix of sequences (nucleotides, lets say).
You store that tree.
Perturb the matrix from 1, and rebuild the tree.
My question is: what is the purpose of 3 from a sequence bioinformatics perspective? I can try to "guess" that, by changing characters in the original matrix, you can remove artifacts in the data? But I have a problem with that guess: I am not sure, why removal of such artifacts is necessary. A sequence alignment is supposed to deal with artifacts by finding long lenghts of similarity, by its very nature.
Bootstrapping, in phylogenetics as elsewhere, doesn't improve the quality of whatever you're trying to estimate (a tree in this case). What it does do is give you an idea of how confident you can be about the result you get from your original dataset. A bootstrap analysis answers the question "If I repeated this experiment many times, using a different sample each time (but of the same size), how often would I expect to get the same result?" This is usually broken down by edge ("How often would I expect to see this particular edge in the inferred tree?").
Sampling Error
More precisely, bootstrapping is a way of approximately measuring the expected level of sampling error in your estimate. Most evolutionary models have the property that, if your dataset had an infinite number of sites, you would be guaranteed to recover the correct tree and correct branch lengths*. But with a finite number of sites this guarantee disappears. What you infer in these circumstances can be considered to be the correct tree plus sampling error, where the sampling error tends to decrease as you increase the sample size (number of sites). What we want to know is how much sampling error we should expect for each edge, given that we have (say) 1000 sites.
What We Would Like To Do, But Can't
Suppose you used an alignment of 1000 sites to infer the original tree. If you somehow had the ability to sequence as many sites as you wanted for all your taxa, you could extract another 1000 sites from each and perform this tree inference again, in which case you would probably get a tree that was similar but slightly different to the original tree. You could do this again and again, using a fresh batch of 1000 sites each time; if you did this many times, you would produce a distribution of trees as a result. This is called the sampling distribution of the estimate. In general it will have highest density near the true tree. Also it becomes more concentrated around the true tree if you increase the sample size (number of sites).
What does this distribution tell us? It tells us how likely it is that any given sample of 1000 sites generated by this evolutionary process (tree + branch lengths + other parameters) will actually give us the true tree -- in other words, how confident we can be about our original analysis. As I mentioned above, this probability-of-getting-the-right-answer can be broken down by edge -- that's what "bootstrap probabilities" are.
What We Can Do Instead
We don't actually have the ability to magically generate as many alignment columns as we want, but we can "pretend" that we do, by simply regarding the original set of 1000 sites as a pool of sites from which we draw a fresh batch of 1000 sites with repetition for each replicate. This generally produces a distribution of results that is different from the true 1000-site sampling distribution, but for large site counts the approximation is good.
* That is assuming that the dataset was in fact generated according to this model -- which is something that we cannot know for certain, unless we're doing a simulation. Also some models, like uncorrected parsimony, actually have the paradoxical quality that under some conditions, the more sites you have, the lower the probability of recovering the correct tree!
Bootstrapping is a general statistical technique that has applications outside of bioinformatics. It is a flexible means of coping with small samples, or samples from a complex population (which I imagine is the case in your application.)

Compression algorithms for a sequence of integers

Are there any good compression algorithms for a large sequence of integers (A/D converter data). There is similar question
But the data is different in my case. It can be negarive or positive and changing like wave data.
EDIT1:sample data added
Please refer to this file for a data sample
Generally if you have some knowledge about the signal, use it to predict next value basing on previous ones. Then - compress difference between predicted and real value.
If prediction is good, differences will be small and their compressing will be good.
Anything more specific is unlikely possible without seeing the data and knowing about its physical nature.
update:
If the prediction is really well and uses all knowledge about dependencies, the differences are likely to be independent and something like arithmetic encoding would work for them.
You want a Delta Encode and then you want to apply a RLE or a Golomb Code. The Golomb Code can be as good as a Huffman Code.
Nearly any standard compression algorithm for byte strings can be applied; after all, any file of data can be interpreted as a sequence of signed integers. Is there something special about your particular integers that you think will make them amenable to some more-specific algorithm? You mention wave data; maybe take a look at FLAC which is designed for audio data; if your data has similar characteristics those techniques may be valuable.
You could diff the data then apply RLE on suitable subregions (i.e. between inflection points).

Resources