Depth Image compression up to a maximum permitted error - image

Articles on image compression often focus on generating the best possible image quality (PSNR) given a fixed compression ratio. I'm curious about getting the best possible compression ratio given a maximum permissible per-pixel error. My instinct is to greedily remove the smallest coefficients in the transformed data, keep track of the error I've caused, until I can't remove any more without passing the maximum error. But I can't find any papers to confirm it. Can anyone point me to a reference about this problem?
edit
Let me give some more details. I'm trying to compress depth images from 3D scanners, not regular images. Color is not a factor. Depth images tend to have large smooth patches, but accurate discontinuities are important. Some pixels will be empty - outside the scanner's range or low confidence level - and not require compression.
The algorithm will need to run fast - optimally at 30 fps like the Microsoft Kinect, or at least somewhere in the 100 millisecond area. The algorithm will be included in a library I distribute. I prefer to minimize dependencies, so compression schemes that I can implement myself in a reasonably small amount of code are preferable.

This answer won't satisfy your request for references, but it's too long to post as a comment.
First, depth buffer compression for computer generated imagery may apply to your case. Usually this compression is done at the hardware level with a transparent interface, so it's typically designed to be simple and fast. Given this, it may be worth your while to search for depth buffer compression.
One of the major issues you're going to have with transform-based compressors (DCTs, Wavelets, etc...) is that there's no easy way to find compact coefficients that meet your hard maximum error criteria. (The problem you end up with looks a lot like linear programming. Wavelets can have localized behavior in most of their basis vectors which can help somewhat, but it's still rather inconvenient.) To achieve the accuracy you desire you may need to add on another refinement step but this will also add more computation time, complexity, and will introduce another layer imperfect entropy coding leading to a loss of compression efficiency.
What you want is more akin to lossless compression than lossy compression. In this light, one approach would be to simply throw away the bits under your error threshold: if your maximum allowable error is X and your depths are represented as integers, integer-divide your depths by X and then apply lossless compression.
Another issue you're facing is the representation of depth -- depending on your circumstances it may be a floating point number, an integer, it may be in a projective coordinate system, or even more bizarre.
Given these restrictions, I recommend a scheme like
BTPC as it allows for a more easily adapted wavelet-like scheme where errors are more clearly localized and easier to understand and account for. Additionally, BTPC has shown a great resistance to many types of images and a good ability to handle continuous gradients and sharp edges with low loss of fidelity -- exactly the sorts of traits you're looking for.
Since BTPC is predictive, it doesn't matter particularly how your depth format is stored -- you just need to modify your predictor to take your coordinate system and numeric type (integer vs. floating) into account.
Since BTPC doesn't do terribly much math, it can run pretty fast on general CPUs, too, although it may not be as easy to vectorize as you'd like. (It sounds like you're possibly doing low level optimized game programming, so this may be a serious consideration for you.)
If you're looking for something simpler to implement I'd recommend a "filter" type of approach (similar to PNG) with a Golomb-Rice coder strapped on. Rather than coding the deltas perfectly to end up with lossless compression, you can code to a "good enough" degree. The advantage of doing this as compared to a quantize-then-lossless-encode style compressor is that you can potentially maintain more continuity.

"greedily remove the smallest coefficients" reminds me of SVD compression, where you use the data associated with the first k largest eigenvalues to approximate the data. The rest of the eigenvalues that are small don't hold significant information and can be discarded.
Large k -> high quality, low compression
Small k -> lower quality, high compression
(disclaimer: I have no idea what I'm talking here but it might help)
edit:
here is a better illustration of SVD compression

I am not aware of any references in case of the problem you have proposed.
However, one direction in which I can think is using optimization techniques to select best coefficients. Techniques like genetic algorithms, hill climbing, simulated annihilation can be used in this regard.
Given that I have experience in genetic algorithms, I can suggest the following process. If you are not aware about genetic algorithm, I recommend you to read up the wiki page on genetic algorithms.
Your problem can be thought of has selecting a subset of coefficients which give the minimum reconstruction error. Say there are N coefficients. It is easy to establish that there are 2^N subsets. Each subset can be represented by a string on N binary numbers. For example, for N=5,
the string 11101 represents that the selected subset contains all the coeff except the coeff4. With genetic algorithms it is possible to find an optimum bit sting. The objective function can be chosen as the absolute error between the reconstructed and the original signals. However, I am aware that you can get an error of zero when all the coeffs are taken.
To get around this problem, you may choose to modulate the objective function with an appropriate function which discourages objective function value near zero and is a monotonically increasing function after a threshold. A function like | log( \epsion + f ) | may suffice.
If what I propose seem interesting to you do let me know. I have an implementation of genetic algorithm with me. But it is tailored to my needs and you might not be in a position to adapt it for this problem. I am willing to work with you on this problem as the problem seem interesting to explore.
Do let me know.

I think you are pretty close to the solution, but there are an issue that i think you should pay some attention.
Because different wavelet coefficients are corresponding to functions with different scale (and shift), so error being introduced by elimination of a partricular coefficient depends not only on it value, but on it position (especially scale), so the weight of the coefficient should be something like w(c) = amp(c) * F(scale, shift) where amp(c) is an amplitude of coefficient and F is a function that depends on the compressed data. When you determine weigths like that the problem is reduced to the backpack problem, that could be solved in many ways (for example reorder the coefficients and eliminate tha smallest one until you get a threshold error on a pixel affected by the corresponding function). The hard part is to determine F(scale,shift). You can do it in the following way. If the data that you are compressing is relatively stable (for example surveillance video), you could estimate F as a middle probability of recieve an unacceptable error eliminating the component with given scale and shift from the wavelet decomposition. So you could perform SVD (or PCA) decomposition on historical data and calculate 'F(scale, shift)' as a weighted (with weights equal to eigenvalues) summ of scalar products of the component with given scale and shift to the eigenvectors
F(scale,shift) = summ eValue(i) * (w(scale,shift) * eVector(i)) where eValue is eigenvalue corresponding to the eigenvector - eVector(i), w(scale,shift) is a wavelet function with given scale and shift.

Iteratively evaluating different sets of coefficients will not help your goal of being able to compress frames as quickly as they are generated, and will not help you to keep complexity low.
Depth maps are different from intensity maps in several ways that can help you.
Large areas of "no data" can be handled very efficiently by run-length encoding.
Measurement error in intensity images is constant across the image after fixed-noise has been subtracted, but depth maps from both Kinects and stereo vision systems have errors that increase as an inverse function of depth. If these are the scanners you are targeting then you can use lossier compression for closer pixels - because the errors your lossy function introduces are independent of sensor error, the total error won't be increased until your lossy function's error is greater than the sensor error.
A team at Microsoft had a lot of success with a very low-loss algorithm that relied heavily on run-length encoding (see paper here), beating out JPEG 2000 with better compression and excellent performance; however, part of their success seemed to step from the relatively crude depth maps their sensor produces. If you are targeting Kinects, you may find it hard to improve on their method.

I think you are looking for something like JPEG-LS algorithm, which tries to limit the maximum amount pixel error. Albeit, it is mainly designed for compression of natural or medical images and is not well designed for depth images ( which are smoother).
The term "near-lossless compression" refers to a lossy algorithm for which each reconstructed image sample differs from the corresponding original image sample by not more than a pre-specified value, the (usually small) "loss." Lossless compression corresponds to loss=0.link to the original reference

I'd try preprocessing the image, then compressing with a general method, such as PNG.
Preprocessing for PNG (first read this)
for y in 1..height
for x in 1..width
if(abs(A[y][x-1] - A[y][x]) < threshold)
A[y][x] = A[y][x-1]
elsif (abs(A[y-1][x] - A[y][x]) < threshold)
A[y][x] = A[y-1][x]

Related

Feature Scaling required or not

I am working with sample data set to learn clustering. This data set contains number of occurrences for the keywords.
Since all are number of occurrences for the different keywords, will it be OK not to scale the values and use them as it is?
I read couple of articles on internet where its emphasized that scaling is important as it will adjust the relativity of the frequency. Since most of frequencies are 0 (95%+), z score scaling will change the shape of distribution, which I am feeling could be problem as I am changing the nature of data.
I am thinking of not changing values at all to avoid this. Will that affect the quality of results I get from the clustering?
As it was already noted, the answer heavily depends on an algorithm being used.
If you're using distance-based algorithms with (usually default) Euclidean distance (for example, k-Means or k-NN), it'll rely more on features with bigger range just because a "typical difference" of values of that feature is bigger.
Non-distance based models can be affected, too. Though one might think that linear models do not get into this category since scaling (and translating, if needed) is a linear transformation, so if it makes results better, then the model should learn it, right? Turns out, the answer is no. The reason is that no one uses vanilla linear models, they're always used with with some sort of a regularization which penalizes too big weights. This can prevent your linear model from learning scaling from data.
There are models that are independent of the feature scale. For example, tree-based algorithms (decision trees and random forests) are not affected. A node of a tree partitions your data into 2 sets by comparing a feature (which splits dataset best) to a threshold value. There's no regularization for the threshold (because one should keep height of the tree small), so it's not affected by different scales.
That being said, it's usually advised to standardize (subtract mean and divide by standard deviation) your data.
Probably it depends on the classification algorithm. I'm only familiar with SVM. Please see Ch. 2.2 for the explanation of scaling
The type of feature (count of words) doesn't matter. The feature ranges should be more or less similar. If the count of e.g. "dignity" is 10 and the count of "have" is 100000000 in your texts, then (at least on SVM) the results of such features would be less accurate as when you scaled both counts to similar range.
The cases, where no scaling is needed are those, where the data is scaled implicitly e.g. features are pixel-values in an image. The data is scaled already to the range 0-255.
*Distance based algorithm need scaling
*There is no need of scaling in tree based algorithms
But it is good to scale your data and train model ,if possible compare the model accuracy and other evaluations before scaling and after scaling and use the best possibility
These is as per my knowledge

Floating point algorithms with potential for performance optimization

For a university lecture I am looking for floating point algorithms with known asymptotic runtime, but potential for low-level (micro-)optimization. This means optimizations such as minimizing cache misses and register spillages, maximizing instruction level parallelism and taking advantage of SIMD (vector) instructions on new CPUs. The optimizations are going to be CPU-specific and will make use of applicable instruction set extensions.
The classic textbook example for this is matrix multiplication, where great speedups can be achieved by simply reordering the sequence of memory accesses (among other tricks). Another example is FFT. Unfortunately, I am not allowed to choose either of these.
Anyone have any ideas, or an algorithm/method that could use a boost?
I am only interested in algorithms where a per-thread speedup is conceivable. Parallelizing problems by multi-threading them is fine, but not the scope of this lecture.
Edit 1: I am taking the course, not teaching it. In the past years, there were quite a few projects that succeeded in surpassing the current best implementations in terms of performance.
Edit 2: This paper lists (from page 11 onwards) seven classes of important numerical methods and some associated algorithms that use them. At least some of the mentioned algorithms are candidates, it is however difficult to see which.
Edit 3: Thank you everyone for your great suggestions! We proposed to implement the exposure fusion algorithm (paper from 2007) and our proposal was accepted. The algorithm creates HDR-like images and consists mainly of small kernel convolutions followed by weighted multiresolution blending (on the Laplacian pyramid) of the source images. Interesting for us is the fact that the algorithm is already implemented in the widely used Enfuse tool, which is now at version 4.1. So we will be able to validate and compare our results with the original and also potentially contribute to the development of the tool itself. I will update this post in the future with the results if I can.
The simplest possible example:
accumulation of a sum. unrolling using multiple accumulators and vectorization allow a speedup of (ADD latency)*(SIMD vector width) on typical pipelined architectures (if the data is in cache; because there's no data reuse, it typically won't help if you're reading from memory), which can easily be an order of magnitude. Cute thing to note: this also decreases the average error of the result! The same techniques apply to any similar reduction operation.
A few classics from image/signal processing:
convolution with small kernels (especially small 2d convolves like a 3x3 or 5x5 kernel). In some sense this is cheating, because convolution is matrix multiplication, and is intimately related to the FFT, but in reality the nitty-gritty algorithmic techniques of high-performance small kernel convolutions are quite different from either.
erode and dilate.
what image people call a "gamma correction"; this is really evaluation of an exponential function (maybe with a piecewise linear segment near zero). Here you can take advantage of the fact that image data is often entirely in a nice bounded range like [0,1] and sub-ulp accuracy is rarely needed to use much cheaper function approximations (low-order piecewise minimax polynomials are common).
Stephen Canon's image processing examples would each make for instructive projects. Taking a different tack, though, you might look at certain amenable geometry problems:
Closest pair of points in moderately high dimension---say 50000 or so points in 16 or so dimensions. This may have too much in common with matrix multiplication for your purposes. (Take the dimension too much higher and dimensionality reduction silliness starts mattering; much lower and spatial data structures dominate. Brute force, or something simple using a brute-force kernel, is what I would want to use for this.)
Variation: For each point, find the closest neighbour.
Variation: Red points and blue points; find the closest red point to each blue point.
Welzl's smallest containing circle algorithm is fairly straightforward to implement, and the really costly step (check for points outside the current circle) is amenable to vectorisation. (I suspect you can kill it in two dimensions with just a little effort.)
Be warned that computational geometry stuff is usually more annoying to implement than it looks at first; don't just grab a random paper without understanding what degenerate cases exist and how careful your programming needs to be.
Have a look at other linear algebra problems, too. They're also hugely important. Dense Cholesky factorisation is a natural thing to look at here (much more so than LU factorisation) since you don't need to mess around with pivoting to make it work.
There is a free benchmark called c-ray.
It is a small ray-tracer for spheres designed to be a benchmark for floating-point performance.
A few random stackshots show that it spends nearly all its time in a function called ray_sphere that determines if a ray intersects a sphere and if so, where.
They also show some opportunities for larger speedup, such as:
It does a linear search through all the spheres in the scene to try to find the nearest intersection. That represents a possible area for speedup, by doing a quick test to see if a sphere is farther away than the best seen so far, before doing all the 3-d geometry math.
It does not try to exploit similarity from one pixel to the next. This could gain a huge speedup.
So if all you want to look at is chip-level performance, it could be a decent example.
However, it also shows how there can be much bigger opportunities.

Edge detection : Any performance evaluation technique?

I am working on edge detection in images and would like to evaluate the performance of algorithm, if any any one could give me a reference or method on how to proceed it will be really helpful. :)
I do not have ground truth and data set includes color as well as gray images.
Thank you.
Create a synthetic data set with known edges, for example by 3D rendering, by compositing 2D images with precise masks (as may be obtained in royalty free photosets), or by introducing edges directly (thin/faint lines). Remember to add some confounding non-edges that look like edges, of a type appropriate for what you're tuning for.
Use your (non-synthetic) data set. Run the reference algorithms that you want to compare against. Also produce combinations of the reference algorithms, for example by voting (majority, at least K out of N, etc). Calculate stats on your algo vs reference algo performance, in terms of (a) number of points your algo classifies as edge which each reference algo, or the combination, does not classify as edge (false positive), or (b) number of points which the reference algo classifies as edge that your algo does not (false negative). You can also calculate a rank correlation-type number for algos by looking at each point and looking at which algos do (or don't) classify that as an edge.
Create ground truth manually. Use reference edge-finding algos as a starting point, then fix up by hand. Probably valuable to do for a small number of images in any case.
Good luck!
For comparisons, quantitative measures like what #Alex I explained is best. To do so, you need to define what is "correct" with a ground truth set and a way to consistently determine if a given image is correct or on a more granular level, how correct (some number like a percentage) it is. #Alex I gave a way to do that.
Another option that is often used in graphics research where there is no ground truth is user studies. Usually less desirable as they are time consuming and often more costly. However, if it is a qualitative improvement that you are after or if a quantitative measurement is just too hard to do, a user study is an appropriate solution.
When I mean user study I mean to poll people on how well a result is given the input image. You could give them a scale to rate things on and randomly give them samples from both your results and the results of another algorithm
And of course, if you still want more ideas, be sure to check out edge detection papers to see how they measured their results (I'd actually look here first as they've already gone through this same process and determined what was best for them: google scholar).

Bimodal distribution characterization algorithm?

What algorithms can be used to characterize an expected clearly bimodal distribution, say a mixture of 2 normal distributions with well separated peaks, in an array of samples? Something that spits out 2 means, 2 standard deviations, and some sort of robustness estimate, would be the desired result.
I am interested in an algorithm that can be implemented in any programming language (for an embedded controller), not an existing C or Python library or stat package.
Would it be easier if I knew that the two modal means differ by a ratio of approximately 3:1 +- 50%, the standard deviations are "small" relative to the peak separation, but the pair of peaks could be anywhere in a 100:1 range?
There are two separate possibilities here. One is that you have a single distribution that is bimodal. The other is that you are observing data from two different distributions. The usual way to estimate the later is in something called, unsurprisingly, a mixture model.
Your approaches for estimating are to use a maximum likelihood approach or use Markov chain Monte Carlo methods if you want to take a Bayesian view of the problem. If you state your assumptions in a bit more detail I'd be willing to help try and figure out what objective function you'd want to try and maximize.
These type of models can be computationally intensive, so I am not sure you'd want to try and do the whole statistical approach in an embedded controller. A hack might be a better fit. If the peaks are in fact well separated, I think it would be easier to try and identify the two peaks and split your data between them and do the estimation of the mean and standard deviation for each distribution independently.

FFT Algorithm: What goes IN/OUT? (re: real-time pitch detection)

I am attempting to extract pitch data from an audio stream. From what I can see, it looks as though FFT is the best algorithm to use.
Rather than digging straight into the math, could someone help me understand what this FFT algorithm does?
Please don't say something obvious like 'FFT extracts frequency data from a raw signal.' I need the next level of detail.
What do I pass in, and what do I get out?
Once I understand the interface clearly, this will help me to understand the implementation.
I take it I need to pass in an audio buffer, I need to tell it how many bytes to use for each computation (say the most recent 1024 bytes from this buffer). and maybe I need to specify the range of pitches I want it to detect. Now it is going to pass back what? An array of frequency bins? What are these?
(Edit:) I have found a C++ algorithm to use (if I can only understand it)
Performous extracts pitch from the microphone. Also the code is open source. Here is a description of what the algorithm does, from the guy that coded it.
PCM input (with buffering)
FFT (1024 samples at a time, remove 200 samples from front of the buffer afterwards)
Reassignment method (against the previous FFT that was 200 samples earlier)
Filtering of peaks (this part could be done much better or even left out)
Combining peaks into sets of harmonics (we call the combination a tone)
Temporal filtering of tones (update the set of tones detected earlier instead of simply using the newly detected ones)
Pick the best vocal tone (frequency limits, weighting, could use the harmonic array also but I don't think we do)
But could someone help me understand how this works? What is it that is getting sent from the FFT to the Reassignment method?
The FFT is just one building block in the process, and it may not be the best approach for pitch detection. Read up on pitch detection and decide which algo you want to use first (this will depend on what exactly you are trying to measure the pitch of - speech, single musical instrument, other types of sound, etc. Get this right before getting into low level details such as the FFT (some, but not all pitch detection algorithms use the FFT internally).
There are numerous similar questions on SO already, e.g. Real-time pitch detection using FFT and Pitch detection using FFT for trumpet, and there is good overview material on Wikipedia etc - read these and then decide whether you still want to roll your own FFT-based solution or perhaps use an existing library which is suitable for your particular application.
There is an element of choice here. The most straightforward to implement is to do (2^n samples in) complex numbers in, and 2^n complex numbers out, so maybe you should start with that.
In the special case of a DCT(discrete cosine transform), typically what goes in is 2^n samples (often floats), and out go 2^n values, often floats too. DCT is an FFT but that takes only the real values, and analyses the function in terms of cosines.
It is smart (but commonly skipped) to define a struct to handle the complex values. Traditionally FFT's are done in-place, but it works fine if you don't.
It can be useful to instantiate a class that contains a work buffer for the FFT (if you don't want to do the FFT in-place), and reuse that for several FFTs.
In goes N samples of PCM (purely real complex numbers). Out comes N bins of frequency domain (each bin corresponding to a 1/N slice of sample rate). Each bin is a complex number. Rather than real and imaginary parts, these values should generally be handled in polar format (absolute value and argument). The absolute value tells the amount of sound near the bin center frequency while the argument tells the phase (at which position the sine wave is travelling).
Most often coders only use the magnitude (absolute value) and throw away the phase angle (argument).

Resources