Are there any machine learning algorithms, prediction models that can help me compress exponentially distributed data? I have already encoded the file using golomb codes, which definitely saves tons of space, but this is not enough -- I need compression. PAQ8L does not compress it enough.
Please ask for the file if needed.
Exponentially distributed --
{a,b,b,a,a,b,c,c,a,a,b,a,a,b,a,c,b,a,b,d}
I don't think it's theoretically possible. The Golomb code is already optimal for geometrically distributed data.
As mentioned in other posts, PAQ* algorithms use a context mixing algorithm. This means, you know more about data than just "exponentially distributed".
I think the Golomb code is still optimal if only the exponential distribution is known about the data.
Related
I've seen questions on compression algorithms around SE, but none quite fit what I'm looking for. Clearly truly uniformly distributed data cannot be compressed, but how close can we get?
My (probably incorrect) thoughts: I would imagine that by transforming the data (normalizing in some way?), you could accentuate the non-uniformity aspects of nearly uniform data and then use that transformed set to compress, perhaps along with the inverse transform or its parameters. But maybe I'm totally wrong and they all perform equally terribly as the data approaches uniformity?
When I look at lists of (lossless) compression algorithms, I don't see them ranked by how effective they are against certain types of data, at least not in any concrete terms. Does anyone know of a source that dives into this?
As background, I have an application where the data set is not independent, but nevertheless appears to be nearly uniform (most of the symbols have very low frequencies, and none of them have very high frequencies). So I was wondering if there are algorithms that can exploit the sampling dependence even if the data frequencies are mostly low. Then of course it would be more helpful to have a source that detailed exactly why some compression algorithms might perform better at this than others, if such a thing existed.
The short answer is no. Such a thing both does not and cannot exist.
The long answer involves information theory.
What matters to a compression algorithm is not how hard it is to say the thing you are specifying. It is how many equally likely things could you have said instead, but didn't. That is, if you have M things you might have said that were equally likely, you must send a signal long enough that it specifies which of the M you said. And that requires log_2(M) bits to make it clear which one you actually said.
In the case of a stream of independent symbols, each with a known probability, we can figure out how many messages could be sent with equal likelihood. And thereby put a lower bound on how efficiently a message can be compressed. That lower bound is the entropy bits per symbol sent. This lower bound is actually achieved by Huffman coding.
In order to do better than Huffman coding, we must find some additional structure to our messages. For example language often has correlations where "h" is likely to follow "t". Or in images, the color of a pixel tends to be similar to the color of a nearby pixel. Any such structure reduces the number of equally likely messages we could have sent, and opens up the possibility of a better compression algorithm.
However you've not described such a structure. So Huffman coding is the best you can do. And if the symbol probabilities are close to each other, it won't give you very much.
Sorry.
I am looking for an algorithm that can search for similar images in a large collection.
I'm currently using a SURF implementation in OpenCL.
At first I used the KNN search algorithm to compare every image's interrest points to the rest of the collection but tests revealed that it doesn't scale well. I've also tried a Hadoop implementation of KNN-Join which really takes a lot of temporary space in HDFS, way too much compared to the amount of input data. In fact pairwise distance approach isn't really appropriate because of the dimension of my input vectors (64).
I heard of Locally Sensitive Hashing and wondered if there was any free implementation, or if it's worth implementing it, maybe there's another algorithm I am not aware of ?
IIRC the flann algorithm is a good compromise:
http://people.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN
I'm looking for compression/deompression algorithms that can give decent compression 2-4x on regular english text and yet I can decompress this data almost as fast as I can get it out of main memory (~10Gbps). Whats the current state of the art in terms of fast decompression algorithms (perhaps vectorized code that uses multiple cores)
In particular, I'm looking at this paper Fast Integer compression using SIMD instructions
and wondering if similar algorithms have been used in any system.
Look at LZO and lz4. Try them on your data and see how they perform.
A golomb code can be good like a huffman and is very simple and fast.
BWT + entropy coding (for instance Huffman coding) is quite fast (compexity O(n)) but needs two pass.
Are there any good compression algorithms for a large sequence of integers (A/D converter data). There is similar question
But the data is different in my case. It can be negarive or positive and changing like wave data.
EDIT1:sample data added
Please refer to this file for a data sample
Generally if you have some knowledge about the signal, use it to predict next value basing on previous ones. Then - compress difference between predicted and real value.
If prediction is good, differences will be small and their compressing will be good.
Anything more specific is unlikely possible without seeing the data and knowing about its physical nature.
update:
If the prediction is really well and uses all knowledge about dependencies, the differences are likely to be independent and something like arithmetic encoding would work for them.
You want a Delta Encode and then you want to apply a RLE or a Golomb Code. The Golomb Code can be as good as a Huffman Code.
Nearly any standard compression algorithm for byte strings can be applied; after all, any file of data can be interpreted as a sequence of signed integers. Is there something special about your particular integers that you think will make them amenable to some more-specific algorithm? You mention wave data; maybe take a look at FLAC which is designed for audio data; if your data has similar characteristics those techniques may be valuable.
You could diff the data then apply RLE on suitable subregions (i.e. between inflection points).
Does anyone of you know a lossless compression algorithm, which produces headerless outputs?
For example do not store the huffman tree used to compress it? I do not speak about hard coded huffman trees, but I like to know if there is any algorithm that can compress and decompress input without storing some metadata in its output. Or is this even theoretically impossible?
Of course it is posible. Among others, the LZ family of compressors don't need to output anything apart from the compressed data itself, as the dictionary is built on-line as compression (or decompression) progress. You have a lot of reference implementations for those LZ-type algorithms. For example, LZMA, component of 7zip.
Adaptive Huffman coding does exactly that. More generally, the term adaptive coding is used to describe entropy codes with this property. Some dictionary codes have this property too, e.g. run-length encoding (RLE) and Lempel-Ziv-Welch (LZW).
Run Length Encoding would be one example
lzo springs to mind. it's used in OpenVPN, with great results
Why are you looking for compression algorithms with headerless compressed output?
Perhaps (a) you have a system like 2-way telephony that needs low-latency streaming compression/decompression.
The adaptive coding category of compression algorithms mentioned by Zach Scrivena
and the LZ family of dictionary compression algorithms mentioned by Diego Sevilla and Javier
are excellent for this kind of application.
Practical implementations of these algorithms usually do have a byte or two of metadata
at the beginning (making them useless for (b) applications), but that has little or no effect on latency.
Perhaps (b) you are mainly interested in cryptography, and you hear that compress-before-encrypt gives some improved security properties, as long as the compressed text does not have fixed metadata header "crib".
Modern encryption algorithms aren't (as far as we know) vulnerable to such "cribs", but if you're paranoid you might be interested in
"bijective compression" (a, b, c, etc.).
It's not possible to detect errors in transmission (flipped bits, inserted bits, deleted bits, etc.) when a receiver gets such compressed output (making these algorithms not especially useful for (a) applications).
Perhaps (c) you are interested in headerless compression for some other reason. Sounds fascinating -- what is that reason?