Fast decompression algorithms - performance

I'm looking for compression/deompression algorithms that can give decent compression 2-4x on regular english text and yet I can decompress this data almost as fast as I can get it out of main memory (~10Gbps). Whats the current state of the art in terms of fast decompression algorithms (perhaps vectorized code that uses multiple cores)
In particular, I'm looking at this paper Fast Integer compression using SIMD instructions
and wondering if similar algorithms have been used in any system.

Look at LZO and lz4. Try them on your data and see how they perform.

A golomb code can be good like a huffman and is very simple and fast.

BWT + entropy coding (for instance Huffman coding) is quite fast (compexity O(n)) but needs two pass.

Related

Text Compression Algorithm

I am just wondering if someone could introduce me any algorithm that compresses Unicode text to 10-20 percent of its original size ?
actually I've read Lempel-Ziv compression algorithm which reduces size of text to 60% of original size, but I've heard that there are some algorithms with this performance
If You are considering only text compression than the very first algorithm that uses entropy based encryption called Huffman Encoding
Huffman Coding
Then there is LZW compression which uses a dictionary encoding to use previously used sequence of letter to assign codes to reduce size of file.
LZW compression
I think above two are sufficient for encoding text data efficiently and are easy to implement.
Note: Donot expect good compression on all files, If data is random with no pattern than no compression algorithm can give you any compression at all. Percentage of compression depends on symbols appearing in the file not only on the algorithm used.
LZ-like coders are not any good for text compression.
The best one for direct use with unicode would be lzma though, as it has position alignment options. (http://www.7-zip.org/sdk.html)
But for best compression, I'd suggest to convert unicode texts to a bytewise format,
eg. utf8, and then use an algorithm with known good results on texts, eg.
BWT (http://libbsc.com) or PPMd (http://compression.ru/ds/ppmdj1.rar).
Also some preprocessing can be applied to improve results of text compression
(see http://xwrt.sourceforge.net/)
And there're some compressors with even better ratio than suggested ones
(mostly paq derivatives), but they're also much slower.
Here I tested various representations of russian translation of
Witten's "Modeling for text compression":
7z rar4 paq8px69
modeling_win1251.txt 156091 50227 42906 36254
modeling_utf16.txt 312184 52523 50311 38497
modeling_utf8.txt 238883 53793 44231 37681
modeling_bocu.txt 165313 53073 44624 38768
modeling_scsu.txt 156261 50499 42984 36485
It shows that longer input doesn't necessarily mean better overall compression,
and that SCSU, although useful, isn't really the best representation of unicode text
(win1251 codepage is one, too).
PAQ is the new reigning champion of text compression...There are a few different flavors and information about them can be found here.
There are three flavors that I recommend:
ZPAQ - Future facing container for PAQ algorithims (created to make the future of PAQ easier)
PAQ8PX/PAQ8KX - The most powerful, works with EXE and WAV files as well.
PAQ8PF - Faster (both compression and decompression) and mostly intended for TXT files
You have to build them yourself from source, fortunately someone made a GUI, FrontPAQ, that packages the two best binary into one.
Once you have a functional binary its simple to use, the documentation can be found here.
Note: I am aware this is a very old question, but I wish to include relevant modern data. I came looking for the same question, yet have found a newer more powerful answer.

Compression algorithms for a sequence of integers

Are there any good compression algorithms for a large sequence of integers (A/D converter data). There is similar question
But the data is different in my case. It can be negarive or positive and changing like wave data.
EDIT1:sample data added
Please refer to this file for a data sample
Generally if you have some knowledge about the signal, use it to predict next value basing on previous ones. Then - compress difference between predicted and real value.
If prediction is good, differences will be small and their compressing will be good.
Anything more specific is unlikely possible without seeing the data and knowing about its physical nature.
update:
If the prediction is really well and uses all knowledge about dependencies, the differences are likely to be independent and something like arithmetic encoding would work for them.
You want a Delta Encode and then you want to apply a RLE or a Golomb Code. The Golomb Code can be as good as a Huffman Code.
Nearly any standard compression algorithm for byte strings can be applied; after all, any file of data can be interpreted as a sequence of signed integers. Is there something special about your particular integers that you think will make them amenable to some more-specific algorithm? You mention wave data; maybe take a look at FLAC which is designed for audio data; if your data has similar characteristics those techniques may be valuable.
You could diff the data then apply RLE on suitable subregions (i.e. between inflection points).

data compression - machine learning for exponential distribution

Are there any machine learning algorithms, prediction models that can help me compress exponentially distributed data? I have already encoded the file using golomb codes, which definitely saves tons of space, but this is not enough -- I need compression. PAQ8L does not compress it enough.
Please ask for the file if needed.
Exponentially distributed --
{a,b,b,a,a,b,c,c,a,a,b,a,a,b,a,c,b,a,b,d}
I don't think it's theoretically possible. The Golomb code is already optimal for geometrically distributed data.
As mentioned in other posts, PAQ* algorithms use a context mixing algorithm. This means, you know more about data than just "exponentially distributed".
I think the Golomb code is still optimal if only the exponential distribution is known about the data.

What are canonical examples of parallel computation?

I am writing a paper to test a new application that will demonstrate the benefits of parallelized computation (compared to the traditional serialized version of this application). I want to use the canonical examples for parallel computation in my paper.
My first example is the parallel computation of pi. I would ideally like an example where each iteration is very time consuming (because of the additional overhead associated with parallelizing); my first thought is a Bayesian simulation with MCMC and Gibbs sampling.
What other problems are typically discussed in this context? What are good examples of large embarassingly parallel problems?
just a few more -
Multiplying matrices
Inverting matrices
FFT
String matching
Rendering 3d scenes (via scan line conversion or ray tracing)
One example I've used in the past of an embarrassingly parallel problem is visualizing the mandelbrot set. Each pixel can be computed independently.
Conway's Life is interesting as well, in that each value of the "next" board can be computed independently, but will depend on the relevant bits of the "current" board being done already.
I would suggest that canonical examples of parallel computation and embarassingly parallel problems are, if not completely, then nearly, disjoint sets. To put it another way, people working in parallel computation aren't terribly excited about embarassingly parallel problems; we call them that because we'd be embarassed to be working on them.
I'd be looking, if I were you, at these (a not entirely original list):
linear algebra on large dense matrices, both direct and iterative approaches;
linear algebra on huge sparse matrices
branch and bound approaches to linear programming (and related) problems;
sequence matching for bioinformatics (outside my field, I may have mis-expressed this);
continuos optimisation.
I expect there are many more.
EDIT: You may be interested in this list of problems which have been selected for benchmarking the next generation of European (academic) supercomputers. It will give you some idea of where that niche is heading.
Molecular dynamics simluations allow you to change the size of the problem until your computer resources are exhausted (i.e. 256 particles vs. 256,000,000 particles). Its truly a "canonical" example if you run the MD simulations under NVT conditions ;-)
My favorite example is monte carlo simulation.
Word counting seems to be the canonical example for MapReduce.
http://en.wikipedia.org/wiki/MapReduce#Example
Finding collisions in hash functions using Paul C. van Oorschot and Michael J. Weiner's method (PDF) comes up often in various cryptographic settings.
I used the Mandelbrot set demo to explain to my mom what parallel programming is about : http://www.ateji.com/px/demo.html
All the examples you mentions are mostly heavy data-parallel codes. You'll probably want to mention also task-oriented codes, such as servers responding to many requests in parallel, and data-flow or stream programming examples (MapReduce is a good representative of this class).

Where can I find a lossless compression algorithm, which produces headerless outputs?

Does anyone of you know a lossless compression algorithm, which produces headerless outputs?
For example do not store the huffman tree used to compress it? I do not speak about hard coded huffman trees, but I like to know if there is any algorithm that can compress and decompress input without storing some metadata in its output. Or is this even theoretically impossible?
Of course it is posible. Among others, the LZ family of compressors don't need to output anything apart from the compressed data itself, as the dictionary is built on-line as compression (or decompression) progress. You have a lot of reference implementations for those LZ-type algorithms. For example, LZMA, component of 7zip.
Adaptive Huffman coding does exactly that. More generally, the term adaptive coding is used to describe entropy codes with this property. Some dictionary codes have this property too, e.g. run-length encoding (RLE) and Lempel-Ziv-Welch (LZW).
Run Length Encoding would be one example
lzo springs to mind. it's used in OpenVPN, with great results
Why are you looking for compression algorithms with headerless compressed output?
Perhaps (a) you have a system like 2-way telephony that needs low-latency streaming compression/decompression.
The adaptive coding category of compression algorithms mentioned by Zach Scrivena
and the LZ family of dictionary compression algorithms mentioned by Diego Sevilla and Javier
are excellent for this kind of application.
Practical implementations of these algorithms usually do have a byte or two of metadata
at the beginning (making them useless for (b) applications), but that has little or no effect on latency.
Perhaps (b) you are mainly interested in cryptography, and you hear that compress-before-encrypt gives some improved security properties, as long as the compressed text does not have fixed metadata header "crib".
Modern encryption algorithms aren't (as far as we know) vulnerable to such "cribs", but if you're paranoid you might be interested in
"bijective compression" (a, b, c, etc.).
It's not possible to detect errors in transmission (flipped bits, inserted bits, deleted bits, etc.) when a receiver gets such compressed output (making these algorithms not especially useful for (a) applications).
Perhaps (c) you are interested in headerless compression for some other reason. Sounds fascinating -- what is that reason?

Resources