I'm doing some research on Huffman coding, there are some variants and I can't find their use case in real applications.
Huffman (classic):
? (pass tree with 0 for node and 1 for leaf)
Efficient way of storing Huffman tree
Canonical Variant
JPEG
PNG
Fixed Variant
Deflate
HPACK http/2
Adaptative Variant:
?
Do you know of any other use cases?
Thank you.
I'm not sure what you're trying to discriminate there between "canonical" and "fixed". All implementations I know of use canonical Huffman codes, since that avoids having to transmit information about a Huffman code that is irrelevant. Both Deflate and JPEG use both fixed and dynamic Huffman codes, where fixed codes are defined a priori, and dynamic codes depend on the data being compressed.
PNG uses Deflate as does many, many other applications. Some examples are gzip, zip, HTTP (with the gzip or zlib "deflate" encodings), PDF, FITS, CDF, HDF. I have no idea how many more.
You just used Deflate to read this web page. There! You did it again. You do this every day, many many times.
Brotli is another compression format that uses Huffman codes.
I'm sure there are a thousand others, many of which are proprietary.
Related
What algorithms are designed to compress static data? For example, I have a input string of "Hello world!" I want to make a library that will JIT compile a set of compression and decompression functions for that "Hello World" string. What algorithms are out there that I can learn from? The closest thing I have found so far is the term "Tailed Compression" but I cant find any actual algorithms for this / code.
For static (fixed, known at outset) content, you can look at "off-line algorithms". One classification for "Data compression via textual substitution" was published in 1982 by J.A. Storer and T.G. Szymanski, in particular "Off-Line Compression: The Macro Model".
DEFLATE supports the use of preset dictionaries. These 32kb dictionaries are used as a reference for deduplicating your data.
Very high compression ratios can be achieved on short data strings with recurring patterns by choosing a decent dictionary. (just a concatenation of sample data is often a good start).
You can use dicflate to experiment with it.
I am just wondering if someone could introduce me any algorithm that compresses Unicode text to 10-20 percent of its original size ?
actually I've read Lempel-Ziv compression algorithm which reduces size of text to 60% of original size, but I've heard that there are some algorithms with this performance
If You are considering only text compression than the very first algorithm that uses entropy based encryption called Huffman Encoding
Huffman Coding
Then there is LZW compression which uses a dictionary encoding to use previously used sequence of letter to assign codes to reduce size of file.
LZW compression
I think above two are sufficient for encoding text data efficiently and are easy to implement.
Note: Donot expect good compression on all files, If data is random with no pattern than no compression algorithm can give you any compression at all. Percentage of compression depends on symbols appearing in the file not only on the algorithm used.
LZ-like coders are not any good for text compression.
The best one for direct use with unicode would be lzma though, as it has position alignment options. (http://www.7-zip.org/sdk.html)
But for best compression, I'd suggest to convert unicode texts to a bytewise format,
eg. utf8, and then use an algorithm with known good results on texts, eg.
BWT (http://libbsc.com) or PPMd (http://compression.ru/ds/ppmdj1.rar).
Also some preprocessing can be applied to improve results of text compression
(see http://xwrt.sourceforge.net/)
And there're some compressors with even better ratio than suggested ones
(mostly paq derivatives), but they're also much slower.
Here I tested various representations of russian translation of
Witten's "Modeling for text compression":
7z rar4 paq8px69
modeling_win1251.txt 156091 50227 42906 36254
modeling_utf16.txt 312184 52523 50311 38497
modeling_utf8.txt 238883 53793 44231 37681
modeling_bocu.txt 165313 53073 44624 38768
modeling_scsu.txt 156261 50499 42984 36485
It shows that longer input doesn't necessarily mean better overall compression,
and that SCSU, although useful, isn't really the best representation of unicode text
(win1251 codepage is one, too).
PAQ is the new reigning champion of text compression...There are a few different flavors and information about them can be found here.
There are three flavors that I recommend:
ZPAQ - Future facing container for PAQ algorithims (created to make the future of PAQ easier)
PAQ8PX/PAQ8KX - The most powerful, works with EXE and WAV files as well.
PAQ8PF - Faster (both compression and decompression) and mostly intended for TXT files
You have to build them yourself from source, fortunately someone made a GUI, FrontPAQ, that packages the two best binary into one.
Once you have a functional binary its simple to use, the documentation can be found here.
Note: I am aware this is a very old question, but I wish to include relevant modern data. I came looking for the same question, yet have found a newer more powerful answer.
Are there any good compression algorithms for a large sequence of integers (A/D converter data). There is similar question
But the data is different in my case. It can be negarive or positive and changing like wave data.
EDIT1:sample data added
Please refer to this file for a data sample
Generally if you have some knowledge about the signal, use it to predict next value basing on previous ones. Then - compress difference between predicted and real value.
If prediction is good, differences will be small and their compressing will be good.
Anything more specific is unlikely possible without seeing the data and knowing about its physical nature.
update:
If the prediction is really well and uses all knowledge about dependencies, the differences are likely to be independent and something like arithmetic encoding would work for them.
You want a Delta Encode and then you want to apply a RLE or a Golomb Code. The Golomb Code can be as good as a Huffman Code.
Nearly any standard compression algorithm for byte strings can be applied; after all, any file of data can be interpreted as a sequence of signed integers. Is there something special about your particular integers that you think will make them amenable to some more-specific algorithm? You mention wave data; maybe take a look at FLAC which is designed for audio data; if your data has similar characteristics those techniques may be valuable.
You could diff the data then apply RLE on suitable subregions (i.e. between inflection points).
I am told that Huffman coding is used as loseless data compression algorithm, but I am also told that real data compress software do not employ Huffman coding, because if the keys are not distributed decentralized enough, the compressed file could be even larger than the orignal file.
This leaves me wondering are there any real-world application of Huffman coding?
Huffman is widely used in all the mainstream compression formats that you might encounter - from GZIP, PKZIP (winzip etc) and BZIP2, to image formats such as JPEG and PNG.
All compression schemes have pathological data-sets that cannot be meaningfully compressed; the archive formats I listed above simply 'store' such files uncompressed when they are encountered.
Newer arithmetic and range coding schemes are often avoided because of patent issues, meaning Huffman remains the work-horse of the compression industry.
See Wikipedia article on the subject:
Huffman coding today is often used as a "back-end" to some other compression method. DEFLATE (PKZIP's algorithm) and multimedia codecs such as JPEG and MP3 have a front-end model and quantization followed by Huffman coding.
There are quite a lot of real-world applications of Huffman Encoding. ZIP is perhaps the most widely used compression tool that uses Huffman Encoding as its basis. The latest of the most efficient lossless compression algorithms, Brotli Compression, released by Google last month also uses Huffman Coding. Apart from that, Brotli also uses LZ77 and a few other fundamental lossless compression algorithms. Refer to Brotli.
When one considers compression algorithms there are often benefits and disadvantages to each. It is the nature of compression that given a set of input, there exists better and worse compression algorithms for that data.
Huffman is really, really good at some things. Most notably with data that repeats order a lot and contains a sub-set of the character space. For example english language text files. The english language tends to have the same letters followed by the same other letters.
If your professor or book gave you the impression that Huffman is not used, they are wrong. For example almost all communications with and from the internet are at some point Huffman encoded. (A number of communication protocols use it.) Most image files (jpegs) are Huffman encoded. Most music files (mp3s) are Huffman encoded. There are many other examples.
One reason Huffman is used is because it can be "discovered" via a slightly different algorithm called adaptive Huffman. As you read the file you learn the Huffman code and "compress as you go". This is a simplified overview , but you get the idea.
To solve the use the best algorithm for the situation problem, zip files allow a number of different compressions to be used depending on what the best one is for a given file.
A very widespread application is the encoding of strings in HPACK, the header compression technique of http/2.
The RFC does directly provide a Huffman Code Table that is optimized for compressing HTTP headers.
Huffman code is used to convert fixed length codes into varible length codes, which results in lossless compression. Variable length codes may be further compressed using JPEG and MPEG techniques to get the desired compression ratio.
Does anyone of you know a lossless compression algorithm, which produces headerless outputs?
For example do not store the huffman tree used to compress it? I do not speak about hard coded huffman trees, but I like to know if there is any algorithm that can compress and decompress input without storing some metadata in its output. Or is this even theoretically impossible?
Of course it is posible. Among others, the LZ family of compressors don't need to output anything apart from the compressed data itself, as the dictionary is built on-line as compression (or decompression) progress. You have a lot of reference implementations for those LZ-type algorithms. For example, LZMA, component of 7zip.
Adaptive Huffman coding does exactly that. More generally, the term adaptive coding is used to describe entropy codes with this property. Some dictionary codes have this property too, e.g. run-length encoding (RLE) and Lempel-Ziv-Welch (LZW).
Run Length Encoding would be one example
lzo springs to mind. it's used in OpenVPN, with great results
Why are you looking for compression algorithms with headerless compressed output?
Perhaps (a) you have a system like 2-way telephony that needs low-latency streaming compression/decompression.
The adaptive coding category of compression algorithms mentioned by Zach Scrivena
and the LZ family of dictionary compression algorithms mentioned by Diego Sevilla and Javier
are excellent for this kind of application.
Practical implementations of these algorithms usually do have a byte or two of metadata
at the beginning (making them useless for (b) applications), but that has little or no effect on latency.
Perhaps (b) you are mainly interested in cryptography, and you hear that compress-before-encrypt gives some improved security properties, as long as the compressed text does not have fixed metadata header "crib".
Modern encryption algorithms aren't (as far as we know) vulnerable to such "cribs", but if you're paranoid you might be interested in
"bijective compression" (a, b, c, etc.).
It's not possible to detect errors in transmission (flipped bits, inserted bits, deleted bits, etc.) when a receiver gets such compressed output (making these algorithms not especially useful for (a) applications).
Perhaps (c) you are interested in headerless compression for some other reason. Sounds fascinating -- what is that reason?