Algorithms for only compressing static data? - algorithm

What algorithms are designed to compress static data? For example, I have a input string of "Hello world!" I want to make a library that will JIT compile a set of compression and decompression functions for that "Hello World" string. What algorithms are out there that I can learn from? The closest thing I have found so far is the term "Tailed Compression" but I cant find any actual algorithms for this / code.

For static (fixed, known at outset) content, you can look at "off-line algorithms". One classification for "Data compression via textual substitution" was published in 1982 by J.A. Storer and T.G. Szymanski, in particular "Off-Line Compression: The Macro Model".

DEFLATE supports the use of preset dictionaries. These 32kb dictionaries are used as a reference for deduplicating your data.
Very high compression ratios can be achieved on short data strings with recurring patterns by choosing a decent dictionary. (just a concatenation of sample data is often a good start).
You can use dicflate to experiment with it.

Related

Get length statistics over very large textual file

I have a very large file (1.5M rows), containing json dictionaries in each row.
Each row contains a parsed Wikipedia article.
For example
{"title": "article title", "summary": "this is a summary of around 500 words", "article": "This is the whole article with more than 3K words"}
{"title": "article2 title", "summary2": "this is another summary of around 500 words", "article": "This is another whole article with more than 3K words"}
Note that the file is not itself a json.
I want to compute some statistics on these texts, e.g mean number of sentences, mean number of words, compression ratio etc. However, everything I try takes ages.
What is the fastest way to go with this? For reference, at the moment I am using spacy for word and sentence tokenization, but I am open to more approximate solutions e.g. using regex, if they are the only way.
If you want to achieve high performance, then you should probably compute the lines in parallel using multiple threads and each line should extract the target metric using a SIMD-friendly code. It is also probably a good idea to simplify the parsing by using a specialized code working only on this problem and not general parsing tool (like regular expression, unless the target engine is capable of producing a very fast linear-time JIT-compiled efficient code).
For the multithreading part, this is certainly the easiest part since the computation appear to be mostly embarrassingly parallel. Each thread compute the target metrics on a chunks of lines and can then perform a parallel reduction (ie. sum) of the target metrics.
Each line can be parsed relatively quickly using SimdJson. Since the JSON documents are small and the structure appear to be simple and always the same, you can use a regular expression to search for "article" *: *"((?:[^"\]+|\")*)" (note that you may need to escape the backslash regarding the language used). However, the best strategy is probably to parse yourself the JSON document to extract the wanted string much more efficiently for example by searching for some very specific key pattern/string like " (with a SIMD-friendly loop) followed by article and then parse the rest using a more robust (but slower) method.
Similar strategies apply to count words. A fast over-approximation is to count the number of space directly followed by a character. The string encoding matters to speed up parsing as decoding UTF-8 string is generally pretty slow. One fast solution is to just discard non-ASCII character if the target language is English or mostly use ASCII characters. If this is not the case, then you can use some SIMD-aware UTF-8 decoding library (or hand written algorithm in the worst case). Working on small chunks of about 1KB can help to use the CPU cache more efficiently (and to auto-vectorize your code if you use a compiled native language like C or C++).
If you are not very familiar with SIMD instructions, or low-level parsing strategies/algorithms (like this one), note that there are some fast parsing libraries to do basic operation efficiently like Hyperscan.

Text Compression Algorithm

I am just wondering if someone could introduce me any algorithm that compresses Unicode text to 10-20 percent of its original size ?
actually I've read Lempel-Ziv compression algorithm which reduces size of text to 60% of original size, but I've heard that there are some algorithms with this performance
If You are considering only text compression than the very first algorithm that uses entropy based encryption called Huffman Encoding
Huffman Coding
Then there is LZW compression which uses a dictionary encoding to use previously used sequence of letter to assign codes to reduce size of file.
LZW compression
I think above two are sufficient for encoding text data efficiently and are easy to implement.
Note: Donot expect good compression on all files, If data is random with no pattern than no compression algorithm can give you any compression at all. Percentage of compression depends on symbols appearing in the file not only on the algorithm used.
LZ-like coders are not any good for text compression.
The best one for direct use with unicode would be lzma though, as it has position alignment options. (http://www.7-zip.org/sdk.html)
But for best compression, I'd suggest to convert unicode texts to a bytewise format,
eg. utf8, and then use an algorithm with known good results on texts, eg.
BWT (http://libbsc.com) or PPMd (http://compression.ru/ds/ppmdj1.rar).
Also some preprocessing can be applied to improve results of text compression
(see http://xwrt.sourceforge.net/)
And there're some compressors with even better ratio than suggested ones
(mostly paq derivatives), but they're also much slower.
Here I tested various representations of russian translation of
Witten's "Modeling for text compression":
7z rar4 paq8px69
modeling_win1251.txt 156091 50227 42906 36254
modeling_utf16.txt 312184 52523 50311 38497
modeling_utf8.txt 238883 53793 44231 37681
modeling_bocu.txt 165313 53073 44624 38768
modeling_scsu.txt 156261 50499 42984 36485
It shows that longer input doesn't necessarily mean better overall compression,
and that SCSU, although useful, isn't really the best representation of unicode text
(win1251 codepage is one, too).
PAQ is the new reigning champion of text compression...There are a few different flavors and information about them can be found here.
There are three flavors that I recommend:
ZPAQ - Future facing container for PAQ algorithims (created to make the future of PAQ easier)
PAQ8PX/PAQ8KX - The most powerful, works with EXE and WAV files as well.
PAQ8PF - Faster (both compression and decompression) and mostly intended for TXT files
You have to build them yourself from source, fortunately someone made a GUI, FrontPAQ, that packages the two best binary into one.
Once you have a functional binary its simple to use, the documentation can be found here.
Note: I am aware this is a very old question, but I wish to include relevant modern data. I came looking for the same question, yet have found a newer more powerful answer.

Fast decompression algorithms

I'm looking for compression/deompression algorithms that can give decent compression 2-4x on regular english text and yet I can decompress this data almost as fast as I can get it out of main memory (~10Gbps). Whats the current state of the art in terms of fast decompression algorithms (perhaps vectorized code that uses multiple cores)
In particular, I'm looking at this paper Fast Integer compression using SIMD instructions
and wondering if similar algorithms have been used in any system.
Look at LZO and lz4. Try them on your data and see how they perform.
A golomb code can be good like a huffman and is very simple and fast.
BWT + entropy coding (for instance Huffman coding) is quite fast (compexity O(n)) but needs two pass.

What are the real-world applications of huffman coding?

I am told that Huffman coding is used as loseless data compression algorithm, but I am also told that real data compress software do not employ Huffman coding, because if the keys are not distributed decentralized enough, the compressed file could be even larger than the orignal file.
This leaves me wondering are there any real-world application of Huffman coding?
Huffman is widely used in all the mainstream compression formats that you might encounter - from GZIP, PKZIP (winzip etc) and BZIP2, to image formats such as JPEG and PNG.
All compression schemes have pathological data-sets that cannot be meaningfully compressed; the archive formats I listed above simply 'store' such files uncompressed when they are encountered.
Newer arithmetic and range coding schemes are often avoided because of patent issues, meaning Huffman remains the work-horse of the compression industry.
See Wikipedia article on the subject:
Huffman coding today is often used as a "back-end" to some other compression method. DEFLATE (PKZIP's algorithm) and multimedia codecs such as JPEG and MP3 have a front-end model and quantization followed by Huffman coding.
There are quite a lot of real-world applications of Huffman Encoding. ZIP is perhaps the most widely used compression tool that uses Huffman Encoding as its basis. The latest of the most efficient lossless compression algorithms, Brotli Compression, released by Google last month also uses Huffman Coding. Apart from that, Brotli also uses LZ77 and a few other fundamental lossless compression algorithms. Refer to Brotli.
When one considers compression algorithms there are often benefits and disadvantages to each. It is the nature of compression that given a set of input, there exists better and worse compression algorithms for that data.
Huffman is really, really good at some things. Most notably with data that repeats order a lot and contains a sub-set of the character space. For example english language text files. The english language tends to have the same letters followed by the same other letters.
If your professor or book gave you the impression that Huffman is not used, they are wrong. For example almost all communications with and from the internet are at some point Huffman encoded. (A number of communication protocols use it.) Most image files (jpegs) are Huffman encoded. Most music files (mp3s) are Huffman encoded. There are many other examples.
One reason Huffman is used is because it can be "discovered" via a slightly different algorithm called adaptive Huffman. As you read the file you learn the Huffman code and "compress as you go". This is a simplified overview , but you get the idea.
To solve the use the best algorithm for the situation problem, zip files allow a number of different compressions to be used depending on what the best one is for a given file.
A very widespread application is the encoding of strings in HPACK, the header compression technique of http/2.
The RFC does directly provide a Huffman Code Table that is optimized for compressing HTTP headers.
Huffman code is used to convert fixed length codes into varible length codes, which results in lossless compression. Variable length codes may be further compressed using JPEG and MPEG techniques to get the desired compression ratio.

Where can I find a lossless compression algorithm, which produces headerless outputs?

Does anyone of you know a lossless compression algorithm, which produces headerless outputs?
For example do not store the huffman tree used to compress it? I do not speak about hard coded huffman trees, but I like to know if there is any algorithm that can compress and decompress input without storing some metadata in its output. Or is this even theoretically impossible?
Of course it is posible. Among others, the LZ family of compressors don't need to output anything apart from the compressed data itself, as the dictionary is built on-line as compression (or decompression) progress. You have a lot of reference implementations for those LZ-type algorithms. For example, LZMA, component of 7zip.
Adaptive Huffman coding does exactly that. More generally, the term adaptive coding is used to describe entropy codes with this property. Some dictionary codes have this property too, e.g. run-length encoding (RLE) and Lempel-Ziv-Welch (LZW).
Run Length Encoding would be one example
lzo springs to mind. it's used in OpenVPN, with great results
Why are you looking for compression algorithms with headerless compressed output?
Perhaps (a) you have a system like 2-way telephony that needs low-latency streaming compression/decompression.
The adaptive coding category of compression algorithms mentioned by Zach Scrivena
and the LZ family of dictionary compression algorithms mentioned by Diego Sevilla and Javier
are excellent for this kind of application.
Practical implementations of these algorithms usually do have a byte or two of metadata
at the beginning (making them useless for (b) applications), but that has little or no effect on latency.
Perhaps (b) you are mainly interested in cryptography, and you hear that compress-before-encrypt gives some improved security properties, as long as the compressed text does not have fixed metadata header "crib".
Modern encryption algorithms aren't (as far as we know) vulnerable to such "cribs", but if you're paranoid you might be interested in
"bijective compression" (a, b, c, etc.).
It's not possible to detect errors in transmission (flipped bits, inserted bits, deleted bits, etc.) when a receiver gets such compressed output (making these algorithms not especially useful for (a) applications).
Perhaps (c) you are interested in headerless compression for some other reason. Sounds fascinating -- what is that reason?

Resources