Compression of Binary file - algorithm
If we are given a binary file of length n, where each bit independently is one with probability 1/3 and zero else. We want to construct a method that the expected length of the compressed sequence is less than 10 percent more than Shannon's lower bound (for all n large enough).
I've got the lower bound is 0.918. I tried to use tuples of size 2, but it gives me an expected length of 1.88 by Huffman coding. Am I going in the right direction?
What if we want to get a 3% margin ?
The Shannon entropy bound is 0.918 output bits per input bit.
If you just write the bits you're given, you'll spend 1 output bit per input bit.
This is already less than 10% more than the bound, so no compression is required.
You can use Arithmetic compressor or Rangecoder.
There is explanation with code for Arithmetic compressor and open-source implementation of Rangecoder.
I personally recommend to use Rangecoder, because of it works fastest, and has never been patented (patent for arithmetic compressor already expired).
Related
MSE giving negative results in High-Level Synthesis
I am trying to calculate the Mean Squared Error in Vitis HLS. I am using hls::pow(...,2) and divide by n, but all I receive is a negative value for example -0.004. This does not make sense to me. Could anyone point the problem out or have a proper explanation for this?? Besides calculating the mean squared error using hls::pow does not give the same results as (a - b) * (a - b) and for information I am using ap_fixed<> types and not normal float or double precision Thanks in advance!
It sounds like an overflow and/or underflow issue, meaning that the values reach the sign bit and are interpreted as negative while just be very large. Have you tried tuning the representation precision or the different saturation/rounding options for the fixed point class? This tuning will depend on the data you're processing. For example, if you handle data that you know will range between -128.5 and 1023.4, you might need very few fractional bits, say 3 or 4, leaving the rest for the integer part (which might roughly be log2((1023+128)^2)). Alternatively, if n is very large, you can try a moving average and calculate the mean in small "chunks" of length m < n. p.s. Getting the absolute value of a - b and store it into an ap_ufixed before the multiplication can already give you one extra bit, but adds an instruction/operation/logic to the algorithm (which might not be a problem if the design is pipelined, but require space if the size of ap_ufixed is very large).
Compress Random 32-bit Integers: How close can we get to Shannon Entropy?
I've developed a lossless compression algorithm that compresses 32-bit integers (of unknown frequency/probability) to 31.95824 bits per integer (it works a lot better for smaller values, just as most compression algorithms do). Obviously it isn't possible to compress uniformly-distributed random data to become smaller than its uncompressed size. Therefore my question is, which lossless compression algorithms get closest to the Shannon Entropy of 32 bits per integer for pseudorandom data, assuming 32-bit integers? Essentially, I'm looking for a table which includes compression algorithms and their respective bits-per-integer value for positive, compressed, 32-bit integers.
When you say "it works a lot better for smaller values", I presume that you have a transformation from the 32-bit integer to a variable-bit-length representation that is optimized for some non-uniform expected distribution of values. Then that same transformation applied to a uniform distribution of 32-bit values will necessarily take more than 32 bits on average. How much more depends on how non-uniform a distribution you started with. So the answer is, of course you can get to 32 bits exactly by doing nothing at all to the number. But then you are not optimized for the application implied by the non-uniform distribution you designed to.
The identity function requires precisely 32 bits per 32 bit integer, which is pretty hard to beat. (There are many other length-preserving bijections, if you insist on changing the data stream.) It's not obvious to me what other criteria you might be employing to recommend an algorithm which does worse than that. Perhaps you believe that the input stream is not truly a uniform sample; rather, it is a restricted to (or significantly biased towards) a subset of the universe, but you do not a priori know what the subset is. In that case, the entropy of the stream is less than one (if there is an upper bound on the size of the subset which is reasonably less than the size of the universe) and you might be able to actually compress the input stream. It's worth noting that unless messages are fixed-length, the length of the message needs to be taken into account in the computation of entropy, both in the numerator and the denominator. For very long messages, that can mostly be ignored but if messages are short, the cost of message delimiters (or explicit length indicators) can be significant. (Otherwise, "compressing" to 103% of original size is a somewhat humptydumptyesque definition of "to compress".)
This is exactly what Quantile Compression (https://github.com/mwlon/quantile-compression/) was built to do: lossless compression of numbers drawn from a numerical distributuon. I'm not aware of any other algorithms that do this. You can see its results vs theoretical optimum in the readme. It also works on floats and timestamps! I'm not sure what your distribution is, but real-world distributions often only take a few bits per number with It works by encoding each number in the sequence as a Huffman code for a coarse numeric range and then an offset for the exact position within that range.
Lossless Compression of Random Data
tl;dr I recently started listening to a security podcast, and heard the following sentence (paraphrasing) One of the good hallmarks of a cryptographically strong random number is its lack of compressibility Which immediately got me thinking, can random data be lossless-ly compressed? I started reading, and found this wikipedia article. A quoted block is below In particular, files of random data cannot be consistently compressed by any conceivable lossless data compression algorithm: indeed, this result is used to define the concept of randomness in algorithmic complexity theory. I understand the pigeon hole principle, so I'm assuming I'm way wrong here somewhere, but what am I missing? IDEA: Assume you have an asymmetric variable-length encryption method by which you could convert any N bit into either a N-16 bit number or N+16 bit number. Is this possible? IF we had an assymetric algorithm could either make the data say 16 bits bigger or 16 bits smaller, then I think I can come up with an algorithm for reliably producing lossless compression. Lossless Compression Algorithm for Arbitrary Data Break the initial data into chunks of a given size. Then use a "key" and attempt to compress each chunk as follows. function compress(data) compressedData = [] chunks = data.splitBy(chunkSize); foreach chunk in chunks encryptedChunk = encrypt(chunk, key) if (encryptedChunk.Length <= chunk.Length - 16) // arbitrary amount compressedData.append(0) // 1 bit, not an integer compressedData.append(encryptedChunk) else compressedData.append(1) // 1 bit, not an integer compressedData.append(chunk) end foreach return compressedData; end function And for de-compression, if you know the chunk-size, then each chunk that begins with 0 perform the asymmetric encryption and append the data to the on going array. If the chunk begins with a 0 simply append the data as-is. If the encryption method produces the 16-bit smaller value even 1/16 as often as the 16-bit larger value, then this will work right? Each chunk is either 1 bit bigger, or 15 bits smaller. One other consideration is that the "key" used by the compression algorithm can be either fixed or perhaps appended to the beginning of the compressed data. Same consideration for the chunk size.
There are 2N−16 possible (N−16)-bit sequences, and 2N possible N-bit sequences. Consequently, no more than one in every 216 N-bit sequence can be losslessly compressed to N−16 bits. So it will happen a lot less frequently than 1/16 of the time. It will happen at most 1/65536 of the time. As your reasoning indicates, the remaining N-bit sequences could be expanded to N+1 bits; there is no need to waste an additional 15 bits encoding them. All the same, the probability of a random N-bit sequence being in the set of (N−16)-bit compressible sequences is so small that the average compression (or expected compression) will continue to be 1.0 (at best).
Integer Time Series compression
Is there a well known documented algorithm for (positive) integer streams / time series compression, that would: have variable bit length work on deltas My input data is a stream of temperature measurements from a sensor (more specifically a TMP36 read out by an Arduino). It is physically impossible that big jumps occur between measurements (time constant of sensor). I therefore think my compression algorithm should work on deltas (set a base on stream start and then only difference to next value). Because gaps are limited, I want variable bit length, because differences lower than 4 fit on 2 bits, lower than 8 on 3 bits and so on... But there is a dilemma between telling in stream the bit size of the next delta and just working on, say, 3 bit deltas and telling size only when bigger for instance. Any idea what algorithm solves than one?
Use variable-length integers to code the deltas between values, and feed that to zlib to do the compression.
First of all there are different formats in existent. One thing I would do first is getting rid of the sign. A sign is usually a distraction when thinking about compression. I usually use the scheme where every positive is 2*v and every negative value is just 2*(-v)-1. So 0 = 0, -1 = 1, 1 = 2, -2 = 3, 2 = 4... . Since with that scheme you have nothing like 0b11111111 = -1 the leading bits are gone. Now you can think about how to compress those symbols / numbers. One thing you can do is create a representive sample and use it to train a static huffman code. This should be possible within your on chip constraints. Another more simple aproach is using huffman codes for bit lengths and write the bits to stream. So 0 = bitlength 0, -1 = bitlength 1, 2,3 = bitlength length 2, ... . By using huffman codes to describe this bitlength you become quite compact literals. I usually use a mixture. I use the most frequent symbols / values as raw values and use not so frequent numbers by using bit lengths + bit pattern of the actual value. This way you stay compact and do not have to deal with excessive tables (there are only 64 symbols for 64 bits lengths possible). Also there are other schemes like leading bit where for example of every byte the first bit (or the highest) marks the last byte of the value so as long as the bit is set there will be another byte for the integer. If it is zero its the last byte of the value. I usually train a static huffman code for such purposes. Its easy and you can even do the encoding and decoding becoming source code / generate source code out from your code (simply create ifs/switch statements and write your tables as arrays in your code).
You can use Integer compression methods with delta or delta of delta encoding like used in TurboPFor Integer Compression. Gamma coding can be also used if the deltas have very small values.
The current state of the art for this problem is Quantile Compression. It compresses numerical sequences such as integers and typically achieves 35% higher compression ratio than other approaches. It has delta encoding as a built-in feature. CLI example: cargo run --release compress \ --csv my.csv \ --col-name my_col \ --level 6 \ --delta-order 1 \ out.qco Rust API example: let my_nums: Vec<i64> = ... let compressor = Compressor::<i64>::from_config(CompressorConfig { compression_level: 6, delta_encoding_order: 1, }); let bytes: Vec<u8> = compressor.simple_compress(&my_nums); println!("compressed down to {} bytes", bytes.len()); It does this by describing each number with a Huffman code for a range (a [lower, upper] bound) followed by an exact offset into that range. By strategically choosing the ranges based on your data, it comes close the Shannon entropy of the data distribution. Since your data comes from a temperature sensor, your data should be very smooth, and you may even consider delta orders higher than 1 (e.g. delta order 2 is "delta-of-deltas").
Encode an array of integers to a short string
Problem: I want to compress an array of non-negative integers of non-fixed length (but it should be 300 to 400), containing mostly 0's, some 1's, a few 2's. Although unlikely, it is also possible to have bigger numbers. For example, here is an array of 360 elements: 0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0, 0,0,4,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,5,2,0,0,0, 0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,1,2,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0. Goal: The goal is to compress an array like this, into a shortest possible encoding using letters and numbers. Ideally, something like: sd58x7y What I've tried: I tried to use "delta encoding", and use zeroes to denote any value higher than 1. For example: {0,0,1,0,0,0,2,0,1} would be denoted as: 2,3,0,1. To decode it, one would read from left to right, and write down "2 zeroes, one, 3 zeroes, one, 0 zeroes, one (this would add to the previous one, and thus have a two), 1 zero, one". To eliminate the need of delimiters (commas) and thus saves more space, I tried to use only one alphanumerical character to denote delta values of 0 to 35 (using 0 to y), while leaving letter z as "35 PLUS the next character". I think this is called "variable bit" or something like that. For example, if there are 40 zeroes in a row, I'd encode it as "z5". That's as far as I got... the resultant string is still very long (it would be about 20 characters long in the above example). I would ideally want something like, 8 characters or even shorter. Thanks for your time; any help or inspiration would be greatly appreciated!
Since your example contains long runs of zeroes, your first step (which it appears you have already taken) could be to use run-lenth encoding (RLE) to compress them. The output from this step would be a list of integers, starting with a run-length count of zeroes, then alternating between that and the non-zero values. (a zero-run-length of 0 will indicate successive non-zero values...) Second, you can encode your integers in a small number of bits, using a class of methods called universal codes. These methods generally compress small integers using a smaller number of bits than larger integers, and also provide the ability to encode integers of any size (which is pretty spiffy...). You can tune the encoding to improve compression based on the exact distribution you expect. You may also want to look into how JPEG-style encoding works. After DCT and quantization, the JPEG entropy encoding problem seems similar to yours. Finally, if you want to go for maximum compression, you might want to look up arithmetic encoding, which can compress your data arbitrarily close to the statistical minimum entropy. The above links explain how to compress to a stream of raw bits. In order to convert them to a string of letters and numbers, you will need to add another encoding step, which converts the raw bits to such a string. As one commenter points out, you may want to look into base64 representation; or (for maximum efficiency with whatever alphabet is available) you could try using arithmetic compression "in reverse". Additional notes on compression in general: the "shortest possible encoding" depends greatly on the exact properties of your data source. Effectively, any given compression technique describes a statistical model of the kind of data it compresses best. Also, once you set up an encoding based on the kind of data you expect, if you try to use it on data unlike the kind you expect, the result may be an expansion, rather than a compression. You can limit this expansion by providing an alternative, uncompressed format, to be used in such cases...
In your data you have: 14 1s (3.89% of data) 4 2s (1.11%) 1 3s, 4s and 5s (0.28%) 339 0s (94.17%) Assuming that your numbers are not independent of each other and you do not have any other information, the total entropy of your data is 0.407 bits per number, that is 146.4212 bits overall (18.3 bytes). So it is impossible to encode in 8 bytes.