What’s the most compression that we can hope for a file that contains 1000 bits (Huffman algorithm)? - algorithm

How much file that contains 1000 bits, where 1 appears with a 10% probability of 0 - 90% probability can be compressed with Huffman code?

Maybe a factor of two.
But only if you do not include the overhead of sending the description of the Huffman code along with the data. For 1000 bits, that overhead will dominate the problem, and determine your maximum compression ratio. I find that for that small of a sample, 125 bytes, general-purpose compressors get it down to only around 100 to 120 bytes, due to the overhead.
A custom Huffman code just for this on bytes from such a stream gives a factor of 2.10, assuming the other side already knows the code. The best you could hope for is the entropy, e.g. with an arithmetic code, which gives 2.13.

Related

Extending Goertzel algorithm to 24 kHz, 32 kHz and 48 kHz in python

I'm learning to implement Goertzel's algorithm to detect DTMF tones from recorded wave files. I got one implemented in python from here. It supports audio sampled at 8 kHz and 16 kHz. I would like to extend it to support audio files sampled at 24 kHz, 32 kHz and 48 kHz.
From the code I got from the link above, I see that the author has set the following precondition parameters/constants:
self.MAX_BINS = 8
if pfreq == 16000:
self.GOERTZEL_N = 210
self.SAMPLING_RATE = 16000
else:
self.GOERTZEL_N = 92
self.SAMPLING_RATE = 8000
According to this article, before one can do the actual Goertzel, two of the preliminary calculations are:
Decide on the sampling rate.
Choose the block size, N
So, the author has clearly set block size as 210 for 16k sampled inputs and 92 for 8k sampled inputs. Now, I would like to understand:
how the author has arrived at this block size?
what would be the block size for 24k, 32k and 48k samples?
The block size determines the frequency resolution/selectivity and the time it takes to gather a block of samples.
The bandwidth of your detector is about Fs/N, and of course the time it takes to gather a block is N/Fs.
For equivalent performance, you should keep the ratio between Fs and N roughly the same, so that both of those measurements remain unchanged.
It is also important, though, to adjust your block size to be as close as possible to a multiple of the wave lengths you want to detect. The Goertzel algorithm is basically a quick way to calculate a few selected DFT bins, and this adjustment puts the frequencies you want to see near the center of those bins.
Optimization of the block size according to the last point is probably why Fs/N is not exactly the same in the code you have for 8KHz and 16Khz sampling rates.
You could redo this optimization for the other sampling rates you want to support, but really performance will be equivalent to what you already have if you just use N = 210 * Fs / 16000
You can find a detailed description of the block size choice here: http://www.telfor.rs/telfor2006/Radovi/10_S_18.pdf

Why are Bytes talked about in powers of 2?

2^10 = 1KB,
2^20 = 1MB,
etc.
etc.
Except, a byte is 8 bits so I do not understand why we are using powers of 2 as an explanation. To talk about Bits in powers of 2 I can completely understand but with Bytes, I am totally lost. Many textbooks / online resources talk about it in this way, what am I missing here?
By the way, I understand 2^10 = 1024 which is approximately 10^3 = 1000. What I don't understand is why we justify the use prefixes and bytes using powers of 2.
I'll ask the question you're really asking: Why don't we just use powers of 10?
To which we'll respond: why should we use powers of 10? Because the lifeforms using the computers happen to have 10 fingers?
Computers break everything down to 1s and 0s.
1024 in binary = 10000000000 (2^10), which is a nice round number.
1000 in binary = 1111101000 (not an even power of 2).
If you are actually working with a computer at a low level (ie looking at the raw memory), it is much easier to think using numbers that get represented as round numbers in the way they are stored.
From your question, I think that you understand about powers of two and measuring bytes. If not, the other answers explain that.
Is your question is why not use bits rather than bytes since bits are truly binary?
The reason that memory, disk space, etc is described in bytes rather than bits has to do with the word addressability of early computers. The bit, nibble and byte came about as workable amounts of memory in simple computers. The first computers had actual wires that linked the various bits together. 8-bit addressability was a significant step forward.
Bytes instead of bits is just a historical convention. Networks measurements are in (mega) bits for similar historical reasons.
Wikipedia has some interesting details.
The reason is that you do not only use bytes to store numbers, but also to address memory bytes that store numbers (or even other addresses). With 1 Byte you have 256 possible addresses, so you can access 256 different bytes. Using only 200 bytes, for example, just because it is a rounder number would be a waste of address space.
This example assumes 8 bit addresses for simplification, usually you have 64 bit addresses in modern PCs.
By the way, in the context of hard drives, the amount of memory is often a round number, e.g. 1 TB, because they address memory space differently. Powers of 2 are used in most memory types, like RAM, flash drives/SSDs, cache memory. In these cases, they are sometimes rounded, e.g. 1024 KB as 1 MB.
There are actually 2 different names for powers of 2 and powers of 10. Powers of ten are known as kilo-bytes, mega-bytes, giga-bytes, while powers of two are called kibi-bytes, mebi-bytes and gibi-bytes. Most people just use the former ones in both cases.
Okay so I figured my own question out. 2^3 bits = 2^0 Bytes. So if we have 2^13 bits and want to convert it to bytes then 2^13 bits = x * 1Byte / (2^3 bits) = 2^10 bytes which is a kilobyte. Now with this conversion, it makes much more sense to me why they choose to represent Bytes in powers of 2.
We can do the same thing with powers of ten, 10^1 ones = 10^0 tens. Then if we want to convert 10^25 ones to tens we get 10^25 ones = x * (10^0 tens / 10^1 ones) = 10^24 tens as expected.
I am not sure if I get what you are exactly asking, but:
2^10 bits = 1KBits
2^10 bytes = 1KBytes = ((2^3)(2^10)Bits = 2^13 Bits
These are two different numbers of bits and you should not confuse them with eachother
I think the part that you are hung up on is the conversion from byte, to KB, to MB, etc. We all know the conversion, but let me clarify:
1024 bytes is a kilobyte. 1024 kilobytes is a megabyte, etc.
As far as the machines go, they don't care about this conversion! They just store it as x bytes. Honestly I'm not sure if it cares are bytes, and just deals with bits.
While I'm not entirely sure, I think the 1024 rate is an arbitrary choice made by some human. It's close to 1000 which is used in the metric system. I thought the same thing as you did, like "this has nothing to do with binary!". As one of the other answers says, it's nothing more than "easy to work with".

What's the reason behind ZigZag encoding in Protocol Buffers and Avro?

ZigZag requires a lot of overhead to write/read numbers. Actually I was stunned to see that it doesn't just write int/long values as they are, but does a lot of additional scrambling. There's even a loop involved:
https://github.com/mardambey/mypipe/blob/master/avro/lang/java/avro/src/main/java/org/apache/avro/io/DirectBinaryEncoder.java#L90
I don't seem to be able to find in Protocol Buffers docs or in Avro docs, or reason myself, what's the advantage of scrambling numbers like that? Why is it better to have positive and negative numbers alternated after encoding?
Why they're not just written in little-endian, big-endian, network order which would only require reading them into memory and possibly reverse bit endianness? What do we buy paying with performance?
It is a variable length 7-bit encoding. The first byte of the encoded value has it high bit set to 0, subsequent bytes have it at 1. Which is the way the decoder can tell how many bytes were used to encode the value. Byte order is always little-endian, regardless of the machine architecture.
It is an encoding trick that permits writing as few bytes as needed to encode the value. So an 8 byte long with a value between -64 and 63 takes only one byte. Which is common, the range provided by long is very rarely used in practice.
Packing the data tightly without the overhead of a gzip-style compression method was the design goal. Also used in the .NET Framework. The processor overhead needed to en/decode the value is inconsequential. Already much lower than a compression scheme, it is a very small fraction of the I/O cost.

Compress to fixed length

I am looking for an algorithm that could compress binary data, not to smallest size, but to specified size. For example, if I have uncompressed data that comes in various 1.2, 1.3, 1.4 ... KB sizes, I would specify "compress to 1 KB" and assuming the data could be compressed to .65, .74, .83 KB sizes, the algorithm would stop at 1 KB, and return this standard size, leaving some entropy in the data. I could not pinpoint an algorithm for that. Does one exist?
You can ZIP and pad with zeroes but still in some case the data is highly random and even very efficient compression algorithms cannot compress the data because there is no correlation between data hence you dont get any compression at all so getting compression up to a specific size is not possible.
It is not possible.
Proof: Some combination of data can't be losslessly compressed, so that case fails your need to meet a fixed size, assuming that size is smaller than the original data.
If you can accept lossy compression it is very possible and happens all the time for some video formats (fixed size container per unit of time; compression adjusts to maximize usage).

Improve this compression algorithm?

I am compressing 8 bit bytes and the algorithm works only if the number of unique single bytes found on the data is 128 or less.
I take all the unique bytes. At the start I store a table containing once each unique byte. If they are 120 I store 120 bytes.
Then, instead of storing each item in space of 8 bits, I store each item in 7 bits, one after another. Those 7 bits contain the item's position on the table.
Question: how can I avoid storing those 120 bytes at the start, by storing the possible tables in my code?
What you are trying do is special case of huffman coding where you are only considering unique byte not their frequency hence giving each byte fixed length code but you can do better use their frequency to give them variable length codes using huffman coding and get more compression.
But if you intend to use the same algorithm then consider this way :-
Dont store 120 bytes store 256 bits (32 bytes) where 1 indicate if value is present
because it will give you all info. You use bit to get the values which
are found in the file and construct the mapping tables again
I don't know the exact algorithm, but probably the idea of the compression algorithm is that you cannot. It has to store those values, so it can write a shortcut for all other bytes in the data.
There is one way in which you could avoid writing those 120 bytes: when you know the contents of those bytes beforehand. For example, when you know that whatever you are going to send, will only contain those bytes. Then you can simply make the table known on both sides, and simply store everything but those 120 bytes.

Resources