In simple terms, how is compression commonly implemented? - algorithm

So I've been thinking lately about how compression might be implemented, and what I've postulated so far is that it might be using a sort of HashTable of 'byte signature' keys with memory location values where that 'byte signature' should be replaced upon expansion of the compressed item in question.
Is this far from the truth?
How is compression typically implemented? No need for a page worth of answer, just in simple terms is fine.

Compressing algorithms try to find repeated subsequences to replace them with a shorter representation.
Let's take the 25 byte long string Blah blah blah blah blah! (200 bit) from An Explanation of the Deflate Algorithm for example.
Naive approach
A naive approach would be to encode every character with a code word of the same length. We have 7 different characters and thus need codes with the length of ceil(ld(7)) = 3. Our code words can than look like these:
000 → "B"
001 → "l"
010 → "a"
011 → "h"
100 → " "
101 → "b"
110 → "!"
111 → not used
Now we can encode our string as follows:
000 001 010 011 100 101 001 010 011 100 101 001 010 011 100 101 001 010 110
B l a h _ b l a h _ b l a h _ b l a !
That would just need 25·3 bit = 75 bit for the encoded word plus 7·8 bit = 56 bit for the dictionary, thus 131 bit (65.5%)
Or for sequences:
00 → "lah b"
01 → "B"
10 → "lah!"
11 → not used
The encoded word:
01 00 00 00 00 10
B lah b lah b lah b lah b lah!
Now we just need 6·2 bit = 12 bit for the encoded word and 10·8 bit = 80 bit plus 3·8 bit = 24 bit for the length of each word, thus 116 bit (58.0%).
Huffman code approach
The Huffman code is used to encode more frequent characters/substrings with shorter code than less frequent ones:
5 × "l", "a", "h"
4 × " ", "b"
1 × "B", "!"
// or for sequences
4 × "lah b"
1 × "B", "lah!"
A possible Huffman code for that is:
0 → "l"
10 → "a"
110 → "h"
1110 → " "
11110 → "b"
111110 → "B"
111111 → "!"
Or for sequences:
0 → "lah b"
10 → "B"
11 → "lah!"
Now our Blah blah blah blah blah! can be encoded to:
111110 0 10 110 1110 11110 0 10 110 1110 11110 0 10 110 1110 11110 0 10 110 1110 11110 0 10 110 111111
B l a h _ b l a h _ b l a h _ b l a h _ b l a h !
Or for sequences:
10 0 0 0 0 11
B lah b lah b lah b lah b lah!
Now out first code just needs 78 bit or 8 bit instead of 25·8 = 200 bit like our initial string has. But we still need to add the dictionary where our characters/sequences are stored. For our per-character example we would need 7 additional bytes (7·8 bit = 56 bit) and our per-sequence example would need again 7 bytes plus 3 bytes for the length of each sequence (thus 59 bit). That would result in:
56 + 78 = 134 bit (67.0%)
59 + 8 = 67 bit (33.5%)
The actual numbers may not be correct. Please feel free to edit/correct it.

Check this wiki page...
Lossless compression algorithms usually exploit statistical redundancy in such a way as to represent the sender's data more concisely without error. Lossless compression is possible because most real-world data has statistical redundancy. For example, in English text, the letter 'e' is much more common than the letter 'z', and the probability that the letter 'q' will be followed by the letter 'z' is very small.
Another kind of compression, called lossy data compression or perceptual coding, is possible if some loss of fidelity is acceptable. Generally, a lossy data compression will be guided by research on how people perceive the data in question. For example, the human eye is more sensitive to subtle variations in luminance than it is to variations in color. JPEG image compression works in part by "rounding off" some of this less-important information. Lossy data compression provides a way to obtain the best fidelity for a given amount of compression. In some cases, transparent (unnoticeable) compression is desired; in other cases, fidelity is sacrificed to reduce the amount of data as much as possible.
Lossless compression schemes are reversible so that the original data can be reconstructed, while lossy schemes accept some loss of data in order to achieve higher compression.
However, lossless data compression algorithms will always fail to compress some files; indeed, any compression algorithm will necessarily fail to compress any data containing no discernible patterns. Attempts to compress data that has been compressed already will therefore usually (text files usually can be compressed more after being compressed, due to fewer symbols), result in an expansion, as will attempts to compress all but the most trivially encrypted data.
In practice, lossy data compression will also come to a point where compressing again does not work, although an extremely lossy algorithm, like for example always removing the last byte of a file, will always compress a file up to the point where it is empty.
An example of lossless vs. lossy compression is the following string:
25.888888888
This string can be compressed as:
25.[9]8
Interpreted as, "twenty five point 9 eights", the original string is perfectly recreated, just written in a smaller form. In a lossy system, using
26
instead, the original data is lost, at the benefit of a smaller file size.

Lossless compression algorithms translate each possible input into distinct outputs, in such a way that more common inputs translate to shorter outputs. It's mathematically impossible for all possible inputs to be compressed -- otherwise, you'd have multiple inputs A and B compressing to the same form, so when you decompress it, do you get back to A or back to B? In practice, most useful information has some redundancy and this redundancy fits certain patterns; hence the data can usefully be compressed because the cases that expand when you compress them don't naturally arise.
Lossy compression, for example, that used in JPEG or MP3 compression, works by approximating the input data by some signal that can be expressed in fewer bits than the original. When you decompress it, you don't get the original, but you usually get something close enough.

In VERY simple terms, a common form of compression is a http://en.wikipedia.org/wiki/Dictionary_coder. This involves replacing longer repeated strings with shorter ones.
For example if you have a file that looks like this:
"Monday Night","Baseball","7:00pm"
"Tuesday Night","Baseball","7:00pm"
"Monday Night","Softball","8:00pm"
"Monday Night","Softball","8:00pm"
"Monday Night","Baseball","5:00pm"
It would be roughly 150 characters, but if you where to do a simple substitution as follows:
A="Monday Night",B="Tuesday Night",C="Baseball",D="Softball",E="7:00pm",F="8:00pm",G=5:00pm"
Then the same content could be encoded as:
A,C,E
B,C,E
A,D,F
A,D,F
A,C,G
Using on 25 characters! A clever observer could also see how to easily reduce this further to 15 characters if we assumed some more things about the format of the file. Obviously there is the overhead of the substitution key, but often very large files have a lot of these substitutions. This can be a very efficient way to compress large files or data structures and still allow them to be "somewhat" human readable.

Rosetta Code has an entry on Huffman Coding, as does an earlier blog entry of mine.

Related

Sequence of binary input file for NIST

I've developed an RNG program, for now, I want to test whether my numbers are random. Thus, I decided to use NIST Test Suite.
I'm still confused about the input file format, they say: "The user may want to construct as many files of arbitrary length as desired. Files should contain binary sequences stored as either ASCII characters consisting of zeroes and ones, or
as binary data where each byte contains eight bits worth of 0’s and 1’s"
My python RNG program will return a sequence of numbers line by line as:
69
11
68
55
33
20
75
96
How can I convert them to the proper input file for NIST?
Your first random number is 69, which is 1000101 in binary. You can either put that in your test file as the ASCII string "1000101" or as seven bits in a binary file 1000101... The ASCII option is probably easier, but the file will be eight times the size. With either case you might have to be careful with leading zeros in binary, I am not sure what NIST wants without reading a lot more of SP 800-22 than I currently have time for.
The appropriate input for NIST would be a set of binary sequences. You can simply convert your integers into binary numbers, write them to a file and use the file as input for the NIST program. Converting to binary and feeding them to NIST does not necessarily mean your RNG will pass all the tests there. Let's try to answer the following question.
How many bits do you produce for each integer?
For example, let's say your RNG generates integers between 0 to 5 (uniform distribution, all values are equiprobable). As representing 5 would require at least 3 bits, we will use 3 bits for each integer.
0: 000
1: 001
2: 010
3: 011
4: 100
5: 101
Look at the first (most significant) bit for each of the numbers. Four of them are 0, and the remaining two are 1. So, whenever you pick a random integer from 0 to 5, the probability of the first bit being 0 is higher than it being 1. Remember that, for an RNG, we need p(0)=p(1)=0.5 for each of the bits.
Now, if an RNG produced values from 0 to 7 (uniformly), we could convert each of them to 3 bits and maintain p(0)=p(1)=0.5 at all the indices. Why is that? Because we have all 23 different values (i.e., 0 to 23-1), it does not face any bias at any index (equal number of zeros and ones).
The above discussion leads us to the conclusion that, if you have integer values coming out of an RNG and they range from 0 to 2n-1, and each of them are equiprobable, you can convert them to n bits and concatenate them for NIST evaluation. If those conditions do not hold (e.g., the number of outcomes is not a power of 2), one way is to settle for the maximum power of 2 that can be accommodated in the output range of your RNG and discard the rest of the values.

What would be the contents of canonical Huffman encoded bit stream?

Let's consider the following example with symbol- code length - canonical code data.
A - 2 - 00
B - 2 - 01
D - 2 - 10
C - 3 - 110
E - 3 - 111
I was wondering what would be the contents of encoded bit stream? Is it 00 01 10 110 111 (basically all codes) or 2,2,2,3,3 in binary equivalent as corresponding code lengths? I wanted to add here that some resources say just transmit code as encoded bit stream and few other resources talk about throwing code away from encoded bit stream and transmit only code length data.
Encoded bitstream
The code is:
00 01 10 110 111
Note that if we sent the code of 2,2,2,3,3, then it would be impossible to decide if the input was AAACC or BBBEE (or many other equivalent choices).
Because Huffman codes are a prefix code it means that we can unambiguously decode the bitstream despite not knowing where the spaces are.
In other words, when given the output 000110110111, we can uniquely decode it as ABDCE.
Transmitting code table
I think the confusion may be because you need to possess two things to decode the bitstream:
The coded bitstream
The lookup table
These two things are often coded in very different ways.
In many cases the lookup table is fixed in advance so does not need to be transmitted.
However, if the probabilities can change, then we need to tell the recipient what code table to use. In this case we can just transmit the lengths of each code word and this gives enough information for the receiver to construct the canonical Huffman code. Alternatives are also possible, for example we can send the number of each code word length followed by the values. This alternative is used by JPEG and explained more below.
Example
The JPEG image codec uses Huffman tables. Normally some default tables are used, but it is possible to optimize the size of images by transmitting a custom Huffman code. A tutorial about this is here.
Another description of the way of transmitting the Huffman table is here. The code lengths are sent (as bytes) followed by the code values (again as bytes).
Code to read it (taken from the link) is:
// Next sixteen bytes are the counts for each code length
u8 counts[16];
for (i = 0; i < 16; i++) {
counts[i] = fgetc(fp);
ctr++;
}
// Remaining bytes are the data values to be mapped
// Build the Huffman map of (length, code) -> value
for (i = 0; i < 16; i++) {
for (j = 0; j < counts[i]; j++) {
huffData[table][huffKey(i + 1, code)] = fgetc(fp);
code++;
ctr++;
}
code <<= 1;
}
What you are asking is how to send a description of the code to the receiver, so that the receiver knows how to decode the following code values.
There are many ways of varying levels of sophistication, depending on how much effort you want to put into compressing the description of the code. Peter de Rivaz describes a simple approach used by JPEG, which is to send 16 counts of the number of codes of each length, followed by the byte values of each of those symbols. So for your code that would be (in hex):
00 03 02 00 00 00 00 00 00 00 00 00 00 00 00 00 41 42 43 44 45
That's not terribly compact, and it can't represent one of the possible codes, which is 256 8-bit codes, since you are limited to a count of 255 for each length.
The first thing you can do is cut off the code lengths when you have a complete code. It is easy to calculate how many code patterns are left, in which case you can simply end it when there are none left. Follow that with the symbols. You then have:
00 03 02 41 42 43 44 45
We don't need eight bits for each count, since they are limited by the constraints on those counts. For example, you can't have more than two one-bit codes. So we could code these in fewer bits, e.g. n+1 bits for n codes. So two bits, three, bits, and so on until the code is complete. For your code, now in binary:
00 011 0010
followed by the bytes 41 42 43 44 45, offset in the bit stream appropriately. Now the list of counts takes nine bits instead of 24. Since we know that there can only be 256 symbols, we can cap off the number of bits for each count at nine, allowing for the count 256, solving the previous problem of not being able to represent the flat code. Then if the code is limited to 16 bits in length (as it is for JPEG), the largest number of bytes needed for the counts is 14.5, less than the original 16. Often the counts will end before 14.5 bytes.
You can get even more sophisticated, noting that at each code length, you have a limit on the possible count of codes of that length due to the shorter code lengths using up patterns. Then the number of bits for each count can be variable, based on how many possible values there are. Then the counts description would be:
00 011 10, then the eight-bit values 41 42 43 44 45
Since we have no preceding patterns used up for lengths one and two, those still need to be two and three bits respectively. However we now have only three possibilities left for length three: the counts 0, 1, or 2. A count of 3 would oversubscribe the code. So we can use two bits for that last one. It is now seven bits instead of nine, and this greatly reduces the number of bits in the counts for codes that use longer code lengths.
An entirely different scheme is the one used by the deflate format (used in zip, gzip, zlib, png, etc.). There the number of code lengths to follow is sent first, followed by the code length of each symbol in order up to the last one. The symbols themselves are implied by the code length location. That results in lots of zeros, to represent symbols that are not present. So for your code there would be a 70 to go up to symbol 69 ("E"), followed by 65 zeros, then 2 2 2 3 3. That seems awfully long, and it is. deflate then run-length and Huffman codes that list of lengths, to compress it. The long strings of zeros get compressed to a few bits, and the short lengths are also just a few bits each. So then you have to first send a description of the code lengths code lengths code (!) so that you can decode that.
You can read the deflate specification for more information on that scheme. brotli uses a similar scheme, with more sophistication still.

Lossless compression of an ordered series of 29 digits (each 0 to 5 Likert scale)

I have a survey with 29 questions, each with a 5-point Likert scale (0=None of the time; 4=Most of the time). I'd like to compress the total set of responses to a small number of alpha or alphanumeric characters, adding a check digit to the end.
So, the set of responses 00101244231023110242231421211 would get turned into something like A2CR7HW4. This output would be part of a printout that a non-techie user would enter on a website as a shortcut to entering the entire string. I'd want to avoid ambiguous characters, such as 0,O,D,I,l,5,S, leaving me with 21 or 22 characters to use (uppercase only). Alternatively, I could just stick with capital alpha only and use all 26 characters.
I'm thinking to convert each pair of digits to a letter (5^2=25, so the whole alphabet is adequate). That would reduce the sequence to 15 characters, which is still longish to type without errors.
Any other suggestions on how to minimize the length of the output?
EDIT: BTW, for context, the survey asks 29 questions about mental health symptoms, generating a predictive risk for 4 psychiatric conditions. Need a code representing all responses.
If the five answers are all equally likely, then the best you can do is ceiling(29 * log(5) / log(n)) symbols, where n is the number of symbols in your alphabet. (The base of the logarithm doesn't matter, so long as they're both the same.)
So for your 22 symbols, the best you can do is 16. For 26 symbols, the best is 15, as you described for 25. If you use 49 characters (e.g. some subset of the upper and lower case characters and the digits), you can get down to 12. The best you'll be able to do with printable ASCII characters would be 11, using 70 of the 94 characters.
The only way to make it smaller would be if the responses are not all equally likely and are heavily skewed. Though if that's the case, then there's probably something wrong with the survey.
First, choose a set of permissible characters, i.e.
characters = "ABC..."
Then, prefix the input-digits with a 1 and interpret it as a quinary number:
100101244231023110242231421211
Now, convert this quinary number to a number in base-"strlen(characters)", i.e. base26 if 26 characters are to be used:
02 23 18 12 10 24 04 19 00 15 14 20 00 03 17
Then, use these numbers as index in "characters", and you have your encoding:
CVSMKWETAPOUADR
For decoding, just reverse the steps.
Are you doing this in a specific language?
If you want to be really thrifty about it you might want to consider encoding the data at bit level.
Since there are only 5 possible answers per question you could do this with only 3 bits:
000
001
010
011
100
Your end result would be a string of bits, at 3-bits per answer so a total of 87 bits or 10 and a bit bytes.
EDIT - misread the question slightly, there are 5 possible answers not 4, my mistake.
The only problem now is that for 4 of your 5 answers you're wasting a bit...you ain't gonna benefit much from going to this much trouble I wouldn't say but it's worth considering.
EDIT:
I've been playing about with it and it's difficult to work out a mechanism that allows you to use both 2 and 3 bit values.
Since your output would be a 97 bit binary value you'd need ot be able make the distinction between 2 and 3 bits values when converting back to the original values.
If you're working with a larger number of values there are some methods you could use, like having a reserved bit for each values that can be used to sort of type a value and give it some meaning. But working with so few bits as it is, it's hard to shave anything off.
Your output at 97 bits could be padded out to 128 bits, which would give you 4 32-bit values if you wanted to simplify it. this 128 bit value would be like a unique fingerprint representing a specific set of answers. There are many ways you can represnt 128 bits.
But in the end borking at bit-level is about as good as it gets when it comes to actual compression and encoding of data...if you can express 5 unique values in less than 3 bits I'd be suitably impressed.

algorithm: gigantic number of very sparse bit arrays, which encoding to use

I've got a special need and the most important concerns are:
in-memory
very low memory footprint
speed
Here's my "problem": I need to store, in-memory, a huge number of very sparse bit arrays. Those bitsets are "append only" and are to be used mostly for intersections. By huge, I mean as high as 200 000 bit arrays.
The range shall be between [0...16 000 000] for each bitset.
I ran some pre-test with "only" 10 673 bit arrays containing some actual data I've got and got the following results:
1% of the bit arrays ( 106 bit arrays) Hamming weight: at most 1 bit set
5% of the bit arrays ( 534 bit arrays) Hamming weight: at most 4 bits set
10% of the bit arrays ( 1068 bit arrays) Hamming weight: at most 8 bits set
15% of the bit arrays ( 1603 bit arrays) Hamming weight: at most 12 bits set
20% of the bit arrays ( 2137 bit arrays) Hamming weight: at most 17 bits set
25% of the bit arrays ( 2671 bit arrays) Hamming weight: at most 22 bits set
30% of the bit arrays ( 3206 bit arrays) Hamming weight: at most 28 bits set
35% of the bit arrays ( 3740 bit arrays) Hamming weight: at most 35 bits set
40% of the bit arrays ( 4274 bit arrays) Hamming weight: at most 44 bits set
45% of the bit arrays ( 4809 bit arrays) Hamming weight: at most 55 bits set
50% of the bit arrays ( 5343 bit arrays) Hamming weight: at most 67 bits set
55% of the bit arrays ( 5877 bit arrays) Hamming weight: at most 83 bits set
60% of the bit arrays ( 6412 bit arrays) Hamming weight: at most 103 bits set
65% of the bit arrays ( 6946 bit arrays) Hamming weight: at most 128 bits set
70% of the bit arrays ( 7480 bit arrays) Hamming weight: at most 161 bits set
75% of the bit arrays ( 8015 bit arrays) Hamming weight: at most 206 bits set
80% of the bit arrays ( 8549 bit arrays) Hamming weight: at most 275 bits set
85% of the bit arrays ( 9083 bit arrays) Hamming weight: at most 395 bits set
90% of the bit arrays ( 9618 bit arrays) Hamming weight: at most 640 bits set
95% of the bit arrays (10152 bit arrays) Hamming weight: at most 1453 bits set
96% of the bit arrays (10259 bit arrays) Hamming weight: at most 1843 bits set
97% of the bit arrays (10366 bit arrays) Hamming weight: at most 2601 bits set
98% of the bit arrays (10473 bit arrays) Hamming weight: at most 3544 bits set
99% of the bit arrays (10580 bit arrays) Hamming weight: at most 4992 bits set
100% of the bit arrays (10687 bit arrays) Hamming weight: at most 53153 bits set
Seen the numbers involved, I obviously need to use compressed bit arrays and that is not an issue: it shall stay easy to deal with seen that the bit arrays are "append only".
The bit array bits that are on are kinda grouped, but not totally. So you'll tend to have several bits on in the same area (but usually not one after another, making RLE kinda not great for bits that are on).
My question is what kind of compression to use?
Now I don't know if I should put my first approach here or in an answer to my own question.
Basically I imagined a "worst case" scenario using a very dumb encoding:
1 bit: if on, the following 5 bits determine how many bits are needed to compute the 'skip', if off, optimization: the following 5 bits determine how many bits are too be taken literally (that is 'on' or 'off', without skipping) [this would only be switched to when determined to be more efficient than the other representation, so when it kicks in, it shall always be an optimization (size-wise)]
5 bits: how many bits we can skip before the next bit on
x bits: skip
Here's an example: a bit array has 3 bit set, the first bit being at 3 098 137, the second at 3 098 141 and the third at 3 098 143.
+-- now we won't skip
|
| +-- 3 because we need 3 bits to store "6" (from 3 098 138 to 3 098 143)
| | +--- 3 098 141 is on
22 3 098 137 | 3 | +- 3 098 143 is on
1 10110 1011110100011000011001 0 00011 000101 etc.
First bit on tells we're going to skip bits.
5 next bits (always 5) tells how many bits we need to tell how many bits we'll skip
22 bits telling to skip to 3 098 137
one bit off telling now we're not skipping bits
5 next bits (always 5) tells how many bits we'll read "as is"
6 bits: off, off, off, on, off, on meaning 3 098 141 and 3 098 143 are on
etc.
Seen the amazing sparsity of these bit arrays, this seems quite size-efficient.
So using that encoding, I took my sample data and computed a "worst case" scenario (I haven't written the algo yet, I'd rather have a few from here inputs first): basically I considered that not only the "size optimization" would never kick in and, also, that the 5 bits would always be set to their maximum value (24 bits), which of course cannot happen.
I did it just to have a very crude approximation of what the "worst of the worst" case could be.
I was very pleasantly surprised:
Worst case scenario:
108 913 290 bits needed for the 10 687 very sparse bit arrays
12.9 MB (13 295 KB)
The data being actual data and all the data being similar, I know that, if worse comes to worse, I could store my 200 000 bit arrays in about 240 MB, which is fine.
I'm pretty sure the actual encoding will comes to way less than that but as I haven't actually written it yet, I can only (very easily) compute the "worst case" which is why I only show that one.
Any hints / ideas as to how to make this more size-efficient (remembering these are super-sparse bit arrays, that there shall be hundreds thousands of them, that they must be in memory, and that they shall be "append only")?
About my 'append-only' case
Basically I've got one growing "expanse" (the range, but "expanse" is the actual term as I understand it) and a lot of bit arrays that have a few bit sets. When the range goes from, say, 0 to 1 000 000, all the bit arrays goes from 0 to 1 000 000 to. When the range grows to 1 000 001, then all the bit arrays are growing too, all by one bit. But most of these bit arrays will have a '0' appended at their end, while about 4 to 8 of the bit arrays will have a '1' appended at their end. However I cannot predict in advance which of the bit arrays will have a 0 or a 1 appended.
So I've got a lot of bit arrays that have all the same size, that are all very sparse (< 0.5% of their bits set) and that are all "growing" as the range growth (so they're all always growing at the same rate).
Judy arrays are great. But I read about them a few years ago and that stuff was "above my head". Judy arrays are a C-only 20KLOC lib and I'm definitely not re-implementing that. But they're amazing.
So I guess I need to add I'd like all this to stay relatively simple, which is not that far-fetched seen the special "append only" property of my very sparse bit arrays.
You didn't say what programming language you want to use. It sounds like you don't want Judy because it's "C-only"... if you are using C# then you could use my Compact Patricia Trie instead. Is is almost 4500 LOC (commented) and uses similar ideas to Judy, but the size and speed of each trie are not ideal due to limitations of .NET. It is not optimized for computing intersections either, but such an algorithm could be added. The article about CP Tries does not emphasize this point, but it can store sets (sparse bit arrays) much more compactly than dictionaries (the graphs in the article show the size and speed of dictionaries, not sets).
The best case is a dense cluster of bits. With 50% occupancy (every other bit set), it requires less than 8 bits per key (less than 4 bits per integer). (correction: less than 8 bits, not more.)
If you only need an approximate representation of the data, use a Bloom filter.
By the way, what do you mean by "append only"? Does it mean that you only add keys, or that each key you add is greater than the keys you added before?
Update: Since you are only adding larger keys, you should probably design a special algorithm just for your case. IMO, when designing a custom algorithm, you should make it as simple as possible. So here's my idea, which assumes the keys of different bitsets are uncorrelated (therefore there is no benefit of attempting to compress data between different bitsets):
A bitset is represented by a sorted array of 32-bit slots. Because it's sorted, you can use binary search to find keys. Each slot consists of a 24-bit "prefix" and 8 bits of "flags". Each slot represents a region of 8 keys. The "flags" tell you which of the 8 keys in the region are present in the bitset, and the "prefix" tells you which region we're talking about, by specifying bits 3 to 26 of the key. For example, if the following bits are "1" in the bitset:
1, 3, 4, 1094, 8001, 8002, 8007, 8009
...then the bitset is represented by an array of 4 slots (16 bytes):
Prefix: 0, 136, 1000, 1001
Flags: 0x15, 0x40, 0x86, 0x02
The first slot represents 1, 3, 4 (notice that bits 1, 3 and 4 are set in the number 0x15); the second slot represents 1094 (136 * 8 + 6); the third slot represents 8001, 8002, and 8007; the fourth slot represents 8009. Does this make sense?
I don't know if this is as compact as your idea. But I think you'll get faster queries and faster modifications, and it will be fairly easy to implement.
You may use binary tree for bit array.
Say, you have array with range of [M..N].
Store it in such a manner:
Choose some number encoding for [0...ram size], like Fibonacci, Golomb or Rice code (you may choose most suitable representation after profiling your program with actual data).
If array is empty (have no bits set), store it as number 0.
If array is full (have all bits set), store it as number 1.
Else split it in two parts: A in [M..(M+N)/2-1] and B in [(M+N)/2..N]
Generate representations of P0 and P1 using this algorithm recursively.
Get length of P0 (in bits or other units length may be whole number of) and store it as a number (you may need to add 1 if length may be 1, e.g. you store 0 as single bit 0).
Store P0 then P1.
In this case, if limits are common, operations of intersection an union are trivial recursions:
Intersection:
If array A is empty, store 0.
If array A is full, store copy of B
Else split arrays, make intersections of both halves, store length of first half, then both halves.
This algorithm may deal with bits (if you need them to be most compact) and bytes/words (if bit operations are so slow).
Also you may add specical encodings for arrays with single bit set, all arrays with size less than some limit (8 elements for example) to decrease level of recursion.
Drawback is that without some hacks adding/removing element to/from array is complex operation (as complex as intersection/union operations).
For example, array with single 0xAB bit set should be stored in array of 0..0xFF as (pseudocode for):
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1, EMPTY, 13, 1, EMPTY, 9, 1, EMPTY, 5, 1, EMPTY, 1, EMPTY, FULL, EMPTY, EMPTY, EMPTY
| AA | AB |
|A8..A9| AA .. AB |
| A8 .. AB |AC..AF|
|A0..A7| A8 .. AF |
| A0 .. AF |B0..BF|
|80..9F| A0 .. BF |
| 80 .. BF |C0..FF|
| 0..7F| 80 .. FF |
EMPTY and FULL are codes for empty and full arrays, numbers are lengths in elements (should be replaced with actual lengthts in bytes, bits or so)
Ff you do not need fast single bit check, you may use most simple approach:
Just store distances between set bits using codes: fibonacci, rice, golomb, levenshtein, elias etc. or invent another one.
Note, that in order to get minimal code length, you should use code with code lengths as close as possible to -log p/log 2, where p is probability of that code. You may use huffman code for that.
For example, use elias gamma code, so array like this:
0 1 0000 1 1 000 1 0 1 000000000000000000 1 000000000000000000
2 5 1 4 2 19 18 (distance)
Should be encoded as:
010 00101 1 00100 010 000010011 000010010
2 5 1 4 2 19 18 (distance code explained)
And mostly compact for array with uniform bits distribution would be arithmetic encoding, but it is very CPU time counsumpting. Becaouse you'll have to read and write such arrays bit by bit with no fast skipping available.
You may look into compressed bitmaps. A common strategy is to use word-aligned run-length encoding.
C++ implementation:
https://github.com/lemire/EWAHBoolArray
Java implementation:
https://github.com/lemire/javaewah
Reference:
Daniel Lemire, Owen Kaser, Kamel Aouiche, Sorting improves word-aligned bitmap indexes. Data & Knowledge Engineering 69 (1), pages 3-28, 2010.
http://arxiv.org/abs/0901.3751
Even if they aren't exactly what you're looking for, it's worth checking out Judy trees. Judy is a heavily optimized library for ordered maps, and one configuration is specifically designed as a bitset rather than a map. I don't think intersection is one of the operations natively optimized for, though...
The general idea is to use a tree with a fixed number of address bits per level, and take advantage of the sparseness at each level. This results in quite good compression even in the worst case, and fast query performance as well. I believe an intersection operation would be relatively straightforward and potentially very fast.
At any rate, it's always a good idea to steal from the best!
Considering you are going to do a bunch of intersection tests anyway, maybe you should try storing all of the bitvectors in parallel. One sparse, 16M entry list. Each entry in that list contains a list of which of the 200k input bitvectors has a '1' at that location. It looks like you expect to have only about 5 bits set per input vector, or 1M total entries? Taking a straw-man linked list implementation for the toplevel and the buckets, and a worst case of no intersections at all (thus 1M buckets with 1 element each) you could store it all in 32MB.
You might be interested in Binary Decision Diagrams (BDD), and more precisely Zero-suppressed Binary Decision Diagram (ZBDD).
They are used to represent sets in a compressed way. Unlike other compressed forms, operations (such as set intersections, or insertions of elements - your "append only" thing?) work directly on the compressed form.

Decoding letters ('a' .. 'z') from a bit sequence without waste

I seek an algorithm that will let me represent an incoming sequence of bits as letters ('a' .. 'z' ), in a minimal matter such that the stream of bits can be regenerated from the letters, without ever holding the entire sequence in memory.
That is, given an external bit source (each read returns a practically random bit), and user input of a number of bits, I would like to print out the minimal number of characters that can represent those bits.
Ideally there should be a parameterization - how much memory versus maximum bits before some waste is necessary.
Efficiency Goal - The same number of characters as the base-26 representation of the bits.
Non-solutions:
If sufficient storage was present, store the entire sequence and use a big-integer MOD 26 operation.
Convert every 9 bits to 2 characters - This seems suboptimal, wasting 25% of information capacity of the letters output.
If you assign a different number of bits per letter, you should be able to exactly encode the bits in the twenty-six letters allowed without wasting any bits. (This is a lot like a Huffman code, only with a pre-built balanced tree.)
To encode bits into letters: Accumulate bits until you match exactly one of the bit codes in the lookup table. Output that letter, clear the bit buffer, and keep going.
To decode letters into bits: For each letter, output the bit sequence in the table.
Implementing in code is left as an exercise to the reader. (Or to me, if I get bored later.)
a 0000
b 0001
c 0010
d 0011
e 0100
f 0101
g 01100
h 01101
i 01110
j 01111
k 10000
l 10001
m 10010
n 10011
o 10100
p 10101
q 10110
r 10111
s 11000
t 11001
u 11010
v 11011
w 11100
x 11101
y 11110
z 11111
Convert each block of 47 bits to a base 26 number of 10 digits. This gives you more than 99.99% efficiency.
This method, as well as others like Huffman, needs a padding mechanism to support variable-length input. This introduces some inefficiency which is less significant with longer inputs.
At the end of the bit stream, append an extra 1 bit. This must be done in all cases, even when the length of the bit stream is a multiple of 47. Any high-order letters of "zero" value can be skipped in the last block of encoded output.
When decoding the letters, a truncated final block can be filled out with "zero" letters and converted to a 47-bit base 2 representation. The final 1 bit is not data, but marks the end of the bit stream.
Could Huffman coding be what you're looking for? It's a compression algorithm, which pretty much represents any information with a minimum of wasted bits.
Zero waste would be log_2(26) bits per letter. As pointed out earlier, you can get to 4.7 by reading 47 bits and converting them to 10 letters. However, you can get to 4.67 by converting every 14 bits into 3 characters. This has the advantage that it fits into an integer. If you have storage space and run time is important, you can create a lookup table with 17,576 entries mapping the possible 14 bits into 3 letters. Otherwise, you can do mod and div operations to compute the 3 letters.
number of letters number of bits bits/letter
1 4 4
2 9 4.5
3 14 4.67
4 18 4.5
5 23 4.6
6 28 4.67
7 32 4.57
8 37 4.63
9 42 4.67
10 47 4.7
Any solution you use is going to be space-inefficient because 26 is not a power of 2. As far as an algorithm goes, I'd rather use a lookup table than an on-the-fly calculation for each series of 9 bits. Your lookup table would 512 entries long.
If you want the binary footprint of each letter to have the same size, the optimal solution would be given by Arithmetic Encoding. However, it will not reach your goal of a mean representation of 4.5 bits/char. Given 26 different characters (not including space etc) 4.7 would be the best you can reach without using variable-length encoding (Huffman, for instance. See Jaegers's answer) or other compression algoritms.
A suboptimal, although simpler, solution could be to find a feasible number of characters to fit into a big integer. For instance, if you form a 32-bit integer out of every 6 charachter chunk (which is possible as 26^6 < 2^32), you use 5.33 bits/char. You can actually even fit 13 letters into a 64 bit integer (4.92 bits/char). This is quite close to the optimal solution, and still rather easy to implement. Using bigger ints than 64 bits can be tricky due to missing native support in many progamming languages.
If you want even better compression rates for text, you should definitely also look into dictionary-based compression algorithms, such as LZW or Deflate.

Resources