I seek an algorithm that will let me represent an incoming sequence of bits as letters ('a' .. 'z' ), in a minimal matter such that the stream of bits can be regenerated from the letters, without ever holding the entire sequence in memory.
That is, given an external bit source (each read returns a practically random bit), and user input of a number of bits, I would like to print out the minimal number of characters that can represent those bits.
Ideally there should be a parameterization - how much memory versus maximum bits before some waste is necessary.
Efficiency Goal - The same number of characters as the base-26 representation of the bits.
Non-solutions:
If sufficient storage was present, store the entire sequence and use a big-integer MOD 26 operation.
Convert every 9 bits to 2 characters - This seems suboptimal, wasting 25% of information capacity of the letters output.
If you assign a different number of bits per letter, you should be able to exactly encode the bits in the twenty-six letters allowed without wasting any bits. (This is a lot like a Huffman code, only with a pre-built balanced tree.)
To encode bits into letters: Accumulate bits until you match exactly one of the bit codes in the lookup table. Output that letter, clear the bit buffer, and keep going.
To decode letters into bits: For each letter, output the bit sequence in the table.
Implementing in code is left as an exercise to the reader. (Or to me, if I get bored later.)
a 0000
b 0001
c 0010
d 0011
e 0100
f 0101
g 01100
h 01101
i 01110
j 01111
k 10000
l 10001
m 10010
n 10011
o 10100
p 10101
q 10110
r 10111
s 11000
t 11001
u 11010
v 11011
w 11100
x 11101
y 11110
z 11111
Convert each block of 47 bits to a base 26 number of 10 digits. This gives you more than 99.99% efficiency.
This method, as well as others like Huffman, needs a padding mechanism to support variable-length input. This introduces some inefficiency which is less significant with longer inputs.
At the end of the bit stream, append an extra 1 bit. This must be done in all cases, even when the length of the bit stream is a multiple of 47. Any high-order letters of "zero" value can be skipped in the last block of encoded output.
When decoding the letters, a truncated final block can be filled out with "zero" letters and converted to a 47-bit base 2 representation. The final 1 bit is not data, but marks the end of the bit stream.
Could Huffman coding be what you're looking for? It's a compression algorithm, which pretty much represents any information with a minimum of wasted bits.
Zero waste would be log_2(26) bits per letter. As pointed out earlier, you can get to 4.7 by reading 47 bits and converting them to 10 letters. However, you can get to 4.67 by converting every 14 bits into 3 characters. This has the advantage that it fits into an integer. If you have storage space and run time is important, you can create a lookup table with 17,576 entries mapping the possible 14 bits into 3 letters. Otherwise, you can do mod and div operations to compute the 3 letters.
number of letters number of bits bits/letter
1 4 4
2 9 4.5
3 14 4.67
4 18 4.5
5 23 4.6
6 28 4.67
7 32 4.57
8 37 4.63
9 42 4.67
10 47 4.7
Any solution you use is going to be space-inefficient because 26 is not a power of 2. As far as an algorithm goes, I'd rather use a lookup table than an on-the-fly calculation for each series of 9 bits. Your lookup table would 512 entries long.
If you want the binary footprint of each letter to have the same size, the optimal solution would be given by Arithmetic Encoding. However, it will not reach your goal of a mean representation of 4.5 bits/char. Given 26 different characters (not including space etc) 4.7 would be the best you can reach without using variable-length encoding (Huffman, for instance. See Jaegers's answer) or other compression algoritms.
A suboptimal, although simpler, solution could be to find a feasible number of characters to fit into a big integer. For instance, if you form a 32-bit integer out of every 6 charachter chunk (which is possible as 26^6 < 2^32), you use 5.33 bits/char. You can actually even fit 13 letters into a 64 bit integer (4.92 bits/char). This is quite close to the optimal solution, and still rather easy to implement. Using bigger ints than 64 bits can be tricky due to missing native support in many progamming languages.
If you want even better compression rates for text, you should definitely also look into dictionary-based compression algorithms, such as LZW or Deflate.
Related
I have a survey with 29 questions, each with a 5-point Likert scale (0=None of the time; 4=Most of the time). I'd like to compress the total set of responses to a small number of alpha or alphanumeric characters, adding a check digit to the end.
So, the set of responses 00101244231023110242231421211 would get turned into something like A2CR7HW4. This output would be part of a printout that a non-techie user would enter on a website as a shortcut to entering the entire string. I'd want to avoid ambiguous characters, such as 0,O,D,I,l,5,S, leaving me with 21 or 22 characters to use (uppercase only). Alternatively, I could just stick with capital alpha only and use all 26 characters.
I'm thinking to convert each pair of digits to a letter (5^2=25, so the whole alphabet is adequate). That would reduce the sequence to 15 characters, which is still longish to type without errors.
Any other suggestions on how to minimize the length of the output?
EDIT: BTW, for context, the survey asks 29 questions about mental health symptoms, generating a predictive risk for 4 psychiatric conditions. Need a code representing all responses.
If the five answers are all equally likely, then the best you can do is ceiling(29 * log(5) / log(n)) symbols, where n is the number of symbols in your alphabet. (The base of the logarithm doesn't matter, so long as they're both the same.)
So for your 22 symbols, the best you can do is 16. For 26 symbols, the best is 15, as you described for 25. If you use 49 characters (e.g. some subset of the upper and lower case characters and the digits), you can get down to 12. The best you'll be able to do with printable ASCII characters would be 11, using 70 of the 94 characters.
The only way to make it smaller would be if the responses are not all equally likely and are heavily skewed. Though if that's the case, then there's probably something wrong with the survey.
First, choose a set of permissible characters, i.e.
characters = "ABC..."
Then, prefix the input-digits with a 1 and interpret it as a quinary number:
100101244231023110242231421211
Now, convert this quinary number to a number in base-"strlen(characters)", i.e. base26 if 26 characters are to be used:
02 23 18 12 10 24 04 19 00 15 14 20 00 03 17
Then, use these numbers as index in "characters", and you have your encoding:
CVSMKWETAPOUADR
For decoding, just reverse the steps.
Are you doing this in a specific language?
If you want to be really thrifty about it you might want to consider encoding the data at bit level.
Since there are only 5 possible answers per question you could do this with only 3 bits:
000
001
010
011
100
Your end result would be a string of bits, at 3-bits per answer so a total of 87 bits or 10 and a bit bytes.
EDIT - misread the question slightly, there are 5 possible answers not 4, my mistake.
The only problem now is that for 4 of your 5 answers you're wasting a bit...you ain't gonna benefit much from going to this much trouble I wouldn't say but it's worth considering.
EDIT:
I've been playing about with it and it's difficult to work out a mechanism that allows you to use both 2 and 3 bit values.
Since your output would be a 97 bit binary value you'd need ot be able make the distinction between 2 and 3 bits values when converting back to the original values.
If you're working with a larger number of values there are some methods you could use, like having a reserved bit for each values that can be used to sort of type a value and give it some meaning. But working with so few bits as it is, it's hard to shave anything off.
Your output at 97 bits could be padded out to 128 bits, which would give you 4 32-bit values if you wanted to simplify it. this 128 bit value would be like a unique fingerprint representing a specific set of answers. There are many ways you can represnt 128 bits.
But in the end borking at bit-level is about as good as it gets when it comes to actual compression and encoding of data...if you can express 5 unique values in less than 3 bits I'd be suitably impressed.
I was wondering if there's a way to compress 20 or so large numbers (~10^8) into a string of a reasonable length. For instance, if the numbers were stored as hex and concatenated, it'd be at least 160 characters long. I wonder if there's a smart way to compress the numbers in and get them back out. I was thinking about having a sequence 0-9 as reference and let one part of the input string be a number <1024. That number is to be converted to binary, which serves as a mask, i.e. indicating which digits exist in the number. It's still not clear where to go on from here.
Are there any better alternatives?
Thanks
If these large numbers are of the same size in bytes, and if you always know the count of those numbers, there is an easy way to do it. You simply Have an array of your bytes, and instead of reading them out as integers, you read them out as characters. Are you trying to obfuscate your values or just pack them to be easily transferred?
When I'm compacting a lot of values into one, reversible String, I usually go with base 64 conversion. This can really cut off quite a lot of the length from a String, but note that it may take up just as much memory in representing it.
Example
This number in decimal:
10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
is the following in Base 64:
Yki8xQRRVqd403ldXJUT8Ungkh/A3Th2TMtNlpwLPYVgct2eE8MAn0bs4o/fv1bmo4oUNQa/9WtZ8gRE7IG+UHX+LniaQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Why you can't do this too an extreme level
Think about it for a second. Let's say you've got a number of length 10. And you want to represent that number with 5 characters, so a 50% rate compression scheme. First, we work out how many possible numbers you can represent with 10 digits.. which is..
2^10 = 1024
Okay, that's fine. How many numbers can we express with 5 digits:
2^5 = 32
So, you can only display 32 different numbers with 5 bits, whereas you can display 1024 numbers with 10 bits. For compression to work, there needs to be some mapping between the compressed value and the extracted value. Let's try and make that mapping happen..
Normal - Compressed
0 0
1 1
2 2
.. ...
31 31
32 ??
33 ??
34 ??
... ...
1023 ??
There is no mapping for most of the numbers that can be represented by the expanded value.
This is known as the Pigeonhole Principle and in this example our value for n is greater than our value for m, hence we need to map values from our compressed values to more than one normal value, which makes things incredibly complex. (thankyou Oli for reminding me).
You need to be much more descriptive about what you mean by "string" and "~10^8". Can your "string" contain any sequence of bytes? Or is it restricted to a subset of possible bytes? If so, how exactly is it restricted? What are the limits on your "large numbers"? What do they represent?
Numbers up to 108 can be represented in 27 bits. 20 of them would be 540 bits, which could be stored in a string of 68 bytes, if any sequence of bytes is permitted. If the contents of a string are limited, it will take more bits. If your range of numbers is larger, it will take more bits.
store all numbers as strings to a marisa trie: https://code.google.com/p/marisa-trie/
Base64 the resulting trie dictionary
It depends of course a lot on your input. But it is a possibility to build a (very) compact representation this way.
I wanna pack 2 32bit digit in to 12strings
example
123 456 -> abcdef0000000
The choice [0-9a-z] gives 36 symbols. With 12 such symbols, you get 36^12 = 4.7e18 different strings.
3 32-bit digits is the same as one 96-bit digit, which has 2^96 = 7.9e28 different values.
There are more digits than you can represent with that alphabet in 12 positions, so no, you cannot achieve the packing you want.
If you use [0-9a-zA-Z] you have 62 symbols. 62^12 = 3.2e21 - still short.
Given a number of symbols S and string length L, you can figure out how many bits of info you can store by solving S^L >= 2^B. For your case, B = 96. Let's use the Base64 alphabet, so S = 64 (this also makes the math easier). Then:
64^L >= 2^96
take log2 of both sides
L * log2(64) >= 96 * log2(2)
L * 6 >= 96
L >= 16
So with a Base64 alphabet, you can store 96 bits in 16 symbols.
Assuming you mean 3 numbers:
Whether you can get them in the string depends on the language. You can't have ANY limitations on the character set, though. A 32 bit number is 4 bytes, three of them make 12 bytes. Zeros are a possibility--any language that won't let you store a zero byte in the string won't allow this. However, there are languages that don't care--Delphi comes to mind. It stores strings with length codes rather than terminators and thus you can put anything in a string.
Assuming you mean 2 numbers:
[0-9][A-Z] is 36 symbols. 36^6 < 2^32, it won't work.
[0-9][A-Z][a-z] is 62 symbols. 62^6 > 2^32, you can do it.
On the other hand, if you're talking strings that hold wide characters of some type even the three numbers can easily be packed in, although you'll have to use some non-ASCII characters to do the job. Pick any 256 symbols and you're good to go. You can even pick 2048 symbols (you'll find more than that if you look at the east Asian languages) and pack each number into three characters.
You are practically wanting to compact a 3 4-byte length numbers into one 12-byte length string. Yes, you can do this, but you will have to use all 256 ANSI symbols (some of which are unprintable, like #13, #7, #10), and it won't even be a "pack", just first 4 bytes will hold one number, second 4 bytes will hold second number, third 4 bytes will hold third number.
If you want to pack TWO 4-byte numbers into a READABLE (printable) string of length 12, you need to be able to stuff 4 bytes into 6 "digits" of a string, thankfully 12 is divisible by 2, and you don't have to play with uneven pack length. So, solving 256^4 <= X^6, we receive X as 41, meaning yes, if you use a Base64 alphabet (as in comments to previous andwer) you are capable of having such a string.
Could you please suggest an error detection scheme for detecting
one possible bit flip in the first 32 bytes of a 33-byte message using
no more than 8 bits of additional data?
Could Pearson hashing be a solution?
Detecting a single bit-flip in any message requires only one extra bit, independent of the length of the message: simply xor together all the bits in the message and tack that on the end. If any single bit flips, the parity bit at the end won't match up.
If you're asking to detect which bit flipped, that can't be done, and a simple argument shows it: the extra eight bits can represent up to 256 classes of 32-byte messages, but the zero message and the 256 messages with one on bit each must all be in different classes. Thus, there are 257 messages which must be distinctly classified, and only 256 classes.
You can detect one bit flip with just one extra bit in any length message (as stated by #Daniel Wagner). The parity bit can, simply put, indicate whether the total number of 1-bits is odd or even. Obviously, if the number of bits that are wrong is even, then the parity bit will fail, so you cannot detect 2-bit errors.
Now, for a more accessible understanding of why you can't error-correct 32 bytes (256 bits) with just 8 bits, please read about the Hamming code (like used in ECC memory). Such a scheme uses special error-correcting parity bits (henceforth called "EC parity") that only encode the parity of a subset of the total number of bits. For every 2^m - 1 total bits, you need to use m EC bits. These represent each possible different mask following the pattern "x bits on, x bits off" where x is a power of 2. Thus, the larger the number of bits at once, the better the data/parity bit ratio you get. For example, 7 total bits would allow encoding only 4 data bits after losing 3 EC bits, but 31 total bits can encode 26 data bits after losing 5 EC bits.
Now, to really understand this probably will take an example. Consider the following sets of masks. The first two rows are to be read top down, indicating the bit number (the "Most Significant Byte" I've labeled MSB):
MSB LSB
| |
v v
33222222 22221111 11111100 0000000|0
10987654 32109876 54321098 7654321|0
-------- -------- -------- -------|-
1: 10101010 10101010 10101010 1010101|0
2: 11001100 11001100 11001100 1100110|0
3: 11110000 11110000 11110000 1111000|0
4: 11111111 00000000 11111111 0000000|0
5: 11111111 11111111 00000000 0000000|0
The first thing to notice is that the binary values for 0 to 31 are represented in each column going from right to left (reading the bits in rows 1 through 5). This means that each vertical column is different from each other one (the important part). I put a vertical extra line between bit numbers 0 and 1 for a particular reason: Column 0 is useless because it has no bits set in it.
To perform error-correcting, we will bitwise-AND the received data bits against each EC bit's predefined mask, then compare the resulting parity to the EC bit. For any calculated parities discovered to not match, find the column in which only those bits are set. For example, if error-correcting bits 1, 4, and 5 are wrong when calculated from the received data value, then column #25--containing 1s in only those masks--must be the incorrect bit and can be corrected by flipping it. If only a single error-correcting bit is wrong, then the error is in that error-correcting bit. Here's an analogy to help you understand why this works:
There are 32 identical boxes, with one containing a marble. Your task is to locate the marble using just an old-style scale (the kind with two balanced platforms to compare the weights of different objects) and you are only allowed 5 weighing attempts. The solution is fairly easy: you put 16 boxes on each side of the scale and the heavier side indicates which side the marble is on. Discarding the 16 boxes on the lighter side, you then weigh 8 and 8 boxes keeping the heavier, then 4 and 4, then 2 and 2, and finally locate the marble by comparing the weights of the last 2 boxes 1 to 1: the heaviest box contains the marble. You have completed the task in only 5 weighings of 32, 16, 8, 4, and 2 boxes.
Similarly, our bit patterns have divided up the boxes in 5 different groups. Going backwards, the fifth EC bit determines whether an error is on the left side or the right side. In our scenario with bit #25, it is wrong, so we know that the error bit is on the left side of the group (bits 16-31). In our next mask for EC bit #4 (still stepping backward), we only consider bits 16-31, and we find that the "heavier" side is the left one again, so we have narrowed down the bits 24-31. Following the decision tree downward and cutting the number of possible columns in half each time, by the time we reach EC bit 1 there is only 1 possible bit left--our "marble in a box".
Note: The analogy is useful, though not perfect: 1-bits are not represented by marbles--the erroring bit location is represented by the marble.
Now, some playing around with these masks and thinking how to arrange things will reveal that there is a problem: If we try to make all 31 bits data bits, then we need 5 more bits for EC. But how, then, will we tell if the EC bits themselves are wrong? Just a single EC bit wrong will incorrectly tell us that some data bit needs correction, and we'll wrongly flip that data bit. The EC bits have to somehow encode for themselves! The solution is to position the parity bits inside of the data, in columns from the bit patterns above where only one bit is set. This way, any data bit being wrong will trigger two EC bits to be wrong, making it so that if only one EC bit is wrong, we know it is wrong itself instead of it signifying a data bit is wrong. The columns that satisfy the one-bit condition are 1, 2, 4, 8, and 16. The data bits will be interleaved between these starting at position 2. (Remember, we are not using position 0 as it would never provide any information--none of our EC bits would be set at all).
Finally, adding one more bit for overall parity will allow detecting 2-bit errors and reliably correcting 1-bit errors, as we can then compare the EC bits to it: if the EC bits say something is wrong, but the parity bit says otherwise, we know there are 2 bits wrong and cannot perform correction. We can use the discarded bit #0 as our parity bit! In fact, now we are encoding the following pattern:
0: 11111111 11111111 11111111 11111111
This gives us a final total of 6 Error-Checking and Correcting (ECC) bits. Extending the scheme of using different masks indefinitely looks like this:
32 bits - 6 ECC bits = 26 data
64 bits - 7 ECC bits = 57 data
128 bits - 8 ECC bits = 120 data
256 bits - 9 ECC bits = 247 data
512 bits - 10 ECC bits = 502 data
Now, if we are sure that we only will get a 1-bit error, we can dispense with the #0 parity bit, so we have the following:
31 bits - 5 ECC bits = 26 data
63 bits - 6 ECC bits = 57 data
127 bits - 7 ECC bits = 120 data
255 bits - 8 ECC bits = 247 data
511 bits - 9 ECC bits = 502 data
This is no change because we don't get any more data bits. Oops! 32 bytes (256 bits) as you requested cannot be error-corrected with a single byte, even if we know we can have only a 1-bit error at worst, and we know the ECC bits will be correct (allowing us to move them out of the data region and use them all for data). We need TWO more bits than we have--one must slide up to the next range of 512 bits, then leave out 246 data bits to get our 256 data bits. So that's one more ECC bit AND one more data bit (as we only have 255, exactly what Daniel told you).
Summary:: You need 33 bytes + 1 bit to detect which bit flipped in the first 32 bytes.
Note: if you are going to send 64 bytes, then you're under the 32:1 ratio, as you can error correct that in just 10 bits. But it's that in real world applications, the "frame size" of your ECC can't keep going up indefinitely for a few reasons: 1) The number of bits being worked with at once may be much smaller than the frame size, leading to gross inefficiencies (think ECC RAM). 2) The chance of being able to accurately correct a bit gets less and less, since the larger the frame, the greater the chance it will have more errors, and 2 errors defeats error-correction ability, while 3 or more can defeat even error-detection ability. 3) Once an error is detected, the larger the frame size, the larger the size of the corrupted piece that must be retransmitted.
If you need to use a whole byte instead of a bit, and you only need to detect errors, then the standard solution is to use a cyclic redundancy check (CRC). There are several well-known 8-bit CRCs to choose from.
A typical fast implementation of a CRC uses a table with 256 entries to handle a byte of the message at a time. For the case of an 8 bit CRC this is a special case of Pearson's algorithm.
I've got a special need and the most important concerns are:
in-memory
very low memory footprint
speed
Here's my "problem": I need to store, in-memory, a huge number of very sparse bit arrays. Those bitsets are "append only" and are to be used mostly for intersections. By huge, I mean as high as 200 000 bit arrays.
The range shall be between [0...16 000 000] for each bitset.
I ran some pre-test with "only" 10 673 bit arrays containing some actual data I've got and got the following results:
1% of the bit arrays ( 106 bit arrays) Hamming weight: at most 1 bit set
5% of the bit arrays ( 534 bit arrays) Hamming weight: at most 4 bits set
10% of the bit arrays ( 1068 bit arrays) Hamming weight: at most 8 bits set
15% of the bit arrays ( 1603 bit arrays) Hamming weight: at most 12 bits set
20% of the bit arrays ( 2137 bit arrays) Hamming weight: at most 17 bits set
25% of the bit arrays ( 2671 bit arrays) Hamming weight: at most 22 bits set
30% of the bit arrays ( 3206 bit arrays) Hamming weight: at most 28 bits set
35% of the bit arrays ( 3740 bit arrays) Hamming weight: at most 35 bits set
40% of the bit arrays ( 4274 bit arrays) Hamming weight: at most 44 bits set
45% of the bit arrays ( 4809 bit arrays) Hamming weight: at most 55 bits set
50% of the bit arrays ( 5343 bit arrays) Hamming weight: at most 67 bits set
55% of the bit arrays ( 5877 bit arrays) Hamming weight: at most 83 bits set
60% of the bit arrays ( 6412 bit arrays) Hamming weight: at most 103 bits set
65% of the bit arrays ( 6946 bit arrays) Hamming weight: at most 128 bits set
70% of the bit arrays ( 7480 bit arrays) Hamming weight: at most 161 bits set
75% of the bit arrays ( 8015 bit arrays) Hamming weight: at most 206 bits set
80% of the bit arrays ( 8549 bit arrays) Hamming weight: at most 275 bits set
85% of the bit arrays ( 9083 bit arrays) Hamming weight: at most 395 bits set
90% of the bit arrays ( 9618 bit arrays) Hamming weight: at most 640 bits set
95% of the bit arrays (10152 bit arrays) Hamming weight: at most 1453 bits set
96% of the bit arrays (10259 bit arrays) Hamming weight: at most 1843 bits set
97% of the bit arrays (10366 bit arrays) Hamming weight: at most 2601 bits set
98% of the bit arrays (10473 bit arrays) Hamming weight: at most 3544 bits set
99% of the bit arrays (10580 bit arrays) Hamming weight: at most 4992 bits set
100% of the bit arrays (10687 bit arrays) Hamming weight: at most 53153 bits set
Seen the numbers involved, I obviously need to use compressed bit arrays and that is not an issue: it shall stay easy to deal with seen that the bit arrays are "append only".
The bit array bits that are on are kinda grouped, but not totally. So you'll tend to have several bits on in the same area (but usually not one after another, making RLE kinda not great for bits that are on).
My question is what kind of compression to use?
Now I don't know if I should put my first approach here or in an answer to my own question.
Basically I imagined a "worst case" scenario using a very dumb encoding:
1 bit: if on, the following 5 bits determine how many bits are needed to compute the 'skip', if off, optimization: the following 5 bits determine how many bits are too be taken literally (that is 'on' or 'off', without skipping) [this would only be switched to when determined to be more efficient than the other representation, so when it kicks in, it shall always be an optimization (size-wise)]
5 bits: how many bits we can skip before the next bit on
x bits: skip
Here's an example: a bit array has 3 bit set, the first bit being at 3 098 137, the second at 3 098 141 and the third at 3 098 143.
+-- now we won't skip
|
| +-- 3 because we need 3 bits to store "6" (from 3 098 138 to 3 098 143)
| | +--- 3 098 141 is on
22 3 098 137 | 3 | +- 3 098 143 is on
1 10110 1011110100011000011001 0 00011 000101 etc.
First bit on tells we're going to skip bits.
5 next bits (always 5) tells how many bits we need to tell how many bits we'll skip
22 bits telling to skip to 3 098 137
one bit off telling now we're not skipping bits
5 next bits (always 5) tells how many bits we'll read "as is"
6 bits: off, off, off, on, off, on meaning 3 098 141 and 3 098 143 are on
etc.
Seen the amazing sparsity of these bit arrays, this seems quite size-efficient.
So using that encoding, I took my sample data and computed a "worst case" scenario (I haven't written the algo yet, I'd rather have a few from here inputs first): basically I considered that not only the "size optimization" would never kick in and, also, that the 5 bits would always be set to their maximum value (24 bits), which of course cannot happen.
I did it just to have a very crude approximation of what the "worst of the worst" case could be.
I was very pleasantly surprised:
Worst case scenario:
108 913 290 bits needed for the 10 687 very sparse bit arrays
12.9 MB (13 295 KB)
The data being actual data and all the data being similar, I know that, if worse comes to worse, I could store my 200 000 bit arrays in about 240 MB, which is fine.
I'm pretty sure the actual encoding will comes to way less than that but as I haven't actually written it yet, I can only (very easily) compute the "worst case" which is why I only show that one.
Any hints / ideas as to how to make this more size-efficient (remembering these are super-sparse bit arrays, that there shall be hundreds thousands of them, that they must be in memory, and that they shall be "append only")?
About my 'append-only' case
Basically I've got one growing "expanse" (the range, but "expanse" is the actual term as I understand it) and a lot of bit arrays that have a few bit sets. When the range goes from, say, 0 to 1 000 000, all the bit arrays goes from 0 to 1 000 000 to. When the range grows to 1 000 001, then all the bit arrays are growing too, all by one bit. But most of these bit arrays will have a '0' appended at their end, while about 4 to 8 of the bit arrays will have a '1' appended at their end. However I cannot predict in advance which of the bit arrays will have a 0 or a 1 appended.
So I've got a lot of bit arrays that have all the same size, that are all very sparse (< 0.5% of their bits set) and that are all "growing" as the range growth (so they're all always growing at the same rate).
Judy arrays are great. But I read about them a few years ago and that stuff was "above my head". Judy arrays are a C-only 20KLOC lib and I'm definitely not re-implementing that. But they're amazing.
So I guess I need to add I'd like all this to stay relatively simple, which is not that far-fetched seen the special "append only" property of my very sparse bit arrays.
You didn't say what programming language you want to use. It sounds like you don't want Judy because it's "C-only"... if you are using C# then you could use my Compact Patricia Trie instead. Is is almost 4500 LOC (commented) and uses similar ideas to Judy, but the size and speed of each trie are not ideal due to limitations of .NET. It is not optimized for computing intersections either, but such an algorithm could be added. The article about CP Tries does not emphasize this point, but it can store sets (sparse bit arrays) much more compactly than dictionaries (the graphs in the article show the size and speed of dictionaries, not sets).
The best case is a dense cluster of bits. With 50% occupancy (every other bit set), it requires less than 8 bits per key (less than 4 bits per integer). (correction: less than 8 bits, not more.)
If you only need an approximate representation of the data, use a Bloom filter.
By the way, what do you mean by "append only"? Does it mean that you only add keys, or that each key you add is greater than the keys you added before?
Update: Since you are only adding larger keys, you should probably design a special algorithm just for your case. IMO, when designing a custom algorithm, you should make it as simple as possible. So here's my idea, which assumes the keys of different bitsets are uncorrelated (therefore there is no benefit of attempting to compress data between different bitsets):
A bitset is represented by a sorted array of 32-bit slots. Because it's sorted, you can use binary search to find keys. Each slot consists of a 24-bit "prefix" and 8 bits of "flags". Each slot represents a region of 8 keys. The "flags" tell you which of the 8 keys in the region are present in the bitset, and the "prefix" tells you which region we're talking about, by specifying bits 3 to 26 of the key. For example, if the following bits are "1" in the bitset:
1, 3, 4, 1094, 8001, 8002, 8007, 8009
...then the bitset is represented by an array of 4 slots (16 bytes):
Prefix: 0, 136, 1000, 1001
Flags: 0x15, 0x40, 0x86, 0x02
The first slot represents 1, 3, 4 (notice that bits 1, 3 and 4 are set in the number 0x15); the second slot represents 1094 (136 * 8 + 6); the third slot represents 8001, 8002, and 8007; the fourth slot represents 8009. Does this make sense?
I don't know if this is as compact as your idea. But I think you'll get faster queries and faster modifications, and it will be fairly easy to implement.
You may use binary tree for bit array.
Say, you have array with range of [M..N].
Store it in such a manner:
Choose some number encoding for [0...ram size], like Fibonacci, Golomb or Rice code (you may choose most suitable representation after profiling your program with actual data).
If array is empty (have no bits set), store it as number 0.
If array is full (have all bits set), store it as number 1.
Else split it in two parts: A in [M..(M+N)/2-1] and B in [(M+N)/2..N]
Generate representations of P0 and P1 using this algorithm recursively.
Get length of P0 (in bits or other units length may be whole number of) and store it as a number (you may need to add 1 if length may be 1, e.g. you store 0 as single bit 0).
Store P0 then P1.
In this case, if limits are common, operations of intersection an union are trivial recursions:
Intersection:
If array A is empty, store 0.
If array A is full, store copy of B
Else split arrays, make intersections of both halves, store length of first half, then both halves.
This algorithm may deal with bits (if you need them to be most compact) and bytes/words (if bit operations are so slow).
Also you may add specical encodings for arrays with single bit set, all arrays with size less than some limit (8 elements for example) to decrease level of recursion.
Drawback is that without some hacks adding/removing element to/from array is complex operation (as complex as intersection/union operations).
For example, array with single 0xAB bit set should be stored in array of 0..0xFF as (pseudocode for):
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1, EMPTY, 13, 1, EMPTY, 9, 1, EMPTY, 5, 1, EMPTY, 1, EMPTY, FULL, EMPTY, EMPTY, EMPTY
| AA | AB |
|A8..A9| AA .. AB |
| A8 .. AB |AC..AF|
|A0..A7| A8 .. AF |
| A0 .. AF |B0..BF|
|80..9F| A0 .. BF |
| 80 .. BF |C0..FF|
| 0..7F| 80 .. FF |
EMPTY and FULL are codes for empty and full arrays, numbers are lengths in elements (should be replaced with actual lengthts in bytes, bits or so)
Ff you do not need fast single bit check, you may use most simple approach:
Just store distances between set bits using codes: fibonacci, rice, golomb, levenshtein, elias etc. or invent another one.
Note, that in order to get minimal code length, you should use code with code lengths as close as possible to -log p/log 2, where p is probability of that code. You may use huffman code for that.
For example, use elias gamma code, so array like this:
0 1 0000 1 1 000 1 0 1 000000000000000000 1 000000000000000000
2 5 1 4 2 19 18 (distance)
Should be encoded as:
010 00101 1 00100 010 000010011 000010010
2 5 1 4 2 19 18 (distance code explained)
And mostly compact for array with uniform bits distribution would be arithmetic encoding, but it is very CPU time counsumpting. Becaouse you'll have to read and write such arrays bit by bit with no fast skipping available.
You may look into compressed bitmaps. A common strategy is to use word-aligned run-length encoding.
C++ implementation:
https://github.com/lemire/EWAHBoolArray
Java implementation:
https://github.com/lemire/javaewah
Reference:
Daniel Lemire, Owen Kaser, Kamel Aouiche, Sorting improves word-aligned bitmap indexes. Data & Knowledge Engineering 69 (1), pages 3-28, 2010.
http://arxiv.org/abs/0901.3751
Even if they aren't exactly what you're looking for, it's worth checking out Judy trees. Judy is a heavily optimized library for ordered maps, and one configuration is specifically designed as a bitset rather than a map. I don't think intersection is one of the operations natively optimized for, though...
The general idea is to use a tree with a fixed number of address bits per level, and take advantage of the sparseness at each level. This results in quite good compression even in the worst case, and fast query performance as well. I believe an intersection operation would be relatively straightforward and potentially very fast.
At any rate, it's always a good idea to steal from the best!
Considering you are going to do a bunch of intersection tests anyway, maybe you should try storing all of the bitvectors in parallel. One sparse, 16M entry list. Each entry in that list contains a list of which of the 200k input bitvectors has a '1' at that location. It looks like you expect to have only about 5 bits set per input vector, or 1M total entries? Taking a straw-man linked list implementation for the toplevel and the buckets, and a worst case of no intersections at all (thus 1M buckets with 1 element each) you could store it all in 32MB.
You might be interested in Binary Decision Diagrams (BDD), and more precisely Zero-suppressed Binary Decision Diagram (ZBDD).
They are used to represent sets in a compressed way. Unlike other compressed forms, operations (such as set intersections, or insertions of elements - your "append only" thing?) work directly on the compressed form.