algorithm: gigantic number of very sparse bit arrays, which encoding to use - algorithm

I've got a special need and the most important concerns are:
in-memory
very low memory footprint
speed
Here's my "problem": I need to store, in-memory, a huge number of very sparse bit arrays. Those bitsets are "append only" and are to be used mostly for intersections. By huge, I mean as high as 200 000 bit arrays.
The range shall be between [0...16 000 000] for each bitset.
I ran some pre-test with "only" 10 673 bit arrays containing some actual data I've got and got the following results:
1% of the bit arrays ( 106 bit arrays) Hamming weight: at most 1 bit set
5% of the bit arrays ( 534 bit arrays) Hamming weight: at most 4 bits set
10% of the bit arrays ( 1068 bit arrays) Hamming weight: at most 8 bits set
15% of the bit arrays ( 1603 bit arrays) Hamming weight: at most 12 bits set
20% of the bit arrays ( 2137 bit arrays) Hamming weight: at most 17 bits set
25% of the bit arrays ( 2671 bit arrays) Hamming weight: at most 22 bits set
30% of the bit arrays ( 3206 bit arrays) Hamming weight: at most 28 bits set
35% of the bit arrays ( 3740 bit arrays) Hamming weight: at most 35 bits set
40% of the bit arrays ( 4274 bit arrays) Hamming weight: at most 44 bits set
45% of the bit arrays ( 4809 bit arrays) Hamming weight: at most 55 bits set
50% of the bit arrays ( 5343 bit arrays) Hamming weight: at most 67 bits set
55% of the bit arrays ( 5877 bit arrays) Hamming weight: at most 83 bits set
60% of the bit arrays ( 6412 bit arrays) Hamming weight: at most 103 bits set
65% of the bit arrays ( 6946 bit arrays) Hamming weight: at most 128 bits set
70% of the bit arrays ( 7480 bit arrays) Hamming weight: at most 161 bits set
75% of the bit arrays ( 8015 bit arrays) Hamming weight: at most 206 bits set
80% of the bit arrays ( 8549 bit arrays) Hamming weight: at most 275 bits set
85% of the bit arrays ( 9083 bit arrays) Hamming weight: at most 395 bits set
90% of the bit arrays ( 9618 bit arrays) Hamming weight: at most 640 bits set
95% of the bit arrays (10152 bit arrays) Hamming weight: at most 1453 bits set
96% of the bit arrays (10259 bit arrays) Hamming weight: at most 1843 bits set
97% of the bit arrays (10366 bit arrays) Hamming weight: at most 2601 bits set
98% of the bit arrays (10473 bit arrays) Hamming weight: at most 3544 bits set
99% of the bit arrays (10580 bit arrays) Hamming weight: at most 4992 bits set
100% of the bit arrays (10687 bit arrays) Hamming weight: at most 53153 bits set
Seen the numbers involved, I obviously need to use compressed bit arrays and that is not an issue: it shall stay easy to deal with seen that the bit arrays are "append only".
The bit array bits that are on are kinda grouped, but not totally. So you'll tend to have several bits on in the same area (but usually not one after another, making RLE kinda not great for bits that are on).
My question is what kind of compression to use?
Now I don't know if I should put my first approach here or in an answer to my own question.
Basically I imagined a "worst case" scenario using a very dumb encoding:
1 bit: if on, the following 5 bits determine how many bits are needed to compute the 'skip', if off, optimization: the following 5 bits determine how many bits are too be taken literally (that is 'on' or 'off', without skipping) [this would only be switched to when determined to be more efficient than the other representation, so when it kicks in, it shall always be an optimization (size-wise)]
5 bits: how many bits we can skip before the next bit on
x bits: skip
Here's an example: a bit array has 3 bit set, the first bit being at 3 098 137, the second at 3 098 141 and the third at 3 098 143.
+-- now we won't skip
|
| +-- 3 because we need 3 bits to store "6" (from 3 098 138 to 3 098 143)
| | +--- 3 098 141 is on
22 3 098 137 | 3 | +- 3 098 143 is on
1 10110 1011110100011000011001 0 00011 000101 etc.
First bit on tells we're going to skip bits.
5 next bits (always 5) tells how many bits we need to tell how many bits we'll skip
22 bits telling to skip to 3 098 137
one bit off telling now we're not skipping bits
5 next bits (always 5) tells how many bits we'll read "as is"
6 bits: off, off, off, on, off, on meaning 3 098 141 and 3 098 143 are on
etc.
Seen the amazing sparsity of these bit arrays, this seems quite size-efficient.
So using that encoding, I took my sample data and computed a "worst case" scenario (I haven't written the algo yet, I'd rather have a few from here inputs first): basically I considered that not only the "size optimization" would never kick in and, also, that the 5 bits would always be set to their maximum value (24 bits), which of course cannot happen.
I did it just to have a very crude approximation of what the "worst of the worst" case could be.
I was very pleasantly surprised:
Worst case scenario:
108 913 290 bits needed for the 10 687 very sparse bit arrays
12.9 MB (13 295 KB)
The data being actual data and all the data being similar, I know that, if worse comes to worse, I could store my 200 000 bit arrays in about 240 MB, which is fine.
I'm pretty sure the actual encoding will comes to way less than that but as I haven't actually written it yet, I can only (very easily) compute the "worst case" which is why I only show that one.
Any hints / ideas as to how to make this more size-efficient (remembering these are super-sparse bit arrays, that there shall be hundreds thousands of them, that they must be in memory, and that they shall be "append only")?
About my 'append-only' case
Basically I've got one growing "expanse" (the range, but "expanse" is the actual term as I understand it) and a lot of bit arrays that have a few bit sets. When the range goes from, say, 0 to 1 000 000, all the bit arrays goes from 0 to 1 000 000 to. When the range grows to 1 000 001, then all the bit arrays are growing too, all by one bit. But most of these bit arrays will have a '0' appended at their end, while about 4 to 8 of the bit arrays will have a '1' appended at their end. However I cannot predict in advance which of the bit arrays will have a 0 or a 1 appended.
So I've got a lot of bit arrays that have all the same size, that are all very sparse (< 0.5% of their bits set) and that are all "growing" as the range growth (so they're all always growing at the same rate).
Judy arrays are great. But I read about them a few years ago and that stuff was "above my head". Judy arrays are a C-only 20KLOC lib and I'm definitely not re-implementing that. But they're amazing.
So I guess I need to add I'd like all this to stay relatively simple, which is not that far-fetched seen the special "append only" property of my very sparse bit arrays.

You didn't say what programming language you want to use. It sounds like you don't want Judy because it's "C-only"... if you are using C# then you could use my Compact Patricia Trie instead. Is is almost 4500 LOC (commented) and uses similar ideas to Judy, but the size and speed of each trie are not ideal due to limitations of .NET. It is not optimized for computing intersections either, but such an algorithm could be added. The article about CP Tries does not emphasize this point, but it can store sets (sparse bit arrays) much more compactly than dictionaries (the graphs in the article show the size and speed of dictionaries, not sets).
The best case is a dense cluster of bits. With 50% occupancy (every other bit set), it requires less than 8 bits per key (less than 4 bits per integer). (correction: less than 8 bits, not more.)
If you only need an approximate representation of the data, use a Bloom filter.
By the way, what do you mean by "append only"? Does it mean that you only add keys, or that each key you add is greater than the keys you added before?
Update: Since you are only adding larger keys, you should probably design a special algorithm just for your case. IMO, when designing a custom algorithm, you should make it as simple as possible. So here's my idea, which assumes the keys of different bitsets are uncorrelated (therefore there is no benefit of attempting to compress data between different bitsets):
A bitset is represented by a sorted array of 32-bit slots. Because it's sorted, you can use binary search to find keys. Each slot consists of a 24-bit "prefix" and 8 bits of "flags". Each slot represents a region of 8 keys. The "flags" tell you which of the 8 keys in the region are present in the bitset, and the "prefix" tells you which region we're talking about, by specifying bits 3 to 26 of the key. For example, if the following bits are "1" in the bitset:
1, 3, 4, 1094, 8001, 8002, 8007, 8009
...then the bitset is represented by an array of 4 slots (16 bytes):
Prefix: 0, 136, 1000, 1001
Flags: 0x15, 0x40, 0x86, 0x02
The first slot represents 1, 3, 4 (notice that bits 1, 3 and 4 are set in the number 0x15); the second slot represents 1094 (136 * 8 + 6); the third slot represents 8001, 8002, and 8007; the fourth slot represents 8009. Does this make sense?
I don't know if this is as compact as your idea. But I think you'll get faster queries and faster modifications, and it will be fairly easy to implement.

You may use binary tree for bit array.
Say, you have array with range of [M..N].
Store it in such a manner:
Choose some number encoding for [0...ram size], like Fibonacci, Golomb or Rice code (you may choose most suitable representation after profiling your program with actual data).
If array is empty (have no bits set), store it as number 0.
If array is full (have all bits set), store it as number 1.
Else split it in two parts: A in [M..(M+N)/2-1] and B in [(M+N)/2..N]
Generate representations of P0 and P1 using this algorithm recursively.
Get length of P0 (in bits or other units length may be whole number of) and store it as a number (you may need to add 1 if length may be 1, e.g. you store 0 as single bit 0).
Store P0 then P1.
In this case, if limits are common, operations of intersection an union are trivial recursions:
Intersection:
If array A is empty, store 0.
If array A is full, store copy of B
Else split arrays, make intersections of both halves, store length of first half, then both halves.
This algorithm may deal with bits (if you need them to be most compact) and bytes/words (if bit operations are so slow).
Also you may add specical encodings for arrays with single bit set, all arrays with size less than some limit (8 elements for example) to decrease level of recursion.
Drawback is that without some hacks adding/removing element to/from array is complex operation (as complex as intersection/union operations).
For example, array with single 0xAB bit set should be stored in array of 0..0xFF as (pseudocode for):
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1, EMPTY, 13, 1, EMPTY, 9, 1, EMPTY, 5, 1, EMPTY, 1, EMPTY, FULL, EMPTY, EMPTY, EMPTY
| AA | AB |
|A8..A9| AA .. AB |
| A8 .. AB |AC..AF|
|A0..A7| A8 .. AF |
| A0 .. AF |B0..BF|
|80..9F| A0 .. BF |
| 80 .. BF |C0..FF|
| 0..7F| 80 .. FF |
EMPTY and FULL are codes for empty and full arrays, numbers are lengths in elements (should be replaced with actual lengthts in bytes, bits or so)
Ff you do not need fast single bit check, you may use most simple approach:
Just store distances between set bits using codes: fibonacci, rice, golomb, levenshtein, elias etc. or invent another one.
Note, that in order to get minimal code length, you should use code with code lengths as close as possible to -log p/log 2, where p is probability of that code. You may use huffman code for that.
For example, use elias gamma code, so array like this:
0 1 0000 1 1 000 1 0 1 000000000000000000 1 000000000000000000
2 5 1 4 2 19 18 (distance)
Should be encoded as:
010 00101 1 00100 010 000010011 000010010
2 5 1 4 2 19 18 (distance code explained)
And mostly compact for array with uniform bits distribution would be arithmetic encoding, but it is very CPU time counsumpting. Becaouse you'll have to read and write such arrays bit by bit with no fast skipping available.

You may look into compressed bitmaps. A common strategy is to use word-aligned run-length encoding.
C++ implementation:
https://github.com/lemire/EWAHBoolArray
Java implementation:
https://github.com/lemire/javaewah
Reference:
Daniel Lemire, Owen Kaser, Kamel Aouiche, Sorting improves word-aligned bitmap indexes. Data & Knowledge Engineering 69 (1), pages 3-28, 2010.
http://arxiv.org/abs/0901.3751

Even if they aren't exactly what you're looking for, it's worth checking out Judy trees. Judy is a heavily optimized library for ordered maps, and one configuration is specifically designed as a bitset rather than a map. I don't think intersection is one of the operations natively optimized for, though...
The general idea is to use a tree with a fixed number of address bits per level, and take advantage of the sparseness at each level. This results in quite good compression even in the worst case, and fast query performance as well. I believe an intersection operation would be relatively straightforward and potentially very fast.
At any rate, it's always a good idea to steal from the best!

Considering you are going to do a bunch of intersection tests anyway, maybe you should try storing all of the bitvectors in parallel. One sparse, 16M entry list. Each entry in that list contains a list of which of the 200k input bitvectors has a '1' at that location. It looks like you expect to have only about 5 bits set per input vector, or 1M total entries? Taking a straw-man linked list implementation for the toplevel and the buckets, and a worst case of no intersections at all (thus 1M buckets with 1 element each) you could store it all in 32MB.

You might be interested in Binary Decision Diagrams (BDD), and more precisely Zero-suppressed Binary Decision Diagram (ZBDD).
They are used to represent sets in a compressed way. Unlike other compressed forms, operations (such as set intersections, or insertions of elements - your "append only" thing?) work directly on the compressed form.

Related

Most efficient bit format to represent small unsigned integers

I have to deal with sequences of a lot of small numbers, about a million, and I have to put as many as possible (more is better) in 4KB. Obviously that's just too little space to put all of them. Also, while this is a specific scenario, I'd love an answer as general as possible.
The numbers don't follow any pattern, but here is what a small script has to say about their distribution:
407037 times 1
165000 times 2
85389 times 3
52257 times 4
34749 times 5
23567 times 6
15892 times 7
11183 times 8
7636 times 9
5402 times 10
3851 times 11
2664 times 12
2023 times 13
1547 times 14
1113 times 15
... many more lines ...
1 times 62
62 is the biggest number I have, so let's set the maximum number we care about at 64. If the method is easily adaptable to accommodate for bigger max numbers, that would be better.
Here is a sample of the numbers:
20
1
1
1
13
1
5
1
15
1
3
4
3
2
2
A naive way to do this would just be to use 6 bits per number, but I think we can do better.
EDIT: adding a bit of info following discussion in comments.
I also have 2KB of ram and a dozen cycles on a microprocessor to decode each number. I need to store, sequentially, from the first number, as many numbers as I can.
EDIT: see graybeard's comment and my followup too.
The correct way to do this would be Rangecoding, Huffman or Shannon-Fano which you can see in any of the digital-communication blogs over the net, so I'm not exactly explaining you these.
I can suggest you a custom method, which is really simple and you can compare it with other methods if you can use this to store more numbers or not.
I see that there are no 0's in your script. So just decrease each number by 1 (while decoding, add 1 to decoded result). Use either 4 or 7 bits to encode numbers. All numbers up-to 8 can be represented in 3-bits. If the number is n <= 8, set the 1st bit as 0 and next 3 bits can represent the number. Else, if the number is n > 8, set 1st bit as 1 and represent the number as 6 bits from there.
Though in Huffman or Shannon-Fano, few of the representations can be as long as over 20 bits.
For provide correct answer, need to know - is decoder size also limited, or there is not limit for decodes size?
If no limit for decoder (just limit for data), I suggest you to use rangecoder, or Huffman coding. Rangecoder has better compression, but extensive arithmetic operation usage.
However, both decoders uses memory for a code, and for statistical tables. So, perhaps, better answer to create something more easy (custom compressor), but with simple and compact code and without any tables. As easy, code-compact, I can propose the run-1 algorithm. This algorithm is not very efficient for your data (rangecoder or Huffman better), but has trivial compact decoder without any tables.
Idea - sequence can contain zero or more bit_1, and use bit_0 as symbol separator. For example, if we would like encode with run-1 the sequence:
1, 1, 2, 1, 5
There will be bit sequence:
0-0-10-0-11110
There, you needed just count number of sequenced bit_1, add 1, and return value as decoded number.
Maybe slightly better than straight Huffman can be attempted by combining with run-length coding.
If you count the successive identical elements, you can rewrite your sequence as a pairs of (value, count). Every such pair appears with some probability and you can use Huffman coding on these. (I don't mean to code the values and the counts separately, but the pairs as a whole).
Your sample yields
(20, 1), (1, 3), (13 1), (1, 1), (5, 1), (1, 1), (15, 1), (3, 1), (4, 1), (3, 1), (5, 2)
The singletons will be (practically) coded as before, and there are more opportunities for compression of longer runs.
You can limit the maximum count(s) that are supported; if the actual count exceeds the limit, it is no big deal to insert several pairs.
The very first step is to compute an histogram of the count values to see if there are enough repetitions for this approach to be worth.
Alternatively, you can try Huffman coding on the deltas (signed differences between successive values). If there are many repetitions, the frequency of 0 will be much higher, increasing the entropy. Obviously, run-length coding of the deltas is also possible.
I took the distribution you listed, and tried an exponential fit. The result was decently good:
More importantly, the fit was reasonably close to p(x) ~= 2^-x. This suggests a very simple coding, known as "unary coding": to encode the number k, output k-1 zeroes, followed by a 1. If your numbers exactly fit the p(x) ~= 2^-x distribution, that would give you an expected code length of 2 bits. Since your numbers appear to be heavier-tailed than that (otherwise it would be vanishingly unlikely to see a 62 in only a million numbers), you won't quite achieve that. Still, given the simplicity of the coding and the ease of decoding (twelve cycles should be sufficient), you should consider trying it out.
You might also look into other universal codes, such as Elias Delta. Golomb coding would be optimal, but decoding it is an involved process.

Compress many numbers into a string

I was wondering if there's a way to compress 20 or so large numbers (~10^8) into a string of a reasonable length. For instance, if the numbers were stored as hex and concatenated, it'd be at least 160 characters long. I wonder if there's a smart way to compress the numbers in and get them back out. I was thinking about having a sequence 0-9 as reference and let one part of the input string be a number <1024. That number is to be converted to binary, which serves as a mask, i.e. indicating which digits exist in the number. It's still not clear where to go on from here.
Are there any better alternatives?
Thanks
If these large numbers are of the same size in bytes, and if you always know the count of those numbers, there is an easy way to do it. You simply Have an array of your bytes, and instead of reading them out as integers, you read them out as characters. Are you trying to obfuscate your values or just pack them to be easily transferred?
When I'm compacting a lot of values into one, reversible String, I usually go with base 64 conversion. This can really cut off quite a lot of the length from a String, but note that it may take up just as much memory in representing it.
Example
This number in decimal:
10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
is the following in Base 64:
Yki8xQRRVqd403ldXJUT8Ungkh/A3Th2TMtNlpwLPYVgct2eE8MAn0bs4o/fv1bmo4oUNQa/9WtZ8gRE7IG+UHX+LniaQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Why you can't do this too an extreme level
Think about it for a second. Let's say you've got a number of length 10. And you want to represent that number with 5 characters, so a 50% rate compression scheme. First, we work out how many possible numbers you can represent with 10 digits.. which is..
2^10 = 1024
Okay, that's fine. How many numbers can we express with 5 digits:
2^5 = 32
So, you can only display 32 different numbers with 5 bits, whereas you can display 1024 numbers with 10 bits. For compression to work, there needs to be some mapping between the compressed value and the extracted value. Let's try and make that mapping happen..
Normal - Compressed
0 0
1 1
2 2
.. ...
31 31
32 ??
33 ??
34 ??
... ...
1023 ??
There is no mapping for most of the numbers that can be represented by the expanded value.
This is known as the Pigeonhole Principle and in this example our value for n is greater than our value for m, hence we need to map values from our compressed values to more than one normal value, which makes things incredibly complex. (thankyou Oli for reminding me).
You need to be much more descriptive about what you mean by "string" and "~10^8". Can your "string" contain any sequence of bytes? Or is it restricted to a subset of possible bytes? If so, how exactly is it restricted? What are the limits on your "large numbers"? What do they represent?
Numbers up to 108 can be represented in 27 bits. 20 of them would be 540 bits, which could be stored in a string of 68 bytes, if any sequence of bytes is permitted. If the contents of a string are limited, it will take more bits. If your range of numbers is larger, it will take more bits.
store all numbers as strings to a marisa trie: https://code.google.com/p/marisa-trie/
Base64 the resulting trie dictionary
It depends of course a lot on your input. But it is a possibility to build a (very) compact representation this way.

Error detection code for 33 bytes, detecting bit flipped in first 32 bytes

Could you please suggest an error detection scheme for detecting
one possible bit flip in the first 32 bytes of a 33-byte message using
no more than 8 bits of additional data?
Could Pearson hashing be a solution?
Detecting a single bit-flip in any message requires only one extra bit, independent of the length of the message: simply xor together all the bits in the message and tack that on the end. If any single bit flips, the parity bit at the end won't match up.
If you're asking to detect which bit flipped, that can't be done, and a simple argument shows it: the extra eight bits can represent up to 256 classes of 32-byte messages, but the zero message and the 256 messages with one on bit each must all be in different classes. Thus, there are 257 messages which must be distinctly classified, and only 256 classes.
You can detect one bit flip with just one extra bit in any length message (as stated by #Daniel Wagner). The parity bit can, simply put, indicate whether the total number of 1-bits is odd or even. Obviously, if the number of bits that are wrong is even, then the parity bit will fail, so you cannot detect 2-bit errors.
Now, for a more accessible understanding of why you can't error-correct 32 bytes (256 bits) with just 8 bits, please read about the Hamming code (like used in ECC memory). Such a scheme uses special error-correcting parity bits (henceforth called "EC parity") that only encode the parity of a subset of the total number of bits. For every 2^m - 1 total bits, you need to use m EC bits. These represent each possible different mask following the pattern "x bits on, x bits off" where x is a power of 2. Thus, the larger the number of bits at once, the better the data/parity bit ratio you get. For example, 7 total bits would allow encoding only 4 data bits after losing 3 EC bits, but 31 total bits can encode 26 data bits after losing 5 EC bits.
Now, to really understand this probably will take an example. Consider the following sets of masks. The first two rows are to be read top down, indicating the bit number (the "Most Significant Byte" I've labeled MSB):
MSB LSB
| |
v v
33222222 22221111 11111100 0000000|0
10987654 32109876 54321098 7654321|0
-------- -------- -------- -------|-
1: 10101010 10101010 10101010 1010101|0
2: 11001100 11001100 11001100 1100110|0
3: 11110000 11110000 11110000 1111000|0
4: 11111111 00000000 11111111 0000000|0
5: 11111111 11111111 00000000 0000000|0
The first thing to notice is that the binary values for 0 to 31 are represented in each column going from right to left (reading the bits in rows 1 through 5). This means that each vertical column is different from each other one (the important part). I put a vertical extra line between bit numbers 0 and 1 for a particular reason: Column 0 is useless because it has no bits set in it.
To perform error-correcting, we will bitwise-AND the received data bits against each EC bit's predefined mask, then compare the resulting parity to the EC bit. For any calculated parities discovered to not match, find the column in which only those bits are set. For example, if error-correcting bits 1, 4, and 5 are wrong when calculated from the received data value, then column #25--containing 1s in only those masks--must be the incorrect bit and can be corrected by flipping it. If only a single error-correcting bit is wrong, then the error is in that error-correcting bit. Here's an analogy to help you understand why this works:
There are 32 identical boxes, with one containing a marble. Your task is to locate the marble using just an old-style scale (the kind with two balanced platforms to compare the weights of different objects) and you are only allowed 5 weighing attempts. The solution is fairly easy: you put 16 boxes on each side of the scale and the heavier side indicates which side the marble is on. Discarding the 16 boxes on the lighter side, you then weigh 8 and 8 boxes keeping the heavier, then 4 and 4, then 2 and 2, and finally locate the marble by comparing the weights of the last 2 boxes 1 to 1: the heaviest box contains the marble. You have completed the task in only 5 weighings of 32, 16, 8, 4, and 2 boxes.
Similarly, our bit patterns have divided up the boxes in 5 different groups. Going backwards, the fifth EC bit determines whether an error is on the left side or the right side. In our scenario with bit #25, it is wrong, so we know that the error bit is on the left side of the group (bits 16-31). In our next mask for EC bit #4 (still stepping backward), we only consider bits 16-31, and we find that the "heavier" side is the left one again, so we have narrowed down the bits 24-31. Following the decision tree downward and cutting the number of possible columns in half each time, by the time we reach EC bit 1 there is only 1 possible bit left--our "marble in a box".
Note: The analogy is useful, though not perfect: 1-bits are not represented by marbles--the erroring bit location is represented by the marble.
Now, some playing around with these masks and thinking how to arrange things will reveal that there is a problem: If we try to make all 31 bits data bits, then we need 5 more bits for EC. But how, then, will we tell if the EC bits themselves are wrong? Just a single EC bit wrong will incorrectly tell us that some data bit needs correction, and we'll wrongly flip that data bit. The EC bits have to somehow encode for themselves! The solution is to position the parity bits inside of the data, in columns from the bit patterns above where only one bit is set. This way, any data bit being wrong will trigger two EC bits to be wrong, making it so that if only one EC bit is wrong, we know it is wrong itself instead of it signifying a data bit is wrong. The columns that satisfy the one-bit condition are 1, 2, 4, 8, and 16. The data bits will be interleaved between these starting at position 2. (Remember, we are not using position 0 as it would never provide any information--none of our EC bits would be set at all).
Finally, adding one more bit for overall parity will allow detecting 2-bit errors and reliably correcting 1-bit errors, as we can then compare the EC bits to it: if the EC bits say something is wrong, but the parity bit says otherwise, we know there are 2 bits wrong and cannot perform correction. We can use the discarded bit #0 as our parity bit! In fact, now we are encoding the following pattern:
0: 11111111 11111111 11111111 11111111
This gives us a final total of 6 Error-Checking and Correcting (ECC) bits. Extending the scheme of using different masks indefinitely looks like this:
32 bits - 6 ECC bits = 26 data
64 bits - 7 ECC bits = 57 data
128 bits - 8 ECC bits = 120 data
256 bits - 9 ECC bits = 247 data
512 bits - 10 ECC bits = 502 data
Now, if we are sure that we only will get a 1-bit error, we can dispense with the #0 parity bit, so we have the following:
31 bits - 5 ECC bits = 26 data
63 bits - 6 ECC bits = 57 data
127 bits - 7 ECC bits = 120 data
255 bits - 8 ECC bits = 247 data
511 bits - 9 ECC bits = 502 data
This is no change because we don't get any more data bits. Oops! 32 bytes (256 bits) as you requested cannot be error-corrected with a single byte, even if we know we can have only a 1-bit error at worst, and we know the ECC bits will be correct (allowing us to move them out of the data region and use them all for data). We need TWO more bits than we have--one must slide up to the next range of 512 bits, then leave out 246 data bits to get our 256 data bits. So that's one more ECC bit AND one more data bit (as we only have 255, exactly what Daniel told you).
Summary:: You need 33 bytes + 1 bit to detect which bit flipped in the first 32 bytes.
Note: if you are going to send 64 bytes, then you're under the 32:1 ratio, as you can error correct that in just 10 bits. But it's that in real world applications, the "frame size" of your ECC can't keep going up indefinitely for a few reasons: 1) The number of bits being worked with at once may be much smaller than the frame size, leading to gross inefficiencies (think ECC RAM). 2) The chance of being able to accurately correct a bit gets less and less, since the larger the frame, the greater the chance it will have more errors, and 2 errors defeats error-correction ability, while 3 or more can defeat even error-detection ability. 3) Once an error is detected, the larger the frame size, the larger the size of the corrupted piece that must be retransmitted.
If you need to use a whole byte instead of a bit, and you only need to detect errors, then the standard solution is to use a cyclic redundancy check (CRC). There are several well-known 8-bit CRCs to choose from.
A typical fast implementation of a CRC uses a table with 256 entries to handle a byte of the message at a time. For the case of an 8 bit CRC this is a special case of Pearson's algorithm.

Is this a clever or stupid way to do an integer divide function?

I'm a Computer Science major, interested in how assembly languages handle a integer divide function. It seems that simply adding up to the numerator, while giving both the division and the mod, is way too impractical, so I came up with another way to divide using bit shifting, subtracting, and 2 look up tables.
Basically, the function takes the denominator, and makes "blocks" based on the highest power of 2. So dividing by 15 makes binary blocks of 4, dividing by 5 makes binary blocks of 3, etc. Then generate the first 2^block-size multiple of the denominator. For each multiple, write the values AFTER the first block into the look up table, keyed by the value of the first block.
Example: Multiples of 5 in binary - block size 3 (octal)
000 000 **101** - 5 maps to 0
000 001 **010** - 2 maps to 1
000 001 **111** - 7 maps to 1
000 010 **100** - 4 maps to 2
000 011 **001** - 1 maps to 3
000 011 **110** - 6 maps to 3
000 100 **011** - 3 maps to 4
000 101 **000** - 0 maps to 5
So the actual procedure involves getting the first block, left bit-shifting over the first block, and subtracting the value that the blocks maps to. If the resulting number comes out to 0, then it's perfectly divisible, and if the value becomes negative, it's not.
If you add another enumeration look up table, where you map the values to a counter as they come in, you can calculate the result of the division!
Example: Multiples of 5 again
5 maps to 1
2 maps to 2
7 maps to 3
4 maps to 4
1 maps to 5
6 maps to 6
3 maps to 7
0 maps to 8
Then all that's left is mapping every block to the counter-table, and you have your answer.
There are a few problems with this method.
If the answer isn't perfectly divisible, then the function returns back junk.
For high Integer values, this won't work, because a 5 block size will get truncated at the end of a 32 bit or 64 bit integer.
It's about 100 times slower than the standard division in C.
If the denominator is a factor of the divisor, then your blocks must map to multiple values, and you need even more tables. This can be solved with prime factorization, but all the methods I've read about easy/quick prime factorization involve dividing, defeating the purpose of this.
So I have 2 questions: First, is there an algorithm similar to this out there already? I've looked around, and I can't seem to find any like it. Second, How do actual assembly languages handle Integer division?
Sorry if there are any formatting mistake, this is my first time posting to stack overflow.
Sorry i answer so late. Ok, first regarding the commenters of your question: they think you are trying to do what the assembly memonic DIV or IDIV achieves by using different instructions in assembly. To me it seems you want to know how the op-codes that are selected by DIV and IDIV achieve division in hardware. To my knowledge Intel uses the SRT algorithm (uses a lookup-table) and AMD uses the Goldschmidt algorithm. I think what you are doing is similar to SRT. You can take a look at both of them here:
http://en.wikipedia.org/wiki/Division_%28digital%29

Decoding letters ('a' .. 'z') from a bit sequence without waste

I seek an algorithm that will let me represent an incoming sequence of bits as letters ('a' .. 'z' ), in a minimal matter such that the stream of bits can be regenerated from the letters, without ever holding the entire sequence in memory.
That is, given an external bit source (each read returns a practically random bit), and user input of a number of bits, I would like to print out the minimal number of characters that can represent those bits.
Ideally there should be a parameterization - how much memory versus maximum bits before some waste is necessary.
Efficiency Goal - The same number of characters as the base-26 representation of the bits.
Non-solutions:
If sufficient storage was present, store the entire sequence and use a big-integer MOD 26 operation.
Convert every 9 bits to 2 characters - This seems suboptimal, wasting 25% of information capacity of the letters output.
If you assign a different number of bits per letter, you should be able to exactly encode the bits in the twenty-six letters allowed without wasting any bits. (This is a lot like a Huffman code, only with a pre-built balanced tree.)
To encode bits into letters: Accumulate bits until you match exactly one of the bit codes in the lookup table. Output that letter, clear the bit buffer, and keep going.
To decode letters into bits: For each letter, output the bit sequence in the table.
Implementing in code is left as an exercise to the reader. (Or to me, if I get bored later.)
a 0000
b 0001
c 0010
d 0011
e 0100
f 0101
g 01100
h 01101
i 01110
j 01111
k 10000
l 10001
m 10010
n 10011
o 10100
p 10101
q 10110
r 10111
s 11000
t 11001
u 11010
v 11011
w 11100
x 11101
y 11110
z 11111
Convert each block of 47 bits to a base 26 number of 10 digits. This gives you more than 99.99% efficiency.
This method, as well as others like Huffman, needs a padding mechanism to support variable-length input. This introduces some inefficiency which is less significant with longer inputs.
At the end of the bit stream, append an extra 1 bit. This must be done in all cases, even when the length of the bit stream is a multiple of 47. Any high-order letters of "zero" value can be skipped in the last block of encoded output.
When decoding the letters, a truncated final block can be filled out with "zero" letters and converted to a 47-bit base 2 representation. The final 1 bit is not data, but marks the end of the bit stream.
Could Huffman coding be what you're looking for? It's a compression algorithm, which pretty much represents any information with a minimum of wasted bits.
Zero waste would be log_2(26) bits per letter. As pointed out earlier, you can get to 4.7 by reading 47 bits and converting them to 10 letters. However, you can get to 4.67 by converting every 14 bits into 3 characters. This has the advantage that it fits into an integer. If you have storage space and run time is important, you can create a lookup table with 17,576 entries mapping the possible 14 bits into 3 letters. Otherwise, you can do mod and div operations to compute the 3 letters.
number of letters number of bits bits/letter
1 4 4
2 9 4.5
3 14 4.67
4 18 4.5
5 23 4.6
6 28 4.67
7 32 4.57
8 37 4.63
9 42 4.67
10 47 4.7
Any solution you use is going to be space-inefficient because 26 is not a power of 2. As far as an algorithm goes, I'd rather use a lookup table than an on-the-fly calculation for each series of 9 bits. Your lookup table would 512 entries long.
If you want the binary footprint of each letter to have the same size, the optimal solution would be given by Arithmetic Encoding. However, it will not reach your goal of a mean representation of 4.5 bits/char. Given 26 different characters (not including space etc) 4.7 would be the best you can reach without using variable-length encoding (Huffman, for instance. See Jaegers's answer) or other compression algoritms.
A suboptimal, although simpler, solution could be to find a feasible number of characters to fit into a big integer. For instance, if you form a 32-bit integer out of every 6 charachter chunk (which is possible as 26^6 < 2^32), you use 5.33 bits/char. You can actually even fit 13 letters into a 64 bit integer (4.92 bits/char). This is quite close to the optimal solution, and still rather easy to implement. Using bigger ints than 64 bits can be tricky due to missing native support in many progamming languages.
If you want even better compression rates for text, you should definitely also look into dictionary-based compression algorithms, such as LZW or Deflate.

Resources