Confused about prefix free and uniquely decodable with respect to binary code - matrix

A code is the assignment of a unique string of characters (a
codeword) to each character in an alphabet.
A code in which the codewords contain only zeroes and ones is
called a binary code.
All ASCII codewords have the same length. This ensures that an
important property called the prefix property holds true for the
ASCII code.
The encoding of a string of characters from an alphabet (the
cleartext) is the concatenation of the codewords corresponding to
the characters of the cleartext, in order, from left to right. A code
is uniquely decodable if the encoding of every possible cleartext
using that code is unique.
Based on the above information I was trying to do some exercises:
Considering the following matrix:
Code1 Code2 Code3 Code4
A 0 0 1 1
B 100 1 01 01
C 10 00 001 001
D 11 11 0001 000
The confusions:
Are all the above assignment considered as codes since they have a unique string of characters???
I understand that code 1 and code 2 are prefix free since they do not have equal length. Having said that, if you have a look at code 4 for alphabets D and C it cosists of 3 digits. Would code 4 be considered prefix free too?
Is code 3 the only uniquely decodable code?

I think you have misunderstood the prefix property - it isn't mainly about length (but enforcing the same length n on each code point will make the code prefix-free - you cannot have unique codes otherwise).
Rather, it is about uniquely being able to identify each code point so that a decoder greedily can take the first translation that matches. In the case of fixed length, the decoder knows that it has to read n digits.
In the case of variable length code like Code1, you don't know upon reading 10 if that can be translated to C or if it is the first two digits of the three-digit B - 10 is a prefix of 100. The same holds true for Code2: 0 is a prefix for 00 and 1 is a prefix of 11.
Consider reading the sequence 100 one digit at a time:
Code1:
Read 1
; "1" does not match any code - Remember the 1 and continue.
Read 0
; "10" matches reduction "C" - or is this the beginning of a "B"? Darn!
Read 0
; Ok, this was either "CA" or "B" - but there is no way of knowing which one.
Hope this helps you forward!

Related

hpack encoding integer significance

After reading this, https://httpwg.org/specs/rfc7541.html#integer.representation
I am confused about quite a few things, although I seem to have the overall gist of the idea.
For one, What are the 'prefixes' exactly/what is their purpose?
For two:
C.1.1. Example 1: Encoding 10 Using a 5-Bit Prefix
The value 10 is to be encoded with a 5-bit prefix.
10 is less than 31 (2^5 - 1) and is represented using the 5-bit prefix.
0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
| X | X | X | 0 | 1 | 0 | 1 | 0 | 10 stored on 5 bits
+---+---+---+---+---+---+---+---+
What are the leading Xs? What is the starting 0 for?
>>> bin(10)
'0b1010'
>>>
Typing this in the python IDE, you see almost the same output... Why does it differ?
This is when the number fits within the number of prefix bits though, making it seemingly simple.
C.1.2. Example 2: Encoding 1337 Using a 5-Bit Prefix
The value I=1337 is to be encoded with a 5-bit prefix.
1337 is greater than 31 (25 - 1).
The 5-bit prefix is filled with its max value (31).
I = 1337 - (25 - 1) = 1306.
I (1306) is greater than or equal to 128, so the while loop body executes:
I % 128 == 26
26 + 128 == 154
154 is encoded in 8 bits as: 10011010
I is set to 10 (1306 / 128 == 10)
I is no longer greater than or equal to 128, so the while loop terminates.
I, now 10, is encoded in 8 bits as: 00001010.
The process ends.
0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
| X | X | X | 1 | 1 | 1 | 1 | 1 | Prefix = 31, I = 1306
| 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1306>=128, encode(154), I=1306/128
| 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 10<128, encode(10), done
+---+---+---+---+---+---+---+---+
The octet-like diagram shows three different numbers being produced... Since the numbers are produced throughout the loop, how do you replicate this octet-like diagram within an integer? What is the actual final result? The diagram or "I" being 10, or 00001010.
def f(a, b):
if a < 2**b - 1:
print(a)
else:
c = 2**b - 1
remain = a - c
print(c)
if remain >= 128:
while 1:
e = remain % 128
g = e + 128
remain = remain / 128
if remain >= 128:
continue
else:
print(remain)
c+=int(remain)
print(c)
break
As im trying to figure this out, I wrote a quick python implementation of it, It seems that i am left with a few useless variables, one being g which in the documentation is the 26 + 128 == 154.
Lastly, where does 128 come from? I can't find any relation between the numbers besides the fact 2 raised to the 7th power is 128, but why is that significant? Is this because the first bit is reserved as a continuation flag? and an octet contains 8 bits so 8 - 1 = 7?
For one, What are the 'prefixes' exactly/what is their purpose?
Integers are used in a few places in HPACK messages and often they have leading bits that cannot be used to for the actual integer. Therefore, there will often be a few leading digits that will be unavailable to use for the integer itself. They are represented by the X. For the purposes of this calculation it doesn't make what those Xs are: could be 000, or 111, or 010 or...etc. Also, there will not always be 3 Xs - that is just an example. There could only be one leading X, or two, or four...etc.
For example, to look up a previous HPACK decoded header, we use 6.1. Indexed Header Field Representation which starts with a leading 1, followed by the table index value. Therefore that 1 is the X in the previous example. We have 7-bits (instead of only 5-bits in the original example in your question). If the table index value is 127 or less we can represent it using those 7-bits. If it's >= 127 then we need to do some extra work (we'll come back to this).
If it's a new value we want to add to the table (to reuse in future requests), but we already have that header name in the table (so it's just a new value for that name we want as a new entry) then we use 6.2.1. Literal Header Field with Incremental Indexing. This has 2 bits at the beginning (01 - which are the Xs), and we only have 6-bits this time to represent the index of the name we want to reuse. So in this case there are two Xs.
So don't worry about there being 3 Xs - that's just an example. In the above examples there was one X (as first bit had to be 1), and two Xs (as first two bits had to be 01) respectively. The Integer Representation section is telling you how to handle any prefixed integer, whether prefixed by 1, 2, 3... etc unusable "X" bits.
What are the leading Xs? What is the starting 0 for?
The leading Xs are discussed above. The starting 0 is just because, in this example we have 5-bits to represent the integers and only need 4-bits. So we pad it with 0. If the value to encode was 20 it would be 10100. If the value was 40, we couldn't fit it in 5-bits so need to do something else.
Typing this in the python IDE, you see almost the same output... Why does it differ?
Python uses 0b to show it's a binary number. It doesn't bother showing any leading zeros. So 0b1010 is the same as 0b01010 and also the same as 0b00001010.
This is when the number fits within the number of prefix bits though, making it seemingly simple.
Exactly. If you need more than the number of bits you have, you don't have space for it. You can't just use more bits as HPACK will not know whether you are intending to use more bits (so should look at next byte) or if it's just a straight number (so only look at this one byte). It needs a signal to know that. That signal is using all 1s.
So to encode 40 in 5 bits, we need to use 11111 to say "it's not big enough", overflow to next byte. 11111 in binary is 31, so we know it's bigger than that, so we'll not waste that, and instead use it, and subtract it from the 40 to give 9 left to encode in the next byte. A new additional byte gives us 8 new bits to play with (well actually only 7 as we'll soon discover, as the first bit is used to signal a further overflow). This is enough so we can use 00001001 to encode our 9. So our complex number is represented in two bytes: XXX11111 and 00001001.
If we want to encode a value bigger than can fix in the first prefixed bit, AND the left over is bigger than 127 that would fit into the available 7 bits of the second byte, then we can't use this overflow mechanism using two bytes. Instead we use another "overflow, overflow" mechanism using three bytes:
For this "overflow, overflow" mechanism, we set the first byte bits to 1s as usual for an overflow (XXX11111) and then set the first bit of the second byte to 1. This leaves 7 bits available to encode the value, plus the next 8 bits in the third byte we're going to have to use (actually only 7 bits of the third byte, because again it uses the first bit to indicate another overflow).
There's various ways they could go have gone about this using the second and third bytes. What they decided to do was encode this as two numbers: the 128 mod, and the 128 multiplier.
1337 = 31 + (128 * 10) + 26
So that means the frist byte is set to 31 as per pervious example, the second byte is set to 26 (which is 11010) plus the leading 1 to show we're using the overflow overflow method (so 100011010), and the third byte is set to 10 (or 00001010).
So 1337 is encoded in three bytes: XXX11111 100011010 00001010 (including setting X to whatever those values were).
Using 128 mod and multiplier is quite efficient and means this large number (and in fact any number up to 16,383) can be represented in three bytes which is, not uncoincidentally, also the max integer that can be represented in 7 + 7 = 14 bits). But it does take a bit of getting your head around!
If it's bigger than 16,383 then we need to do another round of overflow in a similar manner.
All this seems horrendously complex but is actually relatively simply, and efficiently, coded up. Computers can do this pretty easily and quickly.
It seems that i am left with a few useless variables, one being g
You are not print this value in the if statement. Only the left over value in the else. You need to print both.
which in the documentation is the 26 + 128 == 154.
Lastly, where does 128 come from? I can't find any relation between the numbers besides the fact 2 raised to the 7th power is 128, but why is that significant? Is this because the first bit is reserved as a continuation flag? and an octet contains 8 bits so 8 - 1 = 7?
Exactly, it's because the first bit (value 128) needs to be set as per explanation above, to show we are continuing/overflowing into needing a third byte.

Checking if two strings are equal after removing a subset of characters from both

I recently came across this problem:
You are given two strings, s1 and s2, comprised entirely of lowercase letters 'a' through 'r', and need to process a series of queries. Each query provides a subset of lowercase English letters from 'a' through 'r'. For each query, determine whether s1 and s2, when restricted only to the letters in the query, are equal.
s1 and s2 can contain up to 10^5 characters, and there are up to 10^5 queries.
For instance, if s1 is "aabcd" and s2 is "caabd", and you are asked to process a query with the subset "ac", then s1 becomes "aac" while s2 becomes "caa". These don't match, so the query would return false.
I was able to solve this in O(N^2) time by doing the following: For each query, I checked if s1 and s2 would be equal by iterating through both strings, one character at a time, skipping the characters that do not lie within the subset of allowed characters, and checking to see if the "allowed" characters from both s1 and s2 match. If at some point, the characters don't match, then the strings are not equal. Otherwise, the s1 and s2 are equal when restricted only to letters in the query. Each query takes O(N) time to process, and there are N queries, for a total of O(N^2) time.
However, I was told that there was a way to solve this faster in O(N). Does anyone know how this might be done?
The first obvious speedup is to ensure your set membership test is O(1). To do that, there's a couple of options:
Represent every letter as a single bit -- now every character is an 18-bit value with only one bit set. The set of allowed characters is now a mask with these bits ORed together and you can test membership of a character with a bitwise-AND;
Alternatively, you can have an 18-value array and index it by character (c - 'a' would give a value between 0 and 17). The test for membership is then basically the cost of an array lookup (and you can save operations by not doing the subtraction -- instead just make the array larger and index directly by character.
Thought experiment
The next potential speedup is to recognize that any character which does not appear exactly the same number of times in both strings will instantly be a failed match. You can count all character frequencies in both strings with a histogram which can be done in O(N) time. In this way, you can prune the search space if such a character were to appear in the query, and you can test for this in constant time.
Of course, that won't help for a real stress-test which will guarantee that all possible letters have a frequency matched in both strings. So, what do you do then?
Well, you extend the above premise by recognizing that for any position of character x in string 1 and some position of that character in string 2 that would be a valid match (i.e the same number of character x appears in both strings up to their respective positions), then the total count of any other character up to those positions must also be equal. For any character where that is not true, it cannot possibly be compatible with character x.
Concept
Let's start by thinking about this in terms of a technique known as memoization where you can leverage precomputed or partially-computed information and get a whole lot out of it. So consider two strings like this:
a b a a c d e | e a b a c a d
What useful thing can you do here? Well, why not store the partial sums of counts for each letter:
a b a a c d e | e a b a c a d
----------------------|----------------------
freq(a) = 1 1 2 3 3 3 3 | 0 1 1 2 2 3 3
freq(b) = 0 1 1 1 1 1 1 | 0 0 1 1 1 1 1
freq(c) = 0 0 0 0 1 1 1 | 0 0 0 0 1 1 1
freq(d) = 0 0 0 0 0 1 1 | 0 0 0 0 0 0 1
freq(e) = 0 0 0 0 0 0 1 | 1 1 1 1 1 1 1
This uses a whole lot of memory, but don't worry -- we'll deal with that later. Instead, take the time to absorb what we're actually doing.
Looking at the table above, we have the running character count totals for both strings at every position in those strings.
Now let's see how our matching rules work by showing an example of a matching query "ab" and a non-matching query "acd":
For "ab":
a b a a c d e | e a b a c a d
----------------------|----------------------
freq(a) = 1 1 2 3 3 3 3 | 0 1 1 2 2 3 3
freq(b) = 0 1 1 1 1 1 1 | 0 0 1 1 1 1 1
^ ^ ^ ^ ^ ^ ^ ^
We scan the frequency arrays until we locate one of our letters in the query. The locations I have marked with ^ above. If you remove all the unmarked columns, you'll see the remaining columns match on both sides. So this is a match.
For "acd":
a b a a c d e | e a b a c a d
----------------------|----------------------
freq(a) = 1 1 2 3 3 3 3 | 0 1 1 2 2 3 3
freq(c) = 0 0 0 0 1 1 1 | 0 0 0 0 1 1 1
freq(d) = 0 0 0 0 0 1 1 | 0 0 0 0 0 0 1
^ ^ # # ^ ^ ^ # # ^
Here, all columns are matching except those marked with #.
Putting it together
All right so you can see how this works, but you may wondering about the runtime, because those examples above seem to be doing even more scanning than you were doing before!
Here's where things get interesting, and where our character frequency counts really kick in.
First, observe what we're actually doing on those marked columns. For any one character that we care about (for example, the 'a'), we're looking for only the positions in both strings where its count matches, and then we're comparing these two columns to see what other values match. This gives us a set of all other characters that are valid when used with 'a'. And of course, 'a' itself is also valid.
And what was our very first optimization? A bitset -- an 18-bit value the represents valid characters. You can now use this. For the two columns in each string, you set a 1 for characters with matching counts and a zero for characters with non-matching counts. If you process every single pair of matching 'a' values in this manner, what you get is a bunch of sets that they work with. And you can keep a running "master" set that represents the intersection of these -- you just intersect it with each intermediate set you calculate, which is a single bitwise-AND.
By the time you reach the end of both strings, you have performed a O(N) search and you examined 18 rows any time you encountered 'a'. And the result was the set of all characters that work with 'a'.
Now repeat for the other characters, one at a time. Every time it's a O(N) scan as above and you wind up with the set of all other characters that work with that the one you're processing.
After processing all these rows you now have an array containing 18 values representing the set of all characters that work with any one character. The operation took O(18N) time and O(18N) storage.
Query
Since you now have an array where for each character you have all possible characters that work with it, you simply look up each character in the query, and you intersect their sets (again, with bitwise-AND). A further intersection is required by using the set of all characters present in the query. That prunes off all characters that are not relevant.
This leaves you with a value which, for all values in the query, represents the set of all values that can result in a matching string. So if this value is then equal to the query then you have a match. Otherwise, you don't.
This is the part that is now fast. It has essentially reduced your query tests to constant time. However, the original indexing did cost us a lot of memory. What can be done about that?
Dynamic Programming
Is it really necessary to allocate all that storage for our frequency arrays?
Well actually, no. It was useful as a visual tool by laying out our counts in tabular form, and to explain the method conceptually but actually a lot of those values are not very useful most of the time and it sure made the method seem a bit complicated.
The good news is we can compute our master sets at the same time as computing character counts, without needing to store any frequency arrays. Remember that when we're computing the counts we use a histogram which is as simple as having one small 18-value array array where you say count[c] += 1 (if c is a character or an index derived from that character). So we could save a ton of memory if we just do the following:
Processing the set (mask) of all compatible characters for character c:
Initialize the mask for character c to all 1s (mask[c] = (1 << 18) - 1) -- this represents all characters are currently compatible. Initialize a character histogram (count) to all zero.
Walk through string 1 until you reach character c. For every character x you encounter along the way, increase its count in the histogram (count[x]++).
Walk through string 2 until you reach character c. For every character x you encounter along the way, decrease its count in the histogram (count[x]--).
Construct a 'good' set where any character that currently has a zero-count has a 1-bit, otherwise 0-bit. Intersect this with the current mask for c (using bitwise-AND): mask[c] &= good
Continue from step 2 until you have reached the end of both strings. If you reach the end of one of the strings prematurely, then the character count does not match and so you set the mask for this character to zero: mask[c] = 0
Repeat from 1 for every character, until all characters are done.
Above, we basically have the same time complexity of O(18N) except we have absolutely minimal extra storage -- only one array of counts for each character and one array of masks.
Combining techniques like the above to solve seemingly complex combinatorial problems really fast is commonly referred to as Dynamic Programming. We reduced the problem down to a truth table representing all possible characters that work with any single character. The time complexity remains linear with respect to the length of the strings, and only scales by the number of possible characters.
Here is the algorithm above hacked together in C++: https://godbolt.org/z/PxzYvGs8q
Let RESTRICT(s,q) be the restriction of string s to the letters in the set q.
If q contains more than two letters, then the full string RESTRICT(s,q) can be reconstructed from all the strings RESTRICT(s,qij) where qij is a pair of letters in q.
Therefore, RESTRICT(s1,q) = RESTRICT(s2,q) if and only if RESTRICT(s1,qij) = RESTRICT(s2,qij) for all pairs qij in q.
Since you are restricted to 18 letters, there are only 153 letter pairs, or only 171 fundamental single- or double-letter queries.
If you precalculate the results of these 171 queries, then you can answer any other more complicated query just by combining their results, without inspecting the string at all.

Converting to and from a number system that doesn't have a zero digit

Consider Microsoft Excel's column-numbering system. Columns are "numbered" A, B, C, ... , Y, Z, AA, AB, AC, ... where A is 1.
The column system is similar to the base-10 numbering system that we're familiar with in that when any digit has its maximum value and is incremented, its value is set to the lowest possible digit value and the digit to its left is incremented, or a new digit is added at the minimum value. The difference is that there isn't a digit that represents zero in the letter numbering system. So if the "digit alphabet" contained ABC or 123, we could count like this:
(base 3 with zeros added for comparison)
base 3 no 0 base 3 with 0 base 10 with 0
----------- ------------- --------------
- - 0 0
A 1 1 1
B 2 2 2
C 3 10 3
AA 11 11 4
AB 12 12 5
AC 13 20 6
BA 21 21 7
BB 22 22 8
BC 23 100 9
CA 31 101 10
CB 32 102 11
CC 33 110 12
AAA 111 111 13
Converting from the zeroless system to our base 10 system is fairly simple; it's still a matter of multiplying the power of that space by the value in that space and adding it to the total. So in the case of AAA with the alphabet ABC, it's equivalent to (1*3^2) + (1*3^1) + (1*3^0) = 9 + 3 + 1 = 13.
I'm having trouble converting inversely, though. With a zero-based system, you can use a greedy algorithm moving from largest to smallest digit and grabbing whatever fits. This will not work for a zeroless system, however. For example, converting the base-10 number 10 to the base-3 zeroless system: Though 9 (the third digit slot: 3^2) would fit into 10, this would leave no possible configuration of the final two digits since their minimum values are 1*3^1 = 3 and 1*3^0 = 1 respectively.
Realistically, my digit alphabet will contain A-Z, so I'm looking for a quick, generalized conversion method that can do this without trial and error or counting up from zero.
Edit
The accepted answer by n.m. is primarily a string-manipulation-based solution.
For a purely mathematical solution see kennytm's links:
What is the algorithm to convert an Excel Column Letter into its Number?
How to convert a column number (eg. 127) into an excel column (eg. AA)
Convert to base-3-with-zeroes first (digits 0AB), and from there, convert to base-3-without-zeroes (ABC), using these string substitutions:
A0 => 0C
B0 => AC
C0 => BC
Each substitution either removes a zero, or pushes one to the left. In the end, discard leading zeroes.
It is also possible, as an optimisation, to process longer strings of zeros at once:
A000...000 = 0BBB...BBC
B000...000 = ABBB...BBC
C000...000 = BBBB...BBC
Generalizable to any base.

Confusion regarding genetic algorithms

My books(Artificial Intelligence A modern approach) says that Genetic algorithms begin with a set of k randomly generated states, called population. Each state is represented as a string over a finite alphabet- most commonly, a string of 0s and 1s. For eg, an 8-queens state must specify the positions of 8 queens, each in a column of 8 squares, and so requires 8 * log(2)8 = 24 bits. Alternatively the state could be represented as 8 digits, each in range from 1 to 8.
[ http://en.wikipedia.org/wiki/Eight_queens_puzzle ]
I don't understand the expression 8 * log(2)8 = 24 bits , why log2 ^ 8? And what are these 24 bits supposed to be for?
If we take first example on the wikipedia page, the solution can be encoded as [2,4,6,8,3,1,7,5] : the first digit gives the row number for the queen in column A, the second for the queen in column B and so on. Now instead of starting the row numbering at 1, we will start at 0. The solution is then encoded with [1,3,5,7,0,6,4]. Any position can be encoded such way.
We have only digits between 0 and 7, if we write them in binary 3 bit (=log2(8)) are enough :
000 -> 0
001 -> 1
...
110 -> 6
111 -> 7
A position can be encoded using 8 times 3 digits, e.g. from [1,3,5,7,2,0,6,4] we get [001,011,101,111,010,000,110,100] or more briefly 001011101111010000110100 : 24 bits.
In the other way, the bitstring 000010001011100101111110 decodes as 000.010.001.011.100.101.111.110 then [0,2,1,3,4,5,7,6] and gives [1,3,2,4,5,8,7] : queen in column A is on row 1, queen in column B is on row 3, etc.
The number of bits needed to store the possible squares (8 possibilities 0-7) is log(2)8. Note that 111 in binary is 7 in decimal. You have to specify the square for 8 columns, so you need 3 bits 8 times

String similarity: how exactly does Bitap work?

I'm trying to wrap my head around the Bitap algorithm, but am having trouble understanding the reasons behind the steps of the algorithm.
I understand the basic premise of the algorithm, which is (correct me if i'm wrong):
Two strings: PATTERN (the desired string)
TEXT (the String to be perused for the presence of PATTERN)
Two indices: i (currently processing index in PATTERN), 1 <= i < PATTERN.SIZE
j (arbitrary index in TEXT)
Match state S(x): S(PATTERN(i)) = S(PATTERN(i-1)) && PATTERN[i] == TEXT[j], S(0) = 1
In english terms, PATTERN.substring(0,i) matches a substring of TEXT if the previous substring PATTERN.substring(0, i-1) was successfully matched and the character at PATTERN[i] is the same as the character at TEXT[j].
What I don't understand is the bit-shifting implementation of this. The official paper detailing this algorithm basically lays it out, but I can't seem to visualize what's supposed to go on. The algorithm specification is only the first 2 pages of the paper, but I'll highlight the important parts:
Here is the bit-shifting version of the concept:
Here is T[text] for a sample search string:
And here is a trace of the algorithm.
Specifically, I don't understand what the T table signifies, and the reason behind ORing an entry in it with the current state.
I'd be grateful if anyone can help me understand what exactly is going on
T is slightly confusing because you would normally number positions in the
pattern from left to right:
0 1 2 3 4
a b a b c
...whereas bits are normally numbered from right to left.
But writing the
pattern backwards above the bits makes it clear:
bit: 4 3 2 1 0
c b a b a
T[a] = 1 1 0 1 0
c b a b a
T[b] = 1 0 1 0 1
c b a b a
T[c] = 0 1 1 1 1
c b a b a
T[d] = 1 1 1 1 1
Bit n of T[x] is 0 if x appears in position n, or 1 if it does not.
Equivalently, you can think of this as saying that if the current character
in the input string is x, and you see a 0 in position n of T[x], then you
can only possibly be matching the pattern if the match started n characters
previously.
Now to the matching procedure. A 0 in bit n of the state means that we started matching the pattern n characters ago (where 0 is the current character). Initially, nothing matches.
[start]
1 1 1 1 1
As we consume characters trying to match, the state is shifted left (which shifts a zero in
to the bottom bit, bit 0) and OR-ed with the table entry for the current character. The first character is a; shifting left and OR-ing in T[a] gives:
a
1 1 1 1 0
The 0 bit that was shifted in is preserved, because a current character of a can
begin a match of the pattern. For any other character, the bit would be have been set to
1.
The fact that bit 0 of the state is now 0 means that we started matching the pattern on
the current character; continuing, we get:
a b
1 1 1 0 1
...because the 0 bit has been shifted left - think of it as saying that we started matching the pattern 1 character ago - and T[b] has a 0 in the same position, telling
us that a seeing a b in the current position is good if we started matching 1 character
ago.
a b d
1 1 1 1 1
d can't match anywhere; all the bits get set back to 1.
a b d a
1 1 1 1 0
As before.
a b d a b
1 1 1 0 1
As before.
b d a b a
1 1 0 1 0
a is good if the match started either 2 characters ago or on the current character.
d a b a b
1 0 1 0 1
b is good if the match started either 1 or 3 characters ago. The 0 in bit 3 means
that we've almost matched the whole pattern...
a b a b a
1 1 0 1 0
...but the next character is a, which is no good if the match started 4 characters
ago. However, shorter matches might still be good.
b a b a b
1 0 1 0 1
Still looking good.
a b a b c
0 1 1 1 1
Finally, c is good if the match started 4 characters before. The fact that
a 0 has made it all the way to the top bit means that we have a match.
Sorry for not allowing anyone else to answer, but I'm pretty sure I've figured it out now.
The concept essential for groking the algorithm is the representation of match states (defined in the original post) in binary. The article in the original post explains it formally; I'll try my hand at doing so colloquially:
Let's have STR, which is a String created with characters from a given alphabet.
Let's represent STR with a set of binary digits: STR_BINARY. The algorithm requires for this representation to be backwards (so, the first letter corresponds to the last digit, second letter with the second-to-last digit, etc.).
Let's assume RANDOM refers to a String with random characters from the same alphabet STR is created from.
In STR_BINARY, a 0 at a given index indicates that, RANDOM matches STR from STR[0] to
STR[(index of letter in STR that the 0 in STR_BINARY corresponds to)]. Empty spaces count as matches. A 1 indicates that RANDOM does not match STR in inside those same boundaries.
The algorithm becomes simpler to learn once this is understood.

Resources