How to interpret using Most Significant Bit of single byte - Pcapng time format "if_tsresol" - bit

I'm writing a library to parse a file format (pcapng) documented here. There's one particular section I'm having trouble with, it's an option that defines how the long epoch/time variable should be parsed. The option if_tsresol is defined by a single octlet/byte, according to the spec:
The if_tsresol option identifies the resolution of timestamps. If the
Most Significant Bit is equal to zero, the remaining bits indicates
the resolution of the timestamp as a negative power of 10 (e.g. 6
means microsecond resolution, timestamps are the number of
microseconds since 1/1/1970). If the Most Significant Bit is equal to
one, the remaining bits indicates the resolution as as negative power
of 2 (e.g. 10 means 1/1024 of second). If this option is not present,
a resolution of 10^-6 is assumed (i.e. timestamps have the same
resolution of the standard 'libpcap' timestamps).
Example: '6'.
I'm happy with how either case works. E.g. for the first, if the value is '0' it represents seconds since 1970, '3' means milliseconds since 1970, '9' would mean microseconds etc.
I'm also happy with the second case, 10 would mean it's 1/1024 of a second, (i.e. nearly milliseconds), 20 would mean 1/1048576 of a second (i.e. nearly microsecond).
My confusion is over the "Most Significant Bit", and how to choose between these two. My understanding is the Most Significant Bit is effectively the left-most bit in a bit sequence.
So, taking the first example, a value of '3' has the bits 00000011, so the most significant bit is equal to zero right?
However, this first significant bit won't stop being zero until the whole byte is 128 or greater, in which case the second 'option' is meaningless as we're expecting a number from 10-30.
Should I be ignoring the first bit in the calculation of the byte itself? So:
00001010 = 10 which is 10 (option 1)
10001010 = 138 which is 10 (option 2)
and
00000110 = 6 which is 6 (option 1)
10000110 = 134 which is 6 (option 2)

Should I be ignoring the first bit in the calculation of the byte itself?
Yes. As the pcapng spec says:
If the Most Significant Bit is equal to one, the remaining bits indicates the resolution as as negative power of 2 (e.g. 10 means 1/1024 of second).
so the first bit is a flag and the rest of the bits are the value, and you ignore the first bit when doing the calculation.

Related

Fixed Point Multiplication for FFT

I’m writing a Radix-2 DIT FFT algorithm in VHDL, which requires some fractional multiplication of input data by Twiddle Factor (TF). I use Fixed Point arithmetic’s to achieve that, with every word being 16 bit long, where 1 bit is a sign bit and the rest is distributed between integer and fraction. Therefore my dilemma:
I have no idea, in what range my input data will be, so if I just decide that 4 bits go to integer and the rest 11 bits to fraction, in case I get integer numbers higher than 4 bits = 15 decimal, I’m screwed. The same applies if I do 50/50, like 7 bits to integer and the rest to fraction. If I get numbers, which are very small, I’m screwed because of truncation or rounding, i.e:
Let’s assume I have an integer "3"(0000 0011) on input and TF of "0.7071" ( 0.10110101 - 8 bit), and let’s assume, for simplicity, my data is 8 bit long, therefore:
3x0.7071 = 2.1213
3x0.7071 = 0000 0010 . 0001 1111 = 2.12109375 (for 16 bits).
Here comes the trick - I need to up/down round or truncate 16 bits to 8 bits, therefore, I get 0000 0010, i.e 2 - the error is way too high.
My questions are:
How would you solve this problem of range vs precision if you don’t know the range of your input data AND you would have numbers represented in fixed point?
Should I make a process, which decides after every multiplication where to put the comma? Wouldn’t it make the multiplication slower?
Xilinx IP Core has 3 different ways for Fixed Number Arithmetic’s – Unscaled (similar to what I want to do, just truncate in case overflow happens), Scaled fixed point (I would assume, that in that case it decides after each multiplication, where the comma should be and what should be rounded) and Block Floating Point(No idea what it is or how it works - would appreciate an explanation). So how does this IP Core decide where to put the comma? If the decision is made depending on the highest value in my dataset, then in case I have just 1 high peak and the rest of the data is low, the error will be very high.
I will appreciate any ideas or information on any known methods.
You don't need to know the fixed-point format of your input. You can safely treat it as normalized -1 to 1 range or full integer-range.
The reason is that your output will have the same format as the input. Or, more likely for FFT, a known relationship like 3 bits increase, which would the output has 3 more integer bits than the input.
It is the core user's burden to know where the decimal point will end up, you have to document the change to dynamic range of course.

Data Compression : Arithmetic coding unclear

Can anyone please explain arithmetic encoding for data compression with implementation details ? I have surfed through internet and found mark nelson's post but the implementation's technique is indeed unclear to me after trying for many hours.
Mark nelson's explanation on arithmetic coding can be located at
http://marknelson.us/1991/02/01/arithmetic-coding-statistical-modeling-data-compression/
The main idea with arithmetic compression is its the capability to code a probability using the exact amount of data length required.
This amount of data is known, proven by Shannon, and can be calculated simply by using the following formula : -log2(p)
For example, if p=50%, then you need 1 bit.
And if p=25%, you need 2 bits.
That's simple enough for probabilities which are power of 2 (and in this special case, huffman coding could be enough). But what if the probability is 63% ? Then you need -log2(0.63) = 0.67 bits. Sounds tricky...
This property is especially important if your probability is high. If you can predict something with a 95% accuracy, then you only need 0.074 bits to represent a good guess. Which means you are going to compress a lot.
Now, how to do that ?
Well, it's simpler than it sounds. You will divide your range depending on probabilities. For example, if you have a range of 100, 2 possible events, and a probability of 95% for the 1st one, then the first 95 values will say "Event 1", and the last 5 remaining values will say "Event 2".
OK, but on computers, we are accustomed to use powers of 2. For example, with 16 bits, you have a range of 65536 possible values. Just do the same : take the 1st 95% of the range (which is 62259) to say "Event 1", and the rest to say "Event 2". You obviously have a problem of "rounding" (precision), but as long as you have enough values to distribute, it does not matter too much. Furthermore, you are not constrained to 2 events, you could have a myriad of events. All that matters is that values are allocated depending on the probabilities of each event.
OK, but now i have 62259 possible values to say "Event 1", and 3277 to say "Event 2". Which one should i choose ?
Well, any of them will do. Wether it is 1, 30, 5500 or 62256, it still means "Event 1".
In fact, deciding which value to select will not depend on the current guess, but on the next ones.
Suppose i'm having "Event 1". So now i have to choose any value between 0 and 62256. On next guess, i have the same distribution (95% Event 1, 5% Event 2). I will simply allocate the distribution map with these probabilities. Except that this time, it is distributed over 62256 values. And we continue like this, reducing the range of values with each guess.
So in fact, we are defining "ranges", which narrow with each guess. At some point, however, there is a problem of accuracy, because very little values remain.
The idea, is to simply "inflate" the range again. For example, each time the range goes below 32768 (2^15), you output the highest bit, and multiply the rest by 2 (effectively shifting the values by one bit left). By continuously doing like this, you are outputting bits one by one, as they are being settled by the series of guesses.
Now the relation with compression becomes obvious : when the range are narrowed swiftly (ex : 5%), you output a lot of bits to get the range back above the limit. On the other hand, when the probability is very high, the range narrow very slowly. You can even have a lot of guesses before outputting your first bits. That's how it is possible to compress an event to "a fraction of a bit".
I've intentionally used the terms "probability", "guess", "events" to keep this article generic. But for data compression, you just to replace them with the way you want to model your data. For example, the next event can be the next byte; in this case, you have 256 of them.
Maybe this script could be useful to build a better mental model of arithmetic coder: gen_map.py. Originally it was created to facilitate debugging of arithmetic coder library and simplify generation of unit tests for it. However it creates nice ASCII visualizations that also could be useful in understanding arithmetic coding.
A small example. Imagine we have an alphabet of 3 symbols: 0, 1 and 2 with probabilities 1/10, 2/10 and 7/10 correspondingly. And we want to encode sequence [1, 2]. Script will give the following output (ignore -b N option for now):
$ ./gen_map.py -b 6 -m "1,2,7" -e "1,2"
000000111111|1111|111222222222222222222222222222222222222222222222
------011222|2222|222000011111111122222222222222222222222222222222
---------011|2222|222-------------00011111122222222222222222222222
------------|----|-------------------------00111122222222222222222
------------|----|-------------------------------01111222222222222
------------|----|------------------------------------011222222222
==================================================================
000000000000|0000|000000000000000011111111111111111111111111111111
000000000000|0000|111111111111111100000000000000001111111111111111
000000001111|1111|000000001111111100000000111111110000000011111111
000011110000|1111|000011110000111100001111000011110000111100001111
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
001100110011|0011|001100110011001100110011001100110011001100110011
010101010101|0101|010101010101010101010101010101010101010101010101
First 6 lines (before ==== line) represent a range from 0.0 to 1.0 which is recursively subdivided on intervals proportional to symbol probabilities. Annotated first line:
[1/10][ 2/10 ][ 7/10 ]
000000111111|1111|111222222222222222222222222222222222222222222222
Then we subdivide each interval again:
[ 0.1][ 0.2 ][ 0.7 ]
000000111111|1111|111222222222222222222222222222222222222222222222
[ 0.7 ][.1][ 0.2 ][ 0.7 ]
------011222|2222|222000011111111122222222222222222222222222222222
[.1][ .2][ 0.7 ]
---------011|2222|222-------------00011111122222222222222222222222
Note, that some intervals are not subdivided. That happens when there is not enough space to represent every subinterval within given precision (which is specified by -b option).
Each line corresponds to a symbol from the input (in our case - sequence [1, 2]). By following subintervals for each input symbol we'll get a final interval that we want to encode with minimal amount of bits. In our case it's a first 2 subinterval on a second line:
[ This one ]
------011222|2222|222000011111111122222222222222222222222222222222
Following 7 lines (after ====) represent the same interval 0.0 to 1.0, but subdivided according to binary notation. Each line is a bit of output and by choosing between 0 and 1 you choose left or right half-subinterval. For example bits 01 corresponds to subinterval [0.25, 05) on a second line:
[ This one ]
000000000000|0000|111111111111111100000000000000001111111111111111
The idea of arithmetic coder is to output bits (0 or 1) until the corresponding interval will be entirely inside (or equal to) the interval determined by the input sequence. In our case it's 0011. The ~~~~ line shows where we have enough bits to unambiguously identify the interval we want.
Vertical lines formed by | symbol show the range of bit sequences (rows) that could be used to encode the input sequence.
First of all thanks for introducing me to the concept of arithmetic compression!
I can see that this method has the following steps:
Creating mapping: Calculate the fraction of occurrence for each letter which gives a range size for each alphabet. Then order them and assign actual ranges from 0 to 1
Given a message calculate the range (pretty straightforward IMHO)
Find the optimal code
The third part is a bit tricky. Use the following algorithm.
Let b be the optimal representation. Initialize it to empty string (''). Let x be the minimum value and y the maximum value.
double x and y: x=2*x, y=2*y
If both of them are greater than 1 append 1 to b. Go to step 1.
If both of them are less than 1, append 0 to b. Go to step 1.
If x<1, but y>1, then append 1 to b and stop
b essentially contains the fractional part of the number you are transmitting. Eg. If b=011, then the fraction corresponds to 0.011 in binary.
What part of implementation do you not understand?

Error detection code for 33 bytes, detecting bit flipped in first 32 bytes

Could you please suggest an error detection scheme for detecting
one possible bit flip in the first 32 bytes of a 33-byte message using
no more than 8 bits of additional data?
Could Pearson hashing be a solution?
Detecting a single bit-flip in any message requires only one extra bit, independent of the length of the message: simply xor together all the bits in the message and tack that on the end. If any single bit flips, the parity bit at the end won't match up.
If you're asking to detect which bit flipped, that can't be done, and a simple argument shows it: the extra eight bits can represent up to 256 classes of 32-byte messages, but the zero message and the 256 messages with one on bit each must all be in different classes. Thus, there are 257 messages which must be distinctly classified, and only 256 classes.
You can detect one bit flip with just one extra bit in any length message (as stated by #Daniel Wagner). The parity bit can, simply put, indicate whether the total number of 1-bits is odd or even. Obviously, if the number of bits that are wrong is even, then the parity bit will fail, so you cannot detect 2-bit errors.
Now, for a more accessible understanding of why you can't error-correct 32 bytes (256 bits) with just 8 bits, please read about the Hamming code (like used in ECC memory). Such a scheme uses special error-correcting parity bits (henceforth called "EC parity") that only encode the parity of a subset of the total number of bits. For every 2^m - 1 total bits, you need to use m EC bits. These represent each possible different mask following the pattern "x bits on, x bits off" where x is a power of 2. Thus, the larger the number of bits at once, the better the data/parity bit ratio you get. For example, 7 total bits would allow encoding only 4 data bits after losing 3 EC bits, but 31 total bits can encode 26 data bits after losing 5 EC bits.
Now, to really understand this probably will take an example. Consider the following sets of masks. The first two rows are to be read top down, indicating the bit number (the "Most Significant Byte" I've labeled MSB):
MSB LSB
| |
v v
33222222 22221111 11111100 0000000|0
10987654 32109876 54321098 7654321|0
-------- -------- -------- -------|-
1: 10101010 10101010 10101010 1010101|0
2: 11001100 11001100 11001100 1100110|0
3: 11110000 11110000 11110000 1111000|0
4: 11111111 00000000 11111111 0000000|0
5: 11111111 11111111 00000000 0000000|0
The first thing to notice is that the binary values for 0 to 31 are represented in each column going from right to left (reading the bits in rows 1 through 5). This means that each vertical column is different from each other one (the important part). I put a vertical extra line between bit numbers 0 and 1 for a particular reason: Column 0 is useless because it has no bits set in it.
To perform error-correcting, we will bitwise-AND the received data bits against each EC bit's predefined mask, then compare the resulting parity to the EC bit. For any calculated parities discovered to not match, find the column in which only those bits are set. For example, if error-correcting bits 1, 4, and 5 are wrong when calculated from the received data value, then column #25--containing 1s in only those masks--must be the incorrect bit and can be corrected by flipping it. If only a single error-correcting bit is wrong, then the error is in that error-correcting bit. Here's an analogy to help you understand why this works:
There are 32 identical boxes, with one containing a marble. Your task is to locate the marble using just an old-style scale (the kind with two balanced platforms to compare the weights of different objects) and you are only allowed 5 weighing attempts. The solution is fairly easy: you put 16 boxes on each side of the scale and the heavier side indicates which side the marble is on. Discarding the 16 boxes on the lighter side, you then weigh 8 and 8 boxes keeping the heavier, then 4 and 4, then 2 and 2, and finally locate the marble by comparing the weights of the last 2 boxes 1 to 1: the heaviest box contains the marble. You have completed the task in only 5 weighings of 32, 16, 8, 4, and 2 boxes.
Similarly, our bit patterns have divided up the boxes in 5 different groups. Going backwards, the fifth EC bit determines whether an error is on the left side or the right side. In our scenario with bit #25, it is wrong, so we know that the error bit is on the left side of the group (bits 16-31). In our next mask for EC bit #4 (still stepping backward), we only consider bits 16-31, and we find that the "heavier" side is the left one again, so we have narrowed down the bits 24-31. Following the decision tree downward and cutting the number of possible columns in half each time, by the time we reach EC bit 1 there is only 1 possible bit left--our "marble in a box".
Note: The analogy is useful, though not perfect: 1-bits are not represented by marbles--the erroring bit location is represented by the marble.
Now, some playing around with these masks and thinking how to arrange things will reveal that there is a problem: If we try to make all 31 bits data bits, then we need 5 more bits for EC. But how, then, will we tell if the EC bits themselves are wrong? Just a single EC bit wrong will incorrectly tell us that some data bit needs correction, and we'll wrongly flip that data bit. The EC bits have to somehow encode for themselves! The solution is to position the parity bits inside of the data, in columns from the bit patterns above where only one bit is set. This way, any data bit being wrong will trigger two EC bits to be wrong, making it so that if only one EC bit is wrong, we know it is wrong itself instead of it signifying a data bit is wrong. The columns that satisfy the one-bit condition are 1, 2, 4, 8, and 16. The data bits will be interleaved between these starting at position 2. (Remember, we are not using position 0 as it would never provide any information--none of our EC bits would be set at all).
Finally, adding one more bit for overall parity will allow detecting 2-bit errors and reliably correcting 1-bit errors, as we can then compare the EC bits to it: if the EC bits say something is wrong, but the parity bit says otherwise, we know there are 2 bits wrong and cannot perform correction. We can use the discarded bit #0 as our parity bit! In fact, now we are encoding the following pattern:
0: 11111111 11111111 11111111 11111111
This gives us a final total of 6 Error-Checking and Correcting (ECC) bits. Extending the scheme of using different masks indefinitely looks like this:
32 bits - 6 ECC bits = 26 data
64 bits - 7 ECC bits = 57 data
128 bits - 8 ECC bits = 120 data
256 bits - 9 ECC bits = 247 data
512 bits - 10 ECC bits = 502 data
Now, if we are sure that we only will get a 1-bit error, we can dispense with the #0 parity bit, so we have the following:
31 bits - 5 ECC bits = 26 data
63 bits - 6 ECC bits = 57 data
127 bits - 7 ECC bits = 120 data
255 bits - 8 ECC bits = 247 data
511 bits - 9 ECC bits = 502 data
This is no change because we don't get any more data bits. Oops! 32 bytes (256 bits) as you requested cannot be error-corrected with a single byte, even if we know we can have only a 1-bit error at worst, and we know the ECC bits will be correct (allowing us to move them out of the data region and use them all for data). We need TWO more bits than we have--one must slide up to the next range of 512 bits, then leave out 246 data bits to get our 256 data bits. So that's one more ECC bit AND one more data bit (as we only have 255, exactly what Daniel told you).
Summary:: You need 33 bytes + 1 bit to detect which bit flipped in the first 32 bytes.
Note: if you are going to send 64 bytes, then you're under the 32:1 ratio, as you can error correct that in just 10 bits. But it's that in real world applications, the "frame size" of your ECC can't keep going up indefinitely for a few reasons: 1) The number of bits being worked with at once may be much smaller than the frame size, leading to gross inefficiencies (think ECC RAM). 2) The chance of being able to accurately correct a bit gets less and less, since the larger the frame, the greater the chance it will have more errors, and 2 errors defeats error-correction ability, while 3 or more can defeat even error-detection ability. 3) Once an error is detected, the larger the frame size, the larger the size of the corrupted piece that must be retransmitted.
If you need to use a whole byte instead of a bit, and you only need to detect errors, then the standard solution is to use a cyclic redundancy check (CRC). There are several well-known 8-bit CRCs to choose from.
A typical fast implementation of a CRC uses a table with 256 entries to handle a byte of the message at a time. For the case of an 8 bit CRC this is a special case of Pearson's algorithm.

Encoding / Error Correction Challenge

Is it mathematically feasible to encode and initial 4 byte message into 8 bytes and if one of the 8 bytes is completely dropped and another is wrong to reconstruct the initial 4 byte message? There would be no way to retransmit nor would the location of the dropped byte be known.
If one uses Reed Solomon error correction with 4 "parity" bytes tacked on to the end of the 4 "data" bytes, such as DDDDPPPP, and you end up with DDDEPPP (where E is an error) and a parity byte has been dropped, I don't believe there's a way to reconstruct the initial message (although correct me if I am wrong)...
What about multiplying (or performing another mathematical operation) the initial 4 byte message by a constant, then utilizing properties of an inverse mathematical operation to determine what byte was dropped. Or, impose some constraints on the structure of the message so every other byte needs to be odd and the others need to be even.
Alternatively, instead of bytes, it could also be 4 decimal digits encoded in some fashion into 8 decimal digits where errors could be detected & corrected under the same circumstances mentioned above - no retransmission and the location of the dropped byte is not known.
I'm looking for any crazy ideas anyone might have... Any ideas out there?
EDIT:
It may be a bit contrived, but the situation that I'm trying to solve is one where you have, let's say, a faulty printer that prints out important numbers onto a form, which are then mailed off to a processing firm which uses OCR to read the forms. The OCR isn't going to be perfect, but it should get close with only digits to read. The faulty printer could be a bigger problem, where it may drop a whole number, but there's no way of knowing which one it'll drop, but they will always come out in the correct order, there won't be any digits swapped.
The form could be altered so that it always prints a space between the initial four numbers and the error correction numbers, ie 1234 5678, so that one would know whether a 1234 initial digit was dropped or a 5678 error correction digit was dropped, if that makes the problem easier to solve. I'm thinking somewhat similar to how they verify credit card numbers via algorithm, but in four digit chunks.
Hopefully, that provides some clarification as to what I'm looking for...
In the absence of "nice" algebraic structure, I suspect that it's going to be hard to find a concise scheme that gets you all the way to 10**4 codewords, since information-theoretically, there isn't a lot of slack. (The one below can use GF(5) for 5**5 = 3125.) Fortunately, the problem is small enough that you could try Shannon's greedy code-construction method (find a codeword that doesn't conflict with one already chosen, add it to the set).
Encode up to 35 bits as a quartic polynomial f over GF(128). Evaluate the polynomial at eight predetermined points x0,...,x7 and encode as 0f(x0) 1f(x1) 0f(x2) 1f(x3) 0f(x4) 1f(x5) 0f(x6) 1f(x7), where the alternating zeros and ones are stored in the MSB.
When decoding, first look at the MSBs. If the MSB doesn't match the index mod 2, then that byte is corrupt and/or it's been shifted left by a deletion. Assume it's good and shift it back to the right (possibly accumulating multiple different possible values at a point). Now we have at least seven evaluations of a quartic polynomial f at known points, of which at most one is corrupt. We can now try all possibilities for the corruption.
EDIT: bmm6o has advanced the claim that the second part of my solution is incorrect. I disagree.
Let's review the possibilities for the case where the MSBs are 0101101. Suppose X is the array of bytes sent and Y is the array of bytes received. On one hand, Y[0], Y[1], Y[2], Y[3] have correct MSBs and are presumed to be X[0], X[1], X[2], X[3]. On the other hand, Y[4], Y[5], Y[6] have incorrect MSBs and are presumed to be X[5], X[6], X[7].
If X[4] is dropped, then we have seven correct evaluations of f.
If X[3] is dropped and X[4] is corrupted, then we have an incorrect evaluation at 3, and six correct evaluations.
If X[5] is dropped and X[4] is corrupted, then we have an incorrect evaluation at 5, and six correct evaluations.
There are more possibilities besides these, but we never have fewer than six correct evaluations, which suffices to recover f.
I think you would need to study what erasure codes might offer you. I don't know any bounds myself, but maybe some kind of MDS code might achieve this.
EDIT: After a quick search I found RSCode library and in the example it says that
In general, with E errors, and K erasures, you will need
* 2E + K bytes of parity to be able to correct the codeword
* back to recover the original message data.
So looks like Reed-Solomon code is indeed the answer and you may actually get recovery from one erasure and one error in 8,4 code.
Parity codes work as long as two different data bytes aren't affected by error or loss and as long as error isn't equal to any data byte while a parity byte is lost, imho.
Error correcting codes can in general handle erasures, but in the literature the position of the erasure is assumed known. In most cases, the erasure will be introduced by the demodulator when there is low confidence that the correct data can be retrieved from the channel. For instance, if the signal is not clearly 0 or 1, the device can indicate that the data was lost, rather than risking the introduction of an error. Since an erasure is essentially an error with a known position, they are much easier to fix.
I'm not sure what your situation is where you can lose a single value and you can still be confident that the remaining values are delivered in the correct order, but it's not a situation classical coding theory addresses.
What algorithmist is suggesting above is this: If you can restrict yourself to just 7 bits of information, you can fill the 8th bit of each byte with alternating 0 and 1, which will allow you to know the placement of the missing byte. That is, put a 0 in the high bit of bytes 0, 2, 4, 6 and a 1 in the high bits of the others. On the receiving end, if you only receive 7 bytes, the missing one will have been dropped from between bytes whose high bits match. Unfortunately, that's not quite right: if the erasure and the error are adjacent, you can't know immediately which byte was dropped. E.g., high bits 0101101 could result from dropping the 4th byte, or from an error in the 4th byte and dropping the 3rd, or from an error in the 4th byte and dropping the 5th.
You could use the linear code:
1 0 0 0 0 1 1 1
0 1 0 0 1 0 1 1
0 0 1 0 1 1 0 1
0 0 0 1 1 1 1 0
(i.e. you'll send data like (a, b, c, d, b+c+d, a+c+d, a+b+d, a+b+c) (where addition is implemented with XOR, since a,b,c,d are elements of GF(128))). It's a linear code with distance 4, so it can correct a single-byte error. You can decode with syndrome decoding, and since the code is self-dual, the matrix H will be the same as above.
In the case where there's a dropped byte, you can use the technique above to determine which one it is. Once you've determined that, you're essentially decoding a different code - the "punctured" code created by dropping that given byte. Since the punctured code is still linear, you can use syndrome decoding to determine the error. You would have to calculate the parity-check matrix for each of the shortened codes, but you can do this ahead of time. The shortened code has distance 3, so it can correct any single-byte errors.
In the case of decimal digits, assuming one goes with first digit odd, second digit even, third digit odd, etc - with two digits, you get 00-99, which can be represented in 3 odd/even/odd digits (125 total combinations) - 00 = 101, 01 = 103, 20 = 181, 99 = 789, etc. So one encodes two sets of decimal digits into 6 total digits, then the last two digits signify things about the first sets of 2 digits or a checksum of some sort... The next to last digit, I suppose, could be some sort of odd/even indicator on each of the initial 2 digit initial messages (1 = even first 2 digits, 3 = odd first two digits) and follow the pattern of being odd. Then, the last digit could be the one's place of a sum of the individual digits, that way if a digit was missing, it would be immediately apparent and could be corrected assuming the last digit was correct. Although, it would throw things off if one of the last two digits were dropped...
It looks to be theoretically possible if we assume 1 bit error in wrong byte. We need 3 bits to identify dropped byte and 3 bits to identify wrong byte and 3 bits to identify wrong bit. We have 3 times that many extra bits.
But if we need to identify any number of bits error in wrong byte, it comes to 30 bits. Even that looks to be possible with 32 bits, although 32 is a bit too close for my comfort.
But I don't know hot to encode to get that. Try turbocode?
Actually, as Krystian said, when you correct a RS code, both the message AND the "parity" bytes will be corrected, as long as you have v+2e < (n-k) where v is the number of erasures (you know the position) and e is the number of errors. This means that if you only have errors, you can correct up to (n-k)/2 errors, or (n-k-1) erasures (about the double of the number of errors), or a mix of both (see Blahut's article: Transform techniques for error control codes and A universal Reed-Solomon decoder).
What's even nicer is that you can check that the correction was successful: by checking that the syndrome polynomial only contains 0 coefficients, you know that the message+parity bytes are both correct. You can do that before to check if the message needs any correction, and also you can do the check after the decoding to check that both the message and the parity bytes were completely repaired.
The bound v+2e < (n-k) is optimal, you cannot do better (that's why Reed-Solomon is called an optimal error correction code). In fact it's possible to go beyond this limit using bruteforce approaches, up to a certain point (you can gain 1 or 2 more symbols for each 8 symbols) using list decoding, but it's still a domain in its infancy, I don't know of any practical implementation that works.

Decoding letters ('a' .. 'z') from a bit sequence without waste

I seek an algorithm that will let me represent an incoming sequence of bits as letters ('a' .. 'z' ), in a minimal matter such that the stream of bits can be regenerated from the letters, without ever holding the entire sequence in memory.
That is, given an external bit source (each read returns a practically random bit), and user input of a number of bits, I would like to print out the minimal number of characters that can represent those bits.
Ideally there should be a parameterization - how much memory versus maximum bits before some waste is necessary.
Efficiency Goal - The same number of characters as the base-26 representation of the bits.
Non-solutions:
If sufficient storage was present, store the entire sequence and use a big-integer MOD 26 operation.
Convert every 9 bits to 2 characters - This seems suboptimal, wasting 25% of information capacity of the letters output.
If you assign a different number of bits per letter, you should be able to exactly encode the bits in the twenty-six letters allowed without wasting any bits. (This is a lot like a Huffman code, only with a pre-built balanced tree.)
To encode bits into letters: Accumulate bits until you match exactly one of the bit codes in the lookup table. Output that letter, clear the bit buffer, and keep going.
To decode letters into bits: For each letter, output the bit sequence in the table.
Implementing in code is left as an exercise to the reader. (Or to me, if I get bored later.)
a 0000
b 0001
c 0010
d 0011
e 0100
f 0101
g 01100
h 01101
i 01110
j 01111
k 10000
l 10001
m 10010
n 10011
o 10100
p 10101
q 10110
r 10111
s 11000
t 11001
u 11010
v 11011
w 11100
x 11101
y 11110
z 11111
Convert each block of 47 bits to a base 26 number of 10 digits. This gives you more than 99.99% efficiency.
This method, as well as others like Huffman, needs a padding mechanism to support variable-length input. This introduces some inefficiency which is less significant with longer inputs.
At the end of the bit stream, append an extra 1 bit. This must be done in all cases, even when the length of the bit stream is a multiple of 47. Any high-order letters of "zero" value can be skipped in the last block of encoded output.
When decoding the letters, a truncated final block can be filled out with "zero" letters and converted to a 47-bit base 2 representation. The final 1 bit is not data, but marks the end of the bit stream.
Could Huffman coding be what you're looking for? It's a compression algorithm, which pretty much represents any information with a minimum of wasted bits.
Zero waste would be log_2(26) bits per letter. As pointed out earlier, you can get to 4.7 by reading 47 bits and converting them to 10 letters. However, you can get to 4.67 by converting every 14 bits into 3 characters. This has the advantage that it fits into an integer. If you have storage space and run time is important, you can create a lookup table with 17,576 entries mapping the possible 14 bits into 3 letters. Otherwise, you can do mod and div operations to compute the 3 letters.
number of letters number of bits bits/letter
1 4 4
2 9 4.5
3 14 4.67
4 18 4.5
5 23 4.6
6 28 4.67
7 32 4.57
8 37 4.63
9 42 4.67
10 47 4.7
Any solution you use is going to be space-inefficient because 26 is not a power of 2. As far as an algorithm goes, I'd rather use a lookup table than an on-the-fly calculation for each series of 9 bits. Your lookup table would 512 entries long.
If you want the binary footprint of each letter to have the same size, the optimal solution would be given by Arithmetic Encoding. However, it will not reach your goal of a mean representation of 4.5 bits/char. Given 26 different characters (not including space etc) 4.7 would be the best you can reach without using variable-length encoding (Huffman, for instance. See Jaegers's answer) or other compression algoritms.
A suboptimal, although simpler, solution could be to find a feasible number of characters to fit into a big integer. For instance, if you form a 32-bit integer out of every 6 charachter chunk (which is possible as 26^6 < 2^32), you use 5.33 bits/char. You can actually even fit 13 letters into a 64 bit integer (4.92 bits/char). This is quite close to the optimal solution, and still rather easy to implement. Using bigger ints than 64 bits can be tricky due to missing native support in many progamming languages.
If you want even better compression rates for text, you should definitely also look into dictionary-based compression algorithms, such as LZW or Deflate.

Resources