Use of Reed-Solomon error correction algorithm with 4-state barcodes - algorithm

I have a combined data information that requires minimum 35 bits.
Using a 4-state barcode, each bar represents 2 bits, so the above mentioned information can be translated into 18 bars.
I would like to add some strong error correction to this barcode, so if it's somehow damaged, it can be corrected. One of such approach is Reed-Solomon error correction.
My goal is to add as strong error correction as possible, but on the other hand I have a size limitation on the barcode. If I understood the Reed-Solomon algorithm correctly, m∙k  has to be at least the size of my message, i.e. 35 in my case.
Based on the Reed-Solomon Interactive Demo, I can go with (m, n, t, k) being (4, 15, 3, 9), which would allow me to code message up to 4∙9 = 36 bits. This would lead to code word of size 4∙15 = 60 bits, or 30 bars, but the error correction ratio t / n would be just 20.0%.
Next option is to go with (m, n, t, k) being (5, 31, 12, 7), which would allow me to code message up to 5∙7 = 35 bits. This would lead to code word of size 5∙31 = 155 bits, or 78 bars, and the error correction ratio t / n would be ~38.7%.
The first scenario requires use of barcode with 30 bars, which is nice, but 20.0% error correction is not as great as desired. The second scenario offers excellent error correction of 38.7%, but the barcode would have to have 78 bars, which is too many.
Is there some other approach or a different method, that would offer great error correction and a reasonable barcode length?

You could use a shortened code word such as (5, 19, 6, 7) 31.5% correction ratio, 95 bits, 48 bars. One advantage of a shortened code word is reduced chance of mis-correction if it is allowed to correct the maximum of 6 errors. If any of the 6 error locations is outside of the range of valid locations, that is an indication of that there are more than 6 errors. The probability of mis-correction is about (19/31)^6 = 5.3%.

Related

Lossless compression of an ordered series of 29 digits (each 0 to 5 Likert scale)

I have a survey with 29 questions, each with a 5-point Likert scale (0=None of the time; 4=Most of the time). I'd like to compress the total set of responses to a small number of alpha or alphanumeric characters, adding a check digit to the end.
So, the set of responses 00101244231023110242231421211 would get turned into something like A2CR7HW4. This output would be part of a printout that a non-techie user would enter on a website as a shortcut to entering the entire string. I'd want to avoid ambiguous characters, such as 0,O,D,I,l,5,S, leaving me with 21 or 22 characters to use (uppercase only). Alternatively, I could just stick with capital alpha only and use all 26 characters.
I'm thinking to convert each pair of digits to a letter (5^2=25, so the whole alphabet is adequate). That would reduce the sequence to 15 characters, which is still longish to type without errors.
Any other suggestions on how to minimize the length of the output?
EDIT: BTW, for context, the survey asks 29 questions about mental health symptoms, generating a predictive risk for 4 psychiatric conditions. Need a code representing all responses.
If the five answers are all equally likely, then the best you can do is ceiling(29 * log(5) / log(n)) symbols, where n is the number of symbols in your alphabet. (The base of the logarithm doesn't matter, so long as they're both the same.)
So for your 22 symbols, the best you can do is 16. For 26 symbols, the best is 15, as you described for 25. If you use 49 characters (e.g. some subset of the upper and lower case characters and the digits), you can get down to 12. The best you'll be able to do with printable ASCII characters would be 11, using 70 of the 94 characters.
The only way to make it smaller would be if the responses are not all equally likely and are heavily skewed. Though if that's the case, then there's probably something wrong with the survey.
First, choose a set of permissible characters, i.e.
characters = "ABC..."
Then, prefix the input-digits with a 1 and interpret it as a quinary number:
100101244231023110242231421211
Now, convert this quinary number to a number in base-"strlen(characters)", i.e. base26 if 26 characters are to be used:
02 23 18 12 10 24 04 19 00 15 14 20 00 03 17
Then, use these numbers as index in "characters", and you have your encoding:
CVSMKWETAPOUADR
For decoding, just reverse the steps.
Are you doing this in a specific language?
If you want to be really thrifty about it you might want to consider encoding the data at bit level.
Since there are only 5 possible answers per question you could do this with only 3 bits:
000
001
010
011
100
Your end result would be a string of bits, at 3-bits per answer so a total of 87 bits or 10 and a bit bytes.
EDIT - misread the question slightly, there are 5 possible answers not 4, my mistake.
The only problem now is that for 4 of your 5 answers you're wasting a bit...you ain't gonna benefit much from going to this much trouble I wouldn't say but it's worth considering.
EDIT:
I've been playing about with it and it's difficult to work out a mechanism that allows you to use both 2 and 3 bit values.
Since your output would be a 97 bit binary value you'd need ot be able make the distinction between 2 and 3 bits values when converting back to the original values.
If you're working with a larger number of values there are some methods you could use, like having a reserved bit for each values that can be used to sort of type a value and give it some meaning. But working with so few bits as it is, it's hard to shave anything off.
Your output at 97 bits could be padded out to 128 bits, which would give you 4 32-bit values if you wanted to simplify it. this 128 bit value would be like a unique fingerprint representing a specific set of answers. There are many ways you can represnt 128 bits.
But in the end borking at bit-level is about as good as it gets when it comes to actual compression and encoding of data...if you can express 5 unique values in less than 3 bits I'd be suitably impressed.

Most efficient bit format to represent small unsigned integers

I have to deal with sequences of a lot of small numbers, about a million, and I have to put as many as possible (more is better) in 4KB. Obviously that's just too little space to put all of them. Also, while this is a specific scenario, I'd love an answer as general as possible.
The numbers don't follow any pattern, but here is what a small script has to say about their distribution:
407037 times 1
165000 times 2
85389 times 3
52257 times 4
34749 times 5
23567 times 6
15892 times 7
11183 times 8
7636 times 9
5402 times 10
3851 times 11
2664 times 12
2023 times 13
1547 times 14
1113 times 15
... many more lines ...
1 times 62
62 is the biggest number I have, so let's set the maximum number we care about at 64. If the method is easily adaptable to accommodate for bigger max numbers, that would be better.
Here is a sample of the numbers:
20
1
1
1
13
1
5
1
15
1
3
4
3
2
2
A naive way to do this would just be to use 6 bits per number, but I think we can do better.
EDIT: adding a bit of info following discussion in comments.
I also have 2KB of ram and a dozen cycles on a microprocessor to decode each number. I need to store, sequentially, from the first number, as many numbers as I can.
EDIT: see graybeard's comment and my followup too.
The correct way to do this would be Rangecoding, Huffman or Shannon-Fano which you can see in any of the digital-communication blogs over the net, so I'm not exactly explaining you these.
I can suggest you a custom method, which is really simple and you can compare it with other methods if you can use this to store more numbers or not.
I see that there are no 0's in your script. So just decrease each number by 1 (while decoding, add 1 to decoded result). Use either 4 or 7 bits to encode numbers. All numbers up-to 8 can be represented in 3-bits. If the number is n <= 8, set the 1st bit as 0 and next 3 bits can represent the number. Else, if the number is n > 8, set 1st bit as 1 and represent the number as 6 bits from there.
Though in Huffman or Shannon-Fano, few of the representations can be as long as over 20 bits.
For provide correct answer, need to know - is decoder size also limited, or there is not limit for decodes size?
If no limit for decoder (just limit for data), I suggest you to use rangecoder, or Huffman coding. Rangecoder has better compression, but extensive arithmetic operation usage.
However, both decoders uses memory for a code, and for statistical tables. So, perhaps, better answer to create something more easy (custom compressor), but with simple and compact code and without any tables. As easy, code-compact, I can propose the run-1 algorithm. This algorithm is not very efficient for your data (rangecoder or Huffman better), but has trivial compact decoder without any tables.
Idea - sequence can contain zero or more bit_1, and use bit_0 as symbol separator. For example, if we would like encode with run-1 the sequence:
1, 1, 2, 1, 5
There will be bit sequence:
0-0-10-0-11110
There, you needed just count number of sequenced bit_1, add 1, and return value as decoded number.
Maybe slightly better than straight Huffman can be attempted by combining with run-length coding.
If you count the successive identical elements, you can rewrite your sequence as a pairs of (value, count). Every such pair appears with some probability and you can use Huffman coding on these. (I don't mean to code the values and the counts separately, but the pairs as a whole).
Your sample yields
(20, 1), (1, 3), (13 1), (1, 1), (5, 1), (1, 1), (15, 1), (3, 1), (4, 1), (3, 1), (5, 2)
The singletons will be (practically) coded as before, and there are more opportunities for compression of longer runs.
You can limit the maximum count(s) that are supported; if the actual count exceeds the limit, it is no big deal to insert several pairs.
The very first step is to compute an histogram of the count values to see if there are enough repetitions for this approach to be worth.
Alternatively, you can try Huffman coding on the deltas (signed differences between successive values). If there are many repetitions, the frequency of 0 will be much higher, increasing the entropy. Obviously, run-length coding of the deltas is also possible.
I took the distribution you listed, and tried an exponential fit. The result was decently good:
More importantly, the fit was reasonably close to p(x) ~= 2^-x. This suggests a very simple coding, known as "unary coding": to encode the number k, output k-1 zeroes, followed by a 1. If your numbers exactly fit the p(x) ~= 2^-x distribution, that would give you an expected code length of 2 bits. Since your numbers appear to be heavier-tailed than that (otherwise it would be vanishingly unlikely to see a 62 in only a million numbers), you won't quite achieve that. Still, given the simplicity of the coding and the ease of decoding (twelve cycles should be sufficient), you should consider trying it out.
You might also look into other universal codes, such as Elias Delta. Golomb coding would be optimal, but decoding it is an involved process.

Testing TIFF data gives ?conflicting? bit depths in MATLAB

I'm trying to write a function in Matlab that reads in TIFF images from various cameras and restores them to their correct data values for analysis. These cameras are from a variety of brands, and, so far, store either 12 or 14 bit data into 16 bit output. I've been reading them in using imread, and I was told that dividing by either 16 or 4 would convert the data back to it's original form. Unfortunately, that was when the function was only intended for one brand of camera specifically, which nicely scales data to 16 bit at time of capture so that such a transformation would work.
Since I'd like to keep the whole image property detection thing as automated as possible, I've done some digging in the data for a couple different cameras, and I'm running into an issue that I must be completely clueless about. I've determined (so far) that the pictures will always be stored in one of two ways: such that the previous method will work (they multiply the original data out to fill the 16 bits), or they just stuff the data in directly and append zeroes to the front or back for any vacant bits. I decided to see if I could detect which was which and have been using the following two methods. The images I test should easily have values that fill up the full range from zero to saturation (though sometimes not quite), and are fairly large resolution, so in theory these methods should work:
I start by reading in the image data:
Mframe = imread('signal.tif');
This method attempts to detect the number of bits that ever get used:
bits = 0;
for i = 1:16
Bframe = bitget(Mframe,i);
bits = bits + max(max(Bframe));
end
And this method attempts to find if there has been a scaling operation done:
Mframe = imread('signal.tif');
Dframe = diff(Mframe);
mindiff = min(min(nonzeros(Dframe)));
As a 3rd check I always look at the maximum value of my input image:
maxval = max(max(Mframe));
Please check my understanding here:
The value of maxval should be at 65532 in the case of a 16 bit image containing any saturation.
If the 12 or 14 bit data has been scaled to 16 bit, it should return maxval of 65532, a mindiff of 16 or 4 respectively, and bits as 16.
If the 12 or 14 bit data was stored directly with leading/trailing zeros, it can't return a maxval of 65532, mindiff should not return 16 or 4 (though it IS remotely possible), and bits should show as 12 or 14 respectively.
If an image is actually not reaching saturation, it can't return a maxval of 65532, mindiff should still act as described for the two cases above, and bits could possibly return as one lower than it otherwise would.
Am I correct in the above? If not please show me what I'm not understanding (I'm definitely not a computer scientist), because I seem to be getting data that conflicts with this.
Only one case appears to work just like I expect. I know the data to be 12 bit, and my testing shows maxval near 65532, mindiff of 16, and bits as 15. I can conclude that this image is not saturated and is a 12 bit scaled to 16 bit.
Another case for a different brand I know to have 12 bit output, and testing an image that I know isn't quite saturated gives me maxval of 61056, mindiff of 16, and bits as 12. ???
Yet another case, for yet again another brand, is known to have 14 bit output, and when I test an image I know to be saturated it gives me maxval of 65532, mindiff of 4, and bits as 15. ???
So very confused.
Well, after a lot of digging I finally figured it all out. I wrote some code to help me understand the differences between the different files and discovered that a couple of the cameras had "signatures" of sorts in them. I'm contacting the manufacturers for more information, but one in particular appears to be a timestamp that always occurs in the first 2 pixels.
Anyhow, I wrote the following code to fix the two issues I found and now everything is working peachy:
Mframe = imread('signal.tiff');
minval = min(min(Mframe));
mindiff = min(min(nonzeros(diff(Mframe))));
fixbit = log2(double(mindiff));
if rem(fixbit,2) % Correct Brand A Issues
fixbit = fixbit + 1;
Bframe = bitget(Mframe,fixbit);
[x,y] = find(Bframe==1);
for i=1:length(x)
Mframe(x(i),y(i)) = Mframe(x(i),y(i)) + mindiff;
end
end
for i=1:4 % Correct Brand B Timestamp
Bframe = bitget(Mframe,i);
if any(any(Bframe))
Mframe(1,1) = minval; Mframe(1,2) = minval;
end
end
for i = 1:16 % Get actual bit depth
Bframe = bitget(Mframe,i);
bits = bits + max(max(Bframe));
end
As for the Brand A issues, that camera appears to have bad data in just a few pixels of every frame (not the same every time) where a value appears in a pixel that is a one bit lower difference than should be possible from the pixel below it. For example, in a 12 bit picture the minimum difference should be 16 and a 14 bit picture should have a minimum difference of 4, but they have values that are 8 and 2 lower than the pixel below them. Don't know why that's happening, but it was fairly simple to gloss over.

Error detection code for 33 bytes, detecting bit flipped in first 32 bytes

Could you please suggest an error detection scheme for detecting
one possible bit flip in the first 32 bytes of a 33-byte message using
no more than 8 bits of additional data?
Could Pearson hashing be a solution?
Detecting a single bit-flip in any message requires only one extra bit, independent of the length of the message: simply xor together all the bits in the message and tack that on the end. If any single bit flips, the parity bit at the end won't match up.
If you're asking to detect which bit flipped, that can't be done, and a simple argument shows it: the extra eight bits can represent up to 256 classes of 32-byte messages, but the zero message and the 256 messages with one on bit each must all be in different classes. Thus, there are 257 messages which must be distinctly classified, and only 256 classes.
You can detect one bit flip with just one extra bit in any length message (as stated by #Daniel Wagner). The parity bit can, simply put, indicate whether the total number of 1-bits is odd or even. Obviously, if the number of bits that are wrong is even, then the parity bit will fail, so you cannot detect 2-bit errors.
Now, for a more accessible understanding of why you can't error-correct 32 bytes (256 bits) with just 8 bits, please read about the Hamming code (like used in ECC memory). Such a scheme uses special error-correcting parity bits (henceforth called "EC parity") that only encode the parity of a subset of the total number of bits. For every 2^m - 1 total bits, you need to use m EC bits. These represent each possible different mask following the pattern "x bits on, x bits off" where x is a power of 2. Thus, the larger the number of bits at once, the better the data/parity bit ratio you get. For example, 7 total bits would allow encoding only 4 data bits after losing 3 EC bits, but 31 total bits can encode 26 data bits after losing 5 EC bits.
Now, to really understand this probably will take an example. Consider the following sets of masks. The first two rows are to be read top down, indicating the bit number (the "Most Significant Byte" I've labeled MSB):
MSB LSB
| |
v v
33222222 22221111 11111100 0000000|0
10987654 32109876 54321098 7654321|0
-------- -------- -------- -------|-
1: 10101010 10101010 10101010 1010101|0
2: 11001100 11001100 11001100 1100110|0
3: 11110000 11110000 11110000 1111000|0
4: 11111111 00000000 11111111 0000000|0
5: 11111111 11111111 00000000 0000000|0
The first thing to notice is that the binary values for 0 to 31 are represented in each column going from right to left (reading the bits in rows 1 through 5). This means that each vertical column is different from each other one (the important part). I put a vertical extra line between bit numbers 0 and 1 for a particular reason: Column 0 is useless because it has no bits set in it.
To perform error-correcting, we will bitwise-AND the received data bits against each EC bit's predefined mask, then compare the resulting parity to the EC bit. For any calculated parities discovered to not match, find the column in which only those bits are set. For example, if error-correcting bits 1, 4, and 5 are wrong when calculated from the received data value, then column #25--containing 1s in only those masks--must be the incorrect bit and can be corrected by flipping it. If only a single error-correcting bit is wrong, then the error is in that error-correcting bit. Here's an analogy to help you understand why this works:
There are 32 identical boxes, with one containing a marble. Your task is to locate the marble using just an old-style scale (the kind with two balanced platforms to compare the weights of different objects) and you are only allowed 5 weighing attempts. The solution is fairly easy: you put 16 boxes on each side of the scale and the heavier side indicates which side the marble is on. Discarding the 16 boxes on the lighter side, you then weigh 8 and 8 boxes keeping the heavier, then 4 and 4, then 2 and 2, and finally locate the marble by comparing the weights of the last 2 boxes 1 to 1: the heaviest box contains the marble. You have completed the task in only 5 weighings of 32, 16, 8, 4, and 2 boxes.
Similarly, our bit patterns have divided up the boxes in 5 different groups. Going backwards, the fifth EC bit determines whether an error is on the left side or the right side. In our scenario with bit #25, it is wrong, so we know that the error bit is on the left side of the group (bits 16-31). In our next mask for EC bit #4 (still stepping backward), we only consider bits 16-31, and we find that the "heavier" side is the left one again, so we have narrowed down the bits 24-31. Following the decision tree downward and cutting the number of possible columns in half each time, by the time we reach EC bit 1 there is only 1 possible bit left--our "marble in a box".
Note: The analogy is useful, though not perfect: 1-bits are not represented by marbles--the erroring bit location is represented by the marble.
Now, some playing around with these masks and thinking how to arrange things will reveal that there is a problem: If we try to make all 31 bits data bits, then we need 5 more bits for EC. But how, then, will we tell if the EC bits themselves are wrong? Just a single EC bit wrong will incorrectly tell us that some data bit needs correction, and we'll wrongly flip that data bit. The EC bits have to somehow encode for themselves! The solution is to position the parity bits inside of the data, in columns from the bit patterns above where only one bit is set. This way, any data bit being wrong will trigger two EC bits to be wrong, making it so that if only one EC bit is wrong, we know it is wrong itself instead of it signifying a data bit is wrong. The columns that satisfy the one-bit condition are 1, 2, 4, 8, and 16. The data bits will be interleaved between these starting at position 2. (Remember, we are not using position 0 as it would never provide any information--none of our EC bits would be set at all).
Finally, adding one more bit for overall parity will allow detecting 2-bit errors and reliably correcting 1-bit errors, as we can then compare the EC bits to it: if the EC bits say something is wrong, but the parity bit says otherwise, we know there are 2 bits wrong and cannot perform correction. We can use the discarded bit #0 as our parity bit! In fact, now we are encoding the following pattern:
0: 11111111 11111111 11111111 11111111
This gives us a final total of 6 Error-Checking and Correcting (ECC) bits. Extending the scheme of using different masks indefinitely looks like this:
32 bits - 6 ECC bits = 26 data
64 bits - 7 ECC bits = 57 data
128 bits - 8 ECC bits = 120 data
256 bits - 9 ECC bits = 247 data
512 bits - 10 ECC bits = 502 data
Now, if we are sure that we only will get a 1-bit error, we can dispense with the #0 parity bit, so we have the following:
31 bits - 5 ECC bits = 26 data
63 bits - 6 ECC bits = 57 data
127 bits - 7 ECC bits = 120 data
255 bits - 8 ECC bits = 247 data
511 bits - 9 ECC bits = 502 data
This is no change because we don't get any more data bits. Oops! 32 bytes (256 bits) as you requested cannot be error-corrected with a single byte, even if we know we can have only a 1-bit error at worst, and we know the ECC bits will be correct (allowing us to move them out of the data region and use them all for data). We need TWO more bits than we have--one must slide up to the next range of 512 bits, then leave out 246 data bits to get our 256 data bits. So that's one more ECC bit AND one more data bit (as we only have 255, exactly what Daniel told you).
Summary:: You need 33 bytes + 1 bit to detect which bit flipped in the first 32 bytes.
Note: if you are going to send 64 bytes, then you're under the 32:1 ratio, as you can error correct that in just 10 bits. But it's that in real world applications, the "frame size" of your ECC can't keep going up indefinitely for a few reasons: 1) The number of bits being worked with at once may be much smaller than the frame size, leading to gross inefficiencies (think ECC RAM). 2) The chance of being able to accurately correct a bit gets less and less, since the larger the frame, the greater the chance it will have more errors, and 2 errors defeats error-correction ability, while 3 or more can defeat even error-detection ability. 3) Once an error is detected, the larger the frame size, the larger the size of the corrupted piece that must be retransmitted.
If you need to use a whole byte instead of a bit, and you only need to detect errors, then the standard solution is to use a cyclic redundancy check (CRC). There are several well-known 8-bit CRCs to choose from.
A typical fast implementation of a CRC uses a table with 256 entries to handle a byte of the message at a time. For the case of an 8 bit CRC this is a special case of Pearson's algorithm.

Encoding / Error Correction Challenge

Is it mathematically feasible to encode and initial 4 byte message into 8 bytes and if one of the 8 bytes is completely dropped and another is wrong to reconstruct the initial 4 byte message? There would be no way to retransmit nor would the location of the dropped byte be known.
If one uses Reed Solomon error correction with 4 "parity" bytes tacked on to the end of the 4 "data" bytes, such as DDDDPPPP, and you end up with DDDEPPP (where E is an error) and a parity byte has been dropped, I don't believe there's a way to reconstruct the initial message (although correct me if I am wrong)...
What about multiplying (or performing another mathematical operation) the initial 4 byte message by a constant, then utilizing properties of an inverse mathematical operation to determine what byte was dropped. Or, impose some constraints on the structure of the message so every other byte needs to be odd and the others need to be even.
Alternatively, instead of bytes, it could also be 4 decimal digits encoded in some fashion into 8 decimal digits where errors could be detected & corrected under the same circumstances mentioned above - no retransmission and the location of the dropped byte is not known.
I'm looking for any crazy ideas anyone might have... Any ideas out there?
EDIT:
It may be a bit contrived, but the situation that I'm trying to solve is one where you have, let's say, a faulty printer that prints out important numbers onto a form, which are then mailed off to a processing firm which uses OCR to read the forms. The OCR isn't going to be perfect, but it should get close with only digits to read. The faulty printer could be a bigger problem, where it may drop a whole number, but there's no way of knowing which one it'll drop, but they will always come out in the correct order, there won't be any digits swapped.
The form could be altered so that it always prints a space between the initial four numbers and the error correction numbers, ie 1234 5678, so that one would know whether a 1234 initial digit was dropped or a 5678 error correction digit was dropped, if that makes the problem easier to solve. I'm thinking somewhat similar to how they verify credit card numbers via algorithm, but in four digit chunks.
Hopefully, that provides some clarification as to what I'm looking for...
In the absence of "nice" algebraic structure, I suspect that it's going to be hard to find a concise scheme that gets you all the way to 10**4 codewords, since information-theoretically, there isn't a lot of slack. (The one below can use GF(5) for 5**5 = 3125.) Fortunately, the problem is small enough that you could try Shannon's greedy code-construction method (find a codeword that doesn't conflict with one already chosen, add it to the set).
Encode up to 35 bits as a quartic polynomial f over GF(128). Evaluate the polynomial at eight predetermined points x0,...,x7 and encode as 0f(x0) 1f(x1) 0f(x2) 1f(x3) 0f(x4) 1f(x5) 0f(x6) 1f(x7), where the alternating zeros and ones are stored in the MSB.
When decoding, first look at the MSBs. If the MSB doesn't match the index mod 2, then that byte is corrupt and/or it's been shifted left by a deletion. Assume it's good and shift it back to the right (possibly accumulating multiple different possible values at a point). Now we have at least seven evaluations of a quartic polynomial f at known points, of which at most one is corrupt. We can now try all possibilities for the corruption.
EDIT: bmm6o has advanced the claim that the second part of my solution is incorrect. I disagree.
Let's review the possibilities for the case where the MSBs are 0101101. Suppose X is the array of bytes sent and Y is the array of bytes received. On one hand, Y[0], Y[1], Y[2], Y[3] have correct MSBs and are presumed to be X[0], X[1], X[2], X[3]. On the other hand, Y[4], Y[5], Y[6] have incorrect MSBs and are presumed to be X[5], X[6], X[7].
If X[4] is dropped, then we have seven correct evaluations of f.
If X[3] is dropped and X[4] is corrupted, then we have an incorrect evaluation at 3, and six correct evaluations.
If X[5] is dropped and X[4] is corrupted, then we have an incorrect evaluation at 5, and six correct evaluations.
There are more possibilities besides these, but we never have fewer than six correct evaluations, which suffices to recover f.
I think you would need to study what erasure codes might offer you. I don't know any bounds myself, but maybe some kind of MDS code might achieve this.
EDIT: After a quick search I found RSCode library and in the example it says that
In general, with E errors, and K erasures, you will need
* 2E + K bytes of parity to be able to correct the codeword
* back to recover the original message data.
So looks like Reed-Solomon code is indeed the answer and you may actually get recovery from one erasure and one error in 8,4 code.
Parity codes work as long as two different data bytes aren't affected by error or loss and as long as error isn't equal to any data byte while a parity byte is lost, imho.
Error correcting codes can in general handle erasures, but in the literature the position of the erasure is assumed known. In most cases, the erasure will be introduced by the demodulator when there is low confidence that the correct data can be retrieved from the channel. For instance, if the signal is not clearly 0 or 1, the device can indicate that the data was lost, rather than risking the introduction of an error. Since an erasure is essentially an error with a known position, they are much easier to fix.
I'm not sure what your situation is where you can lose a single value and you can still be confident that the remaining values are delivered in the correct order, but it's not a situation classical coding theory addresses.
What algorithmist is suggesting above is this: If you can restrict yourself to just 7 bits of information, you can fill the 8th bit of each byte with alternating 0 and 1, which will allow you to know the placement of the missing byte. That is, put a 0 in the high bit of bytes 0, 2, 4, 6 and a 1 in the high bits of the others. On the receiving end, if you only receive 7 bytes, the missing one will have been dropped from between bytes whose high bits match. Unfortunately, that's not quite right: if the erasure and the error are adjacent, you can't know immediately which byte was dropped. E.g., high bits 0101101 could result from dropping the 4th byte, or from an error in the 4th byte and dropping the 3rd, or from an error in the 4th byte and dropping the 5th.
You could use the linear code:
1 0 0 0 0 1 1 1
0 1 0 0 1 0 1 1
0 0 1 0 1 1 0 1
0 0 0 1 1 1 1 0
(i.e. you'll send data like (a, b, c, d, b+c+d, a+c+d, a+b+d, a+b+c) (where addition is implemented with XOR, since a,b,c,d are elements of GF(128))). It's a linear code with distance 4, so it can correct a single-byte error. You can decode with syndrome decoding, and since the code is self-dual, the matrix H will be the same as above.
In the case where there's a dropped byte, you can use the technique above to determine which one it is. Once you've determined that, you're essentially decoding a different code - the "punctured" code created by dropping that given byte. Since the punctured code is still linear, you can use syndrome decoding to determine the error. You would have to calculate the parity-check matrix for each of the shortened codes, but you can do this ahead of time. The shortened code has distance 3, so it can correct any single-byte errors.
In the case of decimal digits, assuming one goes with first digit odd, second digit even, third digit odd, etc - with two digits, you get 00-99, which can be represented in 3 odd/even/odd digits (125 total combinations) - 00 = 101, 01 = 103, 20 = 181, 99 = 789, etc. So one encodes two sets of decimal digits into 6 total digits, then the last two digits signify things about the first sets of 2 digits or a checksum of some sort... The next to last digit, I suppose, could be some sort of odd/even indicator on each of the initial 2 digit initial messages (1 = even first 2 digits, 3 = odd first two digits) and follow the pattern of being odd. Then, the last digit could be the one's place of a sum of the individual digits, that way if a digit was missing, it would be immediately apparent and could be corrected assuming the last digit was correct. Although, it would throw things off if one of the last two digits were dropped...
It looks to be theoretically possible if we assume 1 bit error in wrong byte. We need 3 bits to identify dropped byte and 3 bits to identify wrong byte and 3 bits to identify wrong bit. We have 3 times that many extra bits.
But if we need to identify any number of bits error in wrong byte, it comes to 30 bits. Even that looks to be possible with 32 bits, although 32 is a bit too close for my comfort.
But I don't know hot to encode to get that. Try turbocode?
Actually, as Krystian said, when you correct a RS code, both the message AND the "parity" bytes will be corrected, as long as you have v+2e < (n-k) where v is the number of erasures (you know the position) and e is the number of errors. This means that if you only have errors, you can correct up to (n-k)/2 errors, or (n-k-1) erasures (about the double of the number of errors), or a mix of both (see Blahut's article: Transform techniques for error control codes and A universal Reed-Solomon decoder).
What's even nicer is that you can check that the correction was successful: by checking that the syndrome polynomial only contains 0 coefficients, you know that the message+parity bytes are both correct. You can do that before to check if the message needs any correction, and also you can do the check after the decoding to check that both the message and the parity bytes were completely repaired.
The bound v+2e < (n-k) is optimal, you cannot do better (that's why Reed-Solomon is called an optimal error correction code). In fact it's possible to go beyond this limit using bruteforce approaches, up to a certain point (you can gain 1 or 2 more symbols for each 8 symbols) using list decoding, but it's still a domain in its infancy, I don't know of any practical implementation that works.

Resources