Related
From the point of view of very low level programming, how is performed the comparison between two numbers?
Using one byte, unsigned numbers 0, 1 and 255 are written:
0 -----> 00000000
1 -----> 00000001
255 ---> 11111111
Now, what happens during the comparison between these numbers?
Using my vision as a human having learned basic programming, I could imagine the following algorithm about == implementation:
b = 0
while b < 8:
if first_number[b] != second_number[b]:
return False
b += 1
return True
Basically this is like comparing each bit step by step, and stop before the end if two bits are different.
Thus we note that the comparison stops at the first iteration compared 0 and 255, while it stops at the last if 0 and 1 are compared.
The first comparison would be 8 times faster than the second.
In practice, I doubt that is the case. But is this theoretically true?
If not, how does the computer work?
A comparison between integers is tipically implemented by the cpu as a subtraction, whose result sign contains information about which number is bigger.
While a naive implementation of subtraction executes one bit at a time (because every bit needs to know the carry of the preceding one), tipical implementation use a carry-lookahead circuit that allows the calculation of more result bits at the same time.
So, the answer is: no, every comparison takes almost the same time for every possible input.
Hardware is fundamentally different from the dominant programming paradigms in that all logic gates (or circuits in general) always do their work independently, in parallel, at all times. There is no such thing as "do this, then do that", only "do this here, feed the result into the circuit over there". If there's a circuit on the chip with input A and output B, then the circuit always, continuously, updates B in accordance with the current values of A — regardless of whether the result is needed right now "in the big picture".
Your pseudo code algorithm doesn't even begin to map to logic gates well. Instead, a comparator looks like this in Verilog (ignoring that there's a built-in == operator):
assign not_equal = (a[0] ^ b[0]) | (a[1] ^ b[1]) | ...;
Where each XOR is a separate logic gate and hence works independently from the others. The results are "reduced" with a logical or, i.e. the output is 1 if any of the XORs produces a 1 (this too does some work in parallel, but the critical path is longer than one gate). Furthermore, all these gates exist in silicon regardless of the specific bit values, and the signal has to propagate through about (1 + log w) gates for a w-bit integer. This propagation delay is again independent of the intermediate and final results.
On some CPU families, equality comparison is implemented by subtracting the two numbers and comparing the result to zero (using a circuit as described above), but the same principle applies. An adder/subtracter doesn't get slower or faster depending on the values.
Not to mention that instructions in a CPU can't take less than one clock cycle anyway, so even if the hardware would finish more quickly, the next instruction still wouldn't start until the next tick.
Now, some hardware operations can take a variable amount of time, but that's because they are state machines, i.e. sequential logic. Technically one could implement the moral equivalent of your algorithm with a state machine, but nobody does that, it's harder to implement than the naive, un-optimized combinatorial circuit above, and less efficient to boot.
State machine circuits are circuits with memory: They store their current state and always compute the outputs (depending on the current state) and the next state (depending on current state and inputs) each clock cycle. On some inputs they may go through N states until they produce an output, and N+x on other inputs. ALU operations generally don't do that though. Pipeline stalls, branch mispredictions, and cache misses are common reasons one instruction takes longer than usual in some circumstances. Properly reasoning about these in a way that helps programmers write faster code is hard though: You have to take into account all the tricky and quirks of real hardware, and there's a lot of those. Empirical evidence, i.e. benchmarking a real black box CPU, is vital.
When it gets down to the assembly the cmp instruction is used regardless of the contents of the variables.
So there is no performance difference.
Several Google Maps products have the notion of polylines, which in terms of underlying data is basically just a sequence of lat/lng points that might for example manifest in a line drawn on a map. The Google Map developer libraries make use of an encoded polyline format that churns out an ASCII string representing the points making up the polyline. This encoded format is then typically decoded with a built in function of the Google libraries or a function written by a third party that implements the decoding algorithm.
The algorithm for encoding polyline points is described in the Encoded Polyline Algorithm Format document. What is not described is the rationale for implementing the algorithm this way, and the significance of each of the individual steps. I'm interested to know whether the thinking/purpose behind implementing the algorithm this way is publicly described anywhere. Two example questions:
Do some of the steps have a quantifiable impact on compression and how does this impact vary as a function of the delta between points?
Is the summing of values with ASCII 63 a compatibility hack of some sort?
But just in general, a description to go along with the algorithm explaining why the algorithm is implemented the way it is.
Update: This blog post from James Snook also has the 'valid ascii' range argument and reads logically for other steps I wondered. E.g. the left shifting before storing which makes place for the negative bit as the first bit.
Some explanations I found, not sure if everything is 100% correct.
One double value is stored in multiple 5 bits chunks and 0x20 (binary '0010 0000') is used as indication that the next 5 bit entry belongs to the current double.
0x1f (binary '0001 1111') is used as bit mask to throw away other bits
I expect that 5 bits are used because the delta of lat or lons are in this range. So that every double value takes only 5 bits on average when done for a lot of examples (but not verified yet).
Now, compression is done by assuming nearby double values are very close and creating the difference is nearly 0, so that the results fits in a few bytes. Then this result is stored in a dynamic fashion: store 5 bits and if the value is longer mark with 0x20 and store the next 5 bits and so on. So I guess you can tweak the compression if you try 6 or 4 bits but I guess 5 is a practically reasonable choice.
Now regarding the magic 63, this is 0x3f and binary 0011 1111. I'm not sure why they add it. I thought that adding 63 will give some 'better' asci characters (e.g. allowed in XML or in URL) as we skip e.g. 62 which is > but 63 which is ? is really better? At least the first ascii chars are not displayable and have to be avoided. Note that if one would use 64 then one would hit the ascii char 127 for the maximum value of 31 (31+64+32) and this char is not defined in html4. Or is because of a signed char is going from -128 to 127 and we need to store the negative numbers as positive, thus adding the maximum possible negative number?
Just for me: here is a link to an official Java implementation with Apache License
An old idea, but ever since then I couldn't get around finding some reasonably good way to solve the problem it raised. So I "invented" (see below) a very compact, and in my opinion, reasonably well performing PRNG, but I can't get to figure out algorithms to build suitable seed values for it at large bit depths. My current solution is simply brute-forcing, it's running time is O(n^3).
The generator
My idea came from XOR taps (essentially LFSRs) some old 8bit machines used for sound generation. I fiddled with XOR as a base on a C64, tried to put together opcodes, and experienced with the result. The final working solution looked like this:
asl
adc #num1
eor #num2
This is 5 bytes on the 6502. With a well chosen num1 and num2, in the accumulator it iterates over all 256 values in a seemingly random order, that is, it looks reasonably random when used to fill the screen (I wrote a little 256b demo back then on this). There are 40 suitable num1 & num2 pairs for this, all giving decent looking sequences.
The concept can be well generalized, if expressed in pure C, it may look like this (BITS being the bit depth of the sequence):
r = (((r >> (BITS-1)) & 1U) + (r << 1) + num1) ^ num2;
r = r & ((1U<<BITS)-1U);
This C code is longer since it is generalized, and even if one would use the full depth of an unsigned integer, C wouldn't have the necessary carry logic to transfer the high bit of the shift to the add operation.
For some performance analysis and comparisons, see below, after the question(s).
The problem / question(s)
The core problem with the generator is finding suitable num1 and num2 which would make it iterate over the whole possible sequence of a given bit depth. At the end of this section I attach my code which just brute-forces it. It will finish in reasonable time for up to 12 bits, you may wait for all 16 bits (there are 5736 possible pairs for that by the way, acquired with an overnight full search a while ago), and you may get a few 20 bits if you are patient. But O(n^3) is really nasty...
(Who will get to find the first full 32bit sequence?)
Other interesting questions which arise:
For both num1 and num2 only odd values are able to produce full sequences. Why? This may not be hard (simple logic, I guess), but I never reasonably proved it.
There is a mirroring property along num1 (the add value), that is, if 'a' with a given 'b' num2 gives a full sequence, then the 2 complement of 'a' (in the given bit depth) with the same num2 is also a full sequence. I only observed this happening reliably with all the full generations I calculated.
A third interesting property is that for all the num1 & num2 pairs the resulting sequences seem to form proper circles, that is, at least the number zero seems to be always part of a circle. Without this property my brute force search would die in an infinite loop.
Bonus: Was this PRNG already known before? (and I just re-invented it)?
And here is the brute force search's code (C):
#define BITS 16
#include "stdio.h"
#include "stdlib.h"
int main(void)
{
unsigned int r;
unsigned int c;
unsigned int num1;
unsigned int num2;
unsigned int mc=0U;
num1=1U; /* Only odd add values produce useful results */
do{
num2=1U; /* Only odd eor values produce useful results */
do{
r= 0U;
c=~0U;
do{
r=(((r>>(BITS-1)) & 1U)+r+r+num1)^num2;
r&=(1U<<(BITS-1)) | ((1U<<(BITS-1))-1U); /* 32bit safe */
c++;
}while (r);
if (c>=mc){
mc=c;
printf("Count-1: %08X, Num1(adc): %08X, Num2(eor): %08X\n", c, num1, num2);
}
num2+=2U;
num2&=(1U<<(BITS-1)) | ((1U<<(BITS-1))-1U);
}while(num2!=1U);
num1+=2U;
num1&=((1U<<(BITS-1))-1U); /* Do not check complements */
}while(num1!=1U);
return 0;
}
This, to show it is working, after each iteration will output the pair found if it's sequence length is equal or longer than the previous. Modify the BITS constant for sequences of other depths.
Seed hunting
I did some graphing relating to the seeds. Here is a nice image showing all the 9bit sequence lengths:
The white dots are the full length sequences, X axis is for num1 (add), Y axis is for num2 (xor), the brighter the dot, the longer the sequence. Other bit depth look very similar in pattern: they all seem to be broken up to sixteen major tiles with two patterns repeating with mirroring. The similarity of the tiles is not complete, for example above a diagonal from the up-left corner to the bottom-right is clearly visible while it's opposite is absent, but for the full-length sequences this property seems to be reliable.
Relying on this it is possible to reduce the work even more than by the previous assumptions, but that's still O(n^3)...
Performance analysis
As of current the longest sequences possible to be generated are 24bits: on my computer it takes at about 5 hours to brute-force a full 24bit sequence for this. This is still just so-so for real PRNG tests such as Diehard, so as of now I rather gone by an own approach.
First it's important to understand the role of the generator. This by no means would be a very good generator for it's simplicity, it's goal is rather to produce decent numbers blazing fast. On this region not needing multiply / divide operations, a Galois LFSR can produce similar performance. So my generator is of any use if it is capable to outperform this one.
The test I performed were all of 16bit generators. I chose this depth since it gives an useful sequence length while the numbers may still be broken up in two 8bit parts making it possible to present various bit-exact graphs for visual analysis.
The core of the tests were looking for correlations along previous and currently generated numbers. For this I used X:Y plots where the previous generation was the Y, the current the X, both broken up to low / high parts as above mentioned for two graphs. I created a program capable of plotting these stepped in real time so to also make it possible to roughly examine how the numbers follow each other, how the graphs fill up. Here obviously only the end results are shown as the generators ran through their full 2^16 or 2^16-1 (Galois) cycle.
The explanation of the fields:
The images consist 8x2 256x256 graphs making the total image size 2048x512 (check them at original size).
The top left graph just confirms that indeed a full sequence was plotted, it is simply an X = r % 256; Y = r / 256; plot.
The bottom left graph shows every second number only plotted the same way as the top, just confirming that the numbers occur reasonably randomly.
From the second graph the top row are the high byte correlation graphs. The first of them uses the previous generation, the next skips one number (so uses 2nd previous generation), and so on until the 7th previous generation.
From the second the bottom row are the low byte correlation graphs, organized the same way as above.
Galois generator, 0xB400 tap set
This is the generator found in the Wikipedia Galois example. It's performance is not the worst, but it is still definitely not really good.
Galois generator, 0xA55A tap set
One of the decent Galois "seeds" I found. Note that the low part of the 16bit numbers seem to be a lot better than the above, however I couldn't find any Galois "seed" which would fuzz up the high byte.
My generator, 0x7F25 (adc), 0x00DB (eor) seed
This is the best of my generators where the high byte of the EOR value is zero. Limiting the high byte is useful on 8bit machines since then this calculation can be omitted for smaller code and faster execution if the loss of randomness performance is affordable.
My generator, 0x778B (adc), 0x4A8B (eor) seed
This is one of the very good quality seeds by my measurements.
To find seeds with good correlation, I built a small program which would analyse them to some degree, the same way for Galois and mine. The "good quality" examples were pinpointed by that program, and then I tested several of them and selected one from those.
Some conclusions:
The Galois generator seems to be more rigid than mine. On all the correlation graphs definite geometrical patterns are observable (some seeds produce "checkerboard" patterns, not shown here) even if it is not composed of lines. My generator also shows patterns, but with more generations they grow less defined.
A portion of the Galois generator's result which include the bits in the high byte seems to be inherently rigid which property seems to be absent from my generator. This is a weak assumption yet probably needing some more research (to see if this is always so with the Galois generator and not with mine on other bit combinations).
The Galois generator lacks zero (maximal period being 2^16-1).
As of now it is impossible to generate a good set of seeds for my generator above 20 bits.
Later I might get in this subject deeper seeking to test the generator with Diehard, but as of now the lack of the ability of generating large enough seeds for it makes it impossible.
This is some form of a non-linear shift feedback register. I don't know if it has been used as such, but it resembles linear shift feedback registers somewhat. Read this Wikipedia page as an introduction to LSFRs. They are used frequently in pseudo random number generation.
However, your pseudo random number generator is inherently bad in that there is a linear correlation between the highest order bit of a previously generated number and the lowest order bit of a number generated next. You shift the highest bit B out, and then the lowest order bit of the new number will be the XOR or B, the lowest order bit of the additive constant num1 and the lowest order bit of the XORed constant num2, because binary addition is equivalent to exclusive or at the lowest order bit. Most likely your PRNG has other similar deficiencies. Creating good PRNGs is hard.
However, I must admit that the C64 code is pleasingly compact!
Can anyone please explain arithmetic encoding for data compression with implementation details ? I have surfed through internet and found mark nelson's post but the implementation's technique is indeed unclear to me after trying for many hours.
Mark nelson's explanation on arithmetic coding can be located at
http://marknelson.us/1991/02/01/arithmetic-coding-statistical-modeling-data-compression/
The main idea with arithmetic compression is its the capability to code a probability using the exact amount of data length required.
This amount of data is known, proven by Shannon, and can be calculated simply by using the following formula : -log2(p)
For example, if p=50%, then you need 1 bit.
And if p=25%, you need 2 bits.
That's simple enough for probabilities which are power of 2 (and in this special case, huffman coding could be enough). But what if the probability is 63% ? Then you need -log2(0.63) = 0.67 bits. Sounds tricky...
This property is especially important if your probability is high. If you can predict something with a 95% accuracy, then you only need 0.074 bits to represent a good guess. Which means you are going to compress a lot.
Now, how to do that ?
Well, it's simpler than it sounds. You will divide your range depending on probabilities. For example, if you have a range of 100, 2 possible events, and a probability of 95% for the 1st one, then the first 95 values will say "Event 1", and the last 5 remaining values will say "Event 2".
OK, but on computers, we are accustomed to use powers of 2. For example, with 16 bits, you have a range of 65536 possible values. Just do the same : take the 1st 95% of the range (which is 62259) to say "Event 1", and the rest to say "Event 2". You obviously have a problem of "rounding" (precision), but as long as you have enough values to distribute, it does not matter too much. Furthermore, you are not constrained to 2 events, you could have a myriad of events. All that matters is that values are allocated depending on the probabilities of each event.
OK, but now i have 62259 possible values to say "Event 1", and 3277 to say "Event 2". Which one should i choose ?
Well, any of them will do. Wether it is 1, 30, 5500 or 62256, it still means "Event 1".
In fact, deciding which value to select will not depend on the current guess, but on the next ones.
Suppose i'm having "Event 1". So now i have to choose any value between 0 and 62256. On next guess, i have the same distribution (95% Event 1, 5% Event 2). I will simply allocate the distribution map with these probabilities. Except that this time, it is distributed over 62256 values. And we continue like this, reducing the range of values with each guess.
So in fact, we are defining "ranges", which narrow with each guess. At some point, however, there is a problem of accuracy, because very little values remain.
The idea, is to simply "inflate" the range again. For example, each time the range goes below 32768 (2^15), you output the highest bit, and multiply the rest by 2 (effectively shifting the values by one bit left). By continuously doing like this, you are outputting bits one by one, as they are being settled by the series of guesses.
Now the relation with compression becomes obvious : when the range are narrowed swiftly (ex : 5%), you output a lot of bits to get the range back above the limit. On the other hand, when the probability is very high, the range narrow very slowly. You can even have a lot of guesses before outputting your first bits. That's how it is possible to compress an event to "a fraction of a bit".
I've intentionally used the terms "probability", "guess", "events" to keep this article generic. But for data compression, you just to replace them with the way you want to model your data. For example, the next event can be the next byte; in this case, you have 256 of them.
Maybe this script could be useful to build a better mental model of arithmetic coder: gen_map.py. Originally it was created to facilitate debugging of arithmetic coder library and simplify generation of unit tests for it. However it creates nice ASCII visualizations that also could be useful in understanding arithmetic coding.
A small example. Imagine we have an alphabet of 3 symbols: 0, 1 and 2 with probabilities 1/10, 2/10 and 7/10 correspondingly. And we want to encode sequence [1, 2]. Script will give the following output (ignore -b N option for now):
$ ./gen_map.py -b 6 -m "1,2,7" -e "1,2"
000000111111|1111|111222222222222222222222222222222222222222222222
------011222|2222|222000011111111122222222222222222222222222222222
---------011|2222|222-------------00011111122222222222222222222222
------------|----|-------------------------00111122222222222222222
------------|----|-------------------------------01111222222222222
------------|----|------------------------------------011222222222
==================================================================
000000000000|0000|000000000000000011111111111111111111111111111111
000000000000|0000|111111111111111100000000000000001111111111111111
000000001111|1111|000000001111111100000000111111110000000011111111
000011110000|1111|000011110000111100001111000011110000111100001111
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
001100110011|0011|001100110011001100110011001100110011001100110011
010101010101|0101|010101010101010101010101010101010101010101010101
First 6 lines (before ==== line) represent a range from 0.0 to 1.0 which is recursively subdivided on intervals proportional to symbol probabilities. Annotated first line:
[1/10][ 2/10 ][ 7/10 ]
000000111111|1111|111222222222222222222222222222222222222222222222
Then we subdivide each interval again:
[ 0.1][ 0.2 ][ 0.7 ]
000000111111|1111|111222222222222222222222222222222222222222222222
[ 0.7 ][.1][ 0.2 ][ 0.7 ]
------011222|2222|222000011111111122222222222222222222222222222222
[.1][ .2][ 0.7 ]
---------011|2222|222-------------00011111122222222222222222222222
Note, that some intervals are not subdivided. That happens when there is not enough space to represent every subinterval within given precision (which is specified by -b option).
Each line corresponds to a symbol from the input (in our case - sequence [1, 2]). By following subintervals for each input symbol we'll get a final interval that we want to encode with minimal amount of bits. In our case it's a first 2 subinterval on a second line:
[ This one ]
------011222|2222|222000011111111122222222222222222222222222222222
Following 7 lines (after ====) represent the same interval 0.0 to 1.0, but subdivided according to binary notation. Each line is a bit of output and by choosing between 0 and 1 you choose left or right half-subinterval. For example bits 01 corresponds to subinterval [0.25, 05) on a second line:
[ This one ]
000000000000|0000|111111111111111100000000000000001111111111111111
The idea of arithmetic coder is to output bits (0 or 1) until the corresponding interval will be entirely inside (or equal to) the interval determined by the input sequence. In our case it's 0011. The ~~~~ line shows where we have enough bits to unambiguously identify the interval we want.
Vertical lines formed by | symbol show the range of bit sequences (rows) that could be used to encode the input sequence.
First of all thanks for introducing me to the concept of arithmetic compression!
I can see that this method has the following steps:
Creating mapping: Calculate the fraction of occurrence for each letter which gives a range size for each alphabet. Then order them and assign actual ranges from 0 to 1
Given a message calculate the range (pretty straightforward IMHO)
Find the optimal code
The third part is a bit tricky. Use the following algorithm.
Let b be the optimal representation. Initialize it to empty string (''). Let x be the minimum value and y the maximum value.
double x and y: x=2*x, y=2*y
If both of them are greater than 1 append 1 to b. Go to step 1.
If both of them are less than 1, append 0 to b. Go to step 1.
If x<1, but y>1, then append 1 to b and stop
b essentially contains the fractional part of the number you are transmitting. Eg. If b=011, then the fraction corresponds to 0.011 in binary.
What part of implementation do you not understand?
Is it mathematically feasible to encode and initial 4 byte message into 8 bytes and if one of the 8 bytes is completely dropped and another is wrong to reconstruct the initial 4 byte message? There would be no way to retransmit nor would the location of the dropped byte be known.
If one uses Reed Solomon error correction with 4 "parity" bytes tacked on to the end of the 4 "data" bytes, such as DDDDPPPP, and you end up with DDDEPPP (where E is an error) and a parity byte has been dropped, I don't believe there's a way to reconstruct the initial message (although correct me if I am wrong)...
What about multiplying (or performing another mathematical operation) the initial 4 byte message by a constant, then utilizing properties of an inverse mathematical operation to determine what byte was dropped. Or, impose some constraints on the structure of the message so every other byte needs to be odd and the others need to be even.
Alternatively, instead of bytes, it could also be 4 decimal digits encoded in some fashion into 8 decimal digits where errors could be detected & corrected under the same circumstances mentioned above - no retransmission and the location of the dropped byte is not known.
I'm looking for any crazy ideas anyone might have... Any ideas out there?
EDIT:
It may be a bit contrived, but the situation that I'm trying to solve is one where you have, let's say, a faulty printer that prints out important numbers onto a form, which are then mailed off to a processing firm which uses OCR to read the forms. The OCR isn't going to be perfect, but it should get close with only digits to read. The faulty printer could be a bigger problem, where it may drop a whole number, but there's no way of knowing which one it'll drop, but they will always come out in the correct order, there won't be any digits swapped.
The form could be altered so that it always prints a space between the initial four numbers and the error correction numbers, ie 1234 5678, so that one would know whether a 1234 initial digit was dropped or a 5678 error correction digit was dropped, if that makes the problem easier to solve. I'm thinking somewhat similar to how they verify credit card numbers via algorithm, but in four digit chunks.
Hopefully, that provides some clarification as to what I'm looking for...
In the absence of "nice" algebraic structure, I suspect that it's going to be hard to find a concise scheme that gets you all the way to 10**4 codewords, since information-theoretically, there isn't a lot of slack. (The one below can use GF(5) for 5**5 = 3125.) Fortunately, the problem is small enough that you could try Shannon's greedy code-construction method (find a codeword that doesn't conflict with one already chosen, add it to the set).
Encode up to 35 bits as a quartic polynomial f over GF(128). Evaluate the polynomial at eight predetermined points x0,...,x7 and encode as 0f(x0) 1f(x1) 0f(x2) 1f(x3) 0f(x4) 1f(x5) 0f(x6) 1f(x7), where the alternating zeros and ones are stored in the MSB.
When decoding, first look at the MSBs. If the MSB doesn't match the index mod 2, then that byte is corrupt and/or it's been shifted left by a deletion. Assume it's good and shift it back to the right (possibly accumulating multiple different possible values at a point). Now we have at least seven evaluations of a quartic polynomial f at known points, of which at most one is corrupt. We can now try all possibilities for the corruption.
EDIT: bmm6o has advanced the claim that the second part of my solution is incorrect. I disagree.
Let's review the possibilities for the case where the MSBs are 0101101. Suppose X is the array of bytes sent and Y is the array of bytes received. On one hand, Y[0], Y[1], Y[2], Y[3] have correct MSBs and are presumed to be X[0], X[1], X[2], X[3]. On the other hand, Y[4], Y[5], Y[6] have incorrect MSBs and are presumed to be X[5], X[6], X[7].
If X[4] is dropped, then we have seven correct evaluations of f.
If X[3] is dropped and X[4] is corrupted, then we have an incorrect evaluation at 3, and six correct evaluations.
If X[5] is dropped and X[4] is corrupted, then we have an incorrect evaluation at 5, and six correct evaluations.
There are more possibilities besides these, but we never have fewer than six correct evaluations, which suffices to recover f.
I think you would need to study what erasure codes might offer you. I don't know any bounds myself, but maybe some kind of MDS code might achieve this.
EDIT: After a quick search I found RSCode library and in the example it says that
In general, with E errors, and K erasures, you will need
* 2E + K bytes of parity to be able to correct the codeword
* back to recover the original message data.
So looks like Reed-Solomon code is indeed the answer and you may actually get recovery from one erasure and one error in 8,4 code.
Parity codes work as long as two different data bytes aren't affected by error or loss and as long as error isn't equal to any data byte while a parity byte is lost, imho.
Error correcting codes can in general handle erasures, but in the literature the position of the erasure is assumed known. In most cases, the erasure will be introduced by the demodulator when there is low confidence that the correct data can be retrieved from the channel. For instance, if the signal is not clearly 0 or 1, the device can indicate that the data was lost, rather than risking the introduction of an error. Since an erasure is essentially an error with a known position, they are much easier to fix.
I'm not sure what your situation is where you can lose a single value and you can still be confident that the remaining values are delivered in the correct order, but it's not a situation classical coding theory addresses.
What algorithmist is suggesting above is this: If you can restrict yourself to just 7 bits of information, you can fill the 8th bit of each byte with alternating 0 and 1, which will allow you to know the placement of the missing byte. That is, put a 0 in the high bit of bytes 0, 2, 4, 6 and a 1 in the high bits of the others. On the receiving end, if you only receive 7 bytes, the missing one will have been dropped from between bytes whose high bits match. Unfortunately, that's not quite right: if the erasure and the error are adjacent, you can't know immediately which byte was dropped. E.g., high bits 0101101 could result from dropping the 4th byte, or from an error in the 4th byte and dropping the 3rd, or from an error in the 4th byte and dropping the 5th.
You could use the linear code:
1 0 0 0 0 1 1 1
0 1 0 0 1 0 1 1
0 0 1 0 1 1 0 1
0 0 0 1 1 1 1 0
(i.e. you'll send data like (a, b, c, d, b+c+d, a+c+d, a+b+d, a+b+c) (where addition is implemented with XOR, since a,b,c,d are elements of GF(128))). It's a linear code with distance 4, so it can correct a single-byte error. You can decode with syndrome decoding, and since the code is self-dual, the matrix H will be the same as above.
In the case where there's a dropped byte, you can use the technique above to determine which one it is. Once you've determined that, you're essentially decoding a different code - the "punctured" code created by dropping that given byte. Since the punctured code is still linear, you can use syndrome decoding to determine the error. You would have to calculate the parity-check matrix for each of the shortened codes, but you can do this ahead of time. The shortened code has distance 3, so it can correct any single-byte errors.
In the case of decimal digits, assuming one goes with first digit odd, second digit even, third digit odd, etc - with two digits, you get 00-99, which can be represented in 3 odd/even/odd digits (125 total combinations) - 00 = 101, 01 = 103, 20 = 181, 99 = 789, etc. So one encodes two sets of decimal digits into 6 total digits, then the last two digits signify things about the first sets of 2 digits or a checksum of some sort... The next to last digit, I suppose, could be some sort of odd/even indicator on each of the initial 2 digit initial messages (1 = even first 2 digits, 3 = odd first two digits) and follow the pattern of being odd. Then, the last digit could be the one's place of a sum of the individual digits, that way if a digit was missing, it would be immediately apparent and could be corrected assuming the last digit was correct. Although, it would throw things off if one of the last two digits were dropped...
It looks to be theoretically possible if we assume 1 bit error in wrong byte. We need 3 bits to identify dropped byte and 3 bits to identify wrong byte and 3 bits to identify wrong bit. We have 3 times that many extra bits.
But if we need to identify any number of bits error in wrong byte, it comes to 30 bits. Even that looks to be possible with 32 bits, although 32 is a bit too close for my comfort.
But I don't know hot to encode to get that. Try turbocode?
Actually, as Krystian said, when you correct a RS code, both the message AND the "parity" bytes will be corrected, as long as you have v+2e < (n-k) where v is the number of erasures (you know the position) and e is the number of errors. This means that if you only have errors, you can correct up to (n-k)/2 errors, or (n-k-1) erasures (about the double of the number of errors), or a mix of both (see Blahut's article: Transform techniques for error control codes and A universal Reed-Solomon decoder).
What's even nicer is that you can check that the correction was successful: by checking that the syndrome polynomial only contains 0 coefficients, you know that the message+parity bytes are both correct. You can do that before to check if the message needs any correction, and also you can do the check after the decoding to check that both the message and the parity bytes were completely repaired.
The bound v+2e < (n-k) is optimal, you cannot do better (that's why Reed-Solomon is called an optimal error correction code). In fact it's possible to go beyond this limit using bruteforce approaches, up to a certain point (you can gain 1 or 2 more symbols for each 8 symbols) using list decoding, but it's still a domain in its infancy, I don't know of any practical implementation that works.