Most efficient bit format to represent small unsigned integers - algorithm

I have to deal with sequences of a lot of small numbers, about a million, and I have to put as many as possible (more is better) in 4KB. Obviously that's just too little space to put all of them. Also, while this is a specific scenario, I'd love an answer as general as possible.
The numbers don't follow any pattern, but here is what a small script has to say about their distribution:
407037 times 1
165000 times 2
85389 times 3
52257 times 4
34749 times 5
23567 times 6
15892 times 7
11183 times 8
7636 times 9
5402 times 10
3851 times 11
2664 times 12
2023 times 13
1547 times 14
1113 times 15
... many more lines ...
1 times 62
62 is the biggest number I have, so let's set the maximum number we care about at 64. If the method is easily adaptable to accommodate for bigger max numbers, that would be better.
Here is a sample of the numbers:
20
1
1
1
13
1
5
1
15
1
3
4
3
2
2
A naive way to do this would just be to use 6 bits per number, but I think we can do better.
EDIT: adding a bit of info following discussion in comments.
I also have 2KB of ram and a dozen cycles on a microprocessor to decode each number. I need to store, sequentially, from the first number, as many numbers as I can.
EDIT: see graybeard's comment and my followup too.

The correct way to do this would be Rangecoding, Huffman or Shannon-Fano which you can see in any of the digital-communication blogs over the net, so I'm not exactly explaining you these.
I can suggest you a custom method, which is really simple and you can compare it with other methods if you can use this to store more numbers or not.
I see that there are no 0's in your script. So just decrease each number by 1 (while decoding, add 1 to decoded result). Use either 4 or 7 bits to encode numbers. All numbers up-to 8 can be represented in 3-bits. If the number is n <= 8, set the 1st bit as 0 and next 3 bits can represent the number. Else, if the number is n > 8, set 1st bit as 1 and represent the number as 6 bits from there.
Though in Huffman or Shannon-Fano, few of the representations can be as long as over 20 bits.

For provide correct answer, need to know - is decoder size also limited, or there is not limit for decodes size?
If no limit for decoder (just limit for data), I suggest you to use rangecoder, or Huffman coding. Rangecoder has better compression, but extensive arithmetic operation usage.
However, both decoders uses memory for a code, and for statistical tables. So, perhaps, better answer to create something more easy (custom compressor), but with simple and compact code and without any tables. As easy, code-compact, I can propose the run-1 algorithm. This algorithm is not very efficient for your data (rangecoder or Huffman better), but has trivial compact decoder without any tables.
Idea - sequence can contain zero or more bit_1, and use bit_0 as symbol separator. For example, if we would like encode with run-1 the sequence:
1, 1, 2, 1, 5
There will be bit sequence:
0-0-10-0-11110
There, you needed just count number of sequenced bit_1, add 1, and return value as decoded number.

Maybe slightly better than straight Huffman can be attempted by combining with run-length coding.
If you count the successive identical elements, you can rewrite your sequence as a pairs of (value, count). Every such pair appears with some probability and you can use Huffman coding on these. (I don't mean to code the values and the counts separately, but the pairs as a whole).
Your sample yields
(20, 1), (1, 3), (13 1), (1, 1), (5, 1), (1, 1), (15, 1), (3, 1), (4, 1), (3, 1), (5, 2)
The singletons will be (practically) coded as before, and there are more opportunities for compression of longer runs.
You can limit the maximum count(s) that are supported; if the actual count exceeds the limit, it is no big deal to insert several pairs.
The very first step is to compute an histogram of the count values to see if there are enough repetitions for this approach to be worth.
Alternatively, you can try Huffman coding on the deltas (signed differences between successive values). If there are many repetitions, the frequency of 0 will be much higher, increasing the entropy. Obviously, run-length coding of the deltas is also possible.

I took the distribution you listed, and tried an exponential fit. The result was decently good:
More importantly, the fit was reasonably close to p(x) ~= 2^-x. This suggests a very simple coding, known as "unary coding": to encode the number k, output k-1 zeroes, followed by a 1. If your numbers exactly fit the p(x) ~= 2^-x distribution, that would give you an expected code length of 2 bits. Since your numbers appear to be heavier-tailed than that (otherwise it would be vanishingly unlikely to see a 62 in only a million numbers), you won't quite achieve that. Still, given the simplicity of the coding and the ease of decoding (twelve cycles should be sufficient), you should consider trying it out.
You might also look into other universal codes, such as Elias Delta. Golomb coding would be optimal, but decoding it is an involved process.

Related

Fixed Point Multiplication for FFT

I’m writing a Radix-2 DIT FFT algorithm in VHDL, which requires some fractional multiplication of input data by Twiddle Factor (TF). I use Fixed Point arithmetic’s to achieve that, with every word being 16 bit long, where 1 bit is a sign bit and the rest is distributed between integer and fraction. Therefore my dilemma:
I have no idea, in what range my input data will be, so if I just decide that 4 bits go to integer and the rest 11 bits to fraction, in case I get integer numbers higher than 4 bits = 15 decimal, I’m screwed. The same applies if I do 50/50, like 7 bits to integer and the rest to fraction. If I get numbers, which are very small, I’m screwed because of truncation or rounding, i.e:
Let’s assume I have an integer "3"(0000 0011) on input and TF of "0.7071" ( 0.10110101 - 8 bit), and let’s assume, for simplicity, my data is 8 bit long, therefore:
3x0.7071 = 2.1213
3x0.7071 = 0000 0010 . 0001 1111 = 2.12109375 (for 16 bits).
Here comes the trick - I need to up/down round or truncate 16 bits to 8 bits, therefore, I get 0000 0010, i.e 2 - the error is way too high.
My questions are:
How would you solve this problem of range vs precision if you don’t know the range of your input data AND you would have numbers represented in fixed point?
Should I make a process, which decides after every multiplication where to put the comma? Wouldn’t it make the multiplication slower?
Xilinx IP Core has 3 different ways for Fixed Number Arithmetic’s – Unscaled (similar to what I want to do, just truncate in case overflow happens), Scaled fixed point (I would assume, that in that case it decides after each multiplication, where the comma should be and what should be rounded) and Block Floating Point(No idea what it is or how it works - would appreciate an explanation). So how does this IP Core decide where to put the comma? If the decision is made depending on the highest value in my dataset, then in case I have just 1 high peak and the rest of the data is low, the error will be very high.
I will appreciate any ideas or information on any known methods.
You don't need to know the fixed-point format of your input. You can safely treat it as normalized -1 to 1 range or full integer-range.
The reason is that your output will have the same format as the input. Or, more likely for FFT, a known relationship like 3 bits increase, which would the output has 3 more integer bits than the input.
It is the core user's burden to know where the decimal point will end up, you have to document the change to dynamic range of course.

Data Compression : Arithmetic coding unclear

Can anyone please explain arithmetic encoding for data compression with implementation details ? I have surfed through internet and found mark nelson's post but the implementation's technique is indeed unclear to me after trying for many hours.
Mark nelson's explanation on arithmetic coding can be located at
http://marknelson.us/1991/02/01/arithmetic-coding-statistical-modeling-data-compression/
The main idea with arithmetic compression is its the capability to code a probability using the exact amount of data length required.
This amount of data is known, proven by Shannon, and can be calculated simply by using the following formula : -log2(p)
For example, if p=50%, then you need 1 bit.
And if p=25%, you need 2 bits.
That's simple enough for probabilities which are power of 2 (and in this special case, huffman coding could be enough). But what if the probability is 63% ? Then you need -log2(0.63) = 0.67 bits. Sounds tricky...
This property is especially important if your probability is high. If you can predict something with a 95% accuracy, then you only need 0.074 bits to represent a good guess. Which means you are going to compress a lot.
Now, how to do that ?
Well, it's simpler than it sounds. You will divide your range depending on probabilities. For example, if you have a range of 100, 2 possible events, and a probability of 95% for the 1st one, then the first 95 values will say "Event 1", and the last 5 remaining values will say "Event 2".
OK, but on computers, we are accustomed to use powers of 2. For example, with 16 bits, you have a range of 65536 possible values. Just do the same : take the 1st 95% of the range (which is 62259) to say "Event 1", and the rest to say "Event 2". You obviously have a problem of "rounding" (precision), but as long as you have enough values to distribute, it does not matter too much. Furthermore, you are not constrained to 2 events, you could have a myriad of events. All that matters is that values are allocated depending on the probabilities of each event.
OK, but now i have 62259 possible values to say "Event 1", and 3277 to say "Event 2". Which one should i choose ?
Well, any of them will do. Wether it is 1, 30, 5500 or 62256, it still means "Event 1".
In fact, deciding which value to select will not depend on the current guess, but on the next ones.
Suppose i'm having "Event 1". So now i have to choose any value between 0 and 62256. On next guess, i have the same distribution (95% Event 1, 5% Event 2). I will simply allocate the distribution map with these probabilities. Except that this time, it is distributed over 62256 values. And we continue like this, reducing the range of values with each guess.
So in fact, we are defining "ranges", which narrow with each guess. At some point, however, there is a problem of accuracy, because very little values remain.
The idea, is to simply "inflate" the range again. For example, each time the range goes below 32768 (2^15), you output the highest bit, and multiply the rest by 2 (effectively shifting the values by one bit left). By continuously doing like this, you are outputting bits one by one, as they are being settled by the series of guesses.
Now the relation with compression becomes obvious : when the range are narrowed swiftly (ex : 5%), you output a lot of bits to get the range back above the limit. On the other hand, when the probability is very high, the range narrow very slowly. You can even have a lot of guesses before outputting your first bits. That's how it is possible to compress an event to "a fraction of a bit".
I've intentionally used the terms "probability", "guess", "events" to keep this article generic. But for data compression, you just to replace them with the way you want to model your data. For example, the next event can be the next byte; in this case, you have 256 of them.
Maybe this script could be useful to build a better mental model of arithmetic coder: gen_map.py. Originally it was created to facilitate debugging of arithmetic coder library and simplify generation of unit tests for it. However it creates nice ASCII visualizations that also could be useful in understanding arithmetic coding.
A small example. Imagine we have an alphabet of 3 symbols: 0, 1 and 2 with probabilities 1/10, 2/10 and 7/10 correspondingly. And we want to encode sequence [1, 2]. Script will give the following output (ignore -b N option for now):
$ ./gen_map.py -b 6 -m "1,2,7" -e "1,2"
000000111111|1111|111222222222222222222222222222222222222222222222
------011222|2222|222000011111111122222222222222222222222222222222
---------011|2222|222-------------00011111122222222222222222222222
------------|----|-------------------------00111122222222222222222
------------|----|-------------------------------01111222222222222
------------|----|------------------------------------011222222222
==================================================================
000000000000|0000|000000000000000011111111111111111111111111111111
000000000000|0000|111111111111111100000000000000001111111111111111
000000001111|1111|000000001111111100000000111111110000000011111111
000011110000|1111|000011110000111100001111000011110000111100001111
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
001100110011|0011|001100110011001100110011001100110011001100110011
010101010101|0101|010101010101010101010101010101010101010101010101
First 6 lines (before ==== line) represent a range from 0.0 to 1.0 which is recursively subdivided on intervals proportional to symbol probabilities. Annotated first line:
[1/10][ 2/10 ][ 7/10 ]
000000111111|1111|111222222222222222222222222222222222222222222222
Then we subdivide each interval again:
[ 0.1][ 0.2 ][ 0.7 ]
000000111111|1111|111222222222222222222222222222222222222222222222
[ 0.7 ][.1][ 0.2 ][ 0.7 ]
------011222|2222|222000011111111122222222222222222222222222222222
[.1][ .2][ 0.7 ]
---------011|2222|222-------------00011111122222222222222222222222
Note, that some intervals are not subdivided. That happens when there is not enough space to represent every subinterval within given precision (which is specified by -b option).
Each line corresponds to a symbol from the input (in our case - sequence [1, 2]). By following subintervals for each input symbol we'll get a final interval that we want to encode with minimal amount of bits. In our case it's a first 2 subinterval on a second line:
[ This one ]
------011222|2222|222000011111111122222222222222222222222222222222
Following 7 lines (after ====) represent the same interval 0.0 to 1.0, but subdivided according to binary notation. Each line is a bit of output and by choosing between 0 and 1 you choose left or right half-subinterval. For example bits 01 corresponds to subinterval [0.25, 05) on a second line:
[ This one ]
000000000000|0000|111111111111111100000000000000001111111111111111
The idea of arithmetic coder is to output bits (0 or 1) until the corresponding interval will be entirely inside (or equal to) the interval determined by the input sequence. In our case it's 0011. The ~~~~ line shows where we have enough bits to unambiguously identify the interval we want.
Vertical lines formed by | symbol show the range of bit sequences (rows) that could be used to encode the input sequence.
First of all thanks for introducing me to the concept of arithmetic compression!
I can see that this method has the following steps:
Creating mapping: Calculate the fraction of occurrence for each letter which gives a range size for each alphabet. Then order them and assign actual ranges from 0 to 1
Given a message calculate the range (pretty straightforward IMHO)
Find the optimal code
The third part is a bit tricky. Use the following algorithm.
Let b be the optimal representation. Initialize it to empty string (''). Let x be the minimum value and y the maximum value.
double x and y: x=2*x, y=2*y
If both of them are greater than 1 append 1 to b. Go to step 1.
If both of them are less than 1, append 0 to b. Go to step 1.
If x<1, but y>1, then append 1 to b and stop
b essentially contains the fractional part of the number you are transmitting. Eg. If b=011, then the fraction corresponds to 0.011 in binary.
What part of implementation do you not understand?

Is this a clever or stupid way to do an integer divide function?

I'm a Computer Science major, interested in how assembly languages handle a integer divide function. It seems that simply adding up to the numerator, while giving both the division and the mod, is way too impractical, so I came up with another way to divide using bit shifting, subtracting, and 2 look up tables.
Basically, the function takes the denominator, and makes "blocks" based on the highest power of 2. So dividing by 15 makes binary blocks of 4, dividing by 5 makes binary blocks of 3, etc. Then generate the first 2^block-size multiple of the denominator. For each multiple, write the values AFTER the first block into the look up table, keyed by the value of the first block.
Example: Multiples of 5 in binary - block size 3 (octal)
000 000 **101** - 5 maps to 0
000 001 **010** - 2 maps to 1
000 001 **111** - 7 maps to 1
000 010 **100** - 4 maps to 2
000 011 **001** - 1 maps to 3
000 011 **110** - 6 maps to 3
000 100 **011** - 3 maps to 4
000 101 **000** - 0 maps to 5
So the actual procedure involves getting the first block, left bit-shifting over the first block, and subtracting the value that the blocks maps to. If the resulting number comes out to 0, then it's perfectly divisible, and if the value becomes negative, it's not.
If you add another enumeration look up table, where you map the values to a counter as they come in, you can calculate the result of the division!
Example: Multiples of 5 again
5 maps to 1
2 maps to 2
7 maps to 3
4 maps to 4
1 maps to 5
6 maps to 6
3 maps to 7
0 maps to 8
Then all that's left is mapping every block to the counter-table, and you have your answer.
There are a few problems with this method.
If the answer isn't perfectly divisible, then the function returns back junk.
For high Integer values, this won't work, because a 5 block size will get truncated at the end of a 32 bit or 64 bit integer.
It's about 100 times slower than the standard division in C.
If the denominator is a factor of the divisor, then your blocks must map to multiple values, and you need even more tables. This can be solved with prime factorization, but all the methods I've read about easy/quick prime factorization involve dividing, defeating the purpose of this.
So I have 2 questions: First, is there an algorithm similar to this out there already? I've looked around, and I can't seem to find any like it. Second, How do actual assembly languages handle Integer division?
Sorry if there are any formatting mistake, this is my first time posting to stack overflow.
Sorry i answer so late. Ok, first regarding the commenters of your question: they think you are trying to do what the assembly memonic DIV or IDIV achieves by using different instructions in assembly. To me it seems you want to know how the op-codes that are selected by DIV and IDIV achieve division in hardware. To my knowledge Intel uses the SRT algorithm (uses a lookup-table) and AMD uses the Goldschmidt algorithm. I think what you are doing is similar to SRT. You can take a look at both of them here:
http://en.wikipedia.org/wiki/Division_%28digital%29

Algorithms: random unique string

I need to generate string that meets the following requirements:
it should be a unique string;
string length should be 8 characters;
it should contain 2 digits;
all symbols (non-digital characters) should be upper case.
I will store them in a data base after generation (they will be assigned to other entities).
My intention is to do something like this:
Generate 2 random values from 0 to 9—they will be used for digits in the string;
generate 6 random values from 0 to 25 and add them to 64—they will be used as 6 symbols;
concatenate everything into one string;
check if the string already exists in the data base; if not—repeat.
My concern with regard to that algorithm is that it doesn't guarantee a result in finite time (if there are already A LOT of values in the data base).
Question: could you please give advice on how to improve this algorithm to be more deterministic?
Thanks.
it should be unique string;
string length should be 8 characters;
it should contains 2 digits;
all symbols (non-digital characters) - should be upper case.
Assuming:
requirements #2 and #3 are exact (exactly 8 chars, exactly 2 digits) and not a minimum
the "symbols" in requirement #4 are the 26 capital letters A through Z
you would like an evenly-distributed random string
Then your proposed method has two issues. One is that the letters A - Z are ASCII 65 - 90, not 64 - 89. The other is that it doesn't distribute the numbers evenly within the possible string space. That can be remedied by doing the following:
Generate two different integers between 0 and 7, and sort them.
Generate 2 random numbers from 0 to 9.
Generate 6 random letters from A to Z.
Use the two different integers in step #1 as positions, and put the 2 numbers in those positions.
Put the 6 random letters in the remaining positions.
There are 28 possibilities for the two different integers ((8*8 - 8 duplicates) / 2 orderings), 266 possibilities for the letters, and 100 possibilities for the numbers, the total # of valid combinations being Ncomb = 864964172800 = 8.64 x 1011.
edit: If you want to avoid the database for storage, but still guarantee both uniqueness of strings and have them be cryptographically secure, your best bet is a cryptographically random bijection from a counter between 0 and Nmax <= Ncomb to a subset of the space of possible output strings. (Bijection meaning there is a one-to-one correspondence between the output string and the input counter.)
This is possible with Feistel networks, which are commonly used in hash functions and symmetric cryptography (including AES). You'd probably want to choose Nmax = 239 which is the largest power of 2 <= Ncomb, and use a 39-bit Feistel network, using a constant key you keep secret. You then plug in your counter to the Feistel network, and out comes another 39-bit number X, which you then transform into the corresponding string as follows:
Repeat the following step 6 times:
Take X mod 26, generate a capital letter, and set X = X / 26.
Take X mod 100 to generate your two digits, and set X = X / 100.
X will now be between 0 and 17 inclusive (239 / 266 / 100 = 17.796...). Map this number to two unique digit positions (probably easiest using a lookup table, since we're only talking 28 possibilities. If you had more, use Floyd's algorithm for generating a unique permutation, and use the variable-base technique of mod + integer divide instead of generating a random number).
Follow the random approach above, but use the numbers generated by this algorithm instead.
Alternatively, use 40-bit numbers, and if the output of your Feistel network is > Ncomb, then increment the counter and try again. This covers the entire string space at the cost of rejecting invalid numbers and having to re-execute the algorithm. (But you don't need a database to do this.)
But this isn't something to get into unless you know what you're doing.
Are these user passwords? If so, there are a couple of things you need to take into account:
You must avoid 0/O and I/1, which can easily be mistaken for each other.
You must avoid too many consecutive letters, which might spell out a rude word.
As far as 2 is concerned, you can avoid the problem by using LLNLLNLL as your pattern (L = letter, N = number).
If you need 1 million passwords out of a pool of 2.5 billion, you will certainly get clashes in your database, so you have to deal with them gracefully. But a simple retry is enough, if your random number generator is robust.
I don't see anything in your requirements that states that the string needs to be random. You could just do something like the following pseudocode:
for letters in ( 'AAAAAA' .. 'ZZZZZZ' ) {
for numbers in ( 00 .. 99 ) {
string = letters + numbers
}
}
This will create unique strings eight characters long, with two digits and six upper-case letters.
If you need randomly-generated strings, then you need to keep some kind of record of which strings have been previously generated, so you're going to have to hit a DB (or keep them all in memory, or write them to a textfile) and check against that list.
I think you're safe well into your tens of thousands of such ID's, and even after that you're most likely alright.
Now if you want some determinism, you can always force a password after a certain number of failures. Say after 50 failures, you select a password at random and increment a part of it by 1 until you get a free one.
I'm willing to bet money though that you'll never see the extra functionality kick in during your life time :)
Do it the other way around: generate one big random number that you will split up to obtain the individual characters:
long bigrandom = ...;
int firstDigit = bigRandom % 10;
int secondDigit = ( bigrandom / 10 ) % 10;
and so on.
Then you only store the random number in your database and not the string. Since there's a one-to-one relationship between the string and the number, this doesn't really make a difference.
However, when you try to insert a new value, and it's already in the databse, you can easily find the smallest unallocated number graeter than the originally generated number, and use that instead of the one you generated.
What you gain from this method is that you're guaranteed to find an available code relatively quickly, even when most codes are already allocated.
For one thing, your list of requirements doesn't state that string has to be necessary random, so you might consider something like database index.
If 'random' is a requirement, you can do a few improvements.
Store string as a number in database. Not sure how much this improves perfromance.
Do not store used strings at all. You can employ 'index' approach above, but convert integer number to a string in a seemingly random fashion (e.g., employing bit shift). Without much research, nobody will notice pattern.
E.g., if we have sequence 1, 2, 3, 4, ... and use cyclic binary shift right by 1 bit, it'll be turned into 4, 1, 5, 2, ... (assuming we have 3 bits only)
It doesn't have to be a shift too, it can be a permutation or any other 'randomization'.
The problem with your approach is clearly that while you have few records, you are very unlikely to get collisions but as your number of records grows the chance will increase until it becomes more likely than not that you'll get a collision. Eventually you will be hitting multiple collisions before you get a 'valid' result. Every time will require a table scan to determine if the code is valid, and the whole thing turns into a mess.
The simplest solution is to precalculate your codes.
Start with the first code 00AAAA, and increment to generate 00AAAB, 00AAAC ... 99ZZZZ. Insert them into a table in random order. When you need a new code, retrieve to top record unused record from the table (then mark it as used). It's not a huge table, as pointed out above - only a few million records.
You don't need to calculate any random numbers and generate strings for each user (already done)
You don't need to check whether anything has already been used, just get the next available
No chance of getting multiple collisions before finding something usable.
If you ever need more 'codes', just generate some more 'random' strings and append them to the table.

Encoding / Error Correction Challenge

Is it mathematically feasible to encode and initial 4 byte message into 8 bytes and if one of the 8 bytes is completely dropped and another is wrong to reconstruct the initial 4 byte message? There would be no way to retransmit nor would the location of the dropped byte be known.
If one uses Reed Solomon error correction with 4 "parity" bytes tacked on to the end of the 4 "data" bytes, such as DDDDPPPP, and you end up with DDDEPPP (where E is an error) and a parity byte has been dropped, I don't believe there's a way to reconstruct the initial message (although correct me if I am wrong)...
What about multiplying (or performing another mathematical operation) the initial 4 byte message by a constant, then utilizing properties of an inverse mathematical operation to determine what byte was dropped. Or, impose some constraints on the structure of the message so every other byte needs to be odd and the others need to be even.
Alternatively, instead of bytes, it could also be 4 decimal digits encoded in some fashion into 8 decimal digits where errors could be detected & corrected under the same circumstances mentioned above - no retransmission and the location of the dropped byte is not known.
I'm looking for any crazy ideas anyone might have... Any ideas out there?
EDIT:
It may be a bit contrived, but the situation that I'm trying to solve is one where you have, let's say, a faulty printer that prints out important numbers onto a form, which are then mailed off to a processing firm which uses OCR to read the forms. The OCR isn't going to be perfect, but it should get close with only digits to read. The faulty printer could be a bigger problem, where it may drop a whole number, but there's no way of knowing which one it'll drop, but they will always come out in the correct order, there won't be any digits swapped.
The form could be altered so that it always prints a space between the initial four numbers and the error correction numbers, ie 1234 5678, so that one would know whether a 1234 initial digit was dropped or a 5678 error correction digit was dropped, if that makes the problem easier to solve. I'm thinking somewhat similar to how they verify credit card numbers via algorithm, but in four digit chunks.
Hopefully, that provides some clarification as to what I'm looking for...
In the absence of "nice" algebraic structure, I suspect that it's going to be hard to find a concise scheme that gets you all the way to 10**4 codewords, since information-theoretically, there isn't a lot of slack. (The one below can use GF(5) for 5**5 = 3125.) Fortunately, the problem is small enough that you could try Shannon's greedy code-construction method (find a codeword that doesn't conflict with one already chosen, add it to the set).
Encode up to 35 bits as a quartic polynomial f over GF(128). Evaluate the polynomial at eight predetermined points x0,...,x7 and encode as 0f(x0) 1f(x1) 0f(x2) 1f(x3) 0f(x4) 1f(x5) 0f(x6) 1f(x7), where the alternating zeros and ones are stored in the MSB.
When decoding, first look at the MSBs. If the MSB doesn't match the index mod 2, then that byte is corrupt and/or it's been shifted left by a deletion. Assume it's good and shift it back to the right (possibly accumulating multiple different possible values at a point). Now we have at least seven evaluations of a quartic polynomial f at known points, of which at most one is corrupt. We can now try all possibilities for the corruption.
EDIT: bmm6o has advanced the claim that the second part of my solution is incorrect. I disagree.
Let's review the possibilities for the case where the MSBs are 0101101. Suppose X is the array of bytes sent and Y is the array of bytes received. On one hand, Y[0], Y[1], Y[2], Y[3] have correct MSBs and are presumed to be X[0], X[1], X[2], X[3]. On the other hand, Y[4], Y[5], Y[6] have incorrect MSBs and are presumed to be X[5], X[6], X[7].
If X[4] is dropped, then we have seven correct evaluations of f.
If X[3] is dropped and X[4] is corrupted, then we have an incorrect evaluation at 3, and six correct evaluations.
If X[5] is dropped and X[4] is corrupted, then we have an incorrect evaluation at 5, and six correct evaluations.
There are more possibilities besides these, but we never have fewer than six correct evaluations, which suffices to recover f.
I think you would need to study what erasure codes might offer you. I don't know any bounds myself, but maybe some kind of MDS code might achieve this.
EDIT: After a quick search I found RSCode library and in the example it says that
In general, with E errors, and K erasures, you will need
* 2E + K bytes of parity to be able to correct the codeword
* back to recover the original message data.
So looks like Reed-Solomon code is indeed the answer and you may actually get recovery from one erasure and one error in 8,4 code.
Parity codes work as long as two different data bytes aren't affected by error or loss and as long as error isn't equal to any data byte while a parity byte is lost, imho.
Error correcting codes can in general handle erasures, but in the literature the position of the erasure is assumed known. In most cases, the erasure will be introduced by the demodulator when there is low confidence that the correct data can be retrieved from the channel. For instance, if the signal is not clearly 0 or 1, the device can indicate that the data was lost, rather than risking the introduction of an error. Since an erasure is essentially an error with a known position, they are much easier to fix.
I'm not sure what your situation is where you can lose a single value and you can still be confident that the remaining values are delivered in the correct order, but it's not a situation classical coding theory addresses.
What algorithmist is suggesting above is this: If you can restrict yourself to just 7 bits of information, you can fill the 8th bit of each byte with alternating 0 and 1, which will allow you to know the placement of the missing byte. That is, put a 0 in the high bit of bytes 0, 2, 4, 6 and a 1 in the high bits of the others. On the receiving end, if you only receive 7 bytes, the missing one will have been dropped from between bytes whose high bits match. Unfortunately, that's not quite right: if the erasure and the error are adjacent, you can't know immediately which byte was dropped. E.g., high bits 0101101 could result from dropping the 4th byte, or from an error in the 4th byte and dropping the 3rd, or from an error in the 4th byte and dropping the 5th.
You could use the linear code:
1 0 0 0 0 1 1 1
0 1 0 0 1 0 1 1
0 0 1 0 1 1 0 1
0 0 0 1 1 1 1 0
(i.e. you'll send data like (a, b, c, d, b+c+d, a+c+d, a+b+d, a+b+c) (where addition is implemented with XOR, since a,b,c,d are elements of GF(128))). It's a linear code with distance 4, so it can correct a single-byte error. You can decode with syndrome decoding, and since the code is self-dual, the matrix H will be the same as above.
In the case where there's a dropped byte, you can use the technique above to determine which one it is. Once you've determined that, you're essentially decoding a different code - the "punctured" code created by dropping that given byte. Since the punctured code is still linear, you can use syndrome decoding to determine the error. You would have to calculate the parity-check matrix for each of the shortened codes, but you can do this ahead of time. The shortened code has distance 3, so it can correct any single-byte errors.
In the case of decimal digits, assuming one goes with first digit odd, second digit even, third digit odd, etc - with two digits, you get 00-99, which can be represented in 3 odd/even/odd digits (125 total combinations) - 00 = 101, 01 = 103, 20 = 181, 99 = 789, etc. So one encodes two sets of decimal digits into 6 total digits, then the last two digits signify things about the first sets of 2 digits or a checksum of some sort... The next to last digit, I suppose, could be some sort of odd/even indicator on each of the initial 2 digit initial messages (1 = even first 2 digits, 3 = odd first two digits) and follow the pattern of being odd. Then, the last digit could be the one's place of a sum of the individual digits, that way if a digit was missing, it would be immediately apparent and could be corrected assuming the last digit was correct. Although, it would throw things off if one of the last two digits were dropped...
It looks to be theoretically possible if we assume 1 bit error in wrong byte. We need 3 bits to identify dropped byte and 3 bits to identify wrong byte and 3 bits to identify wrong bit. We have 3 times that many extra bits.
But if we need to identify any number of bits error in wrong byte, it comes to 30 bits. Even that looks to be possible with 32 bits, although 32 is a bit too close for my comfort.
But I don't know hot to encode to get that. Try turbocode?
Actually, as Krystian said, when you correct a RS code, both the message AND the "parity" bytes will be corrected, as long as you have v+2e < (n-k) where v is the number of erasures (you know the position) and e is the number of errors. This means that if you only have errors, you can correct up to (n-k)/2 errors, or (n-k-1) erasures (about the double of the number of errors), or a mix of both (see Blahut's article: Transform techniques for error control codes and A universal Reed-Solomon decoder).
What's even nicer is that you can check that the correction was successful: by checking that the syndrome polynomial only contains 0 coefficients, you know that the message+parity bytes are both correct. You can do that before to check if the message needs any correction, and also you can do the check after the decoding to check that both the message and the parity bytes were completely repaired.
The bound v+2e < (n-k) is optimal, you cannot do better (that's why Reed-Solomon is called an optimal error correction code). In fact it's possible to go beyond this limit using bruteforce approaches, up to a certain point (you can gain 1 or 2 more symbols for each 8 symbols) using list decoding, but it's still a domain in its infancy, I don't know of any practical implementation that works.

Resources