Basic Reed-Solomon Error Correction Question - algorithm

Does Reed-Solomon error correction work in an instance where there is a dropped byte (or multiple dropped bytes)? For example, let's say it's a (12,8) Reed Solomon code, so theoretically it should be able to correct 2 errors (or 4 erasures if the position is known). But, what happens if only 11 (or 10) bytes are received and one doesn't know which byte(s) were dropped? Will Reed-Solomon error correction work?
Thanks,
Ben

RS decoding for erasures requires the position of the symbols "dropped" or lost. The kind of error you're talking about is due to phase distortion.

You can make it work by simply cycling through the possible positions where the character might be missing and letting it try to correct your result, so let's say you received 10 characters:
1234567890
Have it correct the following values:
??1234567890
?1?234567890
?12?34567890
:
1??234567890
1?2?34567890
:
1234567890??
Each attempt will probably give you some result, most of which are not the one you want. But I would expect that there should be exactly one result with the minimal number of additional modifications, and that should be the one you want to use as the most likely to be correct answer.
For example, if you correct the first three numbers of the example above, you might get the following result:
v
361274567890
917234567890
312734569897
: ^ ^
For the first and third case, you have additional corrections made beyond filling in the two blanks (marked with v and ^), whereas in the second case you have only the missing positions filled in and the other characters match the uncorrected input. Therefore, I would choose answer 2 as the most likely to be correct one.
Clearly, the chances that this works depend on whether there are other errors. Unfortunately I'm not able to give you a rigorous set of conditions under which this method will work for sure.
.
Another thing you can do if your message is long enough is to use an interleaving technique to basically have multiple orthogonal RS codes cover your data. That way, if one fails, you might be able to recover with another one. This method is for example used on compact discs (CDs), where it is called CIRC.

No, Reed-Solomon can't automatically correct instances where there are missing bits, because just like most other FEC algorithms, it was only designed to correct bit-flips. If you know the position of the missing bits, you can pad your received signal at those positions so that RS can then work normally.
However, if you don't know the position, you will need to use another algorithm that supports bit-insertion or bit-deletion such as Marker Codes and Watermark Codes.
Also note that RS can be not only used for erasures but also to process noisy bits using Forney syndrome.

Related

Does Reed-Solomon Error algorithm allow correction only if error occur on input data part?

Reed-Solomon algorithm is adding an additional data to the input, so potential errors (of particular size/quantity) on such damaged input can be corrected back to the original state. Correct? Does this algorithm protects also such added data not being part of the input, but used by the algorithm? If not, what happened if the error occurs in such non-input data part?
An important aspect is that Reed-Solomon (RS) codes are cyclic: the set of codewords is stable by cyclic shift.
A consequence is that no particular part of a code word is more protected or less protected.
A RS code has a error correction capability equal to t = (n-k)/2, where n is the code length (generally expressed in bytes) and k is the information part length.
If the total number of errors (in both parts) is less than t, the RS decoder will be able to correct the errors (more precisely, the t erroneous bytes in the general case). If it is higher, the errors cannot be corrected (but could be detected, another story).
The emplacement of the errors, either in the information part or the added part, has no influence on the error correction capability.
EDIT: the rule t = (n-k)/2 that I mentioned is valid for Reed-Solomon codes. This rule is not generally correct for BCH codes: t <= (n-k)/2. However, with respect to your question, this does not change the answer: these families of code have a given capacity correction, corresponding to the minimum distance between codewords, the decoders can then correct t errors, whatever the position of the errors in the codeword
As long as only half or less of the added data is in error, then errors that are only in the added data can be corrected.
With the appended data, the data + appended data form what is called a codeword, one that meets the rules for a codeword. Note there are two basic types of Reed Solomon code, the "original view" and the "BCH view". What constitutes a valid codeword depends which type of Reed Solomon code is being used. Link to Wiki article that explains this:
https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction
For an erasure only code, the location of all errors is determined by other means, and this case, even if all of the appended data is known to be bad, it can be corrected (or regenerated).

What does Cyclic Code mean? Are CRC and Reed-Solomon Cyclic codes?

We have an assignment where we must make a program in python that compresses a txt file with LZ-78, then encode the compressed file with "cyclic code", and after that send it as a json file to a reciever. I can't find an exact clarification what the professor means by cyclic code.
I searched the web and I found about CRC and Reed-Solomon but I'm not sure if these two are the correct codes to use, so can you please explain to me if these are okay for me to use or if I need something different.
I'm not sure if it helps, but for some teams he specified that he wanted them to use Reed-Muller.
what does cyclic code mean?
That every valid codeword can be rotated (left or right), and the result will be another valid code word. CRC (at least ones that don't post complement the CRC), BCH codes, and BCH type Reed Solomon codes are cyclic codes. Original view Reed Solomon codes are not cyclic unless a set of specific set of evaluation values, successive powers of the field primitive alpha is used.
Encoding and decoding normally don't directly exploit the cyclic nature of cyclic codes, other than as a possible method (reverse cycling as opposed to a lookup table) to correct single burst errors.
https://en.wikipedia.org/wiki/Cyclic_code
https://en.wikipedia.org/wiki/BCH_code
https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction
Reed Muller is a class of older codes that are not cyclic.
https://en.wikipedia.org/wiki/Reed%E2%80%93Muller_code
http://www-math.ucdenver.edu/~wcherowi/courses/m7823/reedmuller.pdf
http://www.mcs.csueastbay.edu/~malek/Class/Reed-Muller.pdf
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.208.440&rep=rep1&type=pdf
Due to the conflict between "cyclic" and "Reed Muller", you should probably ask the professor for clarification.

Patterns in a key generation algorithm

I want to reverse-engineer a key generation algorithm which starts from a 4-byte ID, and the output is a 4-byte key. This seems to not be impossible or very difficult, because some patterns can be observed. In the following picture are the inputs and outputs of the algorithm for 8 situations:
As it can be seen, if the bytes from inputs are matching, also the outputs are matching, but with some exceptions (the red marking in the image).
So I think there are some simple arithmetic/binary operations done, and the mismatch could come from a carry of an addition operation.
Until now I ran a C program with some simple operations on the least significant byte of the inputs, with up to 4 variable parameters (0..255, all combinations) and compared with the output LSB, but without success.
Could you please advise me, what else could I try? And what do you think, it's possible what I'm trying to do?
Thank you very much!

How does a finite state machine perform division?

I am taking a course on models of computation and currently we are doing finite state machines. One my tasks is to draw out a FSM that performs division of 3; to simplify the model the machine only accepts numbers multiple of 3. I am not sure how this exactly works, especially since I imagine FSM putting out only single binary values. Could you guys give examples (division by 2 or 4) or hints on how to approach this?
This is what you need, I think (sorry about the bad picture). The 'E' represents epsilon/lambda/no-output. The label of the edges denotes 'input/output'. For each symbol read there is also a corresponding output which may be lambda (no output).

How to generate a verification code/number?

I'm working on an application where users have to make a call and type a verification number with the keypad of their phone.
I would like to be able to detect if the number they type is correct or not. The phone system does not have access to a list of valid numbers, but instead, it will validate the number against an algorithm (like a credit card number).
Here are some of the requirements :
It must be difficult to type a valid random code
It must be difficult to have a valid code if I make a typo (transposition of digits, wrong digit)
I must have a reasonable number of possible combinations (let's say 1M)
The code must be as short as possible, to avoid errors from the user
Given these requirements, how would you generate such a number?
EDIT :
#Haaked: The code has to be numerical because the user types it with its phone.
#matt b: On the first step, the code is displayed on a Web page, the second step is to call and type in the code. I don't know the user's phone number.
Followup : I've found several algorithms to check the validity of numbers (See this interesting Google Code project : checkDigits).
After some research, I think I'll go with the ISO 7064 Mod 97,10 formula. It seems pretty solid as it is used to validate IBAN (International Bank Account Number).
The formula is very simple:
Take a number : 123456
Apply the following formula to obtain the 2 digits checksum : mod(98 - mod(number * 100, 97), 97) => 76
Concat number and checksum to obtain the code => 12345676
To validate a code, verify that mod(code, 97) == 1
Test :
mod(12345676, 97) = 1 => GOOD
mod(21345676, 97) = 50 => BAD !
mod(12345678, 97) = 10 => BAD !
Apparently, this algorithm catches most of the errors.
Another interesting option was the Verhoeff algorithm. It has only one verification digit and is more difficult to implement (compared to the simple formula above).
For 1M combinations you'll need 6 digits. To make sure that there aren't any accidentally valid codes, I suggest 9 digits with a 1/1000 chance that a random code works. I'd also suggest using another digit (10 total) to perform an integrity check. As far as distribution patterns, random will suffice and the check digit will ensure that a single error will not result in a correct code.
Edit: Apparently I didn't fully read your request. Using a credit card number, you could perform a hash on it (MD5 or SHA1 or something similar). You then truncate at an appropriate spot (for example 9 characters) and convert to base 10. Then you add the check digit(s) and this should more or less work for your purposes.
You want to segment your code. Part of it should be a 16-bit CRC of the rest of the code.
If all you want is a verification number then just use a sequence number (assuming you have a single point of generation). That way you know you are not getting duplicates.
Then you prefix the sequence with a CRC-16 of that sequence number AND some private key. You can use anything for the private key, as long as you keep it private. Make it something big, at least a GUID, but it could be the text to War and Peace from project Gutenberg. Just needs to be secret and constant. Having a private key prevents people from being able to forge a key, but using a 16 bit CR makes it easier to break.
To validate you just split the number into its two parts, and then take a CRC-16 of the sequence number and the private key.
If you want to obscure the sequential portion more, then split the CRC in two parts. Put 3 digits at the front and 2 at the back of the sequence (zero pad so the length of the CRC is consistent).
This method allows you to start with smaller keys too. The first 10 keys will be 6 digits.
Does it have to be only numbers? You could create a random number between 1 and 1M (I'd suggest even higher though) and then Base32 encode it. The next thing you need to do is Hash that value (using a secret salt value) and base32 encode the hash. Then append the two strings together, perhaps separated by the dash.
That way, you can verify the incoming code algorithmically. You just take the left side of the code, hash it using your secret salt, and compare that value to the right side of the code.
I must have a reasonnable number of possible combinations (let's say 1M)
The code must be as short as possible, to avoid errors from the user
Well, if you want it to have at least one million combinations, then you need at least six digits. Is that short enough?
When you are creating the verification code, do you have access to the caller's phone number?
If so I would use the caller's phone number and run it through some sort of hashing function so that you can guarantee that the verification code you gave to the caller in step 1 is the same one that they are entering in step 2 (to make sure they aren't using a friend's validation code or they simply made a very lucky guess).
About the hashing, I'm not sure if it's possible to take a 10 digit number and come out with a hash result that would be < 10 digits (I guess you'd have to live with a certain amount of collision) but I think this would help ensure the user is who they say they are.
Of course this won't work if the phone number used in step 1 is different than the one they are calling from in step 2.
Assuming you already know how to detect which key the user hit, this should be doable reasonably easily. In the security world, there is the notion of a "one time" password. This is sometimes referred to as a "disposable password." Normally these are restricted to the (easily typable) ASCII values. So, [a-zA-z0-9] and a bunch of easily typable symbols. like comma, period, semi colon, and parenthesis. In your case, though, you'd probably want to limit the range to [0-9] and possibly include * and #.
I am unable to explain all the technical details of how these one-time codes are generated (or work) adequately. There is some intermediate math behind it, which I'd butcher without first reviewing it myself. Suffice it to say that you use an algorithm to generate a stream of one time passwords. No matter how mnay previous codes you know, the subsequent one should be impossibel to guess! In your case, you'll simply use each password on the list as the user's random code.
Rather than fail at explaining the details of the implementation myself, I'll direct you to a 9 page article where you can read up on it youself: https://www.grc.com/ppp.htm
It sounds like you have the unspoken requirement that it must be quickly determined, via algorithm, that the code is valid. This would rule out you simply handing out a list of one time pad numbers.
There are several ways people have done this in the past.
Make a public key and private key. Encode the numbers 0-999,999 using the private key, and hand out the results. You'll need to throw in some random numbers to make the result come out to the longer version, and you'll have to convert the result from base 64 to base 10. When you get a number entered, convert it back to base64, apply the private key, and see if the intereting numbers are under 1,000,000 (discard the random numbers).
Use a reversible hash function
Use the first million numbers from a PRN seeded at a specific value. The "checking" function can get the seed, and know that the next million values are good. It can either generate them each time and check one by one when a code is received, or on program startup store them all in a table, sorted, and then use binary search (maximum of compares) since one million integers is not a whole lot of space.
There are a bunch of other options, but these are common and easy to implement.
-Adam
You linked to the check digits project, and using the "encode" function seems like a good solution. It says:
encode may throw an exception if 'bad' data (e.g. non-numeric) is passed to it, while verify only returns true or false. The idea here is that encode normally gets it's data from 'trusted' internal sources (a database key for instance), so it should be pretty usual, in fact, exceptional that bad data is being passed in.
So it sounds like you could pass the encode function a database key (5 digits, for instance) and you could get a number out that would meet your requirements.

Resources