What algorithm to use to calculate a check digit? - algorithm

What algorithm to use to calculate a check digit for a list of digits?
The length of the list is between 8 and 12 digits.
see also:
How to generate a verification code/number?

The Luhn algorithm is good enough for the credit card industry...

As RichieHindle points out, the Luhn algorithm is pretty good. It will detect (but not correct) any one error or transposition (except a transposition of 0 and 9).
You could also consider the algorithm for ISBN check digits, although for old-style ISBN, the check digit is sometimes "X", which may be a problem for you if you're using integer fields. New-style ISBN doesn't seem to have that problem. Wikipedia doesn't go in to the theoretical properties of the system, but I remember studying ISBN numbers in my coding theory course long ago, so I think they are pretty good :-)

I know it is a bit late (according to post dates), but first time I needed a check number algorithm was last week.
So I checked more algorithms and IMHO the best solution seems to be the Damm algorithm.
It is simple to implementation and detect most of tested errors. With default digit check table all single digit errors, all English language mishearing errors, all adjacent transposition errors, and almost all jump transpositions errors are detectable.
For me there was only a single problem, since I need to calculate check digit not only from numbers but also from characters. Unfortunately for me, there was a given rule, that the last character must be a digit; or better to say, the characters were assigned by third party authority and only fixed amount of numbers were used as manufacturer number.
There are many ways how to transcribe characters to number, but the error detection will always be lower, comparing to when only numbers are used.
For these cases you can use the ISO_6346 specification.
When there is no such limitation, use the tables for different size and assign characters and number to table values.
EDIT: updated/fixed description, added reason for digit check number for characters, and added tables for different base sizes.

Luhn algorithm
Check Digit Algorithm
Check Digit Algorithms Tutor
ISIN check digit algorithm

Verhoeff, there is nothing better IMO.

Related

Genetic Algorithm and Substitution Cipher

I want to write a genetic algorithm that decodes a string encoded with a substitution cipher. The input will be a string of lowercase characters from a to z and space characters, which do not get encoded. For example,
uyd zjglk brsmh osc tjewn spdr uyd xqia fsv
is a valid encoding of
the quick brown fox jumps over the lazy dog
Notice that the space character does not get encoded.
Genes will be one-to-one, random character mappings.
To determine a gene's (or mapping's) fitness, the string to be decoded is applied this map, and the number of recognized English words in the result is counted.
The algorithm terminates when all the words in the input string are valid English words.
I do not want to use other techniques, such as frequency analysis.
Will this work? What can be said about performance?
Counting the number of valid words gives a fitness landscape that is very "plateau-y".
In your example string, every individual will be assigned an integral fitness value between 0 and 9, inclusive, with the vast majority being at the low end of that range. This means if you generate an initial population, it's likely that all of them will have a fitness of zero. This means you can't have meaningful selection pressure, and the whole thing looks quite a lot like a random walk. You'll occasionally stumble upon something that gets a word right, and at that point, the population will shift towards that individual.
Given enough time, (and assuming your words are short enough to have some hope of randomly finding one every once in a while), you will eventually find the string. Genetic algorithms with sensible (i.e., ergodic) operators will always find the optimal solution if you let them run far enough into the land of super-exponential time. However, it's very unlikely that a GA would be a very good way of solving the problem.
A genetic algorithm often has "recombination" as well as "mutation" to create a new generation from the previous one. You may want to consider this -- if you have two particular substitution ciphers in your generation and when you look at the parts of them that create english words, it may be possible to combine the non-conflicting parts of the two ciphers that create english words, and make a cipher that creates even more english words than either of the two original ciphers that you "mated." If you don't do this, then the genetic algorithm may take longer.
Also, you may want to alter your choice of "fitness" function to something more complex than simply how many english words the cipher makes. Intuitively, if there is an encrypted word that is fairly long (say 5 or more letters) and has some repeated letter(s), then if you succeed in translating this to an english word, it's probably typically much better evidence that this part of the cipher is correct, as opposed to if you have two or three different 2-letter words that translate to english.
As for the "will it work / what about performance", I agree with the general consensus that your genetic algorithm is basically a structured way to do random guessing, and initially it will probably often be hard to ensure your population of fit individuals have some individuals that are making good progress toward the correct solution, simply because there can be many ciphers that give incorrect english words, e.g. if you have a lot of 3-letter words with 3 distinct letters. So you will either need a huge population size (at least in the beginning), or you'll have to restart the algorithm if you determine that your population is not getting any fitter (because they are all stuck near local optima that give a moderate number of english words, but they're totally off-track from the correct solution).
For genetic algorithm you need a way to get next generation. Either you invent some way to cross two permutations into a third one or you just make random modifications of most successful permutations. The latter gives you essentially local search algorithm based on random walk, which is not too efficient in terms of time, but may converge.
The former won't do any good at all. For different permutations you may get non-zero word count even if they don't share a single correct letter pair. In short, substitution cypher is too nonlinear, so that your algorithm becomes a series of random guesses, something like bogosort. You may evaluate not a number of words, but something like "likelihood" of letter chains, but it will be pretty much a kind of frequency analysis.

Fuzzy Matching Numbers

I've been working with Double Metaphone and Caverphone2 for String comparisons and they work good on things like names, addresses, etc (Caverphone2 is working best for me). However, they produce way too many false positives when you get to numeric values, such as phone numbers, ip addresses, credit card numbers, etc.
So I've looked at the Luhn and Verhoeff algorithms and they describe essentially what I want, but not quite. They seem good at validation, but do not appear to be built for fuzzy matching. Is there anything that behaves like Luhn and Verhoeff, which could detected single-digit errors and transposition errors involving two adjacent digits, for encoding and comparison purposes similar to the fuzzy string algorithms?
I'd like to encode a number, then compare it to 100,000 other numbers to find closely identical matches. So something like 7041234 would match against 7041324 as a possible transcription error, but something like 4213704 would not.
Levenshtein and friends may be good for finding the distance between to specific strings or numbers. However if you want to build a spelling corrector you don't want to run through your entire word database at every query.
Peter Norvig wrote a very nice article on a simple "fuzzy matching" spelling correcter based on some of the technology behind google spelling suggestions.
If your dictionary has N entries, and the average word has length L, the "Brute force Levenshtein" approach would take time O(N*L^3). Peter Norvig's approach instead generates all words within a certain edit distance from the input, and looks them up in the dictionary. Hence it achieves O(L^k), where k is the furthest edit distance considered.

Programming Logic: Finding the smallest equation to a large number

I do not know a whole lot about math, so I don't know how to begin to google what I am looking for, so I rely on the intelligence of experts to help me understand what I am after...
I am trying to find the smallest string of equations for a particular large number. For example given the number
"39402006196394479212279040100143613805079739270465446667948293404245721771497210611414266254884915640806627990306816"
The smallest equation is 64^64 (that I know of) . It contains only 5 bytes.
Basically the program would reverse the math, instead of taking an expression and finding an answer, it takes an answer and finds the most simplistic expression. Simplistic is this case means smallest string, not really simple math.
Has this already been created? If so where can I find it? I am looking to take extremely HUGE numbers (10^10000000) and break them down to hopefully expressions that will be like 100 characters in length. Is this even possible? are modern CPUs/GPUs not capable of doing such big calculations?
Edit:
Ok. So finding the smallest equation takes WAY too much time, judging on answers. Is there anyway to bruteforce this and get the smallest found thus far?
For example given a number super super large. Sometimes taking the sqaureroot of number will result in an expression smaller than the number itself.
As far as what expressions it would start off it, well it would naturally try expressions that would the expression the smallest. I am sure there is tons of math things I dont know, but one of the ways to make a number a lot smaller is powers.
Just to throw another keyword in your Google hopper, see Kolmogorov Complexity. The Kolmogorov complexity of a string is the size of the smallest Turing machine that outputs the string, given an empty input. This is one way to formalize what you seem to be after. However, calculating the Kolmogorov complexity of a given string is known to be an undecidable problem :)
Hope this helps,
TJ
There's a good program to do that here:
http://mrob.com/pub/ries/index.html
I asked the question "what's the point of doing this", as I don't know if you're looking at this question from a mathemetics point of view, or a large number factoring point of view.
As other answers have considered the factoring point of view, I'll look at the maths angle. In particular, the problem you are describing is a compressibility problem. This is where you have a number, and want to describe it in the smallest algorithm. Highly random numbers have very poor compressibility, as to describe them you either have to write out all of the digits, or describe a deterministic algorithm which is only slightly smaller than the number itself.
There is currently no general mathemetical theorem which can determine if a representation of a number is the smallest possible for that number (although a lower bound can be discovered by understanding shannon's information theory). (I said general theorem, as special cases do exist).
As you said you don't know a whole lot of math, this is perhaps not a useful answer for you...
You're doing a form of lossless compression, and lossless compression doesn't work on random data. Suppose, to the contrary, that you had a way of compressing N-bit numbers into N-1-bit numbers. In that case, you'd have 2^N values to compress into 2^N-1 designations, which is an average of 2 values per designation, so your average designation couldn't be uncompressed. Lossless compression works well on relatively structured data, where data we're likely to get is compressed small, and data we aren't going to get actually grows some.
It's a little more complicated than that, since you're compressing partly by allowing more information per character. (There are a greater number of N-character sequences involving digits and operators than digits alone.) Still, you're not going to get lossless compression that, on the average, is better than just writing the whole numbers in binary.
It looks like you're basically wanting to do factoring on an arbitrarily large number. That is such a difficult problem that it actually serves as the cornerstone of modern-day cryptography.
This really appears to be a mathematics problem, and not programming or computer science problem. You should ask this on https://math.stackexchange.com/
While your question remains unclear, perhaps integer relation finding is what you are after.
EDIT:
There is some speculation that finding a "short" form is somehow related to the factoring problem. I don't believe that is true unless your definition requires a product as the answer. Consider the following pseudo-algorithm which is just sketch and for which no optimization is attempted.
If "shortest" is a well-defined concept, then in general you get "short" expressions by using small integers to large powers. If N is my integer, then I can find an integer nearby that is 0 mod 4. How close? Within +/- 2. I can find an integer within +/- 4 that is 0 mod 8. And so on. Now that's just the powers of 2. I can perform the same exercise with 3, 5, 7, etc. We can, for example, easily find the nearest integer that is simultaneously the product of powers of 2, 3, 5, 7, 11, 13, and 17, call it N_1. Now compute N-N_1, call it d_1. Maybe d_1 is "short". If so, then N_1 (expressed as power of the prime) + d_1 is the answer. If not, recurse to find a "short" expression for d_1.
We can also pick integers that are maybe farther away than our first choice; even though the difference d_1 is larger, it might have a shorter form.
The existence of an infinite number of primes means that there will always be numbers that cannot be simplified by factoring. What you're asking for is not possible, sorry.

Given a number series, finding the Check Digit Algorithm...?

Suppose I have a series of index numbers that consists of a check digit. If I have a fair enough sample (Say 250 sample index numbers), do I have a way to extract the algorithm that has been used to generate the check digit?
I think there should be a programmatic approach atleast to find a set of possible algorithms.
UPDATE: The length of a index number is 8 Digits including the check digit.
No, not in the general case, since the number of possible algorithms is far more than what you may think. A sample space of 250 may not be enough to do proper numerical analysis.
For an extreme example, let's say your samples are all 15 digits long. You would not be able to reliably detect the algorithm if it changed the behaviour for those greater than 15 characters.
If you wanted to be sure, you should reverse engineer the code that checks the numbers for validity (if available).
If you know that the algorithm is drawn from a smaller subset than "every possible algorithm", then it might be possible. But algorithms may be only half the story - there's also the case where multipliers, exponentiation and wrap-around points change even using the same algorithm.
paxdiablo is correct, and you can't guess the algorithm without making any other assumption (or just having the whole sample space - then you can define the algorithm by a look up table).
However, if the check digit is calculated using some linear formula dependent on the "data digits" (which is a very common case, as you can see in the wikipedia article), given enough samples you can use Euler elimination.

Recombine Number to Equal Math Formula

I've been thinking about a math/algorithm problem and would appreciate your input on how to solve it!
If I have a number (e.g. 479), I would like to recombine its digits or combination of them to a math formula that matches the original number. All digits should be used in their original order, but may be combined to numbers (hence, 479 allows for 4, 7, 9, 47, 79) but each digit may only be used once, so you can not have something like 4x47x9 as now the number 4 was used twice.
Now an example just to demonstrate on how I think of it. The example is mathematically incorrect because I couldn't come up with a good example that actually works, but it demonstrates input and expected output.
Example Input: 29485235
Example Output: 2x9+48/523^5
As I said, my example does not add up (2x9+48/523^5 doesn't result in 29485235) but I wondered if there is an algorithm that would actually allow me to find such a formula consisting of the source number's digits in their original order which would upon calculation yield the original number.
On the type of math used, I'd say parenthesis () and Add/Sub/Mul/Div/Pow/Sqrt.
Any ideas on how to do this? My thought was on simply brute forcing it by chopping the number apart by random and doing calculations hoping for a matching result. There's gotta be a better way though?
Edit: If it's any easier in non-original order, or you have an idea to solve this while ignoring some of the 'conditions' described above, it would still help tremendously to understand how to go about solving such a problem.
For numbers up to about 6 digits or so, I'd say brute-force it according to the following scheme:
1) Split your initial value into a list (array, whatever, according to language) of numbers. Initially, these are the digits.
2) For each pair of numbers, combine them together using one of the operators. If the result is the target number, then return success (and print out all the operations performed on your way out). Otherwise if it's an integer, recurse on the new, smaller list consisting of the number you just calculated, and the numbers you didn't use. Or you might want to allow non-integer intermediate results, which will make the search space somewhat bigger. The binary operations are:
Add
subtract
multiply
divide
power
concatenate (which may only be used on numbers which are either original digits, or have been produced by concatenation).
3) Allowing square root bloats the search space to infinity, since it's a unary operator. So you will need a way to limit the number of times it can be applied, and I'm not sure what that will be (loss of precision as the answer approaches 1, maybe?). This is another reason to allow only integer intermediate values.
4) Exponentiation will rapidly cause overflows. 2^(9^(4^8)) is far too large to store all the digits directly [although in base 2 it's pretty obvious what they are ;-)]. So you'll either have to accept that you might miss solutions with large intermediate values, or else you'll have to write a bunch of code to do your arithmetic in terms of factors. These obviously don't interact very well with addition, so you might have to do some estimation. For example, just by looking at the magnitude of the number of factors we see that 2^(9^(4^8)) is nowhere near (2^35), so there's no need to calculate (2^(9^(4^8)) + 5) / (2^35). It can't possibly be 29485235, even if it were an integer (which it certainly isn't - another way to rule out this particular example). I think handling these numbers is harder than the rest of the problem put together, so perhaps you should limit yourself to single-digit powers to begin with, and perhaps to results which fit in a 64bit integer, depending what language you are using.
5) I forgot to exclude the trivial solution for any input, of just concatenating all the digits. That's pretty easy to handle, though, just maintain a parameter through the recursion which tells you whether you have performed any non-concatenation operations on the route to your current sub-problem. If you haven't, then ignore the false match.
My estimate of 6 digits is based on the fact that it's fairly easy to write a Countdown solver that runs in a fraction of a second even when there's no solution. This problem is different in that the digits have to be used in order, but there are more operations (Countdown does not permit exponentiation, square root, or concatenation, or non-integer intermediate results). Overall I think this problem is comparable, provided you resolve the square root and overflow issues. If you can solve one case in a fraction of a second, then you can brute force your way through a million candidates in reasonable time (assuming you don't mind leaving your PC on).
By 10 digits, brute force appears impossible, because you have to consider 10 billion cases, each with a significant amount of recursion required. So I guess you'll hit the limit of brute force somewhere between the two.
Note also that my simple algorithm at the top still has a lot of redundancy - it doesn't stop you doing (4,7,9,1) -> (47,9,1) -> (47,91), and then later also doing (4,7,9,1) -> (4,7,91) -> (47,91). So unless you work out where those duplicates are going to occur and avoid them, you'll attempt (47,91) twice. Obviously that's not much work when there's only 2 numbers in the list, but when there are 7 numbers in the list, you probably do not want to e.g. add 4 of them together in 6 different ways and then solve the resulting 4-number problem 6 times. Cleverness here is not required for the Countdown game, but for all I know in this problem it might make the difference between brute-forcing 8 digits, and brute-forcing 9 digits, which is quite significant.
Numbers like that, as I recall, are exceedingly rare, if extant. Some numbers can be expressed by their component digits in a different order, such as, say, 25 (5²).
Also, trying to brute-force solutions is hopeless, at best, given that the number of permutations increase extremely rapidly as the numbers grow in digits.
EDIT: Partial solution.
A partial solution solving some cases would be to factorize the number into its prime factors. If its prime factors are all the same, and the exponent and factor are both present in the digits of the number (such as is the case with 25) you have a specific solution.
Most numbers that do fall into these kinds of patterns will do so either with multiplication or pow() as their major driving force; addition simply doesn't increase it enough.
Short of building a neural network that replicates Carol Voorderman I can't see anything short of brute force working - humans are quite smart at seeing patterns in problems such as this but encoding such insight is really tough.

Resources