Algorithm for one sign checksum - algorithm

I am desperate in the search for an algorithm to create a checksum that is a maximum of two characters long and can recognize the confusion of characters in the input sequence. When testing different algorithms, such as Luhn, CRC24 or CRC32, the checksums were always longer than two characters. If I reduce the checksum to two or even one character, then no longer all commutations are recognized.
Does any of you know an algorithm that meets my needs? I already have a name with which I can continue my search. I would be very grateful for your help.

Taking that your data is alphanumeric, you want to detect all the permutations (in the perfect case), and you can afford to use the binary checksum (i.e. full 16 bits), my guess is that you should probably go with CRC-16 (as already suggested by #Paul Hankin in the comments), as it is more information-dense compared to check-digit algorithms like Luhn or Damm, and is more "generic" when it comes to possible types of errors.
Maybe something like CRC-CCITT (CRC-16-CCITT), you can give it a try here, to see how it works for you.

Related

if i convert a file's contents into a single large number and express it as a mathematical expression, does it mean I have compressed the file?

assuming the mathematical expression has less number of characters than the original number.
example-
20880467999847912034355032910578 can be expressed as (23^23 +10)
this looks like a good compression method. Will it work for compressing large files?
UPDATE- i didn't mean converting a file into a large binary number. lets say i have a text file and i replace all the characters in it with their ascii values. now i have a large number in the decimal number system. i can express it as a mathematical expression like in the example above.
The notion you're looking for is Kolmogorov complexity - it's a measure of how algorithmically incompressible a number is. See this wiki article for a rigorous definition and examples of such numbers.
If you take the contents of a file as a large binary number, and find an expression which evaluates to that number and can be stored more compactly than the number itself, then yes, you have compressed the file.
Unfortunately, for most files, you'll never find such an expression.
Simple logic (see the link posted by #OliCharlesworth) should convince you that it's impossible to find such an expression for all or even most files. Even for files which might have a suitable expression, finding it will be very, very difficult. If you want to convince yourself of this, try this challenge:
Take the following ASCII string:
"Holy Kolmogorov complexity, Batman! Compress this sucker down good and you'll get a pretty penny, my fine lad!"
Interpreted as a binary number, with the high-order digits coming first, that is: 2280899635869589768629811602006623364651019118009864206881173103187172975244099647369151382436996220022807793898568915685059542016541775658916080587423284053601554008368389985872997499032440860090224967472423163775276043175694884234152335588829534778866153948275745.
Try to find a polynomial which evaluates to that number. All the numbers used must be integral, and the total number of decimal digits appearing in the polynomial must be less than 80. If you succeed, I will send you a small cash prize by PayPal.
Yes, by definition. You have correctly defined compression as representing something larger with something smaller.
How do you propose to do this? How often will that work? There's the rub.

Patterns in a key generation algorithm

I want to reverse-engineer a key generation algorithm which starts from a 4-byte ID, and the output is a 4-byte key. This seems to not be impossible or very difficult, because some patterns can be observed. In the following picture are the inputs and outputs of the algorithm for 8 situations:
As it can be seen, if the bytes from inputs are matching, also the outputs are matching, but with some exceptions (the red marking in the image).
So I think there are some simple arithmetic/binary operations done, and the mismatch could come from a carry of an addition operation.
Until now I ran a C program with some simple operations on the least significant byte of the inputs, with up to 4 variable parameters (0..255, all combinations) and compared with the output LSB, but without success.
Could you please advise me, what else could I try? And what do you think, it's possible what I'm trying to do?
Thank you very much!

Improvement to Algorithm

I was asked this question in an interview.
If you had two numbers represented in the binary form and stored as a string. How would you perform simple addition. This was the easy part. (my solution: run through the shortest one and keep track of carry, repeat for the remaining)
The difficult part was when he asked me:
how would you use hardware to make the process faster.
Any suggestion SO community?
I'd say, convert them to proper integers, and use the hardware (ALU) to perform the addition, then convert the result back to a string if needed.
Converting the numbers to an integer variable and letting the CPU do the addition immediately springs to mind. You can then divide the number back into bits if you so choose to.

Best algorithm for hashing number values?

When dealing with a series of numbers, and wanting to use hash results for security reasons, what would be the best way to generate a hash value from a given series of digits? Examples of input would be credit card numbers, or bank account numbers. Preferred output would be a single unsigned integer to assist in matching purposes.
My feeling is that most of the string implementations appear to have low entropy when run against such a short range of characters and because of that, the collision rate might be higher than when run against a larger sample.
The target language is Delphi, however answers from other languages are welcome if they can provide a mathmatical basis which can lead to an optimal solution.
The purpose of this routine will be to determine if a previously received card/account was previously processed or not. The input file could have multiple records against a database of multiple records so performance is a factor.
With security questions all the answers lay on a continuum from most secure to most convenient. I'll give you two answers, one that is very secure, and one that is very convenient. Given that and the explanation of each you can choose the best solution for your system.
You stated that your objective was to store this value in lieu of the actual credit card so you could later know if the same credit card number is used again. This means that it must contain only the credit card number and maybe a uniform salt. Inclusion of the CCV, expiration date, name, etc. would render it useless since it the value could be different with the same credit card number. So we will assume you pad all of your credit card numbers with the same salt value that will remain uniform for all entries.
The convenient solution is to use a FNV (As Zebrabox and Nick suggested). This will produce a 32 bit number that will index quickly for searches. The downside of course is that it only allows for at max 4 billion different numbers, and in practice will produce collisions much quicker then that. Because it has such a high collision rate a brute force attack will probably generate enough invalid results as to make it of little use.
The secure solution is to rely on SHA hash function (the larger the better), but with multiple iterations. I would suggest somewhere on the order of 10,000. Yes I know, 10,000 iterations is a lot and it will take a while, but when it comes to strength against a brute force attack speed is the enemy. If you want to be secure then you want it to be SLOW. SHA is designed to not have collisions for any size of input. If a collision is found then the hash is considered no longer viable. AFAIK the SHA-2 family is still viable.
Now if you want a solution that is secure and quick to search in the DB, then I would suggest using the secure solution (SHA-2 x 10K) and then storing the full hash in one column, and then take the first 32 bits and storing it in a different column, with the index on the second column. Perform your look-up on the 32 bit value first. If that produces no matches then you have no matches. If it does produce a match then you can compare the full SHA value and see if it is the same. That means you are performing the full binary comparison (hashes are actually binary, but only represented as strings for easy human reading and for transfer in text based protocols) on a much smaller set.
If you are really concerned about speed then you can reduce the number of iterations. Frankly it will still be fast even with 1000 iterations. You will want to make some realistic judgment calls on how big you expect the database to get and other factors (communication speed, hardware response, load, etc.) that may effect the duration. You may find that your optimizing the fastest point in the process, which will have little to no actual impact.
Also, I would recommend that you benchmark the look-up on the full hash vs. the 32 bit subset. Most modern database system are fairly fast and contain a number of optimizations and frequently optimize for us doing things the easy way. When we try to get smart we sometimes just slow it down. What is that quote about premature optimization . . . ?
This seems to be a case for key derivation functions. Have a look at PBKDF2.
Just using cryptographic hash functions (like the SHA family) will give you the desired distribution, but for very limited input spaces (like credit card numbers) they can be easily attacked using brute force because this hash algorithms are usually designed to be as fast as possible.
UPDATE
Okay, security is no concern for your task. Because you have already a numerical input, you could just use this (account) number modulo your hash table size. If you process it as string, you might indeed encounter a bad distribution, because the ten digits form only a small subset of all possible characters.
Another problem is probably that the numbers form big clusters of assigned (account) numbers with large regions of unassigned numbers between them. In this case I would suggest to try highly non-linear hash function to spread this clusters. And this brings us back to cryptographic hash functions. Maybe good old MD5. Just split the 128 bit hash in four groups of 32 bits, combine them using XOR, and interpret the result as a 32 bit integer.
While not directly related, you may also have a look at Benford's law - it provides some insight why numbers are usually not evenly distributed.
If you need security, use a cryptographically secure hash, such as SHA-256.
I needed to look deeply into hash functions a few months ago. Here are some things I found.
You want the hash to spread out hits evenly and randomly throughout your entire target space (usually 32 bits, but could be 16 or 64-bits.) You want every character of the input to have and equally large effect on the output.
ALL the simple hashes (like ELF or PJW) that simply loop through the string and xor in each byte with a shift or a mod will fail that criteria for a simple reason: The last characters added have the most effect.
But there are some really good algorithms available in Delphi and asm. Here are some references:
See 1997 Dr. Dobbs article at burtleburtle.net/bob/hash/doobs.html
code at burtleburtle.net/bob/c/lookup3.c
SuperFastHash Function c2004-2008 by Paul Hsieh (AKA HsiehHash)
www.azillionmonkeys.com/qed/hash.html
You will find Delphi (with optional asm) source code at this reference:
http://landman-code.blogspot.com/2008/06/superfasthash-from-paul-hsieh.html
13 July 2008
"More than a year ago Juhani Suhonen asked for a fast hash to use for his
hashtable. I suggested the old but nicely performing elf-hash, but also noted
a much better hash function I recently found. It was called SuperFastHash (SFH)
and was created by Paul Hsieh to overcome his 'problems' with the hash functions
from Bob Jenkins. Juhani asked if somebody could write the SFH function in basm.
A few people worked on a basm implementation and posted it."
The Hashing Saga Continues:
2007-03-13 Andrew: When Bad Hashing Means Good Caching
www.team5150.com/~andrew/blog/2007/03/hash_algorithm_attacks.html
2007-03-29 Andrew: Breaking SuperFastHash
floodyberry.wordpress.com/2007/03/29/breaking-superfasthash/
2008-03-03 Austin Appleby: MurmurHash 2.0
murmurhash.googlepages.com/
SuperFastHash - 985.335173 mb/sec
lookup3 - 988.080652 mb/sec
MurmurHash 2.0 - 2056.885653 mb/sec
Supplies c++ code MurmurrHash2.cpp and aligned-read-only implementation -
MurmurHashAligned2.cpp
//========================================================================
// Here is Landman's MurmurHash2 in C#
//2009-02-25 Davy Landman does C# implimentations of SuperFashHash and MurmurHash2
//landman-code.blogspot.com/search?updated-min=2009-01-01T00%3A00%3A00%2B01%3A00&updated-max=2010-01-01T00%3A00%3A00%2B01%3A00&max-results=2
//
//Landman impliments both SuperFastHash and MurmurHash2 4 ways in C#:
//1: Managed Code 2: Inline Bit Converter 3: Int Hack 4: Unsafe Pointers
//SuperFastHash 1: 281 2: 780 3: 1204 4: 1308 MB/s
//MurmurHash2 1: 486 2: 759 3: 1430 4: 2196
Sorry if the above turns out to look like a mess. I had to just cut&paste it.
At least one of the references above gives you the option of getting out a 64-bit hash, which would certainly have no collisions in the space of credit card numbers, and could be easily stored in a bigint field in MySQL.
You do not need a cryptographic hash. They are much more CPU intensive. And the purpose of "cryptographic" is to stop hacking, not to avoid collisions.
If performance is a factor I suggest to take a look at a CodeCentral entry of Peter Below. It performs very well for large number of items.
By default it uses P.J. Weinberger ELF hashing function. But others are also provided.
By definition, a cryptographic hash will work perfectly for your use case. Even if the characters are close, the hash should be nicely distributed.
So I advise you to use any cryptographic hash (SHA-256 for example), with a salt.
For a non cryptographic approach you could take a look at the FNV hash it's fast with a low collision rate.
As a very fast alternative, I've also used this algorithm for a few years and had few collision issues however I can't give you a mathematical analysis of it's inherent soundness but for what it's worth here it is
=Edit - My code sample was incorrect - now fixed =
In c/c++
unsigned int Hash(const char *s)
{
int hash = 0;
while (*s != 0)
{
hash *= 37;
hash += *s;
s++;
}
return hash;
}
Note that '37' is a magic number, so chosen because it's prime
Best hash function for the natural numbers let
f(n)=n
No conflicts ;)

How to find "equivalent" texts?

I want to find (not generate) 2 text strings such that, after removing all non letters and ucasing, one string can be translated to the other by simple substitution.
The motivation for this comes from a project I known of that is testing methods for attacking cyphers via probability distributions. I'd like to find a large, coherent plain text that, once encrypted with a simple substitution cypher, can be decrypted to something else that is also coherent.
This ends up as 2 parts, find the longest such strings in a corpus, and get that corpus.
The first part seems to me to be amiable to some sort of attack with a B-tree keyed off the string after a substitution that makes the sequence of first occurrences sequential.
HELLOWORLDTHISISIT
1233454637819a9b98
A little optimization based on knowing the maximum value and length of the string based on each depth of the tree and the rest is just coding.
The Other part would be quite a bit more involved; how to generate a large corpus of text to search? some kind of internet spider would seem to be the ideal approach as it would have access to the largest amount of text but how to strip it to just the text?
The question is; Any ideas on how to do this better?
Edit: the cipher that was being used is an insanely basic 26 letter substitution cipher.
p.s. this is more a thought experiment then a probable real project for me.
There are 26! different substitution ciphers. That works out to a bit over 88 bits of choice:
>>> math.log(factorial(26), 2)
88.381953327016262
The entropy of English text is something like 2 bits per character at least. So it seems to me you can't reasonably expect to find passages of more than 45-50 characters that are accidentally equivalent under substitution.
For the large corpus, there's the Gutenberg Project and Wikipedia, for a start. You can download an dump of all the English Wikipedia's XML files from their website.
I think you're asking a bit much to generate a substitution that is also "coherent". That is an AI problem for the encryption algorithm to figure out what text is coherent. Also, the longer your text is the more complicated it will be to create a "coherent" result... quickly approaching a point where you need a "key" as long as the text you are encrypting. Thus defeating the purpose of encrypting it at all.

Resources