Checksum algorithm reverse engineering - algorithm

I've been analyzing some SPI EEPROM memory, and tried to find out which Checksum algorithm has been used;
For example I've got data:
14567D9h and checksum 187h. Assuming it's normal 16 bit check sum I've got 86h - no match, but after adding 101h it magically changes to 391h
Another Example:
8ADh and check sum B5h with this one is normal - 16 bit checksum results with exact number: B5h (perfect match)
I've checked it with 28 samples i was able to intercept. For some values i have to add 101h to checksum and for some it is only needed to sum it up.
Parity check doesn't fit - if you want I can share some more data - all gathered in one excel file, and calculated. After few days of brainstorm with my friend we haven't come up with anything :/
Maybe there is some additional part in the Algorithm, which i haven't found out yet?
CRC and tonnes of other algorithms were checked - only 16 bit checksum was giving any promising results
Thanks for help in advance!
copy of my spreadsheet: https://drive.google.com/file/d/0B2FO0-Y1n-ySMUZ2VTVkME9tdm8/view?usp=sharing

From what I know, CRC is used for files to help identify file' corruption.
The size of the CRC is fixed while the size of the file is not, and the file' size is much bigger.
In other word, the CRC is not reversible simply because it is a many-to-one relation.

Related

Flash ECC algorithm on STM32L1xx

How does the flash ECC algorithm (Flash Error Correction Code) implemented on STM32L1xx work?
Background:
I want to do multiple incremental writes to a single word in program flash of a STM32L151 MCU without doing a page erase in between. Without ECC, one could set bits incrementally, e.g. first 0x00, then 0x01, then 0x03 (STM32L1 erases bits to 0 rather than to 1), etc. As the STM32L1 has 8 bit ECC per word, this method doesn't work. However, if we knew the ECC algorithm, we could easily find a short sequence of values, that could be written incrementally without violating the ECC.
We could simply try different sequences of values and see which ones work (one such sequence is 0x0000001, 0x00000101, 0x00030101, 0x03030101), but if we don't know the ECC algorithm, we can't check, whether the sequence violates the ECC, in which case error correction wouldn't work if bits would be corrupted.
[Edit] The functionality should be used to implement a simple file system using STM32L1's internal program memory. Chunks of data are tagged with a header, which contains a state. Multiple chunks can reside on a single page. The state can change over time (first 'new', then 'used', then 'deleted', etc.). The number of states is small, but it would make things significantly easier, if we could overwrite a previous state without having to erase the whole page first.
Thanks for any comments! As there are no answers so far, I'll summarize, what I found out so far (empirically and based on comments to this answer):
According to the STM32L1 datasheet "The whole non-volatile memory embeds the error correction code (ECC) feature.", but the reference manual doesn't state anything about ECC in program memory.
The datasheet is in line with what we can find out empirically when subsequentially writing multiple words to the same program mem location without erasing the page in between. In such cases some sequences of values work while others don't.
The following are my personal conclusions, based on empirical findings, limited research and comments from this thread. It's not based on official documentation. Don't build any serious work on it (I won't either)!
It seems, that the ECC is calculated and persisted per 32-bit word. If so, the ECC must have a length of at least 7 bit.
The ECC of each word is probably written to the same nonvolatile mem as the word itself. Therefore the same limitations apply. I.e. between erases, only additional bits can be set. As stark pointed out, we can only overwrite words in program mem with values that:
Only set additional bits but don't clear any bits
Have an ECC that also only sets additional bits compared to the previous ECC.
If we write a value, that only sets additional bits, but the ECC would need to clear bits (and therefore cannot be written correctly), then:
If the ECC is wrong by one bit, the error is corrected by the ECC algorithm and the written value can be read correctly. However, ECC wouldn't work anymore if another bit failed, because ECC can only correct single-bit errors.
If the ECC is wrong by more than one bit, the ECC algorithm cannot correct the error and the read value will be wrong.
We cannot (easily) find out empirically, which sequences of values can be written correctly and which can't. If a sequence of values can be written and read back correctly, we wouldn't know, whether this is due to the automatic correction of single-bit errors. This aspect is the whole reason for this question asking for the actual algorithm.
The ECC algorithm itself seems to be undocumented. Hamming code seems to be a commonly used algorithm for ECC and in AN4750 they write, that Hamming code is actually used for error correction in SRAM. The algorithm may or may not be used for STM32L1's program memory.
The STM32L1 reference manual doesn't seem to explicitely forbid multiple writes to program memory without erase, but there is no documentation stating the opposit either. In order not to use undocumented functionality, we will refrain from using such functionality in our products and find workarounds.
Interessting question.
First I have to say, that even if you find out the ECC algorithm, you can't rely on it, as it's not documented and it can be changed anytime without notice.
But to find out the algorithm seems to be possible with a reasonable amount of tests.
I would try to build tests which starts with a constant value and then clearing only one bit.
When you read the value and it's the start value, your bit can't change all necessary bits in the ECC.
Like:
for <bitIdx>=0 to 31
earse cell
write start value, like 0xFFFFFFFF & ~(1<<testBit)
clear bit <bitIdx> in the cell
read the cell
next
If you find a start value where the erase tests works for all bits, then the start value has probably an ECC of all bits set.
Edit: This should be true for any ECC, as every ECC needs always at least a difference of two bits to detect and repair, reliable one defect bit.
As the first bit difference is in the value itself, the second change needs to be in the hidden ECC-bits and the hidden bits will be very limited.
If you repeat this test with different start values, you should be able to gather enough data to prove which error correction is used.

Algorithm for one sign checksum

I am desperate in the search for an algorithm to create a checksum that is a maximum of two characters long and can recognize the confusion of characters in the input sequence. When testing different algorithms, such as Luhn, CRC24 or CRC32, the checksums were always longer than two characters. If I reduce the checksum to two or even one character, then no longer all commutations are recognized.
Does any of you know an algorithm that meets my needs? I already have a name with which I can continue my search. I would be very grateful for your help.
Taking that your data is alphanumeric, you want to detect all the permutations (in the perfect case), and you can afford to use the binary checksum (i.e. full 16 bits), my guess is that you should probably go with CRC-16 (as already suggested by #Paul Hankin in the comments), as it is more information-dense compared to check-digit algorithms like Luhn or Damm, and is more "generic" when it comes to possible types of errors.
Maybe something like CRC-CCITT (CRC-16-CCITT), you can give it a try here, to see how it works for you.

Patterns in a key generation algorithm

I want to reverse-engineer a key generation algorithm which starts from a 4-byte ID, and the output is a 4-byte key. This seems to not be impossible or very difficult, because some patterns can be observed. In the following picture are the inputs and outputs of the algorithm for 8 situations:
As it can be seen, if the bytes from inputs are matching, also the outputs are matching, but with some exceptions (the red marking in the image).
So I think there are some simple arithmetic/binary operations done, and the mismatch could come from a carry of an addition operation.
Until now I ran a C program with some simple operations on the least significant byte of the inputs, with up to 4 variable parameters (0..255, all combinations) and compared with the output LSB, but without success.
Could you please advise me, what else could I try? And what do you think, it's possible what I'm trying to do?
Thank you very much!

What methods can I use to analyse and guess 4-bit checksum algorithm?

[Background Story]
I am working with a 5 year old user identification system, and I am trying to add IDs to the database. The problem I have is that the system that reads the ID numbers requires some sort of checksum, and no-one working here now has ever worked with it, so no-one knows how it works.
I have access to the list of existing IDs, which already have correct checksums. Also, as the checksum only has 16 possible values, I can create any ID I want and run it through the authentication system up to 16 times until I get the correct checksum (but this is quite time consuming)
[Question]
What methods can I use to help guess the checksum algorithm of used for some data?
I have tried a few simple methods such as XORing and summing, but these have not worked.
So my question is: if I have data (in hexadecimal) like this:
data checksum
00029921 1
00013481 B
00026001 3
00004541 8
What methods can I use work out what sort of checksum is used?
i.e. should I try sequential numbers such as 00029921,00029922,00029923,... or 00029911,00029921,00029931,... If I do this what patterns should I look for in the changing checksum?
Similarly, would comparing swapped digits tell me anything useful about the checksum?
i.e. 00013481 and 00031481
Is there anything else that could tell me something useful? What about inverting one bit, or maybe one hex digit?
I am assuming that this will be a common checksum algorithm, but I don't know where to start in testing it.
I have read the following links, but I am not sure if I can apply any of this to my case, as I don't think mine is a CRC.
stackoverflow.com/questions/149617/how-could-i-guess-a-checksum-algorithm
stackoverflow.com/questions/2896753/find-the-algorithm-that-generates-the-checksum
cosc.canterbury.ac.nz/greg.ewing/essays/CRC-Reverse-Engineering.html
[ANSWER]
I have now downloaded a much larger list of data, and it turned out to be simpler than I was expecting, but for completeness, here is what I did.
data:
00024901 A
00024911 B
00024921 C
00024931 D
00042811 A
00042871 0
00042881 1
00042891 2
00042901 A
00042921 C
00042961 0
00042971 1
00042981 2
00043021 4
00043031 5
00043041 6
00043051 7
00043061 8
00043071 9
00043081 A
00043101 3
00043111 4
00043121 5
00043141 7
00043151 8
00043161 9
00043171 A
00044291 E
From these, I could see that when just one value was increased by a value, the checksum was also increased by the same value as in:
00024901 A
00024911 B
Also, two digits swapped did not change the checksum:
00024901 A
00042901 A
This means that the polynomial value (for these two positions at least) must be the same
Finally, the checksum for 00000000 was A, so I calculated the sum of digits plus A mod 16:
( (Σxi) +0xA )mod16
And this matched for all the values I had. Just to check that there was nothing sneaky going on with the first 3 digits that never changed in my data, I made up and tested some numbers as Eric suggested, and those all worked with this too!
Many checksums I've seen use simple weighted values based on the position of the digits. For example, if the weights are 3,5,7 the checksum might be 3*c[0] + 5*c[1] + 7*c[2], then mod 10 for the result. (In your case, mod 16, since you have 4 bit checksum)
To check if this might be the case, I suggest that you feed some simple values into your system to get an answer:
1000000 = ?
0100000 = ?
0010000 = ?
... etc. If there are simple weights based on position, this may reveal it. Even if the algorithm is something different, feeding in nice, simple values and looking for patterns may be enlightening. As Matti suggested, you/we will likely need to see more samples before decoding the pattern.

Best algorithm for hashing number values?

When dealing with a series of numbers, and wanting to use hash results for security reasons, what would be the best way to generate a hash value from a given series of digits? Examples of input would be credit card numbers, or bank account numbers. Preferred output would be a single unsigned integer to assist in matching purposes.
My feeling is that most of the string implementations appear to have low entropy when run against such a short range of characters and because of that, the collision rate might be higher than when run against a larger sample.
The target language is Delphi, however answers from other languages are welcome if they can provide a mathmatical basis which can lead to an optimal solution.
The purpose of this routine will be to determine if a previously received card/account was previously processed or not. The input file could have multiple records against a database of multiple records so performance is a factor.
With security questions all the answers lay on a continuum from most secure to most convenient. I'll give you two answers, one that is very secure, and one that is very convenient. Given that and the explanation of each you can choose the best solution for your system.
You stated that your objective was to store this value in lieu of the actual credit card so you could later know if the same credit card number is used again. This means that it must contain only the credit card number and maybe a uniform salt. Inclusion of the CCV, expiration date, name, etc. would render it useless since it the value could be different with the same credit card number. So we will assume you pad all of your credit card numbers with the same salt value that will remain uniform for all entries.
The convenient solution is to use a FNV (As Zebrabox and Nick suggested). This will produce a 32 bit number that will index quickly for searches. The downside of course is that it only allows for at max 4 billion different numbers, and in practice will produce collisions much quicker then that. Because it has such a high collision rate a brute force attack will probably generate enough invalid results as to make it of little use.
The secure solution is to rely on SHA hash function (the larger the better), but with multiple iterations. I would suggest somewhere on the order of 10,000. Yes I know, 10,000 iterations is a lot and it will take a while, but when it comes to strength against a brute force attack speed is the enemy. If you want to be secure then you want it to be SLOW. SHA is designed to not have collisions for any size of input. If a collision is found then the hash is considered no longer viable. AFAIK the SHA-2 family is still viable.
Now if you want a solution that is secure and quick to search in the DB, then I would suggest using the secure solution (SHA-2 x 10K) and then storing the full hash in one column, and then take the first 32 bits and storing it in a different column, with the index on the second column. Perform your look-up on the 32 bit value first. If that produces no matches then you have no matches. If it does produce a match then you can compare the full SHA value and see if it is the same. That means you are performing the full binary comparison (hashes are actually binary, but only represented as strings for easy human reading and for transfer in text based protocols) on a much smaller set.
If you are really concerned about speed then you can reduce the number of iterations. Frankly it will still be fast even with 1000 iterations. You will want to make some realistic judgment calls on how big you expect the database to get and other factors (communication speed, hardware response, load, etc.) that may effect the duration. You may find that your optimizing the fastest point in the process, which will have little to no actual impact.
Also, I would recommend that you benchmark the look-up on the full hash vs. the 32 bit subset. Most modern database system are fairly fast and contain a number of optimizations and frequently optimize for us doing things the easy way. When we try to get smart we sometimes just slow it down. What is that quote about premature optimization . . . ?
This seems to be a case for key derivation functions. Have a look at PBKDF2.
Just using cryptographic hash functions (like the SHA family) will give you the desired distribution, but for very limited input spaces (like credit card numbers) they can be easily attacked using brute force because this hash algorithms are usually designed to be as fast as possible.
UPDATE
Okay, security is no concern for your task. Because you have already a numerical input, you could just use this (account) number modulo your hash table size. If you process it as string, you might indeed encounter a bad distribution, because the ten digits form only a small subset of all possible characters.
Another problem is probably that the numbers form big clusters of assigned (account) numbers with large regions of unassigned numbers between them. In this case I would suggest to try highly non-linear hash function to spread this clusters. And this brings us back to cryptographic hash functions. Maybe good old MD5. Just split the 128 bit hash in four groups of 32 bits, combine them using XOR, and interpret the result as a 32 bit integer.
While not directly related, you may also have a look at Benford's law - it provides some insight why numbers are usually not evenly distributed.
If you need security, use a cryptographically secure hash, such as SHA-256.
I needed to look deeply into hash functions a few months ago. Here are some things I found.
You want the hash to spread out hits evenly and randomly throughout your entire target space (usually 32 bits, but could be 16 or 64-bits.) You want every character of the input to have and equally large effect on the output.
ALL the simple hashes (like ELF or PJW) that simply loop through the string and xor in each byte with a shift or a mod will fail that criteria for a simple reason: The last characters added have the most effect.
But there are some really good algorithms available in Delphi and asm. Here are some references:
See 1997 Dr. Dobbs article at burtleburtle.net/bob/hash/doobs.html
code at burtleburtle.net/bob/c/lookup3.c
SuperFastHash Function c2004-2008 by Paul Hsieh (AKA HsiehHash)
www.azillionmonkeys.com/qed/hash.html
You will find Delphi (with optional asm) source code at this reference:
http://landman-code.blogspot.com/2008/06/superfasthash-from-paul-hsieh.html
13 July 2008
"More than a year ago Juhani Suhonen asked for a fast hash to use for his
hashtable. I suggested the old but nicely performing elf-hash, but also noted
a much better hash function I recently found. It was called SuperFastHash (SFH)
and was created by Paul Hsieh to overcome his 'problems' with the hash functions
from Bob Jenkins. Juhani asked if somebody could write the SFH function in basm.
A few people worked on a basm implementation and posted it."
The Hashing Saga Continues:
2007-03-13 Andrew: When Bad Hashing Means Good Caching
www.team5150.com/~andrew/blog/2007/03/hash_algorithm_attacks.html
2007-03-29 Andrew: Breaking SuperFastHash
floodyberry.wordpress.com/2007/03/29/breaking-superfasthash/
2008-03-03 Austin Appleby: MurmurHash 2.0
murmurhash.googlepages.com/
SuperFastHash - 985.335173 mb/sec
lookup3 - 988.080652 mb/sec
MurmurHash 2.0 - 2056.885653 mb/sec
Supplies c++ code MurmurrHash2.cpp and aligned-read-only implementation -
MurmurHashAligned2.cpp
//========================================================================
// Here is Landman's MurmurHash2 in C#
//2009-02-25 Davy Landman does C# implimentations of SuperFashHash and MurmurHash2
//landman-code.blogspot.com/search?updated-min=2009-01-01T00%3A00%3A00%2B01%3A00&updated-max=2010-01-01T00%3A00%3A00%2B01%3A00&max-results=2
//
//Landman impliments both SuperFastHash and MurmurHash2 4 ways in C#:
//1: Managed Code 2: Inline Bit Converter 3: Int Hack 4: Unsafe Pointers
//SuperFastHash 1: 281 2: 780 3: 1204 4: 1308 MB/s
//MurmurHash2 1: 486 2: 759 3: 1430 4: 2196
Sorry if the above turns out to look like a mess. I had to just cut&paste it.
At least one of the references above gives you the option of getting out a 64-bit hash, which would certainly have no collisions in the space of credit card numbers, and could be easily stored in a bigint field in MySQL.
You do not need a cryptographic hash. They are much more CPU intensive. And the purpose of "cryptographic" is to stop hacking, not to avoid collisions.
If performance is a factor I suggest to take a look at a CodeCentral entry of Peter Below. It performs very well for large number of items.
By default it uses P.J. Weinberger ELF hashing function. But others are also provided.
By definition, a cryptographic hash will work perfectly for your use case. Even if the characters are close, the hash should be nicely distributed.
So I advise you to use any cryptographic hash (SHA-256 for example), with a salt.
For a non cryptographic approach you could take a look at the FNV hash it's fast with a low collision rate.
As a very fast alternative, I've also used this algorithm for a few years and had few collision issues however I can't give you a mathematical analysis of it's inherent soundness but for what it's worth here it is
=Edit - My code sample was incorrect - now fixed =
In c/c++
unsigned int Hash(const char *s)
{
int hash = 0;
while (*s != 0)
{
hash *= 37;
hash += *s;
s++;
}
return hash;
}
Note that '37' is a magic number, so chosen because it's prime
Best hash function for the natural numbers let
f(n)=n
No conflicts ;)

Resources