Probability of a collision using 32 bit CRC of a unique 32 byte array

Probability of a collision using 32 bit CRC of a unique 32 byte array - probability

I am trying to figure out if using 32 bit CRC will produce collision on 32 byte array.
BackGround
My system reads some configuration whenever it boots up from an external flash. I store the SHA256 hash of the last know configuration and when ever I read the configuration I calculate the SHA256 hash and compare it. If the two hash are different then the data is different.
I need to take that SHA256 and make it into a 32bit hash for another part of the system (due to some legacy code restrictions).
Questions
Will there be a high number of collision if I compute the 32 bit CRC on the 32 byte hash from SHA256?
I calculate the probability of collision to be 0. Can you let me know if this is correct?
The number of sample K is always 2 in my problem (I think) because I am calculating 32 bit CRC on two 32 bytes byte array (SHA256 byte array).
see calculation here

That's correct, if by "0" you mean that very small number. That small number is the probability that you would get a 32-bit CRC from random data that accidentally matches what you were expecting. It is simply 2-32.

Related

Is there any bit level error detection algorithm that use minimum extra bits?

I have a 32-bit number that is created by encoding some data, I want to be more confident that the data (a max 32-bit number) is not changed when decoding it, so I am going to add some error detection bits.
I need to keep the data as short as possible, so I can only add a few bits for error detection, in some cases just 1 bit.
I'm looking for an algorithm that detects more bit changes and needs fewer extra bits.
I was thinking of calculating a checksum or CRC and just dropping extra bits or maybe xor the result to make it shorter but I'm not sure if the error detection remains good enough.
Thanks in advance for any help.

A 1-bit CRC, with polynomial x+1 would simply be the parity of your 32 message bits. That will detect any one-bit error in the resulting 32 bits. For a 2-bit CRC, you can use x2+1. You can define a CRC of any length. See Koopman's list for good CRC polynomials for CRCs of degree 3 and higher.

How to uniquely represent 99,999 bits as a byte, word, or double word

I have 99,999 bit flags that I need to represent uniquely with 32 bits or less. Any of the bits can be set and I need to know if the set bits differ from a comparable set of bits. I am considering using CRC to store a unique value hash but I am not sure if collisions will be a problem. Ideally, less than 500 of these bits will be set at any given time, but they will not be know ahead of time.
Is there suitable hash or other algorithm to uniquely represent these bits?

NO!
Without some other information about those bit flags to identify that certain combinations are impossible, this cannot be done. If all combinations are possible, then you will need to use 99,999 bits to store your 99,999 bit flags.
Edit:
Based on the background information that this is to reduce network usage and the expectation is that only about 500 of the bits are set, there are techniques that can be used, but none are a simple hash, and none are efficient enough to store in 32 bits. I would start by looking at Arithmetic Coding. This uses a probability distribution of the characters that you want to send (0.5% 1, 99.5% 0) to compress data. By my computations, you can "expect" a compression of about 22 times. But, for signals that are considered rare, you will pay the price by needing to transmit a signal larger than your starting 99,999 bits.

Sorting 100 unique numbers by using 40bytes of memory

I've been asked a good programming problem:
In the input I've got 100 unique numbers from 0-255(1 byte). I can only read one number at a time and only once. I've got 40 bytes of memory which I can use. The goal is to sort all numbers and print them in the output. I know for sure that the uniqueness of the numbers is very important.
Any ideas?

32 bytes give you 256 bits, just enough to maintain a bit map of which of the 256 possible byte values are seen in the input. One additional byte is used to store the input value. Read each value, mark it in the bitmap, then discard. Once you've read all 100 input values, simply write out the value associated with the bits you set in the bit map.
Then ask what you are supposed to do with the other 7 bytes :)

Since your numbers are unique and they are only 1-byte long, they have to be within 0 to 255. Treat your 40 bytes of storage as a long bit vector. As you read each number, set the appropriate bit in this 320-bit bit-vector. When you're done reading the input, turn around and scan through this bit-vector, printing the number corresponding to each set bit.
In response to #JavaNewb, here is some more detail. First, since a byte contains 8 bits, it can assume only one of 256 possible values, namely, 0 through 255. Armed with this little factoid, you look at the 40-byte storage array you have. This array turns out to have 40 bytes * 8 bits/byte = 320 bits. Since the problem states that each of the 100 1-byte numbers are unique, you know that you will see a particular number (which can range from 0 through 255) at most once. Each time you see a number, you set the corresponding bit in the 40-byte array. For instance, if you encounter the number 50, you set bit number 2 in byte number 6. A number N corresponds to bit N%8 in byte N/8. You are guaranteed to never encounter a set bit in this array since that would imply the existence of duplicates in the 100 numbers. After you've read in all the numbers, you look at the 40-byte array. Each bit that is set in this array corresponds to one of the 100 numbers you read in. By traversing this 40-byte array from the 0th bit in the 0th byte all the way to the 7th bit in the 31st byte, you will by:
Capturing all the numbers that were read in
Observing them in a sorted order
All you have to do now is print the numbers corresponding to the set bits as you traverse the 40-byte array.

Error detection code for 33 bytes, detecting bit flipped in first 32 bytes

Could you please suggest an error detection scheme for detecting
one possible bit flip in the first 32 bytes of a 33-byte message using
no more than 8 bits of additional data?
Could Pearson hashing be a solution?

Detecting a single bit-flip in any message requires only one extra bit, independent of the length of the message: simply xor together all the bits in the message and tack that on the end. If any single bit flips, the parity bit at the end won't match up.
If you're asking to detect which bit flipped, that can't be done, and a simple argument shows it: the extra eight bits can represent up to 256 classes of 32-byte messages, but the zero message and the 256 messages with one on bit each must all be in different classes. Thus, there are 257 messages which must be distinctly classified, and only 256 classes.

You can detect one bit flip with just one extra bit in any length message (as stated by #Daniel Wagner). The parity bit can, simply put, indicate whether the total number of 1-bits is odd or even. Obviously, if the number of bits that are wrong is even, then the parity bit will fail, so you cannot detect 2-bit errors.
Now, for a more accessible understanding of why you can't error-correct 32 bytes (256 bits) with just 8 bits, please read about the Hamming code (like used in ECC memory). Such a scheme uses special error-correcting parity bits (henceforth called "EC parity") that only encode the parity of a subset of the total number of bits. For every 2^m - 1 total bits, you need to use m EC bits. These represent each possible different mask following the pattern "x bits on, x bits off" where x is a power of 2. Thus, the larger the number of bits at once, the better the data/parity bit ratio you get. For example, 7 total bits would allow encoding only 4 data bits after losing 3 EC bits, but 31 total bits can encode 26 data bits after losing 5 EC bits.
Now, to really understand this probably will take an example. Consider the following sets of masks. The first two rows are to be read top down, indicating the bit number (the "Most Significant Byte" I've labeled MSB):
MSB LSB
| |
v v
33222222 22221111 11111100 0000000|0
10987654 32109876 54321098 7654321|0
-------- -------- -------- -------|-
1: 10101010 10101010 10101010 1010101|0
2: 11001100 11001100 11001100 1100110|0
3: 11110000 11110000 11110000 1111000|0
4: 11111111 00000000 11111111 0000000|0
5: 11111111 11111111 00000000 0000000|0
The first thing to notice is that the binary values for 0 to 31 are represented in each column going from right to left (reading the bits in rows 1 through 5). This means that each vertical column is different from each other one (the important part). I put a vertical extra line between bit numbers 0 and 1 for a particular reason: Column 0 is useless because it has no bits set in it.
To perform error-correcting, we will bitwise-AND the received data bits against each EC bit's predefined mask, then compare the resulting parity to the EC bit. For any calculated parities discovered to not match, find the column in which only those bits are set. For example, if error-correcting bits 1, 4, and 5 are wrong when calculated from the received data value, then column #25--containing 1s in only those masks--must be the incorrect bit and can be corrected by flipping it. If only a single error-correcting bit is wrong, then the error is in that error-correcting bit. Here's an analogy to help you understand why this works:
There are 32 identical boxes, with one containing a marble. Your task is to locate the marble using just an old-style scale (the kind with two balanced platforms to compare the weights of different objects) and you are only allowed 5 weighing attempts. The solution is fairly easy: you put 16 boxes on each side of the scale and the heavier side indicates which side the marble is on. Discarding the 16 boxes on the lighter side, you then weigh 8 and 8 boxes keeping the heavier, then 4 and 4, then 2 and 2, and finally locate the marble by comparing the weights of the last 2 boxes 1 to 1: the heaviest box contains the marble. You have completed the task in only 5 weighings of 32, 16, 8, 4, and 2 boxes.
Similarly, our bit patterns have divided up the boxes in 5 different groups. Going backwards, the fifth EC bit determines whether an error is on the left side or the right side. In our scenario with bit #25, it is wrong, so we know that the error bit is on the left side of the group (bits 16-31). In our next mask for EC bit #4 (still stepping backward), we only consider bits 16-31, and we find that the "heavier" side is the left one again, so we have narrowed down the bits 24-31. Following the decision tree downward and cutting the number of possible columns in half each time, by the time we reach EC bit 1 there is only 1 possible bit left--our "marble in a box".
Note: The analogy is useful, though not perfect: 1-bits are not represented by marbles--the erroring bit location is represented by the marble.
Now, some playing around with these masks and thinking how to arrange things will reveal that there is a problem: If we try to make all 31 bits data bits, then we need 5 more bits for EC. But how, then, will we tell if the EC bits themselves are wrong? Just a single EC bit wrong will incorrectly tell us that some data bit needs correction, and we'll wrongly flip that data bit. The EC bits have to somehow encode for themselves! The solution is to position the parity bits inside of the data, in columns from the bit patterns above where only one bit is set. This way, any data bit being wrong will trigger two EC bits to be wrong, making it so that if only one EC bit is wrong, we know it is wrong itself instead of it signifying a data bit is wrong. The columns that satisfy the one-bit condition are 1, 2, 4, 8, and 16. The data bits will be interleaved between these starting at position 2. (Remember, we are not using position 0 as it would never provide any information--none of our EC bits would be set at all).
Finally, adding one more bit for overall parity will allow detecting 2-bit errors and reliably correcting 1-bit errors, as we can then compare the EC bits to it: if the EC bits say something is wrong, but the parity bit says otherwise, we know there are 2 bits wrong and cannot perform correction. We can use the discarded bit #0 as our parity bit! In fact, now we are encoding the following pattern:
0: 11111111 11111111 11111111 11111111
This gives us a final total of 6 Error-Checking and Correcting (ECC) bits. Extending the scheme of using different masks indefinitely looks like this:
32 bits - 6 ECC bits = 26 data
64 bits - 7 ECC bits = 57 data
128 bits - 8 ECC bits = 120 data
256 bits - 9 ECC bits = 247 data
512 bits - 10 ECC bits = 502 data
Now, if we are sure that we only will get a 1-bit error, we can dispense with the #0 parity bit, so we have the following:
31 bits - 5 ECC bits = 26 data
63 bits - 6 ECC bits = 57 data
127 bits - 7 ECC bits = 120 data
255 bits - 8 ECC bits = 247 data
511 bits - 9 ECC bits = 502 data
This is no change because we don't get any more data bits. Oops! 32 bytes (256 bits) as you requested cannot be error-corrected with a single byte, even if we know we can have only a 1-bit error at worst, and we know the ECC bits will be correct (allowing us to move them out of the data region and use them all for data). We need TWO more bits than we have--one must slide up to the next range of 512 bits, then leave out 246 data bits to get our 256 data bits. So that's one more ECC bit AND one more data bit (as we only have 255, exactly what Daniel told you).
Summary:: You need 33 bytes + 1 bit to detect which bit flipped in the first 32 bytes.
Note: if you are going to send 64 bytes, then you're under the 32:1 ratio, as you can error correct that in just 10 bits. But it's that in real world applications, the "frame size" of your ECC can't keep going up indefinitely for a few reasons: 1) The number of bits being worked with at once may be much smaller than the frame size, leading to gross inefficiencies (think ECC RAM). 2) The chance of being able to accurately correct a bit gets less and less, since the larger the frame, the greater the chance it will have more errors, and 2 errors defeats error-correction ability, while 3 or more can defeat even error-detection ability. 3) Once an error is detected, the larger the frame size, the larger the size of the corrupted piece that must be retransmitted.

If you need to use a whole byte instead of a bit, and you only need to detect errors, then the standard solution is to use a cyclic redundancy check (CRC). There are several well-known 8-bit CRCs to choose from.
A typical fast implementation of a CRC uses a table with 256 entries to handle a byte of the message at a time. For the case of an 8 bit CRC this is a special case of Pearson's algorithm.

Expected collisions for perfect 32bit crc

I'm trying to determine how my crc compares to an "ideal" 32bit crc.
So I ran my crc over 1 million completely random samples of data and collected the amount of collisions, I want to compare this number to the number of collisions I could expect from the "ideal" crc.
Does anyone know how to calculate the expected collision for an "ideal" 32bit crc?

Compare your own CRC with 0x1EDC6F41 as your "ideal" reference.
Having said that, there is no ideal 32-bit CRC. Different polynomials have different collision characteristics depending on the length of data hashed. However, a paper by Castagnoli in 1993 found what is considered the best 32-bit CRC value over the broadest range of data lengths, which is 0x1EDC6F41. This polynomial is used by some network protocols like iSCSI and also the x86 CRC32 instruction.

This explains beautifully the "Birthday Problem" and all about predicting the collision probability CRC32 Hash Collision Probability

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio