FNV-1A hash - xoring 4 bytes at a time - algorithm

Are there any adverse effects if during calculation of FNV-1a hash, 4 bytes are xor'ed at a time rather than just one?

Yes, there is a problem. The algorithm does an XOR on each byte and then multiplies to "mix" that byte with the rest of the value. If you did an XOR on four bytes at a time, the last four bytes of the value you're hashing would have an overwhelming effect on the result.
Basically, the calculation was designed to mix in one byte at a time. If you mix in four bytes at a time, you'll change the distribution of the values that it produces.

Related

how does rabin-karp choose breakpoint in variable-length chunking?

I understand the rabin-karp algo and its usage in string searching. What I don't quite understand is how it can dynamically slice a file into variable-length chunks.
It's said to calculate the hash of a small window of data bytes (ex: 48 bytes) at every single byte offset, and the chunk boundaries—called breakpoints—are whenever the last N (ex: 13) bits of the hash are zero. This gives you an average block size of 2^N = 2^13 = 8192 = 8 KB.
Questions:
Does the rabin-karp rolling hash start from the first 48 bytes, then roll over one byte each time.
If so, is it too much to calculate for a large file even with simple hash function?
Given unpredictable data, how is it possible to have N bits of the hash are zero within the large chunk size limit?
Yes, the sliding window is fix-sized, moving forward byte by byte.
The hash function has O(n) complexity, in each step it only add (and may shift bits) the next byte and minus the original first byte in the window, which is the core method of Rabin hash.
It depends on the hash function actually. The distribution of the chuck sizes may be different. To reduce chunk size variability, the Two Thresholds, Two Divisors Algorithm (TTTD) was proposed. You can also find some advances in this thread from academic research papers.

QR code generation algorithm implementation case analysis

I'm implementing a QR code generation algorithm as explained on thonky.com and I'm trying to understand one of the cases:
as stated in this page and this page I can deduct that if the code is protected with M error correction level, and the chosen mask is No. 0, the first 5 bits of the format string (non-XORed) are '00000', and because of this the whole string of 15 bits is zeros.
the next step is to remove all leading zeros, which are, again, all of them. it means that there's nothing to XOR the generator polynomial string(10100110111) with, thus giving us a final string of 15 zeros, which means that the final (XORed) string will be simply the mask string (101010000010010).
I'm seeking for confirmation that my logic is right.
Thank you all very much in advance for the help.
Your logic is correct.
remove all leading zeroes
The actual process could be described as appending 10 zero bits to the 5 bits of data and treating the 15 bits as 15 single bit coefficients of a polynomial, then dividing that polynomial by the 11 bit generator polynomial resulting in a 10 bit remainder polynomial, which is then subtracted from the 5 data bits + 10 zero bits polynomial. Since this is binary math, add and subtract are both xor operations, and since the 10 appended bits are zero bits, the process can just append the 10 remainder bits to the 5 data bits.
As commented above, rather than actually implementing a BCH encode function, since there are only 32 possible format strings, you can just do a table lookup.
https://www.thonky.com/qr-code-tutorial/format-version-tables

How can one byte be more significant than another?

The difference between little endian and big endian was explained to me like this: "In little endian the least significant byte goes into the low address, and in big endian the most significant byte goes into the low address". What exactly makes one byte more significant than the other?
In the number 222, you could regard the first 2 as most significant because it represents the value 200; the second 2 is less significant because it represents the value 20; and the third 2 is the least significant digit because it represents the value 2.
So, although the digits are the same, the magnitude of the number they represent is used to determine the significance of a digit.
It is the same as when a value is rounded to a number of significant figures ("S.F." or "SF"): 123.321 to 3SF is 123, to 2SF it is 120, to 4SF it is 123.3. That terminology has been used since before electronics were invented.
In any positional numeric system, each digit has a different weight in creating the overall value of the number.
Consider the number 51354 (in base 10): the first 5 is more significant than the second 5, as it stands for 5 multiplied by 10000, while the second 5 is just 5 multiplied by 10.
In computers number are generally fixed-width: for example, a 16 bit unsigned integer can be thought as a sequence of exactly 16 binary digits, with the leftmost one being unambiguously the most significant (it is worth exactly 32768, more than any other bit in the number), and the rightmost the least significant (it is worth just one).
As long as integers are in the CPU registers we don't really need to care about their representation - the CPU will happily perform operations on them as required. But when they are saved to memory (which generally is some random-access bytes store), they need to be represented as bytes in some way.
If we consider "normal" computers, representing a number (bigger than one byte) as a sequence of bytes means essentially representing it in base 256, each byte being a digit in base 256, and each base-256 digit being more or less significant.
Let's see an example: take the value 54321 as a 16 bit integer. If you write it in base 256, it'll be two base-256 digits: the digit 0xD41 (which is worth 0xD4 multiplied by 256) and the digit 0x31 (which is worth 0x31 multiplied by 1). It's clear that the first one is more significant than the second one, as indeed the leftmost "digit" is worth 256 times more than the one at its right.
Now, little endian machines will write in memory the least significant digit first, big endian ones will do the opposite.
Incidentally, there's a nice relationship between binary, hexadecimal and base-256: 4 bits are mapped straight to a hexadecimal digit, and 2 hexadecimal digits are mapped straight to a byte. For this reason you can also see that 54321 is in binary
1101010000110001 = 0xD431
can be split straight into two groups of 8 bits
11010100 00110001
which are the 0xD4 and 0x31 above. So you can see as well that the most significant byte is the one that contains the most significant bits.
Here I'm using the corresponding hexadecimal values to represent each base-256 digit, as there's no good way to represent them symbolically. I could use their ASCII character value, but I 0xD4 is outside ASCII, and 0x31 is 1, which would only add confusion.

How to generate 256-bit random number within a range on embedded system

I need to generate cyptographically secure random number which is 256-bit long in specific range. I use microcontroller suited with random number generator (producer boasts that it's true random number, based on thermal noise).
The upper limit of number to be generated is given as byte array. My question is: will it be secure, to get the random number byte by byte, and performing:
n[i] = rand[i] mod limit[i]
where n[i] is i'th byte of my number etc.
The standard method, using all the bits from the RNG is:
number <- random()
while (number outside range)
number <- random()
endwhile
return number
There are some tweaks possible if the required range is less than half the size of the RNG output, but I assume that is not the case here: it would reduce the output size by one or more bits. Given that, then the while loop will normally only be entered once or twice if at all.
Comparing byte arrays is reasonably simple, and usually speedy providing you compare the most significant bytes first. If the most significant bytes differ, then there is no need to compare less significant bytes at all. We can tell that 7,###,###,### is larger than 5,###,###,### without knowing what digits the # stand for.

Best way to represent numbers of unbounded length?

What's the most optimal (space efficient) way to represent integers of unbounded length?
(The numbers range from zero to positive-infinity)
Some sample number inputs can be found here (each number is shown on it's own line).
Is there a compression algorithm that is specialized in compressing numbers?
You've basically got two alternatives for variable-length integers:
Use 1 bit of every k as an end terminator. That's the way Google protobuf does it, for example (in their case, one bit from every byte, so there are 7 useful bits in every byte).
Output the bit-length first, and then the bits. That's how ASN.1 works, except for OIDs which are represented in form 1.
If the numbers can be really big, Option 2 is better, although it's more complicated and you have to apply it recursively, since you may have to output the length of the length, and then the length, and then the number. A common technique is to use a Option 1 (bit markers) for the length field.
For smallish numbers, option 1 is better. Consider the case where most numbers would fit in 64 bits. The overhead of storing them 7 bits per byte is 1/7; with eight bytes, you'd represent 56 bits. Using even the 7/8 representation for length would also represent 56 bits in eight bytes: one length byte and seven data bytes. Any number shorter than 48 bits would benefit from the self-terminating code.
"Truly random numbers" of unbounded length are, on average, infinitely long, so that's probably not what you've got. More likely, you have some idea of the probability distribution of number sizes, and could choose between the above options.
Note that none of these "compress" (except relative to the bloated ascii-decimal format). The asymptote of log n/n is 0, so as the numbers get bigger the size of the size of the numbers tends to occupy no (relative) space. But it still needs to be represented somehow, so the total representation will always be a bit bigger than log2 of the number.
You cannot compress per se, but you can encode, which may be what you're looking for. You have files with sequences of ASCII decimal digits separated by line feeds. You should simply Huffman encode the characters. You won't do much better than about 3.5 bits per character.

Resources