Why is FNV1a hashing data one byte at the time? - performance

I noticed some implementations of C++ std::hash use FNV1a algorithm, but this is language agnostic question.
One thing I find strange is that algorithm processes 1 byte of data per loop iteration.
hash = FNV_offset_basis
for each byte_of_data to be hashed
hash = hash XOR byte_of_data
hash = hash × FNV_prime
return hash
I wonder why algorithm does not process 4 or 8 bytes of data per loop and fallback to per byte processing for the last few bytes(if length of range is not multiple of 4 or 8).

Related

3 Byte output hashing algorithm

So for a project that I am working on, I am trying to get a hashing algorithm but I don't know anything about hashing algorithms. The final outcome I would like to archive is inputing a 6 byte value and get 3 unique bytes as my output.
My other alternative is one algorithm that inputs a 2 byte value and outputs 1 unique byte.
Is this possible?
** Edit: I would need this in C language if possible or pseudo code.
Most hash functions can take arbitrary numbers of bytes since they are by nature compression functions. As for the output, you can just take the first 3 bytes of the output. Any cryptographically safe hash function will output bytes that are suitable for this.
For example, in Python it would be:
from hashlib import sha256
s = sha256(<your bytes>)
output = s.digest()[:3]

Lossless Compression of Random Data

tl;dr
I recently started listening to a security podcast, and heard the following sentence (paraphrasing)
One of the good hallmarks of a cryptographically strong random number is its lack of compressibility
Which immediately got me thinking, can random data be lossless-ly compressed? I started reading, and found this wikipedia article. A quoted block is below
In particular, files of random data cannot be consistently compressed by any conceivable lossless data compression algorithm: indeed, this result is used to define the concept of randomness in algorithmic complexity theory.
I understand the pigeon hole principle, so I'm assuming I'm way wrong here somewhere, but what am I missing?
IDEA:
Assume you have an asymmetric variable-length encryption method by which you could convert any N bit into either a N-16 bit number or N+16 bit number. Is this possible?
IF we had an assymetric algorithm could either make the data say 16 bits bigger or 16 bits smaller, then I think I can come up with an algorithm for reliably producing lossless compression.
Lossless Compression Algorithm for Arbitrary Data
Break the initial data into chunks of a given size. Then use a "key" and attempt to compress each chunk as follows.
function compress(data)
compressedData = []
chunks = data.splitBy(chunkSize);
foreach chunk in chunks
encryptedChunk = encrypt(chunk, key)
if (encryptedChunk.Length <= chunk.Length - 16) // arbitrary amount
compressedData.append(0) // 1 bit, not an integer
compressedData.append(encryptedChunk)
else
compressedData.append(1) // 1 bit, not an integer
compressedData.append(chunk)
end foreach
return compressedData;
end function
And for de-compression, if you know the chunk-size, then each chunk that begins with 0 perform the asymmetric encryption and append the data to the on going array. If the chunk begins with a 0 simply append the data as-is. If the encryption method produces the 16-bit smaller value even 1/16 as often as the 16-bit larger value, then this will work right? Each chunk is either 1 bit bigger, or 15 bits smaller.
One other consideration is that the "key" used by the compression algorithm can be either fixed or perhaps appended to the beginning of the compressed data. Same consideration for the chunk size.
There are 2N−16 possible (N−16)-bit sequences, and 2N possible N-bit sequences. Consequently, no more than one in every 216 N-bit sequence can be losslessly compressed to N−16 bits. So it will happen a lot less frequently than 1/16 of the time. It will happen at most 1/65536 of the time.
As your reasoning indicates, the remaining N-bit sequences could be expanded to N+1 bits; there is no need to waste an additional 15 bits encoding them. All the same, the probability of a random N-bit sequence being in the set of (N−16)-bit compressible sequences is so small that the average compression (or expected compression) will continue to be 1.0 (at best).

What is the range of possible sha1hash results?

What is the lowest and highest possible returns from sha1? (with respect that sha1 results are actualy 5 32 bit values rather than 1 true 160 bit value)
To create a secure hash the output of the hash must be indistinguishable from random. Many pseudo random number generators and key derivation methods actually use a hash as final calculation.
So the "highest" result consists of all zero's, the lowest consists of all ones. That is, if you interpret the result to be an unsigned integer of course. The chances of exactly getting those values is almost zero of course, as SHA-1 results should be evenly distributed. But the change of a number starting with 8 ones is still 1/2^8 == 1/256, which is certainly not insignificant.
Note that the result of SHA-1 should be interpreted as a bit string. Most runtimes don't have a very useful bitstring representation and use an octet string (aka byte array) instead. I would consider it very annoying of a SHA-1 implementation would return shorts instead of bytes. You don't want to annoy the user with differences in little-endian and big-endian representations, and most other primitives do expect their input represented as bytes.

Proving a perfect hash function over a fixed length input

I have seen the answers on here stating to use gperf, however, I would prefer to roll my own based on the proof that I create for the domain of strings with a fixed length of <= 200 Based on the calculations I have from wolfram I get ~7.9 x 10^374 total permutations. Therefore my line of thinking is if I have a 2048 bit hash function (3.2 x 10^616) I should be able to handle the entire universe of strings that I need to process. My question is how can I prove that the hash implementation I end up producing will be perfect given the constraint of the universe of all strings of length 200 or less?
Strings with a length of 200 characters only have 200 * 8 = 1600 bits. If a 2048 bit hash is OK for your purpose, you could just use the string bits as a perfect hash. The identity hash function is perfect, as it maps each input to a distinct hash value (obviously, because there is no mapping).

How would you sort 1 million 32-bit integers in 2MB of RAM?

Please, provide code examples in a language of your choice.
Update:
No constraints set on external storage.
Example: Integers are received/sent via network. There is a sufficient space on local disk for intermediate results.
Split the problem into pieces small enough to fit into available memory, then use merge sort to combine them.
Sorting a million 32-bit integers in 2MB of RAM using Python by Guido van Rossum
1 million 32-bit integers = 4 MB of memory.
You should sort them using some algorithm that uses external storage. Mergesort, for example.
You need to provide more information. What extra storage is available? Where are you supposed to store the result?
Otherwise, the most general answer:
1. load the fist half of data into memory (2MB), sort it by any method, output it to file.
2. load the second half of data into memory (2MB), sort it by any method, keep it in memory.
3. use merge algorithm to merge the two sorted halves and output the complete sorted data set to a file.
This wikipedia article on External Sorting have some useful information.
Dual tournament sort with polyphased merge
#!/usr/bin/env python
import random
from sort import Pickle, Polyphase
nrecords = 1000000
available_memory = 2000000 # number of bytes
#NOTE: it doesn't count memory required by Python interpreter
record_size = 24 # (20 + 4) number of bytes per element in a Python list
heap_size = available_memory / record_size
p = Polyphase(compare=lambda x,y: cmp(y, x), # descending order
file_maker=Pickle,
verbose=True,
heap_size=heap_size,
max_files=4 * (nrecords / heap_size + 1))
# put records
maxel = 1000000000
for _ in xrange(nrecords):
p.put(random.randrange(maxel))
# get sorted records
last = maxel
for n, el in enumerate(p.get_all()):
if el > last: # elements must be in descending order
print "not sorted %d: %d %d" % (n, el ,last)
break
last = el
assert nrecords == (n + 1) # check all records read
Um, store them all in a file.
Memory map the file (you said there was only 2M of RAM; let's assume the address space is large enough to memory map a file).
Sort them using the file backing store as if it were real memory now!
Here's a valid and fun solution.
Load half the numbers into memory. Heap sort them in place and write the output to a file. Repeat for the other half. Use external sort (basically a merge sort that takes file i/o into account) to merge the two files.
Aside:
Make heap sort faster in the face of slow external storage:
Start constructing the heap before all the integers are in memory.
Start putting the integers back into the output file while heap sort is still extracting elements
As people above mention type int of 32bit 4 MB.
To fit as much "Number" as possible into as little of space as possible using the types int, short and char in C++. You could be slick(but have odd dirty code) by doing several types of casting to stuff things everywhere.
Here it is off the edge of my seat.
anything that is less than 2^8(0 - 255) gets stored as a char (1 byte data type)
anything that is less than 2^16(256 - 65535) and > 2^8 gets stored as a short ( 2 byte data type)
The rest of the values would be put into int. ( 4 byte data type)
You would want to specify where the char section starts and ends, where the short section starts and ends, and where the int section starts and ends.
No example, but Bucket Sort has relatively low complexity and is easy enough to implement

Resources