std::unordered_map with 5 uint32_t as keys - performance

I am writing a checkers engine that fails catastrophically when there's a hash collision. The simplest (and sufficient for my needs) way to fix this would be to use the entire position as a hash key. A position is determined by five uint32_t bitboards and a binary player indicator (which is also a uint32_t). Whats the best way to have std::unordered_map take a 6-tuple of uint32_t as a key?

Related

Using unordered_map with key only to store pointers (dismiss value)

I'm implementing an algorithm that checks nodes in a mesh for a certain value. To store information on which node I have already checked I'd like to use an unordered_map with the pointer to the node as a key. I can then simply use umap.find(pointer) to see if the node was already checked and skip it. This way I can accomplish it in O(n) time.
However I don't need to actually store a value for the map. The key itself is enough information. Is std::unordered_map even the right solution then? If so, what should I put for the "value" field maximize performace? I have a 32bit embedded system, so I thought of just putting uint32_t or uint_fast32_t there.
tl;dr:
Is std::unordered_map the right tool to store keys without values?
Will the native hash function work well for pointers? Or would you suggest a different hashin algorithm?
What do I put as "value" for the map if using std::unordered_map to optimize for performance?
Is std::unordered_map the right tool to store keys without values?
I would use a std::unordered_set in these situations.
Will the native hash function work well for pointers?
Yes. It is most likely just a cast from pointer to std::size_t.
What do I put as "value" for the map if using std::unordered_map to optimize for performance?
If you use a std::unordered_set instead, there is no value, only the pointers.
Is std::unordered_map the right tool to store keys without values?
No - std::unordered_set is the one to use when you don't have distinct keys and values.
Will the native hash function work well for pointers? Or would you suggest a different hashin algorithm?
The "native" compiler-supplied hash function probably casts the pointer value to size_t - a kind of identity hash. That may or may not work well depending on the compromises your Standard Library has chosen. GCC and clang use prime numbers of buckets in the hash table, so it will work fine. Visual C++ (and many non-Standard hash table implementations) use powers of two (i.e. 128, 256, 512...). Powers of two are used because it's very fast to map them on to buckets - just AND with a bitwise mask (127, 255, 511) to retain however-many less-significant bits you need. The problem with doing that with pointers is that often the pointed-to objects have some alignment, so they may all be multiples of e.g. 4 or 8. A multiple of 8 always has the three least significant bits set to 0: those bits don't contribute to the randomised placement of the value in a bucket. Instead, only every 8th bucket will receive any share of the elements being hashed. If you have an implementation like this, then you're probably better off using a better hash function. At the least, you could say bit-shift the pointer values right by enough to remove the known zeros.
What do I put as "value" for the map if using std::unordered_map to optimize for performance?
Again, you should use an std::unordered_set, so don't have to worry about a value.

A good hashing function for a non-uniform sequence of uniformly distributed 4 bits values?

I have a very specific problem:
I have uniformly random values spread on a 15x50 grid and the sample I want to hash corresponds to a square of 5x5 cells centered around any possible grid position.
The number of samples can thus vary from 25 (away from borders, most cases) to 20, 15 (near a border) down to a minimum of 9 (in a corner).
So even though the cell values are random, the location introduces a deterministic variation in the sequence length.
The hash table size is a small number, typically between 50 and 20.
The function will operate on a large set of randomly generated grids (a few hundreds/thousands), and might be called a few thousands times per grid. The positions on the grid can be considered random.
I would like a function that could spread the 15x50 possible samples as evenly as possible.
I have tried the following pseudo-code:
int32 hash = 0;
int i = 0; // I guess i could take any initial value and even be left uninitialized, but fixing one makes the function deterministic
foreach (value in block)
{
hash ^= (value << (i%28))
i++
}
hash %= table_size
but the results, though not grossly imbalanced, do not seem very smooth to me. Maybe it's because the sample is too small, but the circumstances make it difficult to run the code on a bigger sample, and I would rather not have to write a complete test harness if some computer savvy has an answer ready for me :).
I am not sure pairing the values two by two and using a general purpose byte hashing strategy would be the best solution, especially since the number of values might be odd.
I have tought of using a 17th value to represent off-grid cells, but that seems to introduce a bias (the sequences from cells near a border will have a lot of "off grid" values).
I am not sure either what would be the best way to test the efficiency of various solutions (how many grids shall I generate to have an idea of the performances, for instance).
http://www.partow.net/programming/hashfunctions/
Here are few different hash function from experts on various fields. Functions are designed for 8bit values, but I am sure you can extend for your case. I dont know what to suggest, but I think that any of them should work better than your current idea.
Problem with current approach you propose is that values are cyclic in field 2^n and if you make mod 64 at the end for example you lost most values out and only last 3 values remains in final result.
Despite your scepticism I would just shove them through a standard hash function.
If they are well randomised (and relatively independent - you don't say) to begin with you probably don't need to do too much work. Fowler-Noll-Vo (FNV) is a good candidate in these circumstances.
FNV operates on a series of 8-bit input and your input is (logically) 4-bit.
I would start without even bothering to pack 'two by two' as you describe.
If you feel like trying that, just logically pad odd length series with the message length (reduced to a 4 bit value obviously).
I wouldn't expect that packing to improve the hash. It may save you a tiny number of cycles because it swaps a relatively expensive * with a << and a |.
Try both and report back!
Here are implementations of packed and 'normal' versions of FNV1a in C:
#include <inttypes.h>
static const uint32_t sFNVOffsetBasis=2166136261;
static const uint32_t sFNVPrime= 16777619;
const uint32_t FNV1aPacked4Bit(const uint8_t*const pBytes,const size_t pSize) {
uint32_t rHash=sFNVOffsetBasis;
for(size_t i=0;i<pSize;i+=2){
rHash=rHash^(pBytes[i]|(pBytes[i+1]<<4));
rHash=rHash*sFNVPrime;
}
if(pSize%2){//Length is odd. The loop missed the last element.
rHash=rHash^(pBytes[pSize-1]|((pSize&0x1E)<<3));
rHash=rHash*sFNVPrime;
}
return rHash;
}
const uint32_t FNV1a(const uint8_t*const pBytes,const size_t pSize) {
uint32_t rHash=sFNVOffsetBasis;
for(size_t i=0;i<pSize;++i){
rHash=(rHash^pBytes[i])*sFNVPrime;
}
return rHash;
}
NB: I've edited it to skip the first bit when adding in the length. Obviously the bottom bit of an odd length is 100% biased to 1. I don't know how length is distributed. It may be wiser to put it in at the start than the end.

Use cases of std::multimap

I don't quite get the purpose of this data structure. What's the difference between std::multimap<K, V> and std::map<K, std::vector<V>>. The same goes for std::multiset- it could just be std::map<K, int> where the int counts the number of occurrences of K. Am I missing something on the uses of these structures?
A counter-example seems to be in order.
Consider a PhoneEntry in an AdressList grouped by name.
int AdressListCompare(const PhoneEntry& p1, const PhoneEntry& p2){
return p1.name<p2.name;
}
multiset<PhoneEntry, AdressListCompare> adressList;
adressList.insert( PhoneEntry("Cpt.G", "123-456", "Cellular") );
adressList.insert( PhoneEntry("Cpt.G", "234-567", "Work") );
// Getting the entries
addressList.equal_range( PhoneENtry("Cpt.G") ); // All numbers
This would not be feasible with a set+count. Your Object+count approach seems to be faster if this behavior is not required. For instance the multiset::count() member states
"Complexity: logarithmic in size +
linear in count."
You could use make the substitutions that you suggest, and extract similar behavior. But the interfaces would be very different than when dealing with regular standard containers. A major design theme of these containers is that they share as much interface as possible, making them as interchangeable as possible so that the appropriate container can be chosen without having to change the code that uses it.
For instance, std::map<K, std::vector<V>> would have iterators that dereference to std::pair<K, std::vector<V>> instead of std::pair<K, V>. std::map<K, std::vector<V>>::Count() wouldn't return the correct result, failing to account for the duplicates in the vector. Of course you could change your code to do the extra steps needed to correct for this, but now you are interfacing with the container in a much different way. You can't later drop in unordered_map or some other map implementation to see it performs better.
In a broader sense, you are breaking the container abstraction by handling container implementation details in your code rather than having a container that handles it's own business.
It's entirely possible that your compiler's implementation of std::multimap is really just a wrapper around std::map<K, std::vector<V>>. Or it might not be. It could be more efficient and friendly to object pool allocation (which vectors are not).
Using std::map<K, int> instead of std::multiset is the same case. Count() would not return the expected value, iterators will not iterate over the duplicates, iterators will dereference to std::pair<k, int> instead of directly to `K.
A multimap or multiset allows you to have elements with duplicate keys.
ie a set is a non-ordered group of elements that are all unique in that {A,B,C} == {B,C,A}

Hashing of pointer values

Sometimes you need to take a hash function of a pointer; not the object the pointer points to, but the pointer itself. Lots of the time, folks just punt and use the pointer value as an integer, chop off some high bits to make it fit, maybe shift out known-zero bits at the bottom. Thing is, pointer values aren't necessarily well-distributed in the code space; in fact, if your allocator is doing its job, there's an excellent chance they're all clustered close together.
So, my question is, has anyone developed hash functions that are good for this? Take a 32- or 64-bit value that's maybe got 12 bits of entropy in it somewhere and spread it evenly across a 32-bit number space.
This page lists several methods that might be of use. One of them, due to Knuth, is a simple as multiplying (in 32 bits) by 2654435761, but "Bad hash results are produced if the keys vary in the upper bits." In the case of pointers, that's a rare enough situation.
Here are some more algorithms, including performance tests.
It seems that the magic words are "integer hashing".
They'll likely exhibit locality, yes - but in the lower bits, which means objects will be distributed through the hashtable. You'll only see collisions if a pointer's address is a multiple of the hashtable's length from another pointer.
If you know the lowest possible pointer address (which is often the case if you're working within a large buffer), just convert the pointer to an integer by subtracting the lowest possible pointer value; eg. that could be the buffer's base address.
-Remember: pointer subtracted from pointer equals an offset (integer).
So: Don't "chop off" bits; it's much better to convert to an offset.
This will result in that the offset value is much smaller than a pointer value.
It may help further to shift the pointer value right twice (eg. divide by 4) in some cases as well, before hashing it.
The problem with pointers is often that small blocks of memory is likely to be allocated on the same address (eg. a block being freed and another block is taking the freed block's place).
Why not just use an existing hash function?

What is a good Hash Function?

What is a good Hash function? I saw a lot of hash function and applications in my data structures courses in college, but I mostly got that it's pretty hard to make a good hash function. As a rule of thumb to avoid collisions my professor said that:
function Hash(key)
return key mod PrimeNumber
end
(mod is the % operator in C and similar languages)
with the prime number to be the size of the hash table. I get that is a somewhat good function to avoid collisions and a fast one, but how can I make a better one? Is there better hash functions for string keys against numeric keys?
There's no such thing as a “good hash function” for universal hashes (ed. yes, I know there's such a thing as “universal hashing” but that's not what I meant). Depending on the context different criteria determine the quality of a hash. Two people already mentioned SHA. This is a cryptographic hash and it isn't at all good for hash tables which you probably mean.
Hash tables have very different requirements. But still, finding a good hash function universally is hard because different data types expose different information that can be hashed. As a rule of thumb it is good to consider all information a type holds equally. This is not always easy or even possible. For reasons of statistics (and hence collision), it is also important to generate a good spread over the problem space, i.e. all possible objects. This means that when hashing numbers between 100 and 1050 it's no good to let the most significant digit play a big part in the hash because for ~ 90% of the objects, this digit will be 0. It's far more important to let the last three digits determine the hash.
Similarly, when hashing strings it's important to consider all characters – except when it's known in advance that the first three characters of all strings will be the same; considering these then is a waste.
This is actually one of the cases where I advise to read what Knuth has to say in The Art of Computer Programming, vol. 3. Another good read is Julienne Walker's The Art of Hashing.
For doing "normal" hash table lookups on basically any kind of data - this one by Paul Hsieh is the best I've ever used.
http://www.azillionmonkeys.com/qed/hash.html
If you care about cryptographically secure or anything else more advanced, then YMMV. If you just want a kick ass general purpose hash function for a hash table lookup, then this is what you're looking for.
There are two major purposes of hashing functions:
to disperse data points uniformly into n bits.
to securely identify the input data.
It's impossible to recommend a hash without knowing what you're using it for.
If you're just making a hash table in a program, then you don't need to worry about how reversible or hackable the algorithm is... SHA-1 or AES is completely unnecessary for this, you'd be better off using a variation of FNV. FNV achieves better dispersion (and thus fewer collisions) than a simple prime mod like you mentioned, and it's more adaptable to varying input sizes.
If you're using the hashes to hide and authenticate public information (such as hashing a password, or a document), then you should use one of the major hashing algorithms vetted by public scrutiny. The Hash Function Lounge is a good place to start.
This is an example of a good one and also an example of why you would never want to write one.
It is a Fowler / Noll / Vo (FNV) Hash which is equal parts computer science genius and pure voodoo:
unsigned fnv_hash_1a_32 ( void *key, int len ) {
unsigned char *p = key;
unsigned h = 0x811c9dc5;
int i;
for ( i = 0; i < len; i++ )
h = ( h ^ p[i] ) * 0x01000193;
return h;
}
unsigned long long fnv_hash_1a_64 ( void *key, int len ) {
unsigned char *p = key;
unsigned long long h = 0xcbf29ce484222325ULL;
int i;
for ( i = 0; i < len; i++ )
h = ( h ^ p[i] ) * 0x100000001b3ULL;
return h;
}
Edit:
Landon Curt Noll recommends on his site the FVN-1A algorithm over the original FVN-1 algorithm: The improved algorithm better disperses the last byte in the hash. I adjusted the algorithm accordingly.
I'd say that the main rule of thumb is not to roll your own. Try to use something that has been thoroughly tested, e.g., SHA-1 or something along those lines.
A good hash function has the following properties:
Given a hash of a message it is computationally infeasible for an attacker to find another message such that their hashes are identical.
Given a pair of message, m' and m, it is computationally infeasible to find two such that that h(m) = h(m')
The two cases are not the same. In the first case, there is a pre-existing hash that you're trying to find a collision for. In the second case, you're trying to find any two messages that collide. The second task is significantly easier due to the birthday "paradox."
Where performance is not that great an issue, you should always use a secure hash function. There are very clever attacks that can be performed by forcing collisions in a hash. If you use something strong from the outset, you'll secure yourself against these.
Don't use MD5 or SHA-1 in new designs. Most cryptographers, me included, would consider them broken. The principle source of weakness in both of these designs is that the second property, which I outlined above, does not hold for these constructions. If an attacker can generate two messages, m and m', that both hash to the same value they can use these messages against you. SHA-1 and MD5 also suffer from message extension attacks, which can fatally weaken your application if you're not careful.
A more modern hash such as Whirpool is a better choice. It does not suffer from these message extension attacks and uses the same mathematics as AES uses to prove security against a variety of attacks.
Hope that helps!
What you're saying here is you want to have one that uses has collision resistance. Try using SHA-2. Or try using a (good) block cipher in a one way compression function (never tried that before), like AES in Miyaguchi-Preenel mode. The problem with that is that you need to:
1) have an IV. Try using the first 256 bits of the fractional parts of Khinchin's constant or something like that.
2) have a padding scheme. Easy. Barrow it from a hash like MD5 or SHA-3 (Keccak [pronounced 'ket-chak']).
If you don't care about the security (a few others said this), look at FNV or lookup2 by Bob Jenkins (actually I'm the first one who reccomends lookup2) Also try MurmurHash, it's fast (check this: .16 cpb).
A good hash function should
be bijective to not loose information, where possible, and have the least collisions
cascade as much and as evenly as possible, i.e. each input bit should flip every output bit with probability 0.5 and without obvious patterns.
if used in a cryptographic context there should not exist an efficient way to invert it.
A prime number modulus does not satisfy any of these points. It is simply insufficient. It is often better than nothing, but it's not even fast. Multiplying with an unsigned integer and taking a power-of-two modulus distributes the values just as well, that is not well at all, but with only about 2 cpu cycles it is much faster than the 15 to 40 a prime modulus will take (yes integer division really is that slow).
To create a hash function that is fast and distributes the values well the best option is to compose it from fast permutations with lesser qualities like they did with PCG for random number generation.
Useful permutations, among others, are:
multiplication with an uneven integer
binary rotations
xorshift
Following this recipe we can create our own hash function or we take splitmix which is tested and well accepted.
If cryptographic qualities are needed I would highly recommend to use a function of the sha family, which is well tested and standardised, but for educational purposes this is how you would make one:
First you take a good non-cryptographic hash function, then you apply a one-way function like exponentiation on a prime field or k many applications of (n*(n+1)/2) mod 2^k interspersed with an xorshift when k is the number of bits in the resulting hash.
I highly recommend the SMhasher GitHub project https://github.com/rurban/smhasher which is a test suite for hash functions. The fastest state-of-the-art non-cryptographic hash functions without known quality problems are listed here: https://github.com/rurban/smhasher#summary.
Different application scenarios have different design requirements for hash algorithms, but a good hash function should have the following three points:
Collision Resistance: try to avoid conflicts. If it is difficult to find two inputs that are hashed to the same output, the hash function is anti-collision
Tamper Resistant: As long as one byte is changed, its hash value will be very different.
Computational Efficiency: Hash table is an algorithm that can make a trade-off between time consumption and space consumption.
In 2022, we can choose the SHA-2 family to use in secure encryption, SHA-3 it is safer but has greater performance loss. A safer approach is to add salt and mix encryption., we can choose the SHA-2 family to use in secure encryption, SHA-3 it is safer but has greater performance loss. A safer approach is to add salt and mix encryption.

Resources