How does go calculate a hash value for keys in a map? - go

How does Go calculate a hash for keys in a map? Is it truly unique and is it available for use in other structures?
I imagine it's easy for primitive keys like int or immutable string but it seems nontrivial for composite structures.

The language spec doesn't say, which means that it's free to change at any time, or differ between implementations.
The hash algorithm varies somewhat between types and platforms. As of now: On x86 (32 or 64 bit) if the CPU supports AES instructions, the runtime uses aeshash, a hash built on AES primitives, otherwise it uses a function "inspired by" xxHash and cityhash, but different from either. There are different variants for 32-bit and 64-bit systems. Most types use a simple hash of their memory contents, but floating-point types have code to ensure that 0 and -0 hash equally (since they compare equally) and NaNs hash randomly (since two NaNs are never equal). Since complex types are built from floats, their hashes are composed from the hashes of their two floating-point parts. And an interface's hash is the hash of the value stored in the interface, and not the interface header itself.
All of this stuff is in private functions, so no, you can't access Go's internal hash for a value in your own code.

The Go map implementation uses a hash called aeshash. It's not AES, but it uses the aesenc assembly instruction to compute hashes. This hash isn't exported for use in the standard library.
The hash itself is written in assembly, and can be found in the runtime package source.

Since Go 1.14, the go standard library provides the hash/maphash package. The hash functions in this package aren't guaranteed to be the same ones used by Go maps (but it appears that they are, which makes sense); they are guaranteed to be good functions for implementing hashmaps and the like.
hash/maphash only operates on strings or byte slices, so it's still up to you to figure out how to serialize a composite data structure into bytes for hashing purposes.

Related

Using unordered_map with key only to store pointers (dismiss value)

I'm implementing an algorithm that checks nodes in a mesh for a certain value. To store information on which node I have already checked I'd like to use an unordered_map with the pointer to the node as a key. I can then simply use umap.find(pointer) to see if the node was already checked and skip it. This way I can accomplish it in O(n) time.
However I don't need to actually store a value for the map. The key itself is enough information. Is std::unordered_map even the right solution then? If so, what should I put for the "value" field maximize performace? I have a 32bit embedded system, so I thought of just putting uint32_t or uint_fast32_t there.
tl;dr:
Is std::unordered_map the right tool to store keys without values?
Will the native hash function work well for pointers? Or would you suggest a different hashin algorithm?
What do I put as "value" for the map if using std::unordered_map to optimize for performance?
Is std::unordered_map the right tool to store keys without values?
I would use a std::unordered_set in these situations.
Will the native hash function work well for pointers?
Yes. It is most likely just a cast from pointer to std::size_t.
What do I put as "value" for the map if using std::unordered_map to optimize for performance?
If you use a std::unordered_set instead, there is no value, only the pointers.
Is std::unordered_map the right tool to store keys without values?
No - std::unordered_set is the one to use when you don't have distinct keys and values.
Will the native hash function work well for pointers? Or would you suggest a different hashin algorithm?
The "native" compiler-supplied hash function probably casts the pointer value to size_t - a kind of identity hash. That may or may not work well depending on the compromises your Standard Library has chosen. GCC and clang use prime numbers of buckets in the hash table, so it will work fine. Visual C++ (and many non-Standard hash table implementations) use powers of two (i.e. 128, 256, 512...). Powers of two are used because it's very fast to map them on to buckets - just AND with a bitwise mask (127, 255, 511) to retain however-many less-significant bits you need. The problem with doing that with pointers is that often the pointed-to objects have some alignment, so they may all be multiples of e.g. 4 or 8. A multiple of 8 always has the three least significant bits set to 0: those bits don't contribute to the randomised placement of the value in a bucket. Instead, only every 8th bucket will receive any share of the elements being hashed. If you have an implementation like this, then you're probably better off using a better hash function. At the least, you could say bit-shift the pointer values right by enough to remove the known zeros.
What do I put as "value" for the map if using std::unordered_map to optimize for performance?
Again, you should use an std::unordered_set, so don't have to worry about a value.

how to add signature to protobuf messges?

Is there a common way to sign protobuf messages? what I can imagine is to Add a data field and a signature field in a message, and use SerializeToArray(in cpp) or ToByteArray(in c#) to get raw bytes, and then use md5 or sha256 .. etc to calculate the hash value, then assign the hash value to the field 'sign'. Bue I don't know if there is any different with the raw bytes between different languages, or in proto2 and proto3?
The approach you discuss for signing is fine for integrity validation purposes, as long as your hashing algorithm is strong enough. If it is for anything stronger than an integrity checksum, you should probably use a true cryptographic hash (with public+private keys), as anyone can otherwise sign their own arbitrary payload, defeating the point.
You also seen to discuss determinism. The raw bytes in protobuf are not entirely deterministic. There are multiple valid ways of representing the same payload in protobuf, including:
reordering fields (numerical order is a "should", not a "must")
including or omitting zeros (different between proto2 and proto3)
packed vs sequential "repeated" encoding
the reality that "map" is usually backed by some platform-specific inbuilt map/dictionary type, which commonly do not define order, so in theory it can vary every time
not really an issue in reality, but in theory you can encode a varint with an arbitrary length (up to 10 bytes) simply by including unnecessary groups of zero bytes; similar to in text (JSON, etc) saying that 42, 042, 0042 and 0000000042 all represent the same integer; nobody does that, but: it would be valid

Algorithm to hash a string to a dynamic number of characters

I'm looking for a way to hash a string to a dynamic number of characters. I don't want to trim an existing hash (such as SHA) but generate a hash that you can specify the number of output characters for. It should also work if the input is less than the number of characters. It doesn't need to be cryptographic, it only needs to guarantee the same hash for the same input. I've been going through the hash functions on wiki but they all seem to have a fixed length of a dynamic length depending on the input.
What you are looking for might be Extendable Output Functions (XOF's)!
Those hash functions don't have predefined output length and might use sponge functions for construction.
The SHA-3 family consists of four cryptographic hash functions, [...], and two extendable-output functions (XOFs), called SHAKE128 and SHAKE256.
You can try out both under https://emn178.github.io/online-tools/. For the output bits choose your desired number or characters.
For a Java implementation see the Bouncy Castle Crypto Library which supports both algorithms https://www.bouncycastle.org/specifications.html
But be aware of collisions if the hash length is to small.

why is hash output fixed in length?

Hash functions always produce a fixed length output regardless of the input (i.e. MD5 >> 128 bits, SHA-256 >> 256 bits), but why?
I know that it is how the designer designed them to be, but why they designed the output to have the same length?
So that it can be stored in a consistent fashion? easier to be compared? less complicated?
Because that is what the definition of a hash is. Refer to wikipedia
A hash function is any function that can be used to map digital data
of arbitrary size to digital data of fixed size.
If your question relates to why it is useful for a hash to be a fixed size there are multiple reasons (non-exhaustive list):
Hashes typically encode a larger (often arbitrary size) input into a smaller size, generally in a lossy way, i.e. unlike compression functions, you cannot reconstruct the input from the hash value by "reversing" the process.
Having a fixed size output is convenient, especially for hashes designed to be used as a lookup key.
You can predictably (pre)allocate storage for hash values and index them in a contiguous memory segment such as an array.
For hashes of "native word sizes", e.g. 16, 32 and 64 bit integer values, you can do very fast equality and ordering comparisons.
Any algorithm working with hash values can use a single set of fixed size operations for generating and handling them.
You can predictably combine hashes produced with different hash functions in e.g. a bloom filter.
You don't need to waste any space to encode how big the hash value is.
There do exist special hash functions, that are capable of producing an output hash of a specified fixed length, such as so-called sponge functions.
As you can see it is the standard.
Also what you want is specified in standard :
Some application may require a hash function with a message digest
length different than those provided by the hash functions in this
Standard. In such cases, a truncated message digest may be used,
whereby a hash function with a larger message digest length is applied
to the data to be hashed, and the resulting message digest is
truncated by selecting an appropriate number of the leftmost bits.
Often it's because you want to use the hash value, or some part of it, to quickly store and look up values in a fixed-size array. (This is how a non-resizable hashtable works, for example.)
And why use a fixed-size array instead of some other, growable data structure (like a linked list or binary tree)? Because accessing them tends to be both theoretically and practically fast: provided that the hash function is good and the fraction of occupied table entries isn't too high, you get O(1) lookups (vs. O(log n) lookups for tree-based data structures or O(n) for lists) on average. And these accesses are fast in practice: after calculating the hash, which usually takes linear time in the size of the key with a low hidden constant, there's often just a bit shift, a bit mask and one or two indirect memory accesses into a contiguous block of memory that (a) makes good use of cache and (b) pipelines well on modern CPUs because few pointer indirections are needed.

Hashtables/Dictionaries that use floats/doubles

I read somewhere about other data structures similar to hashtables, dictionaries but instead of using ints, they were using floats/doubles, etc.
Anyone knows what they are?
If you mean using floats/doubles as keys in your hash, that's easy. For example, in .NET, it's just using Dictionary<double,MyValueType>.
If you're talking about having the hash be based off a double instead of an int....
Technically, you can have any element as your internal hash. Normally, this is done using an int or long, since these are fast, and the hashing algorithm is easy to compute.
However, the hash is really just a BitArray at heart, so anything would work. There really isn't much advantage to making this something other than an int or long, other than potentially allowing a larger set of hash values (ie: if you go to an 8 byte or larger type for your hash).
You mean as keys? That strikes me as tricky.
If you're using them as arbitrary keys, they're no better than integers.
If you expect to calculate a floating-point value and use it to look something up in a hash table, you're living very dangerously. Floating point numbers do not have infinite precision, and calculating the same thing in two slightly different ways can result in very tiny differences in the result. Hash keys rely on getting the exact same thing every time, so you'd have to be careful to round, and round in exactly the same way at all times. This is trickier than it sounds, by the way.
So, what would you do with floating-point hashes?
A hash algorithm is, in general terms, just a function that produces a smaller output from a larger input. Good hash functions have interesting properties like a large change in output for a small change in the input, and an assurance that they produce every possible output value for some input.
It's not hard to write a simple polynomial type hash function that outputs a floating-point value, rather than an integer value, but it's difficult to ensure that the resulting hash function has the desired properties without getting into the details of the particular floating-point representation used.
At least part of the reason that hash functions are nearly always implemented in integer arithmetic is because proving various properties about an integer calculation is easier than doing the same for a floating point calculation.
It's fairly easy to prove that some (sum of prime factors) modulo (another prime) must, necessarily, produce every possible output for some input. Doing the same for a calculation with a bunch of floating-point fractions would be a drag.
Add to that the relative difficulty of storing and transmitting floating-point values without corruption, and it's just not worth it.
Your question history shows that you use .Net, so I'll answer in that context.
If you want a Dictionary that is type aware, such that you can specify it should use floats or doubles for the keys or values, use System.Collections.Generic.Dictionary<T, U> http://msdn.microsoft.com/en-us/library/xfhwa508.aspx
If you want a Dictionary that is type blind, such that you can use floats AND doubles for keys and values, use System.Collections.HashTable http://msdn.microsoft.com/en-us/library/system.collections.hashtable.aspx

Resources