Determining Perfect Hash Lookup Table for Pearson Hash - algorithm

I'm developing a programming language, and in my programming language, I'm storing objects as hash tables. The hash function I'm using is Pearson Hashing, which depends on a 256-bit lookup table. Here's the function:
char* pearson(char* name, char* lookup)
{
char index = '\0';
while(*name)
{
index = lookup[index ^ *name];
name++;
}
return index;
}
My question is, given a fixed group of fewer than 256 member names, how can one determine a lookup table such that pearson() will return unique characters within a contiguous range starting from '\0'. In other words, I need an algorithm to create a lookup table for a perfect hash. This will allow me to have objects that take up no more space than the number of their members. This will be done at compile time, so speed isn't a huge concern, but faster would be better. It would be easy to brute force this, but I think (hope) there's a better way.
Here's an example: given member variables 'foo', 'bar', and 'baz' in a class, I want to determine a lookup such that:
pearson('foo',lookup) == (char) 0
pearson('bar',lookup) == (char) 1
pearson('baz',lookup) == (char) 2
Note that the order doesn't matter, so the following result would also be acceptable:
pearson('foo',lookup) == (char) 2
pearson('bar',lookup) == (char) 0
pearson('baz',lookup) == (char) 1
In an ideal world, all names that aren't in the table would return a value greater than 2 because this would allow me to avoid a check and possibly even avoid storing the member names, but I don't think this is possible, so I'll have to add an extra check to see if it's in the table. Given this, it probably would save time to not initialize values in the lookup table which aren't used (collisions don't matter, because if it collides and fails the check, it isn't in the object at all, so the collision doesn't need to be resolved; only the error needs to be handled).

I strongly doubt that you will be able to find a solution with brute force if the number of member names is too high. Thanks to the birthday paradox the probability that no collisions exist (i.e., two hashes are the same) is approximately 1:5000 for 64 and 1:850,000,000 for 96 member names. From the structure of your hash function (it's derived from a cryptographic construction that is designed to "mix" things well) I don't expect that an algorithms exists that solves your problem (but I would definitely be interested in such a beast).
Your ideal world is an illusion (as you expected): there are 256 characters you can append to 'foo', no two of them giving a new word with a same hash. As there are only 256 possibilities for the hash values, you can therefore append a character to 'foo' so that its hash is the same as any of the hashes of 'foo', 'bar' or 'baz'.
Why don't you use an existing library like CMPH?

If I understand you correctly, what you need is an sorted and no-duplicated-element array that you can do binary search on. If the key is in the array, the index is the "hash". Otherwise, you get the size of the array. It is O(nlogn) compares to lookup table O(1), but it is good enough for small number of elements - 256 in your case.

Related

Ruby: Help improving hashing algorithm

I am still relatively new to ruby as a language, but I know there are a lot of convenience methods built into the language. I am trying to generate a "hash" to check against in a low level block-chain verifier and I am wondering if there are any "convenience methods" that I could you to try to make this hashing algorithm more efficient. I think I can make this more efficient by utilizing ruby's max integer size, but I'm not sure.
Below is the current code which takes in a string to hash, unpacks it into an array of UTF-8 values, does computationally intensive math to each one of those values, adds up all of those values after the math is done to them, takes that value modulo 65,536, and then returns the hex representation of that value.
def generate_hash(string)
unpacked_string = string.unpack('U*')
sum = 0
unpacked_string.each do |x|
sum += (x**2000) * ((x + 2)**21) - ((x + 5)**3)
end
new_val = sum % 65_536 # Gives a number from 0 to 65,535
new_val.to_s(16)
end
On very large block-chains there is a very large performance hit which I am trying to get around. Any help would be great!
First and foremost, it is extremely unlikely that you are going to create anything that is more efficient than simply using String#hash. This is a case of you trying to build a better mousetrap.
Honestly, your hashing algorithm is very inefficient. The entire point of a hash is to be a fast, low-overhead way of quickly getting a "unique" (as unique as possible) integer to represent any object to avoid comparing by values.
Using that as a premise, if you start doing any type of intense computation in a hash algorithm, it is already counter-productive. Once you start implementing modulo and pow functions, it is inefficient.
Usually best practice involves taking a value(s) of the object that can be represented as integers, and performing bit operations on them, typically with prime numbers to help reduce hash collisions.
def hash
h = value1 ^ 393
h += value2 ^ 17
h
end
In your example, you are for some reason forcing the hash to the max value of a 16-bit unsigned integer, when typically 32-bits is used, although if you are comparing on the Ruby-side, this would be 31-bits due to how Ruby masks Fixnum values. Fixnum was deprecated on the Ruby side as it should have been, but internally the same threshold exists between what how a Bignum and Fixnum are handled. The Integer class simply provides one interface on the Ruby side, as those two really should never have been exposed outside of the C code.
In your specific example using strings, I would simply symbolize them. This guarantees a quick and efficient way that determines if two strings are equal without hardly any overhead, and comparing 2 symbols is the exact same as comparing 2 integers. There is a caveat to this method if you are comparing a vast number of strings. Once a symbol is created, it is alive for the life of the program. Any additional strings that equal to it will return the same symbol, but you cannot remove the memory of the symbol (just a few bytes) for as long as the program runs. Not good if using this method to compare thousands and thousands of unique strings.

Bloom Filter char based

I am new to Bloom Filter. I understand how to implement a Bloom Filter with bit array, which we hash value x with k hash functions and set each bit array index to 1.
But I am wondering how we are going to implement a Bloom Filter with a char array? Especially if the input is a string. One way I can think of is adding the ASCII value of each char of string and hash that value then set index of char array to some value (I am also not sure what value to set in char array if I use this method because it can't be just 0 or 1 since we are not using bit array), but the probability of false positive is going to be very high. May someone give me some ideas to get the start? (I do not need actual code, but I really appreciate if you can give me some insight on what hash function to use and how to map them into char array)
You can use some hashing algorithm which will convert that to an integer hash and then consider each bit of it as part of the bit array or char array.
hash(S)=sum(S[i]*(p^i))_i=0 to n-1.
You can use this hash 2 times to reduce the chance of false positives. That will give you a reasonable behavior.
Also choice of p must be limited to prime and it should be greater than the number of characters in the alphabet set.
This will give you a better result than simple ascii value addition.
Also a strange thing is the hash functions used should independent and uniformly distributed.
Also being fast is another criteria that's why standard cryptographic hashes are not good choice. (like sha1)
One standard hashing method that I heard is murmurhash which you can try to use and compare with the result you expect.
To be clear on how you will go about implementing it:-
You can consider multiple hash functions like murmur, fnv1a or
even the simple one I presented and then you get 3 values from each
hash. Put them in appropriate positions. And then that will work as
your bloom filter.
Here as you are implementing different hash functions the probability of false positive will depend on multiple hash functions resulting in a better result.
For example:
You want to hash stackoverflow. Now you use 3 hash functions which give you numbers 11, 45 and 17. You would keep an map where you will put this value.
{
11: 1,
45: 1,
17: 1
}
Again you hash this way and get the value 11, 15 and 97.
Then you will change it to
{
11: 1,
15: 1,
17: 1,
45: 1,
97: 1
}
Note: I have mentioned map here...it can be something like a bit array also where you set the bits. For example..in case of
stackoverflow 11,17,and 45 th bits will be set to 1.
Note this map will help you answer the query whether an element is there or not.
Now in case of query , you will do the same, get the hash values and will check if these values exist. If yes there is a high chance it is there(not exactly as it may be a false positive) , if not then it is not for sure.
Suppose now you will check if string "abcd" is there. You apply the 3 hash functions used earlier. Results are 11,99,55. You will check if all 3 of them exists. You can see 55 is not there. So string "abcd" is not there.

Is there any probabilistic data structure that reduces the space complexity of a large number of counters?

Basically I need to keep track of a large number of counters. I can increment or decrement each counter by name. The simplest way to do so is to use a hash table, using counter_name as key and its corresponding count as the value for that key.
The counters don't need to be 100% accurate, approximate values for count are fine. So I'm wondering if there is any probabilistic data structure that can reduce the space complexity of N counters to lower than O(N), kinda similar to how HyperLogLog reduces the memory requirement of counting N items by giving only an approximate result. Any ideas?
In my opinion, the thing you are looking for is Count-min sketch.
Reading a stream of elements a1, a2, a3, ..., an where there can be a
lot of repeated elements, in any time it will give you the answer to
the following question: how many ai elements have you seen so far.
basically your unique elements can be bijected into your counters. Countmin sketch allows you to adjust parameters to trade your memory for the accuracy.
P.S. I described some other popular probabilistic data structures here.
Stefan Haustein's correct that the names are likely to take more space than the counters, and you may be able to prioritise certain names as he suggests, but failing that you can consider how best to store the names. If they're fairly short (e.g. 8 characters or less), you might consider using a closed hashing table that stores them directly in the buckets. If they're long, you could store them contiguously (NUL terminated) in a block of memory, and in the hash table store the offset into that block of their first character.
For the counter itself, you can save space by using a probabilistic approach as follows:
template <typename T, typename Q = unsigned>
class Approx_Counter
{
public:
Approx_Counter() : n_(0) { }
Approx_Counter& operator++()
{
if (n_ < 2 || rand() % (operator Q()) == 0)
++n_;
return *this;
}
operator Q() const { return n_ < 2 ? n_ : 1 << n_; }
private:
T n_;
};
Then you can use e.g. Approx_Counter<unsigned char, unsigned long>. Swap out rand() for a C++11 generator if you care.
The idea's simple:
when n_ is 0, ++ has definitely not be invoked
when n_ is 1, ++ has definitely been invoked exactly once
when n_ >= 2, it indicates ++ has probably been invoked about 2n_ times
To keep that last implication in line with the number of ++ invocations actually made, each invocation has a 1 in 2n_ chance of actually incrementing n_ again.
Just make sure your rand() or substitute returns values much larger than the largest counter value you want to track, otherwise you'll get rand() % (operator Q()) == 0 too often and increment inappropriately.
That said, having a smaller counter doesn't help much if you have pointers or offsets to it, so you'll want to squeeze the counter into the bucket too, another reason to prefer your own closed hashing implementation if you genuinely need to tighten up memory usage but want to stick with a hash table (a trie is another possibility).
The above is still O(N) in counter space, just with a smaller constant. For genuinely < O(N) options, you need to consider whether/how keys are related, such that incrementing a counter might reasonable impact multiple keys. You've given us no insights in your question to date.
The names probably take up more space than the counters.
How about having a fixed number of counters and only keep the ones with the highest counts, plus some kind of LRU mechanism to allow new counters to rise to the top? I guess it really depends on your use case...

Can I identify a hash algorithm based on the initial key and output hash?

If I have both the initial key and the hash that was created, is there any way to determine what hashing algorithm was used?
For example:
Key: higher
Hash: df072c8afcf2385b8d34aab3362020d0
Algorithm: ?
By looking at the length, you can decide which algorithms to try. MD5 and MD2 produce 16-byte digests. SHA-1 produces 20 bytes of output. Etc. Then perform each hash on the input and see if it matches the output. If so, that's your algorithm.
Of course, if more than the "key" was hashed, you'll need to know that too. And depending on the application, hashes are often applied iteratively. That is, the output of the hash is hashed again, and that output is hashed… often thousands of times. So if you know in advance how many iterations were performed, that can help too.
There's nothing besides the length in the output of a cryptographic hash that would help narrow down the algorithm that produced it.
Well, given that there are a finite number of popular hash algorithms, maybe what you propose is not so ridiculous.
But suppose I asked you this:
If I have an input and an output, can
I determine the function?
Generally speaking, no, you cannot determine the inner-workings of any function simply from knowing one input and one output, without any additional information.
// very, very basic illustration
if (unknownFunction(2) == 4) {
// what does unknownFunction do?
// return x + 2?
// or return x * 2?
// or return Math.Pow(x, 2)?
// or return Math.Pow(x, 3) - 4?
// etc.
}
The hash seems to contain only hexadecimal characters (each character represents 4bits)
Total count is 32 characters -> this is a 128-bits length hash.
Standard hashing algorithms that comply with these specs are: haval, md2, md4, md5 and ripemd128.
Highest probability is that MD5 was used.
md5("higher") != df072c8afcf2385b8d34aab3362020d0
Highest probability is that some salt was used.
Highest probability still remains MD5.
Didn't match any of the common hashing algorithms:
http://www.fileformat.info/tool/hash.htm?text=higher
Perhaps a salt was added prior to hashing...
Not other than trying out a bunch that you know and seeing if any match.

What is a good Hash Function?

What is a good Hash function? I saw a lot of hash function and applications in my data structures courses in college, but I mostly got that it's pretty hard to make a good hash function. As a rule of thumb to avoid collisions my professor said that:
function Hash(key)
return key mod PrimeNumber
end
(mod is the % operator in C and similar languages)
with the prime number to be the size of the hash table. I get that is a somewhat good function to avoid collisions and a fast one, but how can I make a better one? Is there better hash functions for string keys against numeric keys?
There's no such thing as a “good hash function” for universal hashes (ed. yes, I know there's such a thing as “universal hashing” but that's not what I meant). Depending on the context different criteria determine the quality of a hash. Two people already mentioned SHA. This is a cryptographic hash and it isn't at all good for hash tables which you probably mean.
Hash tables have very different requirements. But still, finding a good hash function universally is hard because different data types expose different information that can be hashed. As a rule of thumb it is good to consider all information a type holds equally. This is not always easy or even possible. For reasons of statistics (and hence collision), it is also important to generate a good spread over the problem space, i.e. all possible objects. This means that when hashing numbers between 100 and 1050 it's no good to let the most significant digit play a big part in the hash because for ~ 90% of the objects, this digit will be 0. It's far more important to let the last three digits determine the hash.
Similarly, when hashing strings it's important to consider all characters – except when it's known in advance that the first three characters of all strings will be the same; considering these then is a waste.
This is actually one of the cases where I advise to read what Knuth has to say in The Art of Computer Programming, vol. 3. Another good read is Julienne Walker's The Art of Hashing.
For doing "normal" hash table lookups on basically any kind of data - this one by Paul Hsieh is the best I've ever used.
http://www.azillionmonkeys.com/qed/hash.html
If you care about cryptographically secure or anything else more advanced, then YMMV. If you just want a kick ass general purpose hash function for a hash table lookup, then this is what you're looking for.
There are two major purposes of hashing functions:
to disperse data points uniformly into n bits.
to securely identify the input data.
It's impossible to recommend a hash without knowing what you're using it for.
If you're just making a hash table in a program, then you don't need to worry about how reversible or hackable the algorithm is... SHA-1 or AES is completely unnecessary for this, you'd be better off using a variation of FNV. FNV achieves better dispersion (and thus fewer collisions) than a simple prime mod like you mentioned, and it's more adaptable to varying input sizes.
If you're using the hashes to hide and authenticate public information (such as hashing a password, or a document), then you should use one of the major hashing algorithms vetted by public scrutiny. The Hash Function Lounge is a good place to start.
This is an example of a good one and also an example of why you would never want to write one.
It is a Fowler / Noll / Vo (FNV) Hash which is equal parts computer science genius and pure voodoo:
unsigned fnv_hash_1a_32 ( void *key, int len ) {
unsigned char *p = key;
unsigned h = 0x811c9dc5;
int i;
for ( i = 0; i < len; i++ )
h = ( h ^ p[i] ) * 0x01000193;
return h;
}
unsigned long long fnv_hash_1a_64 ( void *key, int len ) {
unsigned char *p = key;
unsigned long long h = 0xcbf29ce484222325ULL;
int i;
for ( i = 0; i < len; i++ )
h = ( h ^ p[i] ) * 0x100000001b3ULL;
return h;
}
Edit:
Landon Curt Noll recommends on his site the FVN-1A algorithm over the original FVN-1 algorithm: The improved algorithm better disperses the last byte in the hash. I adjusted the algorithm accordingly.
I'd say that the main rule of thumb is not to roll your own. Try to use something that has been thoroughly tested, e.g., SHA-1 or something along those lines.
A good hash function has the following properties:
Given a hash of a message it is computationally infeasible for an attacker to find another message such that their hashes are identical.
Given a pair of message, m' and m, it is computationally infeasible to find two such that that h(m) = h(m')
The two cases are not the same. In the first case, there is a pre-existing hash that you're trying to find a collision for. In the second case, you're trying to find any two messages that collide. The second task is significantly easier due to the birthday "paradox."
Where performance is not that great an issue, you should always use a secure hash function. There are very clever attacks that can be performed by forcing collisions in a hash. If you use something strong from the outset, you'll secure yourself against these.
Don't use MD5 or SHA-1 in new designs. Most cryptographers, me included, would consider them broken. The principle source of weakness in both of these designs is that the second property, which I outlined above, does not hold for these constructions. If an attacker can generate two messages, m and m', that both hash to the same value they can use these messages against you. SHA-1 and MD5 also suffer from message extension attacks, which can fatally weaken your application if you're not careful.
A more modern hash such as Whirpool is a better choice. It does not suffer from these message extension attacks and uses the same mathematics as AES uses to prove security against a variety of attacks.
Hope that helps!
What you're saying here is you want to have one that uses has collision resistance. Try using SHA-2. Or try using a (good) block cipher in a one way compression function (never tried that before), like AES in Miyaguchi-Preenel mode. The problem with that is that you need to:
1) have an IV. Try using the first 256 bits of the fractional parts of Khinchin's constant or something like that.
2) have a padding scheme. Easy. Barrow it from a hash like MD5 or SHA-3 (Keccak [pronounced 'ket-chak']).
If you don't care about the security (a few others said this), look at FNV or lookup2 by Bob Jenkins (actually I'm the first one who reccomends lookup2) Also try MurmurHash, it's fast (check this: .16 cpb).
A good hash function should
be bijective to not loose information, where possible, and have the least collisions
cascade as much and as evenly as possible, i.e. each input bit should flip every output bit with probability 0.5 and without obvious patterns.
if used in a cryptographic context there should not exist an efficient way to invert it.
A prime number modulus does not satisfy any of these points. It is simply insufficient. It is often better than nothing, but it's not even fast. Multiplying with an unsigned integer and taking a power-of-two modulus distributes the values just as well, that is not well at all, but with only about 2 cpu cycles it is much faster than the 15 to 40 a prime modulus will take (yes integer division really is that slow).
To create a hash function that is fast and distributes the values well the best option is to compose it from fast permutations with lesser qualities like they did with PCG for random number generation.
Useful permutations, among others, are:
multiplication with an uneven integer
binary rotations
xorshift
Following this recipe we can create our own hash function or we take splitmix which is tested and well accepted.
If cryptographic qualities are needed I would highly recommend to use a function of the sha family, which is well tested and standardised, but for educational purposes this is how you would make one:
First you take a good non-cryptographic hash function, then you apply a one-way function like exponentiation on a prime field or k many applications of (n*(n+1)/2) mod 2^k interspersed with an xorshift when k is the number of bits in the resulting hash.
I highly recommend the SMhasher GitHub project https://github.com/rurban/smhasher which is a test suite for hash functions. The fastest state-of-the-art non-cryptographic hash functions without known quality problems are listed here: https://github.com/rurban/smhasher#summary.
Different application scenarios have different design requirements for hash algorithms, but a good hash function should have the following three points:
Collision Resistance: try to avoid conflicts. If it is difficult to find two inputs that are hashed to the same output, the hash function is anti-collision
Tamper Resistant: As long as one byte is changed, its hash value will be very different.
Computational Efficiency: Hash table is an algorithm that can make a trade-off between time consumption and space consumption.
In 2022, we can choose the SHA-2 family to use in secure encryption, SHA-3 it is safer but has greater performance loss. A safer approach is to add salt and mix encryption., we can choose the SHA-2 family to use in secure encryption, SHA-3 it is safer but has greater performance loss. A safer approach is to add salt and mix encryption.

Resources