Why are the hash codes generated by this function not unique? - vb6

I'm testing the VB function below that I got from a Google search. I plan to use it to generate hash codes for quick string comparison. However, there are occasions in which two different strings have the same hash code. For example, these strings
"122Gen 1 heap size (.NET CLR Memory w3wp):mccsmtpteweb025.20833333333333E-02"
"122Gen 2 heap size (.NET CLR Memory w3wp):mccsmtpteweb015.20833333333333E-02"
have the same hash code of 237117279.
Please tell me:
- What is wrong with the function?
- How can I fix it?
Thank you
martin
Private Declare Sub CopyMemory Lib "kernel32" Alias "RtlMoveMemory" (dest As Any, src As Any, ByVal bytes As Long)
Private Function HashCode(Key As String) As Long
On Error GoTo ErrorGoTo
Dim lastEl As Long, i As Long
' copy ansi codes into an array of long'
lastEl = (Len(Key) - 1) \ 4
ReDim codes(lastEl) As Long
' this also converts from Unicode to ANSI'
CopyMemory codes(0), ByVal Key, Len(Key)
' XOR the ANSI codes of all characters'
For i = 0 To lastEl - 1
HashCode = HashCode Xor codes(i) 'Xor'
Next
ErrorGoTo:
Exit Function
End Function

I'm betting there are more than just "occasions" when two strings generate the same hash using your function. In fact, it probably happens more often than you think.
A few things to realize:
First, there will be hash collisions. It happens. Even with really, really big spaces like MD5 (128 bits) there are still two strings that can generate the same resulting hash. You have to deal with those collisions by creating buckets.
Second, a long integer isn't really a big hash space. You're going to get more collisions than you would if you used more bits.
Thirdly, there are libraries available to you in Visual Basic (like .NET's System.Security.Cryptography namespace) that will do a much better job of hashing than most mere mortals.

The two Strings have the same characters. (Note the '2' and the '1' that are flip-flopped)
That is why the hash value is the same.
Make sure that the hash function is taking into account the order of the characters.

Hash functions do not guarantee uniqueness of hash values. If the input value range (judging your sample strings) is larger than the output value range (eg 32 bit integer), then uniqueness is physically impossible.

If the biggest problem is that it doesn't account for the position of the bytes, you could fix it like this:
Private Function HashCode(Key As String) As Long
On Error GoTo ErrorGoTo
Dim lastEl As Long, i As Long
' copy ansi codes into an array of long'
lastEl = (Len(Key) - 1) \ 4
ReDim codes(lastEl) As Long
' this also converts from Unicode to ANSI'
CopyMemory codes(0), ByVal Key, Len(Key)
' XOR the ANSI codes of all characters'
For i = 0 To lastEl - 1
HashCode = HashCode Xor (codes(i) + i) 'Xor'
Next
ErrorGoTo:
Exit Function
End Function
The only difference is that it adds the characters position to it's byte value before the XOR.

No hash function can guarantee uniqueness. There are ~4 billion 32-bit integers, so even the best hash function will generate duplicates when presented with ~4 billion and 1 strings (and mostly likely long before).
Moving to 64-bit hashes or even 128-bit hashes isn't really the solution, though it reduces the probability of a collision.
If you want a better hash function you could look at the cryptographic hashes, but it would be better to reconsider you algorithm and decide if you can deal with the collisions some other way.

The System.Security.Cryptography namespace contains multiple classes which can do hashing for you (such as MD5) which will probably hash them better than you could yourself and will take much less effort.
You don't always have to reinvent the wheel.

Simple XOR is a bad hash: you'll find lots of strings which collide. The hash doesn't depend on the order of the letters in the string, for one thing.
Try using the FNV hash http://isthe.com/chongo/tech/comp/fnv/
This is really simple to implement. It shifts the hash code after each XOR, so the same letters in a different order will produce a different hash.

Hash functions are not meant to return distinct values for distinct strings. However, a good hash function should return different values for strings that look alike. Hash functions are used to search for many reasons, including searching into a large collection. If the hash function is good and if it returns values from the range [0,N-1], then a large collection of M objects will be divide in N collections, each one having about M/N elements. This way, you need to search only in an array of M/N elements instead of searching in an array of M elements.
But, if you only have 2 strings, it is not faster to compute the hash value for those! It is better to just compare the two strings.
An interresing hash function could be:
unsigned int hash(const char* name) {
unsigned mul=1;
unsigned val=0;
while(name[0]!=0) {
val+=mul*((unsigned)name[0]);
mul*=7; //you could use an arbitrary prime number, but test the hash dispersion afterwards
name++;
}
return val;
}

I fixed the syntax highlighting for him.
Also, for those who weren't sure about the environment or were suggesting a more-secure hash: it's Classic (pre-.Net) VB, because .Net would require parentheses for the the call to CopyMemory.
IIRC, there aren't any secure hashes built in for Classic VB. There's not much out there on the web either, so this may be his best bet.

I don't quite see the environment you work in. Is this .Net code? If you really want good hash codes, I would recommend looking into cryptographic hashes (proven algorithms) instead of trying to write your own.
Btw, could you edit your post and paste the code in as a Code Sample (see toolbar)? This would make it easier to read.

"Don't do that."
Writing your own hash function is a big mistake, because your language certainly already has an implementation of SHA-1, which is a perfectly good hash function. If you only need 32 bits (instead of the 160 that SHA-1 provides), just use the last 32 bits of SHA-1.

This particular hash functions XORs all of the characters in a string. Unfortunately XOR is associative:
(a XOR b) XOR c = a XOR (b XOR c)
So any strings with the same input characters will result in the same hash code. The two strings provided are the same, except for the location of two characters, therefore they should have the same hashcode.
You may need to find a better algorithm, MD5 would be a good choice.

The XOR operation is commutative; that is, when XORing all the chars in a string, the order of the chars does not matter. All anagrams of a string will produce the same XOR hash.
In your example, your second string can be generated from your first by swapping the "1" after "...Gen " with the first "2" following it.
There is nothing wrong with your function. All useful hashing functions will sometimes generate collisions, and your program must be prepared to resolve them.
A collision occurs when an input hashes to a value already identified with an earlier input. If a hashing algorithm could not generate collisions, the hash values would need to be as large as the input values. Such a hashing algorithm would be of limited use compared to just storing the input values.
-Al.

There's a visual basic implementation of MD5 hashing here
http://www.bullzip.com/md5/vb/md5-visual-basic.htm

Related

Ruby: Help improving hashing algorithm

I am still relatively new to ruby as a language, but I know there are a lot of convenience methods built into the language. I am trying to generate a "hash" to check against in a low level block-chain verifier and I am wondering if there are any "convenience methods" that I could you to try to make this hashing algorithm more efficient. I think I can make this more efficient by utilizing ruby's max integer size, but I'm not sure.
Below is the current code which takes in a string to hash, unpacks it into an array of UTF-8 values, does computationally intensive math to each one of those values, adds up all of those values after the math is done to them, takes that value modulo 65,536, and then returns the hex representation of that value.
def generate_hash(string)
unpacked_string = string.unpack('U*')
sum = 0
unpacked_string.each do |x|
sum += (x**2000) * ((x + 2)**21) - ((x + 5)**3)
end
new_val = sum % 65_536 # Gives a number from 0 to 65,535
new_val.to_s(16)
end
On very large block-chains there is a very large performance hit which I am trying to get around. Any help would be great!
First and foremost, it is extremely unlikely that you are going to create anything that is more efficient than simply using String#hash. This is a case of you trying to build a better mousetrap.
Honestly, your hashing algorithm is very inefficient. The entire point of a hash is to be a fast, low-overhead way of quickly getting a "unique" (as unique as possible) integer to represent any object to avoid comparing by values.
Using that as a premise, if you start doing any type of intense computation in a hash algorithm, it is already counter-productive. Once you start implementing modulo and pow functions, it is inefficient.
Usually best practice involves taking a value(s) of the object that can be represented as integers, and performing bit operations on them, typically with prime numbers to help reduce hash collisions.
def hash
h = value1 ^ 393
h += value2 ^ 17
h
end
In your example, you are for some reason forcing the hash to the max value of a 16-bit unsigned integer, when typically 32-bits is used, although if you are comparing on the Ruby-side, this would be 31-bits due to how Ruby masks Fixnum values. Fixnum was deprecated on the Ruby side as it should have been, but internally the same threshold exists between what how a Bignum and Fixnum are handled. The Integer class simply provides one interface on the Ruby side, as those two really should never have been exposed outside of the C code.
In your specific example using strings, I would simply symbolize them. This guarantees a quick and efficient way that determines if two strings are equal without hardly any overhead, and comparing 2 symbols is the exact same as comparing 2 integers. There is a caveat to this method if you are comparing a vast number of strings. Once a symbol is created, it is alive for the life of the program. Any additional strings that equal to it will return the same symbol, but you cannot remove the memory of the symbol (just a few bytes) for as long as the program runs. Not good if using this method to compare thousands and thousands of unique strings.

The difference and use of strings and string arrays?

Okay, so for all i know a string is basically an array of characters. So why would there be string arrays in VB? And what differences are between them?
Just the basics, the way they operate that's what i'm interested in.
At times it is very useful to think of a String as an array of characters. It can also be useful to think of it as an array of bytes at times too - and this is of course not the same thing at all.
See The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) for better understanding of the differences between bytes and the characters held by Strings (UTF-16LE) as well as other character encodings commonly used.
But all of that aside, a String is really a higher level abstraction that you should not think of as an array of any kind.
After all, by that sort of logic an Integer or Long is an array as well.
So considering that a String is meant to be viewed as a primitive scalar value type the purpose of String arrays should be pretty clear. Arrays of Strings have pretty much the same sorts of uses as arrays of any other data type.
The fact that you have operations you can perform on Strings that root around inside them (substring operations) isn't much different conceptually than the operations that operate on the data inside any other simple type.
Say you need to store a list of names, it might be 100 names, or 200 names.. it depends from case to case.. what will u do?
String array can solve such case
Try this:
Dim Names() As String
ReDim Names(3) As String
Names(0) = "First"
Names(1) = "Second"
Names(2) = "Third"
Names(3) = "Fourth"
Dim l As Long
For l = LBound(Names) To UBound(Names)
MsgBox Names(l)
Next

Guessing the hash function?

I'd like to know which algorithm is employed. I strongly assume it's something simple and hopefully common. There's no lag in generating the results, for instance.
Input: any string
Output: 5 hex characters (0-F)
I have access to as many keys and results as I wish, but I don't know how exactly I could harness this to attack the function. Is there any method? If I knew any functions that converted to 5-chars to start with then I might be able to brute force for a salt or something.
I know for example that:
a=06a07
b=bfbb5
c=63447
(in case you have something in mind)
In normal use it converts random 32-char strings into 5-char strings.
The only way to derive a hash function from data is through brute force, perhaps combined with some cleverness. There are an infinite number of hash functions, and the good ones perform what is essentially one-way encryption, so it's a question of trial and error.
It's practically irrelevant that your function converts 32-character strings into 5-character hashes; the output is probably truncated. For fun, here are some perfectly legitimate examples, the last 3 of which are cryptographically terrible:
Use the MD5 hashing algorithm, which generates a 16-character hash, and use the 10th through the 14th characters.
Use the SHA-1 algorithm and take the last 5 characters.
If the input string is alphabetic, use the simple substitution A=1, B=2, C=3, ... and take the first 5 digits.
Find each character on your keyboard, measure its distance from the left edge in millimeters, and use every other digit, in reverse order, starting with the last one.
Create a stackoverflow user whose name is the 32-bit string, divide 113 by the corresponding user ID number, and take the first 5 digits after the decimal. (But don't tell 'em I told you to do it!)
Depending on what you need this for, if you have access to as many keys and results as you wish, you might want to try a rainbow table approach. 5 hex chars is only 1mln combinations. You should be able to brute-force generate a map of strings that match all of the resulting hashes in no time. Then you don't need to know the original string, just an equivalent string that generates the same hash, or brute-force entry by iterating over the 1mln input strings.
Following on from a comment I just made to Pontus Gagge, suppose the hash algorithm is as follows:
Append some long, constant string to the input
Compute the SHA-256 hash of the result
Output the last 5 chars of the hash.
Then I'm pretty sure there's no computationally feasible way from your chosen-plaintext attack to figure out what the hashing function is. To even prove that SHA-256 is in use (assuming it's a good hash function, which as far as we currently know it is), I think you'd need to know the long string, which is only stored inside the "black box".
That said, if I knew any published 20-bit hash functions, then I'd be checking those first. But I don't know any: all the usual non-crypto string hashing functions are 32 bit, because that's the expected size of an integer type. You should perhaps compare your results to those of CRC, PJW, and BUZ hash on the same strings, as well as some variants of DJB hash with different primes, and any string hash functions built in to well-known programming languages, like java.lang.String.hashCode. It could be that the 5 output chars are selected from the 8 hex chars generated by one of those.
Beyond that (and any other well-known string hashes you can find), I'm out of ideas. To cryptanalyse a black box hash, you start by looking for correlations between the bits of the input and the bits of the output. This gives you clues what functions might be involved in the hash. But that's a huge subject and not one I'm familiar with.
This sounds mildly illicit.
Not to rain on your parade or anything, but if the implementors have done their work right, you wouldn't notice lags beyond a few tens of milliseconds on modern CPU's even with strong cryptographic hashes, and knowing the algorithm won't help you if they have used salt correctly. If you don't have access to the code or binaries, your only hope is a trivial mistake, whether caused by technical limitations or carelesseness.
There is an uncountable infinity of potential (hash) functions for any given set of inputs and outputs, and if you have no clue better than an upper bound on their computational complexity (from the lag you detect), you have a very long search ahead of you...

Hashtables/Dictionaries that use floats/doubles

I read somewhere about other data structures similar to hashtables, dictionaries but instead of using ints, they were using floats/doubles, etc.
Anyone knows what they are?
If you mean using floats/doubles as keys in your hash, that's easy. For example, in .NET, it's just using Dictionary<double,MyValueType>.
If you're talking about having the hash be based off a double instead of an int....
Technically, you can have any element as your internal hash. Normally, this is done using an int or long, since these are fast, and the hashing algorithm is easy to compute.
However, the hash is really just a BitArray at heart, so anything would work. There really isn't much advantage to making this something other than an int or long, other than potentially allowing a larger set of hash values (ie: if you go to an 8 byte or larger type for your hash).
You mean as keys? That strikes me as tricky.
If you're using them as arbitrary keys, they're no better than integers.
If you expect to calculate a floating-point value and use it to look something up in a hash table, you're living very dangerously. Floating point numbers do not have infinite precision, and calculating the same thing in two slightly different ways can result in very tiny differences in the result. Hash keys rely on getting the exact same thing every time, so you'd have to be careful to round, and round in exactly the same way at all times. This is trickier than it sounds, by the way.
So, what would you do with floating-point hashes?
A hash algorithm is, in general terms, just a function that produces a smaller output from a larger input. Good hash functions have interesting properties like a large change in output for a small change in the input, and an assurance that they produce every possible output value for some input.
It's not hard to write a simple polynomial type hash function that outputs a floating-point value, rather than an integer value, but it's difficult to ensure that the resulting hash function has the desired properties without getting into the details of the particular floating-point representation used.
At least part of the reason that hash functions are nearly always implemented in integer arithmetic is because proving various properties about an integer calculation is easier than doing the same for a floating point calculation.
It's fairly easy to prove that some (sum of prime factors) modulo (another prime) must, necessarily, produce every possible output for some input. Doing the same for a calculation with a bunch of floating-point fractions would be a drag.
Add to that the relative difficulty of storing and transmitting floating-point values without corruption, and it's just not worth it.
Your question history shows that you use .Net, so I'll answer in that context.
If you want a Dictionary that is type aware, such that you can specify it should use floats or doubles for the keys or values, use System.Collections.Generic.Dictionary<T, U> http://msdn.microsoft.com/en-us/library/xfhwa508.aspx
If you want a Dictionary that is type blind, such that you can use floats AND doubles for keys and values, use System.Collections.HashTable http://msdn.microsoft.com/en-us/library/system.collections.hashtable.aspx

What is a good Hash Function?

What is a good Hash function? I saw a lot of hash function and applications in my data structures courses in college, but I mostly got that it's pretty hard to make a good hash function. As a rule of thumb to avoid collisions my professor said that:
function Hash(key)
return key mod PrimeNumber
end
(mod is the % operator in C and similar languages)
with the prime number to be the size of the hash table. I get that is a somewhat good function to avoid collisions and a fast one, but how can I make a better one? Is there better hash functions for string keys against numeric keys?
There's no such thing as a “good hash function” for universal hashes (ed. yes, I know there's such a thing as “universal hashing” but that's not what I meant). Depending on the context different criteria determine the quality of a hash. Two people already mentioned SHA. This is a cryptographic hash and it isn't at all good for hash tables which you probably mean.
Hash tables have very different requirements. But still, finding a good hash function universally is hard because different data types expose different information that can be hashed. As a rule of thumb it is good to consider all information a type holds equally. This is not always easy or even possible. For reasons of statistics (and hence collision), it is also important to generate a good spread over the problem space, i.e. all possible objects. This means that when hashing numbers between 100 and 1050 it's no good to let the most significant digit play a big part in the hash because for ~ 90% of the objects, this digit will be 0. It's far more important to let the last three digits determine the hash.
Similarly, when hashing strings it's important to consider all characters – except when it's known in advance that the first three characters of all strings will be the same; considering these then is a waste.
This is actually one of the cases where I advise to read what Knuth has to say in The Art of Computer Programming, vol. 3. Another good read is Julienne Walker's The Art of Hashing.
For doing "normal" hash table lookups on basically any kind of data - this one by Paul Hsieh is the best I've ever used.
http://www.azillionmonkeys.com/qed/hash.html
If you care about cryptographically secure or anything else more advanced, then YMMV. If you just want a kick ass general purpose hash function for a hash table lookup, then this is what you're looking for.
There are two major purposes of hashing functions:
to disperse data points uniformly into n bits.
to securely identify the input data.
It's impossible to recommend a hash without knowing what you're using it for.
If you're just making a hash table in a program, then you don't need to worry about how reversible or hackable the algorithm is... SHA-1 or AES is completely unnecessary for this, you'd be better off using a variation of FNV. FNV achieves better dispersion (and thus fewer collisions) than a simple prime mod like you mentioned, and it's more adaptable to varying input sizes.
If you're using the hashes to hide and authenticate public information (such as hashing a password, or a document), then you should use one of the major hashing algorithms vetted by public scrutiny. The Hash Function Lounge is a good place to start.
This is an example of a good one and also an example of why you would never want to write one.
It is a Fowler / Noll / Vo (FNV) Hash which is equal parts computer science genius and pure voodoo:
unsigned fnv_hash_1a_32 ( void *key, int len ) {
unsigned char *p = key;
unsigned h = 0x811c9dc5;
int i;
for ( i = 0; i < len; i++ )
h = ( h ^ p[i] ) * 0x01000193;
return h;
}
unsigned long long fnv_hash_1a_64 ( void *key, int len ) {
unsigned char *p = key;
unsigned long long h = 0xcbf29ce484222325ULL;
int i;
for ( i = 0; i < len; i++ )
h = ( h ^ p[i] ) * 0x100000001b3ULL;
return h;
}
Edit:
Landon Curt Noll recommends on his site the FVN-1A algorithm over the original FVN-1 algorithm: The improved algorithm better disperses the last byte in the hash. I adjusted the algorithm accordingly.
I'd say that the main rule of thumb is not to roll your own. Try to use something that has been thoroughly tested, e.g., SHA-1 or something along those lines.
A good hash function has the following properties:
Given a hash of a message it is computationally infeasible for an attacker to find another message such that their hashes are identical.
Given a pair of message, m' and m, it is computationally infeasible to find two such that that h(m) = h(m')
The two cases are not the same. In the first case, there is a pre-existing hash that you're trying to find a collision for. In the second case, you're trying to find any two messages that collide. The second task is significantly easier due to the birthday "paradox."
Where performance is not that great an issue, you should always use a secure hash function. There are very clever attacks that can be performed by forcing collisions in a hash. If you use something strong from the outset, you'll secure yourself against these.
Don't use MD5 or SHA-1 in new designs. Most cryptographers, me included, would consider them broken. The principle source of weakness in both of these designs is that the second property, which I outlined above, does not hold for these constructions. If an attacker can generate two messages, m and m', that both hash to the same value they can use these messages against you. SHA-1 and MD5 also suffer from message extension attacks, which can fatally weaken your application if you're not careful.
A more modern hash such as Whirpool is a better choice. It does not suffer from these message extension attacks and uses the same mathematics as AES uses to prove security against a variety of attacks.
Hope that helps!
What you're saying here is you want to have one that uses has collision resistance. Try using SHA-2. Or try using a (good) block cipher in a one way compression function (never tried that before), like AES in Miyaguchi-Preenel mode. The problem with that is that you need to:
1) have an IV. Try using the first 256 bits of the fractional parts of Khinchin's constant or something like that.
2) have a padding scheme. Easy. Barrow it from a hash like MD5 or SHA-3 (Keccak [pronounced 'ket-chak']).
If you don't care about the security (a few others said this), look at FNV or lookup2 by Bob Jenkins (actually I'm the first one who reccomends lookup2) Also try MurmurHash, it's fast (check this: .16 cpb).
A good hash function should
be bijective to not loose information, where possible, and have the least collisions
cascade as much and as evenly as possible, i.e. each input bit should flip every output bit with probability 0.5 and without obvious patterns.
if used in a cryptographic context there should not exist an efficient way to invert it.
A prime number modulus does not satisfy any of these points. It is simply insufficient. It is often better than nothing, but it's not even fast. Multiplying with an unsigned integer and taking a power-of-two modulus distributes the values just as well, that is not well at all, but with only about 2 cpu cycles it is much faster than the 15 to 40 a prime modulus will take (yes integer division really is that slow).
To create a hash function that is fast and distributes the values well the best option is to compose it from fast permutations with lesser qualities like they did with PCG for random number generation.
Useful permutations, among others, are:
multiplication with an uneven integer
binary rotations
xorshift
Following this recipe we can create our own hash function or we take splitmix which is tested and well accepted.
If cryptographic qualities are needed I would highly recommend to use a function of the sha family, which is well tested and standardised, but for educational purposes this is how you would make one:
First you take a good non-cryptographic hash function, then you apply a one-way function like exponentiation on a prime field or k many applications of (n*(n+1)/2) mod 2^k interspersed with an xorshift when k is the number of bits in the resulting hash.
I highly recommend the SMhasher GitHub project https://github.com/rurban/smhasher which is a test suite for hash functions. The fastest state-of-the-art non-cryptographic hash functions without known quality problems are listed here: https://github.com/rurban/smhasher#summary.
Different application scenarios have different design requirements for hash algorithms, but a good hash function should have the following three points:
Collision Resistance: try to avoid conflicts. If it is difficult to find two inputs that are hashed to the same output, the hash function is anti-collision
Tamper Resistant: As long as one byte is changed, its hash value will be very different.
Computational Efficiency: Hash table is an algorithm that can make a trade-off between time consumption and space consumption.
In 2022, we can choose the SHA-2 family to use in secure encryption, SHA-3 it is safer but has greater performance loss. A safer approach is to add salt and mix encryption., we can choose the SHA-2 family to use in secure encryption, SHA-3 it is safer but has greater performance loss. A safer approach is to add salt and mix encryption.

Resources