How to add extra information to a hash code? - algorithm

Given an array of bytes, there are several well-known good algorithms for calculating a hash code, such as FNV or MD5. (Not talking about cryptography here, just general purpose hash codes.)
Suppose what you have is an array of bytes plus one extra piece of information, a small integer (which is not located next to the array in memory), and you want a hash code based on the whole lot. What's the best way to do this? For example, one could take the hash code of the array and add it to the small integer, or exclusive-or them. But is there a better way?

I think, more easiest and efficient way - just init "hash" accumulator with your small value, and thereafter compute hash by ordinary way.
Following example illustrates my approach, where we compute hash from int and C-style string:
uint32_t hash(const char *str, uint32_t x) {
char c;
while((c = *str++) != 0)
x = ((x << 5) | (x >> (32 - 5))) + c;
return x ^ (x >> 16);
}

Related

What hash function "hacks" do you use?

Hash function can be dependent on data. For example (from this article) if your data are all strings and almost all of them are of different lengths then a simple string length could be a very good hash function (not very realistic I know). Or for example real numbers from 0 to 1 could have a simple hash function:
value * sizeOfHashTable
I am interested if you use such hash functions that are tailor made around your inputs? Any more examples?
As you correctly noted, hash function depends on hashed data.
Common idea to design a good hash function - to comply 3 conditions:
Function must be easy to compute. Maybe, better to use not very good hash, but compute it quick, and save more time on hashing, than lost on imbalanced buckets or table paths.
Function must have good distribution (pseudorandom) on test dataset. Good idea - to use in the hash function "snow-crash effect", when changing a single bit in the input data changes ~half bits in the output value.
For external input data, hash function must be "universal", i.e. resist to attempt generate collision.
My favorite hash function is following. Before 1st use, needed initialize the table S_block with some random values. Good idea to do it at each program run.
const unsigned int S_block[256] = { ... };
#define NLF(h, c) (S_block[(unsigned char)(c + h)] ^ c)
unsigned int hash(const char *key) {
unsigned int h = 0x1F351F35;
char c;
while(c = *key++)
h = ((h << 7) | (h >> (32 - 7))) + NLF(h, c);
h ^= h >> 16;
return h ^ (h >> 8);
}
As practical example, see using variation of this function in the my program emcSSH. The file htable.c contains variation of this function, suitable for the double hashing algorithm.

What is an effective and efficient hashcode algorithm for a histogram?

I have a histogram that is a vector/list of numbers. What is an easy and efficient algorithm for obtaining a hashcode of such a histogram? The hash code just needs to split the images on the hash value and not to compare images.
This application has no concerns on security, so cryptographic functions are unnecessarily slow.
The way to hash a list is to combine the hashes for each item. Java implements the hash function for a list like so:
public int hashCode() {
int hashCode = 1;
for (E e : this)
hashCode = 31*hashCode + (e==null ? 0 : e.hashCode());
return hashCode;
}
Notable properties:
The hash code for every empty list is exactly 1.
The hash code for lists with different numbers of elements are very likely to be different.
The hash code for lists with the same number of elements is more likely to collide. Lists with the same elements in different order will collide; the hash code for lists [1,2] and [2,1] are unfortunately identical.
This is a big of a drawback, but not as big as you might think at first. Hash tables implement a fallback where it checks for hashcode equality first and total equality second. If the difference in ordering occurs near the front of the lists, this fallback check is quick. At worst case, it only takes a number of comparisons equal to the length of the lists.
All in all, this would be a pretty good hash function for your use case, even if you use the numeric value of each histogram entry for its hash code. The problem you really want to avoid with hash functions is common-divisibility, meaning you want outputs from your hash function to fall into different buckets of a hash table. The Wikipedia article covers the properties of a good hash function if you want more information.
To obtain a better hash code for a list of numbers, we should look at a better hash code for an individual number, specifically this answer.
unsigned int hash(unsigned int[] list) {
unsigned int hashCode = 0;
for (int i = 1; i < list.length; i++) {
hashCode = hashCode + list[i];
hashCode = ((hashCode >> 16) ^ hashCode) * 0x45d9f3b;
hashCode = ((hashCode >> 16) ^ hashCode) * 0x45d9f3b;
hashCode = ((hashCode >> 16) ^ hashCode);
}
return hashCode;
}
I think that's a good adaptation, but I'm not an expect.
Regarding the efficiency of overflow, it's not a major slowdown unless you have to handle exceptions for it. In Java, arithmetic will never throw an overflow exception, instead just wrapping it around to the min or max value. There is no real drawback to having a negative hashcode, as long as your implementation of a hashtable supports it.
I may not be understanding your problem correctly, but hashmaps already in matlab, they just have a different name
containers.maps

reflexive hash?

Is there a class of hash algorithms, whether theoretical or practical, such that an algo in the class might be considered 'reflexive' according a definition given below:
hash1 = algo1 ( "input text 1" )
hash1 = algo1 ( "input text 1" + hash1 )
The + operator might be concatenation or any other specified operation to combine the output (hash1) back into the input ("input text 1") so that the algorithm (algo1) will produce exactly the same result. i.e. collision on input and input+output.
The + operator must combine the entirety of both inputs and the algo may not discard part of the input.
The algorithm must produce 128 bits of entropy in the output.
It may, but need not, be cryptographically hard to reverse the output back to one or both possible inputs.
I am not a mathematician, but a good answer might include a proof of why such a class of algorithms cannot exist. This is not an abstract question, however. I am genuinely interested in using such an algorithm in my system, if one does exist.
Sure, here's a trivial one:
def algo1(input):
sum = 0
for i in input:
sum += ord(i)
return chr(sum % 256) + chr(-sum % 256)
Concatenate the result and the "hash" doesn't change. It's pretty easy to come up with something similar when you can reverse the hash.
Yes, you can get this effect with a CRC.
What you need to do is:
Implement an algorithm that will find a sequence of N input bits leading from one given state (of the N-bit CRC accumulator) to another.
Compute the CRC of your input in the normal way. Note the final state (call it A)
Using the function implemented in (1), find a sequence of bits that lead from A to A. This sequence is your hash code. You can now append it to the input.
[Initial state] >- input string -> [A] >- hash -> [A] ...
Here is one way to find the hash. (Note: there is an error in the numbers in the CRC32 example, but the algorithm works.)
And here's an implementation in Java. Note: I've used a 32-bit CRC (smaller than the 64 you specify) because that's implemented in the standard library, but with third-party library code you can easily extend it to larger hashes.
public static byte[] hash(byte[] input) {
CRC32 crc = new CRC32();
crc.update(input);
int reg = ~ (int) crc.getValue();
return delta(reg, reg);
}
public static void main(String[] args) {
byte[] prefix = "Hello, World!".getBytes(Charsets.UTF_8);
System.err.printf("%s => %s%n", Arrays.toString(prefix), Arrays.toString(hash(prefix)));
byte[] suffix = hash(prefix);
byte[] combined = ArrayUtils.addAll(prefix, suffix);
System.err.printf("%s => %s%n", Arrays.toString(combined), Arrays.toString(hash(combined)));
}
private static byte[] delta(int from, int to) {
ByteBuffer buf = ByteBuffer.allocate(8);
buf.order(ByteOrder.LITTLE_ENDIAN);
buf.putInt(from);
buf.putInt(to);
for (int i = 8; i-- > 4;) {
int e = CRCINVINDEX[buf.get(i) & 0xff];
buf.putInt(i - 3, buf.getInt(i - 3) ^ CRC32TAB[e]);
buf.put(i - 4, (byte) (e ^ buf.get(i - 4)));
}
return Arrays.copyOfRange(buf.array(), 0, 4);
}
private static final int[] CRC32TAB = new int[0x100];
private static final int[] CRCINVINDEX = new int[0x100];
static {
CRC32 crc = new CRC32();
for (int b = 0; b < 0x100; ++ b) {
crc.update(~b);
CRC32TAB[b] = 0xFF000000 ^ (int) crc.getValue();
CRCINVINDEX[CRC32TAB[b] >>> 24] = b;
crc.reset();
}
}
Building on ephemiat's answer, I think you can do something like this:
Pick your favorite symmetric key block cipher (e.g.: AES) . For concreteness, let's say that it operates on 128-bit blocks. For a given key K, denote the encryption function and decryption function by Enc(K, block) and Dec(K, block), respectively, so that block = Dec(K, Enc(K, block)) = Enc(K, Dec(K, block)).
Divide your input into an array of 128-bit blocks (padding as necessary). You can either choose a fixed key K or make it part of the input to the hash. In the following, we'll assume that it's fixed.
def hash(input):
state = arbitrary 128-bit initialization vector
for i = 1 to len(input) do
state = state ^ Enc(K, input[i])
return concatenate(state, Dec(K, state))
This function returns a 256-bit hash. It should be not too hard to verify that it satisfies the "reflexivity" condition with one caveat -- the inputs must be padded to a whole number of 128-bit blocks before the hash is adjoined. In other words, instead of hash(input) = hash(input + hash(input)) as originally specified, we have hash(input) = hash(input' + hash(input)) where input' is just the padded input. I hope this isn't too onerous.
Well, I can tell you that you won't get a proof of nonexistence. Here's an example:
operator+(a,b): compute a 64-bit hash of a, a 64-bit hash of b, and concatenate the bitstrings, returning an 128-bit hash.
algo1: for some 128-bit value, ignore the last 64 bits and compute some hash of the first 64.
Informally, any algo1 that yields the first operator to + as its first step will do. Maybe not as interesting a class as you were looking for, but it fits the bill. And it's not without real-world instances either. Lots of password hashing algorithms truncate their input.
I'm pretty sure that such a "reflexive hash" function (if it did exist in more than the trivial sense) would not be a useful hash function in the normal sense.
For an example of a "trivial" reflexive hash function:
int hash(Object obj) { return 0; }

Lists Hash function

I'm trying to make a hash function so I can tell if too lists with same sizes contain the same elements.
For exemple this is what I want:
f((1 2 3))=f((1 3 2))=f((2 1 3))=f((2 3 1))=f((3 1 2))=f((3 2 1)).
Any ideea how can I approch this problem ? I've tried doing the sum of squares of all elements but it turned out that there are collisions,for exemple f((2 2 5))=33=f((1 4 4)) which is wrong as the lists are not the same.
I'm looking for a simple approach if there is any.
Sort the list and then:
list.each do |current_element|
hash = (37 * hash + current_element) % MAX_HASH_VALUE
end
You're probably out of luck if you really want no collisions. There are N choose k sets of size k with elements in 1..N (and worse, if you allow repeats). So imagine you have N=256, k=8, then N choose k is ~4 x 10^14. You'd need a very large integer to distinctly hash all of these sets.
Possibly you have N, k such that you could still make this work. Good luck.
If you allow occasional collisions, you have lots of options. From simple things like your suggestion (add squares of elements) and computing xor the elements, to complicated things like sort them, print them to a string, and compute MD5 on them. But since collisions are still possible, you have to verify any hash match by comparing the original lists (if you keep them sorted, this is easy).
So you are looking something provides these properties,
1. If h(x1) == y1, then there is an inverse function h_inverse(y1) == x1
2. Because the inverse function exists, there cannot be a value x2 such that x1 != x2, and h(x2) == y1.
Knuth's Multiplicative Method
In Knuth's "The Art of Computer Programming", section 6.4, a multiplicative hashing scheme is introduced as a way to write hash function. The key is multiplied by the golden ratio of 2^32 (2654435761) to produce a hash result.
hash(i)=i*2654435761 mod 2^32
Since 2654435761 and 2^32 has no common factors in common, the multiplication produces a complete mapping of the key to hash result with no overlap. This method works pretty well if the keys have small values. Bad hash results are produced if the keys vary in the upper bits. As is true in all multiplications, variations of upper digits do not influence the lower digits of the multiplication result.
Robert Jenkins' 96 bit Mix Function
Robert Jenkins has developed a hash function based on a sequence of subtraction, exclusive-or, and bit shift.
All the sources in this article are written as Java methods, where the operator '>>>' represents the concept of unsigned right shift. If the source were to be translated to C, then the Java 'int' data type should be replaced with C 'uint32_t' data type, and the Java 'long' data type should be replaced with C 'uint64_t' data type.
The following source is the mixing part of the hash function.
int mix(int a, int b, int c)
{
a=a-b; a=a-c; a=a^(c >>> 13);
b=b-c; b=b-a; b=b^(a << 8);
c=c-a; c=c-b; c=c^(b >>> 13);
a=a-b; a=a-c; a=a^(c >>> 12);
b=b-c; b=b-a; b=b^(a << 16);
c=c-a; c=c-b; c=c^(b >>> 5);
a=a-b; a=a-c; a=a^(c >>> 3);
b=b-c; b=b-a; b=b^(a << 10);
c=c-a; c=c-b; c=c^(b >>> 15);
return c;
}
You can read details from here
If all the elements are numbers and they have a maximum, this is not too complicated, you sort those elements and then you put them together one after the other in the base of your maximum+1.
Hard to describe in words...
For example, if your maximum is 9 (that makes it easy to understand), you'd have :
f(2 3 9 8) = f(3 8 9 2) = 2389
If you maximum was 99, you'd have :
f(16 2 76 8) = (0)2081676
In your example with 2,2 and 5, if you know you would never get anything higher than 5, you could "compose" the result in base 6, so that would be :
f(2 2 5) = 2*6^2 + 2*6 + 5 = 89
f(1 4 4) = 1*6^2 + 4*6 + 4 = 64
Combining hash values is hard, I've found this way (no explanation, though perhaps someone would recognize it) within Boost:
template <class T>
void hash_combine(size_t& seed, T const& v)
{
seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
}
It should be fast since there is only shifting, additions and xor taking place (apart from the actual hashing).
However the requirement than the order of the list does not influence the end-result would mean that you first have to sort it which is an O(N log N) operation, so it may not fit.
Also, since it's impossible without more stringent boundaries to provide a collision free hash function, you'll still have to actually compare the sorted lists if ever the hash are equals...
I'm trying to make a hash function so I can tell if two lists with same sizes contain the same elements.
[...] but it turned out that there are collisions
These two sentences suggest you are using the wrong tool for the job. The point of a hash (unless it is a 'perfect hash', which doesn't seem appropriate to this problem) is not to guarantee equality, or to provide a unique output for every given input. In the general usual case, it cannot, because there are more potential inputs than potential outputs.
Whatever hash function you choose, your hashing system is always going to have to deal with the possibility of collisions. And while different hashes imply inequality, it does not follow that equal hashes imply equality.
As regards your actual problem: a start might be to sort the list in ascending order, then use the sorted values as if they were the prime powers in the prime decomposition of an integer. Reconstruct this integer (modulo the maximum hash value) and there is a hash value.
For example:
2 1 3
sorted becomes
1 2 3
Treating this as prime powers gives
2^1.3^2.5^3
which construct
2.9.125 = 2250
giving 2250 as your hash value, which will be the same hash value as for any other ordering of 1 2 3, and also different from the hash value for any other sequence of three numbers that do not overflow the maximum hash value when computed.
A naïve approach to solving your essential problem (comparing lists in an order-insensitive manner) is to convert all lists being compared to a set (set in Python or HashSet in Java). This is more effective than making a hash function since a perfect hash seems essential to your problem. For almost any other approach collisions are inevitable depending on input.

Non colliding hash algorithm for strings up to 255 characters

I am looking for a hash-algorithm, to create as close to a unique hash of a string (max len = 255) as possible, that produces a long integer (DWORD).
I realize that 26^255 >> 2^32, but also know that the number of words in the English language is far less than 2^32.
The strings I need to 'hash' would be mostly single words or some simple construct using two or three words.
The answer:
One of the FNV variants should meet your requirements. They're fast, and produce fairly evenly distributed outputs. (Answered by Arachnid)
See here for a previous iteration of this question (and the answer).
One technique is to use a well-known hash algorithm (say, MD5 or SHA-1) and use only the first 32 bits of the result.
Be aware that the risk of hash collisions increases faster than you might expect. For information on this, read about the Birthday Paradox.
Ronny Pfannschmidt did a test with common english words yesterday and hasn't encountered any collisions for the 10000 words he tested in the Python string hash function. I haven't tested it myself, but that algorithm is very simple and fast, and seems to be optimized for common words.
Here the implementation:
static long
string_hash(PyStringObject *a)
{
register Py_ssize_t len;
register unsigned char *p;
register long x;
if (a->ob_shash != -1)
return a->ob_shash;
len = Py_SIZE(a);
p = (unsigned char *) a->ob_sval;
x = *p << 7;
while (--len >= 0)
x = (1000003*x) ^ *p++;
x ^= Py_SIZE(a);
if (x == -1)
x = -2;
a->ob_shash = x;
return x;
}
H(key) = [GetHash(key) + 1 + (((GetHash(key) >> 5) + 1) % (hashsize – 1))] % hashsize
MSDN article on HashCodes
Java's String.hash() can be easily viewed here, its algorithm is
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

Resources