What hash function "hacks" do you use? - data-structures

Hash function can be dependent on data. For example (from this article) if your data are all strings and almost all of them are of different lengths then a simple string length could be a very good hash function (not very realistic I know). Or for example real numbers from 0 to 1 could have a simple hash function:
value * sizeOfHashTable
I am interested if you use such hash functions that are tailor made around your inputs? Any more examples?

As you correctly noted, hash function depends on hashed data.
Common idea to design a good hash function - to comply 3 conditions:
Function must be easy to compute. Maybe, better to use not very good hash, but compute it quick, and save more time on hashing, than lost on imbalanced buckets or table paths.
Function must have good distribution (pseudorandom) on test dataset. Good idea - to use in the hash function "snow-crash effect", when changing a single bit in the input data changes ~half bits in the output value.
For external input data, hash function must be "universal", i.e. resist to attempt generate collision.
My favorite hash function is following. Before 1st use, needed initialize the table S_block with some random values. Good idea to do it at each program run.
const unsigned int S_block[256] = { ... };
#define NLF(h, c) (S_block[(unsigned char)(c + h)] ^ c)
unsigned int hash(const char *key) {
unsigned int h = 0x1F351F35;
char c;
while(c = *key++)
h = ((h << 7) | (h >> (32 - 7))) + NLF(h, c);
h ^= h >> 16;
return h ^ (h >> 8);
}
As practical example, see using variation of this function in the my program emcSSH. The file htable.c contains variation of this function, suitable for the double hashing algorithm.

Related

How to add extra information to a hash code?

Given an array of bytes, there are several well-known good algorithms for calculating a hash code, such as FNV or MD5. (Not talking about cryptography here, just general purpose hash codes.)
Suppose what you have is an array of bytes plus one extra piece of information, a small integer (which is not located next to the array in memory), and you want a hash code based on the whole lot. What's the best way to do this? For example, one could take the hash code of the array and add it to the small integer, or exclusive-or them. But is there a better way?
I think, more easiest and efficient way - just init "hash" accumulator with your small value, and thereafter compute hash by ordinary way.
Following example illustrates my approach, where we compute hash from int and C-style string:
uint32_t hash(const char *str, uint32_t x) {
char c;
while((c = *str++) != 0)
x = ((x << 5) | (x >> (32 - 5))) + c;
return x ^ (x >> 16);
}

What is an effective and efficient hashcode algorithm for a histogram?

I have a histogram that is a vector/list of numbers. What is an easy and efficient algorithm for obtaining a hashcode of such a histogram? The hash code just needs to split the images on the hash value and not to compare images.
This application has no concerns on security, so cryptographic functions are unnecessarily slow.
The way to hash a list is to combine the hashes for each item. Java implements the hash function for a list like so:
public int hashCode() {
int hashCode = 1;
for (E e : this)
hashCode = 31*hashCode + (e==null ? 0 : e.hashCode());
return hashCode;
}
Notable properties:
The hash code for every empty list is exactly 1.
The hash code for lists with different numbers of elements are very likely to be different.
The hash code for lists with the same number of elements is more likely to collide. Lists with the same elements in different order will collide; the hash code for lists [1,2] and [2,1] are unfortunately identical.
This is a big of a drawback, but not as big as you might think at first. Hash tables implement a fallback where it checks for hashcode equality first and total equality second. If the difference in ordering occurs near the front of the lists, this fallback check is quick. At worst case, it only takes a number of comparisons equal to the length of the lists.
All in all, this would be a pretty good hash function for your use case, even if you use the numeric value of each histogram entry for its hash code. The problem you really want to avoid with hash functions is common-divisibility, meaning you want outputs from your hash function to fall into different buckets of a hash table. The Wikipedia article covers the properties of a good hash function if you want more information.
To obtain a better hash code for a list of numbers, we should look at a better hash code for an individual number, specifically this answer.
unsigned int hash(unsigned int[] list) {
unsigned int hashCode = 0;
for (int i = 1; i < list.length; i++) {
hashCode = hashCode + list[i];
hashCode = ((hashCode >> 16) ^ hashCode) * 0x45d9f3b;
hashCode = ((hashCode >> 16) ^ hashCode) * 0x45d9f3b;
hashCode = ((hashCode >> 16) ^ hashCode);
}
return hashCode;
}
I think that's a good adaptation, but I'm not an expect.
Regarding the efficiency of overflow, it's not a major slowdown unless you have to handle exceptions for it. In Java, arithmetic will never throw an overflow exception, instead just wrapping it around to the min or max value. There is no real drawback to having a negative hashcode, as long as your implementation of a hashtable supports it.
I may not be understanding your problem correctly, but hashmaps already in matlab, they just have a different name
containers.maps

How to generate a seed from an xy coordinate

Iv'e been working on a perlin script but have been having problems with creating simple pseudo random values.
I need to be able to create a seed value from an xy coordinate but x+y has obvious problems with recurring values. Also they go into negative space so x^y doesn't work.
Sorry if this has been already answered somewhere else but either I didn't understand or couldn't find it.
Do you want to assing a repetible random number to each x,y pair ?
Using a linear or in general function combination of the x,y as a seed will give artifacts in the distribution (at least if you don't use a very complex function).
Try with this, I've the same problem ant it worked for me
//seeded random for JS - integer
function irnd2()
{
a=1664525;
c=1013904223;
m=4294967296;
rnd2.r=(rnd2.r*a+c)%m;
return rnd2.r;
}
//seeded random for JS - double [0,1]
function rnd2()
{
a=1664525;
c=1013904223;
m=4294967296;
rnd2.r=(rnd2.r*a+c)%m;
return rnd2.r/m;
}
rnd2.r=192837463;
//seed function
function seed2(s)
{
s=s>0?s:-s;
rnd2.r=192837463^s;
}
//my smart seed from 2 integer
function myseed(x,y)
{
seed2(x);//x is integer
var sx=irnd2();//sx is integer
seed2(y);//y is integer
var sy=irnd2();//sy is integer
seed2(sx^sy);//using binary xor you won't lose information
}
In order to use it :
myseed(x,y);
irnd2();
In this manner you can obtain a good uncorrelated random sequence.
I use it in JS but it should work also in other languages supposing the argument of seed and the returned value of rnd is an integer.
You need to better define the problem to get an optimal answer.
If your x and y values are relatively small, you could place them into the high and low portions of an integer (is the seed in your language an integer), e.g. for a 32-bit platform:
int seed = x << 16 + y;
If the seed value is not allowed to be negative (I didn't fully understand what you meant by "negative space" in your question, whether you were referring to geography or the seed value), you can take the absolute value of the seed.
If you meant that the coordinates can have negative values, your best course of action depends on whether you want the same seed for a coordinate and for it's inverse.
Take the absolute value of both x and y first; then x^y will work fine. One of the easiest ways to create a pseudo-random source is with time. You might try multiplying x^y by the current system time; this method has an extremely low chance of generating recurring seed values.
If you know the range of values you have, you could simply cast x and y as strings padded with zeroes, append the two strings, then run the resulting string through a hash function.
In C#, adapted and improved from alexroat's answer. Just set Random.seed = MyUtils.GetSeedXY(x, y) and you're good to go.
public static class MyUtils
{
static int seed2(int _s)
{
var s = 192837463 ^ System.Math.Abs(_s);
var a = 1664525;
var c = 1013904223;
var m = 4294967296;
return (int) ((s * a + c) % m);
}
public static int GetSeedXY(int x, int y)
{
int sx = seed2(x * 1947);
int sy = seed2(y * 2904);
return seed2(sx ^ sy);
}
}

reflexive hash?

Is there a class of hash algorithms, whether theoretical or practical, such that an algo in the class might be considered 'reflexive' according a definition given below:
hash1 = algo1 ( "input text 1" )
hash1 = algo1 ( "input text 1" + hash1 )
The + operator might be concatenation or any other specified operation to combine the output (hash1) back into the input ("input text 1") so that the algorithm (algo1) will produce exactly the same result. i.e. collision on input and input+output.
The + operator must combine the entirety of both inputs and the algo may not discard part of the input.
The algorithm must produce 128 bits of entropy in the output.
It may, but need not, be cryptographically hard to reverse the output back to one or both possible inputs.
I am not a mathematician, but a good answer might include a proof of why such a class of algorithms cannot exist. This is not an abstract question, however. I am genuinely interested in using such an algorithm in my system, if one does exist.
Sure, here's a trivial one:
def algo1(input):
sum = 0
for i in input:
sum += ord(i)
return chr(sum % 256) + chr(-sum % 256)
Concatenate the result and the "hash" doesn't change. It's pretty easy to come up with something similar when you can reverse the hash.
Yes, you can get this effect with a CRC.
What you need to do is:
Implement an algorithm that will find a sequence of N input bits leading from one given state (of the N-bit CRC accumulator) to another.
Compute the CRC of your input in the normal way. Note the final state (call it A)
Using the function implemented in (1), find a sequence of bits that lead from A to A. This sequence is your hash code. You can now append it to the input.
[Initial state] >- input string -> [A] >- hash -> [A] ...
Here is one way to find the hash. (Note: there is an error in the numbers in the CRC32 example, but the algorithm works.)
And here's an implementation in Java. Note: I've used a 32-bit CRC (smaller than the 64 you specify) because that's implemented in the standard library, but with third-party library code you can easily extend it to larger hashes.
public static byte[] hash(byte[] input) {
CRC32 crc = new CRC32();
crc.update(input);
int reg = ~ (int) crc.getValue();
return delta(reg, reg);
}
public static void main(String[] args) {
byte[] prefix = "Hello, World!".getBytes(Charsets.UTF_8);
System.err.printf("%s => %s%n", Arrays.toString(prefix), Arrays.toString(hash(prefix)));
byte[] suffix = hash(prefix);
byte[] combined = ArrayUtils.addAll(prefix, suffix);
System.err.printf("%s => %s%n", Arrays.toString(combined), Arrays.toString(hash(combined)));
}
private static byte[] delta(int from, int to) {
ByteBuffer buf = ByteBuffer.allocate(8);
buf.order(ByteOrder.LITTLE_ENDIAN);
buf.putInt(from);
buf.putInt(to);
for (int i = 8; i-- > 4;) {
int e = CRCINVINDEX[buf.get(i) & 0xff];
buf.putInt(i - 3, buf.getInt(i - 3) ^ CRC32TAB[e]);
buf.put(i - 4, (byte) (e ^ buf.get(i - 4)));
}
return Arrays.copyOfRange(buf.array(), 0, 4);
}
private static final int[] CRC32TAB = new int[0x100];
private static final int[] CRCINVINDEX = new int[0x100];
static {
CRC32 crc = new CRC32();
for (int b = 0; b < 0x100; ++ b) {
crc.update(~b);
CRC32TAB[b] = 0xFF000000 ^ (int) crc.getValue();
CRCINVINDEX[CRC32TAB[b] >>> 24] = b;
crc.reset();
}
}
Building on ephemiat's answer, I think you can do something like this:
Pick your favorite symmetric key block cipher (e.g.: AES) . For concreteness, let's say that it operates on 128-bit blocks. For a given key K, denote the encryption function and decryption function by Enc(K, block) and Dec(K, block), respectively, so that block = Dec(K, Enc(K, block)) = Enc(K, Dec(K, block)).
Divide your input into an array of 128-bit blocks (padding as necessary). You can either choose a fixed key K or make it part of the input to the hash. In the following, we'll assume that it's fixed.
def hash(input):
state = arbitrary 128-bit initialization vector
for i = 1 to len(input) do
state = state ^ Enc(K, input[i])
return concatenate(state, Dec(K, state))
This function returns a 256-bit hash. It should be not too hard to verify that it satisfies the "reflexivity" condition with one caveat -- the inputs must be padded to a whole number of 128-bit blocks before the hash is adjoined. In other words, instead of hash(input) = hash(input + hash(input)) as originally specified, we have hash(input) = hash(input' + hash(input)) where input' is just the padded input. I hope this isn't too onerous.
Well, I can tell you that you won't get a proof of nonexistence. Here's an example:
operator+(a,b): compute a 64-bit hash of a, a 64-bit hash of b, and concatenate the bitstrings, returning an 128-bit hash.
algo1: for some 128-bit value, ignore the last 64 bits and compute some hash of the first 64.
Informally, any algo1 that yields the first operator to + as its first step will do. Maybe not as interesting a class as you were looking for, but it fits the bill. And it's not without real-world instances either. Lots of password hashing algorithms truncate their input.
I'm pretty sure that such a "reflexive hash" function (if it did exist in more than the trivial sense) would not be a useful hash function in the normal sense.
For an example of a "trivial" reflexive hash function:
int hash(Object obj) { return 0; }

Non colliding hash algorithm for strings up to 255 characters

I am looking for a hash-algorithm, to create as close to a unique hash of a string (max len = 255) as possible, that produces a long integer (DWORD).
I realize that 26^255 >> 2^32, but also know that the number of words in the English language is far less than 2^32.
The strings I need to 'hash' would be mostly single words or some simple construct using two or three words.
The answer:
One of the FNV variants should meet your requirements. They're fast, and produce fairly evenly distributed outputs. (Answered by Arachnid)
See here for a previous iteration of this question (and the answer).
One technique is to use a well-known hash algorithm (say, MD5 or SHA-1) and use only the first 32 bits of the result.
Be aware that the risk of hash collisions increases faster than you might expect. For information on this, read about the Birthday Paradox.
Ronny Pfannschmidt did a test with common english words yesterday and hasn't encountered any collisions for the 10000 words he tested in the Python string hash function. I haven't tested it myself, but that algorithm is very simple and fast, and seems to be optimized for common words.
Here the implementation:
static long
string_hash(PyStringObject *a)
{
register Py_ssize_t len;
register unsigned char *p;
register long x;
if (a->ob_shash != -1)
return a->ob_shash;
len = Py_SIZE(a);
p = (unsigned char *) a->ob_sval;
x = *p << 7;
while (--len >= 0)
x = (1000003*x) ^ *p++;
x ^= Py_SIZE(a);
if (x == -1)
x = -2;
a->ob_shash = x;
return x;
}
H(key) = [GetHash(key) + 1 + (((GetHash(key) >> 5) + 1) % (hashsize – 1))] % hashsize
MSDN article on HashCodes
Java's String.hash() can be easily viewed here, its algorithm is
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

Resources