How to create a unique hash that will match any strings permutations - algorithm

Given a string abcd how can I create a unique hashing method that will hash those 4 characters to match bcad or any other permutation of the letters abcd?
Currently I have this code
long hashString(string a) {
long hashed = 0;
for(int i = 0; i < a.length(); i++) {
hashed += a[i] * 7; // Timed by a prime to make the hash more unique?
}
return hashed;
}
Now this will not work becasue ad will hash with bc.
I know you can make it more unique by multiplying the position of the letter by the letter itself hashed += a[i] * i but then the string will not hash to its permutations.
Is it possible to create a hash that achieves this?
Edit
Some have suggested sorting the strings before you hash them. Which is a valid answer but the sorting would take O(nlog) time and I am looking for a hash function that runs in O(n) time.
I am looking to do this in O(1) memory.

Create an array of 26 integers, corresponding to letters a-z. Initialize it to 0. Scan the string from beginning to end, and increment the array element corresponding to the current letter. Note that up to this point the algorithm has O(n) time complexity and O(1) space complexity (since the array size is a constant).
Finally, hash the contents of the array using your favorite hash function.

The basic thing you can do is sort the strings before applying the hash function. So, to compute the hash of "adbc" or "dcba" you instead compute the hash of "abcd".
If you want to make sure that there are no collisions in your hash function, then the only way is to have the hash result be a string. There are many more strings than there are 32-bit (or 64-bit) integers so collisions are innevitable (collisions are unlikely with a good hash function though).

Easiest way to understand: sort the letters in the string, and then hash the resulting string.
Some variations on your original idea also work, like:
long hashString(string a) {
long hashed = 0;
for(int i = 0; i < a.length(); i++) {
long t = a[i] * 16777619;
hashed += t^(t>>8);
}
return hashed;
}

I suppose you need a hash such that two anagrams will hash to the same value. I'd suggest you sort them first and use any of the common hash function such as md5. I write the following code using Scala:
import java.security.MessageDigest
def hash(s: String) = {
MessageDigest.getInstance("MD5").digest(s.sorted.getBytes)
}
Note in scala:
scala> "hello".sorted
res0: String = ehllo
scala> "cinema".sorted
res1: String = aceimn

Synopsis: store a histogram of the letters in the hash value.
Step 1: compute a histogram of the letters (since a histogram uniquely identifies the letters in the string without regard to the order of the letters).
int histogram[26];
for ( int i = 0; i < a.length(); i++ )
histogram[a[i] - 'a']++;
Step 2: pack the histogram into the hash value. You have several options here. Which option to choose depends on what sort of limitations you can put on the strings.
If you knew that each letter would appear no more than 3 times, then it takes 2 bits to represent the count, so you could create a 52-bit hash that's guaranteed to be unique.
If you're willing to use a 128-bit hash, then you've got 5 bits for 24 letters, and 4 bits for 2 letters (e.g. q and z). The 128-bit hash allows each letter to appear 31 times (15 times for q and z).
But if you want a fixed sized hash, say 16-bit, then you need to pack the histogram into those 16 bits in a way that reduces collisions. The easiest way to do that is to create a 26 byte message (one byte for each entry in the histogram, allowing each letter to appear up to 255 times). Then take the 16-bit CRC of the message, using your favorite CRC generator.

Related

Hash function required for custom data structure containing 12 integers

I have a custom structure that holds 12 integer values, x1,y1,x2,y2,x3,y3,x4,y4,x5,y5,x6,y6.
The range of the numbers is between 1 and 5 inclusive and every structure is guaranteed to have different combinations i.e NO two structures can have all the values of x1,y1,x2,y2,x3,y3,x4,y4,x5,y5,x6,y6 same as the respective values of other.
I need a good hash function to perform O(1) operations.
The requirement is to find out a structure with specific x1,y1....x6,y6 values
Right now I am using the following:-
struct Hash_6
{
size_t operator () ( const Node& n ) const
{
int result=17;
result=31*result+n.x1;
result=31*result+n.x2;
result=31*result+n.x3;
result=31*result+n.x4;
result=31*result+n.x5;
result=31*result+n.x6;
result=31*result+n.y1;
result=31*result+n.y2;
result=31*result+n.y3;
result=31*result+n.y4;
result=31*result+n.y5;
result=31*result+n.y6;
return result;
}
};
I want to know if there is any better more efficient hash function out there which I could use for this specific case.
If the values are always between one and five inclusive, then you can get a unique hash within a 32-bit value.
That's because five (the values) to the power of twelve (the number of variables) is 244,140,625, a value that can be represented in 28 bits.
Hence you hash function becomes (pseudo-code):
def hasher(s):
res = s.x1 - 1
for val in s.x2, s.x3, s.x4, s.x5, s.x6 s.y1, s.y2, s.y3, s.y4, s.y5, s.y6:
res = res * 5 + val - 1;
return res
With your constraints, you get a unique value out of that hash function.
If you wanted to use that hash for bucket selection (such as used in a set or dictionary), you would probably want to reduce it with a modulus to a more suitable value (introducing collisions as part of the process).
But it's unclear whether you're needing a hash for identification (leave as is) or bucketing (reduce it). If the latter, and values are reasonably evenly distributed, that would be along the lines of:
bucket_to_use = hasher(item) modulo num_buckets

What is an effective and efficient hashcode algorithm for a histogram?

I have a histogram that is a vector/list of numbers. What is an easy and efficient algorithm for obtaining a hashcode of such a histogram? The hash code just needs to split the images on the hash value and not to compare images.
This application has no concerns on security, so cryptographic functions are unnecessarily slow.
The way to hash a list is to combine the hashes for each item. Java implements the hash function for a list like so:
public int hashCode() {
int hashCode = 1;
for (E e : this)
hashCode = 31*hashCode + (e==null ? 0 : e.hashCode());
return hashCode;
}
Notable properties:
The hash code for every empty list is exactly 1.
The hash code for lists with different numbers of elements are very likely to be different.
The hash code for lists with the same number of elements is more likely to collide. Lists with the same elements in different order will collide; the hash code for lists [1,2] and [2,1] are unfortunately identical.
This is a big of a drawback, but not as big as you might think at first. Hash tables implement a fallback where it checks for hashcode equality first and total equality second. If the difference in ordering occurs near the front of the lists, this fallback check is quick. At worst case, it only takes a number of comparisons equal to the length of the lists.
All in all, this would be a pretty good hash function for your use case, even if you use the numeric value of each histogram entry for its hash code. The problem you really want to avoid with hash functions is common-divisibility, meaning you want outputs from your hash function to fall into different buckets of a hash table. The Wikipedia article covers the properties of a good hash function if you want more information.
To obtain a better hash code for a list of numbers, we should look at a better hash code for an individual number, specifically this answer.
unsigned int hash(unsigned int[] list) {
unsigned int hashCode = 0;
for (int i = 1; i < list.length; i++) {
hashCode = hashCode + list[i];
hashCode = ((hashCode >> 16) ^ hashCode) * 0x45d9f3b;
hashCode = ((hashCode >> 16) ^ hashCode) * 0x45d9f3b;
hashCode = ((hashCode >> 16) ^ hashCode);
}
return hashCode;
}
I think that's a good adaptation, but I'm not an expect.
Regarding the efficiency of overflow, it's not a major slowdown unless you have to handle exceptions for it. In Java, arithmetic will never throw an overflow exception, instead just wrapping it around to the min or max value. There is no real drawback to having a negative hashcode, as long as your implementation of a hashtable supports it.
I may not be understanding your problem correctly, but hashmaps already in matlab, they just have a different name
containers.maps

Best way to resize a hash table

I am creating my own implementation to hash a table for education purposes.
What would be the best way to increase a hash table size?
I currently double the hash array size.
The hashing function I'm using is: key mod arraysize.
The problem with this is that if the keys are: 2, 4, 6, 8, then the array size will just keep increasing.
What is the best way of overcoming this issue? Is there a better way of increasing a hash table size? Would changing my hashing function help?
NOTE: My keys are all integers!
Hash tables often avoid this problem by making sure that the hash table size is a prime number. When you resize the table, double the size and then round up to the first prime number larger than that. Doing this avoids the clustering problems similar to what you describe.
Now, it does take a little bit of time to find the next prime number, but not a whole lot. When compared to the time involved in rehashing the hash table's contents, finding the next prime number takes almost no time at all. See Optimizing the wrong thing for a description.
OpenJDK uses powers of 2 for the capacity of a HashMap, which will lead to a lot of collisions if the keys are all multiples of a power of two. It prevents this by applying another hash function on top of the key's hashCode:
/**
* Applies a supplemental hash function to a given hashCode, which defends against poor quality hash functions.
* This is critical because HashMap uses power-of-two length hash tables, that otherwise encounter collisions
* for hashCodes that do not differ in lower bits. Note: Null keys always map to hash 0, thus index 0.
*/
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
If you try to implement your own hash table, here is some tips:
Chose a prime number for table size if you use the mod for the hash function.
Use Quadratic Probing to find the final position for collisions, h(x,i) = (Hash(x) + i*i) mod TableSize for the ith collision.
Double the size to the nearest prime number when hash table get half full which you will merely never do if your collision function is ok for your input.
Here is an elegant implement for Quadratic Probing:
//find a position to set the key
int findPos( int key, YourHashTable h )
{
int curPos;
int collisionNum = 0;
curPos = key % h.TableSize;
//while find a collision
while( h[curPos] != null && h[curPos] != key )
{
//f(i) = i*i = f(i-1) + 2*i -1
curPos += 2 * ++collisionNum - 1;
//do the mod only use - for efficiency
if( curPos >= h.TableSize )
curPos -= h.TableSize;
}
return curPos;
}
Hashing and hash functions are a complex topic, fortunately with lots of online resources.
It is not clear how you determine the array size in the first place.
In the Java HashMap implementation, the size of the underlying array is always a power of 2. This has the slight advantage that you don't need to compute the modulo, but can compute the array index as index = hashValue & (array.length-1) (which is equivalent to a modulo operation when array.length is a power of 2).
Additionally, the HashMap uses some "magic function" to reduce the number of hash collisions for the case that several hash values only differ by a constant factor, as in your example.
The actual size of the array is then determined by a "load factor". (You can even specify this as a constructor parameter of HashMap). When the number of array entries that are occupied exceeds loadFactor * array.length, then the length of the array will be doubled.
This load factor allows a certain trade-off: When the load factor is high (0.9 or so), then it will be more likely that hash collisions will occur. When it is low (0.3 or so), then hash collisions will be more unlikely, but there will be a lot of "wasted" space, because only few entries of the array will actually be occupied at any point in time.

clustering words based on their char set

Say there is a word set and I would like to clustering them based on their char bag (multiset). For example
{tea, eat, abba, aabb, hello}
will be clustered into
{{tea, eat}, {abba, aabb}, {hello}}.
abba and aabb are clustered together because they have the same char bag, i.e. two a and two b.
To make it efficient, a naive way I can think of is to covert each word into a char-cnt series, for exmaple, abba and aabb will be both converted to a2b2, tea/eat will be converted to a1e1t1. So that I can build a dictionary and group words with same key.
Two issues here: first I have to sort the chars to build the key; second, the string key looks awkward and performance is not as good as char/int keys.
Is there a more efficient way to solve the problem?
For detecting anagrams you can use a hashing scheme based on the product of prime numbers A->2, B->3, C->5 etc. will give "abba" == "aabb" == 36 (but a different letter to primenumber mapping will be better)
See my answer here.
Since you are going to sort words, I assume all characters ascii values are in the range 0-255. Then you can do a Counting Sort over the words.
The counting sort is going to take the same amount of time as the size of the input word. Reconstruction of the string obtained from counting sort will take O(wordlen). You cannot make this step less than O(wordLen) because you will have to iterate the string at least once ie O(wordLen). There is no predefined order. You cannot make any assumptions about the word without iterating though all the characters in that word. Traditional sorting implementations(ie comparison based ones) will give you O(n * lg n). But non comparison ones give you O(n).
Iterate over all the words of the list and sort them using our counting sort. Keep a map of
sorted words to the list of known words they map. Addition of elements to a list takes constant time. So overall the complexity of the algorithm is O(n * avgWordLength).
Here is a sample implementation
import java.util.ArrayList;
public class ClusterGen {
static String sortWord(String w) {
int freq[] = new int[256];
for (char c : w.toCharArray()) {
freq[c]++;
}
StringBuilder sortedWord = new StringBuilder();
//It is at most O(n)
for (int i = 0; i < freq.length; ++i) {
for (int j = 0; j < freq[i]; ++j) {
sortedWord.append((char)i);
}
}
return sortedWord.toString();
}
static Map<String, List<String>> cluster(List<String> words) {
Map<String, List<String>> allClusters = new HashMap<String, List<String>>();
for (String word : words) {
String sortedWord = sortWord(word);
List<String> cluster = allClusters.get(sortedWord);
if (cluster == null) {
cluster = new ArrayList<String>();
}
cluster.add(word);
allClusters.put(sortedWord, cluster);
}
return allClusters;
}
public static void main(String[] args) {
System.out.println(cluster(Arrays.asList("tea", "eat", "abba", "aabb", "hello")));
System.out.println(cluster(Arrays.asList("moon", "bat", "meal", "tab", "male")));
}
}
Returns
{aabb=[abba, aabb], ehllo=[hello], aet=[tea, eat]}
{abt=[bat, tab], aelm=[meal, male], mnoo=[moon]}
Using an alphabet of x characters and a maximum word length of y, you can create hashes of (x + y) bits such that every anagram has a unique hash. A value of 1 for a bit means there is another of the current letter, a value of 0 means to move on to the next letter. Here's an example showing how this works:
Let's say we have a 7 letter alphabet(abcdefg) and a maximum word length of 4. Every word hash will be 11 bits. Let's hash the word "fade": 10001010100
The first bit is 1, indicating there is an a present. The second bit indicates that there are no more a's. The third bit indicates that there are no more b's, and so on. Another way to think about this is the number of ones in a row represents the number of that letter, and the total zeroes before that string of ones represents which letter it is.
Here is the hash for "dada": 11000110000
It's worth noting that because there is a one-to-one correspondence between possible hashes and possible anagrams, this is the smallest possible hash guaranteed to give unique hashes for any input, which eliminates the need to check everything in your buckets when you are done hashing.
I'm well aware that using large alphabets and long words will result in a large hash size. This solution is geared towards guaranteeing unique hashes in order to avoid comparing strings. If you can design an algorithm to compute this hash in constant time(given you know the values of x and y) then you'll be able to solve the entire grouping problem in O(n).
I would do this in two steps, first sort all your words according to their length and work on each subset separately(this is to avoid lots of overlaps later.)
The next step is harder and there are many ways to do it. One of the simplest would be to assign every letter a number(a = 1, b = 2, etc. for example) and add up all the values for each word, thereby assigning each word to an integer. Then you can sort the words according to this integer value which drastically cuts the number you have to compare.
Depending on your data set you may still have a lot of overlaps("bad" and "cac" would generate the same integer hash) so you may want to set a threshold where if you have too many words in one bucket you repeat the previous step with another hash(just assigning different numbers to the letters) Unless someone has looked at your code and designed a wordlist to mess you up, this should cut the overlaps to almost none.
Keep in mind that this approach will be efficient when you are expecting small numbers of words to be in the same char bag. If your data is a lot of long words that only go into a couple char bags, the number of comparisons you would do in the final step would be astronomical, and in this case you would be better off using an approach like the one you described - one that has no possible overlaps.
One thing I've done that's similar to this, but allows for collisions, is to sort the letters, then get rid of duplicates. So in your example, you'd have buckets for "aet", "ab", and "ehlo".
Now, as I say, this allows for collisions. So "rod" and "door" both end up in the same bucket, which may not be what you want. However, the collisions will be a small set that is easily and quickly searched.
So once you have the string for a bucket, you'll notice you can convert it into a 32-bit integer (at least for ASCII). Each letter in the string becomes a bit in a 32-bit integer. So "a" is the first bit, "b" is the second bit, etc. All (English) words make a bucket with a 26-bit identifier. You can then do very fast integer compares to find the bucket a new words goes into, or find the bucket an existing word is in.
Count the frequency of characters in each of the strings then build a hash table based on the frequency table. so for an example, for string aczda and aacdz we get 20110000000000000000000001. Using hash table we can partition all these strings in buckets in O(N).
26-bit integer as a hash function
If your alphabet isn't too large, for instance, just lower case English letters, you can define this particular hash function for each word: a 26 bit integer where each bit represents whether that English letter exists in the word. Note that two words with the same char set will have the same hash.
Then just add them to a hash table. It will automatically be clustered by hash collisions.
It will take O(max length of the word) to calculate a hash, and insertion into a hash table is constant time. So the overall complexity is O(max length of a word * number of words)

Most common substring of length X

I have a string s and I want to search for the substring of length X that occurs most often in s. Overlapping substrings are allowed.
For example, if s="aoaoa" and X=3, the algorithm should find "aoa" (which appears 2 times in s).
Does an algorithm exist that does this in O(n) time?
You can do this using a rolling hash in O(n) time (assuming good hash distribution). A simple rolling hash would be the xor of the characters in the string, you can compute it incrementally from the previous substring hash using just 2 xors. (See the Wikipedia entry for better rolling hashes than xor.) Compute the hash of your n-x+1 substrings using the rolling hash in O(n) time. If there were no collisions, the answer is clear - if collisions happen, you'll need to do more work. My brain hurts trying to figure out if that can all be resolved in O(n) time.
Update:
Here's a randomized O(n) algorithm. You can find the top hash in O(n) time by scanning the hashtable (keeping it simple, assume no ties). Find one X-length string with that hash (keep a record in the hashtable, or just redo the rolling hash). Then use an O(n) string searching algorithm to find all occurrences of that string in s. If you find the same number of occurrences as you recorded in the hashtable, you're done.
If not, that means you have a hash collision. Pick a new random hash function and try again. If your hash function has log(n)+1 bits and is pairwise independent [Prob(h(s) == h(t)) < 1/2^{n+1} if s != t], then the probability that the most frequent x-length substring in s hash a collision with the <=n other length x substrings of s is at most 1/2. So if there is a collision, pick a new random hash function and retry, you will need only a constant number of tries before you succeed.
Now we only need a randomized pairwise independent rolling hash algorithm.
Update2:
Actually, you need 2log(n) bits of hash to avoid all (n choose 2) collisions because any collision may hide the right answer. Still doable, and it looks like hashing by general polynomial division should do the trick.
I don't see an easy way to do this in strictly O(n) time, unless X is fixed and can be considered a constant. If X is a parameter to the algorithm, then most simple ways of doing this will actually be O(n*X), as you will need to do comparison operations, string copies, hashes, etc., on a substring of length X at every iteration.
(I'm imagining, for a minute, that s is a multi-gigabyte string, and that X is some number over a million, and not seeing any simple ways of doing string comparison, or hashing substrings of length X, that are O(1), and not dependent on the size of X)
It might be possible to avoid string copies during scanning, by leaving everything in place, and to avoid re-hashing the entire substring -- perhaps by using an incremental hash algorithm where you can add a byte at a time, and remove the oldest byte -- but I don't know of any such algorithms that wouldn't result in huge numbers of collisions that would need to be filtered out with an expensive post-processing step.
Update
Keith Randall points out that this kind of hash is known as a rolling hash. It still remains, though, that you would have to store the starting string position for each match in your hash table, and then verify after scanning the string that all of your matches were true. You would need to sort the hashtable, which could contain n-X entries, based on the number of matches found for each hash key, and verify each result -- probably not doable in O(n).
It should be O(n*m) where m is the average length of a string in the list. For very small values of m then the algorithm will approach O(n)
Build a hashtable of counts for each string length
Iterate over your collection of strings, updating the hashtable accordingly, storing the current most prevelant number as an integer variable separate from the hashtable
done.
Naive solution in Python
from collections import defaultdict
from operator import itemgetter
def naive(s, X):
freq = defaultdict(int)
for i in range(len(s) - X + 1):
freq[s[i:i+X]] += 1
return max(freq.iteritems(), key=itemgetter(1))
print naive("aoaoa", 3)
# -> ('aoa', 2)
In plain English
Create mapping: substring of length X -> how many times it occurs in the s string
for i in range(len(s) - X + 1):
freq[s[i:i+X]] += 1
Find a pair in the mapping with the largest second item (frequency)
max(freq.iteritems(), key=itemgetter(1))
Here is a version I did in C. Hope that it helps.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(void)
{
char *string = NULL, *maxstring = NULL, *tmpstr = NULL, *tmpstr2 = NULL;
unsigned int n = 0, i = 0, j = 0, matchcount = 0, maxcount = 0;
string = "aoaoa";
n = 3;
for (i = 0; i <= (strlen(string) - n); i++) {
tmpstr = (char *)malloc(n + 1);
strncpy(tmpstr, string + i, n);
*(tmpstr + (n + 1)) = '\0';
for (j = 0; j <= (strlen(string) - n); j++) {
tmpstr2 = (char *)malloc(n + 1);
strncpy(tmpstr2, string + j, n);
*(tmpstr2 + (n + 1)) = '\0';
if (!strcmp(tmpstr, tmpstr2))
matchcount++;
}
if (matchcount > maxcount) {
maxstring = tmpstr;
maxcount = matchcount;
}
matchcount = 0;
}
printf("max string: \"%s\", count: %d\n", maxstring, maxcount);
free(tmpstr);
free(tmpstr2);
return 0;
}
You can build a tree of sub-strings. The idea is to organise your sub-strings like a telephone book. You then look up the sub-string and increase its count by one.
In your example above, the tree will have sections (nodes) starting with the letters: 'a' and 'o'. 'a' appears three times and 'o' appears twice. So those nodes will have a count of 3 and 2 respectively.
Next, under the 'a' node a sub-node of 'o' will appear corresponding to the sub-string 'ao'. This appears twice. Under the 'o' node 'a' also appears twice.
We carry on in this fashion until we reach the end of the string.
A representation of the tree for 'abac' might be (nodes on the same level are separated by a comma, sub-nodes are in brackets, counts appear after the colon).
a:2(b:1(a:1(c:1())),c:1()),b:1(a:1(c:1())),c:1()
If the tree is drawn out it will be a lot more obvious! What this all says for example is that the string 'aba' appears once, or the string 'a' appears twice etc. But, storage is greatly reduced and more importantly retrieval is greatly speeded up (compare this to keeping a list of sub-strings).
To find out which sub-string is most repeated, do a depth first search of the tree, every time a leaf node is reached, note the count, and keep a track of the highest one.
The running time is probably something like O(log(n)) not sure, but certainly better than O(n^2).
Python-3 Solution:
from collections import Counter
list = []
list.append([string[i: j] for i in range(len(string)) for j in range(i + 1, len(string) + 1) if len(string[i:j]) == K]) # Where K is length
# now find the most common value in this list
# you can do this natively, but I prefer using collections
most_frequent = Counter(list).most_common(1)[0][0]
print(most_freqent)
Here is the native way to get the most common (for those that are interested):
most_occurences = 0
current_most = ""
for i in list:
frequency = list.count(i)
if frequency > most_occurences:
most_occurences = frequency
current_most = list[i]
print(f"{current_most}, Occurences: {most_occurences}")
[Extract K length substrings (geeks for geeks)][1]
[1]: https://www.geeksforgeeks.org/python-extract-k-length-substrings/
LZW algorithm does this
This is exactly what Lempel-Ziv-Welch (LZW used in GIF image format) compression algorithm does. It finds prevalent repeated bytes and changes them for something short.
LZW on Wikipedia
There's no way to do this in O(n).
Feel free to downvote me if you can prove me wrong on this one, but I've got nothing.

Resources