Imagine you have an MD5 sum that was calculated from an array of N 64-byte elements. I want to replace an element at an arbitrary index in the source array with a new element. Then, instead of recalculating the MD5 sum by re-running it through an MD5 function, I would like to "subtract" the old element from the result and "add" the new piece of data to it.
To be a bit more clear, here's some pseudo-Scala:
class Block {
var summary: MD5Result
// The key reason behind this question is that the elements might not be
// loaded. With a large array, it can get expensive to load everything just to
// update a single thing.
var data: Array[Option[Element]]
def replaceElement(block: Block, index: Integer, newElement: Element) = {
// we can know the element that we're replacing
val oldElement = block.data(index) match {
case Some(x) => x
case None => loadData(index) // <- this is expensive
}
// update the MD5 using this magic function
summary = replaceMD5(summary, index, oldElement, newElement)
}
}
Is replaceMD5 implementable? While all signs point to "this is breaking a (weak) cryptographic hash," the actual MD5 algorithm seems to support doing this (but I might be missing something obvious).
I think I better understand what you want to do now. My solution below assumes nothing about MD5 computation, but involves a tradeoff between IO and storing a large number of MD5 hashes. Instead of computing the simple MD5 hash of the entire dataset, it computes a different MD5 hash that nevertheless should have the same important property: that any change to any element (drastically) changes it.
At the outset, decide on a block size b such that
you can afford to read b values from disk (or whatever IO you're talking about) per change of element, and
you can afford to keep 2n/b MD5 hashes in memory.
Create a binary tree of MD5 hashes. Each leaf in this tree will be the MD5 hash of a size-b block. Each internal node is the MD5 hash of its two children. We will use the hash of the root of this tree as "the" MD5 hash.
When element i changes, read in the b elements in block RoundDown(i/b), compute the new MD5 hash for this, and then propagate the changes up the tree (this will take at most log2(n) steps).
Related
Is there any hashing algorithm which supports this type of hashing
const hashGen = new HashAlogrithm();
// Imagine hash has an array like data structure
hashGen.push(1); // It appends element to end of array
hashGen.push(2);
hashGen.push(3);
hashGen.unshift(); // removes first pushed element, in this case it will be `1`
const hashGen2 = new HashAlgorithm();
hashGen2.push(2);
hashGen2.push(3);
const hashA = hashGen.generate(); // Generates the Hash value
const hashB = hashGen2.generate();
// hashA and hashB should be equal
I know I can use an array to keep track of pushing. Then on generate I can use some standard hashing algorithm to create hash.
But in my case I will be calling generate function a lot of times after just pushing one number or after removing one letter.
I don't want to recalculate hash whole array each time. I just want to recalculate hash for changed part ( If that makes sense).
I don't need algorithm to be secure.
Let me know if thats even possible and if any algorithm thus that.
I'm reading this post: https://linux.thai.net/~thep/datrie/, in the beginning of section Double-Array Trie, it says
The tripple-array structure for implementing trie appears to be well defined,
but is still not practical to keep in a single file.
The next/check pool may be able to keep in a single array of integer couples,
but the base array does not grow in parallel to the pool,
and is therefore usually split.
What does the base array is usually split mean and why?
I'd like to understand what is the benefits to use double array trie instead of triple array trie.
I can answer partially to you question.
In a triple array trie we've three array: base, next and check. the base array contains the distinct state of the trie. In the next array we have all the state stored many times: one time when they are the start state and the others each time another state transitions to them. The check have the ownership of a transition.
One way to model a trie with a triple array is model a structure with three array: base, next and check. This is a basic implemantion.
trie {
base: array<S>;
next: array<S>;
check: array<S>;
}
Because next and check have meaningful data, state and ownership, for a transion at the same index we can model these data in a pair. So the data structure has two array: the base array and the pair array, containg the next and check data in one place.
trie {
base: array<S>;
transition: array<Pair>;
}
Pair {
next: array<S>;
check: array<S>;
}
We can have this implementation to:
trie {
transition: array<Triple>;
}
Triple {
base: array<S>;
next: array<S>;
check: array<S>;
}
This is a bad implementation because it seems like the first, but base array data are duplicated for each transition.
In second implementation, where base split from next and check, we can retrieve next and check data at the same time and we doesn't duplicate base info like in the third.
In two-array next is called base and base is dropped away because it is not really necessary. It stores and manages data and this is a something valuable.
I'm implementing a simple Bloom filter as an exercise.
Bloom filters require multiple hash functions, which for practical purposes I don't have.
Assuming I want to have 3 hash functions, isn't it enough to just take the hash of the object I'm checking membership for, hashing it (with murmur3) and then add +1, +2, +3 (for the 3 different hashes) before hashing them again?
As the murmur3 function has a very good avalanche effect (really spreads out results) wouldn't this for all purposes be reasonable?
Pseudo-code:
function generateHashes(obj) {
long hash = murmur3_hash(obj);
long hash1 = murmur3_hash(hash+1);
long hash2 = murmur3_hash(hash+2);
long hash3 = murmur3_hash(hash+3);
(hash1, hash2, hash3)
}
If not, what would be a simple, useful approach to this? I'd like to have a solution that would allow me to easily scale for more hash functions if needed be.
AFAIK, the usual approach is to not actually use multiple hash functions. Rather, hash once and split the resulting hash into 2, 3, or how many parts you want for your Bloom filter. So for example create a hash of 128 bits and split it into 2 hashes 64 bit each.
https://github.com/Claudenw/BloomFilter/wiki/Bloom-Filters----An-overview
The hashing functions of Bloom filter should be independent and random enough. MurmurHash is great for this purpose. So your approach is correct, and you can generate as many new hashes your way. For the educational purposes it is fine.
But in real world, running hashing function multiple times is slow, so the usual approach is to create ad-hoc hashes without actually calculating the hash.
To correct #memo, this is done not by splitting the hash into multiple parts, as the width of the hash should remain constant (and you can't split 64 bit hash to more than 64 parts ;) ). The approach is to get a two independent hashes and combine them.
function generateHashes(obj) {
// initialization phase
long h1 = murmur3_hash(obj);
long h2 = murmur3_hash(h1);
int k = 3; // number of desired hash functions
long hash[k];
// generation phase
for (int i=0; i<k; i++) {
hash[i] = h1 + (i*h2);
}
return hash;
}
As you see, this way creating a new hash is a simple multiply-add operation.
It would not be a good approach. Let me try and explain. Bloom filter allows you to test if an element most likely belongs to a set, or if it absolutely doesn’t. In others words, false positives may occur, but false negatives won’t.
Reference: https://sc5.io/posts/what-are-bloom-filters-and-why-are-they-useful/
Let us consider an example:
You have an input string 'foo' and we pass it to the multiple hash functions. murmur3 hash gives the output K, and subsequent hashes on this hash value give x, y and z
Now assume you have another string 'bar' and as it happens, its murmur3 hash is also K. The remaining hash values? They will be x, y and z because in your proposed approach the subsequent hash functions are not dependent on the input, but instead on the output of first hash function.
long hash1 = murmur3_hash(hash+1);
long hash2 = murmur3_hash(hash+2);
long hash3 = murmur3_hash(hash+3);
As explained in the link, the purpose is to perform a probabilistic search in a set. If we perform search for 'foo' or for 'bar' we would say that it is 'likely' that both of them are present. So the % of false positives will increase.
In other words this bloom filter will behave like a simple hash-function. The 'bloom' aspect of it will not come into picture because only the first hash function is determining the outcome of search.
Hope I was able to explain sufficiently. Let me know in comments if you have some more follow-up queries. Would be happy to assist.
We're learning about hash tables in my data structures and algorithms class, and I'm having trouble understanding separate chaining.
I know the basic premise: each bucket has a pointer to a Node that contains a key-value pair, and each Node contains a pointer to the next (potential) Node in the current bucket's mini linked list. This is mainly used to handle collisions.
Now, suppose for simplicity that the hash table has 5 buckets. Suppose I wrote the following lines of code in my main after creating an appropriate hash table instance.
myHashTable["rick"] = "Rick Sanchez";
myHashTable["morty"] = "Morty Smith";
Let's imagine whatever hashing function we're using just so happens to produce the same bucket index for both string keys rick and morty. Let's say that bucket index is index 0, for simplicity.
So at index 0 in our hash table, we have two nodes with values of Rick Sanchez and Morty Smith, in whatever order we decide to put them in (the first pointing to the second).
When I want to display the corresponding value for rick, which is Rick Sanchez per our code here, the hashing function will produce the bucket index of 0.
How do I decide which node needs to be returned? Do I loop through the nodes until I find the one whose key matches rick?
To resolve Hash Tables conflicts, that's it, to put or get an item into the Hash Table whose hash value collides with another one, you will end up reducing a map to the data structure that is backing the hash table implementation; this is generally a linked list. In the case of a collision this is the worst case for the Hash Table structure and you will end up with an O(n) operation to get to the correct item in the linked list. That's it, a loop as you said, that will search the item with the matching key. But, in the cases that you have a data structure like a balanced tree to search, it can be O(logN) time, as the Java8 implementation.
As JEP 180: Handle Frequent HashMap Collisions with Balanced Trees says:
The principal idea is that once the number of items in a hash bucket
grows beyond a certain threshold, that bucket will switch from using a
linked list of entries to a balanced tree. In the case of high hash
collisions, this will improve worst-case performance from O(n) to
O(log n).
This technique has already been implemented in the latest version of
the java.util.concurrent.ConcurrentHashMap class, which is also slated
for inclusion in JDK 8 as part of JEP 155. Portions of that code will
be re-used to implement the same idea in the HashMap and LinkedHashMap
classes.
I strongly suggest to always look at some existing implementation. To say about one, you could look at the Java 7 implementation. That will increase your code reading skills, that is almost more important or you do more often than writing code. I know that it is more effort but it will pay off.
For example, take a look at the HashTable.get method from Java 7:
public synchronized V get(Object key) {
Entry<?,?> tab[] = table;
int hash = key.hashCode();
int index = (hash & 0x7FFFFFFF) % tab.length;
for (Entry<?,?> e = tab[index] ; e != null ; e = e.next) {
if ((e.hash == hash) && e.key.equals(key)) {
return (V)e.value;
}
}
return null;
}
Here we see that if ((e.hash == hash) && e.key.equals(key)) is trying to find the correct item with the matching key.
And here is the full source code: HashTable.java
I have some hashtable. For instance I have two entities like
john = { 1stname: jonh, 2ndname: johnson },
eric = { 1stname: eric, 2ndname: ericson }
Then I put them in hashtable:
ht["john"] = john;
ht["eric"] = eric;
Let's imagine there is a collision and hashtable use chaining to fix it. As a result there should be a linked list with these two entities like this
How does hashtable understand what entity should be returned for key? Hash values are the same and it knows nothing about entities structure. For instance if I write thisvar val = ht["john"]; how does hashtable (having only key value and its hash) find out that value should be john record and not eric.
I think what you are confused about is what is stored at each location in the hashtable's adjacent list. It seems like you assume that only the value is being stored. In fact, the data in each list node is a tuple (key, value).
Once you ask for ht['john'], the hashtable find the list associated with hash('john') and if the list is not empty it searches for the key 'john' in the list. If the key is found as the first element of the tuple then the value (second element of the tuple) is returned. If the key is not found, then it means that the element is not in the hashtable.
To summarize, the key hash is used to quickly identify the cell in which the element should be stored if present. Actual key equality is tested for to decide whether the key exists or not.
Is this what you are asking for? I have already put this in comments but seems to me you did not follow link
Collision Resolution in the Hashtable Class
Recall that when inserting an item into or retrieving an item from a hash table, a collision can occur. When inserting an item, an open slot must be found. When retrieving an item, the actual item must be found if it is not in the expected location. Earlier we briefly examined two collusion resolution strategies:
Linear probing
Quardratic probing
The Hashtable class uses a different technique referred to as rehasing. (Some sources refer to rehashing as double hashing.)
Rehasing works as follows: there is a set of hash different functions, H1 ... Hn, and when inserting or retrieving an item from the hash table, initially the H1 hash function is used. If this leads to a collision, H2 is tried instead, and onwards up to Hn if needed. The previous section showed only one hash function, which is the initial hash function (H1). The other hash functions are very similar to this function, only differentiating by a multiplicative factor. In general, the hash function Hk is defined as:
Hk(key) = [GetHash(key) + k * (1 + (((GetHash(key) >> 5) + 1) % (hashsize – 1)))] % hashsize
Mathematical Note With rehasing it is important that each slot in the hash table is visited exactly once when hashsize number of probes are made. That is, for a given key you don't want Hi and Hj to hash to the same slot in the hash table. With the rehashing formula used by the Hashtable class, this property is maintained if the result of (1 + (((GetHash(key) >> 5) + 1) % (hashsize – 1))and hashsize are relatively prime. (Two numbers are relatively prime if they share no common factors.) These two numbers are guaranteed to be relatively prime if hashsize is a prime number.
Rehasing provides better collision avoidance than either linear or quadratic probing.
sources here