Hashing algorithm which has array like api - algorithm

Is there any hashing algorithm which supports this type of hashing
const hashGen = new HashAlogrithm();
// Imagine hash has an array like data structure
hashGen.push(1); // It appends element to end of array
hashGen.push(2);
hashGen.push(3);
hashGen.unshift(); // removes first pushed element, in this case it will be `1`
const hashGen2 = new HashAlgorithm();
hashGen2.push(2);
hashGen2.push(3);
const hashA = hashGen.generate(); // Generates the Hash value
const hashB = hashGen2.generate();
// hashA and hashB should be equal
I know I can use an array to keep track of pushing. Then on generate I can use some standard hashing algorithm to create hash.
But in my case I will be calling generate function a lot of times after just pushing one number or after removing one letter.
I don't want to recalculate hash whole array each time. I just want to recalculate hash for changed part ( If that makes sense).
I don't need algorithm to be secure.
Let me know if thats even possible and if any algorithm thus that.

Related

Bloom filters and its multiple hash functions

I'm implementing a simple Bloom filter as an exercise.
Bloom filters require multiple hash functions, which for practical purposes I don't have.
Assuming I want to have 3 hash functions, isn't it enough to just take the hash of the object I'm checking membership for, hashing it (with murmur3) and then add +1, +2, +3 (for the 3 different hashes) before hashing them again?
As the murmur3 function has a very good avalanche effect (really spreads out results) wouldn't this for all purposes be reasonable?
Pseudo-code:
function generateHashes(obj) {
long hash = murmur3_hash(obj);
long hash1 = murmur3_hash(hash+1);
long hash2 = murmur3_hash(hash+2);
long hash3 = murmur3_hash(hash+3);
(hash1, hash2, hash3)
}
If not, what would be a simple, useful approach to this? I'd like to have a solution that would allow me to easily scale for more hash functions if needed be.
AFAIK, the usual approach is to not actually use multiple hash functions. Rather, hash once and split the resulting hash into 2, 3, or how many parts you want for your Bloom filter. So for example create a hash of 128 bits and split it into 2 hashes 64 bit each.
https://github.com/Claudenw/BloomFilter/wiki/Bloom-Filters----An-overview
The hashing functions of Bloom filter should be independent and random enough. MurmurHash is great for this purpose. So your approach is correct, and you can generate as many new hashes your way. For the educational purposes it is fine.
But in real world, running hashing function multiple times is slow, so the usual approach is to create ad-hoc hashes without actually calculating the hash.
To correct #memo, this is done not by splitting the hash into multiple parts, as the width of the hash should remain constant (and you can't split 64 bit hash to more than 64 parts ;) ). The approach is to get a two independent hashes and combine them.
function generateHashes(obj) {
// initialization phase
long h1 = murmur3_hash(obj);
long h2 = murmur3_hash(h1);
int k = 3; // number of desired hash functions
long hash[k];
// generation phase
for (int i=0; i<k; i++) {
hash[i] = h1 + (i*h2);
}
return hash;
}
As you see, this way creating a new hash is a simple multiply-add operation.
It would not be a good approach. Let me try and explain. Bloom filter allows you to test if an element most likely belongs to a set, or if it absolutely doesn’t. In others words, false positives may occur, but false negatives won’t.
Reference: https://sc5.io/posts/what-are-bloom-filters-and-why-are-they-useful/
Let us consider an example:
You have an input string 'foo' and we pass it to the multiple hash functions. murmur3 hash gives the output K, and subsequent hashes on this hash value give x, y and z
Now assume you have another string 'bar' and as it happens, its murmur3 hash is also K. The remaining hash values? They will be x, y and z because in your proposed approach the subsequent hash functions are not dependent on the input, but instead on the output of first hash function.
long hash1 = murmur3_hash(hash+1);
long hash2 = murmur3_hash(hash+2);
long hash3 = murmur3_hash(hash+3);
As explained in the link, the purpose is to perform a probabilistic search in a set. If we perform search for 'foo' or for 'bar' we would say that it is 'likely' that both of them are present. So the % of false positives will increase.
In other words this bloom filter will behave like a simple hash-function. The 'bloom' aspect of it will not come into picture because only the first hash function is determining the outcome of search.
Hope I was able to explain sufficiently. Let me know in comments if you have some more follow-up queries. Would be happy to assist.

Hashtable underlying place holder?

I am trying to understand the HashTable data structure. I understand that in HashTable we first use HashFunction to coverts a key to hash Code and then using modulo operator to convert Hash code to integer index and which is used to get the location in HashTable where data is placed.
At a high level, the flow is like this?
Key -> Hash Function -> Hash code -> Modulo operator -> integer index -> Store in HashTable
Since the key is stored based on the index as emitted by the modulo operator, my doubt is, what is the underlying data structure which is used to hold the actual data? Is it an array, for array can be accessed using Index.
Can anyone help me understand this?
Though it completely depends on implementation, I would agree that underlying data structure would be array with linked list, since array is convinient to access elements at low cost, while linked list is necessary to handle hash collisions.
Here is example of details how it is implemented in java openjdk Hashtable
Initially it creates array with initial capacity:
table = new Entry<?,?>[initialCapacity];
It checks for capacity threshold everytime when new element is added. When threshold limit is reached it performs rehashing and creates a new array which is double size of old array
int newCapacity = (oldCapacity << 1) + 1;
if (newCapacity - MAX_ARRAY_SIZE > 0) {
if (oldCapacity == MAX_ARRAY_SIZE)
// Keep running with MAX_ARRAY_SIZE buckets
return;
newCapacity = MAX_ARRAY_SIZE;
}
Entry<?,?>[] newMap = new Entry<?,?>[newCapacity];
modCount++;
threshold = (int)Math.min(newCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
table = newMap;
Hashtable Entry forms a linked list. It is used in case of hash collisions, since index for 2 different values would become same and required value is checked through linked list.
private static class Entry<K,V> implements Map.Entry<K,V> {
final int hash;
final K key;
V value;
Entry<K,V> next;
You may want to check other more simple implementations of Hashtables for better understanding.

Backing out data from an MD5 checksum

Imagine you have an MD5 sum that was calculated from an array of N 64-byte elements. I want to replace an element at an arbitrary index in the source array with a new element. Then, instead of recalculating the MD5 sum by re-running it through an MD5 function, I would like to "subtract" the old element from the result and "add" the new piece of data to it.
To be a bit more clear, here's some pseudo-Scala:
class Block {
var summary: MD5Result
// The key reason behind this question is that the elements might not be
// loaded. With a large array, it can get expensive to load everything just to
// update a single thing.
var data: Array[Option[Element]]
def replaceElement(block: Block, index: Integer, newElement: Element) = {
// we can know the element that we're replacing
val oldElement = block.data(index) match {
case Some(x) => x
case None => loadData(index) // <- this is expensive
}
// update the MD5 using this magic function
summary = replaceMD5(summary, index, oldElement, newElement)
}
}
Is replaceMD5 implementable? While all signs point to "this is breaking a (weak) cryptographic hash," the actual MD5 algorithm seems to support doing this (but I might be missing something obvious).
I think I better understand what you want to do now. My solution below assumes nothing about MD5 computation, but involves a tradeoff between IO and storing a large number of MD5 hashes. Instead of computing the simple MD5 hash of the entire dataset, it computes a different MD5 hash that nevertheless should have the same important property: that any change to any element (drastically) changes it.
At the outset, decide on a block size b such that
you can afford to read b values from disk (or whatever IO you're talking about) per change of element, and
you can afford to keep 2n/b MD5 hashes in memory.
Create a binary tree of MD5 hashes. Each leaf in this tree will be the MD5 hash of a size-b block. Each internal node is the MD5 hash of its two children. We will use the hash of the root of this tree as "the" MD5 hash.
When element i changes, read in the b elements in block RoundDown(i/b), compute the new MD5 hash for this, and then propagate the changes up the tree (this will take at most log2(n) steps).

Count frequency of items in array - without two for loops

Need to know is there a way to count the frequency of items in a array without using two loops. This is without knowing the size of the array. If I know the size of the array I can use switch without looping. But I need more versatile than that. I think modifying the quicksort may give better results.
Array[n];
TwoDArray[n][2];
First loop will go on Array[], while second loop is to find the element and increase it count in two-d array.
max = 0;
for(int i=0;i<Array.length;i++){
found= false;
for(int j=0;j<TwoDArray[max].length;j++){
if(TwoDArray[j][0]==Array[i]){
TwoDArray[j][1]+=;
found = true;
break;
}
}
if(found==false){
TwoDArray[max+1][0]=Array[i];
TwoDArray[max+1][1]=1;
max+=;
}
If you can comment or provide better solution would be very helpful.
Use map or hash table to implement this. Insert key as the array item and value as the frequency.
Alternatively you can use array too if the range of array elements are not too large. Increase the count of value at indexes corresponding to the array element.
I would build a map keyed by the item in the array and with a value that is the count of that item. One pass over the array to build the map that contains the counts. For each item, look it's count up in the map, increment the count, and put the new count back into the map.
The map put and get operations can be constant time (e.g., if you use a hash map implementation with a good hash function and properly sized backing store). This means you can compute the frequencies in time proportional to the number of elements in your array.
I'm not saying this is better than using a map or hash table (especially not when there are lots of duplicates, though in that case you can get close to O(n) sorting with certain techniques, so this is not too bad either), it's just an alternative.
Sort the array
Use a (single) for-loop to iterate through the sorted array
If you find the same element as the previous one, increment the current count
If you find a different element, store the previous element and its count and set the count to 1
At the end of the loop, store the previous element and its count

Hadoop : Number of input records for reducer

Is there anyway by which each reducer process could determine the number of elements or records it has to process ?
Short answer - ahead of time no, the reducer has no knowledge of how many values are backed by the iterable. The only way you can do this is to count as you iterate, but you can't then re-iterate over the iterable again.
Long answer - backing the iterable is actually a sorted byte array of the serialized key / value pairs. The reducer has two comparators - one to sort the key/value pairs in key order, then a second to determine the boundary between keys (known as the key grouper). Typically the key grouper is the same as the key ordering comparator.
When iterating over the values for a particular key, the underlying context examines the next key in the array, and compares to the previous key using the grouping comparator. If the comparator determines they are equal, then iteration continues. Otherwise iteration for this particular key ends. So you can see that you cannot ahead of time determine how may values you will be passed for any particular key.
You can actually see this in action if you create a composite key, say a Text/IntWritable pair. For the compareTo method sort by first the Text, then the IntWritable field. Next create a Comparator to be used as the group comparator, which only considers the Text part of the key. Now as you iterate over the values in the reducer, you should be able to observe IntWritable part of the key changing with each iteration.
Some code i've used before to demonstrates this scenario can be found on this pastebin
Your reducer class must extend the MapReducer Reduce class:
Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
and then must implement the reduce method using the KEYIN/VALUEIN arguments specified in the extended Reduce class
reduce(KEYIN key, Iterable<VALUEIN> values,
org.apache.hadoop.mapreduce.Reducer.Context context)
The values associated with a given key can be counted via
int count = 0;
Iterator<VALUEIN> it = values.iterator();
while(it.hasNext()){
it.Next();
count++;
}
Though I'd propose doing this counting along side your other processing as to not make two passes through your value set.
EDIT
Here's an example vector of vectors that will dynamically grow as you add to it (so you won't have to statically declare your arrays, and hence don't need the size of the values set). This will work best for non-regular data (IE the number of columns is not the same for every row in your input csv file), but will have the most overhead.
Vector table = new Vector();
Iterator<Text> it = values.iterator();
while(it.hasNext()){
Text t = it.Next();
String[] cols = t.toString().split(",");
int i = 0;
Vector row = new Vector(); //new vector will be our row
while(StringUtils.isNotEmpty(cols[i])){
row.addElement(cols[i++]); //here were adding a new column for every value in the csv row
}
table.addElement(row);
}
Then you can access the Mth column of the Nth row via
table.get(N).get(M);
Now, if you knew the # of columns would be set, you could modify this to use a Vector of arrays which would probably be a little faster/more space efficient.

Resources