I am trying to understand the HashTable data structure. I understand that in HashTable we first use HashFunction to coverts a key to hash Code and then using modulo operator to convert Hash code to integer index and which is used to get the location in HashTable where data is placed.
At a high level, the flow is like this?
Key -> Hash Function -> Hash code -> Modulo operator -> integer index -> Store in HashTable
Since the key is stored based on the index as emitted by the modulo operator, my doubt is, what is the underlying data structure which is used to hold the actual data? Is it an array, for array can be accessed using Index.
Can anyone help me understand this?
Though it completely depends on implementation, I would agree that underlying data structure would be array with linked list, since array is convinient to access elements at low cost, while linked list is necessary to handle hash collisions.
Here is example of details how it is implemented in java openjdk Hashtable
Initially it creates array with initial capacity:
table = new Entry<?,?>[initialCapacity];
It checks for capacity threshold everytime when new element is added. When threshold limit is reached it performs rehashing and creates a new array which is double size of old array
int newCapacity = (oldCapacity << 1) + 1;
if (newCapacity - MAX_ARRAY_SIZE > 0) {
if (oldCapacity == MAX_ARRAY_SIZE)
// Keep running with MAX_ARRAY_SIZE buckets
return;
newCapacity = MAX_ARRAY_SIZE;
}
Entry<?,?>[] newMap = new Entry<?,?>[newCapacity];
modCount++;
threshold = (int)Math.min(newCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
table = newMap;
Hashtable Entry forms a linked list. It is used in case of hash collisions, since index for 2 different values would become same and required value is checked through linked list.
private static class Entry<K,V> implements Map.Entry<K,V> {
final int hash;
final K key;
V value;
Entry<K,V> next;
You may want to check other more simple implementations of Hashtables for better understanding.
Related
We're learning about hash tables in my data structures and algorithms class, and I'm having trouble understanding separate chaining.
I know the basic premise: each bucket has a pointer to a Node that contains a key-value pair, and each Node contains a pointer to the next (potential) Node in the current bucket's mini linked list. This is mainly used to handle collisions.
Now, suppose for simplicity that the hash table has 5 buckets. Suppose I wrote the following lines of code in my main after creating an appropriate hash table instance.
myHashTable["rick"] = "Rick Sanchez";
myHashTable["morty"] = "Morty Smith";
Let's imagine whatever hashing function we're using just so happens to produce the same bucket index for both string keys rick and morty. Let's say that bucket index is index 0, for simplicity.
So at index 0 in our hash table, we have two nodes with values of Rick Sanchez and Morty Smith, in whatever order we decide to put them in (the first pointing to the second).
When I want to display the corresponding value for rick, which is Rick Sanchez per our code here, the hashing function will produce the bucket index of 0.
How do I decide which node needs to be returned? Do I loop through the nodes until I find the one whose key matches rick?
To resolve Hash Tables conflicts, that's it, to put or get an item into the Hash Table whose hash value collides with another one, you will end up reducing a map to the data structure that is backing the hash table implementation; this is generally a linked list. In the case of a collision this is the worst case for the Hash Table structure and you will end up with an O(n) operation to get to the correct item in the linked list. That's it, a loop as you said, that will search the item with the matching key. But, in the cases that you have a data structure like a balanced tree to search, it can be O(logN) time, as the Java8 implementation.
As JEP 180: Handle Frequent HashMap Collisions with Balanced Trees says:
The principal idea is that once the number of items in a hash bucket
grows beyond a certain threshold, that bucket will switch from using a
linked list of entries to a balanced tree. In the case of high hash
collisions, this will improve worst-case performance from O(n) to
O(log n).
This technique has already been implemented in the latest version of
the java.util.concurrent.ConcurrentHashMap class, which is also slated
for inclusion in JDK 8 as part of JEP 155. Portions of that code will
be re-used to implement the same idea in the HashMap and LinkedHashMap
classes.
I strongly suggest to always look at some existing implementation. To say about one, you could look at the Java 7 implementation. That will increase your code reading skills, that is almost more important or you do more often than writing code. I know that it is more effort but it will pay off.
For example, take a look at the HashTable.get method from Java 7:
public synchronized V get(Object key) {
Entry<?,?> tab[] = table;
int hash = key.hashCode();
int index = (hash & 0x7FFFFFFF) % tab.length;
for (Entry<?,?> e = tab[index] ; e != null ; e = e.next) {
if ((e.hash == hash) && e.key.equals(key)) {
return (V)e.value;
}
}
return null;
}
Here we see that if ((e.hash == hash) && e.key.equals(key)) is trying to find the correct item with the matching key.
And here is the full source code: HashTable.java
I have some hashtable. For instance I have two entities like
john = { 1stname: jonh, 2ndname: johnson },
eric = { 1stname: eric, 2ndname: ericson }
Then I put them in hashtable:
ht["john"] = john;
ht["eric"] = eric;
Let's imagine there is a collision and hashtable use chaining to fix it. As a result there should be a linked list with these two entities like this
How does hashtable understand what entity should be returned for key? Hash values are the same and it knows nothing about entities structure. For instance if I write thisvar val = ht["john"]; how does hashtable (having only key value and its hash) find out that value should be john record and not eric.
I think what you are confused about is what is stored at each location in the hashtable's adjacent list. It seems like you assume that only the value is being stored. In fact, the data in each list node is a tuple (key, value).
Once you ask for ht['john'], the hashtable find the list associated with hash('john') and if the list is not empty it searches for the key 'john' in the list. If the key is found as the first element of the tuple then the value (second element of the tuple) is returned. If the key is not found, then it means that the element is not in the hashtable.
To summarize, the key hash is used to quickly identify the cell in which the element should be stored if present. Actual key equality is tested for to decide whether the key exists or not.
Is this what you are asking for? I have already put this in comments but seems to me you did not follow link
Collision Resolution in the Hashtable Class
Recall that when inserting an item into or retrieving an item from a hash table, a collision can occur. When inserting an item, an open slot must be found. When retrieving an item, the actual item must be found if it is not in the expected location. Earlier we briefly examined two collusion resolution strategies:
Linear probing
Quardratic probing
The Hashtable class uses a different technique referred to as rehasing. (Some sources refer to rehashing as double hashing.)
Rehasing works as follows: there is a set of hash different functions, H1 ... Hn, and when inserting or retrieving an item from the hash table, initially the H1 hash function is used. If this leads to a collision, H2 is tried instead, and onwards up to Hn if needed. The previous section showed only one hash function, which is the initial hash function (H1). The other hash functions are very similar to this function, only differentiating by a multiplicative factor. In general, the hash function Hk is defined as:
Hk(key) = [GetHash(key) + k * (1 + (((GetHash(key) >> 5) + 1) % (hashsize – 1)))] % hashsize
Mathematical Note With rehasing it is important that each slot in the hash table is visited exactly once when hashsize number of probes are made. That is, for a given key you don't want Hi and Hj to hash to the same slot in the hash table. With the rehashing formula used by the Hashtable class, this property is maintained if the result of (1 + (((GetHash(key) >> 5) + 1) % (hashsize – 1))and hashsize are relatively prime. (Two numbers are relatively prime if they share no common factors.) These two numbers are guaranteed to be relatively prime if hashsize is a prime number.
Rehasing provides better collision avoidance than either linear or quadratic probing.
sources here
I have a stream data coming, and I am maintaining them by pushing them one by one into a heap (priority queue), the resulting heap looks like:
[(a,1), (b,2), (c, 7), (d, 2), ...]
Because I need to update the items (e.g., change (a,1) to (a, 2), or delete (c,7)) continously through out the time. To effciently find and remove an items in a heap, I want to construct a hash table with the location of every items in the heap stored in the hash table.
So that each time I want to update an item, I can use the hashtable to find it and make changes easily in the heap, simutinouly updating the position of every item in the hash table.
The same question has been asked in this post: How to implement O(1) deletion on min-heap with hashtable with c++ code as following:
template<typename state, typename CmpKey, class dataStructure>
bool AStarOpenClosed<state, CmpKey, dataStructure>::HeapifyUp(unsigned int index)
{
if (index == 0) return false;
int parent = (index-1)/2;
CmpKey compare;
if (compare(elements[theHeap[parent]], elements[theHeap[index]]))
{
// Perform normal heap operations
unsigned int tmp = theHeap[parent];
theHeap[parent] = theHeap[index];
theHeap[index] = tmp;
// Update the element location in the hash table
elements[theHeap[parent]].openLocation = parent;
elements[theHeap[index]].openLocation = index;
HeapifyUp(parent);
return true;
}
return false;
}
I have little experience with c++, wondering if anyone can help me explain the idea or provide a python version code of such an implementation?
My understanding is that the first item in your pair serves as the key, and the second item as the data payload. Then I would propose an approach the other way around, somewhat similar to this answer, but simpler.
Let the hashtable be your primary data structure for data storage and the min-heap be an auxiliary data structure for maintaining the current smallest key in your data set.
Insert new item: add the data into both hashtable and min-heap.
Update the value for the given key: update the value in the hashtable only.
Delete the item with the given key: delete the entry with the given key from the hashtable only.
Access the smallest key: if the element at the top of the heap is not found in the hashtable, drop it; repeat, until the top key is present in the hashtable.
If I have a hash-table File[string] _subs and want to access its values in a sorted way is there a better way than simply through
auto ssubs = new File[_subs.length]; // preallocate sorted subs
size_t ix = 0;
foreach (sub; _subs) {
ssubs[ix++] = sub; // set new reference to sub
}
ssubs.sort!((a, b) => (a.timeLastModified >
b.timeLastModified));
return ssubs;
I would suggest to skip the foreach-loop and use .values, like so:
auto ssubs = _subs.values.sort!((a, b) => (a.timeLastModified > b.timeLastModified));
I think it's better to not add dependencies between elements (like with a for-loop), when they don't need it. The reason for this is that it's easier to make the code parallel without it.
If you find yourself doing that operation often, consider trying a RedBlackTree from std.container. It's the usual sortedness vs. runtime complexity tradeoff: Whereas accessing and inserting elements into an unordered hashmap (hashset) like an AA is O(1), the same operations on an ordered map (set) like a RedBlackTree is O(log(n)).
Is there anyway by which each reducer process could determine the number of elements or records it has to process ?
Short answer - ahead of time no, the reducer has no knowledge of how many values are backed by the iterable. The only way you can do this is to count as you iterate, but you can't then re-iterate over the iterable again.
Long answer - backing the iterable is actually a sorted byte array of the serialized key / value pairs. The reducer has two comparators - one to sort the key/value pairs in key order, then a second to determine the boundary between keys (known as the key grouper). Typically the key grouper is the same as the key ordering comparator.
When iterating over the values for a particular key, the underlying context examines the next key in the array, and compares to the previous key using the grouping comparator. If the comparator determines they are equal, then iteration continues. Otherwise iteration for this particular key ends. So you can see that you cannot ahead of time determine how may values you will be passed for any particular key.
You can actually see this in action if you create a composite key, say a Text/IntWritable pair. For the compareTo method sort by first the Text, then the IntWritable field. Next create a Comparator to be used as the group comparator, which only considers the Text part of the key. Now as you iterate over the values in the reducer, you should be able to observe IntWritable part of the key changing with each iteration.
Some code i've used before to demonstrates this scenario can be found on this pastebin
Your reducer class must extend the MapReducer Reduce class:
Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
and then must implement the reduce method using the KEYIN/VALUEIN arguments specified in the extended Reduce class
reduce(KEYIN key, Iterable<VALUEIN> values,
org.apache.hadoop.mapreduce.Reducer.Context context)
The values associated with a given key can be counted via
int count = 0;
Iterator<VALUEIN> it = values.iterator();
while(it.hasNext()){
it.Next();
count++;
}
Though I'd propose doing this counting along side your other processing as to not make two passes through your value set.
EDIT
Here's an example vector of vectors that will dynamically grow as you add to it (so you won't have to statically declare your arrays, and hence don't need the size of the values set). This will work best for non-regular data (IE the number of columns is not the same for every row in your input csv file), but will have the most overhead.
Vector table = new Vector();
Iterator<Text> it = values.iterator();
while(it.hasNext()){
Text t = it.Next();
String[] cols = t.toString().split(",");
int i = 0;
Vector row = new Vector(); //new vector will be our row
while(StringUtils.isNotEmpty(cols[i])){
row.addElement(cols[i++]); //here were adding a new column for every value in the csv row
}
table.addElement(row);
}
Then you can access the Mth column of the Nth row via
table.get(N).get(M);
Now, if you knew the # of columns would be set, you could modify this to use a Vector of arrays which would probably be a little faster/more space efficient.