Python - hash heap implementation - data-structures

I have a stream data coming, and I am maintaining them by pushing them one by one into a heap (priority queue), the resulting heap looks like:
[(a,1), (b,2), (c, 7), (d, 2), ...]
Because I need to update the items (e.g., change (a,1) to (a, 2), or delete (c,7)) continously through out the time. To effciently find and remove an items in a heap, I want to construct a hash table with the location of every items in the heap stored in the hash table.
So that each time I want to update an item, I can use the hashtable to find it and make changes easily in the heap, simutinouly updating the position of every item in the hash table.
The same question has been asked in this post: How to implement O(1) deletion on min-heap with hashtable with c++ code as following:
template<typename state, typename CmpKey, class dataStructure>
bool AStarOpenClosed<state, CmpKey, dataStructure>::HeapifyUp(unsigned int index)
{
if (index == 0) return false;
int parent = (index-1)/2;
CmpKey compare;
if (compare(elements[theHeap[parent]], elements[theHeap[index]]))
{
// Perform normal heap operations
unsigned int tmp = theHeap[parent];
theHeap[parent] = theHeap[index];
theHeap[index] = tmp;
// Update the element location in the hash table
elements[theHeap[parent]].openLocation = parent;
elements[theHeap[index]].openLocation = index;
HeapifyUp(parent);
return true;
}
return false;
}
I have little experience with c++, wondering if anyone can help me explain the idea or provide a python version code of such an implementation?

My understanding is that the first item in your pair serves as the key, and the second item as the data payload. Then I would propose an approach the other way around, somewhat similar to this answer, but simpler.
Let the hashtable be your primary data structure for data storage and the min-heap be an auxiliary data structure for maintaining the current smallest key in your data set.
Insert new item: add the data into both hashtable and min-heap.
Update the value for the given key: update the value in the hashtable only.
Delete the item with the given key: delete the entry with the given key from the hashtable only.
Access the smallest key: if the element at the top of the heap is not found in the hashtable, drop it; repeat, until the top key is present in the hashtable.

Related

Hashtable underlying place holder?

I am trying to understand the HashTable data structure. I understand that in HashTable we first use HashFunction to coverts a key to hash Code and then using modulo operator to convert Hash code to integer index and which is used to get the location in HashTable where data is placed.
At a high level, the flow is like this?
Key -> Hash Function -> Hash code -> Modulo operator -> integer index -> Store in HashTable
Since the key is stored based on the index as emitted by the modulo operator, my doubt is, what is the underlying data structure which is used to hold the actual data? Is it an array, for array can be accessed using Index.
Can anyone help me understand this?
Though it completely depends on implementation, I would agree that underlying data structure would be array with linked list, since array is convinient to access elements at low cost, while linked list is necessary to handle hash collisions.
Here is example of details how it is implemented in java openjdk Hashtable
Initially it creates array with initial capacity:
table = new Entry<?,?>[initialCapacity];
It checks for capacity threshold everytime when new element is added. When threshold limit is reached it performs rehashing and creates a new array which is double size of old array
int newCapacity = (oldCapacity << 1) + 1;
if (newCapacity - MAX_ARRAY_SIZE > 0) {
if (oldCapacity == MAX_ARRAY_SIZE)
// Keep running with MAX_ARRAY_SIZE buckets
return;
newCapacity = MAX_ARRAY_SIZE;
}
Entry<?,?>[] newMap = new Entry<?,?>[newCapacity];
modCount++;
threshold = (int)Math.min(newCapacity * loadFactor, MAX_ARRAY_SIZE + 1);
table = newMap;
Hashtable Entry forms a linked list. It is used in case of hash collisions, since index for 2 different values would become same and required value is checked through linked list.
private static class Entry<K,V> implements Map.Entry<K,V> {
final int hash;
final K key;
V value;
Entry<K,V> next;
You may want to check other more simple implementations of Hashtables for better understanding.

Hash Tables and Separate Chaining: How do you know which value to return from the bucket's list?

We're learning about hash tables in my data structures and algorithms class, and I'm having trouble understanding separate chaining.
I know the basic premise: each bucket has a pointer to a Node that contains a key-value pair, and each Node contains a pointer to the next (potential) Node in the current bucket's mini linked list. This is mainly used to handle collisions.
Now, suppose for simplicity that the hash table has 5 buckets. Suppose I wrote the following lines of code in my main after creating an appropriate hash table instance.
myHashTable["rick"] = "Rick Sanchez";
myHashTable["morty"] = "Morty Smith";
Let's imagine whatever hashing function we're using just so happens to produce the same bucket index for both string keys rick and morty. Let's say that bucket index is index 0, for simplicity.
So at index 0 in our hash table, we have two nodes with values of Rick Sanchez and Morty Smith, in whatever order we decide to put them in (the first pointing to the second).
When I want to display the corresponding value for rick, which is Rick Sanchez per our code here, the hashing function will produce the bucket index of 0.
How do I decide which node needs to be returned? Do I loop through the nodes until I find the one whose key matches rick?
To resolve Hash Tables conflicts, that's it, to put or get an item into the Hash Table whose hash value collides with another one, you will end up reducing a map to the data structure that is backing the hash table implementation; this is generally a linked list. In the case of a collision this is the worst case for the Hash Table structure and you will end up with an O(n) operation to get to the correct item in the linked list. That's it, a loop as you said, that will search the item with the matching key. But, in the cases that you have a data structure like a balanced tree to search, it can be O(logN) time, as the Java8 implementation.
As JEP 180: Handle Frequent HashMap Collisions with Balanced Trees says:
The principal idea is that once the number of items in a hash bucket
grows beyond a certain threshold, that bucket will switch from using a
linked list of entries to a balanced tree. In the case of high hash
collisions, this will improve worst-case performance from O(n) to
O(log n).
This technique has already been implemented in the latest version of
the java.util.concurrent.ConcurrentHashMap class, which is also slated
for inclusion in JDK 8 as part of JEP 155. Portions of that code will
be re-used to implement the same idea in the HashMap and LinkedHashMap
classes.
I strongly suggest to always look at some existing implementation. To say about one, you could look at the Java 7 implementation. That will increase your code reading skills, that is almost more important or you do more often than writing code. I know that it is more effort but it will pay off.
For example, take a look at the HashTable.get method from Java 7:
public synchronized V get(Object key) {
Entry<?,?> tab[] = table;
int hash = key.hashCode();
int index = (hash & 0x7FFFFFFF) % tab.length;
for (Entry<?,?> e = tab[index] ; e != null ; e = e.next) {
if ((e.hash == hash) && e.key.equals(key)) {
return (V)e.value;
}
}
return null;
}
Here we see that if ((e.hash == hash) && e.key.equals(key)) is trying to find the correct item with the matching key.
And here is the full source code: HashTable.java

Which container to use for given situation?

I am doing a problem and i need to do this task.
I want to add pairs (p1,q1),(p2,q2)..(pn,qn) in such way that
(i) Duplicate pair added only once(like in set).
(ii) I store count how many time each pair are added to set.For ex : (7,2) pair
will present in set only once but if i add 3 times count will 3.
Which container is efficient for this problem in c++?
Little example will be great!
Please ask if you cant understand my problem and sorry for bad English.
How about a std::map<Key, Value> to map your pairs (Key) to their count and as you insert, increment a counter (Value).
using pairs_to_count = std::map<std::pair<T1, T2>, size_t>;
std::pair<T1, T2> p1 = // some value;
std::pair<T1, T2> p2 = // some other value;
pairs_to_count[p1]++;
pairs_to_count[p1]++;
pairs_to_count[p2]++;
pairs_to_count[p2]++;
pairs_to_count[p2]++;
In this code, the operator[] will automatically add a key in the map if it does not exist yet. At that moment, it will initialize the key's corresponding value to zero. But as you insert, even the first time, that value is incremented.
Already after the first insertion, the count of 1 correctly reflects the number of insertion. That value gets incremented as you insert more.
Later, retrieving the count is a matter of calling operator[] again to get value associated with a given key.
size_t const p2_count = pairs_to_count[p2]; // equals 3

optimizing the data structure implementation

There is a stream of random characters coming like 'a''b''c''a'... and so on. At any given point in time when I query I need to get the first non repeating character. For example, for the input "abca", 'b' should be returned since a is repeated and the first non repeating character is 'b'.
There needs to be two methods, one for inserting and one for querying.
My solution is to have a linkedList to store the incoming stream characters. While I get the next character, I just compare with all the current characters and if present I will not insert into the end of linkedlist, else I will insert at the end. By this approach, the query will take O(1) since I will get the first element on the linkedlist and insert will take O(n) since I need to compare from the first element till the last element in the worst case.
Is there any better performing way?
Either you haven't explained your algorithm well or it won't return the correct result. In the example a b a, would your algorithm return a (because it is the first element in the linked list)?
Anyway, here is a modification that improves performance. The idea is to use a hash map from characters to (doubly) linked list nodes. This map can be used to determine if a character has already been inserted and to get to the required node quickly. We should allow a null value for the map target (instead of the list node) to express a character that has ocurred more than once already.
The insertion method works as follows:
Check if the map contains the current character (O(1)). If not, add it to the end of the list and add a reference to the map (O(1)).
If the character is already in the map: Check if the pointed to node is null (O(1)). If so, just ignore it. If it is not, remove the pointed to node from the list and update the reference to a null value (O(1)).
Overall, a O(1) operation.
The query works as in your previous solution.
Here is a C# implementation. It's basically a 1:1 translation of the above explanation:
class StreamAnalyzer
{
LinkedList<char> characterList = new LinkedList<char>();
Dictionary<char, LinkedListNode<char>> characterMap
= new Dictionary<char, LinkedListNode<char>>();
public void AddCharacter(char c)
{
LinkedListNode<char> referencedNode;
if (characterMap.TryGetValue(c, out referencedNode))
{
if(referencedNode != null)
{
characterList.Remove(referencedNode);
characterMap[c] = null;
}
}
else
{
var node = new LinkedListNode<char>(c);
characterList.AddLast(node);
characterMap.Add(c, node);
}
}
public char? GetFirstNonRepeatingCharacter()
{
if (characterList.First == null)
return null;
else
return characterList.First.Value;
}
}

Hadoop : Number of input records for reducer

Is there anyway by which each reducer process could determine the number of elements or records it has to process ?
Short answer - ahead of time no, the reducer has no knowledge of how many values are backed by the iterable. The only way you can do this is to count as you iterate, but you can't then re-iterate over the iterable again.
Long answer - backing the iterable is actually a sorted byte array of the serialized key / value pairs. The reducer has two comparators - one to sort the key/value pairs in key order, then a second to determine the boundary between keys (known as the key grouper). Typically the key grouper is the same as the key ordering comparator.
When iterating over the values for a particular key, the underlying context examines the next key in the array, and compares to the previous key using the grouping comparator. If the comparator determines they are equal, then iteration continues. Otherwise iteration for this particular key ends. So you can see that you cannot ahead of time determine how may values you will be passed for any particular key.
You can actually see this in action if you create a composite key, say a Text/IntWritable pair. For the compareTo method sort by first the Text, then the IntWritable field. Next create a Comparator to be used as the group comparator, which only considers the Text part of the key. Now as you iterate over the values in the reducer, you should be able to observe IntWritable part of the key changing with each iteration.
Some code i've used before to demonstrates this scenario can be found on this pastebin
Your reducer class must extend the MapReducer Reduce class:
Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
and then must implement the reduce method using the KEYIN/VALUEIN arguments specified in the extended Reduce class
reduce(KEYIN key, Iterable<VALUEIN> values,
org.apache.hadoop.mapreduce.Reducer.Context context)
The values associated with a given key can be counted via
int count = 0;
Iterator<VALUEIN> it = values.iterator();
while(it.hasNext()){
it.Next();
count++;
}
Though I'd propose doing this counting along side your other processing as to not make two passes through your value set.
EDIT
Here's an example vector of vectors that will dynamically grow as you add to it (so you won't have to statically declare your arrays, and hence don't need the size of the values set). This will work best for non-regular data (IE the number of columns is not the same for every row in your input csv file), but will have the most overhead.
Vector table = new Vector();
Iterator<Text> it = values.iterator();
while(it.hasNext()){
Text t = it.Next();
String[] cols = t.toString().split(",");
int i = 0;
Vector row = new Vector(); //new vector will be our row
while(StringUtils.isNotEmpty(cols[i])){
row.addElement(cols[i++]); //here were adding a new column for every value in the csv row
}
table.addElement(row);
}
Then you can access the Mth column of the Nth row via
table.get(N).get(M);
Now, if you knew the # of columns would be set, you could modify this to use a Vector of arrays which would probably be a little faster/more space efficient.

Resources