Hashtable and the bucket array - algorithm

I read that into a hash table we have a bucket array but I don't understand what that bucket array contains.
Does it contain the hashing index? the entry (key/value pair)? both?
This image, for me, is not very clear:
(reference)
So, which is a bucket array?

The array index is mostly equivalent to the hash value (well, the hash value mod the size of the array), so there's no need to store that in the array at all.
As to what the actual array contains, there are a few options:
If we use separate chaining:
A reference to a linked-list of all the elements that have that hash value. So:
LinkedList<E>[]
A linked-list node (i.e. the head of the linked-list) - similar to the first option, but we instead just start off with the linked-list straight away without wasting space by having a separate reference to it. So:
LinkedListNode<E>[]
If we use open addressing, we're simply storing the actual element. If there's another element with the same hash value, we use some reproducible technique to find a place for it (e.g. we just try the next position). So:
E[]
There may be a few other options, but the above are the best-known, with separate-chaining being the most popular (to my knowledge)
* I'm assuming some familiarity with generics and Java/C#/C++ syntax - E here is simply the type of the element we're storing, LinkedList<E> means a LinkedList storing elements of type E. X[] is an array containing elements of type X.

What goes into the bucket array depends a lot on what is stored in the hash table, and also on the collision resolution strategy.
When you use linear probing or another open addressing technique, your bucket table stores keys or key-value pairs, depending on the use of your hash table *.
When you use a separate chaining technique, then your bucket array stores pairs of keys and the headers of your chaining structure (e.g. linked lists).
The important thing to remember about the bucket array is that it establishes a mapping between a hash code and a group of zero or more keys. In other words, given a hash code and a bucket array, you can find out, in constant time, what are the possible keys associated with this hash code (enumerating the candidate keys may be linear, but finding the first one needs to be constant time in order to meet hash tables' performance guarantee of amortized constant time insertions and constant-time searches on average).
* If your hash table us used for checking membership (i.e. it represents a set of keys) then the bucket array stores keys; otherwise, it stores key-value pairs.

In practice a linked list of the entries that have been computed (by hashing the key) to go into that bucket.

In a HashTable there are most of the times collisions. That is when different elements have the same hash value. Elements with the same Hash value are stored in one bucket. So for each hash value you have a bucket containing all elements that have this hash-value.

A bucket is a linked list of key-value pairs. hash index is the one
to tell "which bucket", and the "key" in the key-value pair is the one to tell "which entry in that bucket".
also check out
hashing in Java -- structure & access time, i've bee telling more details there.

Related

Create new hash table from existing hash table

Suppose we have a hash table with 2^16 keys and values. Each key can be represented as a bit string (e.g., 0000, 0000, 0000, 0000). Now we want to construct a new hash table. The key of new hash table is still a bit string (e.g., 0000, ****, ****, ****). The corresponding value would be the average of all values in the old hash table when * takes 0 or 1. For instance, the value of 0000, ****, ****, **** will be the average of 2^12 values in the old hash table from 0000, 0000, 0000, 0000 to 0000, 1111, 1111, 1111. Intuitively, we need to do C(16, 4) * 2^16 times to construct the new hash table. What's the most efficient way to construct the new hash table?
The hash table here is not helping you at all, although it isn't much of a hindrance either.
Hash tables cannot, by their nature, cluster keys by the key prefix. In order to provide good hash distribution, keys need to be distributed as close to uniformly as possible between hash values.
If you will need later to process keys in some specific ordering, you might consider an ordered associative mapping, such as a balanced binary tree or some variant of a trie. On the other hand, the advantage of processing keys in order needs to be demonstrated in order to justify the additional overhead of ordered mapping.
In this case, every key needs to be visited, which means the ordered mapping and the hash mapping will both be O(n), assuming linear time traverse and constant time processing, both reasonable assumptions. However, during the processing each result value needs two accumulated intermediaries, basically a running total and a count. (There is an algorithm for "on-line" computation of the mean of a series, but it also requires two intermediate values, running mean and count. So although it has advantages, reducing storage requirements isn't one of them.)
You can use the output hash table to store one of the intermediate values for each output value, but you need somewhere to put the other one. That might be another hash table of the same size, or something similar; in any case, there is an additional storage cost
If you could traverse the original hash table in prefix order, you could reduce this storage cost to a constant, since the two temporary values can be recycled every time you reach a new prefix. So that's a savings, but I doubt whether it's sufficient to justify the overhead of an ordered associative mapping, which also includes increased storage requirements.

Iterating over unordered_map C++

Is it true that keys inserted in a particular order in an unordered_map, will come in the same order while iterating over the map using iterator?
Like for example: if we insert (4,3), (2, 5), (6, 7) in B.
And iterate like:
for(auto it=B.begin();it!=B.end();it++) {
cout<<(it->first);
}
will it give us 4, 2, 6 or keys may come in any order?
From the cplusplus.com page about the begin member function of unordered_map (link):
Notice that an unordered_map object makes no guarantees on which specific element is considered its first element.
So no, there is no guarantee the elements will be iterated over in the order they were inserted.
FYI, you can iterate over an unordered_map more simply:
for (auto& it: B) {
// Do stuff
cout << it.first;
}
Information added to the answer provided by #Aimery,
Unordered map is an associative container that contains key-value pairs with unique keys. Search, insertion, and removal of elements
have average constant-time complexity.
Internally, the elements are not sorted in any particular order but organized into buckets. Which bucket an element is placed into
depends entirely on the hash of its key. This allows fast access to
individual elements since once the hash is computed, it refers to the
exact bucket the element is placed into.
See the ref. from https://en.cppreference.com/w/cpp/container/unordered_map.
According to Sumod Mathilakath gave an answer in Quora
If you prefer to keep intermediate data in sorted order, use std::map<key,value> instead std::unordered_map. It will sort on key by default using std::less<> so you will get result in ascending order.
std::unordered_map is an implementation of hash table data structure, so it will arrange the elements internally according to the hash value using by std::unordered_map. But in case std::map it is usually a red black binary tree implementation.
See the ref. from What will be order of key in unordered_map in c++ and why?.
So, I think we got the answer more clearly.

Hashing Access time with multi variable key

Suppose a dictionary has 2 variable keys instead of 1 like
dictionary[3,5] = Something
dictionry[1,2] = Something
dictionary[3,1] = Something
Would the search time still be O(1).In case I need to find if dictionary[1,5] exists would it yield constant time?
Thanks in advance.
When you do a lookup in a hash table, the cost involved is the cost of
hashing the item to look up, and
comparing that item against (an expected O(1) number of) other other entries in the table.
We can write the expected cost of a hash table lookup as O(hash-cost + compare-cost).
In your case, the cost of hashing a pair instead of a single element is still O(1) - just hash each element of the pair and apply some hash combination step to the two values. Similarly, the cost of comparing two pairs is also O(1) (assuming each individual element of the pair can be compared in constant time). As a result, a lookup will still be (expected) constant time.
The above argument generalizes to any fixed size triple as a key. You typically have to worry about the cost of hashing and comparing keys when they have variable length, as would be the case if you were hashing strings with no length restriction.
Yes. This is not new. In usual, you can have a dictionary with string keys. If you see string as an array of characters, you have a list of chars as key. So, in the same situation, you can say your dictionary works in O(1) too (if length of string is constant).

Programming : find the first unique string in a file in just 1 pass

Given a very long list of Product Names, find the first product name which is unique (occurred exactly once ). You can only iterate once in the file.
I am thinking of taking a hashmap and storing the (keys,count) in a doubly linked list.
basically a linked hashmap
can anyone optimize this or give a better approach
Since you can only iterate the list once, you have to store
each string that occurs exactly once, because it could be the output
their relative position within the list
each string that occurs more than once (or their hash, if you're not afraid)
Notably, you don't have to store the relative positions of strings that occur more than once.
You need
efficient storage of the set of strings. A hash set is a good candidate, but a trie could offer better compression depending on the set of strings.
efficient lookup by value. This rules out a bare list. A hash-set is the clear winner, but a trie also performs well. You can store the leaves of the trie in a hash set.
efficient lookup of the minimum. This asks for a linked list.
Conclusion:
Use a linked hash-set for the set of strings, and a flag indicating if they're unique. If you're fighting for memory, use a linked trie. If a linked trie is too slow, store the trie leaves in a hash map for look-up. Include only the unique strings in the linked list.
In total, your nodes could look like: Node:{Node[] trieEdges, Node trieParent, String inEdge, Node nextUnique, Node prevUnique}; Node firstUnique, Node[] hashMap
If you strive for ease of implementation, you can have two hash-sets instead (one linked).
The following algorithm solves it in O(N+M) time.
where
N=number of strings
M=total number of characters put together in all strings.
The steps are as follows:
`1. Create a hash value for each string`
`2. Xor it and find the one which didn't have a pair`
Xor has this useful property that if you do a xor a=0 and b xor 0=b.
Tips to generate the hash value for a string:
Use a 27 base number system, and give a a value of 1, b a value of 2 and so on till z which gets 26, and so if string is "abc" , we compute hash value as:
H=3*(27 power 0)+2*(27 power 1)+ 1(27 power 2)
=786
You could use modulus operator to make hash values small enough to fit in 32-bit integers.If you do that keep an eye out for collisions, which are basically two strings which are different but get the same hash value due to the modulus operation.
Mostly I guess you won't be needing it.
So compute the hash for each string, and then start from the first hash and keep xor-ing, the result will hold the hash value of the string which din't have a pair.
Caution:This is useful only when strings occur in pairs.Still this is a good idea to start with, that's why I answered it.
Using a linked hashmap is obvious enough. Otherwise, you could use a TreeMap style data structure where the strings are ordered by count. So as soon as you are done reading the input, the root of your tree is unique if a unique string exists. Unlike a linked hash map, insertion takes at most O(log n) as opposed to O(n). You can read up on TreeMaps for insight on how to augment a basic TreeMap into what you need. Also in your linked hashmap you may have to travel O(n) to find your first unique key. With a TreeMap style data structure, your look up is O(1) -- the root. Even if more unique keys exist, the first one you encountered will be the root. The subsequent ones will be children of the root.

Designing small comparable objects

Intro
Consider you have a list of key/value pairs:
(0,a) (1,b) (2,c)
You have a function, that inserts a new value between two current pairs, and you need to give it a key that keeps the order:
(0,a) (0.5,z) (1,b) (2,c)
Here the new key was chosen as the average between the average of keys of the bounding pairs.
The problem is, that you list may have milions of inserts. If these inserts are all put close to each other, you may end up with keys such to 2^(-1000000), which are not easily storagable in any standard nor special number class.
The problem
How can you design a system for generating keys that:
Gives the correct result (larger/smaller than) when compared to all the rest of the keys.
Takes up only O(logn) memory (where n is the number of items in the list).
My tries
First I tried different number classes. Like fractions and even polynomium, but I could always find examples where the key size would grow linear with the number of inserts.
Then I thought about saving pointers to a number of other keys, and saving the lower/greater than relationship, but that would always require at least O(sqrt) memory and time for comparison.
Extra info: Ideally the algorithm shouldn't break when pairs are deleted from the list.
I agree with snowlord. A tree would be ideal in this case. A red-black tree would prevent things from getting unbalanced. If you really need keys, though, I'm pretty sure you can't do better than using the average of the keys on either side of the value you need to insert. That will increase your key length by 1 bit each time. What I recommend is renormalizing the keys periodically. Every x inserts, or whenever you detect keys being generated too close together, renumber everything from 1 to n.
Edit:
You don't need to compare keys if you're inserting by position instead of key. The compare function for the red-black tree would just use the order in the conceptual list, which lines up with in-order in the tree. If you're inserting in position 4 in the list, insert a node at position 4 in the tree (using in-ordering). If you're inserting after a certain node (such as "a"), it's the same. You might have to use your own implementation if whatever language/library you're using requires a key.
I don't think you can avoid getting size O(n) keys without reassigning the key during operation.
As a practical solution I would build an inverted search tree, with pointers from the children to the parents, where each pointer is marked whether it is coming from a left or right child. To compare two elements you need to find the closest common ancestor, where the path to the elements diverges.
Reassigning keys is then rebalancing of the tree, you can do that by some rotation that doesn't change the order.

Resources