Using a hash table to represent more specific data structures? - data-structures

I know that hash tables can be used to store:
Graphs: each vertex is associated with its adjacent vertices.
Sparse matrices: a DOK (dictionary of keys) mapping (row, column)-pairs to the value of the elements.
Sets: each element is associated with nil.
Multisets (or Bags): each element is associated with its multiplicity.
What other data structures (for instance, in this list) may be efficiently represented by a hash table?

Related

Intersection between two collection

I was asked this interview question. Best and efficient way to get the intersection between two collections, one very big and the other small. Java Collection basically
With no other information what you can do is to save time when performing the n x m comparisons as follows:
Let big and small be the collections of size n and m.
Let intersection be the intersecting collection (empty for now).
Let hashes be the collection of hashes of all elements in small.
For each object in big let h be its hash integer.
For each hash value in the hashes collection, ifhash = h, then compare object with the element of small whose hash is hash. If they are equal, add object to intersection.
So, the idea is to compare hashes instead of objects and only compare objects if their hashes coincide.
Note that the additional collection for the hashes of small is acceptable because of the size of this supposedly small collection. Notice also that the algorithm computes n + m hash values and comparatively few object comparisons.
Here is the code in Smalltalk
set := small asSet.
^big select: [:o | set includes: o]
The Smalltalk code is very compact because the message includes: sent to set works as described in step 5 above. It first compares hashes and then objects if needed. The select: is also a very compact way to express the selection in step 5.
UPDATE
If we enumerate all the comparisons between the elements of both collections we would have to consider n x m pairs of objects, which would account for a complexity of order O(nm) (big-O notation). On the other hand, if we put the small collection into a hashed one (as I did in the Smalltalk example) the inner testing that happens every time we have to check if small includes an object of big will have a complexity of O(1). And given that hashing the small collection is O(m), the total complexity of this method would be O(n + m).
Let's call the two collections large and small, respectively.
The Java Collection is an Abstract class - you cannot actually use it directly - you have to use one of Collection's concrete sub-classes. For this problem, you can use Sets. A Set has only unique elements, and it has a method contains(Object o). And it's subclass, SortedSet, is created in ascending order, by default.
So copy small into a Set. It's now got no duplicate values. Copy large into a second Set, and this way we can use its contains() method. Create a third Set called intersection, to hold the intersection results.
for-each element in small check if large.contains(element_from_small) Every time you find a match, intersection.add(element_from_small)
At the end of the run through small, you'll have the intersection of all objects in both original collections, with no duplicates. If you want it ordered, copy it into a SortedSet and it'll then be in ascending order.

Hashtable and the bucket array

I read that into a hash table we have a bucket array but I don't understand what that bucket array contains.
Does it contain the hashing index? the entry (key/value pair)? both?
This image, for me, is not very clear:
(reference)
So, which is a bucket array?
The array index is mostly equivalent to the hash value (well, the hash value mod the size of the array), so there's no need to store that in the array at all.
As to what the actual array contains, there are a few options:
If we use separate chaining:
A reference to a linked-list of all the elements that have that hash value. So:
LinkedList<E>[]
A linked-list node (i.e. the head of the linked-list) - similar to the first option, but we instead just start off with the linked-list straight away without wasting space by having a separate reference to it. So:
LinkedListNode<E>[]
If we use open addressing, we're simply storing the actual element. If there's another element with the same hash value, we use some reproducible technique to find a place for it (e.g. we just try the next position). So:
E[]
There may be a few other options, but the above are the best-known, with separate-chaining being the most popular (to my knowledge)
* I'm assuming some familiarity with generics and Java/C#/C++ syntax - E here is simply the type of the element we're storing, LinkedList<E> means a LinkedList storing elements of type E. X[] is an array containing elements of type X.
What goes into the bucket array depends a lot on what is stored in the hash table, and also on the collision resolution strategy.
When you use linear probing or another open addressing technique, your bucket table stores keys or key-value pairs, depending on the use of your hash table *.
When you use a separate chaining technique, then your bucket array stores pairs of keys and the headers of your chaining structure (e.g. linked lists).
The important thing to remember about the bucket array is that it establishes a mapping between a hash code and a group of zero or more keys. In other words, given a hash code and a bucket array, you can find out, in constant time, what are the possible keys associated with this hash code (enumerating the candidate keys may be linear, but finding the first one needs to be constant time in order to meet hash tables' performance guarantee of amortized constant time insertions and constant-time searches on average).
* If your hash table us used for checking membership (i.e. it represents a set of keys) then the bucket array stores keys; otherwise, it stores key-value pairs.
In practice a linked list of the entries that have been computed (by hashing the key) to go into that bucket.
In a HashTable there are most of the times collisions. That is when different elements have the same hash value. Elements with the same Hash value are stored in one bucket. So for each hash value you have a bucket containing all elements that have this hash-value.
A bucket is a linked list of key-value pairs. hash index is the one
to tell "which bucket", and the "key" in the key-value pair is the one to tell "which entry in that bucket".
also check out
hashing in Java -- structure & access time, i've bee telling more details there.

In a hash tree, are non-leaf nodes direct hashes of data, or are they hashes of sub-hashes?

I am looking on the wikipedia article for hash trees, and I am slightly confused by their diagram.
A leaf node obviously contains the hash of the underlying data.
Are leaf nodes in hash trees different than any non-leaf node? Do non-leaf nodes contain hashes of data, or hashes of hashes?
Given this diagram:
Which of these is Hash 1 a hash of?
Hash 1-0 + Hash 1-1
Data block 002 + Data block 003
Or are hash trees fundamentally different depending on the application (rsync, P2P networks, Git, etc)?
This is what wiki article says:
Nodes further up in the tree are the hashes of their respective
children. For example, in the picture hash 0 is the result of hashing
hash 0-0 and then hash 0-1. That is, hash 0 = hash( hash 0-0 || hash
0-1 ) where || denotes concatenation.
But I truly believe that a developer may customize the tree and algorithm, use different hash functions and so on, optimizing it for different data or speed or memory or whatever.

What is the main implementation idea behind sparse hash table?

Why does Google sparsehash open-source library has two implementations: a dense hashtable and a sparse one?
The dense hashtable is your ordinary textbook hashtable implementation.
The sparse hashtable stores only the elements that have actually been set, divided over a number of arrays. To quote from the comments in the implementation of sparse tables:
// The idea is that a table with (logically) t buckets is divided
// into t/M *groups* of M buckets each. (M is a constant set in
// GROUP_SIZE for efficiency.) Each group is stored sparsely.
// Thus, inserting into the table causes some array to grow, which is
// slow but still constant time. Lookup involves doing a
// logical-position-to-sparse-position lookup, which is also slow but
// constant time. The larger M is, the slower these operations are
// but the less overhead (slightly).
To know which elements of the arrays are set, a sparse table includes a bitmap:
// To store the sparse array, we store a bitmap B, where B[i] = 1 iff
// bucket i is non-empty. Then to look up bucket i we really look up
// array[# of 1s before i in B]. This is constant time for fixed M.
so that each element incurs an overhead of only 1 bit (in the limit).
sparsehash are a memory-efficient way of mapping keys to values (1-2 bits per key). Bloom filters can give you even fewer bits per key, but they don't attach values to keys other than outside/probably-inside, which is slightly less than a bit of information.

Superset Search

I'm looking for an algorithm to solve the following in a reasonable amount of time.
Given a set of sets, find all such sets that are subsets of a given set.
For example, if you have a set of search terms like ["stack overflow", "foo bar", ...], then given a document D, find all search terms whose words all appear in D.
I have found two solutions that are adequate:
Use a list of bit vectors as an index. To query for a given superset, create a bit vector for it, and then iterate over the list performing a bitwise OR for each vector in the list. If the result is equal to the search vector, the search set is a superset of the set represented by the current vector. This algorithm is O(n) where n is the number of sets in the index, and bitwise OR is very fast. Insertion is O(1). Caveat: to support all words in the English language, the bit vectors will need to be several million bits long, and there will need to exist a total order for the words, with no gaps.
Use a prefix tree (trie). Sort the sets before inserting them into the trie. When searching for a given set, sort it first. Iterate over the elements of the search set, activating nodes that match if they are either children of the root node or of a previously activated node. All paths, through activated nodes to a leaf, represent subsets of the search set. The complexity of this algorithm is O(a log a + ab) where a is the size of the search set and b is the number of indexed sets.
What's your solution?
The prefix trie sounds like something I'd try if the sets were sparse compared to the total vocabulary. Don't forget that if the suffix set of two different prefixes is the same, you can share the subgraph representing the suffix set (this can be achieved by hash-consing rather than arbitrary DFA minimization), giving a DAG rather than a tree. Try ordering your words least or most frequent first (I'll bet one or the other is better than some random or alphabetic order).
For a variation on your first strategy, where you represent each set by a very large integer (bit vector), use a sparse ordered set/map of integers (a trie on the sequence of bits which skips runs of consecutive 0s) - http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.5452 (implemented in http://www.scala-lang.org/docu/files/api/scala/collection/immutable/IntMap.html).
If your reference set (of sets) is fixed, and you want to find for many of those sets which ones contain others, I'd compute the immediate containment relation (a directed acyclic graph with a path from a->b iff b is contained in a, and without the redundant arcs a->c where a->b and b->c). The branching factor is no more than the number of elements in a set. The vertices reachable from the given set are exactly those that are subsets of it.
First I would construct 2 data structures, S and E.
S is an array of sets (set S has the N subsets).
S[0] = set(element1, element2, ...)
S[1] = set(element1, element2, ...)
...
S[N] = set(element1, element2, ...)
E is a map (element hash for index) of lists. Each list contains S-indices, where the element appears.
// O( S_total_elements ) = O(n) operation
E[element1] = list(S1, S6, ...)
E[element2] = list(S3, S4, S8, ...)
...
Now, 2 new structures, set L and array C.
I store all the elements of D, that exist in E, in the L. (O(n) operation)
C is an array (S-indices) of counters.
// count subset's elements that are in E
foreach e in L:
foreach idx in E[e]:
C[idx] = C[idx] + 1
Finally,
for i in C:
if C[i] == S[i].Count()
// S[i] subset exists in D
Can you build an index for your documents? i.e. a mapping from each word to those documents containing that word. Once you've built that, lookup should be pretty quick and you can just do set intersection to find the documents matching all words.
Here's Wiki on full text search.
EDIT: Ok, I got that backwards.
You could convert your document to a set (if your language has a set datatype), do the same with your searches. Then it becomes a simple matter of testing whether one is a subset of the other.
Behind the scenes, this is effectively the same idea: it would probably involve building a hash table for the document, hashing the queries, and checking each word in the query in turn. This would be O(nm) where n is the number of searches and m the average number of words in a search.

Resources