How do I use union find data structure to group Strings? - data-structures

I have been using Union-Find (Disjoint set) for a lot of graph problems and know how this works. But I have almost always used this data structure with integers or numbers. While solving this leetcode problem I need to group strings and I am thinking of using Union-Find for this. But I do not know how to use this with strings. Looking for suggestions.

TLDR: Use the same union find code you would for an integer/number, but use a hash map instead of an array to store the parent of each element in the union find. This approach generalizes to any data type that can be stored in hash map, not just strings, i.e. in the code below the two unordered maps could have something other than strings or ints as keys.
class UnionFind {
public:
string find(string s) {
string stringForPathCompression = s;
while(parent[s] != s) s = parent[s];
// The following while loop implements what is known as path compression, which reduces the time complexity.
while(stringForPathCompression != s) {
string temp = parent[stringForPathCompression];
parent[stringForPathCompression] = s;
stringForPathCompression = temp;
}
return s;
}
void unify(string s1, string s2) {
string rootS1 = find(s1), rootS2 = find(s2);
if(rootS1 == rootS2) return;
// we unify the smaller component to the bigger component, thus preserving the bigger component.
// this is known as union by size, and reduces the time complexity
if(sz[rootS1] < sz[rootS2]) parent[rootS1] = rootS2, sz[rootS2] += sz[rootS1];
else parent[rootS2] = rootS1, sz[rootS1] += sz[rootS2];
}
private:
// If we were storing numbers in our union find, both of the hash maps below could be arrays
unordered_map<string, int> sz; // aka size.
unordered_map<string, string> parent;
};

Union Find doesn't really care what kind of data is in the objects. You can decide what strings to union in your main code, and then union find their representative values.

Related

Saving the in-order visit of a binary tree in an array

The question is the following:
is there an algorithm which, given an binary tree T and an array, allows to store in the array the result of the correspondent in-order visit of the tree?
Pseudo-code of a "normal" in-order visit:
inOrder(x){ // x is a node of a binary tree
if(x != NIL){ // the node is not null
inOrder(x.left)
print(x.key)
inOrder(x.right)
}
}
// Function calling inOrder
printInOrder(T){ // T is a binary tree
inOrder(T.root) // T.root is the root of the tree
}
Example:
Given the following tree
5
/ \
3 8
/ \ /
2 7 1
the algorithm above should output 2 3 7 5 1 8.
I'm sure this can be achieved and it shouldn't be too hard but I'm currently struggling for this problem.
Writing to an array (instead of printing) means you need to keep track of which index to write at in the array. If you need to do this without any mutable state other than the array itself, you need to pass the current index as an argument, and return the new current index.
The code below is written in static single assignment form so even the local variables are not mutated. If that isn't required then the code can be simplified a bit. I'm assuming that the array length is known; if it needs to be computed, that is a separate problem.
inOrder(x, arr, i) {
if(x == NIL) {
return i
} else {
i2 = inOrder(x.left, arr, i)
arr[i2] = x.key
i3 = inOrder(x.right, arr, i2 + 1)
return i3
}
}
getArrayInOrder(T, n) {
arr = new array of length n
inOrder(T.root, arr, 0)
return arr
}
First about arrays : In order to populate an array you need to know the length of the array beforehand. If you don't need to specify the length for instantiating an array ( depending on the language that you use ) then that's not really an array. Its a dynamic data structure whose size is increased automatically by the language implementation.
Now I assume you don't know the size of the tree beforehand. If you do know the size, you can instantiate an array of specified size. Assuming you don't know the size of the array, you need to go with a dynamic data structure like ArrayList in Java.
So at each print(x.key) in your code, just append the x.key to the list ( like list.add(x.key) ). After the traversal is complete you may turn your List to array.
You could use iterative version of the traversal too.
One simple solution for recursive approach is to use a single element array to track the index like:
void inOrder(x, int[] idx, int[] arr):
if x != NIL:
inOrder(x.left, idx, arr)
arr[idx[0]++] = x.key
inOrder(x.right, idx, arr)
although I m sure there may be other ways that can become cumbersome (maybe). I prefer iterative version anyway.
If your language / use-case allows putting ints into the array, you could store the index in the array. I'm going backwards because that's simpler then:
inOrder(x, arr){
if(x != NIL){
inOrder(x.right)
arr[--arr[0]] = x.key
inOrder(x.left)
}
}
saveInOrder(T, n){
arr = new int[n]
arr[0] = n
inOrder(T.root, arr)
return arr
}

How to quick search in huge range-value pairs?

For example we have some range-value pairs:
<<1, 2>, 65>
<<3, 37>, 75>
<<45, 159>, 12>
<<160, 200>, 23>
<<210, 255>, 121>
And these ranges are disjoint.
Give a integer 78, and the corresponding range is <45, 159>, so output the value 12.
As the range maybe very large, I now use map to storage these range-value pairs.
Evevy searching will scan the entire map set, it's O(n).
Is there any good ways except binary search?
Binary search is definitely your best option. O(log n) is hard to beat!
If you don't want to hand-code the binary search, and you're using C++, you can use the std::map::upper_bound function, which implements binary search:
std::map<std::pair<int, int>, int> my_map;
int val = 78; // value to search for
auto it = my_map.upper_bound(std::make_pair(val, val));
if(it != my_map.begin()) it--;
if(it->first.first <= val && it->first.second >= val) {
/* Found it, value is it->second */
} else {
/* Didn't find it */
}
std::map::upper_bound is guaranteed to run in logarithmic time, because the map is inherently sorted (and usually implemented as a red-black tree). It returns the element after the element you're searching for, and hence the iterator that it returns is decremented to obtain a useful iterator.
How about a Binary search tree?
Though you'll probably need a self balancing search tree: AVL Trees and Red black trees are a good start, but you probably should stick with something that you already have in your "Ecosystem" for .Net that would be a SortedDictionary,probably with a costum IComparer but that's just an example...
For C++ you should use an ordered map, with a custom compare here
Links:
http://en.wikipedia.org/wiki/Binary_search_tree
http://en.wikipedia.org/wiki/AVL_tree
http://en.wikipedia.org/wiki/Red-black_tree
http://msdn.microsoft.com/en-us/library/f7fta44c%28v=vs.110%29.aspx
http://msdn.microsoft.com/en-us/library/8ehhxeaf(v=vs.110).aspx
http://www.cplusplus.com/reference/map/map/
Again for C++, I would do something like (without compiler, so careful) :
class Comparer
{
bool operator()(pair<int,int> const & one, pair<int,int> const & two)
{
if (a.second < b.first)
{
return true;
}
else
{
return false;
}
}
};
then you define the map as:
map<pair<int,int>,int,Comparer> myMap;
myMap.insert(make_pair(make_pair(-3,9),1));
int num = myMap[make_pair(7,7)]

Find K most frequent words from billions of given words [duplicate]

Input: A positive integer K and a big text. The text can actually be viewed as word sequence. So we don't have to worry about how to break down it into word sequence.
Output: The most frequent K words in the text.
My thinking is like this.
use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.
sort the (word, word-frequency) pair; and the key is "word-frequency". This takes O(n*lg(n)) time with normal sorting algorithm.
After sorting, we just take the first K words. This takes O(K) time.
To summarize, the total time is O(n+nlg(n)+K), Since K is surely smaller than N, so it is actually O(nlg(n)).
We can improve this. Actually, we just want top K words. Other words' frequency is not concern for us. So, we can use "partial Heap sorting". For step 2) and 3), we don't just do sorting. Instead, we change it to be
2') build a heap of (word, word-frequency) pair with "word-frequency" as key. It takes O(n) time to build a heap;
3') extract top K words from the heap. Each extraction is O(lg(n)). So, total time is O(k*lg(n)).
To summarize, this solution cost time O(n+k*lg(n)).
This is just my thought. I haven't find out way to improve step 1).
I Hope some Information Retrieval experts can shed more light on this question.
This can be done in O(n) time
Solution 1:
Steps:
Count words and hash it, which will end up in the structure like this
var hash = {
"I" : 13,
"like" : 3,
"meow" : 3,
"geek" : 3,
"burger" : 2,
"cat" : 1,
"foo" : 100,
...
...
Traverse through the hash and find the most frequently used word (in this case "foo" 100), then create the array of that size
Then we can traverse the hash again and use the number of occurrences of words as array index, if there is nothing in the index, create an array else append it in the array. Then we end up with an array like:
0 1 2 3 100
[[ ],[cat],[burger],[like, meow, geek],[]...[foo]]
Then just traverse the array from the end, and collect the k words.
Solution 2:
Steps:
Same as above
Use min heap and keep the size of min heap to k, and for each word in the hash we compare the occurrences of words with the min, 1) if it's greater than the min value, remove the min (if the size of the min heap is equal to k) and insert the number in the min heap. 2) rest simple conditions.
After traversing through the array, we just convert the min heap to array and return the array.
You're not going to get generally better runtime than the solution you've described. You have to do at least O(n) work to evaluate all the words, and then O(k) extra work to find the top k terms.
If your problem set is really big, you can use a distributed solution such as map/reduce. Have n map workers count frequencies on 1/nth of the text each, and for each word, send it to one of m reducer workers calculated based on the hash of the word. The reducers then sum the counts. Merge sort over the reducers' outputs will give you the most popular words in order of popularity.
A small variation on your solution yields an O(n) algorithm if we don't care about ranking the top K, and a O(n+k*lg(k)) solution if we do. I believe both of these bounds are optimal within a constant factor.
The optimization here comes again after we run through the list, inserting into the hash table. We can use the median of medians algorithm to select the Kth largest element in the list. This algorithm is provably O(n).
After selecting the Kth smallest element, we partition the list around that element just as in quicksort. This is obviously also O(n). Anything on the "left" side of the pivot is in our group of K elements, so we're done (we can simply throw away everything else as we go along).
So this strategy is:
Go through each word and insert it into a hash table: O(n)
Select the Kth smallest element: O(n)
Partition around that element: O(n)
If you want to rank the K elements, simply sort them with any efficient comparison sort in O(k * lg(k)) time, yielding a total run time of O(n+k * lg(k)).
The O(n) time bound is optimal within a constant factor because we must examine each word at least once.
The O(n + k * lg(k)) time bound is also optimal because there is no comparison-based way to sort k elements in less than k * lg(k) time.
If your "big word list" is big enough, you can simply sample and get estimates. Otherwise, I like hash aggregation.
Edit:
By sample I mean choose some subset of pages and calculate the most frequent word in those pages. Provided you select the pages in a reasonable way and select a statistically significant sample, your estimates of the most frequent words should be reasonable.
This approach is really only reasonable if you have so much data that processing it all is just kind of silly. If you only have a few megs, you should be able to tear through the data and calculate an exact answer without breaking a sweat rather than bothering to calculate an estimate.
You can cut down the time further by partitioning using the first letter of words, then partitioning the largest multi-word set using the next character until you have k single-word sets. You would use a sortof 256-way tree with lists of partial/complete words at the leafs. You would need to be very careful to not cause string copies everywhere.
This algorithm is O(m), where m is the number of characters. It avoids that dependence on k, which is very nice for large k [by the way your posted running time is wrong, it should be O(n*lg(k)), and I'm not sure what that is in terms of m].
If you run both algorithms side by side you will get what I'm pretty sure is an asymptotically optimal O(min(m, n*lg(k))) algorithm, but mine should be faster on average because it doesn't involve hashing or sorting.
You have a bug in your description: Counting takes O(n) time, but sorting takes O(m*lg(m)), where m is the number of unique words. This is usually much smaller than the total number of words, so probably should just optimize how the hash is built.
Your problem is same as this-
http://www.geeksforgeeks.org/find-the-k-most-frequent-words-from-a-file/
Use Trie and min heap to efficieinty solve it.
If what you're after is the list of k most frequent words in your text for any practical k and for any natural langage, then the complexity of your algorithm is not relevant.
Just sample, say, a few million words from your text, process that with any algorithm in a matter of seconds, and the most frequent counts will be very accurate.
As a side note, the complexity of the dummy algorithm (1. count all 2. sort the counts 3. take the best) is O(n+m*log(m)), where m is the number of different words in your text. log(m) is much smaller than (n/m), so it remains O(n).
Practically, the long step is counting.
Utilize memory efficient data structure to store the words
Use MaxHeap, to find the top K frequent words.
Here is the code
import java.util.ArrayList;
import java.util.Comparator;
import java.util.List;
import java.util.PriorityQueue;
import com.nadeem.app.dsa.adt.Trie;
import com.nadeem.app.dsa.adt.Trie.TrieEntry;
import com.nadeem.app.dsa.adt.impl.TrieImpl;
public class TopKFrequentItems {
private int maxSize;
private Trie trie = new TrieImpl();
private PriorityQueue<TrieEntry> maxHeap;
public TopKFrequentItems(int k) {
this.maxSize = k;
this.maxHeap = new PriorityQueue<TrieEntry>(k, maxHeapComparator());
}
private Comparator<TrieEntry> maxHeapComparator() {
return new Comparator<TrieEntry>() {
#Override
public int compare(TrieEntry o1, TrieEntry o2) {
return o1.frequency - o2.frequency;
}
};
}
public void add(String word) {
this.trie.insert(word);
}
public List<TopK> getItems() {
for (TrieEntry trieEntry : this.trie.getAll()) {
if (this.maxHeap.size() < this.maxSize) {
this.maxHeap.add(trieEntry);
} else if (this.maxHeap.peek().frequency < trieEntry.frequency) {
this.maxHeap.remove();
this.maxHeap.add(trieEntry);
}
}
List<TopK> result = new ArrayList<TopK>();
for (TrieEntry entry : this.maxHeap) {
result.add(new TopK(entry));
}
return result;
}
public static class TopK {
public String item;
public int frequency;
public TopK(String item, int frequency) {
this.item = item;
this.frequency = frequency;
}
public TopK(TrieEntry entry) {
this(entry.word, entry.frequency);
}
#Override
public String toString() {
return String.format("TopK [item=%s, frequency=%s]", item, frequency);
}
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + frequency;
result = prime * result + ((item == null) ? 0 : item.hashCode());
return result;
}
#Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
TopK other = (TopK) obj;
if (frequency != other.frequency)
return false;
if (item == null) {
if (other.item != null)
return false;
} else if (!item.equals(other.item))
return false;
return true;
}
}
}
Here is the unit tests
#Test
public void test() {
TopKFrequentItems stream = new TopKFrequentItems(2);
stream.add("hell");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("hero");
stream.add("hero");
stream.add("hero");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("home");
stream.add("go");
stream.add("go");
assertThat(stream.getItems()).hasSize(2).contains(new TopK("hero", 3), new TopK("hello", 8));
}
For more details refer this test case
use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.This is same as every one explained above
While insertion itself in hashmap , keep the Treeset(specific to java, there are implementations in every language) of size 10(k=10) to keep the top 10 frequent words. Till size is less than 10, keep adding it. If size equal to 10, if inserted element is greater than minimum element i.e. first element. If yes remove it and insert new element
To restrict the size of treeset see this link
Suppose we have a word sequence "ad" "ad" "boy" "big" "bad" "com" "come" "cold". And K=2.
as you mentioned "partitioning using the first letter of words", we got
("ad", "ad") ("boy", "big", "bad") ("com" "come" "cold")
"then partitioning the largest multi-word set using the next character until you have k single-word sets."
it will partition ("boy", "big", "bad") ("com" "come" "cold"), the first partition ("ad", "ad") is missed, while "ad" is actually the most frequent word.
Perhaps I misunderstand your point. Can you please detail your process about partition?
I believe this problem can be solved by an O(n) algorithm. We could make the sorting on the fly. In other words, the sorting in that case is a sub-problem of the traditional sorting problem since only one counter gets incremented by one every time we access the hash table. Initially, the list is sorted since all counters are zero. As we keep incrementing counters in the hash table, we bookkeep another array of hash values ordered by frequency as follows. Every time we increment a counter, we check its index in the ranked array and check if its count exceeds its predecessor in the list. If so, we swap these two elements. As such we obtain a solution that is at most O(n) where n is the number of words in the original text.
I was struggling with this as well and get inspired by #aly. Instead of sorting afterwards, we can just maintain a presorted list of words (List<Set<String>>) and the word will be in the set at position X where X is the current count of the word. In generally, here's how it works:
for each word, store it as part of map of it's occurrence: Map<String, Integer>.
then, based on the count, remove it from the previous count set, and add it into the new count set.
The drawback of this is the list maybe big - can be optimized by using a TreeMap<Integer, Set<String>> - but this will add some overhead. Ultimately we can use a mix of HashMap or our own data structure.
The code
public class WordFrequencyCounter {
private static final int WORD_SEPARATOR_MAX = 32; // UNICODE 0000-001F: control chars
Map<String, MutableCounter> counters = new HashMap<String, MutableCounter>();
List<Set<String>> reverseCounters = new ArrayList<Set<String>>();
private static class MutableCounter {
int i = 1;
}
public List<String> countMostFrequentWords(String text, int max) {
int lastPosition = 0;
int length = text.length();
for (int i = 0; i < length; i++) {
char c = text.charAt(i);
if (c <= WORD_SEPARATOR_MAX) {
if (i != lastPosition) {
String word = text.substring(lastPosition, i);
MutableCounter counter = counters.get(word);
if (counter == null) {
counter = new MutableCounter();
counters.put(word, counter);
} else {
Set<String> strings = reverseCounters.get(counter.i);
strings.remove(word);
counter.i ++;
}
addToReverseLookup(counter.i, word);
}
lastPosition = i + 1;
}
}
List<String> ret = new ArrayList<String>();
int count = 0;
for (int i = reverseCounters.size() - 1; i >= 0; i--) {
Set<String> strings = reverseCounters.get(i);
for (String s : strings) {
ret.add(s);
System.out.print(s + ":" + i);
count++;
if (count == max) break;
}
if (count == max) break;
}
return ret;
}
private void addToReverseLookup(int count, String word) {
while (count >= reverseCounters.size()) {
reverseCounters.add(new HashSet<String>());
}
Set<String> strings = reverseCounters.get(count);
strings.add(word);
}
}
I just find out the other solution for this problem. But I am not sure it is right.
Solution:
Use a Hash table to record all words' frequency T(n) = O(n)
Choose first k elements of hash table, and restore them in one buffer (whose space = k). T(n) = O(k)
Each time, firstly we need find the current min element of the buffer, and just compare the min element of the buffer with the (n - k) elements of hash table one by one. If the element of hash table is greater than this min element of buffer, then drop the current buffer's min, and add the element of the hash table. So each time we find the min one in the buffer need T(n) = O(k), and traverse the whole hash table need T(n) = O(n - k). So the whole time complexity for this process is T(n) = O((n-k) * k).
After traverse the whole hash table, the result is in this buffer.
The whole time complexity: T(n) = O(n) + O(k) + O(kn - k^2) = O(kn + n - k^2 + k). Since, k is really smaller than n in general. So for this solution, the time complexity is T(n) = O(kn). That is linear time, when k is really small. Is it right? I am really not sure.
Try to think of special data structure to approach this kind of problems. In this case special kind of tree like trie to store strings in specific way, very efficient. Or second way to build your own solution like counting words. I guess this TB of data would be in English then we do have around 600,000 words in general so it'll be possible to store only those words and counting which strings would be repeated + this solution will need regex to eliminate some special characters. First solution will be faster, I'm pretty sure.
http://en.wikipedia.org/wiki/Trie
This is an interesting idea to search and I could find this paper related to Top-K https://icmi.cs.ucsb.edu/research/tech_reports/reports/2005-23.pdf
Also there is an implementation of it here.
Simplest code to get the occurrence of most frequently used word.
function strOccurence(str){
var arr = str.split(" ");
var length = arr.length,temp = {},max;
while(length--){
if(temp[arr[length]] == undefined && arr[length].trim().length > 0)
{
temp[arr[length]] = 1;
}
else if(arr[length].trim().length > 0)
{
temp[arr[length]] = temp[arr[length]] + 1;
}
}
console.log(temp);
var max = [];
for(i in temp)
{
max[temp[i]] = i;
}
console.log(max[max.length])
//if you want second highest
console.log(max[max.length - 2])
}
In these situations, I recommend to use Java built-in features. Since, they are already well tested and stable. In this problem, I find the repetitions of the words by using HashMap data structure. Then, I push the results to an array of objects. I sort the object by Arrays.sort() and print the top k words and their repetitions.
import java.io.*;
import java.lang.reflect.Array;
import java.util.*;
public class TopKWordsTextFile {
static class SortObject implements Comparable<SortObject>{
private String key;
private int value;
public SortObject(String key, int value) {
super();
this.key = key;
this.value = value;
}
#Override
public int compareTo(SortObject o) {
//descending order
return o.value - this.value;
}
}
public static void main(String[] args) {
HashMap<String,Integer> hm = new HashMap<>();
int k = 1;
try {
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("words.in")));
String line;
while ((line = br.readLine()) != null) {
// process the line.
//System.out.println(line);
String[] tokens = line.split(" ");
for(int i=0; i<tokens.length; i++){
if(hm.containsKey(tokens[i])){
//If the key already exists
Integer prev = hm.get(tokens[i]);
hm.put(tokens[i],prev+1);
}else{
//If the key doesn't exist
hm.put(tokens[i],1);
}
}
}
//Close the input
br.close();
//Print all words with their repetitions. You can use 3 for printing top 3 words.
k = hm.size();
// Get a set of the entries
Set set = hm.entrySet();
// Get an iterator
Iterator i = set.iterator();
int index = 0;
// Display elements
SortObject[] objects = new SortObject[hm.size()];
while(i.hasNext()) {
Map.Entry e = (Map.Entry)i.next();
//System.out.print("Key: "+e.getKey() + ": ");
//System.out.println(" Value: "+e.getValue());
String tempS = (String) e.getKey();
int tempI = (int) e.getValue();
objects[index] = new SortObject(tempS,tempI);
index++;
}
System.out.println();
//Sort the array
Arrays.sort(objects);
//Print top k
for(int j=0; j<k; j++){
System.out.println(objects[j].key+":"+objects[j].value);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
For more information, please visit https://github.com/m-vahidalizadeh/foundations/blob/master/src/algorithms/TopKWordsTextFile.java. I hope it helps.
**
C++11 Implementation of the above thought
**
class Solution {
public:
vector<int> topKFrequent(vector<int>& nums, int k) {
unordered_map<int,int> map;
for(int num : nums){
map[num]++;
}
vector<int> res;
// we use the priority queue, like the max-heap , we will keep (size-k) smallest elements in the queue
// pair<first, second>: first is frequency, second is number
priority_queue<pair<int,int>> pq;
for(auto it = map.begin(); it != map.end(); it++){
pq.push(make_pair(it->second, it->first));
// onece the size bigger than size-k, we will pop the value, which is the top k frequent element value
if(pq.size() > (int)map.size() - k){
res.push_back(pq.top().second);
pq.pop();
}
}
return res;
}
};

How to calculate Hash value of a Tree

What is the best way to calculate the hash value of a Tree?
I need to compare the similarity between several trees in O(1). Now, I want to precalculate the hash values and compare them when needed. But then I realized, hashing a tree is different than hashing a sequence. I wasn't able to come up with a good hash function.
What is the best way to calculate hash value of a tree?
Note : I will implement the function in c/c++
Well hasing a tree means representing it in a unique way so that we can differ other trees from this tree using a simple representation or number. On normal polynomial hash we use number base conversion, we convert a string or a sequence in a specific prime base and use a mod value which is also a large prime. Now using this same technique we can hash a tree.
Now fix the root of the tree at any vertex. Let root = 1 and,
B = The base in which we want to convert.
P[i] = i th power of B (B^i).
level[i] = Depth of the ith vertex where (distance from the root).
child[i] = Total number of the vertex in the subtree of ith vertex including i.
degree[i] = Number of adjacent node of vertex i.
Now the contribution of the ith vertex in the hash value is -
hash[i] = ( (P[level[i]]+degree[i]) * child[i] ) % modVal
And the hash value of the entire tree is the summation of the all vertices hash value-
(hash[1]+hash[2]+....+hash[n]) % modVal
If we use this definition of tree equivalence:
T1 is equivalent to T2 iff
all paths to leaves of T1 exist exactly once in T2, and
all paths to leaves of T2 exist exactly once in T2
Hashing a sequence (a path) is straightforward. If h_tree(T) is a hash of all paths-to-leafs of T, where the order of the paths does not alter the outcome, then it is a good hash for the whole of T, in the sense that equivalent trees will produce equal hashes, according to the above definition of equivalence. So I propose:
h_path(path) = an order-dependent hash of all elements in the path.
Requires O(|path|) time to calculate,
but child nodes can reuse the calculation of their
parent node's h_path in their own calculations.
h_tree(T) = an order-independent hashing of all its paths-to-leaves.
Can be calculated in O(|L|), where L is the number of leaves
In pseudo-c++:
struct node {
int path_hash; // path-to-root hash; only use for building tree_hash
int tree_hash; // takes children into account; use to compare trees
int content;
vector<node> children;
int update_hash(int parent_path_hash = 1) {
path_hash = parent_path_hash * PRIME1 + content; // order-dependent
tree_hash = path_hash;
for (node n : children) {
tree_hash += n.update_hash(path_hash) * PRIME2; // order-independent
}
return tree_hash;
}
};
After building two trees, update their hashes and compare away. Equivalent trees should have the same hash, different trees not so much. Note that the path and tree hashes that I am using are rather simplistic, and chosen rather for ease of programming than for great collision resistance...
Child hashes should be successively multiplied by a prime number & added. Hash of the node itself should be multiplied by a different prime number & added.
Cache the hash of the tree overall -- I prefer to cache it outside the AST node, if I have a wrapper object holding the AST.
public class RequirementsExpr {
protected RequirementsAST ast;
protected int hash = -1;
public int hashCode() {
if (hash == -1)
this.hash = ast.hashCode();
return hash;
}
}
public class RequirementsAST {
protected int nodeType;
protected Object data;
// -
protected RequirementsAST down;
protected RequirementsAST across;
public int hashCode() {
int nodeHash = nodeType;
nodeHash = (nodeHash * 17) + (data != null ? data.hashCode() : 0);
nodeHash *= 23; // prime A.
int childrenHash = 0;
for (RequirementsAST child = down; child != null; child = child.getAcross()) {
childrenHash *= 41; // prime B.
childrenHash += child.hashCode();
}
int result = nodeHash + childrenHash;
return result;
}
}
The result of this, is that child/descendant nodes in different positions are always multiplied in by different factors; and the node itself is always multiplied in by a different factor from any possible child/descendant nodes.
Note that other primes should also be used in building the nodeHash of the node data, itself. This helps avoid eg. different values of nodeType colliding with different values of data.
Within the limits of 32-bit hashing, this scheme overall gives a very high chance of uniqueness for any differences in tree-structure (eg, transposing two siblings) or value.
Once calculated (over the entire AST) the hashes are highly efficient.
I would recommend converting the tree to a canonical sequence and hashing the sequence. (The details of the conversion depend on your definition of equivalence. For example, if the trees are binary search trees and the equivalence relation is structural, then the conversion could be to enumerate the tree in preorder, as the structure of binary search trees can be recovered from the preorder enumeration.)
Thomas's answer boils down at first glance to associating a multivariable polynomial with each tree and evaluating the polynomial at a particular location. There are two steps that, at the moment, have to be assumed on faith; the first is that the map doesn't send inequivalent trees to the same polynomial, and the second is that the evaluation scheme doesn't introduce too many collisions. I can't evaluate the first step presently, though there are reasonable definitions of equivalence that permit reconstruction from a two-variable polynomial. The second is not theoretically sound but could be made so via Schwartz--Zippel.

Efficient algorithm to remove any map that is contained in another map from a collection of maps

I have set (s) of unique maps (Java HashMaps currently) and wish to remove from it any maps that are completely contained by some other map in the set (i.e. remove m from s if m.entrySet() is a subset of n.entrySet() for some other n in s.)
I have an n^2 algorithm, but it's too slow. Is there a more efficient way to do this?
Edit:
the set of possible keys is small, if that helps.
Here is an inefficient reference implementation:
public void removeSubmaps(Set<Map> s) {
Set<Map> toRemove = new HashSet<Map>();
for (Map a: s) {
for (Map b : s) {
if (a.entrySet().containsAll(b.entrySet()))
toRemove.add(b);
}
}
s.removeAll(toRemove);
}
Not sure I can make this anything other than an n^2 algorithm, but I have a shortcut that might make it faster. Make a list of your maps with the length of the each map and sort it. A proper subset of a map must be shorter or equal to the map you're comparing - there's never any need to compare to a map higher on the list.
Here's another stab at it.
Decompose all your maps into a list of key,value,map number. Sort the list by key and value. Go through the list, and for each group of key/value matches, create a permutation of all the map number pairs - these are all potential subsets. When you have the final list of pairs, sort by map numbers. Go through this second list, and count the number of occurrences of each pair - if the number matches the size of one of the maps, you've found a subset.
Edit: My original interpretation of the problem was incorrect, here is new answer based on my re-read of the question.
You can create a custom hash function for HashMap which returns the product of all hash value of its entries. Sort the list of hash value and start loop from biggest value and find all divisor from smaller hash values, these are possible subsets of this hashmap, use set.containsAll() to confirm before marking them for removal.
This effectively transforms the problem into a mathematical problem of finding possible divisor from a collection. And you can apply all the common divisor-search optimizations.
Complexity is O(n^2), but if many hashmaps are subsets of others, the actual time spent can be a lot better, approaching O(n) in best-case scenario (if all hashmaps are subset of one). But even in worst case scenario, division calculation would be a lot faster than set.containsAll() which itself is O(n^2) where n is number of items in a hashmap.
You might also want to create a simple hash function for hashmap entry objects to return smaller numbers to increase multiply/division performance.
Here's a subquadratic (O(N**2 / log N)) algorithm for finding maximal sets from a set of sets: An Old Sub-Quadratic Algorithm for Finding Extremal Sets.
But if you know your data distribution, you can do much better in average case.
This what I ended up doing. It works well in my situation as there is usually some value that is only shared by a small number of maps. Kudos to Mark Ransom for pushing me in this direction.
In prose: Index the maps by key/value pair, so that each key/value pair is associated with a set of maps. Then, for each map: Find the smallest set associated with one of it's key/value pairs; this set is typically small for my data. Each of the maps in this set is a potential 'supermap'; no other map could be a 'supermap' as it would not contain this key/value pair. Search this set for a supermap. Finally remove all the identified submaps from the original set.
private <K, V> void removeSubmaps(Set<Map<K, V>> maps) {
// index the maps by key/value
List<Map<K, V>> mapList = toList(maps);
Map<K, Map<V, List<Integer>>> values = LazyMap.create(HashMap.class, ArrayList.class);
for (int i = 0, uniqueRowsSize = mapList.size(); i < uniqueRowsSize; i++) {
Map<K, V> row = mapList.get(i);
Integer idx = i;
for (Map.Entry<K, V> entry : row.entrySet())
values.get(entry.getKey()).get(entry.getValue()).add(idx);
}
// find submaps
Set<Map<K, V>> toRemove = Sets.newHashSet();
for (Map<K, V> submap : mapList) {
// find the smallest set of maps with a matching key/value
List<Integer> smallestList = null;
for (Map.Entry<K, V> entry : submap.entrySet()) {
List<Integer> list = values.get(entry.getKey()).get(entry.getValue());
if (smallestList == null || list.size() < smallestList.size())
smallestList = list;
}
// compare with each of the maps in that set
for (int i : smallestList) {
Map<K, V> map = mapList.get(i);
if (isSubmap(submap, map))
toRemove.add(submap);
}
}
maps.removeAll(toRemove);
}
private <K,V> boolean isSubmap(Map<K, V> submap, Map<K,V> map){
if (submap.size() >= map.size())
return false;
for (Map.Entry<K,V> entry : submap.entrySet()) {
V other = map.get(entry.getKey());
if (other == null)
return false;
if (!other.equals(entry.getValue()))
return false;
}
return true;
}

Resources