Expected tree for given array Given an array of n elements , create a tree such that all path labels from root to leaf represent all combinations of array elements of length 'n'.
for ex if array = {1,2,3} then the tree should be such that there are 6 paths from root to leaf with each of them as follows {{123},{132},{213},{231},{312},{321}} 1<= n <= 10^9 and there can be duplicates too.
Since the list might contain duplicates you need to dedupe them and keep count. To keep it less confusing let us use letters instead of numbers.
Now, let us say {a, a, b, b, b} is your input.
Convert it into a data structure that looks like {a:2, b:3} (you can do this by using a map or using tuples)
Algorithm:
Initialization: Create a root node
Input: data = {a:2, b:3} and node
For node create k = size(data) children.
Assign each key in data to each new edge (in this case a and b).
For each child call this function recursively with the corresponding edge value removed. Example: for the function call corresponding to a, you will pass on data = {a:1, b:3} and for b, data = {a:2, b:3}
Whenever a value becomes 0, remove the corresponding key from data.
Related
I need to think of a data structure, which supports the following operations efficiently:
1) Add an integer x
2) Delete an integer with maximum frequency (if there are more than one element with the same maximum frequency delete all of them).
I am thinking of implementing a segment tree where each node stores the index of its child having largest frequency.
Any ideas or suggestions on how to approach this problem or how should it be implemented would be kindly appreciated.
We can use a combination of data structures. A hash_map to maintain the frequency mappings, where the key is the integer, and value a pointer to a "frequency" node representing the frequency value and the set of integers having the same frequency. The frequency nodes will be maintained in a list ordered by the values of the frequencies.
The Frequency node can be defined as
class Freq {
int frequency;
Set<Integer> values_with_frequency;
Freq prev;
Freq next;
}
The elements HashMap would then contain entries of the form
Entry<Integer, Freq>
So, for a snapshot of the dataset such as
a,b,c,b,d,d,a,e,a,f,b where the letters denote integers, the following would be how the data structure would look like.
c -----> (1, [c, e, f])
|
|
e --
|
|
f --
a -----> (3, [a, b])
|
|
b --
d --> (2, [d])
The Freq nodes would be maintained in a linked list, say freq_nodes, sorted by the frequency value. Note that, as explained below, there wouldn't be any log(n) operation needed for keeping the list sorted on the add/delete operations.
The way the add(x), and delete_max_freq() operations could be implemented is as follows
add(x) :
If x is not found in the elements map, check if the first element of the freq_nodes contains the Freq object with frequency 1. If so, add x to the values_with_frequency set of the Freq object. Otherwise, create a new Freq object with 1 as the frequency value and x added to the (now only single element) wrapped set values_with_frequency
Otherwise, (i.e. if x is already there in the elements map), follow the pointer in the value of the entry corresponding to x in elements to the Freq object in the freq_nodes, remove x from the values_with_frequency field of the Freq object, noting the current value of x’s frequency which is the value of elements.get(x).frequency(Hold this value in say F). If the set values_with_frequency is rendered empty due to this removal, delete the corresponding node from the freq_nodes linked list. Finally if the next Freq node in the freq_nodes linked list has the frequency F+1, just add x to the values_with_frequency field of the next node. Otherwise just create a Freq node as was done in the case of non-existence of Freq node with frequency 1 above.
Finally, add the entry (x, Freq) to the elements map.
Note that this whole add(x) operation is going to be O(1) in time.
Here's an example of a sequence of add() operations with the subsequent state of the data structure.
add(a)
a -> N1 : freq_nodes : |N1 (1, {a}) | ( N1 is actually a Freq object)
add(b)
a -> N1 : freq_nodes : |N1 (1, {a, b}) |
b -> N1
add(a)
At this point ‘a’ points to N1, however, its current frequency is 2, so we need to insert a node N2 next to N1 in the DLL, after removing it from N1’s values_with_frequency set {a,b}
a -> N2 : freq_nodes : |N1 (1, {b}) | --> |N2 (2, {a}) |
b -> N1
The interesting thing to note here is that any time we increase the frequency of an existing element from F to say F+1, we need to do the following
if (next node has a higher frequency than F+1 or we have reached the end of the list):
create a new Freq node with frequency equal to F+1 (as is done above)
and insert it next to the current node
else :
add ‘a’ (the input to the add() operation) to the ```values_with_frequency``` set of the next node
The delete_max_freq() operation would just involve removing the last entry of the linked list freq_nodes, and iterating over the keys in the wrapped set values_with_frequency to remove the corresponding keys from the elements map. This operation would take O(k) time where k is the number of elements with maximum frequency.
Assuming "efficient" refers to the way the complexity of those operations scale, big-O style, I'd consider something consisting of:
a hashmap with the integers as keys and their frequencies as values
a tree structure (possibly a binary search tree, e.g.) where its nodes have a number representing a frequency and a hashset of numbers which have that frequency.
When a number is inserted:
1. Look up the number in the hashmap to find its frequency. (O(1))
2. Look up the frequency in the tree (O(log N)). Remove the number from its collection (O(1)). If the collection is empty, remove the frequency from the tree (O(log N)).
3. Increment the number's frequency. Set that value in the hashmap (O(1)).
4. Look up its new frequency in the tree (O(log N)). If it's there, add the number to the collection there (O(1)). If not, add a new node with the number in its collection (O(log N)).
When deleting items with the maximum frequency:
1. Remove the highest-valued node from the tree (O(log N)).
2. For each number in that node's collection, remove that number's entry from the hashmap (O(1) for each number removed).
If you have N numbers to add and remove, your worst-case scenario should be O(N log N) regardless of the actual distribution of frequencies or the order in which numbers are added and removed.
If you know of any assumptions you can make about the numbers being added, it's possible you could make further enhancements like using an indexed array rather than an ordered tree. But if your inputs are fairly unbounded, this seems like a pretty good structure to handle all the operations you want without getting into O(n²) territory.
My thoughts:
You will need 2 maps.
Map 1: Integer as key with frequency as value.
Map 2: Have a map of frequencies as keys and list of integers as values.
Add Integer: Add the integer to map 1. Get the frequency. Add it to the list of frequency key in map 2.
Delete Integer : We can obviously maintain maximum frequency in a variable across these operations. Now, remove the key from map2 which has this max frequency and decrement max frequency.
So, adding and deleting performance should be O(1) on average.
In the above scenario, we will still have integers in map 1 which exist and have the frequency which is unrealistic after the delete from map 2. In this case, when same integer gets added, we do an on demand update in map 1, meaning, if current frequency in map 1 doesn't exist in map 2 for this integer, it means it was deleted and we can reset that to 1 again.
Implementation:
import java.util.*;
class Foo{
Map<Integer,Integer> map1;
Map<Integer,Set<Integer>> map2;
int max_freq;
Foo(){
map1 = new HashMap<>();
map2 = new HashMap<>();
map2.put(0,new HashSet<>());
max_freq = 0;
}
public void add(int x){
map1.putIfAbsent(x,0);
int curr_f = map1.get(x);
if(!map2.containsKey(curr_f)){
map1.put(x,1);
}else{
map1.merge(x,1,Integer::sum);
}
map2.putIfAbsent(map1.get(x),new HashSet<>());
map2.get(map1.get(x)-1).remove(x); // remove from previous frequency list
map2.get(map1.get(x)).add(x);// add to current frequency list
max_freq = Math.max(max_freq,map1.get(x));
printState();
}
public List<Integer> delete(){
List<Integer> ls = new ArrayList<>(map2.get(max_freq));
map2.remove(max_freq);
max_freq--;
while(max_freq > 0 && map2.get(max_freq).size() == 0) max_freq--;
printState();
return ls;
}
public void printState(){
System.out.println(map1.toString());
System.out.println("Maximum frequency: " + max_freq);
for(Map.Entry<Integer,Set<Integer>> m : map2.entrySet()){
System.out.println(m.getKey() + " " + m.getValue().toString());
}
System.out.println("----------------------------------------------------");
}
}
Demo: https://ideone.com/tETHKV
Note: The call to delete() is amortized.
I am looking at this challenge:
Given a tree with N nodes and N-1 edges. Each edge on the tree is labelled by a string of lowercase letters from the Latin alphabet. Given Q queries, consisting of two nodes u and v, check if it is possible to make a palindrome string which uses all the characters that belong to the string labelled on the edges in the path from node u to node v.
Characters can be used in any order.
N is of the order of 105 and Q is of the order of 106
Input:
N=3
u=1 v=3 weight=bc
u=1 v=2 weight=aba
Q=4
u=1 v=2
u=2 v=3
u=3 v=1
u=3 v=3
Output:
YES
YES
NO
NO
What I thought was to compute the LCA between 2 nodes by precomputation in O(1) using sparse table and Range minimum query on Euler tower and then see the path from LCA to node u and LCA to node v and store all the characters frequency. If the sum of frequency of all the characters is odd, we check if the frequency of each character except one is odd. If the sum of frequency of all the characters is even, we check if the frequency of each character is even. But this process will surely time out because Q can be upto 106.
Is there anyone with a better algorithm?
Preparation Step
Prepare your data structure as follows:
For each node get the path to the root, get all letters on the path, and only retain a letter when it occurs an odd number of times on that path. Finally encode that string with unique letters as a bit pattern, where bit 0 is set when there is an "a", bit 1 is set when there is a "b", ... bit 25 is set when there is a "z". Store this pattern with the node.
This preprocessing can be done with a depth-first recursive procedure, where the current node's pattern is passed down to the children, which can apply the edge's information to that pattern to create their own pattern. So this preprocessing can run in linear time in terms of the total number of characters in the tree, or more precisely O(N+S), where S represents that total number of characters.
Query Step
When a query is done perform the bitwise XOR on the two involved patterns. If the result is 0 or it has only one bit set, return "YES", else return "NO". So the query will not visit any other nodes than just the two ones that are given, look up the two patterns and perform their XOR and make the bit test. All this happens in constant time for one query.
The last query given in the question shows that the result should be "NO" when the two nodes are the same node. This is a boundary case, as it is debatable whether an empty string is a palindrome or not. The above XOR algorithm would return "YES", so you would need a specific test for this boundary case, and return "NO" instead.
Explanation
This works because if we look at the paths both nodes have to the root, they may share a part of their path. The characters on that common path should not be considered, and the XOR will make sure they aren't. Where the paths differ, we actually have the edges on the path from the one node to the other. There we see the characters that should contribute to a palindrome.
If a character appears an even number of times in those edges, it poses no problem for creating a palindrome. The XOR makes sure those characters "disappear".
If a character appears an odd number of times, all but one can mirror each other like in the even case. The remaining one can only be used in an odd-length palindrome, and only in the centre position of it. So there can only be one such character. This translates to the test that the XOR result is allowed to have 1 bit set (but not more).
Implementation
Here is an implementation in JavaScript. The example run uses the input as provided in the question. I did not bother to turn the query results from boolean to NO/YES:
function prepare(edges) {
// edges: array of [u, v, weight] triplets
// Build adjacency list from the list of edges
let adjacency = {};
for (let [u, v, weight] of edges) {
// convert weight to pattern, as we don't really need to
// store the strings
let pattern = 0;
for (let i = 0; i < weight.length; i++) {
let ascii = weight.charCodeAt(i) - 97;
pattern ^= 1 << ascii; // toggle bit that corresponds to letter
}
if (v in adjacency && u in adjacency) throw "Cycle detected!";
if (!(v in adjacency)) adjacency[v] = {};
if (!(u in adjacency)) adjacency[u] = {};
adjacency[u][v] = pattern;
adjacency[v][u] = pattern;
}
// Prepare the consolidated path-pattern for each node
let patterns = {}; // This is the information to return
function dfs(u, parent, pathPattern) {
patterns[u] = pathPattern;
for (let v in adjacency[u]) {
// recurse into the "children" (the parent is not revisited)
if (v !== parent) dfs(v, u, adjacency[u][v] ^ pathPattern);
}
}
// Start a DFS from an arbitrary node as root
dfs(edges[0][0], null, 0);
return patterns;
}
function query(nodePatterns, u, v) {
if (u === v) return false; // Boundary case.
let pattern = nodePatterns[u] ^ nodePatterns[v];
// "smart" test to verify that at most 1 bit is set
return pattern === (pattern & -pattern);
}
// Example:
let edges = [[1, 3, "bc"], [1, 2, "aba"]];
let queries = [[1, 2], [2, 3], [3, 1], [3, 3]];
let nodePatterns = prepare(edges);
for (let [u, v] of queries) {
console.log(u, v, query(nodePatterns, u, v));
}
First of all, let's choose a root. Now imagine that each edge points to a node which is deeper in the tree. Instead of having strings on edges, put them on vertices that those edges point to. Now there is no string only at your root. Now for each vertex calculate and store amount of each letter in it's string.
Since now we'll be doing stuff for each letter seperately.
Using DFS, calculate for each node v number of letters on vertices on a path from v to root. You'll also need LCA, so you may precompute RMQ or find LCA in O(logn) if you like. Let Letters[v][c] be number of letters c on path from v to root. Then, to find number of letter c from u to v just use Letters[v][c] + Letters[u][c] - 2 * Letters[LCA(v, u)][c]. You can check amount of single letter in O(1) (or O(logn) if you're not using RMQ). So in 26* O(1) you can check every single possible letter.
My data has large number of sets (few millions). Each of those set size is between few members to several tens of thousands integers. Many of those sets are subsets of larger sets (there are many of those super-sets). I'm trying to assign each subset to it's largest superset.
Please can anyone recommend algorithm for this type of task?
There are many algorithms for generating all possible sub-sets of a set, but this type of approach is time-prohibitive given my data size (e.g. this paper or SO question).
Example of my data-set:
A {1, 2, 3}
B {1, 3}
C {2, 4}
D {2, 4, 9}
E {3, 5}
F {1, 2, 3, 7}
Expected answer: B and A are subset of F (it's not important B is also subset of A); C is a subset of D; E remains unassigned.
Here's an idea that might work:
Build a table that maps number to a sorted list of sets, sorted first by size with largest first, and then, by size, arbitrarily but with some canonical order. (Say, alphabetically by set name.) So in your example, you'd have a table that maps 1 to [F, A, B], 2 to [F, A, D, C], 3 to [F, A, B, E] and so on. This can be implemented to take O(n log n) time where n is the total size of the input.
For each set in the input:
fetch the lists associated with each entry in that set. So for A, you'd get the lists associated with 1, 2, and 3. The total number of selects you'll issue in the runtime of the whole algorithm is O(n), so runtime so far is O(n log n + n) which is still O(n log n).
Now walk down each list simultaneously. If a set is the first entry in all three lists, then it's the largest set that contains the input set. Output that association and continue with the next input list. If not, then discard the smallest item among all the items in the input lists and try again. Implementing this last bit is tricky, but you can store the heads of all lists in a heap and get (IIRC) something like O(n log k) overall runtime where k is the maximum size of any individual set, so you can bound that at O(n log n) in the worst case.
So if I got everything straight, the runtime of the algorithm is overall O(n log n), which seems like probably as good as you're going to get for this problem.
Here is a python implementation of the algorithm:
from collections import defaultdict, deque
import heapq
def LargestSupersets(setlists):
'''Computes, for each item in the input, the largest superset in the same input.
setlists: A list of lists, each of which represents a set of items. Items must be hashable.
'''
# First, build a table that maps each element in any input setlist to a list of records
# of the form (-size of setlist, index of setlist), one for each setlist that contains
# the corresponding element
element_to_entries = defaultdict(list)
for idx, setlist in enumerate(setlists):
entry = (-len(setlist), idx) # cheesy way to make an entry that sorts properly -- largest first
for element in setlist:
element_to_entries[element].append(entry)
# Within each entry, sort so that larger items come first, with ties broken arbitrarily by
# the set's index
for entries in element_to_entries.values():
entries.sort()
# Now build up the output by going over each setlist and walking over the entries list for
# each element in the setlist. Since the entries list for each element is sorted largest to
# smallest, the first entry we find that is in every entry set we pulled will be the largest
# element of the input that contains each item in this setlist. We are guaranteed to eventually
# find such an element because, at the very least, the item we're iterating on itself is in
# each entries list.
output = []
for idx, setlist in enumerate(setlists):
num_elements = len(setlist)
buckets = [element_to_entries[element] for element in setlist]
# We implement the search for an item that appears in every list by maintaining a heap and
# a queue. We have the invariants that:
# 1. The queue contains the n smallest items across all the buckets, in order
# 2. The heap contains the smallest item from each bucket that has not already passed through
# the queue.
smallest_entries_heap = []
smallest_entries_deque = deque([], num_elements)
for bucket_idx, bucket in enumerate(buckets):
smallest_entries_heap.append((bucket[0], bucket_idx, 0))
heapq.heapify(smallest_entries_heap)
while (len(smallest_entries_deque) < num_elements or
smallest_entries_deque[0] != smallest_entries_deque[num_elements - 1]):
# First extract the next smallest entry in the queue ...
(smallest_entry, bucket_idx, element_within_bucket_idx) = heapq.heappop(smallest_entries_heap)
smallest_entries_deque.append(smallest_entry)
# ... then add the next-smallest item from the bucket that we just removed an element from
if element_within_bucket_idx + 1 < len(buckets[bucket_idx]):
new_element = buckets[bucket_idx][element_within_bucket_idx + 1]
heapq.heappush(smallest_entries_heap, (new_element, bucket_idx, element_within_bucket_idx + 1))
output.append((idx, smallest_entries_deque[0][1]))
return output
Note: don't trust my writeup too much here. I just thought of this algorithm right now, I haven't proved it correct or anything.
So you have millions of sets, with thousands of elements each. Just representing that dataset takes billions of integers. In your comparisons you'll quickly get to trillions of operations without even breaking a sweat.
Therefore I'll assume that you need a solution which will distribute across a lot of machines. Which means that I'll think in terms of https://en.wikipedia.org/wiki/MapReduce. A series of them.
Read the sets in, mapping them to k:v pairs of i: s where i is an element of the set s.
Receive a key of an integers, along with a list of sets. Map them off to pairs (s1, s2): i where s1 <= s2 are both sets that included to i. Do not omit to map each set to be paired with itself!
For each pair (s1, s2) count the size k of the intersection, and send off pairs s1: k, s2: k. (Only send the second if s1 and s2 are different.
For each set s receive the set of supersets. If it is maximal, send off s: s. Otherwise send off t: s for every t that is a strict superset of s.
For each set s, receive the set of subsets, with s in the list only if it is maximal. If s is maximal, send off t: s for every t that is a subset of s.
For each set we receive the set of maximal sets that it is a subset of. (There may be many.)
There are a lot of steps for this, but at its heart it requires repeated comparisons between pairs of sets with a common element for each common element. Potentially that is O(n * n * m) where n is the number of sets and m is the number of distinct elements that are in many sets.
Here is a simple suggestion for an algorithm that might give better results based on your numbers (n = 10^6 to 10^7 sets with m = 2 to 10^5 members, a lot of super/subsets). Of course it depends a lot on your data. Generally speaking complexity is much worse than for the other proposed algorithms. Maybe you could only process the sets with less than X, e.g. 1000 members that way and for the rest use the other proposed methods.
Sort the sets by their size.
Remove the first (smallest) set and start comparing it against the others from behind (largest set first).
Stop as soon as you found a superset and create a relation. Just remove if no superset was found.
Repeat 2. and 3. for all but the last set.
If you're using Excel, you could structure it as follows:
1) Create a cartesian plot as a two-way table that has all your data sets as titles on both the side and the top
2) In a seperate tab, create a row for each data set in the first column, along with a second column that will count the number of entries (ex: F has 4) and then just stack FIND(",") and MID formulas across the sheet to split out all the entries within each data set. Use the counter in the second column to do COUNTIF(">0"). Each variable you find can be your starting point in a subsequent FIND until it runs out of variables and just returns a blank.
3) Go back to your cartesian plot, and bring over the separate entries you just generated for your column titles (ex: F is 1,2,3,7). Use an AND statement to then check that each entry in your left hand column is in your top row data set using an OFFSET to your seperate area and utilizing your counter as the width for the OFFSET
I want to implement a function which will return cartesian product of set, repeated given number. For example
input: {a, b}, 2
output:
aa
ab
bb
ba
input: {a, b}, 3
aaa
aab
aba
baa
bab
bba
bbb
However the only way I can implement it is firstly doing cartesion product for 2 sets("ab", "ab), then from the output of the set, add the same set. Here is pseudo-code:
function product(A, B):
result = []
for i in A:
for j in B:
result.append([i,j])
return result
function product1(chars, count):
result = product(chars, chars)
for i in range(2, count):
result = product(result, chars)
return result
What I want is to start computing directly the last set, without computing all of the sets before it. Is this possible, also a solution which will give me similar result, but it isn't cartesian product is acceptable.
I don't have problem reading most of the general purpose programming languages, so if you need to post code you can do it in any language you fell comfortable with.
Here's a recursive algorithm that builds S^n without building S^(n-1) "first". Imagine an infinite k-ary tree where |S| = k. Label with the elements of S each of the edges connecting any parent to its k children. An element of S^m can be thought of as any path of length m from the root. The set S^m, in that way of thinking, is the set of all such paths. Now the problem of finding S^n is a problem of enumerating all paths of length n - and we can name a path by considering the sequence of edge labels from beginning to end. We want to directly generate S^n without first enumerating all of S^(n-1), so a depth-first search modified to find all nodes at depth n seems appropriate. This is essentially how the below algorithm works:
// collection to hold generated output
members = []
// recursive function to explore product space
Products(set[1...n], length, current[1...m])
// if the product we're working on is of the
// desired length then record it and return
if m = length then
members.append(current)
return
// otherwise we add each possible value to the end
// and generate all products of the desired length
// with the new vector as a prefix
for i = 1 to n do
current.addLast(set[i])
Products(set, length, current)
currents.removeLast()
// reset the result collection and request the set be generated
members = []
Products([a, b], 3, [])
Now, a breadth-first approach is no less efficient than a depth-first one, and if you think about it would be no different from exactly what you're already doing. Indeed, and approach that generates S^n must necessarily generate S^(n-1) at least once, since that can be found in a solution to S^n.
So how would you print all paths in a tree. Here the condition is that we don't only want paths starting from the root or paths in the sub-tree.
For example:
2
/ \
8 10
/\ /
5 6 11
So the program should return:
2-8
2-10
2-8-5
2-8-6
8-5
8-6
2-10-11
10-11
5-8-2-10-11
5-8-2-10
and so on...
One approach is to find the LCA between every distinct pair of nodes and then print the path from the LCA to both nodes (reverse in the left subtree and in order in the right subtree). But the complexity here would be O(n^3). Is there a more efficient solution ?
If you are only interested in the result, not in the algoritm, create the nodes and relations in neo4j with
merge (n2:node{n:2})-[:down]->(n8:node{n:8})-[:down]->(:node{n:5})
merge (n2)-[:down]->(:node{n:10})-[:down]->(:node{n:11})
merge (n8)-[:down]->(:node{n:6})
then query
match p=(a)-[r:down *]-(b) return nodes(p)
Assuming you tree has distinct nodes, you can:
Create a map having key as int and value as vector. The key stands for each node you encounter and vector is for storing all the nodes that you will traverse under the node.
Pass this map by value to each node. You can have a function like:
void printAllPaths(node *proot, map<int, vector<int> > m)
Whenever you encounter a new node n, do the following
a) For each k from set of keys
b) Add n to the value vector of k.
c) Print all keys followed by their value vectors.
d) Also insert new key as n into the map with empty vector as value.
Note: If your tree has duplicate nodes you a multimap will help you keep track. c++ STL will serve you well in this case.