Family Tree Algorithm - algorithm

I'm working on putting together a problem set for an intro-level CS course and came up with a question that, on the surface, seems very simple:
You are given a list of people with the names of their parents, their birth dates, and their death dates. You are interested in finding out who, at some point in their lifetime, was a parent, a grandparent, a great-grandparent, etc. Devise an algorithm to label each person with this information as an integer (0 means the person never had a child, 1 means that the person was a parent, 2 means that the person was a grandparent, etc.)
For simplicity, you can assume that the family graph is a DAG whose undirected version is a tree.
The interesting challenge here is that you can't just look at the shape of the tree to determine this information. For example, I have 8 great-great-grandparents, but since none of them were alive when I was born, in their lifetimes none of them were great-great-grandparents.
The best algorithm I can come up with for this problem runs in time O(n2), where n is the number of people. The idea is simple - start a DFS from each person, finding the furthest descendant down in the family tree that was born before that person's death date. However, I'm pretty sure that this is not the optimal solution to the problem. For example, if the graph is just two parents and their n children, then the problem can be solved trivially in O(n). What I'm hoping for is some algorithm that is either beats O(n2) or whose runtime is parameterized over the shape of the graph that makes it fast for wide graphs with a graceful degradation to O(n2) in the worst-case.

Update: This is not the best solution I have come up with, but I've left it because there are so many comments relating to it.
You have a set of events (birth/death), parental state (no descendants, parent, grandparent, etc) and life state (alive, dead).
I would store my data in structures with the following fields:
mother
father
generations
is_alive
may_have_living_ancestor
Sort your events by date, and then for each event take one of the following two courses of logic:
Birth:
Create new person with a mother, father, 0 generations, who is alive and may
have a living ancestor.
For each parent:
If generations increased, then recursively increase generations for
all living ancestors whose generations increased. While doing that,
set the may_have_living_ancestor flag to false for anyone for whom it is
discovered that they have no living ancestors. (You only iterate into
a person's ancestors if you increased their generations, and if they
still could have living ancestors.)
Death:
Emit the person's name and generations.
Set their is_alive flag to false.
The worst case is O(n*n) if everyone has a lot of living ancestors. However in general you've got the sorting preprocessing step which is O(n log(n)) and then you're O(n * avg no of living ancestors) which means that the total time tends to be O(n log(n)) in most populations. (I hadn't counted the sorting prestep properly, thanks to #Alexey Kukanov for the correction.)

I thought of this this morning, then found that #Alexey Kukanov had similar thoughts. But mine is more fleshed out and has some more optimization, so I'll post it anyways.
This algorithm is O(n * (1 + generations)), and will work for any dataset. For realistic data this is O(n).
Run through all records and generate objects representing people which include date of birth, links to parents, and links to children, and several more uninitialized fields. (Time of last death between self and ancestors, and an array of dates that they had 0, 1, 2, ... surviving generations.)
Go through all people and recursively find and store the time of last death. If you call the person again, return the memoized record. For each person you can encounter the person (needing to calculate it), and can generate 2 more calls to each parent the first time you calculate it. This gives a total of O(n) work to initialize this data.
Go through all people and recursively generate a record of when they first added a generation. These records only need go to the maximum of when the person or their last ancestor died. It is O(1) to calculate when you had 0 generations. Then for each recursive call to a child you need to do O(generations) work to merge that child's data in to yours. Each person gets called when you encounter them in the data structure, and can be called once from each parent for O(n) calls and total expense O(n * (generations + 1)).
Go through all people and figure out how many generations were alive at their death. This is again O(n * (generations + 1)) if implemented with a linear scan.
The sum total of all of these operations is O(n * (generations + 1)).
For realistic data sets, this will be O(n) with a fairly small constant.

My suggestion:
additionally to the values described in the problem statement, each personal record will have two fields: child counter and a dynamically growing vector (in C++/STL sense) which will keep the earliest birthday in each generation of a person's descendants.
use a hash table to store the data, with the person name being the key. The time to build it is linear (assuming a good hash function, the map has amortized constant time for inserts and finds).
for each person, detect and save the number of children. It's also done in linear time: for each personal record, find the record for its parents and increment their counters. This step can be combined with the previous one: if a record for a parent is not found, it is created and added, while details (dates etc) will be added when found in the input.
traverse the map, and put references to all personal records with no children into a queue. Still O(N).
for each element taken out of the queue:
add the birthday of this person into descendant_birthday[0] for both parents (grow that vector if necessary). If this field is already set, change it only if the new date is earlier.
For all descendant_birthday[i] dates available in the vector of the current record, follow the same rule as above to update descendant_birthday[i+1] in parents' records.
decrement parents' child counters; if it reaches 0, add the corresponding parent's record into the queue.
the cost of this step is O(C*N), with C being the biggest value of "family depth" for the given input (i.e. the size of the longest descendant_birthday vector). For realistic data it can be capped by some reasonable constant without correctness loss (as others already pointed out), and so does not depend on N.
traverse the map one more time, and "label each person" with the biggest i for which descendant_birthday[i] is still earlier than the death date; also O(C*N).
Thus for realistic data the solution for the problem can be found in linear time. Though for contrived data like suggested in #btilly's comment, C can be big, and even of the order of N in degenerate cases. It can be resolved either by putting a cap on the vector size or by extending the algorithm with step 2 of #btilly's solution.
A hash table is key part of the solution in case if parent-child relations in the input data are provided through names (as written in the problem statement). Without hashes, it would require O(N log N) to build a relation graph. Most other suggested solutions seem to assume that the relationship graph already exists.

Create a list of people, sorted by birth_date. Create another list of people, sorted by death_date. You can travel logically through time, popping people from these lists, in order to get a list of the events as they happened.
For each Person, define an is_alive field. This'll be FALSE for everyone at first. As people are born and die, update this record accordingly.
Define another field for each person, called has_a_living_ancestor, initialized to FALSE for everyone at first. At birth, x.has_a_living_ancestor will be set to x.mother.is_alive || x.mother.has_a_living_ancestor || x.father.is_alive || x.father.has_a_living_ancestor. So, for most people (but not everyone), this will be set to TRUE at birth.
The challenge is to identify occasions when has_a_living_ancestor can be set to FALSE. Each time a person is born, we do a DFS up through the ancestors, but only those ancestors for which ancestor.has_a_living_ancestor || ancestor.is_alive is true.
During that DFS, if we find an ancestor that has no living ancestors, and is now dead, then we can set has_a_living_ancestor to FALSE. This does mean, I think, that sometimes has_a_living_ancestor will be out of date, but it will hopefully be caught quickly.

The following is an O(n log n) algorithm that work for graphs in which each child has at most one parent (EDIT: this algorithm does not extend to the two-parent case with O(n log n) performance). It is worth noting that I believe the performance can be improved to O(n log(max level label)) with extra work.
One parent case:
For each node x, in reverse topological order, create a binary search tree T_x that is strictly increasing both in date of birth and in number of generations removed from x. (T_x contains the first born child c1 in the subgraph of the ancestry graph rooted at x, along with the next earliest born child c2 in this subgraph such that c2's 'great grandparent level' is a strictly greater than that of c1, along with the next earliest born child c3 in this subgraph such that c3's level is strictly greater than that of c2, etc.) To create T_x, we merge the previously-constructed trees T_w where w is a child of x (they are previously-constructed because we are iterating in reverse topological order).
If we are careful with how we perform the merges, we can show that the total cost of such merges is O(n log n) for the entire ancestry graph. The key idea is to note that after each merge, at most one node of each level survives in the merged tree. We associate with each tree T_w a potential of h(w) log n, where h(w) is equal to the length of the longest path from w to a leaf.
When we merge the child trees T_w to create T_x, we 'destroy' all of the trees T_w, releasing all of the potential that they store for use in building the tree T_x; and we create a new tree T_x with (log n)(h(x)) potential. Thus, our goal is to spend at most O((log n)(sum_w(h(w)) - h(x) + constant)) time to create T_x from the trees T_w so that the amortized cost of the merge will be only O(log n). This can be achieved by choosing the tree T_w such that h(w) is maximal as a starting point for T_x and then modifying T_w to create T_x. After such a choice is made for T_x, we merge each of the other trees, one by one, into T_x with an algorithm that is similar to the standard algorithm for merging two binary search trees.
Essentially, the merging is accomplished by iterating over each node y in T_w, searching for y's predecessor z by birth date, and then inserting y into T_x if it is more levels removed from x than z; then, if z was inserted into T_x, we search for the node in T_x of the lowest level that is strictly greater than z's level, and splice out the intervening nodes to maintain the invariant that T_x is ordered strictly both by birth date and level. This costs O(log n) for each node in T_w, and there are at most O(h(w)) nodes in T_w, so the total cost of merging all trees is O((log n)(sum_w(h(w))), summing over all children w except for the child w' such that h(w') is maximal.
We store the level associated with each element of T_x in an auxiliary field of each node in the tree. We need this value so that we can figure out the actual level of x once we've constructed T_x. (As a technical detail, we actually store the difference of each node's level with that of its parent in T_x so that we can quickly increment the values for all nodes in the tree. This is a standard BST trick.)
That's it. We simply note that the initial potential is 0 and the final potential is positive so the sum of the amortized bounds is an upper bound on the total cost of all merges across the entire tree. We find the label of each node x once we create the BST T_x by binary searching for the latest element in T_x that was born before x died at cost O(log n).
To improve the bound to O(n log(max level label)), you can lazily merge the trees, only merging the first few elements of the tree as necessary to provide the solution for the current node. If you use a BST that exploits locality of reference, such as a splay tree, then you can achieve the above bound.
Hopefully, the above algorithm and analysis is at least clear enough to follow. Just comment if you need any clarification.

I have a hunch that obtaining for each person a mapping (generation -> date the first descendant in that generation is born) would help.
Since the dates must be strictly increasing, we would be able to use use binary search (or a neat datastructure) to find the most distant living descendant in O(log n) time.
The problem is that merging these lists (at least naively) is O(number of generations) so this could get to be O(n^2) in the worst case (consider A and B are parents of C and D, who are parents of E and F...).
I still have to work out how the best case works and try to identify the worst cases better (and see if there is a workaround for them)

We recently implemented relationship module in one of our project in which we had everything in database and yes I think algorithm was best 2nO(m) (m is max branch factor). I multiplied operations twice to N because in first round we create relationship graph and in second round we visit every Person. We have stored bidirectional relationship between every two nodes. While navigating, we only use one direction to travel. But we have two set of operations, one traverse only children, other traverse only parent.
Person{
String Name;
// all relations where
// this is FromPerson
Relation[] FromRelations;
// all relations where
// this is ToPerson
Relation[] ToRelations;
DateTime birthDate;
DateTime? deathDate;
}
Relation
{
Person FromPerson;
Person ToPerson;
RelationType Type;
}
enum RelationType
{
Father,
Son,
Daughter,
Mother
}
This kind of looks like bidirectional graph. But in this case, first you build list of all Person, and then you can build list relations and setup FromRelations and ToRelations between each node. Then all you have to do is, for every Person, you have to only navigate ToRelations of type (Son,Daughter) only. And since you have date, you can calculate everything.
I dont have time to check correctness of the code, but this will give you idea of how to do it.
void LabelPerson(Person p){
int n = GetLevelOfChildren(p, p.birthDate, p.deathDate);
// label based on n...
}
int GetLevelOfChildren(Person p, DateTime bd, DateTime? ed){
List<int> depths = new List<int>();
foreach(Relation r in p.ToRelations.Where(
x=>x.Type == Son || x.Type == Daughter))
{
Person child = r.ToPerson;
if(ed!=null && child.birthDate <= ed.Value){
depths.Add( 1 + GetLevelOfChildren( child, bd, ed));
}else
{
depths.Add( 1 + GetLevelOfChildren( child, bd, ed));
}
}
if(depths.Count==0)
return 0;
return depths.Max();
}

Here's my stab:
class Person
{
Person [] Parents;
string Name;
DateTime DOB;
DateTime DOD;
int Generations = 0;
void Increase(Datetime dob, int generations)
{
// current person is alive when caller was born
if (dob < DOD)
Generations = Math.Max(Generations, generations)
foreach (Person p in Parents)
p.Increase(dob, generations + 1);
}
void Calculate()
{
foreach (Person p in Parents)
p.Increase(DOB, 1);
}
}
// run for everyone
Person [] people = InitializeList(); // create objects from information
foreach (Person p in people)
p.Calculate();

There's a relatively straightforward O(n log n) algorithm that sweeps the events chronologically with the help of a suitable top tree.
You really shouldn't assign homework that you can't solve yourself.

Related

Why is the time complexity O(lgN) for an operation in weighted union find algorithm?

So everywhere I see weighted union find algorithm, they use this approach:
Maintain an array pointing to the parent node of a particular node
Maintain an array denoting the size of the tree a node is in
For union (p,q) merge the smaller tree with larger
Time complexity here is O(lgN)
Now an optimzation on this is to flatten the trees, ie, whenever I am calculating the root of a particular node, set all nodes in that path to point to that root.
Time complexity of this is O(lg*N)
This I can understand, but what I don't get is why don't they start off with an array/hashset wherein nodes point to the root (instead of the immediate parent node)? That would bring the time complexity down to O(1).
I am going to assume that the time complexity you are asking for is the time to check whether 2 nodes belong to the same set.
The key is in how sets are joined, specifically you take the root of one set (the smaller one) and have it point to the root of the other set. Let the two sets have p and q as roots respectively, and |p| will represents the size of the set p if p is a root, while in general it will be the number of items whos set path goes through p (which is 1 + all its children).
We can without loss of generality assume that |p| <= |q| (otherwise we just exchange their names). We then have that |p u q| = |p|+|q| >= 2|p|. This shows us that each subtree in data-structure can at most be half as big as its parent, so given N items it can at most have the depth 1+lg N = O(lg(N)).
If the two choosen items are the furthest possible from the root, it will take O(N) operations to find the root for each of their sets, since you only need O(1) operations to move up one layer in the set, and then O(1) operations to compare those roots.
This cost is also applied to each union operation itself, since you need to figure out which two roots you need to merge. The reason we dont just have all nodes point directly to the root, is several-fold. First we would need to change all the nodes in the set every time we perform a union, secondly we only have edges pointing from the nodes toward the root and not the other way, so we would have to look through all nodes to find the ones we would need to change. The next reason is that we have good optimizations that can help in this kind of step and still work. Finally you could do such a step at the end if you really need to, but it would cost O(N lg(N)) time to perform it, which is compariable to how long it would take to run the entire algorithm by itself without the running short-cut optimization.
You are correct that the solution you suggest will bring down the time complexity of a Find operation to O(1). However, doing so will make a Union operation slower.
Imagine that you use an array/hashtable to remember the representative (or the root, as you call it) of each node. When you perform a union operation between two nodes x and y, you would either need to update all the nodes with the same representative as x to have y's representative, or vice versa. This way, union runs in O(min{|Sx|, |Sy|}), where Sxis the set of nodes with the same representative as x. These sets can be significantly larger than log n.
The weighted union algorithm on the other hand has O(log n) for both Find and
So it's a trade-off. If you expect to do many Find operations, but few Union operations, you should use the solution you suggest. If you expect to do many of each, you can use the weighted union algorithm to avoid excessively slow operations.

Why is the number of sub-trees gained from a range tree query is O(log(n))?

I'm trying to figure out this data structure, but I don't understand how can we
tell there are O(log(n)) subtrees that represents the answer to a query?
Here is a picture for illustration:
Thanks!
If we make the assumption that the above is a purely functional binary tree [wiki], so where the nodes are immutable, then we can make a "copy" of this tree such that only elements with a value larger than x1 and lower than x2 are in the tree.
Let us start with a very simple case to illustrate the point. Imagine that we simply do not have any bounds, than we can simply return the entire tree. So instead of constructing a new tree, we return a reference to the root of the tree. So we can, without any bounds return a tree in O(1), given that tree is not edited (at least not as long as we use the subtree).
The above case is of course quite simple. We simply make a "copy" (not really a copy since the data is immutable, we can just return the tree) of the entire tree. So let us aim to solve a more complex problem: we want to construct a tree that contains all elements larger than a threshold x1. Basically we can define a recursive algorithm for that:
the cutted version of None (or whatever represents a null reference, or a reference to an empty tree) is None;
if the node has a value is smaller than the threshold, we return a "cutted" version of the right subtree; and
if the node has a value greater than the threshold, we return an inode that has the same right subtree, and as left subchild the cutted version of the left subchild.
So in pseudo-code it looks like:
def treelarger(some_node, min):
if some_tree is None:
return None
if some_node.value > min:
return Node(treelarger(some_node.left, min), some_node.value, some_node.right)
else:
return treelarger(some_node.right, min)
This algorithm thus runs in O(h) with h the height of the tree, since for each case (except the first one), we recurse to one (not both) of the children, and it ends in case we have a node without children (or at least does not has a subtree in the direction we need to cut the subtree).
We thus do not make a complete copy of the tree. We reuse a lot of nodes in the old tree. We only construct a new "surface" but most of the "volume" is part of the old binary tree. Although the tree itself contains O(n) nodes, we construct, at most, O(h) new nodes. We can optimize the above such that, given the cutted version of one of the subtrees is the same, we do not create a new node. But that does not even matter much in terms of time complexity: we generate at most O(h) new nodes, and the total number of nodes is either less than the original number, or the same.
In case of a complete tree, the height of the tree h scales with O(log n), and thus this algorithm will run in O(log n).
Then how can we generate a tree with elements between two thresholds? We can easily rewrite the above into an algorithm treesmaller that generates a subtree that contains all elements that are smaller:
def treesmaller(some_node, max):
if some_tree is None:
return None
if some_node.value < min:
return Node(some_node.left, some_node.value, treesmaller(some_node.right, max))
else:
return treesmaller(some_node.left, max)
so roughly speaking there are two differences:
we change the condition from some_node.value > min to some_node.value < max; and
we recurse on the right subchild in case the condition holds, and on the left if it does not hold.
Now the conclusions we draw from the previous algorithm are also conclusions that can be applied to this algorithm, since again it only introduces O(h) new nodes, and the total number of nodes can only decrease.
Although we can construct an algorithm that takes the two thresholds concurrently into account, we can simply reuse the above algorithms to construct a subtree containing only elements within range: we first pass the tree to the treelarger function, and then that result through a treesmaller (or vice versa).
Since in both algorithms, we introduce O(h) new nodes, and the height of the tree can not increase, we thus construct at most O(2 h) and thus O(h) new nodes.
Given the original tree was a complete tree, then it thus holds that we create O(log n) new nodes.
Consider the search for the two endpoints of the range. This search will continue until finding the lowest common ancestor of the two leaf nodes that span your interval. At that point, the search branches with one part zigging left and one part zagging right. For now, let's just focus on the part of the query that branches to the left, since the logic is the same but reversed for the right branch.
In this search, it helps to think of each node as not representing a single point, but rather a range of points. The general procedure, then, is the following:
If the query range fully subsumes the range represented by this node, stop searching in x and begin searching the y-subtree of this node.
If the query range is purely in range represented by the right subtree of this node, continue the x search to the right and don't investigate the y-subtree.
If the query range overlaps the left subtree's range, then it must fully subsume the right subtree's range. So process the right subtree's y-subtree, then recursively explore the x-subtree to the left.
In all cases, we add at most one y-subtree in for consideration and then recursively continue exploring the x-subtree in only one direction. This means that we essentially trace out a path down the x-tree, adding in at most one y-subtree per step. Since the tree has height O(log n), the overall number of y-subtrees visited this way is O(log n). And then, including the number of y-subtrees visited in the case where we branched right at the top, we get another O(log n) subtrees for a total of O(log n) total subtrees to search.
Hope this helps!

How to use balanced binary trees to solve this challenge?

1(70)
/ \
/ \
2(40) 5(10)
/ \ \
/ \ \
3(60) 4(80) 6(20)
/ \
/ \
7(30) 8(50)
This is for an online challenge (not live contest). I don't need someone to solve for me, just to push in right direction. Trying to learn.
Each node has a unique ID, no two people have same salary. Person #1 has salary $70, person #7 has $30 salary, for example. Tree structure denotes who supervises who. Question is who has kth lowest salary on a person's subordinates.
For example I choose person #2. Who is 2nd lowest among subordinates? #2's subordinates are 3, 4, 7, 8. 2nd lowest salary is $50 belonging to person #8.
There are many queries so structure to must be efficient.
I thought about this problem and researched data structures. Binary tree seems like a good idea but I need help.
For example I think ideal structure look like, for person #2:
2(40)
/ \
/ \
7(30) 3(60)
/ \
/ \
8(50) 4(80)
Every child node is subordinate of #2, every left branch has lower salary than on right. If I store how many children at each node I can get kth lowest.
For example: From #2, left branch 1 node, right branch 3 nodes. So 2nd lowest - 1 means I now want 1st lowest in right branch.
Move to #3, 1st lowest points to #8 with $50 which is correct.
My question:
Is this approach as I describe it a good one? Is it a valid approach?
I am having trouble figuring out how to construct this kind of tree. I think I can make them recursively. But hard to figure out how to make all children into new tree sorted by salary. Need some light help.
Here's a solution that uses O(n log^2 n + q log n) time and O(n log^2 n) space (not the best on the latter count, but probably good enough given the limits).
Implement a purely functional sorted list (as an augmented binary search tree) with the following operations and some way to iterate.
EmptyList() -> returns the empty list
Insert(list, key) -> returns the list where |key| has been inserted into |list|
Length(list) -> returns the length of the list
Get(list, k) -> returns the element at index |k| in |list|
On top of these operations, implement an operation
Merge(list1, list2) -> returns the union of |list1| and |list2|
by inserting the elements of the shorter list into the longer.
Now do the obvious thing: traverse the employee hierarchy from leaves to root, setting the ordered list for each employee to the appropriate merge of her subordinate lists, and answer the queries.
Analysis (sketch)
Each query takes O(log n) time. The interesting part of the analysis pertains to the preprocessing.
The cost of preprocessing is dominated by the cost of calling Insert(), specifically from Merge(), since there are n other insertions. Each insertion takes O(log n) time and costs O(log n) space (measuring in words).
What keeps the preprocessing from being quadratic is an implicit heavy path decomposition. Every time we merge two lists, neither list is merged subsequently. Since the shorter list is inserted into the longer, every time a key is inserted into a list, that list is at least twice as long as the list into which that key was previously inserted. It follows that each key is the subject of at most lg n insertions, which suffices to establish a bound of O(n log n) insertions overall and thus the claimed resource bounds.
Here is one possible solution. For each node, we will construct an array of all children's values of that node and keep this in sorted order. The result we are looking for is a dictionary of the form
{ 1 : [10, 20, 30, 40, 60 80],
2 : [30, 50, 60, 80]
...
}
Once we have this, to query for any node for the ith lowest salary, just take the ith element of the array. The total time to do all the queries is O( q ) where q is the number of queries.
How do we construct this? Assuming you have a pointer to the root node, you can recursively construct the sorted salaries for each child. Store those values in the result. Make a copy of each child's array, and insert each child's salary into the child's copied array. Use binary search to find the position, since each array is sorted. Now you have k sorted arrays, you merge them to get a sorted array. If you are merging two arrays, this can be done in linear time. Simply loop, picking the first element of the array that is smaller each time.
For the case where each node has 2 children, merging the two children's arrays is O(n) tine. Inserting the salaries of each node to its corresponding array is O(log(n)) per node since we use binary search. Copying the children array is O(n), and there are n nodes, so we have O(n^2) total time for pre-processing.
Total run time is O(n^2 + q)
What if we can not assume each node has at most 2 children? Then to merge the arrays, use this algorithm. This runs in O(nlog(k)) where k is the number of arrays to merge, since we pop from the heap once per element, and resizing the heap takes O(log(k)) when there are k arrays. k<=n so we can simplify this to O(nlog(n)). So the total running time is unchanged.
The space complexity of this solution is O(n^2).
The question has two parts: first, finding the specified person, and then finding the kth subordinate.
Since the tree is not ordered by id, to find the specified perspn by id requires walking the whole tree until the specified id is found. To speed up this part, we can builda hash map that would allow us to find the person node by id in O(1) time and require O(n) space and set-up time.
Then, to find the subordinate with the kth lowest salary, we need to search the subtree. Since its not ordered by salary, we would have to scan the whole subtree and find the kth lowest salary. This could be done using an array or a heap (putting the subtree nodes into array or heap). This second part would O(m log k) time using the heap to keep the lowest k items, where m is the number of sub-ordinates, and require O(k) space. This should be acceptable if m (number of subordinates of specified person), and k are small.

Special Augmented Red-Black Tree

I'm looking for some help on a specific augmented Red Black Binary Tree. My goal is to make every single operation run in O(log(n)) in the worst case. The nodes of the tree will have an integer as there key. This integer can not be negative, and the tree should be sorted by a simple compare function off of this integer. Additionally, each node will also store another value: its power. (Note that this has nothing to do with mathematical exponents). Power is a floating point value. Both power and key are always non-negative. The tree must be able to provide these operations in O(log(n)) runtime.:
insert(key, power): Insert into the tree. The node in the tree should also store the power, and any other variables needed to augment the tree in such a way that all other operations are also O(log(n)). You can assume that there is no node in the tree which already has the same key.
get(key): Return the power of the node identified by the key.
delete(key): Delete the node with key (assume that the key does exist in the tree prior to the delete.
update(key,power): Update the power at the node given by key.
Here is where it gets interesting:
highestPower(key1, key2): Return the maximum power of all nodes with key k in the range key1 <= k <= key2. That is, all keys from key1 to key2, inclusive on both ends.
powerSum(key1, key2): Return the sum of the powers of all nodes with key k in the ragne key1 <= k <= key2. That is, all keys from key1 to key2, inclusive on both ends.
The main thing I would like to know is what extra variables should I store at each node. Then I need to work out how to use each one of these in each of the above functions so that the tree stays balanced and all operations can run in O(log(n)) My original thought was to store the following:
highestPowerLeft: The highest power of all child nodes to the right of this node.
highestPowerRight: The highest power of all child nodes to the right of this node.
powerSumLeft: The sum of the powers of all child nodes to the left of this node.
powerSumRight: The sum of the powers of all child nodes to the right of this node.
Would just this extra information work? If so, I'm not sure how to deal with it in the functions that are required. Frankly my knowledge of Red Black Tree's isn't great because I feel like every explanation of them gets convoluted really fast, and all the rotations and things confuse the hell out of me. Thanks to anyone willing to attempt helping here, I know what I'm asking is far from simple.
A very interesting problem! For the sum, your proposed method should work (it should be enough to only store the sum of the powers to the left of the current node, though; this technique is called prefix sum). For the max, it doesn't work, since if both max values are equal, that value is outside of your interval, so you have no idea what the max value in your interval is. My only idea is to use a segment tree (in which the leaves are the nodes of your red-black tree), which lets you answer the question "what is the maximal value within the given range?" in logarithmic time, and also lets you update individual values in logarithmic time. However, since you need to insert new values into it, you need to keep it balanced as well.

Are there sorting algorithms that respect final position restrictions and run in O(n log n) time?

I'm looking for a sorting algorithm that honors a min and max range for each element1. The problem domain is a recommendations engine that combines a set of business rules (the restrictions) with a recommendation score (the value). If we have a recommendation we want to promote (e.g. a special product or deal) or an announcement we want to appear near the top of the list (e.g. "This is super important, remember to verify your email address to participate in an upcoming promotion!") or near the bottom of the list (e.g. "If you liked these recommendations, click here for more..."), they will be curated with certain position restriction in place. For example, this should always be the top position, these should be in the top 10, or middle 5 etc. This curation step is done ahead of time and remains fixed for a given time period and for business reasons must remain very flexible.
Please don't question the business purpose, UI or input validation. I'm just trying to implement the algorithm in the constraints I've been given. Please treat this as an academic question. I will endeavor to provide a rigorous problem statement, and feedback on all other aspects of the problem is very welcome.
So if we were sorting chars, our data would have a structure of
struct {
char value;
Integer minPosition;
Integer maxPosition;
}
Where minPosition and maxPosition may be null (unrestricted). If this were called on an algorithm where all positions restrictions were null, or all minPositions were 0 or less and all maxPositions were equal to or greater than the size of the list, then the output would just be chars in ascending order.
This algorithm would only reorder two elements if the minPosition and maxPosition of both elements would not be violated by their new positions. An insertion-based algorithm which promotes items to the top of the list and reorders the rest has obvious problems in that every later element would have to be revalidated after each iteration; in my head, that rules out such algorithms for having O(n3) complexity, but I won't rule out such algorithms without considering evidence to the contrary, if presented.
In the output list, certain elements will be out of order with regard to their value, if and only if the set of position constraints dictates it. These outputs are still valid.
A valid list is any list where all elements are in a position that does not conflict with their constraints.
An optimal list is a list which cannot be reordered to more closely match the natural order without violating one or more position constraint. An invalid list is never optimal. I don't have a strict definition I can spell out for 'more closely matching' between one ordering or another. However, I think it's fairly easy to let intuition guide you, or choose something similar to a distance metric.
Multiple optimal orderings may exist if multiple inputs have the same value. You could make an argument that the above paragraph is therefore incorrect, because either one can be reordered to the other without violating constraints and therefore neither can be optimal. However, any rigorous distance function would treat these lists as identical, with the same distance from the natural order and therefore reordering the identical elements is allowed (because it's a no-op).
I would call such outputs the correct, sorted order which respects the position constraints, but several commentators pointed out that we're not really returning a sorted list, so let's stick with 'optimal'.
For example, the following are a input lists (in the form of <char>(<minPosition>:<maxPosition>), where Z(1:1) indicates a Z that must be at the front of the list and M(-:-) indicates an M that may be in any position in the final list and the natural order (sorted by value only) is A...M...Z) and their optimal orders.
Input order
A(1:1) D(-:-) C(-:-) E(-:-) B(-:-)
Optimal order
A B C D E
This is a trivial example to show that the natural order prevails in a list with no constraints.
Input order
E(1:1) D(2:2) C(3:3) B(4:4) A(5:5)
Optimal order
E D C B A
This example is to show that a fully constrained list is output in the same order it is given. The input is already a valid and optimal list. The algorithm should still run in O(n log n) time for such inputs. (Our initial solution is able to short-circuit any fully constrained list to run in linear time; I added the example both to drive home the definitions of optimal and valid and because some swap-based algorithms I considered handled this as the worse case.)
Input order
E(1:1) C(-:-) B(1:5) A(4:4) D(2:3)
Optimal Order
E B D A C
E is constrained to 1:1, so it is first in the list even though it has the lowest value. A is similarly constrained to 4:4, so it is also out of natural order. B has essentially identical constraints to C and may appear anywhere in the final list, but B will be before C because of value. D may be in positions 2 or 3, so it appears after B because of natural ordering but before C because of its constraints.
Note that the final order is correct despite being wildly different from the natural order (which is still A,B,C,D,E). As explained in the previous paragraph, nothing in this list can be reordered without violating the constraints of one or more items.
Input order
B(-:-) C(2:2) A(-:-) A(-:-)
Optimal order
A(-:-) C(2:2) A(-:-) B(-:-)
C remains unmoved because it already in its only valid position. B is reordered to the end because its value is less than both A's. In reality, there will be additional fields that differentiate the two A's, but from the standpoint of the algorithm, they are identical and preserving OR reversing their input ordering is an optimal solution.
Input order
A(1:1) B(1:1) C(3:4) D(3:4) E(3:4)
Undefined output
This input is invalid for two reasons: 1) A and B are both constrained to position 1 and 2) C, D, and E are constrained to a range than can only hold 2 elements. In other words, the ranges 1:1 and 3:4 are over-constrained. However, the consistency and legality of the constraints are enforced by UI validation, so it's officially not the algorithms problem if they are incorrect, and the algorithm can return a best-effort ordering OR the original ordering in that case. Passing an input like this to the algorithm may be considered undefined behavior; anything can happen. So, for the rest of the question...
All input lists will have elements that are initially in valid positions.
The sorting algorithm itself can assume the constraints are valid and an optimal order exists.2
We've currently settled on a customized selection sort (with runtime complexity of O(n2)) and reasonably proved that it works for all inputs whose position restrictions are valid and consistent (e.g. not overbooked for a given position or range of positions).
Is there a sorting algorithm that is guaranteed to return the optimal final order and run in better than O(n2) time complexity?3
I feel that a library standard sorting algorithm could be modified to handle these constrains by providing a custom comparator that accepts the candidate destination position for each element. This would be equivalent to the current position of each element, so maybe modifying the value holding class to include the current position of the element and do the extra accounting in the comparison (.equals()) and swap methods would be sufficient.
However, the more I think about it, an algorithm that runs in O(n log n) time could not work correctly with these restrictions. Intuitively, such algorithms are based on running n comparisons log n times. The log n is achieved by leveraging a divide and conquer mechanism, which only compares certain candidates for certain positions.
In other words, input lists with valid position constraints (i.e. counterexamples) exist for any O(n log n) sorting algorithm where a candidate element would be compared with an element (or range in the case of Quicksort and variants) with/to which it could not be swapped, and therefore would never move to the correct final position. If that's too vague, I can come up with a counter example for mergesort and quicksort.
In contrast, an O(n2) sorting algorithm makes exhaustive comparisons and can always move an element to its correct final position.
To ask an actual question: Is my intuition correct when I reason that an O(n log n) sort is not guaranteed to find a valid order? If so, can you provide more concrete proof? If not, why not? Is there other existing research on this class of problem?
1: I've not been able to find a set of search terms that points me in the direction of any concrete classification of such sorting algorithm or constraints; that's why I'm asking some basic questions about the complexity. If there is a term for this type of problem, please post it up.
2: Validation is a separate problem, worthy of its own investigation and algorithm. I'm pretty sure that the existence of a valid order can be proven in linear time:
Allocate array of tuples of length equal to your list. Each tuple is an integer counter k and a double value v for the relative assignment weight.
Walk the list, adding the fractional value of each elements position constraint to the corresponding range and incrementing its counter by 1 (e.g. range 2:5 on a list of 10 adds 0.4 to each of 2,3,4, and 5 on our tuple list, incrementing the counter of each as well)
Walk the tuple list and
If no entry has value v greater than the sum of the series from 1 to k of 1/k, a valid order exists.
If there is such a tuple, the position it is in is over-constrained; throw an exception, log an error, use the doubles array to correct the problem elements etc.
Edit: This validation algorithm itself is actually O(n2). Worst case, every element has the constraints 1:n, you end up walking your list of n tuples n times. This is still irrelevant to the scope of the question, because in the real problem domain, the constraints are enforced once and don't change.
Determining that a given list is in valid order is even easier. Just check each elements current position against its constraints.
3: This is admittedly a little bit premature optimization. Our initial use for this is for fairly small lists, but we're eyeing expansion to longer lists, so if we can optimize now we'd get small performance gains now and large performance gains later. And besides, my curiosity is piqued and if there is research out there on this topic, I would like to see it and (hopefully) learn from it.
On the existence of a solution: You can view this as a bipartite digraph with one set of vertices (U) being the k values, and the other set (V) the k ranks (1 to k), and an arc from each vertex in U to its valid ranks in V. Then the existence of a solution is equivalent to the maximum matching being a bijection. One way to check for this is to add a source vertex with an arc to each vertex in U, and a sink vertex with an arc from each vertex in V. Assign each edge a capacity of 1, then find the max flow. If it's k then there's a solution, otherwise not.
http://en.wikipedia.org/wiki/Maximum_flow_problem
--edit-- O(k^3) solution: First sort to find the sorted rank of each vertex (1-k). Next, consider your values and ranks as 2 sets of k vertices, U and V, with weighted edges from each vertex in U to all of its legal ranks in V. The weight to assign each edge is the distance from the vertices rank in sorted order. E.g., if U is 10 to 20, then the natural rank of 10 is 1. An edge from value 10 to rank 1 would have a weight of zero, to rank 3 would have a weight of 2. Next, assume all missing edges exist and assign them infinite weight. Lastly, find the "MINIMUM WEIGHT PERFECT MATCHING" in O(k^3).
http://www-math.mit.edu/~goemans/18433S09/matching-notes.pdf
This does not take advantage of the fact that the legal ranks for each element in U are contiguous, which may help get the running time down to O(k^2).
Here is what a coworker and I have come up with. I think it's an O(n2) solution that returns a valid, optimal order if one exists, and a closest-possible effort if the initial ranges were over-constrained. I just tweaked a few things about the implementation and we're still writing tests, so there's a chance it doesn't work as advertised. This over-constrained condition is detected fairly easily when it occurs.
To start, things are simplified if you normalize your inputs to have all non-null constraints. In linear time, that is:
for each item in input
if an item doesn't have a minimum position, set it to 1
if an item doesn't have a maximum position, set it to the length of your list
The next goal is to construct a list of ranges, each containing all of the candidate elements that have that range and ordered by the remaining capacity of the range, ascending so ranges with the fewest remaining spots are on first, then by start position of the range, then by end position of the range. This can be done by creating a set of such ranges, then sorting them in O(n log n) time with a simple comparator.
For the rest of this answer, a range will be a simple object like so
class Range<T> implements Collection<T> {
int startPosition;
int endPosition;
Collection<T> items;
public int remainingCapacity() {
return endPosition - startPosition + 1 - items.size();
}
// implement Collection<T> methods, passing through to the items collection
public void add(T item) {
// Validity checking here exposes some simple cases of over-constraining
// We'll catch these cases with the tricky stuff later anyways, so don't choke
items.add(item);
}
}
If an element A has range 1:5, construct a range(1,5) object and add A to its elements. This range has remaining capacity of 5 - 1 + 1 - 1 (max - min + 1 - size) = 4. If an element B has range 1:5, add it to your existing range, which now has capacity 3.
Then it's a relatively simple matter of picking the best element that fits each position 1 => k in turn. Iterate your ranges in their sorted order, keeping track of the best eligible element, with the twist that you stop looking if you've reached a range that has a remaining size that can't fit into its remaining positions. This is equivalent to the simple calculation range.max - current position + 1 > range.size (which can probably be simplified, but I think it's most understandable in this form). Remove each element from its range as it is selected. Remove each range from your list as it is emptied (optional; iterating an empty range will yield no candidates. That's a poor explanation, so lets do one of our examples from the question. Note that C(-:-) has been updated to the sanitized C(1:5) as described in above.
Input order
E(1:1) C(1:5) B(1:5) A(4:4) D(2:3)
Built ranges (min:max) <remaining capacity> [elements]
(1:1)0[E] (4:4)0[A] (2:3)1[D] (1:5)3[C,B]
Find best for 1
Consider (1:1), best element from its list is E
Consider further ranges?
range.max - current position + 1 > range.size ?
range.max = 1; current position = 1; range.size = 1;
1 - 1 + 1 > 1 = false; do not consider subsequent ranges
Remove E from range, add to output list
Find best for 2; current range list is:
(4:4)0[A] (2:3)1[D] (1:5)3[C,B]
Consider (4:4); skip it because it is not eligible for position 2
Consider (2:3); best element is D
Consider further ranges?
3 - 2 + 1 > 1 = true; check next range
Consider (2:5); best element is B
End of range list; remove B from range, add to output list
An added simplifying factor is that the capacities do not need to be updated or the ranges reordered. An item is only removed if the rest of the higher-sorted ranges would not be disturbed by doing so. The remaining capacity is never checked after the initial sort.
Find best for 3; output is now E, B; current range list is:
(4:4)0[A] (2:3)1[D] (1:5)3[C]
Consider (4:4); skip it because it is not eligible for position 3
Consider (2:3); best element is D
Consider further ranges?
same as previous check, but current position is now 3
3 - 3 + 1 > 1 = false; don't check next range
Remove D from range, add to output list
Find best for 4; output is now E, B, D; current range list is:
(4:4)0[A] (1:5)3[C]
Consider (4:4); best element is A
Consider further ranges?
4 - 4 + 1 > 1 = false; don't check next range
Remove A from range, add to output list
Output is now E, B, D, A and there is one element left to be checked, so it gets appended to the end. This is the output list we desired to have.
This build process is the longest part. At its core, it's a straightforward n2 selection sorting algorithm. The range constraints only work to shorten the inner loop and there is no loopback or recursion; but the worst case (I think) is still sumi = 0 n(n - i), which is n2/2 - n/2.
The detection step comes into play by not excluding a candidate range if the current position is beyond the end of that ranges max position. You have to track the range your best candidate came from in order to remove it, so when you do the removal, just check if the position you're extracting the candidate for is greater than that ranges endPosition.
I have several other counter-examples that foiled my earlier algorithms, including a nice example that shows several over-constraint detections on the same input list and also how the final output is closest to the optimal as the constraints will allow. In the mean time, please post any optimizations you can see and especially any counter examples where this algorithm makes an objectively incorrect choice (i.e. arrives at an invalid or suboptimal output when one exists).
I'm not going to accept this answer, because I specifically asked if it could be done in better than O(n2). I haven't wrapped my head around the constraints satisfaction approach in #DaveGalvin's answer yet and I've never done a maximum flow problem, but I thought this might be helpful for others to look at.
Also, I discovered the best way to come up with valid test data is to start with a valid list and randomize it: for 0 -> i, create a random value and constraints such that min < i < max. (Again, posting it because it took me longer than it should have to come up with and others might find it helpful.)
Not likely*. I assume you mean average run time of O(n log n) in-place, non-stable, off-line. Most Sorting algorithms that improve on bubble sort average run time of O(n^2) like tim sort rely on the assumption that comparing 2 elements in a sub set will produce the same result in the super set. A slower variant of Quicksort would be a good approach for your range constraints. The worst case won't change but the average case will likely decrease and the algorithm will have the extra constraint of a valid sort existing.
Is ... O(n log n) sort is not guaranteed to find a valid order?
All popular sort algorithms I am aware of are guaranteed to find an order so long as there constraints are met. Formal analysis (concrete proof) is on each sort algorithems wikepedia page.
Is there other existing research on this class of problem?
Yes; there are many journals like IJCSEA with sorting research.
*but that depends on your average data set.

Resources