Assume that the oil company saves for each person that works in the
company a record with its name, its salary, its age and its date of
birth. You can assume that no two fields are identical for any two
employees. a. The accounting department wants to store the records of
all employees such that the following operations can be performed:
a./b./c./d. < not relevent >
e. Compute the average salary of the 10% of employees that earn the highest salaries.
Suggest an efficient data structure that allows an efficient (worst
time complexity and space complexity) for each of the above
operations.
I was given the above problem, and opted to use an AVL tree with a few additional fields for each node, such as
i. Key = Salary
ii. Size = The number of nodes in left+right subtrees + 1
iii. Sum = The sum of salaries of all decendants + the salary of the node
My solution was to keep going to the right child until we reach a node that has node.Size = n/10. Each decendant of this node, let's call it k will have an higher salary than the parent of k, by the AVL tree properties. Hence, we have a subtree with n/10 nodes with the highest salaries in the AVL tree and all we have to do is return Sum/Size.
But the official solution (which also used an AVL with the fields I suggested) used the AVL tree properties to calculate the rank of the node as following
To compute the average salary of the 10% of employees that earn the
highest salaries. We first find the node v that has a rank
9n/10 , then we sum all keys that are larger than key(v),
using the sum field in a set of nodes and divide the sum by
n/10 . To find v we use the rank select recursive algorithm:
The rank of the root is given by r=(size(left(root)). If
r=9n/10 we are done. Otherwise, recursively searching in the
left or the right subtree according to whether r<9n/10 (search > in the left subtree) or r>9n/10 search in the right subtree for > an element with rank 9n/10 − 𝑟.
Then we sum the values of all keys that are larger than v using key(x)+
sum(right(x)) field where x is a node on the path from v to the root
(including v) and v is in the left subtree of x.
Their solution seems messy and I don't quiet understand it, but the idea to use the AVL tree properties to find the rank of a node sounds very useful and I would like to ask the following questions:
i. Does my solution produce the desired result?
ii. Can anyone explain a little bit better the procedure to find the rank of a given element of an AVL (and the additional augmented fields that will be required to do so)?
I'm new to algorithms and was hoping someone could explain why the maximum number of id[] array entries that can be changed in one call to union using quick-find is n-1? Preferably in Layman's terms.
Well, you know at least one value isn't changing right? It's the value that you are adding to the equivalence class of - since the edge-relation is reflexive! There are n-1 others.
If this isn't obvious, re-read the description of the algorithm.
Another way of looking at it: the union-find data structure maintains an array where each index holds the id of that node's "parent" in a forest (a disjoint union of trees) if it has one, otherwise it holds its own id. A new id is written into the array if and only if the algorithm adds a new edge or changes an existing edge. The maximum number of edges you can have in a forest on n nodes is n - 1, which occurs in the case that the forest is a tree including every node.
I've been playing around with the forward-backward algorithm to find the most efficient (determined by a cost function dependent on how a current state differs from the next state) path to go from State 1 to State N. In the picture below, a short version of the problem can be seen with just 3 States and 2 Nodes per State. I do forward-backward algorithm on that and find the best path like normal. The red bits in the pictures are the paths checked during forward propagation bit in the code.
Now the interesting bit, I now want to find the best 3-State Length path (as before) but now only Nodes in the first State are known. The other 4 are now free-floating and can be considered to be in any State (State 2 or State 3). I want to know if you guys have a good idea of how to do this.
Picture: http://i.imgur.com/JrQ2tul.jpg
Note: Bear in mind the original problem consists of around 25 States and 100 Nodes per State. So, you'll know the State of around 100 Nodes in State 1 but the other 24*100 Nodes are Stateless. In this case, I want find a 25-State length path (with minimum cost).
Addendum: Someone pointed out a better algorithm would be Viterbi's algorithm. So here is a problem with more variables thrown in. Can you guys explain how would that be implemented? Same rules apply, the path should start from one of the Nodes in State 1 (Node a or Node b). Also, the cost function using the norm doesn't make sense in this case since we only have one property (Size of node) but in the actual problem I'm expecting a lot more properties.
A variation of Dijkstra's algorithm might be faster for your problem than the forward-backward algorithm, because it does not analyze all nodes at once. Dijkstra is a DP algorithm after all.
Let a node be specified by
Node:
Predecessor : Node
Total cost : Number
Visited nodes : Set of nodes (e.g. a hash set or other performant set)
Initialize the algorithm with
open set : ordered (by total cost) set of nodes = set of possible start nodes (set visitedNodes to the one-element set with the current node)
( = {a, b} in your example)
Then execute the algorithm:
do
n := pop element from open set
if(n.visitedNodes.count == stepTarget)
we're done, backtrace the path from this node
else
for each n2 in available nodes
if not n2 in n.visitedNodes
push copy of n2 to open set (the same node might appear multiple times in the set):
.cost := n.totalCost + norm(n2 - n)
.visitedNodes := n.visitedNodes u { n2 } //u = set union
.predecessor := n
next
loop
If calculating the norm is expensive, you might want to calculate it on demand and store it in a map.
Assume that you are a party consultant, and are hired to prep and host a company party. Every employee in the company is part of a B-tree style hierarchy and are accorded a party rank value. In order to prevent inhibitions of employees in the presence of their direct supervisor, both the supervisor and his direct employees will not be invited. However, either group can be invited.
Design an algorithm to produce a guest list of the largest party rank sum.
My solution is
A supervisor will contain a field for the sum of the party ranks of the direct employees
Execute a bottom-up breadth-first search to access the lowest supervisor sub-tree in the tree. For each supervisor, calculate the differential between the supervisor party rank and the sum of the direct employees. If the employee party rank sum is greater than the supervisor ranking, all affected employees will be added to the guest list.
If the differential between supervisor and employee rankings is less than or equal to zero, move up one level and execute the comparison described above for the next level sub-tree.
Continue up level by level until the head of the company is analyzed, and print out the party rank sum and guest list.
My analysis for the run time is O(n log n -1) due to
log n-1 - time to descend to lowest sub-tree
n - maximum number of comparisons
I came up with this at an interview, but couldn't help but feel I missed something. Am I right on the analysis and steps?
I would compute, for each person in the hierarchy in a bottom-up fashion, two numbers:
If that person DID NOT attend, how many of his transitive subordinates could attend.
If that person DID attend, how many of his transitive subordinates (including himself) could attend.
For each person, this is easy to calculate given the two numbers for each immediate subordinate (in O(B) time where, B is the # of subordinates). Just try both ways for the person and use the appropriate number for each subordinate.
So with a bottom-up walk I think that's O(n) time in total.
I'm working on putting together a problem set for an intro-level CS course and came up with a question that, on the surface, seems very simple:
You are given a list of people with the names of their parents, their birth dates, and their death dates. You are interested in finding out who, at some point in their lifetime, was a parent, a grandparent, a great-grandparent, etc. Devise an algorithm to label each person with this information as an integer (0 means the person never had a child, 1 means that the person was a parent, 2 means that the person was a grandparent, etc.)
For simplicity, you can assume that the family graph is a DAG whose undirected version is a tree.
The interesting challenge here is that you can't just look at the shape of the tree to determine this information. For example, I have 8 great-great-grandparents, but since none of them were alive when I was born, in their lifetimes none of them were great-great-grandparents.
The best algorithm I can come up with for this problem runs in time O(n2), where n is the number of people. The idea is simple - start a DFS from each person, finding the furthest descendant down in the family tree that was born before that person's death date. However, I'm pretty sure that this is not the optimal solution to the problem. For example, if the graph is just two parents and their n children, then the problem can be solved trivially in O(n). What I'm hoping for is some algorithm that is either beats O(n2) or whose runtime is parameterized over the shape of the graph that makes it fast for wide graphs with a graceful degradation to O(n2) in the worst-case.
Update: This is not the best solution I have come up with, but I've left it because there are so many comments relating to it.
You have a set of events (birth/death), parental state (no descendants, parent, grandparent, etc) and life state (alive, dead).
I would store my data in structures with the following fields:
mother
father
generations
is_alive
may_have_living_ancestor
Sort your events by date, and then for each event take one of the following two courses of logic:
Birth:
Create new person with a mother, father, 0 generations, who is alive and may
have a living ancestor.
For each parent:
If generations increased, then recursively increase generations for
all living ancestors whose generations increased. While doing that,
set the may_have_living_ancestor flag to false for anyone for whom it is
discovered that they have no living ancestors. (You only iterate into
a person's ancestors if you increased their generations, and if they
still could have living ancestors.)
Death:
Emit the person's name and generations.
Set their is_alive flag to false.
The worst case is O(n*n) if everyone has a lot of living ancestors. However in general you've got the sorting preprocessing step which is O(n log(n)) and then you're O(n * avg no of living ancestors) which means that the total time tends to be O(n log(n)) in most populations. (I hadn't counted the sorting prestep properly, thanks to #Alexey Kukanov for the correction.)
I thought of this this morning, then found that #Alexey Kukanov had similar thoughts. But mine is more fleshed out and has some more optimization, so I'll post it anyways.
This algorithm is O(n * (1 + generations)), and will work for any dataset. For realistic data this is O(n).
Run through all records and generate objects representing people which include date of birth, links to parents, and links to children, and several more uninitialized fields. (Time of last death between self and ancestors, and an array of dates that they had 0, 1, 2, ... surviving generations.)
Go through all people and recursively find and store the time of last death. If you call the person again, return the memoized record. For each person you can encounter the person (needing to calculate it), and can generate 2 more calls to each parent the first time you calculate it. This gives a total of O(n) work to initialize this data.
Go through all people and recursively generate a record of when they first added a generation. These records only need go to the maximum of when the person or their last ancestor died. It is O(1) to calculate when you had 0 generations. Then for each recursive call to a child you need to do O(generations) work to merge that child's data in to yours. Each person gets called when you encounter them in the data structure, and can be called once from each parent for O(n) calls and total expense O(n * (generations + 1)).
Go through all people and figure out how many generations were alive at their death. This is again O(n * (generations + 1)) if implemented with a linear scan.
The sum total of all of these operations is O(n * (generations + 1)).
For realistic data sets, this will be O(n) with a fairly small constant.
My suggestion:
additionally to the values described in the problem statement, each personal record will have two fields: child counter and a dynamically growing vector (in C++/STL sense) which will keep the earliest birthday in each generation of a person's descendants.
use a hash table to store the data, with the person name being the key. The time to build it is linear (assuming a good hash function, the map has amortized constant time for inserts and finds).
for each person, detect and save the number of children. It's also done in linear time: for each personal record, find the record for its parents and increment their counters. This step can be combined with the previous one: if a record for a parent is not found, it is created and added, while details (dates etc) will be added when found in the input.
traverse the map, and put references to all personal records with no children into a queue. Still O(N).
for each element taken out of the queue:
add the birthday of this person into descendant_birthday[0] for both parents (grow that vector if necessary). If this field is already set, change it only if the new date is earlier.
For all descendant_birthday[i] dates available in the vector of the current record, follow the same rule as above to update descendant_birthday[i+1] in parents' records.
decrement parents' child counters; if it reaches 0, add the corresponding parent's record into the queue.
the cost of this step is O(C*N), with C being the biggest value of "family depth" for the given input (i.e. the size of the longest descendant_birthday vector). For realistic data it can be capped by some reasonable constant without correctness loss (as others already pointed out), and so does not depend on N.
traverse the map one more time, and "label each person" with the biggest i for which descendant_birthday[i] is still earlier than the death date; also O(C*N).
Thus for realistic data the solution for the problem can be found in linear time. Though for contrived data like suggested in #btilly's comment, C can be big, and even of the order of N in degenerate cases. It can be resolved either by putting a cap on the vector size or by extending the algorithm with step 2 of #btilly's solution.
A hash table is key part of the solution in case if parent-child relations in the input data are provided through names (as written in the problem statement). Without hashes, it would require O(N log N) to build a relation graph. Most other suggested solutions seem to assume that the relationship graph already exists.
Create a list of people, sorted by birth_date. Create another list of people, sorted by death_date. You can travel logically through time, popping people from these lists, in order to get a list of the events as they happened.
For each Person, define an is_alive field. This'll be FALSE for everyone at first. As people are born and die, update this record accordingly.
Define another field for each person, called has_a_living_ancestor, initialized to FALSE for everyone at first. At birth, x.has_a_living_ancestor will be set to x.mother.is_alive || x.mother.has_a_living_ancestor || x.father.is_alive || x.father.has_a_living_ancestor. So, for most people (but not everyone), this will be set to TRUE at birth.
The challenge is to identify occasions when has_a_living_ancestor can be set to FALSE. Each time a person is born, we do a DFS up through the ancestors, but only those ancestors for which ancestor.has_a_living_ancestor || ancestor.is_alive is true.
During that DFS, if we find an ancestor that has no living ancestors, and is now dead, then we can set has_a_living_ancestor to FALSE. This does mean, I think, that sometimes has_a_living_ancestor will be out of date, but it will hopefully be caught quickly.
The following is an O(n log n) algorithm that work for graphs in which each child has at most one parent (EDIT: this algorithm does not extend to the two-parent case with O(n log n) performance). It is worth noting that I believe the performance can be improved to O(n log(max level label)) with extra work.
One parent case:
For each node x, in reverse topological order, create a binary search tree T_x that is strictly increasing both in date of birth and in number of generations removed from x. (T_x contains the first born child c1 in the subgraph of the ancestry graph rooted at x, along with the next earliest born child c2 in this subgraph such that c2's 'great grandparent level' is a strictly greater than that of c1, along with the next earliest born child c3 in this subgraph such that c3's level is strictly greater than that of c2, etc.) To create T_x, we merge the previously-constructed trees T_w where w is a child of x (they are previously-constructed because we are iterating in reverse topological order).
If we are careful with how we perform the merges, we can show that the total cost of such merges is O(n log n) for the entire ancestry graph. The key idea is to note that after each merge, at most one node of each level survives in the merged tree. We associate with each tree T_w a potential of h(w) log n, where h(w) is equal to the length of the longest path from w to a leaf.
When we merge the child trees T_w to create T_x, we 'destroy' all of the trees T_w, releasing all of the potential that they store for use in building the tree T_x; and we create a new tree T_x with (log n)(h(x)) potential. Thus, our goal is to spend at most O((log n)(sum_w(h(w)) - h(x) + constant)) time to create T_x from the trees T_w so that the amortized cost of the merge will be only O(log n). This can be achieved by choosing the tree T_w such that h(w) is maximal as a starting point for T_x and then modifying T_w to create T_x. After such a choice is made for T_x, we merge each of the other trees, one by one, into T_x with an algorithm that is similar to the standard algorithm for merging two binary search trees.
Essentially, the merging is accomplished by iterating over each node y in T_w, searching for y's predecessor z by birth date, and then inserting y into T_x if it is more levels removed from x than z; then, if z was inserted into T_x, we search for the node in T_x of the lowest level that is strictly greater than z's level, and splice out the intervening nodes to maintain the invariant that T_x is ordered strictly both by birth date and level. This costs O(log n) for each node in T_w, and there are at most O(h(w)) nodes in T_w, so the total cost of merging all trees is O((log n)(sum_w(h(w))), summing over all children w except for the child w' such that h(w') is maximal.
We store the level associated with each element of T_x in an auxiliary field of each node in the tree. We need this value so that we can figure out the actual level of x once we've constructed T_x. (As a technical detail, we actually store the difference of each node's level with that of its parent in T_x so that we can quickly increment the values for all nodes in the tree. This is a standard BST trick.)
That's it. We simply note that the initial potential is 0 and the final potential is positive so the sum of the amortized bounds is an upper bound on the total cost of all merges across the entire tree. We find the label of each node x once we create the BST T_x by binary searching for the latest element in T_x that was born before x died at cost O(log n).
To improve the bound to O(n log(max level label)), you can lazily merge the trees, only merging the first few elements of the tree as necessary to provide the solution for the current node. If you use a BST that exploits locality of reference, such as a splay tree, then you can achieve the above bound.
Hopefully, the above algorithm and analysis is at least clear enough to follow. Just comment if you need any clarification.
I have a hunch that obtaining for each person a mapping (generation -> date the first descendant in that generation is born) would help.
Since the dates must be strictly increasing, we would be able to use use binary search (or a neat datastructure) to find the most distant living descendant in O(log n) time.
The problem is that merging these lists (at least naively) is O(number of generations) so this could get to be O(n^2) in the worst case (consider A and B are parents of C and D, who are parents of E and F...).
I still have to work out how the best case works and try to identify the worst cases better (and see if there is a workaround for them)
We recently implemented relationship module in one of our project in which we had everything in database and yes I think algorithm was best 2nO(m) (m is max branch factor). I multiplied operations twice to N because in first round we create relationship graph and in second round we visit every Person. We have stored bidirectional relationship between every two nodes. While navigating, we only use one direction to travel. But we have two set of operations, one traverse only children, other traverse only parent.
Person{
String Name;
// all relations where
// this is FromPerson
Relation[] FromRelations;
// all relations where
// this is ToPerson
Relation[] ToRelations;
DateTime birthDate;
DateTime? deathDate;
}
Relation
{
Person FromPerson;
Person ToPerson;
RelationType Type;
}
enum RelationType
{
Father,
Son,
Daughter,
Mother
}
This kind of looks like bidirectional graph. But in this case, first you build list of all Person, and then you can build list relations and setup FromRelations and ToRelations between each node. Then all you have to do is, for every Person, you have to only navigate ToRelations of type (Son,Daughter) only. And since you have date, you can calculate everything.
I dont have time to check correctness of the code, but this will give you idea of how to do it.
void LabelPerson(Person p){
int n = GetLevelOfChildren(p, p.birthDate, p.deathDate);
// label based on n...
}
int GetLevelOfChildren(Person p, DateTime bd, DateTime? ed){
List<int> depths = new List<int>();
foreach(Relation r in p.ToRelations.Where(
x=>x.Type == Son || x.Type == Daughter))
{
Person child = r.ToPerson;
if(ed!=null && child.birthDate <= ed.Value){
depths.Add( 1 + GetLevelOfChildren( child, bd, ed));
}else
{
depths.Add( 1 + GetLevelOfChildren( child, bd, ed));
}
}
if(depths.Count==0)
return 0;
return depths.Max();
}
Here's my stab:
class Person
{
Person [] Parents;
string Name;
DateTime DOB;
DateTime DOD;
int Generations = 0;
void Increase(Datetime dob, int generations)
{
// current person is alive when caller was born
if (dob < DOD)
Generations = Math.Max(Generations, generations)
foreach (Person p in Parents)
p.Increase(dob, generations + 1);
}
void Calculate()
{
foreach (Person p in Parents)
p.Increase(DOB, 1);
}
}
// run for everyone
Person [] people = InitializeList(); // create objects from information
foreach (Person p in people)
p.Calculate();
There's a relatively straightforward O(n log n) algorithm that sweeps the events chronologically with the help of a suitable top tree.
You really shouldn't assign homework that you can't solve yourself.