efficient evaluation of formula - algorithm

Here is the problem I ran into. I have a list of evaluators, I_1, I_2... etc, which have dependency among each other. Something like I_1 -> I_2 (reads, I_2 depends on I_1's result). There is no cyclic dependency.
each of these shared interfaces bool eval(), double value(). say I_1->eval() would update the result of I_1, which can be returned by I_1->value(). And the boolean returned by eval() tells me if the result has changed, and if so, all I_js that depend on I_1 need to be updated.
Now say I_1 has updated result, how to run as few eval()s as possible to keep all I_js up to date?

I just have a nested loop like this:
first do a tree-walk from I_1, marking it and all descendants as out-of-date
make a list of those descendants
anything_changed = true
while anything_changed
anything_changed = false
for each formula in the descendant list
if no predecessors of that formula in the descendant list are out of date
re-evaluate the formula and assert that it is not out of date
anything_changed = true
Look, it's crude but correct.
So what if it's a bit like a quadratic big-O?
If the number of formulas is not too large, and/or the cost of evaluating each one is not too small, and/or if this is not done at high frequency, performance should not be an issue.

If I could, I'd add links from a parent to it's dependant children, so the update then becomes:-
change_value ()
{
evaluate new_value based on all parents
if (value != new_value)
{
value = new_value
for each child
child->change_value ()
}
}
Of course, you'd need to cope with the case where Child(n) is the parent of Child(m)
Actually, thinking about it, it might just work but won't be a minimal set of calls to change_value

You need something like a breadth first search from l_1, omitting to search the descendants of nodes whose return from eval() said that they had not changed, and taking into account that you should not evaluate a node until you have evaluated all the nodes that it directly depends on. One way to arrange this would be to keep a count of unevaluated direct dependencies on each node, decrementing the count for all the nodes that depend on a node you have just evaluated. At each stage if there are nodes not yet evaluated that need to be there must be at least one that does not depend on an unevaluated node. If not you could produce an infinite list of unevaluated nodes by travelling from one node to a node that it depends on and so on, and we know there are no cycles in the dependency graph.
There is pseudo-code for breadth first search at https://en.wikipedia.org/wiki/Breadth-first_search.

An efficient solution would be to have two relations. If I_2 depends on I_1 you would have I_1 --influences--> I_2 and I2 --depends on--> I_1 as relations.
You basically need to be able to efficiently calculate the numbers of out-of-date evaluations that I_X depends on (let's call that number D(I_X))
Then, you do the following:
Do a BFS with the --influences--> relation, storing all reachable I_X
Store the reachable I_X in a data structure that sorts them according to their D(I_X) , e.g. a Priority Queue
// finding the D(I_X) could be integrated into the DFS and require little additional calculation time
while (still nodes to update):
Pop and re-evaluate the I_X with the lowest D(I_X) value (e.g. the first I_X from the Queue) (*)
Update the D(I_Y) value for all I_Y with I_X --influences--> I_Y
(i.e. lower it by 1)
Update the sorting/queue to reflect the new D(I_Y) values
(*) The first element should always have D(I_X) == 0, otherwise, you might have a circular dependency
The algorithm above uses quite a bit of time to find the nodes to update and order them, but gains the advantage that it only re-valuates every I_X once.

Related

Pacman AI - Minimax Application - Avoiding repeated game tree states

In the context of a project, following the UC Berkley pacman ai project (its second part), I want to implement the minimax algorithm, without alpha-beta pruning, for an adversarial agent in a layout small enough that recursion is not a problem.
Having defined the problem as a 2-player (we assume only 1 ghost), turn taking, zero-sum game with perfect information, applying the recursive would be pretty trivial. However, since many different strategies can end up in the same game state (defined as a tuple of pacman's position, the ghost's position, the food's position and the player currently playing), I wanted to find a way to avoid recomputing all those states.
I searched and I read some things about transposition tables. I am not really sure on how to use such a method however and what I thought I should implement was the following:
Each time a state, not yet visited, is expanded, add it to a 'visited' set. If the state has already been expanded, then if it's the max player's turn (pacman) return a +inf value (which would normally never be chosen by the min player), if it's min's turn return -inf accordingly.
The problem with this idea, I think, and the reason why it works for some layouts but not others, is that when I hit a node, all the children of which have already been expanded, the only values I have to choose from are +/- infinities. This causes an infinite value to propagate upwards and be selected, while in fact it is possible that this game state leads to a loss. I think, I have understood the problem, but I can't seem to find a way to get around it.
Is there any other method I could use to avoid computing repeated game states? Is there a standard approach to this that I am not aware of?
Here is some pseudocode:
def maxPLayer(currentState, visitedSet):
if not isTerminalState
for nextState, action in currentState.generateMaxSuccessors()
if nextState not in visitedSet
mark nextState as visited
scores = scores + [minPlayer(nextState, visitedSet)]
if scores is not empty
return bestScore = max(scores)
else
return +inf #The problem is HERE!
else
return evalFnc(currentState)
end MaxPlayer
def minPlayer(currenstState, visitedSet):
if not isTerminal
for nextState, action in generateMinSuccessors()
if nextState not in visitedSet
mark nextState as visited
scores = scores + [maxPLayer(nextState, visitedSet)]
if scores is not empty
return bestScore = min(scores)
else
return -inf #The problem is also HERE!
else
return evalFnc(currentState)
end MinPlayer
Note that the first player to play is max and I choose the action that has the highest score. Nothing changes if I take into account infinite values or not, there are still instances of the game where the agent loses, or loops infinitely.
I think the main shortcoming in your approach is that you consider already visited states as undesirable targets for the opponent to move to. Instead of returning an infinity value, you should retrieve the value that was computed at the time when that state was first visited.
Practically this means you should use a map (of state->value) instead of a set (of state).
Only in case the value of the first visit is not yet computed (because the recursive call leads to a visit of an ancestor state), you would need to use a reserved value. But let that value be undefined/null/None, so that it will not be treated as other numerical results, but will be excluded from possible paths, even when backtracking.
As a side note, I would perform the lookup & marking of states at the start of the function -- on the current state -- instead of inside the loop on the neighboring states.
Here is how one of the two functions would then look:
def maxPLayer(currentState, evaluatedMap):
if currentState in evaluatedMap
return evaluatedMap.get(currentState)
evaluatedMap.set(currentState, undefined)
if not isTerminalState
bestScore = undefined
for nextState in currentState.generateMaxSuccessors()
value = minPlayer(nextState, evaluatedMap)
if value != undefined
scores.append(value)
if scores is not empty
bestScore = max(scores)
else
bestScore = evalFnc(currentState)
evaluatedMap.set(currentState, bestScore)
return bestScore
end MaxPlayer
The value undefined will be used during the time that a state is visited, but its value has not yet been determined (because of pending recursive calls). If a state is such that the current player has no valid moves (is "stuck"), then that state will permanently get the value undefined, in other cases, the value undefined will eventually get replaced with a true score.
The problem I was having was finally related with the definition of a 'game state' and how 'repeated states' had to be handled.
In fact, consider a the game state tree and a particular game state x which is identified by the following:
The position of pacman.
The number and position of food pellets on the grid.
The position and the direction of the ghost (the direction is taken into acount because the ghost is considered to not be able to make a half turn.
Now suppose you start going down a certain branch of the tree and at some point you visit the node x. Assuming it had not already been visited before and it is not a terminal state for the game, this node should added to the set of visited nodes.
Now suppose that once you're done with this particular branch of the tree, you start exploring a different one. After a certain, undetermined number of steps you get once again to a node identified as x. This is where the problem with the code in the question lies.
In fact, while the game state as defined is exactly the same, the path followed to get to this state is not (since we are currently on a new, different branch than the original one). Obviously, considering the state as visited or using the utility calculated by the last branch is false. It produces unexpected results.
The solution to this problem is, simply, to have a separate set of visited nodes for each branch of the tree. This way the situation described above is avoided. From there on, there are two strategies that can be considered:
The first one consists of considering looping through already visited states as a worst case scenario for pacman and an optimal strategy for the ghost (which is obviously not strictly true). Taking this into account, repeated states in the same branch of the tree are treated as a kind of 'terminal' states that return -inf as a utility.
The second approach consists of making use of a transposition table. This is however not trivial to implement: If a node is not already in the dictionary, initialize it at infinity to show that it is currently being computed and should not be recomputed if visited later. When reaching a terminal state, while recursing on all nodes store in the dictionary the difference in the game score between the current node and the corresponding terminal state. If while traversing a branch you visit a node that was already in the dictionary return the current game score (which depends on the path you took to get to this node and can change from one branch to the other) plus the value in the dictionaty (which is the gain (or loss) in score form getting from this node to the terminal state and which is always the same).
In more practical terms, the first approach is really simple to implement, it suffices to copy the set every time you pass it as an argument to the next player (so that values in different branches won't affect each other). This will make the algorithm significantly slower and alpha beta should be applied even for very small, simple mazes (1 food pellet and maybe 7x7 mazes). In any other case python will wither complain about recursion or simply take too long to solve (more than a few minutes). It is however correct.
The second approach is more complicated. I have no formal proof of correctness, although intuitively it seems to work. It is significantly faster and also compatible with alpha beta pruning.
The corresponding pseudo code is easy to derive from the explanation.

How to perform minimum splits to satisfy special set ordering?

I'm trying to create an algorithm to solve the following problem:
Input is an unsorted list of sets containing pairs (key, value) of ints. The first of each pair is positive and unique within the set.
I want to find an algorithm to split the input sets so the sets can be ordered such that for each key the value is nondecreasing in the set order.
There is a trival solution which is to split the sets into each individual value and sort them, I'd like something more efficient in terms of the number of sets which are split.
Are there any similar problems you have encountered and/or techniques you can suggest?
Does the optimal (minimum number of splits) solution sound like it is possible in polynomial time?
Edit: In the example the "<=" operator indicates a constraint on the sets as a whole whereby for each key value (100, 101, 102) the corresponding values are equal to or greater than the values in previous sets (or omitted from the set). I.e extracting the values for each key using the order from the output sets gives:
Key 100 {0, 1}
Key 101 {2, 3}
Key 102 {10, 15}
A*
I propose using A* to find an optimal solution. Build the order of split sets incrementally from left to right, minimizing the number of sets required to achieve this.
A* visits states based on some heuristic estimate of the total cost. I propose that a state is described by the totality of all the pairs already included in the order as we have it so far. If all values for every key are different, then you can represent this information rather concisely by simply storing the last value for each key. Otherwise you'll have to somehow take care of equal values, so you know which ones were already included and which ones were not. For every state you maintain some representation of the best order leading to it, but that may get updated along the way while the state remains the same.
The heuristic should be an estimate of the total cost of the path from the beginning through the current state to the goal. It may be too low, but must never be too high. In our case, the heuristic should count the number of (possibly split) sets included in the order so far, and add to that the number of (unsplit) sets still waiting for insertion. As the remaining sets may need splitting, this might be too low, but as you can never have less sets than those still waiting for insertion, it is a suitable heuristic.
Now you have some priority queue of states, ordered by the value of this heuristic. You extract minimal items from it, and know that the moment you extract a state from the queue, the cost up to that state can not decrease any more, so the path up to that state is optimal. Now you examine what other states can be reached from this: which other pairs can be next in the order of split sets? For each remaining set which has pairs that are ready to be included, you create a new subsequent state, taking all the pairs from the set which are ready. The cost so far increases by one. If you manage to take a whole set, without splitting, then the extimate for the remaining cost decreases by one.
For this new state, you check whether it is already persent in your priority queue. If it is, and its previous cost was higher than the one just computed, then you update its cost, and the optimal path leading to it. Make sure the priority key changes its position accordingly (“decrease key”). If the state wasn't present in the queue before, then add it to the queue.
Dijkstra
Come to think of it, this is the same as running Dijkstra's algorithm with the number of splits as cost. And as each edge has either cost zero or cost one, you can implement this even easier, without any priority queue at all. Instead, you can use two sets, called S₀ and S₁, where all elements from S₀ require the same number of splits, and all elements from S₁ require one more split. Roughly sketched in pseudocode:
S₀ = ∅ (empty set)
S₁ = ∅
add initial state (no pairs added yet, all sets remain to be added) to S₀
while True
while (S₀ ≠ ∅)
x = take and remove any element from zero
if x is the target state (all pairs included in the order) then
return the path information associated with it
for (r: those sets which remain to be added in state x)
if we can take r as a whole then
let y be the state obtained by taking r as the next set in the order
if y is in S₁, remove it
add y to S₀
else if we can add only some elements from r then
let y bet the state obtained by taking as many elements from r as possible
if y is not in S₀, add it to S₁
S₀ = S₁
S₁ = ∅

Clojure DAG (Bayesian Network)

I would like to build a Bayesian Network in clojure, since I haven't found any similar project.
I have studied a lot of theory of BN but still I can't see how implement the network (I am not what people call "guru" for anything, but especially not for functional programming).
I do know that a BN is nothing more than a DAG and a lot probability table (one for each node) but now I have no glue how to implement the DAG.
My first idea was a huge set (the DAG) with some little maps (the node of the DAG), every map should have a name (probably a: key) a probability table (another map?) A vector of parents and finally a vector of non-descendant.
Now I don't know how to implement the reference of the parents and non-descendants (what I should put in the two vector).
I guess that a pointer should be perfect, but clojure lack of it; I could put in the vector the: name of the other node but it is going to be slow, doesn't it?
I was thinking that instead of a vector I could use more set, in this way would be faster find the descendants of a node.
Similar problem for the probability table where I still need some reference at the other nodes.
Finally I also would like to learn the BN (build the network starting by the data) this means that I will change a lot both probability tables, edge, and nodes.
Should I use mutable types or they would only increment the complexity?
This is not a complete answer, but here is a possible encoding for the example network from the wikipedia article. Each node has a name, a list of successors (children) and a probability table:
(defn node [name children fn]
{:name name :children children :table fn})
Also, here are little helper functions for building true/false probabilities:
;; builds a true/false probability map
(defn tf [true-prob] #(if % true-prob (- 1.0 true-prob)))
The above function returns a closure, which, when given a true value (resp. false value), returns the probability of the event X=true (for the X probability variable we are encoding).
Since the network is a DAG, we can references directly nodes to each other (exactly like the pointers you mentioned) without having to care about circular references. We just build the graph in topological order:
(let [gw (node "grass wet" [] (fn [& {:keys [sprinkler rain]}]
(tf (cond (and sprinkler rain) 0.99
sprinkler 0.9
rain 0.8
:else 0.0))))
sk (node "sprinkler" [gw]
(fn [& {:keys [rain]}] (tf (if rain 0.01 0.4))))
rn (node "rain" [sk gw]
(constantly (tf 0.2)))]
(def dag {:nodes {:grass-wet gw :sprinkler sk :rain rn}
:joint (fn [g s r]
(*
(((:table gw) :sprinkler s :rain r) g)
(((:table sk) :rain r) s)
(((:table rn)) r)))}))
The probability table of each node is given as a function of the states of the parent nodes and returns the probability for true and false values. For example,
((:table (:grass-wet dag)) :sprinkler true :rain false)
... returns {:true 0.9, :false 0.09999999999999998}.
The resulting joint function combines probabilities according the this formula:
P(G,S,R) = P(G|S,R).P(S|R).P(R)
And ((:joint dag) true true true) returns 0.0019800000000000004.
Indeed, each value returned by ((:table <x>) <args>) is a closure around an if, which returns probability knowing the state of the probability variable. We call each closure with the respective true/false value to extract the appropriate probability, and multiply them.
Here, I am cheating a little because I suppose that the joint function should be computed by traversing the graph (a macro could help, in the general case). This also feels a little messy, notably regarding nodes's states, which are not necessarly only true and false: you would most likely use a map in the general case.
In general, the way to compute the joint distribution of a BN is
prod( P(node | parents of node) )
To achive this, you need a list of nodes where each node contains
node name
list of parents
probability table
list of children
probability table maybe is easiest to handle when flat with each row value corresponding to a parent configuration and each column corresponding to a value for the node. This assumes you are using a record to hold all of the values. The value of the node can be contained within the node also.
Nodes with no parents have only one row.
Each row should be normalized after which P(node|parents) = table[row,col]
You don't really need the list of children but having it could make topological sorting easier. A DAG must be capable of being topologically sorted.
The biggest problem arises as the number of cells in the probability table is the product of all of the dimensions of the parents and self. I handled this in C++ using a sparse table using row mapping.
Querying the DAG is a different matter and the best method for doing this depends on size and whether the an approximate answer is sufficient. There isn't enough room to cover them here. Search for Murphy and the Bayes Net Toolbox might be helpful
I realize you are specifically looking for an implementation but, with a little work, you can roll your own.
You may try to go even flatter and have several maps indexed by node ids: one map for probabilities tables, one for parents and one for non-descendants (I'm no BN expert: what's this, how is it used etc. ? It feels like something that could be recomputed from the parents table^W relation^W map).

Family Tree Algorithm

I'm working on putting together a problem set for an intro-level CS course and came up with a question that, on the surface, seems very simple:
You are given a list of people with the names of their parents, their birth dates, and their death dates. You are interested in finding out who, at some point in their lifetime, was a parent, a grandparent, a great-grandparent, etc. Devise an algorithm to label each person with this information as an integer (0 means the person never had a child, 1 means that the person was a parent, 2 means that the person was a grandparent, etc.)
For simplicity, you can assume that the family graph is a DAG whose undirected version is a tree.
The interesting challenge here is that you can't just look at the shape of the tree to determine this information. For example, I have 8 great-great-grandparents, but since none of them were alive when I was born, in their lifetimes none of them were great-great-grandparents.
The best algorithm I can come up with for this problem runs in time O(n2), where n is the number of people. The idea is simple - start a DFS from each person, finding the furthest descendant down in the family tree that was born before that person's death date. However, I'm pretty sure that this is not the optimal solution to the problem. For example, if the graph is just two parents and their n children, then the problem can be solved trivially in O(n). What I'm hoping for is some algorithm that is either beats O(n2) or whose runtime is parameterized over the shape of the graph that makes it fast for wide graphs with a graceful degradation to O(n2) in the worst-case.
Update: This is not the best solution I have come up with, but I've left it because there are so many comments relating to it.
You have a set of events (birth/death), parental state (no descendants, parent, grandparent, etc) and life state (alive, dead).
I would store my data in structures with the following fields:
mother
father
generations
is_alive
may_have_living_ancestor
Sort your events by date, and then for each event take one of the following two courses of logic:
Birth:
Create new person with a mother, father, 0 generations, who is alive and may
have a living ancestor.
For each parent:
If generations increased, then recursively increase generations for
all living ancestors whose generations increased. While doing that,
set the may_have_living_ancestor flag to false for anyone for whom it is
discovered that they have no living ancestors. (You only iterate into
a person's ancestors if you increased their generations, and if they
still could have living ancestors.)
Death:
Emit the person's name and generations.
Set their is_alive flag to false.
The worst case is O(n*n) if everyone has a lot of living ancestors. However in general you've got the sorting preprocessing step which is O(n log(n)) and then you're O(n * avg no of living ancestors) which means that the total time tends to be O(n log(n)) in most populations. (I hadn't counted the sorting prestep properly, thanks to #Alexey Kukanov for the correction.)
I thought of this this morning, then found that #Alexey Kukanov had similar thoughts. But mine is more fleshed out and has some more optimization, so I'll post it anyways.
This algorithm is O(n * (1 + generations)), and will work for any dataset. For realistic data this is O(n).
Run through all records and generate objects representing people which include date of birth, links to parents, and links to children, and several more uninitialized fields. (Time of last death between self and ancestors, and an array of dates that they had 0, 1, 2, ... surviving generations.)
Go through all people and recursively find and store the time of last death. If you call the person again, return the memoized record. For each person you can encounter the person (needing to calculate it), and can generate 2 more calls to each parent the first time you calculate it. This gives a total of O(n) work to initialize this data.
Go through all people and recursively generate a record of when they first added a generation. These records only need go to the maximum of when the person or their last ancestor died. It is O(1) to calculate when you had 0 generations. Then for each recursive call to a child you need to do O(generations) work to merge that child's data in to yours. Each person gets called when you encounter them in the data structure, and can be called once from each parent for O(n) calls and total expense O(n * (generations + 1)).
Go through all people and figure out how many generations were alive at their death. This is again O(n * (generations + 1)) if implemented with a linear scan.
The sum total of all of these operations is O(n * (generations + 1)).
For realistic data sets, this will be O(n) with a fairly small constant.
My suggestion:
additionally to the values described in the problem statement, each personal record will have two fields: child counter and a dynamically growing vector (in C++/STL sense) which will keep the earliest birthday in each generation of a person's descendants.
use a hash table to store the data, with the person name being the key. The time to build it is linear (assuming a good hash function, the map has amortized constant time for inserts and finds).
for each person, detect and save the number of children. It's also done in linear time: for each personal record, find the record for its parents and increment their counters. This step can be combined with the previous one: if a record for a parent is not found, it is created and added, while details (dates etc) will be added when found in the input.
traverse the map, and put references to all personal records with no children into a queue. Still O(N).
for each element taken out of the queue:
add the birthday of this person into descendant_birthday[0] for both parents (grow that vector if necessary). If this field is already set, change it only if the new date is earlier.
For all descendant_birthday[i] dates available in the vector of the current record, follow the same rule as above to update descendant_birthday[i+1] in parents' records.
decrement parents' child counters; if it reaches 0, add the corresponding parent's record into the queue.
the cost of this step is O(C*N), with C being the biggest value of "family depth" for the given input (i.e. the size of the longest descendant_birthday vector). For realistic data it can be capped by some reasonable constant without correctness loss (as others already pointed out), and so does not depend on N.
traverse the map one more time, and "label each person" with the biggest i for which descendant_birthday[i] is still earlier than the death date; also O(C*N).
Thus for realistic data the solution for the problem can be found in linear time. Though for contrived data like suggested in #btilly's comment, C can be big, and even of the order of N in degenerate cases. It can be resolved either by putting a cap on the vector size or by extending the algorithm with step 2 of #btilly's solution.
A hash table is key part of the solution in case if parent-child relations in the input data are provided through names (as written in the problem statement). Without hashes, it would require O(N log N) to build a relation graph. Most other suggested solutions seem to assume that the relationship graph already exists.
Create a list of people, sorted by birth_date. Create another list of people, sorted by death_date. You can travel logically through time, popping people from these lists, in order to get a list of the events as they happened.
For each Person, define an is_alive field. This'll be FALSE for everyone at first. As people are born and die, update this record accordingly.
Define another field for each person, called has_a_living_ancestor, initialized to FALSE for everyone at first. At birth, x.has_a_living_ancestor will be set to x.mother.is_alive || x.mother.has_a_living_ancestor || x.father.is_alive || x.father.has_a_living_ancestor. So, for most people (but not everyone), this will be set to TRUE at birth.
The challenge is to identify occasions when has_a_living_ancestor can be set to FALSE. Each time a person is born, we do a DFS up through the ancestors, but only those ancestors for which ancestor.has_a_living_ancestor || ancestor.is_alive is true.
During that DFS, if we find an ancestor that has no living ancestors, and is now dead, then we can set has_a_living_ancestor to FALSE. This does mean, I think, that sometimes has_a_living_ancestor will be out of date, but it will hopefully be caught quickly.
The following is an O(n log n) algorithm that work for graphs in which each child has at most one parent (EDIT: this algorithm does not extend to the two-parent case with O(n log n) performance). It is worth noting that I believe the performance can be improved to O(n log(max level label)) with extra work.
One parent case:
For each node x, in reverse topological order, create a binary search tree T_x that is strictly increasing both in date of birth and in number of generations removed from x. (T_x contains the first born child c1 in the subgraph of the ancestry graph rooted at x, along with the next earliest born child c2 in this subgraph such that c2's 'great grandparent level' is a strictly greater than that of c1, along with the next earliest born child c3 in this subgraph such that c3's level is strictly greater than that of c2, etc.) To create T_x, we merge the previously-constructed trees T_w where w is a child of x (they are previously-constructed because we are iterating in reverse topological order).
If we are careful with how we perform the merges, we can show that the total cost of such merges is O(n log n) for the entire ancestry graph. The key idea is to note that after each merge, at most one node of each level survives in the merged tree. We associate with each tree T_w a potential of h(w) log n, where h(w) is equal to the length of the longest path from w to a leaf.
When we merge the child trees T_w to create T_x, we 'destroy' all of the trees T_w, releasing all of the potential that they store for use in building the tree T_x; and we create a new tree T_x with (log n)(h(x)) potential. Thus, our goal is to spend at most O((log n)(sum_w(h(w)) - h(x) + constant)) time to create T_x from the trees T_w so that the amortized cost of the merge will be only O(log n). This can be achieved by choosing the tree T_w such that h(w) is maximal as a starting point for T_x and then modifying T_w to create T_x. After such a choice is made for T_x, we merge each of the other trees, one by one, into T_x with an algorithm that is similar to the standard algorithm for merging two binary search trees.
Essentially, the merging is accomplished by iterating over each node y in T_w, searching for y's predecessor z by birth date, and then inserting y into T_x if it is more levels removed from x than z; then, if z was inserted into T_x, we search for the node in T_x of the lowest level that is strictly greater than z's level, and splice out the intervening nodes to maintain the invariant that T_x is ordered strictly both by birth date and level. This costs O(log n) for each node in T_w, and there are at most O(h(w)) nodes in T_w, so the total cost of merging all trees is O((log n)(sum_w(h(w))), summing over all children w except for the child w' such that h(w') is maximal.
We store the level associated with each element of T_x in an auxiliary field of each node in the tree. We need this value so that we can figure out the actual level of x once we've constructed T_x. (As a technical detail, we actually store the difference of each node's level with that of its parent in T_x so that we can quickly increment the values for all nodes in the tree. This is a standard BST trick.)
That's it. We simply note that the initial potential is 0 and the final potential is positive so the sum of the amortized bounds is an upper bound on the total cost of all merges across the entire tree. We find the label of each node x once we create the BST T_x by binary searching for the latest element in T_x that was born before x died at cost O(log n).
To improve the bound to O(n log(max level label)), you can lazily merge the trees, only merging the first few elements of the tree as necessary to provide the solution for the current node. If you use a BST that exploits locality of reference, such as a splay tree, then you can achieve the above bound.
Hopefully, the above algorithm and analysis is at least clear enough to follow. Just comment if you need any clarification.
I have a hunch that obtaining for each person a mapping (generation -> date the first descendant in that generation is born) would help.
Since the dates must be strictly increasing, we would be able to use use binary search (or a neat datastructure) to find the most distant living descendant in O(log n) time.
The problem is that merging these lists (at least naively) is O(number of generations) so this could get to be O(n^2) in the worst case (consider A and B are parents of C and D, who are parents of E and F...).
I still have to work out how the best case works and try to identify the worst cases better (and see if there is a workaround for them)
We recently implemented relationship module in one of our project in which we had everything in database and yes I think algorithm was best 2nO(m) (m is max branch factor). I multiplied operations twice to N because in first round we create relationship graph and in second round we visit every Person. We have stored bidirectional relationship between every two nodes. While navigating, we only use one direction to travel. But we have two set of operations, one traverse only children, other traverse only parent.
Person{
String Name;
// all relations where
// this is FromPerson
Relation[] FromRelations;
// all relations where
// this is ToPerson
Relation[] ToRelations;
DateTime birthDate;
DateTime? deathDate;
}
Relation
{
Person FromPerson;
Person ToPerson;
RelationType Type;
}
enum RelationType
{
Father,
Son,
Daughter,
Mother
}
This kind of looks like bidirectional graph. But in this case, first you build list of all Person, and then you can build list relations and setup FromRelations and ToRelations between each node. Then all you have to do is, for every Person, you have to only navigate ToRelations of type (Son,Daughter) only. And since you have date, you can calculate everything.
I dont have time to check correctness of the code, but this will give you idea of how to do it.
void LabelPerson(Person p){
int n = GetLevelOfChildren(p, p.birthDate, p.deathDate);
// label based on n...
}
int GetLevelOfChildren(Person p, DateTime bd, DateTime? ed){
List<int> depths = new List<int>();
foreach(Relation r in p.ToRelations.Where(
x=>x.Type == Son || x.Type == Daughter))
{
Person child = r.ToPerson;
if(ed!=null && child.birthDate <= ed.Value){
depths.Add( 1 + GetLevelOfChildren( child, bd, ed));
}else
{
depths.Add( 1 + GetLevelOfChildren( child, bd, ed));
}
}
if(depths.Count==0)
return 0;
return depths.Max();
}
Here's my stab:
class Person
{
Person [] Parents;
string Name;
DateTime DOB;
DateTime DOD;
int Generations = 0;
void Increase(Datetime dob, int generations)
{
// current person is alive when caller was born
if (dob < DOD)
Generations = Math.Max(Generations, generations)
foreach (Person p in Parents)
p.Increase(dob, generations + 1);
}
void Calculate()
{
foreach (Person p in Parents)
p.Increase(DOB, 1);
}
}
// run for everyone
Person [] people = InitializeList(); // create objects from information
foreach (Person p in people)
p.Calculate();
There's a relatively straightforward O(n log n) algorithm that sweeps the events chronologically with the help of a suitable top tree.
You really shouldn't assign homework that you can't solve yourself.

Finding the width of a directed acyclic graph... with only the ability to find parents

I'm trying to find the width of a directed acyclic graph... as represented by an arbitrarily ordered list of nodes, without even an adjacency list.
The graph/list is for a parallel GNU Make-like workflow manager that uses files as its criteria for execution order. Each node has a list of source files and target files. We have a hash table in place so that, given a file name, the node which produces it can be determined. In this way, we can figure out a node's parents by examining the nodes which generate each of its source files using this table.
That is the ONLY ability I have at this point, without changing the code severely. The code has been in public use for a while, and the last thing we want to do is to change the structure significantly and have a bad release. And no, we don't have time to test rigorously (I am in an academic environment). Ideally we're hoping we can do this without doing anything more dangerous than adding fields to the node.
I'll be posting a community-wiki answer outlining my current approach and its flaws. If anyone wants to edit that, or use it as a starting point, feel free. If there's anything I can do to clarify things, I can answer questions or post code if needed.
Thanks!
EDIT: For anyone who cares, this will be in C. Yes, I know my pseudocode is in some horribly botched Python look-alike. I'm sort of hoping the language doesn't really matter.
I think the "width" you're considering here isn't really what you want - the width depends on how you assign levels to each node where you have some choice. You noticed this when you were deciding whether to assign all sources to level 0 or all sinks to the max level.
Instead, you just want to count the number of nodes and divide by the "critical path length", which is the longest path in the dag. This gives the average parallelism for the graph. It depends only on the graph itself, and it still gives you an indication of how wide the graph is.
To compute the critical path length, just do what you're doing - the critical path length is the maximum level you end up assigning.
In my opinion when you're doing this type of last minute development, its best to keep the new structures separate from the ones you are already using. At this point, if I were pressed by time I would go for a simpler solution.
Create an adjacency matrix for the graph using the parent data (should be easy)
Perform a topological sort using this matrix. (or even use tsort if pressed for time)
Now that you have a topological sort, create an array level, one element for each node.
For each node:
If the node has no parents set its level to 0
Otherwise set it to the minimum of level its parents + 1.
Find the maximum level width.
The question is as Keith Randall asked, is this the right measurement you need?
Here's what I (Platinum Azure, the original author) have so far.
Preparations/augmentations:
Add "children" field to linked list ("DAG") node
Add "level" field to "DAG" node
Add "children_left" field to "DAG" node. This is used to make sure that all children are examined before a parent is examined (in a later stage of the algorithm).
Algorithm:
Find the number of immediate children for all nodes; also, determine leaves by adding nodes with children==0 to list.
for l in L:
l.children = 0
for l in L:
l.level = 0
for p in l.parents:
++p.children
Leaves = []
for l in L:
l.children_left = l.children
if l.children == 0:
Leaves.append(l)
Assign every node a "reverse depth" level. Normally by depth, I mean topologically sort and assign depth=0 to nodes with no parents. However, I'm thinking I need to reverse this, with depth=0 corresponding to leaves. Also, we want to make sure that no node is added to the queue without all its children "looking at it" first (to determine its proper "depth level").
max_level = 0
while !Leaves.empty():
l = Leaves.pop()
for p in l.parents:
--p.children_left
if p.children_left == 0:
/* we only want to append parents with for sure correct level */
Leaves.append(p)
p.level = Max(p.level, l.level + 1)
if p.level > max_level:
max_level = p.level
Now that every node has a level, simply create an array and then go through the list once more to count the number of nodes in each level.
level_count = new int[max_level+1]
for l in L:
++level_count[l.level]
width = Max(level_count)
So that's what I'm thinking so far. Is there a way to improve on it? It's linear time all the way, but it's got like five or six linear scans and there will probably be a lot of cache misses and the like. I have to wonder if there isn't a way to exploit some locality with a better data structure-- without actually changing the underlying code beyond node augmentation.
Any thoughts?

Resources