How to find genetic connection between two nodes

How to find genetic connection between two nodes - algorithm

One feature in our current project is to find out relationship of two nodes. One node can be generated by two other nodes together. To put it simple, I take them as a family tree, as below:
A B
/ \ |
C D E
| \ /
F G
I'm going to write a function to decide if two nodes are genetic connected, for example:
is_genetic_connected("D", "F"); // returns true because they have common ancestor: "A"
is_genetic_connected("E", "F"); // returns false
I'm not sure if we can apply LCA here, or is there any other good algorithm for such problem?

Related

Longest common path between k graphs

I was looking at interview problems and come across this one, failed to find a liable solution.
Actual question was asked on Leetcode discussion.
Given multiple school children and the paths they took from their school to their homes, find the longest most common path (paths are given in order of steps a child takes).
Example:
child1 : a -> g -> c -> b -> e
child2 : f -> g -> c -> b -> u
child3 : h -> g -> c -> b -> x
result = g -> c -> b
Note: There could be multiple children.The input was in the form of steps and childID. For example input looked like this:
(child1, a)
(child2, f)
(child1, g)
(child3, h)
(child1, c)
...
Some suggested longest common substring can work but it will not example -
1 a-b-c-d-e-f-g
2 a-b-c-x-y-f-g
3 m-n-o-p-f-g
4 m-x-o-p-f-g
1 and 2 will give abc, 3 and 4 give pfg
now ans will be none but ans is fg
it's like graph problem, how can we find longest common path between k graphs ?

You can construct a directed graph g with an edge a->b present if and only if it is present in all individual paths, then drop all nodes with degree zero.
The graph g will have have no cycles. If it did, the same cycle would be present in all individual paths, and a path has no cycles by definition.
In addition, all in-degrees and out-degrees will be zero or one. For example, if a node a had in-degree greater than one, there would be two edges representing two students arriving at a from two different nodes. Such edges cannot appear in g by construction.
The graph will look like a disconnected collection of paths. There may be multiple paths with maximum length, or there may be none (an empty path if you like).
In the Python code below, I find all common paths and return one with maximum length. I believe the whole procedure is linear in the number of input edges.
import networkx as nx
path_data = """1 a-b-c-d-e-f-g
2 a-b-c-x-y-f-g
3 m-n-o-p-f-g
4 m-x-o-p-f-g"""
paths = [line.split(" ")[1].split("-") for line in path_data.split("\n")]
num_paths = len(paths)
# graph h will include all input edges
# edge weight corresponds to the number of students
# traversing that edge
h = nx.DiGraph()
for path in paths:
for (i, j) in zip(path, path[1:]):
if h.has_edge(i, j):
h[i][j]["weight"] += 1
else:
h.add_edge(i, j, weight=1)
# graph g will only contain edges traversed by all students
g = nx.DiGraph()
g.add_edges_from((i, j) for i, j in h.edges if h[i][j]["weight"] == num_paths)
def longest_path(g):
# assumes g is a disjoint collection of paths
all_paths = list()
for node in g.nodes:
path = list()
if g.in_degree[node] == 0:
while True:
path.append(node)
try:
node = next(iter(g[node]))
except:
break
all_paths.append(path)
if not all_paths:
# handle the "empty path" case
return []
return max(all_paths, key=len)
print(longest_path(g))
# ['f', 'g']

Approach 1: With Graph construction
Consider this example:
1 a-b-c-d-e-f-g
2 a-b-c-x-y-f-g
3 m-n-o-p-f-g
4 m-x-o-p-f-g
Draw a directed weighted graph.
I am a lazy person. So, I have not drawn the direction arrows but believe they are invisibly there. Edge weight is 1 if not marked on the arrow.
Give the length of longest chain with each edge in the chain having Maximum Edge Weight MEW.
MEW is 4, our answer is FG.
Say AB & BC had edge weight 4, then ABC should be the answer.
The below example, which is the case of MEW < #children, should output ABC.
1 a-b-c-d-e-f-g
2 a-b-c-x-y-f-g
3 m-n-o-p-f-h
4 m-x-o-p-f-i
If some kid is like me, the kid will keep roaming multiple places before reaching home. In such cases, you might see MEW > #children and the solution would become complicated. I hope all the children in our input are obedient and they go straight from school to home.
Approach 2: Without Graph construction
If luckily the problem mentions that the longest common piece of path should be present in the paths of all the children i.e. strictly MEW == #children then you can solve by easier way. Below picture should give you clue on what to do.
Take the below example
1 a-b-c-d-e-f-g
2 a-b-c-x-y-f-g
3 m-n-o-p-f-g
4 m-x-o-p-f-g
Method 1:
Get longest common graph for first two: a-b-c, f-g (Result 1)
Get longest common graph for last two: p-f-g (Result 2)
Using Result 1 & 2 we get: f-g (Final Result)
Method 2:
Get longest common graph for first two: a-b-c, f-g (Result 1)
Take Result 1 and next graph i.e. m-n-o-p-f-g: f-g (Result 2)
Take Result 2 and next graph i.e. m-x-o-p-f-g: f-g (Final Result)
The beauty of the approach without graph construction is that even if kids roam same pieces of paths multiple times, we get the right solution.
If you go a step ahead, you could combine the approaches and use approach 1 as a sub-routine in approach 2.

Set Simplification

Say I have two sets, set1 = {a,b,c,d,e,f} and set2 = {a,b,c,d,e,g}. Rather than expressing these explicitly, I want to create something like
common = {a,b,c,d,e}
set1 = common + f
set2 = common + g
If we wanted to represent {a,b,c,h}, we could represent it as common - d - e + h.
My goal is basically to be able to generate the optimal common portion to be used. With only one common section this isn't too challenging, but I need to allow more than one (but not unlimited, or the benefits gained would be trivial).
By optimal, I mean "least number of elements expressed". So in the above example, it "costs" 5 (number of elements) to make the common variable. Then sets 1 and 2 both cost 2 (one to reference common, one to add the extra element), totalling 7. Without the substitution, these would cost 12 to store (6 elements each). Similarly, in subtracting an element from a referenced would "cost" 1.
Another example,
{a,b,c,d}, {a,c,d,e}, {e,f,g,h} and {e,f}
could be
common1 = {a,c,d}
common2 = {e,f,g}
set1 = common1 + b
set2 = common1 + e
set3 = common2 + h
set4 = common2 - g
By allowing multiple common portions this becomes a lot more challenging. Is there a name for this type of problem, or something similar? It seems like it could be related to compression, but I haven't been able to find too many resources on where to start with this.
Some other details that may be relevent:
Being allowed to reference multiple common portions to represent one set can be valid, but isn't required.
For my use case, the sets will typically be around 20 elements and around 10 different sets.

You could find all atomic sets, that is all sets that are never not seen apart.
{a,b,c,d,e,f,g,h} | {a,b,c,d} = {a,b,c,d},{e,f,g,h}
{a,b,c,d},{e,f,g,h} | {a,c,d,e} = {a,b,c,d},{e,f,g,h}
{a,c,d},{b},{e},{f,g,h} | {e,f,g,h} = {a,c,d},{b},{e},{f,g,h}
{a,c,d},{b},{e},{f,g,h} | {e,f} = {a,c,d},{b},{e},{f},{g,h}
{a,b,c,d} = {a,c,d},{b}
{a,c,d,e} = {a,c,d},{e}
{e,f,g,h} = {e},{f},{g,h}
{e,f} = {e},{f}
This is a little closer but it doesnt solve the minimal breakdown.
I dont think you can find the minimal because i suspect that it is NP-Hard. If you consider a set S and create a graph where each possible subset of S is a node G. Now give a node weight according to the length of the subset, and draw an edge between each node that corresponds to the amount of change. {abc} -> {a} has a weight of 2. {bcd} -> {abe} has a weight of 4. Now to find a minimal solution to the common set problem you need to find a minimal weight spanning tree that covers each of the sets you are interested in. If you find that you can use this to build a minimal common set -- these would be equivelent. Finding minimum weight tree in a node weighted graph is called the Node-Weighed Steiner Tree Problem. A Node weighted Steiner Tree Problem can be shown equivalent to the Steiner Tree Problem. The Steiner Tree problem can be show to be NP-Hard. So I strongly suspect the problem you are trying to solve is NP-Hard.
http://theory.cs.uni-bonn.de/info5/steinerkompendium/node15.html
http://theory.cs.uni-bonn.de/info5/steinerkompendium/node17.html

minimum weight vertex cover of a tree

There's an existing question dealing with trees where the weight of a vertex is its degree, but I'm interested in the case where the vertices can have arbitrary weights.
This isn't homework but it is one of the questions in the algorithm design manual, which I'm currently reading; an answer set gives the solution as
Perform a DFS, at each step update Score[v][include], where v is a vertex and include is either true or false;
If v is a leaf, set Score[v][false] = 0, Score[v][true] = wv, where wv is the weight of vertex v.
During DFS, when moving up from the last child of the node v, update Score[v][include]:
Score[v][false] = Sum for c in children(v) of Score[c][true] and Score[v][true] = wv + Sum for c in children(v) of min(Score[c][true]; Score[c][false])
Extract actual cover by backtracking Score.
However, I can't actually translate that into something that works. (In response to the comment: what I've tried so far is drawing some smallish graphs with weights and running through the algorithm on paper, up until step four, where the "extract actual cover" part is not transparent.)
In response Ali's answer: So suppose I have this graph, with the vertices given by A etc. and the weights in parens after:
A(9)---B(3)---C(2)
\ \
E(1) D(4)
The right answer is clearly {B,E}.
Going through this algorithm, we'd set values like so:
score[D][false] = 0; score[D][true] = 4
score[C][false] = 0; score[C][true] = 2
score[B][false] = 6; score[B][true] = 3
score[E][false] = 0; score[E][true] = 1
score[A][false] = 4; score[A][true] = 12
Ok, so, my question is basically, now what? Doing the simple thing and iterating through the score vector and deciding what's cheapest locally doesn't work; you only end up including B. Deciding based on the parent and alternating also doesn't work: consider the case where the weight of E is 1000; now the correct answer is {A,B}, and they're adjacent. Perhaps it is not supposed to be confusing, but frankly, I'm confused.

There's no actual backtracking done (or needed). The solution uses dynamic programming to avoid backtracking, since that'd take exponential time. My guess is "backtracking Score" means the Score contains the partial results you would get by doing backtracking.
The cover vertex of a tree allows to include alternated and adjacent vertices. It does not allow to exclude two adjacent vertices, because it must contain all of the edges.
The answer is given in the way the Score is recursively calculated. The cost of not including a vertex, is the cost of including its children. However, the cost of including a vertex is whatever is less costly, the cost of including its children or not including them, because both things are allowed.
As your solution suggests, it can be done with DFS in post-order, in a single pass. The trick is to include a vertex if the Score says it must be included, and include its children if it must be excluded, otherwise we'd be excluding two adjacent vertices.
Here's some pseudocode:
find_cover_vertex_of_minimum_weight(v)
find_cover_vertex_of_minimum_weight(left children of v)
find_cover_vertex_of_minimum_weight(right children of v)
Score[v][false] = Sum for c in children(v) of Score[c][true]
Score[v][true] = v weight + Sum for c in children(v) of min(Score[c][true]; Score[c][false])
if Score[v][true] < Score[v][false] then
add v to cover vertex tree
else
for c in children(v)
add c to cover vertex tree

It actually didnt mean any thing confusing and it is just Dynamic Programming, you seems to almost understand all the algorithm. If I want to make it any more clear, I have to say:
first preform DFS on you graph and find leafs.
for every leaf assign values as the algorithm says.
now start from leafs and assign values to each leaf parent by that formula.
start assigning values to parent of nodes that already have values until you reach the root of your graph.
That is just it, by backtracking in your algorithm it means that you assign value to each node that its child already have values. As I said above this kind of solving problem is called dynamic programming.
Edit just for explaining your changes in the question. As you you have the following graph and answer is clearly B,E but you though this algorithm just give you B and you are incorrect this algorithm give you B and E.
A(9)---B(3)---C(2)
\ \
E(1) D(4)
score[D][false] = 0; score[D][true] = 4
score[C][false] = 0; score[C][true] = 2
score[B][false] = 6 this means we use C and D; score[B][true] = 3 this means we use B
score[E][false] = 0; score[E][true] = 1
score[A][false] = 4 This means we use B and E; score[A][true] = 12 this means we use B and A.
and you select 4 so you must use B and E. if it was just B your answer would be 3. but as you find it correctly your answer is 4 = 3 + 1 = B + E.
Also when E = 1000
A(9)---B(3)---C(2)
\ \
E(1000) D(4)
it is 100% correct that the answer is B and A because it is wrong to use E just because you dont want to select adjacent nodes. with this algorithm you will find the answer is A and B and just by checking you can find it too. suppose this covers :
C D A = 15
C D E = 1006
A B = 12
Although the first two answer have no adjacent nodes but they are bigger than last answer that have adjacent nodes. so it is best to use A and B for cover.

Finding the minimum number of calls on a tree

I was asked this question in an interview and struggled to answer it correctly in the time allotted. Nonetheless, I thought it was an interesting problem, and I hadn't seen it before.
Suppose you have a tree where the root can call (on the phone) each of it's children, when a child receives the call, he can call each of his children, etc. The problem is that each call must be done in a number of rounds, and we need to minimize the number of rounds it takes to make the calls. For example, suppose you have the following tree:
A
/ \
/ \
B D
|
|
C
One solution is for A to call D in round one, A to call B in round two, and B to call C in round three. The optimal solution is for A to call B in round one, and A to call D and B to call C in round two.
Note that A cannot call both B and D in the same round, nor can any node call more than one of its children in the same round. However, multiple nodes with a different parent can call simultaneously. For example, given the tree:
A
/ | \
/ | \
B C D
/\ |
/ \ |
E F G
We can have a sequence (where - separates rounds), such as:
A B - B E, A D - B F, A C, D G
(A calls B first round, B calls E and A calls D second, ...)
I'm assuming some type of dynamic programming can be used, but I'm not sure which direction to take this in. My initial inclination is to use DFS to order the longest path from the root to leaves in decreasing order, but when it comes to the nodes actually making calls, I'm not sure how we can achieve optimality given any tree, not how we can output the paths that the optimal calls would make (i.e. in the first example we could output
A B - B C, A D

I think something like this could get the optimal solution:
suppose the value of 'calls' for each of leaves is 1
for each node get the value of calls for all of his children and rank them according to their 'calls' value
consider rank of each child as 'ranks'
to compute the value of 'calls' for each node loop over his children (after computing their ranks) and find the maximum value of 'calls' + 'ranks'
'calls' value of the root node is the answer
It's sorta dynamic programming on trees and you can implement it recursively like this:
int f(node v)
{
int s = 0;
for each u in v.children
{
d[u] = f(u)
}
sort d and rank its values in r (r for the maximum u would be 1)
for each u in v.children
{
s = max(s, d[u] + r[u] + 1)
}
return s
}
Good Luck!

Developing an Algorithm for Tree Mutations

I have to come up with an efficient algorithm that takes a tree in this format:
?
/ \
? ?
/ \ / \
G A A A
and fills in the question mark nodes with the values that provide the least amount of mutations. The values can only be {A, C, T, G}. The tree will always have this same shape and amount of nodes. Also, it will always have the leaf nodes filled in and the remaining nodes will be question marks that need to be filled.
For instance, the tree on the right is correct and has less mutations than the one on the left.
A A
/ \ / \
G G A A
/ \ / \ / \ / \
G A A A G A A A
A mutation occurs when a parent node differs from a child node. So, the above left tree contains five mutations and the above right has one.
Can someone help me out by providing psuedocode? Thanks.

This looks like dynamic programming from the bottom of the tree up. For each node you want to work out the least cost solution that leaves that node marked A, C, T, or G, for each of these possibilities. You work this out by using previously calculated costs for each possibility for the nodes immediately below that node. The code just to work out the cost might be a bit like this.
LeastCost(node, colourHere)
{
foreach colour
leastLeft[colour] = LeastCost(leftChild, colour)
leastRight[colour] = LeastCost(rightChild, colour)
best = infinity
foreach combination
cost = leastLeft[combination.leftColour] +
leastRight[combination.rightColour]
if (combination.leftColour != colourHere)
cost++;
if (combination.rightColour != colourHere)
cost++;
if (cost < best)
cost = best;
return cost
}
To return the best answer as well as the best cost you need to keep track of the combination corresponding to the best answer as well. Come to think about it, you can save time by working out the answers for all four colours at each node at the same time.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio