I have a set of students (referred to as items in the title for generality). Amongst these students, some have a reputation for being rambunctious. We are told about a set of hate relationships of the form 'i hates j'. 'i hates j' does not imply 'j hates i'. We are supposed to arrange the students in rows (front most row numbered 1) in a way such that if 'i hates j' then i should be put in a row that is strictly lesser numbered than that of j (in other words: in some row that is in front of j's row) so that i doesn't throw anything at j (Turning back is not allowed). What would be an efficient algorithm to find the minimum number of rows needed (each row need not have the same number of students)?
We will make the following assumptions:
1) If we model this as a directed graph, there are no cycles in the graph. The most basic cycle would be: if 'i hates j' is true, 'j hates i' is false. Because otherwise, I think the ordering would become impossible.
2) Every student in the group is at least hated by one other student OR at least hates one other student. Of course, there would be students who are both hated by some and who in turn hate other students. This means that there are no stray students who don't form part of the graph.
Update: I have already thought of constructing a directed graph with i --> j if 'i hates j and doing topological sorting. However, since the general topological sort would suit better if I had to line all the students in a single line. Since there is a variation of the rows here, I am trying to figure out how to factor in the change into topological sort so it gives me what I want.
When you answer, please state the complexity of your solution. If anybody is giving code and you don't mind the language, then I'd prefer Java but of course any other language is just as fine.
JFYI This is not for any kind of homework (I am not a student btw :)).
It sounds to me that you need to investigate topological sorting.
This problem is basically another way to put the longest path in a directed graph problem. The number of rows is actually number of nodes in path (number of edges + 1).
Assuming the graph is acyclic, the solution is topological sort.
Acyclic is a bit stronger the your assumption 1. Not only A -> B and B -> A is invalid. Also A -> B, B -> C, C -> A and any cycle of any length.
HINT: the question is how many rows are needed, not which student in which row. The answer to the question is the length of the longest path.
It's from a project management theory (or scheduling theory, I don't know the exact term). There the task is about sorting jobs (vertex is a job, arc is a job order relationship).
Obviously we have some connected oriented graph without loops. There is an arc from vertex a to vertex b if and only if a hates b. Let's assume there is a source (without incoming arcs) and destination (without outgoing arcs) vertex. If that is not the case, just add imaginary ones. Now we want to find length of a longest path from source to destination (it will be number of rows - 1, but mind the imaginary verteces).
We will define vertex rank (r[v]) as number of arcs in a longest path between source and this vertex v. Obviously we want to know r[destination]. Algorithm for finding rank:
0) r_0[v] := 0 for all verteces v
repeat
t) r_t[end(j)] := max( r_{t-1}[end(j)], r_{t-1}[start(j)] + 1 ) for all arcs j
until for all arcs j r_{t+1}[end(j)] = r_t[end(j)] // i.e. no changes on this iteration
On each step at least one vertex increases its rank. Therefore in this form complexity is O(n^3).
By the way, this algorithm also gives you student distribution among rows. Just group students by their respective ranks.
Edit: Another code with the same idea. Possibly it is better understandable.
# Python
# V is a list of vertex indices, let it be something like V = range(N)
# source has index 0, destination has index N-1
# E is a list of edges, i.e. tuples of the form (start vertex, end vertex)
R = [0] * len(V)
do:
changes = False
for e in E:
if R[e[1]] < R[e[0]] + 1:
changes = True
R[e[1]] = R[e[0]] + 1
while changes
# The answer is derived from value of R[N-1]
Of course this is the simplest implementation. It can be optimized, and time estimate can be better.
Edit2: obvious optimization - update only verteces adjacent to those that were updated on the previous step. I.e. introduce a queue with verteces whose rank was updated. Also for edge storing one should use adjacency lists. With such optimization complexity would be O(N^2). Indeed, each vertex may appear in the queue at most rank times. But vertex rank never exceeds N - number of verteces. Therefore total number of algorithm steps will not exceed O(N^2).
Essentailly the important thing in assumption #1 is that there must not be any cycles in this graph. If there are any cycles you can't solve this problem.
I would start by seating all of the students that do not hate any other students in the back row. Then you can seat the students who hate these students in the next row and etc.
The number of rows is the length of the longest path in the directed graph, plus one. As a limit case, if there is no hate relationship everyone can fit on the same row.
To allocate the rows, put everyone who is not hated by anyone else on the row one. These are the "roots" of your graph. Everyone else is put on row N + 1 if N is the length of the longest path from any of the roots to that person (this path is of length one at least).
A simple O(N^3) algorithm is the following:
S = set of students
for s in S: s.row = -1 # initialize row field
rownum = 0 # start from first row below
flag = true # when to finish
while (flag):
rownum = rownum + 1 # proceed to next row
flag = false
for s in S:
if (s.row != -1) continue # already allocated
ok = true
foreach q in S:
# Check if there is student q who will sit
# on this or later row who hates s
if ((q.row == -1 or q.row = rownum)
and s hated by q) ok = false; break
if (ok): # can put s here
s.row = rownum
flag = true
Simple answer = 1 row.
Put all students in the same row.
Actually that might not solve the question as stated - lesser row, rather than equal row...
Put all students in row 1
For each hate relation, put the not-hating student in a row behind the hating student
Iterate till you have no activity, or iterate Num(relation) times.
But I'm sure there are better algorithms - look at acyclic graphs.
Construct a relationship graph where i hates j will have a directed edge from i to j. So end result is a directed graph. It should be a DAG otherwise no solutions as it's not possible to resolve circular hate relations ship.
Now simply do a DFS search and during the post node callbacks, means the once the DFS of all the children are done and before returning from the DFS call to this node, simply check the row number of all the children and assign the row number of this node as row max row of the child + 1. Incase if there is some one who doesn't hate anyone basically node with no adjacency list simply assign him row 0.
Once all the nodes are processed reverse the row numbers. This should be easy as this is just about finding the max and assigning the row numbers as max-already assigned row numbers.
Here is the sample code.
postNodeCb( graph g, int node )
{
if ( /* No adj list */ )
row[ node ] = 0;
else
row[ node ] = max( row number of all children ) + 1;
}
main()
{
.
.
for ( int i = 0; i < NUM_VER; i++ )
if ( !visited[ i ] )
graphTraverseDfs( g, i );`enter code here`
.
.
}
Related
Write a function in main.cpp, which creates a random graph of a certain size as follows. The function takes two parameters. The first parameter is the number of vertices n. The second parameter p (1 >= p >= 0) is the probability that an edge exists between a pair of nodes. In particular, after instantiating a graph with n vertices and 0 edges, go over all possible vertex pairs one by one, and for each such pair, put an edge between the vertices with probability p.
How to know if an edge exists between two vertices .
Here is the full question
PS: I don't need the code implementation
The problem statement clearly says that the first input parameter is the number of nodes and the second parameter is the probability p that an edge exists between any 2 nodes.
What you need to do is as follows (Updated to amend a mistake that was pointed out by #user17732522):
1- Create a bool matrix (2d nested array) of size n*n initialized with false.
2- Run a loop over the rows:
- Run an inner loop over the columns:
- if row_index != col_index do:
- curr_p = random() // random() returns a number between 0 and 1 inclusive
- if curr_p <= p: set matrix[row_index][col_index] = true
else: set matrix[row_index][col_index] = false
- For an undirected graph, also set matrix[col_index][row_index] = true/false based on curr_p
Note: Since we are setting both cells (both directions) in the matrix in case of a probability hit, we could potentially set an edge 2 times. This doesn't corrupt the correctnes of the probability and isn't much additional work. It helps to keep the code clean.
If you want to optimize this solution, you could run the loop such that you only visit the lower-left triangle (excluding the diagonal) and just mirror the results you get for those cells to the upper-right triangle.
That's it.
Your friends are planning an expedition to a small town deep in the Canadian north
next winter break. They’ve researched all the travel options and have drawn up a directed
graph whose nodes represent intermediat destinations and edges represent the reoads betweeen
them.
In the course of this, they’ve also learned that extreme weather causes roads in this part of
the world to become quite slow in the winter and may cause large travel delays. They’ve
found an excellent travel Web site that can accurately predict how fast they’ll be able to
travel along the roads; however, the speed of travel depends on the time of the year. More
precisely, the Web site answers queries of the following form: given an edge e = (u, v)
connecting two sites u and v, and given a proposed starting time t from location u, the
site will return a value fe(t), the predicted arrival time at v. The web site guarantees that
1
fe(t) > t for every edge e and every time t (you can’t travel backward in time), and that
fe(t) is a monotone increasing function of t (that is, you do not arrive earlier by starting
later). Other than that, the functions fe may be arbitrary. For example, in areas where the
travel time does not vary with the season, we would have fe(t) = t + e, wheree is the
time needed to travel from the beginning to the end of the edge e.
Your friends want to use the Web site to determine the fastest way to travel through the
directed graph from their starting point to their intended destination. (You should assume
that they start at time 0 and that all predictions made by the Web site are completely
correct.) Give a polynomial-time algorithm to do this, where we treat a single query to
the Web site (based on a specific edge e and a time t) as taking a single computational step.
def updatepath(node):
randomvalue = random.randint(0,3)
print(node,"to other node:",randomvalue)
for i in range(0,n):
distance[node][i] = distance[node][i] + randomvalue
def minDistance(dist,flag_array,n):
min_value = math.inf
for i in range(0,n):
if dist[i] < min_value and flag_array[i] == False:
min_value = dist[i]
min_index = i
return min_index
def shortest_path(graph, src,n):
dist = [math.inf] * n
flag_array = [False] * n
dist[src] = 0
for cout in range(n):
#find the node index that have min cost
u = minDistance(dist,flag_array,n)
flag_array[u] = True
updatepath(u)
for i in range(n):
if graph[u][i] > 0 and flag_array[i]==False and dist[i] > dist[u] + graph[u][i]:
dist[i] = dist[u] + graph[u][i]
path[i] = u
return dist
I applied Dijkstra algorithm but it is not correct ? What would i change in my algorithm to work it for dynamic changing edge.
Well, Key points are that function is monotonically increasing. There is an algorithm which exploits this property and it is called A*.
Accumulated cost: Your prof wants you to use two distances one is accumulated cost(this is simple the cost from previous added to the cost/time needed to move to the next node).
Heuristic cost: This is some predicted cost.
Disjkstra approach would not work because you are working with heuristic cost/predicted and accumulated cost.
Monotonically increasing means h(A) <= h(A) + f(A..B).It simply says that if you move from node A to node B then the cost should not be less than the previous node (in this case A) and this is heuristic + accumulated. If this property holds then the first path which A* chooses is always the path to goal and it never needs to backtrack.
Note: The power of this algorithm is totally base on how you predict value.
If you underestimate the value that will be corrected with accumulated value but if you overestimate the value it will chose wrong path.
Algorithm:
Create a Min Priority queue.
insert initial city in q.
while(!pq.isEmpty() && !Goalfound)
Node min = pq.delMin() //this should return you a cities to which your
distance(heuristic+accumulated is minial).
put all succesors of min in pq // all cities which you can reach, you
can better make a list of visited
cities s that queue will be
efficient by not placing same
element twice.
Keep doing this and at the end you will either reach goal or your queue will be empty
Extra
Here i implemented a 8-puzzle-solve using A*, it can give you an idea about how costs are defined and ho it works.
`
private void solve(MinPQ<Node> pq, HashSet<Node> closedList) {
while(!(pq.min().getBoad().isGoal(pq.min().getBoad()))){
Node e = pq.delMin();
closedList.add(e);
for(Board boards: e.getBoad().neighbors()){
Node nextNode = new Node(boards,e,e.getMoves()+1);
if(!equalToPreviousNode(nextNode,e.getPreviousNode()))
pq.insert(nextNode);
}
}
Node collection = pq.delMin();
while(!(collection.getPreviousNode() == null)){
this.getB().add(collection.getBoad());
collection =collection.getPreviousNode();
}
this.getB().add(collection.getBoad());
System.out.println(pq.size());
}
A link to full code is here.
I was asked on a programming test to solve a problem where I had to find the shortest distance from each node on a rectangular graph to one of a set of possible destinations on the graph. I was able to create a solution that passed all the tests but I'm fairly certain there is a more efficient algorithm out there.
C11--C12--C13--C14
| | | |
FGB--C22--C23--C24
| | | |
C31--C32--C33--C34
| | | |
C41--FGB--C43--C44
| | | |
C51--C52--C53--C54
| | | |
C61--C62--C63--FGB
So for example, on the graph above, say each "FGB" represents a Five Guys (because it's delicious). And each "Cxx" represents a customer. It was essentially, "How far is each customer from the nearest Five Guys?" So C11 is 1 away, C12 is 2 away, etc. All edges are weight=1.
Is Floyd-Warshall what I'm looking for? I don't really care about all pairs.
Any thoughts or reference materials you might point me to? Much appreciated.
There's a simple solution that works in two passes.
The firt pass is forward, row by row. For every node, you evaluate incrementally the distance to the nearest target to the left of above.
D(node)= if target node => 0 else min(D(left neighbor) + 1, D(top neighbor) + 1)
The second pass is backward. The final distance is evaluated as
D(node)= min(D(node), D(right neighbor) + 1, D(bottom neighbor) + 1)
At the same time that you record a new value, you can record the location of the corresponding target.
(When a neighbor does not exist, the distance is infinite by convention.)
For an arbitrary graph size, with an arbitrary number of target nodes (FGBs), the following BFS algorithm will be the most efficient:
nodes = set of all nodes
currentNodes = set of target nodes (FGB)
for each node in nodes:
dist[node] = infinity
for each node in currentNodes:
dist[node] = 0
while currentNodes is not empty:
newNodes := []
for each node in currentNodes:
for each neighbor of node:
if dist[neighbor] > dist[node] + 1 then:
dist[neighbor] := dist[node] + 1
newNodes.add(neighbor)
currentNodes := newNodes
The for each node loop will in total visit every node exactly once.
The for each neighbor loop will iterate at the most 4 times per node, given the graph is a rectangular one.
This means the inner if condition will be executed near 4n times, i.e. this is O(n). Depending on the used data structure the number can be 4n-4√n as the boundary nodes have fewer neighbors, but this is still O(n).
Note that if you have fewer than 4 target nodes (FGBs) it will be faster (though not significant in big-O terms) to use the rectangular properties of the graph to calculate the distances. You could do that with this formula, where m is the number of target nodes (FGBs):
dist[node at [i,j]] := min( abs(i-FGB[0].column)+abs(j-FGB[0].row),
abs(i-FGB[1].column)+abs(j-FGB[1].row),
...
abs(i-FGB[m-1].column)+abs(j-FGB[m-1].row)
)
This has a time complexity of O(n.m), which for a m limited to a certain constant still is O(n) and might be faster than the more generic solution.
An algorithm could make use of this, and depending on the value of m, could choose which method to apply.
In response to #yves-daoust (I know I'm not suppose to respond to other answers with a whole new answer, but this is too long for the "response to answer" dialog box).
So the whole thing would look something like this?
// O(n)
for each node:
set node distance to infinity
// O(m)
for each FGB in FGB coordinate list
directly set corresponding node distance to infinity
// O(n)
for each i in row
for each j in column
if left and up exist
set node(i,j) to min(node(i,j), left+1, up+1)
else if left exists
set node(i,j) to min(node(i,j), left+1)
else if up exists
set node(i,j) to min(node(i,j), up+1)
// O(n)
for each i in row (reverse)
for each j in column (reverse)
if right and down exist
set node(i,j) to min(node(i,j), right+1, down+1)
else if right exists
set node(i,j) to min(node(i,j), right+1)
else if down exists
set node(i,j) to min(node(i,j), down+1)
The whole thing is O(3n+m) = O(n+m). Is that right?
There's an existing question dealing with trees where the weight of a vertex is its degree, but I'm interested in the case where the vertices can have arbitrary weights.
This isn't homework but it is one of the questions in the algorithm design manual, which I'm currently reading; an answer set gives the solution as
Perform a DFS, at each step update Score[v][include], where v is a vertex and include is either true or false;
If v is a leaf, set Score[v][false] = 0, Score[v][true] = wv, where wv is the weight of vertex v.
During DFS, when moving up from the last child of the node v, update Score[v][include]:
Score[v][false] = Sum for c in children(v) of Score[c][true] and Score[v][true] = wv + Sum for c in children(v) of min(Score[c][true]; Score[c][false])
Extract actual cover by backtracking Score.
However, I can't actually translate that into something that works. (In response to the comment: what I've tried so far is drawing some smallish graphs with weights and running through the algorithm on paper, up until step four, where the "extract actual cover" part is not transparent.)
In response Ali's answer: So suppose I have this graph, with the vertices given by A etc. and the weights in parens after:
A(9)---B(3)---C(2)
\ \
E(1) D(4)
The right answer is clearly {B,E}.
Going through this algorithm, we'd set values like so:
score[D][false] = 0; score[D][true] = 4
score[C][false] = 0; score[C][true] = 2
score[B][false] = 6; score[B][true] = 3
score[E][false] = 0; score[E][true] = 1
score[A][false] = 4; score[A][true] = 12
Ok, so, my question is basically, now what? Doing the simple thing and iterating through the score vector and deciding what's cheapest locally doesn't work; you only end up including B. Deciding based on the parent and alternating also doesn't work: consider the case where the weight of E is 1000; now the correct answer is {A,B}, and they're adjacent. Perhaps it is not supposed to be confusing, but frankly, I'm confused.
There's no actual backtracking done (or needed). The solution uses dynamic programming to avoid backtracking, since that'd take exponential time. My guess is "backtracking Score" means the Score contains the partial results you would get by doing backtracking.
The cover vertex of a tree allows to include alternated and adjacent vertices. It does not allow to exclude two adjacent vertices, because it must contain all of the edges.
The answer is given in the way the Score is recursively calculated. The cost of not including a vertex, is the cost of including its children. However, the cost of including a vertex is whatever is less costly, the cost of including its children or not including them, because both things are allowed.
As your solution suggests, it can be done with DFS in post-order, in a single pass. The trick is to include a vertex if the Score says it must be included, and include its children if it must be excluded, otherwise we'd be excluding two adjacent vertices.
Here's some pseudocode:
find_cover_vertex_of_minimum_weight(v)
find_cover_vertex_of_minimum_weight(left children of v)
find_cover_vertex_of_minimum_weight(right children of v)
Score[v][false] = Sum for c in children(v) of Score[c][true]
Score[v][true] = v weight + Sum for c in children(v) of min(Score[c][true]; Score[c][false])
if Score[v][true] < Score[v][false] then
add v to cover vertex tree
else
for c in children(v)
add c to cover vertex tree
It actually didnt mean any thing confusing and it is just Dynamic Programming, you seems to almost understand all the algorithm. If I want to make it any more clear, I have to say:
first preform DFS on you graph and find leafs.
for every leaf assign values as the algorithm says.
now start from leafs and assign values to each leaf parent by that formula.
start assigning values to parent of nodes that already have values until you reach the root of your graph.
That is just it, by backtracking in your algorithm it means that you assign value to each node that its child already have values. As I said above this kind of solving problem is called dynamic programming.
Edit just for explaining your changes in the question. As you you have the following graph and answer is clearly B,E but you though this algorithm just give you B and you are incorrect this algorithm give you B and E.
A(9)---B(3)---C(2)
\ \
E(1) D(4)
score[D][false] = 0; score[D][true] = 4
score[C][false] = 0; score[C][true] = 2
score[B][false] = 6 this means we use C and D; score[B][true] = 3 this means we use B
score[E][false] = 0; score[E][true] = 1
score[A][false] = 4 This means we use B and E; score[A][true] = 12 this means we use B and A.
and you select 4 so you must use B and E. if it was just B your answer would be 3. but as you find it correctly your answer is 4 = 3 + 1 = B + E.
Also when E = 1000
A(9)---B(3)---C(2)
\ \
E(1000) D(4)
it is 100% correct that the answer is B and A because it is wrong to use E just because you dont want to select adjacent nodes. with this algorithm you will find the answer is A and B and just by checking you can find it too. suppose this covers :
C D A = 15
C D E = 1006
A B = 12
Although the first two answer have no adjacent nodes but they are bigger than last answer that have adjacent nodes. so it is best to use A and B for cover.
Update 2011-12-28: Here's a blog post with a less vague description of the problem I was trying to solve, my work on it, and my current solution: Watching Every MLB Team Play A Game
I'm trying to solve a kind of strange pathfinding challenge. I have an acyclic directional graph, and every edge has a distance value. And I want to find a shortest path. Simple, right? Well, there are a couple of reasons I can't just use Dijkstra's or A*.
I don't care at all what the starting node of my path is, nor the ending node. I just need a path that includes exactly 10 nodes. But:
Each node has an attribute, let's say it's color. Each node has one of 20 different possible colors.
The path I'm trying to find is the shortest path with exactly 10 nodes, where each node is a different color. I don't want any of the nodes in my path to have the same color as any other node.
It'd be nice to be able to force my path to have one value for one of the attributes ("at least one node must be blue", for instance), but that's not really necessary.
This is a simplified example. My full data set actually has three different attributes for each node that must all be unique, and I have 2k+ nodes each with an average of 35 outgoing edges. Since getting a perfect "shortest path" may be exponential or factorial time, an exhaustive search is really not an option. What I'm really looking for is some approximation of a "good path" that meets the criterion under #3.
Can anyone point me towards an algorithm that I might be able to use (even modified)?
Some stats on my full data set:
Total nodes: 2430
Total edges: 86524
Nodes with no incoming edges: 19
Nodes with no outgoing edges: 32
Most outgoing edges: 42
Average edges per node: 35.6 (in each direction)
Due to the nature of the data, I know that the graph is acyclic
And in the full data set, I'm looking for a path of length 15, not 10
It is the case when the question actually contains most of the answer.
Do a breadth-first search starting from all root nodes. When the number of parallelly searched paths exceeds some limit, drop the longest paths. Path length may be weighed: last edges may have weight 10, edges passed 9 hops ago - weight 1. Also it is possible to assign lesser weight to all paths having the preferred attribute or paths going through the weakly connected nodes. Store last 10 nodes in the path to the hash table to avoid duplication. And keep somewhere the minimum sum of the last 9 edge lengths along with the shortest path.
If the number of possible values is low, you can use the Floyd algorithm with a slight modification: for each path you store a bitmap that represents the different values already visited. (In your case the bitmap will be 20 bits wide per path.
Then when you perform the length comparison, you also AND your bitmaps to check whether it's a valid path and if it is, you OR them together and store that as the new bitmap for the path.
Have you tried a straight-forward approach and failed? From your description of the problem, I see no reason a simple greedy algorithm like depth-first search might be just fine:
Pick a start node.
Check the immediate neighbors, are there any nodes that are ok to append to the path? Expand the path with one of them and repeat the process for that node.
If you fail, backtrack to the last successful state and try a new neighbor.
If you run out of neighbors to check, this node cannot be the start node of a path. Try a new one.
If you have 10 nodes, you're done.
Good heuristics for picking a start node is hard to give without any knowledge about how the attributes are distributed, but it is possible that it is beneficial to nodes with high degree first.
It looks like a greedy depth first search will be your best bet. With a reasonable distribution of attribute values, I think finding a single valid sequence is E[O(1)] time, that is expected constant time. I could probably prove that, but it might take some time. The proof would use the assumption that there is a non-zero probability that a valid next segment of the sequence could be found at every step.
The greedy search would backtracking whenever the unique attribute value constraint is violated. The search stops when a 15 segment path is found. If we accept my hunch that each sequence can be found in E[O(1)], then it is a matter of determining how many parallel searches to undertake.
For those who want to experiment, here is a (postgres) sql script to generate some fake data.
SET search_path='tmp';
-- DROP TABLE nodes CASCADE;
CREATE TABLE nodes
( num INTEGER NOT NULL PRIMARY KEY
, color INTEGER
-- Redundant fields to flag {begin,end} of paths
, is_root boolean DEFAULT false
, is_terminal boolean DEFAULT false
);
-- DROP TABLE edges CASCADE;
CREATE TABLE edges
( numfrom INTEGER NOT NULL REFERENCES nodes(num)
, numto INTEGER NOT NULL REFERENCES nodes(num)
, cost INTEGER NOT NULL DEFAULT 0
);
-- Generate some nodes, set color randomly
INSERT INTO nodes (num)
SELECT n
FROM generate_series(1,2430) n
WHERE 1=1
;
UPDATE nodes SET COLOR= 1+TRUNC(20*random() );
-- (partial) cartesian product nodes*nodes. The ordering guarantees a DAG.
INSERT INTO edges(numfrom,numto,cost)
SELECT n1.num ,n2.num, 0
FROM nodes n1 ,nodes n2
WHERE n1.num < n2.num
AND random() < 0.029
;
UPDATE edges SET cost = 1+ 1000 * random();
ALTER TABLE edges
ADD PRIMARY KEY (numfrom,numto)
;
ALTER TABLE edges
ADD UNIQUE (numto,numfrom)
;
UPDATE nodes no SET is_root = true
WHERE NOT EXISTS (
SELECT * FROM edges ed
WHERE ed.numfrom = no.num
);
UPDATE nodes no SET is_terminal = true
WHERE NOT EXISTS (
SELECT * FROM edges ed
WHERE ed.numto = no.num
);
SELECT COUNT(*) AS nnode FROM nodes;
SELECT COUNT(*) AS nedge FROM edges;
SELECT color, COUNT(*) AS cnt FROM nodes GROUP BY color ORDER BY color;
SELECT COUNT(*) AS nterm FROM nodes no WHERE is_terminal = true;
SELECT COUNT(*) AS nroot FROM nodes no WHERE is_root = true;
WITH zzz AS (
SELECT numto, COUNT(*) AS fanin
FROM edges
GROUP BY numto
)
SELECT zzz.fanin , COUNT(*) AS cnt
FROM zzz
GROUP BY zzz.fanin
ORDER BY zzz.fanin
;
WITH zzz AS (
SELECT numfrom, COUNT(*) AS fanout
FROM edges
GROUP BY numfrom
)
SELECT zzz.fanout , COUNT(*) AS cnt
FROM zzz
GROUP BY zzz.fanout
ORDER BY zzz.fanout
;
COPY nodes(num,color,is_root,is_terminal)
TO '/tmp/nodes.dmp';
COPY edges(numfrom,numto, cost)
TO '/tmp/edges.dmp';
The problem may be solving by dynamic programming as follows. Let's start by formally defining its solution.
Given a DAG G = (V, E), let C the be set of colors of vertices visited so far and let w[i, j] and c[i] be respectively the weight (distance) associated to edge (i, j) and the color of a vertex i. Note that w[i, j] is zero if the edge (i, j) does not belong to E.
Now define the distance d for going from vertex i to vertex j taking into account C as
d[i, j, C] = w[i, j] if i is not equal to j and c[j] does not belong to C
= 0 if i = j
= infinite if i is not equal to j and c[j] belongs to C
We are now ready to define our subproblems as follows:
A[i, j, k, C] = shortest path from i to j that uses exactly k edges and respects the colors in C so that no two vertices in the path are colored using the same color (one of the colors in C)
Let m be the maximum number of edges permitted in the path and assume that the vertices are labeled 1, 2, ..., n. Let P[i,j,k] be the predecessor vertex of j in the shortest path satisfying the constraints from i to j. The following algorithm solves the problem.
for k = 1 to m
for i = 1 to n
for j = 1 to n
A[i,j,k,C] = min over x belonging to V {d[i,x,C] + A[x,j,k-1,C union c[x]]}
P[i,j,k] = the vertex x that minimized A[i,j,k,C] in the previous statement
Set the initial conditions as follows:
A[i,j,k,C] = 0 for k = 0
A[i,j,k,C] = 0 if i is equal to j
A[i,j,k,C] = infinite in all of the other cases
The overall computational complexity of the algorithm is O(m n^3); taking into account that in your particular case m = 14 (since you want exactly 15 nodes), it follows that m = O(1) so that the complexity actually is O(n^3). To represent the set C use an hash table so that insertion and membership testing require O(1) on average. Note that in the algorithm the operation C union c[x] is actually an insert operation in which you add the color of vertex x into the hash table for C. However, since you are inserting just an element, the set union operation leads to exactly the same result (if the color is not in the set, it is added; otherwise, it is simply discarded and the set does not change). Finally, to represent the DAG, use the adjacency matrix.
Once the algorithm is done, to find the minimum shortest path among all possible vertices i and j, simply find the minimum among the values A[i,j,m,C]. Note that if this value is infinite, then no valid shortest path exists. If a valid shortest path exists, then you can actually determine it by using the P[i,j,k] values and tracing backwards through predecessor vertices. For instance, starting from a = P[i,j,m] the last edge on the shortest path is (a,j), the previous edge is given by b = P[i,a,m-1] and its is (b,a) and so on.