I have a graph that has two distinct classes of nodes, class A nodes and class B nodes.
Class A nodes are not connected to any other A nodes and class B nodes aren’t connected to any other B nodes, but B nodes are connected to A nodes and vice versa. Some B nodes are connected to lots of A nodes and most A nodes are connected to lots of B nodes.
I want to eliminate as many of the A nodes as possible from the graph.
I must keep all of the B nodes, and they must still be connected to at least one A node (preferably only one A node).
I can eliminate an A node when it has no B nodes connected only to it. Are there any algorithms that could find an optimal, or at least close to optimal, solution for which A nodes I can remove?
Old, Incorrect Answer, But Start Here
First, you need to recognize that you have a bipartite graph. That is, you can colour the nodes red and blue such that no edges connect a red node to a red node or a blue node to a blue node.
Next, recognize that you're trying to solve a vertex cover problem. From Wikipedia:
In the mathematical discipline of graph theory, a vertex cover (sometimes node cover) of a graph is a set of vertices such that each edge of the graph is incident to at least one vertex of the set. The problem of finding a minimum vertex cover is a classical optimization problem in computer science and is a typical example of an NP-hard optimization problem that has an approximation algorithm.
Since you have a special graph, it's reasonable to think that maybe the NP-hard doesn't apply to you. This thought brings us to Kőnig's theorem which relates the maximum matching problem to the minimum vertex cover problem. Once you know this, you can apply the Hopcroft–Karp algorithm to solve the problem in O(|E|√|V|) time, though you'll probably need to jigger it a bit to ensure you keep all the B nodes.
New, Correct Answer
It turns out this jiggering is the creation of a "constrained bipartitate graph vertex cover problem", which asks us if there is a vertex cover that uses less than a A-nodes and less than b B-nodes. The problem is NP-complete, so that's a no go. The jiggering was hard than I thought!
But using less than the minimum number of nodes isn't the constraint we want. We want to ensure that the minimum number of A-nodes is used and the maximum number of B-nodes.
Kőnig's theorem, above, is a special case of the maximum flow problem. Thinking about the problem in terms of flows brings us pretty quickly to minimum-cost flow problems.
In these problems we're given a graph whose edges have specified capacities and unit costs of transport. The goal is to find the minimum cost needed to move a supply of a given quantity from an arbitrary set of source nodes to an arbitrary set of sink nodes.
It turns out your problem can be converted into a minimum-cost flow problem. To do so, let us generate a source node that connects to all the A nodes and a sink node that connects to all the B nodes.
Now, let us make the cost of using a Source->A edge equal to 1 and give all other edges a cost of zero. Further, let us make the capacity of the Source->A edges equal to infinity and the capacity of all other edges equal to 1.
This looks like the following:
The red edges have Cost=1, Capacity=Inf. The blue edges have Cost=0, Capacity=1.
Now, solving the minimum flow problem becomes equivalent to using as few red edges as possible. Any red edge that isn't used allocates 0 flow to its corresponding A node and that node can be removed from the graph. Conversely, each B node can only pass 1 unit of flow to the sink, so all B nodes must be preserved in order for the problem to be solved.
Since we've recast your problem into this standard form, we can leverage existing tools to get a solution; namely, Google's Operation Research Tools.
Doing so gives the following answer to the above graph:
The red edges are unused and the black edges are used. Note that if a red edge emerges from the source the A-node it connects to generates no black edges. Note also that each B-node has at least one in-coming black edge. This satisfies the constraints you posed.
We can now detect the A-nodes to be removed by looking for Source->A edges with zero usage.
Source Code
The source code necessary to generate the foregoing figures and associated solutions is as follows:
#!/usr/bin/env python3
#Documentation: https://developers.google.com/optimization/flow/mincostflow
#Install dependency: pip3 install ortools
from __future__ import print_function
from ortools.graph import pywrapgraph
import matplotlib.pyplot as plt
import networkx as nx
import random
import sys
def GenerateGraph(Acount,Bcount):
assert Acount>5
assert Bcount>5
G = nx.DiGraph() #Directed graph
source_node = Acount+Bcount
sink_node = source_node+1
for a in range(Acount):
for i in range(random.randint(0.2*Bcount,0.3*Bcount)): #Connect to 10-20% of the Bnodes
b = Acount+random.randint(0,Bcount-1) #In the half-open range [0,Bcount). Offset from A's indices
G.add_edge(source_node, a, capacity=99999, unit_cost=1, usage=1)
G.add_edge(a, b, capacity=1, unit_cost=0, usage=1)
G.add_edge(b, sink_node, capacity=1, unit_cost=0, usage=1)
G.node[a]['type'] = 'A'
G.node[b]['type'] = 'B'
G.node[source_node]['type'] = 'source'
G.node[sink_node]['type'] = 'sink'
G.node[source_node]['supply'] = Bcount
G.node[sink_node]['supply'] = -Bcount
return G
def VisualizeGraph(graph, color_type):
gcopy = graph.copy()
for p, d in graph.nodes(data=True):
if d['type']=='source':
source = p
if d['type']=='sink':
sink = p
Acount = len([1 for p,d in graph.nodes(data=True) if d['type']=='A'])
Bcount = len([1 for p,d in graph.nodes(data=True) if d['type']=='B'])
if color_type=='usage':
edge_color = ['black' if d['usage']>0 else 'red' for u,v,d in graph.edges(data=True)]
elif color_type=='unit_cost':
edge_color = ['red' if d['unit_cost']>0 else 'blue' for u,v,d in graph.edges(data=True)]
Ai = 0
Bi = 0
pos = dict()
for p,d in graph.nodes(data=True):
if d['type']=='source':
pos[p] = (0, Acount/2)
elif d['type']=='sink':
pos[p] = (3, Bcount/2)
elif d['type']=='A':
pos[p] = (1, Ai)
Ai += 1
elif d['type']=='B':
pos[p] = (2, Bi)
Bi += 1
nx.draw(graph, pos=pos, edge_color=edge_color, arrows=False)
plt.show()
def GenerateMinCostFlowProblemFromGraph(graph):
start_nodes = []
end_nodes = []
capacities = []
unit_costs = []
min_cost_flow = pywrapgraph.SimpleMinCostFlow()
for node,neighbor,data in graph.edges(data=True):
min_cost_flow.AddArcWithCapacityAndUnitCost(node, neighbor, data['capacity'], data['unit_cost'])
supply = len([1 for p,d in graph.nodes(data=True) if d['type']=='B'])
for p, d in graph.nodes(data=True):
if (d['type']=='source' or d['type']=='sink') and 'supply' in d:
min_cost_flow.SetNodeSupply(p, d['supply'])
return min_cost_flow
def ColorGraphEdgesByUsage(graph, min_cost_flow):
for i in range(min_cost_flow.NumArcs()):
graph[min_cost_flow.Tail(i)][min_cost_flow.Head(i)]['usage'] = min_cost_flow.Flow(i)
def main():
"""MinCostFlow simple interface example."""
# Define four parallel arrays: start_nodes, end_nodes, capacities, and unit costs
# between each pair. For instance, the arc from node 0 to node 1 has a
# capacity of 15 and a unit cost of 4.
Acount = 20
Bcount = 20
graph = GenerateGraph(Acount, Bcount)
VisualizeGraph(graph, 'unit_cost')
min_cost_flow = GenerateMinCostFlowProblemFromGraph(graph)
# Find the minimum cost flow between node 0 and node 4.
if min_cost_flow.Solve() != min_cost_flow.OPTIMAL:
print('Unable to find a solution! It is likely that one does not exist for this input.')
sys.exit(-1)
print('Minimum cost:', min_cost_flow.OptimalCost())
ColorGraphEdgesByUsage(graph, min_cost_flow)
VisualizeGraph(graph, 'usage')
if __name__ == '__main__':
main()
Despite this is an old question, I see it has not been correctly answered yet.
An analogous question to this one has also been answered earlier in this post.
The problem you are presenting here is indeed the Minimum Set Cover Problem, which is one of the well-known NP-hard problems. From the Wikipedia, the Minimum Set Cover Problem can be formulated as:
Given a set of elements {1,2,...,n} (called the universe) and a collection S of m sets whose union equals the universe, the set cover problem is to identify the smallest sub-collection of S whose union equals the universe. For example, consider the universe U={1,2,3,4,5} and the collection of sets S={{1,2,3},{2,4},{3,4},{4,5}}. Clearly the union of S is U. However, we can cover all of the elements with the following, smaller number of sets: {{1,2,3},{4,5}}.
In your formulation, B nodes represent the elements in the universe, A nodes represent the sets and edges between A nodes and B nodes determine which elements (B nodes) belong to each set (A node). Then, the minimum set cover is equivalent to the minimum number of A nodes so that they are connected to all B nodes. Consequently, the maximum number of A nodes which can be removed from the graph while being connected to every B node are those which do not belong to the minimum set cover.
Since it is NP-hard, there is no polinomial time algorithm for computing the optimum, but a simple greedy algorithm can efficiently provide approximate solutions with tight bounds to the optimum. From the Wikipedia:
There is a greedy algorithm for polynomial time approximation of set covering that chooses sets according to one rule: at each stage, choose the set that contains the largest number of uncovered elements.
Related
The idea is to construct a list of the nodes in the undirected graph ordered by their degrees.
Graph is given in the form {node: (set of its neighbours) for node in the graph}
The code raises KeyError exception at the line "graph[neighbor].remove(node)". It seems like the node have already been deleted from the set, but I don't see where.
Can anyone please point out at the mistake?
Edit: This list of nodes is used in the simulation of the targeted attack in order of values of nodes in the graph. So, after an attack on the node with the biggest degree, it is removed from the node, and the degrees of the remaining nodes should be recalculated accordingly.
def fast_targeted_order(graph):
"""returns an orderedv list of the nodes in the graph in decreasing
order of their degrees"""
number_of_nodes = len(graph)
# initialise a list of sets of every possible degree
degree_sets = [set() for dummy_ind in range(number_of_nodes)]
#group nodes in sets according to their degrees
for node in graph:
degree = len(graph[node])
degree_sets[degree] |= {node}
ordered_nodes = []
#starting from the set of nodes with the maximal degree
for degree in range(number_of_nodes - 1, -1, -1):
#copy the set to avoid raising the exception "set size changed
during the execution
copied_degree_set = degree_sets[degree].copy()
while degree_sets[degree]:
for node in copied_degree_set:
degree_sets[degree] -= {node}
for neighbor in graph[node]:
neighbor_degree = len(graph[neighbor])
degree_sets[neighbor_degree] -= {neighbor}
degree_sets[neighbor_degree - 1] |= {neighbor}
graph[neighbor].remove(node)
ordered_nodes.append(node)
graph.pop(node)
return ordered_nodes
My previous answer (now deleted) was incorrect, the issue was not in using set, but in deleting items in any sequence during iteration through the same sequence.
Python tutorial for version 3.1 clearly warns:
It is not safe to modify the sequence being iterated over in the loop
(this can only happen for mutable sequence types, such as lists). If
you need to modify the list you are iterating over (for example, to
duplicate selected items) you must iterate over a copy.
However, tutorial for Python 3.5. (which I use) only advises:
If you need to modify the sequence you are iterating over while inside
the loop (for example to duplicate selected items), it is
recommended that you first make a copy.
It appears that this operation is still very unpredictable in Python 3.5, producing different results with the same input.
From my point of view, the previous version of the tutorial is preferred to the current one.
#PetarPetrovic and #jdehesa, thanks for the valuable advice.
Working solution:
def fast_targeted_order(ugraph):
"""
input: undirected graph in the form {node: set of node's neighbors)
returns an ordered list of the nodes in V in decresing order of their degrees
"""
graph = copy_graph(ugraph)
number_of_nodes = len(graph)
degrees_dict = {degree: list() for degree in range(number_of_nodes)}
for node in graph:
degree = len(graph[node])
degrees_dict[degree].append(node)
ordered_degrees = OrderedDict(sorted(degrees_dict.items(),
key = lambda key_value: key_value[0],
reverse = True))
ordered_nodes = []
for degree, nodes in ordered_degrees.items():
nodes_copy = nodes[:]
for node in nodes_copy:
if node in nodes:
for neighbor in graph[node]:
neighbor_degree = len(graph[neighbor])
ordered_degrees[neighbor_degree].remove(neighbor)
if neighbor_degree:
ordered_degrees[neighbor_degree - 1].append(neighbor)
graph[neighbor].remove(node)
graph.pop(node)
ordered_degrees[degree].remove(node)
ordered_nodes.append(node)
return ordered_nodes
In order to train myself both in Python and graph theory, I tried to implement the Dijkstra algo using Python 3, and submitted it against several online judges, to see if it was correct.
It works well in many cases, but not always.
For example, I am stuck with this one: the test case works fine and I also have tried custom test cases of my own, but when I submit the following solution, the judge keeps telling me "wrong answer", and the expected result is very different from my output, indeed.
Notice that the judge tests it against quite a complex graph (10000 nodes with 100000 edges), while all the cases I tried before never exceeded 20 nodes and around 20-40 edges.
Here is my code.
Given al an adjacency list in the following form:
al[n] = [(a1, w1), (a2, w2), ...]
where
n is the node id;
a1, a2, etc. are its adjacent nodes and w1, w2, etc. the respective weights for the given edge;
and supposing that maximum distance never exceeds 1 billion, I implemented Dijkstra's algorithm this way:
import queue
distance = [1000000000] * (N+1) # this is the array where I store the shortest path between 1 and each other node
distance[1] = 0 # starting from node 1 with distance 0
pq = queue.PriorityQueue()
pq.put((0, 1)) # same as above
visited = [False] * (N+1)
while not pq.empty():
n = pq.get()[1]
if visited[n]:
continue
visited[n] = True
for edge in al[n]:
if distance[edge[0]] > distance[n] + edge[1]:
distance[edge[0]] = distance[n] + edge[1]
pq.put((distance[edge[0]], edge[0]))
Could you please help me understand wether my implementation is flawed, or if I simply ran into some bugged online judge?
Thank you very much.
UPDATE
As requested, I'm providing the snippet I use to populate the adjacency list al for the linked problem.
N,M = input().split()
N,M = int(N), int(M)
al = [[] for n in range(N+1)]
for m in range(M):
try:
a,b,w = input().split()
a,b,w = int(a), int(b), int(w)
al[a].append((b, w))
al[b].append((a, w))
except:
pass
(Please don't mind the ugly "except: pass", I was using it just for debugging purposes... :P)
Primary problem in interpreting the question:
According to your parsing code, you are treating the input data as an undirected graph, i.e. each edge from A to B also is an edge from B to A.
Is seems like this premise is not valid and it should instead be a directed graph, i.e. you have to remove this line:
al[b].append((a, w)) # no back reference!
Previous problem, now already fixed in the code:
Currently, you are using the never-changing weight of the edges in your queue:
pq.put((edge[1], edge[0]))
This way, the nodes always end up at the same position in the queue, no matter at what stage of the algorithm and how far the path to reach that node actually is.
Instead, you should use the new distance to the target node edge[0], i.e. distance[edge[0]] as the priority in the queue:
pq.put((distance[edge[0]], edge[0]))
There's an existing question dealing with trees where the weight of a vertex is its degree, but I'm interested in the case where the vertices can have arbitrary weights.
This isn't homework but it is one of the questions in the algorithm design manual, which I'm currently reading; an answer set gives the solution as
Perform a DFS, at each step update Score[v][include], where v is a vertex and include is either true or false;
If v is a leaf, set Score[v][false] = 0, Score[v][true] = wv, where wv is the weight of vertex v.
During DFS, when moving up from the last child of the node v, update Score[v][include]:
Score[v][false] = Sum for c in children(v) of Score[c][true] and Score[v][true] = wv + Sum for c in children(v) of min(Score[c][true]; Score[c][false])
Extract actual cover by backtracking Score.
However, I can't actually translate that into something that works. (In response to the comment: what I've tried so far is drawing some smallish graphs with weights and running through the algorithm on paper, up until step four, where the "extract actual cover" part is not transparent.)
In response Ali's answer: So suppose I have this graph, with the vertices given by A etc. and the weights in parens after:
A(9)---B(3)---C(2)
\ \
E(1) D(4)
The right answer is clearly {B,E}.
Going through this algorithm, we'd set values like so:
score[D][false] = 0; score[D][true] = 4
score[C][false] = 0; score[C][true] = 2
score[B][false] = 6; score[B][true] = 3
score[E][false] = 0; score[E][true] = 1
score[A][false] = 4; score[A][true] = 12
Ok, so, my question is basically, now what? Doing the simple thing and iterating through the score vector and deciding what's cheapest locally doesn't work; you only end up including B. Deciding based on the parent and alternating also doesn't work: consider the case where the weight of E is 1000; now the correct answer is {A,B}, and they're adjacent. Perhaps it is not supposed to be confusing, but frankly, I'm confused.
There's no actual backtracking done (or needed). The solution uses dynamic programming to avoid backtracking, since that'd take exponential time. My guess is "backtracking Score" means the Score contains the partial results you would get by doing backtracking.
The cover vertex of a tree allows to include alternated and adjacent vertices. It does not allow to exclude two adjacent vertices, because it must contain all of the edges.
The answer is given in the way the Score is recursively calculated. The cost of not including a vertex, is the cost of including its children. However, the cost of including a vertex is whatever is less costly, the cost of including its children or not including them, because both things are allowed.
As your solution suggests, it can be done with DFS in post-order, in a single pass. The trick is to include a vertex if the Score says it must be included, and include its children if it must be excluded, otherwise we'd be excluding two adjacent vertices.
Here's some pseudocode:
find_cover_vertex_of_minimum_weight(v)
find_cover_vertex_of_minimum_weight(left children of v)
find_cover_vertex_of_minimum_weight(right children of v)
Score[v][false] = Sum for c in children(v) of Score[c][true]
Score[v][true] = v weight + Sum for c in children(v) of min(Score[c][true]; Score[c][false])
if Score[v][true] < Score[v][false] then
add v to cover vertex tree
else
for c in children(v)
add c to cover vertex tree
It actually didnt mean any thing confusing and it is just Dynamic Programming, you seems to almost understand all the algorithm. If I want to make it any more clear, I have to say:
first preform DFS on you graph and find leafs.
for every leaf assign values as the algorithm says.
now start from leafs and assign values to each leaf parent by that formula.
start assigning values to parent of nodes that already have values until you reach the root of your graph.
That is just it, by backtracking in your algorithm it means that you assign value to each node that its child already have values. As I said above this kind of solving problem is called dynamic programming.
Edit just for explaining your changes in the question. As you you have the following graph and answer is clearly B,E but you though this algorithm just give you B and you are incorrect this algorithm give you B and E.
A(9)---B(3)---C(2)
\ \
E(1) D(4)
score[D][false] = 0; score[D][true] = 4
score[C][false] = 0; score[C][true] = 2
score[B][false] = 6 this means we use C and D; score[B][true] = 3 this means we use B
score[E][false] = 0; score[E][true] = 1
score[A][false] = 4 This means we use B and E; score[A][true] = 12 this means we use B and A.
and you select 4 so you must use B and E. if it was just B your answer would be 3. but as you find it correctly your answer is 4 = 3 + 1 = B + E.
Also when E = 1000
A(9)---B(3)---C(2)
\ \
E(1000) D(4)
it is 100% correct that the answer is B and A because it is wrong to use E just because you dont want to select adjacent nodes. with this algorithm you will find the answer is A and B and just by checking you can find it too. suppose this covers :
C D A = 15
C D E = 1006
A B = 12
Although the first two answer have no adjacent nodes but they are bigger than last answer that have adjacent nodes. so it is best to use A and B for cover.
Update 2011-12-28: Here's a blog post with a less vague description of the problem I was trying to solve, my work on it, and my current solution: Watching Every MLB Team Play A Game
I'm trying to solve a kind of strange pathfinding challenge. I have an acyclic directional graph, and every edge has a distance value. And I want to find a shortest path. Simple, right? Well, there are a couple of reasons I can't just use Dijkstra's or A*.
I don't care at all what the starting node of my path is, nor the ending node. I just need a path that includes exactly 10 nodes. But:
Each node has an attribute, let's say it's color. Each node has one of 20 different possible colors.
The path I'm trying to find is the shortest path with exactly 10 nodes, where each node is a different color. I don't want any of the nodes in my path to have the same color as any other node.
It'd be nice to be able to force my path to have one value for one of the attributes ("at least one node must be blue", for instance), but that's not really necessary.
This is a simplified example. My full data set actually has three different attributes for each node that must all be unique, and I have 2k+ nodes each with an average of 35 outgoing edges. Since getting a perfect "shortest path" may be exponential or factorial time, an exhaustive search is really not an option. What I'm really looking for is some approximation of a "good path" that meets the criterion under #3.
Can anyone point me towards an algorithm that I might be able to use (even modified)?
Some stats on my full data set:
Total nodes: 2430
Total edges: 86524
Nodes with no incoming edges: 19
Nodes with no outgoing edges: 32
Most outgoing edges: 42
Average edges per node: 35.6 (in each direction)
Due to the nature of the data, I know that the graph is acyclic
And in the full data set, I'm looking for a path of length 15, not 10
It is the case when the question actually contains most of the answer.
Do a breadth-first search starting from all root nodes. When the number of parallelly searched paths exceeds some limit, drop the longest paths. Path length may be weighed: last edges may have weight 10, edges passed 9 hops ago - weight 1. Also it is possible to assign lesser weight to all paths having the preferred attribute or paths going through the weakly connected nodes. Store last 10 nodes in the path to the hash table to avoid duplication. And keep somewhere the minimum sum of the last 9 edge lengths along with the shortest path.
If the number of possible values is low, you can use the Floyd algorithm with a slight modification: for each path you store a bitmap that represents the different values already visited. (In your case the bitmap will be 20 bits wide per path.
Then when you perform the length comparison, you also AND your bitmaps to check whether it's a valid path and if it is, you OR them together and store that as the new bitmap for the path.
Have you tried a straight-forward approach and failed? From your description of the problem, I see no reason a simple greedy algorithm like depth-first search might be just fine:
Pick a start node.
Check the immediate neighbors, are there any nodes that are ok to append to the path? Expand the path with one of them and repeat the process for that node.
If you fail, backtrack to the last successful state and try a new neighbor.
If you run out of neighbors to check, this node cannot be the start node of a path. Try a new one.
If you have 10 nodes, you're done.
Good heuristics for picking a start node is hard to give without any knowledge about how the attributes are distributed, but it is possible that it is beneficial to nodes with high degree first.
It looks like a greedy depth first search will be your best bet. With a reasonable distribution of attribute values, I think finding a single valid sequence is E[O(1)] time, that is expected constant time. I could probably prove that, but it might take some time. The proof would use the assumption that there is a non-zero probability that a valid next segment of the sequence could be found at every step.
The greedy search would backtracking whenever the unique attribute value constraint is violated. The search stops when a 15 segment path is found. If we accept my hunch that each sequence can be found in E[O(1)], then it is a matter of determining how many parallel searches to undertake.
For those who want to experiment, here is a (postgres) sql script to generate some fake data.
SET search_path='tmp';
-- DROP TABLE nodes CASCADE;
CREATE TABLE nodes
( num INTEGER NOT NULL PRIMARY KEY
, color INTEGER
-- Redundant fields to flag {begin,end} of paths
, is_root boolean DEFAULT false
, is_terminal boolean DEFAULT false
);
-- DROP TABLE edges CASCADE;
CREATE TABLE edges
( numfrom INTEGER NOT NULL REFERENCES nodes(num)
, numto INTEGER NOT NULL REFERENCES nodes(num)
, cost INTEGER NOT NULL DEFAULT 0
);
-- Generate some nodes, set color randomly
INSERT INTO nodes (num)
SELECT n
FROM generate_series(1,2430) n
WHERE 1=1
;
UPDATE nodes SET COLOR= 1+TRUNC(20*random() );
-- (partial) cartesian product nodes*nodes. The ordering guarantees a DAG.
INSERT INTO edges(numfrom,numto,cost)
SELECT n1.num ,n2.num, 0
FROM nodes n1 ,nodes n2
WHERE n1.num < n2.num
AND random() < 0.029
;
UPDATE edges SET cost = 1+ 1000 * random();
ALTER TABLE edges
ADD PRIMARY KEY (numfrom,numto)
;
ALTER TABLE edges
ADD UNIQUE (numto,numfrom)
;
UPDATE nodes no SET is_root = true
WHERE NOT EXISTS (
SELECT * FROM edges ed
WHERE ed.numfrom = no.num
);
UPDATE nodes no SET is_terminal = true
WHERE NOT EXISTS (
SELECT * FROM edges ed
WHERE ed.numto = no.num
);
SELECT COUNT(*) AS nnode FROM nodes;
SELECT COUNT(*) AS nedge FROM edges;
SELECT color, COUNT(*) AS cnt FROM nodes GROUP BY color ORDER BY color;
SELECT COUNT(*) AS nterm FROM nodes no WHERE is_terminal = true;
SELECT COUNT(*) AS nroot FROM nodes no WHERE is_root = true;
WITH zzz AS (
SELECT numto, COUNT(*) AS fanin
FROM edges
GROUP BY numto
)
SELECT zzz.fanin , COUNT(*) AS cnt
FROM zzz
GROUP BY zzz.fanin
ORDER BY zzz.fanin
;
WITH zzz AS (
SELECT numfrom, COUNT(*) AS fanout
FROM edges
GROUP BY numfrom
)
SELECT zzz.fanout , COUNT(*) AS cnt
FROM zzz
GROUP BY zzz.fanout
ORDER BY zzz.fanout
;
COPY nodes(num,color,is_root,is_terminal)
TO '/tmp/nodes.dmp';
COPY edges(numfrom,numto, cost)
TO '/tmp/edges.dmp';
The problem may be solving by dynamic programming as follows. Let's start by formally defining its solution.
Given a DAG G = (V, E), let C the be set of colors of vertices visited so far and let w[i, j] and c[i] be respectively the weight (distance) associated to edge (i, j) and the color of a vertex i. Note that w[i, j] is zero if the edge (i, j) does not belong to E.
Now define the distance d for going from vertex i to vertex j taking into account C as
d[i, j, C] = w[i, j] if i is not equal to j and c[j] does not belong to C
= 0 if i = j
= infinite if i is not equal to j and c[j] belongs to C
We are now ready to define our subproblems as follows:
A[i, j, k, C] = shortest path from i to j that uses exactly k edges and respects the colors in C so that no two vertices in the path are colored using the same color (one of the colors in C)
Let m be the maximum number of edges permitted in the path and assume that the vertices are labeled 1, 2, ..., n. Let P[i,j,k] be the predecessor vertex of j in the shortest path satisfying the constraints from i to j. The following algorithm solves the problem.
for k = 1 to m
for i = 1 to n
for j = 1 to n
A[i,j,k,C] = min over x belonging to V {d[i,x,C] + A[x,j,k-1,C union c[x]]}
P[i,j,k] = the vertex x that minimized A[i,j,k,C] in the previous statement
Set the initial conditions as follows:
A[i,j,k,C] = 0 for k = 0
A[i,j,k,C] = 0 if i is equal to j
A[i,j,k,C] = infinite in all of the other cases
The overall computational complexity of the algorithm is O(m n^3); taking into account that in your particular case m = 14 (since you want exactly 15 nodes), it follows that m = O(1) so that the complexity actually is O(n^3). To represent the set C use an hash table so that insertion and membership testing require O(1) on average. Note that in the algorithm the operation C union c[x] is actually an insert operation in which you add the color of vertex x into the hash table for C. However, since you are inserting just an element, the set union operation leads to exactly the same result (if the color is not in the set, it is added; otherwise, it is simply discarded and the set does not change). Finally, to represent the DAG, use the adjacency matrix.
Once the algorithm is done, to find the minimum shortest path among all possible vertices i and j, simply find the minimum among the values A[i,j,m,C]. Note that if this value is infinite, then no valid shortest path exists. If a valid shortest path exists, then you can actually determine it by using the P[i,j,k] values and tracing backwards through predecessor vertices. For instance, starting from a = P[i,j,m] the last edge on the shortest path is (a,j), the previous edge is given by b = P[i,a,m-1] and its is (b,a) and so on.
I’ve implemented and used A* several times and thought I knew everything there was to know about A*…. Until I encountered this example:
The graph consists of 4 nodes and 6 directed weighted edges. The heuristic is denoted per node by H=…. The heuristic is clearly admissible, so I don't see any problems with that.
The problem is to find the route from start to goal with the minimal total cost. The correct solution is the route taking the edges with the costs 36 and 18.
My implementation of A* performs the following steps(omitting any operations related to the closed list):
The startnode is {G = 0, H = 200, -> F = 200} and is selected as ‘current node’
All its neighbours are added to the openlist = {{G=5, H=100, F=105}, {G=36, H=100, F=136}}.
The new ‘current node’ is selected, which is the node in the open list with smallest F, which is the node with F = 105, the upper node in the image.
The neighbours of that node are added to the openlist, which then has the elements { {G=36, H=100,F=136}, {G=58,H=0,F=58}}.
Again a new current node is selected, which is the goal node, so the algorithm terminates and the route with the costs 5 and 53 is selected.
So the implementation produces the wrong result. What in these steps shouldn’t have happened?
For a heuristic to be admissible, it must be bounded from above by the actual cost to the goal. Your heuristic is not admissible and that's why you're getting the wrong answer. See, for instance, the Wikipedia article on A*.