Set Simplification - algorithm

Say I have two sets, set1 = {a,b,c,d,e,f} and set2 = {a,b,c,d,e,g}. Rather than expressing these explicitly, I want to create something like
common = {a,b,c,d,e}
set1 = common + f
set2 = common + g
If we wanted to represent {a,b,c,h}, we could represent it as common - d - e + h.
My goal is basically to be able to generate the optimal common portion to be used. With only one common section this isn't too challenging, but I need to allow more than one (but not unlimited, or the benefits gained would be trivial).
By optimal, I mean "least number of elements expressed". So in the above example, it "costs" 5 (number of elements) to make the common variable. Then sets 1 and 2 both cost 2 (one to reference common, one to add the extra element), totalling 7. Without the substitution, these would cost 12 to store (6 elements each). Similarly, in subtracting an element from a referenced would "cost" 1.
Another example,
{a,b,c,d}, {a,c,d,e}, {e,f,g,h} and {e,f}
could be
common1 = {a,c,d}
common2 = {e,f,g}
set1 = common1 + b
set2 = common1 + e
set3 = common2 + h
set4 = common2 - g
By allowing multiple common portions this becomes a lot more challenging. Is there a name for this type of problem, or something similar? It seems like it could be related to compression, but I haven't been able to find too many resources on where to start with this.
Some other details that may be relevent:
Being allowed to reference multiple common portions to represent one set can be valid, but isn't required.
For my use case, the sets will typically be around 20 elements and around 10 different sets.

You could find all atomic sets, that is all sets that are never not seen apart.
{a,b,c,d,e,f,g,h} | {a,b,c,d} = {a,b,c,d},{e,f,g,h}
{a,b,c,d},{e,f,g,h} | {a,c,d,e} = {a,b,c,d},{e,f,g,h}
{a,c,d},{b},{e},{f,g,h} | {e,f,g,h} = {a,c,d},{b},{e},{f,g,h}
{a,c,d},{b},{e},{f,g,h} | {e,f} = {a,c,d},{b},{e},{f},{g,h}
{a,b,c,d} = {a,c,d},{b}
{a,c,d,e} = {a,c,d},{e}
{e,f,g,h} = {e},{f},{g,h}
{e,f} = {e},{f}
This is a little closer but it doesnt solve the minimal breakdown.
I dont think you can find the minimal because i suspect that it is NP-Hard. If you consider a set S and create a graph where each possible subset of S is a node G. Now give a node weight according to the length of the subset, and draw an edge between each node that corresponds to the amount of change. {abc} -> {a} has a weight of 2. {bcd} -> {abe} has a weight of 4. Now to find a minimal solution to the common set problem you need to find a minimal weight spanning tree that covers each of the sets you are interested in. If you find that you can use this to build a minimal common set -- these would be equivelent. Finding minimum weight tree in a node weighted graph is called the Node-Weighed Steiner Tree Problem. A Node weighted Steiner Tree Problem can be shown equivalent to the Steiner Tree Problem. The Steiner Tree problem can be show to be NP-Hard. So I strongly suspect the problem you are trying to solve is NP-Hard.
http://theory.cs.uni-bonn.de/info5/steinerkompendium/node15.html
http://theory.cs.uni-bonn.de/info5/steinerkompendium/node17.html

Related

Removing unnecessary nodes in graph

I have a graph that has two distinct classes of nodes, class A nodes and class B nodes.
Class A nodes are not connected to any other A nodes and class B nodes aren’t connected to any other B nodes, but B nodes are connected to A nodes and vice versa. Some B nodes are connected to lots of A nodes and most A nodes are connected to lots of B nodes.
I want to eliminate as many of the A nodes as possible from the graph.
I must keep all of the B nodes, and they must still be connected to at least one A node (preferably only one A node).
I can eliminate an A node when it has no B nodes connected only to it. Are there any algorithms that could find an optimal, or at least close to optimal, solution for which A nodes I can remove?
Old, Incorrect Answer, But Start Here
First, you need to recognize that you have a bipartite graph. That is, you can colour the nodes red and blue such that no edges connect a red node to a red node or a blue node to a blue node.
Next, recognize that you're trying to solve a vertex cover problem. From Wikipedia:
In the mathematical discipline of graph theory, a vertex cover (sometimes node cover) of a graph is a set of vertices such that each edge of the graph is incident to at least one vertex of the set. The problem of finding a minimum vertex cover is a classical optimization problem in computer science and is a typical example of an NP-hard optimization problem that has an approximation algorithm.
Since you have a special graph, it's reasonable to think that maybe the NP-hard doesn't apply to you. This thought brings us to Kőnig's theorem which relates the maximum matching problem to the minimum vertex cover problem. Once you know this, you can apply the Hopcroft–Karp algorithm to solve the problem in O(|E|√|V|) time, though you'll probably need to jigger it a bit to ensure you keep all the B nodes.
New, Correct Answer
It turns out this jiggering is the creation of a "constrained bipartitate graph vertex cover problem", which asks us if there is a vertex cover that uses less than a A-nodes and less than b B-nodes. The problem is NP-complete, so that's a no go. The jiggering was hard than I thought!
But using less than the minimum number of nodes isn't the constraint we want. We want to ensure that the minimum number of A-nodes is used and the maximum number of B-nodes.
Kőnig's theorem, above, is a special case of the maximum flow problem. Thinking about the problem in terms of flows brings us pretty quickly to minimum-cost flow problems.
In these problems we're given a graph whose edges have specified capacities and unit costs of transport. The goal is to find the minimum cost needed to move a supply of a given quantity from an arbitrary set of source nodes to an arbitrary set of sink nodes.
It turns out your problem can be converted into a minimum-cost flow problem. To do so, let us generate a source node that connects to all the A nodes and a sink node that connects to all the B nodes.
Now, let us make the cost of using a Source->A edge equal to 1 and give all other edges a cost of zero. Further, let us make the capacity of the Source->A edges equal to infinity and the capacity of all other edges equal to 1.
This looks like the following:
The red edges have Cost=1, Capacity=Inf. The blue edges have Cost=0, Capacity=1.
Now, solving the minimum flow problem becomes equivalent to using as few red edges as possible. Any red edge that isn't used allocates 0 flow to its corresponding A node and that node can be removed from the graph. Conversely, each B node can only pass 1 unit of flow to the sink, so all B nodes must be preserved in order for the problem to be solved.
Since we've recast your problem into this standard form, we can leverage existing tools to get a solution; namely, Google's Operation Research Tools.
Doing so gives the following answer to the above graph:
The red edges are unused and the black edges are used. Note that if a red edge emerges from the source the A-node it connects to generates no black edges. Note also that each B-node has at least one in-coming black edge. This satisfies the constraints you posed.
We can now detect the A-nodes to be removed by looking for Source->A edges with zero usage.
Source Code
The source code necessary to generate the foregoing figures and associated solutions is as follows:
#!/usr/bin/env python3
#Documentation: https://developers.google.com/optimization/flow/mincostflow
#Install dependency: pip3 install ortools
from __future__ import print_function
from ortools.graph import pywrapgraph
import matplotlib.pyplot as plt
import networkx as nx
import random
import sys
def GenerateGraph(Acount,Bcount):
assert Acount>5
assert Bcount>5
G = nx.DiGraph() #Directed graph
source_node = Acount+Bcount
sink_node = source_node+1
for a in range(Acount):
for i in range(random.randint(0.2*Bcount,0.3*Bcount)): #Connect to 10-20% of the Bnodes
b = Acount+random.randint(0,Bcount-1) #In the half-open range [0,Bcount). Offset from A's indices
G.add_edge(source_node, a, capacity=99999, unit_cost=1, usage=1)
G.add_edge(a, b, capacity=1, unit_cost=0, usage=1)
G.add_edge(b, sink_node, capacity=1, unit_cost=0, usage=1)
G.node[a]['type'] = 'A'
G.node[b]['type'] = 'B'
G.node[source_node]['type'] = 'source'
G.node[sink_node]['type'] = 'sink'
G.node[source_node]['supply'] = Bcount
G.node[sink_node]['supply'] = -Bcount
return G
def VisualizeGraph(graph, color_type):
gcopy = graph.copy()
for p, d in graph.nodes(data=True):
if d['type']=='source':
source = p
if d['type']=='sink':
sink = p
Acount = len([1 for p,d in graph.nodes(data=True) if d['type']=='A'])
Bcount = len([1 for p,d in graph.nodes(data=True) if d['type']=='B'])
if color_type=='usage':
edge_color = ['black' if d['usage']>0 else 'red' for u,v,d in graph.edges(data=True)]
elif color_type=='unit_cost':
edge_color = ['red' if d['unit_cost']>0 else 'blue' for u,v,d in graph.edges(data=True)]
Ai = 0
Bi = 0
pos = dict()
for p,d in graph.nodes(data=True):
if d['type']=='source':
pos[p] = (0, Acount/2)
elif d['type']=='sink':
pos[p] = (3, Bcount/2)
elif d['type']=='A':
pos[p] = (1, Ai)
Ai += 1
elif d['type']=='B':
pos[p] = (2, Bi)
Bi += 1
nx.draw(graph, pos=pos, edge_color=edge_color, arrows=False)
plt.show()
def GenerateMinCostFlowProblemFromGraph(graph):
start_nodes = []
end_nodes = []
capacities = []
unit_costs = []
min_cost_flow = pywrapgraph.SimpleMinCostFlow()
for node,neighbor,data in graph.edges(data=True):
min_cost_flow.AddArcWithCapacityAndUnitCost(node, neighbor, data['capacity'], data['unit_cost'])
supply = len([1 for p,d in graph.nodes(data=True) if d['type']=='B'])
for p, d in graph.nodes(data=True):
if (d['type']=='source' or d['type']=='sink') and 'supply' in d:
min_cost_flow.SetNodeSupply(p, d['supply'])
return min_cost_flow
def ColorGraphEdgesByUsage(graph, min_cost_flow):
for i in range(min_cost_flow.NumArcs()):
graph[min_cost_flow.Tail(i)][min_cost_flow.Head(i)]['usage'] = min_cost_flow.Flow(i)
def main():
"""MinCostFlow simple interface example."""
# Define four parallel arrays: start_nodes, end_nodes, capacities, and unit costs
# between each pair. For instance, the arc from node 0 to node 1 has a
# capacity of 15 and a unit cost of 4.
Acount = 20
Bcount = 20
graph = GenerateGraph(Acount, Bcount)
VisualizeGraph(graph, 'unit_cost')
min_cost_flow = GenerateMinCostFlowProblemFromGraph(graph)
# Find the minimum cost flow between node 0 and node 4.
if min_cost_flow.Solve() != min_cost_flow.OPTIMAL:
print('Unable to find a solution! It is likely that one does not exist for this input.')
sys.exit(-1)
print('Minimum cost:', min_cost_flow.OptimalCost())
ColorGraphEdgesByUsage(graph, min_cost_flow)
VisualizeGraph(graph, 'usage')
if __name__ == '__main__':
main()
Despite this is an old question, I see it has not been correctly answered yet.
An analogous question to this one has also been answered earlier in this post.
The problem you are presenting here is indeed the Minimum Set Cover Problem, which is one of the well-known NP-hard problems. From the Wikipedia, the Minimum Set Cover Problem can be formulated as:
Given a set of elements {1,2,...,n} (called the universe) and a collection S of m sets whose union equals the universe, the set cover problem is to identify the smallest sub-collection of S whose union equals the universe. For example, consider the universe U={1,2,3,4,5} and the collection of sets S={{1,2,3},{2,4},{3,4},{4,5}}. Clearly the union of S is U. However, we can cover all of the elements with the following, smaller number of sets: {{1,2,3},{4,5}}.
In your formulation, B nodes represent the elements in the universe, A nodes represent the sets and edges between A nodes and B nodes determine which elements (B nodes) belong to each set (A node). Then, the minimum set cover is equivalent to the minimum number of A nodes so that they are connected to all B nodes. Consequently, the maximum number of A nodes which can be removed from the graph while being connected to every B node are those which do not belong to the minimum set cover.
Since it is NP-hard, there is no polinomial time algorithm for computing the optimum, but a simple greedy algorithm can efficiently provide approximate solutions with tight bounds to the optimum. From the Wikipedia:
There is a greedy algorithm for polynomial time approximation of set covering that chooses sets according to one rule: at each stage, choose the set that contains the largest number of uncovered elements.

minimum weight vertex cover of a tree

There's an existing question dealing with trees where the weight of a vertex is its degree, but I'm interested in the case where the vertices can have arbitrary weights.
This isn't homework but it is one of the questions in the algorithm design manual, which I'm currently reading; an answer set gives the solution as
Perform a DFS, at each step update Score[v][include], where v is a vertex and include is either true or false;
If v is a leaf, set Score[v][false] = 0, Score[v][true] = wv, where wv is the weight of vertex v.
During DFS, when moving up from the last child of the node v, update Score[v][include]:
Score[v][false] = Sum for c in children(v) of Score[c][true] and Score[v][true] = wv + Sum for c in children(v) of min(Score[c][true]; Score[c][false])
Extract actual cover by backtracking Score.
However, I can't actually translate that into something that works. (In response to the comment: what I've tried so far is drawing some smallish graphs with weights and running through the algorithm on paper, up until step four, where the "extract actual cover" part is not transparent.)
In response Ali's answer: So suppose I have this graph, with the vertices given by A etc. and the weights in parens after:
A(9)---B(3)---C(2)
\ \
E(1) D(4)
The right answer is clearly {B,E}.
Going through this algorithm, we'd set values like so:
score[D][false] = 0; score[D][true] = 4
score[C][false] = 0; score[C][true] = 2
score[B][false] = 6; score[B][true] = 3
score[E][false] = 0; score[E][true] = 1
score[A][false] = 4; score[A][true] = 12
Ok, so, my question is basically, now what? Doing the simple thing and iterating through the score vector and deciding what's cheapest locally doesn't work; you only end up including B. Deciding based on the parent and alternating also doesn't work: consider the case where the weight of E is 1000; now the correct answer is {A,B}, and they're adjacent. Perhaps it is not supposed to be confusing, but frankly, I'm confused.
There's no actual backtracking done (or needed). The solution uses dynamic programming to avoid backtracking, since that'd take exponential time. My guess is "backtracking Score" means the Score contains the partial results you would get by doing backtracking.
The cover vertex of a tree allows to include alternated and adjacent vertices. It does not allow to exclude two adjacent vertices, because it must contain all of the edges.
The answer is given in the way the Score is recursively calculated. The cost of not including a vertex, is the cost of including its children. However, the cost of including a vertex is whatever is less costly, the cost of including its children or not including them, because both things are allowed.
As your solution suggests, it can be done with DFS in post-order, in a single pass. The trick is to include a vertex if the Score says it must be included, and include its children if it must be excluded, otherwise we'd be excluding two adjacent vertices.
Here's some pseudocode:
find_cover_vertex_of_minimum_weight(v)
find_cover_vertex_of_minimum_weight(left children of v)
find_cover_vertex_of_minimum_weight(right children of v)
Score[v][false] = Sum for c in children(v) of Score[c][true]
Score[v][true] = v weight + Sum for c in children(v) of min(Score[c][true]; Score[c][false])
if Score[v][true] < Score[v][false] then
add v to cover vertex tree
else
for c in children(v)
add c to cover vertex tree
It actually didnt mean any thing confusing and it is just Dynamic Programming, you seems to almost understand all the algorithm. If I want to make it any more clear, I have to say:
first preform DFS on you graph and find leafs.
for every leaf assign values as the algorithm says.
now start from leafs and assign values to each leaf parent by that formula.
start assigning values to parent of nodes that already have values until you reach the root of your graph.
That is just it, by backtracking in your algorithm it means that you assign value to each node that its child already have values. As I said above this kind of solving problem is called dynamic programming.
Edit just for explaining your changes in the question. As you you have the following graph and answer is clearly B,E but you though this algorithm just give you B and you are incorrect this algorithm give you B and E.
A(9)---B(3)---C(2)
\ \
E(1) D(4)
score[D][false] = 0; score[D][true] = 4
score[C][false] = 0; score[C][true] = 2
score[B][false] = 6 this means we use C and D; score[B][true] = 3 this means we use B
score[E][false] = 0; score[E][true] = 1
score[A][false] = 4 This means we use B and E; score[A][true] = 12 this means we use B and A.
and you select 4 so you must use B and E. if it was just B your answer would be 3. but as you find it correctly your answer is 4 = 3 + 1 = B + E.
Also when E = 1000
A(9)---B(3)---C(2)
\ \
E(1000) D(4)
it is 100% correct that the answer is B and A because it is wrong to use E just because you dont want to select adjacent nodes. with this algorithm you will find the answer is A and B and just by checking you can find it too. suppose this covers :
C D A = 15
C D E = 1006
A B = 12
Although the first two answer have no adjacent nodes but they are bigger than last answer that have adjacent nodes. so it is best to use A and B for cover.

Pathfinding while forcing unique node attributes -- which algorithm should I use?

Update 2011-12-28: Here's a blog post with a less vague description of the problem I was trying to solve, my work on it, and my current solution: Watching Every MLB Team Play A Game
I'm trying to solve a kind of strange pathfinding challenge. I have an acyclic directional graph, and every edge has a distance value. And I want to find a shortest path. Simple, right? Well, there are a couple of reasons I can't just use Dijkstra's or A*.
I don't care at all what the starting node of my path is, nor the ending node. I just need a path that includes exactly 10 nodes. But:
Each node has an attribute, let's say it's color. Each node has one of 20 different possible colors.
The path I'm trying to find is the shortest path with exactly 10 nodes, where each node is a different color. I don't want any of the nodes in my path to have the same color as any other node.
It'd be nice to be able to force my path to have one value for one of the attributes ("at least one node must be blue", for instance), but that's not really necessary.
This is a simplified example. My full data set actually has three different attributes for each node that must all be unique, and I have 2k+ nodes each with an average of 35 outgoing edges. Since getting a perfect "shortest path" may be exponential or factorial time, an exhaustive search is really not an option. What I'm really looking for is some approximation of a "good path" that meets the criterion under #3.
Can anyone point me towards an algorithm that I might be able to use (even modified)?
Some stats on my full data set:
Total nodes: 2430
Total edges: 86524
Nodes with no incoming edges: 19
Nodes with no outgoing edges: 32
Most outgoing edges: 42
Average edges per node: 35.6 (in each direction)
Due to the nature of the data, I know that the graph is acyclic
And in the full data set, I'm looking for a path of length 15, not 10
It is the case when the question actually contains most of the answer.
Do a breadth-first search starting from all root nodes. When the number of parallelly searched paths exceeds some limit, drop the longest paths. Path length may be weighed: last edges may have weight 10, edges passed 9 hops ago - weight 1. Also it is possible to assign lesser weight to all paths having the preferred attribute or paths going through the weakly connected nodes. Store last 10 nodes in the path to the hash table to avoid duplication. And keep somewhere the minimum sum of the last 9 edge lengths along with the shortest path.
If the number of possible values is low, you can use the Floyd algorithm with a slight modification: for each path you store a bitmap that represents the different values already visited. (In your case the bitmap will be 20 bits wide per path.
Then when you perform the length comparison, you also AND your bitmaps to check whether it's a valid path and if it is, you OR them together and store that as the new bitmap for the path.
Have you tried a straight-forward approach and failed? From your description of the problem, I see no reason a simple greedy algorithm like depth-first search might be just fine:
Pick a start node.
Check the immediate neighbors, are there any nodes that are ok to append to the path? Expand the path with one of them and repeat the process for that node.
If you fail, backtrack to the last successful state and try a new neighbor.
If you run out of neighbors to check, this node cannot be the start node of a path. Try a new one.
If you have 10 nodes, you're done.
Good heuristics for picking a start node is hard to give without any knowledge about how the attributes are distributed, but it is possible that it is beneficial to nodes with high degree first.
It looks like a greedy depth first search will be your best bet. With a reasonable distribution of attribute values, I think finding a single valid sequence is E[O(1)] time, that is expected constant time. I could probably prove that, but it might take some time. The proof would use the assumption that there is a non-zero probability that a valid next segment of the sequence could be found at every step.
The greedy search would backtracking whenever the unique attribute value constraint is violated. The search stops when a 15 segment path is found. If we accept my hunch that each sequence can be found in E[O(1)], then it is a matter of determining how many parallel searches to undertake.
For those who want to experiment, here is a (postgres) sql script to generate some fake data.
SET search_path='tmp';
-- DROP TABLE nodes CASCADE;
CREATE TABLE nodes
( num INTEGER NOT NULL PRIMARY KEY
, color INTEGER
-- Redundant fields to flag {begin,end} of paths
, is_root boolean DEFAULT false
, is_terminal boolean DEFAULT false
);
-- DROP TABLE edges CASCADE;
CREATE TABLE edges
( numfrom INTEGER NOT NULL REFERENCES nodes(num)
, numto INTEGER NOT NULL REFERENCES nodes(num)
, cost INTEGER NOT NULL DEFAULT 0
);
-- Generate some nodes, set color randomly
INSERT INTO nodes (num)
SELECT n
FROM generate_series(1,2430) n
WHERE 1=1
;
UPDATE nodes SET COLOR= 1+TRUNC(20*random() );
-- (partial) cartesian product nodes*nodes. The ordering guarantees a DAG.
INSERT INTO edges(numfrom,numto,cost)
SELECT n1.num ,n2.num, 0
FROM nodes n1 ,nodes n2
WHERE n1.num < n2.num
AND random() < 0.029
;
UPDATE edges SET cost = 1+ 1000 * random();
ALTER TABLE edges
ADD PRIMARY KEY (numfrom,numto)
;
ALTER TABLE edges
ADD UNIQUE (numto,numfrom)
;
UPDATE nodes no SET is_root = true
WHERE NOT EXISTS (
SELECT * FROM edges ed
WHERE ed.numfrom = no.num
);
UPDATE nodes no SET is_terminal = true
WHERE NOT EXISTS (
SELECT * FROM edges ed
WHERE ed.numto = no.num
);
SELECT COUNT(*) AS nnode FROM nodes;
SELECT COUNT(*) AS nedge FROM edges;
SELECT color, COUNT(*) AS cnt FROM nodes GROUP BY color ORDER BY color;
SELECT COUNT(*) AS nterm FROM nodes no WHERE is_terminal = true;
SELECT COUNT(*) AS nroot FROM nodes no WHERE is_root = true;
WITH zzz AS (
SELECT numto, COUNT(*) AS fanin
FROM edges
GROUP BY numto
)
SELECT zzz.fanin , COUNT(*) AS cnt
FROM zzz
GROUP BY zzz.fanin
ORDER BY zzz.fanin
;
WITH zzz AS (
SELECT numfrom, COUNT(*) AS fanout
FROM edges
GROUP BY numfrom
)
SELECT zzz.fanout , COUNT(*) AS cnt
FROM zzz
GROUP BY zzz.fanout
ORDER BY zzz.fanout
;
COPY nodes(num,color,is_root,is_terminal)
TO '/tmp/nodes.dmp';
COPY edges(numfrom,numto, cost)
TO '/tmp/edges.dmp';
The problem may be solving by dynamic programming as follows. Let's start by formally defining its solution.
Given a DAG G = (V, E), let C the be set of colors of vertices visited so far and let w[i, j] and c[i] be respectively the weight (distance) associated to edge (i, j) and the color of a vertex i. Note that w[i, j] is zero if the edge (i, j) does not belong to E.
Now define the distance d for going from vertex i to vertex j taking into account C as
d[i, j, C] = w[i, j] if i is not equal to j and c[j] does not belong to C
= 0 if i = j
= infinite if i is not equal to j and c[j] belongs to C
We are now ready to define our subproblems as follows:
A[i, j, k, C] = shortest path from i to j that uses exactly k edges and respects the colors in C so that no two vertices in the path are colored using the same color (one of the colors in C)
Let m be the maximum number of edges permitted in the path and assume that the vertices are labeled 1, 2, ..., n. Let P[i,j,k] be the predecessor vertex of j in the shortest path satisfying the constraints from i to j. The following algorithm solves the problem.
for k = 1 to m
for i = 1 to n
for j = 1 to n
A[i,j,k,C] = min over x belonging to V {d[i,x,C] + A[x,j,k-1,C union c[x]]}
P[i,j,k] = the vertex x that minimized A[i,j,k,C] in the previous statement
Set the initial conditions as follows:
A[i,j,k,C] = 0 for k = 0
A[i,j,k,C] = 0 if i is equal to j
A[i,j,k,C] = infinite in all of the other cases
The overall computational complexity of the algorithm is O(m n^3); taking into account that in your particular case m = 14 (since you want exactly 15 nodes), it follows that m = O(1) so that the complexity actually is O(n^3). To represent the set C use an hash table so that insertion and membership testing require O(1) on average. Note that in the algorithm the operation C union c[x] is actually an insert operation in which you add the color of vertex x into the hash table for C. However, since you are inserting just an element, the set union operation leads to exactly the same result (if the color is not in the set, it is added; otherwise, it is simply discarded and the set does not change). Finally, to represent the DAG, use the adjacency matrix.
Once the algorithm is done, to find the minimum shortest path among all possible vertices i and j, simply find the minimum among the values A[i,j,m,C]. Note that if this value is infinite, then no valid shortest path exists. If a valid shortest path exists, then you can actually determine it by using the P[i,j,k] values and tracing backwards through predecessor vertices. For instance, starting from a = P[i,j,m] the last edge on the shortest path is (a,j), the previous edge is given by b = P[i,a,m-1] and its is (b,a) and so on.

Seating people in a movie theater

This is based on an article I read about puzzles and interview questions asked by large software companies, but it has a twist...
General question:
What is an algorithm to seat people in a movie theater so that they sit directly beside their friends but not beside their enemies.
Technical question:
Given an N by M grid, fill the grid with N * M - 1 items. Each item has an association Boolean value for each of the other N * M - 2 items. In each row of N, items directly beside other items should have a positive association value for the other. Columns however do not matter, i.e. an item can be "enemies" with the item in front of it. Note: If item A has a positive association value for B, then that means B also has a positive association value for A. It works the same for negative association values. An item is guarenteed to have a positive association with atleast one other item. Also, you have access to all of the items and their association values before you start placing them in the grid.
Comments:
I have been researching this problem and thinking about it since yesterday, from what I have found it reminds me of the bin packing problem with some added requirements. In some free time I attempted to implement it, but large groups of "enemies" were sitting next to each other. I am sure that most situations will have to have atleast one pair of enemies sitting next to each other, but my solution was far from optimal. It actually looked as if I had just randomized it.
As far as my implementation went, I made N = 10, M = 10, the number of items = 99, and had an array of size 99 for EACH item that had a randomized Boolean value that referred to the friendship of the corresponding item number. This means that each item had a friendship value that corresponded with their self as well, I just ignored that value.
I plan on trying to reimplement this again later and I will post the code. Can anyone figure out a "good" way to do this to minimize seating clashes between enemies?
This problem is NP-Hard.
Define L={(G,n,m)|there is a legal seating for G in m×m matrix,(u,v) in E if u is friend of v} L is a formal definition of this problem as a language.
Proof:
We will show Hamiltonian-Problem ≤ (p) 2-path ≤ (p) This-problem in 2 steps [Hamiltonian and 2-path defined below], and thus we conclude this problem is NP-Hard.
(1) We will show that finding two paths covering all vertices without using any vertex twice is NP-Hard [let's call such a path: 2-path and this problem as 2-path problem]
A reduction from Hamiltonian Path problem:
input: a graph G=(V,E)
Output: a graph G'=(V',E) where V' = V U {u₀}.
Correctness:
if G has Hamiltonian Path: v₁→v₂→...→vn, then G' has 2-path:
v₁→v₂→...→vn,u₀
if G' has 2-path, since u₀ is isolated from the rest vertices, there is a
path: v₁→...→vn, which is Hamiltonian in G.
Thus: G has Hamiltonian path 1 ⇔ G' has 2-path, and thus: 2-path problem is NP-Hard.
(2)We will now show that our problem [L] is also NP-Hard:
We will show a reduction from the 2-path problem, defined above.
input: a graph G=(V,E)
output: (G,|V|+1,1) [a long row with |V|+1 sits].
Correctness:
If G has 2-path, then we can seat the people, and use the 1 sit gap to
use as a 'buffer' between the two paths, this will be a legal perfect seating
since if v₁ is sitting next to v₂, then v₁ v₁→v₂ is in the path, and thus
(v₁,v₂) is in E, so v₁,v₂ are friends.
If (G,|V|+1,1) is legal seat:[v₁,...,vk,buffer,vk+1,...,vn] , there is a 2-path in G,
v₁→...→vk, vk+1→...→vn
Conclusion: This problem is NP-Hard, so there is not known polynomial solution for it.
Exponential solution:
You might want to use backtracking solution: which is basically: create all subsets of E with size |V|-2 or less, check which is best.
static best <- infinity
least_enemies(G,used):
if |used| <= |V|-2:
val <- evaluate(used)
best <- min(best,val)
if |used| == |V|-2:
return
for each edge e in E-used: //E without used
least_enemies(G,used + e)
in here we assume evaluate(used) gives the 'score' for this solution. if this solution is completely illegal [i.e. a vertex appear twice], evaluate(used)=infinity. an optimization can of course be made, trimming these cases. to get the actual sitting we can store the currently best solution.
(*)There are probably better solutions, this is just a simple possible solution to start with, the main aim in this answer is proving this problem is NP-Hard.
EDIT: simpler solution:
Create a graph G'=(V U { u₀ } ,E U {(u₀,v),(v,u₀) | for each v in V}) [u₀ is a junk vertex for the buffer] and a weight function for edges:
w((u,v)) = 1 u is friend of v
w((u,v)) = 2 u is an enemy v
w((u0,v)) = w ((v,u0)) = 0
Now you got yourself a classic TSP, which can be solved in O(|V|^2 * 2^|V|) using dynamic programming.
Note that this solution [using TSP] is for one lined theatre, but it might be a good lead to find a solution for the general case.
One algorithm used for large "search spaces" such as this is simulated annealing

Dynamic programming question

I am stuck with one of the algorithm homework problem. Can anyone give me some hint to solve it? Here is the question:
Consider a chain structured computation represented by a weighted graph G = (V;E) where
V = {v1; v2; ... ; vn} and E = {(vi; vi+1) such that 1<= i <= n-1. We are also given a chain-structure m identical processors P = {P1; ... ; Pm} (i.e., there exists a communication link between Pk and Pk+1 for 1 <= k <= m - 1).
The set of vertices V represents computation modules, and the set of edges E represents
communication between the two modules. Each node vi is assigned a weight wi denoting the
execution time of the module on a single processor. Each edge (vi; vi+1) is assigned a weight ci denoting the amount of communication time between the two modules if they are assigned two different processors. If multiple modules are assigned to the same processor, the modules assigned to the same processor must be consecutive. Suppose modules va; va+1; .. ; vb are assigned to Processor Pk. Then, the time taken by Pk, denoted by Tk, is the time to compute assigned modules plus the time to communicate between neighboring processors. Hence, Tk = wa+...+ wb + ca-1 + cb. Note here that ca-1 = 0 if a = 1 and cb = 0 if b = n.
The objective of the problem is to find an assignment V to P such that max1<=k<=m Tk
is minimized, where we assume that each processor must take at least one module. (This
assumption can be relaxed by adding m dummy modules with zero weight on computational
and communication time.)
Develop a dynamic programming algorithm to solve this problem in polynomial time(i.e O(mn))
I tried to find the minimum execution time for each Pk and then find the max, but I doubt my solution is dynamic programming since there is no recursive formula. Please give me some hints!
Thanks!
I think you might be able to modify the Viterbi algorithm to solve this problem.
okay. this is easy.
decompose your problem to be a function you need to minimise, say F(n,k). which results into the minimum assignment of the first n nodes to k first processors.
Then derive your formula like this, collecting the number of nodes on the kth processor.
F(n,k) = min[i=0..n]( max(F(i,k-1), w[i]+...+w[n]+c[i-1]+c[n]) )
c[0] = 0
F(*,0) = inf
F(0,*) = inf

Resources