Algorithm to detect and remove least number of inconsistent facts (probably in PROLOG)? - algorithm

How do you program the following algorithm?
Imagine a list of "facts" like this, where the letters represent variables bound to numeric values:
x = 1
y = 2
z = 3
a = 1
b = 2
c = 3
a = x
b = z
c = z
These "facts" clearly must not all be "true": it's not possible for b=z while b=2 and z=3. If either b=z or b=2 were removed, all facts would be consistent. If z=3 and either b=z or c=z were removed, then all facts would be consistent but there would be one fewer fact than above. This set contains many such consistent subsets. For instance, a=1, b=2, c=3 is a consistent subset, and so are many others.
Two consistent subsets are larger than any other consistent subset in this example:
x = 1
y = 2
z = 3
a = 1
b = 2
c = 3
a = x
c = z
and
x = 1
y = 2
z = 3
a = 1
c = 3
a = x
b = z
c = z
Using an appropriate programming language (I'm thinking PROLOG, but maybe I'm wrong) how would you process a large set containing consistent and inconsistent facts, and then output the largest possible sub-set of consistent facts (or multiple subsets as in the example above)?

This is very closely related to the NP-hard multiway cut problem. In the (unweighted) multiway cut problem, we have an undirected graph and a set of terminal vertices. The goal is to remove as few edges as possible so that each terminal vertex lies in its own connected component.
For this problem, we can interpret each variable and each constant as a vertex, and each equality as an edge from its left-hand side to its right-hand side. The terminal vertices are those associated with constants.
For only two terminals, the multiway cut problem is the polynomial-time solvable s-t minimum cut problem. We can use minimum cuts to get a polynomial-time 2-approximation to the multiway cut problem, by finding the cheapest cut separating two terminals, deleting the edges involved, and then recursing on the remaining connected components. Several approximation algorithms with better ratios have been proposed in the theoretical literature on multiway cut.
Practically speaking, multiway cut arises in applications to computer vision, so I would expect that there has been some work on obtaining exact solutions. I don't know what's out there, though.

Prolog could serve as a convenient implementation language, but some thought about algorithms suggests that a specialized approach may be advantageous.
Among statements of these kinds (equalities between two variables or between one variable and one constant) the only inconsistency that can arise is a path connecting two distinct constants.
Thus, if we find all "inconsistency" paths that connect pairs of distinct constants, it is necessary and sufficient to find a set of edges (original equalities) that disconnect all of those paths.
It is tempting to think a greedy algorithm is optimal here: always select an edge to remove which is common to the largest number of remaining "inconsistency" paths. Thus I propose:
1) Find all simple paths P connecting two different constants (without passing through any third constant), and construct a linkage structure between these paths and their edges.
2) Count the frequencies of edges E appearing along those "inconsistency" paths P.
3) Find a sufficient number of edges to remove by pursuing a greedy strategy, removing the next edge that appears most frequently and updating the counts of edges in remaining paths accordingly.
4) Given that upper bound on edges necessary to remove (to leave a consistent subset of statements), apply a backtracking strategy to determine whether any smaller number of edges would suffice.
As applied to the example in the Question, there prove to be exactly two "inconsistency" paths:
2 -- b -- z -- 3
2 -- b -- z -- c -- 3
Removing either of the two edges, 2 -- b or b -- z, common to both of these paths suffices to disconnect both "inconsistency" paths (removing all inconsistencies among remaining statements).
Moreover it is evident that no other single edge's removal would suffice to accomplish that.

Related

Select the most elements that do not overlap so that the sum of their size is maximized

I'm trying to find an algorithm to the following problem.
Say I have a number of objects A, B, C,...
I have a list of valid combinations of these objects. Each combination is of length 2 or 4.
For eg. AF, CE, CEGH, ADFG,... and so on.
For combinations of two objects, eg. AF, the length of the combination is 2. For combination of four objects, eg CEGH, the length of the combination.
I can only pick non-overlapping combinations, i.e. I cannot pick AF and ADFG because both require objects 'A' and 'F'. I can pick combinations AF and CEGH because they do not require common objects.
If my solution consists of only the two combinations AF and CEGH, then my objective is the sum of the length of the combinations, which is 2 + 4 = 6.
Given a list of objects and their valid combinations, how do I pick the most valid combinations that don't overlap with each other so that I maximize the sum of the lengths of the combinations? I do not want to formulate it as an IP as I am working with a problem instance with 180 objects and 10 million valid combinations and solving an IP using CPLEX is prohibitively slow. Looking for some other elegant way to solve it. Can I perhaps convert this to a network? And solve it using a max-flow algorithm? Or a Dynamic program? Stuck as to how to go about solving this problem.
My first attempt at showing this problem to be NP-hard was wrong, as it did not take into account the fact that only combinations of size 2 or 4 were allowed. However, using Jim D.'s suggestion to reduce from 3-dimensional matching (3DM), we can show that the problem is nevertheless NP-hard.
I'll show that the natural decision problem form of your problem ("Given a set O of objects, and a set C of combinations of either 2 or 4 objects from O, and an integer m, does there exist a subset D of C such that all sets in D are pairwise disjoint, and the union of all sets in D has size at least m?") is NP-hard. Clearly the optimisation problem (i.e., your original problem, where we seek an actual subset of combinations that maximises m above) is at least as hard as this problem. (To see that the optimisation problem is not "much" harder than the decision problem, notice that you could first find the maximum m value for which a solution exists using a binary search on m in which you solve a decision problem at each step, and then, once this maximal m value has been found, solving a series of decision problems in which each combination in turn is removed: if the solution after removing some particular combination is still "YES", then it may also be left out of all future problem instances, while if the solution becomes "NO", then it is necessary to keep this combination in the solution.)
Given an instance (X, Y, Z, T, k) of 3DM, where X, Y and Z are sets that are pairwise disjoint from each other, T is a subset of X*Y*Z (i.e., a set of ordered triples with first, second and third components from X, Y and Z, respectively) and k is an integer, our task is to determine whether there is any subset U of T such that |U| >= k and all triples in U are pairwise disjoint (i.e., to answer the question, "Are there at least k non-overlapping triples in T?"). To turn any such instance of 3DM into an instance of your problem, all we need to do is create a fresh 4-combination from each triple in T, by adding a distinct dummy value to each. The set of objects in the constructed instance of your problem will consist of the union of X, Y, Z, and the |T| dummy values we created. Finally, set m to k.
Suppose that the answer to the original 3DM instance is "YES", i.e., there are at least k non-overlapping triples in T. Then each of the k triples in such a solution corresponds to a 4-combination in the input C to your problem, and no two of these 4-combinations overlap, since by construction, their 4th elements are all distinct, and by assumption of the. Thus there are at least m = k non-overlapping 4-combinations in the instance of your problem, so the solution for that problem must also be "YES".
In the other direction, suppose that the solution to the constructed instance of your problem is "YES", i.e., there are at least m non-overlapping 4-combinations in C. We can simply take the first 3 elements of each of the 4-combinations (throwing away the fourth) to produce a set of k = m non-overlapping triples in T, so the answer to the original 3DM instance must also be "YES".
We have shown that a YES-answer to one problem implies a YES-answer to the other, thus a NO-answer to one problem implies a NO-answer to the other. Thus the problems are equivalent. The instance of your problem can clearly be constructed in polynomial time and space. It follows that your problem is NP-hard.
You can reduce this problem to the maximum weighted clique problem, which is, unfortunately, NP-hard.
Build a graph such that every combination is a vertex with weight equal to the length of the combination, and connect vertices if the corresponding combinations do not share any object (i.e. if you can pick both them at the same time). Then, a solution is valid if and only if it is a clique in that graph.
A simple search on google brings up a lot of approximation algorithms for this problem, such as this one.

Match every point in two different sized sets with minimum total line length

I have two sets of points plotted in a coordinate system. Each point in a set must be matched to at least one point at the other set, in a way that the sum of the length of the lines drawn by joining those points should be as low as possible. To make it clear, line drawing is just an abstraction, the actual output is just the pairs of points that must be matched.
I've seen this question about a similar problem, except that in my case there's no single-link restriction since the sets may have different sizes. Is there any kind of problem that describes this situation? More specifically, what algorithm could I use to solve this, assuming each set may have a maximum of 10 points?
Algorithm
You can model this as a network flow problem.
By having a source of 1 at each point in the first set, and a sink of 1 at each point in the second set, plus an extra node 'dest' for any left over capacity, any valid flow will always connect every point.
Make edges between the points with cost according to the distance between the points.
So far we have a network whose solution will be the lowest cost matching of set 1 to set 2 (i.e. each point will have a single link).
To allow multiple links you can simply make the following additions:
add 0 weight edges between each point in set2 and 'dest' (this allows points in set 2 to be multiply connected)
add 0 weight edges between 'dest' and each point in set2 (this allows points in set 1 to be multiply connected)
Example Python code using Networkx
import networkx as nx
import random
G=nx.DiGraph()
set1=['A','B','C','D','E','F','G','H','I']
set2=['a','b','c']
# Assume set1 > set2 (or swap sets)
assert len(set1)>=len(set2)
G.add_node('dest',demand=len(set1)-len(set2))
A=[]
for person in set1:
G.add_node(person,demand=-1)
G.add_edge('dest',person,weight=0)
for project in set2:
cost = random.randint(1,10) # Assign appropriate costs here
G.add_edge(person,project,weight=cost) # Edge taken if person does this project
for project in set2:
G.add_node(project,demand=1)
G.add_edge(project,'dest',weight=0)
flowdict = nx.min_cost_flow(G)
for person in set1:
for project,flow in flowdict[person].items():
if flow:
print person,'->',project
You can use a discrete optimization approach (Integer Programming).
We have two sets A, of size X, and B, of size Y. This means a maximum of X*Y links, each described by a boolean variable: L(i,j) = L(Y*i+j) is 1 if nodes A(i) and B(j) are linked, 0 if not. If X = Y = 10, we can write link L(7,3) as L73.
We can rewrite the problem like this:
Node A(i) has at least one link: X (say, ten) criteria with i from 0 to X-1, each of them comprised of Y components:
L(i,0)+L(i,1)+L(i,2)+...+L(i,Y-1) >= 1
Node B(j) has at least one link, and there are Y criteria made up of X components:
L(0,j)+L(1,j)+L(2,j)+...+L(X-1,j) >= 1
The minimal cost requirement becomes:
cost = SUM(C(0,0)*L(0,0)+C(0,1)*L(0,1)+...+C(9,9)*L(9,9)
With these conventions, we can easily build the matrices for an ILP problem, that can be passed to our favorite ILP solving package or library (C, Java, Python, even PHP).
====
A self-contained "greedy" algorithm which is not guaranteed to find a minimum, but is reasonably quick and should give reasonable results unless you feed it a pathological data set, is:
- connect all points in the smaller set, each to its nearest point in the other set.
- connect all unconnected points remaining in the larger set, each to its
nearest point in the first set, whether it's already connected or not.
As an optimization, you can then enumerate the points in the larger data set; if one of them (say A) is singly connected to a point in the first data set (say B) which is multiply connected, and is not its nearest neighbour C, you can switch the link from A-B to A-C. This takes care of one of the simplest problems that may arise from the "greediness" of the algorithm.

Efficient algorithm for eliminating nodes in "graph"?

Suppose I have a a graph with 2^N - 1 nodes, numbered 1 to 2^N - 1. Node i "depends on" node j if all the bits in the binary representation of j that are 1, are also 1 in the binary representation of i. So, for instance, if N=3, then node 7 depends on all other nodes. Node 6 depends on nodes 4 and 2.
The problem is eliminating nodes. I can eliminate a node if no other nodes depend on it. No nodes depend on 7; so I can eliminate 7. After eliminating 7, I can eliminate 6, 5, and 3, etc. What I'd like is to find an efficient algorithm for listing all the possible unique elimination paths. (that is, 7-6-5 is the same as 7-5-6, so we only need to list one of the two). I have a dumb algorithm already, but I think there must be a better way.
I have three related questions:
Does this problem have a general name?
What's the best way to solve it?
Is there a general formula for the number of unique elimination paths?
Edit: I should note that a node cannot depend on itself, by definition.
Edit2: Let S = {s_1, s_2, s_3,...,s_m} be the set of all m valid elimination paths. s_i and s_j are "equivalent" (for my purposes) iff the two eliminations s_i and s_j would lead to the same graph after elimination. I suppose to be clearer I could say that what I want is the set of all unique graphs resulting from valid elimination steps.
Edit3: Note that elimination paths may be different lengths. For N=2, the 5 valid elimination paths are (),(3),(3,2),(3,1),(3,2,1). For N=3, there are 19 unique paths.
Edit4: Re: my application - the application is in statistics. Given N factors, there are 2^N - 1 possible terms in statistical model (see http://en.wikipedia.org/wiki/Analysis_of_variance#ANOVA_for_multiple_factors) that can contain the main effects (the factors alone) and various (2,3,... way) interactions between the factors. But an interaction can only be present in a model if all sub-interactions (or main effects) are present. For three factors a, b, and c, for example, the 3 way interaction a:b:c can only be in present if all the constituent two-way interactions (a:b, a:c, b:c) are present (and likewise for the two-ways). Thus, the model a + b + c + a:b + a:b:c would not be allowed. I'm looking for a quick way to generate all valid models.
It seems easier to think about this in terms of sets: you are looking for families of subsets of {1, ..., N} such that for each set in the family also all its subsets are present. Each such family is determined by the inclusion-wise maximal sets, which must be overlapping. Families of pairwise overlapping sets are called Sperner families. So you are looking for Sperner families, plus the union of all the subsets in the family. Possibly known algorithms for enumerating Sperner families or antichains in general are useful; without knowing what you actually want to do with them, it's hard to tell.
Thanks to #FalkHüffner's answer, I saw that what I wanted to do was equivalent to finding monotonic Boolean functions for N arguments. If you look at the figure on the Wikipedia page for Dedekind numbers (http://en.wikipedia.org/wiki/Dedekind_number) the figure expresses the problem graphically. There is an algorithm for generating monotonic Boolean functions (http://www.mathpages.com/home/kmath094.htm) and it is quite simple to construct.
For my purposes, I use the algorithm, then eliminate the first column and last row of the resulting binary arrays. Starting from the top row down, each row has a 1 in the ith column if one can eliminate the ith node.
Thanks!
You can build a "heap", in which at depth X are all the nodes with X zeros in their binary representation.
Then, starting from the bottom layer, connect each item to a random parent at the layer above, until you get a single-component graph.
Note that this graph is a tree, i.e., each node except for the root has exactly one parent.
Then, traverse the tree (starting from the root) and count the total number of paths in it.
UPDATE:
The method above is bad, because you cannot just pick a random parent for a given item - you have a limited number of items from which you can pick a "legal" parent... But I'm leaving this method here for other people to give their opinion (perhaps it is not "that bad").
In any case, why don't you take your graph, extract a spanning-tree (you can use Prim algorithm or Kruskal algorithm for finding a minimal-spanning-tree), and then count the number of paths in it?

Check if given string can be created by a set of characters cut out from magazine article

"Observe that when you cut a character out of a magazine, the character on the reverse side of the page is also removed. Give an algorithm to determine whether you can generate a given string by pasting cutouts from a given magazine. Assume that you are given a function that will identify the character and its position on the reverse side of the page for any given character position."
How can I do it?
I can do some initial pruning so that if a needed character has only one way of getting picked up, its taken initially before turning the sub-problem for dynamic technique, but what after this initial pruning?
What is the time and space complexity?
As #LiKao suggested, this can be solved using max flow. To construct the network we make two "layers" of vertices: one with all the distinct characters in the input string and one with each position on the page. Make an edge with capacity 1 from a character to a position if that position has that character on one side. Make edges of capacity 1 from each position to the sink, and make edges from the source to each character with capacity equal to the multiplicity of that character in the input string.
For example, let's say we're searching for the word "FOO" on a page with four positions:
pos 1 2 3 4
front F C O Z
back O O K Z
We then generate the following network, ignoring position 4 since it does not provide any of the required characters.
Now, we only need to determine if there is a flow from the source to the sink of length("FOO") = 3 or more.
You can use dynamic programming directly.
We are given string s with n letters. We are given a set of pieces P = {p_1, ..., p_k}. Each piece has one letter in the front p_i.f and one in the back p_i.b.
Denote with f(j, p) the function that returns true if it is feasible to create substring s_1...s_j using pieces in p \subseteq P, and false otherwise.
The following recurrence holds:
f(n, P) = f(n-1, P-p_1) | f(n-1, P-p_2) | ... | f(n-1, P-p_k)
In plain English the feasibility of s using all pieces in P, depends on the feasibility of the substring s_1...s_n-1 given one less piece, and we try removing all possible pieces (of course in practice we do not have to remove all pieces one by one; we only need to remove those pieces for which p_i.f == s_n || p_i.b == s_n).
The initial condition is that f(1, P-p_1) = f(1, P-p2) = ... = true, assuming that we have already checked a-priori (in linear time) that there are enough letters in P to cover all the letters in s.
While this problem can be formulated as a Maxflow problem as shown in the accepted answer, it is simpler and more efficient to formulate it as a maximum cardinality matching problem in a bipartite graph. Maxflow algorithms like Dinic's are slower than the special case algorithms like Hopcroft–Karp algorithm.
The bipartite graph is formed by adding two edges from every character of the given string to a cutout, one edge for each side. We then run Hopcroft–Karp. In the end, we simply check whether the cardinality of the matching is equal to the length of the string.
For a working implementation (in Scala) using JGraphT, see my GitHub.
I'd like to come up with a more efficient DP solution, since Skiena's book has this problem in the DP section, but so far haven't found any.

Eliminating symmetry from graphs

I have an algorithmic problem in which I have derived a transfer matrix between a lot of states. The next step is to exponentiate it, but it is very large, so I need to do some reductions on it. Specifically it contains a lot of symmetry. Below are some examples on how many nodes can be eliminated by simple observations.
My question is whether there is an algorithm to efficiently eliminate symmetry in digraphs, similarly to the way I've done it manually below.
In all cases the initial vector has the same value for all nodes.
In the first example we see that b, c, d and e all receive values from a and one of each other. Hence they will always contain an identical value, and we can merge them.
In this example we quickly spot, that the graph is identical from the point of view of a, b, c and d. Also for their respective sidenodes, it doesn't matter to which inner node it is attached. Hence we can reduce the graph down to only two states.
Update: Some people were reasonable enough not quite sure what was meant by "State transfer matrix". The idea here is, that you can split a combinatorial problem up into a number of state types for each n in your recurrence. The matrix then tell you how to get from n-1 to n.
Usually you are only interested about the value of one of your states, but you need to calculate the others as well, so you can always get to the next level. In some cases however, multiple states are symmetrical, meaning they will always have the same value. Obviously it's quite a waste to calculate all of these, so we want to reduce the graph until all nodes are "unique".
Below is an example of the transfer matrix for the reduced graph in example 1.
[S_a(n)] [1 1 1] [S_a(n-1)]
[S_f(n)] = [1 0 0]*[S_f(n-1)]
[S_B(n)] [4 0 1] [S_B(n-1)]
Any suggestions or references to papers are appreciated.
Brendan McKay's nauty ( http://cs.anu.edu.au/~bdm/nauty/) is the best tool I know of for computing automorphisms of graphs. It may be too expensive to compute the whole automorphism group of your graph, but you might be able to reuse some of the algorithms described in McKay's paper "Practical Graph Isomorphism" (linked from the nauty page).
I'll just add an extra answer building on what userOVER9000 suggested, if anybody else are interested.
The below is an example of using nauty on Example 2, through the dreadnaut tool.
$ ./dreadnaut
Dreadnaut version 2.4 (64 bits).
> n=8 d g -- Starting a new 8-node digraph
0 : 1 3 4; -- Entering edge data
1 : 0 2 5;
2 : 3 1 6;
3 : 0 2 7;
4 : 0;
5 : 1;
6 : 2;
7 : 3;
> cx -- Calling nauty
(1 3)(5 7)
level 2: 6 orbits; 5 fixed; index 2
(0 1)(2 3)(4 5)(6 7)
level 1: 2 orbits; 4 fixed; index 4
2 orbits; grpsize=8; 2 gens; 6 nodes; maxlev=3
tctotal=8; canupdates=1; cpu time = 0.00 seconds
> o -- Output "orbits"
0:3; 4:7;
Notice it suggests joining nodes 0:3 which are a:d in Example 2 and 4:7 which are e:h.
The nauty algorithm is not well documented, but the authors describe it as exponential worst case, n^2 average.
Computing symmetries seems to be a bit of a second order problem. Taking just a,b,c and d in your second graph, the symmetry would have to be expressed
a(b,c,d) = b(a,d,c)
and all its permutations, or some such. Consider a second subgraph a', b', c', d' added to it. Again, we have the symmetries, but parameterised differently.
For computing people (rather than math people), could we express the problem like so?
Each graph node contains a set of letters. At each iteration, all of the letters in each node are copied to its neighbours by the arrows (some arrows take more than one iteration and can be treated as a pipe of anonymous nodes).
We are trying to find efficient ways of determining things such as
* what letters each set/node contains after N iterations.
* for each node the N after which its set no longer changes.
* what sets of nodes wind up containing the same sets of letters (equivalence class)
?

Resources