Minimal k-subset intersection - set

Given n subsets (T1, T2... Tn) s.t. each subset has elements from [1 to p] (where p may or may not be given). The user can specify any k (s.t. k<=n) and I need to find k such subsets out of these n, such that the intersection of subsets Ta1, Ta2, ... Tak yields minimum cardinality. Ta1 ∩ Ta2 ∩ ... Tak has minimum possible numbers.
Original problem: Finding k test cases out of n randomly generated test cases that yield line coverage closest to that of n test cases on any code. Used gcov to determine which test case are unable to cover which lines.
Say: a program has 6 lines (for simplicity) and we run 3 tests on it. After running test 1, T1={1,2,3}, i.e., line no 1-3 didn't get covered/executed after test 1 has finished. We separately run tests 2 and 3 to get T2={4,5,6} and T3={2,4}. Now user can mention any value of k for which we have to give k out of n test cases which have closest coverage to all the test cases.
When we run T1 and T2 together, T1 covers 4,5,6 but not 1,2,3 and T2 covers 1,2,3 but not 4,5,6. Hence T1 and T2 cover 1,2,3,4,5,6 and nothing is left uncovered at least once. T1 ∩ T2 and thus T1 ∩ T2 ∩ T3, both yields ∅.
Now, for k=1, best answer is: T3 (cardinality = 2),
for k=2, best answer is: T1 and T2 since T1 ∩ T2 yields ∅ (as it covers all). Finally for k=3, choose all T1, T2 and T3, as intersection would be ∅ (zero cardinality)

Related

Find if permutation is possible

Given a permutation of natural integers from 1 to N, inclusive. Initially, the permutation is 1, 2, 3, ..., N. We are also given M pairs of integers, where the i-th is (Li,Ri). In a single turn we can choose any of these pairs (let's say with the index j) and arbitrarily shuffle the elements of our permutation on the positions from Lj to Rj, inclusive (the positions are 1-based). We are not limited in the number of turns and you can pick any pair more than once.
The goal is to obtain the permutation P, that is given. If it's possible, output "Possible", otherwise output "Impossible".
Example : Let N=7 and M=4 and array be [3 1 2 4 5 7 6] and queries are :
1 2
4 4
6 7
2 3
Here answer is Possible.
Treat each pair as an interval, compute the union of intervals as a list of non-overlapping intervals, and then test, for each i, whether the value at position i of the permutation either is i or is in the same non-overlapping interval as i.
This works because, if we have a <= b <= c <= d with pairs (a, c) and (b, d), then by repeatedly invoking (a, c) and (b, d), we can get any permutation that we could get with (a, d). Conversely, (a, d) enables any permutation that we could get with (a, c) and (b, d). Once the list of pairs is non-overlapping, it's clear that we can move element i to position j != i if and only if i and j are in the same interval.

Find cardinality of set

I have faced the following problem recently:
We have a sequence A of M consecutive integers, beginning at A[1] = 1:
1,2,...M (example: M = 8 , A = 1,2,3,4,5,6,7,8 )
We have the set T consisting of all possible subsequences made from L_T consecutive terms of A.
(example L_T = 3 , subsequences are {1,2,3},{2,3,4},{3,4,5},...). Let's call the elements of T "tiles".
We have the set S consisting of all possible subsequences of A that have length L_S. ( example L_S = 4, subsequences like {1,2,3,4} , {1,3,7,8} ,...{4,5,7,8} ).
We say that an element s of S can be "covered" by K "tiles" of T if there exist K tiles in T such that the union of their sets of terms contains the terms of s as a subset. For example, subsequence {1,2,3} is possible to cover with 2 tiles of length 2 ({1,2} and {3,4}), while subsequnce {1,3,5} is not possible to "cover" with 2 "tiles" of length 2, but is possible to cover with 2 "tiles" of length 3 ({1,2,3} and {4,5,6}).
Let C be the subset of elements of S that can be covered by K tiles of T.
Find the cardinality of C given M, L_T, L_S, K.
Any ideas would be appreciated how to tackle this problem.
Assume M is divisible by T, so that we have an integer number of tiles covering all elements of the initial set (otherwise the statement is currently unclear).
First, let us count F (P): it will be almost the number of subsequences of length L_S which can be covered by no more than P tiles, but not exactly that.
Formally, F (P) = choose (M/T, P) * choose (P*T, L_S).
We start by choosing exactly P covering tiles: the number of ways is choose (M/T, P).
When the tiles are fixed, we have exactly P * T distinct elements available, and there are choose (P*T, L_S) ways to choose a subsequence.
Well, this approach has a flaw.
Note that, when we chose a tile but did not use its elements at all, we in fact counted some subsequences more than once.
For example, if we fixed three tiles numbered 2, 6 and 7, but used only 2 and 7, we counted the same subsequences again and again when we fixed three tiles numbered 2, 7 and whatever.
The problem described above can be countered by a variation of the inclusion-exclusion principle.
Indeed, for a subsequence which uses only Q tiles out of P selected tiles, it is counted choose (M-Q, P-Q) times instead of only once: Q of P choices are fixed, but the other ones are arbitrary.
Define G (P) as the number of subsequences of length L_S which can be covered by exactly P tiles.
Then, F (P) is sum for Q from 0 to P of the products G (Q) * choose (M-Q, P-Q).
Working from P = 0 upwards, we can calculate all the values of G by calculating the values of F.
For example, we get G (2) from knowing F (2), G (0) and G (1), and also the equation connecting F (2) with G (0), G (1) and G (2).
After that, the answer is simply sum for P from 0 to K of the values G (P).

coloring tree with minimum sum of colors

The problem is to color tree vertices with natural numbers such that sum of numbers(colors) assigned to vertices be minimum.
Is number of colors to do that bounded?
I think 3 colors are enough to do that. How to prove it?
It's not. Describe a rooted tree algebraically as follows. V is a one-node tree. E(t1, t2) is a tree consisting of t1 and t2 and an edge from t1's root to t2's root, rooted at t2's root. The following tree t3 requires four colors to attain the minimum, 156.
t3 = E(t2, E(t2, E(t2, E(t2, t2))))
t2 = E(t1, E(t1, E(t1, E(t1, t1))))
t1 = E(t0, E(t0, E(t0, E(t0, t0))))
t0 = V
Based on some experimentation, I would conjecture can prove that this construction generalizes and thus that no fixed number of colors suffices to attain the minimum for all trees.
Theorem For all d ≥ k ≥ 3, the following inductively constructed tree T(d, k) requires at least k colors. T(d, 1) is the one-vertex tree. For i > 1, T(d, i) is the tree with d leaves attached to each vertex of T(d, i - 1).
Proof By induction on k. The base case k = 3 is essentially your example where 3 colors are necessary for optimality. For k > 3, consider a coloring of T(d, k) that uses only k - 1 colors. We show how to use color k to improve it. If some internal vertex has color 1, then we improve by changing its color to k and changing the colors of its d > k - 1 adjacent leaves to 1. If no interval vertex has color 1, and some leaf has color other than 1, change the leaf to 1. If we haven't improved yet, all leaves have color 1 and all interval vertices have color > 1. Removing all the leaves and decrementing the labels, we have a coloring of T(d, k - 1), which we can improve by inductive hypothesis.
data Tree = V | E Tree Tree
deriving (Eq, Show)
otherMinimums [x, y] = [y, x]
otherMinimums (x:xs) = minimum xs : map (min x) (otherMinimums xs)
color m V = [1..m]
color m (E t1 t2) = let
c1 = color m t1
c2 = color m t2 in
zipWith (+) (otherMinimums c1) c2
t3 = E t2 $ E t2 $ E t2 $ E t2 $ t2
t2 = E t1 $ E t1 $ E t1 $ E t1 $ t1
t1 = E t0 $ E t0 $ E t0 $ E t0 $ t0
t0 = V
Results:
> color 3 t3
[157,158,163]
> color 4 t3
[157,158,159,156]
First, 2 colors is enough for any tree. To prove that, you can just color the tree level by level in alternate colors.
Second, coloring level by level is the only valid method of 2-coloring. It can be proved by induction on the levels. Fix the color of the root node. Then all its children should have the different color, children of the children — first color, and so on.
Third, to choose the optimal coloring, just check the two possible layouts: when the root node has the color 0, and when it has the color 1, respectively.
For a tree, you can use only 2 colors : one for for nodes with odd depth and a second color for nodes with even depth.
EDIT:
The previous answer was wrong because I didn't understand the problem.
As shown by Wobble, the number of colors needed is not bounded.
Number of colours to minimise sum for a tree with n nodes is bounded as O(logn)
This was covered by E. Kubicka in her 1989 paper
http://dl.acm.org/citation.cfm?id=75430

What is the meaning of "from distinct vertex chains" in this nearest neighbor algorithm?

The following pseudo-code is from the first chapter of an online preview version of The Algorithm Design Manual (page 7 from this PDF).
The example is of a flawed algorithm, but I still really want to understand it:
[...] A different idea might be to repeatedly connect the closest pair of
endpoints whose connection will not create a problem, such as
premature termination of the cycle. Each vertex begins as its own
single vertex chain. After merging everything together, we will end up
with a single chain containing all the points in it. Connecting the
final two endpoints gives us a cycle. At any step during the execution
of this closest-pair heuristic, we will have a set of single vertices
and vertex-disjoint chains available to merge. In pseudocode:
ClosestPair(P)
Let n be the number of points in set P.
For i = 1 to n − 1 do
d = ∞
For each pair of endpoints (s, t) from distinct vertex chains
if dist(s, t) ≤ d then sm = s, tm = t, and d = dist(s, t)
Connect (sm, tm) by an edge
Connect the two endpoints by an edge
Please note that sm and tm should be sm and tm.
First of all, I don't understand what "from distinct vertex chains" would mean. Second, i is used as a counter in the outer loop, but i itself is never actually used anywhere! Could someone smarter than me please explain what's really going on here?
This is how I see it, after explanation of Ernest Friedman-Hill (accepted answer):
So the example from the same book (Figure 1.4).
I've added names to the vertices to make it clear
So at first step all the vertices are single vertex chains, so we connect A-D, B-E and C-F pairs, b/c distance between them is the smallest.
At the second step we have 3 chains and distance between A-D and B-E is the same as between B-E and C-F, so we connect let's say A-D with B-E and we left with two chains - A-D-E-B and C-F
At the third step there is the only way to connect them is through B and C, b/c B-C is shorter then B-F, A-F and A-C (remember we consider only endpoints of chains). So we have one chain now A-D-E-B-C-F.
At the last step we connect two endpoints (A and F) to get a cycle.
1) The description states that every vertex always belongs either to a "single-vertex chain" (i.e., it's alone) or it belongs to one other chain; a vertex can only belong to one chain. The algorithm says at each step you select every possible pair of two vertices which are each an endpoint of the respective chain they belong to, and don't already belong to the same chain. Sometimes they'll be singletons; sometimes one or both will already belong to a non-trivial chain, so you'll join two chains.
2) You repeat the loop n times, so that you eventually select every vertex; but yes, the actual iteration count isn't used for anything. All that matters is that you run the loop enough times.
Though question is already answered, here's a python implementation for closest pair heuristic. It starts with every point as a chain, then successively extending chains to build one long chain containing all points.
This algorithm does build a path yet it's not a sequence of robot arm movements for that arm starting point is unknown.
import matplotlib.pyplot as plot
import math
import random
def draw_arrow(axis, p1, p2, rad):
"""draw an arrow connecting point 1 to point 2"""
axis.annotate("",
xy=p2,
xytext=p1,
arrowprops=dict(arrowstyle="-", linewidth=0.8, connectionstyle="arc3,rad=" + str(rad)),)
def closest_pair(points):
distance = lambda c1p, c2p: math.hypot(c1p[0] - c2p[0], c1p[1] - c2p[1])
chains = [[points[i]] for i in range(len(points))]
edges = []
for i in range(len(points)-1):
dmin = float("inf") # infinitely big distance
# test each chain against each other chain
for chain1 in chains:
for chain2 in [item for item in chains if item is not chain1]:
# test each chain1 endpoint against each of chain2 endpoints
for c1ind in [0, len(chain1) - 1]:
for c2ind in [0, len(chain2) - 1]:
dist = distance(chain1[c1ind], chain2[c2ind])
if dist < dmin:
dmin = dist
# remember endpoints as closest pair
chain2link1, chain2link2 = chain1, chain2
point1, point2 = chain1[c1ind], chain2[c2ind]
# connect two closest points
edges.append((point1, point2))
chains.remove(chain2link1)
chains.remove(chain2link2)
if len(chain2link1) > 1:
chain2link1.remove(point1)
if len(chain2link2) > 1:
chain2link2.remove(point2)
linkedchain = chain2link1
linkedchain.extend(chain2link2)
chains.append(linkedchain)
# connect first endpoint to the last one
edges.append((chains[0][0], chains[0][len(chains[0])-1]))
return edges
data = [(0.3, 0.2), (0.3, 0.4), (0.501, 0.4), (0.501, 0.2), (0.702, 0.4), (0.702, 0.2)]
# random.seed()
# data = [(random.uniform(0.01, 0.99), 0.2) for i in range(60)]
edges = closest_pair(data)
# draw path
figure = plot.figure()
axis = figure.add_subplot(111)
plot.scatter([i[0] for i in data], [i[1] for i in data])
nedges = len(edges)
for i in range(nedges - 1):
draw_arrow(axis, edges[i][0], edges[i][1], 0)
# draw last - curved - edge
draw_arrow(axis, edges[nedges-1][0], edges[nedges-1][1], 0.3)
plot.show()
TLDR: Skip to the section "Clarified description of ClosestPair heuristic" below if already familiar with the question asked in this thread and the answers contributed thus far.
Remarks: I started the Algorithm Design Manual recently and the ClosestPair heuristic example bothered me because of what I felt like was a lack of clarity. It looks like others have felt similarly. Unfortunately, the answers provided on this thread didn't quite do it for me--I felt like they were all a bit too vague and hand-wavy for me. But the answers did help nudge me in the direction of what I feel is the correct interpretation of Skiena's.
Problem statement and background: From page 5 of the book for those who don't have it (3rd edition):
Skiena first details how the NearestNeighbor heuristic is incorrect, using the following image to help illustrate his case:
The figure on top illustrates a problem with the approach employed by the NearestNeighbor heuristic, with the bottom figure being the optimal solution. Clearly a different approach is needed to find this optimal solution. Cue the ClosestPair heuristic and the reason for this question.
Book description: The following description of the ClosestPair heuristic is outlined in the book:
Maybe what we need is a different approach for the instance that proved to be a bad instance for the nearest-neighbor heuristic. Always walking to the closest point is too restrictive, since that seems to trap us into making moves we didn't want.
A different idea might repeatedly connect the closest pair of endpoints whose connection will not create a problem, such as premature termination of the cycle. Each vertex begins as its own single vertex chain. After merging everything together, we will end up with a single chain containing all the points in it. Connecting the final two endpoints gives us a cycle. At any step during the execution of this closest-pair heuristic, we will have a set of single vertices and the end of vertex-disjoint chains available to merge. The pseudocode that implements this description appears below.
Clarified description of ClosestPair heuristic
It may help to first "zoom back" a bit and answer the basic question of what we are trying to find in graph theory terms:
What is the shortest closed trail?
That is, we want to find a sequence of edges (e_1, e_2, ..., e_{n-1}) for which there is a sequence of vertices (v_1, v_2, ..., v_n) where v_1 = v_n and all edges are distinct. The edges are weighted, where the weight for each edge is simply the distance between vertices that comprise the edge--we want to minimize the overall weight of whatever closed trails exist.
Practically speaking, the ClosestPair heuristic gives us one of these distinct edges for every iteration of the outer for loop in the pseudocode (lines 3-10), where the inner for loop (lines 5-9) ensures the distinct edge being selected at each step, (s_m, t_m), is comprised of vertices coming from the endpoints of distinct vertex chains; that is, s_m comes from the endpoint of one vertex chain and t_m from the endpoint of another distinct vertex chain. The inner for loop simply ensures we consider all such pairs, minimizing the distance between potential vertices in the process.
Note (ties in distance between vertices): One potential source of confusion is that no sort of "processing order" is specified in either for loop. How do we determine the order in which to compare endpoints and, furthermore, the vertices of those endpoints? It doesn't matter. The nature of the inner for loop makes it clear that, in the case of ties, the most recently encountered vertex pairing with minimal distance is chosen.
Good instance of ClosestPair heuristic
Recall what happened in the bad instance of applying the NearestNeighbor heuristic (observe the newly added vertex labels):
The total distance covered was absurd because we kept jumping back and forth over 0.
Now consider what happens when we use the ClosestPair heuristic. We have n = 7 vertices; hence, the pseudocode indicates that the outer for loop will be executed 6 times. As the book notes, each vertex begins as its own single vertex chain (i.e., each point is a singleton where a singleton is a chain with one endpoint). In our case, given the figure above, how many times will the inner for loop execute? Well, how many ways are there to choose a 2-element subset of an n-element set (i.e., the 2-element subsets represent potential vertex pairings)? There are n choose 2 such subsets:
Since n = 7 in our case, there's a total of 21 possible vertex pairings to investigate. The nature of the figure above makes it clear that (C, D) and (D, E) are the only possible outcomes from the first iteration since the smallest possible distance between vertices in the beginning is 1 and dist(C, D) = dist(D, E) = 1. Which vertices are actually connected to give the first edge, (C, D) or (D, E), is unclear since there is no processing order. Let's assume we encounter vertices D and E last, thus resulting in (D, E) as our first edge.
Now there are 5 more iterations to go and 6 vertex chains to consider: A, B, C, (D, E), F, G.
Note (each iteration eliminates a vertex chain): Each iteration of the outer for loop in the ClosestPair heuristic results in the elimination of a vertex chain. The outer for loop iterations continue until we are left with a single vertex chain comprised of all vertices, where the last step is to connect the two endpoints of this single vertex chain by an edge. More precisely, for a graph G comprised of n vertices, we start with n vertex chains (i.e., each vertex begins as its own single vertex chain). Each iteration of the outer for loop results in connecting two vertices of G in such a way that these vertices come from distinct vertex chains; that is, connecting these vertices results in merging two distinct vertex chains into one, thus decrementing by 1 the total number of vertex chains left to consider. Repeating such a process n - 1 times for a graph that has n vertices results in being left with n - (n - 1) = 1 vertex chain, a single chain containing all the points of G in it. Connecting the final two endpoints gives us a cycle.
One possible depiction of how each iteration looks is as follows:
ClosestPair outer for loop iterations
1: connect D to E # -> dist: 1, chains left (6): A, B, C, (D, E), F, G
2: connect D to C # -> dist: 1, chains left (5): A, B, (C, D, E), F, G
3: connect E to F # -> dist: 3, chains left (4): A, B, (C, D, E, F), G
4: connect C to B # -> dist: 4, chains left (3): A, (B, C, D, E, F), G
5: connect F to G # -> dist: 8, chains left (2): A, (B, C, D, E, F, G)
6: connect B to A # -> dist: 16, single chain: (A, B, C, D, E, F, G)
Final step: connect A and G
Hence, the ClosestPair heuristic does the right thing in this example where previously the NearestNeighbor heuristic did the wrong thing:
Bad instance of ClosestPair heuristic
Consider what the ClosestPair algorithm does on the point set in the figure below (it may help to first try imagining the point set without any edges connecting the vertices):
How can we connect the vertices using ClosestPair? We have n = 6 vertices; thus, the outer for loop will execute 6 - 1 = 5 times, where our first order of business is to investigate the distance between vertices of
total possible pairs. The figure above helps us see that dist(A, D) = dist(B, E) = dist(C, F) = 1 - ɛ are the only possible options in the first iteration since 1 - ɛ is the shortest distance between any two vertices. We arbitrarily choose (A, D) as the first pairing.
Now are there are 4 more iterations to go and 5 vertex chains to consider: (A, D), B, C, E, F. One possible depiction of how each iteration looks is as follows:
ClosestPair outer for loop iterations
1: connect A to D # --> dist: 1-ɛ, chains left (5): (A, D), B, C, E, F
2: connect B to E # --> dist: 1-ɛ, chains left (4): (A, D), (B, E), C, F
3: connect C to F # --> dist: 1-ɛ, chains left (3): (A, D), (B, E), (C, F)
4: connect D to E # --> dist: 1+ɛ, chains left (2): (A, D, E, B), (C, F)
5: connect B to C # --> dist: 1+ɛ, single chain: (A, D, E, B, C, F)
Final step: connect A and F
Note (correctly considering the endpoints to connect from distinct vertex chains): Iterations 1-3 depicted above are fairly uneventful in the sense that we have no other meaningful options to consider. Even once we have the distinct vertex chains (A, D), (B, E), and (C, F), the next choice is similarly uneventful and arbitrary. There are four possibilities given that the smallest possible distance between vertices on the fourth iteration is 1 + ɛ: (A, B), (D, E), (B, C), (E, F). The distance between vertices for all of the points above is 1 + ɛ. The choice of (D, E) is arbitrary. Any of the other three vertex pairings would have worked just as well. But notice what happens during iteration 5--our possible choices for vertex pairings have been tightly narrowed. Specifically, the vertex chains (A, D, E, B) and (C, F), which have endpoints (A, B) and (C, F), respectively, allow for only four possible vertex pairings: (A, C), (A, F), (B, C), (B, F). Even if it may seem obvious, it is worth explicitly noting that neither D nor E were viable vertex candidates above--neither vertex is included in the endpoint, (A, B), of the vertex chain of which they are vertices, namely (A, D, E, B). There is no arbitrary choice at this stage. There are no ties in the distance between vertices in the pairs above. The (B, C) pairing results in the smallest distance between vertices: 1 + ɛ. Once vertices B and C have been connected by an edge, all iterations have been completed and we are left with a single vertex chain: (A, D, E, B, C, F). Connecting A and F gives us a cycle and concludes the process.
The total distance traveled across (A, D, E, B, C, F) is as follows:
The distance above evaluates to 5 - ɛ + √(5ɛ^2 + 6ɛ + 5) as opposed to the total distance traveled by just going around the boundary (the right-hand figure in the image above where all edges are colored in red): 6 + 2ɛ. As ɛ -> 0, we see that 5 + √5 ≈ 7.24 > 6 where 6 was the necessary amount of travel. Hence, we end up traveling about
farther than is necessary by using the ClosestPair heuristic in this case.

Algorithm to "transfer water from a set of bottles to another one" (metaphorically speaking)

Ok, I have a problem. I have a set "A" of bottles of various sizes, all full of water.
Then I have another set "B" of bottles of various sizes, all empty.
I want to transfer the water from A to B, knowing that the total capacity of each set is the same. (i.e.: Set A contains the same amount of water as set B).
This is of course trivial in itself, just take the first bottle in B, pour it in the first in A until this is full. Then if the bottle from B has still water in it, go on with the second bottle in A, etc.
However, I want to minimize the total number of pours (the action of pouring from a bottle into another, each action counts 1, independently from how much water it involves)
I'd like to find a greedy algorithm to do this, or if not possible at least an efficient one. However, efficiency is secondary to correctness of the algorithm (I don't want a suboptimal solution).
Of course this problem is just a metaphor for a real problem in a computer program to manage personal expenses.
Bad news: this problem is NP-hard by a reduction from subset sum. Given numbers x1, …, xn, S, the object of subset sum is to determine whether or not some subset of the xis sum to S. We make A-bottles with capacities x1, …, xn and B-bottles with capacities S and (x1 + … + xn - S) and determine whether n pours are sufficient.
Good news: any greedy strategy (i.e., choose any nonempty A, choose any unfilled B, pour until we have to stop) is a 2-approximation (i.e., uses at most twice as many pours as optimal). The optimal solution uses at least max(|A|, |B|) pours, and greedy uses at most |A| + |B|, since every time greedy does a pour, either an A is drained or a B is filled and does not need to be poured out of or into again.
There might be an approximation scheme (a (1 + ε)-approximation for any ε > 0). I think now it's more likely that there's an inapproximability result – the usual tricks for obtaining approximation schemes don't seem to apply here.
Here are some ideas that might lead to a practical exact algorithm.
Given a solution, draw a bipartite graph with left vertices A and right vertices B and an (undirected) edge from a to b if and only if a is poured into b. If the solution is optimal, I claim that there are no cycles – otherwise we could eliminate the smallest pour in the cycle and replace the lost volume going around the cycle. For example, if I have pours
a1 -> b1: 1
a1 -> b2: 2
a2 -> b1: 3
a2 -> b3: 4
a3 -> b2: 5
a3 -> b3: 6
then I can eliminate by a1 -> b1 pour like so:
a2 -> b1: 4 (+1)
a2 -> b3: 3 (-1)
a3 -> b3: 7 (+1)
a3 -> b2: 4 (-1)
a1 -> b2: 3 (+1)
Now, since the graph has no cycle, we can count the number of edges (pours) as |A| + |B| - #(connected components). The only variable here is the number of connected components, which we want to maximize.
I claim that the greedy algorithm forms graphs that have no cycle. If we knew what the connected components of an optimal solution were, we could use a greedy algorithm on each one and get an optimal solution.
One way to tackle this subproblem would be to use dynamic programming to enumerate all subset pairs X of A and Y of B such that sum(X) == sum(Y) and then feed these into an exact cover algorithm. Both steps are of course exponential, but they might work well on real data.
Here's my take:
Identify bottles having the exact same size in both sets. This translate to one-to-one pour for these same-size bottles.
Sort the remaining bottles in A in descending order by capacity, and sort remaining bottles in B in ascending order. Compute the number of pours you need when pouring sorted list in A to B.
Update: After each pour in step 2, repeat step 1. (Optimization step suggested by Steve Jessop). Rinse and repeat until all water is transferred.
i think this gives the minimum number of pours:
import bisect
def pours(A, B):
assert sum(A) == sum(B)
count = 0
A.sort()
B.sort()
while A and B:
i = A.pop()
j = B.pop()
if i == j:
count += 1
elif i > j:
bisect.insort(A, i-j)
count += 1
elif i < j:
bisect.insort(B, j-i)
count += 1
return count
A=[5,4]
B=[4,4,1]
print pours(A,B)
# gives 3
A=[5,3,2,1]
B=[4,3,2,1,1]
print pours(A,B)
# gives 5
in English it reads:
assert that both lists have the same sum (i think the algorithm will still work if sum(A) > sum(B) or sum(A) < sum(B) is true)
take the two lists A and B, sort both them
while A isn't empty and B isn't empty:
take i (the largest) from A and j (the largest) from B
if i equals j, pour i in j and count 1 pour
if i is larger than j, pour i in j, place i-j remainder back in A (using an insertion sort), count 1 pour
if i is smaller than j, pour i in j, place j-i remainder back in B (using an insertion sort), count 1 pour

Resources