Fast way to compare cyclical data - algorithm

Suppose I have the data set {A,B,C,D}, of arbitrary type, and I want to compare it to another data set. I want the comparison to be true for {A,B,C,D}, {B,C,D,A}, {C,D,A,B}, and {D,A,B,C}, but not for {A,C,B,D} or any other set that is not ordered similarly. What is a fast way to do this?
Storing them in arrays,rotating, and doing comparison that way is an O(n^2) task so that's not very good.
My first intuition would be to store the data as a set like {A,B,C,D,A,B,C} and then search for a subset, which is only O(n). Can this be done any faster?

There is a fast algorithm for finding the minimum rotation of a string - https://en.wikipedia.org/wiki/Lexicographically_minimal_string_rotation. So you can store and compare the minimum rotation.

One option is to use a directed graph. Set up a graph with the following transitions:
A -> B
B -> C
C -> D
D -> A
All other transitions will put you in an error state. Thus, provided each member is unique (which is implied by your use of the word set), you will be able to determine membership provided you end on the same graph node on which you started.
If a value can appear multiple times in your search, you'll need a smarter set of states and transitions.
This approach is useful if you precompute a single search and then match it to many data points. It's not so useful if you have to constantly regenerate the graph. It could also be cache-inefficient if your state table is large.

Well Dr Zoidberg, if you are interested in order, as you are, then you need to store your data in a structure that preserves order and also allows for easy rotation.
In Python a list would do.
Find the smallest element of the list then rotate each list you want to compare until the smallest element of them is at the beginning. Note: this is not a sort, but a rotation. With all the lists for comparison so normalised, a straight forward list compare between any two would tell if they are the same after rotation.
>>> def rotcomp(lst1, lst2):
while min(lst1) != lst1[0]:
lst1 = lst1[1:] + [lst1[0]]
while min(lst2) != lst2[0]:
lst2 = lst2[1:] + [lst2[0]]
return lst1 == lst2
>>> rotcomp(list('ABCD'), list('CDAB'))
True
>>> rotcomp(list('ABCD'), list('CDBA'))
False
>>>
>>> rotcomp(list('AABC'), list('ABCA'))
False
>>> def rotcomp2(lst1, lst2):
return repr(lst1)[1:-1] in repr(lst2 + lst2)
>>> rotcomp2(list('ABCD'), list('CDAB'))
True
>>> rotcomp2(list('ABCD'), list('CDBA'))
False
>>> rotcomp2(list('AABC'), list('ABCA'))
True
>>>
NEW SECTION: WITH DUPLICATES?
If the input may contain duplicates then, (from the possible twin question mentioned under the question), An algorithm is to see if one list is a sub-list of the other list repeated twice.
function rotcomp2 uses that algorithm and a textual comparison of the repr of the list contents.

Related

How do I find the right optimisation algorithm for my problem?

Disclaimer: I'm not a professional programmer or mathematician and this is my first time encountering the field of optimisation problems. Now that's out of the way so let's get to the problem at hand:
I got several lists, each containing various items and number called 'mandatoryAmount':
listA (mandatoryAmountA, itemA1, itemA2, itemA2, ...)
Each item has certain values (each value is a number >= 0):
itemA1 (M, E, P, C, Al, Ac, D, Ab,S)
I have to choose a certain number of items from each list determined by 'mandatoryAmount'.
Within each list I can choose every item multiple times.
Once I have all of the items from each list, I'll add up the values of each.
For example:
totalM = listA (itemA1 (M) + itemA1 (M) + itemA3 (M)) + listB (itemB1 (M) + itemB2 (M))
The goals are:
-To have certain values (totalAl, totalAc, totalAb, totalS) reach a certain number cap while going over that cap as little as possible. Anything over that cap is wasted.
-To maximize the remaining values with different weightings each
The output should be the best possible selection of items to meet the goals stated above. I imagine the evaluation function to just add up all non-waste values times their respective weightings while subtracting all wasted stats times their respective weightings.
edit:
The total amount of items across all lists should be somewhere between 500 and 1000, the number of lists is around 10 and the mandatoryAmount for each list is between 0 and 14.
Here's some sample code that uses Python 3 and OR-Tools. Let's start by
defining the input representation and a random instance.
import collections
import random
Item = collections.namedtuple("Item", ["M", "E", "P", "C", "Al", "Ac", "D", "Ab", "S"])
List = collections.namedtuple("List", ["mandatoryAmount", "items"])
def RandomItem():
return Item(
random.random(),
random.random(),
random.random(),
random.random(),
random.random(),
random.random(),
random.random(),
random.random(),
random.random(),
)
lists = [
List(
random.randrange(5, 10), [RandomItem() for j in range(random.randrange(5, 10))]
)
for i in range(random.randrange(5, 10))
]
Time to formulate the optimization as a mixed-integer program. Let's import
the solver library and initialize the solver object.
from ortools.linear_solver import pywraplp
solver = pywraplp.Solver.CreateSolver("solver", "SCIP")
Make constraints for the totals that must reach a certain cap.
AlCap = random.random()
totalAl = solver.Constraint(AlCap, solver.infinity())
AcCap = random.random()
totalAc = solver.Constraint(AcCap, solver.infinity())
AbCap = random.random()
totalAb = solver.Constraint(AbCap, solver.infinity())
SCap = random.random()
totalS = solver.Constraint(SCap, solver.infinity())
We want to maximize the other values subject to some weighting.
MWeight = random.random()
EWeight = random.random()
PWeight = random.random()
CWeight = random.random()
DWeight = random.random()
solver.Objective().SetMaximization()
Create variables and fill in the constraints. For each list there is an
equality constraint on the number of items.
associations = []
for list_ in lists:
amount = solver.Constraint(list_.mandatoryAmount, list_.mandatoryAmount)
for item in list_.items:
x = solver.IntVar(0, solver.infinity(), "")
amount.SetCoefficient(x, 1)
totalAl.SetCoefficient(x, item.Al)
totalAc.SetCoefficient(x, item.Ac)
totalAb.SetCoefficient(x, item.Ab)
totalS.SetCoefficient(x, item.S)
solver.Objective().SetCoefficient(
x,
MWeight * item.M
+ EWeight * item.E
+ PWeight * item.P
+ CWeight * item.C
+ DWeight * item.D,
)
associations.append((item, x))
if solver.Solve() != solver.OPTIMAL:
raise RuntimeError
solution = []
for item, x in associations:
solution += [item] * round(x.solution_value())
print(solution)
I think David Eisenstat has the right idea with Integer programming, but let's see if we get some good solutions otherwise and perhaps provide some initial optimization. However, I think that we can just choose all of one item in each list may make this easier to solve that it normally would be. Basically that turns it into more of a Subset Sum problem. Especially with the cap.
There are two possibilities here:
There is no solution, no condition satisfies the requirement.
There is a solution that we need to be optimized.
We really want to try to find a solution first, if we can find one (regardless of the amount of waste), then that's nice.
So let's reframe the problem: We aim to simply minimize waste, but we also need to meet a min requirement. So let's try to get as much waste as possible in ways we need it.
I'm going to propose an algorithm you could use that should work "fairly well" and is polynomial time, though could probably have some optimizations. I'll be using K to mean mandatoryAmount as it's a bit of a customary variable in this situation. Also I'll be using N to mean the number of lists. Lastly, Z to represent the total number of items (across all lists).
Get the list of all items and sort them by the amount of each value they have (first the goal values, then the bonus values). If an item has 100A, 300C, 200B, 400D, 150E and the required are [B, D], then the sort order would look like: [400,200,300,150,100]. Repeat but for one goal value. Using the same example above we would have: [400,300,150,100] for goal: D and [200,300,150,100] for goal B. Create a boolean variable for optimization mode (we start by seeking for a solution, once we find one, we can try to optimize it). Create a counter/hash to contain the unassigned items. An item cannot be unassigned more than K times (to avoid infinite loops). This isn't strictly needed, but could work as an optimization for step 5, as it prioritize goals you actually need.
For each list, keep a counter of the number of assignable slots for each list, set each to K, as well as the number of total assignable slots, and set to K * N. This will be adjusted as needed along the way. You want to be able to quickly O(1) lookup for: a) which list an (sorted) item belongs to, b) how many available slots that item has, and c) How many times has the item been unassigned, d) Find the item is the sorted list.
General Assignment. While there are slots available (total slots), go through the list from highest to lowest order. If the list for that item is available, assign as many slots as possible to that item. Update the assignable and total slots. If result is a valid solution, record it, trip the "optimization mode flag". If slots remain unassigned, revert the previous unassignment (but do not change the assignment count).
Waste Optimization. Find the most wasteful item that can be unassigned (unassigned count < K). Unassign one slot of it. If in optimization mode, do not allow any of the goal values to go below their cap (skip if it would). Update the unassigned count for item. Goto #3, but start just after the wasteful item. If no assignment made, reassign this item until the list has no remaining assignments, but do not update the unassigned count (otherwise we might end up in an invalid state).
Goal value Optimization. Skip if current state is a valid solution. Find the value furthest from it's goal (IE: A/B/C/D/E above) that can be unassigned. Unassign one slot for that item. Update assignment count. Goto step 3, begin search at start of list (unlike Step 4), stop searching the list if you go below the value of this item (not this item itself, as others may have the same value). If no assignment made, reassign this item until the list has no remaining assignments, but do not update the unassigned count (otherwise we might end up in an invalid state).
No Assignments remain. Return current state as "best solution found".
Algorithm should end with the "best" solution that this approach can come up with. Increasing max unassignment counts may improve the solution, decreasing max assignment counts will speed up the algorithm. Algorithm will run until it has maxed out it's assignment counts.
This is a bit of a greedy algorithm, so I'm not sure it's optimal (in that it will always yield the best result) but it may give you some ideas as to how to approach it. It also feels like it should yield fairly good results, as it basically trying to bound the results. Algorithm performance is something like O(Z^2 * K), where K is the mandatoryAmount and Z is the total number of items. Each item is unassigned K items, and potentially each assignment also requires O(Z) checks before it is reassigned.
As an optimization, use a O(log N) or better delete/next operation sorted data structure to store the sorted lists. Doing so it would make it practical to delete items from the assignment lists once the unassignment count reaches K (rendering them no longer assignable) allowing for O(Z * log(Z) * K) performance instead.
Edit:
Hmmm, the above only works within a single list (IE: Item removed can only be added to it's own list, as only that list has room). To avoid this, do step 4 (remove too heavy) then step 5 (remove too light) and then goto step 3 (using step 5's rules for searching, but also disallow adding back the too heavy ones).
So basically we remove the heaviest one then the lightest one then we try to assign something that is as heavy as possible to make up for the lightest one we removed.

How to assign many subsets to their largest supersets?

My data has large number of sets (few millions). Each of those set size is between few members to several tens of thousands integers. Many of those sets are subsets of larger sets (there are many of those super-sets). I'm trying to assign each subset to it's largest superset.
Please can anyone recommend algorithm for this type of task?
There are many algorithms for generating all possible sub-sets of a set, but this type of approach is time-prohibitive given my data size (e.g. this paper or SO question).
Example of my data-set:
A {1, 2, 3}
B {1, 3}
C {2, 4}
D {2, 4, 9}
E {3, 5}
F {1, 2, 3, 7}
Expected answer: B and A are subset of F (it's not important B is also subset of A); C is a subset of D; E remains unassigned.
Here's an idea that might work:
Build a table that maps number to a sorted list of sets, sorted first by size with largest first, and then, by size, arbitrarily but with some canonical order. (Say, alphabetically by set name.) So in your example, you'd have a table that maps 1 to [F, A, B], 2 to [F, A, D, C], 3 to [F, A, B, E] and so on. This can be implemented to take O(n log n) time where n is the total size of the input.
For each set in the input:
fetch the lists associated with each entry in that set. So for A, you'd get the lists associated with 1, 2, and 3. The total number of selects you'll issue in the runtime of the whole algorithm is O(n), so runtime so far is O(n log n + n) which is still O(n log n).
Now walk down each list simultaneously. If a set is the first entry in all three lists, then it's the largest set that contains the input set. Output that association and continue with the next input list. If not, then discard the smallest item among all the items in the input lists and try again. Implementing this last bit is tricky, but you can store the heads of all lists in a heap and get (IIRC) something like O(n log k) overall runtime where k is the maximum size of any individual set, so you can bound that at O(n log n) in the worst case.
So if I got everything straight, the runtime of the algorithm is overall O(n log n), which seems like probably as good as you're going to get for this problem.
Here is a python implementation of the algorithm:
from collections import defaultdict, deque
import heapq
def LargestSupersets(setlists):
'''Computes, for each item in the input, the largest superset in the same input.
setlists: A list of lists, each of which represents a set of items. Items must be hashable.
'''
# First, build a table that maps each element in any input setlist to a list of records
# of the form (-size of setlist, index of setlist), one for each setlist that contains
# the corresponding element
element_to_entries = defaultdict(list)
for idx, setlist in enumerate(setlists):
entry = (-len(setlist), idx) # cheesy way to make an entry that sorts properly -- largest first
for element in setlist:
element_to_entries[element].append(entry)
# Within each entry, sort so that larger items come first, with ties broken arbitrarily by
# the set's index
for entries in element_to_entries.values():
entries.sort()
# Now build up the output by going over each setlist and walking over the entries list for
# each element in the setlist. Since the entries list for each element is sorted largest to
# smallest, the first entry we find that is in every entry set we pulled will be the largest
# element of the input that contains each item in this setlist. We are guaranteed to eventually
# find such an element because, at the very least, the item we're iterating on itself is in
# each entries list.
output = []
for idx, setlist in enumerate(setlists):
num_elements = len(setlist)
buckets = [element_to_entries[element] for element in setlist]
# We implement the search for an item that appears in every list by maintaining a heap and
# a queue. We have the invariants that:
# 1. The queue contains the n smallest items across all the buckets, in order
# 2. The heap contains the smallest item from each bucket that has not already passed through
# the queue.
smallest_entries_heap = []
smallest_entries_deque = deque([], num_elements)
for bucket_idx, bucket in enumerate(buckets):
smallest_entries_heap.append((bucket[0], bucket_idx, 0))
heapq.heapify(smallest_entries_heap)
while (len(smallest_entries_deque) < num_elements or
smallest_entries_deque[0] != smallest_entries_deque[num_elements - 1]):
# First extract the next smallest entry in the queue ...
(smallest_entry, bucket_idx, element_within_bucket_idx) = heapq.heappop(smallest_entries_heap)
smallest_entries_deque.append(smallest_entry)
# ... then add the next-smallest item from the bucket that we just removed an element from
if element_within_bucket_idx + 1 < len(buckets[bucket_idx]):
new_element = buckets[bucket_idx][element_within_bucket_idx + 1]
heapq.heappush(smallest_entries_heap, (new_element, bucket_idx, element_within_bucket_idx + 1))
output.append((idx, smallest_entries_deque[0][1]))
return output
Note: don't trust my writeup too much here. I just thought of this algorithm right now, I haven't proved it correct or anything.
So you have millions of sets, with thousands of elements each. Just representing that dataset takes billions of integers. In your comparisons you'll quickly get to trillions of operations without even breaking a sweat.
Therefore I'll assume that you need a solution which will distribute across a lot of machines. Which means that I'll think in terms of https://en.wikipedia.org/wiki/MapReduce. A series of them.
Read the sets in, mapping them to k:v pairs of i: s where i is an element of the set s.
Receive a key of an integers, along with a list of sets. Map them off to pairs (s1, s2): i where s1 <= s2 are both sets that included to i. Do not omit to map each set to be paired with itself!
For each pair (s1, s2) count the size k of the intersection, and send off pairs s1: k, s2: k. (Only send the second if s1 and s2 are different.
For each set s receive the set of supersets. If it is maximal, send off s: s. Otherwise send off t: s for every t that is a strict superset of s.
For each set s, receive the set of subsets, with s in the list only if it is maximal. If s is maximal, send off t: s for every t that is a subset of s.
For each set we receive the set of maximal sets that it is a subset of. (There may be many.)
There are a lot of steps for this, but at its heart it requires repeated comparisons between pairs of sets with a common element for each common element. Potentially that is O(n * n * m) where n is the number of sets and m is the number of distinct elements that are in many sets.
Here is a simple suggestion for an algorithm that might give better results based on your numbers (n = 10^6 to 10^7 sets with m = 2 to 10^5 members, a lot of super/subsets). Of course it depends a lot on your data. Generally speaking complexity is much worse than for the other proposed algorithms. Maybe you could only process the sets with less than X, e.g. 1000 members that way and for the rest use the other proposed methods.
Sort the sets by their size.
Remove the first (smallest) set and start comparing it against the others from behind (largest set first).
Stop as soon as you found a superset and create a relation. Just remove if no superset was found.
Repeat 2. and 3. for all but the last set.
If you're using Excel, you could structure it as follows:
1) Create a cartesian plot as a two-way table that has all your data sets as titles on both the side and the top
2) In a seperate tab, create a row for each data set in the first column, along with a second column that will count the number of entries (ex: F has 4) and then just stack FIND(",") and MID formulas across the sheet to split out all the entries within each data set. Use the counter in the second column to do COUNTIF(">0"). Each variable you find can be your starting point in a subsequent FIND until it runs out of variables and just returns a blank.
3) Go back to your cartesian plot, and bring over the separate entries you just generated for your column titles (ex: F is 1,2,3,7). Use an AND statement to then check that each entry in your left hand column is in your top row data set using an OFFSET to your seperate area and utilizing your counter as the width for the OFFSET

Check if a collection of sets is pairwise disjoint

What is the most efficient way to determine whether a collection of sets is pairwise disjoint? -- i.e. verifying that the intersection between all pairs of sets is empty. How efficiently can this be done?
The sets from a collection are pairwise disjoint if, and only if, the size of their union equals the sum of their sizes (this statement applies to finite sets):
def pairwise_disjoint(sets) -> bool:
union = set().union(*sets)
return len(union) == sum(map(len, sets))
This could be a one-liner, but readability counts.
Expected linear time O(total number of elements):
def all_disjoint(sets):
union = set()
for s in sets:
for x in s:
if x in union:
return False
union.add(x)
return True
This is optimal under the assumption that your input is a collection of sets represented as some kind of unordered data structure (hash table?), because than you need to look at every element at least once.
You can do much better by using a different representation for your sets. For example, by maintaining a global hash table that stores for each element the number of sets it is stored in, you can do all the set operations optimally and also check for disjointness in O(1).
Using Python as psudo-code. The following tests for the intersection of each pair of sets only once.
def all_disjoint(sets):
S = list(sets)
while S:
s = S.pop() # remove an element
# loop over the remaining ones
for t in S:
# test for intersection
if not s.isdisjoint(t):
return False
return True
The number of intersection tests is the same as the number of edges in a fully connected graph with the same number of vertexes as there are sets. It also exits early if any pair is found not to be disjoint.

Determine conflict-free sets?

Suppose you have a bunch of sets, whereas each set has a couple of subsets.
Set1 = { (banana, pineapple, orange), (apple, kale, cucumber), (onion, garlic) }
Set2 = { (banana, cucumber, garlic), (avocado, tomato) }
...
SetN = { ... }
The goal now is to select one subset from each set, whereas each subset must be conflict free with any other selected subset. For this toy-size example, a possible solution would be to select (banana, pineapple, orange) (from Set1) and (avocado, tomato) (from Set2).
A conflict would occur, if one would select the first subset of Set1 and Set2 because the banana would be contained in both subsets (which is not possible because it exists only once).
Even though there are many algorithms, I was unable to select a suitable algorithm. I'm somehow stuck and would appreciate answers targeting the following questions:
1) How to find a suitable algorithm and represent this problem in such a way that it can be processed by the algorithm?
2) How a possible solution for this toy-size example may look like (any language is just fine, I just want to get the idea).
Edit1: I was thinking about simulated annealing, too (return one possible solution). This could be of interest to minimize, e.g., the overall cost of selecting the sets. However, I could not figure out how to make an appropriate problem description that takes the 'conflicts' into account.
This problem can be formulated as a generalized exact cover problem.
Create a new atom for each set of sets (Set1, Set2, etc.) and turn your input into an instance like so:
{Set1, banana, pineapple, orange}
{Set1, apple, kale, cucumber}
{Set1, onion, garlic}
{Set2, banana, cucumber, garlic}
{Set2, avocado, tomato}
...
making the Set* atoms primary (covered exactly once) and the other atoms secondary (covered at most once). Then you can solve it with a generalization of Knuth's Algorithm X.
Looking at the list of sets, I had the image of a maze with multiple entrances. The task is akin to tracing paths from top to bottom that are free of subset-intersections. The example in Haskell picks all entrances, and tries each path, returning those that succeed.
My understanding of how the code works (algorithm):
For each subset in the first set, pick each subset in the next set where the intersection of that subset with each of the subsets in the accumulated result is null. If there are no subsets matching the criteria, break that strain of the loop. If there are no sets left to pick from, return that result. Call the function recursively for all chosen subsets (and corresponding accumulating-results).
import Data.List (intersect)
import Control.Monad (guard)
sets = [[["banana", "pineapple", "orange"], ["apple", "kale", "cucumber"], ["onion", "garlic"]]
,[["banana", "cucumber", "garlic"], ["avocado", "tomato"]]]
solve sets = solve' sets [] where
solve' [] result = [result]
solve' (set:rest) result = do
subset <- set
guard (all null (map (intersect subset) result))
solve' rest (result ++ [subset])
OUTPUT:
*Main> solve sets
[[["banana","pineapple","orange"],["avocado","tomato"]]
,[["apple","kale","cucumber"],["avocado","tomato"]]
,[["onion","garlic"],["avocado","tomato"]]]

Algorithm/Data Structure for finding combinations of minimum values easily

I have a symmetric matrix like shown in the image attached below.
I've made up the notation A.B which represents the value at grid point (A, B). Furthermore, writing A.B.C gives me the minimum grid point value like so: MIN((A,B), (A,C), (B,C)).
As another example A.B.D gives me MIN((A,B), (A,D), (B,D)).
My goal is to find the minimum values for ALL combinations of letters (not repeating) for one row at a time e.g for this example I need to find min values with respect to row A which are given by the calculations:
A.B = 6
A.C = 8
A.D = 4
A.B.C = MIN(6,8,6) = 6
A.B.D = MIN(6, 4, 4) = 4
A.C.D = MIN(8, 4, 2) = 2
A.B.C.D = MIN(6, 8, 4, 6, 4, 2) = 2
I realize that certain calculations can be reused which becomes increasingly important as the matrix size increases, but the problem is finding the most efficient way to implement this reuse.
Can point me in the right direction to finding an efficient algorithm/data structure I can use for this problem?
You'll want to think about the lattice of subsets of the letters, ordered by inclusion. Essentially, you have a value f(S) given for every subset S of size 2 (that is, every off-diagonal element of the matrix - the diagonal elements don't seem to occur in your problem), and the problem is to find, for each subset T of size greater than two, the minimum f(S) over all S of size 2 contained in T. (And then you're interested only in sets T that contain a certain element "A" - but we'll disregard that for the moment.)
First of all, note that if you have n letters, that this amounts to asking Omega(2^n) questions, roughly one for each subset. (Excluding the zero- and one-element subsets and those that don't include "A" saves you n + 1 sets and a factor of two, respectively, which is allowed for big Omega.) So if you want to store all these answers for even moderately large n, you'll need a lot of memory. If n is large in your applications, it might be best to store some collection of pre-computed data and do some computation whenever you need a particular data point; I haven't thought about what would work best, but for example computing data only for a binary tree contained in the lattice would not necessarily help you anything beyond precomputing nothing at all.
With these things out of the way, let's assume you actually want all the answers computed and stored in memory. You'll want to compute these "layer by layer", that is, starting with the three-element subsets (since the two-element subsets are already given by your matrix), then four-element, then five-element, etc. This way, for a given subset S, when we're computing f(S) we will already have computed all f(T) for T strictly contained in S. There are several ways that you can make use of this, but I think the easiest might be to use two such subset S: let t1 and t2 be two different elements of T that you may select however you like; let S be the subset of T that you get when you remove t1 and t2. Write S1 for S plus t1 and write S2 for S plus t2. Now every pair of letters contained in T is either fully contained in S1, or it is fully contained in S2, or it is {t1, t2}. Look up f(S1) and f(S2) in your previously computed values, then look up f({t1, t2}) directly in the matrix, and store f(T) = the minimum of these 3 numbers.
If you never select "A" for t1 or t2, then indeed you can compute everything you're interested in while not computing f for any sets T that don't contain "A". (This is possible because the steps outlined above are only interesting whenever T contains at least three elements.) Good! This leaves just one question - how to store the computed values f(T). What I would do is use a 2^(n-1)-sized array; represent each subset-of-your-alphabet-that-includes-"A" by the (n-1) bit number where the ith bit is 1 whenever the (i+1)th letter is in that set (so 0010110, which has bits 2, 4, and 5 set, represents the subset {"A", "C", "D", "F"} out of the alphabet "A" .. "H" - note I'm counting bits starting at 0 from the right, and letters starting at "A" = 0). This way, you can actually iterate through the sets in numerical order and don't need to think about how to iterate through all k-element subsets of an n-element set. (You do need to include a special case for when the set under consideration has 0 or 1 element, in which case you'll want to do nothing, or 2 elements, in which case you just copy the value from the matrix.)
Well, it looks simple to me, but perhaps I misunderstand the problem. I would do it like this:
let P be a pattern string in your notation X1.X2. ... .Xn, where Xi is a column in your matrix
first compute the array CS = [ (X1, X2), (X1, X3), ... (X1, Xn) ], which contains all combinations of X1 with every other element in the pattern; CS has n-1 elements, and you can easily build it in O(n)
now you must compute min (CS), i.e. finding the minimum value of the matrix elements corresponding to the combinations in CS; again you can easily find the minimum value in O(n)
done.
Note: since your matrix is symmetric, given P you just need to compute CS by combining the first element of P with all other elements: (X1, Xi) is equal to (Xi, X1)
If your matrix is very large, and you want to do some optimization, you may consider prefixes of P: let me explain with an example
when you have solved the problem for P = X1.X2.X3, store the result in an associative map, where X1.X2.X3 is the key
later on, when you solve a problem P' = X1.X2.X3.X7.X9.X10.X11 you search for the longest prefix of P' in your map: you can do this by starting with P' and removing one component (Xi) at a time from the end until you find a match in your map or you end up with an empty string
if you find a prefix of P' in you map then you already know the solution for that problem, so you just have to find the solution for the problem resulting from combining the first element of the prefix with the suffix, and then compare the two results: in our example the prefix is X1.X2.X3, and so you just have to solve the problem for
X1.X7.X9.X10.X11, and then compare the two values and choose the min (don't forget to update your map with the new pattern P')
if you don't find any prefix, then you must solve the entire problem for P' (and again don't forget to update the map with the result, so that you can reuse it in the future)
This technique is essentially a form of memoization.

Resources