Find optimal combination of elements from multiple sets

Find optimal combination of elements from multiple sets - algorithm

I have several sets of pairs like:
a: ([1, 2], [4,5])
b: ([1, 3])
c: ([4, 7], [1, 8])
d: ([9, 7], [1, 5])
...
Where no two pairs are identical, and the elements of no pair are identical. Each set can contain many pairs. There is a smallish number of elements (around 200).
From each set I take one pair. Now, I want to take pairs in such a way, that the number of elements is the smallest possible.
The problem is too large to try every combination, is there any algorithm or heuristic that might help me find the optimal (or a close guess)?

The problem has a definite NP-complete feel about it. So here are two greedy approaches that may produce reasonable approximate answers. To figure which is better you should implement both and compare.
The first is bottom up. Give each set a value of 2 if it has a pair selected from it, and (n+1)/n if it has n pairs partially selected from it. At each round, give each element a value for being selected which is the sum of the amount by which adding it increases the value of all of the sets. In the round select the element with the highest value, then update the value of all of the sets, update the value of all remaining elements, and continue.
This will pick elements that look like they are making progress towards covering all sets.
The second is top down. Start with all elements selected, and give each set a value of 1/n where n is the number of selected pairs. Elements that are required for all pairs in a given set are put into the final set. Of the remaining elements, find the one who increases the value the least if it is removed, and remove it.
The idea is that we start with too big a cover and repetitively remove the one which seems least important for covering all the sets. What we are left with is hopefully minimal.

Related

Is there an algorithm for generating random pairs of a set of numbers?

I have a discontinuous list of numbers N (e.g. { 1, 2, 3, 6, 8, 10}) and i need to progressively create random pairs of numbers in N and store them in a list in which there can't be twice the same pair.
For example for a list of 3 different of numbers there is 6 possible pair (not counting same number pair):
example for the list { 4, 8, 9 }, the possible pairs are:
(4,8) (4,9) (8,4) (8,9) (9,4) (9,8)
When we arrive to a number list size of 30 for example, we get 870 possible pairs and with my current method I get less and less efficient the more possible pairs there are.
For now my strategy with a number list of size 30 for example is :
N = { 3, 8, 10, 15, 16, ... } // size = 30
// Lets say I have already a list with 200 different pairs
my_pairs = { (8,16), (23, 32), (16,10), ... }
// Get two random numbers in the list
rn1 = random(N)
rn2 = random(N)
Loop through my_pairs to see if the pair (rn1,rn2) has already been generated
If there is one, we pick two new numbers rn1 & rn2 at random and retry adding them to my_pairs
If not then we add it to the list
The issue is that the more pairs we have in my_pairs, the less likely it is for a pair to not be in that list. So we have to check multiple random pairs multiple times and go through the list every time.
I could try to generate all possible pairs at the start, shuffle the list and pop one element each time I need to add a random pair to my list.
But it will take a lot of space to store all possible pairs when my Numbers list size is increasing (like 9900 possible pairs for 100 different numbers).
And I add numbers in N during my process so I can't afford to recalculate all possible pairs every time.
Is there an algorithm for generating random unique pairs ?
Maybe it would be faster using matrices or storing my pairs in some sort of a tree graph ?

It depends a lot on what you want to optimize for.
If you want to keep things simple and easy to maintain, having a hash set of all generated numbers sounds reasonable. The assumption here is that both checking membership and adding a new element should be O(1) on average.
If you worry about space requirements, because you regularly use up to 70% of the possible pairs, then you could optimize for space. To do that, I'd first establish a mapping between each possible pair and a single integer. I'd do so in a way that allows for easy addition of more numbers to N.
+0 +1
0 (0,1) (1,0)
2 (0,2) (2,0)
4 (1,2) (2,1)
6 (0,3) (3,0)
8 (1,3) (3,1)
10 (2,3) (3,2)
Something like this would map a single integer i to a pair (a,b) of indices into your sequence N, which you could then look up in N to turn them into an actual pair of elements. You can come up with formulas for this mapping, although the conversion from i to (a,b) will entail a square root somewhere.
When you have this, the task of picking a pair from a set of arbitrary numbers becomes the task of picking an integer from a continuous range of integers. Now you could use a bit map to very efficiently store for each index whether you have already picked that one in the past. For low percentages of picked pairs that bitmap may be more memory-consuming than a hash map of only picked values would be, but as you approach 70% of all pairs getting picked, it will be way more efficient. I would expect a typical hash map entry to consume at least 3×64=192 bit of storage, so the bitmap will start saving memory once 1/192=0.52% of all values are getting picked. Growing the bit map might still be expensive, so estimating the maximal size of N might help allocating enough memory up front.
If you have a costly random number generator, or worry about the worst case time complexity of the whole thing, then you might want to avoid multiple attempts that might result in already picked pairs. To achieve that you would probably store the set of all picked pairs in some kind of search tree where each node also keeps track of how many leafs its subtree contains. That way you could generate a random number in a range that corresponds to the size of pairs that haven't been picked yet, and then use the information in that tree to add to the chosen value the number of all already picked indices smaller than that. I haven't worked out all details but I believe with this it should be possible to turn this into O(log n) worst case time complexity, as opposed to the O(1) average case but O(n) or even O(∞) worst case we had before.

Algorithm for seeing if many different arrays are subsets of another one?

Let's say I have an array of ~20-100 integers, for example [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] (actually numbers more like [106511349 , 173316561, ...], all nonnegative 64-bit integers under 2^63, but for demonstration purposes let's use these).
And many (~50,000) smaller arrays of usually 1-20 terms to match or not match:
1=[2, 3, 8, 20]
2=[2, 3, NOT 8]
3=[2, 8, NOT 16]
4=[2, 8, NOT 16] (there will be duplicates with different list IDs)
I need to find which of these are subsets of the array being tested. A matching list must have all of the positive matches, and none of the negative ones. So for this small example, I would need to get back something like [3, 4]. List 1 fails to match because it requires 20, and list 2 fails to match because it has NOT 8. The NOT can easily be represented by using the high bit/making the number negative in those cases.
I need to do this quickly up to 10,000 times per second . The small arrays are "fixed" (they change infrequently, like once every few seconds), while the large array is done per data item to be scanned (so 10,000 different large arrays per second).
This has become a bit of a bottleneck, so I'm looking into ways to optimize it.
I'm not sure the best data structures or ways to represent this. One solution would be to turn it around and see what small lists we even need to consider:
2=[1, 2, 3, 4]
3=[1, 2]
8=[1, 2, 3, 4]
16=[3, 4]
20=[1]
Then we'd build up a list of lists to check, and do the full subset matching on these. However, certain terms (often the more frequent ones) are going to end up in many of the lists, so there's not much of an actual win here.
I was wondering if anyone is aware of a better algorithm for solving this sort of problem?

you could try to make a tree with the smaller arrays since they change less frequently, such that each subtree tries to halve the number of small arrays left.
For example, do frequency analysis on numbers in the smaller arrays. Find which number is found in closest to half of the smaller arrays. Make that the first check in the tree. In your example that would be '3' since it occurs in half the small arrays. Now that's the head node in the tree. Now put all the small lists that contain 3 to the left subtree and all the other lists to the right subtree. Now repeat this process recursively on each subtree. Then when a large array comes in, reverse index it, and then traverse the subtree to get the lists.

You did not state which of your arrays are sorted - if any.
Since your data is not that big, I would use a hash-map to store the entries of the source set (the one with ~20-100 integers). That would basically let you test if a integer is present in O(1).
Then, given that 50,000(arrays) * 20(terms each) * 8(bytes per term) = 8 megabytes + (hash map overhead), does not seem large either for most systems, I would use another hash-map to store tested arrays. This way you don't have to re-test duplicates.

I realize this may be less satisfying from a CS point of view, but if you're doing a huge number of tiny tasks that don't affect each other, you might want to consider parallelizing them (multithreading). 10,000 tasks per second, comparing a different array in each task, should fit the bill; you don't give any details about what else you're doing (e.g., where all these arrays are coming from), but it's conceivable that multithreading could improve your throughput by a large factor.

First, do what you were suggesting; make a hashmap from input integer to the IDs of the filter arrays it exists in. That lets you say "input #27 is in these 400 filters", and toss those 400 into a sorted set. You've then gotta do an intersection of the sorted sets for each one.
Optional: make a second hashmap from each input integer to it's frequency in the set of filters. When an input comes in, sort it using the second hashmap. Then take the least common input integer and start with it, so you have less overall work to do on each step. Also compute the frequencies for the "not" cases, so you basically get the most bang for your buck on each step.
Finally: this could be pretty easily made into a parallel programming problem; if it's not fast enough on one machine, it seems you could put more machines on it pretty easily, if whatever it's returning is useful enough.

Algorithm to find first sequence of integers that sum to certain value

I have a list of numbers and I have a sum value. For instance,
list = [1, 2, 3, 5, 7, 11, 10, 23, 24, 54, 79 ]
sum = 20
I would like to generate a sequence of numbers taken from that list, such that the sequence sums up to that target. In order to help achieve this, the sequence can be of any length and repetition is allowed.
result = [2, 3, 5, 10] ,or result = [1, 1, 2, 3, 3, 5, 5] ,or result = [10, 10]
I've been doing a lot of research into this problem and have found the subset sum problem to be of interest. My problem is, in a few ways, similar to the subset sum problem in that I would like to find a subset of numbers that produces the targeted sum.
However, unlike the subset sum problem which finds all sets of numbers that sum up to the target (and so runs in exponential time if brute forcing), I only want to find one set of numbers. I want to find the first set that gives me the sum. So, in a certain sense, speed is a factor.
Additionally, I would like there to be some degree of randomness (or pseudo-randomness) to the algorithm. That is, should I run the algorithm using the same list and sum multiple times, I should get a different set of numbers each time.
What would be the best algorithm to achieve this?
Additional Notes:
What I've achieved so far is using a naive method where I cycle through the list adding it to every combination of values. This obviously takes a long time and I'm currently not feeling too happy about it. I'm hoping there is a better way to do this!
If there is no sequence that gives me the exact sum, I'm satisfied with a sequence that gives me a sum that is as close as possible to the targeted sum.

As others said, this is a NP-problem.
However, this doesn't mean small improvements aren't possible:
Is 1 in the list? [1,1,1,1...] is the solution. O(1) in a sorted list
Remove list element bigger than the target sum. O(n)
Is there any list element x with (x%sum)==0 ? Again, easy solution. O(n)
Are there any list elements x,y with (x%y)==0 ? Remove x. O(n^2)
(maybe even: Are there any list elements x,y,z with (x%y)==z or (x+y)==z ? Remove x. O(n^3))
Before using the full recursion, try if you can get the sum
just with the smallest even and smallest odd number.
...

Subset Sum problem isn't about finding all subsets, but rather about determining if there is some subset. It is a decision problem. All problems in NP are like this. And even this simpler problem is NP-complete.
This means that if you want an exact answer (the subset must sum exactly some value) you won't be able to do much better than the any subset sum algorithm (it is exponential unless P=NP).

I would attempt to reduce the problem to a brute-force search of a smaller set.
Sort the list smallest to largest.
Keep a sum and result list.
Repeat {
Draw randomly from the subset of list less than target - sum.
Increment sum by drawn value, add drawn value to result list.
} until list[0] > sum or sum == 0
If sum != 0, brute force search for small combinations from list that match the difference between sum and small combinations of result.
This approach may fail to find valid solutions, even if they exist. It can, however, quickly find a solution or quickly fail before having to resort to a slower brute force approach using the entire set at a greater depth.

This is a greedy approach to the problem:
Without 'randomness':
Obtain the single largest number in the set that is smaller than your desired sum- we'll name it X. Given it's ordered, at best it's O(1), and O(N) at worst if the sum is 2.
As you can repeat the value- say c times, do so as many times until you get closest to the sum, but be careful! Create a range of values- essentially now you'll be finding another sum! You'll now be find numbers that add up to R = (sum - X * c). So find the largest number smaller than R. Check if R - (number you just found) = 0 or if any [R - (number you just found)] % (smaller #s) == 0.
If it becomes R > 0, make partial sums of the smaller numbers less than R (this will not be more than 5 ~ 10 computations because of the nature of this algorithm). See if these would then satisfy it.
If that step makes R < 0, remove one X and start the process again.
With 'randomness':
Just get X randomly! :-)
Note: This would work best if you have a few single digit numbers.

In a set of frequency values, how to find some optimal subsets of harmonically related frequency values?

I have a set of frequency values and I'd like to find the most likely subsets of values meeting the following conditions:
values in each subset should be harmonically related (approximately multiples of a given value)
the number of subsets should be as small as possible
every subset should have a minimum number of missing harmonics smaller than the highest value
E.g. [1,2,3,4,10,20,30] should return [1,2,3,4] and [10,20,30] (a set with all the values is not optimal because, even if they are harmonically related, there are many missing values)
The brute force method could be to compute all the possible subsets of values in the sets and compute some cost value, but that would take way too long time.
Is there any efficient algorithm to perform this task (or something similar)?

I would reduce the problem to minimum set cover, which, although NP-hard, often is efficiently solvable in practice via integer programming. I'm assuming that it would be reasonable to decompose [1, 2, 3, 4, 8, 12, 16] as [1, 2, 3, 4] and [4, 8, 12, 16], with 4 repeating.
To solve set cover (well, to use stock integer-program solvers, anyway), we need to enumerate all of the maximal allowed subsets. If the fundamental (i.e., the given value) must belong to the set, then, for each frequency, we can enumerate its multiples in order until too many in a row are missing. If not, we try all pairs of frequencies, assume that their fundamental is their approximate greatest common divisor, and extend the subset downward and upward until too many frequencies are missing.

Divide list into two equal parts algorithm

Related questions:
Algorithm to Divide a list of numbers into 2 equal sum lists
divide list in two parts that their sum closest to each other
Let's assume I have a list, which contains exactly 2k elements. Now, I'm willing to split it into two parts, where each part has a length of k while trying to make the sum of the parts as equal as possible.
Quick example:
[3, 4, 4, 1, 2, 1] might be splitted to [1, 4, 3] and [1, 2, 4] and the sum difference will be 1
Now - if the parts can have arbitrary lengths, this is a variation of the Partition problem and we know that's it's weakly NP-Complete.
But does the restriction about splitting the list into equal parts (let's say it's always k and 2k) make this problem solvable in polynomial time? Any proofs to that (or a proof scheme for the fact that it's still NP)?

It is still NP complete. Proof by reduction of PP (your full variation of the Partition problem) to QPP (equal parts partition problem):
Take an arbitrary list of length k plus additional k elements all valued as zero.
We need to find the best performing partition in terms of PP. Let us find one using an algorithm for QPP and forget about all the additional k zero elements. Shifting zeroes around cannot affect this or any competing partition, so this is still one of the best performing unrestricted partitions of the arbitrary list of length k.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio