Finding all subsets of specified size - algorithm

I've been scratching my head about this for two days now and I cannot come up with a solution. What I'm looking for is a function f(s, n) such that it returns a set containing all subsets of s where the length of each subset is n.
Demo:
s={a, b, c, d}
f(s, 4)
{{a, b, c, d}}
f(s, 3)
{{a, b, c}, {a, b, d}, {a, c, d}, {b, c, d}}
f(s, 2)
{{a, b}, {a, c}, {a, d}, {b, c}, {b, d}, {c, d}}
f(s, 1)
{{a}, {b}, {c}, {d}}
I have a feeling that recursion is the way to go here. I've been fiddling with something like
f(S, n):
for s in S:
t = f( S-{s}, n-1 )
...
But this does not seem to do the trick. I did notice that len(f(s,n)) seems to be the binomial coefficient bin(len(s), n). I guess this could be utilized somehow.
Can you help me please?

Let us call n the size of the array and k the number of elements to be out in a subarray.
Let us consider the first element A[0] of the array A.
If this element is put in the subset, the problem becomes a (n-1, k-1) similar problem.
If not, it becomes a (n-1, k) problem.
This can be simply implemented in a recursive function.
We just have to pay attention to deal with the extreme cases k == 0 or k > n.
During the process, we also have to keep trace of:
n: the number of remaining elements of A to consider
k: the number of elements that remain to be put in the current subset
index: the index of the next element of A to consider
The current_subset array that memorizes the elements already selected.
Here is a simple code in c++ to illustrate the algorithm
Output
For 5 elements and subsets of size 3:
3 4 5
2 4 5
2 3 5
2 3 4
1 4 5
1 3 5
1 3 4
1 2 5
1 2 4
1 2 3
#include <iostream>
#include <vector>
void print (const std::vector<std::vector<int>>& subsets) {
for (auto &v: subsets) {
for (auto &x: v) {
std::cout << x << " ";
}
std::cout << "\n";
}
}
// n: number of remaining elements of A to consider
// k: number of elements that remain to be put in the current subset
// index: index of next element of A to consider
void Get_subset_rec (std::vector<std::vector<int>>& subsets, int n, int k, int index, std::vector<int>& A, std::vector<int>& current_subset) {
if (n < k) return;
if (k == 0) {
subsets.push_back (current_subset);
return;
}
Get_subset_rec (subsets, n-1, k, index+1, A, current_subset);
current_subset.push_back(A[index]);
Get_subset_rec (subsets, n-1, k-1, index+1, A, current_subset);
current_subset.pop_back(); // remove last element
return;
}
void Get_subset (std::vector<std::vector<int>>& subsets, int subset_length, std::vector<int>& A) {
std::vector<int> current_subset;
Get_subset_rec (subsets, A.size(), subset_length, 0, A, current_subset);
}
int main () {
int subset_length = 3; // subset size
std::vector A = {1, 2, 3, 4, 5};
int size = A.size();
std::vector<std::vector<int>> subsets;
Get_subset (subsets, subset_length, A);
std::cout << subsets.size() << "\n";
print (subsets);
}
Live demo

One way to solve this is by backtracking. Here's a possible algorithm in pseudo code:
def backtrack(input_set, idx, partial_res, res, n):
if len(partial_res == n):
res.append(partial_res[:])
return
for i in range(idx, len(input_set)):
partial_res.append(input_set[i])
backtrack(input_set, idx+1, partial_res, res, n) # path with input_set[i]
partial_res.pop()
backtrack(input_set, idx+1, partial_res, res, n) # path without input_set[i]
Time complexity of this approach is O(2^len(input_set)) since we make 2 branches at each element of input_set, regardless of whether the path leads to a valid result or not. The space complexity is O(len(input_set) choose n) since this is the number of valid subsets you get, as you correctly pointed out in your question.
Now, there is a way to optimize the above algorithm to reduce the time complexity to O(len(input_set) choose n) by pruning the recursive tree to paths that can lead to valid results only.
If n - len(partial_res) < len(input_set) - idx + 1, we are sure that even if we took every remaining element in input_set[idx:] we are still short at least one to reach n. So we can employ this as a base case and return and prune.
Also, if n - len(partial_res) == len(input_set) - idx + 1, this means that we need each and every element in input_set[idx:] to get the required n length result. Thus, we can't skip any elements and so the second branch of our recursive call becomes redundant.
backtrack(input_set, idx+1, partial_res, res, n) # path without input_set[i]
We can skip this branch with a conditional check.
Implementing these base cases correctly, reduces the time complexity of the algorithm to O(len(input_set) choose k), which is a hard limit because that's the number of subsets that there are.

subseqs 0 _ = [[]]
subseqs k [] = []
subseqs k (x:xs) = map (x:) (subseqs (k-1) xs) ++ subseqs k xs
Live demo
The function looks for subsequences of (non-negative) length k in a given sequence. There are three cases:
If the length is 0: there is a single empty subsequence in any sequence.
Otherwise, if the sequence is empty: there are no subsequences of any (positive) length k.
Otherwise, there is a non-empty sequence that starts with x and continues with xs, and a positive length k. All our subsequences are of two kinds: those that contain x (they are subsequences of xs of length k-1, with x stuck at the front of each one), and those that do not contain x (they are just subsequences of xs of length k).
The algorithm is a more or less literal translation of these notes to Haskell. Notation cheat sheet:
[] an empty list
[w] a list with a single element w
x:xs a list with a head of x and a tail of xs
(x:) a function that sticks an x in front of any list
++ list concatenation
f a b c a function f applied to arguments a b and c

Here is a non-recursive python function that takes a list superset and returns a generator that produces all subsets of size k.
def subsets_k(superset, k):
if k > len(superset):
return
if k == 0:
yield []
return
indices = list(range(k))
while True:
yield [superset[i] for i in indices]
i = k - 1
while indices[i] == len(superset) - k + i:
i -= 1
if i == -1:
return
indices[i] += 1
for j in range(i + 1, k):
indices[j] = indices[i] + j - i
Testing it:
for s in subsets_k(['a', 'b', 'c', 'd', 'e'], 3):
print(s)
Output:
['a', 'b', 'c']
['a', 'b', 'd']
['a', 'b', 'e']
['a', 'c', 'd']
['a', 'c', 'e']
['a', 'd', 'e']
['b', 'c', 'd']
['b', 'c', 'e']
['b', 'd', 'e']
['c', 'd', 'e']

Related

Dynamic programming problem: maximize the sum of the value of all positions

At a recent phone interview I was asked the following dynamic programming problem but couldn't come up with an algorithm for it:
Suppose there is a path with n positions. Consider the set S = {A,B,C}. Each position on the path has an associated non-empty subset of S. For each position on the path, we can choose one element from its associated subset. For a given position i on the path, its “value” is determined by the total number of distinct elements from the positions it has access to. The positions it has access to is given by the set {i-1, i, i+1} (for i=1 it is just {0,1} and for i=n it is just {n, n-1}). We want to maximize the sum of the “value” of all positions.
So for example, if I had n=5 and the following subsets for each position 1…5:
S1 = {A,C}, S2={A, B}, S3={A,B,C}, S4={A,C}, S5={A,B,C}
Then one such possible arrangement to maximize the sum would be [A, B, C, A, B], which would be 2 + 3 + 3 + 3 + 2 = 13.
I'd like to write an algorithm that, given a list of subsets (where the nth subset corresponds to the nth position), returns the maximum sum of the value of all positions. It should be bounded by a polynomial function of n.
Example Input: [['A', 'C'], ['A', 'B'], ['A', 'B', 'C'], ['A', 'C'], ['A', 'B', 'C']]
Expected Output: 13
Given that my phone interview is already over with and that I've still been unable to solve this problem after giving it more thought, I'd rather just see a working solution at this point. Thanks!
The key to solving the problem is to realize that, given an arrangement A with a certain score, the new score of A after appending an element z depends only on the final two elements of A.
Given an array ending with the (not necessarily distinct) elements x and y, the increase in score after appending an element z is:
1 // from z on itself
+ 1 if (z != y and z != x) else 0 // from y gaining z as neighbor
+ 1 if (z != y) else 0 // from z gaining y as neighbor
For your example, there are 4 possible arrangements with the first two positions:
Subsets:
S1 = {A, C}, S2 = {A, B},
S3 = {A, B, C}, S4 = {A, C}, S5 = {A, B, C}
After placing the first two elements:
[A, A] max score = 2
[A, B] max score = 4
[C, A] max score = 4
[C, B] max score = 4
after appending a third element (from S3), all possible 'last two' elements and the maximum score of any arrangement with those 'last two' elements:
After S3 = {A, B, C}
[A, A] max score = 5
[A, B] max score = 7
[A, C] max score = 6
[B, A] max score = 7
[B, B] max score = 5
[B, C] max score = 7
Here, for instance, the unique maximal score arrangement ending in A, A is [C, A, A], although we only care about the last two values and the score.
After all five subsets, the feasible 'last two elements' of arrangements and the maximum score of any corresponding arrangement will be:
[A, A] max score = 11
[A, B] max score = 13
[A, C] max score = 12
[C, A] max score = 13
[C, B] max score = 13
[C, C] max score = 11
so we can return 13. With extra bookkeeping throughout the algorithm, we can also reconstruct the full arrangement.
Here's the three-variable Dynamic Programming (DP) formula:
DP(index, y, z) is defined as the
maximum score for an arrangement on PathArray[0, 1, ..., index]
with final two elements [y, z], for any y in Subsets[index-1]
and z in Subsets[index]
DP(index, y, z) = max over all x in Subsets[index-2] of
(DP(index-1, x, y) + AddedScore(x, y, z))
The answer to your question is the maximum value of DP(n-1, a, b) for any valid a and b.
I've excluded the base case when the path has length 2: you should directly solve for the score of the one and two element cases.
With one element: the score is always 1.
With two elements: the score is 4 if the elements are not equal, otherwise, the score is 2.
A Python implementation:
def dp_solve(subsets):
if len(subsets) == 1:
return 1
def added_value(grandparent, parent, child) -> int:
return (1
+ (1 if child != grandparent and child != parent else 0)
+ (1 if child != parent else 0))
pair_dict = collections.Counter()
for x, y in itertools.product(subsets[0], subsets[1]):
pair_dict[x, y] = 4 if x != y else 2
for subset in subsets[2:]:
new_dict = collections.Counter()
for old_key, old_value in pair_dict.items():
grandparent, parent = old_key
for child in subset:
new_value = old_value + added_value(grandparent,
parent,
child)
new_dict[parent, child] = max(new_dict[parent, child],
new_value)
pair_dict = new_dict
return max(pair_dict.values())
my_lists = [['A', 'C'], ['A', 'B'], ['A', 'B', 'C'], ['A', 'C'], ['A', 'B', 'C']]
print(dp_solve(my_lists)) # Prints 13
I'm relatively certain this iterative version always produces the max, though I can't prove it.
Note: this assumes each set doesn't contain duplicates, if that's the case needs slight modification
if only one position in path, select any value from it's set
else:
starting from first position on path
While (not at last element of path) {
if position set only has 1 value, select it
else if position set has unique value not in neighbors' sets (or single neighbor for ends), select it
else select value that's not the same as prior position, prioritizing a value that's in (next position + 1)'s set (assuming that position isn't out of bounds)
outputArray[position] = value
position++
}
//at last position
if only 1 value select it
else select a value different from previous
outputArray[position] = value
outputArray should now contain values from each set that maximize distinctness among neighbors

Prolog lists with lengths of constrained length [duplicate]

This question already has answers here:
Using a constrained variable with `length/2`
(4 answers)
Closed 5 years ago.
I'm using the clpfd library
?- use_module(library(clpfd)).
true.
Then I attempt to generate all 3 lists of length K with 1 <= K <= 3.
?- K in 1 .. 3, length(C, K).
K = 1,
C = [_1302] ;
K = 2,
C = [_1302, _1308] ;
K = 3,
C = [_1302, _1308, _1314] ;
ERROR: Out of global stack
I would expect the query to terminate after K = 3. For example, the following does terminate.
?- between(1, 3, K), length(X, K).
K = 1,
X = [_3618] ;
K = 2,
X = [_3618, _3624] ;
K = 3,
X = [_3618, _3624, _3630].
Why does one terminate and the other does not?
K in 1..3 simply asserts that K is somewhere between 1 and 3, without binding particular value. What you need is indomain(K) predicate, which backtracks over all values in K's domain:
K in 1..3, indomain(K), length(C, K).
Out of stack in your example happens for the following reason: length(C, K) without any of its arguments bound generates lists of different lengths, starting with 0, then 1, 2, 3, ...
Each time it generates a solution it tries bind a particular value to K, that is 0, 1, 2, ...
Now, because there are constraints applied to K, any attempts to bind a value greater than 3 will fail, meaning that length(C, K) will continue trying to find alternative solutions, that is, it will keep generating lists of length 4, 5, 6, ... and so on, all of which will be discarded. This process will continue until you exhaust your stack.

Partitions of n into k parts with restrictions

I need an algorithm that produces a partition of the number n into k parts with the added restrictions that each element of the partition must be between a and b. Ideally, all possible partitions satisfying the restrictions should be equally likely. Partitions are considered the same if they have the same elements in different order.
For example, with n=10, k=3, a=2, b=4 one has only {4,4,2} and {4,3,3} as possible outcomes.
Is there a standard algorithm for such a problem? One can assume that at least one partition satisfying the restrictions always exists.
You can implement this as a recursive algorithm. Basically, the recurrence is like this:
if k == 1 and a <= n <= b, then the only partition is [n], otherwise none
otherwise, combine all the elements x from a to b with all the partitions for n-x, k-1
to prevent duplicates, also substitute the lower bound a with x
Here's some Python (aka executable pseudo-code):
def partitions(n, k, a, b):
if k == 1 and a <= n <= b:
yield [n]
elif n > 0 and k > 0:
for x in range(a, b+1):
for p in partitions(n-x, k-1, x, b):
yield [x] + p
print(list(partitions(10, 3, 2, 4)))
# [[2, 4, 4], [3, 3, 4]]
This could be further improved by checking (k-1)*a and (k-1)*b for the lower and upper bounds for the remaining elements, respectively, and restricting the range for x accordingly:
min_x = max(a, n - (k-1) * b)
max_x = min(b, n - (k-1) * a)
for x in range(min_x, max_x+1):
For partitions(110, 12, 3, 12) with 3,157 solutions, this reduces the number of recursive calls from 638,679 down to 24,135.
Here's a sampling algorithm that uses conditional probability.
import collections
import random
countmemo = {}
def count(n, k, a, b):
assert n >= 0
assert k >= 0
assert a >= 0
assert b >= 0
if k == 0:
return 1 if n == 0 else 0
key = (n, k, a, b)
if key not in countmemo:
countmemo[key] = sum(
count(n - c, k - 1, a, c) for c in range(a, min(n, b) + 1))
return countmemo[key]
def sample(n, k, a, b):
partition = []
x = random.randrange(count(n, k, a, b))
while k > 0:
for c in range(a, min(n, b) + 1):
y = count(n - c, k - 1, a, c)
if x < y:
partition.append(c)
n -= c
k -= 1
b = c
break
x -= y
else:
assert False
return partition
def test():
print(collections.Counter(
tuple(sample(20, 6, 2, 5)) for i in range(10000)))
if __name__ == '__main__':
test()
If k and b - a are not too big you can try a randomized depth-first search:
import random
def restricted_partition_rec(n, k, min, max):
if k <= 0 or n < min:
return []
ps = list(range(min, max + 1))
random.shuffle(ps)
for p in ps:
if p > n:
continue
elif p < n:
subp = restricted_partition(n - p, k - 1, min, max)
if subp:
return [p] + subp
elif k == 1:
return [p]
return []
def restricted_partition(n, k, min, max):
return sorted(restricted_partition_rec(n, k, min, max), reverse=True)
print(restricted_partition(10, 3, 2, 4))
>>>
[4, 4, 2]
Although I'm not sure if all the partitions have exactly the same probability in this case.

Given original string and encoded string, how to induce encoding?

Suppose I have an original string and an encoded string , like the following:
"abcd" -> "0010111111001010", then one possible solution would be that "a" matches with "0010", "b" matches with "1111", "c" matches with "1100", "d" matches with "1010".
How to write a program, that given these two strings, and figure out possible encoding rules?
My first scratch looks like this:
fun partition(orgl, encode) =
let
val part = size(orgl)
fun porpt(str, i, len) =
if i = len - 1 then
[substring(str, len * (len - 1), size(str) - (len - 1) * len)]
else
substring(str, len * i, len)::porpt(str, i + 1, len)
in
porpt(encode, 0, part)
end;
But obviously it can not check whether the two substrings match the identical character, and there are many other possibilities other than proportionally partitioning the strings.
What should be the appropriate algorithms for this problem?
P.S. Only prefix code is allowed.
What I have learned has not really got into serious algorithms yet, but I did some searching about backtracking and wrote my second version of the code:
fun partition(orgl, encode) =
let
val part = size(orgl)
fun backtrack(str, s, len, count, code) =
let
val current =
if count = 1 then
code#[substring(str, s, size(str) - s)]
else
code#[substring(str, s, len)]
in
if len > size(str) - s then []
else
if proper_prefix(0, orgl, code) then
if count = 1 then current
else
backtrack(str, s + len, len, count - 1, current)
else
backtrack(str, s, len + 1, count, code)
end
in
backtrack(encode, 0, 1, part, [])
end;
Where the function proper_prefix would check prefix code and unique mapping. However, this function does not function correctly.
For example, when I input :
partition("abcd", "001111110101101");
The returned result is:
uncaught exception Subscript
FYI, the body of proper_prefix looks like this:
fun proper_prefix(i, orgl, nil) = true
| proper_prefix(i, orgl, x::xs) =
let
fun check(j, str, nil) = true
| check(j, str, x::xs) =
if String.isPrefix str x then
if str = x andalso substring(orgl, i, 1) = substring(orgl, i + j + 1, 1) then
check(j + 1, str, xs)
else
false
else
check(j + 1, str, xs)
in
if check(0, x, xs) then proper_prefix(i + 1, orgl, xs)
else false
end;
I'd try a back-tracking approach:
Start with an empty hypothesis (i.e. set all encodings to unknown). Then process the encoded string character by character.
At every new code character, you have two options: Either append the code character to the encoding of the current source character or go to the next source character. If you encounter a source character that you already have an encoding for, check if it matches and go on. Or if it doesn't match, go back and try another option. You can also check the prefix-property during this traversal.
Your example input could be processed as follows:
Assume 'a' == '0'
Go to next source character
Assume 'b' == '0'
Violation of prefix property, go back
Assume 'a' == '00'
Go to next source character
Assume 'b' == '1'
...
This explores the range of all possible encodings. You can either return the first encoding found or all possible encodings.
If one were to naively iterate all possible translations of abcd → 0010111111001010, this possibly leads to a blow-up. Simple iteration also appears to lead to a lot of invalid translations one would have to skip:
(a, b, c, d) → (0, 0, 1, 0111111001010) is invalid because a = b
(a, b, c, d) → (0, 0, 10, 111111001010) is invalid because a = b
(a, b, c, d) → (0, 01, 0, 111111001010) is invalid because a = c
(a, b, c, d) → (00, 1, 0, 111111001010) is one possibility
(a, b, c, d) → (0, 0, 101, 11111001010) is invalid because a = b
(a, b, c, d) → (0, 010, 1, 11111001010) is another possibility
(a, b, c, d) → (001, 0, 1, 11111001010) is another possibility
(a, b, c, d) → (0, 01, 01, 11111001010) is invalid because b = c
(a, b, c, d) → (00, 1, 01, 11111001010) is another possibility
(a, b, c, d) → (00, 10, 1, 11111001010) is another possibility
...
If all character strings contain each character exactly once, then this blow-up of results is the answer. If the same character occurs more than once, this further constrains the solution. E.g. matching abca → 111011 could generate
(a, b, c, a) → (1, 1, 1, 011) is invalid because a = b = c, a ≠ a
(a, b, c, a) → (1, 1, 10, 11) is invalid because a = b, a ≠ a
(a, b, c, a) → (1, 11, 0, 11) is invalid because a = b, a ≠ a
(a, b, c, a) → (11, 1, 0, 11) is one possibility
... (all remaining combinations would eventually prove invalid)
For a given hypothesis, you can choose the order in which to verify your constraints. Either
See if any mappings overlap. (I think this is what Nico calls the prefix property.)
See if any character that occurs more than once actually occurs in both places in the bit string.
An algorithm using this search strategy will have to find an order of checking constraints in order to try to a hypothesis as soon possible. My intuition tells me that a constraint a → β is worth investigating sooner if the bit string β is long and if it occurs many times.
Another strategy is ruling out that a particular character can map to any bit string of/above/below a certain length. For example, aaab → 1111110 rules out a mapping to any bit string of length above 2, and abcab → 1011101 rules out a mapping to any bit string of length different than 2.
For the programming part, try and think of ways to represent hypotheses. E.g.
(* For the hypothesis (a, b, c, a) → (11, 1, 0, 11) *)
(* Order signifies first occurrence *)
val someHyp1 = ([(#"a", 2), (#"b", 1), (#"c", 1)], "abca", "111011")
(* Somehow recurse over hypothesis and accumulate offsets for each character, e.g. *)
val someHyp2 = ([(#"a", 2), (#"b", 1), (#"c", 1)],
[(#"a", 0), (#"b", 2), (#"c", 3), (#"a", 4)])
And make a function that generates new hypotheses in some order, and a function that finds if a hypothesis is valid.
fun nextHypothesis (hyp, origStr, encStr) = ... (* should probably return SOME/NONE *)
fun validHypothesis (hyp, origStr, encStr) =
allStr (fn (i, c) => (* is bit string for c at its
accumulated offset in encStr? *)) origStr
(* Helper function that checks whether a predicate is true for each
character in a string. The predicate function takes both the index
and the character as argument. *)
and allStr p s =
let val len = size s
fun loop i = i >= len orelse p (i, String.sub (s, i)) andalso loop (i+1)
in loop 0 end
An improvement over this framework would be to change the order in which to explore hypotheses, since some search paths can rule out larger amounts of invalid mappings than others.

Prolog - List of sequence from f0 to fN

The question require me to write a predicate seqList(N, L), which is satisfied when L is the list [f0, . . . , fN].
Where the fN = fN-1 + fN-2 + fN-3
My code is to compare the head of a list given, and will return true or false when compared.
seqList(_,[]).
seqList(N,[H|T]) :-
N1 is N - 1,
seq(N,H),
seqList(N1,T).
However, it only valid when the value is reversed,
e.g. seqList(3,[1,1,0,0]) will return true, but the list should return me true for
seqList(3,[0,0,1,1]). Is there any way for me to reverse the list and verifies it correctly?
It seems that you want to generate N elements of a sequence f such that f(N) = f(N-1) + f(N-2) + f(N-3) where f(X) is the X-th element of the sequence list, 0-based. The three starting elements must be pre-set as part of the specification as well. You seem to be starting with [0,0,1, ...].
Using the approach from Lazy lists in Prolog?:
seqList(N,L):- N >= 3, !,
L=[0,0,1|X], N3 is N-3, take(N3, seq(0,0,1), X-[], _).
next( seq(A,B,C), D, seq(B,C,D) ):- D is A+B+C.
Now all these functions can be fused and inlined, to arrive at one recursive definition.
But you can do it directly. You just need to write down the question, to get the solution back.
question(N,L):-
Since you start with 0,0,1, ... write it down:
L = [0, 0, 1 | X],
since the three elements are given, we only need to find out N-3 more. Write it down:
N3 is N-3,
you've now reduced the problem somewhat. You now need to find N-3 elements and put them into the X list. Use a worker predicate for that. It also must know the three preceding numbers at each step:
worker( N3, 0, 0, 1, X).
So just write down what the worker must know:
worker(N, A, B, C, X):-
if N is 0, we must stop. X then is an empty list. Write it down.
N = 0, X = [] .
Add another clause, for when N is greater than 0.
worker(N, A, B, C, X):-
N > 0,
We know that the next element is the sum of the three preceding numbers. Write that down.
D is A + B + C,
the next element in the list is the top element of our argument list (the last parameter). Write it down:
X = [D | X2 ],
now there are one less elements to add. Write it down:
N2 is N - 1,
To find the rest of the list, the three last numbers are B, C, and D. Then the rest is found by worker in exactly the same way:
worker( N2, B, C, D, X2).
That's it. The question predicate is your solution. Rename it to your liking.

Resources