Sum-of-Product of subsets - algorithm

Is there a name for this operation? And: is there a closed-form expression?
For a given set of n elements, and value k between 1 and n,
Take all subsets (combinations) of k items
Find the product of each subset
Find the sum of all those products
I can express this in Python, and do the calculation pretty easily:
from operator import mul
from itertools import combinations
from functools import reduce
def sum_of_product_of_subsets(list1, k):
val = 0
for subset in combinations(list1, k):
val += reduce(mul, subset)
return val
I'm just looking for the closed form expression, so as to avoid the loop in case the set size gets big.
Note this is NOT the same as this question: Sum of the product over all combinations with one element from each group -- that question is about the sum-of-products of a Cartesian product. I'm looking for the sum-of-products of the set of combinations of size k; I don't think they are the same.
To be clear, for set(a, b, c, d), then:
k = 4 --> a*b*c*d
k = 3 --> b*c*d + a*c*d + a*b*d + a*b*c
k = 2 --> a*b + a*c + a*d + b*c + b*d + c*d
k = 1 --> a + b + c + d
Just looking for the expression; no need to supply the Python code specifically. (Any language would be illustrative, if you'd like to supply an example implementation.)

These are elementary symmetric polynomials. You can write them using summation signs as in Wikipedia. You can also use Vieta's formulas to get all of them at once as coefficients of a polynomial (up to signs)
(x-a_1)(x-a_2)...(x-a_k) =
x^k -
(a_1 + ... + a_k) x^(k-1) +
(a_1 a_2 + a_1 a_3 + ... + a_(k-1) a_k)) x^(k-2) +
... +
(-1)^k a_1 ... a_k
By expanding (x-a_1)(x-a_2)...(x-a_k) you get a polynomial time algorithm to compute all those numbers (your original implementation runs in exponential time).
Edit: Python implementation:
from itertools import izip, chain
l = [2,3,4]
x = [1]
for i in l:
x = [a + b*i for a,b in izip(chain([0],x), chain(x,[0]))]
print x
That gives you [24, 26, 9, 1], as 2*3*4=24, 2*3+2*4+3*4=26, 2+3+4=9. That last 1 is the empty product, which corresponds to k=0 in your implementation.
This should be O(N2). Using polynomial FFT you could do O(N log2 N), but I am too lazy to code that.

I have just run into the same problem elsewhere and I might have an easier solution.
Basically the closed form you are looking for is this one:
(1+e_1)*(1+e_2)*(1+e_3)*...*(1+e_n) - 1
where considering the set S={e_1, e_2, ..., e_n}
Here is why:
Let 'm' be the product of the elements of S (n=e_1*e_2*...*e_n).
If you look at the original products of elements of subsets, you can see, that all of those products are divisors of 'm'.
Now apply the Divisor function to 'm' (from now on called sigma(m) ) with one modification: consider all e_i elements as 'primes' (because we don't want them to be divided), so this gives sigma(e_i)=e_i+1 .
Then if you apply sigma to m:
sigma(m)=sigma(e_1*e_2*...*e_n)=1+[e_1+e_2+...+e_n]+[e_1*e_2+e_1*e_3+...+e_(n-1)*e_n]+[e_1*e_2*e_3+e_1*e_2*e_3+...+e_(n-2)]+...+[e_1*e_2*...*e_n]
This is what the original problem was. (Except for the 1 in the beginning).
Our divisor function is multiplicative, so the previous equation can be rewritten as following:
sigma(m)=(1+e_1)*(1+e_2)*(1+e_3)*...*(1+e_n)
There is one correction you need here. It is because of the empty subset (which is taken into account here, but in the original problem it is not present), which includes '1' in the sum (in the beginning of the firs equation).
So the closed form, what you need is:
(1+e_1)*(1+e_2)*(1+e_3)*...*(1+e_n) - 1
Sorry, I can't really code that, but I think the computation shouldn't take more than 2n-1 loops.
(You can read more about the divisor function here: http://en.wikipedia.org/wiki/Divisor_function)

Related

Sum of 2^Pi mod 1000000007 for all i where Pi is sum of numbers in ith subset of a set X

I am stuck on a problem in which I have to print sum of 2Pi mod 1000000007 for all i where Pi is sum of numbers in ith subset of a set X.
Length of set can be upto 100000.
Value of element in the range [0,1012].
Here's the link of the Problem.
Problem Statement
I could not find any approach other than Brute-Force which gives verdict TLE.
Here's the Problem Statement
Violet, Klaus and Sunny Baudelaire were given a task by Count Olaf to keep them busy while he acquires their fortune.He has given them N numbers and asked each of them to do the following:
He asked Sunny to make all possible subsets of the set of numbers.
Then he asked Klaus to find out the sum of the number in each subsets thus formed.
Finally he asked Violet to tell him the sum of 2Pi for all i where Pi is sum of numbers in ith subset.
Since Count Olaf will be bored while listening to such a long number he's asked to give him the answer modulus 1000000007.
Can you help Baudelaires out of this predicament ?
Input Format
First line of input contains a single number N indicating the size of the set. The following line will containing N numbers that makes the set.
Output Format
Print the answer in a single line.
Input Constraints
1 ≤ N ≤ 105
0 ≤ a[i] ≤ 1012
Here's My Solution:
import itertools # importing module
#initializing the sum variable which will store final answer
t=0
#Input, number of elements in array
n=input()
#Input the array
arr=map(int,raw_input().split())
#Traverse all the possible combinations and update the sum variable 't'.
for i in xrange(len(arr)+1):
for val in itertools.combinations(arr,i):
x=sum(val)
t=(t+2**x)%1000000007
#Print final answer
print t
Here's the working algorithm which passes all testcases in time limit but I don't get the logic behind it.
from sys import stdin
mod = 10**9 + 7
n = int(stdin.readline())
ans = 1
a = map(int,stdin.readline().split())
for i in a:
j = pow(2,i,mod)
ans = (ans*(j+1))%mod
print ans
#Moderators,admins etc.
Before putting this question on hold or marking off-Topic or closed....
Please comment the reason so that I can know the reason and if possible reword it or ask on any other StackExchange Site.
I first posted it on codegolf.stackexchange.com and people(moderators) there have suggested me to post it here as it comes under algorithm category.
You can read about it here.
Programming Puzzles and Code golf
Thank You
You're given a set of numbers X, and are asked to compute sum(2^sum(x for x in A) for A a subset of X).
Let X be the set {x[0], x[1], ..., x[n]}, and S[i] be the sum of the powers of 2 of the sums of the subsets of x[0]...x[i]. That is, S[i] = sum(2^sum(x for x in A) for A a subset of x[0]...x[i])).
Subsets of x[0]...x[i+1] are either subsets of x[0]...x[i], or subsets of x[0]...x[i] with x[i+1] added.
Then:
S[i+1] = sum(2^sum(x for x in A) for A a subset of x[0]...x[i+1])
= sum(2^sum(x for x in A) for A a subset of x[0]...x[i])
+ sum(2^sum(x for x in A+{x[i+1]}) for A a subset of x[0]...x[i])
= sum(2^sum(x for x in A) for A a subset of x[0]...x[i])
+ sum(2^x[i+1] * 2^sum(x for x in A) for A a subset of x[0]...x[i])
= S[i]
+ 2^x[i+1] * S[i]
This gives us a linear-time method for computing the result:
A = [3, 3, 6, 1, 2]
m = 10**9 + 7
r = 1
for x in A:
r = (r * (1 + pow(2, x, m))) % m
print r

Maximum of sums of unsorted array and each of a number of sorted arrays

Given an unsorted array
A = a_1 ... a_n
And a set of sorted Arrays
B_i = b_i_1 ... b_i_n # for i from 1 to $large_number
I would like to find the maximums from the (not yet calculated) sum arrays
C_i = (a_1 + b_i_1) ... (a_n + b_i_n)
for each i.
Is there a trick to do better than just calculating all the C_i and finding their maximums in O($large_number * n)?
Can we do better when we know that the B arrays are just shifts from an endless sequence,
e.g.
S = 0 1 4 9 16 ...
B_i = S[i:i+n]
(The above sequence has the maybe advantageous property that (S_i - S_i-1 > S_i-1 - S_i-2))
There are $large_number * n data in your first problem, so there can't be any such trick.
You can prove this with an adversary argument. Suppose you have an algorithm that solves your problem without looking at all n * $large_number entries of b. I'm going to pick a fixed a, namely (-10, -20, -30, ..., -10n). The first $large_number * n - 1 the algorithm looks at an entry b_(i,j), I'll answer that it's 10j, for a sum of zero. The last time it looks at an entry, I'll answer that it's 10j+1, for a sum of 1.
If $large_number is Omega(n), your second problem requires you to look at n * $large_number entries of S, so it also can't have any such trick.
However, if you specify S, there may be something. And if $large_number <= n/2 (or whatever it is), then, all of the entries of S must be sorted, so you only have to look at the last B.
If we don't know anything I don't it's possible to do better than O($large_number * n)
However - If it's just shifts of an endless sequence we can do it in O($large_number + n):
We calculate B_0 ןמ O($large_number).
Than B_1 = (B_0 - S[0]) + S[n+1]
And in general: B_i = (B_i-1 - S[i-1]) + S[i-1+n].
So we can calculate all the other entries and the max in O(n).
This is for a general sequence - if we have some info about it, it might be possible to do better.
we know that the B arrays are just shifts from an endless sequence,
e.g.
S = 0 1 4 9 16 ...
B_i = S[i:i+n]
You can easily calculate S[i:i+n] as (sum of squares from 1 to i+n) - (sum of squares from 1 to i-1)
See https://math.stackexchange.com/questions/183316/how-to-get-to-the-formula-for-the-sum-of-squares-of-first-n-numbers
With the provided example, S1 = 0, S2 = 1, S3 = 4...
Let f(n) = SUM of Si for i=1 to n = (n-1)(n)(2n-1)/6
B_i = f(i+n) - f(i-1)
You then add SUM(A) to each sum.
Another approach is to calculate the difference between B_i and B_(i-1):
That would be: S[i:i+n] - S[i-1:i+n-1] = S(i+n) - S(i-1)
That way, you can just calculate the difference of the sums of each array with the previous one. In my understanding, since Ci = SUM(Bi)+SUM(A), SUM(A) becomes a constant that is irrelevant in finding the maximum.

arrangement with constraint on the sum

I'm looking to construct an algorithm which gives the arrangements with repetition of n sequences of a given step S (which can be a positive real number), under the constraint that the sum of all combinations is k, with k a positive integer.
My problem is thus to find the solutions to the equation:
x 1 + x 2 + ⋯ + x n = k
where
0 ≤ x i ≤ b i
and S (the step) a real number with finite decimal.
For instance, if 0≤xi≤50, and S=2.5 then xi = {0, 2.5 , 5,..., 47.5, 50}.
The point here is to look only through the combinations having a sum=k because if n is big it is not possible to generate all the arrangements, so I would like to bypass this to generate only the combinations that match the constraint.
I was thinking to start with n=2 for instance, and find all linear combinations that match the constraint.
ex: if xi = {0, 2.5 , 5,..., 47.5, 50} and k=100, then we only have one combination={50,50}
For n=3, we have the combination for n=2 times 3, i.e. {50,50,0},{50,0,50} and {0,50,50} plus the combinations {50,47.5,2.5} * 3! etc...
If xi = {0, 2.5 , 5,..., 37.5, 40} and k=100, then we have 0 combinations for n=2 because 2*40<100, and we have {40,40,20} times 3 for n=3... (if I'm not mistaken)
I'm a bit lost as I can't seem to find a proper way to start the algorithm, knowing that I should have the step S and b as inputs.
Do you have any suggestions?
Thanks
You can transform your problem into an integer problem by dividing everything by S: We want to find all integer sequences y1, ..., yn with:
(1) 0 ≤ yi ≤ ⌊b / S⌋
(2) y1 + ... + yn = k / S
We can see that there is no solution if k is not a multiple of S. Once we have reduced the problem, I would suggest using a pseudopolynomial dynamic programming algorithm to solve the subset sum problem and then reconstruct the solution from it. Let f(i, j) be the number of ways to make sum j with i elements. We have the following recurrence:
f(0,0) = 1
f(0,j) = 0 forall j > 0
f(i,j) = sum_{m = 0}^{min(floor(b / S), j)} f(i - 1, j - m)
We can solve f in O(n * k / S) time by filling it row by row. Now we want to reconstruct the solution. I'm using Python-style pseudocode to illustrate the concept:
def reconstruct(i, j):
if f(i,j) == 0:
return
if i == 0:
yield []
return
for m := 0 to min(floor(b / S), j):
for rest in reconstruct(i - 1, j - m):
yield [m] + rest
result = reconstruct(n, k / S)
result will be a list of all possible combinations.
What you are describing sounds like a special case of the subset sum problem. Once you put it in those terms, you'll find that Pisinger apparently has a linear time algorithm for solving a more general version of your problem, since your weights are bounded. If you're interested in designing your own algorithm, you might start by reading Pisinger's thesis to get some ideas.
Since you are looking for all possible solutions and not just a single solution, the dynamic programming approach is probably your best bet.

Dynamic programming approximation

I am trying to calculate a function F(x,y) using dynamic programming. Functionally:
F(X,Y) = a1 F(X-1,Y)+ a2 F(X-2,Y) ... + ak F(X-k,Y) + b1 F(X,Y-1)+ b2 F(X,Y-2) ... + bk F(X,Y-k)
where k is a small number (k=10).
The problem is, X=1,000,000 and Y=1,000,000. So it is infeasible to calculate F(x,y) for every value between x=1..1000000 and y=1..1000000. Is there an approximate version of DP where I can avoid calculating F(x,y) for a large number of inputs and still get accurate estimate of F(X,Y).
A similar example is string matching algorithms (Levenshtein's distance) for two very long and similar strings (eg. similar DNA sequences). In such cases only the diagonal scores are important and the far-from-diagonal entries do not contribute to the final distance. How do we avoid calculating off-the-diagonal entries?
PS: Ignore the border cases (i.e. when x < k and y < k).
I'm not sure precisely how to adapt the following technique to your problem, but if you were working in just one dimension there is an O(k3 log n) algorithm for computing the nth term of the series. This is called a linear recurrence and can be solved using matrix math, of all things. The idea is to suppose that you have a recurrence defined as
F(1) = x_1
F(2) = x_2
...
F(k) = x_k
F(n + k + 1) = c_1 F(n) + c_2 F(n + 1) + ... + c_k F(n + k)
For example, the Fibonacci sequence is defined as
F(0) = 0
F(1) = 1
F(n + 2) = 1 x F(n) + 1 x F(n + 1)
There is a way to view this computation as working on a matrix. Specifically, suppose that we have the vector x = (x_1, x_2, ..., x_k)^T. We want to find a matrix A such that
Ax = (x_2, x_3, ..., x_k, x_{k + 1})^T
That is, we begin with a vector of terms 1 ... k of the sequence, and then after multiplying by matrix A end up with a vector of terms 2 ... k + 1 of the sequence. If we then multiply that vector by A, we'd like to get
A(x_2, x_3, ..., x_k, x_{k + 1})^T = (x_3, x_4, ..., x_k, x_{k + 1}, x_{k + 2})
In short, given k consecutive terms of the series, multiplying that vector by A gives us the next term of the series.
The trick uses the fact that we can group the multiplications by A. For example, in the above case, we multiplied our original x by A to get x' (terms 2 ... k + 1), then multiplied x' by A to get x'' (terms 3 ... k + 2). However, we could have instead just multiplied x by A2 to get x'' as well, rather than doing two different matrix multiplications. More generally, if we want to get term n of the sequence, we can compute Anx, then inspect the appropriate element of the vector.
Here, we can use the fact that matrix multiplication is associative to compute An efficiently. Specifically, we can use the method of repeated squaring to compute An in a total of O(log n) matrix multiplications. If the matrix is k x k, then each multiplication takes time O(k3) for a total of O(k3 log n) work to compute the nth term.
So all that remains is actually finding this matrix A. Well, we know that we want to map from (x_1, x_2, ..., x_k) to (x_1, x_2, ..., x_k, x_{k + 1}), and we know that x_{k + 1} = c_1 x_1 + c_2 x_2 + ... + c_k x_k, so we get this matrix:
| 0 1 0 0 ... 0 |
| 0 0 1 0 ... 0 |
A = | 0 0 0 1 ... 0 |
| ... |
| c_1 c_2 c_3 c_4 ... c_k |
For more detail on this, see the Wikipedia entry on solving linear recurrences with linear algebra, or my own code that implements the above algorithm.
The only question now is how you adapt this to when you're working in multiple dimensions. It's certainly possible to do so by treating the computation of each row as its own linear recurrence, then to go one row at a time. More specifically, you can compute the nth term of the first k rows each in O(k3 log n) time, for a total of O(k4 log n) time to compute the first k rows. From that point forward, you can compute each successive row in terms of the previous row by reusing the old values. If there are n rows to compute, this gives an O(k4 n log n) algorithm for computing the final value that you care about. If this is small compared to the work you'd be doing before (O(n2 k2), I believe), then this may be an improvement. Since you're saying that n is on the order of one million and k is about ten, this does seem like it should be much faster than the naive approach.
That said, I wouldn't be surprised if there was a much faster way of solving this problem by not proceeding row by row and instead using a similar matrix trick in multiple dimensions.
Hope this helps!
Without knowing more about your specific problem, the general approach is to use a top-down dynamic programming algorithm and memoize the intermediate results. That way you will only calculate the values that will be actually used (while saving the result to avoid repeated calculations).

Calculating sum of geometric series (mod m)

I have a series
S = i^(m) + i^(2m) + ............... + i^(km) (mod m)
0 <= i < m, k may be very large (up to 100,000,000), m <= 300000
I want to find the sum. I cannot apply the Geometric Progression (GP) formula because then result will have denominator and then I will have to find modular inverse which may not exist (if the denominator and m are not coprime).
So I made an alternate algorithm making an assumption that these powers will make a cycle of length much smaller than k (because it is a modular equation and so I would obtain something like 2,7,9,1,2,7,9,1....) and that cycle will repeat in the above series. So instead of iterating from 0 to k, I would just find the sum of numbers in a cycle and then calculate the number of cycles in the above series and multiply them. So I first found i^m (mod m) and then multiplied this number again and again taking modulo at each step until I reached the first element again.
But when I actually coded the algorithm, for some values of i, I got cycles which were of very large size. And hence took a large amount of time before terminating and hence my assumption is incorrect.
So is there any other pattern we can find out? (Basically I don't want to iterate over k.)
So please give me an idea of an efficient algorithm to find the sum.
This is the algorithm for a similar problem I encountered
You probably know that one can calculate the power of a number in logarithmic time. You can also do so for calculating the sum of the geometric series. Since it holds that
1 + a + a^2 + ... + a^(2*n+1) = (1 + a) * (1 + (a^2) + (a^2)^2 + ... + (a^2)^n),
you can recursively calculate the geometric series on the right hand to get the result.
This way you do not need division, so you can take the remainder of the sum (and of intermediate results) modulo any number you want.
As you've noted, doing the calculation for an arbitrary modulus m is difficult because many values might not have a multiplicative inverse mod m. However, if you can solve it for a carefully selected set of alternate moduli, you can combine them to obtain a solution mod m.
Factor m into p_1, p_2, p_3 ... p_n such that each p_i is a power of a distinct prime
Since each p is a distinct prime power, they are pairwise coprime. If we can calculate the sum of the series with respect to each modulus p_i, we can use the Chinese Remainder Theorem to reassemble them into a solution mod m.
For each prime power modulus, there are two trivial special cases:
If i^m is congruent to 0 mod p_i, the sum is trivially 0.
If i^m is congruent to 1 mod p_i, then the sum is congruent to k mod p_i.
For other values, one can apply the usual formula for the sum of a geometric sequence:
S = sum(j=0 to k, (i^m)^j) = ((i^m)^(k+1) - 1) / (i^m - 1)
TODO: Prove that (i^m - 1) is coprime to p_i or find an alternate solution for when they have a nontrivial GCD. Hopefully the fact that p_i is a prime power and also a divisor of m will be of some use... If p_i is a divisor of i. the condition holds. If p_i is prime (as opposed to a prime power), then either the special case i^m = 1 applies, or (i^m - 1) has a multiplicative inverse.
If the geometric sum formula isn't usable for some p_i, you could rearrange the calculation so you only need to iterate from 1 to p_i instead of 1 to k, taking advantage of the fact that the terms repeat with a period no longer than p_i.
(Since your series doesn't contain a j=0 term, the value you want is actually S-1.)
This yields a set of congruences mod p_i, which satisfy the requirements of the CRT.
The procedure for combining them into a solution mod m is described in the above link, so I won't repeat it here.
This can be done via the method of repeated squaring, which is O(log(k)) time, or O(log(k)log(m)) time, if you consider m a variable.
In general, a[n]=1+b+b^2+... b^(n-1) mod m can be computed by noting that:
a[j+k]==b^{j}a[k]+a[j]
a[2n]==(b^n+1)a[n]
The second just being the corollary for the first.
In your case, b=i^m can be computed in O(log m) time.
The following Python code implements this:
def geometric(n,b,m):
T=1
e=b%m
total = 0
while n>0:
if n&1==1:
total = (e*total + T)%m
T = ((e+1)*T)%m
e = (e*e)%m
n = n/2
//print '{} {} {}'.format(total,T,e)
return total
This bit of magic has a mathematical reason - the operation on pairs defined as
(a,r)#(b,s)=(ab,as+r)
is associative, and the rule 1 basically means that:
(b,1)#(b,1)#... n times ... #(b,1)=(b^n,1+b+b^2+...+b^(n-1))
Repeated squaring always works when operations are associative. In this case, the # operator is O(log(m)) time, so repeated squaring takes O(log(n)log(m)).
One way to look at this is that the matrix exponentiation:
[[b,1],[0,1]]^n == [[b^n,1+b+...+b^(n-1))],[0,1]]
You can use a similar method to compute (a^n-b^n)/(a-b) modulo m because matrix exponentiation gives:
[[b,1],[0,a]]^n == [[b^n,a^(n-1)+a^(n-2)b+...+ab^(n-2)+b^(n-1)],[0,a^n]]
Based on the approach of #braindoper a complete algorithm which calculates
1 + a + a^2 + ... +a^n mod m
looks like this in Mathematica:
geometricSeriesMod[a_, n_, m_] :=
Module[ {q = a, exp = n, factor = 1, sum = 0, temp},
While[And[exp > 0, q != 0],
If[EvenQ[exp],
temp = Mod[factor*PowerMod[q, exp, m], m];
sum = Mod[sum + temp, m];
exp--];
factor = Mod[Mod[1 + q, m]*factor, m];
q = Mod[q*q, m];
exp = Floor[ exp /2];
];
Return [Mod[sum + factor, m]]
]
Parameters:
a is the "ratio" of the series. It can be any integer (including zero and negative values).
n is the highest exponent of the series. Allowed are integers >= 0.
mis the integer modulus != 0
Note: The algorithm performs a Mod operation after every arithmetic operation. This is essential, if you transcribe this algorithm to a language with a limited word length for integers.

Resources