Subset-Sum in Linear Time - algorithm

This was a question on our Algorithms final exam. It's verbatim because the prof let us take a copy of the exam home.
(20 points) Let I = {r1,r2,...,rn} be a set of n arbitrary positive integers and the values in I are distinct. I is not given in any sorted order. Suppose we want to find a subset I' of I such that the total sum of all elements in I' is exactly 100*ceil(n^.5) (each element of I can appear at most once in I'). Present an O(n) time algorithm for solving this problem.
As far as I can tell, it's basically a special case of the knapsack problem, otherwise known as the subset-sum problem ... both of which are in NP and in theory impossible to solve in linear time?
So ... was this a trick question?
This SO post basically explains that a pseudo-polynomial (linear) time approximation can be done if the weights are bounded, but in the exam problem the weights aren't bounded and either way given the overall difficulty of the exam I'd be shocked if the prof expected us to know/come up with an obscure dynamic optimization algorithm.

There are two things that make this problem possible:
The input can be truncated to size O(sqrt(n)). There are no negative inputs, so you can discard any numbers greater than 100*sqrt(n), and all inputs are distinct so we know there are at most 100*sqrt(n) inputs that matter.
The playing field has size O(sqrt(n)). Although there are O(2^sqrt(n)) ways to combine the O(sqrt(n)) inputs that matter, you don't have to care about combinations that either leave the 100*sqrt(n) range or redundantly hit a target you can already reach.
Basically, this problem screams dynamic programming with each input being checked against each part of the 'reached number' space somehow.
The solution ends up being a matter of ensuring numbers don't reach off of themselves (by scanning in the right direction), of only looking at each number once, and of giving ourselves enough information to reconstruct the solution afterwards.
Here's some C# code that should solve the problem in the given time:
int[] FindSubsetToImpliedTarget(int[] inputs) {
var target = 100*(int)Math.Ceiling(Math.Sqrt(inputs.Count));
// build up how-X-was-reached table
var reached = new int?[target+1];
reached[0] = 0; // the empty set reaches 0
foreach (var e in inputs) {
// we go backwards to avoid reaching off of ourselves
for (var i = target; i >= e; i--) {
if (reached[i-e].HasValue) {
reached[i] = e;
}
}
}
// was target even reached?
if (!reached[target].HasValue) return null;
// build result by back-tracking via the logged reached values
var result = new List<int>();
for (var i = target; reached[i] != 0; i -= reached[i].Value) {
result.Add(reached[i].Value);
}
return result.ToArray();
}
I haven't actually tested the above code, so beware typos and off-by-ones.

With the typical DP algorithm for subset-sum problem will obtain O(N) time consuming algorithm. We use dp[i][k] (boolean) to indicate whether the first i items have a subset with sum k,the transition equation is:
dp[i][k] = (dp[i-1][k-v[i] || dp[i-1][k]),
it is O(NM) where N is the size of the set and M is the targeted sum. Since the elements are distinct and the sum must equal to 100*ceil(n^.5), we just need consider at most the first 100*ceil(n^.5) items, then we get N<=100*ceil(n^.5) and M = 100*ceil(n^.5).
The DP algorithm is O(N*M) = O(100*ceil(n^.5) * 100*ceil(n^.5)) = O(n).

Ok following is a simple solution in O(n) time.
Since the required sum S is of the order of O(n^0.5), if we formulate an algorithm of complexity S^2, then we are good since our algorithm shall be of effective complexity O(n).
Iterate once over all the elements and check if the value is less than S or not. If it is then push it in a new array. This array shall contain a maximum of S elements (O(n^.5))
Sort this array in descending order in O(sqrt(n)*logn) time ( < O(n)). This is so because logn <= sqrt(n) for all natural numbers. https://math.stackexchange.com/questions/65793/how-to-prove-log-n-leq-sqrt-n-over-natural-numbers
Now this problem is a 1D knapsack problem with W = S and number of elements = S (upper bound).
Maximize the total weight of items and see if it equals S.
It can be solved using dynamic programming in linear time (linear wrt W ~ S).

Related

Best case of fractional knapsack

the worst case running time of fractional knapsack is O(n), then what should be its best case? is it O(1), because if a weight limit is 16 and you get first item having value, is it right??
True if you assume that input is given in sorted order of value !!!
But as per the definition, the algorithm is expected to take non-sorted input too. see this.
If you are considering a normal input that may or may not be sorted. Then there are two approaches to solve the problem:
Sort the input. which can not be less than O(n) even in best case that too if you use bubble/insertion sort. Which looks completely foolish because both of these sorting algorithms has O(n^2) avarage/worst case performance.
Use the weighted medians approach . That will cost you O(n) as finding the weighted median will take O(n). The code for this approach is given below.
Weighted median approach for fractional knapsack:
We will work on value per unit of item in the following code. The code will first find the middle value (i.e. mid of values per unit of items if given in sorted order) and place it in its correct position. We will use quick sort partition method for this. Once we get the middle (call it mid) element, following two cases need to be taken into consideration:
When sum of weight of all items present in the right side of mid is more than the value of W, we need to search our answer in right side of mid.
else sum all the values present in right side of mid (call it v_left) and search for W-v_left in the left side of mid (include mid as well).
Following is the implementation in python (Use only floating point numbers everywhere):
Please note that i am not providing you the production level code and there are cases which will fail as well. Think about what can cause worst case/failure for finding kth max in array (when all valules are same may be).
def partition(weights,values,start,end):
x = values[end]/weights[end]
i = start
for j in range(start,end):
if values[j]/weights[j] < x:
values[i],values[j] = values[j],values[i]
weights[i], weights[j] = weights[j],weights[i]
i+=1
values[i],values[end] = values[end],values[i]
weights[i], weights[end] = weights[end],weights[i]
return i
def _find_kth(weights,values,start,end,k):
ind = partition(weights,values,start,end)
if ind - start == k-1:
return ind
if ind - start > k-1:
return _find_kth(weights,values,start,ind-1,k)
return _find_kth(weights,values,ind+1,end,k-ind-1)
def find_kth(weights,values,k):
return _find_kth(weights,values,0,len(weights)-1,k)
def fractional_knapsack(weights,values,w):
if w == 0 or len(weights)==0:
return 0
if len(weights) == 1 and weights[0] > w:
return w*(values[0]/weights[0])
mid = find_kth(weights,values,len(weights)/2)
w1 = reduce(lambda x,y: x+y,weights[mid+1:])
v1 = reduce(lambda x,y: x+y, values[mid+1:])
if(w1>w):
return fractional_knapsack(weights[mid+1:],values[mid+1:],w)
return v1 + fractional_knapsack(weights[:mid+1],values[:mid+1],w-w1)
(Editing and rewriting the answer after discussion with #Shasha99, since I feel answers before 2016-12-06 are a bit deceiving)
Summary
O(1) best case is possible if the items are already sorted. Otherwise best case is O(n).
Discussion
If the items are not sorted, you need to find the best item (for the case where one item already fills the knapsack), and that alone will take O(n), since you have to check all of them. Therefore, best case O(n).
On the opposite end, you could have a knapsack where all the items fit. Searching for best would not be needed, but you need to put all of them in, so it's still O(n).
More analysis
Funny enough, O(n) worst case does not imply items being sorted.
Apparently idea from http://algo2.iti.kit.edu/sanders/courses/algdat03/sol12.pdf paired with fast median selection algorithm (weighted medians or maybe median of medians?). Thanks to #Shasha99 for finding this algorithm.
Note that plain quickselect is O(n) expected, O(n*n) worst, but if you use median-of-medians that becomes O(n) worst case. The downside is quite a complicated algorithm.
I'd be interested in a working implementation of any algorithm. More sources to (hopefully simple) algorithms also wouldn't hurt.

Calculating bigO runtime with 2D values, where one dimension has unknown length

I was working on the water-collection between towers problem, and trying to calculate the bigO of my solution for practice.
At one point, I build a 2D array of 'towers' from the user's input array of heights. This step uses a nested for loop, where the inner loop runs height many times.
Is my BigO for this step then n * maxHeight?
I've never seen any sort of BigO that used a variable like this, but then again I'm pretty new so that could be an issue of experience.
I don't feel like the height issue can be written off as a constant, because there's no reason that the height of the towers wouldn't exceed the nuber of towers on a regular basis.
//convert towerArray into 2D array representing towers
var multiTowerArray = [];
for (i = 0; i < towerArray.length; i++) {
//towerArray is the user-input array of tower heights
multiTowerArray.push([]);
for (j = 0; j < towerArray[i]; j++) {
multiTowerArray[i].push(1);
}
}
For starters, it's totally reasonable - and not that uncommon - to give the big-O runtime of a piece of code both in terms of the number of elements in the input as well as the size of the elements in the input. For example, counting sort runs in time O(n + U), where n is the number of elements in the input array and U is the maximum value in the array. So yes, you absolutely could say that the runtime of your code is O(nU), where n is the number of elements and U is the maximum value anywhere in the array.
Another option would be to say that the runtime of your code is O(n + S), where S is the sum of all the elements in the array, since the aggregate number of times that the inner loop runs is equal to the sum of the array elements.
Generally speaking, you can express the runtime of an algorithm in terms of whatever quantities you'd like. Many graph algorithms have a runtime that depends on both number of nodes (often denoted n) and the number of edges (often denoted m), such as Dijkstra's algorithm, which can be made to run in time O(m + n log n) using a Fibonacci heap. Some algorithms have a runtime that depends on the size of the output (for example, the Aho-Corasick string matching algorithm runs in time O(m + n + z), where m and n are the lengths of the input strings and z is the number of matches). Some algorithms depend on a number of other parameters - as an example, the count-min sketch performs updates in time O(ε-1 log δ-1), where ε and δ are parameters specified when the algorithm starts.

Greedy Attempt for covering all the numbers with the given intervals

Let S be a set of intervals (containing n number of intervals) of the natural numbers that might overlap and N be a list of numbers (containing n number of numbers).
I want to find the smallest subset (let's call P) of S such that for each number
in our list N, there exists at least one interval in P that contains it. The intervals in P are allowed to overlap.
Trivial example:
S = {[1..4], [2..7], [3..5], [8..15], [9..13]}
N = [1, 4, 5]
// so P = {[1..4], [2..7]}
I think a dynamic algorithm might not work always, so if anybody knows of a solution to this problem (or a similar one that can be converted into), that would be great. I am trying to make a O(n^2 solution)
Here is one greedy approach
P = {}
for each q in N: // O(n)
if q in P // O(n)
continue
for each i in S // O(n)
if q in I: // O(n)
P.add(i)
break
But that is O(n^4).. Any help with creating a greedy approach that is O(n^2) would be great!
Thanks!
* Update: * I've been slamming at this problem and I think I have an O(n^2) solution!!
Let me know if you think I'm right!!!
N = MergeSort (N)
upper, lower = infinity, -1
P = empty set
for each q in N do
if (q>=lower and q<=upper)=False
max_interval = [-infinity, infinity]
for each r in S do
if q in r then
if r.rightEndPoint > max_interval.rightEndPoint
max_interval = r
P.append(max_interval)
lower = max_interval.leftEndPoint
upper = max_interval.rightEndPoint
S.remove(max_interval)
I think this should work!! I'm trying to find a counter solution; but yeah!!
This problem is similar to set cover problem, which is NP-complete (i.e., arguably has no solution faster than exponential). What makes it different is that intervals always cover adjacent elements (not arbitrary subset of N), which opens ways for faster solutions.
http://en.wikipedia.org/wiki/Set_cover_problem
I think that the solution proposed by Mike is good enough. But I think I have quite straightforward O(N^2) greedy algo. It starts like the Mike's one (moreover, I believe Mike's solution can also be improved in similar way):
You sort your N numbers and place them sorted into array ELEM; COMPLEXITY O(N*lg N);
Using binary search, for each interval S[i] you identify starting and ending index of elements in ELEM that are covered by S[i]. Say, you place this pair of numbers into array COVER, the difference between the two indices tells you how many elements you cover, for simplicity, let us place it array COVER_COUNT; COMPLEXITY O(N*lg N);
You introduce index pointer p, that shows till which element in ELEM, your N is already covered. you set p = 0, meaning that all elements up to 0-th (excluded) are initially covered (i.e., no elements); Complexity O(1). Moreover you introduce boolean array IS_INCLUDED, that reflects if interval S[i] is already included in your coverage set. Complexity O(N)
Then you start from the 0-th element in ELEM and see what is the interval that contains ELEM[0] and has greater coverage COVER_COUNT[i]. Imagine that it is i-th interval. We then mark it as included by setting IS_INCLUDED[i] to true. Then you set p to end[i] + 1 where end[i] is the ending index in COVER[i] pair (indeed now all elements til end[i] are covered). Then, knowing p you update all elements in COVER_COUNT so that they reflect how many elements of not yet covered elements each interval covers (this can be easily done in O(N) time). Then you perform the same step for ELEM[p] and continues till p >= ELEM.length. It can be observed that the overall complexity is O(N^2).
You finish in O(n^2) and in IS_INCLUDED has true for intervals of S included in optimal cover set
Let me know if this solution seems reasonable to you and if I calculated everything well.
P.S. Just wanted to add that the optimality of ythe solution found by algo can be proved by induction and contradiction. By contradiction, it is easy to show that at least one optimal solution includes the longest interval of those covering element ELEM[0]. If so, by induction we can show that for each next element in algo, we can keep on following the strategy of selelcting the interval that is the longest with respect to the number of remaining elements covered and that covers the leftmost yet uncovered element.
I am not sure, but mb some think like this.
1) For each interval create a list with elements from N witch contain in interval, it will take O(n^2) lets call it Q[i] for S[i]
2) Then sort our S by length of Q[i], O(n*lg(n))
3) Go throw this array excluding Q[i] from N O(n) and from Q[i+1]...Q[n] = O(n^2)
4) Repeat 2 while N is not empty.
It's not O(n^2), it's O(n^3) but if you can use hashmap, i think you can improve this.

Maximum non-overlapping intervals in a interval tree

Given a list of intervals of time, I need to find the set of maximum non-overlapping intervals.
For example,
if we have the following intervals:
[0600, 0830], [0800, 0900], [0900, 1100], [0900, 1130],
[1030, 1400], [1230, 1400]
Also it is given that time have to be in the range [0000, 2400].
The maximum non-overlapping set of intervals is [0600, 0830], [0900, 1130], [1230, 1400].
I understand that maximum set packing is NP-Complete. I want to confirm if my problem (with intervals containing only start and end time) is also NP-Complete.
And if so, is there a way to find an optimal solution in exponential time, but with smarter preprocessing and pruning data. Or if there is a relatively easy to implement fixed parameter tractable algorithm. I don't want to go for an approximation algorithm.
This is not a NP-Complete problem. I can think of an O(n * log(n)) algorithm using dynamic programming to solve this problem.
Suppose we have n intervals. Suppose the given range is S (in your case, S = [0000, 2400]). Either suppose all intervals are within S, or eliminate all intervals not within S in linear time.
Pre-process:
Sort all intervals by their begin points. Suppose we get an array A[n] of n intervals.
This step takes O(n * log(n)) time
For all end points of intervals, find the index of the smallest begin point that follows after it. Suppose we get an array Next[n] of n integers.
If such begin point does not exist for the end point of interval i, we may assign n to Next[i].
We can do this in O(n * log(n)) time by enumerating n end points of all intervals, and use a binary search to find the answer. Maybe there exists linear approach to solve this, but it doesn't matter, because the previous step already take O(n * log(n)) time.
DP:
Suppose the maximum non-overlapping intervals in range [A[i].begin, S.end] is f[i]. Then f[0] is the answer we want.
Also suppose f[n] = 0;
State transition equation:
f[i] = max{f[i+1], 1 + f[Next[i]]}
It is quite obvious that the DP step take linear time.
The above solution is the one I come up with at the first glance of the problem. After that, I also think out a greedy approach which is simpler (but not faster in the sense of big O notation):
(With the same notation and assumptions as the DP approach above)
Pre-process: Sort all intervals by their end points. Suppose we get an array B[n] of n intervals.
Greedy:
int ans = 0, cursor = S.begin;
for(int i = 0; i < n; i++){
if(B[i].begin >= cursor){
ans++;
cursor = B[i].end;
}
}
The above two solutions come out from my mind, but your problem is also referred as the activity selection problem, which can be found on Wikipedia http://en.wikipedia.org/wiki/Activity_selection_problem.
Also, Introduction to Algorithms discusses this problem in depth in 16.1.

Dynamic Programing- complexity

I have a homework problem that I have been trying to figure out for some time now, and I can't figure it out for the life of me.
I have a sheet of size X*Y, and a set of patterns of lesser sizes, with price values associated with them. I can cut the sheet either horizontally or vertically, and I have to find the optimized cutting pattern to get the greatest profit from the sheet.
As far as I can tell there should be (X*Y)(X+Y+#ofPatterns) recursive operations. The complexity is supposed to be exponential. Can someone please explain why?
The pseudo-code I have come up with is as follows:
Optimize( w, h ) {
best_price = 0
for(Pattern p : all patterns) {
if ( p fits into this piece of cloth && p’s price > best price) {best_price = p’s price}
}
for (i = 1…n){
L= Optimize( i, h );
R= Optimize( w-i, h);
if (L_price + R_price > best_price) { update best_price}
}
for (i = 1…n){
T= Optimize( w, i );
B= Optimize( w, h-i);
if (T_price + B_price > best_price) { update best_price}
}
return best_price;
}
The recursive case is exponential because you can at the start choose to cut your paper 0 to max width inches or 0 to max height inches and then optionally cut the remaining pieces (recurse).
This problem sounds like a bit more interesting case of this rod cutting problem since it involves two dimensions.
http://www.radford.edu/~nokie/classes/360/dp-rod-cutting.html
is a good guide. Read that should put you on the right track without blatantly answering your homework.
The relevant portion to why it is exponential when recursing:
This recursive algorithm uses the formula above and is slow
Code
-- price array p, length n
Cut-Rod(p, n)
if n = 0 then
return 0
end if
q := MinInt
for i in 1 .. n loop
q = max(q, p(i) + Cut-Rod(p, n-i)
end loop
return q
Recursion tree (shows subproblems): 4/[3,2,1,0]//[2,1,0],[1,0],0//[1,0],0,0//0
Performance: Let T(n) = number of calls to Cut-Rod(x, n), for any x
T(0)=0
T(n)=1+∑i=1nT(n−i)=1+∑j=0n−1T(j)
Solution: T(n)=2n
When calculating the complexity of a dynamic programming algorithm, we can decompose it into two subproblems: one is calculating the number of substates; and the other is calculating the time complexity of solving a particular subproblem.
But it's true that when you don't use a memoization approach, the algorithm that has a polynomial time complexity in nature would increase to exponential time complexity since you are not re-using information that you've previously calculated. (I'm pretty sure you understand this part from your dynamic programming course)
No matter whether you solve a dynamic programming problem using the memoization method or the bottom-up approach, the time complexity stays the same. I think the trouble you are having is that you are trying to draw the function call graph in your head. Instead, let's try to estimate the number of function calls this way.
You are saying that there are (X*Y)(X+Y+#ofPatterns) recursive calls.
Well, yes and no.
It's true that when you use a memoization method, there are only this many number of recursive calls. Because if you have called and calculated a certain Optimize(w0,h0), the value will be stored and the next time another function Optimize(w1,h1) calls Optimize(w0,h0), it won't do these redundant work again. And that's what makes the time complexity polynomial.
But in your current implementation, one subproblem Optimize(w0,h0) gets many redundant function calls, which means the number of recursive calls in your algorithm is not polynomial at all (for a simple example, try to draw the call graph of the recursive Fibonacci number algorithm).

Resources