Let's say I have a huge array of doubles w[] indexed from 0 to n-1.
I also have a list of m subsets of [0;n-1]. For each subset S, I am trying to compute the sums of w[i] over S.
Obviously I can compute this separately for each subset, which is going to be in O(m * n).
However is there any faster way to do this? I'm talking from a practical standpoint, as I think you can't have a lower asymptotic bound. Is it possible to pre-process all the subsets and store them in such a way that computing all the sums is faster?
Thanks!
edit :
to give some order of magnitude, my n would be around 20 millions, and m around ~200.
For subsets that are dense (or nearly dense) you may be able to speed up the computation by computing a running sum of the elements. That is, create another array in parallel with w, where each element in the parallel array contains the sum of the elements of w up to that point.
To compute the sum for a dense subset, you that the starting and ending positions in the parallel array, and subtract the running sum at the start from the running sum at the end. The difference between the two is (ignoring rounding errors) the sum for that subset.
For a nearly dense subset, you start by doing the same, then subtract off the values of the (relatively few) items in that range that aren't part of the set.
These may not produce exactly the same result as you'd get by naively summing the subset though. If you need better accuracy, you'd probably want to use Kahan summation for your array of running sums, and possibly preserve its error residual at each point, to be taken into account when doing the subtraction.
Related
This is from the practice problem in one of coursera's Algorithms courses; I've been stuck for a couple of weeks.
The problem is this:
Given an array of n distinct unsorted elements x1, x2, ..., xn ε X with positive weights w1, w2, ..., wn ε W, a weighted median is an element xk for which the total weight of all elements with values less than xk is at most (total weight)/2 and also the total weight of elements with values larger than xk is at most (total weight)/2. Observe that there are at most two weighted. Show how to compute all weighted medians in O(n) worst time
The course mostly covered divide and conquer algorithms, so I think the key to get started on this would be to identify which of the algorithms covered can be used for this problem.
One of the algorithms covered was the RSelect algorithm in the form RSelect(array X, length n, order statistic i) which for a weighted median could be written as RSelect(array X, weights W, length n, order statistic i). My issue with this approach is that it assumes I know the median value ahead of time, which seems unlikely. There's also the issue that the pivot is chosen uniformly at random, which I don't imagine is likely to work with weights without computing every weight for every entry.
Next is the DSelect algorithms, where using a median of medians approach a pivot may be computed without randomization so we can compute a proper median. This seems like the approach that could work, where I have trouble is that it also assumes that I know ahead of time the value I'm looking for.
DSelect(array A, length n, order statistic i) for an unweighted array
DSelect(array A, weights W, length n, order statistic i) for a weighted array
Am I overthinking this? Should I use DSelect assuming that I know the value of (total weight) / 2 ahead of time? I guess even if I compute it it would add only linear time to the running time. But then it would be no different from precomputing a weighted array (combine A, W into Q where qi = xi*wi) and transforming this back to an unweighted array problem where I can use RSelect (plus some accounting for cases where there are two medians)
I've found https://archive.org/details/lineartimealgori00blei/page/n3 and https://blog.nelsonliu.me/2016/07/05/gsoc-week-6-efficient-calculation-of-weighted-medians/ which describe this problem, but their approach doesn't seem to be something covered in the course (and I'm not familiar with heaps/heapsort)
This problem can be solved with a simple variant of quickselect:
Calculate the sum of all weights and divide by 2 to get the target sum
Choose a pivot and partition the array into larger and smaller elements
Sum the weights in the smaller partition, and subtract from the total to get the sum in the other partition
go back to 2 to process the appropriate partition with the appropriate target sum
Just like normal quickselect, this becomes linear in the worst case if you use the (normal, unweighted) median-of-medians approach to choose a pivot.
This average performance can be achieved with Quickselect.
The randomly chosen pivot can be chosen - with weighting - with the Reservoir Sampling Algorithm. You are correct that it is O(n) to find the first pivot, but the size of the lists that you're working with will follow a geometric series, so the total cost of finding pivots will still work out to be only O(n).
We have M unique integers between 1 and N. In real life, N is a few millions, and M is between N/10 and N/3. I need to compute a distribution of pairwise distances between the M integers.
The brute-force complexity of the problem is M^2, but the output is just N numbers. So the natural question is whether there is a faster algorithm. Even an algorithm as fast as N * sqrt(M) should be sufficient for our purposes.
The problem appeared as a subset of the following problem. We have a large virtual square symmetric matrix, few million by few million elements. Some rows and columns of the matrix are masked out. We need to find how many masked-out elements are in each diagonal of the matrix. One can easily calculate how many masked-out bins intersect each diagonal. But often a masked-out row and column would intersect right on the diagonal, thus masking out only one bin. To avoid double-counting these, we need pairwise distribution of distances between masked-out columns.
You can do this in O(NlogN) using the Fourier transform.
The idea is that you first compute a histogram H(x) of your M integers where H(x) is the number of times the value x appears in your input (which will be either 0 or 1 if all M are distinct - but this is not essential).
Then what you want to compute is A(d), where A(d) defined as the number of pairs of integers that are exactly d apart.
This can be computed as A(d) = sum(H(x)*H(x+d) for all x)
This type of function is called a convolution and can be efficiently computed by taking the Fourier transform, multiplying the output by itself, and then computing the inverse transform. Care needs to be taken to pad appropriately, for example see this question.
If you use Python, this is particularly easy as you can call scipy.signal.fftconvolve to do this operation.
I have a specific sub-problem for which I am having trouble coming up with an optimal solution. This problem is similar to the subset sum group of problems as well as space filling problems, but I have not seen this specific problem posed anywhere. I don't necessarily need the optimal solution (as I am relatively certain it is NP-hard), but an effective and fast approximation would certainly suffice.
Problem: Given a list of positive valued integers find the fewest number of disjoint subsets containing the entire list of integers where each subset sums to less than N. Obviously no integer in the original list can be greater than N.
In my application I have many lists and I can concatenate them into columns of a matrix as long as they fit in the matrix together. For downstream purposes I would like to have as little "wasted" space in the resulting ragged matrix, hence the space filling similarity.
Thus far I am employing a greedy-like approach, processing from the largest integers down and finding the largest integer that fits into the current subset under the limit N. Once the smallest integer no longer fits into the current subset I proceed to the next subset similarly until all numbers are exhausted. This almost certainly does not find the optimal solution, but was the best I could come up with quickly.
BONUS: My application actually requires batches, where there is a limit on the number of subsets in each batch (M). Thus the larger problem is to find the fewest batches where each batch contains M subsets and each subset sums to less than N.
Straight from Wikipedia (with some bold amendments):
In the bin packing problem, objects [Integers] of different volumes [values] must be
packed into a finite number of bins [sets] or containers each of volume V [summation of the subset < V] in
a way that minimizes the number of bins [sets] used. In computational
complexity theory, it is a combinatorial NP-hard problem.
https://en.wikipedia.org/wiki/Bin_packing_problem
As far as I can tell, this is exactly what you are looking for.
Is there a data structure representing a large set S of (64-bit) integers, that starts out empty and supports the following two operations:
insert(s) inserts the number s into S;
minmod(m) returns the number s in S such that s mod m is minimal.
An example:
insert(11)
insert(15)
minmod(7) -> the answer is 15 (which mod 7 = 1)
insert(14)
minmod(7) -> the answer is 14 (which mod 7 = 0)
minmod(10) -> the answer is 11 (which mod 10 = 1)
I am interested in minimizing the maximal total time spent on a sequence of n such operations. It is obviously possible to just maintain a list of elements for S and iterate through them for every minmod operation; then insert is O(1) and minmod is O(|S|), which would take O(n^2) time for n operations (e.g., n/2 insert operations followed by n/2 minmod operations would take roughly n^2/4 operations).
So: is it possible to do better than O(n^2) for a sequence of n operations? Maybe O(n sqrt(n)) or O(n log(n))? If this is possible, then I would also be interested to know if there are data structures that additionally admit removing single elements from S, or removing all numbers within an interval.
Another idea based on balanced binary search tree, as in Keith's answer.
Suppose all inserted elements so far are stored in balanced BST, and we need to compute minmod(m). Consider our set S as a union of subsets of numbers, lying in intervals [0,m-1], [m, 2m-1], [2m, 3m-1] .. etc. The answer will obviously be among the minimal numbers we have in each of that intervals. So, we can consequently lookup the tree to find the minimal numbers of that intervals. It's easy to do, for example if we need to find the minimal number in [a,b], we'll move left if current value is greater than a, and right otherwise, keeping track of the minimal value in [a,b] we've met so far.
Now if we suppose that m is uniformly distributed in [1, 2^64], let's calculate the mathematical expectation of number of queries we'll need.
For all m in [2^63, 2^64-1] we'll need 2 queries. The probability of this is 1/2.
For all m in [2^62, 2^63-1] we'll need 4 queries. The probability of this is 1/4.
...
The mathematical expectation will be sum[ 1/(2^k) * 2^k ], for k in [1,64], which is 64 queries.
So, to sum up, the average minmod(m) query complexity will be O(64*logn). In general, if we m has unknown upper bound, this will be O(logmlogn). The BST update is, as known, O(logn), so the overall complexity in case of n queries will be O(nlogm*logn).
Partial answer too big for a comment.
Suppose you implement S as a balanced binary search tree.
When you seek S.minmod(m), naively you walk the tree and the cost is O(n^2).
However, at a given time during the walk, you have the best (lowest) result so far. You can use this to avoid checking whole sub-trees when:
bestSoFar < leftChild mod m
and
rightChild - leftChild < m - leftChild mod m
This will only help much if a common spacing b/w the numbers in the set is smaller than common values of m.
Update the next morning...
Grigor has better and more fully articulated my idea and shown how it works well for "large" m. He also shows how a "random" m is typically "large", so works well.
Grigor's algorithm is so efficient for large m that one needs to think about the risk for much smaller m.
So it is clear that you need to think about the distribution of m and optimise for different cases if need be.
For example, it might be worth simply keeping track of the minimal modulus for very small m.
But suppose m ~ 2^32? Then the search algorithm (certainly as given but also otherwise) needs to check 2^32 intervals, which may amount to searching the whole set anyway.
Given an unsorted integer array, and without making any assumptions on
the numbers in the array:
Is it possible to find two numbers whose
difference is minimum in O(n) time?
Edit: Difference between two numbers a, b is defined as abs(a-b)
Find smallest and largest element in the list. The difference smallest-largest will be minimum.
If you're looking for nonnegative difference, then this is of course at least as hard as checking if the array has two same elements. This is called element uniqueness problem and without any additional assumptions (like limiting size of integers, allowing other operations than comparison) requires >= n log n time. It is the 1-dimensional case of finding the closest pair of points.
I don't think you can to it in O(n). The best I can come up with off the top of my head is to sort them (which is O(n * log n)) and find the minimum difference of adjacent pairs in the sorted list (which adds another O(n)).
I think it is possible. The secret is that you don't actually have to sort the list, you just need to create a tally of which numbers exist. This may count as "making an assumption" from an algorithmic perspective, but not from a practical perspective. We know the ints are bounded by a min and a max.
So, create an array of 2 bit elements, 1 pair for each int from INT_MIN to INT_MAX inclusive, set all of them to 00.
Iterate through the entire list of numbers. For each number in the list, if the corresponding 2 bits are 00 set them to 01. If they're 01 set them to 10. Otherwise ignore. This is obviously O(n).
Next, if any of the 2 bits is set to 10, that is your answer. The minimum distance is 0 because the list contains a repeated number. If not, scan through the list and find the minimum distance. Many people have already pointed out there are simple O(n) algorithms for this.
So O(n) + O(n) = O(n).
Edit: responding to comments.
Interesting points. I think you could achieve the same results without making any assumptions by finding the min/max of the list first and using a sparse array ranging from min to max to hold the data. Takes care of the INT_MIN/MAX assumption, the space complexity and the O(m) time complexity of scanning the array.
The best I can think of is to counting sort the array (possibly combining equal values) and then do the sorted comparisons -- bin sort is O(n + M) (M being the number of distinct values). This has a heavy memory requirement, however. Some form of bucket or radix sort would be intermediate in time and more efficient in space.
Sort the list with radixsort (which is O(n) for integers), then iterate and keep track of the smallest distance so far.
(I assume your integer is a fixed-bit type. If they can hold arbitrarily large mathematical integers, radixsort will be O(n log n) as well.)
It seems to be possible to sort unbounded set of integers in O(n*sqrt(log(log(n))) time. After sorting it is of course trivial to find the minimal difference in linear time.
But I can't think of any algorithm to make it faster than this.
No, not without making assumptions about the numbers/ordering.
It would be possible given a sorted list though.
I think the answer is no and the proof is similar to the proof that you can not sort faster than n lg n: you have to compare all of the elements, i.e create a comparison tree, which implies omega(n lg n) algorithm.
EDIT. OK, if you really want to argue, then the question does not say whether it should be a Turing machine or not. With quantum computers, you can do it in linear time :)