Average number of intervals from an input in 0..N - algorithm

The question sprang up when examining the "Find the K missing numbers in this set supposed to cover [0..N]" question.
The author of the question asked for CS answers instead of equation-based answers, and his proposal was to sort the input and then iterate over it to list the K missing numbers.
While this seems fine to me, it also seems wasteful. Let's take an example:
N = 200
K = 2 (we will consider K << N)
missing elements: 53, 75
The "sorted" set can be represented as: [0, 52] U [54, 74] U [76, 200], which is way more compact than enumerating all values of the set (and allows to retrieve the missing numbers in O(K) operations, to be compared with O(N) if the set is sorted).
However this is the final result, but during the construction the list of intervals might be much larger, as we feed the elements one at a time....
Let us, therefore, introduce another variable: let I be the number of elements of the set that we fed to the structure so far. Then, we may at worst have: min((N-K)/2, I) intervals (I think...)
From which we deduce that the number of intervals reached during the construction is the maximum encountered for I in [0..N], the worst case being (N-K)/2 thus O(N).
I have however a gut feeling that if the input is random, instead of being specially crafted, we might get a much lower bound... and thus the always so tricky question:
How much intervals... in average ?

Your approach vs. the proposed one with sorting seems to be a classical trade-off of which operations is cheap and which one is expensive.
I find your notation a bit confusing, so please allow me to use my own:
Let S be the set. Let n be the number of items in the set: n = |S|. Let max be the biggest number in the set: max = max(S). Let k be the number of elements not in the set: k = |{0,...,max} \ S|.
For the sorting solution, we could very cheaply insert all n elements into S using hashing. That would take expected O(n). Then for finding the k missing elements, we sort the set in O(nlogn), and then determine the missing elements in O(n).
That is, the overall cost for adding n elements and then finding the missing k elements takes expected O(n) + O(nlogn) + O(n) = O(nlogn).
You suggest a different approach in which we represent the set as a list of dense subsets of S. How would you implement such a data structure? I suggest a sorted tree (instead of a list) so that an insert becomes efficient. Because what do you have to do for an insert of a new element e? I think you have to:
Find the potential candidate subset(s) in the tree where e could be added
If a subset already contains e, nothing has to be done.
If a subset contains e+1 and another subset contains e-1, merge the subsets together and add e to the result
If a subset already contains e+1, but e-1 is not contained in S, add e to that subset
If a subset already contains e-1, but e+1 is not contained in S, add e to that subset
Otherwise, create a new subset holding only the element e and insert it into the tree.
We can expect that finding the subsets needed for the above operations takes O(logn). The operations of 4. or 5. take constant time if we represent the subsets as pairs of integers (we just have to decrement the lower or increment the upper boundary). 3. and 6. potentially require changing the tree structure, but we expect that to take at most O(logn), so the whole "insert" will not take more than O(logn).
Now with such a datastructure in place, we can easily determine the k missing numbers by traversing the tree in order and collecting the numbers not in any of the subsets. The costs are linear in the number of nodes in the tree, which is <= n/2, so the total costs are O(n) for that.
However, if we consider again the complete sequence operations, we get for n inserts O(nlogn) + O(n) for finding the k missing numbers, so the overall costs are again O(nlogn).
This is not better than the expected costs of the first algorithm.
A third solution is to use a boolean array to represent the set and a single integer max for the biggest element in the set.
If an element e is added to the Set, you set array[e] = true. You can implement the variable size of the set using table expansion, so the costs for inserting an element into the array is constant.
To retrieve the missing elements, you just collect those elements f where array[f] == false. This will take O(max).
The overall costs for inserting n elements and finding the k missing ones is thus: O(n) + O(max). However, max = n + k, and so we get as the overall costs O(n + k).
A fourth method which is a cross-over between the third one and the one using hashing is the following one, which also uses hashing, but doesn't require sorting.
Store your set S in a hash set, and also store the maximum element in S in a variable max. To find the k missing ones, first generate a result set R containing all numbers {0,...,max}. Then iterate over S and delete every element in S from R.
The costs for that are also O(n + k).

Related

Find Pair with Difference less than K with O(n) complexity on average

I have an unsorted array of n positive numbers and a parameter k, I need to find out if there is a pair of numbers in the array that the difference between than is less than k and I need to do so in time complexity of O(n) on probable average and in space complexity of O(n).
I believe it requires the use of a universal hash table but I'm not sure how, any ideas?
This answer works even on unbounded integers and floats (doing some assumptions on the nicety of the hashmap you'll be using - the java implementation should work for instance):
keep a hashmap<int, float> all_divided_values. For each key y,
if all_divided_values[y] exists, it will contain a value v that
is in the array such that floor(v/k) = y.
For each value v in the original array A, if v/k is in all_divided_values's keys, output (v, all_divided_values[v/k])
(they are distant by less than k). Else, store v in
all_divided_values[v/k]
Once all_divided_values is filled, go through A again. For each v, test whether all_divided_values[v/k - 1] exists, and if so,
output the pair (v, all_divided_values[v/k - 1]) if and only if abs(v-all_divided_values[v/k - 1])<=k
Inserting in a hashmap is usually (with Java hashmap for instance) O(1) in average, so the total time is O(n). But please note that technically this could be false, for instance if your language's implementation does not have a nice strategy about the hashmap.
Simple solution:
1- Sort the array
2- Calculate the difference between consecutive elements
a) If the difference is smaller than k return that pair
b) If no consecutive number difference yields a value smaller than k, then your array has no pair of numbers such that the difference is smaller than k.
Sorting is O(nlogn), but if you have only Integers of limited size, you can use Counting sort, that is O(n)
You can consider this way.
The problem can be modeled as this:-
consider each element (considering integer) now you convert them to a range (A[i]-K,A[i]+K)
Now you want to check if any of the two intervals overlap.
Interval intersection problem without any sorted ness is not solvable in O(n) (worst case). You need to sort them and then inn O(n) you can check if hey intersect.
Same goes for your logic. Sort it and find it.

Is it possible to compute the minimum of a set of numbers modulo a given number in amortized sublinear time?

Is there a data structure representing a large set S of (64-bit) integers, that starts out empty and supports the following two operations:
insert(s) inserts the number s into S;
minmod(m) returns the number s in S such that s mod m is minimal.
An example:
insert(11)
insert(15)
minmod(7) -> the answer is 15 (which mod 7 = 1)
insert(14)
minmod(7) -> the answer is 14 (which mod 7 = 0)
minmod(10) -> the answer is 11 (which mod 10 = 1)
I am interested in minimizing the maximal total time spent on a sequence of n such operations. It is obviously possible to just maintain a list of elements for S and iterate through them for every minmod operation; then insert is O(1) and minmod is O(|S|), which would take O(n^2) time for n operations (e.g., n/2 insert operations followed by n/2 minmod operations would take roughly n^2/4 operations).
So: is it possible to do better than O(n^2) for a sequence of n operations? Maybe O(n sqrt(n)) or O(n log(n))? If this is possible, then I would also be interested to know if there are data structures that additionally admit removing single elements from S, or removing all numbers within an interval.
Another idea based on balanced binary search tree, as in Keith's answer.
Suppose all inserted elements so far are stored in balanced BST, and we need to compute minmod(m). Consider our set S as a union of subsets of numbers, lying in intervals [0,m-1], [m, 2m-1], [2m, 3m-1] .. etc. The answer will obviously be among the minimal numbers we have in each of that intervals. So, we can consequently lookup the tree to find the minimal numbers of that intervals. It's easy to do, for example if we need to find the minimal number in [a,b], we'll move left if current value is greater than a, and right otherwise, keeping track of the minimal value in [a,b] we've met so far.
Now if we suppose that m is uniformly distributed in [1, 2^64], let's calculate the mathematical expectation of number of queries we'll need.
For all m in [2^63, 2^64-1] we'll need 2 queries. The probability of this is 1/2.
For all m in [2^62, 2^63-1] we'll need 4 queries. The probability of this is 1/4.
...
The mathematical expectation will be sum[ 1/(2^k) * 2^k ], for k in [1,64], which is 64 queries.
So, to sum up, the average minmod(m) query complexity will be O(64*logn). In general, if we m has unknown upper bound, this will be O(logmlogn). The BST update is, as known, O(logn), so the overall complexity in case of n queries will be O(nlogm*logn).
Partial answer too big for a comment.
Suppose you implement S as a balanced binary search tree.
When you seek S.minmod(m), naively you walk the tree and the cost is O(n^2).
However, at a given time during the walk, you have the best (lowest) result so far. You can use this to avoid checking whole sub-trees when:
bestSoFar < leftChild mod m
and
rightChild - leftChild < m - leftChild mod m
This will only help much if a common spacing b/w the numbers in the set is smaller than common values of m.
Update the next morning...
Grigor has better and more fully articulated my idea and shown how it works well for "large" m. He also shows how a "random" m is typically "large", so works well.
Grigor's algorithm is so efficient for large m that one needs to think about the risk for much smaller m.
So it is clear that you need to think about the distribution of m and optimise for different cases if need be.
For example, it might be worth simply keeping track of the minimal modulus for very small m.
But suppose m ~ 2^32? Then the search algorithm (certainly as given but also otherwise) needs to check 2^32 intervals, which may amount to searching the whole set anyway.

Range query for a semigroup operator (union)

I'm looking to implement an algorithm, which is given an array of integers and a list of ranges (intervals) in that array, returns the number of distinct elements in each interval. That is, given the array A and a range [i,j] returns the size of the set {A[i],A[i+1],...,A[j]}.
Obviously, the naive approach (iterate from i to j and count ignoring duplicates) is too slow. Range-Sum seems inapplicable, since A U B - B isn't always equal to B.
I've looked up Range Queries in Wikipedia, and it hints that Yao (in '82) showed an algorithm that does this for semigroup operators (which union seems to be) with linear preprocessing time and space and almost constant query time. The article, unfortunately, is not available freely.
Edit: it appears this exact problem is available at http://www.spoj.com/problems/DQUERY/
There's rather simple algorithm which uses O(N log N) time and space for preprocessing and O(log N) time per query. At first, create a persistent segment tree for answering range sum query(initially, it should contain zeroes at all the positions). Then iterate through all the elements of the given array and store the latest position of each number. At each iteration create a new version of the persistent segment tree putting 1 to the latest position of each element(at each iteration the position of only one element can be updated, so only one position's value in segment tree changes so update can be done in O(log N)). To answer a query (l, r) You just need to find sum on (l, r) segment for the version of the tree which was created when iterating through the r's element of the initial array.
Hope this algorithm is fast enough.
Upd. There's a little mistake in my explanation: at each step, at most two positions' values in the segment tree might change(because it's necessary to put 0 to a previous latest position of a number if it's updated). However, it doesn't change the complexity.
You can answer any of your queries in constant time by performing a quadratic-time precomputation:
For every i from 0 to n-1
S <- new empty set backed by hashtable;
C <- 0;
For every j from i to n-1
If A[j] does not belong to S, increment C and add A[j] to S.
Stock C as the answer for the query associated to interval i..j.
This algorithm takes quadratic time since for each interval we perform a bounded number of operations, each one taking constant time (note that the set S is backed by a hashtable), and there's a quadratic number of intervals.
If you don't have additional information about the queries (total number of queries, distribution of intervals), you cannot do essentially better, since the total number of intervals is already quadratic.
You can trade off the quadratic precomputation by n linear on-the-fly computations: after receiving a query of the form A[i..j], precompute (in O(n) time) the answer for all intervals A[i..k], k>=i. This will guarantee that the amortized complexity will remain quadratic, and you will not be forced to perform the complete quadratic precomputation at the beginning.
Note that the obvious algorithm (the one you call obvious in the statement) is cubic, since you scan every interval completely.
Here is another approach which might be quite closely related to the segment tree. Think of the elements of the array as leaves of a full binary tree. If there are 2^n elements in the array there are n levels of that full tree. At each internal node of the tree store the union of the points that lie in the leaves beneath it. Each number in the array needs to appear once in each level (less if there are duplicates). So the cost in space is a factor of log n.
Consider a range A..B of length K. You can work out the union of points in this range by forming the union of sets associated with leaves and nodes, picking nodes as high up the tree as possible, as long as the subtree beneath those nodes is entirely contained in the range. If you step along the range picking subtrees that are as big as possible you will find that the size of the subtrees first increases and then decreases, and the number of subtrees required grows only with the logarithm of the size of the range - at the beginning if you could only take a subtree of size 2^k it will end on a boundary divisible by 2^(k+1) and you will have the chance of a subtree of size at least 2^(k+1) as the next step if your range is big enough.
So the number of semigroup operations required to answer a query is O(log n) - but note that the semigroup operations may be expensive as you may be forming the union of two large sets.

Prove that the running time of quick sort after modification = O(Nk)

this is a homework question, and I'm not that at finding the complixity but I'm trying my best!
Three-way partitioning is a modification of quicksort that partitions elements into groups smaller than, equal to, and larger than the pivot. Only the groups of smaller and larger elements need to be recursively sorted. Show that if there are N items but only k unique values (in other words there are many duplicates), then the running time of this modification to quicksort is O(Nk).
my try:
on the average case:
the tree subroutines will be at these indices:
I assume that the subroutine that have duplicated items will equal (n-k)
first: from 0 - to(i-1)
Second: i - (i+(n-k-1))
third: (i+n-k) - (n-1)
number of comparisons = (n-k)-1
So,
T(n) = (n-k)-1 + Sigma from 0 until (n-k-1) [ T(i) + T (i-k)]
then I'm not sure how I'm gonna continue :S
It might be a very bad start though :$
Hope to find a help
First of all, you shouldn't look at the average case since the upper bound of O(nk) can be proved for the worst case, which is a stronger statement.
You should look at the maximum possible depth of recursion. In normal quicksort, the maximum depth is n. For each level, the total number of operations done is O(n), which gives O(n^2) total in the worst case.
Here, it's not hard to prove that the maximum possible depth is k (since one unique value will be removed at each level), which leads to O(nk) total.
I don't have a formal education in complexity. But if you think about it as a mathematical problem, you can prove it as a mathematical proof.
For all sorting algorithms, the best case scenario will always be O(n) for n elements because to sort n elements you have to consider each one atleast once. Now, for your particular optimisation of quicksort, what you have done is simplified the issue because now, you are only sorting unique values: All the values that are the same as the pivot are already considered sorted, and by virtue of its nature, quicksort will guarantee that every unique value will feature as the pivot at some point in the operation, so this eliminates duplicates.
This means for an N size list, quicksort must perform some operation N times (once for every position in the list), and because it is trying to sort the list, that operation is trying to find the position of that value in the list, but because you are effectively dealing with just unique values, and there are k of those, the quicksort algorithm must perform k comparisons for each element. So it performs Nk operations for an N sized list with k unique elements.
To summarise:
This algorithm eliminates checking against duplicate values.
But all sorting algorithms must look at every value in the list at least once. N operations
For every value in the list the operation is to find its position relative to other values in the list.
Because duplicates get removed, this leaves only k values to check against.
O(Nk)

Data structure / algorithm for query: filter by A, sort by B, return N results

Imagine that you have a large set of #m objects with properties A and B. What data structure can you use as index(s) (or which algorithm) to improve the performance of the following query?
find all objects where A between X and Y, order by B, return first N results;
That is, filter by range A and sort by B, but only return the first few results (say, 1000 at most). Insertions are very rare, so heavy preprocessing is acceptable. I'm not happy with the following options:
With records (or index) sorted by B: Scan the records/index in B order, return the first N where A matches X-Y. In the worst cases (few objects match the range X-Y, or the matches are at the end of the records/index) this becomes O(m), which for large data sets of size m is not good enough.
With records (or index) sorted by A: Do a binary search until the first object is found which matches the range X-Y. Scan and create an array of references to all k objects which match the range. Sort the array by B, return the first N. That's O(log m + k + k log k). If k is small then that's really O(log m), but if k is large then the cost of the sort becomes even worse than the cost of the linear scan over all mobjects.
Adaptative 2/1: do a binary search for the first match of the range X-Y (using an index over A); do a binary search for the last match of the range. If the range is small continue with algorithm 2; otherwise revert to algorithm 1. The problem here is the case where we revert to algorithm 1. Although we checked that "many" objects pass the filter, which is the good case for algorithm 1, this "many" is at most a constant (asymptotically the O(n) scan will always win over the O(k log k) sort). So we still have an O(n) algorithm for some queries.
Is there an algorithm / data structure which allows answering this query in sublinear time?
If not, what could be good compromises to achieve the necessary performance? For instance, if I don't guarantee returning the objects best ranking for their B property (recall < 1.0) then I can scan only a fraction of the B index. But could I do that while bounding the results' quality somehow?
The question you are asking is essentially a more general version of:
Q. You have a sorted list of words with a weight associated with each word, and you want all words which share a prefix with a given query q, and you want this list sorted by the associated weight.
Am I right?
If so, you might want to check this paper which discusses how to do this in O(k log n) time, where k is the number of elements in the output set desired and n is the number of records in the original input set. We assume that k > log n.
http://dhruvbird.com/autocomplete.pdf
(I am the author).
Update: A further refinement I can add is that the question you are asking is related to 2-dimensional range searching where you want everything in a given X-range and the top-K from the previous set, sorted by the Y-range.
2D range search lets you find everything in an X/Y-range (if both your ranges are known). In this case, you only know the X-range, so you would need to run the query repeatedly and binary search on the Y-range till you get K results. Each query can be performed using O(log n) time if you employ fractional cascading, and O(log2n) if employing the naive approach. Either of them are sub-linear, so you should be okay.
Additionally, the time to list all entries would add an additional O(k) factor to your running time.
assuming N << k < n, it can be done in O(logn + k + NlogN), similar to what you suggested in option 2, but saves some time, you don't need to sort all the k elements, but only N, which is much smaller!
the data base is sorted by A.
(1) find the first element and last element, and create a list containing these
elements.
(2) find the N'th biggest element, using selection algorithm (*), and create a new
list of size N, with a second iteration: populate the last list with the N highest
elements.
(3) sort the last list by B.
Selection algorithm: find the N'th biggest element. it is O(n), or O(k) in here, because the list's size is k.
complexity:
Step one is trivially O(logn + k).
Step 2 is O(k) [selection] and another iteration is also O(k), since this list has only k elements.
Step 3 is O(NlogN), a simple sort, and the last list contains only N elements.
If the number of items you want to return is small--up to about 1% of the total number of items--then a simple heap selection algorithm works well. See When theory meets practice. But it's not sub-linear.
For expected sub-linear performance, you can sort the items by A. When queried, use binary search to find the first item where A >= X, and then sequentially scan items until A > Y, using the heap selection technique I outlined in that blog post.
This should give you O(log n) for the initial search, and then O(m log k), where m is the number of items where X <= A <= Y, and k is the number of items you want returned. Yes, it will still be O(n log k) for some queries. The deciding factor will be the size of m.
Set up a segment tree on A and, for each segment, precompute the top N in range. To query, break the input range into O(log m) segments and merge the precomputed results. Query time is O(N log log m + log m); space is O(m log N).
This is not really a fully fleshed out solution, just an idea. How about building a quadtree on the A and B axes? You would walk down the tree in, say, a breadth-first manner; then:
whenever you find a subtree with A-values all outside the given range [X, Y], you discard that subtree (and don't recurse);
whenever you find a subtree with A-values all inside the given range [X, Y], you add that subtree to a set S that you're building and don't recurse;
whenever you find a subtree with some A-values inside the range [X, Y] and some outside, you recurse into it.
Now you have the set S of all maximal subtrees with A-coordinates between X and Y; there are at most O(sqrt(m)) of these subtrees, which I will show below.
Some of these subtrees will contain O(m) entries (certainly they will contain O(m) entries all added together), so we can't do anything on all entries of all subtrees. We can now make a heap of the subtrees in S, so that the B-minimum of each subtree is less than the B-minimums of its children in the heap. Now extract B-minimal elements from the top node of the heap until you have N of them; whenever you extract an element from a subtree with k elements, you need to decompose that subtree into O(log(k)) subtrees not containing the recently extracted element.
Now let's consider complexity. Finding the O(sqrt(m)) subtrees will take at most O(sqrt(m)) steps (exercise for the reader, using arguments in the proof below). We should probably insert them into the heap as we find them; this will take O(sqrt(m) * log(sqrt(m))) = O(sqrt(m) * log(m)) steps. Extracting a single element from a k-element subtree in the heap takes O(sqrt(k)) time to find the element, then inserting the O(log(sqrt(k))) = O(log(k)) subtrees back into the heap of size O(sqrt(m)) takes O(log(k) * log(sqrt(m))) = O(log(k) * log(m)) steps. We can probably be smarter using potentials, but we can at least bound k by m, so that leaves N*(O(sqrt(k) + log(k)*log(m))) = O(N * (sqrt(m) + log(m)^2) = O(N*sqrt(m)) steps for the extraction, and O(sqrt(m)*(N + log(m))) steps in total... which is sublinear in m.
Here's a proof of the bound of O(sqrt(m)) subtrees. There are several strategies for building a quadtree, but for ease of analysis, let's say that we make a binary tree; in the root node, we split the data set according to A-coordinate around the point with median A-coordinate, then one level down we split the data set according to B-coordinate around the point with median B-coordinate (that is, median for the half of the points contained in that half-tree), and continue alternating the direction per level.
The height of the tree is log(m). Now let's consider for how many subtrees we need to recurse. We only need to recurse if a subtree contains the A-coordinate X, or it contains the A-coordinate Y, or both. At the (2*k)th level down, there are 2^(2*k) subtrees in total. By then, each subtree has its A-range subdivided k times already, and every time we do that, only half the trees contain the A-coordinate X. So at most 2^k subtrees contain the A-coordinate X. Similarly, at most 2^k will contain the A-coordinate Y. This means that in total we will recurse into at most 2*sum(2^k, k = 0 .. log(m)/2) = 2*(2^(log(m)/2 - 1) + 1) = O(sqrt(m)) subtrees.
Since we examine at most 2^k subtrees at the (2*k)'th level down, we can also add at most 2^k subtrees at that level to S. This gives the final result.
The outcome you describe is what most search engines are built to achieve (sorting, filtering, paging). If you havent done so already, check out a search engine like Norch or Solr.

Resources