This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How to find the kth largest element in an unsorted array of length n in O(n)?
I'm currently sitting in front of an course assignment.
The task is to find the nth-smallest element in an array. (Without sorting it!)
I tried to understand the BFPRT algorithm but from what I got it is only useful if you want to calculate the median and not the "n-th smallest" element.
Another idea I had, was to convert the array into a tree by attaching smaller/bigger nodes to the left/right of the root node. I'm not sure however if this counts as sorting.
To accelerate this I could store the number of subnodes in each node.
The complete assignment also includes that the algorithm has to be recursive.
There is also the hint to think about other data structures.
What do you think about my idea of transforming the array into a balanced tree?
Are there any other options I might have missed?
EDIT: I looked at various similar questions but were not able to completely understand the answers/apply them to my specific task.
The traditional approach to this problem (the order statistic problem) is reminiscent of quicksort. Let's say that you are looking for the k'th smallest element. Pick a (random) pivot element and partition the remaining elements into two groups (without sorting the two groups): L contains all elements that are smaller than or equal to the pivot element (except the pivot element itself), and G contains all elements that are greater than the pivot element. How large is L? If it contains exactly k - 1 elements, then the pivot element must be the k'th smallest element, and you are done. If L contains more than k - 1 elements, then the k'th smallest element must be in L; otherwise, it is in G. Now, apply the same algorithm to either L or G (if you need to use G, you must adjust k since you are no longer looking for the k'th smallest element of G, but the k'th smallest element overall).
This algorithm runs in expected O(n) time; however, there exists a clever modification of the algorithm that guarantees O(n) time in worst case.
Edit: As #Ishtar points out, the "clever modification" is the BFPRT algorithm. Its core idea is to make sure that you never select a bad pivot element, such that the two partitions L and G do not become too unbalanced. As long as one can guarantee that one partition will never be more than c times larger than the other (for some arbitrary, but fixed c), the run time will be O(n).
There is a quite complex algorithm that in theory runs in O(n). In practise it is a bit slower. Have a look at this link: link. There is a wikipedia entry about this problem as well: wikilink
EDIT:
A simple pseudocode-algorithm to solve the problem:
k = the k'th element is what we are looking for
FindKthSmallest(Array, k)
pivot = some pivot element of the array.
L = Set of all elements smaller than pivot in Array
R = Set of all elements greater than pivot in Array
if |L| > k FindKthSmalles(L, k)
else if(|L|+1 == k) return pivot
else return FindKthSmallest(R, k-|L|+1)
I love the tournament algorithm here -- it's very intuitive and easy to understand.
http://en.wikipedia.org/wiki/Tournament_selection
Related
Given an array of positive integers, how can I find the number of increasing (or decreasing) subsequences of length 3? E.g. [1,6,3,7,5,2,9,4,8] has 24 of these, such as [3,4,8] and [6,7,9].
I've found solutions for length k, but I believe those solutions can be made more efficient since we're only looking at k = 3.
For example, a naive O(n^3) solution can be made faster by looping over elements and counting how many elements to their left are less, and how many to their right are higher, then multiplying these two counts, and adding it to a sum. This is O(n^2), which obviously doesn't translate easily into k > 3.
The solution can be by looping over elements, on every element you can count how many elements to their left and less be using segment tree algorithm which work in O(log(n)), and by this way you can count how many elements to their right and higher, then multiplying these two counts, and adding it to the sum. This is O(n*log(n)).
You can learn more about segment tree algorithm over here:
Segment Tree Tutorial
For each curr element, count how many elements on the left and right have less and greater values.
This curr element can form less[left] * greater[right] + greater[left] * less[right] triplet.
Complexity Considerations
The straightforward approach to count elements on left and right yields a quadratic solution. You might be tempted to use a set or something to count solders in O(log n) time.
You can find a solder rating in a set in O(log n), however, counting elements before and after will still be linear. Unless you implement BST where each node tracks count of left children.
Check the solution here:
https://leetcode.com/problems/count-number-of-teams/discuss/554795/C%2B%2BJava-O(n-*-n)
Give a lower bound on the time to produce a single sorted list of n numbers that are in k groups. Such that the smallest n/k are first and so on.
So I have been stuck at this problem for a while and I'm really unsure how to go about it. I know how to make a decision tree, but i don't understand how I'm supposed to do it in the context of this problem. I don't necessarily understand the problem, yet it seems to be clear enough to be solved for people. Any point in the right direction or clarification would be extremely appreciated.
Your question is difficult, because it assumes that the n numbers have been divided into k groups, with the groups themselves being ordered. I will assume here that the numbers within each group are not ordered. If the numbers were already sorted within each group, it would render the problem trivial.
The decision tree to solve your question could be built with k subtrees, one for each group, with each subtree connecting to the next subtree. The reason for this is that the groups themselves are already sorted, and we only need to sort each group. The lower bound running time would occur if we only had to traverse this tree along one path to find the correct leaf node (and sorted list). This means the lower bound is the height of the tree, which is:
O(k * lg n/k)
To break down this expression:
lg n/k is the height of each of the k subtrees
k * lg n/k is the height of the complete decision tree (there are k subtrees)
Please read this excellent PDF from the CS 401 class at the University of Illinois at Chicago which will completely explain your original problem and also show you a proof for how I arrived at the Big Omega expression I gave above.
I'm not sure what is a "lower bound" in the question.
If
(...) numbers that are in k groups. Such that the smallest n/k are first and so on.
means groups are already sorted (given in proper order), then
the time to produce a single sorted list
is minimum if the numbers inside groups are already sorted. Then the minimum time to produce a sorted list is k*(n/k-1) + k*(n/k) + (k-1) = O(n+k) for testing each of 'k' groups for being already sorted, converting each group into linked lists by appending each item in order and then concatenating groups into a single result list or array.
On the other hand, if we want the minimum time required to build the result despite the initial (lack of) order of input numbers in groups, then the answer is O(k*(n/k)*ln(n/k)) + n = O(n*ln(n/k)) for general algorithm sorting k groups n/k items each, then placing all n items in a resulting list or array.
Given an array of integers and some query operations.
The query operations are of 2 types
1.Update the value of the ith index to x.
2.Given 2 integers find the kth minimum in that range.(Ex if the 2 integers are i and j ,we have to find out the kth minimum between i and j both inclusive).
I can find the Range minimum query using segment tree but could no do so for the kth minimum.
Can anyone help me?
Here is a O(polylog n) per query solution that does actually not assume a constant k, so the k can vary between queries. The main idea is to use a segment tree, where every node represents an interval of array indices and contains a multiset (balanced binary search tree) of the values in the represened array segment. The update operation is pretty straightforward:
Walk up the segment tree from the leaf (the array index you're updating). You will encounter all nodes that represent an interval of array indices that contain the updated index. At every node, remove the old value from the multiset and insert the new value into the multiset. Complexity: O(log^2 n)
Update the array itself.
We notice that every array element will be in O(log n) multisets, so the total space usage is O(n log n). With linear-time merging of multisets we can build the initial segment tree in O(n log n) as well (there's O(n) work per level).
What about queries? We are given a range [i, j] and a rank k and want to find the k-th smallest element in a[i..j]. How do we do that?
Find a disjoint coverage of the query range using the standard segment tree query procedure. We get O(log n) disjoint nodes, the union of whose multisets is exactly the multiset of values in the query range. Let's call those multisets s_1, ..., s_m (with m <= ceil(log_2 n)). Finding the s_i takes O(log n) time.
Do a select(k) query on the union of s_1, ..., s_m. See below.
So how does the selection algorithm work? There is one really simple algorithm to do this.
We have s_1, ..., s_n and k given and want to find the smallest x in a, such that s_1.rank(x) + ... + s_m.rank(x) >= k - 1, where rank returns the number of elements smaller than x in the respective BBST (this can be implemented in O(log n) if we store subtree sizes).
Let's just use binary search to find x! We walk through the BBST of the root, do a couple of rank queries and check whether their sum is larger than or equal to k. It's a predicate monotone in x, so binary search works. The answer is then the minimum of the successors of x in any of the s_i.
Complexity: O(n log n) preprocessing and O(log^3 n) per query.
So in total we get a runtime of O(n log n + q log^3 n) for q queries. I'm sure we could get it down to O(q log^2 n) with a cleverer selection algorithm.
UPDATE: If we are looking for an offline algorithm that can process all queries at once, we can get O((n + q) * log n * log (q + n)) using the following algorithm:
Preprocess all queries, create a set of all values that ever occured in the array. The number of those will be at most q + n.
Build a segment tree, but this time not on the array, but on the set of possible values.
Every node in the segment tree represents an interval of values and maintains a set of positions where these values occurs.
To answer a query, start at the root of the segment tree. Check how many positions in the left child of the root lie in the query interval (we can do that by doing two searches in the BBST of positions). Let that number be m. If k <= m, recurse into the left child. Otherwise recurse into the right child, with k decremented by m.
For updates, remove the position from the O(log (q + n)) nodes that cover the old value and insert it into the nodes that cover the new value.
The advantage of this approach is that we don't need subtree sizes, so we can implement this with most standard library implementations of balanced binary search trees (e.g. set<int> in C++).
We can turn this into an online algorithm by changing the segment tree out for a weight-balanced tree such as a BB[α] tree. It has logarithmic operations like other balanced binary search trees, but allows us to rebuild an entire subtree from scratch when it becomes unbalanced by charging the rebuilding cost to the operations that must have caused the imbalance.
If this is a programming contest problem, then you might be able to get away with the following O(n log(n) + q n^0.5 log(n)^1.5)-time algorithm. It is set up to use the C++ STL well and has a much better big-O constant than Niklas's (previous?) answer on account of using much less space and indirection.
Divide the array into k chunks of length n/k. Copy each chunk into the corresponding locations of a second array and sort it. To update: copy the chunk that changed into the second array and sort it again (time O((n/k) log(n/k)). To query: copy to a scratch array the at most 2 (n/k - 1) elements that belong to a chunk partially overlapping the query interval. Sort them. Use one of the answers to this question to select the element of the requested rank out of the union of the sorted scratch array and fully overlapping chunks, in time O(k log(n/k)^2). The optimum setting of k in theory is (n/log(n))^0.5. It's possible to shave another log(n)^0.5 using the complicated algorithm of Frederickson and Johnson.
perform a modification of the bucket sort: create a bucket that contains the numbers in the range you want and then sort this bucket only and find the kth minimum.
Damn, this solution can't update an element but at least finds that k-th element, here you'll get some ideas so you can think of some solution that provides update. Try pointer-based B-trees.
This is O(n log n) space and O(q log^2 n) time complexity. Later I explained the same with O(log n) per query.
So, you'll need to do the next:
1) Make a "segment tree" over given array.
2) For every node, instead of storing one number, you would store a whole array. The size of that array has to be equal to the number of it's children. That array (as you guessed) has to contain the values of the bottom nodes (children, or the numbers from that segment), but sorted.
3) To make such an array, you would merge two arrays from its two sons from segment tree. But not only that, for every element from the array you have just made (by merging), you need to remember the position of the number before its insertion in merged array (basically, the array from which it comes, and position in it). And a pointer to the first next element that is not inserted from the same array.
4) With this structure, you can check how many numbers there are that are lower than given value x, in some segment S. You find (with binary search) the first number in the array of the root node that is >= x. And then, using the pointers you have made, you can find the results for the same question for two children arrays (arrays of nodes that are children to the previous node) in O(1). You stop to operate this descending for each node that represents the segment that is whole either inside or outside of given segment S. The time complexity is O(log n): O(log n) to find the first element that is >=x, and O(log n) for all segments of decomposition of S.
5) Do a binary search over solution.
This was solution with O(log^2 n) per query. But you can reduce to O(log n):
1) Before doing all I wrote above, you need to transform the problem. You need to sort all numbers and remember the positions for each in original array. Now these positions are representing the array you are working on. Call that array P.
If bounds of the query segment are a and b. You need to find the k-th element in P that is between a and b by value (not by index). And that element represents the index of your result in original array.
2) To find that k-th element, you would do some type of back-tracking with complexity of O(log n). You will be asking the number of elements between index 0 and (some other index) that are between a and b by value.
3) Suppose that you know the answer for such a question for some segment (0,h). Get answers on same type of questions for all segments in tree that begin on h, starting from the greatest one. Keep getting those answers as long as the current answer (from segment (0,h)) plus the answer you got the last are greater than k. Then update h. Keep updating h, until there is only one segment in tree that begins with h. That h is the index of the number you are looking for in the problem you have stated.
To get the answer to such a question for some segment from tree you will spend exactly O(1) of time. Because you already know the answer of it's parent's segment, and using the pointers I explained in the first algorithm you can get the answer for the current segment in O(1).
This is an interview question. Design a class, which stores integers and provides two operations:
void insert(int k)
int getMedian()
I guess I can use BST so that insert takes O(logN) and getMedian takes O(logN) (for getMedian I should add the number of of left/right children for each node).
Now I wonder if this is the most efficient solution and there is no better one.
You can use 2 heaps, that we will call Left and Right.
Left is a Max-Heap.
Right is a Min-Heap.
Insertion is done like this:
If the new element x is smaller than the root of Left then we insert x to Left.
Else we insert x to Right.
If after insertion Left has count of elements that is greater than 1 from the count of elements of Right, then we call Extract-Max on Left and insert it to Right.
Else if after insertion Right has count of elements that is greater than the count of elements of Left, then we call Extract-Min on Right and insert it to Left.
The median is always the root of Left.
So insertion is done in O(lg n) time and getting the median is done in O(1) time.
See this Stack Overflow question for a solution that involves two heaps.
Would it beat an array of integers witch performs a sort at insertion time with a sort algorithm dedicated for integer (http://en.wikipedia.org/wiki/Sorting_algorithm) if you choose your candidate amongst O < O(log(n)) and using an array, then getMedian would be taking the index at half of the size would be O(1), no? It seems possible to me to do better than log(n) + log(n).
Plus by being a little more flexible you can improve your performance by changing your sort algorithm according to the properties of your input (are the input almost sorted or not ...).
I am pretty much autodidact in computer science, but that is the way I would do it: simpler is better.
You could consider a self-balancing tree, too. If the tree is fully balanced, then the root node is your median. Say, the tree is one-level deeper on one end. Then, you just need to know how many nodes are there in the deeper-side to pick the correct median.
Imagine that you have a large set of #m objects with properties A and B. What data structure can you use as index(s) (or which algorithm) to improve the performance of the following query?
find all objects where A between X and Y, order by B, return first N results;
That is, filter by range A and sort by B, but only return the first few results (say, 1000 at most). Insertions are very rare, so heavy preprocessing is acceptable. I'm not happy with the following options:
With records (or index) sorted by B: Scan the records/index in B order, return the first N where A matches X-Y. In the worst cases (few objects match the range X-Y, or the matches are at the end of the records/index) this becomes O(m), which for large data sets of size m is not good enough.
With records (or index) sorted by A: Do a binary search until the first object is found which matches the range X-Y. Scan and create an array of references to all k objects which match the range. Sort the array by B, return the first N. That's O(log m + k + k log k). If k is small then that's really O(log m), but if k is large then the cost of the sort becomes even worse than the cost of the linear scan over all mobjects.
Adaptative 2/1: do a binary search for the first match of the range X-Y (using an index over A); do a binary search for the last match of the range. If the range is small continue with algorithm 2; otherwise revert to algorithm 1. The problem here is the case where we revert to algorithm 1. Although we checked that "many" objects pass the filter, which is the good case for algorithm 1, this "many" is at most a constant (asymptotically the O(n) scan will always win over the O(k log k) sort). So we still have an O(n) algorithm for some queries.
Is there an algorithm / data structure which allows answering this query in sublinear time?
If not, what could be good compromises to achieve the necessary performance? For instance, if I don't guarantee returning the objects best ranking for their B property (recall < 1.0) then I can scan only a fraction of the B index. But could I do that while bounding the results' quality somehow?
The question you are asking is essentially a more general version of:
Q. You have a sorted list of words with a weight associated with each word, and you want all words which share a prefix with a given query q, and you want this list sorted by the associated weight.
Am I right?
If so, you might want to check this paper which discusses how to do this in O(k log n) time, where k is the number of elements in the output set desired and n is the number of records in the original input set. We assume that k > log n.
http://dhruvbird.com/autocomplete.pdf
(I am the author).
Update: A further refinement I can add is that the question you are asking is related to 2-dimensional range searching where you want everything in a given X-range and the top-K from the previous set, sorted by the Y-range.
2D range search lets you find everything in an X/Y-range (if both your ranges are known). In this case, you only know the X-range, so you would need to run the query repeatedly and binary search on the Y-range till you get K results. Each query can be performed using O(log n) time if you employ fractional cascading, and O(log2n) if employing the naive approach. Either of them are sub-linear, so you should be okay.
Additionally, the time to list all entries would add an additional O(k) factor to your running time.
assuming N << k < n, it can be done in O(logn + k + NlogN), similar to what you suggested in option 2, but saves some time, you don't need to sort all the k elements, but only N, which is much smaller!
the data base is sorted by A.
(1) find the first element and last element, and create a list containing these
elements.
(2) find the N'th biggest element, using selection algorithm (*), and create a new
list of size N, with a second iteration: populate the last list with the N highest
elements.
(3) sort the last list by B.
Selection algorithm: find the N'th biggest element. it is O(n), or O(k) in here, because the list's size is k.
complexity:
Step one is trivially O(logn + k).
Step 2 is O(k) [selection] and another iteration is also O(k), since this list has only k elements.
Step 3 is O(NlogN), a simple sort, and the last list contains only N elements.
If the number of items you want to return is small--up to about 1% of the total number of items--then a simple heap selection algorithm works well. See When theory meets practice. But it's not sub-linear.
For expected sub-linear performance, you can sort the items by A. When queried, use binary search to find the first item where A >= X, and then sequentially scan items until A > Y, using the heap selection technique I outlined in that blog post.
This should give you O(log n) for the initial search, and then O(m log k), where m is the number of items where X <= A <= Y, and k is the number of items you want returned. Yes, it will still be O(n log k) for some queries. The deciding factor will be the size of m.
Set up a segment tree on A and, for each segment, precompute the top N in range. To query, break the input range into O(log m) segments and merge the precomputed results. Query time is O(N log log m + log m); space is O(m log N).
This is not really a fully fleshed out solution, just an idea. How about building a quadtree on the A and B axes? You would walk down the tree in, say, a breadth-first manner; then:
whenever you find a subtree with A-values all outside the given range [X, Y], you discard that subtree (and don't recurse);
whenever you find a subtree with A-values all inside the given range [X, Y], you add that subtree to a set S that you're building and don't recurse;
whenever you find a subtree with some A-values inside the range [X, Y] and some outside, you recurse into it.
Now you have the set S of all maximal subtrees with A-coordinates between X and Y; there are at most O(sqrt(m)) of these subtrees, which I will show below.
Some of these subtrees will contain O(m) entries (certainly they will contain O(m) entries all added together), so we can't do anything on all entries of all subtrees. We can now make a heap of the subtrees in S, so that the B-minimum of each subtree is less than the B-minimums of its children in the heap. Now extract B-minimal elements from the top node of the heap until you have N of them; whenever you extract an element from a subtree with k elements, you need to decompose that subtree into O(log(k)) subtrees not containing the recently extracted element.
Now let's consider complexity. Finding the O(sqrt(m)) subtrees will take at most O(sqrt(m)) steps (exercise for the reader, using arguments in the proof below). We should probably insert them into the heap as we find them; this will take O(sqrt(m) * log(sqrt(m))) = O(sqrt(m) * log(m)) steps. Extracting a single element from a k-element subtree in the heap takes O(sqrt(k)) time to find the element, then inserting the O(log(sqrt(k))) = O(log(k)) subtrees back into the heap of size O(sqrt(m)) takes O(log(k) * log(sqrt(m))) = O(log(k) * log(m)) steps. We can probably be smarter using potentials, but we can at least bound k by m, so that leaves N*(O(sqrt(k) + log(k)*log(m))) = O(N * (sqrt(m) + log(m)^2) = O(N*sqrt(m)) steps for the extraction, and O(sqrt(m)*(N + log(m))) steps in total... which is sublinear in m.
Here's a proof of the bound of O(sqrt(m)) subtrees. There are several strategies for building a quadtree, but for ease of analysis, let's say that we make a binary tree; in the root node, we split the data set according to A-coordinate around the point with median A-coordinate, then one level down we split the data set according to B-coordinate around the point with median B-coordinate (that is, median for the half of the points contained in that half-tree), and continue alternating the direction per level.
The height of the tree is log(m). Now let's consider for how many subtrees we need to recurse. We only need to recurse if a subtree contains the A-coordinate X, or it contains the A-coordinate Y, or both. At the (2*k)th level down, there are 2^(2*k) subtrees in total. By then, each subtree has its A-range subdivided k times already, and every time we do that, only half the trees contain the A-coordinate X. So at most 2^k subtrees contain the A-coordinate X. Similarly, at most 2^k will contain the A-coordinate Y. This means that in total we will recurse into at most 2*sum(2^k, k = 0 .. log(m)/2) = 2*(2^(log(m)/2 - 1) + 1) = O(sqrt(m)) subtrees.
Since we examine at most 2^k subtrees at the (2*k)'th level down, we can also add at most 2^k subtrees at that level to S. This gives the final result.
The outcome you describe is what most search engines are built to achieve (sorting, filtering, paging). If you havent done so already, check out a search engine like Norch or Solr.