I have been stuck in one question.
Given 1 billion numbers we need to find the largest 1 million numbers. One approach is to sort the
numbers and then take the first million numbers from that is in O(n log n). Propose an algorithm that has
expected O(n) time complexity.
Is it Heap sort which can do that with having O(n) complexity?
The general version of the problem you're trying to solve here seems to be the following:
Given n numbers, report the largest k of them in (possibly expected) time O(n).
If you just need to find the top k elements and the ordering doesn't matter, there's a clever O(n)-time algorithm for this problem based on using fast selection algorithms. As a refresher, a selection algorithm takes as input an array A and a number m, then reorders the array A so that the m smallest elements are in the first m slots and the remaining elements occupy the larger slots. The quickselect algorithm does this in (expected) time O(m) and is fast in practice; the median-of-medians algorithm does this in worst-case O(m) time but is slower in practice. While these algorithms typically are framed in terms of finding the smallest k elements, they work just as well finding the largest k elements.
Using this algorithm as a subroutine, here's how we can find the top k elements in time and space O(m):
Initialize a buffer of 2k elements.
Copy the first k elements of the array into the buffer.
While there are elements remaining in the array:
Copy the next k of them into the buffer.
Use a selection algorithm to place the k largest elements
of the buffer in the first k slots of the buffer.
Discard the remaining elements of the buffer.
Return the contents of the buffer.
To see why this works, notice that after each iteration of the loop, we maintain the invariant that the buffer holds the k largest elements of the ones that have been seen so far (though not necessarily in sorted order). Therefore, the algorithm will identify the top k elements of the input and return them in some order.
In terms of time complexity - there's O(k) work to create the buffer, and across all iterations of the loop we do O(n) work copying elements into the buffer. Each call to the selection algorithm takes (expected) time O(k), and there are O(n / k) calls to the algorithm for a net runtime of O(n + k). Under the assumption that k < n, this gives an overall runtime of O(n), with only O(k) total space required.
No general sorting algorithm can do this in O(n) time. Furthermore, without additional constraints (e.g., the billion numbers are taken from the numbers 1 to 1,000,000) there is no sorting algorithm at all that is going to work for this.
However, there is a simple O(n) algorithm to do this:
initialize a return buffer with 1,000,000 empty cells
for each item in your list of 1,000,000,000, do the following:
check each of the 1,000,000 cells in order
if the number from the input is bigger than the number in your buffer, swap them and keep going
if you're looking at a blank cell, put the number you're holding down
if you get to the end of the list and are holding a number, throw it out
Here's an example with a list of 10 things and we want the biggest 5:
Input: [6, 2, 4, 4, 8, 2, 4, 1, 9, 2]
Buffer: [-, -, -, -, -]
[6, -, -, -, -] … see a blank, drop the 6
[6, 2, -, -, -] … 2 < 6, skip, then see a blank, drop the 2
[6, 4, 2, -, -] … 4 < 6 but 4 > 2, swap out 2, see blank, drop 2
[6, 4, 4, 2, -] … 4 <= 4,6 but 4 > 2, swap out 2, see blank, drop 2
[8, 6, 4, 4, 2] … 8 > 6, swap, then swap 6 for 4, etc.
[8, 6, 4, 4, 2] … 2 <= everything, drop it on the floor
[8, 6, 4, 4, 4] … 4 <= everything but 2, swap, then drop 2 on floor
[8, 6, 4, 4, 4] … 1 <= everything, drop it on the floor
[9, 8, 6, 4, 4] … 9 > everything, swap with 8 then drop a 4 on the floor
[9, 8, 6, 4, 4] … 2 <= everything, drop it on the floor
You do 1,000,000 comparisons and potentially up to 1,000,000 swaps for every element in the input (consider input in sorted ascending order). That means you do work proportional to 1,000,000 * n, a linear amount of work in the size of the input n.
You can generally do better than sorting. Most people would solve this problem by using a heap as the structure. The time complexity to build the heap will be O(n). But you will then have to do a million "pop" operations, the time complexity for each pop being O(log n), but you are not doing the full n pops (only n/1000 pops in this case).
I say you can "generally" do better than sorting because most sorting algorithms in libraries are O(n log n). But there is the "distribution sort" that is actually O(n + k), where k is the number of possible values in the range you are sorting and depending on the value of k, you might do better sorting.
Update
To incorporate the suggestion made by #pjs, create a "minimum" heap with the first million values from the billion where a pop operation removes the minimum value from the heap. Then for the next 999,000,000 values, check to see if each one is greater than the current minimum value on the heap and if so, pop the current minimum value from the heap and push the new value. When you are done, you will be left with the 1,000,000 largest values.
I need a little help trying to figure out something:
Given a sequence of unordered numbers (less than 15.000) - A - I must answer Q queries (Q <= 100000) of the form i, j, x, y that translate as the following:
How many numbers in range [i,j] from A are bigger (or equal) than x but smaller than y with all numbers in the sequence smaller than 5000.
I am under the impression this requires something like O(logN) because of the big length of the sequence and this got me thinking about BIT (binary indexed trees - because of the queries) but a 2D BIT is too big and requires way to much time to run even on the update side. So the only solution I see here should be 1D BIT or Segment Trees but I can't figure how to work out a solution based on these data structures. I tried retaining the positions in the ordered set of numbers but I can't figure out how to make a BIT that responds to queries of the given form.
Also the algorithm should fit in like 500ms for the given limits.
Edit 1: 500ms for all of the operations on preprocessing and answering the queries
EDIT 2: Where i, j are positions of first and last element in the sequence A to look for elements bigger than x and smaller than y
EDIT 3: Example:
Let there be 1, 3, 2, 4, 6, 3 and query 1, 4, 3, 5 so between positions 1 and 4 (inclusive) there are 2 elements (3 and 4) bigger (or equal) than 3 and smaller than 5
Thank you in advance! P.S: Sorry for the poor English!
Implement 2D-range counting by making a BIT-organized array of sorted subarrays. For example, on the input
[1, 3, 2, 4, 6, 3]
the oracle would be
[[1]
,[1, 3]
,[2]
,[1, 2, 3, 4]
,[6]
,[3, 6]
].
The space usage is O(N log N) (hopefully fine). Construction takes O(N log N) time if you're careful, or O(N log^2 N) time if not (no reason to be for your application methinks).
To answer a query with maximums on sequence index and value (four of these can be used to answer the input queries), do the BIT read procedure for the maximum index, using binary search in the array to count the number of elements not exceeding the maximum value. The query time is O(log^2 N).
what is a way to add elements of array thats sum would equal the largest element in the array?
example for this array [4, 6, 23, 10, 1, 3] I have sorted the array first resulting in [1, 3, 4, 6, 10, 23] then I pop the last digit or the last element max = 23. I'm left with [1, 3, 4, 6, 10] and need a way to find a way to find the elements that add up to 23 which are 3 + 4 + 6 + 10 = 23. The elements don't have to be subsequent they can be at random points of the array but they must add up to max.
I can find the permutations of the sorted array from 2 elements to n-1 elements and sum them and compare them to max but that seems inefficient. plz help
This is exactly the subset sum problem, which is NP-Complete, but if your numbers are relatively small integers, there is an efficient pseudo-polynomial solution using Dynamic Programming:
D(i,0) = TRUE
D(0,x) = FALSE x>0
D(i,x) = D(i-1,x) OR D(i-1,x-arr[i])
If there is a solution, you need to step back in the matrix created by the DP solution, and "record" each choice you have made along the way, to get the elements used for the summation. This thread deals with how to find the actual elements in a very similar problem (known as knapsack problem), which is solved similarly: How to find which elements are in the bag, using Knapsack Algorithm [and not only the bag's value]?
Here is the problem, an unsorted array a[n], and I need to find the kth smallest number in range [i, j], and absolutely 1<=i<=j<=n, k<=j-i+1.
Typically I will use quick-find to do the job, but it is not fast enough if there many query requests with different range [i, j], I hardly to figure out a algorithm to do the query in O(logn) time (preprocessing is allowed).
Any idea is appreciated.
PS
Let me make the problem easier to understand. Any kinds of preprocessing is allowed, but the query needs to be done in O(logn) time. And there will be many (more than 1) queries, like find the 1st in range [3,7], or 3rd in range [10,17], or 11th in range [33, 52].
By range [i, j] I mean in the original array, not sorted or something.
For example, a[5] = {3,1,7,5,9}, query 1st in range [3,4] is 5, 2nd in range [1,3] is 5, 3rd in range [0,2] is 7.
If pre-processing is allowed and not counted towards the time complexity, just use that to construct sub-lists so that you can efficiently find the element you're looking for. As with most optimisations, this trades space for time.
Your pre-processing step is to take your original list of n numbers and create a number of new sublists.
Each of these sublists is a portion of the original, starting with the nth element, extending for m elements and then sorted. So your original list of:
{3, 1, 7, 5, 9}
gives you:
list[0][0] = {3}
list[0][1] = {1, 3}
list[0][2] = {1, 3, 7}
list[0][3] = {1, 3, 5, 7}
list[0][4] = {1, 3, 5, 7, 9}
list[1][0] = {1}
list[1][1] = {1, 7}
list[1][2] = {1, 5, 7}
list[1][3] = {1, 5, 7, 9}
list[2][0] = {7}
list[2][1] = {5, 7}
list[2][2] = {5, 7, 9}
list[3][0] = {5}
list[3][1] = {5,9}
list[4][0] = {9}
This isn't a cheap operation (in time or space) so you may want to maintain a "dirty" flag on the list so you only perform it the first time after you do an modifying operation (insert, delete, change).
In fact, you can use lazy evaluation for even more efficiency. Basically set all sublists to an empty list when you start and whenever you perform a modifying operation. Then, whenever you attempt to access a sublist and it's empty, calculate that sublist (and that one only) before trying to get the kth value out of it.
That ensures sublists are evaluated only when needed and cached to prevent unnecessary recalculation. For example, if you never ask for a value from the 3-through-6 sublist, it's never calculated.
The pseudo-code for creating all the sublists is basically (for loops inclusive at both ends):
for n = 0 to a.lastindex:
create array list[n]
for m = 0 to a.lastindex - n
create array list[n][m]
for i = 0 to m:
list[n][m][i] = a[n+i]
sort list[n][m]
The code for lazy evaluation is a little more complex (but only a little), so I won't provide pseudo-code for that.
Then, in order to find the kth smallest number in the range i through j (where i and j are the original indexes), you simply look up lists[i][j-i][k-1], a very fast O(1) operation:
+--------------------------+
| |
| v
1st in range [3,4] (values 5,9), list[3][4-3=1][1-1-0] = 5
2nd in range [1,3] (values 1,7,5), list[1][3-1=2][2-1=1] = 5
3rd in range [0,2] (values 3,1,7), list[0][2-0=2][3-1=2] = 7
| | ^ ^ ^
| | | | |
| +-------------------------+----+ |
| |
+-------------------------------------------------+
Here's some Python code which shows this in action:
orig = [3,1,7,5,9]
print orig
print "====="
list = []
for n in range (len(orig)):
list.append([])
for m in range (len(orig) - n):
list[-1].append([])
for i in range (m+1):
list[-1][-1].append(orig[n+i])
list[-1][-1] = sorted(list[-1][-1])
print "(%d,%d)=%s"%(n,m,list[-1][-1])
print "====="
# Gives xth smallest in index range y through z inclusive.
x = 1; y = 3; z = 4; print "(%d,%d,%d)=%d"%(x,y,z,list[y][z-y][x-1])
x = 2; y = 1; z = 3; print "(%d,%d,%d)=%d"%(x,y,z,list[y][z-y][x-1])
x = 3; y = 0; z = 2; print "(%d,%d,%d)=%d"%(x,y,z,list[y][z-y][x-1])
print "====="
As expected, the output is:
[3, 1, 7, 5, 9]
=====
(0,0)=[3]
(0,1)=[1, 3]
(0,2)=[1, 3, 7]
(0,3)=[1, 3, 5, 7]
(0,4)=[1, 3, 5, 7, 9]
(1,0)=[1]
(1,1)=[1, 7]
(1,2)=[1, 5, 7]
(1,3)=[1, 5, 7, 9]
(2,0)=[7]
(2,1)=[5, 7]
(2,2)=[5, 7, 9]
(3,0)=[5]
(3,1)=[5, 9]
(4,0)=[9]
=====
(1,3,4)=5
(2,1,3)=5
(3,0,2)=7
=====
Current solution is O( (logn)^2 ). I am pretty sure it can be modified to run on O(logn). The main advantage of this algorithm over paxdiablo's algorithm is space efficiency. This algorithm needs O(nlogn) space, not O(n^2) space.
First, the complexity of finding kth smallest element from two sorted arrays of length m and n is O(logm + logn). Complexity of finding kth smallest element from arrays of lengths a,b,c,d.. is O(loga+logb+.....).
Now, sort the whole array and store it. Sort the first half and second half of the array and store it and so on. You will have 1 sorted array of length n, 2 sorted of arrays of length n/2, 4 sorted arrays of length n/4 and so on. Total memory required = 1*n+2*n/2+4*n/4+8*n/8...= nlogn.
Once you have i and j figure out the list of of subarrays which, when concatenated, give you range [i,j]. There are going to be logn number of arrays. Finding kth smallest number among them would take O( (logn)^2) time.
Example for the last paragraph:
Assume the array is of size 8 (indexed from 0 to 7). You have the following sorted lists:
A:0-7, B:0-3, C:4-7, D:0-1, E:2-3, F:4-5, G:6-7.
Now construct a tree with pointers to these arrays such that every node contains its immediate constituents. A will be root, B and C are its children and so on.
Now implement a recursive function that returns a list of arrays.
def getArrays(node, i, j):
if i==node.min and j==node.max:
return [node];
if i<=node.left.max:
if j<=node.left.max:
return [getArrays(node.left, i, j)]; # (i,j) is located within left node
else:
return [ getArrays(node.left, i, node.left.max), getArrays(node.right, node.right.min, j) ]; # (i,j) is spread over left and right node
else:
return [getArrays(node.right, i, j)]; # (i,j) is located within right node
Preprocess: Make an nxn array where the [k][r] element is the kth smallest element of the first r elements (1-indexed for convenience).
Then, given some particular range [i,j] and value for k, do the following:
Find the element at the [k][j] slot of the matrix; call this x.
go down the i-1 column of your matrix and find how many values in it are smaller than or equal to x (treat column 0 as having 0 smaller entries). By construction, this column will be sorted (all columns will be sorted), so it can be found in log time. Call this value s
Find the element in the [k+s][j] slot of the matrix. This is your answer.
E.g., given 3 1 7 5 9
3 1 1 1 1
X 3 3 3 3
X X 7 5 5
X X X 7 7
X X X X 9
Now, if we're asked for the 2nd smallest in [2,4] range (again, 1-indexing), I first find the 2nd smallest in [1,4] range which is 3. I then look at column 1 and see that there is 1 element less than or equal to 3. Finally, I find the 3rd smallest in [1,4] range at [3][5] slot which is 5, as desired.
This takes n^2 space, and log(n) lookup time.
This one does not require pre-process but is somehow slower than O(logN). It's significantly faster than a naive iterate&count, and could support dynamic modification on the sequence.
It goes like this. Suppose the length n has n=2^x for some x. Construct a segment-tree whose root node represent [0,n-1]. For each of the node, if it represent a node [a,b], b>a, let it has two child nodes each representing [a,(a+b)/2], [(a+b)/2+1,b]. (That is, do a recursive divide-by-two).
Then, on each node, maintain a separate binary search tree for the numbers within that segment. Therefore, each modification on the sequence takes O(logN)[on the segement]*O(logN)[on the BST]. Queries can be done like this, Let Q(a,b,x) be rank of x within segment [a,b]. Obviously, if Q(a,b,x) can be computed efficiently, a binary search on x can compute the answer desired effectively (with an extra O(logE) factor.
Q(a,b,x) can be computed as: find smallest number of segments that make up [a,b], which can be done in O(logN) on the segment tree. For each segment, query on the binary search tree for that segment for the number of elements less than x. Add all these numbers to get Q(a,b,x).
This should be O(logN*logE*logN). Well not exactly what you have asked for though.
In O(log n) time it's not possible to read all of the elements of the array. Since it's not sorted, and there's no other provided information, this is impossible.
There's no way you can do better than O(n) in both worst and average case. You have to look at every single element.