Dividing the elements of an array in 3 groups - algorithm

I have to divide the elements of an array into 3 groups. This needs to be done without sorting the array. Consider the example
we have 120 unsorted values thus the smallest 40 values need to be in the first group and next 40 in the second and the largest 40 in the third group
I was thinking of the median of median approach but not able to apply it to my problem, kindly suggest an algorithm.

You can call quickselect twice on your array to do this in-place and in average case linear time. The worst case runtime can also be improved to O(n) by using the linear time median of medians algorithm to choose an optimal pivot for quickselect.
For both calls to quickselect, use k = n / 3. On your first call, use quickselect on the entire array, and on your second call, use it from arr[k..n-1] (for a 0-indexed array).
Wikipedia explanation of quickselect:
Quickselect uses the same overall approach as quicksort, choosing one
element as a pivot and partitioning the data in two based on the
pivot, accordingly as less than or greater than the pivot. However,
instead of recursing into both sides, as in quicksort, quickselect
only recurses into one side – the side with the element it is
searching for. This reduces the average complexity from O(n log n) (in
quicksort) to O(n) (in quickselect).
As with quicksort, quickselect is generally implemented as an in-place
algorithm, and beyond selecting the kth element, it also partially
sorts the data. See selection algorithm for further discussion of the
connection with sorting.
To divide the elements of the array into 3 groups, use the following algorithm written in Python in combination with quickselect:
k = n / 3
# First group smallest elements in array
quickselect(L, 0, n - 1, k) # Call quickselect on your entire array
# Then group middle elements in array
quickselect(L, k, n - 1, k) # Call quickselect on subarray
# Largest elements in array are already grouped so
# there is no need to call quickselect again
The key point of all this is that quickselect uses a subroutine called partition. Partition arranges an array into two parts, those greater than a given element and those less than a given element. Thus it partially sorts an array around this element and returns its new sorted position. Thus by using quickselect, you actually partially sort the array around the kth element (note that this is different from actually sorting the entire array) in-place and in average-case linear time.
Time Complexity for quickselect:
Worst case performance O(n2)
Best case performance O(n)
Average case performance O(n)
The runtime of quickselect is almost always linear and not quadratic, but this depends on the fact that for most arrays, simply choosing a random pivot point will almost always yield linear runtime. However, if you want to improve the worst case performance for your quickselect, you can choose to use the median of medians algorithm before each call to approximate an optimal pivot to be used for quickselect. In doing so, you will improve the worst case performance of your quickselect algorithm to O(n). This overhead probably isn't necessary but if you are dealing with large lists of randomized integers it can prevent some abnormal quadratic runtimes of your algorithm.
Here is a complete example in Python which implements quickselect and applies it twice to a reverse-sorted list of 120 integers and prints out the three resulting sublists.
from random import randint
def partition(L, left, right, pivotIndex):
'''partition L so it's ordered around L[pivotIndex]
also return its new sorted position in array'''
pivotValue = L[pivotIndex]
L[pivotIndex], L[right] = L[right], L[pivotIndex]
storeIndex = left
for i in xrange(left, right):
if L[i] < pivotValue:
L[storeIndex], L[i] = L[i], L[storeIndex]
storeIndex = storeIndex + 1
L[right], L[storeIndex] = L[storeIndex], L[right]
return storeIndex
def quickselect(L, left, right, k):
'''retrieve kth smallest element of L[left..right] inclusive
additionally partition L so that it's ordered around kth
smallest element'''
if left == right:
return L[left]
# Randomly choose pivot and partition around it
pivotIndex = randint(left, right)
pivotNewIndex = partition(L, left, right, pivotIndex)
pivotDist = pivotNewIndex - left + 1
if pivotDist == k:
return L[pivotNewIndex]
elif k < pivotDist:
return quickselect(L, left, pivotNewIndex - 1, k)
else:
return quickselect(L, pivotNewIndex + 1, right, k - pivotDist)
def main():
# Setup array of 120 elements [120..1]
n = 120
L = range(n, 0, -1)
k = n / 3
# First group smallest elements in array
quickselect(L, 0, n - 1, k) # Call quickselect on your entire array
# Then group middle elements in array
quickselect(L, k, n - 1, k) # Call quickselect on subarray
# Largest elements in array are already grouped so
# there is no need to call quickselect again
print L[:k], '\n'
print L[k:k*2], '\n'
print L[k*2:]
if __name__ == '__main__':
main()

I would take a look at order statistics. The kth order statistic of a statistical sample is equal to its kth-smallest value. The problem of computing the kth smallest (or largest) element of a list is called the selection problem and is solved by a selection algorithm.
It is right to think the median of the medians way. However, instead of finding the median, you might want to find both 20th and 40th smallest elements from the array. Just like finding the median, it takes only linear time to find both of them using a selection algorithm. Finally you go over the array and partition the elements according to these two elements, which is linear time as well.
PS. If this is your exercise in an algorithm class, this might help you :)

Allocate an array of the same size of the input array
scan the input array once and keep track of the min and max values of the array.
and at the same time set to 1 all the values of the second array.
compute delta = (max - min) / 3.
Scan the array again and set the second array to two if the number is > min+delta and < max-delta; otherwise if >= max-delta, set it to 3;
As a result you will have an array that tells in which group the number is.
I am assuming that all the numbers are different from each other.
Complexity: O(2n)

Related

How to calculate the maximum median in an array

This is an algorithm question:
Input is an array with non-duplicate positive integers. Find a continuous subarray(size > 1) which has the maximum median value.
Example: input: [100, 1, 99, 2, 1000], output should be the result of (1000 + 2) / 2 = 501
I can come up the brute force solution: try all lengths from 2 -> array size to find the maximum median. But it seems too slow. I also tried to use two pointer on this question but not sure when to move left and right pointer.
Anyone has a better idea to solve this question?
tl;dr - We can show that the answer must be of length 2 or 3, after which it's linear time to check all the possibilities.
Let's say the input is A and the smallest subarray with the biggest median is a. The biggest median is either a single element or the average of a pair of elements from a. Notice that every element in a bigger than the largest element of the median can only be next to elements less than the smallest element of the median (otherwise such a pair could be chosen as a subarray to form a bigger median).
If either end of a had a pair of elements that didn't include an element of the median, it could be eliminated from a without affecting the median, a contradiction.
If either end of a was smaller than the smallest element of the median, eliminating it would increase the median, a contradiction.
Thus each end of a is either an element of the median or larger than the largest element of the median (because it's larger than the smallest elt of the median and not equal to the largest elt of the median).
Thus each end of a is an element of the median because otherwise, we'd have an element larger than an element of the median adjacent to an elt of the median, forming a larger median.
If a is odd then it must be of length three, since any larger odd length could have 2 removed from the end of a farthest from the median without changing the median.
If a is even then it must be of length 2 because any larger even length bookended by the elements of the median with interior elements alternating between smaller and larger than the median must have one of the median elements adjacent to a larger element than the other elt of the median, forming a larger median.
This proof outline could use some editing, but regardless, the conclusion is that the smallest array containing the largest median must be of length 2 or 3.
Given that, check every such subarray in linear time. O(n).
This is a Python implementation of an algorithm that solves the problem in O(n):
import random
import statistics
n = 50
numbers = random.sample(range(n),n)
max_m = 0;
max_a = [];
for i in range(2,3):
for j in range(0,n-i+1):
a = numbers[j:j+i]
m = statistics.median(a)
if m > max_m:
max_m = m
max_a = a
print(numbers)
print(max_m)
print(max_a)
This is a variation of the brute force algorithm (O(n^3)) that performs only the search for sub-arrays of length 2 or 3. The reason is that for every array of size n, there exists a sub-array that has the same or improved median. Applying this reasoning recursively, we can reduce the size of the sub-array to 2 or 3. Thus, by looking only at sub-arrays of size 2 or 3, we are guaranteed to obtain the sub-array with the maximum median.
The operation is the following: If, for a contiguous sub-array (at the beginning or at the end), at least half of the elements are lower than the median (or lower than both values forming the median, if this is the case), remove them to improve or at least preserve the median.
If in all sub-arrays there is always at least one more element above or equal to the median(s) than below, there will come a point where the size of the sub-array will be that of the median. In that case, it means that the complement will have more elements below the median, and thus, we can simply remove the complement and improve (or preserve) the median. Thus, we can always perform the operation. For n=3, it can happen that you need to remove 2 or 3 elements to perform the operation, which is not allowed. In this case, the result is the list itself.

How to choose the least number of weights to get a total weight in O(n) time

If there are n unsorted weights and I need to find the least number of weights to get at least weight W.
How do I find them in O(n)?
This problem has many solution methods:
Method 1 - Sorting - O(nlogn)
I guess that the most trivial one would be to sort in descending order and then to take the first K elements that give a sum of at least W. The time complexity will be though O(nlogn).
Method 2 - Max Heap - O(n + klogn)
Another method would be to use a max heap.
Creating the heap will take O(n) and then extracting elements until we got to a total sum of at least W. Each extraction will take O(logn) so the total time complexity will be O(klogn) where k is the number of elements we had to extract from the heap.
Method 3 - Using Min Heap - O(nlogk)
Adding this method that JimMischel suggested in the comments below.
Creating a min heap with the first k elements in the list that sums to at least W. Then, iterate over the remaining elements and if it's greater than the minimum (heap top) replace between them.
At this point, it might be that we have more elements of what we actually need to get to W, so we will just extract the minimums until we reach our limit. In practice, depending on the relation between
find_min_set(A,W)
currentW = 0
heap H //Create empty heap
for each Elem in A
if (currentW < W)
H.add(Elem)
currentW += Elem
else if (Elem > H.top())
currentW += (Elem-H.top())
H.pop()
H.add(Elem)
while (currentW-H.top() > W)
currentW -= H.top()
H.pop()
This method might be even faster in practice, depending on the relation between k and n. See when theory meets practice.
Method 4 - O(n)
The best method I could think of will be using some kind of quickselect while keeping track of the total weight and always partitioning with the median as a pivot.
First, let's define few things:
sum(A) - The total sum of all elements in array A.
num(A) - The number of elements in array A.
med(A) - The median of the array A.
find_min_set(A,W,T)
//partition A
//L contains all the elements of A that are less than med(A)
//R contains all the elements of A that are greater or equal to med(A)
L, R = partition(A,med(A))
if (sum(R)==W)
return T+num(R)
if (sum(R) > W)
return find_min_set(R,W,T)
if (sum(R) < W)
return find_min_set(L,W-sum(R),num(R)+T)
Calling this method by find_min_set(A,W,0).
Runtime Complexity:
Finding median is O(n).
Partitioning is O(n).
Each recursive call is taking half of the size of the array.
Summing it all up we get a follow relation: T(n) = T(n/2) + O(n) which is same as the average case of quickselect = O(n).
Note: When all values are unique both worst-case and average complexity is indeed O(n). With possible duplicates values, the average complexity is still O(n) but the worst case is O(nlogn) with using Median of medians method for selecting the pivot.

How to find pair with kth largest sum?

Given two sorted arrays of numbers, we want to find the pair with the kth largest possible sum. (A pair is one element from the first array and one element from the second array). For example, with arrays
[2, 3, 5, 8, 13]
[4, 8, 12, 16]
The pairs with largest sums are
13 + 16 = 29
13 + 12 = 25
8 + 16 = 24
13 + 8 = 21
8 + 12 = 20
So the pair with the 4th largest sum is (13, 8). How to find the pair with the kth largest possible sum?
Also, what is the fastest algorithm? The arrays are already sorted and sizes M and N.
I am already aware of the O(Klogk) solution , using Max-Heap given here .
It also is one of the favorite Google interview question , and they demand a O(k) solution .
I've also read somewhere that there exists a O(k) solution, which i am unable to figure out .
Can someone explain the correct solution with a pseudocode .
P.S.
Please DON'T post this link as answer/comment.It DOESN'T contain the answer.
I start with a simple but not quite linear-time algorithm. We choose some value between array1[0]+array2[0] and array1[N-1]+array2[N-1]. Then we determine how many pair sums are greater than this value and how many of them are less. This may be done by iterating the arrays with two pointers: pointer to the first array incremented when sum is too large and pointer to the second array decremented when sum is too small. Repeating this procedure for different values and using binary search (or one-sided binary search) we could find Kth largest sum in O(N log R) time, where N is size of the largest array and R is number of possible values between array1[N-1]+array2[N-1] and array1[0]+array2[0]. This algorithm has linear time complexity only when the array elements are integers bounded by small constant.
Previous algorithm may be improved if we stop binary search as soon as number of pair sums in binary search range decreases from O(N2) to O(N). Then we fill auxiliary array with these pair sums (this may be done with slightly modified two-pointers algorithm). And then we use quickselect algorithm to find Kth largest sum in this auxiliary array. All this does not improve worst-case complexity because we still need O(log R) binary search steps. What if we keep the quickselect part of this algorithm but (to get proper value range) we use something better than binary search?
We could estimate value range with the following trick: get every second element from each array and try to find the pair sum with rank k/4 for these half-arrays (using the same algorithm recursively). Obviously this should give some approximation for needed value range. And in fact slightly improved variant of this trick gives range containing only O(N) elements. This is proven in following paper: "Selection in X + Y and matrices with sorted rows and columns" by A. Mirzaian and E. Arjomandi. This paper contains detailed explanation of the algorithm, proof, complexity analysis, and pseudo-code for all parts of the algorithm except Quickselect. If linear worst-case complexity is required, Quickselect may be augmented with Median of medians algorithm.
This algorithm has complexity O(N). If one of the arrays is shorter than other array (M < N) we could assume that this shorter array is extended to size N with some very small elements so that all calculations in the algorithm use size of the largest array. We don't actually need to extract pairs with these "added" elements and feed them to quickselect, which makes algorithm a little bit faster but does not improve asymptotic complexity.
If k < N we could ignore all the array elements with index greater than k. In this case complexity is equal to O(k). If N < k < N(N-1) we just have better complexity than requested in OP. If k > N(N-1), we'd better solve the opposite problem: k'th smallest sum.
I uploaded simple C++11 implementation to ideone. Code is not optimized and not thoroughly tested. I tried to make it as close as possible to pseudo-code in linked paper. This implementation uses std::nth_element, which allows linear complexity only on average (not worst-case).
A completely different approach to find K'th sum in linear time is based on priority queue (PQ). One variation is to insert largest pair to PQ, then repeatedly remove top of PQ and instead insert up to two pairs (one with decremented index in one array, other with decremented index in other array). And take some measures to prevent inserting duplicate pairs. Other variation is to insert all possible pairs containing largest element of first array, then repeatedly remove top of PQ and instead insert pair with decremented index in first array and same index in second array. In this case there is no need to bother about duplicates.
OP mentions O(K log K) solution where PQ is implemented as max-heap. But in some cases (when array elements are evenly distributed integers with limited range and linear complexity is needed only on average, not worst-case) we could use O(1) time priority queue, for example, as described in this paper: "A Complexity O(1) Priority Queue for Event Driven Molecular Dynamics Simulations" by Gerald Paul. This allows O(K) expected time complexity.
Advantage of this approach is a possibility to provide first K elements in sorted order. Disadvantages are limited choice of array element type, more complex and slower algorithm, worse asymptotic complexity: O(K) > O(N).
EDIT: This does not work. I leave the answer, since apparently I am not the only one who could have this kind of idea; see the discussion below.
A counter-example is x = (2, 3, 6), y = (1, 4, 5) and k=3, where the algorithm gives 7 (3+4) instead of 8 (3+5).
Let x and y be the two arrays, sorted in decreasing order; we want to construct the K-th largest sum.
The variables are: i the index in the first array (element x[i]), j the index in the second array (element y[j]), and k the "order" of the sum (k in 1..K), in the sense that S(k)=x[i]+y[j] will be the k-th greater sum satisfying your conditions (this is the loop invariant).
Start from (i, j) equal to (0, 0): clearly, S(1) = x[0]+y[0].
for k from 1 to K-1, do:
if x[i+1]+ y[j] > x[i] + y[j+1], then i := i+1 (and j does not change) ; else j:=j+1
To see that it works, consider you have S(k) = x[i] + y[j]. Then, S(k+1) is the greatest sum which is lower (or equal) to S(k), and such as at least one element (i or j) changes. It is not difficult to see that exactly one of i or j should change.
If i changes, the greater sum you can construct which is lower than S(k) is by setting i=i+1, because x is decreasing and all the x[i'] + y[j] with i' < i are greater than S(k). The same holds for j, showing that S(k+1) is either x[i+1] + y[j] or x[i] + y[j+1].
Therefore, at the end of the loop you found the K-th greater sum.
tl;dr: If you look ahead and look behind at each iteration, you can start with the end (which is highest) and work back in O(K) time.
Although the insight underlying this approach is, I believe, sound, the code below is not quite correct at present (see comments).
Let's see: first of all, the arrays are sorted. So, if the arrays are a and b with lengths M and N, and as you have arranged them, the largest items are in slots M and N respectively, the largest pair will always be a[M]+b[N].
Now, what's the second largest pair? It's going to have perhaps one of {a[M],b[N]} (it can't have both, because that's just the largest pair again), and at least one of {a[M-1],b[N-1]}. BUT, we also know that if we choose a[M-1]+b[N-1], we can make one of the operands larger by choosing the higher number from the same list, so it will have exactly one number from the last column, and one from the penultimate column.
Consider the following two arrays: a = [1, 2, 53]; b = [66, 67, 68]. Our highest pair is 53+68. If we lose the smaller of those two, our pair is 68+2; if we lose the larger, it's 53+67. So, we have to look ahead to decide what our next pair will be. The simplest lookahead strategy is simply to calculate the sum of both possible pairs. That will always cost two additions, and two comparisons for each transition (three because we need to deal with the case where the sums are equal);let's call that cost Q).
At first, I was tempted to repeat that K-1 times. BUT there's a hitch: the next largest pair might actually be the other pair we can validly make from {{a[M],b[N]}, {a[M-1],b[N-1]}. So, we also need to look behind.
So, let's code (python, should be 2/3 compatible):
def kth(a,b,k):
M = len(a)
N = len(b)
if k > M*N:
raise ValueError("There are only %s possible pairs; you asked for the %sth largest, which is impossible" % M*N,k)
(ia,ib) = M-1,N-1 #0 based arrays
# we need this for lookback
nottakenindices = (0,0) # could be any value
nottakensum = float('-inf')
for i in range(k-1):
optionone = a[ia]+b[ib-1]
optiontwo = a[ia-1]+b[ib]
biggest = max((optionone,optiontwo))
#first deal with look behind
if nottakensum > biggest:
if optionone == biggest:
newnottakenindices = (ia,ib-1)
else: newnottakenindices = (ia-1,ib)
ia,ib = nottakenindices
nottakensum = biggest
nottakenindices = newnottakenindices
#deal with case where indices hit 0
elif ia <= 0 and ib <= 0:
ia = ib = 0
elif ia <= 0:
ib-=1
ia = 0
nottakensum = float('-inf')
elif ib <= 0:
ia-=1
ib = 0
nottakensum = float('-inf')
#lookahead cases
elif optionone > optiontwo:
#then choose the first option as our next pair
nottakensum,nottakenindices = optiontwo,(ia-1,ib)
ib-=1
elif optionone < optiontwo: # choose the second
nottakensum,nottakenindices = optionone,(ia,ib-1)
ia-=1
#next two cases apply if options are equal
elif a[ia] > b[ib]:# drop the smallest
nottakensum,nottakenindices = optiontwo,(ia-1,ib)
ib-=1
else: # might be equal or not - we can choose arbitrarily if equal
nottakensum,nottakenindices = optionone,(ia,ib-1)
ia-=1
#+2 - one for zero-based, one for skipping the 1st largest
data = (i+2,a[ia],b[ib],a[ia]+b[ib],ia,ib)
narrative = "%sth largest pair is %s+%s=%s, with indices (%s,%s)" % data
print (narrative) #this will work in both versions of python
if ia <= 0 and ib <= 0:
raise ValueError("Both arrays exhausted before Kth (%sth) pair reached"%data[0])
return data, narrative
For those without python, here's an ideone: http://ideone.com/tfm2MA
At worst, we have 5 comparisons in each iteration, and K-1 iterations, which means that this is an O(K) algorithm.
Now, it might be possible to exploit information about differences between values to optimise this a little bit, but this accomplishes the goal.
Here's a reference implementation (not O(K), but will always work, unless there's a corner case with cases where pairs have equal sums):
import itertools
def refkth(a,b,k):
(rightia,righta),(rightib,rightb) = sorted(itertools.product(enumerate(a),enumerate(b)), key=lamba((ia,ea),(ib,eb):ea+eb)[k-1]
data = k,righta,rightb,righta+rightb,rightia,rightib
narrative = "%sth largest pair is %s+%s=%s, with indices (%s,%s)" % data
print (narrative) #this will work in both versions of python
return data, narrative
This calculates the cartesian product of the two arrays (i.e. all possible pairs), sorts them by sum, and takes the kth element. The enumerate function decorates each item with its index.
The max-heap algorithm in the other question is simple, fast and correct. Don't knock it. It's really well explained too. https://stackoverflow.com/a/5212618/284795
Might be there isn't any O(k) algorithm. That's okay, O(k log k) is almost as fast.
If the last two solutions were at (a1, b1), (a2, b2), then it seems to me there are only four candidate solutions (a1-1, b1) (a1, b1-1) (a2-1, b2) (a2, b2-1). This intuition could be wrong. Surely there are at most four candidates for each coordinate, and the next highest is among the 16 pairs (a in {a1,a2,a1-1,a2-1}, b in {b1,b2,b1-1,b2-1}). That's O(k).
(No it's not, still not sure whether that's possible.)
[2, 3, 5, 8, 13]
[4, 8, 12, 16]
Merge the 2 arrays and note down the indexes in the sorted array. Here is the index array looks like (starting from 1 not 0)
[1, 2, 4, 6, 8]
[3, 5, 7, 9]
Now start from end and make tuples. sum the elements in the tuple and pick the kth largest sum.
public static List<List<Integer>> optimization(int[] nums1, int[] nums2, int k) {
// 2 * O(n log(n))
Arrays.sort(nums1);
Arrays.sort(nums2);
List<List<Integer>> results = new ArrayList<>(k);
int endIndex = 0;
// Find the number whose square is the first one bigger than k
for (int i = 1; i <= k; i++) {
if (i * i >= k) {
endIndex = i;
break;
}
}
// The following Iteration provides at most endIndex^2 elements, and both arrays are in ascending order,
// so k smallest pairs must can be found in this iteration. To flatten the nested loop, refer
// 'https://stackoverflow.com/questions/7457879/algorithm-to-optimize-nested-loops'
for (int i = 0; i < endIndex * endIndex; i++) {
int m = i / endIndex;
int n = i % endIndex;
List<Integer> item = new ArrayList<>(2);
item.add(nums1[m]);
item.add(nums2[n]);
results.add(item);
}
results.sort(Comparator.comparing(pair->pair.get(0) + pair.get(1)));
return results.stream().limit(k).collect(Collectors.toList());
}
Key to eliminate O(n^2):
Avoid cartesian product(or 'cross join' like operation) of both arrays, which means flattening the nested loop.
Downsize iteration over the 2 arrays.
So:
Sort both arrays (Arrays.sort offers O(n log(n)) performance according to Java doc)
Limit the iteration range to the size which is just big enough to support k smallest pairs searching.

Algorithm to find 100 closest stars to the origin

First let me phrase the proper question:
Q: There is a file containing more than a million points (x,y) each of which represents a star. There is a planet earth at (a,b). Now, the task is to build an algorithm that would return the 100 closest stars to earth. What would be the time and space complexities of your algorithm.
This question has been asked many times in various interviews. I tried looking up the answers but could not find a satisfactory one.
One way to do it which I thought might be using a max heap of size 100. Calculate distances for each star and check if the distance is lesser than the root of the max heap. If yes, replace it with the root and call heapify.
Any other better/faster answers?
P.S: This is not a homework question.
You can actually do this in time O(n) and space O(k), where k is the number of closest points that you want, by using a very clever trick.
The selection problem is as follows: Given an array of elements and some index i, rearrange the elements of the array such that the ith element is in the right place, all elements smaller than the ith element are to the left, and all elements greater than the ith element are to the right. For example, given the array
40 10 00 30 20
If I tried to select based on index 2 (zero-indexed), one result might be
10 00 20 40 30
Since the element at index 2 (20) is in the right place, the elements to the left are smaller than 20, and the elements to the right are greater than 20.
It turns out that since this is a less strict requirement than actually sorting the array, it's possible to do this in time O(n), where n is the number of elements of the array. Doing so requires some complex algorithms like the median-of-medians algorithm, but is indeed O(n) time.
So how do you use this here? One option is to load all n elements from the file into an array, then use the selection algorithm to select the top k in O(n) time and O(n) space (here, k = 100).
But you can actually do better than this! For any constant k that you'd like, maintain a buffer of 2k elements. Load 2k elements from the file into the array, then use the selection algorithm to rearrange it so that the smallest k elements are in the left half of the array and the largest are in the right, then discard the largest k elements (they can't be any of the k closest points). Now, load k more elements from the file into the buffer and do this selection again, and repeat this until you've processed every line of the file. Each time you do a selection you discard the largest k elements in the buffer and retain the k closest points you've seen so far. Consequently, at the very end, you can select the top k elements one last time and find the top k.
What's the complexity of the new approach? Well, you're using O(k) memory for the buffer and the selection algorithm. You end up calling select on a buffer of size O(k) a total of O(n / k) times, since you call select after reading k new elements. Since select on a buffer of size O(k) takes time O(k), the total runtime here is O(n + k). If k = O(n) (a reasonable assumption), this takes time O(n), space O(k).
Hope this helps!
To elaborate on the MaxHeap solution you would build a max-heap with the first k elements from the file ( k = 100 in this case ).
The key for the max-heap would be its distance from Earth (a,b). Distance between 2 points on a 2d plane can be calculated using:
dist = (x1,y1) to (x2,y2) = square_root((x2 - x1)^2 + (y2 - y1)^2);
This would take O(k) time to construct. For every subsequent element from k to n. ie (n - k) elements you need to fetch its distance from earth and compare it with the top of max-heap. If the new element to be inserted is closer to earth than the top of the max-heap, replace the top of the max-heap and call heapify on the new root of the heap.
This would take O((n-k)logk) time to complete.
Finally we would be left with just the k elements in the max-heap. You can call heapify k times to return all these k elements. This is another O(klogk).
Overall time complexity would be O(k + (n-k)logk + klogk).
It's a famous question and there has been lot's of solution for that:
http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
if you did not find it useful, there are some other resources such as Rurk's computational geometry book.
Your algorithm is correct. Just remember that time complexity of your program is O(n . log 100 ) = O(n), unless number of closest points to find can vary.
import sys,os,csv
iFile=open('./file_copd.out','rU')
earth = [0,0]
##getDistance return distance given two stars
def getDistance(star1,star2):
return sqrt((star1[0]-star2[0])**2 +(star1[1]-star2[1])**2 )
##diction dict_galaxy looks like this {key,distance} key is the seq assign to each star, value is a list [distance,its cordinance]
##{1,[distance1,[x,y]];2,[distance2,[x,y]]}
dict_galaxy={}
#list_galaxy=[]
count = 0
sour=iFile.readlines()
for line in sour:
star=line.split(',') ##Star is a list [x,y]
dict_galaxy[count]=[getDistance(earth,star),star]
count++
###Now sort this dictionary based on their distance, and return you a list of keys.
list_sorted_key = sorted(dict_galaxy,key=lambda x:dict_galaxy[x][0])
print 'is this what you want %s'%(list_sorted_key[:100].to_s)
iFile.close()

How to find k nearest neighbors to the median of n distinct numbers in O(n) time?

I can use the median of medians selection algorithm to find the median in O(n). Also, I know that after the algorithm is done, all the elements to the left of the median are less that the median and all the elements to the right are greater than the median. But how do I find the k nearest neighbors to the median in O(n) time?
If the median is n, the numbers to the left are less than n and the numbers to the right are greater than n.
However, the array is not sorted in the left or the right sides. The numbers are any set of distinct numbers given by the user.
The problem is from Introduction to Algorithms by Cormen, problem 9.3-7
No one seems to quite have this. Here's how to do it. First, find the median as described above. This is O(n). Now park the median at the end of the array, and subtract the median from every other element. Now find element k of the array (not including the last element), using the quick select algorithm again. This not only finds element k (in order), it also leaves the array so that the lowest k numbers are at the beginning of the array. These are the k closest to the median, once you add the median back in.
The median-of-medians probably doesn't help much in finding the nearest neighbours, at least for large n. True, you have each column of 5 partitioned around it's median, but this isn't enough ordering information to solve the problem.
I'd just treat the median as an intermediate result, and treat the nearest neighbours as a priority queue problem...
Once you have the median from the median-of-medians, keep a note of it's value.
Run the heapify algorithm on all your data - see Wikipedia - Binary Heap. In comparisons, base the result on the difference relative to that saved median value. The highest priority items are those with the lowest ABS(value - median). This takes O(n).
The first item in the array is now the median (or a duplicate of it), and the array has heap structure. Use the heap extract algorithm to pull out as many nearest-neighbours as you need. This is O(k log n) for k nearest neighbours.
So long as k is a constant, you get O(n) median of medians, O(n) heapify and O(log n) extracting, giving O(n) overall.
med=Select(A,1,n,n/2) //finds the median
for i=1 to n
B[i]=mod(A[i]-med)
q=Select(B,1,n,k) //get the kth smallest difference
j=0
for i=1 to n
if B[i]<=q
C[j]=A[i] //A[i], the real value should be assigned instead of B[i] which is only the difference between A[i] and median.
j++
return C
You can solve your problem like that:
You can find the median in O(n), w.g. using the O(n) nth_element algorithm.
You loop through all elements substutiting each with a pair:
the absolute difference to the median, element's value.
Once more you do nth_element with n = k. after applying this algorithm you are guaranteed to have the k smallest elements in absolute difference first in the new array. You take their indices and DONE!
Four Steps:
Use Median of medians to locate the median of the array - O(n)
Determine the absolute difference between the median and each element in the array and store them in a new array - O(n)
Use Quickselect or Introselect to pick k smallest elements out of the new array - O(k*n)
Retrieve the k nearest neighbours by indexing the original array - O(k)
When k is small enough, the overall time complexity becomes O(n).
Find the median in O(n). 2. create a new array, each element is the absolute value of the original value subtract the median 3. Find the kth smallest number in O(n) 4. The desired values are the elements whose absolute difference with the median is less than or equal to the kth smallest number in the new array.
You could use a non-comparison sort, such as a radix sort, on the list of numbers L, then find the k closest neighbors by considering windows of k elements and examining the window endpoints. Another way of stating "find the window" is find i that minimizes abs(L[(n-k)/2+i] - L[n/2]) + abs(L[(n+k)/2+i] - L[n/2]) (if k is odd) or abs(L[(n-k)/2+i] - L[n/2]) + abs(L[(n+k)/2+i+1] - L[n/2]) (if k is even). Combining the cases, abs(L[(n-k)/2+i] - L[n/2]) + abs(L[(n+k)/2+i+!(k&1)] - L[n/2]). A simple, O(k) way of finding the minimum is to start with i=0, then slide to the left or right, but you should be able to find the minimum in O(log(k)).
The expression you minimize comes from transforming L into another list, M, by taking the difference of each element from the median.
m=L[n/2]
M=abs(L-m)
i minimizes M[n/2-k/2+i] + M[n/2+k/2+i].
You already know how to find the median in O(n)
if the order does not matter, selection of k smallest can be done in O(n)
apply for k smallest to the rhs of the median and k largest to the lhs of the median
from wikipedia
function findFirstK(list, left, right, k)
if right > left
select pivotIndex between left and right
pivotNewIndex := partition(list, left, right, pivotIndex)
if pivotNewIndex > k // new condition
findFirstK(list, left, pivotNewIndex-1, k)
if pivotNewIndex < k
findFirstK(list, pivotNewIndex+1, right, k)
don't forget the special case where k==n return the original list
Actually, the answer is pretty simple. All we need to do is to select k elements with the smallest absolute differences from the median moving from m-1 to 0 and m+1 to n-1 when the median is at index m. We select the elements using the same idea we use in merging 2 sorted arrays.
If you know the index of the median, which should just be ceil(array.length/2) maybe, then it just should be a process of listing out n(x-k), n(x-k+1), ... , n(x), n(x+1), n(x+2), ... n(x+k)
where n is the array, x is the index of the median, and k is the number of neighbours you need.(maybe k/2, if you want total k, not k each side)
First select the median in O(n) time, using a standard algorithm of that complexity.
Then run through the list again, selecting the elements that are nearest to the median (by storing the best known candidates and comparing new values against these candidates, just like one would search for a maximum element).
In each step of this additional run through the list O(k) steps are needed, and since k is constant this is O(1). So the total for time needed for the additional run is O(n), as is the total runtime of the full algorithm.
Since all the elements are distinct, there can be atmost 2 elements with the same difference from the mean. I think it is easier for me to have 2 arrays A[k] and B[k] the index representing the absolute value of the difference from the mean. Now the task is to just fill up the arrays and choose k elements by reading the first k non empty values of the arrays reading A[i] and B[i] before A[i+1] and B[i+1]. This can be done in O(n) time.
All the answers suggesting to subtract the median from the array would produce incorrect results. This method will find the elements closest in value, not closest in position.
For example, if the array is 1,2,3,4,5,10,20,30,40. For k=2, the value returned would be (3,4); which is incorrect. The correct output should be (4,10) as they are the nearest neighbor.
The correct way to find the result would be using the selection algorithm to find upper and lower bound elements. Then by direct comparison find the remaining elements from the list.

Resources