Greedy Algorithm Optimization - algorithm

I have the following problem:
Let there be n projects.
Let Fi(x) equal to the number of points you will obtain if you spent
x units of time working on project i.
You have T units of time to use and work on any project you would
The goal is to maximize the number of points you will earn and the F functions are non-decreasing.
The F functions have diminishing marginal return, in other words spending x+1 unit of time working on a particular project will yield less of an increase in total points earned from that project than spending x unit of time on the project did.
I have come up with the following O(nlogn + Tlogn) algorithm but I am supposed to find an algorithm running in O(n + Tlogn):
sum = 0
gain[] = sort(fi(1))
for sum < T
getMax(gain) // assume that the max gain corresponds to project "P"
gain.sortedInsert(Fp(schedule[P] + 1) - gain[P])
return schedule
That is, it takes O(nlogn) to sort the initial gain array and O(Tlogn) to run through the loop. I have thought through this problem more than I care to admit and cannot come up with an algorithm that would run in O(n + Tlogn).

For the first case, use a Heap, constructing the heap will take O(n) time, and each ExtractMin & DecreaseKey function call will take O(logN) time.
For the second case construct a nXT table where ith column denotes the solution for the case T=i. i+1 th column should only depend on the values on the ith column and the function F, hence calculatable in O(nT) time. I did not think all the cases thoroughly but this should give you a good start.


How many subproblems are there in this Activity Selection recursive breakdown?

Activity Selection: Given a set of activities A with start and end times, find a maximum subset of mutually compatible activities.
My problem
The two approaches seem to be the same, but the numSubproblems in firstApproach is exponential, while in secondApproach is O(n^2). If I were to memoize the result, then how can I memoize firstApproach?
The naive firstApproach
let max = 0
for (a: Activities):
let B = {Activities - allIncompatbleWith(a)}
let maxOfSubproblem = ActivitySelection(B)
max = max (max, maxOfSubproblem+1)
return max
1. Assume a particular activity `a` is part of the optimal solution
2. Find the set of activities incompatible with `a: allIncompatibleWith(a)`.
2. Solve Activity for the set of activities: ` {Activities - allImcompatibleWith(a)}`
3. Loop over all activities `a in Activities` and choose maximum.
The CLRS Section 16.1 based secondApproach
Solve for S(0, n+1)
let S(i,j) = 0
for (k: 0 to n):
let a = Activities(k)
let S(i,k) = solution for the set of activities that start after activity-i finishes and end before activity-k starts
let S(k,j) = solution for the set of activities that start after activity-k finishes and end before activyty-j starts.
S(i,j) = max (S(i,k) + S(k,j) + 1)
return S(i,j)
1. Assume a particular activity `a` is part of optimal solution
2. Solve the subproblems for:
(1) activities that finish before `a` starts
(2) activities that start after `a` finishes.
Let S(i, j) refer to the activities that lie between activities i and j (start after i and end before j).
Then S(i,j) characterises the subproblems needed to be solved above. ),
S(i,j) = max S(i,k) + S(k,j) + 1, with the variable k looped over j-i indices.
My analysis
#numSubproblems = #numSubset of the set of all activities = 2^n.
#numSubproblems = #number of ways to chooose two indicises from n indices, with repetition. = n*n = O(n^2)
The two approaches seem to be the same, but the numSubproblems in firstApproach is exponential, while in secondApproach is O(n^2). What's the catch? Why are they different, even thought the two approaches seem to be the same?
The two approaches seem to be the same
The two solutions are not the same. The difference is in the number of states possible in the search space. Both solutions exhibit overlapping sub-problems and optimal substructure. Without memoization, both solutions browse through the entire search space.
Solution 1
This a backtracking solution where all subsets that are compatible with an activity are tried and each time an activity is selected, your candidate solution is incremented by 1 and compared with the currently stored maximum. It utilizes no insight of the start times and end times of the activities. The major difference is that the state of your recurrence is the entire subset of activities (compatible activities) for which the solution needs to be determined (regardless of their start and finish times). If you were to memoize the solution, you would have to use a bitmasks (or (std::bitset in C++) to store the solution for a subset of activities. You could also use std::set or other Set data structures.
Solution 2
The number of states for the sub-problems in the second solution are greatly reduced because the recurrence relation solves for only those activities which finish before the start of the current activity and those activities which start after the current activity finishes. Notice that the number of states in such a solution is determined by the number of possible values of the tuple (start time, end time). Since, there are n activities, the number of states are atmost n2. If we memoize this solution, we simply need to store the solution for a given start time and end time, which automatically gives a solution for the subset of activities that fall in this range, regardless of whether they are compatible among themselves.
Memoization always don't lead to polynomial time asymptotic time complexity. In the first approach, you can apply memoization, but that'll not reduce the time complexity to polynomial time.
What is memoization?
In simple words, memoization is nothing but a recursive solution (top-down) that stores the result of computed solution to sub-problem. And if the same sub-problem is to be calculated again, you return the originally stored solution instead of recomputing it.
Memoization in your first recursive solution
In your case each sub-problem is finding optimal selection of activities for a subset. So the memoization (in your case) will result in storing the optimal solution for all the subsets.
No doubt memoization will give you performance enhancements by avoiding recomputation of solution on a subset of activities that has been "seen" before, but it can't (in this case) reduce the time complexity to polynomial time because you end up storing the sub-solutions for every subset (in worst case).
Where memoization gives us real benefit?
On the other hand, if you see this, where memoization is applied for fibonacci series, the total number of sub-solutions that you have to store is linear with the size of the input. And thus it drops the exponential complexity to linear.
How can you memoize the first solution
For applying memoization in the first approach, you need to maintain the sub-solutions. The data-structure that you can use is Map<Set<Activity>, Integer> which will store the maximum number of compatible activities for the given Set<Activity>. In java equals() on a java.util.Set works properly across all the implementations, so you can use it.
Your first approach will be modified like this:
// this structure memoizes the sub-solutions
Map<Set<Activity>, Integer> map;
ActivitySelection(Set<Activity> activities) {
if(map contains activities)
return map.getValueFor(activities);
let max = 0
for (a: activities):
let B = {Activities - allIncompatbleWith(a)}
let maxOfSubproblem = ActivitySelection(B)
max = max (max, maxOfSubproblem+1)
map.put(activities, max)
return max
On a lighter note:
The time complexity of the second solution (CLRS 16.1) will be O(n^3) instead of O(n^2). You'll have to have 3 loops for i, j and k. The space complexity for this solution is O(n^2).

Simple Max Profit Scheduling Algo

Assume that you have an array of durations L[5,8,2] with deadlines D[13,8,7]. If you have an end time of each activity E[i]. You receive (or lose) an amount D[i] - E[i] for each activity, which sums to a total amount gained or lost, which for this example is 4. E depends on what order you do each activity. For example if you do each L[i] in ascending order your resulting E would be [7,15,2].
I've found the max value occurs after you sort the L array, which runs O(nlog n). What's fascinating is that after you sort the L array, there's no need to sort the D array b/c you'll end up with the same max value for any arrangement of the deadlines (I've tried on larger sets). Is there a better way to solve this problem to get the running time to be less than O(nlogn)? I've spent a couple hours trying all sorts of linear tweaks on lengths and deadlines, to no avail, or even use conditional statements. It seems to me this can be done in O(n) time, but I can't for the life of me find it.
You sort an unbounded array of integers. There are faster ways to sort integers than the ones based on just comparing their magnitude: O(n log log n) for a deterministic case and O(n sqrt(log log n)) for a randomized algorithm. See for more discussion.
If the integers are bounded (as in, you can guarantee they won't be larger than some value), counting sort will solve the problem in O(n).
Sorting the durations is the correct answer. As #liori points out, there are different ways to sort integers, but regardless, you still need to sort the durations.
Let's look at an abstraction of the problem. Start with L[a,b,c] and D[x,y,z]. Assume that the tasks are executed in the order given, then the end times are E[a,a+b,a+b+c], and so
profit = (x - a) + (y - (a+b)) + (z - (a+b+c))
which is the same as
profit = x + y + z - 3a - 2b - c
From this, we can see that the order of the deadlines doesn't matter, but the order in which the tasks are executed is important. The duration of the first task is subtracted from the profit many times. But the duration of the last task is only subtracted from the profit once. So clearly, the tasks need to be done in order from shortest to longest.

Interview Scheduling Algorithm

I am trying to think of an algorithm that always produces the optimum solution in the best possible time to this problem:
There are n candidates for a job, and k rooms in which they have scheduled interviews at various times of the day. Interviews have a specific schedule in each room, with each interview having a specified start time (si), finish time (fi), and interview room (ri). All time units are always integers. In addition we need to schedule pictures with the people currently being interviewed throughout the day. The pictures don't effectively take any time, but at some point in the day each interviewee must be in a picture. If we schedule a picture at time t, all people currently being interviewed will be in that picture. Taking a picture has no affect on the rest of each interviews start and end time. So the problem is this: with an unordered list of interviews , each with variables (si, fi, ri), how do you make sure every interview candidate is in a picture, while taking as few pictures as possible?
So ideally we would take pictures when there are as many people present as possible to minimize the number of pictures taken. My original idea for this was sort of a brute force, but it would be a really bad big-O runtime. It is very important to minimize the runtime of this algorithm while still returning the fewest possible photographs. That being said, if you can think of a fast greedy algorithm that doesn't perfectly solve the problem, I would like to hear that too.
I'm sure my description here was far from flawless, so if you would like me to clarify anything, feel free to leave a comment and I'll get back to you.
Start with the following observations:
At least one picture must be taken during each interview, since we cannot photograph that interviewee before they arrive or after they leave.
The set of people available to photograph changes only at the times si and fi.
After an arrival event si, if the next event j is an arrival, there is no need to take a picture between si and sj, since everyone available at si is still available at sj.
Therefore, you can let the set of available interviewees "build up" through arrival events (up to k of them) and wait to take a picture until someone is about to leave.
Thus I think the following algorithm should work:
Put the arrival and departure times into a list and sort it (times should remain tagged with "arrival" or "departure" and the interviewee's index).
Create a boolean array A of size n to keep track of whether each interviewee is available (interview is in progress).
Create a boolean array P of size n to keep track of whether each interviewee has been photographed.
Loop over the sorted time list (index variable i):
a. If an arrival is encountered, set A[i] to true.
b. If a departure j is encountered, check P[j] to see if the person leaving has been photographed already. If not, take a picture now and record its effects (for all A[k] = true set P[k] = true). Finally set A[i] to false.
The sort is O(n log n), the loop has 2n iterations, and checking the arrays is O(1). But since on each picture-taking event, you may need to loop over A, the overall runtime is O(n2) in the worst case (which would happen if no interviews overlapped in time).
Here's an O(n log n) solution:
Step 1: Separately sort the starting and finishing time of all interviews, but at the same time keep track of the places they are sorted to (i.e. the original indices and the indices after sort). This results in 4 arrays below
sst[] (sst = sorted starting time)
sft[] (sft = sorted finishing time)
sst2orig[] (sst index to original index)
sft2orig[] (sst index to original index)
Note: by definitions of the above 4 arrays,
"sst2orig[j] = i & sst2orig[k] = i" means that
interview [i] has starting time sst[j] and finishing time sft[k]
Step 2: Define a boolean array p_taken[] to represent if the candidate of an interview has already been phtographed. All elements in the array will be set to false initially.
Step 3: The loop
std::vector<int> photo_time;
int last_p_not_taken_sst_index = 0;
for (int i=0; i<sft.size; i++) {
// ignore the candidate already photographed
if (p_taken[sft2orig[sft[i]]]) continue;
// Now we found the first leaving candidate not phtographed, we
// must take a photo now.
// So we can now mark all candidate having prior sst[] time as
// already photographed. So, we search for the first elm. in
// sst[] that is greater than sft[i], and returns the index.
// If all elm. in sst[] is smaller than sft[i], we return sst.size().
// This could be done via a binary search
int k = upper_inequal_bound_index(sst, sft[i]);
// now we can mark all candidate with starting time prior than sst[k]
// to be "photographed". This will include the one corresponding to
// sft[i]
for (int j=last_p_not_taken_sst_index; j<k; j++)
p_taken[sst2orig[j]] = true;
last_p_not_taken_sst_index = k;
The final answer is saved in photo_time, and the number of photos is photo_time.size().
Time Complexity:
Step 1: Sorts: O(n log n)
Step 2: initialize p_taken[]: O(n)
Step 3: We loop n times, and in each loop
3-1 check p_taken: O(1)
3-2 binary search: O(log n)
3-3 mark candidates: aggreated O(n), since we mark once only, per candidate.
So, overall for step 3: O(n x ( 1 + log n) + n) = O(n log n)
Step 1 ~ 3, total: O(n log n)
Note that step 3 can be futher optimized: we can shrink to exclude those already previous binary-searched range. But the worst case is still O(log n) per loop. Thus the total is still O(n log n)

scheduling n people with given time of travel

this is a puzzle but i think it could be a classical algorithm which i am unaware of :
There are n people at the bottom of a mountain, and everyone wants to go up, then down the mountain. Person i takes u[i] time to climb this mountain, and d[i] time to descend it.
However, at same given time atmost 1 person can climb , and .atmost 1 person can descend the mountain. Find the least time to travel up and back down the mountain.
Update 1 :
well i tried with few examples and found that it's not reducible to sorting , or getting the fastest climbers first or vice versa . I think to get optimal solution we may have to try out all possible solutions , so seems to be NP complete.
My initial guess: (WRONG)
The solution i thought is greedy : sort n people by start time in ascending order. Then up jth person up and kth down where u[j]<= d[k] and d[k] is minimum from all k persons on top of mountain. I am not able to prove correctness of this .
Any other idea how to approach ?
A hint would suffice.
Try to think in the following manner: if the people are not sorted in ascending order of time it takes them to climb the mountain than what happens if you find a pair of adjacent people that are not in the correct order(i.e. first one climbs longer than second one) and swap them. Is it possible that the total time increases?
I think it is incorrect. Consider
u = [2,3]
d = [1,3]
Your algorithm gives ordering 0,1 whereas it should be 1,0.
I would suggest another greedy approach:
Create ordering list and add first person.
For current ordering keep track of two values:
mU - time of last person on the mountain - time of the end
mD - time of earliest time of first descending
From people who are not ordered choose the one which minimises abs(mD - d) and abs(mU - u). Then if abs(mD - d) < abs(mU - u) he should go at the beginning of ordering. Otherwise he goes at the end.
Some tweak may still be needed here, but this approach should minimise losses from cases like the one given in the example.
The following solution will only work with n <= 24.
This solution will require dynamic programming and bit-mask technique knowledge to be understood.
Observation: we can easily observe that the optimal total climb up time is fixed, which is equalled to the total climb up time of n people.
For the base case, if n = 1, the solution is obvious.
For n = 2, the solution is simple, just scan through all 4 possibilities and calculate the minimum down time.
For n = 3, we can see that this case will be equal to the case when one person climb up first, followed by two.
And the two person minimum down time can be easily pre-calculated. More important, this two person then can be treated as one person with up time is the total up time of the two, and down time is the minimum down time.
Storing all result for minimum down time for cases from n = 0 to n = 3 in array called 'dp', using bit-mask technique, we represent the state for 3 person as index 3 = 111b, so the result for case n = 3 will be:
for(int i = 0; i < 3; i++){
dp[3] = min(dp[(1<<i)] + dp[3^(1<<i)],dp[3]);
For n = 4... 24, the solution will be similar to case n = 3.
Note: The actual formula is not just simple as the code for case n = 3(and it requires similar approach to solve as case n = 2), but will be very similar,
Your approach looks sensible, but it may be over-simplified, could you describe it more precisely here?
From your description, I can't make out whether you are sorting or something else; these are the heuristics that I figured you are using:
Get the fastest climbers first, so the start using the Down path
Ensure there is always people at the top of the mountain, so
when the Down path becomes available, a person starts descending
immediately.The way you do that is to select first those people who
climb fast and descend slowly.
What if the fastest climber is also the fastest descender? That would leave the Down path idle until the second climber gets to the top, how does your algorithm ensures that this the best order?. I'm not sure that the problem reduces to a Sorting problem, it looks more like a knapsack or scheduling type.

Find the largest k numbers in k arrays stored across k machines

This is an interview question. I have K machines each of which is connected to 1 central machine. Each of the K machines have an array of 4 byte numbers in file. You can use any data structure to load those numbers into memory on those machines and they fit. Numbers are not unique across K machines. Find the K largest numbers in the union of the numbers across all K machines. What is the fastest I can do this?
(This is an interesting problem because it involves parallelism. As I haven't encountered parallel algorithm optimization before, it's quite amusing: you can get away with some ridiculously high-complexity steps, because you can make up for it later. Anyway, onto the answer...)
> "What is the fastest I can do this?"
The best you can do is O(K). Below I illustrate both a simple O(K log(K)) algorithm, and the more complex O(K) algorithm.
First step:
Each computer needs enough time to read every element. This means that unless the elements are already in memory, one of the two bounds on the time is O(largest array size). If for example your largest array size varies as O(K log(K)) or O(K^2) or something, no amount of algorithmic trickery will let you go faster than that. Thus the actual best running time is O(max(K, largestArraySize)) technically.
Let us say the arrays have a max length of N, which is <=K. With the above caveat, we're allowed to bound N<K since each computer has to look at each of its elements at least once (O(N) preprocessing per computer), each computer can pick the largest K elements (this is known as finding kth-order-statistics, see these linear-time algorithms). Furthermore, we can do so for free (since it's also O(N)).
Bounds and reasonable expectations:
Let's begin by thinking of some worst-case scenarios, and estimates for the minimum amount of work necessary.
One minimum-work-necessary estimate is O(K*N/K) = O(N), because we need to look at every element at the very least. But, if we're smart, we can distribute the work evenly across all K computers (hence the division by K).
Another minimum-work-necessary estimate is O(N): if one array is larger than all elements on all other computers, we return the set.
We must output all K elements; this is at least O(K) to print them out. We can avoid this if we are content merely knowing where the elements are, in which case the O(K) bound does not necessarily apply.
Can this bound of O(N) be achieved? Let's see...
Simple approach - O(NlogN + K) = O(KlogK):
For now let's come up with a simple approach, which achieves O(NlogN + K).
Consider the data arranged like so, where each column is a computer, and each row is a number in the array:
computer: A B C D E F G
10 (o) (o)
9 o (o) (o)
8 o (o)
7 x x (x)
6 x x (x)
5 x ..........
4 x x ..
3 x x x . .
2 x x . .
1 x x .
0 x x .
You can also imagine this as a sweep-line algorithm from computation geometry, or an efficient variant of the 'merge' step from mergesort. The elements with parentheses represent the elements with which we'll initialize our potential "candidate solution" (in some central server). The algorithm will converge on the correct o responses by dumping the (x) answers for the two unselected os.
All computers start as 'active'.
Each computer sorts its elements. (parallel O(N logN))
Repeat until all computers are inactive:
Each active computer finds the next-highest element (O(1) since sorted) and gives it to the central server.
The server smartly combines the new elements with the old K elements, and removes an equal number of the lowest elements from the combined set. To perform this step efficiently, we have a global priority queue of fixed size K. We insert the new potentially-better elements, and bad elements fall out of the set. Whenever an element falls out of the set, we tell the computer which sent that element to never send another one. (Justification: This always raises the smallest element of the candidate set.)
(sidenote: Adding a callback hook to falling out of a priority queue is an O(1) operation.)
We can see graphically that this will perform at most 2K*(findNextHighest_time + queueInsert_time) operations, and as we do so, elements will naturally fall out of the priority queue. findNextHighest_time is O(1) since we sorted the arrays, so to minimize 2K*queueInsert_time, we choose a priority queue with an O(1) insertion time (e.g. a Fibonacci-heap based priority queue). This gives us an O(log(queue_size)) extraction time (we cannot have O(1) insertion and extraction); however, we never need to use the extract operation! Once we are done, we merely dump the priority queue as an unordered set, which takes O(queue_size)=O(K) time.
We'd thus have O(N log(N) + K) total running time (parallel sorting, followed by O(K)*O(1) priority queue insertions). In the worst case of N=K, this is O(K log(K)).
The better approach - O(N+K) = O(K):
However I have come up with a better approach, which achieves O(K). It is based on the median-of-median selection algorithm, but parallelized. It goes like this:
We can eliminate a set of numbers if we know for sure that there are at least K (not strictly) larger numbers somewhere among all the computers.
Each computer finds the sqrt(N)th highest element of its set, and splits the set into elements < and > it. This takes O(N) time in parallel.
The computers collaborate to combine those statistics into a new set, and find the K/sqrt(N)th highest element of that set (let's call it the 'superstatistic'), and note which computers have statistics < and > the superstatistic. This takes O(K) time.
Now consider all elements less than their computer's statistics, on computers whose statistic is less than the superstatistic. Those elements can be eliminated. This is because the elements greater than their computer's statistic, on computers whose statistic is larger than the superstatistic, are a set of K elements which are larger. (See the visual here).
Now, the computers with the uneliminated elements evenly redistribute their data to the computers who lost data.
Recurse: you still have K computers, but the value of N has decreased. Once N is less than a predetermined constant, use the previous algorithm I mentioned in "simple approach - O(NlogN + K)"; except in this case, it is now O(K). =)
It turns out that the reductions are O(N) total (amazingly not order K), except perhaps the final step which might by O(K). Thus this algorithm is O(N+K) = O(K) total.
Analysis and simulation of O(K) running time below. The statistics allow us to divide the world into four unordered sets, represented here as a rectangle divided into four subboxes:
| | s | <- computer
| | #=K s REDIST. | <- computer
| | s | <- computer
| K/N^.5|-----S----------| <- computer
| | s | <- computer
K | s | <- computer
| | s ELIMIN. | <- computer
| | s | <- computer
| | s | <- computer
| |_____s__________| <- computer
s=statistic, S=superstatistic
#=K -- set of K largest elements
(I'd draw the relation between the unordered sets of rows and s-column here, but it would clutter things up; see the addendum right now quickly.)
For this analysis, we will consider N as it decreases.
At a given step, we are able to eliminate the elements labelled ELIMIN; this has removed area from the rectangle representation above, reducing the problem size from K*N to , which hilariously simplifies to
Now, the computers with the uneliminated elements redistribute their data (REDIST rectangle above) to the computers with eliminated elements (ELIMIN). This is done in parallel, where the bandwidth bottleneck corresponds to the length of the short size of REDIST (because they are outnumbered by the ELIMIN computers which are waiting for their data). Therefore the data will take as long to transfer as the long length of the REDIST rectangle (another way of thinking about it: K/√N * (N-√N) is the area, divided by K/√N data-per-time, resulting in O(N-√N) time).
Thus at each step of size N, we are able to reduce the problem size to K(2√N-1), at the cost of performing N + 3K + (N-√N) work. We now recurse. The recurrence relation which will tell us our performance is:
T(N) = 2N+3K-√N + T(2√N-1)
The decimation of the subproblem size is much faster than the normal geometric series (being √N rather than something like N/2 which you'd normally get from common divide-and-conquers). Unfortunately neither the Master Theorem nor the powerful Akra-Bazzi theorem work, but we can at least convince ourselves it is linear via a simulation:
>>> def T(n,k=None):
... return 1 if n<10 else sqrt(n)*(2*sqrt(n)-1)+3*k+T(2*sqrt(n)-1, k=k)
>>> f = (lambda x: x)
>>> (lambda n: T((10**5)*n,k=(10**5)*n)/f((10**5)*n) - T(n,k=n)/f(n))(10**30)
The function T(N) is, at large scales, a multiple of the linear function x, hence linear (doubling the input doubles the output). This method, therefore, almost certainly achieves the bound of O(N) we conjecture. Though see the addendum for an interesting possibility.
One pitfall is accidentally sorting. If we do anything which accidentally sorts our elements, we will incur a log(N) penalty at the least. Thus it is better to think of the arrays as sets, to avoid the pitfall of thinking that they are sorted.
Also we might initially think that with the constant amount of work at each step of 3K, so we would have to do work 3Klog(log(N)) work. But the -1 has a powerful role to play in the decimation of the problem size. It is very slightly possible that the running time is actually something above linear, but definitely much smaller than even Nlog(log(log(log(N)))). For example it might be something like O(N*InverseAckermann(N)), but I hit the recursion limit when testing.
The O(K) is probably only due to the fact that we have to print them out; if we are content merely knowing where the data is, we might even be able to pull off an O(N) (e.g. if the arrays are of length O(log(K)) we might be able to achieve O(log(K)))... but that's another story.
The relation between the unordered sets is as follows. Would have cluttered things up in explanation.
/ \
(.....) > s > (.....)
(.....) > s > (.....)
(.....) > s > (.....)
/ \
(.....) > s > (.....)
(.....) > s > (.....)
(.....) > s > (.....)
Find the k largest numbers on each machine. O(n*log(k))
Combine the results (on a centralized server, if k is not huge, otherwise you can merge them in a tree-hierarchy accross the server cluster).
Update: to make it clear, the combine step is not a sort. You just pick the top k numbers from the results. There are many ways to do this efficiently. You can use a heap for example, pushing the head of each list. Then you can remove the head from the heap and push the head from the list the element belonged to. Doing this k times gives you the result. All this is O(k*log(k)).
Maintain a min heap of size 'k' in the centralized server.
Initially insert first k elements into the min heap.
For the remaining elements
Check(peek) for the min element in the heap (O(1))
If the min element is lesser than the current element, then remove the min element from heap and insert the current element.
Finally min heap will have 'k' largest elements
This would require n(log k) time.
I would suggest something like this:
take the k largest numbers on each machine in sorted order O(Nk) where N is the number of element on each machine
sort each of these arrays of k elements by largest element (you will get k arrays of k elements sorted by largest element : a square matrix kxk)
take the "upper triangle" of the matrix made of these k arrays of k elements, (the k largest element will be in this upper triangle)
the central machine can now find the k largest element of these k(k+1)/2 elements
Let the machines find the out k largest elements copy it into a
datastructure (stack), sort it and pass it on to the Central
At the central machine receive the stacks from all the machine. Find
the greatest of the elements at the top of the stacks.
Pop out the greatest element form its stack and copy it to the 'TopK list'.
Leave the other stacks intact.
Repeat step 3, k times to get Top K numbers.
1) sort the items on every machine
2) use a k - binary heap on the central machine
a) populate the heap with first (max) element from each machine
b) extract the first element, and put back in the heap the first element from the machine that you extracted the element. (of course heapify your heap, after the element is added).
Sort will be O(Nlog(N)) where N is the max array on the machines.
O(k) - to build the heap
O(klog(k)) to extract and populate the heap k times.
Complexity is max(O(klog(k)),O(Nlog(N)))
I would think the MapReduce paradigm would be well suited to a task like this.
Every machine runs it's own independent map task to find the maximum value in its array (depends on the language used) and this will probably be O(N) complexity for N numbers on each machine.
The reduce task compares the result from the individual machines' outputs to give you the largest k numbers.
