Selecting top n items from multiple sorted arrays - algorithm

What is the optimal algorithm for selecting top n elements from multiple arrays, provided each array is sorted the same way in which the resultant array should be.
Reading elements is very expensive and therefore the number of reads should be an absolute minimum.

Put tuples (current_element, array_number, current_index=0) into priority queue (for example, based on binary max-heap), ordered by element value
Then remove top of the queue n times.
After removing increment index in corresponding array (if possible), get the next element and insert updated tuple into queue again

Related

Merging sorted n lists

How the problem of sorting a very huge list is tackled?
I suppose we divide the list and have them process in each CPU and produce small sorted lists.
But how can we combine and produce a final sorted list?
You can merge mutiple sorted lists using priority queue (based on binary heap).
Fill queue with pairs (current element of list or its index; list id).
At every step:
extract pair with min element from queue
add value to result
get the next element of the same list (if possible)
insert new pair into queue again
How huge is your list relative to available memory?
For useful clues start from wiki external sorting page
Basic approach should be create a min-heap of size (n) where n is the number of partitioned sorted list from huge list.
Each node of binary heap should be represented like index/sorted_list_number and value.
The top node of min heap will point the min value of huge list and index will point from which sorted list its coming, Now pop top from min heap add its value in huge list and add new value from popped index list in to heap and heapify again.
Repeat till node finish , also take care for heapsize when one/more of the list are getting empty in process.
Since the issue is that your list is larger than memory, I would say external sort is the solution:
https://en.wikipedia.org/wiki/External_sorting
Say we have N blocks of main memory, we can load N-1 blocks of two list. Use the remaining one block as an output buffer
Merge two list by performing usual merging through comparing the front element. Output the result to the output buffer.
When the buffer is full, write the output back to secondary memory.
Repeat the steps until all the lists are merged.

How can i write this algorithm so that it works stable

Assume that we have an Mystery-Sort(A), which takes an array A of
length n as input, sorts the numbers in A in non-decreasing order, and returns the sorted array.
We do not know whether the algorithm implemented by Mystery-Sort is stable.
I need a procedure that takes an array of n integers and returns the sorted array in non-decreasing order, but i need the procedure to be stable.
How can i achieve this the pseudo-code of a stable sorting procedure Stable-Sort(A), which pre-processes and/or postprocesses
the elements in A in O(n) time, makes only one call to Mystery-Sort, and returns
the sorted array in non-decreasing order.
I see this as coming in two phases: a pre-processing phase in which you find all duplicated elements with their identifiers; a post-processing phase where you simply overwrite the found elements into their original order.
You didn't specify how you can differentiate elements that sort as the same value; I'll call that the id. In this first pass, construct a table with one row per value. Iterate through the array; store the id of each element (or the entire element) in the matching table row in the table. If there's already an element there, extend that row and store the current element.
At this point, if you wish, you can eliminate any row of the table with fewer than 2 elements.
For the post-procesing, iterate through the sorted array. If the value you find is in the table, then don't trust the order returned from Mystery-Sort. Instead, simply overwrite the next elements with the ones from that row of the table. This restores their original order.
When you reach the end of the sorted list, you're done.

Divide the list into 2 equal Parts

I have a list which contains random numbers such that Number >= 0. Now i have to divide the list into 2 equal parts (assume list contains even number of elements) such that all the numbers contain in first list are less than the numbers present in second list. This can be easily done by any sorting mechanism in O(nlogn). But i don't need data to be sorted in any two equal length list. Only condition is that (all elements in first list <= all elements in second list.)
So is there a way or hack we can reduce the complexity since we don't require sorted data here?
If the problem is actually solvable (data is right) you can find the median using the selection algorithm. When you have that you just create 2 equally sized arrays and iterate over the original list element by element putting each element into either of the new lists depending whether it's bigger or smaller than the median. Should run in linear time.
#Edit: as gen-y-s pointed out if you write the selection algorithm yourself or use a proper library it might already divide the input list so no need for the second pass.

Maintaining sort while changing random elements

I have come across this problem where I need to efficiently remove the smallest element in a list/array. That would be fairly trivial to solve - a heap would be sufficient.
However, the issue now is that when I remove the smallest element, it would cause changes in other elements in the data structure, which may result in the ordering being changed. An example is this:
I have an array of elements:
[1,3,5,7,9,11,12,15,20,33]
When I remove "1" from the array "5" and "12" get changed to "4" and "17" respectively.
[3,4,7,9,11,17,15,20,33]
And hence the ordering is not maintained.
However, the element that is removed will have pointers to all elements that will be changed, but there is not knowing how many elements will be changed and by how much.
So my question is:
What is the best way to store these elements to maximize performance when removing the smallest element from the data structure while maintaining sort? Or should I just leave it unsorted?
My current implementation is just storing them unsorted in a vector, so the time complexity is O(N^2), O(N) for finding the smallest element, and N removals.
A.
If you have the list M of all changed elements of the ordered list L,
go through M, and for every element
If it is still ordered with its neigbours in M, live it be.
If it is not in order with neighbours, exclude it from the M.
Such excluded elements will create a list N
Order N
Use some algorithm for merging ordered lists. http://en.wikipedia.org/wiki/Merge_algorithm
B.
If you are sure that new elements are few and not strongly changed, simply use the bubble sort.
I would still go with a heap ,backed by an array
In case only a few elements change after each pop,After you perform the pop operation , perform a heapify up/down for any item that reduces in value. It will still be in the order of O(nlog k) values, where k is the size of your array and n the number of elements that have reduced in size.
If a lot of items change in size , then you can consider this as a case where you have an unsorted array and you just create a heap from the array.

top-k selection/merge

I have n sorted lists (5 < n < 300). These lists are quite long (300000+ tuples). Selecting the top k of the individual lists is of course trivial - they are right at the head of the lists.
Example for k = 2:
top2 (L1: [ 'a': 10, 'b': 4, 'c':3 ]) = ['a':10 'b':4]
top2 (L2: [ 'c': 5, 'b': 2, 'a':0 ]) = ['c':5 'b':2]
Where it gets more interesting is when I want the combined top k across all the sorted lists.
top2(L1+L2) = ['a':10, 'c':8]
Just combining of the top k of the individual list would not necessarily gives the correct results:
top2(top2(L1)+top2(L2)) = ['a':10, 'b':6]
The goal is to reduce the required space and keep the sorted lists small.
top2(topX(L1)+topX(L2)) = ['a':10, 'c':8]
The question is whether there is an algorithm to calculate the combined top k having the correct order while cutting off the long tail of the lists at a certain position. And if there is: How does one find the limit X where is is safe to cut?
Note: Correct counts are not important. Only the order is.
top2(magic([L1,L2])) = ['a', 'c']
This algorithm uses O(U) memory where U is the number of unique keys. I doubt a lower memory bounds can be achieved because it is impossible to tell which keys can be discarded until all the keys have been summed.
Make a master list of (key:total_count) tuples. Simply run through each list one item at a time, keeping a tally of how many times each key has been seen.
Use any top-k selection algorithm on the master list that does not use additional memory. One simple solution is to sort the list in place.
If I understand your question correctly, the correct output is the top 10 items, irrespective of the list from which each came. If that's correct, then start with the first 10 items in each list will allow you to generate the correct output (if you only want unique items in the output, but the inputs might contain duplicates, then you need 10 unique items in each list).
In the most extreme case, all the top items come from one list, and all items from the other lists are ignored. In this case, having 10 items in the one list will be sufficient to produce the correct result.
Associate an index with each of your n lists. Set it to point to the first element in each case.
Create a list-of-lists, and sort it by the indexed elements.
The indexed item on the top list in your list-of-lists is your first element.
Increment the index for the topmost list and remove that list from the list-of-lists and re-insert it based on the new value of its indexed element.
The indexed item on the top list in your list-of-lists is your next element
Goto 4 and repeat until done.
You didn't specify how many lists you have. If n is small, then step 4 can be done very simply (just re-sort the lists). As n grows you may want to think about more efficient ways to resort and almost-sorted list-of-lists.
I did not understand if an 'a' appears in two lists, their counts must be combined. Here is a new memory-efficient algorithm:
(New) Algorithm:
(Re-)sort each list by ID (not by count). To release memory, the list can be written back to disk. Only enough memory for the longest list is required.
Get the next lowest unprocessed ID and find the total count across all lists.
Insert the ID into a priority queue of k nodes. Use the total count as the node's priority (not the ID). This priority queue drops the lowest node if more than k nodes are inserted.
Go to step 2 until all ID's have been exhausted.
Analysis: This algorithm can be implemented using only O(k) additional memory to store the min-heap. It makes several trade-offs to accomplish this:
The lists are sorted by ID in place; the original orderings by counts are lost. Otherwise O(U) additional memory is required to make a master list with ID: total_count tuples where U is number of unique ID's.
The next lowest ID is found in O(n) time by checking the first tuple of each list. This is repeated U times where U is the number of unique ID's. This might be improved by using a min-heap to track the next lowest ID. This would require O(n) additional memory (and may not be faster in all cases).
Note: This algorithm assumes ID's can be quickly compared. String comparisons are not trivial. I suggest hashing string ID's to integers. They do not have to be unique hashes, but collisions must be checked so all ID's are properly sorted/compared. Of course, this would add to the memory/time complexity.
The perfect solution requires all tuples to be inspected at least once.
However, it is possible to get close to the perfect solution without inspecting every tuple. Discarding the "long tail" introduces a margin of error. You can use some type of heuristic to calculate when the margin of error is acceptable.
For example, if there are n=100 sorted lists and you have inspected down each list until the count is 2, the most the total count for a key could increase by is 200.
I suggest taking an iterative approach:
Tally each list until a certain lower count threshold L is reached.
Lower L to include more tuples.
Add the new tuples to the counts tallied so far.
Go to step 2 until lowering L does not change the top k counts by more than a certain percentage.
This algorithm assumes the counts for the top k keys will approach a certain value the further long tail is traversed. You can use other heuristics instead of the certain percentage like number of new keys in the top k, how much the top k keys were shuffled, etc...
There is a sane way to implement this through mapreduce:
http://www.yourdailygeekery.com/2011/05/16/top-k-with-mapreduce.html
In general, I think you are in trouble. Imagine the following lists:
['a':100, 'b':99, ...]
['c':90, 'd':89, ..., 'b':2]
and you have k=1 (i.e. you want only the top one). 'b' is the right answer, but you need to look all the way down to the end of the second list to realize that 'b' beats 'a'.
Edit:
If you have the right distribution (long, low count tails), you might be able to do better. Let's keep with k=1 for now to make our lives easier.
The basic algorithm is to keep a hash map of the keys you've seen so far and their associated totals. Walk down the lists processing elements and updating your map.
The key observation is that a key can gain in count by at most the sum of the counts at the current processing point of each list (call that sum S). So on each step, you can prune from your hash map any keys whose total is more than S below your current maximum count element. (I'm not sure what data structure you would need to prune as you need to look up keys given a range of counts - maybe a priority queue?)
When your hash map has only one element in it, and its count is at least S, then you can stop processing the lists and return that element as the answer. If your count distribution plays nice, this early exit may actually trigger so you don't have to process all of the lists.

Resources