top-k selection/merge - algorithm

I have n sorted lists (5 < n < 300). These lists are quite long (300000+ tuples). Selecting the top k of the individual lists is of course trivial - they are right at the head of the lists.
Example for k = 2:
top2 (L1: [ 'a': 10, 'b': 4, 'c':3 ]) = ['a':10 'b':4]
top2 (L2: [ 'c': 5, 'b': 2, 'a':0 ]) = ['c':5 'b':2]
Where it gets more interesting is when I want the combined top k across all the sorted lists.
top2(L1+L2) = ['a':10, 'c':8]
Just combining of the top k of the individual list would not necessarily gives the correct results:
top2(top2(L1)+top2(L2)) = ['a':10, 'b':6]
The goal is to reduce the required space and keep the sorted lists small.
top2(topX(L1)+topX(L2)) = ['a':10, 'c':8]
The question is whether there is an algorithm to calculate the combined top k having the correct order while cutting off the long tail of the lists at a certain position. And if there is: How does one find the limit X where is is safe to cut?
Note: Correct counts are not important. Only the order is.
top2(magic([L1,L2])) = ['a', 'c']

This algorithm uses O(U) memory where U is the number of unique keys. I doubt a lower memory bounds can be achieved because it is impossible to tell which keys can be discarded until all the keys have been summed.
Make a master list of (key:total_count) tuples. Simply run through each list one item at a time, keeping a tally of how many times each key has been seen.
Use any top-k selection algorithm on the master list that does not use additional memory. One simple solution is to sort the list in place.

If I understand your question correctly, the correct output is the top 10 items, irrespective of the list from which each came. If that's correct, then start with the first 10 items in each list will allow you to generate the correct output (if you only want unique items in the output, but the inputs might contain duplicates, then you need 10 unique items in each list).
In the most extreme case, all the top items come from one list, and all items from the other lists are ignored. In this case, having 10 items in the one list will be sufficient to produce the correct result.

Associate an index with each of your n lists. Set it to point to the first element in each case.
Create a list-of-lists, and sort it by the indexed elements.
The indexed item on the top list in your list-of-lists is your first element.
Increment the index for the topmost list and remove that list from the list-of-lists and re-insert it based on the new value of its indexed element.
The indexed item on the top list in your list-of-lists is your next element
Goto 4 and repeat until done.
You didn't specify how many lists you have. If n is small, then step 4 can be done very simply (just re-sort the lists). As n grows you may want to think about more efficient ways to resort and almost-sorted list-of-lists.

I did not understand if an 'a' appears in two lists, their counts must be combined. Here is a new memory-efficient algorithm:
(New) Algorithm:
(Re-)sort each list by ID (not by count). To release memory, the list can be written back to disk. Only enough memory for the longest list is required.
Get the next lowest unprocessed ID and find the total count across all lists.
Insert the ID into a priority queue of k nodes. Use the total count as the node's priority (not the ID). This priority queue drops the lowest node if more than k nodes are inserted.
Go to step 2 until all ID's have been exhausted.
Analysis: This algorithm can be implemented using only O(k) additional memory to store the min-heap. It makes several trade-offs to accomplish this:
The lists are sorted by ID in place; the original orderings by counts are lost. Otherwise O(U) additional memory is required to make a master list with ID: total_count tuples where U is number of unique ID's.
The next lowest ID is found in O(n) time by checking the first tuple of each list. This is repeated U times where U is the number of unique ID's. This might be improved by using a min-heap to track the next lowest ID. This would require O(n) additional memory (and may not be faster in all cases).
Note: This algorithm assumes ID's can be quickly compared. String comparisons are not trivial. I suggest hashing string ID's to integers. They do not have to be unique hashes, but collisions must be checked so all ID's are properly sorted/compared. Of course, this would add to the memory/time complexity.

The perfect solution requires all tuples to be inspected at least once.
However, it is possible to get close to the perfect solution without inspecting every tuple. Discarding the "long tail" introduces a margin of error. You can use some type of heuristic to calculate when the margin of error is acceptable.
For example, if there are n=100 sorted lists and you have inspected down each list until the count is 2, the most the total count for a key could increase by is 200.
I suggest taking an iterative approach:
Tally each list until a certain lower count threshold L is reached.
Lower L to include more tuples.
Add the new tuples to the counts tallied so far.
Go to step 2 until lowering L does not change the top k counts by more than a certain percentage.
This algorithm assumes the counts for the top k keys will approach a certain value the further long tail is traversed. You can use other heuristics instead of the certain percentage like number of new keys in the top k, how much the top k keys were shuffled, etc...

There is a sane way to implement this through mapreduce:
http://www.yourdailygeekery.com/2011/05/16/top-k-with-mapreduce.html

In general, I think you are in trouble. Imagine the following lists:
['a':100, 'b':99, ...]
['c':90, 'd':89, ..., 'b':2]
and you have k=1 (i.e. you want only the top one). 'b' is the right answer, but you need to look all the way down to the end of the second list to realize that 'b' beats 'a'.
Edit:
If you have the right distribution (long, low count tails), you might be able to do better. Let's keep with k=1 for now to make our lives easier.
The basic algorithm is to keep a hash map of the keys you've seen so far and their associated totals. Walk down the lists processing elements and updating your map.
The key observation is that a key can gain in count by at most the sum of the counts at the current processing point of each list (call that sum S). So on each step, you can prune from your hash map any keys whose total is more than S below your current maximum count element. (I'm not sure what data structure you would need to prune as you need to look up keys given a range of counts - maybe a priority queue?)
When your hash map has only one element in it, and its count is at least S, then you can stop processing the lists and return that element as the answer. If your count distribution plays nice, this early exit may actually trigger so you don't have to process all of the lists.

Related

Divide the list into 2 equal Parts

I have a list which contains random numbers such that Number >= 0. Now i have to divide the list into 2 equal parts (assume list contains even number of elements) such that all the numbers contain in first list are less than the numbers present in second list. This can be easily done by any sorting mechanism in O(nlogn). But i don't need data to be sorted in any two equal length list. Only condition is that (all elements in first list <= all elements in second list.)
So is there a way or hack we can reduce the complexity since we don't require sorted data here?
If the problem is actually solvable (data is right) you can find the median using the selection algorithm. When you have that you just create 2 equally sized arrays and iterate over the original list element by element putting each element into either of the new lists depending whether it's bigger or smaller than the median. Should run in linear time.
#Edit: as gen-y-s pointed out if you write the selection algorithm yourself or use a proper library it might already divide the input list so no need for the second pass.

Distribute list elements between two lists equitatively

I have a list with 'n' elements (lets say 10 ie), I want to distribute this elements into two lists, each one balanced with the other by a criteria, evaluating the valour of each element. ie The output should be two lists with 5 elements that are aproximately balanced with each other.
Thanks for your time.
You could employ a greedy strategy (I'm not certain this will give you "optimal" results, but it should give you relatively good results at least).
Start by finding the total value of all the elements in your list, V. The goal is to create two lists each with value about half this amount (ideally as close to 1/2*V as possible). Start with 3 lists, List_original, List_1, List_2.
Pull off items from List_original (starting with the largest, working your way down to the smallest) and put them into List_1 if and only if adding them to List_1 doesn't cause the total value of List_1 to exceed 1/2*V. Everything else goes into List_2.
The result will be that List_1 will be at most 1/2*V and List_2 will be at least 1/2*V. In the event that some subset of your items sums up to exactly 1/2*V then you might get equality. I haven't tried to prove/disprove this yet. Depending on how close to balanced your result has to be, this could be good enough (it should at least be very fast).
I came up with a quick "solution" by taking the full averagevalue of the list, then ordering it asc, taking the two highest values for each list and then iterate with the rest. With each iteration I compared the average of the full list with the average of each of the two sublists with the added iteration, each time I put the iteration element in the list wich average was closer to the full average of the list. Keep doing it until the list were full.
I know it is not the best choice but it was good enough for now.
Hope my explanation was clear enough.
Thanks to all.

Algorithm for certain permutaion of array elements (parallel sorting by regular sampling) [C++]

I am implementing an parallel sorting by regular sampling algorithm which is described here. I am stuck in a point at which I need to migrate sorted sublists to proper places of the sorted array. The problem can be stated in that way: There is one global array. The array has been divided into p subarrays.Each of those subarrays was sorted. p-1 global pivot elements were determined and each sub-array was divided into p sub-sub arrays (yellow, red, green). Now I need to move those sub-sub-arrays so that sub-sub-arrays with local index i are in the thread i (so they are ordered in such manner at which colors are neighbouring and the order from left to right remains).
Actually serial algorithm will do, but I just have no clever idea how to obtain a proper permutation. The following figure shows a case for p=3 threads. Yellow color denotes a sub-sub-array 0, red - 1, green - 2.
The sub-sub arrays may have different sizes.
Ok Seems like I don't have enough reputation to comment on your question, so I shall take the route of posting the answer.
So let me get this straigh. You are stuck on phase 3 of this algo. Right?
How about this:
Let's have p linkedLists of indexes. Let each process communicate the index ranges to process i; as the indexes are communicated, append the indexes to list of process i. When all the communications are over, you shall have the all the indexes for process i in the list of process i. Node of this list should be a data structre like
Node {
index
valueOfIndex
}
Now as you populate the list, copy its value also in the list.
Once you are through with the process. You can recrate your array for process i using its list i.
????

Time complexity for array management (Algorithm)

I'm working on a program that takes in a bunch (y) of integers and then needs to return the x highest integers in order. This code needs to be as fast as possible, but at the moment I dont think I have the best algorithm.
My approach/algorithm so far is to create a sorted list of integers (high to low) that have already been input and then handle each item as it comes in. For the first x items, I maintain a sorted array of integers, and when each new item comes in, I figure out where it should be placed using a binary sort. (Im also considering just taking in the first x items and then quick sorting them, but I dont know if this is faster) After the first x items have been sorted I then consider the rest of the items by first seeing if they qualify to enter the already sorted list of highest integers (by seeing if the new integer is greater than the integer at the end of the list) and if it does, add it to the sorted list via a binary search and remove the integer at the end of the list.
I was wondering if anyone had any advice as to how I can make this faster, or perhaps an entire new approach that is faster than this. Thanks.
This is a partial sort:
The fastest implementation is Quicksort where you only recurse on ranges containing the bottom/top k elements.
In C++ you can just use std::partial_sort
If you use a heap-ordered tree data structure to store the integers, inserting a new integer takes no more than lg N comparisons and removing the maximum takes no more than 2 lg N comparisions. Thus, to insert y items would require no more than y lg N comparisons and to remove the top x items would require no more than 2x lg N comparisons. The Wikipedia entry has references to a range of implementations.
This is called a top-N sort. Here is a very simple and efficient scheme. No fancy data structures needed.
Keep a list of the highest x elements (it starts out empty)
Split your input into chunks of x * 10 items
For each chunk, add the remembered list of the x highest items so far to it and sort it (e.g. quick sort)
Keep the x highest items. They form the new remembered list
goto 3 until all chunks processed
The remembered list is now your final result
This is O(N) in the number of items and only requires a normal quick sort as a primitive.
You don't seem to need the top N items in sorted order. Because of this, you can solve this in linear time.
Find the Nth largest array element using linear-time selection. Return it and all array elements larger than it.

Limited Sort/Filter Algorithm

I have a rather large list of elements (100s of thousands).
I have a filter that can either accept or not accept elements.
I want the top 100 elements that satisfy the filter.
So far, I have sorted the results first and then taken the top 100 that satisfy the filter. The rationale behind this is that the filter is not entirely fast.
But right now, the sorting step is taking way longer than the filtering step, so I would like to combine them in some way.
Is there an algorithm to combine the concerns of sorting/filtering to get the top 100 results satisfying the filter without incurring the cost of sorting all of the elements?
My instinct is to select the top 100 elements from the list (much cheaper than a sort, use your favorite variant of QuickSelect). Run those through the filter, yielding n successes and 100-n failures. If n < 100 then repeat by selecting 100-n elements from the top of the remainder of the list:
k = 100
while (k > 0):
select top k from list and remove them
filter them, yielding n successes
k = k - n
All being well this runs in time proportional to the length of the list, since each selection step runs in that time, and the number of selection steps required depends on the success rate of the filter, but not directly on the size of the list.
I expect this has some bad cases, though. If almost all elements fail the filter then it's considerably slower than just sorting everything, since you'll end up selecting thousands of times. So you might want some criteria to bail out if it's looking bad, and fall back to sorting the whole list.
It also has the problem that it will likely do a largeish number of small selects towards the end, since we expect k to decay exponentially if the filter criteria are unrelated to the sort criteria. So you could probably improve it by selecting somewhat more than k elements at each step. Say, k divided by the expected success rate of the filter, plus a small constant. The expectation based on past performance if there's no domain knowledge you can use to predict it, and the small constant chosen experimentally to avoid an annoyingly large number of steps to find the last few elements. If you end up at any step with more items that have passed the filter than the number you're still looking for (i.e, n > k), then select the top k from the current batch of successes and you're done.
Since QuickSelect gives you the top k without sorting those k, you'll need to do a final sort of 100 elements if you need the top 100 in order.
I've solved this exact problem by using a binary tree for sorting and by keeping count of the elements to the left of the current node during insertion. See http://pub.uni-bielefeld.de/publication/2305936 (Figure 4.4 et al) for details.
If I understand right, you have two choiced:
Selecting 100 Elements - N operations of the filter check. Then 100(lg 100) for the sort.
Sorting then selecting 100 Elements - At least N(lg N) for the sort, then the select.
the first sounds shorter then sorting then selecting.
I'd probably filter first, then insert the result of that into a priority queue. Keep track of the number of items in the PQ, and after you do the insert, if it's larger than the number you want to keep (100 in your case), pop off the smallest item and discard it.
Steve's suggestion to use Quicksort is a good one.
1 Read in the first 1000 or so elements.
2 Sort them and pick the 100th largest element.
3 Run one pass of Quicksort on the whole file with the element from step 2 as the pivot.
4 Select the upper half of the result of the Quicksort pass for further processing.
You are guaranteed at least 100 elements in the upper half of the single pass of Quicksort. Assuming the first 1000 are reasonably representative of the whole file then you should end up with about one tenth of the original elements at step 4.

Resources