Merging sorted n lists - algorithm

How the problem of sorting a very huge list is tackled?
I suppose we divide the list and have them process in each CPU and produce small sorted lists.
But how can we combine and produce a final sorted list?

You can merge mutiple sorted lists using priority queue (based on binary heap).
Fill queue with pairs (current element of list or its index; list id).
At every step:
extract pair with min element from queue
add value to result
get the next element of the same list (if possible)
insert new pair into queue again
How huge is your list relative to available memory?
For useful clues start from wiki external sorting page

Basic approach should be create a min-heap of size (n) where n is the number of partitioned sorted list from huge list.
Each node of binary heap should be represented like index/sorted_list_number and value.
The top node of min heap will point the min value of huge list and index will point from which sorted list its coming, Now pop top from min heap add its value in huge list and add new value from popped index list in to heap and heapify again.
Repeat till node finish , also take care for heapsize when one/more of the list are getting empty in process.

Since the issue is that your list is larger than memory, I would say external sort is the solution:
https://en.wikipedia.org/wiki/External_sorting
Say we have N blocks of main memory, we can load N-1 blocks of two list. Use the remaining one block as an output buffer
Merge two list by performing usual merging through comparing the front element. Output the result to the output buffer.
When the buffer is full, write the output back to secondary memory.
Repeat the steps until all the lists are merged.

Related

Selecting top n items from multiple sorted arrays

What is the optimal algorithm for selecting top n elements from multiple arrays, provided each array is sorted the same way in which the resultant array should be.
Reading elements is very expensive and therefore the number of reads should be an absolute minimum.
Put tuples (current_element, array_number, current_index=0) into priority queue (for example, based on binary max-heap), ordered by element value
Then remove top of the queue n times.
After removing increment index in corresponding array (if possible), get the next element and insert updated tuple into queue again

Is there such data structure - "linked list with samples"

Is there such data structure:
There is slow list data structure such linked list or data saved on disk.
There is relatively small array of pointers to some of the elements in the "slow list", hopefully evenly distributed.
Then when you do search, you first check the array and then perform the normal search (linked list search or binary search in case of disk data).
This looks very similar to jump search, sample search and to skip lists, but I think is different algorithm.
Please note I am giving example with link list or file on disk, because they are slow structures.
I don't know if there's a name for this algorithm (I don't think it deserves one, though if there isn't, it could bear mine:), but I did implement something like that 10 years ago for an interview.
You can have an array of pointers to the elements of a list. An array of fixed size, say, of 256 pointers. When you construct the list or traverse it for the first time, you store pointers to its elements in the array. So, for a list of 256 or fewer elements you'd have a pointer to each element.
As the list grows beyond 256 elements, you drop every odd-numbered pointer by moving the 128 even-numbered pointers to the beginning of the array. When the array of pointers fills up again, you repeat the procedure. At every such point you double the step between the list elements whose addresses end up in the array of pointers. Initially you'd place every element's address there, then every other's, then of one out of four and so on.
You end up with an array of pointers to the list elements spaced apart by the list length / 256.
If the list is singly-linked, locating i-th element from the beginning or the end of it is reduced to searching in 1/256th of the list.
If the list is sorted, you can perform binary search on the array to locate the bin (the 1/256th portion of the list) where to look further.

I'm trying to find an algorithm for merging m sorted lists (of total n elements) in n log2 m runtme

Above says it all. I cannot really think of a way to do this, and how to prove it. Any ideas?
Put the input lists in a heap (aka a priority queue) where each list's priority is its first element. To get the next element of the output list, pull the top list off the heap, append its first element to the output list, remove that element from the input list, and (if the input list is not empty) put the input list back in the heap. Repeat until the heap is empty.
See this question and answer on Computer Science stackexchange for more details.

Data Structure that supports queue like operations and mode finding

This was an interview question asked to me almost 3 years back and I was pondering about this a while back.
Design a data structure that supports the following operations:
insert_back(), remove_front() and find_mode(). Best complexity
required.
The best solution I could think of was O(logn) for insertion and deletion and O(1) for mode. This is how I solved it: Keep a queue DS for handling which element is inserted and deleted.
Also keep an array which is max heap ordered and a hash table.
The hashtable contains an integer key and an index into the heap array location of that element. The heap array contains an ordered pair (count,element) and is ordered on the count property.
Insertion : Insert the element into the queue. Find the location of the heap array index from the hashtable. If none exists, then add the element to the heap and heapify upwards. Then add the final location into the hashtable. Increment the count in that location and heapify upwards or downwards as needed to restore the heap property.
Deletion : Remove element from the head of the queue. From the hash table, find a location in the heap array index. Decrement the count in the heap and reheapify upward or downwards as needed to restore the heap property.
Find Mode: The element at the head of the array heap (getMax()) will give us the mode.
Can someone please suggest something better. The only optimization I could think of was using a Fibonacci heap but I am not sure if that is a good fit in this problem.
I think there is a solution with O(1) for all operations.
You need a deque, and two hashtables.
The first one is a linked hashtable, where for each element you store its count, the next element in count order and a previous element in count order. Then you can look the next and previous element's entries in that hashtable in a constant time. For this hashtable you also keep and update the element with the largest count. (element -> count, next_element, previous_element)
In the second hashtable for each distinct number of elements, you store the elements with that count in the start and in the end of the range in the first hashtable. Note that the size of this hashtable will be less than n (it's O(sqrt(n)), I think). (count -> (first_element, last_element))
Basically, when you add an element to or remove an element from the deque, you can find its new position in the first hashtable by analyzing its next and previous elements, and the values for the old and new count in the second hashtable in constant time. You can remove and add elements in the first hashtable in constant time, using algorithms for linked lists. You can also update the second hashtable and the element with the maximum count in constant time as well.
I'll try writing pseudocode if needed, but it seems to be quite complex with many special cases.

top-k selection/merge

I have n sorted lists (5 < n < 300). These lists are quite long (300000+ tuples). Selecting the top k of the individual lists is of course trivial - they are right at the head of the lists.
Example for k = 2:
top2 (L1: [ 'a': 10, 'b': 4, 'c':3 ]) = ['a':10 'b':4]
top2 (L2: [ 'c': 5, 'b': 2, 'a':0 ]) = ['c':5 'b':2]
Where it gets more interesting is when I want the combined top k across all the sorted lists.
top2(L1+L2) = ['a':10, 'c':8]
Just combining of the top k of the individual list would not necessarily gives the correct results:
top2(top2(L1)+top2(L2)) = ['a':10, 'b':6]
The goal is to reduce the required space and keep the sorted lists small.
top2(topX(L1)+topX(L2)) = ['a':10, 'c':8]
The question is whether there is an algorithm to calculate the combined top k having the correct order while cutting off the long tail of the lists at a certain position. And if there is: How does one find the limit X where is is safe to cut?
Note: Correct counts are not important. Only the order is.
top2(magic([L1,L2])) = ['a', 'c']
This algorithm uses O(U) memory where U is the number of unique keys. I doubt a lower memory bounds can be achieved because it is impossible to tell which keys can be discarded until all the keys have been summed.
Make a master list of (key:total_count) tuples. Simply run through each list one item at a time, keeping a tally of how many times each key has been seen.
Use any top-k selection algorithm on the master list that does not use additional memory. One simple solution is to sort the list in place.
If I understand your question correctly, the correct output is the top 10 items, irrespective of the list from which each came. If that's correct, then start with the first 10 items in each list will allow you to generate the correct output (if you only want unique items in the output, but the inputs might contain duplicates, then you need 10 unique items in each list).
In the most extreme case, all the top items come from one list, and all items from the other lists are ignored. In this case, having 10 items in the one list will be sufficient to produce the correct result.
Associate an index with each of your n lists. Set it to point to the first element in each case.
Create a list-of-lists, and sort it by the indexed elements.
The indexed item on the top list in your list-of-lists is your first element.
Increment the index for the topmost list and remove that list from the list-of-lists and re-insert it based on the new value of its indexed element.
The indexed item on the top list in your list-of-lists is your next element
Goto 4 and repeat until done.
You didn't specify how many lists you have. If n is small, then step 4 can be done very simply (just re-sort the lists). As n grows you may want to think about more efficient ways to resort and almost-sorted list-of-lists.
I did not understand if an 'a' appears in two lists, their counts must be combined. Here is a new memory-efficient algorithm:
(New) Algorithm:
(Re-)sort each list by ID (not by count). To release memory, the list can be written back to disk. Only enough memory for the longest list is required.
Get the next lowest unprocessed ID and find the total count across all lists.
Insert the ID into a priority queue of k nodes. Use the total count as the node's priority (not the ID). This priority queue drops the lowest node if more than k nodes are inserted.
Go to step 2 until all ID's have been exhausted.
Analysis: This algorithm can be implemented using only O(k) additional memory to store the min-heap. It makes several trade-offs to accomplish this:
The lists are sorted by ID in place; the original orderings by counts are lost. Otherwise O(U) additional memory is required to make a master list with ID: total_count tuples where U is number of unique ID's.
The next lowest ID is found in O(n) time by checking the first tuple of each list. This is repeated U times where U is the number of unique ID's. This might be improved by using a min-heap to track the next lowest ID. This would require O(n) additional memory (and may not be faster in all cases).
Note: This algorithm assumes ID's can be quickly compared. String comparisons are not trivial. I suggest hashing string ID's to integers. They do not have to be unique hashes, but collisions must be checked so all ID's are properly sorted/compared. Of course, this would add to the memory/time complexity.
The perfect solution requires all tuples to be inspected at least once.
However, it is possible to get close to the perfect solution without inspecting every tuple. Discarding the "long tail" introduces a margin of error. You can use some type of heuristic to calculate when the margin of error is acceptable.
For example, if there are n=100 sorted lists and you have inspected down each list until the count is 2, the most the total count for a key could increase by is 200.
I suggest taking an iterative approach:
Tally each list until a certain lower count threshold L is reached.
Lower L to include more tuples.
Add the new tuples to the counts tallied so far.
Go to step 2 until lowering L does not change the top k counts by more than a certain percentage.
This algorithm assumes the counts for the top k keys will approach a certain value the further long tail is traversed. You can use other heuristics instead of the certain percentage like number of new keys in the top k, how much the top k keys were shuffled, etc...
There is a sane way to implement this through mapreduce:
http://www.yourdailygeekery.com/2011/05/16/top-k-with-mapreduce.html
In general, I think you are in trouble. Imagine the following lists:
['a':100, 'b':99, ...]
['c':90, 'd':89, ..., 'b':2]
and you have k=1 (i.e. you want only the top one). 'b' is the right answer, but you need to look all the way down to the end of the second list to realize that 'b' beats 'a'.
Edit:
If you have the right distribution (long, low count tails), you might be able to do better. Let's keep with k=1 for now to make our lives easier.
The basic algorithm is to keep a hash map of the keys you've seen so far and their associated totals. Walk down the lists processing elements and updating your map.
The key observation is that a key can gain in count by at most the sum of the counts at the current processing point of each list (call that sum S). So on each step, you can prune from your hash map any keys whose total is more than S below your current maximum count element. (I'm not sure what data structure you would need to prune as you need to look up keys given a range of counts - maybe a priority queue?)
When your hash map has only one element in it, and its count is at least S, then you can stop processing the lists and return that element as the answer. If your count distribution plays nice, this early exit may actually trigger so you don't have to process all of the lists.

Resources