merging N sorted files using K way merge - algorithm

There is decent literature about merging sorted files or say merging K sorted files. They all work on the theory that first element of each file is put in a Heap, then until the heap is empty poll that element, get another from the file from where this element was taken. This works as long as one record of each file can be put in a heap.
Now let us say I have N sorted files but I can only bring K records in the heap and K < N and let us say N = Kc where "c" is the multiplier implying that N is so large that it is some multiple of c. Clearly, it will require doing K way merge over and over until we only are left with K files and then we merge them as one last time into the final sort. How do I implement this and what will be the complexity of this?

There are multiple examples of k-way merge written in Java. One is http://www.sanfoundry.com/java-program-k-way-merge-algorithm/.
To implement your merge, you just have to write a simple wrapper that continually scans your directory, feeding the thing files until there's only one left. The basic idea is:
while number of files > 1
fileList = Load all file names
i = 0
while i < fileList.length
filesToMerge = copy files i through i+k-1 from file list
merge(filesToMerge, output file name)
i += k
end while
end while
Complexity analysis
This is easier to think about if we assume that each file contains the same number of items.
You have to merge M files, each of which contains n items, but you can only merge k files at a time. So you have to do logk(M) passes. That is, if you have 1,024 files and you can only merge 16 at a time, then you'll make one pass that merges 16 files at a time, creating a total of 64 files. Then you'll make another pass that merges 16 files at a time, creating four files, and your final pass will merge those four files to create the output.
If you have k files, each of which contains n items, then complexity of merging them is O(n*k log2 k).
So in the first pass you do M/k merges, each of which has complexity O(nk log k). That's O((M/k) * n * k * log2 k), or O(Mn log k).
Now, each of your files contains nkk items, and you do M/k/k merges of k files each. So the second pass complexity is O((M/k2) n * k2 * log2 k). Simplified, that, too works out to O(Mn log k).
In the second pass, you do k merges, each of which has complexity O(nk).
Note that in every pass you're working with M*n items. So each pass you do is O(Mn log k). And you're doing logk(M) passes. So the total complexity is: O(logk(M) * (Mn log k)), or
O((Mn log k) log M)
The assumption that every file contains the same number of items doesn't affect the asymptotic analysis because, as I've shown, every pass manipulates the same number of items: M*n.

This is all my thoughts
I would do it in iteration. First I would go for p=floor(n/k) iteration to get p sorted file. Then continue doing this for p+n%k items, until p+n%k becomes less then k. And then finally will get the sorted file.
Does it make sense?

Related

How to merge sorted lists into a single list in O(n * log(k))

(I got this as an interview question and would love some help with it.)
You have k sorted lists containing n different numbers in total.
Show how to create a single sorted list containing all the element from the k lists in O(n * log(k))
The idea is to use a min heap of size k.
Push all the k lists on the heap (one heap-entry per list), keyed by their minimum (i.e. first) value
Then repeatedly do this:
Extract the top list (having the minimal key) from the heap
Extract the minimum value from that list and push it on the result list
Push the shortened list back (if it is not empty) on the heap, now keyed by its new minimum value
Repeat until all values have been pushed on the result list.
The initial step will have a time complexity of O(klogk).
The 3 steps above will be repeated n times. At each iteration the cost of each is:
O(1)
O(1) if the extraction is implemented using a pointer/index (not shifting all values in the list)
O(log k) as the heap size is never greater than k
So the resulting complexity is O(nlogk) (as k < n, the initial step is not significant).
As the question is stated, there's no need for a k-way merge (or a heap). A standard 2 way merge used repeatedly to merge pairs of lists, in any order, until a single sorted list is produced will also have time complexity O(n log(k)). If the question had instead asked how to merge k lists in a single pass, then a k-way merge would be needed.
Consider the case for k == 32, and to simplify the math, assume all lists are merged in order so that each merge pass merges all n elements. After the first pass, there are k/2 lists, after the 2nd pass, k/4 lists, after log2(k) = 5 passes, all k (32) lists are merged into a single sorted list. Other than simplifying the math, the order in which lists are merged doesn't matter, the time complexity remains the same at O(n log2(k)).
Using a k-way merge is normally only advantageous when merging data using an external device, such as one or more disk drives (or classic usage tape drives), where the I/O time is great enough that heap overhead can be ignored. For a ram based merge / merge sort, the total number of operations is about the same for a 2-way merge / merge sort or a k-way merge / merge sort. On a processor with 16 registers, most of them used as indexes or pointers, an optimized (no heap) 4-way merge (using 8 of the registers as indexes or pointers to current and ending location of each run) can be a bit faster than a 2-way merge due to being more cache friendly.
When N=2, you merge the two lists by iteratively popping the front of the list which is the smallest. In a way, you create a virtual list that supports a pop_front operation implemented as:
pop_front(a, b): return if front(a) <= front(b) then pop_front(a) else pop_front(b)
You can very well arrange a tree-like merging scheme where such virtual lists are merged in pairs:
pop_front(a, b, c, d): return if front(a, b) <= front(c, d) then pop_front(a, b) else pop_front(c, d)
Every pop will involve every level in the tree once, leading to a cost O(Log k) per pop.
The above reasoning is wrong because it doesn't account for the front operations, that involves the comparison between two elements, which will cascade and finally require a total of k-1 comparisons per output element.
This can be circumvented by "memoizing" the front element, i.e. keeping it next to the two lists after a comparison has been made. Then, when an element is popped, this front element is updated.
This directly leads to the binary min-heap device, as suggested by #trincot.
5 7 32 21
5
6 4 8 23 40
2
7 7 20 53
2
2 4 6 8 10

Interval tree of an an array with update on array

Given a array of size N, and an array of intervals also of size N, each a contiguous segment of the first array, I need to handle Q queries that update the elements of the array and that ask for the sum of an segment in the second array (sum of the elements in the iTH interval to the jTH interval).
Now, the first query can be handled easily. I can build a segment tree from the array. I can use it to calculate the sum of an interval in the first array (an element in the second array). But how can i handle the second query in O(log n)? In the worst case, the element I update will be in all the intervals in the second array.
I need a O(Qlog N) or O(Q(logN)^2) solution.
Here is an O((Q + N) * sqrt(Q)) solution(it is based on a pretty standard idea of sqrt-decomposition):
1. Let's assume that the array is never updated. Then the problem becomes pretty easy: using prefix sums, it is possible to solve this problem in O(N) time for precomputation and O(1) per query(we need 2 prefix sum arrays here: one for the original array and the other one for the array of intervals).
2. Now let's divide our queries into blocks of size sqrt(Q). In the beginning of the each block, we can do the same thing as in 1. taking into accounts only those updates that happened before the beginning of this block. It can be done in linear time(using prefix sums twice). The total number of such computations is Q / sqrt(Q) = sqrt(Q) times(because it is the number of blocks we have). So far, it gives us O((N + Q) * sqrt(Q)) time in total.
3. When we get the query of type 2, all the updates that are outside the current block are already considered. So there are at most sqrt(Q) updates that could affect the answer. So let's process them almost naively: iterate over all updates within the current block that happened before this query and update the answer. To do this, we need to know how many times a given position in the array is present in the intervals from i to j. This part can be solved offline with sweep line algorithm using O(Q * sqrt(N + Q)) time and space(additional log factor does not appear because radix sort can be used).
So we get O((N + Q) * sqrt(Q)) time and space in the worst case in total. It is worse than O(Q * log N), of course, but should work fine for about 10^5 queries and array elements.

Doubly Linked List with some Operations

For Doubly linked list Q that has number elements, and we have a pointer to first and last elements we have define two operations.
Delete (k): delete k first elements from Q.
Append (c), check the last element from `Q`, if this value bigger than c, delete this elements and repeat it again until the last element is lower or equal to `c` (or empty `Q`), then insert c as last elements of Q`.`
if we repeat sequences of these two operations in arbitrary order for n times on empty list Q, sum of all cost of theses operation is close to 2n. why my instructor reach to 2n? any hint or idea is appreciated.
When we "repeat Delete and Append in arbitrary order for n times on empty list Q", Append is called n times; hence exactly n list element insertions are performed.
Since the list is initially empty, it never contains more than n elements; hence at most n list element deletions are performed in the combination of Delete and Append.
Hence the total number of loops in each of Delete and Append (including reads in Append) is no more than 2n.
So all in all, no section of the program is executed more than 2n times (counting separately code that may be common to list element insertion, list element deletion, and list element access).
The cost is minimal when k is always 0, and c non-decreasing (including always 0): we have n list element insertions, n list element read (one returning empty), n emptyness tests, n-1 element comparisons, and no deletion. The cost thus varies significantly with parameters.
Note: "Sum of all cost of theses operation is close to 2n" is ill-defined, thus not even wrong. Worse, if list element deletion, by some bad luck (e.g. code cache miss, debug code..) was much slower than the rest, it could be that the code duration vary by a large factor (much higher than 2) depending on parameters. Hence execution time is NOT ALWAYS "about 2n" for any lousing meaning of that.
Update: In comment, we are told that list element insertion and deletion has the same cost 1. There are n list element insertions, and between 0 to n list element deletions. Hence if we neglect the other costs (which is reasonable is memory allocation cost dominates), the total cost is about n to about 2n depending on parameters. Further, for many parameters (including k>=1 most of the time), there are nearly n list element deletions, hence cost is about 2n if one insists on a best guess, such as in a multiple choice question with (a) n+k (b) n (c) 2n (d) 3n as the only options.
if we repeat these operations in arbitrary order for n times on empty
list Q sum of all cost of theses operation is close to 2n
It will actually be O(n) since we know the list Q is empty.
DoStuff(list Q, int n):
for(int i = 0; i < n; i++)
Q.Delete(k) //O(k)
Q.Append(c) //O(sizeof(Q))
//or
//Q.Append(c) //O(sizeof(Q))
//Q.Delete(k) //O(k)
Where n is the number of iterations.
Now say the lists aren't empty, then we would have O(n*(sizeof(Q)+k)). Explanation for that is below:
Say worse case scenario for Delete (k) would be to delete k first elements from Q where k is the size of Q, then we would delete n elements. However, it is more accurate to say O(k) because you will always delete the first k elements only.
Say worse case scenario for Append (c) would be all elements inside of Q is greater than the value c. This would start from the tail node and delete all nodes from Q.
In either order
Delete(k) //O(k)
Append(c) //O(sizeof(Q))
Or
Append(c) //O(sizeof(Q))
Delete(k) //O(k)
It would be at worse case for just these two commands is O(sizeof(Q)+k). Now we know we have to do this n iterations, so we finally get O(n*(sizeof(Q)+k))
As for what your professor said, the only reason why I could imagine your professor said 2n is because there are 2 functions being called n times. Therefore 2n.

Removing items from a list - algorithm time complexity

Problem consists of two sorted lists with no duplicates of sizes n and m. First list contains strings that should be deleted from second list.
Simplest algorithm would have to do nxm operations (I believe that terminology for this is "quadratic time"?).
Improved solution would be to take advantage of the fact that both list are sorted and skip strings with index that is lower than last deleted index in future comparisons.
I wonder what time complexity would that be?
Are there any solutions for this problem with better time complexity?
You should look into Merge sort. This is the basic idea behind why it works efficiently.
The idea is to scan the two lists together, which takes O(n+m) time:
Make a pointer x for first list, say A and another pointer y for the second list, say B. Set x=0 and y=0. While x < n and y < m, if A[x] < B[y], then add A[x] to the new merged list and increment x. Otherwise add B[y] to the new list and increment y. Once you hit x=n or y=m, take on the remaining elements from B or A, respectively.
I believe the complexity would be O(n+m), because every item in each of the lists would be visited exactly once.
A counting/bucket sort algorithm would work where each string in the second list is a bucket.
You go through the second list (takes m time) and create your buckets. You then go through your first list (takes n time) and increment the number of occurances. You then would have to go through each bucket (takes m time) again and only return strings that occur once. A Trie or a HashMap would work well for storing a buckets. Should be O(n+m+m). If you use a HashSet, in the second pass instead of incrementing a counter, you remove from the Set. It should be O(n+m+(m-n)).
Might it be O(m + log(n)) if binary search is used?

Is there a way to skip empty buckets during bucket sort?

Counting sort is kind of a bucket sort. Let's assume we're using it like this:
Let A be the array to sort
Let k be the max element
Let bucket[] be an array of buckets
Let each bucket be a linked list (with a start and end pointer)
Then in pseudocode, counting sort looks like this:
Counting-Sort (A[], bucket[], k)
1. Init bucket[]
2. for i -> 1 to n
3. add A[i] to bucket[A[i].key].end
4. for i -> 1 to k
5. concatenate bucket[i].start to bucket[0].end
6. bucket[0].end=bucket[i].end
7. copy bucket[0] to A
Time Complexity by lines:
1) I know there is a way (not simple but a way) to init array in O(1)
2,3) O(n)
4,5) O(k)
6) O(n)
This gives us a net runtime of O(k+n), which for k >> n is Ω(n), which is bad for us. But what if we can change lines 4,5 to somehow skip the empty buckets? This way we will end up having O(n) no metter what k is.
Does anyone know how to do this? Or is it impossible?
One option would be to hold an auxilary BST containing which buckets are actually being used. Whenever you add something to a bucket, if it's the first entry to be placed there, you would also add that bucket's value to the BST.
When you want to then go concatenate everything, you could then just iterate over the BST in sorted order, concatenating just the buckets you find.
If there are z buckets that actually get used, this takes O(n + z log z). If the number of buckets is large compared to the number actually used, this could be much faster.
More generally - if you have a way of sorting the z different buckets being used in O(f(z)) time, you can do a bucket sort in O(n + f(z)) time. Maintain a second array of the buckets you actually use, adding a bucket to the array when it's used for the first time. Before iterating over the buckets, sort in O(f(z)) time the indices of the buckets in usem then iterate across that array to determine what buckets to visit. For example, if you used y-Fast trees, you could sort in O(n + z log log z).
Hope this helps!
You can turn the bucket array into an associative array, which yields O(n log n), and I don't believe you can do better than that for sorting (on average).
O(n) is impossible in the general case.

Resources