external sorting: multiway merge - algorithm

In multiway merge The task is to find the smallest element out of k elements
Solution: priority queues
Idea: Take the smallest elements from the first k runs, store them into main memory in a heap tree.
Then repeatedly output the smallest element from the heap. The smallest element is replaced with the next element from the run from which it came.
When finished with the first set of runs, do the same with the next set of runs.
Assume my main memory of size ( M )less than k, how we can sort the elements, in other words,how multi way merge algorithm merge works if memory size M is less than K
For example if my M = 3 and i have following
Tape1: 8 9 10
Tape2: 11 12 13
Tape3: 14 15 16
Tape4: 4 5 6
My question how muliway merge will work because we will read 8, 11, 14 and build priority queue, we place 8 to output tape and then forward Tape1, i am not getting when Tape4 is read and how we will compare with already written to output tape?
Thanks!

It won't work. You must choose a k small enough for available memory.
In this case, you could do a 3-way merge of the first 3 tapes, then a 2-way merge between the result of that and the one remaining tape. Or you could do 3 2-way merges (two pairs of tapes, then combine the results), which is simpler to implement but does more tape access.
In theory you could abandon the priority queue. Then you wouldn't need to store k elements in memory, but you would frequently need to look at the next element on all k tapes in order to find the smallest.

Related

algorithm interview: mth frequent element in n sorted arrays

There is an algorithm interview question:
We have n sorted arrays, how to find the m-th frequent element in the aggregated array of n arrays? Moreover, how to save space? Even compromise on some time complexity.
What I can think is that enumerate all the elements of n arrays and use a hashmap to record their frequency, then sort hashmap respect to the value (frequency). But then there is no difference with the one array case.
Walk over all arrays in parallel using n pointers
1 4 7 12 34
2 6 9 12 25
The walk would look like this
1 1 4 7 7 12 12 34
* 2 2 2 9 12 25 34
You do need a hash map in order to count the number of occurrences of elements in the cut. E.g. at the second step in the example, your cut contains 1 and 2.
Also, you need two min-heaps, one for every cut to be able to choose the array to advance along and another one to store m most repetitive elements.
The complexity would be expected O(#elements * (log(n) + log(m))). The space requirement is O(n + m).
But if you really need to save space you can consider all these n sorted arrays as one big unsorted, sort it with something like heapsort and choose the longest subarray of duplicates. This would require O(#elements * log(#elements)) time but only O(1) space.
You do an n-way merge, but instead of writing out the merged array, you just count the length of each run of duplicate values and remember the longest m in a min-heap.
This takes O(total_length * (log n + log m)) time, and O(n) space.
It's a combination of common SO questions. Search above on "merge k sorted lists" and "kth largest"

How to merge sorted lists into a single list in O(n * log(k))

(I got this as an interview question and would love some help with it.)
You have k sorted lists containing n different numbers in total.
Show how to create a single sorted list containing all the element from the k lists in O(n * log(k))
The idea is to use a min heap of size k.
Push all the k lists on the heap (one heap-entry per list), keyed by their minimum (i.e. first) value
Then repeatedly do this:
Extract the top list (having the minimal key) from the heap
Extract the minimum value from that list and push it on the result list
Push the shortened list back (if it is not empty) on the heap, now keyed by its new minimum value
Repeat until all values have been pushed on the result list.
The initial step will have a time complexity of O(klogk).
The 3 steps above will be repeated n times. At each iteration the cost of each is:
O(1)
O(1) if the extraction is implemented using a pointer/index (not shifting all values in the list)
O(log k) as the heap size is never greater than k
So the resulting complexity is O(nlogk) (as k < n, the initial step is not significant).
As the question is stated, there's no need for a k-way merge (or a heap). A standard 2 way merge used repeatedly to merge pairs of lists, in any order, until a single sorted list is produced will also have time complexity O(n log(k)). If the question had instead asked how to merge k lists in a single pass, then a k-way merge would be needed.
Consider the case for k == 32, and to simplify the math, assume all lists are merged in order so that each merge pass merges all n elements. After the first pass, there are k/2 lists, after the 2nd pass, k/4 lists, after log2(k) = 5 passes, all k (32) lists are merged into a single sorted list. Other than simplifying the math, the order in which lists are merged doesn't matter, the time complexity remains the same at O(n log2(k)).
Using a k-way merge is normally only advantageous when merging data using an external device, such as one or more disk drives (or classic usage tape drives), where the I/O time is great enough that heap overhead can be ignored. For a ram based merge / merge sort, the total number of operations is about the same for a 2-way merge / merge sort or a k-way merge / merge sort. On a processor with 16 registers, most of them used as indexes or pointers, an optimized (no heap) 4-way merge (using 8 of the registers as indexes or pointers to current and ending location of each run) can be a bit faster than a 2-way merge due to being more cache friendly.
When N=2, you merge the two lists by iteratively popping the front of the list which is the smallest. In a way, you create a virtual list that supports a pop_front operation implemented as:
pop_front(a, b): return if front(a) <= front(b) then pop_front(a) else pop_front(b)
You can very well arrange a tree-like merging scheme where such virtual lists are merged in pairs:
pop_front(a, b, c, d): return if front(a, b) <= front(c, d) then pop_front(a, b) else pop_front(c, d)
Every pop will involve every level in the tree once, leading to a cost O(Log k) per pop.
The above reasoning is wrong because it doesn't account for the front operations, that involves the comparison between two elements, which will cascade and finally require a total of k-1 comparisons per output element.
This can be circumvented by "memoizing" the front element, i.e. keeping it next to the two lists after a comparison has been made. Then, when an element is popped, this front element is updated.
This directly leads to the binary min-heap device, as suggested by #trincot.
5 7 32 21
5
6 4 8 23 40
2
7 7 20 53
2
2 4 6 8 10

Show heapsort repeats comparisons

how would one prove that heapsort repeats comparisons that it has made before? (i.e. it would perform a comparison that has been done previously)
Thanks
The two elements may take comparisons in build heap step(heapify) and also in reorder step in heap sort. This is the wiki.
For example, sort by max-heap:
origin array: 4 6 10 7 3 8 5
heapify to a new heap array by shift-up.
The comparisons: 4<6, 6<10, 4<7, 6<8
(10) (7 8) (4 3 6 5) // each layer is grouped by parenthesis
re-order step
swap the first with the last, put the big one to end
reduce the heap size by 1
use shift-down
The comparisons: 5<8, 6<7, 3<6, 3<4, 3<5, 3<4
Because, in the heapify the comparisons based on the order of elements. And after heapify, the order may be not sorted too. So there may be other comparisons.

Data structure insert,delete,mode

Here is the interview problem: Designing a data structure for a range of integers {1,...,M} (numbers can be repeated) support insert(x), delete(x) and return mode which return the most frequently number.
The interviewer said that we can do in O(1) for all the operation with preprocessed in O(M). He also accepted that I can do insert(x) and delete(x) in O(log(n)), return mode in O(1) with preprocessed in O(M).
But I can only give in O(n) for insert(x) and delete(x) and return mode in O(1), actually how can I give O(log (n)) or/and O(1) in insert(x) and delete(x), and return mode in O(1) with preprocessed in O(M)?
When you hear O(log X) operations, the first structures that comes to mind should be a binary search tree and a heap. For reference: (since I'm focussing on a heap below)
A heap is a specialized tree-based data structure that satisfies the heap property: If A is a parent node of B then the key of node A is ordered with respect to the key of node B with the same ordering applying across the heap. ... The keys of parent nodes are always greater than or equal to those of the children and the highest key is in the root node (this kind of heap is called max heap) ....
A binary search tree doesn't allow construction (from unsorted data) in O(M), so let's see if we can make a heap work (you can create a heap in O(M)).
Clearly we want the most frequent number at the top, so this heap needs to use frequency as its ordering.
But this brings us to a problem - insert(x) and delete(x) will both require that we look through the entire heap to find the correct element.
Now you should be thinking "what if we had some sort of mapping from index to position in the tree?", and this is exactly what we're going to have. If all / most of the M elements exist, we could simply have an array, with each index i's element being a pointer to the node in the heap. If implemented correctly, this will allow us to look up the heap node in O(1), which we could then modify appropriately, and move, taking O(log M) for both insert and delete.
If only a few of the M elements exist, replacing the array with a (hash-)map (of integer to heap node) might be a good idea.
Returning the mode will take O(1).
O(1) for all operations is certainly quite a bit more difficult.
The following structure comes to mind:
3 2
^ ^
| |
5 7 4 1
12 14 15 18
To explain what's going on here - 12, 14, 15 and 18 correspond to the frequency, and the numbers above correspond to the elements with said frequency, so both 5 and 3 would have a frequency of 12, 7 and 2 would have a frequency of 14, etc.
This could be implemented as a double linked-list:
/-------\ /-------\
(12) <-> 5 <-> 3 <-> (13) <-> (14) <-> 7 <-> 2 <-> (15) <-> 4 <-> (16) <-> (18) <-> 1
^------------------/ ^------/ ^------------------/ ^------------/ ^------/
You may notice that:
I filled in the missing 13 and 16 - these are necessary, otherwise we'll have to update all elements with the same frequency when doing an insert (in this example, you would've needed to update 5 to point to 13 when doing insert(3), because 13 wouldn't have existed yet, so it would've been pointing to 14).
I skipped 17 - this is just be an optimization in terms of space usage - this makes this structure take O(M) space, as opposed to O(M + MaxFrequency). The exact conditions for skipping a number is simply that it doesn't have any elements at its frequency, or one less than its frequency.
There's some strange things going on above the linked-list. These simply mean that 5 points to 13 as well, and 7 points to 15 as well, i.e. each element also keeps a pointer to the next frequency.
There's some strange things going on below the linked-list. These simply mean that each frequency keeps a pointer to the frequency before it (this is more space efficient than each element keeping a pointer to both it's own and the next frequency).
Similarly to the above solution, we'd keep a mapping (array or map) of integer to node in this structure.
To do an insert:
Look up the node via the mapping.
Remove the node.
Get the pointer to the next frequency, insert it after that node.
Set the next frequency pointer using the element after the insert position (either it is the next frequency, in which case we can just make the pointer point to that, otherwise we can make this next frequency pointer point to the same element as that element's next frequency pointer).
To do a remove:
Look up the node via the mapping.
Remove the node.
Get the pointer to the current frequency via the next frequency, insert it before that node.
Set the next frequency pointer to that node.
To get the mode:
Return the last node.
Since range is fixed, for simplicity lets take an example M=7 (range is 1 to 7). So we need atmost 3 bit to represent each number.
0 - 000
1 - 001
2 - 010
3 - 011
4 - 100
5 - 101
6 - 110
7 - 111
Now create a b-tree with each node having 2-child (like Huffmann coding algo). Each leaf will contain the frequency of each number (initially it would be 0 for all). And address of these nodes will be saved in an array, with key as index (i.e. address for Node 1 will be at index 1 in array).
With pre-processing, we can execute insert, remove in O(1), mode in O(M) time.
insert(x) - go to location k in array, get address of node and increment counter for that node.
delete(x) - as above, just decrement counter if>0.
mode - linear search in array for maximum frequency (value of counter).

Sort numbers by sum algorithm

I have a language-agnostic question about an algorithm.
This comes from a (probably simple) programming challenge I read. The problem is, I'm too stupid to figure it out, and curious enough that it is bugging me.
The goal is to sort a list of integers to ascending order by swapping the positions of numbers in the list. Each time you swap two numbers, you have to add their sum to a running total. The challenge is to produce the sorted list with the smallest possible running total.
Examples:
3 2 1 - 4
1 8 9 7 6 - 41
8 4 5 3 2 7 - 34
Though you are free to just give the answer if you want, if you'd rather offer a "hint" in the right direction (if such a thing is possible), I would prefer that.
Only read the first two paragraph is you just want a hint. There is a an efficient solution to this (unless I made a mistake of course). First sort the list. Now we can write the original list as a list of products of disjoint cycles.
For example 5,3,4,2,1 has two cycles, (5,1) and (3,4,2). The cycle can be thought of as starting at 3, 4 is in 3's spot, 2 is in 4's spot, and 4 is in 3's. spot. The end goal is 1,2,3,4,5 or (1)(2)(3)(4)(5), five disjoint cycles.
If we switch two elements from different cycles, say 1 and 3 then we get: 5,1,4,2,3 and in cycle notation (1,5,3,4,2). The two cycles are joined into one cycle, this is the opposite of what we want to do.
If we switch two elements from the same cycle, say 3 and 4 then we get: 5,4,3,2,1 in cycle notation (5,1)(2,4)(3). The one cycle is split into two smaller cycles. This gets us closer to the goal of all cycles of length 1. Notice that any switch of two elements in the same cycle splits the cycle into two cycles.
If we can figure out the optimal algorithm for switching one cycle we can apply that for all cycles and get an optimal algorithm for the entire sort. One algorithm is to take the minimum element in the cycle and switch it with the the whose position it is in. So for (3,4,2) we would switch 2 with 4. This leaves us with a cycle of length 1 (the element just switched into the correct position) and a cycle of size one smaller than before. We can then apply the rule again. This algorithm switches the smallest element cycle length -1 times and every other element once.
To transform a cycle of length n into cycles of length 1 takes n - 1 operations. Each element must be operated on at least once (think about each element to be sorted, it has to be moved to its correct position). The algorithm I proposed operates on each element once, which all algorithms must do, then every other operation was done on the minimal element. No algorithm can do better.
This algorithm takes O(n log n) to sort then O(n) to mess with cycles. Solving one cycle takes O(cycle length), the total length of all cycles is n so cost of the cycle operations is O(n). The final run time is O(n log n).
I'm assuming memory is free and you can simulate the sort before performing it on the real objects.
One approach (that is likely not the fastest) is to maintain a priority queue. Each node in the queue is keyed by the swap cost to get there and it contains the current item ordering and the sequence of steps to achieve that ordering. For example, initially it would contain a 0-cost node with the original data ordering and no steps.
Run a loop that dequeues the lowest-cost queue item, and enqueues all possible single-swap steps starting at that point. Keep running the loop until the head of the queue has a sorted list.
I did a few attempts at solving one of the examples by hand:
1 8 9 7 6
6 8 9 7 1 (+6+1=7)
6 8 1 7 9 (7+1+9=17)
6 8 7 1 9 (17+1+7=25)
6 1 7 8 9 (25+1+8=34)
1 6 7 8 9 (34+1+6=41)
Since you needed to displace the 1, it seems that you may have to do an exhaustive search to complete the problem - the details of which were already posted by another user. Note that you will encounter problems if the dataset is large when doing this method.
If the problem allows for "close" answers, you can simply make a greedy algorithm that puts the largest item into position - either doing so directly, or by swapping the smallest element into that slot first.
Comparisons and traversals apparently come for free, you can pre-calculate the "distance" a number must travel (and effectively the final sort order). The puzzle is the swap algorithm.
Minimizing overall swaps is obviously important.
Minimizing swaps of larger numbers is also important.
I'm pretty sure an optimal swap process cannot be guaranteed by evaluating each ordering in a stateless fashion, although you might frequently come close (not the challenge).
I think there is no trivial solution to this problem, and my approach is likely no better than the priority queue approach.
Find the smallest number, N.
Any pairs of numbers that occupy each others' desired locations should be swapped, except for N.
Assemble (by brute force) a collection of every set of numbers that can be mutually swapped into their desired locations, such that the cost of sorting the set amongst itself is less than the cost of swapping every element of the set with N.
These sets will comprise a number of cycles. Swap within those cycles in such a way that the smallest number is swapped twice.
Swap all remaining numbers, which comprise a cycle including N, using N as a placeholder.
As a hint, this reeks of dynamic programming; that might not be precise enough a hint to help, but I'd rather start with too little!
You are charged by the number of swaps, not by the number of comparisons. Nor did you mention being charged for keeping other records.

Resources