Class Z behaves like which well-known data structure? - algorithm

I am working with this question, which I am unsure about:
Class Z behaves like which well-known data structure?
Where the possible answers is:
A. (LIFO) Stack.
B. (FIFO) Queue.
C. Priority queue.
D. Union–Find.
By looking at the code, I think the answer is D - union find. If we look at the methods query, last or first, we see it uses Union-find data-structure to determine if the array is equal or not.
public class Z
{
int[] next, prev;
Z(int N) {
prev = new int[N];
next = new int[N];
for (int i = 0; i<N; ++i) {
// put element i in a list of its own
next[i] = i;
prev[i] = i;
}
}
int first(int i) {
// return first element of list containing i
while (i != prev[i]) i = prev[i];
return i;
}
int last(int i) {
// return last element of list containing i
while (i != next[i]) i = next[i];
return i;
}
void update(int i, int j) {
int f = first(j);
int l = last(i);
next[l] = f;
prev[f] = l;
}
boolean query(int i, int j) {
return last(i) == last(j);
}
}

Yes, you're right -- it can be used as a Union Find datastructure. If z is an instance of this class, then Union can be written as if !z.query(i, j) z.update(i, j), and Find can be written z.last(i).
Details
Z keeps the integers 0, 1, ..., N-1in a set of disjoint lists, with each integer in its own list initially. update(i, j) appends the list containing j to the list containing i. first(i) and last(i) return the first and last element of the list containing i. query(i, j) reports whether i and j are in the same list.
The implementation requires update(i, j) to only be called if i and j are not already in the same list (otherwise lists become loops, and subsequent calls to any of the methods may not terminate), and its efficiency is poor as the usual disjoint-union-datastructure optimizations aren't made.

Related

Best algorithm to pair items of two queues

I have to find the best algorithm to define pairing between the items from two lists as in the figure. The pair is valid only if the number of node in list A is lower than number of node in list B and there are no crosses between links. The quality of the matching algorithm is determined by the total number of links.
I firstly tried to use a very simple algorithm: take a node in the list A and then look for the first node in list B that is higher than the former. The second figure shows a test case where this algorithm is not the best one.
Simple back-tracking can work (it may not be optimal, but it will certainly work).
For each legal pairing A[i], B[j], there are two choices:
take it, and make it illegal to try to pair any A[x], B[y] with x>i and y<j
not take it, and look at other possible pairs
By incrementally adding legal pairs to a bunch of pairs, you will eventually exhaust all legal pairings down a path. The number of valid pairings in a path is what you seek to maximize, and this algorithm will look at all possible answers and is guaranteed to work.
Pseudocode:
function search(currentPairs):
bestPairing = currentPairs
for each currently legal pair:
nextPairing = search(copyOf(currentPairs) + this pair)
if length of nextPairing > length of bestPairing:
bestPairing = nextPairing
return bestPairing
Initially, you will pass an empty currentPairs. Searching for legal pairs is the tricky part. You can use 3 nested loops that look at all A[x], B[y], and finally, if A[x] < B[y], look against all currentPairs to see if the there is a crossing line (the cost of this is roughly O(n^3)); or you can use a boolean matrix of valid pairings, which you update at each level (less computation time, down to O(n^2) - but more expensive in terms of memory)
Here a Java implementation.
For convinience I first build a map with the valid choices for each entry of list(array) a to b.
Then I loop throuough the list, making no choice and the valid choices for a connection to b.
Since you cant go back without crossing the existing connections I keep track of the maximum assigned in b.
Works at least for the two examples...
public class ListMatcher {
private int[] a ;
private int[] b ;
private Map<Integer,List<Integer>> choicesMap;
public ListMatcher(int[] a, int[] b) {
this.a = a;
this.b = b;
choicesMap = makeMap(a,b);
}
public Map<Integer,Integer> solve() {
Map<Integer,Integer> solution = new HashMap<>();
return solve(solution, 0, -1);
}
private Map<Integer,Integer> solve(Map<Integer,Integer> soFar, int current, int max) {
// done
if (current >= a.length) {
return soFar;
}
// make no choice from this entry
Map<Integer, Integer> solution = solve(new HashMap<>(soFar),current+1, max);
for (Integer choice : choicesMap.get(current)) {
if (choice > max) // can't go back
{
Map<Integer,Integer> next = new HashMap<>(soFar);
next.put(current, choice);
next = solve(next, current+1, choice);
if (next.size() > solution.size()) {
solution = next;
}
}
}
return solution;
}
// init possible choices
private Map<Integer, List<Integer>> makeMap(int[] a, int[] b) {
Map<Integer,List<Integer>> possibleMap = new HashMap<>();
for(int i = 0; i < a.length; i++) {
List<Integer> possible = new ArrayList<>();
for(int j = 0; j < b.length; j++) {
if (a[i] < b[j]) {
possible.add(j);
}
}
possibleMap.put(i, possible);
}
return possibleMap;
}
public static void main(String[] args) {
ListMatcher matcher = new ListMatcher(new int[]{3,7,2,1,5,9,2,2},new int[]{4,5,10,1,12,3,6,7});
System.out.println(matcher.solve());
matcher = new ListMatcher(new int[]{10,1,1,1,1,1,1,1},new int[]{2,2,2,2,2,2,2,101});
System.out.println(matcher.solve());
}
}
Output
(format: zero-based index_in_a=index_in_b)
{2=0, 3=1, 4=2, 5=4, 6=5, 7=6}
{1=0, 2=1, 3=2, 4=3, 5=4, 6=5, 7=6}
Your solution isn't picked because the solutions making no choice are picked first.
You can change this by processing the loop first...
Thanks to David's suggestion, I finally found the algorithm. It is an LCS approach, replacing the '=' with an '>'.
Recursive approach
The recursive approach is very straightforward. G and V are the two vectors with size n and m (adding a 0 at the beginning of both). Starting from the end, if last from G is larger than last from V, then return 1 + the function evaluated without the last item, otherwise return max of the function removing last from G or last from V.
int evaluateMaxRecursive(vector<int> V, vector<int> G, int n, int m) {
if ((n == 0) || (m == 0)) {
return 0;
}
else {
if (V[n] < G[m]) {
return 1 + evaluateMaxRecursive(V, G, n - 1, m - 1);
} else {
return max(evaluateMaxRecursive(V, G, n - 1, m), evaluateMaxRecursive(V, G, n, m - 1));
}
}
};
The recursive approach is valid with small number of items, due to the re-evaluation of same lists that occur during the loop.
Non recursive approach
The non recursive approach goes in the opposite direction and works with a table that is filled in after having clared to 0 first row and first column. The max value is the value in the bottom left corner of the table
int evaluateMax(vector<int> V, vector<int> G, int n, int m) {
int** table = new int* [n + 1];
for (int i = 0; i < n + 1; ++i)
table[i] = new int[m + 1];
for (int i = 0; i < n + 1; i++)
for (int t = 0; t < m + 1; t++)
table[i][t] = 0;
for (int i = 1; i < m + 1; i++)
for (int t = 1; t < n + 1; t++) {
if (G[i - 1] > V[t - 1]) {
table[t] [i] = 1 + table[t - 1][i - 1];
}
else {
table[t][i] = max(table[t][i - 1], table[t - 1][i]);
}
}
return table[n][m];
}
You can find more details here LCS - Wikipedia

Implement algorithm to merge sorted arrays and return it

I have to implement a Java program called Merge.java which contain the following implementation of algorithm:
Using the merge procedure for merge sort, merge the first two sorted arrays, then merge in the third,and so on. Given k sorted arrays, each with n elements, which combine them into a single sorted array of kn elements.
Program should generate a 2-dimensional array data with dimension k × n storing k sorted arrays of randomly generated integers of length n. Each algorithm should take data as input and merge all k lists into one single array result of length k × n.
public class Merge {
int k = 2; int n = 4;
//generate a 2-dimensional array data with dimension k × n
int[][] data = new int[k][n];
int size = k*n;
//implementing merge procedure for merge sort
public static int[] merge(int data[][]){
// First, creating a new array to store the single sorted array
int res[] = new int[12];
How can I then traverse through the arrays and compare their elements one by one and insert them in the new array (res) in sorted order and is this the right way as per question?
return res ;
}
public static void printArray(int[] arr){
for(int i : arr) {
System.out.printf("%d ", i);
}
System.out.printf("n");
}
public static void main(String[]args){
Merge obj = new Merge();
int[][] array = new int[][]{{12, 8, 1, 5},{ 10, 3, 4, 23}};
int [] finalSorted = merge(array);
printArray(finalSorted);
}
}
Edited to add:
Was both helpful..cheers.. this is what I got so far:
However my program should return this in 2-Dimension and arrays can be more than two:
Program should generate a 2-dimensional array data with dimension k × n storing k sorted arrays of randomly generated integers of length n. Each algorithm should take data as input and merge all k lists into one single array result of length k × n.
What would be next step?
//merge method take takes two arrays as parameters and returns the merge array
public int[] merge(int[] array1 , int [] array2){
int i=0,j=0,k = 0;
int m=array1.length;
int n=array2.length ;
// declaring a to be returned array after merging those two array1 & array2
int[] mergedArray = new int[m+n];
//comparing between two arrays , write it and compare next element and so on
while(i< m && j<n){
if(array1[i]<= array2[j]){
// if first element of array1 is <= then array2 then place array1 element in the mergedArray and viceversa
mergedArray[k] = array1[i];
i++;
}else{
mergedArray[j]=array2[j]; // opposite of above
j++;
}
k++ ;
}
// when run out of elements from one or other array, just write all the elements from the other
if(i<m){
for(int p=i ; p<m ; p++){
mergedArray[k] = array1[p];
k++;
}
} else {
for(int p=j ; p<n ; p++){
mergedArray[k]=array2[p];
k++;
}
}
return mergedArray;
}
}
try this..
// size of C array must be equal or greater than
// sum of A and B arrays' sizes
public void merge(int[] A, int[] B, int[] C) {
int i, j, k, m, n;
i = 0;
j = 0;
k = 0;
m = A.length;
n = B.length;
while (i < m && j < n) {
if (A[i] <= B[j]) {
C[k] = A[i];
i++;
} else {
C[k] = B[j];
j++;
}
k++;
}
if (i < m) {
for (int p = i; p < m; p++) {
C[k] = A[p];
k++;
}
} else {
for (int p = j; p < n; p++) {
C[k] = B[p];
k++;
}
}
}
Refrance Link : http://www.algolist.net/Algorithms/Merge/Sorted_arrays
Rather than just post the answer, let me give you some pointers in the right direction.
First, you'll need a merge() method that takes two arrays as parameters and returns the merged array. That means that the returned array should be declared and allocated inside the merge() method itself.
Then it's just a matter of looking at the two arrays, element by element. If the current element from a is less than the current element of b, write it and get the next element from a. If the current element from b is less than the current element from a, write it and get the next element from b. And when you run out of elements from one or the other array, just write all the elements from the other.
You'll invoke this method with the first two arrays that you generated. Then, you'll invoke it with the result of the first merge and one of the remaining generated arrays. Keep doing that until you have merged in all of the generated arrays, one at a time.
Then you're done.

Adding sum of frequencies whille solving Optimal Binary search tree

I am referring to THIS problem and solution.
Firstly, I did not get why sum of frequencies is added in the recursive equation.
Can someone please help understand that with an example may be.
In Author's word.
We add sum of frequencies from i to j (see first term in the above
formula), this is added because every search will go through root and
one comparison will be done for every search.
In code, sum of frequencies (purpose of which I do not understand) ... corresponds to fsum.
int optCost(int freq[], int i, int j)
{
// Base cases
if (j < i) // If there are no elements in this subarray
return 0;
if (j == i) // If there is one element in this subarray
return freq[i];
// Get sum of freq[i], freq[i+1], ... freq[j]
int fsum = sum(freq, i, j);
// Initialize minimum value
int min = INT_MAX;
// One by one consider all elements as root and recursively find cost
// of the BST, compare the cost with min and update min if needed
for (int r = i; r <= j; ++r)
{
int cost = optCost(freq, i, r-1) + optCost(freq, r+1, j);
if (cost < min)
min = cost;
}
// Return minimum value
return min + fsum;
}
Secondly, this solution will just return the optimal cost. Any suggestions regarding how to get the actual bst ?
Why we need sum of frequencies
The idea behind sum of frequencies is to correctly calculate cost of particular tree. It behaves like accumulator value to store tree weight.
Imagine that on first level of recursion we start with all keys located on first level of the tree (we haven't picked any root element yet). Remember the weight function - it sums over all node weights multiplied by node level. For now weight of our tree equals to sum of weights of all keys because any of our keys can be located on any level (starting from first) and anyway we will have at least one weight for each key in our result.
1) Suppose that we found optimal root key, say key r. Next we move all our keys except r one level down because each of the elements left can be located at most on second level (first level is already occupied). Because of that we add weight of each key left to our sum because anyway for all of them we will have at least double weight. Keys left we split in two sub arrays according to r element(to the left from r and to the right) which we selected before.
2) Next step is to select optimal keys for second level, one from each of two sub arrays left from first step. After doing that we again move all keys left one level down and add their weights to the sum because they will be located at least on third level so we will have at least triple weight for each of them.
3) And so on.
I hope this explanation will give you some understanding of why we need this sum of frequencies.
Finding optimal bst
As author mentioned at the end of the article
2) In the above solutions, we have computed optimal cost only. The
solutions can be easily modified to store the structure of BSTs also.
We can create another auxiliary array of size n to store the structure
of tree. All we need to do is, store the chosen ‘r’ in the innermost
loop.
We can do just that. Below you will find my implementation.
Some notes about it:
1) I was forced to replace int[n][n] with utility class Matrix because I used Visual C++ and it does not support non-compile time constant expression as array size.
2) I used second implementation of the algorithm from article which you provided (with memorization) because it is much easier to add functionality to store optimal bst to it.
3) Author has mistake in his code:
Second loop for (int i=0; i<=n-L+1; i++) should have n-L as upper bound not n-L+1.
4) The way we store optimal bst is as follows:
For each pair i, j we store optimal key index. This is the same as for optimal cost but instead of storing optimal cost we store optimal key index. For example for 0, n-1 we will have index of the root key r of our result tree. Next we split our array in two according to root element index r and get their optimal key indexes. We can dot that by accessing matrix elements 0, r-1 and r+1, n-1. And so forth. Utility function 'PrintResultTree' uses this approach and prints result tree in in-order (left subtree, node, right subtree). So you basically get ordered list because it is binary search tree.
5) Please don't flame me for my code - I'm not really a c++ programmer. :)
int optimalSearchTree(int keys[], int freq[], int n, Matrix& optimalKeyIndexes)
{
/* Create an auxiliary 2D matrix to store results of subproblems */
Matrix cost(n,n);
optimalKeyIndexes = Matrix(n, n);
/* cost[i][j] = Optimal cost of binary search tree that can be
formed from keys[i] to keys[j].
cost[0][n-1] will store the resultant cost */
// For a single key, cost is equal to frequency of the key
for (int i = 0; i < n; i++)
cost.SetCell(i, i, freq[i]);
// Now we need to consider chains of length 2, 3, ... .
// L is chain length.
for (int L = 2; L <= n; L++)
{
// i is row number in cost[][]
for (int i = 0; i <= n - L; i++)
{
// Get column number j from row number i and chain length L
int j = i + L - 1;
cost.SetCell(i, j, INT_MAX);
// Try making all keys in interval keys[i..j] as root
for (int r = i; r <= j; r++)
{
// c = cost when keys[r] becomes root of this subtree
int c = ((r > i) ? cost.GetCell(i, r - 1) : 0) +
((r < j) ? cost.GetCell(r + 1, j) : 0) +
sum(freq, i, j);
if (c < cost.GetCell(i, j))
{
cost.SetCell(i, j, c);
optimalKeyIndexes.SetCell(i, j, r);
}
}
}
}
return cost.GetCell(0, n - 1);
}
Below is utility class Matrix:
class Matrix
{
private:
int rowCount;
int columnCount;
std::vector<int> cells;
public:
Matrix()
{
}
Matrix(int rows, int columns)
{
rowCount = rows;
columnCount = columns;
cells = std::vector<int>(rows * columns);
}
int GetCell(int rowNum, int columnNum)
{
return cells[columnNum + rowNum * columnCount];
}
void SetCell(int rowNum, int columnNum, int value)
{
cells[columnNum + rowNum * columnCount] = value;
}
};
And main method with utility function to print result tree in in-order:
//Print result tree in in-order
void PrintResultTree(
Matrix& optimalKeyIndexes,
int startIndex,
int endIndex,
int* keys)
{
if (startIndex == endIndex)
{
printf("%d\n", keys[startIndex]);
return;
}
else if (startIndex > endIndex)
{
return;
}
int currentOptimalKeyIndex = optimalKeyIndexes.GetCell(startIndex, endIndex);
PrintResultTree(optimalKeyIndexes, startIndex, currentOptimalKeyIndex - 1, keys);
printf("%d\n", keys[currentOptimalKeyIndex]);
PrintResultTree(optimalKeyIndexes, currentOptimalKeyIndex + 1, endIndex, keys);
}
int main(int argc, char* argv[])
{
int keys[] = { 10, 12, 20 };
int freq[] = { 34, 8, 50 };
int n = sizeof(keys) / sizeof(keys[0]);
Matrix optimalKeyIndexes;
printf("Cost of Optimal BST is %d \n", optimalSearchTree(keys, freq, n, optimalKeyIndexes));
PrintResultTree(optimalKeyIndexes, 0, n - 1, keys);
return 0;
}
EDIT:
Below you can find code to create simple tree like structure.
Here is utility TreeNode class
struct TreeNode
{
public:
int Key;
TreeNode* Left;
TreeNode* Right;
};
Updated main function with BuildResultTree function
void BuildResultTree(Matrix& optimalKeyIndexes,
int startIndex,
int endIndex,
int* keys,
TreeNode*& tree)
{
if (startIndex > endIndex)
{
return;
}
tree = new TreeNode();
tree->Left = NULL;
tree->Right = NULL;
if (startIndex == endIndex)
{
tree->Key = keys[startIndex];
return;
}
int currentOptimalKeyIndex = optimalKeyIndexes.GetCell(startIndex, endIndex);
tree->Key = keys[currentOptimalKeyIndex];
BuildResultTree(optimalKeyIndexes, startIndex, currentOptimalKeyIndex - 1, keys, tree->Left);
BuildResultTree(optimalKeyIndexes, currentOptimalKeyIndex + 1, endIndex, keys, tree->Right);
}
int main(int argc, char* argv[])
{
int keys[] = { 10, 12, 20 };
int freq[] = { 34, 8, 50 };
int n = sizeof(keys) / sizeof(keys[0]);
Matrix optimalKeyIndexes;
printf("Cost of Optimal BST is %d \n", optimalSearchTree(keys, freq, n, optimalKeyIndexes));
PrintResultTree(optimalKeyIndexes, 0, n - 1, keys);
TreeNode* tree = new TreeNode();
BuildResultTree(optimalKeyIndexes, 0, n - 1, keys, tree);
return 0;
}

How can I find most frequent combinations of numbers in a list?

Imagine you have a list of numbers (or letters), such as
1177783777297461145777267337774652113777236237118777
I want to find the most frequent combinations of numbers in this list:
for 1-digit-long combinations - it is the most frequent number in this list
for 2-digit-long combinations - probably '11'
for 3-digits-long combinations - probably '777' etc
Is there some special algorythm for such a tasks?
UPDATE
Well, by myself I have coded the following (Java). Looks like execution time is proportional to data size multiplied by pattern size:
public static void main(String[] args)
{
int DATA_SIZE = 10000;
int[] data = new int[DATA_SIZE];
for (int i = 0; i < DATA_SIZE; i++)
{
data[i] = (int) (10 * Math.random()) % 10;
System.out.print(data[i]);
}
int[] pattern1 = new int[]{1, 2, 3};
int[] pattern2 = new int[]{7, 7, 7};
int[] pattern3 = new int[]{7, 7};
System.out.println();
System.out.println(match(data, pattern1));
System.out.println(match(data, pattern2));
System.out.println(match(data, pattern3));
}
static int match(int[] data, int[] pattern)
{
int matches = 0;
int i = 0;
while (i < data.length)
{
matches = isEqual(data, i, pattern) ? matches + 1 : matches;
i++;
}
return matches;
}
static boolean isEqual(int[] a, int startIndex, int[] a2)
{
if (a == a2)
{
return true;
}
if (a == null || a2 == null)
{
return false;
}
for (int i = 0; i < a2.length; i++)
{
if (a[startIndex + i] != a2[i])
{
return false;
}
}
return true;
}
This can be done in quadratic time, though I'm curious about faster approaches. The idea is iterating over the possible length values k=1..N, and on each iteration loop through the string to find the most frequent sequence of length k.
The inner loop can use a hashtable for counting the frequencies efficiently.
To find the largest number of repeats of a sequence of length at least k in a string of length n, you can build a suffix tree (http://en.wikipedia.org/wiki/Suffix_tree) in linear time, then find the node that has the most children that describes sequences of length k (or more) from the root.
Overall, this is linear time in the length of the input string.
For small k, you're better off with a naive algorithm:
from collections import Counter
def input(s, k):
c = Counter()
for i in xrange(len(s) - k):
c[s[i : i + k]] += 1
return c.most_common(1)[0][0]
for k in xrange(1, 4):
print input('1177783777297461145777267337774652113777236237118777', k)
Simply go through array mantaining the variable with most frequent combination already found and auxillary hash table where keys are searching patterns and values are number of occurences in your input data. When you find next pattern, increment the value in hash table and if necessary the value of current most frequent combination.
REGEX - thats the solution to your problem with great efficiency
http://www.vogella.com/articles/JavaRegularExpressions/article.html
check this it will help you . If you still cant get through lemme know ,ill help

Non-Recursive Merge Sort

Can someone explain in English how does Non-Recursive merge sort works ?
Thanks
Non-recursive merge sort works by considering window sizes of 1,2,4,8,16..2^n over the input array. For each window ('k' in code below), all adjacent pairs of windows are merged into a temporary space, then put back into the array.
Here is my single function, C-based, non-recursive merge sort.
Input and output are in 'a'. Temporary storage in 'b'.
One day, I'd like to have a version that was in-place:
float a[50000000],b[50000000];
void mergesort (long num)
{
int rght, wid, rend;
int i,j,m,t;
for (int k=1; k < num; k *= 2 ) {
for (int left=0; left+k < num; left += k*2 ) {
rght = left + k;
rend = rght + k;
if (rend > num) rend = num;
m = left; i = left; j = rght;
while (i < rght && j < rend) {
if (a[i] <= a[j]) {
b[m] = a[i]; i++;
} else {
b[m] = a[j]; j++;
}
m++;
}
while (i < rght) {
b[m]=a[i];
i++; m++;
}
while (j < rend) {
b[m]=a[j];
j++; m++;
}
for (m=left; m < rend; m++) {
a[m] = b[m];
}
}
}
}
By the way, it is also very easy to prove this is O(n log n). The outer loop over window size grows as power of two, so k has log n iterations. While there are many windows covered by inner loop, together, all windows for a given k exactly cover the input array, so inner loop is O(n). Combining inner and outer loops: O(n)*O(log n) = O(n log n).
Loop through the elements and make every adjacent group of two sorted by swapping the two when necessary.
Now, dealing with groups of two groups (any two, most likely adjacent groups, but you could use the first and last groups) merge them into one group be selecting the lowest valued element from each group repeatedly until all 4 elements are merged into a group of 4. Now, you have nothing but groups of 4 plus a possible remainder. Using a loop around the previous logic, do it all again except this time work in groups of 4. This loop runs until there is only one group.
Quoting from Algorithmist:
Bottom-up merge sort is a
non-recursive variant of the merge
sort, in which the array is sorted by
a sequence of passes. During each
pass, the array is divided into blocks
of size m. (Initially, m = 1).
Every two adjacent blocks are merged
(as in normal merge sort), and the
next pass is made with a twice larger
value of m.
Both recursive and non-recursive merge sort have same time complexity of O(nlog(n)). This is because both the approaches use stack in one or the other manner.
In non-recursive approach
the user/programmer defines and uses stack
In Recursive approach stack is used internally by the system to store return address of the function which is called recursively
The main reason you would want to use a non-recursive MergeSort is to avoid recursion stack overflow. I for example am trying to sort 100 million records, each record about 1 kByte in length (= 100 gigabytes), in alphanumeric order. An order(N^2) sort would take 10^16 operations, ie it would take decades to run even at 0.1 microsecond per compare operation. An order (N log(N)) Merge Sort will take less than 10^10 operations or less than an hour to run at the same operational speed. However, in the recursive version of MergeSort, the 100 million element sort results in 50-million recursive calls to the MergeSort( ). At a few hundred bytes per stack recursion, this overflows the recursion stack even though the process easily fits within heap memory. Doing the Merge sort using dynamically allocated memory on the heap-- I am using the code provided by Rama Hoetzlein above, but I am using dynamically allocated memory on the heap instead of using the stack-- I can sort my 100 million records with the non-recursive merge sort and I don't overflow the stack. An appropriate conversation for website "Stack Overflow"!
PS: Thanks for the code, Rama Hoetzlein.
PPS: 100 gigabytes on the heap?!! Well, it's a virtual heap on a Hadoop cluster, and the MergeSort will be implemented in parallel on several machines sharing the load...
I am new here.
I have modified Rama Hoetzlein solution( thanks for the ideas ). My merge sort does not use the last copy back loop. Plus it falls back on insertion sort. I have benchmarked it on my laptop and it is the fastest. Even better than the recursive version. By the way it is in java and sorts from descending order to ascending order. And of course it is iterative. It can be made multithreaded. The code has become complex. So if anyone interested, please have a look.
Code :
int num = input_array.length;
int left = 0;
int right;
int temp;
int LIMIT = 16;
if (num <= LIMIT)
{
// Single Insertion Sort
right = 1;
while(right < num)
{
temp = input_array[right];
while(( left > (-1) ) && ( input_array[left] > temp ))
{
input_array[left+1] = input_array[left--];
}
input_array[left+1] = temp;
left = right;
right++;
}
}
else
{
int i;
int j;
//Fragmented Insertion Sort
right = LIMIT;
while (right <= num)
{
i = left + 1;
j = left;
while (i < right)
{
temp = input_array[i];
while(( j >= left ) && ( input_array[j] > temp ))
{
input_array[j+1] = input_array[j--];
}
input_array[j+1] = temp;
j = i;
i++;
}
left = right;
right = right + LIMIT;
}
// Remainder Insertion Sort
i = left + 1;
j = left;
while(i < num)
{
temp = input_array[i];
while(( j >= left ) && ( input_array[j] > temp ))
{
input_array[j+1] = input_array[j--];
}
input_array[j+1] = temp;
j = i;
i++;
}
// Rama Hoetzlein method
int[] temp_array = new int[num];
int[] swap;
int k = LIMIT;
while (k < num)
{
left = 0;
i = k;// The mid point
right = k << 1;
while (i < num)
{
if (right > num)
{
right = num;
}
temp = left;
j = i;
while ((left < i) && (j < right))
{
if (input_array[left] <= input_array[j])
{
temp_array[temp++] = input_array[left++];
}
else
{
temp_array[temp++] = input_array[j++];
}
}
while (left < i)
{
temp_array[temp++] = input_array[left++];
}
while (j < right)
{
temp_array[temp++] = input_array[j++];
}
// Do not copy back the elements to input_array
left = right;
i = left + k;
right = i + k;
}
// Instead of copying back in previous loop, copy remaining elements to temp_array, then swap the array pointers
while (left < num)
{
temp_array[left] = input_array[left++];
}
swap = input_array;
input_array = temp_array;
temp_array = swap;
k <<= 1;
}
}
return input_array;
Just in case anyone's still lurking in this thread ... I've adapted Rama Hoetzlein's non-recursive merge sort algorithm above to sort double linked lists. This new sort is in-place, stable and avoids time costly list dividing code that's in other linked list merge sorting implementations.
// MergeSort.cpp
// Angus Johnson 2017
// License: Public Domain
#include "io.h"
#include "time.h"
#include "stdlib.h"
struct Node {
int data;
Node *next;
Node *prev;
Node *jump;
};
inline void Move2Before1(Node *n1, Node *n2)
{
Node *prev, *next;
//extricate n2 from linked-list ...
prev = n2->prev;
next = n2->next;
prev->next = next; //nb: prev is always assigned
if (next) next->prev = prev;
//insert n2 back into list ...
prev = n1->prev;
if (prev) prev->next = n2;
n1->prev = n2;
n2->prev = prev;
n2->next = n1;
}
void MergeSort(Node *&nodes)
{
Node *first, *second, *base, *tmp, *prev_base;
if (!nodes || !nodes->next) return;
int mul = 1;
for (;;) {
first = nodes;
prev_base = NULL;
//sort each successive mul group of nodes ...
while (first) {
if (mul == 1) {
second = first->next;
if (!second) {
first->jump = NULL;
break;
}
first->jump = second->next;
}
else
{
second = first->jump;
if (!second) break;
first->jump = second->jump;
}
base = first;
int cnt1 = mul, cnt2 = mul;
//the following 'if' condition marginally improves performance
//in an unsorted list but very significantly improves
//performance when the list is mostly sorted ...
if (second->data < second->prev->data)
while (cnt1 && cnt2) {
if (second->data < first->data) {
if (first == base) {
if (prev_base) prev_base->jump = second;
base = second;
base->jump = first->jump;
if (first == nodes) nodes = second;
}
tmp = second->next;
Move2Before1(first, second);
second = tmp;
if (!second) { first = NULL; break; }
--cnt2;
}
else
{
first = first->next;
--cnt1;
}
} //while (cnt1 && cnt2)
first = base->jump;
prev_base = base;
} //while (first)
if (!nodes->jump) break;
else mul <<= 1;
} //for (;;)
}
void InsertNewNode(Node *&head, int data)
{
Node *tmp = new Node;
tmp->data = data;
tmp->next = NULL;
tmp->prev = NULL;
tmp->jump = NULL;
if (head) {
tmp->next = head;
head->prev = tmp;
head = tmp;
}
else head = tmp;
}
void ClearNodes(Node *head)
{
if (!head) return;
while (head) {
Node *tmp = head;
head = head->next;
delete tmp;
}
}
int main()
{
srand(time(NULL));
Node *nodes = NULL, *n;
const int len = 1000000; //1 million nodes
for (int i = 0; i < len; i++)
InsertNewNode(nodes, rand() >> 4);
clock_t t = clock();
MergeSort(nodes); //~1/2 sec for 1 mill. nodes on Pentium i7.
t = clock() - t;
printf("Sort time: %d msec\n\n", t * 1000 / CLOCKS_PER_SEC);
n = nodes;
while (n)
{
if (n->prev && n->data < n->prev->data) {
printf("oops! sorting's broken\n");
break;
}
n = n->next;
}
ClearNodes(nodes);
printf("All done!\n\n");
getchar();
return 0;
}
Edited 2017-10-27: Fixed a bug affecting odd numbered lists
Any interest in this anymore? Probably not. Oh well. Here goes nothing.
The insight of merge-sort is that you can merge two (or several) small sorted runs of records into one larger sorted run, and you can do so with simple stream-like operations "read first/next record" and "append record" -- which means you don't need a big data set in RAM at once: you can get by with just two records, each taken from a distinct run. If you can just keep track of where in your file the sorted runs start and end, you can simply merge pairs of adjacent runs (into a temp file) repeatedly until the file is sorted: this takes a logarithmic number of passes over the file.
A single record is trivially sorted: each time you merge two adjacent runs, the size of each run doubles. So that's one way to keep track. The other is to work on a priority queue of runs. Take the two smallest runs from the queue, merge them, and enqueue the result -- until there is only one remaining run. This is appropriate if you expect your data to naturally start with sorted runs.
In practice with enormous data sets you'll want to exploit the memory hierarchy. Suppose you have gigabytes of RAM and terabytes of data. Why not merge a thousand runs at once? Indeed you can do this, and a priority-queue of runs can help. That will significantly decrease the number of passes you have to make over a file to get it sorted. Some details are left as an exercise for the reader.

Resources