Understanding merge sort optimization: avoiding copies - algorithm

I have below merge sort program in algorithms book, it is mentioned that The main problem is that merging two sorted lists requires linear extra memory, and the additional work spent copying to the temporary array and back, throughout the algorithm, has the effect of slowing down the sort considerably. This copying can be avoided by judiciously switching the roles of "a" and "tmp_array" at alternate levels of the recursion.
My question is what does author mean "copying can be avoided by judiciously switching the roles of a and tmp_array at alternate levels of the recursion" and how it is possible in following code? Request to show an example how we can achieve this?
void mergesort( input_type a[], unsigned int n ) {
input_type *tmp_array;
tmp_array = (input_type *) malloc( (n+1) * sizeof (input_type) );
m_sort( a, tmp_array, 1, n );
free( tmp_array );
}
void m_sort( input_type a[], input_type tmp_array[ ], int left, int right ) {
int center;
if( left < right ) {
center = (left + right) / 2;
m_sort( a, tmp_array, left, center );
m_sort( a, tmp_array, center+1, right );
merge( a, tmp_array, left, center+1, right );
}
}
void merge( input_type a[ ], input_type tmp_array[ ], int l_pos, int r_pos, int right_end ) {
int i, left_end, num_elements, tmp_pos;
left_end = r_pos - 1;
tmp_pos = l_pos;
num_elements = right_end - l_pos + 1;
/* main loop */
while( ( 1_pos <= left_end ) && ( r_pos <= right_end ) )
if( a[1_pos] <= a[r_pos] )
tmp_array[tmp_pos++] = a[l_pos++];
else
tmp_array[tmp_pos++] = a[r_pos++];
while( l_pos <= left_end ) /* copy rest of first half */
tmp_array[tmp_pos++] = a[l_pos++];
while( r_pos <= right_end ) /* copy rest of second half */
tmp_array[tmp_pos++] = a[r_pos++];
/* copy tmp_array back */
for(i=1; i <= num_elements; i++, right_end-- )
a[right_end] = tmp_array[right_end];
}

I'm going to assume that, without looking at this code, it is performing merge sort by declaring a contiguous block of memory the same size as the original.
So normally merge sort is like this:
split array in half
sort half-arrays by recursively invoking MergeSort on them
merge half-arrays back
I'm assuming it's recursive, so no copies will be done before we're sorting sub-arrays of size 2. Now what happens?
_ means it is memory we have available, but we don't care about the data in it
original:
8 5 2 3 1 7 4 6
_ _ _ _ _ _ _ _
Begin recursive calls:
recursive call 1:
(8 5 2 3) (1 7 4 6)
_ _ _ _ _ _ _ _
recursive call 2:
((8 5) (2 3)) ((1 7) (4 6))
_ _ _ _ _ _ _ _
recursive call 3:
(((8) (5))((2) (3)))(((1) (7))((4) (6)))
_ _ _ _ _ _ _ _
Recursive calls resolving with merging, PLUS COPYING (uses more memory, or alternatively is 'slower'):
merge for call 3, using temp space:
(((8) (5))((2) (3)))(((1) (7))((4) (6))) --\ perform merge
(( 5 8 )( 2 3 ))(( 1 7 )( 4 6 )) <--/ operation
UNNECESSARY: copy back:
(( 5 8 )( 2 3 ))(( 1 7 )( 4 6 )) <--\ copy and
_ _ _ _ _ _ _ _ --/ ignore old
merge for call 2, using temp space:
(( 5 8 )( 2 3 ))(( 1 7 )( 4 6 )) --\ perform merge
( 2 3 5 8 )( 1 4 6 7 ) <--/ operation
UNNECESSARY: copy back:
( 2 3 5 8 )( 1 4 6 7 ) <--\ copy and
_ _ _ _ _ _ _ _ --/ ignore old
merge for call 1, using temp space:
( 2 3 5 8 )( 1 4 6 7 ) --\ perform merge
1 2 3 4 5 6 7 8 <--/ operation
UNNECESSARY: copy back:
1 2 3 4 5 6 7 8 <--\ copy and
_ _ _ _ _ _ _ _ --/ ignore old
What the author is suggesting
Recursive calls resolving with merging, WITHOUT COPYING (uses less memory):
merge for call 3, using temp space:
(((8) (5))((2) (3)))(((1) (7))((4) (6))) --\ perform merge
(( 5 8 )( 2 3 ))(( 1 7 )( 4 6 )) <--/ operation
merge for call 2, using old array as temp space:
( 2 3 5 8 )( 1 4 6 7 ) <--\ perform merge
(( 5 8 )( 2 3 ))(( 1 7 )( 4 6 )) --/ operation (backwards)
merge for call 1, using temp space:
( 2 3 5 8 )( 1 4 6 7 ) --\ perform merge
1 2 3 4 5 6 7 8 <--/ operation
There you go: you don't need to do copies as long as you perform each "level" of the merge-sort tree in lock-step, as shown above.
You may have a minor issue of parity, also as demonstrated above. That is, your result may be in your temp_array. You either have three options for dealing with this:
returning the temp_array as the answer, and release the old memory (if your application is fine with that)
perform a single array copy operation, to copy temp_array back into your original array
allow yourself to consume a mere twice-as-much memory, and perform a single cycle of merges from temp_array1 to temp_array2 then back to original_array, then release temp_array2. The parity issue should be resolved.
This is not necessarily "faster":
additional work spent copying to the temporary array and back
This is actually not the core reason why it's 'faster' per se. It is obviously not asymptotically faster, nor necessarily even faster. There is a notion of latency vs. throughput. Generally running time is measured in latency, because extra garbage work (like releasing memory) may be done asynchronously. You don't necessarily need to copy "back" to the original array depending on your language. However, if you are repeating something many times on memory-bound hardware in a garbage-collected language, the garbage collection can occasionally be forced to spike if the GC algorithm is a poor choice for what you are doing (or if this is C, maybe you are waiting to allocate). Thus if you were to create extra memory in a GC language, it should not really count against you. Granted, this may cause you not to take advantage of cache properly if you use too much memory. You'd have to benchmark it yourself, very carefully for your use case.
I do not recommend creating random temporary arrays for each step though, as that would make memory O(N log(N)) and this is a trivial optimization.
Minor notes on in-placeness:
Also, the reason you can't naively do it in-place is because while you are merging two sorted sub-arrays, the new result sorted sub-array may take arbitrarily many from one input array before spontaneously swap to the other array. For example, as you can see we need a buffer because our input arrays might get split into fragments:
( 4 6 7 8 10)(1 2 3 5 9 11)(... other sub-arrays)
( 1)(6 7 8 10)(4)(2 3 5 9 11)(...
( 1 2)(7 8 10)(4 6)(3 5 9 11) ...
( 1 2 3)(8 10)(4 6 7)(5 9 11)
( 1 2 3 4(10)(8)(6 7)(5 9 11) ooph :-(
( 1 2 3 4 5)(8)(6 7)(10)(9 11) ooph
You might be able to so cleverly in-place if you do some weird variant of the kth-statistic median-of-medians algorithm, performing your merge into the middle of the two arrays rather than the start (merging from a specifically chosen element outwards left/decreasing and right/increasing simultaneously). I'm not sure how one would implement that though, or if the hunch is true.
(very minor note: Perhaps those who are familiar with sorting algorithms should be careful of comparing a traditional swap traditional swap operation involving a tmp variable in a register, which is two reads-from-cache and two writes-to-cache, to not-in-place copying to other bits of memory, without a per-operation counting argument.)
Certainly, OP's method is extremely simple to code for only twice as much memory.

Start by thinking of merge sort in this way.
0: Consider the input array A0 as a collection of ordered sequences of
length 1.
1: Merge each consecutive pair of sequences from A0, constructing a
new temporary array A1.
2: Merge each consecutive pair of sequences from A1, constructing a
new temporary array A2.
...
Finish when the last iteration results in a single sequence.
Now, you can obviously get away with just a single temporary array by doing this:
0: Consider the input array A0 as a collection of ordered sequences of
length 1.
1: Merge each consecutive pair of sequences from A0, constructing a
new temporary array A1.
2: Merge each consecutive pair of sequences from A1, overwriting A0
with the result.
3: Merge each consecutive pair of sequences from A0, overwriting A1
with the result.
...
Finish when the last iteration results in a single sequence.
Of course, you can be even smarter than this. If you want to be nicer to the cache, you might decide to sort top-down, rather than bottom-up. In this case, it hopefully becomes obvious what your textbook means when it refers to tracking the role of the arrays at different levels of recursion.
Hope this helps.

Here is my implementation without extra copies.
public static void sort(ArrayList<Integer> input) {
mergeSort(input, 0, input.size() - 1);
}
/**
* Sorts input and returns inversions number
* using classical divide and conquer approach
*
* #param input Input array
* #param start Start index
* #param end End index
* #return int
*/
private static long mergeSort(ArrayList<Integer> input, int start, int end) {
if (end - start < 1) {
return 0;
}
long inversionsNumber = 0;
// 1. divide input into subtasks
int pivot = start + (int) Math.ceil((end - start) / 2);
if (end - start > 1) {
inversionsNumber += mergeSort(input, start, pivot);
inversionsNumber += mergeSort(input, pivot + 1, end);
}
// 2. Merge the results
int offset = 0;
int leftIndex = start;
int rightIndex = pivot + 1;
while (leftIndex <= pivot && rightIndex <= end) {
if (input.get(leftIndex + offset) <= input.get(rightIndex)) {
if (leftIndex < pivot) {
leftIndex++;
} else {
rightIndex++;
}
continue;
}
moveElement(input, rightIndex, leftIndex + offset);
inversionsNumber += rightIndex - leftIndex - offset;
rightIndex++;
offset++;
}
return inversionsNumber;
}
private static void moveElement(ArrayList<Integer> input, int from, int to) {
assert 0 <= to;
assert to < from;
assert from < input.size();
int temp = input.get(from);
for (int i = from; i > to; i--) {
input.set(i, input.get(i - 1));
}
input.set(to, temp);
}

Look at the very last part of the merge function. What if, instead of copying that data, you just used the knowledge that the sorted part is now in tmp_array instead of a when the function returns, and a is available for use as a temp.
Details are left as an exercise for the reader.

Related

Fenwick tree batch updating

Fenwick tree has an update(idx, delta) function, which may be implemented like this:
for ( ; idx < fTree.length; idx += idx & -idx ){
this._fTree[ idx ] += delta;
}
So you are updating your current idx and then pushing your change till the end.
I want to perform several update calls in a loop, like
update( 12, 100 );
update( 13, 400 );
update( 17, 200 );
If I know the range of updated indexes in advance ( 12 - 17 here ), update may be optimized. Rather than pushing till the end everytime, we can push till certain index, buffer rest change, and then, after last update, push buffered change till the end. Providing a picture to explain concept:
For example taking range 3 - 7.
updating of index 3 will lead to update of following indexes: 3 - 4 - 8 - 16.
Updating of index 4 will lead to: 4 - 8 - 16
Updating of index 5 will lead to: 5 - 6 - 8 - 16
It would be optimal to push everything till 8, buffer rest, and then push it till 16.
Question
Is there better way to find this node to push until ( 8 here ), knowing range start and end?
Non-ideal solution
const getLimitIndex = ( from, to ) => {
let lim = to;
for( let tmp = lim & -lim; lim - tmp > from; ){
lim += tmp;
tmp = lim & -lim;
}
return lim;
}

Broken Calculator

Problem Statement:
There is a broken calculator. Only a few of the digits [0 to 9] and operators [+, -, *, /] are working.
A req no. needs to be formed using the working digits and the operators. Each press on the keyboard is called an operation.
= operator is always working and is used when the req no. is formed using operators.
-1 needs to be printed in case the req no. cannot be formed using the digits and the operators provided OR exceeds the max no. of operations allowed.
At no point in time during the calculation of the result, the no. should become negative or exceed 999 [0 <= calcno <= 999]
Input:
1st line contains 3 space separated nos: no. of working digits, no. of working operators, max. no of operations allowed.
2nd line contains space separated working digits.
3rd line contains space separated working operators [1 represents +, 2 represents -, 3 represents *, 4 represents /].
4th line contains the req. no to be formed.
Output:
Find the minimum required operations to form the req no.
Example:
Input 1:
2 1 8
2 5
3
50
Possible ways:
Case 1: 2*5*5 = -> 6 operations
Case 2: 2*25 = -> 4 operations
4 is the req Answer
Input 2:
3 4 8
5 4 2
3 2 4 1
42
Possible ways:
Case 1: 42 -> 2 operations (direct key in)
Case 2: 5*4*2+2 = -> 8 operations
..........some other ways
2 is the req Answer
I am not getting a proper approach to this problem.
Can someone suggest some ways to approach the problem.
Giving some more context what vish4071 said in the comments.
Set up a graph in the following way:
Starting the graph with a root, and than the new node are the number you're aloud to use (for the example this is 2 and 5). And build up the graph level by level.
Make each level in the following way, a new node will consist either of adding number or a operator which you're aloud to use. After each operator there cannot be another operator.
If the node has a higher value than the Target value, than kill the node (target as end note), this only works for this example (if the operators are * and +). If you would be able to use the - and / operator this is not vallid.
Do this till you find the required value, and the level (+1, due to the = operation) is you're answer.
And example of the graph is given below
for your first example:
D=0 D=1
5
/
Root /
\
\2
D=1 D=2 d=3 d=4
--2
/
/
(*)___5 --> reaches answer but at level 6
/
/ (*)___2 --> complete
/ / \ 5
/ /
2 /____25_252 --> end node
\ \
\ \
\
\ 225 --> end node
\ /
22__222 --> end node
\
(*)
This is slightly better than brute forcing, maybe there is a more optimal way.
#include <bits/stdc++.h>
using namespace std;
int main() {
// your code goes here
int n,m,o;
cin>>n>>m>>o;
int arr[n];
queue<pair<int,int>> q;
for(int i=0;i<n;i++)
{
cin>>arr[i];
q.push(make_pair(arr[i],1));
}
int op[m];
for(int i=0;i<m;i++) cin>>op[i];
unordered_map<int,int> mp;
for(int i=0;i<m;i++) mp[op[i]]=1;
int target;
cin>>target;
int ans=INT_MAX;
while(!q.empty())
{
int num=q.front().first;
int count=q.front().second;
if(num==target) ans=min(ans,count);
q.pop();
for(int i=0;i<=4;i++)
{
for(int j=0;j<n;j++)
{
if(i==0 and count+1<=o)
{
q.push(make_pair(num*10+arr[j],count+1));
}
else
{
if(i==1 and mp.find(i)!=mp.end() and count+3<=o)
{
q.push(make_pair(num+arr[j],count+3));
}
if(i==2 and mp.find(i)!=mp.end() and count+3<=o)
{
q.push(make_pair(abs(num-arr[j]),count+3));
}
if(i==3 and mp.find(i)!=mp.end() and count+3<=o)
{
q.push(make_pair(num*arr[j],count+3));
}
if(i==4 and mp.find(i)!=mp.end() and count+3<=o)
{
q.push(make_pair(num/arr[j],count+3));
}
}
}
}
}
if(ans==INT_MAX) cout<<"-1"<<endl;
else cout<<ans<<endl;
return 0;
}

How to sort a vector of structure belonging to same category/set in c++

While solving a competitive coding problem I got stuck with the the following sorting scenario.
I have a vector of following structure
struct Data{
int p;
int val;
int ll;
};
Defined as :
vector<Data> a(N);
Now p field in the structure tells the set number to which val belong.
e.g if values are 1 2 3 4 5 6 7 8 9
(1,4,7) belong to group/set 3 i.e p is 3 , (2,5,8) belong to group/set 4 i.e p is 4 and (3,6,9) belong to group/set 5 i.e p is 5
i have p and val field in structure as
p as 3 4 5 3 4 5 3 4 5
val as 1 2 3 4 5 6 7 8 9
Now the problem is I have to sort the vector set wise in descending order
i.e 7 8 9 4 5 6 1 2 3
here 1 4 and 7 belong to set 3 so they are sorted in their respective places.
I tried with the selection sort as below which worked fine but it gave Time limit exceeded because of O(N^2) complexity.
for(int i=0;i<N;i++)
{
int mi=i;
Data max=a[i];
for(int j=i+1;j<N;j++)
{
if((a[i].p==a[j].p)&&(a[j].val>max.val))
{
max=a[j];
mi=j;
}
}
a[mi]=a[i];
a[i]=max;
}
Please help me find the best (time complexity) way to sort this scenario (if possible using STL sort).
Thanks in advance.
Modifying http://www.cplusplus.com/reference/algorithm/sort/, the key bit is:
#include <algorithm>
...
bool mycomparison (Data i, Data j) {
if (i.p != j.p)
return j.p < i.p;
else
return j.val < i.val;
}
...
// Sort vector a
a.std::sort( a.begin(), a.end(), mycomparison );
...
Note that reversing i and j in the return lines causes things to be in descending order.

check overflow when multiply with 3 by bitwise

I have problem how to solve this one, Iam thinking about return
int product = 3 * n;
return (!n || product/n == 3);
however, I cant use those operators.
/*
* Overflow detection of 3*n
* Input is positive
* Example: overflow( 10 ) = 0
* Example: overlfow( 1<<30 ) = 1
* Legal ops: & | >> << ~
* Max ops: 10
*
* Number of X86 instructions:
*/
int overflow_3( int n ) {
return 2;
}
The condition is equivalent to checking whether x is larger than MAX_INT / 3, that is, x > 0x2aaaaaaa. Since x is known to be nonnegative, we know that the top bit is zero and thus we can check the condition as follows:
unsigned overflow(unsigned x) {
return (x + 0x55555555) >> 31;
}
There are two possible options for a number to overflow when multiplied by 3.
Let's look at X3 multiplication. There are two actions:
1. Shift left by 1 leaves the leftmost bit set. This could only happen if the near leftmost (i.e the 30) bit is set
2. Shift left by 1 leaves the leftmost bit unset. However the following addition of the original number results in having the bits set. This could only happen if the 29 bit is set (since it is the only one that will become the 30 after the shift) and if either the 28 or the 27 bit is set (since they can overflow to the 30 bit). However the 27 but by itself being set is not enough (since we need the 26 bit to be set, or the 25th and 24th) and etc.
So basically you need a loop here. However since loops are not allowed I would use recursion. So:
int overflow_3(int n){
return n >> 30 || (n >> 29 && overflow_3( (n & ( (1 << 29) - 1)) << 2 ) );
}

What is the fastest possible way to sort an array of 7 integers?

This is a part of a program that analyzes the odds of poker, specifically Texas Hold'em. I have a program I'm happy with, but it needs some small optimizations to be perfect.
I use this type (among others, of course):
type
T7Cards = array[0..6] of integer;
There are two things about this array that may be important when deciding how to sort it:
Every item is a value from 0 to 51. No other values are possible.
There are no duplicates. Never.
With this information, what is the absolutely fastest way to sort this array? I use Delphi, so pascal code would be the best, but I can read C and pseudo, albeit a bit more slowly :-)
At the moment I use quicksort, but the funny thing is that this is almost no faster than bubblesort! Possible because of the small number of items. The sorting counts for almost 50% of the total running time of the method.
EDIT:
Mason Wheeler asked why it's necessary to optimize. One reason is that the method will be called 2118760 times.
Basic poker information: All players are dealt two cards (the pocket) and then five cards are dealt to the table (the 3 first are called the flop, the next is the turn and the last is the river. Each player picks the five best cards to make up their hand)
If I have two cards in the pocket, P1 and P2, I will use the following loops to generate all possible combinations:
for C1 := 0 to 51-4 do
if (C1<>P1) and (C1<>P2) then
for C2 := C1+1 to 51-3 do
if (C2<>P1) and (C2<>P2) then
for C3 := C2+1 to 51-2 do
if (C3<>P1) and (C3<>P2) then
for C4 := C3+1 to 51-1 do
if (C4<>P1) and (C4<>P2) then
for C5 := C4+1 to 51 do
if (C5<>P1) and (C5<>P2) then
begin
//This code will be executed 2 118 760 times
inc(ComboCounter[GetComboFromCards([P1,P2,C1,C2,C3,C4,C5])]);
end;
As I write this I notice one thing more: The last five elements of the array will always be sorted, so it's just a question of putting the first two elements in the right position in the array. That should simplify matters a bit.
So, the new question is: What is the fastest possible way to sort an array of 7 integers when the last 5 elements are already sorted. I believe this could be solved with a couple (?) of if's and swaps :-)
For a very small set, insertion sort can usually beat quicksort because it has very low overhead.
WRT your edit, if you're already mostly in sort order (last 5 elements are already sorted), insertion sort is definitely the way to go. In an almost-sorted set of data, it'll beat quicksort every time, even for large sets. (Especially for large sets! This is insertion sort's best-case scenario and quicksort's worst case.)
Don't know how you are implementing this, but what you could do is have an array of 52 instead of 7, and just insert the card in its slot directly when you get it since there can never be duplicates, that way you never have to sort the array. This might be faster depending on how its used.
I don't know that much about Texas Hold'em: Does it matter what suit P1 and P2 are, or does it only matter if they are of the same suit or not? If only suit(P1)==suit(P2) matters, then you could separate the two cases, you have only 13x12/2 different possibilities for P1/P2, and you can easily precalculate a table for the two cases.
Otherwise, I would suggest something like this:
(* C1 < C2 < P1 *)
for C1:=0 to P1-2 do
for C2:=C1+1 to P1-1 do
Cards[0] = C1;
Cards[1] = C2;
Cards[2] = P1;
(* generate C3...C7 *)
(* C1 < P1 < C2 *)
for C1:=0 to P1-1 do
for C2:=P1+1 to 51 do
Cards[0] = C1;
Cards[1] = P1;
Cards[2] = C2;
(* generate C3...C7 *)
(* P1 < C1 < C2 *)
for C1:=P1+1 to 51 do
for C2:=C1+1 to 51 do
Cards[0] = P1;
Cards[1] = C1;
Cards[2] = C2;
(* generate C3...C7 *)
(this is just a demonstration for one card P1, you would have to expand that for P2, but I think that's straightforward. Although it'll be a lot of typing...)
That way, the sorting doesn't take any time at all. The generated permutations are already ordered.
There are only 5040 permutations of 7 elements. You can programmaticaly generate a program that finds the one represented by your input in a minimal number of comparisons. It will be a big tree of if-then-else instructions, each comparing a fixed pair of nodes, for example if (a[3]<=a[6]).
The tricky part is deciding which 2 elements to compare in a particular internal node. For this, you have to take into account the results of comparisons in the ancestor nodes from root to the particular node (for example a[0]<=a[1], not a[2]<=a[7], a[2]<=a[5]) and the set of possible permutations that satisfy the comparisons. Compare the pair of elements that splits the set into as equal parts as possible (minimize the size of the larger part).
Once you have the permutation, it is trivial to sort it in a minimal set of swaps.
Since the last 5 items are already sorted, the code can be written just to reposition the first 2 items. Since you're using Pascal, I've written and tested a sorting algorithm that can execute 2,118,760 times in about 62 milliseconds.
procedure SortT7Cards(var Cards: T7Cards);
const
CardsLength = Length(Cards);
var
I, J, V: Integer;
V1, V2: Integer;
begin
// Last 5 items will always be sorted, so we want to place the first two into
// the right location.
V1 := Cards[0];
V2 := Cards[1];
if V2 < V1 then
begin
I := V1;
V1 := V2;
V2 := I;
end;
J := 0;
I := 2;
while I < CardsLength do
begin
V := Cards[I];
if V1 < V then
begin
Cards[J] := V1;
Inc(J);
Break;
end;
Cards[J] := V;
Inc(J);
Inc(I);
end;
while I < CardsLength do
begin
V := Cards[I];
if V2 < V then
begin
Cards[J] := V2;
Break;
end;
Cards[J] := V;
Inc(J);
Inc(I);
end;
if J = (CardsLength - 2) then
begin
Cards[J] := V1;
Cards[J + 1] := V2;
end
else if J = (CardsLength - 1) then
begin
Cards[J] := V2;
end;
end;
Use min-sort. Search for minimal and maximal element at once and place them into resultant array. Repeat three times. (EDIT: No, I won't try to measure the speed theoretically :_))
var
cards,result: array[0..6] of integer;
i,min,max: integer;
begin
n=0;
while (n<3) do begin
min:=-1;
max:=52;
for i from 0 to 6 do begin
if cards[i]<min then min:=cards[i]
else if cards[i]>max then max:=cards[i]
end
result[n]:=min;
result[6-n]:=max;
inc(n);
end
for i from 0 to 6 do
if (cards[i]<52) and (cards[i]>=0) then begin
result[3] := cards[i];
break;
end
{ Result is sorted here! }
end
This is the fastest method: since the 5-card list is already sorted, sort the two-card list (a compare & swap), and then merge the two lists, which is O(k * (5+2). In this case (k) will normally be 5: the loop test(1), the compare(2), the copy(3), the input-list increment(4) and the output list increment(5). That's 35 + 2.5. Throw in loop initialization and you get 41.5 statements, total.
You could also unroll the loops which would save you maybe 8 statements or execution, but make the whole routine about 4-5 times longer which may mess with your instruction cache hit ratio.
Given P(0 to 2), C(0 to 5) and copying to H(0 to 6)
with C() already sorted (ascending):
If P(0) > P(1) Then
// Swap:
T = P(0)
P(0) = P(1)
P(1) = T
// 1stmt + (3stmt * 50%) = 2.5stmt
End
P(2), C(5) = 53 \\ Note these are end-of-list flags
k = 0 \\ P() index
J = 0 \\ H() index
i = 0 \\ C() index
// 4 stmt
Do While (j) < 7
If P(k) < C(I) then
H(j) = P(k)
k = k+1
Else
H(j) = C(i)
j = j+1
End if
j = j+1
// 5stmt * 7loops = 35stmt
Loop
And note that this is faster than the other algorithm that would be "fastest" if you had to truly sort all 7 cards: use a bit-mask(52) to map & bit-set all 7 cards into that range of all possible 52 cards (the bit-mask), and then scan the bit-mask in order looking for the 7 bits that are set. That takes 60-120 statements at best (but is still faster than any other sorting approach).
For seven numbers, the most efficient algorithm that exists with regards to the number of comparisons is Ford-Johnson's. In fact, wikipedia references a paper, easily found on google, that claims Ford-Johnson's the best for up to 47 numbers. Unfortunately, references to Ford-Johnson's aren't all that easy to found, and the algorithm uses some complex data structures.
It appears on The Art Of Computer Programming, Volume 3, by Donald Knuth, if you have access to that book.
There's a paper which describes FJ and a more memory efficient version here.
At any rate, because of the memory overhead of that algorithm, I doubt it would be worth your while for integers, as the cost of comparing two integers is rather cheap compared to the cost of allocating memory and manipulating pointers.
Now, you mentioned that 5 cards are already sorted, and you just need to insert two. You can do this with insertion sort most efficiently like this:
Order the two cards so that P1 > P2
Insert P1 going from the high end to the low end
(list) Insert P2 going from after P1 to the low end
(array) Insert P2 going from the low end to the high end
How you do that will depend on the data structure. With an array you'll be swapping each element, so place P1 at 1st, P2 and 7th (ordered high to low), and then swap P1 up, and then P2 down. With a list, you just need to fix the pointers as appropriate.
However once more, because of the particularity of your code, it really is best if you follow nikie suggestion and just generate the for loops apropriately for every variation in which P1 and P2 can appear in the list.
For example, sort P1 and P2 so that P1 < P2. Let's make Po1 and Po2 the position from 0 to 6, of P1 and P2 on the list. Then do this:
Loop Po1 from 0 to 5
Loop Po2 from Po1 + 1 to 6
If (Po2 == 1) C1start := P2 + 1; C1end := 51 - 4
If (Po1 == 0 && Po2 == 2) C1start := P1+1; C1end := P2 - 1
If (Po1 == 0 && Po2 > 2) C1start := P1+1; C1end := 51 - 5
If (Po1 > 0) C1start := 0; C1end := 51 - 6
for C1 := C1start to C1end
// Repeat logic to compute C2start and C2end
// C2 can begin at C1+1, P1+1 or P2+1
// C2 can finish at P1-1, P2-1, 51 - 3, 51 - 4 or 51 -5
etc
You then call a function passing Po1, Po2, P1, P2, C1, C2, C3, C4, C5, and have this function return all possible permutations based on Po1 and Po2 (that's 36 combinations).
Personally, I think that's the fastest you can get. You completely avoid having to order anything, because the data will be pre-ordered. You incur in some comparisons anyway to compute the starts and ends, but their cost is minimized as most of them will be on the outermost loops, so they won't be repeated much. And they can even be more optimized at the cost of more code duplication.
For 7 elements, there are only few options. You can easily write a generator that produces method to sort all possible combinations of 7 elements. Something like this method for 3 elements:
if a[1] < a[2] {
if a[2] < a[3] {
// nothing to do, a[1] < a[2] < a[3]
} else {
if a[1] < a[3] {
// correct order should be a[1], a[3], a[2]
swap a[2], a[3]
} else {
// correct order should be a[3], a[1], a[2]
swap a[2], a[3]
swap a[1], a[3]
}
}
} else {
// here we know that a[1] >= a[2]
...
}
Of course method for 7 elements will be bigger, but it's not that hard to generate.
The code below is close to optimal. It could be made better by composing a list to be traversed while making the tree, but I'm out of time right now. Cheers!
object Sort7 {
def left(i: Int) = i * 4
def right(i: Int) = i * 4 + 1
def up(i: Int) = i * 4 + 2
def value(i: Int) = i * 4 + 3
val a = new Array[Int](7 * 4)
def reset = {
0 until 7 foreach {
i => {
a(left(i)) = -1
a(right(i)) = -1
a(up(i)) = -1
a(value(i)) = scala.util.Random.nextInt(52)
}
}
}
def sortN(i : Int) {
var index = 0
def getNext = if (a(value(i)) < a(value(index))) left(index) else right(index)
var next = getNext
while(a(next) != -1) {
index = a(next)
next = getNext
}
a(next) = i
a(up(i)) = index
}
def sort = 1 until 7 foreach (sortN(_))
def print {
traverse(0)
def traverse(i: Int): Unit = {
if (i != -1) {
traverse(a(left(i)))
println(a(value(i)))
traverse(a(right(i)))
}
}
}
}
In pseudo code:
int64 temp = 0;
int index, bit_position;
for index := 0 to 6 do
temp |= 1 << cards[index];
for index := 0 to 6 do
begin
bit_position = find_first_set(temp);
temp &= ~(1 << bit_position);
cards[index] = bit_position;
end;
It's an application of bucket sort, which should generally be faster than any of the comparison sorts that were suggested.
Note: The second part could also be implemented by iterating over bits in linear time, but in practice it may not be faster:
index = 0;
for bit_position := 0 to 51 do
begin
if (temp & (1 << bit_position)) > 0 then
begin
cards[index] = bit_position;
index++;
end;
end;
Assuming that you need an array of cards at the end of it.
Map the original cards to bits in a 64 bit integer ( or any integer with >= 52 bits ).
If during the initial mapping the array is sorted, don't change it.
Partition the integer into nibbles - each will correspond to values 0x0 to 0xf.
Use the nibbles as indices to corresponding sorted sub-arrays. You'll need 13 sets of 16 sub-arrays ( or just 16 sub-arrays and use a second indirection, or do the bit ops rather than looking the answer up; what is faster will vary by platform ).
Concatenate the non-empty sub-arrays into the final array.
You could use larger than nibbles if you want; bytes would give 7 sets of 256 arrays and make it more likely that the non-empty arrays require concatenating.
This assumes that branches are expensive and cached array accesses cheap.
#include <stdio.h>
#include <stdbool.h>
#include <stdint.h>
// for general case of 7 from 52, rather than assuming last 5 sorted
uint32_t card_masks[16][5] = {
{ 0, 0, 0, 0, 0 },
{ 1, 0, 0, 0, 0 },
{ 2, 0, 0, 0, 0 },
{ 1, 2, 0, 0, 0 },
{ 3, 0, 0, 0, 0 },
{ 1, 3, 0, 0, 0 },
{ 2, 3, 0, 0, 0 },
{ 1, 2, 3, 0, 0 },
{ 4, 0, 0, 0, 0 },
{ 1, 4, 0, 0, 0 },
{ 2, 4, 0, 0, 0 },
{ 1, 2, 4, 0, 0 },
{ 3, 4, 0, 0, 0 },
{ 1, 3, 4, 0, 0 },
{ 2, 3, 4, 0, 0 },
{ 1, 2, 3, 4, 0 },
};
void sort7 ( uint32_t* cards) {
uint64_t bitset = ( ( 1LL << cards[ 0 ] ) | ( 1LL << cards[ 1LL ] ) | ( 1LL << cards[ 2 ] ) | ( 1LL << cards[ 3 ] ) | ( 1LL << cards[ 4 ] ) | ( 1LL << cards[ 5 ] ) | ( 1LL << cards[ 6 ] ) ) >> 1;
uint32_t* p = cards;
uint32_t base = 0;
do {
uint32_t* card_mask = card_masks[ bitset & 0xf ];
// you might remove this test somehow, as well as unrolling the outer loop
// having separate arrays for each nibble would save 7 additions and the increment of base
while ( *card_mask )
*(p++) = base + *(card_mask++);
bitset >>= 4;
base += 4;
} while ( bitset );
}
void print_cards ( uint32_t* cards ) {
printf ( "[ %d %d %d %d %d %d %d ]\n", cards[0], cards[1], cards[2], cards[3], cards[4], cards[5], cards[6] );
}
int main ( void ) {
uint32_t cards[7] = { 3, 9, 23, 17, 2, 42, 52 };
print_cards ( cards );
sort7 ( cards );
print_cards ( cards );
return 0;
}
Use a sorting network, like in this C++ code:
template<class T>
inline void sort7(T data) {
#define SORT2(x,y) {if(data##x>data##y)std::swap(data##x,data##y);}
//DD = Define Data, create a local copy of the data to aid the optimizer.
#define DD1(a) register auto data##a=*(data+a);
#define DD2(a,b) register auto data##a=*(data+a);register auto data##b=*(data+b);
//CB = Copy Back
#define CB1(a) *(data+a)=data##a;
#define CB2(a,b) *(data+a)=data##a;*(data+b)=data##b;
DD2(1,2) SORT2(1,2)
DD2(3,4) SORT2(3,4)
DD2(5,6) SORT2(5,6)
DD1(0) SORT2(0,2)
SORT2(3,5)
SORT2(4,6)
SORT2(0,1)
SORT2(4,5)
SORT2(2,6) CB1(6)
SORT2(0,4)
SORT2(1,5)
SORT2(0,3) CB1(0)
SORT2(2,5) CB1(5)
SORT2(1,3) CB1(1)
SORT2(2,4) CB1(4)
SORT2(2,3) CB2(2,3)
#undef CB1
#undef CB2
#undef DD1
#undef DD2
#undef SORT2
}
Use the function above if you want to pass it an iterator or a pointer and use the function below if you want to pass it the seven arguments one by one. BTW, using templates allows compilers to generate really optimized code so don't get ride of the template<> unless you want C code (or some other language's code).
template<class T>
inline void sort7(T& e0, T& e1, T& e2, T& e3, T& e4, T& e5, T& e6) {
#define SORT2(x,y) {if(data##x>data##y)std::swap(data##x,data##y);}
#define DD1(a) register auto data##a=e##a;
#define DD2(a,b) register auto data##a=e##a;register auto data##b=e##b;
#define CB1(a) e##a=data##a;
#define CB2(a,b) e##a=data##a;e##b=data##b;
DD2(1,2) SORT2(1,2)
DD2(3,4) SORT2(3,4)
DD2(5,6) SORT2(5,6)
DD1(0) SORT2(0,2)
SORT2(3,5)
SORT2(4,6)
SORT2(0,1)
SORT2(4,5)
SORT2(2,6) CB1(6)
SORT2(0,4)
SORT2(1,5)
SORT2(0,3) CB1(0)
SORT2(2,5) CB1(5)
SORT2(1,3) CB1(1)
SORT2(2,4) CB1(4)
SORT2(2,3) CB2(2,3)
#undef CB1
#undef CB2
#undef DD1
#undef DD2
#undef SORT2
}
Take a look at this:
http://en.wikipedia.org/wiki/Sorting_algorithm
You would need to pick one that will have a stable worst case cost...
Another option could be to keep the array sorted the whole time, so an addition of a card would keep the array sorted automatically, that way you could skip to sorting...
What JRL is referring to is a bucket sort. Since you have a finite discrete set of possible values, you can declare 52 buckets and just drop each element in a bucket in O(1) time. Hence bucket sort is O(n). Without the guarantee of a finite number of different elements, the fastest theoretical sort is O(n log n) which things like merge sort an quick sort are. It's just a balance of best and worst case scenarios then.
But long answer short, use bucket sort.
If you like the above mentioned suggestion to keep a 52 element array which always keeps your array sorted, then may be you could keep another list of 7 elements which would reference the 7 valid elements in the 52 element array. This way we can even avoid parsing the 52 element array.
I guess for this to be really efficient, we would need to have a linked list type of structure which be supports operations: InsertAtPosition() and DeleteAtPosition() and be efficient at that.
There are a lot of loops in the answers. Given his speed requirement and the tiny size of the data set I would not do ANY loops.
I have not tried it but I suspect the best answer is a fully unrolled bubble sort. It would also probably gain a fair amount of advantage from being done in assembly.
I wonder if this is the right approach, though. How are you going to analyze a 7 card hand?? I think you're going to end up converting it to some other representation for analysis anyway. Would not a 4x13 array be a more useful representation? (And it would render the sorting issue moot, anyway.)
Considering that last 5 elements are always sorted:
for i := 0 to 1 do begin
j := i;
x := array[j];
while (j+1 <= 6) and (array[j+1] < x) do begin
array[j] := array[j+1];
inc(j);
end;
array[j] := X;
end;
bubble sort is your friend. Other sorts have too many overhead codes and not suitable for small number of elements
Cheers
Here is your basic O(n) sort. I'm not sure how it compares to the others. It uses unrolled loops.
char card[7]; // the original table of 7 numbers in range 0..51
char table[52]; // workspace
// clear the workspace
memset(table, 0, sizeof(table));
// set the 7 bits corresponding to the 7 cards
table[card[0]] = 1;
table[card[1]] = 1;
...
table[card[6]] = 1;
// read the cards back out
int j = 0;
if (table[0]) card[j++] = 0;
if (table[1]) card[j++] = 1;
...
if (table[51]) card[j++] = 51;
If you are looking for a very low overhead, optimal sort, you should create a sorting network. You can generate the code for a 7 integer network using the Bose-Nelson algorithm.
This would guarentee a fixed number of compares and an equal number of swaps in the worst case.
The generated code is ugly, but it is optimal.
Your data is in a sorted array and I'll assume you swap the new two if needed so also sorted, so
a. if you want to keep it in place then use a form of insertion sort;
b. if you want to have it the result in another array do a merging by copying.
With the small numbers, binary chop is overkill, and ternary chop is appropriate anyway:
One new card will mostly like split into two and three, viz. 2+3 or 3+2,
two cards into singles and pairs, e.g. 2+1+2.
So the most time-space efficient approach to placing the smaller new card is to compare with a[1] (viz. skip a[0]) and then search left or right to find the card it should displace, then swap and move right (shifting rather than bubbling), comparing with the larger new card till you find where it goes. After this you'll be shifting forward by twos (two cards have been inserted).
The variables holding the new cards (and swaps) should be registers.
The look up approach would be faster but use more memory.

Resources