I have an exponential moving average that gets called millions of times, and thus is the most expensive part of my code:
double _exponential(double price[ ], double smoothingValue, int dataSetSize)
{
int i;
double cXAvg;
cXAvg = price[ dataSetSize - 2 ] ;
for (i= dataSetSize - 2; i > -1; --i)
cXAvg += (smoothingValue * (price[ i ] - cXAvg)) ;
return ( cXAvg) ;
}
Is there a more efficient way to code this to speed things up? I have a multi-threaded app and am using Visual C++.
Thank you.
Ouch!
Sure, multithreading can help. But you can almost assuredly improve the performance on a single threaded machine.
First, you are calculating it in the wrong direction. Only the most modern machines can do negative stride prefetching. Nearly all machihnes are faster for unit strides. I.e. changing the direction of the array so that you scan from low to high rather than high to low is almost always better.
Next, rewriting a bit - please allow me to shorten the variable names to make it easier to type:
avg = price[0]
for i
avg = s * (price[i] - avg)
By the way, I will start using shorthands p for price and s for smoothing, to save typing. I'm lazy...
avg0 = p0
avg1 = s*(p1-p0)
avg2 = s*(p2-s*(p1-p0)) = s*(p2-s*(p1-avg0))
avg3 = s*(p3-s*(p2-s*(p1-p0))) = s*p3 - s*s*p2 + s*s*avg1
and, in general
avg[i] = s*p[i] - s*s*p[i-1] + s*s*avg[i-2]
precalculating s*s
you might do
avg[i] = s*p[i] - s*s*(p[i-1] + s*s*avg[i-2])
but it is probably faster to do
avg[i] = (s*p[i] - s*s*p[i-1]) + s*s*avg[i-2])
The latency between avg[i] and avg[i-2] is then 1 multiply and an add, rather than a subtract and a multiply between avg[i] and avg[i-1]. I.e. more than twice as fast.
In general, you want to rewrite the recurrence so that avg[i] is calculated in terms of avg[j]
for j as far back as you can possibly go, without filling up the machine, either execution units or registers.
You are basically doing more multiplies overall, in order to get fewer chains of multiples (and subtracts) on the critical path.
Skipping from avg[i-2] to avg[i[ is easy, you can probably do three and four. Exactly how far
depends on what your machine is, and how many registers you have.
And the latency of the floating point adder and multiplier. Or, better yet, the flavour of combined multiply-add instruction you have - all modern machines have them. E.g. if the MADD or MSUB is 7 cycles long, you can do up to 6 other calculations in its shadow, even if you have only a single floating point unit. Fully pipelined. And so on. Less if pipelined every otherr cycle, as is common for double precision on older chips and GPUs. The assembly code should be software pipelined so that different loop iterations overlap. A good compiler should do that for you, but you might have to rewrite the C code to get the best performance.
By the way: I do NOT mean to suggest that you should be creating an array of avg[]. Instead, you would need two averages if avg[i] is calculated in terms of avg[i-2], and so on.
You can use an array of avg[i] if you want, but I think that you only need to have 2 or 4 avgs, called, creatively, avg0 and avg1 (2, 3...), and "rotate" them.
avg0 = p0
avg1 = s*(p1-p0)
/*avg2=reuses*/avg0 = s*(p2-s*(p1-avg0))
/*avg3=reusing*/avg3 = s*p3 - s*s*p2 + s*s*avg1
for i from 2 to N by 2 do
avg0 = s*p3 - s*s*p2 + s*s*avg0
avg1 = s*p3 - s*s*p2 + s*s*avg1
This sort of trick, splitting an accumulator or average into two or more,
combining multiple stages of the recurrence, is common in high performance code.
Oh, yes: precalculate s*s, etc.
If I have done it right, in infinite precision this would be identical. (Double check me, please.)
However, in finite precision FP your results may differ, hopefully only slightly, because of different roundings. If the unrolling is correct and the answers are significantly different, you probably have a numerically unstable algorithm. You're the one who wouyld know.
Note: floating point rounding errors will change the low bits of your answer.
Both because of rearranging the code, and using MADD.
I think that is probably okay, but you have to decide.
Note: the calculations for avg[i] and avg[i-1] are now independent. So you can use a SIMD
instruction set, like Intel SSE2, which permits operation on two 64 bit values in a 128 bit wide register at a time.
That'll be good for almost 2X, on a machine that has enough ALUs.
If you have enough registers to rewrite avg[i] in terms of avg[i-4]
(and I am sure you do on iA64), then you can go 4X wide,
if you have access to a machine like 256 bit AVX.
On a GPU... you can go for deeper recurrences, rewriting avg[i] in terms of avg[i-8], and so on.
Some GPUs have instructions that calculate AX+B or even AX+BY as a single instruction.
Although that's more common for 32 bit than for 64 bit precision.
At some point I would probably start asking: do you want to do this on multiple prices at a time?
Not only does this help you with multithreading, it will also suit it to running on a GPU. And using wide SIMD.
Minor Late Addition
I am a bit embarassed not to have applied Horner's Rule to expressions like
avg1 = s*p3 - s*s*p2 + s*s*avg1
giving
avg1 = s*(p3 - s*(p2 + avg1))
slightly more efficient. slightly different results with rounding.
In my defence, any decent compiler should do this for you.
But Hrner's rule makes the dependency chain deeper in terms of multiplies.
You might need to unroll and pipelined the loop a few more times.
Or you can do
avg1 = s*p3 - s2*(*p2 + avg1)
where you precalculate
s2 = s*s
Here is the most efficient way, albeit in C#, you would need to port it over to C++, which should be very simple. It calculates the EMA and Slope on the fly in the most efficient way.
public class ExponentialMovingAverageIndicator
{
private bool _isInitialized;
private readonly int _lookback;
private readonly double _weightingMultiplier;
private double _previousAverage;
public double Average { get; private set; }
public double Slope { get; private set; }
public ExponentialMovingAverageIndicator(int lookback)
{
_lookback = lookback;
_weightingMultiplier = 2.0/(lookback + 1);
}
public void AddDataPoint(double dataPoint)
{
if (!_isInitialized)
{
Average = dataPoint;
Slope = 0;
_previousAverage = Average;
_isInitialized = true;
return;
}
Average = ((dataPoint - _previousAverage)*_weightingMultiplier) + _previousAverage;
Slope = Average - _previousAverage;
//update previous average
_previousAverage = Average;
}
}
Related
I'm trying to understand if there is a difference in speed when executing the following lines of code in a computer program:
myarray[1] = 5; return myarray[1];
myarray[0] = 5; return myarray[0];
x = 5; return x;
x = 5; y = x; return y;
return 5;
From what I understand, arrays are basically pointers (variables that store the memory addresses of other variables). Therefore (1) and (2) should be the same speed, but slower than (3), (4) and (5).
(5) should be the fastest, (3) should be slower than (5) because there is an equal sign, and (4) should be slower than (3) because there are two equal signs that need to be handled.
Would this be right?
You don't give a context what myarray, x and y are. Without that context, the question cannot be answered in any meaningful way. The extra assignments might have no side effects that cannot be optimised away.
Basically, looking at speed optimisation at this elementary level is completely pointless. If you want to look at speed, you need code that is substantial enough that the execution time can be measured. You cannot measure the time of one or two simple statements on a modern processor.
Lets have matrix A say A = magic(100);. I have seen 2 ways of computing sum of all elements of matrix A.
sumOfA = sum(sum(A));
Or
sumOfA = sum(A(:));
Is one of them faster (or better practise) then other? If so which one is it? Or are they both equally fast?
It seems that you can't make up your mind about whether performance or floating point accuracy is more important.
If floating point accuracy were of paramount accuracy, then you would segregate the positive and negative elements, sorting each segment. Then sum in order of increasing absolute value. Yeah, I know, its more work than anyone would do, and it probably will be a waste of time.
Instead, use adequate precision such that any errors made will be irrelevant. Use good numerical practices about tests, etc, such that there are no problems generated.
As far as the time goes, for an NxM array,
sum(A(:)) will require N*M-1 additions.
sum(sum(A)) will require (N-1)*M + M-1 = N*M-1 additions.
Either method requires the same number of adds, so for a large array, even if the interpreter is not smart enough to recognize that they are both the same op, who cares?
It is simply not an issue. Don't make a mountain out of a mole hill to worry about this.
Edit: in response to Amro's comment about the errors for one method over the other, there is little you can control. The additions will be done in a different order, but there is no assurance about which sequence will be better.
A = randn(1000);
format long g
The two solutions are quite close. In fact, compared to eps, the difference is barely significant.
sum(A(:))
ans =
945.760668102446
sum(sum(A))
ans =
945.760668102449
sum(sum(A)) - sum(A(:))
ans =
2.72848410531878e-12
eps(sum(A(:)))
ans =
1.13686837721616e-13
Suppose you choose the segregate and sort trick I mentioned. See that the negative and positive parts will be large enough that there will be a loss of precision.
sum(sort(A(A<0),'descend'))
ans =
-398276.24754782
sum(sort(A(A<0),'descend')) + sum(sort(A(A>=0),'ascend'))
ans =
945.7606681037
So you really would need to accumulate the pieces in a higher precision array anyway. We might try this:
[~,tags] = sort(abs(A(:)));
sum(A(tags))
ans =
945.760668102446
An interesting problem arises even in these tests. Will there be an issue because the tests are done on a random (normal) array? Essentially, we can view sum(A(:)) as a random walk, a drunkard's walk. But consider sum(sum(A)). Each element of sum(A) (i.e., the internal sum) is itself a sum of 1000 normal deviates. Look at a few of them:
sum(A)
ans =
Columns 1 through 6
-32.6319600960983 36.8984589766173 38.2749084367497 27.3297721091922 30.5600109446534 -59.039228262402
Columns 7 through 12
3.82231962760523 4.11017616179294 -68.1497901792032 35.4196443983385 7.05786623564426 -27.1215387236418
Columns 13 through 18
When we add them up, there will be a loss of precision. So potentially, the operation as sum(A(:)) might be slightly more accurate. Is it so? What if we use a higher precision for the accumulation? So first, I'll form the sum down the columns using doubles, then convert to 25 digits of decimal precision, and sum the rows. (I've displayed only 20 digits here, leaving 5 digits hidden as guard digits.)
sum(hpf(sum(A)))
ans =
945.76066810244807408
Or, instead, convert immediately to 25 digits of precision, then summing the result.
sum(hpf(A(:))
945.76066810244749807
So both forms in double precision were equally wrong here, in opposite directions. In the end, this is all moot, since any of the alternatives I've shown are far more time consuming compared to the simple variations sum(A(:)) or sum(sum(A)). Just pick one of them and don't worry.
Performance-wise, I'd say both are very similar (assuming a recent MATLAB version). Here is quick test using the TIMEIT function:
function sumTest()
M = randn(5000);
timeit( #() func1(M) )
timeit( #() func2(M) )
end
function v = func1(A)
v = sum(A(:));
end
function v = func2(A)
v = sum(sum(A));
end
the results were:
>> sumTest
ans =
0.0020917
ans =
0.0017159
What I would worry about is floating-point issues. Example:
>> M = randn(1000);
>> abs( sum(M(:)) - sum(sum(M)) )
ans =
3.9108e-11
Error magnitude increases for larger matrices
i think a simple way to understand is apply " tic_ toc "function in first and last of your code.
tic
A = randn(5000);
format long g
sum(A(:));
toc
but when you used randn function ,elements of it are random and time of calculation can
different in each cycle CPU calculation .
This better you used a unique matrix whit so large elements to compare time of calculation.
How can I find the cube root of a number in an efficient way?
I think Newton-Raphson method can be used, but I don't know how to guess the initial solution programmatically to minimize the number of iterations.
This is a deceptively complex question. Here is a nice survey of some possible approaches.
In view of the "link rot" that overtook the Accepted Answer, I'll give a more self-contained answer focusing on the topic of quickly obtaining an initial guess suitable for superlinear iteration.
The "survey" by metamerist (Wayback link) provided some timing comparisons for various starting value/iteration combinations (both Newton and Halley methods are included). Its references are to works by W. Kahan, "Computing a Real Cube Root", and by K. Turkowski, "Computing the Cube Root".
metamarist updates the DEC-VAX era bit-fiddling technique of W. Kahan with this snippet, which "assumes 32-bit integers" and relies on IEEE 754 format for doubles "to generate initial estimates with 5 bits of precision":
inline double cbrt_5d(double d)
{
const unsigned int B1 = 715094163;
double t = 0.0;
unsigned int* pt = (unsigned int*) &t;
unsigned int* px = (unsigned int*) &d;
pt[1]=px[1]/3+B1;
return t;
}
The code by K. Turkowski provides slightly more precision ("approximately 6 bits") by a conventional powers-of-two scaling on float fr, followed by a quadratic approximation to its cube root over interval [0.125,1.0):
/* Compute seed with a quadratic qpproximation */
fr = (-0.46946116F * fr + 1.072302F) * fr + 0.3812513F;/* 0.5<=fr<1 */
and a subsequent restoration of the exponent of two (adjusted to one-third). The exponent/mantissa extraction and restoration make use of math library calls to frexp and ldexp.
Comparison with other cube root "seed" approximations
To appreciate those cube root approximations we need to compare them with other possible forms. First the criteria for judging: we consider the approximation on the interval [1/8,1], and we use best (minimizing the maximum) relative error.
That is, if f(x) is a proposed approximation to x^{1/3}, we find its relative error:
error_rel = max | f(x)/x^(1/3) - 1 | on [1/8,1]
The simplest approximation would of course be to use a single constant on the interval, and the best relative error in that case is achieved by picking f_0(x) = sqrt(2)/2, the geometric mean of the values at the endpoints. This gives 1.27 bits of relative accuracy, a quick but dirty starting point for a Newton iteration.
A better approximation would be the best first-degree polynomial:
f_1(x) = 0.6042181313*x + 0.4531635984
This gives 4.12 bits of relative accuracy, a big improvement but short of the 5-6 bits of relative accuracy promised by the respective methods of Kahan and Turkowski. But it's in the ballpark and uses only one multiplication (and one addition).
Finally, what if we allow ourselves a division instead of a multiplication? It turns out that with one division and two "additions" we can have the best linear-fractional function:
f_M(x) = 1.4774329094 - 0.8414323527/(x+0.7387320679)
which gives 7.265 bits of relative accuracy.
At a glance this seems like an attractive approach, but an old rule of thumb was to treat the cost of a FP division like three FP multiplications (and to mostly ignore the additions and subtractions). However with current FPU designs this is not realistic. While the relative cost of multiplications to adds/subtracts has come down, in most cases to a factor of two or even equality, the cost of division has not fallen but often gone up to 7-10 times the cost of multiplication. Therefore we must be miserly with our division operations.
static double cubeRoot(double num) {
double x = num;
if(num >= 0) {
for(int i = 0; i < 10 ; i++) {
x = ((2 * x * x * x) + num ) / (3 * x * x);
}
}
return x;
}
It seems like the optimization question has already been addressed, but I'd like to add an improvement to the cubeRoot() function posted here, for other people stumbling on this page looking for a quick cube root algorithm.
The existing algorithm works well, but outside the range of 0-100 it gives incorrect results.
Here's a revised version that works with numbers between -/+1 quadrillion (1E15). If you need to work with larger numbers, just use more iterations.
static double cubeRoot( double num ){
boolean neg = ( num < 0 );
double x = Math.abs( num );
for( int i = 0, iterations = 60; i < iterations; i++ ){
x = ( ( 2 * x * x * x ) + num ) / ( 3 * x * x );
}
if( neg ){ return 0 - x; }
return x;
}
Regarding optimization, I'm guessing the original poster was asking how to predict the minimum number of iterations for an accurate result, given an arbitrary input size. But it seems like for most general cases the gain from optimization isn't worth the added complexity. Even with the function above, 100 iterations takes less than 0.2 ms on average consumer hardware. If speed was of utmost importance, I'd consider using pre-computed lookup tables. But this is coming from a desktop developer, not an embedded systems engineer.
I have large arrays which I am doing fairly simple linear algebra on. I have achieved good speed ups by vectorising the operations but i want to know how MATLAB treats the subarrays.
I pre-allocate arrays because they are used in various operations, and my formulae are long and require many different arrays and subarrays so readability counts while I am still coding.
For example, a simple case:
Array = someBig2DArray;
soln = Array;
[len_x len_y] = size(Array);
A1 = Array(2:len_x-1, 2:len_y-1);
A2 = Array(1:len_x-2, 2:len_y-1);
A3 = Array(3:len_x, 2:len_y-1);
soln(2:len_x-1, 2:len_y-1) = (A1 - 2*A2 + A3)/2
The down side of using this method is that I have 3 extra arrays of basically the same size taking up memory.
Alternatively:
soln(2:len_x-1, 2:len_y-1) = (Array(3:len_x, 2:leny-1) - 2*Array(2:len_x-1, 2:len_y-1) + Array(1:len_x-2, 2:len_y-1))/2
Does this second method use less memory, while sacrificing readability? Or does it actually create 'temporary' arrays and actually end up using roughly the same amount of memory, but only briefly? (I am nearing the limit of my system...)
Are these methods the same speed internally, talking about bigO and number of operations?
Are there any ways of reducing the memory requirements of the first method while keeping readability?
Here is a snippit from my code (actually 1D in this case). My initial thinking was that vectorising the for loop and using matlab's usually very good matrix functions would speed things up. As it turns out the first method is significantly faster. Why?
for i = 3:len-1
dSdx = sigma(i) - sigma(i-1) + ...
0.25*(sigma(i+1) - sigma(i) - sigma(i-1) + sigma(i-2));
ddSdx2 = sigma(i+1) - 2*sigma(i) + sigma(i-1);
sigma_new(i) = sigma(i) + dT*(kappa*ddSdx2/h^2 - U*dSdx/h);
end
%This section replaces the for loop above
dSdx = sigma(3:len-1) - sigma(2:len-2) + 0.25*(sigma(4:len) ...
- sigma(3:len-1) - sigma(2:len-2) + sigma(1:len-3));
ddSdx2 = sigma(4:len) - 2*sigma(3:len-1) + sigma(2:len-2);
sigma_new(3:len-1) = sigma(3:len-1) + dT*(kappa*ddSdx2/h^2 - U*dSdx/h);
you can use tic and toc to time your methods. trying out your two approaches (there are some typos and assignment mismatches that needed to be fixed), shows that the second approach is ~3.5 times slower than the first.
I would like to genrate a random permutation as fast as possible.
The problem: The knuth shuffle which is O(n) involves generating n random numbers.
Since generating random numbers is quite expensive.
I would like to find an O(n) function involving a fixed O(1) amount of random numbers.
I realize that this question has been asked before, but I did not see any relevant answers.
Just to stress a point: I am not looking for anything less than O(n), just an algorithm involving less generation of random numbers.
Thanks
Create a 1-1 mapping of each permutation to a number from 1 to n! (n factorial). Generate a random number in 1 to n!, use the mapping, get the permutation.
For the mapping, perhaps this will be useful: http://en.wikipedia.org/wiki/Permutation#Numbering_permutations
Of course, this would get out of hand quickly, as n! can become really large soon.
Generating a random number takes long time you say? The implementation of Javas Random.nextInt is roughly
oldseed = seed;
nextseed = (oldseed * multiplier + addend) & mask;
return (int)(nextseed >>> (48 - bits));
Is that too much work to do for each element?
See https://doi.org/10.1145/3009909 for a careful analysis of the number of random bits required to generate a random permutation. (It's open-access, but it's not easy reading! Bottom line: if carefully implemented, all of the usual methods for generating random permutations are efficient in their use of random bits.)
And... if your goal is to generate a random permutation rapidly for large N, I'd suggest you try the MergeShuffle algorithm. An article published in 2015 claimed a factor-of-two speedup over Fisher-Yates in both parallel and sequential implementations, and a significant speedup in sequential computations over the other standard algorithm they tested (Rao-Sandelius).
An implementation of MergeShuffle (and of the usual Fisher-Yates and Rao-Sandelius algorithms) is available at https://github.com/axel-bacher/mergeshuffle. But caveat emptor! The authors are theoreticians, not software engineers. They have published their experimental code to github but aren't maintaining it. Someday, I imagine someone (perhaps you!) will add MergeShuffle to GSL. At present gsl_ran_shuffle() is an implementation of Fisher-Yates, see https://www.gnu.org/software/gsl/doc/html/randist.html?highlight=gsl_ran_shuffle.
Not what you asked exactly, but if provided random number generator doesn't satisfy you, may be you should try something different. Generally, pseudorandom number generation can be very simple.
Probably, best-known algorithm
http://en.wikipedia.org/wiki/Linear_congruential_generator
More
http://en.wikipedia.org/wiki/List_of_pseudorandom_number_generators
As other answers suggest, you can make a random integer in the range 0 to N! and use it to produce a shuffle. Although theoretically correct, this won't be faster in general since N! grows fast and you'll spend all your time doing bigint arithmetic.
If you want speed and you don't mind trading off some randomness, you will be much better off using a less good random number generator. A linear congruential generator (see http://en.wikipedia.org/wiki/Linear_congruential_generator) will give you a random number in a few cycles.
Usually there is no need in full-range of next random value, so to use exactly the same amount of randomness you can use next approach (which is almost like random(0,N!), I guess):
// ...
m = 1; // range of random buffer (single variant)
r = 0; // random buffer (number zero)
// ...
for(/* ... */) {
while (m < n) { // range of our buffer is too narrow for "n"
r = r*RAND_MAX + random(); // add another random to our random-buffer
m *= RAND_MAX; // update range of random-buffer
}
x = r % n; // pull-out next random with range "n"
r /= n; // remove it from random-buffer
m /= n; // fix range of random-buffer
// ...
}
P.S. of course there will be some errors related with division by value different from 2^n, but they will be distributed among resulted samples.
Generate N numbers (N < of the number of random number you need) before to do the computation, or store them in an array as data, with your slow but good random generator; then pick up a number simply incrementing an index into the array inside your computing loop; if you need different seeds, create multiple tables.
Are you sure that your mathematical and algorithmical approach to the problem is correct?
I hit exactly same problem where Fisher–Yates shuffle will be bottleneck in corner cases. But for me the real problem is brute force algorithm that doesn't scale well to all problems. Following story explains the problem and optimizations that I have come up with so far.
Dealing cards for 4 players
Number of possible deals is 96 bit number. That puts quite a stress for random number generator to avoid statical anomalies when selecting play plan from generated sample set of deals. I choose to use 2xmt19937_64 seeded from /dev/random because of the long period and heavy advertisement in web that it is good for scientific simulations.
Simple approach is to use Fisher–Yates shuffle to generate deals and filter out deals that don't match already collected information. Knuth shuffle takes ~1400 CPU cycles per deal mostly because I have to generate 51 random numbers and swap 51 times entries in the table.
That doesn't matter for normal cases where I would only need to generate 10000-100000 deals in 7 minutes. But there is extreme cases when filters may select only very small subset of hands requiring huge number of deals to be generated.
Using single number for multiple cards
When profiling with callgrind (valgrind) I noticed that main slow down was C++ random number generator (after switching away from std::uniform_int_distribution that was first bottleneck).
Then I came up with idea that I can use single random number for multiple cards. The idea is to use least significant information from the number first and then erase that information.
int number = uniform_rng(0, 52*51*50*49);
int card1 = number % 52;
number /= 52;
int cards2 = number % 51;
number /= 51;
......
Of course that is only minor optimization because generation is still O(N).
Generation using bit permutations
Next idea was exactly solution asked in here but I ended up still with O(N) but with larger cost than original shuffle. But lets look into solution and why it fails so miserably.
I decided to use idea Dealing All the Deals by John Christman
void Deal::generate()
{
// 52:26 split, 52!/(26!)**2 = 495,918,532,948,1041
max = 495918532948104LU;
partner = uniform_rng(eng1, max);
// 2x 26:13 splits, (26!)**2/(13!)**2 = 10,400,600**2
max = 10400600LU*10400600LU;
hands = uniform_rng(eng2, max);
// Create 104 bit presentation of deal (2 bits per card)
select_deal(id, partner, hands);
}
So far good and pretty good looking but select_deal implementation is PITA.
void select_deal(Id &new_id, uint64_t partner, uint64_t hands)
{
unsigned idx;
unsigned e, n, ns = 26;
e = n = 13;
// Figure out partnership who owns which card
for (idx = CARDS_IN_SUIT*NUM_SUITS; idx > 0; ) {
uint64_t cut = ncr(idx - 1, ns);
if (partner >= cut) {
partner -= cut;
// Figure out if N or S holds the card
ns--;
cut = ncr(ns, n) * 10400600LU;
if (hands > cut) {
hands -= cut;
n--;
} else
new_id[idx%NUM_SUITS] |= 1 << (idx/NUM_SUITS);
} else
new_id[idx%NUM_SUITS + NUM_SUITS] |= 1 << (idx/NUM_SUITS);
idx--;
}
unsigned ew = 26;
// Figure out if E or W holds a card
for (idx = CARDS_IN_SUIT*NUM_SUITS; idx-- > 0; ) {
if (new_id[idx%NUM_SUITS + NUM_SUITS] & (1 << (idx/NUM_SUITS))) {
uint64_t cut = ncr(--ew, e);
if (hands >= cut) {
hands -= cut;
e--;
} else
new_id[idx%NUM_SUITS] |= 1 << (idx/NUM_SUITS);
}
}
}
Now that I had the O(N) permutation solution done to prove algorithm could work I started searching for O(1) mapping from random number to bit permutation. Too bad it looks like only solution would be using huge lookup tables that would kill CPU caches. That doesn't sound good idea for AI that will be using very large amount of caches for double dummy analyzer.
Mathematical solution
After all hard work to figure out how to generate random bit permutations I decided go back to maths. It is entirely possible to apply filters before dealing cards. That requires splitting deals to manageable number of layered sets and selecting between sets based on their relative probabilities after filtering out impossible sets.
I don't yet have code ready for that to tests how much cycles I'm wasting in common case where filter is selecting major part of deal. But I believe this approach gives the most stable generation performance keeping the cost less than 0.1%.
Generate a 32 bit integer. For each index i (maybe only up to half the number of elements in the array), if bit i % 32 is 1, swap i with n - i - 1.
Of course, this might not be random enough for your purposes. You could probably improve this by not swapping with n - i - 1, but rather by another function applied to n and i that gives better distribution. You could even use two functions: one for when the bit is 0 and another for when it's 1.