I am looking for a solution to accelerate the performances of my program with a lot of matrix multiplications. So I hace replaced the CLAPACK f2c libraries with the MKL. Unfortunately, the performances results was not the expected ones.
After investigation, I faced to a block triangular matrix which gives bad performances principaly when i try to multiply it with its transpose.
In order to simplify the problem I did my tests with an identity matrix of 5000 elements ( I found the same comportment )
NAME
Matrix [Size,Size]
CLAPACK f2c(second)
MKL_GNU_THREAD (second)
Multiplication of an identity matrix by itself
5000
0.076536
1.090167
Multiplication of dense matrix by its transpose
5000*5000
93.71569
1.113872
We can see that the CLAPACK f2c multiplication of an identity matrix is faster ( x14) than the MKL.
We can note an acceleration multipliy by 84 between the MKL and CLAPACK f2c dense matrix multiplication.
Moreover, the difference of the time consumption during the muliplication of a dense*denseT and an identity matrix is very slim.
So I tried to found in CLAPACK f2c DGEMM where is the optimization for the multiplication of a parse matrix, and I found a condition on null values.
/* Form C := alpha*A*B + beta*C. */
i__1 = *n;
for (j = 1; j <= i__1; ++j) {
if (*beta == 0.) {
i__2 = *m;
for (i__ = 1; i__ <= i__2; ++i__) {
c__[i__ + j * c_dim1] = 0.;
/* L50: */
}
} else if (*beta != 1.) {
i__2 = *m;
for (i__ = 1; i__ <= i__2; ++i__) {
c__[i__ + j * c_dim1] = *beta * c__[i__ + j * c_dim1];
/* L60: */
}
}
i__2 = *k;
for (l = 1; l <= i__2; ++l) {
if (b[l + j * b_dim1] != 0.) { // HERE THE CONDITION
temp = *alpha * b[l + j * b_dim1];
i__3 = *m;
for (i__ = 1; i__ <= i__3; ++i__) {
c__[i__ + j * c_dim1] += temp * a[i__ + l *
a_dim1];
/* L70: */
} // ENF of condition
}
When I removed this condition I got this kind of results :
NAME
Matrix [Size,Size]
CLAPACK f2c (second)
MKL_GNU_THREAD (second)
Multiplication of an identity matrix by itself
5000
93.210873
1.090167
Multiplication of dense matrix by its transpose
5000*5000
93.71569
1.113872
Here we note that the multiplication of a dense and an identity is
very clause in term of performances, and now the MKL shows the best
performances.
The MKL multiplication seems to be faster than CLAPACK f2c but only with
the same number of non-null elements.
I have two ideas on this results :
The 0 optimization is not activated by default in the MKL
The MKL cannot see the 0 (double) values inside my sparse matrices .
May you tell me why the MKL shows performance issues ?
Do you have any tips in order to bypass the multiplication on null elements with dgemm ?
I did a conservion in CSR and it shows better performances but in is case why lapacke_dgemm is worst than f2c_dgemmm.
Thank you for your help :)
MKL_VERBOSE Intel(R) MKL 2021.0 Update 1 Product build 20201104 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 3.50GHz lp64 gnu_thread
In my shader I have variable b and need to determine within which range it lies and from that assign the right value to variable a. I ended up with a lot of if statements:
float a = const1;
if (b >= 2.0 && b < 4.0) {
a = const2;
} else if (b >= 4.0 && b < 6.0) {
a = const3;
} else if (b >= 6.0 && b < 8.0) {
a = const4;
} else if (b >= 8.0) {
a = const5;
}
My question is could this lead to performance issues (branching) and how can I optimize it? I've looked at the step and smoothstep functions but haven't figured out a good way to accomplish this.
To solve the problem depicted and avoid branching the usual techniques is to find a series of math functions, one for each condition, that evaluate to 0 for all the conditions except the one the variable satisfies. We can use these functions as gains to build a sum that evaluates to the right value each time.
In this case the conditions are simple intervals, so using the step functions we could write:
x in [a,b] as step(a,x)*step(x,b) (notice the inversion of x and b to get x<=b)
Or
x in [a,b[ as step(a,x)-step(x,b) as explained in this other post: GLSL point inside box test
Using this technique we obtain:
float a = (step(x,2.0)-((step(2.0,x)*step(x,2.0)))*const1 +
(step(2.0,x)-step(4.0,x))*const2 +
(step(4.0,x)-step(6.0,x))*const3 +
(step(6.0,x)-step(8.0,x))*const4 +
step(8.0,x)*const5
This works for general disjoint intervals, but in the case of a step or staircase function as in this question, we can simplify it as:
float a = const1 + step(2.0,x)*(const2-const1) +
step(4.0,x)*(const3-const2) +
step(6.0,x)*(const4-const3) +
step(8.0,x)*(const5-const4)
We could also use a 'bool conversion to float' as means to express our conditions, so as an example step(8.0,x)*(const5-const4) is equivalent to float(x>=8.0)*(const5-const4)
You can avoid branching by creating kind of a lookup table:
float table[5] = {const1, const2, const3, const4, const5};
float a = table[int(clamp(b, 0.0, 8.0) / 2)];
But the performance will depend on whether the lookup table will have to be created in every shader or if it's some kind of uniform... As always, measure first...
It turned out Jaa-cs answere wasn't viable for me as I'm targeting WebGL which doesn't allow variables as indexes (unless it's a loop index). His solution might work great for other OpenGL implementations though.
I came up with this solution using mix and step functions:
//Outside of main function:
uniform vec3 constArray[5]; // Values are sent in to shader
//Inside main function:
float a = constArray[0];
a = mix(a, constArray[1], step(2.0, b));
a = mix(a, constArray[2], step(4.0, b));
a = mix(a, constArray[3], step(6.0, b));
a = mix(a, constArray[4], step(8.0, b));
But after some testing it didn't give any visible performance boost. I finally ended up with this solution:
float a = constArray[0];
if (b >= 2.0)
a = constArray[1];
if (b >= 4.0)
a = constArray[2];
if (b >= 6.0)
a = constArray[3];
if (b >= 8.0)
a = constArray[4];
Which is both compact and easily readable. In my case both these alternatives and my original code performed equally, but at least here are some options to try out.
I'm doing some rigid-body rotation dynamics simulation, which means I have to compute many rotations by small angle, which has performance bottleneck in evaluation of trigonometric function. Now I do it by Taylor(McLaurin) series:
class double2{
double x,y;
// Intristic full sin/cos
final void rotate ( double a){
double x_=x;
double ca=Math.cos(a); double sa=Math.sin(a);
x=ca*x_-sa*y; y=sa*x_+ca*y;
}
// Taylor 7th-order aproximation
final void rotate_d7( double a){
double x_=x;
double a2=a*a;
double a4=a2*a2;
double a6=a4*a2;
double ca= 1.0d - a2 /2.0d + a4 /24.0d - a6/720.0d;
double sa= a - a2*a/6.0d + a4*a/120.0d - a6*a/5040.0d;
x=ca*x_-sa*y; y=sa*x_+ca*y;
}
}
but the trade of performance-speed is not so great as I would expect:
error(100x dphi=Pi/100 ) time [ns pre rotation]
v.rotate_d1() : -0.010044860504615213 9.314306 ns/op
v.rotate_d3() : 3.2624666136960023E-6 16.268745 ns/op
v.rotate_d5() : -4.600003294941146E-10 35.433617 ns/op
v.rotate_d7() : 3.416711358283919E-14 49.831547 ns/op
v.rotate() : 3.469446951953614E-16 75.70213 ns/op
Is there any faster method how to evaluate approximation of sin() and cos() for small angle ( like < Pi/100 )
I was thinking maybe some rational series, or continuous fraction approximation? Do you know any? ( Precomputed table doesn't make sense here )
You might find that adjusting your calculations can improve performance. E.g.:
const double c7 = -1/5040d;
const double c5 = 1/120d;
const double c3 = -1/6d;
double a2 = a * a;
double sa = (((c7 * a2 + c5) * a2 + c3) * a2 + 1) * a;
// similarly for cos
Now the optimiser might be doing some of this itself anyway, so your mileage may vary. Would be interested to know the results either way.
Instead of optimizing the trig functions, see if you can do without them. Rigid-body simulations tend to be a perfectly natural fit for vector math.
Two ways : reduce the precision if possible (as often in video games, use minimal acceptable precision if you aim performance)
the you should try to use tabulated values. Once per execution (when the game loads ?) compute an array of sinus/ cosinus/ that you then access in constant time.
float cosAlpha = COSINUS[(int)(k*alpha)]; // e.g: k = 1000
tune k and the array size to choose angle resolution vs. memory footprint.
edit: Don't forget to use parity of cosinus/sinus functions to avoid duplicate values in the tab
edit2: try floats instead of double. Difference will be insignificant for the player, and the performance impact way be interesting. Test it !
can you add some inline assembler? Targetting the i386 'fsincos' instruction is probably the fastest method :
Vector2 unit_vector ( Angle angle ) {
Vector2 r;
//now the normal processor detection
//and various platform specific vesions
# if defined (__i386__) && !defined (NO_ASM)
# if defined __GNUC__
# define ASM_SINCOS
asm ("fsincos" : "=t" (r.x), "=u" (r.y) : "0" (angle.radians()));
# elif defined _MSC_VER
# define ASM_SINCOS
double a = angle.radians();
__asm fld a
__asm fsincos
__asm fstp r.x
__asm fstp r.y
# endif
# endif
}
from here.
This has the added bonus of calculating both sin and cos in a single call.
EDIT : it's Java.
Are your rotations suitably self-contained that you can offload thousands at a time over JNI? Otherwise this hardware-specific approach is no good.
For small x (x<0.2 in radians) you can safely assume sin(x) = x.
The maximum deviation is 0.0013.
I am currently reading "Linux Kernel Development" by Robert Love, and I got a few questions about the CFS.
My question is how calc_delta_mine calculates :
delta_exec_weighted= (delta_exec * weight)/lw->weight
I guess it is done by two steps :
calculation the (delta_exec * 1024) :
if (likely(weight > (1UL << SCHED_LOAD_RESOLUTION)))
tmp = (u64)delta_exec * scale_load_down(weight);
else
tmp = (u64)delta_exec;
calculate the /lw->weight ( or * lw->inv_weight ) :
if (!lw->inv_weight) {
unsigned long w = scale_load_down(lw->weight);
if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
lw->inv_weight = 1;
else if (unlikely(!w))
lw->inv_weight = WMULT_CONST;
else
lw->inv_weight = WMULT_CONST / w;
}
/*
* Check whether we'd overflow the 64-bit multiplication:
*/
if (unlikely(tmp > WMULT_CONST))
tmp = SRR(SRR(tmp, WMULT_SHIFT/2) * lw->inv_weight,
WMULT_SHIFT/2);
else
tmp = SRR(tmp * lw->inv_weight, WMULT_SHIFT);
return (unsigned long)min(tmp, (u64)(unsigned long)LONG_MAX);
The SRR (Shift right and round) macro is defined via :
#define SRR(x, y) (((x) + (1UL << ((y) - 1))) >> (y))
And the other MACROS are defined :
#if BITS_PER_LONG == 32
# define WMULT_CONST (~0UL)
#else
# define WMULT_CONST (1UL << 32)
#endif
#define WMULT_SHIFT 32
Can someone please explain how exactly the SRR works and how does this check the 64-bit multiplication overflow?
And please explain the definition of the MACROS in this function((~0UL) ,(1UL << 32))?
The code you posted is basically doing calculations using 32.32 fixed-point arithmetic, where a single 64-bit quantity holds the integer part of the number in the high 32 bits, and the decimal part of the number in the low 32 bits (so, for example, 1.5 is 0x0000000180000000 in this system). WMULT_CONST is thus an approximation of 1.0 (using a value that can fit in a long for platform efficiency considerations), and so dividing WMULT_CONST by w computes 1/w as a 32.32 value.
Note that multiplying two 32.32 values together as integers produces a result that is 232 times too large; thus, WMULT_SHIFT (=32) is the right shift value needed to normalize the result of multiplying two 32.32 values together back down to 32.32.
The necessity of using this improved precision for scheduling purposes is explained in a comment in sched/sched.h:
/*
* Increase resolution of nice-level calculations for 64-bit architectures.
* The extra resolution improves shares distribution and load balancing of
* low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup
* hierarchies, especially on larger systems. This is not a user-visible change
* and does not change the user-interface for setting shares/weights.
*
* We increase resolution only if we have enough bits to allow this increased
* resolution (i.e. BITS_PER_LONG > 32). The costs for increasing resolution
* when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the
* increased costs.
*/
As for SRR, mathematically, it computes the rounded result of x / 2y.
To round the result of a division x/q you can calculate x + q/2 floor-divided by q; this is what SRR does by calculating x + 2y-1 floor-divided by 2y.
I am trying to create random lines and select some of them, which are really rare. My code is rather simple, but to get something that I can use I need to create very large vectors(i.e.: <100000000 x 1, tracks variable in my code). Is there any way to be able to creater larger vectors and to reduce the time needed for all those calculations?
My code is
%Initial line values
tracks=input('Give me the number of muon tracks: ');
width=1e-4;
height=2e-4;
Ystart=15.*ones(tracks,1);
Xstart=-40+80.*rand(tracks,1);
%Xend=-40+80.*rand(tracks,1);
Xend=laprnd(tracks,1,Xstart,15);
X=[Xstart';Xend'];
Y=[Ystart';zeros(1,tracks)];
b=(Ystart.*Xend)./(Xend-Xstart);
hot=0;
cold=0;
for i=1:tracks
if ((Xend(i,1)<width/2 && Xend(i,1)>-width/2)||(b(i,1)<height && b(i,1)>0))
plot(X(:, i),Y(:, i),'r');%the chosen ones!
hold all
hot=hot+1;
else
%plot(X(:, i),Y(:, i),'b');%the rest of them
%hold all
cold=cold+1;
end
end
I am also using and calling a Laplace distribution generator made my Elvis Chen which can be found here
function y = laprnd(m, n, mu, sigma)
%LAPRND generate i.i.d. laplacian random number drawn from laplacian distribution
% with mean mu and standard deviation sigma.
% mu : mean
% sigma : standard deviation
% [m, n] : the dimension of y.
% Default mu = 0, sigma = 1.
% For more information, refer to
% http://en.wikipedia.org./wiki/Laplace_distribution
% Author : Elvis Chen (bee33#sjtu.edu.cn)
% Date : 01/19/07
%Check inputs
if nargin < 2
error('At least two inputs are required');
end
if nargin == 2
mu = 0; sigma = 1;
end
if nargin == 3
sigma = 1;
end
% Generate Laplacian noise
u = rand(m, n)-0.5;
b = sigma / sqrt(2);
y = mu - b * sign(u).* log(1- 2* abs(u));
The result plot is
As you indicate, your problem is two-fold. On the one hand, you have memory issues because you need to do so many trials. On the other hand, you have performance issues, because you have to process all those trials.
Solutions to each issue often have a negative impact on the other issue. IMHO, the best approach would be to find a compromise.
More trials are only possible of you get rid of those gargantuan arrays that are required for vectorization, and use a different strategy to do the loop. I will give priority to the possibility of using more trials, possibly at the cost of optimal performance.
When I execute your code as-is in the Matlab profiler, it immediately shows that the initial memory allocation for all your variables takes a lot of time. It also shows that the plot and hold all commands are the most time-consuming lines of them all. Some more trial-and-error shows that there is a disappointingly low maximum value for the trials you can do before OUT OF MEMORY errors start appearing.
The loop can be accelerated tremendously if you know a few things about its limitations in Matlab. In older versions of Matlab, it used to be true that loops should be avoided completely in favor of 'vectorized' code. In recent versions (I believe R2008a and up), the Mathworks introduced a piece of technology called the JIT accelerator (Just-in-Time compiler) which translates M-code into machine language on the fly during execution. Simply put, the JIT accelerator allows your code to bypass Matlab's interpreter and talk much more directly with the underlying hardware, which can save a lot of time.
The advice you'll hear a lot that loops should be avoided in Matlab, is no longer generally true. While vectorization still has its value, any procedure of sizable complexity that is implemented using only vectorized code is often illegible, hard to understand, hard to change and hard to upkeep. An implementation of the same procedure that uses loops, often has none of these drawbacks, and moreover, it will quite often be faster and require less memory.
Unfortunately, the JIT accelerator has a few nasty (and IMHO, unnecessary) limitations that you'll have to learn about.
One such thing is plot; it's generally a better idea to let a loop do nothing other than collect and manipulate data, and delay any plotting commands etc. until after the loop.
Another such thing is hold; the hold function is not a Matlab built-in function, meaning, it is implemented in M-language. Matlab's JIT accelerator is not able to accelerate non-builtin functions when used in a loop, meaning, your entire loop will run at Matlab's interpretation speed, rather than machine-language speed! Therefore, also delay this command until after the loop :)
Now, in case you're wondering, this last step can make a HUGE difference -- I know of one case where copy-pasting a function body into the upper-level loop caused a 1200x performance improvement. Days of execution time had been reduced to minutes!).
There is actually another minor issue in your loop (which is really small, and rather inconvenient, I will immediately agree with) -- the name of the loop variable should not be i. The name i is the name of the imaginary unit in Matlab, and the name resolution will also unnecessarily consume time on each iteration. It's small, but non-negligible.
Now, considering all this, I've come to the following implementation:
function [hot, cold, h] = MuonTracks(tracks)
% NOTE: no variables larger than 1x1 are initialized
width = 1e-4;
height = 2e-4;
% constant used for Laplacian noise distribution
bL = 15 / sqrt(2);
% Loop through all tracks
X = [];
hot = 0;
ii = 0;
while ii <= tracks
ii = ii + 1;
% Note that I've inlined (== copy-pasted) the original laprnd()
% function call. This was necessary to work around limitations
% in loops in Matlab, and prevent the nececessity of those HUGE
% variables.
%
% Of course, you can still easily generalize all of this:
% the new data
u = rand-0.5;
Ystart = 15;
Xstart = 800*rand-400;
Xend = Xstart - bL*sign(u)*log(1-2*abs(u));
b = (Ystart*Xend)/(Xend-Xstart);
% the test
if ((b < height && b > 0)) ||...
(Xend < width/2 && Xend > -width/2)
hot = hot+1;
% growing an array is perfectly fine when the chances of it
% happening are so slim
X = [X [Xstart; Xend]]; %#ok
end
end
% This is trivial to do here, and prevents an 'else' in the loop
cold = tracks - hot;
% Now plot the chosen ones
h = figure;
hold all
Y = repmat([15;0], 1, size(X,2));
plot(X, Y, 'r');
end
With this implementation, I can do this:
>> tic, MuonTracks(1e8); toc
Elapsed time is 24.738725 seconds.
with a completely negligible memory footprint.
The profiler now also shows a nice and even distribution of effort along the code; no lines that really stand out because of their memory use or performance.
It's possibly not the fastest possible implementation (if anyone sees obvious improvements, please, feel free to edit them in). But, if you're willing to wait, you'll be able to do MuonTracks(1e23) (or higher :)
I've also done an implementation in C, which can be compiled into a Matlab MEX file:
/* DoMuonCounting.c */
#include <math.h>
#include <matrix.h>
#include <mex.h>
#include <time.h>
#include <stdlib.h>
void CountMuons(
unsigned long long tracks,
unsigned long long *hot, unsigned long long *cold, double *Xout);
/* simple little helper functions */
double sign(double x) { return (x>0)-(x<0); }
double rand_double() { return (double)rand()/(double)RAND_MAX; }
/* the gateway function */
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
int
dims[] = {1,1};
const mxArray
/* Output arguments */
*hot_out = plhs[0] = mxCreateNumericArray(2,dims, mxUINT64_CLASS,0),
*cold_out = plhs[1] = mxCreateNumericArray(2,dims, mxUINT64_CLASS,0),
*X_out = plhs[2] = mxCreateDoubleMatrix(2,10000, mxREAL);
const unsigned long long
tracks = (const unsigned long long)mxGetPr(prhs[0])[0];
unsigned long long
*hot = (unsigned long long*)mxGetPr(hot_out),
*cold = (unsigned long long*)mxGetPr(cold_out);
double
*Xout = mxGetPr(X_out);
/* call the actual function, and return */
CountMuons(tracks, hot,cold, Xout);
}
// The actual muon counting
void CountMuons(
unsigned long long tracks,
unsigned long long *hot, unsigned long long *cold, double *Xout)
{
const double
width = 1.0e-4,
height = 2.0e-4,
bL = 15.0/sqrt(2.0),
Ystart = 15.0;
double
Xstart,
Xend,
u,
b;
unsigned long long
i = 0ul;
*hot = 0ul;
*cold = tracks;
/* seed the RNG */
srand((unsigned)time(NULL));
/* aaaand start! */
while (i++ < tracks)
{
u = rand_double() - 0.5;
Xstart = 800.0*rand_double() - 400.0;
Xend = Xstart - bL*sign(u)*log(1.0-2.0*fabs(u));
b = (Ystart*Xend)/(Xend-Xstart);
if ((b < height && b > 0.0) || (Xend < width/2.0 && Xend > -width/2.0))
{
Xout[0 + *hot*2] = Xstart;
Xout[1 + *hot*2] = Xend;
++(*hot);
--(*cold);
}
}
}
compile in Matlab with
mex DoMuonCounting.c
(after having run mex setup :) and then use it in conjunction with a small M-wrapper like this:
function [hot,cold, h] = MuonTrack2(tracks)
% call the MEX function
[hot,cold, Xtmp] = DoMuonCounting(tracks);
% process outputs, and generate plots
hot = uint32(hot); % circumvents limitations in 32-bit matlab
X = Xtmp(:,1:hot);
clear Xtmp
h = NaN;
if ~isempty(X)
h = figure;
hold all
Y = repmat([15;0], 1, hot);
plot(X, Y, 'r');
end
end
which allows me to do
>> tic, MuonTrack2(1e8); toc
Elapsed time is 14.496355 seconds.
Note that the memory footprint of the MEX version is slightly larger, but I think that's nothing to worry about.
The only flaw I see is the fixed maximum number of Muon counts (hard-coded as 10000 as the initial array size of Xout; needed because there are no dynamically growing arrays in standard C)...if you're worried this limit could be broken, simply increase it, change it to be equal to a fraction of tracks, or do some smarter (but more painful) dynamic array-growing tricks.
In Matlab, it is sometimes faster to vectorize rather than use a for loop. For example, this expression:
(Xend(i,1) < width/2 && Xend(i,1) > -width/2) || (b(i,1) < height && b(i,1) > 0)
which is defined for each value of i, can be rewritten in a vectorised manner like this:
isChosen = (Xend(:,1) < width/2 & Xend(:,1) > -width/2) | (b(:,1) < height & b(:,1)>0)
Expessions like Xend(:,1) will give you a column vector, so Xend(:,1) < width/2 will give you a column vector of boolean values. Note then that I have used & rather than && - this is because & performs an element-wise logical AND, unlike && which only works on scalar values. In this way you can build the entire expression, such that the variable isChosen holds a column vector of boolean values, one for each row of your Xend/b vectors.
Getting counts is now as simple as this:
hot = sum(isChosen);
since true is represented by 1. And:
cold = sum(~isChosen);
Finally, you can get the data points by using the boolean vector to select rows:
plot(X(:, isChosen),Y(:, isChosen),'r'); % Plot chosen values
hold all;
plot(X(:, ~isChosen),Y(:, ~isChosen),'b'); % Plot unchosen values
EDIT: The code should look like this:
isChosen = (Xend(:,1) < width/2 & Xend(:,1) > -width/2) | (b(:,1) < height & b(:,1)>0);
hot = sum(isChosen);
cold = sum(~isChosen);
plot(X(:, isChosen),Y(:, isChosen),'r'); % Plot chosen values