cuda intrinsic functions sqrtf and powf performance issues - performance

when i convert from powf to __powf it gives performance improvement to me. but if i convert sqrtf to one of which __fsqrt_[rn,rz,ru,rd] it slows down. I think they should run at least as fast as sqrtf. What can be the problem?
Regards

If you need to square an integer (or float for that matter) then you can just multiply the value with itself, i.e. instead of;
y = powf(x, 2);
use:
y = x * x;
This avoids using an expensive transcendental function (along with its associated function call overhead) and just generates a single multiply instruction in most cases.
The square root probably can't be avoided but you can use fsqrtf rather than sqrtf if you only need single precision - this is typically much faster.

Related

Poor performance in matlab

So I had to write a program in Matlab to calculate the convolution of two functions, manually. I wrote this simple piece of code that I know is not that optimized probably:
syms recP(x);
recP(x) = rectangularPulse(-1,1,x);
syms triP(x);
triP(x) = triangularPulse(-1,1,x);
t = -10:0.1:10;
s1 = -10:0.1:10;
for i = 1:201
s1(i) = 0;
for j = t
s1(i) = s1(i) + ( recP(j) * triP(t(i)-j) );
end
end
plot(t,s1);
I have a core i7-7700HQ coupled with 32 GB of RAM. Matlab is stored on my HDD and my Windows is on my SSD. The problem is that this simple code is taking I think at least 20 minutes to run. I have it in a section and I don't run the whole code. Matlab is only taking 18% of my CPU and 3 GB of RAM for this task. Which is I think probably enough, I don't know. But I don't think it should take that long.
Am I doing anything wrong? I've searched for how to increase the RAM limit of Matlab, and I found that it is not limited and it takes how much it needs. I don't know if I can increase the CPU usage of it or not.
Is there any solution to how make things a little bit faster? I have like 6 or 7 of these for loops in my homework and it takes forever if I run the whole live script. Thanks in advance for your help.
(Also, it highlights the piece of code that is currently running. It is the for loop, the outer one is highlighted)
Like Ander said, use the symbolic toolbox in matlab as a last resort. Additionally, when trying to speed up matlab code, focus on taking advantage of matlab's vectorized operations. What I mean by this is matlab is very efficient at performing operations like this:
y = x.*z;
where x and z are some Nx1 vectors each and the operator '.*' is called 'dot multiplication'. This is essentially telling matlab to perform multiplication on x1*z1, x[2]*z[2] .... x[n]*z[n] and assign all the values to the corresponding value in the vector y. Additionally, many of the functions in matlab are able to accept vectors as inputs and perform their operations on each element and return an equal size vector with the output at each element. You can check this for any given function by scrolling down in its documentation to the inputs and outputs section and checking what form of array the inputs and outputs can take. For example, rectangularPulse's documentation says it can accept vectors as inputs. Therefore, you can simplify your inner loop to this:
s1(i) = s1(i) + ( rectangularPulse(-1,1,t) * triP(t(i)-t) );
So to summarize:
Avoid the symbolic toolbox in matlab until you have a better handle of what you're doing or you absolutely have to use it.
Use matlab's ability to handle vectors and arrays very well.
Deconstruct any nested loops you write one at a time from the inside out. Usually this dramatically accelerates matlab code especially when you are new to writing it.
See if you can even further simplify the code and get rid of your outer loop as well.

Tips for improving performance of a 2d image 'tracing' CUDA kernel?

Can you give me some tips to optimize this CUDA code?
I'm running this on a device with compute capability 1.3 (I need it for a Tesla C1060 although I'm testing it now on a GTX 260 which has the same compute capability) and I have several kernels like the one below. The number of threads I need to execute this kernel is given by long SUM and depends on size_t M and size_t N which are the dimensions of a rectangular image received as parameter it can vary greatly from 50x50 to 10000x10000 in pixels or more. Although I'm mostly interested in working the bigger images with Cuda.
Now each image has to be traced in all directions and angles and some computations must be done over the values extracted from the tracing. So, for example, for a 500x500 image I need 229080 threads computing that kernel below which is the value of SUM (that's why I check that the thread id idHilo doesn't go over it). I copied several arrays into the global memory of the device one after another since I need to access them for the calculations all of length SUM. Like this
cudaMemcpy(xb_cuda,xb_host,(SUM*sizeof(long)),cudaMemcpyHostToDevice);
cudaMemcpy(yb_cuda,yb_host,(SUM*sizeof(long)),cudaMemcpyHostToDevice);
...etc
So each value of every array can be accessed by one thread. All are done before the kernel calls. According to the Cuda Profiler on Nsight the highest memcopy duration is 246.016 us for a 500x500 image so that is not taking so long.
But the kernels like the one I copied below are taking too long for any practical use (3.25 seconds according to the Cuda profiler for the kernel below for a 500x500 image and 5.052 seconds for the kernel with the highest duration) so I need to see if I can optimize them.
I arrange the data this way
First the block dimension
dim3 dimBlock(256,1,1);
then the number of blocks per Grid
dim3 dimGrid((SUM+255)/256);
For a number of 895 blocks for a 500x500 image.
I'm not sure how to use coalescing and shared memory in my case or even if it's a good idea to call the kernel several times with different portions of the data. The data is independent one from the other so I could in theory call that kernel several times and not with the 229080 threads all at once if needs be.
Now take into account that the outer for loop
for(t=15;t<=tendbegin_cuda[idHilo]-15;t++){
depends on
tendbegin_cuda[idHilo]
the value of which depends on each thread but most threads have similar values for it.
According to the Cuda Profiler the Global Store Efficiency is of 0.619 and the Global Load Efficiency is 0.951 for this kernel. Other kernels have similar values .
Is that good? bad? how can I interpret those values? Sadly the devices of compute capability 1.3 don't provide other useful info for assessing the code like the Multiprocessor and Kernel Memory or Instruction analysis. The only results I get after the analysis is "Low Global Memory Store Efficiency" and "Low Global Memory Load Efficiency" but I'm not sure how I can optimize those.
void __global__ t21_trazo(long SUM,int cT, double Bn, size_t M, size_t N, float* imagen_cuda, double* vector_trazo_cuda, long* xb_cuda, long* yb_cuda, long* xinc_cuda, long* yinc_cuda, long* tbegin_cuda, long* tendbegin_cuda){
long xi;
long yi;
int t;
int k;
int a;
int ji;
long idHilo=blockIdx.x*blockDim.x+threadIdx.x;
int neighborhood[31];
int v=0;
if(idHilo<SUM){
for(t=15;t<=tendbegin_cuda[idHilo]-15;t++){
xi = xb_cuda[idHilo] + floor((double)t*xinc_cuda[idHilo]);
yi = yb_cuda[idHilo] + floor((double)t*yinc_cuda[idHilo]);
neighborhood[v]=floor(xi/Bn);
ji=floor(yi/Bn);
if(fabs((double)neighborhood[v]) < M && fabs((double)ji)<N)
{
if(tendbegin_cuda[idHilo]>30 && v==30){
if(t==0)
vector_trazo_cuda[20+idHilo*31]=0;
for(k=1;k<=15;k++)
vector_trazo_cuda[20+idHilo*31]=vector_trazo_cuda[20+idHilo*31]+fabs(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
for(a=0;a<30;a++)
neighborhood[a]=neighborhood[a+1];
v=v-1;
}
v=v+1;
}
}
}
}
EDIT:
Changing the DP flops for SP flops only slightly improved the duration. Loop unrolling the inner loops practically didn't help.
Sorry for the unstructured answer, I'm just going to throw out some generally useful comments with references to your code to make this more useful to others.
Algorithm changes are always number one for optimizing. Is there another way to solve the problem that requires less math/iterations/memory etc.
If precision is not a big concern, use floating point (or half precision floating point with newer architectures). Part of the reason it didn't affect your performance much when you briefly tried is because you're still using double precision calculations on your floating point data (fabs takes double, so if you use with float, it converts your float to a double, does double math, returns a double and converts to float, use fabsf).
If you don't need to use the absolute full precision of float use fast math (compiler option).
Multiply is much faster than divide (especially for full precision/non-fast math). Calculate 1/var outside the kernel and then multiply instead of dividing inside kernel.
Don't know if it gets optimized out, but you should use increment and decrement operators. v=v-1; could be v--; etc.
Casting to an int will truncate toward zero. floor() will truncate toward negative infinite. you probably don't need explicit floor(), also, floorf() for float as above. when you use it for the intermediate computations on integer types, they're already truncated. So you're converting to double and back for no reason. Use the appropriately typed function (abs, fabs, fabsf, etc.)
if(fabs((double)neighborhood[v]) < M && fabs((double)ji)<N)
change to
if(abs(neighborhood[v]) < M && abs(ji)<N)
vector_trazo_cuda[20+idHilo*31]=vector_trazo_cuda[20+idHilo*31]+
fabs(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
change to
vector_trazo_cuda[20+idHilo*31] +=
fabsf(imagen_cuda[ji*M+(neighborhood[v-(15+k)])]-
imagen_cuda[ji*M+(neighborhood[v-(15-k)])]);
.
xi = xb_cuda[idHilo] + floor((double)t*xinc_cuda[idHilo]);
change to
xi = xb_cuda[idHilo] + t*xinc_cuda[idHilo];
The above line is needlessly complicated. In essence you are doing this,
convert t to double,
convert xinc_cuda to double and multiply,
floor it (returns double),
convert xb_cuda to double and add,
convert to long.
The new line will store the same result in much, much less time (also better because if you exceed the precision of double in the previous case, you would be rounding to a nearest power of 2). Also, those four lines should be outside the for loop...you don't need to recompute them if they don't depend on t. Together, i wouldn't be surprised if this cuts your run time by a factor of 10-30.
Your structure results in a lot of global memory reads, try to read once from global, handle calculations on local memory, and write once to global (if at all possible).
Compile with -lineinfo always. Makes profiling easier, and i haven't been able to assess any overhead whatsoever (using kernels in the 0.1 to 10ms execution time range).
Figure out with the profiler if you're compute or memory bound and devote time accordingly.
Try to allow the compiler use registers when possible, this is a big topic.
As always, don't change everything at once. I typed all this out with compiling/testing so i may have an error.
You may be running too many threads simultaneously. The optimum performance seems to come when you run the right number of threads: enough threads to keep busy, but not so many as to over-fragment the local memory available to each simultaneous thread.
Last fall I built a tutorial to investigate optimization of the Travelling Salesman problem (TSP) using CUDA with CUDAFY. The steps I went through in achieving a several-times speed-up from a published algorithm may be useful in guiding your endeavours, even though the problem domain is different. The tutorial and code is available at CUDA Tuning with CUDAFY.

How do I translate this range coding C++ snippet to performant Haskell?

I know enough Haskell to translate the code below, but I don't know much about making it perform well:
typedef unsigned long precision;
typedef unsigned char uc;
const int kSpaceForByte = sizeof(precision) * 8 - 8;
const int kHalfPrec = sizeof(precision) * 8 / 2;
const precision kTop = ((precision)1) << kSpaceForByte;
const precision kBot = ((precision)1) << kHalfPrec;
//This must be called before encoding starts
void RangeCoder::StartEncode(){
_low = 0;
_range = (precision) -1;
}
/*
RangeCoder does not concern itself with models of the data.
To encode each symbol, you pass the parameters *cumFreq*, which gives
the cumulative frequency of the possible symbols ordered before this symbol,
*freq*, which gives the frequency of this symbol. And *totFreq*, which gives
the total frequency of all symbols.
This means that you can have different frequency distributions / models for
each encoded symbol, as long as you can restore the same distribution at
this point, when restoring.
*/
void RangeCoder::Encode(precision cumFreq, precision freq, precision totFreq){
assert(cumFreq + freq <= totFreq && freq && totFreq <= kBot);
_low += cumFreq * (_range /= totFreq);
_range *= freq;
while ((_low ^ _low + _range) < kTop or
_range < kBot and ((_range= -_low & kBot - 1), 1)){
//the "a or b and (r=..,1)" idiom is a way to assign r only if a is false.
OutByte(_low >> kSpaceForByte); //output one byte.
_range <<= sizeof(uc) * 8;
_low <<= sizeof(uc) * 8;
}
}
I know, I know "Write several versions and use criterion to see what works". I don't know enough to know what my options are though, or to avoid silly mistakes.
Here are my thoughts so far. One way would be to use the State monad and/or lenses. Another would be to translate the loop and state to explicit recursion. I read somewhere that explicit recursion tends to performs badly on ghc though. I think using ByteString Builder would be a good way to output each byte. Assuming I run on a 64 bit platform, should I use unboxed Word64 arguments? The compression quality will not decrease significantly if I decrease the precision to 32 bits. Will GHC optimize better for this?
Since this is not a 1-1 mapping, pipes with StateP would lead to very neat code, where I would request arguments one at a time and then let the while-loop respond byte for byte. Unfortunately, when i benchmarked it, it seems the pipe overhead (unsurprisingly) is quite large. Since each symbol can lead to many byte outputs, it feels a bit like a concatMap with State. Perhaps this would be the idiomatic solution? Concatenating lists of bytes does not sound very fast to me, though. ByteString has a concatMap. Perhaps this is the correct way? EDIT: no it is not. It takes a ByteString as input.
I intend to release the package on Hackage when I'm done, so any advice (or actual code!) you can give will benefit the community :). I plan to use this compression as a base for writing a very memory efficient compressed map.
I read somewhere that explicit recursion tends to performs badly on ghc though.
No. GHC produce slow machine code for recursion, which couldn't be reduced (or GHC "don't want" to reduce). If recursion could be unrolled (I don't see any fundamential problems with it in your snippet), it is translated to almost the same machine code as while-loop in C or C++.
Assuming I run on a 64 bit platform, should I use unboxed Word64 arguments? The compression quality will not decrease significantly if I decrease the precision to 32 bits. Will GHC optimize better for this?
Do you mean Word#? Let GHC to deal with it, use boxed types. I've never met a situation when some profit could be achived only by using unboxed types. Using 32bit types wouldn't help on 64bit platform.
One general rule of optimizing performance for GHC is avoiding data structures where possible. If you can pass pieces of data through function arguments or closures, use the chance.

What has a better performance: multiplication or division?

Which version is faster:
x * 0.5
or
x / 2 ?
I've had a course at the university called computer systems some time ago. From back then I remember that multiplying two values can be achieved with comparably "simple" logical gates but division is not a "native" operation and requires a sum register that is in a loop increased by the divisor and compared to the dividend.
Now I have to optimise an algorithm with a lot of divisions. Unfortunately it's not just dividing by two, so binary shifting is not an option. Will it make a difference to change all divisions to multiplications ?
Update:
I have changed my code and didn't notice any difference. You're probably right about compiler optimisations. Since all the answers were great ive upvoted them all. I chose rahul's answer because of the great link.
Usually division is a lot more expensive than multiplication, but a smart compiler will often convert division by a compile-time constant to a multiplication anyway. If your compiler is not smart enough though, or if there are floating point accuracy issues, then you can always do the optimisation explicitly, e.g. change:
float x = y / 2.5f;
to:
const float k = 1.0f / 2.5f;
...
float x = y * k;
Note that this is most likely a case of premature optimisation - you should only do this kind of thing if you have profiled your code and positively identified division as being a performance bottlneck.
Division by a compile-time constant that's a power of 2 is quite fast (comparable to multiplication by a compile-time constant) for both integers and floats (it's basically convertible into a bit shift).
For floats even dynamic division by powers of two is much faster than regular (dynamic or static division) as it basically turns into a subtraction on its exponent.
In all other cases, division appears to be several times slower than multiplication.
For dynamic divisor the slowndown factor at my Intel(R) Core(TM) i5 CPU M 430 # 2.27GHz appears to be about 8, for static ones about 2.
The results are from a little benchmark of mine, which I made because I was somewhat curious about this (notice the aberrations at powers of two) :
ulong -- 64 bit unsigned
1 in the label means dynamic argument
0 in the lable means statically known argument
The results were generated from the following bash template:
#include <stdio.h>
#include <stdlib.h>
typedef unsigned long ulong;
int main(int argc, char** argv){
$TYPE arg = atoi(argv[1]);
$TYPE i = 0, res = 0;
for (i=0;i< $IT;i++)
res+=i $OP $ARG;
printf($FMT, res);
return 0;
}
with the $-variables assigned and the resulting program compiled with -O3 and run (dynamic values came from the command line as it's obvious from the C code).
Well if it is a single calculation you wil hardly notice any difference but if you talk about millions of transaction then definitely Division is costlier than Multiplication. You can always use whatever is the clearest and readable.
Please refer this link:- Should I use multiplication or division?
That will likely depend on your specific CPU and the types of your arguments. For instance, in your example you're doing a floating-point multiplication but an integer division. (Probably, at least, in most languages I know of that use C syntax.)
If you are doing work in assembler, you can look up the specific instructions you are using and see how long they take.
If you are not doing work in assembler, you probably don't need to care. All modern compilers with optimization will change your operations in this way to the most appropriate instructions.
Your big wins on optimization will not be from twiddling the arithmetic like this. Instead, focus on how well you are using your cache. Consider whether there are algorithm changes that might speed things up.
One note to make, if you are looking for numerical stability:
Don't recycle the divisions for solutions that require multiple components/coordinates, e.g. like implementing an n-D vector normalize() function, i.e. the following will NOT give you a unit-length vector:
V3d v3d(x,y,z);
float l = v3d.length();
float oneOverL = 1.f / l;
v3d.x *= oneOverL;
v3d.y *= oneOverL;
v3d.z *= oneOverL;
assert(1. == v3d.length()); // fails!
.. but this code will..
V3d v3d(x,y,z);
float l = v3d.length();
v3d.x /= l;
v3d.y /= l;
v3d.z /= l;
assert(1. == v3d.length()); // ok!
Guess the problem in the first code excerpt is the additional float normalization (the pre-division will impose a different scale normalization to the floating point number, which is then forced upon the actual result and introducing additional error).
Didn't look into this for too long, so please share your explanation why this happens. Tested it with x,y and z being .1f (and with doubles instead of floats)

What is the difference between the OpenCL functions length() and fast_length()?

On page three of this OpenCL reference sheet (broken link) there are two built in vector length functions with identical parameters: length() and half_length().
What is the difference between these functions? I gather from the name one is 'faster' than the other but in what circumstances? Does it sacrafice accuracy for this speed increase? If not, why would one ever use length() over fast_length()?
According to the OpenCL spec (version 1.1, page 215):
float length(floatn p): Return the length of vector p, i.e. sqrt(p.x²+p.y²+...)
float fast_length(floatn p): Return the length of vector p computed as half_sqrt(p.x²+p.y²+...)
So fast_length uses half_sqrt, while length uses sqrt. As you can guess sqrt has better guarantees on accuracy, but might be slower. More to the point:
Min Accuracy of sqrt: 3ulp (unit of least precision)
Min Accuracy of half_sqrt: 8192ulp
So half_sqrt can be about 11bits less accurate then sqrt (well actually it can be 13 bit less accurate, since there ist no requirement for sqrt not to be better then strictly necessary). Since float has a mantissa of 23bit (plus one implicit bit) half_sqrt only promises about 10bit of precision (11bit including the implicit 1). It might however be faster, if the hardware has such a function. In hardware it's not unusual to have sqrt or rsqrt instruction providing only a small number of bits (like 10-14) and using Newton-Raphson iterations after the instruction to get the necessary precision. In such a case using half_sqrt is obviously faster.

Resources