A couple of CUDA-performance questions - performance

This is the first time i ask question here so thanks very much in advance and please forgive my ignorance. And also I've just started to CUDA programming.
Basically, i have a bunch of points, and i want to calculate all the pair-wise distances. Currently my kernel function just holds on one point, and iteratively read in all other points (from global memory), and conduct the calculation. Here's some of my confusions:
I'm using a Tesla M2050 with 448 cores. But my current parallel version (kernel<<<128,16,16>>>) achieves a much higher parallelism (about 600x faster than kernel<<<1,1,1>>>). Is it possibly due to the multithreading thing or pipeline issue, or they actually indicate the same thing?
I want to further improve the performance. So i figure to use shared memory to hold some input points for each multiprocessing block. But the new code is just as fast. What's the possible cause? Could it be related to the fact that i set too many threads?
Or, is it because i have a if-statement in the code? The thing is, i only consider and count the short distances, so i have a statement like (if dist < 200). How much should i worry about this one?
A million thanks!
Bin

Mark Harris has a very good presentation about optimizing CUDA: Optimizing Parallel Reduction in CUDA.
Algorithmic optimizations
Changes to addressing, algorithm cascading
11.84x speedup, combined!
Code optimizations
Loop unrolling
2.54x speedup, combined
Having an extra operations statement, does indeed cause problems although it will be the last thing you want to optimize, if not simply because you need to know the layout of your code before implementing the size assumptions!
The problem you are working on sounds like the famous n-body problem,
see Fast N-Body Simulation with CUDA.
An additional performance increase can be achieved if you can avoid doing a pairwise computation, for example, the elements are too far to have an effect on each-other. This applies to any relationship that can be expressed geometrically, whether it be pairwise costs or a physics simulation with springs. My favorite method is to divide the grid into boxes and, with each element putting itself into a box via division, then only evaluate pairwise relations between between neighboring boxes. This can be called O(n*m).

(1) The GPU runs many more threads in parallel than there are cores. This is because each core is pipelined. Operations take around 20 cycles on compute capability 2.0 (Fermi) architectures. So for each clock cycle, the core starts work on a new operation, returns the finished result of one operation, and move all the other (around 18) operations one more step towards completion. So, to saturate the GPU, you might need something like 448 * 20 threads.
(2) It's probably because your values are getting cached in the L1 and L2 caches.
(3) It depends on how much work you're doing inside the if conditional. The GPU must run all 32 threads in a warp through all the code inside the if even if the condition is true for only a single of those threads. If there is a lot of code in the conditional as compared to the rest of your kernel, and relatively view threads go through that code path, it is likely that you end up with low compute throughput.

Related

CUDA Sorting Many Vectors / Arrays

I have many (200 000) vectors of integers (around 2000 elements in each vector) in GPU memory.
I am trying to parallelize algorithm which needs to sort, calculate average, standard deviation and skewness for each vector.
In the next step, the algorithm has to delete the maximal element and repeated calculation of statistical moments until some criteria is not fulfilled for each vector independently.
I would like to ask someone more experienced what is the best approach to parallelize this algorithm.
Is it possible to sort more that one vector at once?
Maybe is it better to not parallelize sorting but the whole algorithm as one thread?
200 000 vectors of integers ... 2000 elements in each vector ... in GPU memory.
2,000 integers sounds like something a single GPU block could tackle handily. They would fit in its shared memory (or into its register file, but that would be less useful for various reasons), so you wouldn't need to sort them in global memory. 200,000 vector = 200,000 blocks; but you can't have 2000 block threads - that excessive
You might be able to use cub's block radix sort, as #talonmies suggests, but I'm not too sure that's the right thing to do. You might be able to do it with thrust, but there's also a good chance you'll have a lot of overhead and complex code (I may be wrong though). Give serious consideration to adapting an existing (bitonic) sort kernel, or even writing your own - although that's more challenging to get right.
Anyway, if you write your own kernel, you can code your "next step" after sorting the data.
Maybe is it better to not parallelize sorting but the whole algorithm as one thread?
This depends on how much time your application spends on these sorting efforts at the moment, relative to its entire running time. See also Amdahl's Law for a more formal statement of the above. Having said that - typically it should be worthwhile to parallelize the sorting when you already have data in GPU memory.

Parallelization of Multiple Delaunay Triangulations in Mathematica

I am trying to run a large number of "DelaunayTriangulation" routines from the Computational Geometry package in Mathematica. I have an array, "Lattice", which contains data for several thousand points in several thousand time-frames. (e.g. Lattice[[i]] indicates the ith time frame with ~10000 (x,y) coordinates).
I want to generate another large array, "Tri", with all the triangulation index data inside. For a serial calculation:
Tri=Table[DelaunayTriangulation[Lattice[[i]]],{i,imax}];
This calculation will take an exceptionally long time, so naturally, I wish to parallelize this computation:
Tri=Parallelize[Table[DelaunayTriangulation[Lattice[[i]]],{i,imax}]];
The problem lies here; usually, I would expect these individual triangulations to be divided between the 16 cores I have and run in parallel, but I don't see this. The parallelization doesn't affect anything and the computation runs as if it were on a single core.
I'm sure my use of "Parrallelize" is correct, as it works with default Mathematica commands in other tables.
Is this an issue using the triangulaion routine? Or perhaps memory (although the serial calculation uses about >>1Gb of the 32Gb of RAM I have)? Any insight into this will be useful.

Memory and time issues when dividing two matrices

I have two sparse matrices in matlab
M1 of size 9thousandx1.8million and M2 of size 1.8millionx1.8million.
Now I need to calculate the expression
M1/M2
and it took me like an hour. Is it normal? Is there any efficient way in matlab so that I can overcome this time issue. I mean it's a lot and if I make number of iterations then it will keep on taking 1 hour. Any suggestion?
A quick back-of-the-envelope calculation based on assuming some iterative method like conjugate gradient or Kaczmarz method is used, and plugging in the sizes makes me believe that an hour isn't bad.
Because of the tridiagonality the matrix that's being "inverted" (if not explicitly), both of those methods are going to take a number of instructions near "some near-unity scalar factor" times ~9000 times 1.8e6 times "the number of iterations required for convergence". The product of the two things in quotes is probably around 50 (minimum) to around 1000 (maximum). I didn't cherry pick these to make your math work, these are about what I'd expect from having done these. If you assume about 1e9 instructions per second (which doesn't account much for memory access etc.) you get around 13 minutes to around 4.5 hours.
Thus, it seems in the right range for an algorithm that's exploiting sparsity.
Might be able to exploit it better yourself if you know the structure, but probably not by much.
Note, this isn't to say that 13 minutes is achievable.
Edit: One side note, I'm not sure what's being used, but I assumed iterative methods. It's also possible that direct methods are used (like explained here). These methods can be very efficient for sparse systems if you exploit the sparsity right. It's very possible that Matlab is using these by default, but it's worth investigating what Matlab is doing in your case.
In my limited experience, iterative methods were usually preferred over direct methods as the size of the systems get large (yours is large.) Our linear systems worked out to be block tridiagonal as well, as they often do in image processing.

How does SIMD behave in this case?

I am using an engine that allows SIMD code to be written, and it performs fast. But there is only one block that has all the code.
I understand that this code is run independently on each entity concurrently, but when there is only 1 thing changing, is it still faster to calculate it regardless? Is this the idea with SIMD, parallelism?
For instance:
void simdFunction ()
{
center = mesh.center(); // always the same
vert.pos.x = center.x; // run on each vertex
}
In this case, the center is always the same, so will it be calculated for each vertex on SIMD? If so, is this still efficient?
Basically does being able to run this in parallel outweighs the cost of calculating it regardless in the general SIMD programming sense?
this code is run independently on each entity concurrently
No, that's not how SIMD works.
With SIMD, all arithmetic units are working in lock-step, performing identical operations. There's no independence whatsoever.
Generally though, you're better off computing shared constants just once, in sequential code. That way the SIMD engine will spend less time on each slice of vertices.
The exception would be if the computation is short, the SIMD is a co-processor (like GPGPU), and the data is already in that co-processor. Then computing it using SIMD might easily beat moving data back to the sequential processor and back.

Is trigonometry computationally expensive?

I read in an article somewhere that trig calculations are generally expensive. Is this true? And if so, that's why they use trig-lookup tables right?
EDIT: Hmm, so if the only thing that changes is the degrees (accurate to 1 degree), would a look up table with 360 entries (for every angle) be faster?
Expensive is a relative term.
The mathematical operations that will perform fastest are those that can be performed directly by your processor. Certainly integer add and subtract will be among them. Depending upon the processor, there may be multiplication and division as well. Sometimes the processor (or a co-processor) can handle floating point operations natively.
More complicated things (e.g. square root) requires a series of these low-level calculations to be performed. These things are usually accomplished using math libraries (written on top of the native operations your processor can perform).
All of this happens very very fast these days, so "expensive" depends on how much of it you need to do, and how quickly you need it to happen.
If you're writing real-time 3D rendering software, then you may need to use lots of clever math tricks and shortcuts to squeeze every bit of speed out of your environment.
If you're working on typical business applications, odds are that the mathematical calculations you're doing won't contribute significantly to the overall performance of your system.
On the Intel x86 processor, floating point addition or subtraction requires 6 clock cycles, multiplication requires 8 clock cycles, and division 30-44 clock cycles. But cosine requires between 180 and 280 clock cycles.
It's still very fast, since the x86 does these things in hardware, but it's much slower than the more basic math functions.
Since sin(), cos() and tan() are mathematical functions which are calculated by summing a series developers will sometimes use lookup tables to avoid the expensive calculation.
The tradeoff is in accuracy and memory. The greater the need for accuracy, the greater the amount of memory required for the lookup table.
Take a look at the following table accurate to 1 degree.
http://www.analyzemath.com/trigonometry/trig_1.gif
While the quick answer is that they are more expensive than the primitive math functions (addition/multiplication/subtraction etc...) they are not -expensive- in terms of human time. Typically the reason people optimize them with look-up tables and approximations is because they are calling them potentially tens of thousands of times per second and every microsecond could be valuable.
If you're writing a program and just need to call it a couple times a second the built-in functions are fast enough by far.
I would recommend writing a test program and timing them for yourself. Yes, they're slow compared to plus and minus, but they're still single processor instructions. It's unlikely to be an issue unless you're doing a very tight loop with millions of iterations.
Yes, (relative to other mathematical operations multiply, divide): if you're doing something realtime (matrix ops, video games, whatever), you can knock off lots of cycles by moving your trig calculations out of your inner loop.
If you're not doing something realtime, then no, they're not expensive (relative to operations such as reading a bunch of data from disk, generating a webpage, etc.). Trig ops are hopefully done in hardware by your CPU (which can do billions of floating point operations per second).
If you always know the angles you are computing, you can store them in a variable instead of calculating them every time. This also applies within your method/function call where your angle is not going to change. You can be smart by using some formulas (calculating sin(theta) from sin(theta/2), knowing how often the values repeat - sin(theta + 2*pi*n) = sin(theta)) and reducing computation. See this wikipedia article
yes it is. trig functions are computed by summing up a series. So in general terms, it would be a lot more costly then a simple mathematical operation. same goes for sqrt

Resources