Parallelizing complex and data intensive calculations on the GPU - performance

Preface: I'm sorry that this a very open-ended question, since it would be quite complex to go into the exact problem I am working on, and I think an abstract formulation also contains the necessary detail. If more details are needed though, feel free to ask.
Efficiency in GPU computing comes from being able to parallelize calculations over thousands of cores, even though these run more slowly than traditional CPU cores. I am wondering if this idea can be applied to the problem I am working on.
The problem I am working on is an optimisation problem, where a potential solution is generated, the quality of this solution calculated, and compared to the current best solution, in order to approach the best solution possible.
In the current algorithm, a variation of gradient descent, the calculating of this penalty is what takes by far the most processor time (Profiling suggest around 5% of the time is used to generate a new valid possibility, and 95% of the time is used to calculate the penalty). However, the calculating of this penalty is quite a complex process, where different parts of the (potential) solution depend on eachother, and are subject to multiple different constraints for which a penalty may be given to the solution - the data model for this problem currently takes over 200MB of RAM to store.
Are there strategies in which to write an algorithm for such a problem on the GPU? My problem is currently that the datamodel needs to be loaded for each processor core/thread working the problem, since the generating of a new solution takes so little time, it would be inefficient to start using locks and have to wait for a processor to be done with its penalty calculation.
A GPU obviously doesn't have this amount of memory available for each of its cores. However, my understanding is that if the model were to be stored on RAM, the overhead of communication between the GPU and the CPU would greatly slow down the algorithm (Currently around 1 million of these penalty calculations are performed every second on a single core of a fairly modern CPU, and I'm guessing a million transfers of data to the GPU every second would quickly become a bottleneck).
If anyone has any insights, or even a reference to a similar problem, I would be most grateful, since my own searches have not yet turned up much.

Related

Most relevant performance indicators for C/C++

I am looking for relevant performance indicators to benchmark and optimize my C/C++ code. For example, virtual memory usage is a simple but efficient indicator, but I know some are more specialized and help in optimizing specific domains : cache hits/misses, context switches, and so on.
I believe here is a good place to have a list of performance indicators, what they measure, and how to measure them, in order to help people who want to start optimizing their programs know where to start.
Time is the most relevant indicator.
This is why most profilers default to measuring / sampling time or core clock cycles. Understanding where your code spends its time is an essential first step to looking for speedups. First find out what's slow, then find out why it's slow.
There are 2 fundamentally different kinds of speedups you can look for, and time will help you find both of them.
Algorithmic improvements: finding ways to do less work in the first place. This is often the most important kind, and the one Mike Dunlavey's answer focuses on. You should definitely not ignore this. Caching a result that's slow to recompute can be very worth it, especially if it's slow enough that loading from DRAM is still faster.
Using data structures / algorithms that can more efficiently solve your problem on real CPUs is somewhere between these two kinds of speedups. (e.g. linked lists are in practice often slower than arrays because pointer-chasing latency is a bottleneck, unless you end up copying large arrays too often...)
Applying brute force more efficiently to do the same work in fewer cycles. (And/or more friendly to the rest of the program with smaller cache footprint and/or less branching that takes up space in the branch predictors, or whatever.)
Often involves changing your data layout to be more cache friendly, and/or manually vectorizing with SIMD. Or doing so in a smarter way. Or writing a function that handles a common special case faster than your general-case function. Or even hand-holding the compiler into making better asm for your C source.
Consider summing an array of float on modern x86-64: Going from latency-bound scalar addition to AVX SIMD with multiple accumulators can give you a speedup of 8 (elements per vector) * 8 (latency / throughput on Skylake) = 64x for a medium-sized array (still on a single core/thread), in the theoretical best case where you don't run into another bottleneck (like memory bandwidth if your data isn't hot in L1d cache). Skylake vaddps / vaddss has 4 cycle latency, and 2-per-clock = 0.5c reciprocal throughput. (https://agner.org/optimize/). Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for more about multiple accumulators to hide FP latency. But this still loses hard vs. storing the total somewhere, and maybe even updating the total with a delta when you change an element. (FP rounding error can accumulate that way, though, unlike integers.)
If you don't see an obvious algorithmic improvement, or want to know more before making changes, check whether the CPU is stalling on anything, or if it's efficiency chewing through all the work the compiler is making it do.
Instructions per clock (IPC) tells you whether the CPU is close to its max instruction throughput or not. (Or more accurately, fused-domain uops issued per clock on x86, because for example one rep movsb instruction is a whole big memcpy and decodes to many many uops. And cmp/jcc fuses from 2 instructions to 1 uop, increasing IPC but the pipeline width is still fixed.)
Work done per instruction is a factor, too, but isn't something you can measure with a profiler: if you have the expertise, look at compiler-generated asm to see if the same work with fewer instructions is possible. If the compiler didn't auto-vectorize, or did so inefficiently, you can maybe get a lot more work done per instruction by manually vectorizing with SIMD intrinsics, depending on the problem. Or by hand-holding the compiler into emitting better asm by tweaking your C source to compute things in a way that is natural for asm. e.g. What is the efficient way to count set bits at a position or lower?. And see also C++ code for testing the Collatz conjecture faster than hand-written assembly - why?
If you find low IPC, figure out why by considering possibilities like cache misses or branch misses, or long dependency chains (often a cause of low IPC when not bottlenecked on the front-end or memory).
Or you might find that it's already close to optimally applying the available brute force of the CPU (unlikely but possible for some problems). In that case your only hope is algorithmic improvements to do less work.
(CPU frequency isn't fixed, but core clock cycles is a good proxy. If your program doesn't spend time waiting for I/O, then core clock cycles is maybe more useful to measure.)
A mostly-serial portion of a multi-threaded program can be hard to detect; most tools don't have an easy way to find threads using cycles when other threads are blocked.
Time spent in a function isn't the only indicator, though. A function can make the rest of the program slow by touching a lot of memory, resulting in eviction of other useful data from cache. So that kind of effect is possible. Or having a lot of branches somewhere can maybe occupy some of the branch-prediction capacity of the CPU, resulting in more branch misses elsewhere.
But note that simply finding where the CPU is spending a lot of time executing is not the most useful, in a large codebase where functions containing hotspots can have multiple callers. e.g. lots of time spent in memcpy doesn't mean you need to speed up memcpy, it means you need to find which caller is calling memcpy a lot. And so on back up the call tree.
Use profilers that can record stack snapshots, or just hit control-C in a debugger and look at the call stack a few times. If a certain function usually appears in the call stack, it's making expensive calls.
Related: linux perf: how to interpret and find hotspots, especially Mike Dunlavey's answer there makes this point.
Algorithmic improvements to avoid doing work at all are often much more valuable than doing the same work more efficiently.
But if you find very low IPC for some work you haven't figured out how to avoid yet, then sure take a look at rearranging your data structures for better caching, or avoiding branch mispredicts.
Or if high IPC is still taking a long time, manually vectorizing a loop can help, doing 4x or more work per instruction.
#PeterCordes answers are always good. I can only add my own perspective, coming from about 40 years optimizing code:
If there is time to be saved (which there is), that time is spent doing something unnecessary, that you can get rid of if you know what it is.
So what is it? Since you don't know what it is, you also don't know how much time it takes, but it does take time. The more time it takes, the more worthwhile it is to find, and the easier it is to find it. Suppose it takes 30% of the time. That means a random-time snapshot has a 30% chance of showing you what it is.
I take 5-10 random snapshots of the call stack, using a debugger and the "pause" function.
If I see it doing something on more than one snapshot, and that thing can be done faster or not at all, I've got a substantial speedup, guaranteed.
Then the process can be repeated to find more speedups, until I hit diminishing returns.
The important thing about this method is - no "bottleneck" can hide from it. That sets it apart from profilers which, because they summarize, speedups can hide from them.

Julia parallel speedup performance for large scale computations

General context:
I have developed a fairly large Navier-Stokes (finite difference) solver written in FORTRAN90. It has adaptive grids (hence load-balance issue), and I have tried various techniques (MPI, OpenMP & OpenMP-MPI hyrbid) to parallelize it. However, it does not scale good enough i.e. according to Amdahl's law it runs 96-97% of the computations in parallel. Also, the general size of the mesh is a couple of hundred million points, which would require to increase later in the future.
Query:
Now, I am thinking of switching to Julia, since it has become very tedious to maintain and add further functionalities to the existing code.
The problem is that I am unable to find a good answer about the parallel performance of Julia. I have searched on the internet as well as have watched a lot of youtube videos. What I have noticed is that most people say that Julia is very much suitable for the parallel computing, some even provide a bar chart showing the reduction in the elapsed time compared to the serial code. However, some of the answers/videos are quite old, which make them a little unreliable due to the growing nature of this new language.
Therefore, I would like to know if the language has the ability to scale even for a few thousand cores?
Extra information:
I am still trying hard to improve the speedup of my existing code to achieve almost linear performance for a couple of thousand cores. The solver needs to exchange overlapping points 3-4 times per timestep. Hence, it involves a huge communication overhead. However, the non-adaptive grid version of the code easily scales up to 20k cores.
I have also read somewhere that Julia does not use InfiniBand standard for data communication in parallel.
The following paper has scaling results for pde constrained parameter estimation problems but not up to anywhere near the number of cores you seem to be interested in: https://arxiv.org/abs/1606.07399. I haven't seen any examples going up to thousands of cores.
Re infiniband: By default Julia uses shared memory for communication within a node and TCP/IP across nodes, so by default infiniband is not supported. However, the language allows for the implementation of custom transports and I imagine someone will add infiniband support at some point but I couldn't find any implementations with a quick google search.

Upper bound on speedup

My MPI experience showed that the speedup as does not increase linearly with the number of nodes we use (because of the costs of communication). My experience is similar to this:.
Today a speaker said: "Magically (smiles), in some occasions we can get more speedup than the ideal one!".
He meant that ideally, when we use 4 nodes, we would get a speedup of 4. But in some occasions we can get a speedup greater than 4, with 4 nodes! The topic was related to MPI.
Is this true? If so, can anyone provide a simple example on that? Or maybe he was thinking about adding multithreading to the application (he went out of time and then had to leave ASAP, thus we could not discuss)?
Parallel efficiency (speed-up / number of parallel execution units) over unity is not at all uncommon.
The main reason for that is the total cache size available to the parallel program. With more CPUs (or cores), one has access to more cache memory. At some point, a large portion of the data fits inside the cache and this speeds up the computation considerably. Another way to look at it is that the more CPUs/cores you use, the smaller the portion of the data each one gets, until that portion could actually fit inside the cache of the individual CPU. This is sooner or later cancelled by the communication overhead though.
Also, your data shows the speed-up compared to the execution on a single node. Using OpenMP could remove some of the overhead when using MPI for intranode data exchange and therefore result in better speed-up compared to the pure MPI code.
The problem comes from the incorrectly used term ideal speed-up. Ideally, one would account for cache effects. I would rather use linear instead.
Not too sure this is on-topic here, but here goes nothing...
This super-linearity in speed-up can typically occur when you parallelise your code while distributing the data in memory with MPI. In some cases, by distributing the data across several nodes / processes, you end-up having sufficiently small chunks of data to deal with for each individual process that it fits in the cache of the processor. This cache effect might have a huge impact on the code's performance, leading to great speed-ups and compensating for the increased need of MPI communications... This can be observed in many situations, but this isn't something you can really count for for compensating a poor scalability.
Another case where you can observe this sort of super-linear scalability is when you have an algorithm where you distribute the task of finding a specific element in a large collection: by distributing your work, you can end up in one of the processes/threads finding almost immediately the results, just because it happened to be given range of indexes starting very close to the answer. But this case is even less reliable than the aforementioned cache effect.
Hope that gives you a flavour of what super-linearity is.
Cache has been mentioned, but it's not the only possible reason. For instance you could imagine a parallel program which does not have sufficient memory to store all its data structures at low node counts, but foes at high. Thus at low node counts the programmer may have been forced to write intermediate values to disk and then read them back in again, or alternatively re-calculate the data when required. However at high node counts these games are no longer required and the program can store all its data in memory. Thus super-linear speed-up is a possibility because at higher node counts the code is just doing less work by using the extra memory to avoid I/O or calculations.
Really this is the same as the cache effects noted in the other answers, using extra resources as they become available. And this is really the trick - more nodes doesn't just mean more cores, it also means more of all your resources, so as speed up really measures your core use if you can also use those other extra resources to good effect you can achieve super-linear speed up.

Is this simulation appropriate for CUDA or OpenCL?

I'm asking on behalf of a friend working in numerical astrophysics.
Basically what he's doing is simulating a cloud of gas. There are a finite number of cells and the timestep is defined such that gas cannot cross more than one cell each step. Each cell has properties like density and temperature. Each timestep, these (and position) need to be calculated. It's mainly position that's the issue I believe as that is affected primarily by the interactions of gravity among the cells, all of which affect each other.
At the moment he's running this on a cluster of ~150 nodes but I wondered, if it's parallelizable like this, could it be run faster on a few GPUs with CUDA? At the moment it takes him a couple of days to finish a simulation. As GPUs generally have ~500 cores, it seemed like they could provide a boost.
Maybe I'm totally wrong.
Yes this sounds like a decent application for a GPU. GPU processing is most effective when it's running the same function on a large data set. If you've already got it running in parallel on a cluster computer, I'd say write it and test it on a single graphics card, and see if that's an improvement on a single cluster, then scale accordingly.
The task you describe is a good fit for the GPU. GPUs have successfully been used for dramatically improving the performance in areas such as particle, aerodynamics and fluid simulations.
Without knowing more details about the simulation it's impossible to say for sure whether it would gain a performance boost. Broadly speaking, algorithms that are memory bound ( that is, relatively few arithmetic operations per memory transaction ) tend to benefit most from offloading to the GPU.
For astrophysics simulations specifically, the following link may be of use : http://www.astrogpu.org/

Any idea to improve cache performance for large scale program?

I am working on a large scale software. It's driven by memory/data moving between huge amount of complicated models.
Sometime the cache misses is too high and the performance is not good. But the scenario looks too complicated to me.
I just want to get some general ideas on how to reduce the cache miss and improve memory performance.
Appreicate for any comment.
Thanks!
The most valuable tool when hunting performance bottlenecks is measurement. You need to figure out what code has the problem and then measure it for cache misses, if that indeed proves to be the problem.
As for general ideas, you will need to lower the miss rate. So when you pull data into memory, you need to work as much as possible on it before you leave it again, rather than stream data. Compare as an example,
for i in data:
f(i)
for i in data:
g(i)
for i in data:
h(i)
which traverses the list three times. It may be possible to write this as:
for i in data:
h(g(f(i)))
lowering the traverse to only a single time - usually leading to fewer misses.
Another worthy trick is to think about the data structure. The access patterns of a binary tree are much different from those of a hash table. But establish measurement first so you can be sure you got the misses nailed - and that it is the misses that is your problem.
Finally, even with low miss rates, you can look into lowering memory bandwidth in general. If you move lots and lots of data, it tend to be slow - since memory speeds grow at a much lower rate compared to transistor count.
It may be having cache misses, but don't assume that's the problem.
Find out where the problems are and fix them, as in this example.
In my experience, the larger the software is, the larger are the problems (and the opportunities for speeding it up).
Often the software has been developed on fast machines, with small input data sets, so the developers never felt the need to profile and remove performance bugs.
The good news is they're all in there, just waiting for you to find and remove them, getting massive speedup, for which you can take the credit!
This is a giant topic with no detail in the question. So, I'd suggest buying more RAM.

Resources