Any idea to improve cache performance for large scale program? - performance

I am working on a large scale software. It's driven by memory/data moving between huge amount of complicated models.
Sometime the cache misses is too high and the performance is not good. But the scenario looks too complicated to me.
I just want to get some general ideas on how to reduce the cache miss and improve memory performance.
Appreicate for any comment.
Thanks!

The most valuable tool when hunting performance bottlenecks is measurement. You need to figure out what code has the problem and then measure it for cache misses, if that indeed proves to be the problem.
As for general ideas, you will need to lower the miss rate. So when you pull data into memory, you need to work as much as possible on it before you leave it again, rather than stream data. Compare as an example,
for i in data:
f(i)
for i in data:
g(i)
for i in data:
h(i)
which traverses the list three times. It may be possible to write this as:
for i in data:
h(g(f(i)))
lowering the traverse to only a single time - usually leading to fewer misses.
Another worthy trick is to think about the data structure. The access patterns of a binary tree are much different from those of a hash table. But establish measurement first so you can be sure you got the misses nailed - and that it is the misses that is your problem.
Finally, even with low miss rates, you can look into lowering memory bandwidth in general. If you move lots and lots of data, it tend to be slow - since memory speeds grow at a much lower rate compared to transistor count.

It may be having cache misses, but don't assume that's the problem.
Find out where the problems are and fix them, as in this example.
In my experience, the larger the software is, the larger are the problems (and the opportunities for speeding it up).
Often the software has been developed on fast machines, with small input data sets, so the developers never felt the need to profile and remove performance bugs.
The good news is they're all in there, just waiting for you to find and remove them, getting massive speedup, for which you can take the credit!

This is a giant topic with no detail in the question. So, I'd suggest buying more RAM.

Related

Is there a theorem like CAP for web development?

When you're building something in a web development scenario you're often thinking of costs/resources, and you're often juggling between three resources:
CPU (Processing in general)
Memory (Storage in general)
Network/Bandwidth (Or maybe even external/server resources)
The theorem here is simple, you can only choose two of these to be low.
If you want low CPU and Memory, you'll have to ask the server to do the work (High bandwidth usage)
If you want low Memory and Bandwidth, the CPU will have to do extra work to create and recreate things on the go.
If you want low CPU and Bandwidth, the memory will have to store more information and possibly duplicated data.
My question here, is there a name for this theorem? Or the managing of these 3 resources? I would like to know more about the theory behind choosing the best options in each scenario and researches related to this.
To be honest I don't know if this is the right community for this question, it is mostly a theoretical/academic question.
It doesn't work this way. For all 3 resources you can achieve low usage percentage by simply adding more resources in most cases. If you have high CPU usage then deploy your app to the server with two times more CPUs. So if you have money you can achieve low usage levels. In CAP it doesn't matter how much money you have.
Also from Wikipedia:
CAP is frequently misunderstood as if one has to choose to abandon one of the three guarantees at all times. In fact, the choice is really between consistency and availability only when a network partition or failure happens; at all other times, no trade-off has to be made

Most relevant performance indicators for C/C++

I am looking for relevant performance indicators to benchmark and optimize my C/C++ code. For example, virtual memory usage is a simple but efficient indicator, but I know some are more specialized and help in optimizing specific domains : cache hits/misses, context switches, and so on.
I believe here is a good place to have a list of performance indicators, what they measure, and how to measure them, in order to help people who want to start optimizing their programs know where to start.
Time is the most relevant indicator.
This is why most profilers default to measuring / sampling time or core clock cycles. Understanding where your code spends its time is an essential first step to looking for speedups. First find out what's slow, then find out why it's slow.
There are 2 fundamentally different kinds of speedups you can look for, and time will help you find both of them.
Algorithmic improvements: finding ways to do less work in the first place. This is often the most important kind, and the one Mike Dunlavey's answer focuses on. You should definitely not ignore this. Caching a result that's slow to recompute can be very worth it, especially if it's slow enough that loading from DRAM is still faster.
Using data structures / algorithms that can more efficiently solve your problem on real CPUs is somewhere between these two kinds of speedups. (e.g. linked lists are in practice often slower than arrays because pointer-chasing latency is a bottleneck, unless you end up copying large arrays too often...)
Applying brute force more efficiently to do the same work in fewer cycles. (And/or more friendly to the rest of the program with smaller cache footprint and/or less branching that takes up space in the branch predictors, or whatever.)
Often involves changing your data layout to be more cache friendly, and/or manually vectorizing with SIMD. Or doing so in a smarter way. Or writing a function that handles a common special case faster than your general-case function. Or even hand-holding the compiler into making better asm for your C source.
Consider summing an array of float on modern x86-64: Going from latency-bound scalar addition to AVX SIMD with multiple accumulators can give you a speedup of 8 (elements per vector) * 8 (latency / throughput on Skylake) = 64x for a medium-sized array (still on a single core/thread), in the theoretical best case where you don't run into another bottleneck (like memory bandwidth if your data isn't hot in L1d cache). Skylake vaddps / vaddss has 4 cycle latency, and 2-per-clock = 0.5c reciprocal throughput. (https://agner.org/optimize/). Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for more about multiple accumulators to hide FP latency. But this still loses hard vs. storing the total somewhere, and maybe even updating the total with a delta when you change an element. (FP rounding error can accumulate that way, though, unlike integers.)
If you don't see an obvious algorithmic improvement, or want to know more before making changes, check whether the CPU is stalling on anything, or if it's efficiency chewing through all the work the compiler is making it do.
Instructions per clock (IPC) tells you whether the CPU is close to its max instruction throughput or not. (Or more accurately, fused-domain uops issued per clock on x86, because for example one rep movsb instruction is a whole big memcpy and decodes to many many uops. And cmp/jcc fuses from 2 instructions to 1 uop, increasing IPC but the pipeline width is still fixed.)
Work done per instruction is a factor, too, but isn't something you can measure with a profiler: if you have the expertise, look at compiler-generated asm to see if the same work with fewer instructions is possible. If the compiler didn't auto-vectorize, or did so inefficiently, you can maybe get a lot more work done per instruction by manually vectorizing with SIMD intrinsics, depending on the problem. Or by hand-holding the compiler into emitting better asm by tweaking your C source to compute things in a way that is natural for asm. e.g. What is the efficient way to count set bits at a position or lower?. And see also C++ code for testing the Collatz conjecture faster than hand-written assembly - why?
If you find low IPC, figure out why by considering possibilities like cache misses or branch misses, or long dependency chains (often a cause of low IPC when not bottlenecked on the front-end or memory).
Or you might find that it's already close to optimally applying the available brute force of the CPU (unlikely but possible for some problems). In that case your only hope is algorithmic improvements to do less work.
(CPU frequency isn't fixed, but core clock cycles is a good proxy. If your program doesn't spend time waiting for I/O, then core clock cycles is maybe more useful to measure.)
A mostly-serial portion of a multi-threaded program can be hard to detect; most tools don't have an easy way to find threads using cycles when other threads are blocked.
Time spent in a function isn't the only indicator, though. A function can make the rest of the program slow by touching a lot of memory, resulting in eviction of other useful data from cache. So that kind of effect is possible. Or having a lot of branches somewhere can maybe occupy some of the branch-prediction capacity of the CPU, resulting in more branch misses elsewhere.
But note that simply finding where the CPU is spending a lot of time executing is not the most useful, in a large codebase where functions containing hotspots can have multiple callers. e.g. lots of time spent in memcpy doesn't mean you need to speed up memcpy, it means you need to find which caller is calling memcpy a lot. And so on back up the call tree.
Use profilers that can record stack snapshots, or just hit control-C in a debugger and look at the call stack a few times. If a certain function usually appears in the call stack, it's making expensive calls.
Related: linux perf: how to interpret and find hotspots, especially Mike Dunlavey's answer there makes this point.
Algorithmic improvements to avoid doing work at all are often much more valuable than doing the same work more efficiently.
But if you find very low IPC for some work you haven't figured out how to avoid yet, then sure take a look at rearranging your data structures for better caching, or avoiding branch mispredicts.
Or if high IPC is still taking a long time, manually vectorizing a loop can help, doing 4x or more work per instruction.
#PeterCordes answers are always good. I can only add my own perspective, coming from about 40 years optimizing code:
If there is time to be saved (which there is), that time is spent doing something unnecessary, that you can get rid of if you know what it is.
So what is it? Since you don't know what it is, you also don't know how much time it takes, but it does take time. The more time it takes, the more worthwhile it is to find, and the easier it is to find it. Suppose it takes 30% of the time. That means a random-time snapshot has a 30% chance of showing you what it is.
I take 5-10 random snapshots of the call stack, using a debugger and the "pause" function.
If I see it doing something on more than one snapshot, and that thing can be done faster or not at all, I've got a substantial speedup, guaranteed.
Then the process can be repeated to find more speedups, until I hit diminishing returns.
The important thing about this method is - no "bottleneck" can hide from it. That sets it apart from profilers which, because they summarize, speedups can hide from them.

Understanding perf detail when comparing two different implementations of a BFS algorithm

The results below are measured using perf on a compute server with 32 cores. I know my implementation is unoptimized but purposely as I want to make comparisons. I understand that graph algorithms tend to have low locality which researchers try to address.
I'm unclear of the results, though. The time elapsed is misleading. My implementation runs through a graph with about 4mm nodes in about 10 seconds and the rest of the time pre processing. The optimized version uses the same input and traverses about 10 times with each less than a second each so it's really just pre-processing time. I'm not trying to achieve the same. Just understand why that may be based on perf.
I see my page faults are substantially higher. I'm not 100 sure why this is this the case as the annotations (from what I can tell) do not point to any specific piece of my code from mine...
__gnu_cxx::new_allocator<std::_List_node<int> >::construct<int, int const&>
This seems to be when I process the graph itself since I create linked lists for the adjacency lists. I figured this may actually cause issues and wanted to investigate anyway. I should be able to improve page faults (and hopefully performance) by switching to jagged arrays?
The optimized algorithm has a much higher last level cache miss which I thought would explain the primary issue with BFS / graph algorithms with low locality but performance seems to be unaffected by this and my unoptimized is significantly lower.
Then there are the front / back end cycles which seems to be the opposite in terms of performance issues when comparing the two - I'm worse in frontend and the optimized is worse in backend.
Am I missing or not understanding something obvious? I thought there would be something obvious in terms of low locality that would be of issue when looking at perf but I'm confused by the optimized version.
This is my implementation of unoptimized parallel BFS (running once)...
This is using an optimized parallel BFS from a benchmark suite (running 10 times)...
Both take about 40 seconds to pre-process the data once, before doing parallel searching.
Unfortunately perf stat often doesn't given enough information to really determine where the bottleneck in your application is. It is possible to have two applications with wildly different underlying bottlenecks but with very similar perf stat profiles. For example, two applications may have the same number or fraction of L2 cache misses, and yet one might be dominated by this effect and the other way may almost be not impacted at all, depending on the amount and nature of overlapping work.
So if you try to analyze in depth from these high level counters, you are often just taking stabs in the dark. Still we can make a few observations. You mention:
The optimized algorithm has a much higher last level cache miss which
I thought would explain the primary issue with BFS / graph algorithms
with low locality but performance seems to be unaffected by this and
my unoptimized is significantly lower.
First, LLC misses are ~620 million for the optimized algorithm and ~380 for your algorithm, but you are running the optimized algorithm 10 times in this benchmark and yours only once. So the optimized algorithm has perhaps 62 million misses, and your algorithm has six times the number of LLC misses. Yes, your algorithm has a lower LLC miss rate - but the absolute number of LLC misses is what counts for performance. The lower miss rates just means that you are making every more total accesses than the 6x figure: basically you make many, many more memory accesses than the optimized version, which leads to a higher hit rate but more total misses.
All of this points to accessing more total memory in your unoptimized algorithm, or perhaps accessing it in a much more cache unfriendly fashion. That's would also explain the much higher number of page faults. Overall, both algorithms have low IPC, and yours is particularly low (0.49 IPC) and given that there aren't branch prediction problems, and that you've already identified these as graph algorithms with locality/memory access problems, stalls while waiting for memory are very likely.
Luckily, there is a better way that just trying to reverse engineer what might be the bottleneck based on perf stat output. Intel has developed a whole methodology which tries to this type of top-down analysis in a way that determines the true bottlenecks. It's not perfect, but it's far and away better than looking at the plain perf stat counters. VTune isn't free, but you can get a similar analysis based on the same methodology effect using Andi Kleen's toplev. I highly recommend you start there.

Parallelizing complex and data intensive calculations on the GPU

Preface: I'm sorry that this a very open-ended question, since it would be quite complex to go into the exact problem I am working on, and I think an abstract formulation also contains the necessary detail. If more details are needed though, feel free to ask.
Efficiency in GPU computing comes from being able to parallelize calculations over thousands of cores, even though these run more slowly than traditional CPU cores. I am wondering if this idea can be applied to the problem I am working on.
The problem I am working on is an optimisation problem, where a potential solution is generated, the quality of this solution calculated, and compared to the current best solution, in order to approach the best solution possible.
In the current algorithm, a variation of gradient descent, the calculating of this penalty is what takes by far the most processor time (Profiling suggest around 5% of the time is used to generate a new valid possibility, and 95% of the time is used to calculate the penalty). However, the calculating of this penalty is quite a complex process, where different parts of the (potential) solution depend on eachother, and are subject to multiple different constraints for which a penalty may be given to the solution - the data model for this problem currently takes over 200MB of RAM to store.
Are there strategies in which to write an algorithm for such a problem on the GPU? My problem is currently that the datamodel needs to be loaded for each processor core/thread working the problem, since the generating of a new solution takes so little time, it would be inefficient to start using locks and have to wait for a processor to be done with its penalty calculation.
A GPU obviously doesn't have this amount of memory available for each of its cores. However, my understanding is that if the model were to be stored on RAM, the overhead of communication between the GPU and the CPU would greatly slow down the algorithm (Currently around 1 million of these penalty calculations are performed every second on a single core of a fairly modern CPU, and I'm guessing a million transfers of data to the GPU every second would quickly become a bottleneck).
If anyone has any insights, or even a reference to a similar problem, I would be most grateful, since my own searches have not yet turned up much.

Upper bound on speedup

My MPI experience showed that the speedup as does not increase linearly with the number of nodes we use (because of the costs of communication). My experience is similar to this:.
Today a speaker said: "Magically (smiles), in some occasions we can get more speedup than the ideal one!".
He meant that ideally, when we use 4 nodes, we would get a speedup of 4. But in some occasions we can get a speedup greater than 4, with 4 nodes! The topic was related to MPI.
Is this true? If so, can anyone provide a simple example on that? Or maybe he was thinking about adding multithreading to the application (he went out of time and then had to leave ASAP, thus we could not discuss)?
Parallel efficiency (speed-up / number of parallel execution units) over unity is not at all uncommon.
The main reason for that is the total cache size available to the parallel program. With more CPUs (or cores), one has access to more cache memory. At some point, a large portion of the data fits inside the cache and this speeds up the computation considerably. Another way to look at it is that the more CPUs/cores you use, the smaller the portion of the data each one gets, until that portion could actually fit inside the cache of the individual CPU. This is sooner or later cancelled by the communication overhead though.
Also, your data shows the speed-up compared to the execution on a single node. Using OpenMP could remove some of the overhead when using MPI for intranode data exchange and therefore result in better speed-up compared to the pure MPI code.
The problem comes from the incorrectly used term ideal speed-up. Ideally, one would account for cache effects. I would rather use linear instead.
Not too sure this is on-topic here, but here goes nothing...
This super-linearity in speed-up can typically occur when you parallelise your code while distributing the data in memory with MPI. In some cases, by distributing the data across several nodes / processes, you end-up having sufficiently small chunks of data to deal with for each individual process that it fits in the cache of the processor. This cache effect might have a huge impact on the code's performance, leading to great speed-ups and compensating for the increased need of MPI communications... This can be observed in many situations, but this isn't something you can really count for for compensating a poor scalability.
Another case where you can observe this sort of super-linear scalability is when you have an algorithm where you distribute the task of finding a specific element in a large collection: by distributing your work, you can end up in one of the processes/threads finding almost immediately the results, just because it happened to be given range of indexes starting very close to the answer. But this case is even less reliable than the aforementioned cache effect.
Hope that gives you a flavour of what super-linearity is.
Cache has been mentioned, but it's not the only possible reason. For instance you could imagine a parallel program which does not have sufficient memory to store all its data structures at low node counts, but foes at high. Thus at low node counts the programmer may have been forced to write intermediate values to disk and then read them back in again, or alternatively re-calculate the data when required. However at high node counts these games are no longer required and the program can store all its data in memory. Thus super-linear speed-up is a possibility because at higher node counts the code is just doing less work by using the extra memory to avoid I/O or calculations.
Really this is the same as the cache effects noted in the other answers, using extra resources as they become available. And this is really the trick - more nodes doesn't just mean more cores, it also means more of all your resources, so as speed up really measures your core use if you can also use those other extra resources to good effect you can achieve super-linear speed up.

Resources