How can I prove that a program is memory bound? - performance

As far as I can tell, there are three general ways to describe the limitations on the running time of a program: CPU bound, memory bound, and I/O bound. How can I prove that a program is memory bound?

Do you mean "prove", as in an academic exercise, or "prove", as in justify buying more memory for the system it runs on?
In the first case, you'd need a lot of detail about what the algorithm is doing, and what the memory latency looks like. If you can show that the cache miss frequency * main memory latency is >= the number of "productive" computation cycles, then the program is memory-bound.
To make a case that an existing piece of software is probably memory-bound, you can use a low-level profiler to get the same kind of information (cache-miss frequency, etc). Or you could try running the program with the CPU either clocked down to a lower rate, or occupied with other work, and see if the runtime increases linearly with CPU degradation, or somewhat more slowly.

Related

CPU bound vs Cache bound - Can instructions be executed without cache/memory access? Can memory access be as fast as instruction execution?

I was looking up the difference between CPU bound and IO bound programs. That was when I came across answers that explain that there are other variants like Memory Bound, Cache bound, etc.
I understand how Memory Bound (Multiplication of 2 large matrices in Main Memory) and IO Bound (grep) differ from each other and from CPU bound/Cache bound.
However, the difference between CPU Bound programs and IO Bound programs doesn't seem as clear. Here is what I gathered :
Cache bound - Speed of cache access is an important factor in deciding the speed at which the program gets executed. For example, if the most visited part of a program is a small chunk of code inside a loop small enough to be contained within the cache, then the program may be cache bound.
CPU bound - The speed at which CPU executes instructions is an important factor in deciding the speed at which the program gets executed.
But how can processes be CPU bound? I mean, instructions need to be fetched before execution (from cache/ Main Memory) every time, so, no matter how fast the CPU is, it will have to wait for the cache to finish data transfer and thus will at least be Cache Bound or Memory bound, since memory access is slower than instruction execution.
So is CPU bound the same as cache bound?
CPU architecture is very much like plumbing, just without the smell. When one of the pipes gets clogged, some others will overflow, while others will remain empty - both cases are bad utilization, but you need to find the jam to release everything.
Similarly, with a CPU you have multiple systems that need to work in unison to make the program progress. Each of these machines has an upper limit on the bandwidth it can work, and when it's reached - it will become a limitation, making the other systems underutilized or even stalled.
The main memory for example depends on the number of channels and the type of DRAM (and of course frequency), but let's say it commonly peaks at 25G/s in client CPUs. that means that any workload that tries to consume data beyond this rate, will become blocked by the memory BW (i.e. memory bound), and the rest of the systems will be underutilized.
Cache BW depends on the cache level (and the processor micro-architecture, and of course frequency of that cache domain), but you can find out where it peaks in the optimization guides.
According to 2.1.3 here, Intel Skylake for example provides 2 32B loads + 1 store per cycle from the L1 (though the actual utilization they quote is a little lower, probably due to collisions or writeback interference), L2 is effectively about 1/2 line per cycle and L3 a little less than 1/3. This means that if your data set is contained in one of these levels, you can reach that peak BW before being capped by that cache.
On the other hand, let's say you don't reach the peak cache bandwidth, instead consuming data from the L1 at a lower rate, but each element of data requires many complicated mathematical operations. In that case, you may be bounded by your execution bandwidth - more so if these operations are limited to only part of the execution ports (as is the case with some esoteric operations).
There are useful tools to determine what you're bounded by - look up TopDown analysis for example

Which are the common causes for non scalability of shared memory programs?

Whenever someone paralelizes an application the expected outcome is a decent speedup, but is not always the case.
It is very usual that a program that runs in x seconds, parallelized to use 8 cores will not achieve x/8 seconds (optimal speedup). In some extreme cases, it even takes more time than the original sequential program.
Why? and most importantly, how do I improve scalability?
There are a few common causes of non scalability:
Too much synchronization: Some problems (and sometimes too much conservative programmers) require lots of synchronization between parallel tasks, this eliminates most of the parallelism in the algorithm, making it slower.
1.1. Make sure to use the minimum synchronization possible for your algorithm. With openmp for instance, a simple change from synchronized to atomic can result in a relevant difference.
1.2 Sometimes a worse sequential algorithm might offer better parallelism opportunities, if you have the chance to try something else it might be worth the shot.
Memory bandwidth limitation: it is very common that the most "trivial" implementation of an algorithm is not optimized for locality, which implies heavy communication costs between the processors and the main memory.
2.1 Optimize for locality: this means get to know where your application will run, what are the available cache memories and how to change your data structures to maximize cache usage.
Too much parallelization overhead: sometimes the parallel task is so "small" that the overhead for thread/process creation is too big compared to the parallel region total time, which causes a poor speedup or even speed-down.
All of RSFalcon7's suggestions can be combined into a "super rule": do as much as possible in unshared resources (L1 & L2 caches) - implying economizing on code and data requirements - and if you need to go to shared resources do as much as possible in L3 before going to RAM before using synchronization (the CPU cycles required to synchronize is variable but is slower - or much slower - than accessing RAM) before going to disks.
If you plan to utilize hyperthreading I have found that code compiled with gcc will utilize hyperthreading better with optimization level O1 than with, say, O2 or O3.

risk of "big" computations on hardware

Assuming someone is doing some big computations (I know that's relative... and I'm not gonna specify the nature of the operation just to keep the question open, it may be sorting data, searching for elements, calculating the prime factors of a really long number... ) using badly designed, brute force algorithmes or just an itterative process to get the results, can this approach have any bad effetcs on the cpu or the ram over a long period of time ?
Intensive processing will increase the heat generated by the CPU (or GPU) and even the RAM (to a much smaller degree).
Recent CPU chips have the ability to slow themselves down once the heat exceeds certain thresholds to prevent damage to the CPU. That would typically indicate a failure in the cooling system though.
I do not believe there are much other issues other than electricity consumption and overheating risks.

Optimum performance of GPU

I have been asked to measure how "efficiently " does my code use the GPU /what % of peak performance are algorithms achieving.I am not sure how to do this comparison.Till now I have basically had timers put in my code and measure the execution.How can I compare this to optimal performance and find what might be the bottle necks? (I did hear about visual profiler but couldnt get it to work ..it keeps giving me "cannot load output" error).
Each card has a maximum memory bandwidth and processing speed. For example, the GTX 480 bandwidth is 177.4 GB/s. You will need to know the specs for your card.
The first thing to decide is whether your code is memory bound or computation bound. If it is clearly one or the other, that will help you focus on the correct "efficiency" to measure. If your program is memory bound, then you need to compare your bandwidth with the cards maximum bandwidth.
You can calculate memory bandwidth by computing the amount of memory you read/write and dividing by run time (I use cuda events for timing). Here is a good example of calculating bandwidth efficiency (look at the whitepaper for the parallel reduction) and using it to help validate a kernel.
I don't know very much about determining the efficiency if instead you are ALU bound. You can probably count (or profile) the number of instructions, but what is the card's maximum?
I'm also not sure what to do in the likely case that your kernel is something in between memory bound and ALU bound.
Anyone...?
Generally "efficiently" would probably be a measure of how much memory and GPU cycles (average, min, max) of your program is using. Then the efficiency measure would be avg(mem)/total memory for the time period and so on with AVG(GPU cycles)/Max GPU cycles.
Then I'd compare these metrics to metrics from some GPU benchmark suites (which you can assume to be pretty efficient at using most of the GPU). Or you could measure against some random GPU intensive programs of your choice. That'd be how I'd do it but I've never thought to try so good luck!
As for bottlenecks and "optimal" performance. These are probably NP-Complete problems that no one can help you with. Get out the old profiler and debuggers and start working your way through your code.
Can't help with profiler and microoptimisation, but there is a CUDA calculator http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls , which trys to estimate how does your CUDA code use the hardware resources, based on this values:
Threads Per Block
Registers Per Thread
Shared Memory Per Block (bytes)

Algorithms for modern hardware?

Once again, I find myself with a set of broken assumptions. The article itself is about a 10x performance gain by modifying a proven-optimal algorithm to account for virtual memory:
On a modern multi-issue CPU, running
at some gigahertz clock frequency, the
worst-case loss is almost 10 million
instructions per VM page fault. If you
are running with a rotating disk, the
number is more like 100 million
instructions.
What good is an O(log2(n)) algorithm
if those operations cause page faults
and slow disk operations? For most
relevant datasets an O(n) or even an
O(n^2) algorithm, which avoids page
faults, will run circles around it.
Are there more such algorithms around? Should we re-examine all those fundamental building blocks of our education? What else do I need to watch out for when writing my own?
Clarification:
The algorithm in question isn't faster than the proven-optimal one because the Big-O notation is flawed or meaningless. It's faster because the proven-optimal algorithm relies on an assumption that is not true in modern hardware/OSes, namely that all memory access is equal and interchangeable.
You only need to re-examine your algorithms when your customers complain about the slowness of your program or it is missing critical deadlines. Otherwise focus on correctness, robustness, readability, and ease of maintenance. Until these items are achieved any performance optimization is a waste of development time.
Page faults and disk operations may be platform specific. Always profile your code to see where the bottlenecks are. Spending time on these areas will produce the most benefits.
If you're interested, along with page faults and slow disk operations, you may want to aware of:
Cache hits -- Data Oriented Design
Cache hits -- Reducing unnecessary
branches / jumps.
Cache prediction -- Shrink loops so
they fit into the processor's cache.
Again, these items are only after quality has been achieved, customer complaints and a profiler has analyzed your program.
One important thing is to realize that the most common usage of big-O notation (to talk about runtime complexity) is only half of the story - there's another half, namely space complexity (that can also be expressed using big-O) which can also be quite relevant.
Generally these days, memory capacity advances have outpaced computing speed advances (for a single core - parallelization can get around this), so less focus is given to space complexity, but it's still a factor that should be kept in mind, especially on machines with more limited memory or if working with very large amounts of data.
I'll expand on GregS's answer: the difference is between effective complexity and asymptotic complexity. Asymptotic complexity ignores constant factors and is valid only for “large enough” inputs. Oftentimes “large enough” can actually mean “larger than any computer can deal with, now and for a few decades”; this is where the theory (justifiably) gets a bad reputation. Of course there are also cases where “large enough” means n=3 !
A more complex (and thus more accurate) way of looking at this is to first ask “what is the size range of problems you are interested in?” Then you need to measure the efficiency of various algorithms in that size range, to get a feeling for the ‘hidden constants’. Or you can use finer methods of algorithmic asymptotics which actually give you estimates on the constants.
The other thing to look at are ‘transition points’. Of course an algorithm which runs in 2n2 time will be faster than one that runs in 1016nlog(n) times for all n < 1.99 * 1017. So the quadratic algorithm will be the one to choose (unless you are dealing with the sizes of data that CERN worries about). Even sub-exponential terms can bite – 3n3 is a lot better than n3 + 1016n2 for n < 5*1015 (assuming that these are actual complexities).
There are no broken assumptions that I see. Big-O notation is a measure of algorithmic complexity on a very, very, simplified idealized computing machine, and ignoring constant terms. Obviously it is not the final word on actual speeds on actual machines.
O(n) is only part of the story -- a big part, and frequently the dominant part, but not always the dominant part. Once you get to performance optimization (which should not be done too early in your development), you need to consider all of the resources you are using. You can generalize Amdahl's Law to mean that your execution time will be dominated by the most limited resource. Note that also means that the particular hardware on which you're executing must also be considered. A program that is highly optimized and extremely efficient for a massively parallel computer (e.g., CM or MassPar) would probably not do well on a big vector box (e.g., Cray-2) nor on a high-speed microprocessor. That program might not even do well on a massive array of capable microprocessors (map/reduce style). The different optimizations for the different balances of cache, CPU communications, I/O, CPU speed, memory access, etc. mean different performance.
Back when I spent time working on performance optimizations, we would strive for "balanced" performance over the whole system. A super-fast CPU with a slow I/O system rarely made sense, and so on. O() typically considers only CPU complexity. You may be able to trade off memory space (unrolling loops doesn't make O() sense, but it does frequently help real performance); concern about cache hits vice linear memory layouts vice memory bank hits; virtual vs real memory; tape vs rotating disk vs RAID, and so on. If your performance is dominated by CPU activity, with I/O and memory loafing along, big-O is your primary concern. If your CPU is at 5% and the network is at 100%, maybe you can get away from big-O and work on I/O, caching, etc.
Multi-threading, particularly with multi-cores, makes all of that analysis even more complex. This opens up into very extensive discussion. If you are interested, Google can give you months or years of references.

Resources