risk of "big" computations on hardware - algorithm

Assuming someone is doing some big computations (I know that's relative... and I'm not gonna specify the nature of the operation just to keep the question open, it may be sorting data, searching for elements, calculating the prime factors of a really long number... ) using badly designed, brute force algorithmes or just an itterative process to get the results, can this approach have any bad effetcs on the cpu or the ram over a long period of time ?

Intensive processing will increase the heat generated by the CPU (or GPU) and even the RAM (to a much smaller degree).
Recent CPU chips have the ability to slow themselves down once the heat exceeds certain thresholds to prevent damage to the CPU. That would typically indicate a failure in the cooling system though.
I do not believe there are much other issues other than electricity consumption and overheating risks.

Related

Roofline model: How does increasing Arithmetic Intensity allow room for improvements to performance?

Intel Tip: If you can’t break a memory roof, try to rework your algorithm for
higher arithmetic intensity. This will move you to the right and give
you more room to increase performance before hitting the memory
bandwidth roof.
For algorithms in the memory-bound region of a roofline plot, Intel suggests increasing the arithmetic intensity so that they move to the right (compute-bound region) hence providing room to improve the performance, since the performance roof would be higher.
I'm unable to understand how increasing the arithmetic intensity (say, increasing the no. of operations in the algorithm) can possibly improve a performance metric like the wall-clock time taken for the algorithm to run. Wouldn't you need to do more no. of computations even for a higher performance (in FLOPS)? Could someone explain how this is possible?
Increasing the arithmetic alone is not sufficient to make the algorithm faster. The idea is if you have a choice between multiple algorithm and one of them is memory bound, then it is probably better to pick the other one assuming it is not much slower in practice since you can hardly optimize memory-bound algorithm while this is often much easier for compute-bound one. The memory latency did almost not improve much over the last decade and the bandwidth is only slowly increasing (much less than the number of FLOPS of processors). This is known as the Memory Wall (stated several decades ago). Moving data becomes so expensive nowadays that it is sometimes better to recompute operations rather than storing the previous results. This is especially true for very large data since the bigger the data structure the slower it is. This situation is expected to become worse over the next decades. Thus, a slower compute-bound algorithm can become faster than a memory-bound one in a near future (especially if it can-be/is parallelized).

Performance related to ram speed

My question is: why overclocking the RAMs does not bring a significant performance increasing?
If I speed up their frequency and/or their latency, I will not get big advantages.
I can't understand it, considering that the CPU should be able to read and write faster as well as processing data.
Or maybe the bottleneck is getting smaller because of faster CPUs?
Please give me as much information as you can.
Regards.
I think this is because of two reasons:
Keep in mind that not all operations that the CPU does are writing or reading from RAM. While it waits for RAM, it can be computing other things. I don't know what the average RAM-based-operations versus register-based operations ratio.
CPU's have ultra fast memory, called L1-, L2-, L3-Cache. These memory units are very very fast, but keep only a few megabytes of data.

hyperthreading and turbo boost in matrix multiply - worse performance using hyper threading

I am tunning my GEMM code and comparing with Eigen and MKL. I have a system with four physical cores. Until now I have used the default number of threads from OpenMP (eight on my system). I assumed this would be at least as good as four threads. However, I discovered today that if I run Eigen and my own GEMM code on a large dense matrix (1000x1000) I get better performance using four threads instead of eight. The efficiency jumped from 45% to 65%. I think this can be also seen in this plot
https://plafrim.bordeaux.inria.fr/doku.php?id=people:guenneba
The difference is quite substantial. However, the performance is much less stable. The performance jumps around quit a bit each iteration both with Eigen and my own GEMM code. I'm surprised that Hyperthreading makes the performance so much worse. I guess this is not not a question. It's an unexpected observation which I'm hoping to find feedback on.
I see that not using hyper threading is also suggested here.
How to speed up Eigen library's matrix product?
I do have a question regarding measuring max performance. What I do now is run CPUz and look at the frequency as I'm running my GEMM code and then use that number in my code (4.3 GHz on one overclocked system I use). Can I trust this number for all threads? How do I know the frequency per thread to determine the maximum? How to I properly account for turbo boost?
The purpose of hyperthreading is to improve CPU usage for code exhibiting high latency. Hyperthreading masks this latency by treating two threads at once thus having more instruction level parallelism.
However, a well written matrix product kernel exhibits an excellent instruction level parallelism and thus exploits nearly 100% of the CPU ressources. Therefore there is no room for a second "hyper" thread, and the overhead of its management can only decrease the overall performance.
Unless I've missed something, always possible, your CPU has one clock shared by all its components so if you measure it's rate at 4.3GHz (or whatever) then that's the rate of all the components for which it makes sense to figure out a rate. Imagine the chaos if this were not so, some cores running at one rate, others at another rate; the shared components (eg memory access) would become unmanageable.
As to hyperthreading actually worsening the performance of your matrix multiplication, I'm not surprised. After all, hyperthreading is a poor-person's parallelisation technique, duplicating instruction pipelines but not functional units. Once you've got your code screaming along pushing your n*10^6 contiguous memory locations through the FPUs a context switch in response to a pipeline stall isn't going to help much. At best the other pipeline will scream along for a while before another context switch robs you of useful clock cycles, at worst all the careful arrangement of data in the memory hierarchy will be horribly mangled at each switch.
Hyperthreading is designed not for parallel numeric computational speed but for improving the performance of a much more general workload; we use general-purpose CPUs in high-performance computing not because we want hyperthreading but because all the specialist parallel numeric CPUs have gone the way of all flesh.
As a provider of multithreaded concurrency services, I have explored how hyperthreading affects performance under a variety of conditions. I have found that with software that limits its own high-utilization threads to no more that the actual physical processors available, the presence or absence of HT makes very little difference. Software that attempts to use more threads than that for heavy computational work, is likely unaware that it is doing so, relying on merely the total processor count (which doubles under HT), and predictably runs more slowly. Perhaps the largest benefit that enabling HT may provide, is that you can max out all physical processors, without bringing the rest of the system to a crawl. Without HT, software often has to leave one CPU free to keep the host system running normally. Hyperthreads are just more switchable threads, they are not additional processors.

Optimum performance of GPU

I have been asked to measure how "efficiently " does my code use the GPU /what % of peak performance are algorithms achieving.I am not sure how to do this comparison.Till now I have basically had timers put in my code and measure the execution.How can I compare this to optimal performance and find what might be the bottle necks? (I did hear about visual profiler but couldnt get it to work ..it keeps giving me "cannot load output" error).
Each card has a maximum memory bandwidth and processing speed. For example, the GTX 480 bandwidth is 177.4 GB/s. You will need to know the specs for your card.
The first thing to decide is whether your code is memory bound or computation bound. If it is clearly one or the other, that will help you focus on the correct "efficiency" to measure. If your program is memory bound, then you need to compare your bandwidth with the cards maximum bandwidth.
You can calculate memory bandwidth by computing the amount of memory you read/write and dividing by run time (I use cuda events for timing). Here is a good example of calculating bandwidth efficiency (look at the whitepaper for the parallel reduction) and using it to help validate a kernel.
I don't know very much about determining the efficiency if instead you are ALU bound. You can probably count (or profile) the number of instructions, but what is the card's maximum?
I'm also not sure what to do in the likely case that your kernel is something in between memory bound and ALU bound.
Anyone...?
Generally "efficiently" would probably be a measure of how much memory and GPU cycles (average, min, max) of your program is using. Then the efficiency measure would be avg(mem)/total memory for the time period and so on with AVG(GPU cycles)/Max GPU cycles.
Then I'd compare these metrics to metrics from some GPU benchmark suites (which you can assume to be pretty efficient at using most of the GPU). Or you could measure against some random GPU intensive programs of your choice. That'd be how I'd do it but I've never thought to try so good luck!
As for bottlenecks and "optimal" performance. These are probably NP-Complete problems that no one can help you with. Get out the old profiler and debuggers and start working your way through your code.
Can't help with profiler and microoptimisation, but there is a CUDA calculator http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls , which trys to estimate how does your CUDA code use the hardware resources, based on this values:
Threads Per Block
Registers Per Thread
Shared Memory Per Block (bytes)

Algorithms for modern hardware?

Once again, I find myself with a set of broken assumptions. The article itself is about a 10x performance gain by modifying a proven-optimal algorithm to account for virtual memory:
On a modern multi-issue CPU, running
at some gigahertz clock frequency, the
worst-case loss is almost 10 million
instructions per VM page fault. If you
are running with a rotating disk, the
number is more like 100 million
instructions.
What good is an O(log2(n)) algorithm
if those operations cause page faults
and slow disk operations? For most
relevant datasets an O(n) or even an
O(n^2) algorithm, which avoids page
faults, will run circles around it.
Are there more such algorithms around? Should we re-examine all those fundamental building blocks of our education? What else do I need to watch out for when writing my own?
Clarification:
The algorithm in question isn't faster than the proven-optimal one because the Big-O notation is flawed or meaningless. It's faster because the proven-optimal algorithm relies on an assumption that is not true in modern hardware/OSes, namely that all memory access is equal and interchangeable.
You only need to re-examine your algorithms when your customers complain about the slowness of your program or it is missing critical deadlines. Otherwise focus on correctness, robustness, readability, and ease of maintenance. Until these items are achieved any performance optimization is a waste of development time.
Page faults and disk operations may be platform specific. Always profile your code to see where the bottlenecks are. Spending time on these areas will produce the most benefits.
If you're interested, along with page faults and slow disk operations, you may want to aware of:
Cache hits -- Data Oriented Design
Cache hits -- Reducing unnecessary
branches / jumps.
Cache prediction -- Shrink loops so
they fit into the processor's cache.
Again, these items are only after quality has been achieved, customer complaints and a profiler has analyzed your program.
One important thing is to realize that the most common usage of big-O notation (to talk about runtime complexity) is only half of the story - there's another half, namely space complexity (that can also be expressed using big-O) which can also be quite relevant.
Generally these days, memory capacity advances have outpaced computing speed advances (for a single core - parallelization can get around this), so less focus is given to space complexity, but it's still a factor that should be kept in mind, especially on machines with more limited memory or if working with very large amounts of data.
I'll expand on GregS's answer: the difference is between effective complexity and asymptotic complexity. Asymptotic complexity ignores constant factors and is valid only for “large enough” inputs. Oftentimes “large enough” can actually mean “larger than any computer can deal with, now and for a few decades”; this is where the theory (justifiably) gets a bad reputation. Of course there are also cases where “large enough” means n=3 !
A more complex (and thus more accurate) way of looking at this is to first ask “what is the size range of problems you are interested in?” Then you need to measure the efficiency of various algorithms in that size range, to get a feeling for the ‘hidden constants’. Or you can use finer methods of algorithmic asymptotics which actually give you estimates on the constants.
The other thing to look at are ‘transition points’. Of course an algorithm which runs in 2n2 time will be faster than one that runs in 1016nlog(n) times for all n < 1.99 * 1017. So the quadratic algorithm will be the one to choose (unless you are dealing with the sizes of data that CERN worries about). Even sub-exponential terms can bite – 3n3 is a lot better than n3 + 1016n2 for n < 5*1015 (assuming that these are actual complexities).
There are no broken assumptions that I see. Big-O notation is a measure of algorithmic complexity on a very, very, simplified idealized computing machine, and ignoring constant terms. Obviously it is not the final word on actual speeds on actual machines.
O(n) is only part of the story -- a big part, and frequently the dominant part, but not always the dominant part. Once you get to performance optimization (which should not be done too early in your development), you need to consider all of the resources you are using. You can generalize Amdahl's Law to mean that your execution time will be dominated by the most limited resource. Note that also means that the particular hardware on which you're executing must also be considered. A program that is highly optimized and extremely efficient for a massively parallel computer (e.g., CM or MassPar) would probably not do well on a big vector box (e.g., Cray-2) nor on a high-speed microprocessor. That program might not even do well on a massive array of capable microprocessors (map/reduce style). The different optimizations for the different balances of cache, CPU communications, I/O, CPU speed, memory access, etc. mean different performance.
Back when I spent time working on performance optimizations, we would strive for "balanced" performance over the whole system. A super-fast CPU with a slow I/O system rarely made sense, and so on. O() typically considers only CPU complexity. You may be able to trade off memory space (unrolling loops doesn't make O() sense, but it does frequently help real performance); concern about cache hits vice linear memory layouts vice memory bank hits; virtual vs real memory; tape vs rotating disk vs RAID, and so on. If your performance is dominated by CPU activity, with I/O and memory loafing along, big-O is your primary concern. If your CPU is at 5% and the network is at 100%, maybe you can get away from big-O and work on I/O, caching, etc.
Multi-threading, particularly with multi-cores, makes all of that analysis even more complex. This opens up into very extensive discussion. If you are interested, Google can give you months or years of references.

Resources