What is better from an energy-saving perspective to run task minimum energy consumption - cpu

The CPU frequency and CPU usage are the main factors that impact energy consumption (as far as I know). however what is better from an energy-saving perspective to run task minimum energy consumption:
Option 1: Maximum CPU frequency with minimum usage
Option 2: Maximum CPU usage with min frequency.

Work per time scales approximately linearly with CPU frequency. (A bit less than linear because higher CPU frequency means DRAM latency is more clock cycles).
CPU power has two components: switching (dynamic) power which scales with f3 (because voltage has to increase for higher frequency, and transistors switch are pumping that V^2 capacitor energy more often); and leakage power which doesn't vary as dramatically. At high frequency dynamic power dominates, but as you lower the frequency, eventually it becomes significant. The smaller your transistors, the more significant leakage is.
System-wide, there's also other power for things like DRAM that doesn't change much or at all with CPU frequency.
Min frequency is more efficient, unless the minimum is far below the best frequency for work per energy. (Some parts of power decrease with frequency, others like leakage current and DRAM refresh don't).
Frequencies lower than max have lower work per energy (better task efficiency) up to a certain point. Like 800 MHz on a Skylake CPU on Intel's 14 nm process. If there's work to be done, there's no gain from dropping below that; just race-to-sleep at that most efficient frequency. (Power would decrease, but work rate would decrease more below that point.)
https://en.wikichip.org/wiki/File:Intel_Architecture,_Code_Name_Skylake_Deep_Dive-_A_New_Architecture_to_Manage_Power_Performance_and_Energy_Efficiency.pdf is slides from IDF2015 about Skylake power management covered a lot of that general-case stuff well. Unfortunately I don't know where to find a copy of the audio from Efraim Rotem's talk; it was up for a year or so after, but the original link is dead now. :/
Also in general about dynamic power (from switching, not leakage) scaling with frequency cubed if you adjust voltage as well as frequency, see Modern Microprocessors
A 90-Minute Guide! and
https://electronics.stackexchange.com/questions/614018/why-does-switching-cause-power-dissipation
https://electronics.stackexchange.com/questions/258724/why-do-cpus-need-so-much-current
https://electronics.stackexchange.com/questions/548601/why-does-decreasing-the-cmos-supply-voltage-also-decrease-the-maximum-circuit-fr

Related

How do I interpret this difference in matrix multiplication GFLOP/s?

I'm trying some matrix multiplication optimizations from this wiki here. While measuring the GFLOP/s for the naive, triple-for-loop matmul, I expected to see a drop in the GFLOP/s after a particular size, which according to the wiki, represents the point where data stops fitting in the cache:
I ran the benchmark on 2 different PCs:
3rd gen Intel i5 (3210M): (L1=32KB per core, L2=256KB per core, L3=3MB shared).
I got the expected graph, with a sharp drop from ~2GFLOP/s to 0.5.
6th gen Intel i7 (6500U): (L1=32KB per core, L2=256KB per core, L3=4MB shared)
On this, I instead see a gradual decrease in GFLOP/s, even if I try for larger sizes. Looking at the Ubuntu system monitor, one of the CPU cores was always at 100% usage.
I'm trying to understand the following:
How do I interpret the change in GFLOP/s with matrix size? If the expected drop corresponds to the data no longer fitting in the cache, why do I not see such a drop even for much bigger sizes on the i7?
How does the 3rd gen i5 perform faster for smaller sizes?
How do I interpret the CPU occupancy? Would I see a reduction in CPU usage if more time was being spent in fetching data from cache/RAM?
Edit:
I switched to double from float and tried -O3 and -O0, here are the plots. I couldn't check frequencies on the odler i5 but the Skylake i7 goes to turbo freq almost instantaneously for most of the process' duration.
Code from here, used GCC 7.4.0 on i7, and clang(Apple LLVM 7) on i5.
Regarding question 2:
While both CPUs have the same base and turbo frequency, the Ivy Bridge has a TDP of 35W while the Skylake has 15W. Even with a much newer process it is possible that the Ivy Bridge is able to use its turbo for a bigger part of the calculation. (Peter Cordes already mentioned checking the actual turbo.).
Regarding question 3:
CPU utilization doesn't depend on what the CPU is doing, waiting for RAM still counts as utilized. There are performance counters you can query which would tell you if the Ivy Bridge is slower because it stalls for memory more often.
With efficient cache-blocking, dense matmul should bottleneck on ALU, not memory bandwidth. O(N^3) work over O(N^2) memory.
But you're measuring a naive matmul. That means it's always horrible if you're striding down columns of one input. This is the classic problem for cache-blocking / loop-tiling.
Your Skylake has significantly better bandwidth to L3 cache and DRAM, and less-associative L2 cache (4-way instead of 8-way). Still, I would have expected better performance when your working set fits in L2 than when it doesn't.
SKL probably also has better HW prefetching, and definitely a larger out-of-order window size, than IvyBridge.
IvyBridge (including your 3210M) was the generation that introduced next-page hardware prefetching, but I think the feature with that name is just TLB prefetching, not data. It probably isn't a factor, especially if transparent hugepages are avoiding any TLB misses.
But if not, TLB misses might be the real cause of the dropoff on IvB. Use performance counters to check. (e.g. perf stat)
Was your CPU frequency shooting up to max turbo right away and staying there for both CPUs? #idspispopd's answer also makes some good points about total power / cooling budget, but yeah check that your two systems are maintaining the same CPU frequencies for this. Or if not, record what they are.
You did compile with optimization enabled, right? If not, that could be enough overhead to hide a memory bottleneck. Did you use the same compiler/version/options on both systems? Did you use -march=native?

risk of "big" computations on hardware

Assuming someone is doing some big computations (I know that's relative... and I'm not gonna specify the nature of the operation just to keep the question open, it may be sorting data, searching for elements, calculating the prime factors of a really long number... ) using badly designed, brute force algorithmes or just an itterative process to get the results, can this approach have any bad effetcs on the cpu or the ram over a long period of time ?
Intensive processing will increase the heat generated by the CPU (or GPU) and even the RAM (to a much smaller degree).
Recent CPU chips have the ability to slow themselves down once the heat exceeds certain thresholds to prevent damage to the CPU. That would typically indicate a failure in the cooling system though.
I do not believe there are much other issues other than electricity consumption and overheating risks.

Relation between higher CPU frequency and thrashing?

This happened to be one of my class test question.
In a demand paging system, the CPU utilization is 20% and the paging disk utilization is 97.7%
If the CPU speed is increased, will the CPU usage be increased in this scenario?
Paging is effectively a bottleneck in this example. The amount of computation per unit time might increase slightly with a faster CPU but not in proportion to the increase in CPU speed (so the percentage utilization would decrease).
A quick and dirty estimation would use Amdahl's Law. In the example, 80% of the work is paging and 20% is CPU-limited, so an N-fold improvement in CPU performance would result in a speedup factor of 1/((1 - 0.2) + (0.2/N)).
A more realistic estimate would add an awareness of queueing theory to recognize that if the paging requests came in more frequently the utilization would actually increase even with a fixed buffer size. However, the increase in paging utilization is smaller than the increase in request frequency.
Without looking at the details of queueing theory, one can also simply see that the maximum potential improvement in paging is just over 2%. (If paging utilization was driven up to 100%: 100/97.7 or 1.0235.) Even at 100% paging utilization, paging would take 0.80/(100/97.7) of the original time, so clearly there is not much opportunity for improvement.
If a 10-fold CPU speed improvement drove paging utilization to effectively 100%, every second of work under the original system would use 781.6 milliseconds in paging (800 ms / (100/97.7)) and 20 milliseconds in the CPU (200 ms / 10). CPU utilization would decrease to 20 / (781.6 + 20) or about 2.5%.

Is memory latency affected by CPU frequency? Is it a result of memory power management by the memory controller?

I basically need some help to explain/confirm some experimental results.
Basic Theory
A common idea expressed in papers on DVFS is that execution times have on-chip and off-chip components. On-chip components of execution time scale linearly with CPU frequency whereas the off-chip components remain unaffected.
Therefore, for CPU-bound applications, there is a linear relationship between CPU frequency and instruction-retirement rate. On the other hand, for a memory bound application where the caches are often missed and DRAM has to be accessed frequently, the relationship should be affine (one is not just a multiple of the other, you also have to add a constant).
Experiment
I was doing experiments looking at how CPU frequency affects instruction-retirement rate and execution time under different levels of memory-boundedness.
I wrote a test application in C that traverses a linked list. I effectively create a linked list whose individual nodes have sizes equal to the size of a cache-line (64 bytes). I allocated a large amount of memory that is a multiple of the cache-line size.
The linked list is circular such that the last element links to the first element. Also, this linked list randomly traverses through the cache-line sized blocks in the allocated memory. Every cache-line sized block in the allocated memory is accessed, and no block is accessed more than once.
Because of the random traversal, I assumed it should not be possible for the hardware to use any pre-fetching. Basically, by traversing the list, you have a sequence of memory accesses with no stride pattern, no temporal locality, and no spacial locality. Also, because this is a linked list, one memory access can not begin until the previous one completes. Therefore, the memory accesses should not be parallelizable.
When the amount of allocated memory is small enough, you should have no cache misses beyond initial warm up. In this case, the workload is effectively CPU bound and the instruction-retirement rate scales very cleanly with CPU frequency.
When the amount of allocated memory is large enough (bigger than the LLC), you should be missing the caches. The workload is memory bound and the instruction-retirement rate should not scale as well with CPU frequency.
The basic experimental setup is similiar to the one described here:
"Actual CPU Frequency vs CPU Frequency Reported by the Linux "cpufreq" Subsystem".
The above application is run repeatedly for some duration. At the start and end of the duration, the hardware performance counter is sampled to determine the number of instructions retired over the duration. The length of the duration is measured as well. The average instruction-retirement rate is measured as the ratio between these two values.
This experiment is repeated across all the possible CPU frequency settings using the "userspace" CPU-frequency governor in Linux. Also, the experiment is repeated for the CPU-bound case and the memory-bound case as described above.
Results
The two following plots show results for the CPU-bound case and memory-bound case respectively. On the x-axis, the CPU clock frequency is specified in GHz. On the y-axis, the instruction-retirement rate is specified in (1/ns).
A marker is placed for repetition of the experiment described above. The line shows what the result would be if instruction-retirement rate increased at the same rate as CPU frequency and passed through the lowest-frequency marker.
Results for the CPU-bound case.
Results for the memory-bound case.
The results make sense for the CPU-bound case, but not as much for the memory-bound case. All the markers for the memory-bound fall below the line which is expected because the instruction-retirement rate should not increase at the same rate as CPU frequency for a memory-bound application. The markers appear to fall on straight lines, which is also expected.
However, there appears to be step-changes in the instruction-retirement rate with change in CPU frequency.
Question
What is causing the step changes in the instruction-retirement rate? The only explanation I could think of is that the memory controller is somehow changing the speed and power-consumption of memory with changes in the rate of memory requests. (As instruction-retirement rate increases, the rate of memory requests should increase as well.) Is this a correct explanation?
You seem to have exactly the results you expected - a roughly linear trend for the cpu bound program, and a shallow(er) affine one for the memory bound case (which is less cpu effected). You will need a lot more data to determine if they are consistent steps or if they are - as I suspect - mostly random jitter depending on how 'good' the list is.
The cpu clock will affect bus clocks, which will affect timings and so on - synchronisation between differently clocked buses is always challenging for hardware designers. The spacing of your steps is interestingly 400 Mhz but I wouldn't draw too much from this - generally, this kind of stuff is way too complex and specific-hardware dependent to be properly analysed without 'inside' knowledge the memory controller used, etc.
(please draw nicer lines of best fit)

Questions on Measuring Time Using the CPU Clock

I'm aware of the standard methods of getting time deltas using CPU clock counters on various operating systems. My question is, how do such operating systems account for the change in CPU frequency for power saving purposes. I initially thought this could be explained based on the fact that OS's use specific calls to measure frequency to get the corrected frequency based on which core is being used, what frequency it's currently set to, etc. But then I realized, wouldn't that make any time delta inaccurate if the CPU frequency was lowered and raised back to it's original value in between two clock queries.
For example take the following scenario:
Query the CPU cycles. Operating system lowers CPU frequency for power saving. Some other code is run here. Operating system raises CPU frequency for performance. Query the CPU cycles. Calculate delta as cycle difference divided by frequency.
This would yield an inaccurate delta since the CPU frequency was not constant between the two queries. How is this worked around by the operating system or programs that have to work with time deltas using CPU cycles?
see this wrong clock cycle measurements with rdtsc
there are more ways how to deal with it
set CPU clock to max
read the link above to see how to do it?
use PIT instead of RDTSC
PIT is programmable interrupt timer (Intel 8253 if I remember correctly) it is present on all PC motherboards since x286 (and maybe even before) but the resolution is only ~119KHz and not all OS give you access to it.
combine PIT and RDTSC
just measure the CPU clock by PIT repeatedly when is stable enough start your measurement (and remain scanning for CPU clock change). If CPU clock changes during measurement then throw away the measurement and start again

Resources