Slow threading performance on Linux on ARM9 - performance

When I write a simple application, running for 10 minutes, that starts 10 threads once (pthreads), each sleeping for 1 ms in a loop (not doing anything else) the CPU is used ca. 44% (top reports that). It is a ARM9 CPU with 450 MHz, Linux 2.6.37 is used as OS. There is no other program running, it tried out different kernel configs (Dynamic Ticks, Soft/Hard IRQ, High Resolution Timer, ..., ..., ...), different priorities (up to 99) but the numbers stay the same. /usr/bin/time -v shows ca. 5'200'000 voluntary context switches and ca. 3 minutes are spent in Kernel space. Sleepin in each thread for ca. 5 ms and the CPU utilization goes down to ca. 9% which is IMO still crazy (40'500'000 cycles to safe some registers). clock_nanosleep was used for sleeping (CLOCK_REALTIME/CLOCK_MONOTONIC did not change anything).
I'm aware that a full context switch is expensive on ARM9 because caches have to be cleared. But a simple thread switch, or switch to the OS shouldn't be that expensive IMHO (address space remains the same, no cache/TLB flushing required). Is this common or should I try to find the bottleneck in the kernel?

You're busily waking up and going back to sleep at 100uS intervals -- 10 threads, 1ms, that's 100uS on average. And keep in mind that you have two context switches for each of those 100uS intervals, so you have a context switch every 50uS on average, or 20,000 times per second.
Perhaps that's the answer you're looking for?

Related

Approximating Processing Power from CPU-Time

In a particular scenario I found that a code has taken 20 CPU Years and 4 real Months time. My goal is to approximate the amount of processing power utilized considering the fact that all the processors were on 100% usage all the time. So, my approach is as follows,
20 CPU Years = 20 * 365 * 24 CPU Hours = 175,200 CPU Hours.
Now, 1 CPU Year means 1 GFLOP machine working for 1 real Hour. Which means, in this case, the work done is, 1 GFLOP machine working for 175,200 real Hours. But in reality it took 4 * 30 * 24 = 2,880 real hours. So, approximately 175,200/2,880 =(approx.) 61 GLFOP machine.
My question is am I doing the approximation correctly or misunderstanding some particular term as per the calculations given above ? Or I am mixing GFLOPS and GFLOP together ?
Definitions
My question is am I doing the approximation correctly or misunderstanding some particular term as per the calculations given above ?
"100% usage" may mean the CPU spent 20% of its time doing nothing waiting for data to be transferred to/from RAM (and/or branch mispredictions or other stalls), 10% of its time running faster than normal because other CPUs where actually doing nothing, and 15% of its time running slower than normal for power/temperature management reasons; and (depending on where you got that "100% usage" statistic) "100% usage" may be significantly more confusing (e.g. http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html ).
Depending on context; GFLOPS is either "theoretical maximum under perfect conditions that will never occur in practice" (worthless marketing hype); or a direct measurement of a specific case that ignores most of the work a CPU did (everything involving integers, all control flow, all data transfer, all memory management, ...)
In a particular scenario I found that a code has taken 20 CPU Years and 4 real Months time. My goal is to approximate the amount of processing power utilized.
From this; you might (or might not) be able to say "most of the work that CPUs did was discarded due to lockless algorithm retries and/or transactions that couldn't be committed; and (partly because the bottleneck was RAM bandwidth and partly because of the way SMT works on this system) it would have been 4 times as fast if half as many CPUs were used."
TL;DR: Approximating processor power is just an inconvenient way to obfuscate the (more useful) information that you started with (e.g. that a specific piece of code running on a specific piece of hardware that was working on a specific piece of data happened to take 4 months of real time).
Your Calculation:
Yes; you're mixing GFLOP and GFLOPS (e.g. GFLOPS = GFLOP per second; and a "1 GFLOP machine" is a computer that can do a billion floating point operations in an infinite amount of time, which is every computer), and the web page you linked to is making the same mistake (e.g. saying "a 1 GFLOP reference machine" when it should be saying "a 1 GFLOPS reference machine").
Note that there's no need to care about GFLOPS or GFLOP for the calculation you're doing: If something was supposed to take 20 "reference CPU years" and actually took 4 months (or 4/12 years); then you'd say that your hardware is equivalent to "20 / (4/12) = 60 reference CPUs". Of course this is horribly silly and it'd make more sense to say that your hardware happened to achieve 60 GFLOPS without bothering with the misleading "reference CPU" nonsense.

What is the difference between MIPS and Execution time

When it comes to rating the performance of a processor, is calculating the Million Instructions Per Second (MIPS) a practical measure to use?
Or is finding the Execution Time (IC x CPI x 1/CR) the main thing to use?
Imagine you have one CPU that does 100 million tiny little instructions that don't do much on their own per second. Next; imagine you have another CPU where you need a quarter of the instructions to do the same work; which can do 50 million larger instructions per second. The second CPU has half as many MIPs but is twice as fast.
Now.. Imagine you have 2 CPUs that both execute the exact same instructions; where one CPU runs at 1 GHz, can do 5 instructions per cycle, and stalls rarely; and the other CPU runs at 4 GHz, can only do 2 instructions per cycle, and spends a lot more time stalled doing nothing (due to cache misses, branch mispredictions, etc). In this case the 1 GHz CPU might be significantly faster than the 4 GHz CPU.
Finally; imagine you have 2 CPUs that both execute the exact same instructions, both have exactly the same clock frequency, both execute the same number of instructions per cycle, and both spend exactly the same amount of time stalled. One CPU has overheats easily and had to "under-clock" itself to a crawl after 250 milliseconds of not being idle just to avoid melting itself, and the other CPU can go at max. speed continuously without ever overheating.
Execution time is how long it takes to do some work taking everything into account (and can be extremely different for different types of work); while MIPS is like a real estate agent determining how much a building is worth by measuring the weight of a rubber chicken.

Low GPU usage in CUDA

I implemented a program which uses different CUDA streams from different CPU threads. Memory copying is implemented via cudaMemcpyAsync using those streams. Kernel launches are also using those streams. The program is doing double-precision computations (and I suspect this is the culprit, however, cuBlas reaches 75-85% CPU usage for multiplication of matrices of doubles). There are also reduction operations, however they are implemented via if(threadIdx.x < s) with s decreasing 2 times in each iteration, so stalled warps should be available to other blocks. The application is GPU and CPU intensive, it starts with another piece of work as soon as the previous has finished. So I expect it to reach 100% of either CPU or GPU.
The problem is that my program generates 30-40% of GPU load (and about 50% of CPU load), if trusting GPU-Z 1.9.0. Memory Controller Load is 9-10%, Bus Interface Load is 6%. This is for the number of CPU threads equal to the number of CPU cores. If I double the number of CPU threads, the loads stay about the same (including the CPU load).
So why is that? Where is the bottleneck?
I am using GeForce GTX 560 Ti, CUDA 8RC, MSVC++2013, Windows 10.
One my guess is that Windows 10 applies some aggressive power saving, even though GPU and CPU temperatures are low, the power plan is set to "High performance" and the power supply is 700W while power consumption with max CPU and GPU TDP is about 550W.
Another guess is that double-precision speed is 1/12 of the single-precision speed because there are 1 double-precision CUDA core per 12 single-precision CUDA cores on my card, and GPU-Z takes as 100% the situation when all single-precision and double-precision cores are used. However, the numbers do not quite match.
Apparently the reason was low occupancy due to CUDA threads using too many registers by default. To tell the compiler the limit on the number of registers per thread, __launch_bounds__ can be used, as described here. So to be able to launch all 1536 threads in 560 Ti, for block size 256 the following can be specified:
_global__ void __launch_bounds__(256, 6) MyKernel(...) { ... }
After limiting the number of registers per CUDA thread, the GPU usage has raised to 60% for me.
By the way, 5xx series cards are still supported by NSight v5.1 for Visual Studio. It can be downloaded from the archive.
EDIT: the following flags have further increased GPU usage to 70% in an application that uses multiple GPU streams from multiple CPU threads:
cudaSetDeviceFlags(cudaDeviceScheduleYield | cudaDeviceMapHost | cudaDeviceLmemResizeToMax);
cudaDeviceScheduleYield lets other threads execute when a CPU
thread is waiting on GPU operation, rather than spinning GPU for the
result.
cudaDeviceLmemResizeToMax, as I understood it, makes kernel
launches themselves asynchronous and avoids excessive local memory
allocations&deallocations.

Why can't my ultraportable laptop CPU maintain peak performance in HPC

I have developed a high performance Cholesky factorization routine, which should have peak performance at around 10.5 GFLOPs on a single CPU (without hyperthreading). But there is some phenomenon which I don't understand when I test its performance. In my experiment, I measured the performance with increasing matrix dimension N, from 250 up to 10000.
In my algorithm I have applied caching (with tuned blocking factor), and data are always accessed with unit stride during computation, so cache performance is optimal; TLB and paging problem are eliminated;
I have 8GB available RAM, and the maximum memory footprint during experiment is under 800MB, so no swapping comes across;
During experiment, no resource demanding process like web browser is running at the same time. Only some really cheap background process is running to record CPU frequency as well as CPU temperature data every 2s.
I would expect the performance (in GFLOPs) should maintain at around 10.5 for whatever N I am testing. But a significant performance drop is observed in the middle of the experiment as shown in the first figure.
CPU frequency and CPU temperature are seen in the 2nd and 3rd figure. The experiment finishes in 400s. Temperature was at 51 degree when experiment started, and quickly rose up to 72 degree when CPU got busy. After that it grew slowly to the highest at 78 degree. CPU frequency is basically stable, and it did not drop when temperature got high.
So, my question is:
since CPU frequency did not drop, why performance suffers?
how exactly does temperature affect CPU performance? Does the increment from 72 degree to 78 degree really make things worse?
CPU info
System: Ubuntu 14.04 LTS
Laptop model: Lenovo-YOGA-3-Pro-1370
Processor: Intel Core M-5Y71 CPU # 1.20 GHz * 2
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0,1
Off-line CPU(s) list: 2,3
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 61
Stepping: 4
CPU MHz: 1474.484
BogoMIPS: 2799.91
Virtualisation: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 4096K
NUMA node0 CPU(s): 0,1
CPU 0, 1
driver: intel_pstate
CPUs which run at the same hardware frequency: 0, 1
CPUs which need to have their frequency coordinated by software: 0, 1
maximum transition latency: 0.97 ms.
hardware limits: 500 MHz - 2.90 GHz
available cpufreq governors: performance, powersave
current policy: frequency should be within 500 MHz and 2.90 GHz.
The governor "performance" may decide which speed to use
within this range.
current CPU frequency is 1.40 GHz.
boost state support:
Supported: yes
Active: yes
update 1 (control experiment)
In my original experiment, CPU is kept busy working from N = 250 to N = 10000. Many people (primarily those whose saw this post before re-editing) suspected that the overheating of CPU is the major reason for performance hit. Then I went back and installed lm-sensors linux package to track such information, and indeed, CPU temperature rose up.
But to complete the picture, I did another control experiment. This time, I give CPU a cooling time between each N. This is achieved by asking the program to pause for a number of seconds at the start of iteration of the loop through N.
for N between 250 and 2500, the cooling time is 5s;
for N between 2750 and 5000, the cooling time is 20s;
for N between 5250 and 7500, the cooling time is 40s;
finally for N between 7750 and 10000, the cooling time is 60s.
Note that the cooling time is much larger than the time spent for computation. For N = 10000, only 30s are needed for Cholesky factorization at peak performance, but I ask for a 60s cooling time.
This is certainly a very uninteresting setting in high performance computing: we want our machine to work all the time at peak performance, until a very large task is completed. So this kind of halt makes no sense. But it helps to better know the effect of temperature on performance.
This time, we see that peak performance is achieved for all N, just as theory supports! The periodic feature of CPU frequency and temperature is the result of cooling and boost. Temperature still has an increasing trend, simply because as N increases, the work load is getting bigger. This also justifies more cooling time for a sufficient cooling down, as I have done.
The achievement of peak performance seems to rule out all effects other than temperature. But this is really annoying. Basically it says that computer will get tired in HPC, so we can't get expected performance gain. Then what is the point of developing HPC algorithm?
OK, here are the new set of plots:
I don't know why I could not upload the 6th figure. SO simply does not allow me to submit the edit when adding the 6th figure. So I am sorry I can't attach the figure for CPU frequency.
update 2 (how I measure CPU frequency and temperature)
Thanks to Zboson for adding the x86 tag. The following bash commands are what I used for measurement:
while true
do
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq >> cpu0_freq.txt ## parameter "freq0"
cat sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq >> cpu1_freq.txt ## parameter "freq1"
sensors | grep "Core 0" >> cpu0_temp.txt ## parameter "temp0"
sensors | grep "Core 1" >> cpu1_temp.txt ## parameter "temp1"
sleep 2
done
Since I did not pin the computation to 1 core, the operating system will alternately use two different cores. It makes more sense to take
freq[i] <- max (freq0[i], freq1[i])
temp[i] <- max (temp0[i], temp1[i])
as the overall measurement.
TL:DR: Your conclusion is correct. Your CPU's sustained performance is nowhere near its peak. This is normal: the peak perf is only available as a short term "bonus" for bursty interactive workloads, above its rated sustained performance, given the light-weight heat-sink, fans, and power-delivery.
You can develop / test on this machine, but benchmarking will be hard. You'll want to run on a cluster, server, or desktop, or at least a gaming / workstation laptop.
From the CPU info you posted, you have a dual-core-with-hyperthreading Intel Core M with a rated sustainable frequency of 1.20 GHz, Broadwell generation. Its max turbo is 2.9GHz, and it's TDP-up sustainable frequency is 1.4GHz (at 6W).
For short bursts, it can run much faster and make much more heat than it requires its cooling system to handle. This is what Intel's "turbo" feature is all about. It lets low-power ultraportable laptops like yours have snappy UI performance in stuff like web browsers, because the CPU load from interactive is almost always bursty.
Desktop/server CPUs (Xeon and i5/i7, but not i3) do still have turbo, but the sustained frequency is much closer to the max turbo. e.g. a Haswell i7-4790k has a sustained "rated" frequency of 4.0GHz. At that frequency and below, it won't use (and convert to heat) more than its rated TDP of 88W. Thus, it needs a cooling system that can handle 88W. When power/current/temperature allow, it can clock up to 4.4GHz and use more than 88W of power. (The sliding window for calculating the power history to keep the sustained power with 88W is sometimes configurable in the BIOS, e.g. 20sec or 5sec. Depending on what code is running, 4.4GHz might not increase the electrical current demand to anywhere near peak. e.g. code with lots of branch mispredicts that's still limited by CPU frequency, but that doesn't come anywhere near saturating the 256b AVX FP units like Prime95 would.)
Your laptop's max turbo is a factor of 2.4x higher than rated frequency. That high-end Haswell desktop CPU can only upclock by 1.1x. The max sustained frequency is already pretty close to the max peak limits, because it's rated to need a good cooling system that can keep up with that kind of heat production. And a solid power supply that can supply that much current.
The purpose of Core M is to have a CPU that can limit itself to ultra low power levels (rated TDP of 4.5 W at 1.2GHz, 6W at 1.4GHz). So the laptop manufacturer can safely design a cooling and power delivery system that's small and light, and only handles that much power. The "Scenario Design Power" is only 3.5W, and that's supposed to represent the thermal requirements for real-world code, not max-power stuff like Prime95.
Even a "normal" ULV laptop CPU is rated for 15W sustained, and high power gaming/workstation laptop CPUs at 45W. And of course laptop vendors put those CPUs into machines with beefier heat-sinks and fans. See a table on wikipedia, and compare desktop / server CPUs (also on the same page).
The achievement of peak performance seems to rule out all effects
other than temperature. But this is really annoying. Basically it says
that computer will get tired in HPC, so we can't get expected
performance gain. Then what is the point of developing HPC algorithm?
The point is to run them on hardware that's not so badly thermally limited! An ultra-low-power CPU like a Core M makes a decent dev platform, but not a good HPC compute platform.
Even a laptop with an xxxxM CPU, rather than a xxxxU CPU, will do ok. (e.g. a "gaming" or "workstation" laptop that's designed to run CPU-intensive stuff for sustained periods). Or in Skylake-family, "xxxxH" or "HK" are the 45W mobile CPUs, at least quad-core.
Further reading:
Modern Microprocessors
A 90-Minute Guide!
[Power Delivery in a Modern Processor] - general background, including the "power wall" that Pentium 4 ran into.
(https://www.realworldtech.com/power-delivery/) - really deep technical dive into CPU / motherboard design and the challenges of delivering stable low-voltage to very bursty demands, and reacting quickly to the CPU requesting more / less voltage as it changes frequency.

Linux Multi-Threaded Performance Enhancements for File open()

I’m working on tuning performance on a high-performance, high-capacity data engine which ultimately services an end-user web experience. Specifically, the piece delegated to me revolves around characterizing multi-threaded file IO and memory mapping of the data to local cache. In writing test applications to isolate the timing tall-poles, several questions have been exposed. The code has been minimized to perform only a system file open (open(O_RDONLY)) call. I’m hoping that the result of this query helps us understand the fundamental low-level system processes so that a complete predictive (or at least relational) timing model can be understood. Suggestions are always welcome. We’ve seemed to hit a timing barrier, and would like to understand the behavior and determine whether that barrier can be broken.
The test program:
Is written in C, compiled using the gnu C compiler as noted below;
Is minimally written to isolate the discovered issues to a single system file “open()”;
Is configurable to simultaneously launch a requested number of pthreads;
loads a list of 1000 text files of ~8K size;
creates the threads (simply) with no attribute modifications;
each thread performs multiple, sequential file open() calls on the next available file from the pre-determined list of files until the file list is exhausted in such a way that a single thread should open all 1000 files, 2 threads should theoretically open 500 files (not proven as of yet), etc.);
We’ve run tests multiple times, parametrically varying the thread count, file sizes, and whether the files are located on a local or remote server. Several questions have come up.
Observed results (opening remote files):
File open times are higher the first time through (as expected, due to file caching);
Running the test app with one thread to load all the remote files takes X seconds;
It appears that running the app with a thread count between 1 and # of available CPUs on the machine results in times that are proportional to the number of CPUs (nX seconds).
Running the app using a thread count > #CPUs results in run times that seem to level out at the approx same value as the time is takes to run with #CPUs threads (is this coincidental, or a systematic limit, or what?).
Running multiple, concurrent processes (for example, 25 concurrent instances of the same test app) results in the times being approximately linear with number of processes for a selected thread count.
Running app on different servers shows similar results
Observed results (opening files residing locally):
Orders of magnitude faster times (as to be expected);
With increasing the thread count, a LOW timing inflection point occurs at around 4-5 active threads, then increases again until the number of threads equals the CPU count, then levels off again;
Running multiple, concurrent processes (same test) results in the times being approximately linear with number of processes for a constant thread count (same result as #5 above).
Also, we noticed that Local opens take about .01 ms and sequential network opens are 100x slower at 1ms. Opening network files, we get a linear throughput increase up to 8x with 8 threads, but 9+ threads do nothing. The network open calls seem to block after more than 8 simultaneous requests. What we expected was an initial delay equal to the network roundtrip, and then approximately the same throughput as local. Perhaps there is extra mutex locking done on the local and remote systems that takes 100x longer. Perhaps there is some internal queue of remote calls that only holds 8.
Expected results and questions to be answered either by test or by answers from forums like this one:
Running multiple threads would result in the same work done in shorter time;
Is there an optimal number of threads;
Is there a relationship between the number of threads and CPUs available?
Is there some other systematic reason that an 8-10 file limit is observed?
How does the system call to “open()” work in a multi-threading process?
Each thread gets its context-switched time-slice;
Does the open() call block and wait until the file is open/loaded into file cache? Or does the call allow context switching to occur while the operation is in progress?
When the open() completes, does the scheduler reprioritize that thread to execute sooner, or does the thread have to wait until its turn in round-robin way;
Would having the mounted volume on which the 1000 files reside set as read-only or read/write make a difference?
When open() is called with a full path, is each element in the path stat()ed? Would it make more sense to open() a common directory in the list of files tree, and then open() the files under that common directory by relative path?
Development test setup:
Red Hat Enterprise Linux Server release 5.4 (Tikanga)
8-CPUS, each with characteristics as shown below:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU X5460 # 3.16GHz
stepping : 6
cpu MHz : 1992.000
cache size : 6144 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 4
apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 lahf_lm
bogomips : 6317.47
clflush size : 64
cache_alignment : 64
address sizes : 38 bits physical, 48 bits virtual
power management:
GNU C compiler, version:
gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46)
Not sure if this is one of your issues, but it may be of use.
The one thing that struck me, while optimizing thousands of random reads on a single SATA disk, was that performing non-blocking I/O isn't so easy to do in linux in a clean way, without extra threads.
It is (currently) impossible to issue a non-blocking read() on a block device; i.e. it will block for the 5 ms seek time the disk needs (and 5 ms is an eternity, at 3 GHz). Specifying O_NONBLOCK to open() only served some purpose for backward compatibility, with CD burners or something (this was a rather vague issue). Normally, open() doesn't block or cache anything, it's mostly just to get a handle on a file to do some data I/O later.
For my purposes, mmap() seemed to get me as close to the kernel handling of the disk as possible. Using madvise() and mincore() I was able to fully exploit the NCQ capabilities of the disk, which was simply proved by varying the queue depth of outstanding requests, which turned out to be inversely proportional to the total time taken to issue 10k reads.
Thanks to 64 bit memory addressing, using mmap() to map an entire disk to memory is no problem at all. (on 32 bit platforms, you would need to map the parts of the disk you need using mmap64())

Resources