Different profiling modes for different cores using perf - performance

I have the following questions regarding perf.
a) Is it possible that I run different profiling modes on different cores simultaneously. e.g. Core 0 with event based sampling (sampling every N events) and Core 1 with free running counter based sampling?
b) In case a) is not possible. Then is it possible to get a snapshot of the PMU counters on the other cores (Core 1) for every sample (overflow at N events) on Core 0?
P.S: The platform is a RPi 3b+ based on the Arm Cortex A53

It is possible to operate different profiling modes on different cores of the CPU simultaneously.
perf also has a processor-wide mode wherein all threads running on the designated processors are monitored. Counts and samples are thus aggregated per CPU/core.
-C, --cpu=
Count only on the list of CPUs provided. Multiple CPUs can be
provided as a comma-separated list with no space: 0,1. Ranges of
CPUs are specified with -: 0-2. In per-thread mode, this option
is ignored. The -a option is still necessary to activate
system-wide monitoring. Default is to count on all CPUs.
Running both the free-running counter as well as the sampling mechanism of perf simultaneously, is possible on different cores of the CPU like below -
eg. for cpu 0:
perf stat --cpu 0 -B dd if=/dev/zero of=/dev/null count=1000000
and for cpu 1:
perf record --cpu 1 sleep 20


Confused about OMP_NUM_THREADS and numactl NUMA-cores bindings

I'm confused about how multiple launches of same python command bind to cores on a NUMA Xeon machine.
I read that OMP_NUM_THREADS env var sets the number of threads launched for a numactl process. So if I ran numactl --physcpubind=4-7 --membind=0 python -u test.py with OMP_NUM_THREADS=4 on a hyperthreaded HT machine (lscpu output below) it'd limit the this numactl process to 4 threads.
But since machine has HT, it's not clear to me if 4-7 in the above are 4 physical or 4 logical.
How to find which of the numa-node-0 cores in 0-23,96-119 are physical and which ones logical? Are 96-119 all logical or are they interspersed?
If 4-7 are all physical cores, then with HT on there would be only 2 physical cores needed, so what happens to the other 2?
Where is OpenMP library getting invoked in binding threads to physical cores?
(from my limited understanding I could just launch command python main.py in a sh shell 20 times with different numactl bindings and OMP_NUM_THREADS still applies, even though I didn't explicitly use MPI lib anywhere, is that correct?)
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 192
On-line CPU(s) list: 0-191
Thread(s) per core: 2
Core(s) per socket: 48
Socket(s): 2
NUMA node(s): 4
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 9242 CPU # 2.30GHz
Stepping: 7
Frequency boost: enabled
CPU MHz: 1000.026
CPU max MHz: 2301,0000
CPU min MHz: 1000,0000
BogoMIPS: 4600.00
L1d cache: 3 MiB
L1i cache: 3 MiB
L2 cache: 96 MiB
L3 cache: 143 MiB
NUMA node0 CPU(s): 0-23,96-119
NUMA node1 CPU(s): 24-47,120-143
NUMA node2 CPU(s): 48-71,144-167
NUMA node3 CPU(s): 72-95,168-191
I read that OMP_NUM_THREADS env var sets the number of threads launched for a numactl process.
numactl do not launch threads. It controls NUMA policy for processes or shared memory. However, OpenMP runtimes may adapt the number of threads created by a region based on the environment set by numactl (although AFAIK this behaviour is undefined by the standard). You should use the environment variable OMP_NUM_THREADS to set the number of threads. You can check the OpenMP configuration using the environment variable OMP_DISPLAY_ENV.
How to find which of the numa-node-0 cores in 0-23,96-119 are physical and which ones logical? Are 96-119 all logical or are they interspersed?
This is a bit complex. Physical IDs are the ones available in /proc/cpuinfo. They are not guaranteed to stay the same over time (eg. they can change when the machine is restarted) nor "intuitive" (ie. following rules like being contiguous for threads/cores close to each other). One should avoid hard-coding them manually. e.g. a BIOS update or kernel update might lead to enumerating logical cores in a different order.
You can use the great tool hwloc to convert well-defined deterministic logical IDs to physical ones. Here, you cannot be entirely sure that 0 and 96 are two threads sharing the same core (although this is probably true here for your processor, where it looks like the kernel enumerated one logical core from each physical core as cores 0..95, then 96..191 for the other logical core on each physical core). The other common possibility is for Linux to do both logical cores of each physical core consecutively, making logical cores 2n and 2n+1 share a physical core.
If 4-7 are all physical cores, then with HT on there would be only 2 physical cores needed, so what happens to the other 2?
--physcpubind of numctl accepts physical cpu numbers as shown in the "processor" fields of /proc/cpuinfo regarding the documentation. Thus, 4-7 here should be interpreted as physical thread IDs. Two threads IDs can refer to the same physical core (which is always the case on Intel processors with hyper-threading enabled).
Where is OpenMP library getting invoked in binding threads to physical cores?
AFAIK, this is implementation dependent of the OpenMP runtime used (eg. GOMP, IOMP, etc.). The initialization of the OpenMP runtime is often done lazily when the first parallel section is encountered. For the binding, some runtimes read /proc/cpuinfo manually while some other use hwloc. If you want deterministic bindings, then you should use the OMP_PLACES and OMP_PROC_BIND environment variables to tell the runtime to bind threads using a custom user-defined method and not the default one.
If you want to be safe and portable, use the following configuration (using Bash):
OMP_PLACES={$(hwloc-calc --physical-output --sep "},{" --intersect PU core:all.pu:0)}
The OpenMP threads will be scheduled on OpenMP places. The above configuration configure the OpenMP runtime so that there will be 4 threads statically map on 4 different fixed cores.

Low GPU usage in CUDA

I implemented a program which uses different CUDA streams from different CPU threads. Memory copying is implemented via cudaMemcpyAsync using those streams. Kernel launches are also using those streams. The program is doing double-precision computations (and I suspect this is the culprit, however, cuBlas reaches 75-85% CPU usage for multiplication of matrices of doubles). There are also reduction operations, however they are implemented via if(threadIdx.x < s) with s decreasing 2 times in each iteration, so stalled warps should be available to other blocks. The application is GPU and CPU intensive, it starts with another piece of work as soon as the previous has finished. So I expect it to reach 100% of either CPU or GPU.
The problem is that my program generates 30-40% of GPU load (and about 50% of CPU load), if trusting GPU-Z 1.9.0. Memory Controller Load is 9-10%, Bus Interface Load is 6%. This is for the number of CPU threads equal to the number of CPU cores. If I double the number of CPU threads, the loads stay about the same (including the CPU load).
So why is that? Where is the bottleneck?
I am using GeForce GTX 560 Ti, CUDA 8RC, MSVC++2013, Windows 10.
One my guess is that Windows 10 applies some aggressive power saving, even though GPU and CPU temperatures are low, the power plan is set to "High performance" and the power supply is 700W while power consumption with max CPU and GPU TDP is about 550W.
Another guess is that double-precision speed is 1/12 of the single-precision speed because there are 1 double-precision CUDA core per 12 single-precision CUDA cores on my card, and GPU-Z takes as 100% the situation when all single-precision and double-precision cores are used. However, the numbers do not quite match.
Apparently the reason was low occupancy due to CUDA threads using too many registers by default. To tell the compiler the limit on the number of registers per thread, __launch_bounds__ can be used, as described here. So to be able to launch all 1536 threads in 560 Ti, for block size 256 the following can be specified:
_global__ void __launch_bounds__(256, 6) MyKernel(...) { ... }
After limiting the number of registers per CUDA thread, the GPU usage has raised to 60% for me.
By the way, 5xx series cards are still supported by NSight v5.1 for Visual Studio. It can be downloaded from the archive.
EDIT: the following flags have further increased GPU usage to 70% in an application that uses multiple GPU streams from multiple CPU threads:
cudaSetDeviceFlags(cudaDeviceScheduleYield | cudaDeviceMapHost | cudaDeviceLmemResizeToMax);
cudaDeviceScheduleYield lets other threads execute when a CPU
thread is waiting on GPU operation, rather than spinning GPU for the
cudaDeviceLmemResizeToMax, as I understood it, makes kernel
launches themselves asynchronous and avoids excessive local memory

Are not all processors created equal?

My laptop has 4 logical processors (two physical); logical CPUs 1 and 2 map to core 1, and logical CPUs 3 and 4 map to core 2 (verified with GetLogicalProcessorInformation()).
I ran a multithreaded matrix multiplication program on my computer with two threads. The first time, I used SetProcessAffinityMask(hProcess, 0x5) (which means logical processors 1 and 3) while the second time I used SetProcessAffinityMask(hProcess, 0xA) (logical processors 2 and 4).
It turned out that the first version was about twice as fast as the second version, as though I'd never multithreaded the second version anyway.
Does anyone have any guesses as to why this might be happening?
Plugged in (full CPU):
Affinity mask: 0x3 (0011b), 9 gflop/s
Affinity mask: 0x5 (0101b), 17 gflop/s
Affinity mask: 0x6 (0110b), 17 gflop/s
Affinity mask: 0x9 (1001b), 9 gflop/s
Affinity mask: 0xA (1010b), 9 gflop/s
Affinity mask: 0xC (1100b), 9 gflop/s
On battery (clocked down):
Affinity mask: 0x3 (0011b), 5 gflop/s
Affinity mask: 0x5 (0101b), 10 gflop/s
Affinity mask: 0x6 (0110b), 10 gflop/s
Affinity mask: 0x9 (1001b), 5 gflop/s
Affinity mask: 0xA (1010b), 2 gflop/s
(--> Very interesting, why half speed when on battery but normal speed on AC?! this one varies a lot between 1.5-2.5 gflop/s, unlike the others.)
Affinity mask: 0xC (1100b), 5 gflop/s
Does this imply that the fourth logical CPU is not doing anything (!)? (Everything with the mask for the fourth CPU set is slow.)
I just ran the same thing on the High Performance profile on batteries. The results are inconsistent: This time, I got 2x speedup for the masks 5, 6, and 10, but there was no speedup for mask 12. I'll try to run the tests again on AC power, but ultimately it seems like this result is a combination of power management, Turbo Boost, scheduling inconsistencies, etc., and it's more difficult to measure than I previously thought. :(
SetProcessAffinityMask() does not guarantee you will have one thread per core; only that the threads you have will run on the cores you have allowed.
Perhaps the OS is scheduling differently.
Also, I'm surprised 1 and 2 are on core 1. Usually, logical processor numbers interleave over physical cores, to provide an inherent load balancing. I would expect 1 and 3 to be on core 1, 2 and 4 to be on core 2.
No, not all cores are equal. Only one is the boot core. Furthermore, in many cases all IRQs (or at least IRQs from a majority of the devices) are directed to a single core.
More important to your observed behavior, not all sets of cores are equal. In a NUMA memory architecture (which have been relatively mainstream in x86 since Intel Hyperthreading and AMD Opteron), there's an ideal group of processors which can efficiently access a particular region of memory, and all other processors will pay a significant penalty to access that range.
With Hyperthreading, it's not main system memory that's connected non-uniformly, but L1 and L2 cache. If your process migrates between the two virtual processors associated to the same physical core, the cache remains valid. But if it migrates to the other physical core, cached data has to be copied and ownership transferred to the other cache. For some workloads, this could make a big difference.
It would be good to know what physical CPU this is, but I'm assuming from your phrasing about logical processors that there is 1 physical socket, 2 CPU cores, and hyperthreading is enabled giving you 4 logical processors.
The short answer is, for this complicated definition of "processor", no, not all processors are created equal. Hyperthreaded logical cores share execution resources, and if there's contention for those resources they won't be fast as separate physical cores. This sharing can take place at different levels for both hyperthreading and multicore processors (ALU, execution resources, cache at different levels, etc) but in broad terms, physical cores in the same socket won't be affected much by what the other core(s) is/are doing, and logical cores implemented by hyperthreading will be hugely affected by what their hypertwin is doing.
Another difference between different CPUs: As Ben said, your OS may process most hardware interrupts on a single CPU, which means that CPU will seem slower for other purposes, but I'd be surprised if the interrupt load is enough to impact performance anywhere near this much.
The results you got -- on processors A and B (being intentionally ambiguous about which 2 processors those are) you get double the performance of A alone, but on processors A and C you get approximately the same performance as A alone -- sure sound like hyperthreading is the difference, where A and C are hypertwins in the same physical core, and B is in the other physical core. You said that GetLogicalProcessorInformation() claims otherwise, but it's not unheard of for the BIOS tables on which that depends to have errors.
I would run Task Manager, keep an eye on loads on each CPU before you run your test to get an idea of how much else is going on and where Windows schedules it, then run your test again a few times, for different combinations of CPU affinity, and see if you can confirm or deny this theory.
Have you checked the return code from SetProcessAffinityMask to see if there was an error? If the call fails, you might get stuck on one logical processor. According to the documentation, you can only use the bits that are set in the result of GetProcessAffinityMask.
You say you've tried masks of 0x5, 0xA, and 0x9. I'd be curious to see the results with 0x3.

Linux Multi-Threaded Performance Enhancements for File open()

I’m working on tuning performance on a high-performance, high-capacity data engine which ultimately services an end-user web experience. Specifically, the piece delegated to me revolves around characterizing multi-threaded file IO and memory mapping of the data to local cache. In writing test applications to isolate the timing tall-poles, several questions have been exposed. The code has been minimized to perform only a system file open (open(O_RDONLY)) call. I’m hoping that the result of this query helps us understand the fundamental low-level system processes so that a complete predictive (or at least relational) timing model can be understood. Suggestions are always welcome. We’ve seemed to hit a timing barrier, and would like to understand the behavior and determine whether that barrier can be broken.
The test program:
Is written in C, compiled using the gnu C compiler as noted below;
Is minimally written to isolate the discovered issues to a single system file “open()”;
Is configurable to simultaneously launch a requested number of pthreads;
loads a list of 1000 text files of ~8K size;
creates the threads (simply) with no attribute modifications;
each thread performs multiple, sequential file open() calls on the next available file from the pre-determined list of files until the file list is exhausted in such a way that a single thread should open all 1000 files, 2 threads should theoretically open 500 files (not proven as of yet), etc.);
We’ve run tests multiple times, parametrically varying the thread count, file sizes, and whether the files are located on a local or remote server. Several questions have come up.
Observed results (opening remote files):
File open times are higher the first time through (as expected, due to file caching);
Running the test app with one thread to load all the remote files takes X seconds;
It appears that running the app with a thread count between 1 and # of available CPUs on the machine results in times that are proportional to the number of CPUs (nX seconds).
Running the app using a thread count > #CPUs results in run times that seem to level out at the approx same value as the time is takes to run with #CPUs threads (is this coincidental, or a systematic limit, or what?).
Running multiple, concurrent processes (for example, 25 concurrent instances of the same test app) results in the times being approximately linear with number of processes for a selected thread count.
Running app on different servers shows similar results
Observed results (opening files residing locally):
Orders of magnitude faster times (as to be expected);
With increasing the thread count, a LOW timing inflection point occurs at around 4-5 active threads, then increases again until the number of threads equals the CPU count, then levels off again;
Running multiple, concurrent processes (same test) results in the times being approximately linear with number of processes for a constant thread count (same result as #5 above).
Also, we noticed that Local opens take about .01 ms and sequential network opens are 100x slower at 1ms. Opening network files, we get a linear throughput increase up to 8x with 8 threads, but 9+ threads do nothing. The network open calls seem to block after more than 8 simultaneous requests. What we expected was an initial delay equal to the network roundtrip, and then approximately the same throughput as local. Perhaps there is extra mutex locking done on the local and remote systems that takes 100x longer. Perhaps there is some internal queue of remote calls that only holds 8.
Expected results and questions to be answered either by test or by answers from forums like this one:
Running multiple threads would result in the same work done in shorter time;
Is there an optimal number of threads;
Is there a relationship between the number of threads and CPUs available?
Is there some other systematic reason that an 8-10 file limit is observed?
How does the system call to “open()” work in a multi-threading process?
Each thread gets its context-switched time-slice;
Does the open() call block and wait until the file is open/loaded into file cache? Or does the call allow context switching to occur while the operation is in progress?
When the open() completes, does the scheduler reprioritize that thread to execute sooner, or does the thread have to wait until its turn in round-robin way;
Would having the mounted volume on which the 1000 files reside set as read-only or read/write make a difference?
When open() is called with a full path, is each element in the path stat()ed? Would it make more sense to open() a common directory in the list of files tree, and then open() the files under that common directory by relative path?
Development test setup:
Red Hat Enterprise Linux Server release 5.4 (Tikanga)
8-CPUS, each with characteristics as shown below:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU X5460 # 3.16GHz
stepping : 6
cpu MHz : 1992.000
cache size : 6144 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 4
apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 lahf_lm
bogomips : 6317.47
clflush size : 64
cache_alignment : 64
address sizes : 38 bits physical, 48 bits virtual
power management:
GNU C compiler, version:
gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46)
Not sure if this is one of your issues, but it may be of use.
The one thing that struck me, while optimizing thousands of random reads on a single SATA disk, was that performing non-blocking I/O isn't so easy to do in linux in a clean way, without extra threads.
It is (currently) impossible to issue a non-blocking read() on a block device; i.e. it will block for the 5 ms seek time the disk needs (and 5 ms is an eternity, at 3 GHz). Specifying O_NONBLOCK to open() only served some purpose for backward compatibility, with CD burners or something (this was a rather vague issue). Normally, open() doesn't block or cache anything, it's mostly just to get a handle on a file to do some data I/O later.
For my purposes, mmap() seemed to get me as close to the kernel handling of the disk as possible. Using madvise() and mincore() I was able to fully exploit the NCQ capabilities of the disk, which was simply proved by varying the queue depth of outstanding requests, which turned out to be inversely proportional to the total time taken to issue 10k reads.
Thanks to 64 bit memory addressing, using mmap() to map an entire disk to memory is no problem at all. (on 32 bit platforms, you would need to map the parts of the disk you need using mmap64())

Mapping logical processors to physical processors

On a dual quad-core GetProcessAffinityMask (or the dialog from "Set affinity" in taskman.exe) will report eight logical processors. How do I find out which logical processor is on which physical processor? Especially: which logical processors are on the same physical processor?
EDIT: If it is not possible to do this programmatically, do anyone just know what the normal mapping is? Are the first four on the first processor and the second four on the second or are the odd numbered on the first and the even numbered on the second?
You can use Win32_Processor WMI class to query the number of cores, number of logical processors, architecture, cache memory and other information about the CPUs on the system.
To query information about the relationship between the logical processors in a system, you can use GetLogicalProcessorInformation API function.
In case you don't want to write the code yourself, SysInternal's handy coreinfo utility comes closest to answering your questions. It implements GetLogicalProcessorInformation as Mehrdad recommends. For a Xeon E5640 (quad core, 8 threads), you get from coreinfo:
c:\App\SysInternals>Coreinfo.exe -c
Coreinfo v3.0 - Dump information on system CPU and memory topology
Copyright (C) 2008-2011 Mark Russinovich
Sysinternals - www.sysinternals.com
Logical to Physical Processor Map:
**------ Physical Processor 0 (Hyperthreaded)
--**---- Physical Processor 1 (Hyperthreaded)
----**-- Physical Processor 2 (Hyperthreaded)
------** Physical Processor 3 (Hyperthreaded)
There are 8 * for the 8 hyperthreads, two per core, as expected for this chip. What's not clear, though, is how the arrangement of * matches up with the list of logical processors as Windows presents them. For instance, Task Manager gives me a dialog for assigning the processor affinity, labeled CPU 0 through CPU 7, for any process. It's fair (but not necessary) to assume that you can take coreinfo's output and number the logical processors left-to-right. So "CPU 5" would be the second hyperthread running on physical processor 2.
The numbering is done in a sequential manner: first all physical cores followed by the logical cores [1] .
[1] CPU Numbering on a hypertheading enabled system
