How can I check that MKL calls are running with the correct number of threads on Xeon Phi? - linux-kernel

I am running 60 MPI processes and MKL_THREAD_NUM is set to 4 to get me to the full 240 hardware threads on the Xeon Phi. My code is running but I want to make sure that MKL is actually using 4 threads. What is the best way to check this with the limited Xeon Phi linux kernel?

You can set MKL_NUM_THREADS to 4 if you like. However,using every single thread does not necessarily give the best performance. In some cases, the MKL library knows things about the algorithm that mean fewer threads is better. In these cases, the library routines can choose to use fewer threads. You should only use 60 MPI ranks if you have 61 coresIf you are going to use that many MPI ranks, you will want to set the I_MPI_PIN_DOMAIN environment variable to "core". Remember to leave one core free for the OS and system level processes. This will put one rank per core on the coprocessor and allow all the OpenMP threads for each MPI process to reside on the same core, giving you better cache behavior. If you do this, you can also use micsmc in gui mode on the host processor to continuously monitor the activity on all the cores. With one MPI processor per core, you can see how much of the time all threads on a core are being used.

Set MKL_NUM_THREADS to 4. You can use environment variable or runtime call. This value will be respected so there is nothing to check.
Linux kernel on KNC is not stripped down so I don't know why you think that's a limitation. You should not use any system calls for this anyways though.

Related

Intel OpenMP library slows down memory bandwidth significantly on AMD platforms by setting KMP_AFFINITY=scatter

For memory-bound programs it is not always faster to use many threads, say the same number as the cores, since threads may compete for memory channels. Usually on a two-socket machine, less threads are better but we need to set affinity policy that distributes the threads across sockets to maximize the memory bandwidth.
Intel OpenMP claims that KMP_AFFINITY=scatter is to achieve this purpose, the opposite value "compact" is to place threads as close as possible. I have used ICC to build the Stream program for benchmarking and this claim is easily validated on Intel machines. And if OMP_PROC_BIND is set, the native OpenMP env vars like OMP_PLACES and OMP_PROC_BIND are ignored. You will get such a warning:
OMP: Warning #181: OMP_PROC_BIND: ignored because KMP_AFFINITY has been defined
However, a benchmark on a newest AMD EPYC machine I obtained shows really bizarre results. KMP_AFFINITY=scatter gives the slowest memory bandwidth possible. It seems that this setting is doing exactly the opposite on AMD machines: placing threads as close as possible so that even the L3 cache at each NUMA node is not even fully utilized. And if I explicitly set OMP_PROC_BIND=spread, it is ignored by Intel OpenMP as the warning above says.
The AMD machine has two sockets, 64 physical cores per socket. I have tested using 128, 64, and 32 threads and I want them to be spread across the whole system. Using OMP_PROC_BIND=spread, Stream gives me a triad speed of 225, 290, and 300 GB/s, respectively. But once I set KMP_AFFINITY=scatter, even when OMP_PROC_BIND=spread is still present, Streams gives 264, 144, and 72 GB/s.
Notice that for 128 threads on 128 cores, setting KMP_AFFINITY=scatter gives better performance, this even further suggests that in fact all the threads are placed as close as possible, but not scattering at all.
In summary, KMP_AFFINITY=scatter displays completely opposite (in the bad way) behavior on AMD machines and it will even overwrite native OpenMP environment regardless the CPU brand. The whole situation sounds a bit fishy, since it is well known that ICC detects the CPU brand and uses the CPU dispatcher in MKL to launch the slower code on non-Intel machines. So why can't ICC simply disable KMP_AFFINITY and restore OMP_PROC_BIND if it detects a non-Intel CPU?
Is this a known issue to someone? Or someone can validate my findings?
To give more context, I am a developer of commercial computational fluid dynamics program and unfortunately we links our program with ICC OpenMP library and KMP_AFFINITY=scatter is set by default because in CFD we must solve large-scale sparse linear systems and this part is extremely memory-bound. I found that with setting KMP_AFFINITY=scatter, our program becomes 4X slower (when using 32 threads) than the actual speed the program can achieve on the AMD machine.
Update:
Now using hwloc-ps I can confirm that KMP_AFFINITY=scatter is actually doing "compact" on my AMD threadripper 3 machine. I have attached the lstopo result. I run my CFD program (built by ICC2017) with 16 threads. OPM_PROC_BIND=spread can place one thread in each CCX so that L3 cache is fully utilized. Hwloc-ps -l -t gives:
While setting KMP_AFFINITY=scatter, I got
I will try the latest ICC/Clang OpenMP runtime and see how it works.
TL;DR: Do not use KMP_AFFINITY. It is not portable. Prefer OMP_PROC_BIND (it cannot be used with KMP_AFFINITY at the same time). You can mix it with OMP_PLACES to bind threads to cores manually. Moreover, numactl should be used to control the memory channel binding or more generally NUMA effects.
Long answer:
Thread binding: OMP_PLACES can be used to bound each thread to a specific core (reducing context switches and NUMA issues). OMP_PROC_BIND and KMP_AFFINITY should theoretically do that correctly, but in practice, they fail to do so on some systems. Note that OMP_PROC_BIND and KMP_AFFINITY are exclusive option: they should not be used together (OMP_PROC_BIND is a new portable replacement of the older KMP_AFFINITY environment variable). As the topology of the core change from one machine to another, you can use the hwloc tool to get the list of the PU ids required by OMP_PLACES. More especially hwloc-calc to get the list and hwloc-ls to check the CPU topology. All threads should be bound separately so that no move is possible. You can check the binding of the threads with hwloc-ps.
NUMA effects: AMD processors are built by assembling multiple CCX connected together with a high-bandwidth connection (AMD Infinity Fabric). Because of that, AMD processors are NUMA systems. If not taken into account, NUMA effects can result in a significant drop in performance. The numactl tool is designed to control/mitigate NUMA effects: processes can be bound to memory channels using the --membind option and the memory allocation policy can be set to --interleave (or --localalloc if the process is NUMA-aware). Ideally, processes/threads should only work on data allocated and first-touched on they local memory channels. If you want to test a configuration on a given CCX you can play with --physcpubind and --cpunodebind.
My guess is that the Intel/Clang runtime does not perform a good thread binding when KMP_AFFINITY=scatter is set because of a bad PU mapping (which could come from a OS bug, a runtime bug or bad user/admin settings). Probably due to the CCX (since mainstream processors containing multiple NUMA nodes were quite rare).
On AMD processors, threads accessing memory of another CCX usually pay an additional significant cost due to data moving through the (quite-slow) Infinity Fabric interconnect and possibly due to its saturation as well as the one of memory channels. I advise you to not trust OpenMP runtime's automatic thread binding (use OMP_PROC_BIND=TRUE), to rather perform the thread/memory bindings manually and then to report bugs if needed.
Here is an example of a resulting command line so as to run your application:
numactl --localalloc OMP_PROC_BIND=TRUE OMP_PLACES="{0},{1},{2},{3},{4},{5},{6},{7}" ./app
PS: be careful about PU/core IDs and logical/physical IDs.

A multi-threaded software(PFC3D-to do simulation) not using all the available cores

I'm using a multi-threaded software(PFC3D developped by Itasca consulting) to do some simulations.After moving to a powerful computer Intel Xeon Gold 5120T CPU 2.2GHZ 2.19 GHZ (2 Processors)(28 physical cores, 56 logical cores)(Windows10) to have rapid calculations, the software seems to only use a limited number of cores.Normally the number of cores detected in the software is 56 and it takes automaticly the maximum number of cores.
I'm quite sure that the problem is in the system not in my software because I'm running the same code in a intel core i9-9880H Processor (16 logical cores) and it'is using all the cores with even more efficiency than the xeon gold.
The software is using 22 to 30;
28 cores/56 threads are displayed on task managers CPU page.I have windows 10 pro.
I appreciate very much your precious help.
Thank you
Youssef
interface
classes
details
code
It's hard to say because I do not have the code and you provide so little information.
You seems to have no IO because you said that you use 100% of the CPU on i9. That should simplify a little bit but...
There could be many reasons.
My feeling is that you have threading synchronisation (like critical section) that depends on shared ressource(s). That ressource seems to be lightly solicitated and thread require it just a little wich enable 16 threads to access it without too much collisions (or very little). I mean that thread do not have to wait for shared resource (it is mostly available / not locked). But adding more threads improve significantly collisions amount (locking state of shared ressources by another thread) to have to wait for that ressource. That really sounds like something like that. But it is only a guess.
A quick try that could potentially improve the performance (because I have the feeling that shared resource require very quick access), is to use SpinLock instead of regular Critical Section. But that is totally a guess based on very little and also SpinLock is available in C# but perhaps not in your language.
About the number of CPU taken, it could be normal to take only the half depending on how the program is made. Sometimes it could be better to not use hyperthreaded and perhaps your program is doing this itself. Also there could be a bug in either the program itself, in C# or in the BIOS which tell the app that there is only 28 cpus instead of 56 (usually due to hyperthreading). IT is still a guess.
There could be some additional information that could potentially help you in this stack overflow question.
Good luck.

Difference between core and processor

What is the difference between a core and a processor?
I've already looked for it on Google, but I only get definitions for multi-core and multi-processor, which is not what I am looking for.
A core is usually the basic computation unit of the CPU - it can run a single program context (or multiple ones if it supports hardware threads such as hyperthreading on Intel CPUs), maintaining the correct program state, registers, and correct execution order, and performing the operations through ALUs. For optimization purposes, a core can also hold on-core caches with copies of frequently used memory chunks.
A CPU may have one or more cores to perform tasks at a given time. These tasks are usually software processes and threads that the OS schedules. Note that the OS may have many threads to run, but the CPU can only run X such tasks at a given time, where X = number cores * number of hardware threads per core. The rest would have to wait for the OS to schedule them whether by preempting currently running tasks or any other means.
In addition to the one or many cores, the CPU will include some interconnect that connects the cores to the outside world, and usually also a large "last-level" shared cache. There are multiple other key elements required to make a CPU work, but their exact locations may differ according to design. You'll need a memory controller to talk to the memory, I/O controllers (display, PCIe, USB, etc..). In the past these elements were outside the CPU, in the complementary "chipset", but most modern design have integrated them into the CPU.
In addition the CPU may have an integrated GPU, and pretty much everything else the designer wanted to keep close for performance, power and manufacturing considerations. CPU design is mostly trending in to what's called system on chip (SoC).
This is a "classic" design, used by most modern general-purpose devices (client PC, servers, and also tablet and smartphones). You can find more elaborate designs, usually in the academy, where the computations is not done in basic "core-like" units.
An image may say more than a thousand words:
* Figure describing the complexity of a modern multi-processor, multi-core system.
Source:
https://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization
Let's clarify first what is a CPU and what is a core, a central processing unit CPU, can have multiple core units, those cores are a processor by itself, capable of execute a program but it is self contained on the same chip.
In the past one CPU was distributed among quite a few chips, but as Moore's Law progressed they made to have a complete CPU inside one chip (die), since the 90's the manufacturer's started to fit more cores in the same die, so that's the concept of Multi-core.
In these days is possible to have hundreds of cores on the same CPU (chip or die) GPUs, Intel Xeon. Other technique developed in the 90's was simultaneous multi-threading, basically they found that was possible to have another thread in the same single core CPU, since most of the resources were duplicated already like ALU, multiple registers.
So basically a CPU can have multiple cores each of them capable to run one thread or more at the same time, we may expect to have more cores in the future, but with more difficulty to be able to program efficiently.
CPU is a central processing unit. Since 2002 we have only single core processor i.e. we will only perform a single task or a program at a time.
For having multiple programs run at a time we have to use the multiple processor for executing multi processes at a time so we required another motherboard for that and that is very expensive.
So, Intel introduced the concept of hyper threading i.e. it will convert the single CPU into two virtual CPUs i.e we have two cores for our task. Now the CPU is single, but it is only pretending (masqueraded) that it has a dual CPU and performs multiple tasks. But having real multiple cores will be better than that so people develop making multi-core processor i.e. multiple processors on a single box i.e. grabbing a multiple CPU on single big CPU. I.e. multiple cores.
In the early days...like before the 90s...the processors weren't able to do multi tasks that efficiently...coz a single processor could handle just a single task...so when we used to say that my antivirus,microsoft word,vlc,etc. softwares are all running at the same time...that isn't actually true. When I said a processor could handle a single process at a time...I meant it. It actually would process a single task...then it used to pause that task...take another task...complete it if its a short one or again pause it and add it to the queue...then the next. But this 'pause' that I mentioned was so small (appx. 1ns) that you didn't understand that the task has been paused. Eg. On vlc while listening to music there are other apps running simultaneously but as I told you...one program at a time...so the vlc is actually pausing in between for ns so you dont underatand it but the music is actually stopping in between.
But this was about the old processors...
Now-a- days processors ie 3rd gen pcs have multi cored processors. Now the 'cores' can be compared to a 1st or 2nd gen processors itself...embedded onto a single chip, a single processor. So now we understood what are cores ie they are mini processors which combine to become a processor. And each core can handle a single process at a time or multi threads as designed for the OS. And they folloq the same steps as I mentioned above about the single processor.
Eg. A i7 6gen processor has 8 cores...ie 8 mini processors in 1 i7...ie its speed is 8x times the old processors. And this is how multi tasking can be done.
There could be hundreds of cores in a single processor
Eg. Intel i128.
I hope I explaned this well.
I have read all answers, but this link was more clear explanation for me about difference between CPU(Processor) and Core. So I'm leaving here some notes from there.
The main difference between CPU and Core is that the CPU is an electronic circuit inside the computer that carries out instruction to perform arithmetic, logical, control and input/output operations while the core is an execution unit inside the CPU that receives and executes instructions.
Intel's picture is helpful, as shown by Tortuga's best answer. Here's a caption for it.
Processor: One semiconductor chip, the CPU (central processing unit) seated in one socket, circa 1950s-2010s. Over time, more functions have been packed onto the CPU chip. Prior to the 1950s releases of single-chip processors, one processor might have spread across multiple chips. In the mid 2010s the system-on-a-chip chips made it slightly more sketchy to equate one processor to one chip, though that's generally what people mean by processor, as in "this computer has an i7 processor" or "this computer system has four processors."
Core: One block of a CPU, executing one instruction at a time. (You'll see people say one instruction per clock cycle, but some CPUs use multiple clock cycles for some instructions.)

How are light weight threads scheduled by the linux kernel on a multichip multicore SMP system?

I am running a parallel algorithm using light threads and I am wondering how are these assigned to different cores when the system provides several cores and several chips. Are threads assigned to a single chip until all the cores on the chip are exhausted? Are threads assigned to cores on different chips in order to better distribute the work between chips?
You don't say what OS you're on, but in Linux, threads are assigned to a core based on the load on that core. A thread that is ready to run will be assigned to a core with lowest load unless you specify otherwise by setting thread affinity. You can do this with sched_setaffinity(). See the man page for more details. In general, as meyes1979 said, this is something that is decided by the scheduler implemented in the OS you are using.
Depending upon the version of Linux you're using, there are two articles that might be helpful: this article describes early 2.6 kernels, up through 2.6.22, and this article describes kernels newer than 2.6.23.
Different threading libraries perform threading operations differently. The "standard" in Linux these days is NPTL, which schedules threads at the same level as processes. This is quite fine, as process creation is fast on Linux, and is intended to always remain fast.
The Linux kernel attempts to provide very strong CPU affinity with executing processes and threads to increase the ratio of cache hits to cache misses -- if a task always executes on the same core, it'll more likely have pre-populated cache lines.
This is usually a good thing, but I have noticed the kernel might not always migrate tasks away from busy cores to idle cores. This behavior is liable to change from version to version, but I have found multiple CPU-bound tasks all running on one core while three other cores were idle. (I found it by noticing that one core was six or seven degrees Celsius warmer than the other three.)
In general, the right thing should just happen; but when the kernel does not automatically migrate tasks to other processors, you can use the taskset(1) command to restrict the processors allowed to programs or you could modify your program to use the pthread_setaffinity_np(3) function to ask for individual threads to be migrated. (This is perhaps best for in-house applications -- one of your users might not want your program to use all available cores. If you do choose to include calls to this function within your program, make sure it is configurable via configuration files to provide functionality similar to the taskset(1) program.)

OpenMP thread mapping to physical cores

So I've looked around online for some time to no avail. I'm new to using OpenMP and so not sure of the terminology here, but is there a way to figure out a specific machine's mapping from OMPThread (given by omp_get_thread_num();) and the physical cores on which the threads will run?
Also I was interested in how exactly OMP assigned threads, for example is thread 0 always going to run in the same location when the same code is run on the same machine? Thanks.
Typically, the OS takes care of assigning threads to cores, including with OpenMP. This is by design, and a good thing - you normally would want the OS to be able to move a thread across cores (transparently to your application) as required, since it will interrupt your application at times.
Certain operating system APIs will allow thread affinity to be set. For example, on Windows, you can use SetThreadAffinityMask to force a thread onto a specific core.
Most of the time Reed is correct, OpenMP doesn't care about the assignment of threads to cores (or processors). However, because of things like cache reuse and data locality we have found that there are many cases where having the threads assigned to cores increases the performance of OpenMP. Therefore if you look at most OpenMP implementations, you will find that there is usually some environment variable that can be set to "bind" threads to cores. The OpenMP ARB has not yet specified any "standard" way of doing this, so at this time it is left up to an OpenMP implementation to decide if and how this should be done. There has been a great deal of discussion about whether this should be included in the OpenMP spec or not and if so how it could best be done.

Resources