Can I use OpenACC for multi-core CPU?

Can I use OpenACC for multi-core CPU? - parallel-processing

I want to use OpenACC for parallelization of multi-core CPU.
I know that it is possible to use CPU as host and GPU as device for execution of target region, But I want
to set CPU cores(or two separate CPU) for host and target device simultaneously. Can I do this with OpenACC?

Yes. The target device of OpenACC can be multicore CPU. If using PGI, use the flag -ta=multicore to target the CPU. By default, the runtime will use all the cores available on the system. If you want to limit the number of cores to use, set the environment variable ACC_NUM_CORES=N.

If using GCC, it's not yet possible (but certainly can be implemented); see https://stackoverflow.com/a/61227622/664214.

Related

Intel OpenMP library slows down memory bandwidth significantly on AMD platforms by setting KMP_AFFINITY=scatter

For memory-bound programs it is not always faster to use many threads, say the same number as the cores, since threads may compete for memory channels. Usually on a two-socket machine, less threads are better but we need to set affinity policy that distributes the threads across sockets to maximize the memory bandwidth.
Intel OpenMP claims that KMP_AFFINITY=scatter is to achieve this purpose, the opposite value "compact" is to place threads as close as possible. I have used ICC to build the Stream program for benchmarking and this claim is easily validated on Intel machines. And if OMP_PROC_BIND is set, the native OpenMP env vars like OMP_PLACES and OMP_PROC_BIND are ignored. You will get such a warning:
OMP: Warning #181: OMP_PROC_BIND: ignored because KMP_AFFINITY has been defined
However, a benchmark on a newest AMD EPYC machine I obtained shows really bizarre results. KMP_AFFINITY=scatter gives the slowest memory bandwidth possible. It seems that this setting is doing exactly the opposite on AMD machines: placing threads as close as possible so that even the L3 cache at each NUMA node is not even fully utilized. And if I explicitly set OMP_PROC_BIND=spread, it is ignored by Intel OpenMP as the warning above says.
The AMD machine has two sockets, 64 physical cores per socket. I have tested using 128, 64, and 32 threads and I want them to be spread across the whole system. Using OMP_PROC_BIND=spread, Stream gives me a triad speed of 225, 290, and 300 GB/s, respectively. But once I set KMP_AFFINITY=scatter, even when OMP_PROC_BIND=spread is still present, Streams gives 264, 144, and 72 GB/s.
Notice that for 128 threads on 128 cores, setting KMP_AFFINITY=scatter gives better performance, this even further suggests that in fact all the threads are placed as close as possible, but not scattering at all.
In summary, KMP_AFFINITY=scatter displays completely opposite (in the bad way) behavior on AMD machines and it will even overwrite native OpenMP environment regardless the CPU brand. The whole situation sounds a bit fishy, since it is well known that ICC detects the CPU brand and uses the CPU dispatcher in MKL to launch the slower code on non-Intel machines. So why can't ICC simply disable KMP_AFFINITY and restore OMP_PROC_BIND if it detects a non-Intel CPU?
Is this a known issue to someone? Or someone can validate my findings?
To give more context, I am a developer of commercial computational fluid dynamics program and unfortunately we links our program with ICC OpenMP library and KMP_AFFINITY=scatter is set by default because in CFD we must solve large-scale sparse linear systems and this part is extremely memory-bound. I found that with setting KMP_AFFINITY=scatter, our program becomes 4X slower (when using 32 threads) than the actual speed the program can achieve on the AMD machine.
Update:
Now using hwloc-ps I can confirm that KMP_AFFINITY=scatter is actually doing "compact" on my AMD threadripper 3 machine. I have attached the lstopo result. I run my CFD program (built by ICC2017) with 16 threads. OPM_PROC_BIND=spread can place one thread in each CCX so that L3 cache is fully utilized. Hwloc-ps -l -t gives:
While setting KMP_AFFINITY=scatter, I got
I will try the latest ICC/Clang OpenMP runtime and see how it works.

TL;DR: Do not use KMP_AFFINITY. It is not portable. Prefer OMP_PROC_BIND (it cannot be used with KMP_AFFINITY at the same time). You can mix it with OMP_PLACES to bind threads to cores manually. Moreover, numactl should be used to control the memory channel binding or more generally NUMA effects.
Long answer:
Thread binding: OMP_PLACES can be used to bound each thread to a specific core (reducing context switches and NUMA issues). OMP_PROC_BIND and KMP_AFFINITY should theoretically do that correctly, but in practice, they fail to do so on some systems. Note that OMP_PROC_BIND and KMP_AFFINITY are exclusive option: they should not be used together (OMP_PROC_BIND is a new portable replacement of the older KMP_AFFINITY environment variable). As the topology of the core change from one machine to another, you can use the hwloc tool to get the list of the PU ids required by OMP_PLACES. More especially hwloc-calc to get the list and hwloc-ls to check the CPU topology. All threads should be bound separately so that no move is possible. You can check the binding of the threads with hwloc-ps.
NUMA effects: AMD processors are built by assembling multiple CCX connected together with a high-bandwidth connection (AMD Infinity Fabric). Because of that, AMD processors are NUMA systems. If not taken into account, NUMA effects can result in a significant drop in performance. The numactl tool is designed to control/mitigate NUMA effects: processes can be bound to memory channels using the --membind option and the memory allocation policy can be set to --interleave (or --localalloc if the process is NUMA-aware). Ideally, processes/threads should only work on data allocated and first-touched on they local memory channels. If you want to test a configuration on a given CCX you can play with --physcpubind and --cpunodebind.
My guess is that the Intel/Clang runtime does not perform a good thread binding when KMP_AFFINITY=scatter is set because of a bad PU mapping (which could come from a OS bug, a runtime bug or bad user/admin settings). Probably due to the CCX (since mainstream processors containing multiple NUMA nodes were quite rare).
On AMD processors, threads accessing memory of another CCX usually pay an additional significant cost due to data moving through the (quite-slow) Infinity Fabric interconnect and possibly due to its saturation as well as the one of memory channels. I advise you to not trust OpenMP runtime's automatic thread binding (use OMP_PROC_BIND=TRUE), to rather perform the thread/memory bindings manually and then to report bugs if needed.
Here is an example of a resulting command line so as to run your application:
numactl --localalloc OMP_PROC_BIND=TRUE OMP_PLACES="{0},{1},{2},{3},{4},{5},{6},{7}" ./app
PS: be careful about PU/core IDs and logical/physical IDs.

Should I enable SMP on heterogeneous multi-threaded CPU's?

I'm building the Linux kernel for a big.LITTLE board and I've been wondering about the CONFIG_SMP option, which enables the kernel's Symmetric-processing support.
Linux's documentation says this should be enabled on Multi-Threaded processors, but I wonder if Symmetric Multi processing wouldn't only work properly on processors that are actually symmetric.
I understand what SMP is, but I haven't found any hint or documentation saying anything about it's use on Linux built for ARM's big.LITTLE.

Yes, if you want to use more than a single core you have to enable CONFIG_SMP. This in itself will make all cores (both big and little ones) available to the kernel.
Then, you have two options (I'm assuming you are using the mainline Linux kernel or something not excessively different from it, e.g. not an Android kernel):
If you also enable CONFIG_BL_SWITCHER (-> Kernel Features -> big.LITTLE support -> big.LITTLE switcher support) and CONFIG_ARM_BIG_LITTLE_CPUFREQ (-> CPU Power Management -> CPU Frequency scaling -> CPU Frequency scaling -> Generic ARM big LITTLE CPUfreq driver), each big core in your SoC will be paired to a little core, and only one of the cores in each pair will be active at any given time, depending on the CPU load. So basically the number of logical cores will be half the number of physical cores, and each logical core will combine one physical big core and one physical little core (unless the total number of big cores differs from the number of little cores, in which case there will be non-paired physical cores that are also logical cores). For each logical core, switching between the big and little physical core will be managed by the cpufreq governor and will be conceptually equivalent to CPU frequency switching.
If you don't enable the above two configuration options, then all physical cores will be available as logical cores, can be active at the same time and are treated by the scheduler as if they were identical.
The first option is more suited if you are aiming at low power consumption, while the second option allows you to get the most out of the CPU.
This will change when Heterogeneous Multi-Processing (HMP) support is integrated in the mainline kernel.

How can I check that MKL calls are running with the correct number of threads on Xeon Phi?

I am running 60 MPI processes and MKL_THREAD_NUM is set to 4 to get me to the full 240 hardware threads on the Xeon Phi. My code is running but I want to make sure that MKL is actually using 4 threads. What is the best way to check this with the limited Xeon Phi linux kernel?

You can set MKL_NUM_THREADS to 4 if you like. However,using every single thread does not necessarily give the best performance. In some cases, the MKL library knows things about the algorithm that mean fewer threads is better. In these cases, the library routines can choose to use fewer threads. You should only use 60 MPI ranks if you have 61 coresIf you are going to use that many MPI ranks, you will want to set the I_MPI_PIN_DOMAIN environment variable to "core". Remember to leave one core free for the OS and system level processes. This will put one rank per core on the coprocessor and allow all the OpenMP threads for each MPI process to reside on the same core, giving you better cache behavior. If you do this, you can also use micsmc in gui mode on the host processor to continuously monitor the activity on all the cores. With one MPI processor per core, you can see how much of the time all threads on a core are being used.

Set MKL_NUM_THREADS to 4. You can use environment variable or runtime call. This value will be respected so there is nothing to check.
Linux kernel on KNC is not stripped down so I don't know why you think that's a limitation. You should not use any system calls for this anyways though.

How can I compile a C program for multiple cores with mingw?

I have a C program that I am compiling with mingw, but it runs on only one core of my 8-core machine. How do I compile it to run on multiple cores?
(To clarify: I am not looking to use multiple cores to compile, as compilation time is low. It's runtime where I want to use my full CPU capacity.)

There is no other way but to write a multithread program. You need to first see how to split your tasks into independent parts which can be then run in threads simultaneously.
It cannot be fully automated. You may consider making use of the last additions of the C11 standard, or taking a look at pthreads or OpenMP.

OpenMP thread mapping to physical cores

So I've looked around online for some time to no avail. I'm new to using OpenMP and so not sure of the terminology here, but is there a way to figure out a specific machine's mapping from OMPThread (given by omp_get_thread_num();) and the physical cores on which the threads will run?
Also I was interested in how exactly OMP assigned threads, for example is thread 0 always going to run in the same location when the same code is run on the same machine? Thanks.

Typically, the OS takes care of assigning threads to cores, including with OpenMP. This is by design, and a good thing - you normally would want the OS to be able to move a thread across cores (transparently to your application) as required, since it will interrupt your application at times.
Certain operating system APIs will allow thread affinity to be set. For example, on Windows, you can use SetThreadAffinityMask to force a thread onto a specific core.

Most of the time Reed is correct, OpenMP doesn't care about the assignment of threads to cores (or processors). However, because of things like cache reuse and data locality we have found that there are many cases where having the threads assigned to cores increases the performance of OpenMP. Therefore if you look at most OpenMP implementations, you will find that there is usually some environment variable that can be set to "bind" threads to cores. The OpenMP ARB has not yet specified any "standard" way of doing this, so at this time it is left up to an OpenMP implementation to decide if and how this should be done. There has been a great deal of discussion about whether this should be included in the OpenMP spec or not and if so how it could best be done.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio