OpenMP thread affinity with GOMP_CPU_AFFINITY and sched_setaffinity - openmp

I'm trying to bind OpenMP threads to CPUs. Two methods have been tried.
The first one is sched_setaffinity. Every time threads are forked, I invoke sched_setaffinity to bind OpenMP threads to CPUs. It seems that binding for every time is far too costly. Is it really needed to bind for every time or just bind for once?
The second one is GOMP_CPU_AFFINITY. I set GOMP_CPU_AFFINITY=0-7 and OMP_NUM_THREADS=4. Two nodes are used for computing. Each node has 2 chips and each chip has 4 cores. I place a process in each node and each process forks 4 threads. If GOMP_CPU_AFFINITY is not set, OpenMP does speed up the program. If GOMP_CPU_AFFINITY is set, the following are returned and OpenMP does not work:
OMP: Warning #123: Ignoring invalid OS proc ID 1.
OMP: Warning #123: Ignoring invalid OS proc ID 2.
OMP: Warning #123: Ignoring invalid OS proc ID 3.
OMP: Warning #123: Ignoring invalid OS proc ID 4.
OMP: Warning #123: Ignoring invalid OS proc ID 5.
OMP: Warning #123: Ignoring invalid OS proc ID 6.
OMP: Warning #123: Ignoring invalid OS proc ID 7.

with openmp 4.0, one can set affinity per parallel region
#pragma omp parallel proc_bind(close)
with nested parallelism, one can also useproc_bind(spread) in the outer region to do coarse-grain pining first.

Related

Is there any way to know which processes are running on which core in QNX

My system is running with QNX6.5 and it has 4 cpu cores. But I don't know which and all processes are running in each core. Is there any way to know in detail.
Thanks in advance
Processes typically run multiple threads (at least one - main thread); so the thread is actual running unit, not the process (and core affinity is settable per thread). Thus you'd need to know on which core(s) threads are running.
There is "%l" format option that tells you on what CPU particular thread is executing on:
# pidin -F "%b %50h %i %l" -p random
tid thread name cpu
1 1 0
Runmask : 0x0000007f
Inherit Mask: 0x0000007f
2 Timer Thread 1
Runmask : 0x0000007f
Inherit Mask: 0x0000007f
3 3 6
Runmask : 0x0000007f
Inherit Mask: 0x0000007f
Above we print thread id, thread name, run/inherit cpu masks and top right column is current cpu threads are running on, for the process called "random".
The best tooling for analyzing the details of process scheduling in QNX is the "System Analysis Toolkit", which uses the instrumentation features of the QNX kernel to provide a log of every scheduling event and message pass.
For QNX 6.5, the documentation can be found here: http://www.qnx.com/developers/docs/6.5.0SP1.update/index.html#./com.qnx.doc.instr_en_instr/about.html
Got the details by using below command.
pidin rmasks
which will give "pid, tid, and name" of every threads.
From the runmask value we can identify in which core it is running.
For me thread details also fine.

Can I bound an ELF to a particular CPU

After checking the ELF format, it seems that there is no region indicates which CPU will execute this file.
My question is:
Can we assign a particular CPU to an ELF? If yes, how can I do that?
Can we assign a particular CPU to an ELF?
No.
if i want to assign a program to a particular cpu, how can i do it?
Depends on the OS.
instead of use Linux command
On Linux, you can call sched_setaffinity to bind a process to a set of CPUs, or pthread_setaffinity_np to bind a thread to a set of CPUs (note: neither call is portable to other OSes).

Cuda kernel launch failure

I am trying to call two kernels as shown below
for (t=0; t<=time_total; t++)
{
//kernel calls
kernel1<<<noOfBlocks,noOfThreadsPerBlock>>>(** SOME PARAMETERS **);
checkCudaError(cudaThreadSynchronize());
kernel2<<<noOfBlocks,noOfThreadsPerBlock>>>(** SOME PARAMETERS **);
checkCudaError(cudaThreadSynchronize());
}
And the structure of the second kernel is
var[index+0]=**SOME CALCULATION**
var[index+1]=**SOME CALCULATION**
var[index+2]=**SOME CALCULATION**
Now when I execute this code, checkCudaError does not report anything and the code is executed giving some output but visual studio gives the following exception
First-chance exception at 0x7640c41f in **.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0039f9c4..
First-chance exception at 0x7640c41f in **.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0039f9c4..
And when I check on Nsight it says kernel 2 is having the following error
CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES
Now the problem is that var array in kernel 2 is giving some of the rows correct some are copies of other row values and some are garbage.
Also when I do this
var[index+0]=3
var[index+1]=3
var[index+2]=3
All the values of var are set to 3
A few side notes:
cudaThreadSynchronize() is deprecated in favor of cudaDeviceSynchronize().
The fact that nsight is reporting an error on the 2nd kernel launch, but your error checking code is not, leads me to believe your error checking code is broken.
Now, regarding your issue, out of resources is frequently due to a code requesting too many registers (too many registers per thread times the number of threads per threadblock requested.) Try re-compiling your code specifying -Xptxas -v to get verbose output, and then recompiling again with -maxrregcount 20 (or something like that) to try to work around this for test purposes.
If this "fixes" your problem, you may then want to consider the following:
See if there is a way you can re-order or restructure your code to reduce the register pressure
If not, then adjust your maxrregcount value upwards to approximately the highest value that will allow your code to compile and run according to the launch configurations (number of threads per block) that you care about. You may also want to benchmark your code at different levels of this setting, as it can affect occupancy. Usually if you have it set to the highest value that will compile and run, then you are limiting yourself to one threadblock per SM at execution time. This may be OK, or there may be a lower setting that is better, allowing two threadblocks per SM residency, and possibly higher performance. Only benchmarking your code will tell.

Determine thread created time in WinDbg with user-mode dump

How to determine the time thread was created in WinDbg with user-mode dump.
The !runaway command allows you to display the elapsed time since a thread was created. From the documentation:
!runaway [Flags] Parameters
Flags Specifies the kind of information to be displayed. Flags can be
any combination of the following bits. The default value is 0x1.
Bit 0 (0x1) Causes the debugger to show the amount of user time
consumed by each thread.
Bit 1 (0x2) Causes the debugger to show the amount of kernel time
consumed by each thread.
Bit 2 (0x4)  Causes the debugger to show the amount of time that has
elapsed since each thread was created.

Why is a thread's status running but it doesn't use any CPU?

Today I found a very strange problem.
I ran Redhat Enterprise Linux 6, and the CPU was Intel E31275 (4 cores, 8 threads). I found one kernel thread (I called it as my_thread) didn't work correctly.
With "ps" command, I found the status of my_thread was always running:
ps ax
5545 ? R 3:14 [my_thread]
15774 ttyS0 Ss 0:00 -bash
...
But its running time was always 3:14. Since it ws running, why didn't the total time increase?
From the proc file /proc/5545/sched, I found the all statistics including wakeups count (se.nr_wakeups) for this thread was always the same, too.
From /proc/5545/stack, I found this thread called this function and never returned:
interruptible_sleep_on_timeout(&q, 3*HZ);
In theory this function would return every 3 seconds if no other threads woke up the thread. Each time after the function returned, se.nr_wakeups in /proc/5545/sched would be increased by 1. But this never happened after I found the thread had some problems.
Does any one have some ideas? Is it possible that interruptible_sleep_on_timeout() never returns?
Update:
I find the problem won't occur if I set CPU affinity for this thread. If I pin it to a dedicated core, then everything is OK. Are there any problems with SMP scheduling?
Update again:
After I disalbe hyperthread in BIOS, I have not seen such a problem until now.
First off, R indicates the thread is not in running state but runnable. That is, it does not mean it runs, it means it is in a state the scheduler is allowed to pick it for running. There is a big difference between the two.
In a similar sense interruptible_sleep_on_timeout(&q, 3*HZ); will not run the thread after 3 jiffies, but rather make it available for running after 3 jiffies - and indeed you see it in "ps" as available for running, so possibly the timeout has indeed occurred.
Since you did not say anything about the kernel thread in question I don't even know if it is in your own code or standard kernel code so I cannot really answer in detail.
One possible reason for the situation you described is that some other thread (user or kernel) has higher priority then your thread and so the scheduler never picks it for running. If so, it is not probably a thread running in real time priority (SCHED_FIFO or SCHED_RR).

Resources