Parallel (OpenMP) Fortran code stalls after long time without giving error - parallel-processing

Running a Fortran code containing a parallel OpenMP region, I have encountered a problem that after the code runs fine for some time (counter=~1,000,000,000 in the code below), it stalls without crashing or providing any errors. A code snippet reproducing the problem is:
program crasher
implicit none
integer*8 :: limit
integer*8 :: i,temp
integer*8 :: counter
limit=1062
counter=0
do
counter=counter+1
!$OMP PARALLEL &
!$OMP DEFAULT(none) &
!$OMP PRIVATE(i,temp) &
!$OMP SHARED(limit)
!$OMP DO
do i=1,limit
temp=0
enddo
!$OMP END DO
!$OMP END PARALLEL
if (mod(counter,100000).eq.0) then
write(6,'(A,I0)') "Number of runs: ",counter
endif
enddo
end program
When I do strace -p PID, with the PIDs of the processes (16 cores) spawned by this code, one of them yields:
...
futex(0x7f4617a91a00, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f4617a91a44, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 461, {1489578049, 907892000}, ffffffff^C) = -1 ETIMEDOUT (Connection timed out)
...
over and over again, even after the other processes have ceased to do anything. Running the same code on a different machine, the above strace output does not appear and the code runs fine. Running the code in serial, it runs fine on both machines.
I have compiled with ifort (v 15.0.2) as well as gfortran (v 4.8.5) with the same result on both machines: One machine works the other one does the crazy thing.
I have found some information that this might be a problem with the linux kernel. The machine that produces the error has "Linux 2.6.32-431.23.3.el6.x86_64" the other "Linux 3.10.0-327.18.2.el7.x86_64". Does anyone have an idea how to fix/work around this?

Related

Memory usage increase when some data is copied from other images (processors) in coarrays fortran?

I am using coarray to parallelize a fortran code. The code is working properly in my pc (ubuntu 18, OpenCoarrays 2.0.0). However when I run the code on the cluster (centos) it crashes with the following error:
=====================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
Error: Command:
/APP/enhpc/mpi/mpich2-gcc-hd/bin/mpiexec -n 10 -machinefile machines ./IPS
failed to run
using top command during running the code I found out that memory increases when the code is running. The problem is coming from where I copy some data from another processor:
for example a(:)=b(:)[k]
Since the code is running on my pc properly what can be the reason for memory increase in cluster?
I have to mention that I am running the code with cores on a single node.
It increases continuously. It is a centos cluster. I do not know what kind of architecture it has. I am using OpenCoarrays v2.9.1 which is using coarray fortran (CAF) for compiling. Also GNU v 10.1. I wrote a simple code as follows:
program hello_image
integer::m,n,i
integer,allocatable:: A(:)[:],B(:)
m=1e3
n=1e6
allocate(A(n)[*],B(n))
A(:)=10
B(:)=20
write(*,*) j,this_image()
do j=1,m
Do i=1,n
B(i)=A(i)[3] ! this line means that the data is copied from processor 3 to other processors
enddo
write(*,*) j,this_image()
enddo
end program hello_image
When I am running this code in my pc the memory usage for all clusters are a constant value of 0.1% and they are not increasing. However, when I run the same code in the cluster the memory usage is continously increasing.
Output from My pc:
output from cluster:

IntMap traverseWithKey faster than mapWithKey?

I was reading this part of Parallel and Concurrent Programming in Haskell, and found the sequential version of the program to be far slower than the parallel one with one core:
Sequential:
$ cabal run fwsparse -O2 --ghc-options="-rtsopts" -- 1000 800 +RTS -s
...
Total time 11.469s ( 11.529s elapsed)
Parallel:
$ cabal run fwsparse1 -O2 --ghc-options="-rtsopts -threaded" -- 1000 800 +RTS -s -N1
...
Total time 4.906s ( 4.988s elapsed)
According to the book, the sequential one should be slightly faster than the parallel one (if it runs on a single core).
The only difference in the two programs was this function:
Sequential:
update g k = Map.mapWithKey shortmap g
Parallel:
update g k = runPar $ do
m <- Map.traverseWithKey (\i jmap -> spawn (return (shortmap i jmap))) g
traverse get m
Since the spawn function uses deepseq, I initially though it had something to do with strictness but the use of force didn't change the performance of the sequential program.
Finally, I managed to get the sequential one work faster:
$ cabal run fwsparse -O2 --ghc-options="-rtsopts" -- 1000 800 +RTS -s
...
Total time 3.891s ( 3.901s elapsed)
by changing the update function to this:
update g k = runIdentity $ Map.traverseWithKey (\i jmap -> pure $ shortmap i jmap) g
Why does using traverseWithKey in the Identity monad speed up performance? I checked the IntMap source code but couldn't figure out the reason.
Is this a bug or the expected behaviour?
(Also, this is my first question on StackOverflow so please tell me if I'm doing anything wrong.)
EDIT:
As per Ismor's comment, I turned off optimization (using -O0) and the sequential program runs in 19.5 seconds with either traverseWithKey or mapWithKey, while the parallel one runs in 21.5 seconds.
Why is traverseWithKey optimized to be so much faster than mapWithKey?
Which function should I use in practice?

running subprocesses in parallel with Python

I am trying to understand how can I build a parallel computing pipeline for multiple subprocesses.
As I see, each subprocess block waits for the previous code block to run, whereas I have a pipeline which does not have a dependency for the previous run, and it can be handled in parallel. I want to understand whether this is possible, and if so, a sample syntax for showing how to do that would be a great help! Thanks in advance.
import sys
import os
import subprocess
subprocess.run("python pipelinecode1.py".split() +
[run_date, this_wk, last_wk, prev_wk], shell=True)
subprocess.run("python pipelinecode2.py".split() +
[run_date, this_wk, last_wk, prev_wk], shell=True)
subprocess.run("python pipelinecode3.py".split() +
[run_date, this_wk, last_wk, prev_wk], shell=True)
The MCVE as-is shows zero dependency on the python-interpreter, so the most efficient step for running a set of mutualy independent tasks ( not a pipeline, where one-step-after-another order of processing steps "forms" the "pipeline" ) is GNU parallel:
$ parallel python {} run_date this_wk last_wk prev_wk ::: pipelinecode1.py \
pipelinecode2.py \
pipelinecode3.py
This way you do not waste CPU / cache resources and escape from the blocking and GIL-lock re-introduced re-[SERIAL]-isation of the code-execution without any add-on overhead costs.
For all configurables available read respective details in man parallel

Issue with NVIDIA GPU for matrix operations [duplicate]

I have this little nonsense script here which I am executing in MATLAB R2013b:
clear all;
n = 2000;
times = 50;
i = 0;
tCPU = tic;
disp 'CPU::'
A = rand(n, n);
B = rand(n, n);
disp '::Go'
for i = 0:times
CPU = A * B;
end
tCPU = toc(tCPU);
tGPU = tic;
disp 'GPU::'
A = gpuArray(A);
B = gpuArray(B);
disp '::Go'
for i = 0:times
GPU = A * B ;
end
tGPU = toc(tGPU);
fprintf('On CPU: %.2f sec\nOn GPU: %.2f sec\n', tCPU, tGPU);
Unfortunately after execution I receive a message from Windows saying: "Display driver stopped working and has recovered.".
Which I assume means that Windows did not get response from my graphic cards driver or something. The script returned without errors:
>> test
CPU::
::Go
GPU::
::Go
On CPU: 11.01 sec
On GPU: 2.97 sec
But no matter if the GPU runs out of memory or not, MATLAB is not able to use the GPU device before I restarted it. If I don't restart MATLAB I receive just a message from CUDA:
>> test
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In test at 1
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In test at 1
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In test at 1
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In test at 1
CPU::
::Go
GPU::
Error using gpuArray
An unexpected error occurred during CUDA execution.
The CUDA error was:
the launch timed out and was terminated
Error in test (line 21)
A = gpuArray(A);
Does anybody know how to avoid this issue or what I am doing wrong here?
If needed, my GPU Device:
>> gpuDevice
ans =
CUDADevice with properties:
Name: 'GeForce GTX 660M'
Index: 1
ComputeCapability: '3.0'
SupportsDouble: 1
DriverVersion: 6
ToolkitVersion: 5
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 2.1475e+09
FreeMemory: 1.9037e+09
MultiprocessorCount: 2
ClockRateKHz: 950000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
The key piece of information is this part of the gpuDevice output:
KernelExecutionTimeout: 1
This means that the host display driver is active on the GPU you are running the compute jobs on. The NVIDIA display driver contains a watchdog timer which kills any task which takes more than a predefined amount of time without yielding control back to the driver for screen refresh. This is intended to prevent the situation where a long running or stuck compute job renders the machine unresponsive by freezing the display. The runtime of your Matlab script is clearly exceeding the display driver watchdog timer limit. Once that happens, the the compute context held on the device is destroyed and Matlab can no longer operate with the device. You might be able to reinitialise the context by calling reset, which I guess will run cudaDeviceReset() under the cover.
There is a lot of information about this watchdog timer on the interweb - for example this Stack Overflow question. The solution for how to modify this timeout is dependent on your OS and hardware. The simplest way to avoid this is to not run CUDA code on a display GPU, or increase the granularity of your compute jobs so that no one operation has a runtime which exceeds the timeout limit. Or just write faster code...

Linpack sometimes starting, sometimes not, but nothing changed

I installed Linpack on a 2-Node cluster with Xeon processors. Sometimes if I start Linpack with this command:
mpiexec -np 28 -print-rank-map -f /root/machines.HOSTS ./xhpl_intel64
linpack starts and prints the output, sometimes I only see the mpi mappings printed and then nothing following. To me this seems like random behaviour because I don't change anything between the calls and as already mentioned, Linpack sometimes starts, sometimes not.
In top I can see that xhpl_intel64processes have been created and they are heavily using the CPU but when watching the traffic between the nodes, iftop is telling me that it nothing is sent.
I am using MPICH2 as MPI implementation. This is my HPL.dat:
# cat HPL.dat
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
10000 Ns
1 # of NBs
250 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
2 Ps
14 Qs
16.0 threshold
1 # of panel fact
2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
1 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
edit2:
I now just let the program run for a while and after 30min it tells me:
# mpiexec -np 32 -print-rank-map -f /root/machines.HOSTS ./xhpl_intel64
(node-0:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
(node-1:16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31)
Assertion failed in file ../../socksm.c at line 2577: (it_plfd->revents & 0x008) == 0
internal ABORT - process 0
APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)
Is this a mpi problem?
Do you know what type of problem this could be?
I figured out what the problem was: MPICH2 uses different random ports each time it starts and if these are blocked your application wont start up correctly.
The solution for MPICH2 is to set the environment variable MPICH_PORT_RANGE to START:END, like this:
export MPICH_PORT_RANGE=50000:51000
Best,
heinrich

Resources