Memory usage increase when some data is copied from other images (processors) in coarrays fortran? - memory-management

I am using coarray to parallelize a fortran code. The code is working properly in my pc (ubuntu 18, OpenCoarrays 2.0.0). However when I run the code on the cluster (centos) it crashes with the following error:
=====================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
Error: Command:
/APP/enhpc/mpi/mpich2-gcc-hd/bin/mpiexec -n 10 -machinefile machines ./IPS
failed to run
using top command during running the code I found out that memory increases when the code is running. The problem is coming from where I copy some data from another processor:
for example a(:)=b(:)[k]
Since the code is running on my pc properly what can be the reason for memory increase in cluster?
I have to mention that I am running the code with cores on a single node.
It increases continuously. It is a centos cluster. I do not know what kind of architecture it has. I am using OpenCoarrays v2.9.1 which is using coarray fortran (CAF) for compiling. Also GNU v 10.1. I wrote a simple code as follows:
program hello_image
integer::m,n,i
integer,allocatable:: A(:)[:],B(:)
m=1e3
n=1e6
allocate(A(n)[*],B(n))
A(:)=10
B(:)=20
write(*,*) j,this_image()
do j=1,m
Do i=1,n
B(i)=A(i)[3] ! this line means that the data is copied from processor 3 to other processors
enddo
write(*,*) j,this_image()
enddo
end program hello_image
When I am running this code in my pc the memory usage for all clusters are a constant value of 0.1% and they are not increasing. However, when I run the same code in the cluster the memory usage is continously increasing.
Output from My pc:
output from cluster:

Related

How to run multiple training tasks on different GPUs?

I have 2 GPUs on my server which I want to run different training tasks on them.
On the first task, trying to force the Tensorflow to use only one GPU, I added the following code at the top of my script :
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
After running the first task, when I try to run the second task on the other GPU, (with the same 2 lines of code) I get the error "No device GPU:1".
What is the problem?
Q : "What is the problem?"
The system needs to see the cards - validate the current state of the server, using the call to ( hwloc-tool ) lstopo :
$ lstopo --only osdev
GPU L#0 "card1"
GPU L#1 "renderD128"
GPU L#2 "card2"
GPU L#3 "renderD129"
GPU L#4 "card3"
GPU L#5 "renderD130"
GPU L#6 "card0"
GPU L#7 "controlD64"
Block(Disk) L#8 "sda"
Block(Disk) L#9 "sdb"
Net L#10 "eth0"
Net L#11 "eno1"
GPU L#12 "card4"
GPU L#13 "renderD131"
GPU L#14 "card5"
GPU L#15 "renderD132"
If showing more than just an above mentioned card0, proceed with proper naming / id#-sandbe sure to set it before doing any other import-s, like that of pycuda and tensorflow.
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '1' # MUST PRECEDE ANY IMPORT-s
#---------------------------------------------------------------------
import pycuda as pyCUDA
import tensorflow as tf
...
#----------------------------------------------------------------------

Parallel (OpenMP) Fortran code stalls after long time without giving error

Running a Fortran code containing a parallel OpenMP region, I have encountered a problem that after the code runs fine for some time (counter=~1,000,000,000 in the code below), it stalls without crashing or providing any errors. A code snippet reproducing the problem is:
program crasher
implicit none
integer*8 :: limit
integer*8 :: i,temp
integer*8 :: counter
limit=1062
counter=0
do
counter=counter+1
!$OMP PARALLEL &
!$OMP DEFAULT(none) &
!$OMP PRIVATE(i,temp) &
!$OMP SHARED(limit)
!$OMP DO
do i=1,limit
temp=0
enddo
!$OMP END DO
!$OMP END PARALLEL
if (mod(counter,100000).eq.0) then
write(6,'(A,I0)') "Number of runs: ",counter
endif
enddo
end program
When I do strace -p PID, with the PIDs of the processes (16 cores) spawned by this code, one of them yields:
...
futex(0x7f4617a91a00, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f4617a91a44, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 461, {1489578049, 907892000}, ffffffff^C) = -1 ETIMEDOUT (Connection timed out)
...
over and over again, even after the other processes have ceased to do anything. Running the same code on a different machine, the above strace output does not appear and the code runs fine. Running the code in serial, it runs fine on both machines.
I have compiled with ifort (v 15.0.2) as well as gfortran (v 4.8.5) with the same result on both machines: One machine works the other one does the crazy thing.
I have found some information that this might be a problem with the linux kernel. The machine that produces the error has "Linux 2.6.32-431.23.3.el6.x86_64" the other "Linux 3.10.0-327.18.2.el7.x86_64". Does anyone have an idea how to fix/work around this?

Issue with NVIDIA GPU for matrix operations [duplicate]

I have this little nonsense script here which I am executing in MATLAB R2013b:
clear all;
n = 2000;
times = 50;
i = 0;
tCPU = tic;
disp 'CPU::'
A = rand(n, n);
B = rand(n, n);
disp '::Go'
for i = 0:times
CPU = A * B;
end
tCPU = toc(tCPU);
tGPU = tic;
disp 'GPU::'
A = gpuArray(A);
B = gpuArray(B);
disp '::Go'
for i = 0:times
GPU = A * B ;
end
tGPU = toc(tGPU);
fprintf('On CPU: %.2f sec\nOn GPU: %.2f sec\n', tCPU, tGPU);
Unfortunately after execution I receive a message from Windows saying: "Display driver stopped working and has recovered.".
Which I assume means that Windows did not get response from my graphic cards driver or something. The script returned without errors:
>> test
CPU::
::Go
GPU::
::Go
On CPU: 11.01 sec
On GPU: 2.97 sec
But no matter if the GPU runs out of memory or not, MATLAB is not able to use the GPU device before I restarted it. If I don't restart MATLAB I receive just a message from CUDA:
>> test
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In test at 1
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In test at 1
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In test at 1
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In test at 1
CPU::
::Go
GPU::
Error using gpuArray
An unexpected error occurred during CUDA execution.
The CUDA error was:
the launch timed out and was terminated
Error in test (line 21)
A = gpuArray(A);
Does anybody know how to avoid this issue or what I am doing wrong here?
If needed, my GPU Device:
>> gpuDevice
ans =
CUDADevice with properties:
Name: 'GeForce GTX 660M'
Index: 1
ComputeCapability: '3.0'
SupportsDouble: 1
DriverVersion: 6
ToolkitVersion: 5
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 2.1475e+09
FreeMemory: 1.9037e+09
MultiprocessorCount: 2
ClockRateKHz: 950000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
The key piece of information is this part of the gpuDevice output:
KernelExecutionTimeout: 1
This means that the host display driver is active on the GPU you are running the compute jobs on. The NVIDIA display driver contains a watchdog timer which kills any task which takes more than a predefined amount of time without yielding control back to the driver for screen refresh. This is intended to prevent the situation where a long running or stuck compute job renders the machine unresponsive by freezing the display. The runtime of your Matlab script is clearly exceeding the display driver watchdog timer limit. Once that happens, the the compute context held on the device is destroyed and Matlab can no longer operate with the device. You might be able to reinitialise the context by calling reset, which I guess will run cudaDeviceReset() under the cover.
There is a lot of information about this watchdog timer on the interweb - for example this Stack Overflow question. The solution for how to modify this timeout is dependent on your OS and hardware. The simplest way to avoid this is to not run CUDA code on a display GPU, or increase the granularity of your compute jobs so that no one operation has a runtime which exceeds the timeout limit. Or just write faster code...

Jobs being killed on Condor Enviroment

I am running an executable in Condor that basically processes an input Image and saves a binary image in a given folder. I use this code in 213 images.
My condor configuration file contents are as following:
universe = vanilla
executable = /datasets/me/output_cpen_database/source_codes_techniques/test/vole
arguments = cmfd -I /datasets/me/cpen_database/scale1/$(Process)/$(Process).png -O /datasets/me/output_cpen_database/scale1/dct/$(Process)/ --numThreads 10 --chan GRAY --featvec DCT --blockSize 16 --minDistEuclidian 50 --kdsort --fastsats --minSameShift 1000 --markRegions --useOrig --writePost --writeMatrix
initialdir = /datasets/me/output_cpen_database/source_codes_techniques/test
requirements = (OpSysAndVer == "Ubuntu12")
request_cpus = 5
request_memory = 20000
output = logs/output-$(Process).log
error = logs/error-$(Process).log
log = logs/log-$(Process).log
Notification = Complete
Notify_User = mymail#gmail.com
Queue 214
Some images are processed OK, but in some cases I receive the following error in my mailbox:
Condor job 1273.47
/datasets/me/output_cpen_database/source_codes_techniques/test/vole cmfd -I /datasets/me/cpen_database/scale1/47/47.png -O /datasets/me/output_cpen_database/scale1/dct/47/ --numThreads 10 --chan GRAY --featvec DCT --blockSize 16 --minDistEuclidian 50 --kdsort --fastsats --minSameShift 1000 --markRegions --useOrig --writePost --writeMatrix
died on signal 9 (Killed)
I was thinking if this happens because of lack of memory, but this image's (named 47) size is not longer than 20MB (actually it has 16.7MB).
As I said before, the condor runs this executable ok for some other images .
Should I have to increase the request_memory in my configuration file? what is happening here?
Usually, a job dying on signal 9 means problems with some of the shared libraries required by your executable. What I would check is whether or not all jobs die on a particular host. If that's the case, you could run the code manually and see if you get a missing shared library error.

Linpack sometimes starting, sometimes not, but nothing changed

I installed Linpack on a 2-Node cluster with Xeon processors. Sometimes if I start Linpack with this command:
mpiexec -np 28 -print-rank-map -f /root/machines.HOSTS ./xhpl_intel64
linpack starts and prints the output, sometimes I only see the mpi mappings printed and then nothing following. To me this seems like random behaviour because I don't change anything between the calls and as already mentioned, Linpack sometimes starts, sometimes not.
In top I can see that xhpl_intel64processes have been created and they are heavily using the CPU but when watching the traffic between the nodes, iftop is telling me that it nothing is sent.
I am using MPICH2 as MPI implementation. This is my HPL.dat:
# cat HPL.dat
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
10000 Ns
1 # of NBs
250 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
2 Ps
14 Qs
16.0 threshold
1 # of panel fact
2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
1 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
edit2:
I now just let the program run for a while and after 30min it tells me:
# mpiexec -np 32 -print-rank-map -f /root/machines.HOSTS ./xhpl_intel64
(node-0:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
(node-1:16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31)
Assertion failed in file ../../socksm.c at line 2577: (it_plfd->revents & 0x008) == 0
internal ABORT - process 0
APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)
Is this a mpi problem?
Do you know what type of problem this could be?
I figured out what the problem was: MPICH2 uses different random ports each time it starts and if these are blocked your application wont start up correctly.
The solution for MPICH2 is to set the environment variable MPICH_PORT_RANGE to START:END, like this:
export MPICH_PORT_RANGE=50000:51000
Best,
heinrich

Resources