Issue with NVIDIA GPU for matrix operations [duplicate] - matrix

I have this little nonsense script here which I am executing in MATLAB R2013b:
clear all;
n = 2000;
times = 50;
i = 0;
tCPU = tic;
disp 'CPU::'
A = rand(n, n);
B = rand(n, n);
disp '::Go'
for i = 0:times
CPU = A * B;
end
tCPU = toc(tCPU);
tGPU = tic;
disp 'GPU::'
A = gpuArray(A);
B = gpuArray(B);
disp '::Go'
for i = 0:times
GPU = A * B ;
end
tGPU = toc(tGPU);
fprintf('On CPU: %.2f sec\nOn GPU: %.2f sec\n', tCPU, tGPU);
Unfortunately after execution I receive a message from Windows saying: "Display driver stopped working and has recovered.".
Which I assume means that Windows did not get response from my graphic cards driver or something. The script returned without errors:
>> test
CPU::
::Go
GPU::
::Go
On CPU: 11.01 sec
On GPU: 2.97 sec
But no matter if the GPU runs out of memory or not, MATLAB is not able to use the GPU device before I restarted it. If I don't restart MATLAB I receive just a message from CUDA:
>> test
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In test at 1
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In test at 1
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In test at 1
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In test at 1
CPU::
::Go
GPU::
Error using gpuArray
An unexpected error occurred during CUDA execution.
The CUDA error was:
the launch timed out and was terminated
Error in test (line 21)
A = gpuArray(A);
Does anybody know how to avoid this issue or what I am doing wrong here?
If needed, my GPU Device:
>> gpuDevice
ans =
CUDADevice with properties:
Name: 'GeForce GTX 660M'
Index: 1
ComputeCapability: '3.0'
SupportsDouble: 1
DriverVersion: 6
ToolkitVersion: 5
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 2.1475e+09
FreeMemory: 1.9037e+09
MultiprocessorCount: 2
ClockRateKHz: 950000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1

The key piece of information is this part of the gpuDevice output:
KernelExecutionTimeout: 1
This means that the host display driver is active on the GPU you are running the compute jobs on. The NVIDIA display driver contains a watchdog timer which kills any task which takes more than a predefined amount of time without yielding control back to the driver for screen refresh. This is intended to prevent the situation where a long running or stuck compute job renders the machine unresponsive by freezing the display. The runtime of your Matlab script is clearly exceeding the display driver watchdog timer limit. Once that happens, the the compute context held on the device is destroyed and Matlab can no longer operate with the device. You might be able to reinitialise the context by calling reset, which I guess will run cudaDeviceReset() under the cover.
There is a lot of information about this watchdog timer on the interweb - for example this Stack Overflow question. The solution for how to modify this timeout is dependent on your OS and hardware. The simplest way to avoid this is to not run CUDA code on a display GPU, or increase the granularity of your compute jobs so that no one operation has a runtime which exceeds the timeout limit. Or just write faster code...

Related

Memory usage increase when some data is copied from other images (processors) in coarrays fortran?

I am using coarray to parallelize a fortran code. The code is working properly in my pc (ubuntu 18, OpenCoarrays 2.0.0). However when I run the code on the cluster (centos) it crashes with the following error:
=====================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
Error: Command:
/APP/enhpc/mpi/mpich2-gcc-hd/bin/mpiexec -n 10 -machinefile machines ./IPS
failed to run
using top command during running the code I found out that memory increases when the code is running. The problem is coming from where I copy some data from another processor:
for example a(:)=b(:)[k]
Since the code is running on my pc properly what can be the reason for memory increase in cluster?
I have to mention that I am running the code with cores on a single node.
It increases continuously. It is a centos cluster. I do not know what kind of architecture it has. I am using OpenCoarrays v2.9.1 which is using coarray fortran (CAF) for compiling. Also GNU v 10.1. I wrote a simple code as follows:
program hello_image
integer::m,n,i
integer,allocatable:: A(:)[:],B(:)
m=1e3
n=1e6
allocate(A(n)[*],B(n))
A(:)=10
B(:)=20
write(*,*) j,this_image()
do j=1,m
Do i=1,n
B(i)=A(i)[3] ! this line means that the data is copied from processor 3 to other processors
enddo
write(*,*) j,this_image()
enddo
end program hello_image
When I am running this code in my pc the memory usage for all clusters are a constant value of 0.1% and they are not increasing. However, when I run the same code in the cluster the memory usage is continously increasing.
Output from My pc:
output from cluster:

ModelSim Fatal error in process RAM_i1/RAM_0_0_0/P107 Lattice MACHXO3L_MISC.vhd

I am facing a fatal error when trying to simulate in ModelSim a design that instantiates a RAM IP for the target device MACHXO3L from Lattice Semiconductor. I have compiled their libraries to use in ModelSim, but the simulations always stop due to the following fatal error:
# ** Fatal: (vsim-3483) Delay in signal assignment is not ascending.
# Time: 20 ns Iteration: 1 Process: /fft_tb/fft_i/RAM_i1/RAM_0_0_0/P107 File: C:/lscc/diamond/3.11_x64/cae_library/simulation/script/../vhdl/machxo3l/src/MACHXO3L_MISC.vhd Line: 541
# Fatal error in Process P107 at C:/lscc/diamond/3.11_x64/cae_library/simulation/script/../vhdl/machxo3l/src/MACHXO3L_MISC.vhd line 541
ModelSim Fatal Error:
Any ideas? It seems that the problem is the Lattice library MACHXO3L_MISC.vhd line 541
As correctly suggested by #user1155120, the problem is solved by changing the simulation time resolution. I changed it to picoseconds by modifying the modelsim.ini file. The parameter to be modified is:
; Set SystemC default time unit.
; Set to fs, ps, ns, us, ms, or sec with optional
; prefix of 1, 10, or 100. The default is 1 ns.
; The ScTimeUnit value is honored if it is coarser than Resolution.
; If ScTimeUnit is finer than Resolution, it is set to the value
; of Resolution. For example, if Resolution is 100ps and ScTimeUnit is ns,
; then the default time unit will be 1 ns. However if Resolution
; is 10 ns and ScTimeUnit is ns, then the default time unit will be 10 ns.
ScTimeUnit = ps
You may also want or need to change the same parameter in the .mpf file, in your project folder.
If that doesn't change the simulation resolution you can implicitly do it in the vsim command:
vsim work.<your_test_bench> -t ps

Linux perf test on 3.10 kernel

I am trying perf test on imx6dl arm target, 2 subtests are failing on perf and are:
perf test -v 15
15: Test breakpoint overflow signal handler :
--- start ---
count1 0, count2 0, overflow 0
failed: wrong count for bp10
failed: wrong overflow hit
failed: wrong count for bp2
---- end ----
Test breakpoint overflow signal handler: FAILED!
perf test -v 16
16: Test breakpoint overflow sampling :
--- start ---
count 0, overflow 0
Wrong number of executions 0 != 10000
Wrong number of overflows 0 != 100
---- end ----
Test breakpoint overflow sampling: FAILED!
Please help me out why all the values are showing zero.
Thanks.
Your imx6dl arm may have no hardware performance counters or no interrupt-on-overflow mode on them. Or your kernel has no support of such hardware. You should check exact core name and configuration of your chip and ARM's documentation on hardware performance counters implemented in it.

Jobs being killed on Condor Enviroment

I am running an executable in Condor that basically processes an input Image and saves a binary image in a given folder. I use this code in 213 images.
My condor configuration file contents are as following:
universe = vanilla
executable = /datasets/me/output_cpen_database/source_codes_techniques/test/vole
arguments = cmfd -I /datasets/me/cpen_database/scale1/$(Process)/$(Process).png -O /datasets/me/output_cpen_database/scale1/dct/$(Process)/ --numThreads 10 --chan GRAY --featvec DCT --blockSize 16 --minDistEuclidian 50 --kdsort --fastsats --minSameShift 1000 --markRegions --useOrig --writePost --writeMatrix
initialdir = /datasets/me/output_cpen_database/source_codes_techniques/test
requirements = (OpSysAndVer == "Ubuntu12")
request_cpus = 5
request_memory = 20000
output = logs/output-$(Process).log
error = logs/error-$(Process).log
log = logs/log-$(Process).log
Notification = Complete
Notify_User = mymail#gmail.com
Queue 214
Some images are processed OK, but in some cases I receive the following error in my mailbox:
Condor job 1273.47
/datasets/me/output_cpen_database/source_codes_techniques/test/vole cmfd -I /datasets/me/cpen_database/scale1/47/47.png -O /datasets/me/output_cpen_database/scale1/dct/47/ --numThreads 10 --chan GRAY --featvec DCT --blockSize 16 --minDistEuclidian 50 --kdsort --fastsats --minSameShift 1000 --markRegions --useOrig --writePost --writeMatrix
died on signal 9 (Killed)
I was thinking if this happens because of lack of memory, but this image's (named 47) size is not longer than 20MB (actually it has 16.7MB).
As I said before, the condor runs this executable ok for some other images .
Should I have to increase the request_memory in my configuration file? what is happening here?
Usually, a job dying on signal 9 means problems with some of the shared libraries required by your executable. What I would check is whether or not all jobs die on a particular host. If that's the case, you could run the code manually and see if you get a missing shared library error.

Linpack sometimes starting, sometimes not, but nothing changed

I installed Linpack on a 2-Node cluster with Xeon processors. Sometimes if I start Linpack with this command:
mpiexec -np 28 -print-rank-map -f /root/machines.HOSTS ./xhpl_intel64
linpack starts and prints the output, sometimes I only see the mpi mappings printed and then nothing following. To me this seems like random behaviour because I don't change anything between the calls and as already mentioned, Linpack sometimes starts, sometimes not.
In top I can see that xhpl_intel64processes have been created and they are heavily using the CPU but when watching the traffic between the nodes, iftop is telling me that it nothing is sent.
I am using MPICH2 as MPI implementation. This is my HPL.dat:
# cat HPL.dat
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
10000 Ns
1 # of NBs
250 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
2 Ps
14 Qs
16.0 threshold
1 # of panel fact
2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
1 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
edit2:
I now just let the program run for a while and after 30min it tells me:
# mpiexec -np 32 -print-rank-map -f /root/machines.HOSTS ./xhpl_intel64
(node-0:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
(node-1:16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31)
Assertion failed in file ../../socksm.c at line 2577: (it_plfd->revents & 0x008) == 0
internal ABORT - process 0
APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)
Is this a mpi problem?
Do you know what type of problem this could be?
I figured out what the problem was: MPICH2 uses different random ports each time it starts and if these are blocked your application wont start up correctly.
The solution for MPICH2 is to set the environment variable MPICH_PORT_RANGE to START:END, like this:
export MPICH_PORT_RANGE=50000:51000
Best,
heinrich

Resources