How to make Theano use the integrated GPU in MacBook Air? - macos

I tried running the test program for GPU usage:
from theano import function, config, shared, tensor, sandbox
import numpy
import time
vlen=10*30*768 #10x #coresx #threadspercore
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], tensor.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in xrange(iters):
r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, tensor.Elemwise) and ('Gpu' not in type(x.op).__name__)
for x in f.maker.fgraph.toposort()]):
print('Used the cpu')
else:
print('Used the gnu')
It only shows this (even after installing libgpuarray):
[Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
Looping 1000 times took 2.723539 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the cpu
I would like to know how to utilise the integrated GPU of MacBook Air (Early 2014).
My processor has Intel HD Graphics 5000 -- not NVIDIA and hence not CUDA compatible Many links suggest usage of OpenCL. This was also supposed to be pre-installed with OS-X. But I can't make head or tail out of the links in the web.
I could not find much help of how to set up Theano in the docs either.
I just need to make Theano use my Mac's integrated GPU. Is this possible? If so, how? What are its prerequisites?

THEANO_FLAGS=device=opencl0:1 python ~/check_GPU.py

Related

Memory usage increase when some data is copied from other images (processors) in coarrays fortran?

I am using coarray to parallelize a fortran code. The code is working properly in my pc (ubuntu 18, OpenCoarrays 2.0.0). However when I run the code on the cluster (centos) it crashes with the following error:
=====================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
Error: Command:
/APP/enhpc/mpi/mpich2-gcc-hd/bin/mpiexec -n 10 -machinefile machines ./IPS
failed to run
using top command during running the code I found out that memory increases when the code is running. The problem is coming from where I copy some data from another processor:
for example a(:)=b(:)[k]
Since the code is running on my pc properly what can be the reason for memory increase in cluster?
I have to mention that I am running the code with cores on a single node.
It increases continuously. It is a centos cluster. I do not know what kind of architecture it has. I am using OpenCoarrays v2.9.1 which is using coarray fortran (CAF) for compiling. Also GNU v 10.1. I wrote a simple code as follows:
program hello_image
integer::m,n,i
integer,allocatable:: A(:)[:],B(:)
m=1e3
n=1e6
allocate(A(n)[*],B(n))
A(:)=10
B(:)=20
write(*,*) j,this_image()
do j=1,m
Do i=1,n
B(i)=A(i)[3] ! this line means that the data is copied from processor 3 to other processors
enddo
write(*,*) j,this_image()
enddo
end program hello_image
When I am running this code in my pc the memory usage for all clusters are a constant value of 0.1% and they are not increasing. However, when I run the same code in the cluster the memory usage is continously increasing.
Output from My pc:
output from cluster:

How many cores can I use when using **make -j**?

I'm using MacBook with M1 chip (10 CPU cores with 8 P cores and 2 E cores, 24 GPU cores) and want to compile some program, I wonder how I can know the number of cores I can use to compile?
Simply, what should x be in make -jx?
You can enter:
$ make -j `sysctl -n hw.ncpu`
so the answer is :
x = `sysctl -n hw.ncpu`
where the backstick characters means a command substitution, ie the result of the command inside the backsticks.

IntMap traverseWithKey faster than mapWithKey?

I was reading this part of Parallel and Concurrent Programming in Haskell, and found the sequential version of the program to be far slower than the parallel one with one core:
Sequential:
$ cabal run fwsparse -O2 --ghc-options="-rtsopts" -- 1000 800 +RTS -s
...
Total time 11.469s ( 11.529s elapsed)
Parallel:
$ cabal run fwsparse1 -O2 --ghc-options="-rtsopts -threaded" -- 1000 800 +RTS -s -N1
...
Total time 4.906s ( 4.988s elapsed)
According to the book, the sequential one should be slightly faster than the parallel one (if it runs on a single core).
The only difference in the two programs was this function:
Sequential:
update g k = Map.mapWithKey shortmap g
Parallel:
update g k = runPar $ do
m <- Map.traverseWithKey (\i jmap -> spawn (return (shortmap i jmap))) g
traverse get m
Since the spawn function uses deepseq, I initially though it had something to do with strictness but the use of force didn't change the performance of the sequential program.
Finally, I managed to get the sequential one work faster:
$ cabal run fwsparse -O2 --ghc-options="-rtsopts" -- 1000 800 +RTS -s
...
Total time 3.891s ( 3.901s elapsed)
by changing the update function to this:
update g k = runIdentity $ Map.traverseWithKey (\i jmap -> pure $ shortmap i jmap) g
Why does using traverseWithKey in the Identity monad speed up performance? I checked the IntMap source code but couldn't figure out the reason.
Is this a bug or the expected behaviour?
(Also, this is my first question on StackOverflow so please tell me if I'm doing anything wrong.)
EDIT:
As per Ismor's comment, I turned off optimization (using -O0) and the sequential program runs in 19.5 seconds with either traverseWithKey or mapWithKey, while the parallel one runs in 21.5 seconds.
Why is traverseWithKey optimized to be so much faster than mapWithKey?
Which function should I use in practice?

How to run multiple training tasks on different GPUs?

I have 2 GPUs on my server which I want to run different training tasks on them.
On the first task, trying to force the Tensorflow to use only one GPU, I added the following code at the top of my script :
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
After running the first task, when I try to run the second task on the other GPU, (with the same 2 lines of code) I get the error "No device GPU:1".
What is the problem?
Q : "What is the problem?"
The system needs to see the cards - validate the current state of the server, using the call to ( hwloc-tool ) lstopo :
$ lstopo --only osdev
GPU L#0 "card1"
GPU L#1 "renderD128"
GPU L#2 "card2"
GPU L#3 "renderD129"
GPU L#4 "card3"
GPU L#5 "renderD130"
GPU L#6 "card0"
GPU L#7 "controlD64"
Block(Disk) L#8 "sda"
Block(Disk) L#9 "sdb"
Net L#10 "eth0"
Net L#11 "eno1"
GPU L#12 "card4"
GPU L#13 "renderD131"
GPU L#14 "card5"
GPU L#15 "renderD132"
If showing more than just an above mentioned card0, proceed with proper naming / id#-sandbe sure to set it before doing any other import-s, like that of pycuda and tensorflow.
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '1' # MUST PRECEDE ANY IMPORT-s
#---------------------------------------------------------------------
import pycuda as pyCUDA
import tensorflow as tf
...
#----------------------------------------------------------------------

Issue with NVIDIA GPU for matrix operations [duplicate]

I have this little nonsense script here which I am executing in MATLAB R2013b:
clear all;
n = 2000;
times = 50;
i = 0;
tCPU = tic;
disp 'CPU::'
A = rand(n, n);
B = rand(n, n);
disp '::Go'
for i = 0:times
CPU = A * B;
end
tCPU = toc(tCPU);
tGPU = tic;
disp 'GPU::'
A = gpuArray(A);
B = gpuArray(B);
disp '::Go'
for i = 0:times
GPU = A * B ;
end
tGPU = toc(tGPU);
fprintf('On CPU: %.2f sec\nOn GPU: %.2f sec\n', tCPU, tGPU);
Unfortunately after execution I receive a message from Windows saying: "Display driver stopped working and has recovered.".
Which I assume means that Windows did not get response from my graphic cards driver or something. The script returned without errors:
>> test
CPU::
::Go
GPU::
::Go
On CPU: 11.01 sec
On GPU: 2.97 sec
But no matter if the GPU runs out of memory or not, MATLAB is not able to use the GPU device before I restarted it. If I don't restart MATLAB I receive just a message from CUDA:
>> test
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In test at 1
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In test at 1
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In test at 1
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In test at 1
CPU::
::Go
GPU::
Error using gpuArray
An unexpected error occurred during CUDA execution.
The CUDA error was:
the launch timed out and was terminated
Error in test (line 21)
A = gpuArray(A);
Does anybody know how to avoid this issue or what I am doing wrong here?
If needed, my GPU Device:
>> gpuDevice
ans =
CUDADevice with properties:
Name: 'GeForce GTX 660M'
Index: 1
ComputeCapability: '3.0'
SupportsDouble: 1
DriverVersion: 6
ToolkitVersion: 5
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 2.1475e+09
FreeMemory: 1.9037e+09
MultiprocessorCount: 2
ClockRateKHz: 950000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
The key piece of information is this part of the gpuDevice output:
KernelExecutionTimeout: 1
This means that the host display driver is active on the GPU you are running the compute jobs on. The NVIDIA display driver contains a watchdog timer which kills any task which takes more than a predefined amount of time without yielding control back to the driver for screen refresh. This is intended to prevent the situation where a long running or stuck compute job renders the machine unresponsive by freezing the display. The runtime of your Matlab script is clearly exceeding the display driver watchdog timer limit. Once that happens, the the compute context held on the device is destroyed and Matlab can no longer operate with the device. You might be able to reinitialise the context by calling reset, which I guess will run cudaDeviceReset() under the cover.
There is a lot of information about this watchdog timer on the interweb - for example this Stack Overflow question. The solution for how to modify this timeout is dependent on your OS and hardware. The simplest way to avoid this is to not run CUDA code on a display GPU, or increase the granularity of your compute jobs so that no one operation has a runtime which exceeds the timeout limit. Or just write faster code...

Resources