Is MEX of MATLAB known to be slow on macOS? - macos

Question: Is MEX of MATLAB known to be slow on macOS?
Here, "slow" refers to the speed of setting MEX up, of compiling MEX code, and of running the MEX function.
I have done some timing using GitHub Actions. The hardware specification is as follows. It is copied from the official documentation of GitHub Actions. The detailed information of CPU (frequency etc) and memory are not available.
GNU/Linux (ubuntu 20.04) and Windows (Windows Server 2019):
2-core CPU
7 GB of RAM memory
14 GB of SSD disk space
macOS (macOS Big Sur 11):
3-core CPU
14 GB of RAM memory
14 GB of SSD disk space
Here is the data obtained.
System: GNU/Linux | Language: C | MATLAB: 2021b | Time: 2022.02.26 05:46:36
MEX configured to use 'gcc' for C language compilation.
- Time for setting MEX up: 0.477950 seconds
- Time for mexifying timestwo: 4.500026 seconds
- Time for 100 runs of timestwo: 0.003845 seconds
System: Windows | Language: C | MATLAB: 2021b | Time: 2022.02.26 05:47:56
MEX configured to use 'Microsoft Visual C++ 2019 (C)' for C language compilation.
- Time for setting MEX up: 2.518557 seconds
- Time for mexifying timestwo: 4.416958 seconds
- Time for 100 runs of timestwo: 0.004215 seconds
System: macOS | Language: C | MATLAB: 2021b | Time: 2022.02.26 05:49:01
MEX configured to use 'Xcode with Clang' for C language compilation.
- Time for setting MEX up: 17.602277 seconds
- Time for mexifying timestwo: 5.979585 seconds
- Time for 100 runs of timestwo: 0.130843 seconds
System: GNU/Linux | Language: Fortran | MATLAB: 2021b | Time: 2022.02.26 05:46:20
MEX configured to use 'gfortran' for FORTRAN language compilation.
- Time for setting MEX up: 0.835881 seconds
- Time for mexifying timestwo: 2.768746 seconds
- Time for 100 runs of timestwo: 0.003279 seconds
System: Windows | Language: Fortran | MATLAB: 2021b | Time: 2022.02.26 05:51:04
MEX configured to use 'Intel oneAPI 2021 for Fortran with Microsoft Visual Studio 2019' for FORTRAN language compilation.
- Time for setting MEX up: 1.660305 seconds
- Time for mexifying timestwo: 3.495534 seconds
- Time for 100 runs of timestwo: 0.003299 seconds
System: macOS | Language: Fortran | MATLAB: 2021b | Time: 2022.02.26 05:49:47
MEX configured to use 'Intel Fortran Composer XE' for FORTRAN language compilation.
- Time for setting MEX up: 248.263933 seconds
- Time for mexifying timestwo: 87.093711 seconds
- Time for 100 runs of timestwo: 0.078741 seconds
It turns out that MEX is much slower on macOS than on GNU/Linux or Windows: slow to set up, slow to mexify, and the MEX function is slow to run. In particular, it is about 300 times slower to set up MEX for Fortran on macOS than on Linux. However, note that the significant difference probably comes from the virtual environment of GitHub Actions. On local machines, the difference may not be that dramatic.
If you are interested in the timing, below is the code I used for it. It is also available on GitHub.
function mex_time(language)
% MEX_TIME measures the running time of MATLAB concerning MEX.
orig_warning_state = warning;
warning('off','all');
if nargin == 1 && (isa(language, 'char') || isa(language, 'string')) && strcmpi(language, 'Fortran')
language = 'Fortran';
timestwo_src = 'timestwo.F';
else
language = 'C';
timestwo_src = 'timestwo.c';
end
if ismac
sys = 'macOS';
elseif isunix
sys = 'GNU/Linux';
elseif ispc
sys = 'Windows';
else
error('Platform not supported.')
end
matlab_version = version('-release');
date_time = datestr(now,'yyyy.mm.dd HH:MM:SS');
fprintf('\nSystem: %s | Language: %s | MATLAB: %s | Time: %s\n\n', sys, language, matlab_version, date_time);
tic;
mex('-setup', language);
fprintf('\n- Time for setting MEX up: %f seconds\n\n', toc);
clear('timestwo');
tic;
mex(fullfile(matlabroot, 'extern', 'examples', 'refbook', timestwo_src));
fprintf('\n- Time for mexifying timestwo: %f seconds\n', toc);
tic;
for i = 1 : 100
timestwo(i);
end
fprintf('\n- Time for 100 runs of timestwo: %f seconds\n\n', toc);
delete('timestwo.*');
warning(orig_warning_state);
You may try
mex_time('C');
mex_time('Fortran');
on your computer and post the results here. Thank you very much.
(See the discussions on MATLAB Answers and GitHub)

Related

jupyter notebook %%time doesn't measure cpu time of %%sh commands?

When I run python code in a jupyter-lab (v3.4.3) ipython notebook (v8.4.0) and use the %%time cell magic, both cpu time and wall time are reported.
%%time
for i in range(10000000):
a = i*i
CPU times: user 758 ms, sys: 121 µs, total: 758 ms
Wall time: 757 ms
But when the same computation is performed using the %%sh magic to run a shell script, the cpu time results are nonsense.
%%time
%%sh
python -c "for i in range(10000000): a = i*i"
CPU times: user 6.14 ms, sys: 12.5 ms, total: 18.6 ms
Wall time: 920 ms
The docs for %time do say "Time execution of a Python statement or expression.", but this still surprised me because I had assumed that the shell script will run in a python subprocess and thus can also be measured. So, what's going on here? Is this a bug, or just a known caveat of using %%sh?
I know I can use the shell builtin time or /usr/bin/time to get similar output, but this is a bit cumbersome for multiple lines of shell---is there a better workaround?

Query a `perf.data` file for the total raw execution time of a symbol

I used perf to generate a perf file with perf record ./application. perf report shows me various things about it. How can I show the total time it took to run the application, and the total time to run a specific "symbol"/function? perf seems to often show percentages, but I want raw time, and it want "inclusive" time, i.e. including children.
perf v4.15.18 on Ubuntu Linux 18.04
perf is statistical (sampling) profiler (in its default perf record mode), and it means it have no exact timestamps on function entry and exit (tracing is required for exact data). Perf asks OS kernel to generate interrupts thousands times per second (4 kHz for hardware PMU if -e cycles supported, less for software event -e cpu-clock). Every interrupt of program execution is recorded as sample which contains EIP (current instruction pointer), pid (process/thread id), timestamp of current time. When program runs for several seconds, there will be thousands of samples, and perf report can generate histograms from them: which parts of program code (which functions) were executed more often than other. You will get generic overview that some functions did take around 30% of program execution time while other - 5%.
perf report does not compute total program execution time (it may estimate it by comparing timestamps of first and last sample, but it is not exact if there were off-CPU periods). But it does estimate total event count (it is printed in first line in interactive TUI and is listed in text output):
$ perf report |grep approx
# Samples: 1K of event 'cycles'
# Event count (approx.): 844373507
There is perf report -n option which adds column "number of samples" next to percent column.
Samples: 1K of event 'cycles', Event count (approx.): 861416907
Overhead Samples Command Shared Object Symbol
42.36% 576 bc bc [.] _bc_rec_mul
37.49% 510 bc bc [.] _bc_shift_addsub.isra.3
14.90% 202 bc bc [.] _bc_do_sub
0.89% 12 bc bc [.] bc_free_num
But samples are taken not at same intervals and they are less exact than computed overhead (every sample may have different weight). I will recommend you to run perf stat ./application to have real total running time and total hardware counts for your application. It is better when your application has stable running time (do perf stat -r 5 ./application to have variation estimated by tool as "+- 0.28%" in last column)
To include children functions stack traces must be sampled at every interrupt. They are not sampled in default perf record mode. This sampling is turned on with -g or --call-graph dwarf options: perf record -g ./application or perf record --call-graph dwarf ./application. It is not simple to use it correctly for preinstalled libraries or applications in Linux (as most distributions strip debug information from packages), but can be used for your own applications compiled with debug information. The default -g which is same as --call-graph fp requires that all code is compiled with -fno-omit-frame-pointer gcc option, and non-default --call-graph dwarf is more reliable. With correctly prepared program and libraries, single-threaded application, and long enough stack size samples (8KB is default, change with --call-graph dwarf,65536), perf report should show around 99% for _start and main functions (including children).
bc calculator compiled with -fno-omit-frame-pointer:
bc-no-omit-frame$ echo '3^123456%3' | perf record -g bc/bc
bc-no-omit-frame$ perf report
Samples: 1K of event 'cycles:uppp', Event count (approx.): 811063902
Children Self Command Shared Object Symbol
+ 98.33% 0.00% bc [unknown] [.] 0x771e258d4c544155
+ 98.33% 0.00% bc libc-2.27.so [.] __libc_start_main
+ 98.33% 0.00% bc bc [.] main
bc calculator with dwarf call graph:
$ echo '3^123456%3' | perf record --call-graph dwarf bc/bc
$ perf report
Samples: 1K of event 'cycles:uppp', Event count (approx.): 898828479
Children Self Command Shared Object Symbol
+ 98.42% 0.00% bc bc [.] _start
+ 98.42% 0.00% bc libc-2.27.so [.] __libc_start_main
+ 98.42% 0.00% bc bc [.] main
bc without debug info has incorrect call graph handling by perf in -g (fp) mode (no 99% for main):
$ cp bc/bc bc.strip
$ strip -d bc.strip
$ echo '3^123456%3' | perf record --call-graph fp ./bc.strip
Samples: 1K of event 'cycles:uppp', Event count (approx.): 841993392
Children Self Command Shared Object Symbol
+ 43.94% 43.94% bc.strip bc.strip [.] _bc_rec_mul
+ 39.73% 39.73% bc.strip bc.strip [.] _bc_shift_addsub.isra.3
+ 11.27% 11.27% bc.strip bc.strip [.] _bc_do_sub
+ 0.92% 0.92% bc.strip libc-2.27.so [.] malloc
Sometimes perf report --no-children can be useful to disable sorting on self+children overhead (will sort by "self" overhead), for example when call graph was not fully captured.

theano:simple neural network with 6 layers trains with only 132 datapoints per second. Why is it so slow?

I have a quite small neural network with two fully connected sigmoid subnetworks 10->100->10, whose output is concatenated and then fed into another 20->100->1 network.
This architecture is quite small, having just a few weights matrices that have maximum dimension 20x100=2000 weights.
Even if I am using theano with all flags set to use gpu acceleration, the system reaches only 132 iterations (datapoints!) per second. I am not using minibatches because it's not your typical neural network, but a matrix factorization model.
The system I am using has the following specs:
OS
uname -a
Linux node081 2.6.32-358.18.1.el6.x86_64 #1 SMP Wed Aug 28 17:19:38 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
software
Python 3.5.2 (compiled by myself)
theano 0.9.0dev2.dev-ee4c4e21b9e9037f2aa9626c3d779382840ea2e3
NumPy 1.11.2
cpu
Intel(R) Xeon(R) CPU E5-2620 0 # 2.00GHz
with nproc that returns 12, hence might have 12 processors (or cores?)
gpu
this is the output of nvidia-smi (with my process running):
+------------------------------------------------------+
| NVIDIA-SMI 5.319.49 Driver Version: 319.49 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K20m On | 0000:03:00.0 Off | Off |
| N/A 29C P0 49W / 225W | 92MB / 5119MB | 19% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 5903 python3 77MB |
+-----------------------------------------------------------------------------+
command and env vars
OMP_NUM_THREADS=8 THEANO_FLAGS=mode=FAST_RUN,device=gpu,init_gpu_device=gpu,floatX=float32,nvcc.flags=-D_FORCE_INLINES,print_active_device=True,enable_initial_driver_test=True,warn_float64=raise,force_device=True,assert_no_cpu_op=raise python3 $#
theano config settings set at run-time
theano.config.optimizer='fast_run'
theano.config.openmp=True
theano.config.openmp_elemwise_minsize=4
theano.config.floatX='float32'
theano.config.assert_no_cpu_op='raise'
I also tried to deactivate openmp, and it works slightly slower.
It seems that I took all the precautions to make sure I have gpu acceleration correctly set. What might the reason of getting only 132 gradient updates at every second? Is there any further check I need to perform?
In theano,
The compilation is much faster on optimizer=fast_compile than with optimizer=fast_run.
Using the new-backend, with the help of new optimizer, compilation time has increased by 5X on certain networks. I'd suggest you should always stick with the new backend. You can use the new backend by using device=cudaflag.
While you're using the new backend, I'd advice you to use the bleeding edge if you want speed. There are a lot of optimizations being done every week that has potential to give great speeds.
From the docs, you can set the flag allow_gc=Falseto get faster speed
You can set config.nvcc.fastmath flag to True if you require some speed up from the division and multiplication operations at the cost of precision.
If you have convolutional operations in your network, you can set some config.dnn flags depending upon your network and needs. Also, setting cnmem flag will help.
Finally, whenever you're reporting a code to be slow, please share profiling results for helping the development :)

How to make Theano use the integrated GPU in MacBook Air?

I tried running the test program for GPU usage:
from theano import function, config, shared, tensor, sandbox
import numpy
import time
vlen=10*30*768 #10x #coresx #threadspercore
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], tensor.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in xrange(iters):
r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, tensor.Elemwise) and ('Gpu' not in type(x.op).__name__)
for x in f.maker.fgraph.toposort()]):
print('Used the cpu')
else:
print('Used the gnu')
It only shows this (even after installing libgpuarray):
[Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
Looping 1000 times took 2.723539 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the cpu
I would like to know how to utilise the integrated GPU of MacBook Air (Early 2014).
My processor has Intel HD Graphics 5000 -- not NVIDIA and hence not CUDA compatible Many links suggest usage of OpenCL. This was also supposed to be pre-installed with OS-X. But I can't make head or tail out of the links in the web.
I could not find much help of how to set up Theano in the docs either.
I just need to make Theano use my Mac's integrated GPU. Is this possible? If so, how? What are its prerequisites?
THEANO_FLAGS=device=opencl0:1 python ~/check_GPU.py

Issue with NVIDIA GPU for matrix operations [duplicate]

I have this little nonsense script here which I am executing in MATLAB R2013b:
clear all;
n = 2000;
times = 50;
i = 0;
tCPU = tic;
disp 'CPU::'
A = rand(n, n);
B = rand(n, n);
disp '::Go'
for i = 0:times
CPU = A * B;
end
tCPU = toc(tCPU);
tGPU = tic;
disp 'GPU::'
A = gpuArray(A);
B = gpuArray(B);
disp '::Go'
for i = 0:times
GPU = A * B ;
end
tGPU = toc(tGPU);
fprintf('On CPU: %.2f sec\nOn GPU: %.2f sec\n', tCPU, tGPU);
Unfortunately after execution I receive a message from Windows saying: "Display driver stopped working and has recovered.".
Which I assume means that Windows did not get response from my graphic cards driver or something. The script returned without errors:
>> test
CPU::
::Go
GPU::
::Go
On CPU: 11.01 sec
On GPU: 2.97 sec
But no matter if the GPU runs out of memory or not, MATLAB is not able to use the GPU device before I restarted it. If I don't restart MATLAB I receive just a message from CUDA:
>> test
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In test at 1
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In test at 1
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In test at 1
Warning: An unexpected error occurred during CUDA
execution. The CUDA error was:
CUDA_ERROR_LAUNCH_TIMEOUT
> In test at 1
CPU::
::Go
GPU::
Error using gpuArray
An unexpected error occurred during CUDA execution.
The CUDA error was:
the launch timed out and was terminated
Error in test (line 21)
A = gpuArray(A);
Does anybody know how to avoid this issue or what I am doing wrong here?
If needed, my GPU Device:
>> gpuDevice
ans =
CUDADevice with properties:
Name: 'GeForce GTX 660M'
Index: 1
ComputeCapability: '3.0'
SupportsDouble: 1
DriverVersion: 6
ToolkitVersion: 5
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 2.1475e+09
FreeMemory: 1.9037e+09
MultiprocessorCount: 2
ClockRateKHz: 950000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
The key piece of information is this part of the gpuDevice output:
KernelExecutionTimeout: 1
This means that the host display driver is active on the GPU you are running the compute jobs on. The NVIDIA display driver contains a watchdog timer which kills any task which takes more than a predefined amount of time without yielding control back to the driver for screen refresh. This is intended to prevent the situation where a long running or stuck compute job renders the machine unresponsive by freezing the display. The runtime of your Matlab script is clearly exceeding the display driver watchdog timer limit. Once that happens, the the compute context held on the device is destroyed and Matlab can no longer operate with the device. You might be able to reinitialise the context by calling reset, which I guess will run cudaDeviceReset() under the cover.
There is a lot of information about this watchdog timer on the interweb - for example this Stack Overflow question. The solution for how to modify this timeout is dependent on your OS and hardware. The simplest way to avoid this is to not run CUDA code on a display GPU, or increase the granularity of your compute jobs so that no one operation has a runtime which exceeds the timeout limit. Or just write faster code...

Resources