OpenCL returning nan after changing device - macos

I'm trying to run a simple OpenCL program that adds two vectors in individual buffers, and stores the result in the third buffer. I'm trying to run it on a MacBook Pro with a discrete GPU. The AMD GPU is (through the clGetDeviceInfo function), second in the device list with the integrated GPU being the first. It works correctly with the integrated GPU. But when I modify the command queue initialisation to the following:
cl_command_queue command_queue = clCreateCommandQueue(context, device_list[1], 0, &clStatus);
It returns NaN values in the output. If I use device_list[0], it works. I only changed the command queue initialisation. So how do I guarantee that I use the discrete GPU without issue?

Related

How do I reliably query SIMD group size for Metal Compute Shaders? threadExecutionWidth doesn't always match

I'm trying to use the SIMD group reduction/prefix functions in a series of reasonably complex compute kernels in a Mac app. I need to allocate some threadgroup memory for coordinating between SIMD groups in the same thread group. This array should therefore should have a capacity depending on [[simdgroups_per_threadgroup]], but that's not a compile time value, so it can't be used as an array dimension.
Now, according to various WWDC session videos, threadExecutionWidth on the pipeline object should return the SIMD group size, with which I could then allocate an appropriate amount of memory using setThreadgroupMemoryLength:atIndex: on the compute encoder.
This works consistently on some hardware (e.g. Apple M1, threadExecutionWidth always seems to report 32) but I'm hitting configurations where threadExecutionWidth does not match apparent SIMD group size, causing runtime errors due to out of bounds access. (e.g. on Intel UHD Graphics 630, threadExecutionWidth = 16 for some complex kernels, although SIMD group size seems to be 32)
So:
Is there a reliable way to query SIMD group size for a compute kernel before it runs?
Alternately, will the SIMD group size always be the same for all kernels on a device?
If the latter is at least true, I can presumably trust threadExecutionWidth for the most trivial of kernels? Or should I submit a trivial kernel to the GPU which returns [[threads_per_simdgroup]]?
I suspect the problem might occur in kernels where Metal offers an "odd" (non-pow2) maximum thread group sizes, although in the case I'm encountering, the maximum threadgroup size is reported as 896, which is an integer multiple of 32, so it's not as if it's using the greatest common denominator between max threadgroup size and SIMD group size for threadExecutionWidth.

CUDA OOM on Slurm but not locally, even if Slurm has more GPUs

I am working on a Slurm-based cluster. I debug my code on the login node, which has 2 GPUs.
I can run it fine using model = nn.DataParallel(model), but my Slurm jobs crash because of
RuntimeError: CUDA out of memory. Tried to allocate 246.00 MiB (GPU 0; 15.78 GiB total capacity; 2.99 GiB already allocated; 97.00 MiB free; 3.02 GiB reserved in total by PyTorch)
I submit Slurm jobs using submitit.SlurmExecutor with the following parameters
executor.update_parameters(
time=1000,
nodes=1,
ntasks_per_node=4,
num_gpus=4,
job_name='test',
mem="256GB",
cpus_per_task=8,
)
I am even requesting more GPUs (4 vs 2), yet it still crashes.
I am checking that all 4 GPUs are visible to the job, and they are.
The weird thing is that if I reduce the network size:
With nn.DataParallel I still get CUDA OOM.
Without it everything works and jobs do not crash. But I need to use a bigger model, so this is not a solution.
Why? Is it due to nn.DataParallel?
EDIT
My model has a LSTM inside, and I noticed that I get the following warning
/private/home/xxx/.local/lib/python3.8/site-packages/torch/nn/modules/rnn.py:679: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters(). (Triggered internally at /pytorch/aten/src/ATen/native/cudnn/RNN.cpp:924.)
result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,
After a Google search, it seems I have to call flatten_parameters() before calling the LSTM, but I cannot find a definitive answer about this (like, where exactly should I call it?). Also, adding flatten_parameters() the code still works locally, but Slurm jobs now crash with
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

STM32F411 I need to send a lot of data by USB with high speed

I'm using STM32F411 with USB CDC library, and max speed for this library is ~1Mb/s.
I'm creating a project where I have 8 microphones connected into ADC line (this part works fine), I need a 16-bit signal, so I'm increasing accuracy by adding first 16 signals from one line (ADC gives only 12-bits signal). In my project, I need 96k 16-bit samples for one line, so it's 0,768M signals for all 8 lines. This signal needs 12000Kb space, but STM32 have only 128Kb SRAM, so I decided to send about 120 with 100Kb data in one second.
The conclusion is I need ~11,72Mb/s to send this.
The problem is that I'm unable to do that because CDC USB limited me to ~1Mb/s.
Question is how to increase USB speed to 12Mb/s for STM32F4. I need some prompt or library.
Or maybe should I set up "audio device" in CubeMX?
If small b means byte in your question, the answer is: it is not possible as your micro has FS USB which max speeds is 12M bits per second.
If it means bits your 1Mb (bit) speed assumption is wrong. But you will not reach the 12M bit payload transfer.
You may try to write (only if b means bit) your own class but I afraid you will not find a ready made library. You will need also to write the device driver on the host computer

Why does my CUDA kernel execution time increase with successive launches?

I'm prototyping an application with CUDA. I've been benchmarking it against the CPU and noticed some variable runtimes. I decided to run my application in a loop from the command line so I could gather some better statistics. I ran the application 50 times and recorded the results. I was very surprised to see that the elapsed kernel time was increasing as a function of launch number.
Here is a snippet so you can see the part of the code that is being timed:
int nblocks = (int)ceil((float)n / (float)NUM_THREADS);
gpuErrchk(cudaEventRecord(start, 0));
gpuperfkernel << <nblocks, NUM_THREADS >> >(dmetadata, ddatax, ddatay);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaEventRecord(stop, 0));
gpuErrchk(cudaEventSynchronize(stop));
gpuErrchk(cudaEventElapsedTime(&milliseconds, start, stop));
printf("GPU kernel took %f milliseconds.\n", milliseconds);
gpuelapsed += milliseconds;
I've worked with CUDA quite a bit and I haven't seen this behavior before. Wondering if anyone has noticed this? My platform is Windows 10, CUDA 7.5, MSI notebook, GeForce 970m.
Since I'm on a laptop I was thinking it might be a power related setting or something like that, but I have everything set to high performance and have disabled the screen saver.
The GeForce 970m has boost clocks. Run after run, temperature of your GPU rises and most probably the boost is less likely to be at its top level when temperature increases.
You can monitor the GPU temperature with nvidia-smi. There is also a monitoring API. Your boost settings should also be configurable in nvidia-smi to some extent, should you want to verify this.
To disable auto boost via nvidia-smi use this command:
sudo nvidia-smi --auto-boost-default=DISABLED

strange behaviour using mmap

I'm using Angtsrom embedded linux kernel v.2.6.37, based on Technexion distribution.
DM3730 SoC, TDM3730 module, custom baseboard.
CodeSourcery toolchain v. 2010-09.50
Here is dataflow in my system:
http://i.stack.imgur.com/kPhKw.png
FPGA generates incrementing data, Kernel reads it via GPMC DMA. GPMC pack size = 512 data samples. Buffer size = 61440 32bit samples (=60 ram pages).
DMA buffer is allocated by dma_alloc_coherent and mapped to userspace by mmap() call. User application directly reads data from DMA buffer and saving to NAND using fwrite() call. User reads data by 4096 samples at once.
And what I see in my file? http://i.stack.imgur.com/etzo0.png
Red line means first border of ring buffer. Ooops! Small packs (~16 samples) starts to hide after border. Their values is accurately = "old" values of corresponding buffer position. But WHY? 16 samples is much lesser than DMA pack size and user read pack size, so there cannot be pointers mismatch.
I guess there is some mmap() feature is hiding somewhere. I have tried different flags for mmap() - such as MAP_LOCKED, MAP_POPULATE, MAP_NONBLOCK with no success. I completely missunderstanding this behaviour :(
P.S. When i'm using copy_to_user() from kernel instead of mmap() and zero-copy access, there is no such behaviour.

Resources