Atomic operations in CUDA supported by specific GPU

Atomic operations in CUDA supported by specific GPU - visual-studio-2010

I have wrote a program in CUDA which will be executing on GPU (nvidia geforce 310m). In kernel I've used atomicMin function. After compiling and running I have got an error: "Kernel execution failed : <8> invalid device function". I think that may be due to the fact that my card does not support atomic operations. Am I right or there is some other thing to consider? By the way to run the atomic operations I've read that I need to change in visual studio: Project properties -> CUDA C/C++ -> Device -> Code Generation -> compute_13,sm_13. Thanks.

Probably your GPU does not match the compute architecture (sm_13) you are compiling for.
The description of error code 8 in driver_types.h is as follows:
/**
* The requested device function does not exist or is not compiled for the
* proper device architecture.
*/
cudaErrorInvalidDeviceFunction = 8,
A typical reason for this is that the compiled binary architecture does not match the device architecture. You don't mention which GPU you are using, but I'm guessing it's not a sm_13 device.
You can determine what GPU device you have and it's compute architecture and capabilities by running the cuda deviceQuery sample code.
More specifics about the compute architecture required for various atomic operations can be found in the documentation. Note that some atomic functions are available as early as the sm_11 (compute 1.1) architecture, including some versions of the atomicMin function.
EDIT: based on the fact that you're now indicating your GPU is a GeForce 310m device, this is not a compute 1.3 capable device. Therefore specifying sm_13 won't work. Your GeForce 310m is a compute 1.2 device, so if you specify that architecture (sm_12) you should be able to run code that has been successfully compiled that way.
Regarding atomics, compute 1.2 devices do support certain atomic operations, including certain versions of atomicMin. Since you haven't shown your code, I can't say anything beyond that.

CUDA devices with compute capability 1.3 support atomic operations. Try compiling your code with the following flag
-arch sm_13

Related

What flags does the NVIDIA OpenCL compiler support beyond the OpenCL standard ones?

The CUDA 3.0 toolkit documentation listed several flags NVIDIA's OpenCL compiler accepts, as an extension beyond what the OpenCL standard mandates:
Option
Description
-cl-nv-maxrregcount <N>
Max number of registers a kernel (or device function?) may use); passed on to ptxas as --maxrregcount
-cl-nv-opt-level <N>
Code optimization level.
-cl-nv-verbose
Enable verbose mode.
But I know there are others. For example, the clcc project mentions cl-nv-arch and cl-nv-cstd (which actually regards the OpenCL C version targeted). I vaguely recall one flag in particular which turns off support for grids/block sizes beyond CUDA's natively-supported grid and block sizes. How can I determine those extra flags, in a recent NVIDIA OpenCL runtime version?

This one is due to #Tim, in this answer:
Option
Description
-nv-line-info
Generates information about the locations in the source files corresponding to PTX instructions; more-or-less the same as the NVCC command-line option --generate-line-info.

How to change endianess settings in cortex m3?

I found two statements in cortex m3 guide(red book)
1. Cortex m3 supports both Little as well as big endianess.
2. After reset endianess cannot be changed dynamically.
So indirectly it is telling change endianess settings in reset handler , is it so?
If yes then how to change endianess. Means which register I need to configure and where to configure ( in reset or in exception handler)
It is not actually good idea to change endianess
But still as a curiosity I wanted to see whether cortex m3 really supports to both endianess or not?

The Cortex-M architecture can be configured to support either big-endian or little-endian operation.
However, a specific Cortex-M implementation can only support one endianness -- it's hard-wired into the silicon, and cannot be changed. Every implementation I'm aware of has chosen little-endian.

You need to be reading the ARM documentation directly. The Technical Reference Manual touches on things like this. If you actually had the source to the cortex-m3 when building it into a chip then you would see the outer layers and or config options that you can touch.
From the cortex-m3 TRM
SETEND always faults. A configuration pin selects Cortex-M3
endianness.
And then we have it in another hit:
The processor contains a configuration pin, BIGEND, that enables you
to select either the little-endian or BE-8 big-endian format. This
configuration pin is sampled on reset. You cannot change endianness
when out of reset.
Technically it would be possible to build a chip where you could choose, it could be designed with an external strap connected to BIGEND, it could be some fuse or other non-volatile thing that you can touch and then pop reset on the ARM core, could have some other processor or logic that manages the booting of the ARM core and you talk to or program that before releasing reset on the ARM core.
In general it is a bad idea to go against the grain on default endianness for an architecture. ARM in particular now that there are two flavors and the latter (BE-8) being more painful (than BE-32). Granted there are other toolchains than gcc, but even with those the vast majority of the users, the vast majority of the indirect testing is in the native (little-endian) mode. Would even wonder how truly tested the logic is, does anyone outside ARMs design verification actually push that mode? Did they test it hard enough?
Have you tried actually building big endian cortex-m3 code? Since the cortex-m is a 16 bit instruction set (with thumb2 extensions) how does that affect BE-8. With BE-8 on a full sized ARM with ARM instructions the 32 bit data swaps but the 32 bit instructions do not. Perhaps this is in the TRM and I should read more, but does that work the same way on the cortex-m? The 16 bit instructions do not swap but the data does? What about on a full sized arm with thumb instructions? And does the toolchain match what the hardware expects?
BTW what that implies is there is a signal named BIGEND in the logic that you interface when you are building a chip around the cortex-m3 and you can go into that logic and change the default setting for BIGEND (I assume they have provided one) or as I have mentioned above you can add logic in your chip to make it a runtime option rather than compile time.

Can I run Cuda or opencl on intel iris?

I have a Macbook pro mid 2014 with intel iris and intel core i5 processor 16GB of RAM. I am planing to learn some ray-traced 3D. But, I am not sure, if my laptop can render fast without any nvidia's hardware.
So, I would appreciate it, if someone can tell me if I can use Cuda if not, then could you please teach me in a very easy way how to enable OpenCL in after affects. I am looking for any tutorial for beginners to learn how to create or build OpenCL?

Cuda works only on nvidia hardware but there may be some libraries converting it to run on cpu cores(not igpu).
AMD is working on "hipify"ing old cuda kernels to translate them to opencl or similar codes so they can become more general.
Opencl works everywhere as long as both hardware and os supports. Amd, Nvidia, Intel, Xilinx, Altera, Qualcomm, MediaTek, Marvell, Texas Instruments .. support this. Maybe even Raspberry pi-x can support in future.
Documentation for opencl in stackoverflow.com is under development. But there are some sites:
Amd's tutorial
Amd's parallel programming guide for opencl
Nvidia's learning material
Intel's HD graphics coding tutorial
Some overview of hardware, benchmark and parallel programming subjects
blog
Scratch-a-pixel-raytracing-tutorial (I read it then wrote its teraflops gpu version)
If it is Iris Graphics 6100:
Your integrated gpu has 48 execution units each having 8 ALU units that can do add,multiply and many more operations. Its clock frequency can rise to 1GHz. This means a maximum of 48*8*2(1 add+1multiply)*1G = 768 Giga floating point operations per second but only if each ALU is capable of concurrently doing 1 addition and 1 multiplication. 768 Gflops is more than a low-end discrete gpu such as R7-240 of AMD.(As of 19.10.2017, AMD's low-end is RX550 with 1200 GFlops, faster than Intel's Iris Plus 650 which is nearly 900 GFlops). Ray tracing needs re-accessing to too many geometry data so a device should have its own memory(such as with Nvidia or Amd), to let CPU do its work.
How you install opencl on a computer can change by OS and hardware type, but building a software with an opencl-installed computer is similar:
Query platforms. Result of this can be AMD, Intel, Nvidia,duplicate of these because of overlapped installations of wrong drivers,experimental platforms prior to newer opencl version supports.
Query devices of a platform(or all platforms). This gives individual devices (and their duplicates if there are driver errors or some other things to fix).
Create a context(or multiple) using a platform
Using a context(so everything will have implicit sync in it):
Build programs using kernel strings. Usually CPU can take less time than a GPU to build a program.(there is binary load option to shurtcut this)
Build kernels(as objects now) from programs.
Create buffers from host-side buffers or opencl-managed buffers.
Create a command queue (or multiple)
Just before computing(or an array of computations):
Select buffers for a kernel as its arguments.
Enqueue buffer write(or map/unmap) operations on "input" buffers
Compute:
Enqueue nd range kernel(with specifying which kernel runs and with how many threads)
Enqueue buffer read(or map/unmap) operations on "output" buffers
Don't forget to synchronize with host using clFinish() if you haven't used blocking type enqueueBufferRead.
Use your accelerated data.
After opencl is no more needed:
Be sure all command queues are empty / finished doing kernel work.
Release all in the opposite order of creation
If you need to accelerate an open source software, you can switch a hotspot parallelizable loop with a simple opencl kernel, if it doesn't have another acceleration support already. For example, you can accelerate air-pressure and heat-advection part of powdertoy sand-box simulator.

Yes, you can, because OpenCL is supported by MacOS natively.
From your question it appears you are not seeking advice on programming, which would have been the appropriate subject for Stack Overflow. The first search hit on Google explains how to turn on OpenCL accelerated effects in After Effects (Project Settings dialog -> Video Rendering and Effects), but I have no experience with that myself.

What makes OpenCV so large on Windows? Anything I can do about it?

The OpenCV x64 distribution (through emgucv) for Windows has almost half a gigabyte of DLLs, including a single 224Mb opencv_gpu.dll. It seems unlikely that any human could have produced that amount of code, so what gives? Large embedded resources? Code generation bloat (this doesn't seem likely given that it's a native c/c++ project)
I want to use it for face recognition, but it's a problem to have such a large binary dependency in git, and it's a hassle to manage outside of source control.
[Update]
There are no embedded resources (at least the kind Windows DLLs usually have, but since this is a cross-platform product, I'm not sure that's significant.) Maybe lots of initialized C table structures to perform matrix operations?

The size of opencv_gpu is result of numerous template instantiations compiled for several CUDA architecture versions.
For example for convolution:
7 data types (from CV_8U to CV_64F)
~30 hadcoded sizes of convolution kernel
8 CUDA architectures (bin: 1.1 1.2 1.3 2.0 2.1(2.0) 3.0 + ptx: 2.0 3.0)
This produces about 1700 variants of convolution.
This way opencv_gpu can grow up to 1 Gb for the latest OpenCV release.
If you are not going to use any CUDA acceleration then you can safely drop the opencv_gpu.dll

CUDA fallback to CPU?

I have a CUDA application that on one computer (with a GTX 275) works fine and on another, with a GeForce 8400 works about 100 times slower. My suspicion is that there is some kind of fallback that makes the code actually run on the CPU rather than on the GPU.
Is there a way to actually make sure that the code is running on the GPU?
Is this fallback documented somewhere?
What conditions may trigger it?
EDIT: The code is compiled with compute capabilities 1.1 which what the 8400 has.

Couldn't it just be that the gap in performance is that large. This link indicates that the 8400 operates at 22-62 GFlops and this link indicates that the GTX 275 operates at 1010.88 GFlops.

There are a number of possible reasons for this.
Presumably you're not using the emulation device. Can you run the device query sample from the SDK? That will show if you have the toolkit and driver installed correctly.
You can also query the device properties from within your app to check what device you are attached to.
The 8400 is much lower performance than the GTX275, so it could be real, but see the next point.
One of the major changes in going from compute capability 1.1 to 1.2 and beyond is the way the memory accesses are handled. In 1.1 you have to be very careful not only to coalesce your memory accesses but also to make sure that each half-warp is aligned, otherwise each thread will issue it's own 32 byte transaction. In 1.2 and beyond the alignment is not such an issue as it degrades gracefully to minimise transactions.
This, combined with the lower performance of the 8400, could also account for what you are seeing.

If I remember correctly, you can list all available devices (and choose which device to use for your kernel) from the host code. You could try determine if the available device is software emulation and issue a warning.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio