Unsupported Atomic Operation in OpenCL

Unsupported Atomic Operation in OpenCL - parallel-processing

I am Using Atomic operations for OpenCL. same code is working for intel CPU but is giving error on Nvidia GPU. I have enabled Atomics for 32 bit and 64 bit both.
int cidx=idx%10;
int i=1;
C[idx]=In1[idx] & In2[idx];
atomic_add(R,i);
This is just portion of overall code. its giving build error "Unsupported Operation" While running on Nvidia Quadro GPU rather it's working all fine on Intel i3, Xeon, and AMD Processors.

atomic_add did not appear in OpenCL 1.0, it was added in a later revision of the spec. You might be running on two different implementations which conform to different OpenCL versions.

Related

Accelerator restriction: unsupported operation: RSQRTSS

I have a simple nbody implementation code and try to compile it for launching on NVIDIA GPUs (Tesla K20m/Geforce GTX 650 Ti). I use the following compiler options:
-Minfo=all -acc -Minline -Mfpapprox -ta=tesla:cc35/nvidia
Everything works without -Mfpapprox, but when I use it, the compilation fails with the following output:
346, Accelerator restriction: unsupported operation: RSQRTSS
The 346 line writes as:
float rdistance=1.0f/sqrtf(drSquared);
where
float drSquared=dx*dx+dy*dy+dz*dz+softening;
and dx, dy, dz are float values. This line is inside the #pragma acc parallel loop independent for() construction.
What is the problem with -Mfpapprox?

-Mfpapprox tells the compiler to use very low-precision CPU instructions to approximate DIV or SQRT. These instructions are not supported on the GPU. The GPU SQRT is both fast and precise so no need for a low-precision version.
Actually even on the CPU, I'd recommend you not use -Mfpapprox unless you really understand the mathematics of your code and it can handle a high degree of imprecision (as much as 5-6 bits or ~20Ulps off). We added this flag about 10 years ago since at the time the CPUs divide operation was very expensive. However, CPU performance for divide has greatly improved since then (as has sqrt) so you're generally better off not sacrificing precision for the little bit of speed-up you might get from this flag.
I'll put in an issue report requesting that the compiler ignore -Mfpapprox for GPU code so you wont see this error.

What is the more reliable way, using the Win32 API, to determine if a processor is an Intel Skylake Gen?

What is the most reliable way, using the Win32 API, to determine if a processor is an Intel Skylake Gen? This seems like an easy question, as one can check the friendly name of the CPU in the registry and get some data, but I have found that to be less than authoritative and feel I am missing some other store of data to query.
Note: I specified the Win32 API to both be clear this is Windows, and also to deter answers that would involve writing a device driver (interfacing with them via deviceioctrl/IRP is fine).
Thanks!

Probably the most reliable / direct way is to use the CPUID instruction with the appropriate input register values, and decode the vendor/family/model ID numbers.
According to http://www.sandpile.org/x86/cpuid.htm,
SKL has Family = 0x6 (like every descendant of i686 PPro (P6 core)).
SKL Y/U: model = 0x4E (low power, dual-core even for i7)
SKL S/H: model = 0x5E (desktop/high-power laptop, quad-core except i3)
SKX model = 0x55 (Skylake-E Xeons, not release yet AFAIK)
KBL Y/U: model = 0x8E (Kaby Lake low power, dual-core)
KBL S/H: model = 0x9E (Kaby Lake desktop/high-power laptop, quad-core except i3)
Dual-core desktop i3 CPUs are probably the same die as quad-core i5, but with 2 of the cores disabled. (Often because of a manufacturing defect that would prevent it being sold as a quad-core part.) Interesting to see that the model # reflects this difference between dual-core silicon vs. a quad-core die fused-off to dual-core.
If there's something you want to enable based on something Skylake has, it might be better to detect that directly (with some other CPUID query). e.g. check the feature-bit for an instruction-set extension directly. That way you won't run into trouble in a VM where CPUID shows a SKL CPU, but the VM doesn't pass through all instruction-set extensions. (e.g. some don't pass through AVX to the guest OS).
But this might be useful if you're selecting between versions of a function tuned for Haswell vs. Skylake. e.g. psrlvd ymm, ymm, ymm is 1 uop / 1 cycle on Skylake, but 3 uops and 3 cycles on Haswell. So on Haswell, repeated shifts by the same amount (when it isn't a compile-time-constant) would be faster if you use psrld ymm, ymm, xmm (with the count in the low element of the xmm reg), but on Skylake it's faster to pre-broadcast the shift count and use a variable-shift.
There are other improvements to front-end throughput, micro-fusion of indexed addressing modes, and instructions running on more ports that could make it useful to have differently micro-optimized versions of things for Skylake vs. Haswell.

(answering my own question)
While no available way using the Windows API became apparent, I found and excellent summation at https://en.wikipedia.org/wiki/CPUID.
Using the CPUID instruction, one can derive the model based on the highest supported feature count, returned in the EAX register; coupled with vendor (returned elsewhere). I now have a nice abstraction layer for all this.
Here is a list of processors and the highest function supported: https://en.wikipedia.org/wiki/CPUID
For Skylake CPUs, this is 0x16 (32).
History shows this to be unique for CPU models (see link).

Low GPU usage in CUDA

I implemented a program which uses different CUDA streams from different CPU threads. Memory copying is implemented via cudaMemcpyAsync using those streams. Kernel launches are also using those streams. The program is doing double-precision computations (and I suspect this is the culprit, however, cuBlas reaches 75-85% CPU usage for multiplication of matrices of doubles). There are also reduction operations, however they are implemented via if(threadIdx.x < s) with s decreasing 2 times in each iteration, so stalled warps should be available to other blocks. The application is GPU and CPU intensive, it starts with another piece of work as soon as the previous has finished. So I expect it to reach 100% of either CPU or GPU.
The problem is that my program generates 30-40% of GPU load (and about 50% of CPU load), if trusting GPU-Z 1.9.0. Memory Controller Load is 9-10%, Bus Interface Load is 6%. This is for the number of CPU threads equal to the number of CPU cores. If I double the number of CPU threads, the loads stay about the same (including the CPU load).
So why is that? Where is the bottleneck?
I am using GeForce GTX 560 Ti, CUDA 8RC, MSVC++2013, Windows 10.
One my guess is that Windows 10 applies some aggressive power saving, even though GPU and CPU temperatures are low, the power plan is set to "High performance" and the power supply is 700W while power consumption with max CPU and GPU TDP is about 550W.
Another guess is that double-precision speed is 1/12 of the single-precision speed because there are 1 double-precision CUDA core per 12 single-precision CUDA cores on my card, and GPU-Z takes as 100% the situation when all single-precision and double-precision cores are used. However, the numbers do not quite match.

Apparently the reason was low occupancy due to CUDA threads using too many registers by default. To tell the compiler the limit on the number of registers per thread, __launch_bounds__ can be used, as described here. So to be able to launch all 1536 threads in 560 Ti, for block size 256 the following can be specified:
_global__ void __launch_bounds__(256, 6) MyKernel(...) { ... }
After limiting the number of registers per CUDA thread, the GPU usage has raised to 60% for me.
By the way, 5xx series cards are still supported by NSight v5.1 for Visual Studio. It can be downloaded from the archive.
EDIT: the following flags have further increased GPU usage to 70% in an application that uses multiple GPU streams from multiple CPU threads:
cudaSetDeviceFlags(cudaDeviceScheduleYield | cudaDeviceMapHost | cudaDeviceLmemResizeToMax);
cudaDeviceScheduleYield lets other threads execute when a CPU
thread is waiting on GPU operation, rather than spinning GPU for the
result.
cudaDeviceLmemResizeToMax, as I understood it, makes kernel
launches themselves asynchronous and avoids excessive local memory
allocations&deallocations.

CUDA 5.5 samples compile fine on OS X 10.9 but error out immediately when run

This is on a MacBookPro7,1 with a GeForce 320M (compute capability 1.2). Previously, with OS X 10.7.8, XCode 4.x and CUDA 5.0, CUDA code compiled and ran fine.
Then, I update to OS X 10.9.2, XCode 5.1 and CUDA 5.5. At first, deviceQuery failed. I read elsewhere that 5.5.28 (the driver CUDA 5.5 shipped with) did not support compute capability 1.x (sm_10), but that 5.5.43 did. After updating the CUDA driver to the even more current 5.5.47 (GPU Driver verions 8.24.11 310.90.9b01), deviceQuery indeed passes with the following output.
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce 320M"
CUDA Driver Version / Runtime Version 5.5 / 5.5
CUDA Capability Major/Minor version number: 1.2
Total amount of global memory: 253 MBytes (265027584 bytes)
( 6) Multiprocessors, ( 8) CUDA Cores/MP: 48 CUDA Cores
GPU Clock rate: 950 MHz (0.95 GHz)
Memory Clock rate: 1064 Mhz
Memory Bus Width: 128-bit
Maximum Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536, 32768), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(8192), 512 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(8192, 8192), 512 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 512
Max dimension size of a thread block (x,y,z): (512, 512, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 1)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.5, NumDevs = 1, Device0 = GeForce 320M
Result = PASS
Furthermore, I can successfully compile without modification the CUDA 5.5 samples, though I have not tried to compile all of them.
However, samples such as matrixMul, simpleCUFFT, simpleCUBLAS all fail immediately when run.
$ ./matrixMul
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce 320M" with compute capability 1.2
MatrixA(160,160), MatrixB(320,160)
cudaMalloc d_A returned error code 2, line(164)
$ ./simpleCUFFT
[simpleCUFFT] is starting...
GPU Device 0: "GeForce 320M" with compute capability 1.2
CUDA error at simpleCUFFT.cu:105 code=2(cudaErrorMemoryAllocation) "cudaMalloc((void **)&d_signal, mem_size)"
Error Code 2 is cudaErrorMemoryAllocation, but I suspect it hides a failed CUDA initialization somehow.
$ ./simpleCUBLAS
GPU Device 0: "GeForce 320M" with compute capability 1.2
simpleCUBLAS test running..
!!!! CUBLAS initialization error
Actual error code is CUBLAS_STATUS_NOT_INITIALIZED being returned from call to cublasCreate().
Has anyone run into this before and found a fix? Thanks in advance.

I would guess you are running out of memory. Your GPU is being used by the display manager, and it only has 256Mb of RAM. The combined memory footprint of the OS 10.9 display manager and the CUDA 5.5 runtime might be leaving you with almost no free memory. I would recommend writing and running a small test program like this:
#include <iostream>
int main(void)
{
size_t mfree, mtotal;
cudaSetDevice(0);
cudaMemGetInfo(&mfree, &mtotal);
std::cout << mfree << " bytes of " << mtotal << " available." << std::endl;
return cudaDeviceReset();
}
[disclaimer: written in browser, never compiled or tested use at own risk ]
That should give you a picture of the available free memory after context establishment on the device. You might be surprised at how little there is to work with.
EDIT: Here is an even lighter weight alternative test which doesn't even attempt to establish a context on the device. Instead, it only uses the driver API to check the device. If this succeeds, then either the runtime API shipping for OS X is broken somehow, or you have no memory available on the device for establishing a context. If it fails, then your truly have a broken CUDA installation. Either way, I would consider opening a bug report with NVIDIA:
#include <iostream>
#include <cuda.h>
int main(void)
{
CUdevice d;
size_t b;
cuInit(0);
cuDeviceGet(&d, 0);
cuDeviceTotalMem(&b, d);
std::cout << "Total memory = " << b << std::endl;
return 0;
}
Note you will need to explicitly link the cuda driver library to get this to work (pass -lcuda to nvcc, for example)

Which[neon/vfp/vfp3] should I specify for the mfpu when evaluate and compare float performance in ARM processor?

I want to evaluate some different ARM Processor float performance. I use the lmbench and pi_css5, I confuse in the float test.
From cat /proc/cpuinfo(below), I guess there're 3 types of float features: neon,vfp,vfpv3? From this question&answer, it seems it's depend to the compiler.
Still I don't know which I should to specify in compille flag(-mfpu=neon/vfp/vfpv3), or I should compile the program with each of that, or just do not specify the -mfpu?
cat /proc/cpuinfo
Processor : ARMv7 Processor rev 4 (v7l)
BogoMIPS : 532.00
Features : swp half thumb fastmult vfp edsp neon vfpv3 tls
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc09
CPU revision : 4

It might be even a little bit more complicated then you anticipated. GCC arm options page doesn't explain fpu versions, however ARM's manual for their compiler does. You should also notice that Linux doesn't provide whole story about fpu features, only telling about vfp, vfpv3, vfpv3d16, or vfpv4.
Back to your question, you should select the greatest common factor among them, compile your code towards it and compare the results. On the other hand if a cpu has vfpv4 and other has vfpv3 which one would you think is better?
If your question is as simple as selecting between neon, vfp or vfpv3. Select neon (source).
-mfpu=neon selects VFPv3 with NEON coprocessor extensions.
From the gcc manual,
If the selected floating-point hardware includes the NEON extension
(e.g. -mfpu=neon), note that floating-point operations will
not be used by GCC's auto-vectorization pass unless
`-funsafe-math-optimizations' is also specified. This is because
NEON hardware does not fully implement the IEEE 754 standard for
floating-point arithmetic (in particular denormal values are
treated as zero), so the use of NEON instructions may lead to a
loss of precision.
See for instance, Subnormal IEEE-754 floating point numbers support on ios... for more on this topic.

I have tried each one of them, and it seems using the -mfpu=neon and to specify the -march=armv7-a and -mfloat-abi=softfp is the proper configuration.
Besides, a referrence(ARM Cortex-A8 vs. Intel Atom) is of great useful for ARM BenchMark.
Another helpful article is about ARM Cortex-A Processors and gcc command lines, this clears the SIMD coprocessor configuration.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio