I have this UBO:
layout(std140)uniform _ObjMatrix
{
layout(row_major)mat4x3 ViewMatrix[256];
};
On OpenGL desktop, the size is 3*Vec4*256 elements (total size 12288 bytes) - this is what I was expecting = OK
However when running on my mobile phone, OpenGL ES 3.0, the size is 4*Vec4*256 elements (total size 16384 bytes) = Not OK
I thought std140 should guarantee same layout on all platforms?
So what's the problem and how to fix it?
I need the smaller size for faster performance (because of smaller bandwidth for transfers)
Works OK on Desktop, Apple iOS, but fails on 2 Android ARM Mali GPU's, perhaps it's a bug in ARM Mali drivers
This is a confirmed Mali driver bug which affects row_major annotation for array declarations. The workaround is to apply the row_major annotation to the uniform block rather than the array element:
layout(std140, row_major) uniform _ObjMatrix {
mat4x3 ViewMatrix[256];
};
Related
In an algorithm we are using a median filter with a big window of 257x257 on a UInt16 image. My task is to implement that algorithm with OpenCL on a GPU.
In fact I do not only need the median but in some cases also the 0.001, 0.02 and 0.999 quantiles.
The obvious approach is to have a OpenCL kernel running on every output pixel where the kernel loads all pixels in the window form the input image to local memory, then sorts these values and finally computes the quantiles.
Now the problem is with a window size of 257x257 this approach would need at least 132098 bytes of local memory. But local memory is very limited. My Quadro K4000 has only 49152 bytes.
So what would be a good approach to implement such a median filter on the GPU?
A solution for CUDA would also be acceptable. But I guess the underlying problem would be the same.
I know this has been discussed before but I still haven't found a decent answer relevant to 2014.
Is there a max size to Vertex Buffer Objects in OpenGL ES 2.0?
I am writing a graphics engine to run on Android.
I am using gldrawarrays() to draw bunch of lines with GL_LINE_STRIP.
So, I am not using any index arrays so I am not capped by the max value of a short integer which comes up with Index Buffer Objects.
I would like to load in excess of 2 million X,Y,Z float values so around 24mb of data to the GPU.
Am I way out short or way far of the limits? Is there a way to query this?
As far as the API is concerned, the size of GLsizeiptr is the upper-bound.
That means 4 GiB generally speaking (32-bit pointer being the most common case); of course no integrated device actually has that much GPU memory yet, that is the largest address you can deal with. Consequently, it is the largest number of bytes you can allocate with a function such as glBufferData (...).
Consider the prototype for glBufferData:
void glBufferData (GLenum target, GLsizeiptr size, const GLvoid *data, GLenum usage);
Now let us look at the definition of GLsizeiptr:
OpenGL ES 2.0 Specification - Basic GL Operation - p. 12
There is no operational limit defined by OpenGL or OpenGL ES. About the best you could portably do is call glBufferData (...) with a certain size and NULL for the data pointer to see if it raises a GL_OUT_OF_MEMORY error. That is very roughly equivalent to a "proxy texture," which is intended to check if there is enough memory to fit a texture with certain dimensions before trying to upload it. It is an extremely crude approach to the problem, but it is one that has been around in GL for ages.
As a test, I am trying to crunch as much GFLOPS from the GPU as possible, just to see how far we can go with compute via RenderScript.
For this I use a GPU-cache-friendly kernel that will (hopefully) not be bounded on memory access for testing purposes:
#pragma rs_fp_relaxed
rs_allocation input;
float __attribute__((kernel)) compute(float in, int x)
{
float sum = 0;
if (x < 64) return 0;
for (int i = 0; i < 64; i++) {
sum += rsGetElementAt_float(input, x - i);
}
return sum;
}
On the Java side I just call the kernel a couple of times:
for (int i = 0; i < 1024; i++) {
m_script.forEach_compute(m_inAllocation, m_outAllocation);
}
With allocation sizes of 1M floats this maxes around 1-2 GFLOPS on a GPU that should max around 100 GFLOPS (Snapdragon 600, APQ8064AB), that is 50x - 100x less compute performance !.
I have tried unrolling the loop (10% difference), using larger or smaller sums (<5% diff), different allocation sizes (<5% diff), 1D or 2D allocations (no diff), but come nowhere near the amount of GFLOPS that should be possible on the device. I even am thinking that the entire kernel is only running on the CPUs.
In similar sense, looking at the results of an RenderScript benchmark application (https://compubench.com/result.jsp?benchmark=compu20, the top of the line devices only achieve around 60M pixels/s on a Gaussian blur. A 5x5 blur in naive (non-seperable) implementation takes around 50 FLOPS/pixel, resulting in 3 GFLOPS as opposed to the 300 GFLOPS these GPUs have.
Any thoughts?
(see e.g. http://kyokojap.myweb.hinet.net/gpu_gflops/ for an overview of device capabilities)
EDIT:
Using the OpenCL libs that are available on the device (Samsung S4, 4.4.2) I have rewritten the RenderScript test program to OpenCL and run it via the NDK. With basically the same setup (1M float buffers and running the kernel 1024 times) I can now get around 25 GFLOPS, that is 10x the RenderScript performance, and 4x from the theoretical device maximum.
For RenderScript there is no way of knowing if a kernel is running on the GPU. So:
if the RenderScript kernel does run on the GPU, why is it so slow?
if the kernel is not running on the GPU, which devices do run RenderScript on the GPU (aside from most probably the Nexus line)?
Thanks.
What device are you using? Not all devices are shipping with GPU drivers yet.
Also, that kernel will be memory bound, since you've got a 1:1 arithmetic to load ratio.
This is on a MacBookPro7,1 with a GeForce 320M (compute capability 1.2). Previously, with OS X 10.7.8, XCode 4.x and CUDA 5.0, CUDA code compiled and ran fine.
Then, I update to OS X 10.9.2, XCode 5.1 and CUDA 5.5. At first, deviceQuery failed. I read elsewhere that 5.5.28 (the driver CUDA 5.5 shipped with) did not support compute capability 1.x (sm_10), but that 5.5.43 did. After updating the CUDA driver to the even more current 5.5.47 (GPU Driver verions 8.24.11 310.90.9b01), deviceQuery indeed passes with the following output.
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce 320M"
CUDA Driver Version / Runtime Version 5.5 / 5.5
CUDA Capability Major/Minor version number: 1.2
Total amount of global memory: 253 MBytes (265027584 bytes)
( 6) Multiprocessors, ( 8) CUDA Cores/MP: 48 CUDA Cores
GPU Clock rate: 950 MHz (0.95 GHz)
Memory Clock rate: 1064 Mhz
Memory Bus Width: 128-bit
Maximum Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536, 32768), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(8192), 512 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(8192, 8192), 512 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 512
Max dimension size of a thread block (x,y,z): (512, 512, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 1)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.5, NumDevs = 1, Device0 = GeForce 320M
Result = PASS
Furthermore, I can successfully compile without modification the CUDA 5.5 samples, though I have not tried to compile all of them.
However, samples such as matrixMul, simpleCUFFT, simpleCUBLAS all fail immediately when run.
$ ./matrixMul
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce 320M" with compute capability 1.2
MatrixA(160,160), MatrixB(320,160)
cudaMalloc d_A returned error code 2, line(164)
$ ./simpleCUFFT
[simpleCUFFT] is starting...
GPU Device 0: "GeForce 320M" with compute capability 1.2
CUDA error at simpleCUFFT.cu:105 code=2(cudaErrorMemoryAllocation) "cudaMalloc((void **)&d_signal, mem_size)"
Error Code 2 is cudaErrorMemoryAllocation, but I suspect it hides a failed CUDA initialization somehow.
$ ./simpleCUBLAS
GPU Device 0: "GeForce 320M" with compute capability 1.2
simpleCUBLAS test running..
!!!! CUBLAS initialization error
Actual error code is CUBLAS_STATUS_NOT_INITIALIZED being returned from call to cublasCreate().
Has anyone run into this before and found a fix? Thanks in advance.
I would guess you are running out of memory. Your GPU is being used by the display manager, and it only has 256Mb of RAM. The combined memory footprint of the OS 10.9 display manager and the CUDA 5.5 runtime might be leaving you with almost no free memory. I would recommend writing and running a small test program like this:
#include <iostream>
int main(void)
{
size_t mfree, mtotal;
cudaSetDevice(0);
cudaMemGetInfo(&mfree, &mtotal);
std::cout << mfree << " bytes of " << mtotal << " available." << std::endl;
return cudaDeviceReset();
}
[disclaimer: written in browser, never compiled or tested use at own risk ]
That should give you a picture of the available free memory after context establishment on the device. You might be surprised at how little there is to work with.
EDIT: Here is an even lighter weight alternative test which doesn't even attempt to establish a context on the device. Instead, it only uses the driver API to check the device. If this succeeds, then either the runtime API shipping for OS X is broken somehow, or you have no memory available on the device for establishing a context. If it fails, then your truly have a broken CUDA installation. Either way, I would consider opening a bug report with NVIDIA:
#include <iostream>
#include <cuda.h>
int main(void)
{
CUdevice d;
size_t b;
cuInit(0);
cuDeviceGet(&d, 0);
cuDeviceTotalMem(&b, d);
std::cout << "Total memory = " << b << std::endl;
return 0;
}
Note you will need to explicitly link the cuda driver library to get this to work (pass -lcuda to nvcc, for example)
I am Using Atomic operations for OpenCL. same code is working for intel CPU but is giving error on Nvidia GPU. I have enabled Atomics for 32 bit and 64 bit both.
int cidx=idx%10;
int i=1;
C[idx]=In1[idx] & In2[idx];
atomic_add(R,i);
This is just portion of overall code. its giving build error "Unsupported Operation" While running on Nvidia Quadro GPU rather it's working all fine on Intel i3, Xeon, and AMD Processors.
atomic_add did not appear in OpenCL 1.0, it was added in a later revision of the spec. You might be running on two different implementations which conform to different OpenCL versions.