What is the point of DXGI_USAGE_UNORDERED_ACCESS buffer usage in swapchain? - directx-11

DirectX 11 allows swapchains with DXGI_USAGE_UNORDERED_ACCESS, but there is no way of setting the required CPUAccess flag ( D3D11_CPU_ACCESS_WRITE ). What is the purpose of such swapchain?

DXGI_USAGE_UNORDERED_ACCESS is used for writing to a resource as a UAV from a compute shader, not the CPU.

Related

Is there any performance difference between Buffer, StructuredBuffer and ByteAddressBuffer (also their RW variants)?

I tried looking this up on various websites, including MS Docs on DirectX 11 Compute Shader types; but I haven't found anything mentioning performance differences of these buffer types.
Are they exactly the same performance-wise ?
If no, what is the most optimum way of using each in various scenarios ?
Performance will eventually differ from GPU/Driver combination.
There is a project here that does benchmark access for those (the linear/random cases are the most useful).
Constant access is also useful if you want to compare cbuffer access versus other buffer access (on NVidia it is common to perform a buffer to cbuffer gpu copy before to go on an expensive shader for example).
https://github.com/sebbbi/perftest
Note that also different buffers (in d3d11 land) have different limitations.
So the performance benefit can be hindered by those.
Structured buffers cannot be bound as vertex/index buffers. So if you want to use them you need to perform an extra copy. (For vertex buffers you can just fetch from vertex id, there is no penalty of this, index buffers can be read but are a bit more problematic).
Byte address allow to store anything in a non structured way (just a basic pointer somehow). Reads are still aligned to 4 bytes (int size). Converting to float (reads) need a asfloat, from float (writes) need a asuint, but in driver cases this is generally a nop, so there is no performance impact.
Byte address (and typed buffers) can be used as index buffer or vertex buffers. No copy necessary.
Typed buffers do not support Interlocked operations too well, in this case you need to use a Structured/ByteAddress buffer (note that you can use interlocked on a small buffer and perform the read/writes on a typed buffer if you want).
Byte address can be more annoying to use if you have an array of elements of the same type (even a float4x4 is a decent amount of code to fetch versus a StructuredBuffer < float4x4 >
Structured buffers allow you to bind "Partial views". So even if your buffers has let's say 2048 floats, you can bind a range from 4-456 (it also allows you to bind 500-600 as write at the same time since they are not overlapping).
For all buffers, if you use them as readonly, don't bind them as RW, this generally has a decent penalty.
To add to the accepted answer,
There is also a performance penalty for elements in the StructuredBuffer not being aligned to a 128 bit stride [sizeof float4]. If not there is the possability that a single float4 for example could span across cache lines causing up to a 5% perf penalty.
An example of how to solve this is to use padding to re-align elements:
struct Foo
{
float4 Position;
float Radius;
float pad0;
float pad1;
float pad2;
float4 Rotation;
};
NVIDIA post with more detail

Accelerator restriction: unsupported operation: RSQRTSS

I have a simple nbody implementation code and try to compile it for launching on NVIDIA GPUs (Tesla K20m/Geforce GTX 650 Ti). I use the following compiler options:
-Minfo=all -acc -Minline -Mfpapprox -ta=tesla:cc35/nvidia
Everything works without -Mfpapprox, but when I use it, the compilation fails with the following output:
346, Accelerator restriction: unsupported operation: RSQRTSS
The 346 line writes as:
float rdistance=1.0f/sqrtf(drSquared);
where
float drSquared=dx*dx+dy*dy+dz*dz+softening;
and dx, dy, dz are float values. This line is inside the #pragma acc parallel loop independent for() construction.
What is the problem with -Mfpapprox?
-Mfpapprox tells the compiler to use very low-precision CPU instructions to approximate DIV or SQRT. These instructions are not supported on the GPU. The GPU SQRT is both fast and precise so no need for a low-precision version.
Actually even on the CPU, I'd recommend you not use -Mfpapprox unless you really understand the mathematics of your code and it can handle a high degree of imprecision (as much as 5-6 bits or ~20Ulps off). We added this flag about 10 years ago since at the time the CPUs divide operation was very expensive. However, CPU performance for divide has greatly improved since then (as has sqrt) so you're generally better off not sacrificing precision for the little bit of speed-up you might get from this flag.
I'll put in an issue report requesting that the compiler ignore -Mfpapprox for GPU code so you wont see this error.

OpenCL returning nan after changing device

I'm trying to run a simple OpenCL program that adds two vectors in individual buffers, and stores the result in the third buffer. I'm trying to run it on a MacBook Pro with a discrete GPU. The AMD GPU is (through the clGetDeviceInfo function), second in the device list with the integrated GPU being the first. It works correctly with the integrated GPU. But when I modify the command queue initialisation to the following:
cl_command_queue command_queue = clCreateCommandQueue(context, device_list[1], 0, &clStatus);
It returns NaN values in the output. If I use device_list[0], it works. I only changed the command queue initialisation. So how do I guarantee that I use the discrete GPU without issue?

Max Buffer Sizes Opengl ES 2.0

I know this has been discussed before but I still haven't found a decent answer relevant to 2014.
Is there a max size to Vertex Buffer Objects in OpenGL ES 2.0?
I am writing a graphics engine to run on Android.
I am using gldrawarrays() to draw bunch of lines with GL_LINE_STRIP.
So, I am not using any index arrays so I am not capped by the max value of a short integer which comes up with Index Buffer Objects.
I would like to load in excess of 2 million X,Y,Z float values so around 24mb of data to the GPU.
Am I way out short or way far of the limits? Is there a way to query this?
As far as the API is concerned, the size of GLsizeiptr is the upper-bound.
That means 4 GiB generally speaking (32-bit pointer being the most common case); of course no integrated device actually has that much GPU memory yet, that is the largest address you can deal with. Consequently, it is the largest number of bytes you can allocate with a function such as glBufferData (...).
Consider the prototype for glBufferData:
void glBufferData (GLenum target, GLsizeiptr size, const GLvoid *data, GLenum usage);
Now let us look at the definition of GLsizeiptr:
OpenGL ES 2.0 Specification - Basic GL Operation - p. 12
There is no operational limit defined by OpenGL or OpenGL ES. About the best you could portably do is call glBufferData (...) with a certain size and NULL for the data pointer to see if it raises a GL_OUT_OF_MEMORY error. That is very roughly equivalent to a "proxy texture," which is intended to check if there is enough memory to fit a texture with certain dimensions before trying to upload it. It is an extremely crude approach to the problem, but it is one that has been around in GL for ages.

strange behaviour using mmap

I'm using Angtsrom embedded linux kernel v.2.6.37, based on Technexion distribution.
DM3730 SoC, TDM3730 module, custom baseboard.
CodeSourcery toolchain v. 2010-09.50
Here is dataflow in my system:
http://i.stack.imgur.com/kPhKw.png
FPGA generates incrementing data, Kernel reads it via GPMC DMA. GPMC pack size = 512 data samples. Buffer size = 61440 32bit samples (=60 ram pages).
DMA buffer is allocated by dma_alloc_coherent and mapped to userspace by mmap() call. User application directly reads data from DMA buffer and saving to NAND using fwrite() call. User reads data by 4096 samples at once.
And what I see in my file? http://i.stack.imgur.com/etzo0.png
Red line means first border of ring buffer. Ooops! Small packs (~16 samples) starts to hide after border. Their values is accurately = "old" values of corresponding buffer position. But WHY? 16 samples is much lesser than DMA pack size and user read pack size, so there cannot be pointers mismatch.
I guess there is some mmap() feature is hiding somewhere. I have tried different flags for mmap() - such as MAP_LOCKED, MAP_POPULATE, MAP_NONBLOCK with no success. I completely missunderstanding this behaviour :(
P.S. When i'm using copy_to_user() from kernel instead of mmap() and zero-copy access, there is no such behaviour.

Resources