Is there any performance difference between Buffer, StructuredBuffer and ByteAddressBuffer (also their RW variants)? - performance

I tried looking this up on various websites, including MS Docs on DirectX 11 Compute Shader types; but I haven't found anything mentioning performance differences of these buffer types.
Are they exactly the same performance-wise ?
If no, what is the most optimum way of using each in various scenarios ?

Performance will eventually differ from GPU/Driver combination.
There is a project here that does benchmark access for those (the linear/random cases are the most useful).
Constant access is also useful if you want to compare cbuffer access versus other buffer access (on NVidia it is common to perform a buffer to cbuffer gpu copy before to go on an expensive shader for example).
https://github.com/sebbbi/perftest
Note that also different buffers (in d3d11 land) have different limitations.
So the performance benefit can be hindered by those.
Structured buffers cannot be bound as vertex/index buffers. So if you want to use them you need to perform an extra copy. (For vertex buffers you can just fetch from vertex id, there is no penalty of this, index buffers can be read but are a bit more problematic).
Byte address allow to store anything in a non structured way (just a basic pointer somehow). Reads are still aligned to 4 bytes (int size). Converting to float (reads) need a asfloat, from float (writes) need a asuint, but in driver cases this is generally a nop, so there is no performance impact.
Byte address (and typed buffers) can be used as index buffer or vertex buffers. No copy necessary.
Typed buffers do not support Interlocked operations too well, in this case you need to use a Structured/ByteAddress buffer (note that you can use interlocked on a small buffer and perform the read/writes on a typed buffer if you want).
Byte address can be more annoying to use if you have an array of elements of the same type (even a float4x4 is a decent amount of code to fetch versus a StructuredBuffer < float4x4 >
Structured buffers allow you to bind "Partial views". So even if your buffers has let's say 2048 floats, you can bind a range from 4-456 (it also allows you to bind 500-600 as write at the same time since they are not overlapping).
For all buffers, if you use them as readonly, don't bind them as RW, this generally has a decent penalty.

To add to the accepted answer,
There is also a performance penalty for elements in the StructuredBuffer not being aligned to a 128 bit stride [sizeof float4]. If not there is the possability that a single float4 for example could span across cache lines causing up to a 5% perf penalty.
An example of how to solve this is to use padding to re-align elements:
struct Foo
{
float4 Position;
float Radius;
float pad0;
float pad1;
float pad2;
float4 Rotation;
};
NVIDIA post with more detail

Related

avx512 strided gather with arbitrary stride

I know in AVX512 you can perform strided gathers with strides of 1,2,4,8. However what if I have an arbitrary stride that can be anywhere between 10-1000? The stride is known at compile time. I understand then the instruction won't be the bottleneck, the memory probably will be. Is _mm512_set_ps the most effective way to do this?
strided gathers with strides of 1,2,4,8
No, there's no special support for that; maybe you're thinking of ARM/ARM64 NEON vld4 4-way deinterleave?
In x86 you can use 1,2,4, or 8 as a scale factors for an index vector for vpgatherdd / vpgatherdps, but if you just want every 2nd element it's better to manually shuffle (e.g. _mm512_permutex2var_ps to grab alternate floats from 2 input vectors), getting many useful elements with one wide load instead of accessing cache once per element.
But in your case, with a minimum stride of 10, at most 2 elements will come from the same 16 x 32-bit 512-bit vector, and with wider strides not even one per vector.
So you can use vpgatherdps with _mm512_add_epi32(idx, _mm512_set1_epi32(16 * stride)) in a loop. Or better, just use a fixed vector of indices and increment the base pointer. You might generate that vector of indices with _mm512_mullo_epi32(_mm512_setr_epi32(0,1,2,3,...,15), _mm512_set1_epi32(stride)). Since a float is 4 bytes wide, use a scale factor of 4 with your gathers.
Even if you need to handle huge arrays, incrementing the pointer instead of the vector elements avoids any need for 64-bit indices, as well as minimizing the number of vector uops. (Valuable when using 512-bit vectors on current CPUs.)
IIRC, Intel's optimization manual has a section about strided loads and the tradeoff in manual gather vs. using gather instructions. Gather instructions become relatively better the wider your vectors are (2/clock load throughput but only 1/clock shuffle throughput for most shuffles), so especially for 512-bit vectors its likely a win to use vector shuffles.

Minimizing global memory reads in OpenCL with vectors?

Suppose my kernel takes 4 (or 3, or 2) unrelated float or double args, or that I want to access 4 separate floats from global memory. Will this cause 4 separate global memory accesses? Is accessing a single vector of 4 floats or doubles faster than accessing 4 separate ones? If so, am I better off packing them into a single vector and then, say, using #defines to reference the individual members?
If this does increase the performance, do I have to do it myself, or might the compiler be smart enough to automatically convert 4 separate float reads into a single vector for me? Is this what "auto-vectorization" is? I've seen auto-vectorization mentioned in a few documents, without detailed explanation of exactly what it does, except that it seems to be an optional performance optimization for CPUs only, not GPUs.
Using vectors depends on kernel itself. If you need all four values at same time (for example: at start of kernel, at start of loop), it's better to pack them, because they will be assigned during one read (Values in single vector are stored sequential).
On the other hand, when you need only some of the values, you can speed up execution by reading only what you need.
Another case is when you read them one by one, each reading divided by some computation (i.e. give GPU some time to fetch data).
Basically, this data read segments, behaves like buffer. If you have enough instances, number of reads is same (in optional cause) and what really counts is how well are these reads used.
Compiler often unpack these structures so only speedup is, that you have all variables nicely stored, so when you read, you fill them all up with one read and rest of buffer is used for another instance.
As example, I will use 128 bits wide bus and 4 floats (32 bits).
(32b * 4) / 128b = 1 instance/read
For scalar data types, there are N reads (N = number of variables), each read filling one variable in each instance up to the number of fetched variables.
32b / 128b = 4 instance/read
So in my examples, if you have 4 instances, there will always be at least 4 reads no matter what and only thing, you can do with this is cover fetching time by some computation, if it's even possible.

Should I use pitched memory in CUDA for read-only 2D arrays?

I am porting some code from CPU to GPU, and in CPU side I have a dynamically allocated matrix (double **) which is to be ported to GPU. However, once initialized, matrix is never modified. Since I can't use pointers to pointers on GPU, should I represent this matrix as a flat array (double * accessed as matrix[i * nCols + j]) or use pitched memory for it? Will the use of pitched memory lead to performance improvement in this case?
The only instance I can think of that using pitched memory could perform worse for a 2D array instead of linear memory is if you directly access the memory using:
int tid = blockIdx.x * blockDim.x + threadIdx.x;
double myVal=_d_array[tid];
Otherwise, pitch will at the least align the first entry of each row. A read through:
http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#coalesced-access-to-global-memory
will most definitely help your understanding. If your rows are small (~16 entries), or you're using a 2.x compute capability card you could see significant performance improvements when you access data row by row with pitch instead of linear layout.
Worst case without pitch for row by row with a 2.x capability card could be close to 50% bandwidth for an unaligned grab of 16 double values. This could also thrash your L1 cache pretty bad, as that will boot out an extra L1 cache line.
Due to non L1 caching in 3.x an unaligned grab of 16 doubles will result in 32B*5 grab into L2 instead of 32B*4 so the performance hit will likely be "small".
One thing to keep in mind is making block sizes multiples 32 is typically a good idea.

Max Buffer Sizes Opengl ES 2.0

I know this has been discussed before but I still haven't found a decent answer relevant to 2014.
Is there a max size to Vertex Buffer Objects in OpenGL ES 2.0?
I am writing a graphics engine to run on Android.
I am using gldrawarrays() to draw bunch of lines with GL_LINE_STRIP.
So, I am not using any index arrays so I am not capped by the max value of a short integer which comes up with Index Buffer Objects.
I would like to load in excess of 2 million X,Y,Z float values so around 24mb of data to the GPU.
Am I way out short or way far of the limits? Is there a way to query this?
As far as the API is concerned, the size of GLsizeiptr is the upper-bound.
That means 4 GiB generally speaking (32-bit pointer being the most common case); of course no integrated device actually has that much GPU memory yet, that is the largest address you can deal with. Consequently, it is the largest number of bytes you can allocate with a function such as glBufferData (...).
Consider the prototype for glBufferData:
void glBufferData (GLenum target, GLsizeiptr size, const GLvoid *data, GLenum usage);
Now let us look at the definition of GLsizeiptr:
OpenGL ES 2.0 Specification - Basic GL Operation - p. 12
There is no operational limit defined by OpenGL or OpenGL ES. About the best you could portably do is call glBufferData (...) with a certain size and NULL for the data pointer to see if it raises a GL_OUT_OF_MEMORY error. That is very roughly equivalent to a "proxy texture," which is intended to check if there is enough memory to fit a texture with certain dimensions before trying to upload it. It is an extremely crude approach to the problem, but it is one that has been around in GL for ages.

Why does the speed of memcpy() drop dramatically every 4KB?

I tested the speed of memcpy() noticing the speed drops dramatically at i*4KB. The result is as follow: the Y-axis is the speed(MB/second) and the X-axis is the size of buffer for memcpy(), increasing from 1KB to 2MB. Subfigure 2 and Subfigure 3 detail the part of 1KB-150KB and 1KB-32KB.
Environment:
CPU : Intel(R) Xeon(R) CPU E5620 # 2.40GHz
OS : 2.6.35-22-generic #33-Ubuntu
GCC compiler flags : -O3 -msse4 -DINTEL_SSE4 -Wall -std=c99
I guess it must be related to caches, but I can't find a reason from the following cache-unfriendly cases:
Why is my program slow when looping over exactly 8192 elements?
Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
Since the performance degradation of these two cases are caused by unfriendly loops which read scattered bytes into the cache, wasting the rest of the space of a cache line.
Here is my code:
void memcpy_speed(unsigned long buf_size, unsigned long iters){
struct timeval start, end;
unsigned char * pbuff_1;
unsigned char * pbuff_2;
pbuff_1 = malloc(buf_size);
pbuff_2 = malloc(buf_size);
gettimeofday(&start, NULL);
for(int i = 0; i < iters; ++i){
memcpy(pbuff_2, pbuff_1, buf_size);
}
gettimeofday(&end, NULL);
printf("%5.3f\n", ((buf_size*iters)/(1.024*1.024))/((end.tv_sec - \
start.tv_sec)*1000*1000+(end.tv_usec - start.tv_usec)));
free(pbuff_1);
free(pbuff_2);
}
UPDATE
Considering suggestions from #usr, #ChrisW and #Leeor, I redid the test more precisely and the graph below shows the results. The buffer size is from 26KB to 38KB, and I tested it every other 64B(26KB, 26KB+64B, 26KB+128B, ......, 38KB). Each test loops 100,000 times in about 0.15 second. The interesting thing is the drop not only occurs exactly in 4KB boundary, but also comes out in 4*i+2 KB, with a much less falling amplitude.
PS
#Leeor offered a way to fill the drop, adding a 2KB dummy buffer between pbuff_1 and pbuff_2. It works, but I am not sure about Leeor's explanation.
Memory is usually organized in 4k pages (although there's also support for larger sizes). The virtual address space your program sees may be contiguous, but it's not necessarily the case in physical memory. The OS, which maintains a mapping of virtual to physical addresses (in the page map) would usually try to keep the physical pages together as well but that's not always possible and they may be fractured (especially on long usage where they may be swapped occasionally).
When your memory stream crosses a 4k page boundary, the CPU needs to stop and go fetch a new translation - if it already saw the page, it may be cached in the TLB, and the access is optimized to be the fastest, but if this is the first access (or if you have too many pages for the TLBs to hold on to), the CPU will have to stall the memory access and start a page walk over the page map entries - that's relatively long as each level is in fact a memory read by itself (on virtual machines it's even longer as each level may need a full pagewalk on the host).
Your memcpy function may have another issue - when first allocating memory, the OS would just build the pages to the pagemap, but mark them as unaccessed and unmodified due to internal optimizations. The first access may not only invoke a page walk, but possibly also an assist telling the OS that the page is going to be used (and stores into, for the target buffer pages), which would take an expensive transition to some OS handler.
In order to eliminate this noise, allocate the buffers once, perform several repetitions of the copy, and calculate the amortized time. That, on the other hand, would give you "warm" performance (i.e. after having the caches warmed up) so you'll see the cache sizes reflect on your graphs. If you want to get a "cold" effect while not suffering from paging latencies, you might want to flush the caches between iteration (just make sure you don't time that)
EDIT
Reread the question, and you seem to be doing a correct measurement. The problem with my explanation is that it should show a gradual increase after 4k*i, since on every such drop you pay the penalty again, but then should enjoy the free ride until the next 4k. It doesn't explain why there are such "spikes" and after them the speed returns to normal.
I think you are facing a similar issue to the critical stride issue linked in your question - when your buffer size is a nice round 4k, both buffers will align to the same sets in the cache and thrash each other. Your L1 is 32k, so it doesn't seem like an issue at first, but assuming the data L1 has 8 ways it's in fact a 4k wrap-around to the same sets, and you have 2*4k blocks with the exact same alignment (assuming the allocation was done contiguously) so they overlap on the same sets. It's enough that the LRU doesn't work exactly as you expect and you'll keep having conflicts.
To check this, i'd try to malloc a dummy buffer between pbuff_1 and pbuff_2, make it 2k large and hope that it breaks the alignment.
EDIT2:
Ok, since this works, it's time to elaborate a little. Say you assign two 4k arrays at ranges 0x1000-0x1fff and 0x2000-0x2fff. set 0 in your L1 will contain the lines at 0x1000 and 0x2000, set 1 will contain 0x1040 and 0x2040, and so on. At these sizes you don't have any issue with thrashing yet, they can all coexist without overflowing the associativity of the cache. However, everytime you perform an iteration you have a load and a store accessing the same set - i'm guessing this may cause a conflict in the HW. Worse - you'll need multiple iteration to copy a single line, meaning that you have a congestion of 8 loads + 8 stores (less if you vectorize, but still a lot), all directed at the same poor set, I'm pretty sure there's are a bunch of collisions hiding there.
I also see that Intel optimization guide has something to say specifically about that (see 3.6.8.2):
4-KByte memory aliasing occurs when the code accesses two different
memory locations with a 4-KByte offset between them. The 4-KByte
aliasing situation can manifest in a memory copy routine where the
addresses of the source buffer and destination buffer maintain a
constant offset and the constant offset happens to be a multiple of
the byte increment from one iteration to the next.
...
loads have to wait until stores have been retired before they can
continue. For example at offset 16, the load of the next iteration is
4-KByte aliased current iteration store, therefore the loop must wait
until the store operation completes, making the entire loop
serialized. The amount of time needed to wait decreases with larger
offset until offset of 96 resolves the issue (as there is no pending
stores by the time of the load with same address).
I expect it's because:
When the block size is a 4KB multiple, then malloc allocates new pages from the O/S.
When the block size is not a 4KB multiple, then malloc allocates a range from its (already allocated) heap.
When the pages are allocated from the O/S then they are 'cold': touching them for the first time is very expensive.
My guess is that, if you do a single memcpy before the first gettimeofday then that will 'warm' the allocated memory and you won't see this problem. Instead of doing an initial memcpy, even writing one byte into each allocated 4KB page might be enough to pre-warm the page.
Usually when I want a performance test like yours I code it as:
// Run in once to pre-warm the cache
runTest();
// Repeat
startTimer();
for (int i = count; i; --i)
runTest();
stopTimer();
// use a larger count if the duration is less than a few seconds
// repeat test 3 times to ensure that results are consistent
Since you are looping many times, I think arguments about pages not being mapped are irrelevant. In my opinion what you are seeing is the effect of hardware prefetcher not willing to cross page boundary in order not to cause (potentially unnecessary) page faults.

Resources