Minimizing global memory reads in OpenCL with vectors? - performance

Suppose my kernel takes 4 (or 3, or 2) unrelated float or double args, or that I want to access 4 separate floats from global memory. Will this cause 4 separate global memory accesses? Is accessing a single vector of 4 floats or doubles faster than accessing 4 separate ones? If so, am I better off packing them into a single vector and then, say, using #defines to reference the individual members?
If this does increase the performance, do I have to do it myself, or might the compiler be smart enough to automatically convert 4 separate float reads into a single vector for me? Is this what "auto-vectorization" is? I've seen auto-vectorization mentioned in a few documents, without detailed explanation of exactly what it does, except that it seems to be an optional performance optimization for CPUs only, not GPUs.

Using vectors depends on kernel itself. If you need all four values at same time (for example: at start of kernel, at start of loop), it's better to pack them, because they will be assigned during one read (Values in single vector are stored sequential).
On the other hand, when you need only some of the values, you can speed up execution by reading only what you need.
Another case is when you read them one by one, each reading divided by some computation (i.e. give GPU some time to fetch data).
Basically, this data read segments, behaves like buffer. If you have enough instances, number of reads is same (in optional cause) and what really counts is how well are these reads used.
Compiler often unpack these structures so only speedup is, that you have all variables nicely stored, so when you read, you fill them all up with one read and rest of buffer is used for another instance.
As example, I will use 128 bits wide bus and 4 floats (32 bits).
(32b * 4) / 128b = 1 instance/read
For scalar data types, there are N reads (N = number of variables), each read filling one variable in each instance up to the number of fetched variables.
32b / 128b = 4 instance/read
So in my examples, if you have 4 instances, there will always be at least 4 reads no matter what and only thing, you can do with this is cover fetching time by some computation, if it's even possible.

Related

Do CPUs with AVX2 or newer instruction sets support any form of caching on register renaming?

For example, there is a very simple pseudo code with many duplicated values taken:
Data:
1 5 1 5 1 2 2 3 8 3 4 5 6 7 7 7
For all data elements:
get particle id from data array
idx = id/7
index = (idx << 8) | id
aabb = lookup[index]
test collision of aabb with a ray
so that it will very probably re-compute same value of 1 for same division followed by same bitwise operation, with no loop carried dependency.
Can new CPUs (like Avx512 or AVX2) remember the pattern (same data + same code path) and directly rename an old input register and return the output quickly (like predicting branch but instead predicting register renamed for a temporary value)?
I'm currently developing a collision detection algorithm on an old CPU (bulldozer ver.1) and any online C++ compiler is not good enough for having predictable performance due to cpu being shared by all visitors.
Removing duplicates by using an unoredered map takes about 15-30 nanoseconds per insert or by using a vectorized plain array scan about 3-5 nanoseconds per insert. This is too slow to effectively filter unnecessary duplicates out. Even if a direct-mapped cache is used (that contains just a modulo operator and some assignments), it still fails (due to cache miss) even worse than using an unordered map in terms of performance.
I'm not expecting a cpu with only hundred(s) of physical registers to actually cache many things, but it could help a lot in computing duplicate values quickly, by just remembering the "same value + same code path" combo only from the last iteration of a loop. At least some physics simulations with collision checking could get a decent boost.
Processing a sorted is faster, but only for branching code? What about branchless code, with newest cpus?
Is there any way of harnessing the register renaming performance (zero latency?) as a simple caching of duplicated work?

avx512 strided gather with arbitrary stride

I know in AVX512 you can perform strided gathers with strides of 1,2,4,8. However what if I have an arbitrary stride that can be anywhere between 10-1000? The stride is known at compile time. I understand then the instruction won't be the bottleneck, the memory probably will be. Is _mm512_set_ps the most effective way to do this?
strided gathers with strides of 1,2,4,8
No, there's no special support for that; maybe you're thinking of ARM/ARM64 NEON vld4 4-way deinterleave?
In x86 you can use 1,2,4, or 8 as a scale factors for an index vector for vpgatherdd / vpgatherdps, but if you just want every 2nd element it's better to manually shuffle (e.g. _mm512_permutex2var_ps to grab alternate floats from 2 input vectors), getting many useful elements with one wide load instead of accessing cache once per element.
But in your case, with a minimum stride of 10, at most 2 elements will come from the same 16 x 32-bit 512-bit vector, and with wider strides not even one per vector.
So you can use vpgatherdps with _mm512_add_epi32(idx, _mm512_set1_epi32(16 * stride)) in a loop. Or better, just use a fixed vector of indices and increment the base pointer. You might generate that vector of indices with _mm512_mullo_epi32(_mm512_setr_epi32(0,1,2,3,...,15), _mm512_set1_epi32(stride)). Since a float is 4 bytes wide, use a scale factor of 4 with your gathers.
Even if you need to handle huge arrays, incrementing the pointer instead of the vector elements avoids any need for 64-bit indices, as well as minimizing the number of vector uops. (Valuable when using 512-bit vectors on current CPUs.)
IIRC, Intel's optimization manual has a section about strided loads and the tradeoff in manual gather vs. using gather instructions. Gather instructions become relatively better the wider your vectors are (2/clock load throughput but only 1/clock shuffle throughput for most shuffles), so especially for 512-bit vectors its likely a win to use vector shuffles.

Using a slice instead of list when working with large data volumes in Go

I have a question on the utility of slices in Go. I have just seen Why are lists used infrequently in Go? and Why use arrays instead of slices? but had some question which I did not see answered there.
In my application:
I read a CSV file containing approx 10 million records, with 23 columns per record.
For each record, I create a struct and put it into a linked list.
Once all records have been read, the rest of the application logic works with this linked list (the processing logic itself is not relevant for this question).
The reason I prefer a list and not a slice is due to the large amount of contiguous memory an array/slice would need. Also, since I don't know the size of the exact number of records in the file upfront, I can't specify the array size upfront (I know Go can dynamically re-dimension the slice/array as needed, but this seems terribly inefficient for such a large set of data).
Every Go tutorial or article I read seems to suggest that I should use slices and not lists (as a slice can do everything a list can, but do it better somehow). However, I don't see how or why a slice would be more helpful for what I need? Any ideas from anyone?
... approx 10 million records, with 23 columns per record ... The reason I prefer a list and not a slice is due to the large amount of contiguous memory an array/slice would need.
This contiguous memory is its own benefit as well as its own drawback. Let's consider both parts.
(Note that it is also possible to use a hybrid approach: a list of chunks. This seems unlikely to be very worthwhile here though.)
Also, since I don't know the size of the exact number of records in the file upfront, I can't specify the array size upfront (I know Go can dynamically re-dimension the slice/array as needed, but this seems terribly inefficient for such a large set of data).
Clearly, if there are n records, and you allocate and fill in each one once (using a list), this is O(n).
If you use a slice, and allocate a single extra slice entry every time, you start with none, grow it to size 1, then copy the 1 to a new array of size 2 and fill in item #2, grow it to size 3 and fill in item #3, and so on. The first of the n entities is copied n times, the second is copied n-1 times, and so on, for n(n+1)/2 = O(n2) copies. But if you use a multiplicative expansion technique—which Go's append implementation does—this drops to O(log n) copies. Each one does copy more bytes though. It ends up being O(n), amortized (see Why do dynamic arrays have to geometrically increase their capacity to gain O(1) amortized push_back time complexity?).
The space used with the slice is obviously O(n). The space used for the linked list approach is O(n) as well (though the records now require at least one forward pointer so you need some extra space per record).
So in terms of the time needed to construct the data, and the space needed to hold the data, it's O(n) either way. You end up with the same total memory requirement. The main difference, at first glace anyway, is that the linked-list approach doesn't require contiguous memory.
So: What do we lose when using contiguous memory, and what do we gain?
What we lose
The thing we lose is obvious. If we already have fragmented memory regions, we might not be able to get a contiguous block of the right size. That is, given:
used: 1 MB (starting at base, ending at base+1M)
free: 1 MB (starting at +1M, ending at +2M)
used: 1 MB (etc)
free: 1 MB
used: 1 MB
free: 1 MB
we have a total of 6 MB, 3 used and 3 free. We can allocate 3 1 MB blocks, but we can't allocate one 3 MB block unless we can somehow compact the three "used" regions.
Since Go programs tend to run in virtual memory on large-memory-space machines (virtual sizes of 64 GB or more), this tends not to be a big problem. Of course everyone's situation differs, so if you really are VM-constrained, that's a real concern. (Other languages have compacting GC to deal with this, and a future Go implementation could at least in theory use a compacting GC.)
What we gain
The first gain is also obvious: we don't need pointers in each record. This saves some space—the exact amount depends on the size of the pointers, whether we're using singly linked lists, and so on. Let's just assume 2 8 byte pointers, or 16 bytes per record. Multiply by 10 million records and we're looking pretty good here: we've saved 160 MBytes. (Go's container/list implementation uses a doubly linked list, and on a 64 bit machine, this is the size of the per-element threading needed.)
We gain something less obvious at first, though, and it's huge. Because Go is a garbage-collected language, every pointer is something the GC must examine at various times. The slice approach has zero extra pointers per record; the linked-list approach has two. That means that the GC system can avoid examining the nonexistent 20 million pointers (in the 10 million records).
Conclusion
There are times to use container/list. If your algorithm really calls for a list and is significantly clearer that way, do it that way, unless and until it proves to be a problem in practice. Or, if you have items that can be on some collection of lists—items that are actually shared, but some of them are on the X list and some are on the Y list and some are on both—this calls for a list-style container. But if there's an easy way to express something as either a list or a slice, go for the slice version first. Because slices are built into Go, you also get the type safety / clarity mentioned in the first link (Why are lists used infrequently in Go?).

What is the best general purpose computing practice in OpenCL for iterative problems?

When we have a program that requires lots of operations over a large data sets and the operations on each of the data elements are independent, OpenCL can be one of the good choice to make it faster. I have a program like the following:
while( function(b,c)!=TRUE)
{
[X,Y] = function1(BigData);
M = functionA(X);
b = function2(M);
N = functionB(Y);
c = function3(N);
}
Here the function1 is applied on each of the elements on the BigData and produce another two big data sets (X,Y). function2 and function3 are then applied operation individually on each of the elements on these X,Y data, respectively.
Since the operations of all the functions are applied on each of the elements of the data sets independently, using GPU might make it faster. So I come up with the following:
while( function(b,c)!=TRUE)
{
//[X,Y] = function1(BigData);
1. load kernel1 and BigData on the GPU. each of the thread will work on one of the data
element and save the result on X and Y on GPU.
//M = functionA(X);
2a. load kernel2 on GPU. Each of the threads will work on one of the
data elements of X and save the result on M on GPU.
(workItems=n1, workgroup size=y1)
//b = function2(M);
2b. load kernel2 (Same kernel) on GPU. Each of the threads will work on
one of the data elements of M and save the result on B on GPU
(workItems=n2, workgroup size=y2)
3. read the data B on host variable b
//N = functionB(Y);
4a. load kernel3 on GPU. Each of the threads will work on one of the
data element of Y and save the result on N on GPU.
(workItems=n1, workgroup size=y1)
//c = function2(M);
4b. load kernel3 (Same kernel) on GPU. Each of the threads will work
on one of the data element of M and save the result on C on GPU
(workItems=n2, workgroup size=y2)
5. read the data C on host variable c
}
However, the overhead involved in this code seems significant to me (I have implemented a test program and run on a GPU). And if the kernels have some sort of synchronizations it might be ended up with more slowdown.
I also believe the workflow is kind of common. So what is the best practice to using OpenCL for speedup for a program like this.
I don't think there's a general problem with the way you've split up the problem into kernels, although it's hard to say as you haven't been very specific. How often do you expect your while loop to run?
If your kernels do negligible work but the outer loop is doing a lot of iterations, you may wish to combine the kernels into one, and do some number of iterations within the kernel itself, if that works for your problem.
Otherwise:
If you're getting unexpectedly bad performance, you most likely need to be looking at the efficiency of each of your kernels, and possibly their data access patterns. Unless neighbouring work items are reading/writing neighbouring data (ideally: 16 work items read 4 bytes each from a 64-byte cache line at a time) you're probably wasting memory bandwidth. If your kernels contain lots of conditionals or non-constant loop iterations, that will cost you, etc.
You don't specify what kind of runtimes you're getting, on what kind Of job size, (Tens? Thousands? Millions of arithmetic ops? How big are your data sets?) or what hardware. (Compute card? Laptop IGPU?) "Significant overhead" can mean a lot of different things. 5ms? 1 second?
Intel, nVidia and AMD all publish optimisation guides - have you read these?

overlapping variables and performance

Sorry in advance if I have some of this wrong. I may edit to correct later if it's not too disruptive.
When multiple variables are declared in adjacent memory, as I understand it, on a very low level, registers are created that encapsulate a number of bytes, commonly 1, 2, 4 or 8. This allows those bit ranges to be binary rotated, as well as thought of by the processor as numbers and so mutated with simple mathematics such as add, subtract, multiply and devide.
There may be abstraction reasons for not overlapping thease ranges, but as many langueges consider instructions to occur in a well defined sequential order that the coder will be aware of, are there any performance reasons to not overlap one or more in adjacent bytes of allocated memory?
For example in a block of allocated memory where every bit starts as 0. Bytes 0 to 3 could be being used as an int, as well as bytes 1 to 4. The first could be set to a value before the second range was multiplied by 3.
If there are performance reasons not to then are they overcome by otherwise having to to copy values in and out of completely new variables and perform more complicated processes to achieve certain algorithms that could otherwise be done on a very low level?
There is nothing wrong with this trick when it is done in assembly: optimizers have been routinely making use of knowing where parts of an integer are to save CPU cycles and reduce the size of the code. For example, when a 32-bit integer variable is initialized to a value that fits in only 16 bits, optimizing compilers would replace the instruction that stores a 32-bit value in memory with a faster instruction that stores a 16-bit value to the lower bits of the variable, and clear the upper 16 bits. Moreover, many optimizers would go even further: if a constant is divisible by 2^16, they would store the value divided by 2^16 to the upper 16 bits, and clear lower 16 bits.
Some architectures restrict such manipulations to addresses of certain properties, for example, by requiring all 4-byte memory load/store instructions to be done at addresses divisible by four. These restrictions may reduce applicability of partial-value writing tricks.

Resources