As part of a class project I am looking at ways to improve the performance of a path finding algorithm in a CPU architecture. The algorithm is implemented in C++. The basic operation is to read x,y coordinates and perform some operations on them.
The idea I have right now is to store x and y coordinates separately in two cache banks(set associative). Two coordinates entered for a location should be stored in different banks such that it should be able to read them both in parallel and do separate operations on x and y coordinates and store the combined result. By using vector operations this process can be further sped up to read up to 4 x coordinates and 4 y coordinates at the same time.
For example, for computing euclidean distance from the goal node, at each location the x and y coordinates has to be read and subtracted from goal coordinates to find out the distance.
I wish to know if there is an effective way(cache placement policy) to keep x and y coordinates in different cache lines/blocks for taking advantage of parallelism. Is there any operation/encoding of coordinates I can use to implement it?
P.S: I am not looking for software optimizations but for a modified cache design(theoretical) to speed up the algorithm.
Ref: This blog post mentions that "L1 cache can process two accesses in parallel if they access cache lines from different banks, and serially if they belong to the same bank."
A coordinate held in a structure like this:
struct {
int x;
int y;
} coordinate;
will fit into a single cache line. In fact, on typical MIPS processors with a 32 byte cache line, 8 instances of coordinate will fit into a cache line. The CPU will read the entire line at once from a single bank or way in the data cache, and all 32 bytes, or all 4 coordinates will thenceforth be available in a 32 byte fill/store buffer (FSB) at zero delay to the pipeline. MIPS processors with caches typically have more than one FSB.
MIPS processors with banked or set-associative caches typically use a LRU algorithm to select which bank/way new data is placed into on a cache miss. It is not generally possible to ensure that data from a certain location always gets placed into a particular way in the cache.
All this is to say that your scheme of storing x and y coordinates in separate cache banks will not improve performance over a naive scheme because of the positive effects of the FSBs on the naive scheme and because of the unpredictable effects of the way selection algorithm for the cache.
Related
I know in AVX512 you can perform strided gathers with strides of 1,2,4,8. However what if I have an arbitrary stride that can be anywhere between 10-1000? The stride is known at compile time. I understand then the instruction won't be the bottleneck, the memory probably will be. Is _mm512_set_ps the most effective way to do this?
strided gathers with strides of 1,2,4,8
No, there's no special support for that; maybe you're thinking of ARM/ARM64 NEON vld4 4-way deinterleave?
In x86 you can use 1,2,4, or 8 as a scale factors for an index vector for vpgatherdd / vpgatherdps, but if you just want every 2nd element it's better to manually shuffle (e.g. _mm512_permutex2var_ps to grab alternate floats from 2 input vectors), getting many useful elements with one wide load instead of accessing cache once per element.
But in your case, with a minimum stride of 10, at most 2 elements will come from the same 16 x 32-bit 512-bit vector, and with wider strides not even one per vector.
So you can use vpgatherdps with _mm512_add_epi32(idx, _mm512_set1_epi32(16 * stride)) in a loop. Or better, just use a fixed vector of indices and increment the base pointer. You might generate that vector of indices with _mm512_mullo_epi32(_mm512_setr_epi32(0,1,2,3,...,15), _mm512_set1_epi32(stride)). Since a float is 4 bytes wide, use a scale factor of 4 with your gathers.
Even if you need to handle huge arrays, incrementing the pointer instead of the vector elements avoids any need for 64-bit indices, as well as minimizing the number of vector uops. (Valuable when using 512-bit vectors on current CPUs.)
IIRC, Intel's optimization manual has a section about strided loads and the tradeoff in manual gather vs. using gather instructions. Gather instructions become relatively better the wider your vectors are (2/clock load throughput but only 1/clock shuffle throughput for most shuffles), so especially for 512-bit vectors its likely a win to use vector shuffles.
In the csapp textbook, the description of memory mountain denotes that increasing working size worsens temporal locality, but I feel like both size and stride factors contribute to spatial locality only, as throughput decreases when more data is sparsely stored in lower level caches.
Where is temporal locality in play here? As far as I know, it means the same specific memory address is referenced again in the near future as seen in this answer: What is locality of reference?
This graph is produced by sequentially traversing fixed-size elements of an array. The stride parameter specifies the number of elements to be skipped between two sequentially accessed elements. The size parameter specifies the total size of the array (including the elements that may be skipped). The main loop of the test looks like this (you can get the code from here):
for (i = 0; i < size / sizeof(double); i += stride*4) {
acc0 = acc0 + data[i];
acc1 = acc1 + data[i+stride];
acc2 = acc2 + data[i+stride*2];
acc3 = acc3 + data[i+stride*3];
}
That loop is shown in the book in Figure 6.40. What is not shown or mentioned in the book is that this loop is executed once to warm up the cache hierarchy and then memory throughput is measured for a number of runs. The minimum memory throughput of all the runs (on the warmed up cache) is the one that is plotted.
Both the size and stride parameters together affect temporal locality (but only the stride affects spatial locality). For example, the 32k-s0 configuration has a similar temporal locality as the 64k-s1 configuration because the first access and last access to every line are interleaved by the same number of cache lines. If you hold the size at a particular value and go along the stride axis, some lines that are repeatedly accessed at a lower stride would not be accessed at higher strides, making their temporal locality essentially zero. It's possible to define temporal locality formally, but I'll not do that to answer the question. On the other hand, if you hold the stride at a particular value and go along the size axis, temporal locality for each accessed line becomes smaller with higher sizes. However, performance deteriorates not because of the uniformly lower temporal locality of each accessed line, but because of the larger working set size.
I think the size axis better illustrates the impact of the size of the working set (the amount of memory the loop is going to access during its execution) on execution time than temporal locality. To observe the impact of temporal locality on performance, the memory throughput of the first run of this loop should be compared against that of the second run of the same loop (same size and stride). Temporal locality increases by the same amount for each accessed cache line in the second run of the loop and, if the cache hierarchy is optimized for temporal locality, the throughput of the second run should be better than that of the first. In general, the throughput of each of N sequential invocations of the same loop should be plotted to see the full impact of temporal locality, where N >= 2.
By the way, memory mountains on other processors can be found here and here. You can create a 3D mountain plot using this or this script.
When we have a program that requires lots of operations over a large data sets and the operations on each of the data elements are independent, OpenCL can be one of the good choice to make it faster. I have a program like the following:
while( function(b,c)!=TRUE)
{
[X,Y] = function1(BigData);
M = functionA(X);
b = function2(M);
N = functionB(Y);
c = function3(N);
}
Here the function1 is applied on each of the elements on the BigData and produce another two big data sets (X,Y). function2 and function3 are then applied operation individually on each of the elements on these X,Y data, respectively.
Since the operations of all the functions are applied on each of the elements of the data sets independently, using GPU might make it faster. So I come up with the following:
while( function(b,c)!=TRUE)
{
//[X,Y] = function1(BigData);
1. load kernel1 and BigData on the GPU. each of the thread will work on one of the data
element and save the result on X and Y on GPU.
//M = functionA(X);
2a. load kernel2 on GPU. Each of the threads will work on one of the
data elements of X and save the result on M on GPU.
(workItems=n1, workgroup size=y1)
//b = function2(M);
2b. load kernel2 (Same kernel) on GPU. Each of the threads will work on
one of the data elements of M and save the result on B on GPU
(workItems=n2, workgroup size=y2)
3. read the data B on host variable b
//N = functionB(Y);
4a. load kernel3 on GPU. Each of the threads will work on one of the
data element of Y and save the result on N on GPU.
(workItems=n1, workgroup size=y1)
//c = function2(M);
4b. load kernel3 (Same kernel) on GPU. Each of the threads will work
on one of the data element of M and save the result on C on GPU
(workItems=n2, workgroup size=y2)
5. read the data C on host variable c
}
However, the overhead involved in this code seems significant to me (I have implemented a test program and run on a GPU). And if the kernels have some sort of synchronizations it might be ended up with more slowdown.
I also believe the workflow is kind of common. So what is the best practice to using OpenCL for speedup for a program like this.
I don't think there's a general problem with the way you've split up the problem into kernels, although it's hard to say as you haven't been very specific. How often do you expect your while loop to run?
If your kernels do negligible work but the outer loop is doing a lot of iterations, you may wish to combine the kernels into one, and do some number of iterations within the kernel itself, if that works for your problem.
Otherwise:
If you're getting unexpectedly bad performance, you most likely need to be looking at the efficiency of each of your kernels, and possibly their data access patterns. Unless neighbouring work items are reading/writing neighbouring data (ideally: 16 work items read 4 bytes each from a 64-byte cache line at a time) you're probably wasting memory bandwidth. If your kernels contain lots of conditionals or non-constant loop iterations, that will cost you, etc.
You don't specify what kind of runtimes you're getting, on what kind Of job size, (Tens? Thousands? Millions of arithmetic ops? How big are your data sets?) or what hardware. (Compute card? Laptop IGPU?) "Significant overhead" can mean a lot of different things. 5ms? 1 second?
Intel, nVidia and AMD all publish optimisation guides - have you read these?
Suppose my kernel takes 4 (or 3, or 2) unrelated float or double args, or that I want to access 4 separate floats from global memory. Will this cause 4 separate global memory accesses? Is accessing a single vector of 4 floats or doubles faster than accessing 4 separate ones? If so, am I better off packing them into a single vector and then, say, using #defines to reference the individual members?
If this does increase the performance, do I have to do it myself, or might the compiler be smart enough to automatically convert 4 separate float reads into a single vector for me? Is this what "auto-vectorization" is? I've seen auto-vectorization mentioned in a few documents, without detailed explanation of exactly what it does, except that it seems to be an optional performance optimization for CPUs only, not GPUs.
Using vectors depends on kernel itself. If you need all four values at same time (for example: at start of kernel, at start of loop), it's better to pack them, because they will be assigned during one read (Values in single vector are stored sequential).
On the other hand, when you need only some of the values, you can speed up execution by reading only what you need.
Another case is when you read them one by one, each reading divided by some computation (i.e. give GPU some time to fetch data).
Basically, this data read segments, behaves like buffer. If you have enough instances, number of reads is same (in optional cause) and what really counts is how well are these reads used.
Compiler often unpack these structures so only speedup is, that you have all variables nicely stored, so when you read, you fill them all up with one read and rest of buffer is used for another instance.
As example, I will use 128 bits wide bus and 4 floats (32 bits).
(32b * 4) / 128b = 1 instance/read
For scalar data types, there are N reads (N = number of variables), each read filling one variable in each instance up to the number of fetched variables.
32b / 128b = 4 instance/read
So in my examples, if you have 4 instances, there will always be at least 4 reads no matter what and only thing, you can do with this is cover fetching time by some computation, if it's even possible.
I have known the ideas of block and grid in cuda, and I'm wondering if there is any helper function well written that can help me determine the best block and grid size for any given 2D image.
For example, for a 512x512 image mentioned in this thread. Grid is 64x64 and block is 8x8.
However sometimes my input image may not be power of 2, it may be 317x217 or something like that.In this case, maybe grid should be 317x1 and block should be 1x217.
So if I have an application that accepts an image from user, and use cuda to process it, how can it automatically determine the size and dimension of block and grid, where user can input any size of image.
Is there any existed helper function or class that handles this problem?
Usually you want to choose the size of your blocks based on your GPU architecture, with the goal of maintaining 100% occupancy on the Streaming Multiprocessor (SM). For example, the GPUs at my school can run 1536 threads per SM, and up to 8 blocks per SM, but each block can only have up to 1024 threads in each dimension. So if I were to launch a 1d kernel on the GPU, I could max out a block with 1024 threads, but then only 1 block would be on the SM (66% occupancy). If I instead chose a smaller number, like 192 threads or 256 threads per block, then I could have 100% occupancy with 6 and 8 blocks respectively on the SM.
Another thing to consider is the amount of memory that must be accessed vs the amount of computation to be done. In many imaging applications, you don't just need the value at a single pixel, rather you need the surrounding pixels as well. Cuda groups its threads into warps, which step through every instruction simultaneously (currently, there are 32 threads to a warp, though that may change). Making your blocks square generally minimizes the amount of memory that needs to be loaded vs the amount of computation that can be done, making the GPU more efficient. Likewise, blocks that are a power of 2 load memory more efficiently (if properly aligned with memory addresses) since Cuda loads memory lines at a time instead of by single values.
So for your example, even though it might seem more effective to have a grid that is 317x1 and blocks that are 1x217, your code will likely be more efficient if you launch blocks that are 16x16 on a grid that is 20x14 as it will lead to better computation/memory ratio and SM occupancy. This does mean, though, that you will have to check within the kernel to make sure the thread is not out of the picture before trying to access memory, something like
const int thread_id_x = blockIdx.x*blockDim.x+threadIdx.x;
const int thread_id_y = blockIdx.y*blockDim.y+threadIdx.y;
if(thread_id_x < pic_width && thread_id_y < pic_height)
{
//Do stuff
}
Lastly, you can determine the lowest number of blocks you need in each grid dimension that completely covers your image with (N+M-1)/M where N is the number of total threads in that dimension and you have M threads per block in that dimension.
It depends on how you deal with the image. If your thread only process each pixel separately, for example, adding 3 to each pixel value, you can just assign one dimension to your block size and the other to your grid size (just do not out of range). But if you want to do something like filter or erode, this kind of operation often need to access the pixels near the center pixel like 3*3 of 9*9. Then the block should be 8*8 as you mentioned, or some other value. And you'd better to use the texture memory. Because when the thread access to the global memory, there always be 32 thread to be a wrap in a block one time.
So there isn't function as you described. The number of threads and blocks depends on how you process the data, it is not universal.