What is the best general purpose computing practice in OpenCL for iterative problems? - parallel-processing

When we have a program that requires lots of operations over a large data sets and the operations on each of the data elements are independent, OpenCL can be one of the good choice to make it faster. I have a program like the following:
while( function(b,c)!=TRUE)
{
[X,Y] = function1(BigData);
M = functionA(X);
b = function2(M);
N = functionB(Y);
c = function3(N);
}
Here the function1 is applied on each of the elements on the BigData and produce another two big data sets (X,Y). function2 and function3 are then applied operation individually on each of the elements on these X,Y data, respectively.
Since the operations of all the functions are applied on each of the elements of the data sets independently, using GPU might make it faster. So I come up with the following:
while( function(b,c)!=TRUE)
{
//[X,Y] = function1(BigData);
1. load kernel1 and BigData on the GPU. each of the thread will work on one of the data
element and save the result on X and Y on GPU.
//M = functionA(X);
2a. load kernel2 on GPU. Each of the threads will work on one of the
data elements of X and save the result on M on GPU.
(workItems=n1, workgroup size=y1)
//b = function2(M);
2b. load kernel2 (Same kernel) on GPU. Each of the threads will work on
one of the data elements of M and save the result on B on GPU
(workItems=n2, workgroup size=y2)
3. read the data B on host variable b
//N = functionB(Y);
4a. load kernel3 on GPU. Each of the threads will work on one of the
data element of Y and save the result on N on GPU.
(workItems=n1, workgroup size=y1)
//c = function2(M);
4b. load kernel3 (Same kernel) on GPU. Each of the threads will work
on one of the data element of M and save the result on C on GPU
(workItems=n2, workgroup size=y2)
5. read the data C on host variable c
}
However, the overhead involved in this code seems significant to me (I have implemented a test program and run on a GPU). And if the kernels have some sort of synchronizations it might be ended up with more slowdown.
I also believe the workflow is kind of common. So what is the best practice to using OpenCL for speedup for a program like this.

I don't think there's a general problem with the way you've split up the problem into kernels, although it's hard to say as you haven't been very specific. How often do you expect your while loop to run?
If your kernels do negligible work but the outer loop is doing a lot of iterations, you may wish to combine the kernels into one, and do some number of iterations within the kernel itself, if that works for your problem.
Otherwise:
If you're getting unexpectedly bad performance, you most likely need to be looking at the efficiency of each of your kernels, and possibly their data access patterns. Unless neighbouring work items are reading/writing neighbouring data (ideally: 16 work items read 4 bytes each from a 64-byte cache line at a time) you're probably wasting memory bandwidth. If your kernels contain lots of conditionals or non-constant loop iterations, that will cost you, etc.
You don't specify what kind of runtimes you're getting, on what kind Of job size, (Tens? Thousands? Millions of arithmetic ops? How big are your data sets?) or what hardware. (Compute card? Laptop IGPU?) "Significant overhead" can mean a lot of different things. 5ms? 1 second?
Intel, nVidia and AMD all publish optimisation guides - have you read these?

Related

What could be the causes of this performance regression, and how to investigate it?

Context
I'm writing some high-performance code for ARM64 using NEON SIMD instructions, which I am trying to further optimize. I only use integer operations, no floating-point. This code is fully CPU- or memory-bound: it does not perform system calls or I/O of any kind (filesystem, networking, or anything else). The code is single-threaded by design -- any parallelism should be handled by calling the code from different CPUs with different arguments. The data working set should be small enough to fit in my CPU's L1 D-cache, and if it overflows a little, it will definitely fit in L2 with lots of space to spare.
My development environment is an Apple laptop with the M1 processor, running macOS; as such, the prime choice for a performance investigation tool is Apple's Instruments. I know VTune has some more advanced features such as top-down microarchitecture analysis, but evidently this isn't available for ARM.
The problem
I had an idea that, at a high level, works like this: a certain function f(x, y) can be broken down into two functions g() and h(). I can calculate x2 = g(x), y2 = g(y) and then h(x2, y2), obtaining the same result as f(x, y). However, it turns out that I compute f() many times with different combinations of the same input arguments. By applying all these inputs to g() and caching their outputs, I can directly call the output of h()with these cached values and save some time recomputing theg()-part of f()`.
Benchmarks
I confirmed the basic idea is sound by microbenchmarking with Google Benchmark. If f() takes 100 X (where X is some arbitrary unit of time), then each call to g() takes 14 X, and a call to h() takes 78 X. While it's longer to call g() twice then h() rather than f(), suppose I need to compute f(x, y) and f(x, z), which would ordinarily take 200 X. I can instead compute x2 = g(x), y2 = g(y) and z2 = g(z), taking 3*14 = 42 X, and then h(x2, y2) and h(x2, z2), taking 2*78 = 156 X. In total, I spend 156 + 42 = 198 X, which is already less than 200 X, and the savings would add up for larger examples, up to maximum of 22%, since this is how much less h() costs compared to f() (assuming I compute h() much more often than g()). This would represent a significant speedup for my application.
I proceeded to test this idea on a more realistic example: I have some code which does a bunch of things, plus 3 calls to f() which, among themselves, use combinations of the same 2 arguments. So, I replace 3 calls to f() by 2 calls to g() and 3 calls to h(). The benchmarks above indicate this should reduce execution time by 3*100 - 2*14 - 3*78 = 38 X. However, benchmarking the modified code shows that execution time increases by ~700 X!
I tried replacing each call to f() individually with 2 calls to g() for its arguments and a call to h(). This should increase execution time by 2*14 + 78 - 100 = 6 X, but instead, execution time increases by 230 X (not coincidentally, approximately 1/3 of 700 X).
Performance counter results using Apple Instruments
To bring some data to the discussion, I ran both codes under Apple Instruments using the CPU counters template, monitoring some performance counters I thought might be relevant.
For reference, the original code executes in 7.6 seconds (considering only number of iterations times execution time per iteration, i.e. disregarding Google Benchmark overhead), whereas the new code executes in 9.4 seconds; i.e. a difference of 1.8 seconds. Both versions use the exact same number of iterations and work on the same input, producing the same output. The code runs on the M1's performance core, which I assume is running at its maximum 3.2 GHz clock speed.
Parameter
Original code
New code
Total cycles
22,199,155,777
27,510,276,704
MAP_DISPATCH_BUBBLE
78,611,658
6,438,255,204
L1D_CACHE_MISS_LD
892,442
1,808,341
L1D_CACHE_MISS_ST
2,163,402
4,830,661
L1I_CACHE_MISS_DEMAND
2,620,793
7,698,674
INST_SIMD_ALU
79,448,291,331
78,253,076,740
INST_SIMD_LD
17,254,640,147
16,867,679,279
INST_SIMD_ST
14,169,912,790
14,029,275,120
INST_INT_ALU
4,512,600,211
4,379,585,445
INST_INT_LD
550,965,752
546,134,341
INST_INT_ST
455,541,070
455,298,056
INST_ALL
119,683,934,968
118,972,558,207
MAP_STALL_DISPATCH
6,307,551,337
5,470,291,508
SCHEDULE_UOP
116,252,941,232
113,882,670,763
MAP_REWIND
16,293,616
11,787,119
FLUSH_RESTART_OTHER_NONSPEC
58,616
90,955
FETCH_RESTART
27,417,457
28,119,690
BRANCH_MISPRED_NONSPEC
432,761
465,697
L1I_TLB_MISS_DEMAND
754,161
1,492,705
L2_TLB_MISS_INSTRUCTION
485,702
1,217,474
MMU_TABLE_WALK_INSTRUCTION
486,812
1,219,082
BRANCH_MISPRED_NONSPEC
377,750
440,382
INST_BRANCH
1,194,614,553
1,151,040,641
Instruments won't let me add all these counters to the same run, so some results are from different runs. However, since the code is fully deterministic and runs the same number of iterations, any differences between runs should be just random noise.
EDIT: playing around with Instruments, I found one performance counter that has wildly differing values between the original code and the new code, which is MAP_DISPATCH_BUBBLE. Still doing research on what it means, whether it might explain the issues I'm seeing, and how to work around this.
EDIT 2: I decided to test this code on other ARM processors I have access to (Cortex-X2 and Cortex-A72). On the Cortex-X2, both versions perform identically, and on the Cortex-A72, there was a small (~1.5%) increase in performance with the new code. So I'm more inclined than ever to believe that I hit an M1 front-end bottleneck.
Hypotheses and data analysis
Having faced previous performance problems with this code base before, some ideas sprung to mind:
Memory alignment: SIMD code is sometimes sensitive to memory alignment, particularly for memory-bound code, which I suspect my code may be. However, adding or removing __attribute__((aligned(64))) made no difference, so I don't think that's it.
D-cache misses: the new code allocates some new arrays to cache the output of g(), so it might lead to more cache misses. And indeed there are 3.6 million more L1 D-cache misses (load + store) than the original code. However, as I've mentioned at the beginning, the working set easily fits into L2. Assuming a 10-cycle L2 cache miss cost, that's only 36 million cycles. At 3.2 GHz, that's just 1.1 ms, i.e. < 0.1% of the observed performance difference.
I-cache misses: a similar situation: there's an extra 5.1 million L1 I-cache misses, but at a 10-cycle cost, we're looking at 1.6 ms, again < 0.1% of the observed performance difference.
Inlining/unrolling: I employ aggressive inlining and loop unrolling on my code, as well as LTO and unity builds, since performance is the #1 priority and code size is irrelevant (unless it affects performance via e.g. I-cache misses). I considered the possibility that the new code might be inlining/unrolling less aggressively due to the compiler hitting some kind of heuristic for maximum code size. This might result in more instructions being executed, such as compares/branches for loops, and CALL/RET and function prologues/epilogues for function call. However, the table shows that the new code executes a bit fewer instructions of each kind (as I would expect), and of course, in total (INST_ALL).
Somehow, the original code simply achieves a higher IPC, and I have no idea why. Also, to be clear: both codes perform the same operation using the same algorithm. What I did was to basically the code for f() (a bunch of function calls to other subroutines) between g() and h().
The question
This brings me to my question: what could possibly be making the new code run slower than the old code? What other performance counters could I look at in Instruments to give me insight into this issue?
Beyond answers to this specific question, I'm looking for general advice on how to approach similar problems like this in the future. I've found some books about debugging performance problems, but they generally fall into two camps. The first just describes the profiling process I'm familiar with: find out which functions take the longest to execute and optimize them. The second is represented by books like Systems Performance: Enterprise and the Cloud and The Every Computer Performance Book, and is closer to what I'm looking for. However, they look at system-level issues like I/O, kernel calls, etc.; the kind of code I write is CPU- and maybe memory-bound, with many opportunities to convert to SIMD, and no interaction with the outside world. Basically, I'd like to know how to design meaningful experiments using a profiler and CPU performance counters (cycle counters, cache misses, instructions executed by type such as ALU, memory, etc.) to solve these kinds of performance issues with my code when they arise.

Could the "reduce" function be parallelized in Functional Programming?

In Functional Programming, one benefit of the map function is that it could be implemented to be executed in parallel.
So on a 4 cores hardware, this code and a parallel implementation of map would allow the 4 values to be processed at the same time.
let numbers = [0,1,2,3]
let increasedNumbers = numbers.map { $0 + 1 }
Fine, now lets talk about the reduce function.
Return the result of repeatedly calling combine with an accumulated
value initialized to initial and each element of self, in turn, i.e.
return combine(combine(...combine(combine(initial, self[0]),
self[1]),...self[count-2]), self[count-1]).
My question: could the reduce function be implemented so to be executed in parallel?
Or, by definition, it is something that can only be executed sequentially?
Example:
let sum = numbers.reduce(0) { $0 + $1 }
One of the most common reductions is the sum of all elements.
((a+b) + c) + d == (a + b) + (c+d) # associative
a+b == b+a # commutative
That equality works for integers, so you can change the order of operations from one long dependency chain to multiple shorter dependency chains, allowing multithreading and SIMD parallelism.
It's also true for mathematical real numbers, but not for floating point numbers. In many cases, catastrophic cancellation is not expected, so the final result will be close enough to be worth the massive performance gain. For C/C++ compilers, this is one of the optimizations enabled by the -ffast-math option. (There's a -fassociative-math option for just this part of -ffast-math, without the assumptions about lack of infinities and NaNs.)
It's hard to get much SIMD speedup if one wide load can't scoop up multiple useful values. Intel's AVX2 added "gathered" loads, but the overhead is very high. With Haswell, it's typically faster to just use scalar code, but later microarchitectures do have faster gathers. So SIMD reduction is much more effective on arrays, or other data that is stored contiguously.
Modern SIMD hardware works by loading 2 consecutive double-precision floats into a vector register (for example, with 16B vectors like x86's sse). There is a packed-FP-add instruction that adds the corresponding elements of two vectors. So-called "vertical" vector operations (where the same operation happens between corresponding elements in two vectors) are much cheaper than "horizontal" operations (adding the two doubles in one vector to each other).
So at the asm level, you have a loop that sums all the even-numbered elements into one half of a vector accumulator, and all the odd-numbered elements into the other half. Then one horizontal operation at the end combines them. So even without multithreading, using SIMD requires associative operations (or at least, close enough to associative, like floating point usually is). If there's an approximate pattern in your input, like +1.001, -0.999, the cancellation errors from adding one big positive to one big negative number could be much worse than if each cancellation had happened separately.
With wider vectors, or narrower elements, a vector accumulator will hold more elements, increasing the benefit of SIMD.
Modern hardware has pipelined execution units that can sustain one (or sometimes two) FP vector-adds per clock, but the result of each one isn't ready for 5 cycles. Saturating the hardware's throughput capabilities requires using multiple accumulators in the loop, so there are 5 or 10 separate loop-carried dependency chains. (To be concrete, Intel Skylake does vector-FP multiply, add, or FMA (fused multiply-add) with 4c latency and one per 0.5c throughput. 4c/0.5c = 8 FP additions in flight at once to saturate Skylake's FP math unit. Each operation can be a 32B vector of eight single-precision floats, four double-precision floats, a 16B vector, or a scalar. (Keeping multiple operations in flight can speed up scalar stuff, too, but if there's any data-level parallelism available, you can probably vectorize it as well as use multiple accumulators.) See http://agner.org/optimize/ for x86 instruction timings, pipeline descriptions, and asm optimization stuff. But note that everything here applies to ARM with NEON, PPC Altivec, and other SIMD architectures. They all have vector registers and similar vector instructions.
For a concrete example, here's how gcc 5.3 auto-vectorizes a FP sum reduction. It only uses a single accumulator, so it's missing out on a factor of 8 throughput for Skylake. clang is a bit more clever, and uses two accumulators, but not as many as the loop unroll factor to get 1/4 of Skylake's max throughput. Note that if you take out -ffast-math from the compile options, the FP loop uses addss (add scalar single) rather than addps (add packed single). The integer loop still auto-vectorizes, because integer math is associative.
In practice, memory bandwidth is the limiting factor most of the time. Haswell and later Intel CPUs can sustain two 32B loads per cycle from L1 cache. In theory, they could sustain that from L2 cache. The shared L3 cache is another story: it's a lot faster than main memory, but its bandwidth is shared by all cores. This makes cache-blocking (aka loop tiling) for L1 or L2 a very important optimization when it can be done cheaply, when working with more than 256k of data. Rather than producing and then reducing 10MiB of data, produce in 128k chunks and reduce them while they're still in L2 cache instead of the producer having to push them to main memory and the reducer having to bring them back in. When working in a higher level language, your best bet may be to hope that the implementation does this for you. This is what you ideally want to happen in terms of what the CPU actually does, though.
Note that all the SIMD speedup stuff applies within a single thread operating on a contiguous chunk of memory. You (or the compiler for your functional language!) can and should use both techniques, to have multiple threads each saturating the execution units on the core they're running on.
Sorry for the lack of functional-programming in this answer. You may have guessed that I saw this question because of the SIMD tag. :P
I'm not going to try to generalize from addition to other operations. IDK what kind of stuff you functional-programming guys get up to with reductions, but addition or compare (find min/max, count matches) are the ones that get used as SIMD-optimization examples.
There are some compilers for functional programming languages that parallelize the reduce and map functions. This is an example from the Futhark programming language, which compiles into parallel CUDA and OpenCL source code:
let main (x: []i32) (y: []i32): i32 =
reduce (+) 0 (map2 (*) x y)
It may be possible to write a compiler that would translate a subset of Haskell into Futhark, though this hasn't been done yet. The Futhark language does not allow recursive functions, but they may be implemented in a future version of the language.

Minimizing global memory reads in OpenCL with vectors?

Suppose my kernel takes 4 (or 3, or 2) unrelated float or double args, or that I want to access 4 separate floats from global memory. Will this cause 4 separate global memory accesses? Is accessing a single vector of 4 floats or doubles faster than accessing 4 separate ones? If so, am I better off packing them into a single vector and then, say, using #defines to reference the individual members?
If this does increase the performance, do I have to do it myself, or might the compiler be smart enough to automatically convert 4 separate float reads into a single vector for me? Is this what "auto-vectorization" is? I've seen auto-vectorization mentioned in a few documents, without detailed explanation of exactly what it does, except that it seems to be an optional performance optimization for CPUs only, not GPUs.
Using vectors depends on kernel itself. If you need all four values at same time (for example: at start of kernel, at start of loop), it's better to pack them, because they will be assigned during one read (Values in single vector are stored sequential).
On the other hand, when you need only some of the values, you can speed up execution by reading only what you need.
Another case is when you read them one by one, each reading divided by some computation (i.e. give GPU some time to fetch data).
Basically, this data read segments, behaves like buffer. If you have enough instances, number of reads is same (in optional cause) and what really counts is how well are these reads used.
Compiler often unpack these structures so only speedup is, that you have all variables nicely stored, so when you read, you fill them all up with one read and rest of buffer is used for another instance.
As example, I will use 128 bits wide bus and 4 floats (32 bits).
(32b * 4) / 128b = 1 instance/read
For scalar data types, there are N reads (N = number of variables), each read filling one variable in each instance up to the number of fetched variables.
32b / 128b = 4 instance/read
So in my examples, if you have 4 instances, there will always be at least 4 reads no matter what and only thing, you can do with this is cover fetching time by some computation, if it's even possible.

why is mpi slower on my laptop

I am running MPI on my laptop (intel i7 quad core 4700m 12Gb RAM) and the efficiency drops even for codes that involve no inter-process communication. Obviously I cannot just throw 100 processes at it since my machine is only quad-core, but I thought that it should scale well up to 8 process (intel quad core simulates as 8???). For example consider the simple toy Fortran code:
program test
implicit none
integer, parameter :: root=0
integer :: ierr,rank,nproc,tt,i
integer :: n=100000
real :: s=0.0,tstart,tend
complex, dimension(100000/nproc) :: u=2.0,v=0.0
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD,nproc,ierr)
call cpu_time(tstart)
do tt=1,200000
v=0.0
do i=1,100000/nproc
v(i) = v(i) + 0.1*u(i)
enddo
enddo
call cpu_time(tend)
if (rank==root) then
print *, 'total time was: ',tend-tstart
endif
call MPI_FINALIZE(ierr)
end subroutine test2
For 2 processes it takes half the time, but even trying 4 processes (should be quarter of the time?) the result begins to become less efficient and for 8 processes there is no improvement whatsoever. Basically I am wondering if this is just because I am running on a laptop and has something to do with shared memory, or if I am making some fundamental mistake in my code. Thanks
Note: In the above example I manually change the nproc in the array declaration and the inner loop to be equal to the number of processors I am using.
A quad core processor, thanks to hyperthreading shows itself as having 8 threads, but physically they are just 4 cores. The other 4 are scheduled by the hardware itself using the free slots in the execution pipelines.
It happens that especially with compute intensive loads this approach does not pay at all, being often counter-productive too on extreme loads because of overheads and not always optimized cache usage.
You can try to disable hyperthreading in the BIOS and compare it: you will have just 4 threads, 4 cores.
Even going from 1 to 4 there are resources that are being in competition. In particular each core has its own L1 cache, but each pair of cores shares the L2 cache (2x256KB) and the 4 cores share the L3 cache.
And all the cores obviously share the memory channels.
So you cannot expect to have linear scaling occupying more and more cores, since they will have to balance the usage of the resources, that are dedicated to one core/one thread in the sequential case.
All of this without involving communications at all.
The same behavior happens on desktops/servers, in particular for memory-intensive loads, as the one in your test case.
For example it's less evident with matrix-matrix multiplies, that is compute-intensive: for a NxN matrix, you have O(N^2) memory accesses but O(N^3) floating point operations.

Does the hardware-prefetcher benefit in this memory access pattern?

I have two arrays: A with N_A random integers and B with N_B random integers between 0 and (N_A - 1). I use the numbers in B as indices into A in the following loop:
for(i = 0; i < N_B; i++) {
sum += A[B[i]];
}
Experimenting on an Intel i7-3770, N_A = 256 million, N_B = 64 million, this loop takes only .62 seconds, which corresponds to a memory access latency of about 9 nanoseconds.
As this latency is too small, I was wondering if the hardware prefetcher is playing a role. Can someone offer an explanation?
The HW prefetcher can see through your first level of indirection (B[i]) since these elements are sequential. It's capable of issuing multiple prefetches ahead, so you could assume that the average access into B would hit the caches (either L1 or L2). However, there's no way that the prefetcher can predict random addresses (the data stored in B) and prefetch the correct elements from A. You still have to perform a memory access in almost all accesses to A (disregarding occasional lucky cache hits due to reuse of lines)
The reason you see such low latency is that the accesses into A are non serialized, the CPU can access multiple elements of A simultaneously, so the time doesn't just accumulate. In fact, you measure memory BW here, checking how long it takes to access 64M elements overall, not memory latency (how long it takes to access a single element).
A reasonable "snapshot" of the CPU memory unit should show several outstanding requests - a few accesses into B[i], B[i+64], ... (the intermediate accesses should simply get merged as each request fetches a 64Byte line), all of which would probably be prefetches reflecting future values of i, intermixed with random accesses to A elements according to the previously fetched elements of B.
To measure latency, you need each access to depends on the result of the previous one, for e.g. by making the content of each element in A the index of the next access.
The CPU charges ahead in the instruction stream and will juggle multiple outstanding loads at once. The stream looks like this:
load b[0]
load a[b[0]]
add
loop code
load b[1]
load a[b[1]]
add
loop code
load b[1]
load a[b[1]]
add
loop code
...
The iterations are only serialized by the loop code, which runs quickly. All loads can run concurrently. Concurrency is just limited by how many loads the CPU can handle.
I suspect you wanted to benchmark random, unpredictable, serialized memory loads. This is actually pretty hard on a modern CPU. Try to introduce an unbreakable dependency chain:
int lastLoad = 0;
for(i = 0; i < N_B; i++) {
var load = A[B[i] + (lastLoad & 1)]; //be sure to make A one element bigger
sum += load;
lastLoad = load;
}
This requires the last load to be executed until the address of the next load can be computed.

Resources