Context
I'm writing some high-performance code for ARM64 using NEON SIMD instructions, which I am trying to further optimize. I only use integer operations, no floating-point. This code is fully CPU- or memory-bound: it does not perform system calls or I/O of any kind (filesystem, networking, or anything else). The code is single-threaded by design -- any parallelism should be handled by calling the code from different CPUs with different arguments. The data working set should be small enough to fit in my CPU's L1 D-cache, and if it overflows a little, it will definitely fit in L2 with lots of space to spare.
My development environment is an Apple laptop with the M1 processor, running macOS; as such, the prime choice for a performance investigation tool is Apple's Instruments. I know VTune has some more advanced features such as top-down microarchitecture analysis, but evidently this isn't available for ARM.
The problem
I had an idea that, at a high level, works like this: a certain function f(x, y) can be broken down into two functions g() and h(). I can calculate x2 = g(x), y2 = g(y) and then h(x2, y2), obtaining the same result as f(x, y). However, it turns out that I compute f() many times with different combinations of the same input arguments. By applying all these inputs to g() and caching their outputs, I can directly call the output of h()with these cached values and save some time recomputing theg()-part of f()`.
Benchmarks
I confirmed the basic idea is sound by microbenchmarking with Google Benchmark. If f() takes 100 X (where X is some arbitrary unit of time), then each call to g() takes 14 X, and a call to h() takes 78 X. While it's longer to call g() twice then h() rather than f(), suppose I need to compute f(x, y) and f(x, z), which would ordinarily take 200 X. I can instead compute x2 = g(x), y2 = g(y) and z2 = g(z), taking 3*14 = 42 X, and then h(x2, y2) and h(x2, z2), taking 2*78 = 156 X. In total, I spend 156 + 42 = 198 X, which is already less than 200 X, and the savings would add up for larger examples, up to maximum of 22%, since this is how much less h() costs compared to f() (assuming I compute h() much more often than g()). This would represent a significant speedup for my application.
I proceeded to test this idea on a more realistic example: I have some code which does a bunch of things, plus 3 calls to f() which, among themselves, use combinations of the same 2 arguments. So, I replace 3 calls to f() by 2 calls to g() and 3 calls to h(). The benchmarks above indicate this should reduce execution time by 3*100 - 2*14 - 3*78 = 38 X. However, benchmarking the modified code shows that execution time increases by ~700 X!
I tried replacing each call to f() individually with 2 calls to g() for its arguments and a call to h(). This should increase execution time by 2*14 + 78 - 100 = 6 X, but instead, execution time increases by 230 X (not coincidentally, approximately 1/3 of 700 X).
Performance counter results using Apple Instruments
To bring some data to the discussion, I ran both codes under Apple Instruments using the CPU counters template, monitoring some performance counters I thought might be relevant.
For reference, the original code executes in 7.6 seconds (considering only number of iterations times execution time per iteration, i.e. disregarding Google Benchmark overhead), whereas the new code executes in 9.4 seconds; i.e. a difference of 1.8 seconds. Both versions use the exact same number of iterations and work on the same input, producing the same output. The code runs on the M1's performance core, which I assume is running at its maximum 3.2 GHz clock speed.
Parameter
Original code
New code
Total cycles
22,199,155,777
27,510,276,704
MAP_DISPATCH_BUBBLE
78,611,658
6,438,255,204
L1D_CACHE_MISS_LD
892,442
1,808,341
L1D_CACHE_MISS_ST
2,163,402
4,830,661
L1I_CACHE_MISS_DEMAND
2,620,793
7,698,674
INST_SIMD_ALU
79,448,291,331
78,253,076,740
INST_SIMD_LD
17,254,640,147
16,867,679,279
INST_SIMD_ST
14,169,912,790
14,029,275,120
INST_INT_ALU
4,512,600,211
4,379,585,445
INST_INT_LD
550,965,752
546,134,341
INST_INT_ST
455,541,070
455,298,056
INST_ALL
119,683,934,968
118,972,558,207
MAP_STALL_DISPATCH
6,307,551,337
5,470,291,508
SCHEDULE_UOP
116,252,941,232
113,882,670,763
MAP_REWIND
16,293,616
11,787,119
FLUSH_RESTART_OTHER_NONSPEC
58,616
90,955
FETCH_RESTART
27,417,457
28,119,690
BRANCH_MISPRED_NONSPEC
432,761
465,697
L1I_TLB_MISS_DEMAND
754,161
1,492,705
L2_TLB_MISS_INSTRUCTION
485,702
1,217,474
MMU_TABLE_WALK_INSTRUCTION
486,812
1,219,082
BRANCH_MISPRED_NONSPEC
377,750
440,382
INST_BRANCH
1,194,614,553
1,151,040,641
Instruments won't let me add all these counters to the same run, so some results are from different runs. However, since the code is fully deterministic and runs the same number of iterations, any differences between runs should be just random noise.
EDIT: playing around with Instruments, I found one performance counter that has wildly differing values between the original code and the new code, which is MAP_DISPATCH_BUBBLE. Still doing research on what it means, whether it might explain the issues I'm seeing, and how to work around this.
EDIT 2: I decided to test this code on other ARM processors I have access to (Cortex-X2 and Cortex-A72). On the Cortex-X2, both versions perform identically, and on the Cortex-A72, there was a small (~1.5%) increase in performance with the new code. So I'm more inclined than ever to believe that I hit an M1 front-end bottleneck.
Hypotheses and data analysis
Having faced previous performance problems with this code base before, some ideas sprung to mind:
Memory alignment: SIMD code is sometimes sensitive to memory alignment, particularly for memory-bound code, which I suspect my code may be. However, adding or removing __attribute__((aligned(64))) made no difference, so I don't think that's it.
D-cache misses: the new code allocates some new arrays to cache the output of g(), so it might lead to more cache misses. And indeed there are 3.6 million more L1 D-cache misses (load + store) than the original code. However, as I've mentioned at the beginning, the working set easily fits into L2. Assuming a 10-cycle L2 cache miss cost, that's only 36 million cycles. At 3.2 GHz, that's just 1.1 ms, i.e. < 0.1% of the observed performance difference.
I-cache misses: a similar situation: there's an extra 5.1 million L1 I-cache misses, but at a 10-cycle cost, we're looking at 1.6 ms, again < 0.1% of the observed performance difference.
Inlining/unrolling: I employ aggressive inlining and loop unrolling on my code, as well as LTO and unity builds, since performance is the #1 priority and code size is irrelevant (unless it affects performance via e.g. I-cache misses). I considered the possibility that the new code might be inlining/unrolling less aggressively due to the compiler hitting some kind of heuristic for maximum code size. This might result in more instructions being executed, such as compares/branches for loops, and CALL/RET and function prologues/epilogues for function call. However, the table shows that the new code executes a bit fewer instructions of each kind (as I would expect), and of course, in total (INST_ALL).
Somehow, the original code simply achieves a higher IPC, and I have no idea why. Also, to be clear: both codes perform the same operation using the same algorithm. What I did was to basically the code for f() (a bunch of function calls to other subroutines) between g() and h().
The question
This brings me to my question: what could possibly be making the new code run slower than the old code? What other performance counters could I look at in Instruments to give me insight into this issue?
Beyond answers to this specific question, I'm looking for general advice on how to approach similar problems like this in the future. I've found some books about debugging performance problems, but they generally fall into two camps. The first just describes the profiling process I'm familiar with: find out which functions take the longest to execute and optimize them. The second is represented by books like Systems Performance: Enterprise and the Cloud and The Every Computer Performance Book, and is closer to what I'm looking for. However, they look at system-level issues like I/O, kernel calls, etc.; the kind of code I write is CPU- and maybe memory-bound, with many opportunities to convert to SIMD, and no interaction with the outside world. Basically, I'd like to know how to design meaningful experiments using a profiler and CPU performance counters (cycle counters, cache misses, instructions executed by type such as ALU, memory, etc.) to solve these kinds of performance issues with my code when they arise.
When we have a program that requires lots of operations over a large data sets and the operations on each of the data elements are independent, OpenCL can be one of the good choice to make it faster. I have a program like the following:
while( function(b,c)!=TRUE)
{
[X,Y] = function1(BigData);
M = functionA(X);
b = function2(M);
N = functionB(Y);
c = function3(N);
}
Here the function1 is applied on each of the elements on the BigData and produce another two big data sets (X,Y). function2 and function3 are then applied operation individually on each of the elements on these X,Y data, respectively.
Since the operations of all the functions are applied on each of the elements of the data sets independently, using GPU might make it faster. So I come up with the following:
while( function(b,c)!=TRUE)
{
//[X,Y] = function1(BigData);
1. load kernel1 and BigData on the GPU. each of the thread will work on one of the data
element and save the result on X and Y on GPU.
//M = functionA(X);
2a. load kernel2 on GPU. Each of the threads will work on one of the
data elements of X and save the result on M on GPU.
(workItems=n1, workgroup size=y1)
//b = function2(M);
2b. load kernel2 (Same kernel) on GPU. Each of the threads will work on
one of the data elements of M and save the result on B on GPU
(workItems=n2, workgroup size=y2)
3. read the data B on host variable b
//N = functionB(Y);
4a. load kernel3 on GPU. Each of the threads will work on one of the
data element of Y and save the result on N on GPU.
(workItems=n1, workgroup size=y1)
//c = function2(M);
4b. load kernel3 (Same kernel) on GPU. Each of the threads will work
on one of the data element of M and save the result on C on GPU
(workItems=n2, workgroup size=y2)
5. read the data C on host variable c
}
However, the overhead involved in this code seems significant to me (I have implemented a test program and run on a GPU). And if the kernels have some sort of synchronizations it might be ended up with more slowdown.
I also believe the workflow is kind of common. So what is the best practice to using OpenCL for speedup for a program like this.
I don't think there's a general problem with the way you've split up the problem into kernels, although it's hard to say as you haven't been very specific. How often do you expect your while loop to run?
If your kernels do negligible work but the outer loop is doing a lot of iterations, you may wish to combine the kernels into one, and do some number of iterations within the kernel itself, if that works for your problem.
Otherwise:
If you're getting unexpectedly bad performance, you most likely need to be looking at the efficiency of each of your kernels, and possibly their data access patterns. Unless neighbouring work items are reading/writing neighbouring data (ideally: 16 work items read 4 bytes each from a 64-byte cache line at a time) you're probably wasting memory bandwidth. If your kernels contain lots of conditionals or non-constant loop iterations, that will cost you, etc.
You don't specify what kind of runtimes you're getting, on what kind Of job size, (Tens? Thousands? Millions of arithmetic ops? How big are your data sets?) or what hardware. (Compute card? Laptop IGPU?) "Significant overhead" can mean a lot of different things. 5ms? 1 second?
Intel, nVidia and AMD all publish optimisation guides - have you read these?
I have been stuck on a problem in my class for a week now.
I was hoping someone could help steer me in the right direction.
a busy cat http://www.designedbychristian.com/unnamed.png
Processor R IS A 64-BIT RISC processor with a 2GHz clock rate. The average instruction requires one cycle to complete, assuming zero wait state memory accesses. Processor C is a CISC processor with a 1.8GHz clock rate. The average simple instruction requires one cycle to complete, assuming zero wait state memory accesses. The average complex instruction requires two cycles to complete, assuming zero wait state memory accesses. Processor R can't directly implement the complex processing instructions or Processor C. Executing an equivalent set of simple instructions requires an average of three cycles to complete, assuming zero wait state to memory accesses.
Program S contains nothing but simple instructions. Program C executes 70% simple instructions and 30% complex instructions. Which processor will execute program S more quickly? What percentage of complex instructions will the performance of the two processor be equal?
I attached an image above translating the data into excel the best I could.
I am not asking you guys to answer this for me but I am stuck completely and I would some help on where to start and what my answer should look like.
For the second part:
Processor R Total cycles = 1 x #simpleInstructions + 3 x #complexInstructions
Processor C Total cycles = 1 x #simpleInstructions + 2 x #complexInstructions
So, how much time for R, and how much time for C?
When expressing complex/simple instructions as a percentage,
RCycles = 1 x 0.7 x totalInstructions + 3 x 0.3 x totalInstructions
CCycles = 1 x 0.7 x totalInstructions + 2 x 0.3 x totalInstructions
Which is faster?
Now replace the percentages with a variable, equate the Rtime and Ctime and calculate percentage.
There is this question regarding solving the AMAT(Average Memory Access Time) given these data:
Legends: Cache Level 1 = L1 Cache Level 2 = L2 Main Memory = M
L1, L2 and M's Hit Time are 1, 10 and 100 respectively whilst
L1 Miss Rate is 5%, L2 5% and M 50%.
Find the AMAT in clock cycles.
After attempting to solve this question, here is my solution:
AMAT's formula is = Hit Time X Hit Rate + Miss Penalty * Miss Rate
Miss Penalty = AMAT for the next cache(say for example, AMAT of L2)
So I manipulated the formula, resulting into something like this:
AMAT = Hit Time L1 X Hit Rate L1 + AMAT L2 * Miss Rate L1
AMAT L2 = Hit Time L2 X Hit Rate L2 + AMAT M * Miss Rate L2
AMAT M = Hit Time M X Hit Rate M + [???] * Miss Rate M
providing the numerical value for the said formula would look like this:
AMAT = 1 X .95 + AMAT L2 * .05
AMAT L2 = 10 X .95 + AMAT M * .05
AMAT M = 100 X .5 + [???] * .5
So my first question would be, is my formula correct?
Next, how to get M's Miss Penalty?
Your "cascading" deductions are correct. If you miss L1 then the data has to be fetched from L2, involving a penalty, and if you miss L2 the data has to be fetched from RAM involving a higher penalty. Based on this your method of computing AMAT is correct.
Now, let's look a bit at the computer architecture. After L1 you have L2, after L2 there can be an L3 or it can be a RAM. After RAM you have the persistent storage (Hard disk).
However, the access time towards the HDD is very difficult to compute. Majority of the profilers (e.g perf, oprofile) just give you miss rates for the cache and assume that the rest of reads are taken directly from the RAM. Usually this is the case in modern computers. Reads from hard drives are marked as I/O requests. Why is it hard to compute the HDD access time? Because your data block can be at any location on your HDD, depending fragmentation. In case of spinning disks the access time can be lower f the data is located near the magnetic reader. So usually HDD access time is hard to predict when computing AMAT.
This means that if the problem statement gives you only L1, L2, M hit time, you cannot compute HDD hit time. However you can measure/estimate it in real life applications, using profilers.
So in short:
Is your formula correct? Yes - From mathematical point of view. From practical point of view, it is hard to use and generally does not make sense if M hit rate is not 100%.
How to compute M miss penalty? If it is not given you cannot compute it. In practice, by analyzing profiler output you can determine it for a specific measurement (no guaranteed that it will be the same for another measurement, because of the reasons described above).
Cheers.
consider this instruction flow diagram....
instruction fetch->instruction decode->operands fetch->instruction execute->write back
suppose a processor that supports
both cisc and risc...like intel 486
now if we issue a risc instruction it takes one clock cycle to execute and so there is no problem...but if a cisc instruction is issued its execution will take time...
so let it takes three clock cycles to execute the cisc instruction and one clock cycle each is taken by the stages preceeding execution....
now in a superscalar structure the two instructions issued while the first is being processed are diverted into other functional units available...but there is no such diversion possible in simple pipelining as only one functional unit is available for execution of instructions....
so what avoids the congestion of instructions in simple pipelining case?
Technically speaking, the x86 is not a RISC processor. It's a CISC processor. There are instructions that take less time, but those aren't RISC instructions. I believe that Intel internally turns instructions into RISC instructions, but that's not really relevant.
If we have instructions which take different amounts of time, then that becomes a CISC processor. It's nearly impossible to pipeline a CISC processor - to the best of my knowledge nobody has done it. There are many things that you can do inside of the CPU itself in order to speed up execution, such as out-of-order execution. So, there's no way you can have pipeline congestion because all instructions must be executed sequentially.
now if we issue a risc instruction it takes one clock cycle to execute and so there is no problem...but if a cisc instruction is issued its execution will take time...
A RISC instruction does not necessarily take one clock cycle. On the MIPS, it takes 5. However, the point of pipelining is that after you execute one instruction, the next instruction will complete one clock cycle after the current one finishes.
now in a superscalar structure the two instructions issued while the first is being processed are diverted into other functional units available...
In a superscalar architecture, two instructions are executed and finish at the same time. In a pure superscalar architecture, the cycle looks like this(F = Fetch, D = Decode, X = eXecute, M = Memory, W = Writeback):
(inst. 1) F D X M W
(inst. 2) F D X M W
(inst. 3) F D X M W
(inst. 4) F D X M W
but there is no such diversion possible in simple pipelining as only one functional unit is available for execution of instructions....
Right, so the cycle looks like this:
(inst. 1) F D X M W
(inst. 2) F D X M W
(inst. 3) F D X M W
(inst. 4) F D X M W
Now, if we have instructions that take a varying amount of time(a CISC computer), it's harder to pipeline, because there's only one execution unit, and we may have to wait for a previous instruction to finish executing. Instruction 1 takes 2 execution cycles, instruction 2 takes 5, instruction 3 takes two, and instruction 4 takes only one in this example
(inst. 1) F D X X M W
(inst. 2) F D X X X X X M W
(inst. 3) F D X X M W
(inst. 4) F D X M W
Thus, we can't really pipeline CISC processors - we must wait for the execute cycle to finish before we can go onto the next instruction. We don't have to do this in MIPS because it can determine if an instruction is a branch and the destination in the decode phase.