Turn the code into a code using SIMD instructions - performance

I am preparing for an exam and are doing some exercises without facit. So I am been giving this code and are wondering if I have turned the code into SIMD instructions.
The code
int A[100000];
int B[100000];
int C=0;
for int(i=0; i < 100000; i++)
C += A[i] * B[i];
Since there is no remainder, we don't need to take care of it. We also assume that it is a 128 bit register, and therefore can calculate 4 single precision floating point values.
My result - using SIMD
int A[100000];
int B[100000];
int C=0;
for int(i=0; i < 100000/4; i += 4)
C += A[i] * B[i];
C += A[i+1] * B[i+1];
C += A[i+2] * B[i+2];
C += A[i+3] * B[i+3];
What advantages can you see for using SIMD instructions instead of writing programs with multiple threads?

Assuming the omitted curly braces on your second loop is simply a typo, and typo in the for loop, and the fact that you ask about multiplying floats but your code shows arrays of ints, this won't get great vectorisation even if the compiler sees it. While the compiler might do the loads of 4 values from A and B as a single instruction each, and do the 4 multiplies in one instruction, your code forces the compiler to then extract each of the 4 products and sum them sequentially, and getting individual values out of a SIMD register is typically quite slow.
If on the other hand you did this
float A[100000];
float B[100000];
float C0=0, C1=0, C2=0, C3=0;
for (size_t i=0; i < 100000/4; i += 4)
{
C0 += A[i+0] * B[i+0];
C1 += A[i+1] * B[i+1];
C2 += A[i+2] * B[i+2];
C3 += A[i+3] * B[i+3];
}
float C = (C0 + C1) + (C2 + C3);
Then a good compiler could vectorise this as now it sees that within each loop it loads two SIMD registers, multiplies them, then it can add the result to a SIMD register of the sums, and only extracts those 4 sums and sums them all at the end.
A vectorising compile can do this with SIMD and it will not change the order of evaluation of individual sums (FP maths is NOT associative). The compiler is typically not allowed to change the order of FP maths for this reason (not without some extra flags that allow it to technically breach the language standards), so the code above can be precisely represented by SIMD instructions, and will run much faster (in fact I'd unwind the loop a further stage as the multiplication will be a bottleneck as it stands).
This is sort of the trick with SIMD, you have to understand and then think how the operation would be best implemented with vector instructions, and then write your code to execute the same sequence of operations, and hope the compiler spots what you've done.
Or you can write the vector instructions yourself with intrinsics, or use OpenMP or similar to tell the compiler more explicitly what to do.
Amongst the advantages of SIMD over threads for such an operation is the fact that you're making use of more of the silicon within a single core... so you're not preventing another thread from getting cycles. On our compute grid we typically run many single threaded processes on any one machine to keep all the cores busy at all times... in such a case doing this sum using more cores is a false economy, you'd simply be stealing cycles that another thread could usefully be running another job.

Yes, the provided code should compile into SIMD instructions with capable CPUs and compilers.
On vector-capable processors, SIMD exposes hardware features that greatly accelerate identical, parallel computations. For instance, SIMD typically makes better use of the cache on a single core due to streaming RAM access, assuming the data being processed is localized in contiguous areas of memory. Using multiprocessing, cache competition and other synchronization overhead could actually reduce performance as the various cores attempt to write data simultaneously. This is in addition to the intrinsic boost on von-Neumann machines from only having to read one, not four, separate instructions from the shared system memory.
The logic to do these arithmetic operations in parallel is always present, but requires specific SIMD instructions to utilize. As a result, SIMD tends to be used in hot loops where hand tuning makes overall optimization sense.

Related

OpenCL Pseudo random generator race condition

Using this question as basis I implemented a pseudo-random number generator with a global state:
__global uint global_random_state;
void set_random_seed(uint seed){
global_random_state = seed;
}
uint get_random_number(uint range){
uint seed = global_random_state + get_global_id(0);
uint t = seed ^ (seed << 11);
uint result = seed ^ (seed >> 19) ^ (t ^ (t >> 8));
global_random_state = result; /* race condition? */
return result % range;
}
Since these functions will be used from multiple threads, there will be a race condition present when writing to global_random_state.
This might actually help the system to be more unpredictable, so it seems like a good thing, but I'd like to know if there are any consequences to this that might not surface immediately. Are there any side-effects inside the GPU which might cause problems later on when the kernel is run?
In theory you want atom_cmpxchg for correctness here (or find the equivalent GPGPU). However, a grave note of warning, having the entire machine serializing through a single cacheline is going to strangle your performance fundamentally. Atomics on the same address must form a queue and wait. Atomics on different locations can parallelize (more details at the end).
Generally, algorithms that leverage random variables on GPGPU will keep their own copy of the random variable generators. This enables each work item to cache and potentially reuse their own random with out glutting the bus with memory traffic on every new random. Search for "OpenCL Monte Carlo" "Simulation" or "Example" for samples. CUDA has some nice examples too.
Another option is to use a random generator that allows one to skip ahead and have different work items move forward in the sequence different amounts. This can be more compute intensive though, but the tradeoff is that you don't strain the memory hierarchy as much.
More gory details on atomics: (1) GPU cache atomics are designed to expect contiguous arrays and atomic ALUs are per bank, (2) each dword in a cacheline will be processed by the same atomic ALU each time, and (3) neighboring cachelines will hash to different banks. So, if every clock you are doing atomics on contiguous cachelines of data then the work should be perfectly spread out (or statistically so). Conversely, if one makes every work item atomically modify the same 32b, then the cache system cannot apply all the same atomic ALU slot to 16/32/64 (whatever your system uses). It must break the operation up in 16/32/64 separate atomic operations apply it iteratively (by #2 above). In a system where you have 512 ALUs to process atomics you would be using 1 of those ALUs each clock (the same one). Spread the work out and you can use all 512/c.

Desired Compute-To-Memory-Ratio (OP/B) on GPU

I am trying to undertand the architecture of the GPUs and how we assess the performance of our programs on the GPU. I know that the application can be:
Compute-bound: performance limited by the FLOPS rate. The processor’s cores are fully utilized (always have work to do)
Memory-bound: performance limited by the memory
bandwidth. The processor’s cores are frequently idle because memory cannot supply data fast enough
The image below shows the FLOPS rate, peak memory bandwidth, and the Desired Compute to memory ratio, labeled by (OP/B), for each microarchitecture.
I also have an example of how to compute this OP/B metric. Example: Below is part of a CUDA code for applying matrix-matrix multiplication
for(unsigned int i = 0; i < N; ++i) {
sum += A[row*N + i]*B[i*N + col];
}
and the way to calculate OP/B for this matrix-matrix multiplication is as follows:
Matrix multiplication performs 0.25 OP/B
1 FP add and 1 FP mul for every 2 FP values (8B) loaded
Ignoring stores
and if we want to utilize this:
But matrix multiplication has high potential for reuse. For NxN matrices:
Data loaded: (2 input matrices)×(N^2 values)×(4 B) = 8N^2 B
Operations: (N^2 dot products)(N adds + N muls each) = 2N^3 OP
Potential compute-to-memory ratio: 0.25N OP/B
So if I understand this clearly well, I have the following questions:
It is always the case that the greater OP/B, the better ?
how do we know how much FP operations we have ? Is it the adds and the multiplications
how do we know how many bytes are loaded per FP operation ?
It is always the case that the greater OP/B, the better ?
Not always. The target value balances the load on compute pipe throughput and memory pipe throughput (i.e. that level of op/byte means that both pipes will be fully loaded). As you increase op/byte beyond that or some level, your code will switch from balanced to compute-bound. Once your code is compute bound, the performance will be dictated by the compute pipe that is the limiting factor. Additional op/byte increase beyond this point may have no effect on code performance.
how do we know how much FP operations we have ? Is it the adds and the multiplications
Yes, for the simple code you have shown, it is the adds and multiplies. Other more complicated codes may have other factors (e.g. sin, cos, etc.) which may also contribute.
As an alternative to "manually counting" the FP operations, the GPU profilers can indicate the number of FP ops that a code has executed.
how do we know how many bytes are loaded per FP operation ?
Similar to the previous question, for simple codes you can "manually count". For complex codes you may wish to try to use profiler capabilities to estimate. For the code you have shown:
sum += A[row*N + i]*B[i*N + col];
The values from A and B have to be loaded. If they are float quantities then they are 4 bytes each. That is a total of 8 bytes. That line of code will require 1 floating point multiplication (A * B) and one floating point add operation (sum +=). The compiler will fuse these into a single instruction (fused multiply-add) but the net effect is you are performing two floating point operations per 8 bytes. op/byte is 2/8 = 1/4. The loop does not change the ratio in this case. To increase this number, you would want to explore various optimization methods, such as a tiled shared-memory matrix multiply, or just use CUBLAS.
(Operations like row*N + i are integer arithmetic and don't contribute to the floating-point load, although its possible they may be significant, performance-wise.)

Does the using of SIMD load main CPU registers?

Let's imagine we have software developer that's goal is achieve absolute maximum of CPU's performance.
In today's CPUs we have many cores, we can load data in cache for faster processing and we also have SIMD instructions (AVX for example) that allow us to sum\multiply\do other ops with array of items (multiply 8 integers per one CPU clock). The disadvantage of this instruction is the cost of sending data & instructions to SIMD module + overhead of converting vector type to primitive types (sorry I familiar only with C#'s Vector) (We not looling on code complexety for now).
As far as I understand, while we using SIMD, main registers of CPU used only for sending and recieving data to this registers and main ALU blocks used for general purpose calculations are idle at this time.
And here is my question - will using of SIMD instructions load main CPU blocks? For example if we have huge amount of different calculations (let's imagine 40% of them are best to run on SIMD and 60% of them are better to run as a usual), will SIMD allow us to gain performance boost in this way: 100% of all cores performace + n% of SIMD's boost performance?
I'm asking this question because of for example with GPGPU we can use GPU for parallel calculations and CPU used in this case only for sending and recieving data, so it's idle all the time and we can utilize it's performance for sensitive for latency tasks.
Looks like this is a question about Out-Of-Order-Execution? Modern x64 have a number of execution ports on the CPU, and each can dispatch a new instruction per clock cycle (so about 8 CPU ops can run in parallel on an Intel SkyLake). Some of those ports handle memory loads/stores, some handle integer arithmetic, and some handle the SIMD instructions.
So for example, you may be able to displatch 2 AVX float mults, an AVX bitwise op, 2 AVX loads, a single AVX store, and a couple of bits of pointer arithmetic on the general purpose registers in a single cycle [you will have to wait for the operation to complete - the latency]. So in theory, as long as there aren't horrific dependency chains in the code, with some care you should able to keep each of those ports busy (or at least, that's the basic aim!).
Simple Rule 1: The busier you can keep the execution ports, the faster your code goes. This should be self evident. If you can keep 8 ports busy, you're doing 8 times more than if you can only keep 1 busy. In general though, it's mostly not worth worrying about (yes, there are always exceptions to the rule)
Simple Rule 2: When the SIMD execution ports are in use, the ALU doesn't suddenly become idle [A slight terminology error on your part here: The ALU is simply the bit of the CPU that does arithmetic. The computation for general purpose ops is done on an ALU, but it's also correct to call a SIMD unit an ALU. What you meant to ask is: do the general purpose parts of the CPU power down when SIMD units are in use? To which the answer is no... ]. Consider this AVX2 optimised method (which does nothing interesting!)
#include <immintrin.h>
typedef __m256 float8;
#define mul8f _mm256_mul_ps
void computeThing(float8 a[], float8 b[], float8 c[], int count)
{
for(int i = 0; i < count; ++i)
{
a[i] = mul8f(a[i], b[i]);
b[i] = mul8f(b[i], c[i]);
}
}
Since there are no dependencies between a, b, and c (which I should really be explicit about by specifying __restrict), then the two SIMD multiply instructions can both be dispatched in a single clock cycle (since there are two execution ports that can handle floating point multiply).
The General Purpose ALU doesn't suddenly power down here - The general purpose registers & instructions are still being used!
1. to compute memory addresses (for: a[i], b[i], c[i], d[i])
2. to load/store into those memory locations
3. to increment the loop counter
4. to test if the count has been reached?
It just so happens that we are also making use of the SIMD units to do a couple of multiplications...
Simple Rule 3: For floating point operations, using 'float' or '__m256' makes next to no difference. The same CPU hardware used to compute either float or float8 types is exactly the same. There are simply a couple of bits in the machine code encoding that specifies the choice between float/__m128/__m256.
i.e. https://godbolt.org/z/xTcLrf

OpenMP slower reduction

There are two versions of openmp codes with reduction and without.
// with reduction
#pragma omp parallel for reduction(+:sum)
for (i=1;i<= num_steps; i++){
x = (i-0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
// without reduction
#pragma omp parallel private(i)
{
int id = omp_get_thread_num();
int numthreads = omp_get_num_threads();
double x;
double partial_sum = 0;
for (i=id;i< num_steps; i+=numthreads){
x = (i+0.5)*step;
partial_sum += + 4.0/(1.0+x*x);
}
#pragma omp critical
sum += partial_sum;
}
I run the codes using 8 cores, the total time double for the reduction version. What's the reason? Thanks.
Scalar reduction in OpenMP is usually quite fast. The observed behaviour in your case is due to two things made wrong in two different ways.
In your first code you did not make x private. Therefore it is shared among the threads and besides getting incorrect results, the execution suffers from the data sharing. Whenever one thread writes to x, the core that it executes on sends a message to all other cores and makes them invalidate their copies of that cache line. When any of them writes to x later, the whole cache line has to be reloaded and then the cache lines in all other cores get invalidated. And so forth. This slows things down significantly.
In your second code you have used the OpenMP critical construct. This is a relatively heavy-weight in comparison with the atomic adds, usually used to implement the reduction at the end. Atomic adds on x86 are performed using the LOCK instruction prefix and everything gets implemented in the hardware. On the other side, critical sections are implemented using mutexes and require several instructions and often busy waiting loops. This is far less efficient than the atomic adds.
In the end, your first code is slowed down due to bad data sharing condition. Your second code is slowed down due to the use of incorrect synchronisation primitive. It just happens that on your particular system the latter effect is less severe than the former and hence your second example runs faster.
If you want to manually parallelize the loop as well as the reduction you can do it like this:
#pragma omp parallel private(i)
{
int id = omp_get_thread_num();
int numthreads = omp_get_num_threads();
int start = id*num_steps/numthreads;
int finish = (id+1)*num_steps/numthreads;
double x;
double partial_sum = 0;
for (i=start; i<finish ; i++){
x = (i+0.5)*step;
partial_sum += + 4.0/(1.0+x*x);
}
#pragma omp atomic
sum += partial_sum;
}
However, I don't recommend this. Reductions don't have to be done with atomic and you should just let OpenMP parallelize the loop. The first case is the best solution (but make sure you declare x private).
Edit: According to Hristo once you make x private these two methods are nearlly the same in speed. I want to explain why using critical in your second method instead of atomic or allowing OpenMP to do the reduction has hardly any effect on the performance in this case.
There are two ways I can think of doing a reduction:
Sum the partial sums linearly using atomic or critical
Sum the partial sums using a tree. I.e. if you have 8 cores this gives you eight partial sums you reduce this to 4 partial sums then 2 partial sums then 1.
The first cast has linear convergence in the number of cores. The second case goes as the log of the number of cores. So one my be temped to think the second case is always better. However, for only eight cores the reduction is entirely dominated by taking the partial sums. Adding eight numbers with atomic/critical vs. reducing the tree in 3 steps will be negligable.
What if you have e.g. 1024 cores? Then the tree can be reduced in only 10 steps and the linear sum takes 1024 steps. But the constant term can be much larger for the second case and doing the partial sum of a large array e.g. with 1 million elements probably still dominates the reduction.
So I suspect that using atomic or even critical for a reduction has a negligable effect on the reduction time in general.

OpenMP: Huge slowdown in what should be ideal scenario

In the code below I'm trying to compare all elements of an array to all other elements in a nested for loop. (It's to run a simple n-body simulation. I'm testing with only 4 bodies for 4 threads on 4 cores). An identical sequential version of the code without OpenMP modifications runs in around 15 seconds for 25M iterations. Last night this code ran in around 30 seconds. Now it runs in around 1 minute! I think the problem may lie in that the threads must write to the array which is passed to the function via a pointer.
The array is dynamically allocated elsewhere and is composed of structs I defined. This is just a hunch. I have verified that the 4 threads are running on 4 separate cores at 100% and that they are accessing the elements of the array properly. Any ideas?
void runSimulation (particle* particles, int numSteps){
//particles is a pointer to an array of structs I've defined and allocated dynamically before calling the function
//Variable Initializations
#pragma omp parallel num_threads(4) private(//The variables inside the loop) shared(k,particles) // 4 Threads for four cores
{
while (k<numSteps){ //Main loop.
#pragma omp master //Check whether it is time to report progress.
{
//Some simple if statements
k=k+1; //Increment step counter for some reason omp doesn't like k++
}
//Calculate new velocities
#pragma omp for
for (i=0; i<numParticles; i++){ //Calculate forces by comparing each particle to all others
Fx = 0;
Fy = 0;
for (j=0; j<numParticles; j++){
//Calcululate the cumulative force by comparing each particle to all others
}
//Calculate accelerations and set new velocities
ax = Fx / particles[i].mass;
ay = Fy / particles[i].mass;
//ARE THESE TWO LINES THE PROBLEM?!
particles[i].xVelocity += deltaT*ax;
particles[i].yVelocity += deltaT*ay;
}
#pragma omp master
//Apply new velocities to create new positions after all forces have been calculated.
for (i=0; i<numParticles; i++){
particles[i].x += deltaT*particles[i].xVelocity;
particles[i].y += deltaT*particles[i].yVelocity;
}
#pragma omp barrier
}
}
}
You are thrashing the cache. All the cores are writing to the same shared structure, which will be continually bouncing around between the cores via the L2 (best case), L3 or main memory/memory bus (worst case). Depending on how stuff is shared this is taking anywhere from 20 to 300 cycles, while writes to private memory in L1 takes 1 cycle or less in ideal conditions.
That explains your slowdown.
If you increase your number of particles the situation may become less severe because you'll often be writing to distinct cache lines, so there will be less thrashing. btown above as the right idea in suggesting a private array.
Not sure if this will fix the problem, but you might try giving each thread its own copy of the full array; the problem might be that the threads are fighting over accessing the shared memory, and you're seeing a lot of cache misses.
I'm not sure of the exact openmp syntax you'd use to do this, but try doing this:
Allocate memory to hold the entire particles array in each thread; do this once, and save all four new pointers.
At the beginning of each main loop iteration, in the master thread, deep-copy the main array four times to each of those new arrays. You can do this quickly with a memcpy().
Do the calculation such that the first thread writes to indices 0 < i < numParticles/4, and so on.
In the master thread, before you apply the new velocities, merge the four arrays into the main array by copying over only the relevant indices. You can do this quickly with a memcpy().
Note that you can parallelize your "apply new velocities" loop without any problems because each iteration only operates on a single index; this is probably the easiest part to parallelize.
The new operations will only be O(N) compared to your calculations which are O(N^2), so they shouldn't take too much time in the long run. There are definitely ways to optimize the steps that I laid out for you, Gabe, but I'll leave those to you.
I don't agree that the problem is cache thrashing since the size of the struct particles must exceed the size of a cache line just from the number of members.
I think the more likely culprit is that the overhead for initializing an omp for is 1000's of cycles http://www.ualberta.ca/CNS/RESEARCH/Courses/2001/PPandV/OpenMP.Eric.pdf and the loop has only a few calculations in it. I'm not remotely surprised the loop is slower with only 4 bodies. If you had a few 100's of bodies the situation would be different. I once worked on a loop a bit like this, and ended up using pthreads directly.

Resources