OpenMP: Huge slowdown in what should be ideal scenario - openmp

In the code below I'm trying to compare all elements of an array to all other elements in a nested for loop. (It's to run a simple n-body simulation. I'm testing with only 4 bodies for 4 threads on 4 cores). An identical sequential version of the code without OpenMP modifications runs in around 15 seconds for 25M iterations. Last night this code ran in around 30 seconds. Now it runs in around 1 minute! I think the problem may lie in that the threads must write to the array which is passed to the function via a pointer.
The array is dynamically allocated elsewhere and is composed of structs I defined. This is just a hunch. I have verified that the 4 threads are running on 4 separate cores at 100% and that they are accessing the elements of the array properly. Any ideas?
void runSimulation (particle* particles, int numSteps){
//particles is a pointer to an array of structs I've defined and allocated dynamically before calling the function
//Variable Initializations
#pragma omp parallel num_threads(4) private(//The variables inside the loop) shared(k,particles) // 4 Threads for four cores
{
while (k<numSteps){ //Main loop.
#pragma omp master //Check whether it is time to report progress.
{
//Some simple if statements
k=k+1; //Increment step counter for some reason omp doesn't like k++
}
//Calculate new velocities
#pragma omp for
for (i=0; i<numParticles; i++){ //Calculate forces by comparing each particle to all others
Fx = 0;
Fy = 0;
for (j=0; j<numParticles; j++){
//Calcululate the cumulative force by comparing each particle to all others
}
//Calculate accelerations and set new velocities
ax = Fx / particles[i].mass;
ay = Fy / particles[i].mass;
//ARE THESE TWO LINES THE PROBLEM?!
particles[i].xVelocity += deltaT*ax;
particles[i].yVelocity += deltaT*ay;
}
#pragma omp master
//Apply new velocities to create new positions after all forces have been calculated.
for (i=0; i<numParticles; i++){
particles[i].x += deltaT*particles[i].xVelocity;
particles[i].y += deltaT*particles[i].yVelocity;
}
#pragma omp barrier
}
}
}

You are thrashing the cache. All the cores are writing to the same shared structure, which will be continually bouncing around between the cores via the L2 (best case), L3 or main memory/memory bus (worst case). Depending on how stuff is shared this is taking anywhere from 20 to 300 cycles, while writes to private memory in L1 takes 1 cycle or less in ideal conditions.
That explains your slowdown.
If you increase your number of particles the situation may become less severe because you'll often be writing to distinct cache lines, so there will be less thrashing. btown above as the right idea in suggesting a private array.

Not sure if this will fix the problem, but you might try giving each thread its own copy of the full array; the problem might be that the threads are fighting over accessing the shared memory, and you're seeing a lot of cache misses.
I'm not sure of the exact openmp syntax you'd use to do this, but try doing this:
Allocate memory to hold the entire particles array in each thread; do this once, and save all four new pointers.
At the beginning of each main loop iteration, in the master thread, deep-copy the main array four times to each of those new arrays. You can do this quickly with a memcpy().
Do the calculation such that the first thread writes to indices 0 < i < numParticles/4, and so on.
In the master thread, before you apply the new velocities, merge the four arrays into the main array by copying over only the relevant indices. You can do this quickly with a memcpy().
Note that you can parallelize your "apply new velocities" loop without any problems because each iteration only operates on a single index; this is probably the easiest part to parallelize.
The new operations will only be O(N) compared to your calculations which are O(N^2), so they shouldn't take too much time in the long run. There are definitely ways to optimize the steps that I laid out for you, Gabe, but I'll leave those to you.

I don't agree that the problem is cache thrashing since the size of the struct particles must exceed the size of a cache line just from the number of members.
I think the more likely culprit is that the overhead for initializing an omp for is 1000's of cycles http://www.ualberta.ca/CNS/RESEARCH/Courses/2001/PPandV/OpenMP.Eric.pdf and the loop has only a few calculations in it. I'm not remotely surprised the loop is slower with only 4 bodies. If you had a few 100's of bodies the situation would be different. I once worked on a loop a bit like this, and ended up using pthreads directly.

Related

Why is iterating through the outer dimension with an outer loop faster than with an inner loop?

Let's consider a matrix
std::vector<std::vector<int>> matrix;
where each row has the same length. I will call each std::vector<int> a column.
Why is iterating through the outer dimension with an outer loop faster than with an inner loop?
First program: Iterating through columns first
int sum = 0;
for (int col = 0 ; col < matrix.size() ; col++)
{
for (int row = 0 ; row < matrix[0].size() ; row++)
{
sum += matrix[col][row];
}
}
Second program: Iterating through rows first
int sum = 0;
for (int row = 0 ; row < matrix[0].size() ; row++) // Assuming there is at least one element in matrix
{
for (int col = 0 ; col < matrix.size() ; col++)
{
sum += matrix[col][row];
}
}
Here are my guesses
Jumping around the memory
I could have a vague intuition that jumping around in the memory would take more time than reading the memory that is contiguous but I thought memory access of the RAM takes constant time. Plus, there are no moving part in the DRAM and I don't understand why it would be faster to read two ints if they are continuous?
Bus width
An int take either 2 bytes (although it may vary depending on the data model). In a machine with a 8 byte wide bus, I can imagine that eventually if the ints are contiguous in memory, then 4 ints (depending on the data model) could be sent to the processor at every clock cycle, while only one int could be sent per clock cycle if they are not contiguous.
If this is the case, then if the matrix would contain long long int which are 8 byte long, we would not see any difference between the two programs anymore (I haven't tested it).
Cache
I am not sure why, but I feel like the cache could be the reason for why the second program is slower. The effect with the cache might be related to the bus size argument I talked about just above. It is possible that only memory that is contiguous in the DRAM could load in the cache but I don't know why it would be the case.
Yeah, it's cache.
There's a strange coincidence1 that when programs access data in memory, they often access nearby data immediately or shortly after.
CPU designers realized this and thus design caches to load a whole chunk of memory at once.
So when you access matrix[0][0], much, if not all of the rest of matrix[0] was pulled into cache along with the single element at matrix[0][0], whereas there's a good chance that nothing from matrix[20] made it into cache.
Note that this is dependent on your matrix consisting of contiguous arrays, at least at the last dimension. If you were using, say linked lists, you would probably2 not see much difference, instead experience the slower performance regardless of access order.
The reason is that cache loads contiguous blocks. Consider if matrix[0][0] refers to the memory address 0x12340000. Accessing that would load that byte, plus the next 127 bytes into cache (exact amount depends on the cpu). So you'd have every byte from 0x12340000 to 0x1234007F in cache.
In a contiguous array, your next element at 0x12340004 is already in cache. But linked lists aren't contiguous, the next element could be virtually anywhere. If it's outside of that 0x12340000 to 0x1234007F range, you haven't gained anything.
1 It's really not that strange of a coincidence if you think about it. Using local stack variables? Accesses to the same area of memory. Iterating across a 1-dimensional array? Lots of accesses to the same area of memory. Iterating across a 2-dimensional array with the outer dimension in the outer loop and the inner arrays in the inner, nested loop? Basically iterating over a bunch of 1-dimensional arrays.
2 It's possibly you might luck out and have your linked list's nodes all right next to each other, but that seems like a very unlikely scenario. And you'll still won't fit as many elements in cache because the pointers to the next element takes up space, and there will be an additional, small performance hit from the indirection.
When Going Column - row, you are counting like so ([C][R]) [0][0] + [0][1] + [0][2] ... and so on. So you are not switching between the elements of the array.
When going row - column, you are counting like so ([C][R]) [0][0] + [1][0] + [2][0] This way you are switching between the elements of the array each time, so in DRAM it takes longer.
2D Arrays are handled like so : new Array{array1, array2, array3}; Arrays inside of an array. Counting down an array (C-R) is faster than Switching arrays and counting the element in the same row (R-C).
Arrays are a sectioned off piece of memory so when you have 2D arrays and you count (R-C) you are jumping around in the DRAM which is slower.
It does not matter that there are no mechanical parts in DRAM, jumping around will be slower. Example: SRAM has no mechanical parts but is slower than DRAM (with larger size of course) because more distance has the be traveled do to the larger size of the extra transistors and capacitors.
edit after reading the other answer, I would like to include that when you iterate (C-R) the whole element can be loaded into cache for quick access. But when going (R-C) loading a new array element into cache each time is not efficient or possibly does not happen due to inefficiency.

Turn the code into a code using SIMD instructions

I am preparing for an exam and are doing some exercises without facit. So I am been giving this code and are wondering if I have turned the code into SIMD instructions.
The code
int A[100000];
int B[100000];
int C=0;
for int(i=0; i < 100000; i++)
C += A[i] * B[i];
Since there is no remainder, we don't need to take care of it. We also assume that it is a 128 bit register, and therefore can calculate 4 single precision floating point values.
My result - using SIMD
int A[100000];
int B[100000];
int C=0;
for int(i=0; i < 100000/4; i += 4)
C += A[i] * B[i];
C += A[i+1] * B[i+1];
C += A[i+2] * B[i+2];
C += A[i+3] * B[i+3];
What advantages can you see for using SIMD instructions instead of writing programs with multiple threads?
Assuming the omitted curly braces on your second loop is simply a typo, and typo in the for loop, and the fact that you ask about multiplying floats but your code shows arrays of ints, this won't get great vectorisation even if the compiler sees it. While the compiler might do the loads of 4 values from A and B as a single instruction each, and do the 4 multiplies in one instruction, your code forces the compiler to then extract each of the 4 products and sum them sequentially, and getting individual values out of a SIMD register is typically quite slow.
If on the other hand you did this
float A[100000];
float B[100000];
float C0=0, C1=0, C2=0, C3=0;
for (size_t i=0; i < 100000/4; i += 4)
{
C0 += A[i+0] * B[i+0];
C1 += A[i+1] * B[i+1];
C2 += A[i+2] * B[i+2];
C3 += A[i+3] * B[i+3];
}
float C = (C0 + C1) + (C2 + C3);
Then a good compiler could vectorise this as now it sees that within each loop it loads two SIMD registers, multiplies them, then it can add the result to a SIMD register of the sums, and only extracts those 4 sums and sums them all at the end.
A vectorising compile can do this with SIMD and it will not change the order of evaluation of individual sums (FP maths is NOT associative). The compiler is typically not allowed to change the order of FP maths for this reason (not without some extra flags that allow it to technically breach the language standards), so the code above can be precisely represented by SIMD instructions, and will run much faster (in fact I'd unwind the loop a further stage as the multiplication will be a bottleneck as it stands).
This is sort of the trick with SIMD, you have to understand and then think how the operation would be best implemented with vector instructions, and then write your code to execute the same sequence of operations, and hope the compiler spots what you've done.
Or you can write the vector instructions yourself with intrinsics, or use OpenMP or similar to tell the compiler more explicitly what to do.
Amongst the advantages of SIMD over threads for such an operation is the fact that you're making use of more of the silicon within a single core... so you're not preventing another thread from getting cycles. On our compute grid we typically run many single threaded processes on any one machine to keep all the cores busy at all times... in such a case doing this sum using more cores is a false economy, you'd simply be stealing cycles that another thread could usefully be running another job.
Yes, the provided code should compile into SIMD instructions with capable CPUs and compilers.
On vector-capable processors, SIMD exposes hardware features that greatly accelerate identical, parallel computations. For instance, SIMD typically makes better use of the cache on a single core due to streaming RAM access, assuming the data being processed is localized in contiguous areas of memory. Using multiprocessing, cache competition and other synchronization overhead could actually reduce performance as the various cores attempt to write data simultaneously. This is in addition to the intrinsic boost on von-Neumann machines from only having to read one, not four, separate instructions from the shared system memory.
The logic to do these arithmetic operations in parallel is always present, but requires specific SIMD instructions to utilize. As a result, SIMD tends to be used in hot loops where hand tuning makes overall optimization sense.

Does the hardware-prefetcher benefit in this memory access pattern?

I have two arrays: A with N_A random integers and B with N_B random integers between 0 and (N_A - 1). I use the numbers in B as indices into A in the following loop:
for(i = 0; i < N_B; i++) {
sum += A[B[i]];
}
Experimenting on an Intel i7-3770, N_A = 256 million, N_B = 64 million, this loop takes only .62 seconds, which corresponds to a memory access latency of about 9 nanoseconds.
As this latency is too small, I was wondering if the hardware prefetcher is playing a role. Can someone offer an explanation?
The HW prefetcher can see through your first level of indirection (B[i]) since these elements are sequential. It's capable of issuing multiple prefetches ahead, so you could assume that the average access into B would hit the caches (either L1 or L2). However, there's no way that the prefetcher can predict random addresses (the data stored in B) and prefetch the correct elements from A. You still have to perform a memory access in almost all accesses to A (disregarding occasional lucky cache hits due to reuse of lines)
The reason you see such low latency is that the accesses into A are non serialized, the CPU can access multiple elements of A simultaneously, so the time doesn't just accumulate. In fact, you measure memory BW here, checking how long it takes to access 64M elements overall, not memory latency (how long it takes to access a single element).
A reasonable "snapshot" of the CPU memory unit should show several outstanding requests - a few accesses into B[i], B[i+64], ... (the intermediate accesses should simply get merged as each request fetches a 64Byte line), all of which would probably be prefetches reflecting future values of i, intermixed with random accesses to A elements according to the previously fetched elements of B.
To measure latency, you need each access to depends on the result of the previous one, for e.g. by making the content of each element in A the index of the next access.
The CPU charges ahead in the instruction stream and will juggle multiple outstanding loads at once. The stream looks like this:
load b[0]
load a[b[0]]
add
loop code
load b[1]
load a[b[1]]
add
loop code
load b[1]
load a[b[1]]
add
loop code
...
The iterations are only serialized by the loop code, which runs quickly. All loads can run concurrently. Concurrency is just limited by how many loads the CPU can handle.
I suspect you wanted to benchmark random, unpredictable, serialized memory loads. This is actually pretty hard on a modern CPU. Try to introduce an unbreakable dependency chain:
int lastLoad = 0;
for(i = 0; i < N_B; i++) {
var load = A[B[i] + (lastLoad & 1)]; //be sure to make A one element bigger
sum += load;
lastLoad = load;
}
This requires the last load to be executed until the address of the next load can be computed.

OpenMP output for "for" loop

I am new to OpenMP and I just tried to write a small program with the parallel for construct. I have trouble understanding the output of my program. I don't understand why thread number 3 prints the output before 1 and 2. Could someone offer me an explanation?
So, the program is:
#pragma omp parallel for
for (i = 0; i < 7; i++) {
printf("We are in thread number %d and are printing %d\n",
omp_get_thread_num(), i);
}
and the output is:
We are in thread number 0 and are printing 0
We are in thread number 0 and are printing 1
We are in thread number 3 and are printing 6
We are in thread number 1 and are printing 2
We are in thread number 1 and are printing 3
We are in thread number 2 and are printing 4
We are in thread number 2 and are printing 5
My processor is a Intel(R) Core(TM) i5-2410M CPU with 4 cores.
Thank you!
OpenMP makes no guarantees of the relative ordering, in time, of the execution of statements by different threads. OpenMP leaves it to the programmer to impose such ordering if it is required. In general it is not required, in many cases not even desirable, which is why OpenMP's default behaviour is as it is. The cost, in time, of imposing such an ordering is likely to be significant.
I suggest you run much larger tests several times, you should observe that the cross-thread sequencing of events is, essentially, random.
If you want to print in order then you can use the ordered construct
#pragma omp parallel for ordered
for (i = 0; i < 7; i++) {
#pragma omp ordered
printf("We are in thread number %d and are printing %d\n",
omp_get_thread_num(), i);
}
I assume this requires threads from larger iterations to wait for the ones with lower iteration so it will have an effect on performance. You can see it used here http://bisqwit.iki.fi/story/howto/openmp/#ExampleCalculatingTheMandelbrotFractalInParallel
That draws the Mandelbrot set as characters using ordered. A much faster solution than using ordered is to fill an array in parallel of the characters and then draw them serially (try the code). Since one uses OpenMP for performance I have never found a good reason to use ordered but I'm sure it has its use somewhere.

OpenMP slower reduction

There are two versions of openmp codes with reduction and without.
// with reduction
#pragma omp parallel for reduction(+:sum)
for (i=1;i<= num_steps; i++){
x = (i-0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
// without reduction
#pragma omp parallel private(i)
{
int id = omp_get_thread_num();
int numthreads = omp_get_num_threads();
double x;
double partial_sum = 0;
for (i=id;i< num_steps; i+=numthreads){
x = (i+0.5)*step;
partial_sum += + 4.0/(1.0+x*x);
}
#pragma omp critical
sum += partial_sum;
}
I run the codes using 8 cores, the total time double for the reduction version. What's the reason? Thanks.
Scalar reduction in OpenMP is usually quite fast. The observed behaviour in your case is due to two things made wrong in two different ways.
In your first code you did not make x private. Therefore it is shared among the threads and besides getting incorrect results, the execution suffers from the data sharing. Whenever one thread writes to x, the core that it executes on sends a message to all other cores and makes them invalidate their copies of that cache line. When any of them writes to x later, the whole cache line has to be reloaded and then the cache lines in all other cores get invalidated. And so forth. This slows things down significantly.
In your second code you have used the OpenMP critical construct. This is a relatively heavy-weight in comparison with the atomic adds, usually used to implement the reduction at the end. Atomic adds on x86 are performed using the LOCK instruction prefix and everything gets implemented in the hardware. On the other side, critical sections are implemented using mutexes and require several instructions and often busy waiting loops. This is far less efficient than the atomic adds.
In the end, your first code is slowed down due to bad data sharing condition. Your second code is slowed down due to the use of incorrect synchronisation primitive. It just happens that on your particular system the latter effect is less severe than the former and hence your second example runs faster.
If you want to manually parallelize the loop as well as the reduction you can do it like this:
#pragma omp parallel private(i)
{
int id = omp_get_thread_num();
int numthreads = omp_get_num_threads();
int start = id*num_steps/numthreads;
int finish = (id+1)*num_steps/numthreads;
double x;
double partial_sum = 0;
for (i=start; i<finish ; i++){
x = (i+0.5)*step;
partial_sum += + 4.0/(1.0+x*x);
}
#pragma omp atomic
sum += partial_sum;
}
However, I don't recommend this. Reductions don't have to be done with atomic and you should just let OpenMP parallelize the loop. The first case is the best solution (but make sure you declare x private).
Edit: According to Hristo once you make x private these two methods are nearlly the same in speed. I want to explain why using critical in your second method instead of atomic or allowing OpenMP to do the reduction has hardly any effect on the performance in this case.
There are two ways I can think of doing a reduction:
Sum the partial sums linearly using atomic or critical
Sum the partial sums using a tree. I.e. if you have 8 cores this gives you eight partial sums you reduce this to 4 partial sums then 2 partial sums then 1.
The first cast has linear convergence in the number of cores. The second case goes as the log of the number of cores. So one my be temped to think the second case is always better. However, for only eight cores the reduction is entirely dominated by taking the partial sums. Adding eight numbers with atomic/critical vs. reducing the tree in 3 steps will be negligable.
What if you have e.g. 1024 cores? Then the tree can be reduced in only 10 steps and the linear sum takes 1024 steps. But the constant term can be much larger for the second case and doing the partial sum of a large array e.g. with 1 million elements probably still dominates the reduction.
So I suspect that using atomic or even critical for a reduction has a negligable effect on the reduction time in general.

Resources