Is it good practice to place loop invariant code outside a loop? - performance

I'm learning about loop invariant code motion as a method of code optimization in compiler design. The example used is that:
for ( i=0; i<n; i++ )
buffer[i] = 10*i + x*x;
Might as well be optimized to:
tmp = x*x
for (i=0; i<n; i++ )
buffer[i] = 10*i + tmp;
To avoid calculating x*x more than once.
In the curricular example, this is dealt with by the compiler backend.
My question is, is it generally good practice to do this explicitly in the source code (and is it more advantageous in some languages than others)?

Related

while loop getting stuck - Openmp

I was trying to implement some piece of parallel code and tried to synchronize the threads using an array of flags as shown below
// flags array set to zero initially
#pragma omp parallel for num_threads (n_threads) schedule(static, 1)
for(int i = 0; i < n; i ++){
for(int j = 0; j < i; j++) {
while(!flag[j]);
y[i] -= L[i][j]*y[j];
}
y[i] /= L[i][i];
flag[i] = 1;
}
However, the code always gets stuck after a few iterations when I try to compile it using gcc -O3 -fopenmp <file_name>. I have tried different number of threads like 2, 4, 8 all of them leads to the loop getting stuck. On putting print statements inside critical sections, I figured out that even though the value of flag[i] gets updated to 1, the while loop is still stuck or maybe there is some other problem with the code, I am not aware of.
I also figured out that if I try to do something inside the while block like printf("Hello\n") the problem goes away. I think there is some problem with the memory consistency across threads but I do not know how to resolve this. Any help would be appreciated.
Edit: The single threaded code I am trying to parallelise is
for(int i=0; i<n; i++){
for(int j=0; j < i; j++){
y[i]-=L[i][j]*y[j];
}
y[i]/=L[i][i];
}
You have data race in your code, which is easy to fix, but the bigger problem is that you also have loop carried dependency. The result of your code does depend on the order of execution. Try reversing the i loop without OpenMP, you will get different result, so your code cannot be parallelized efficiently.
One possibility is to parallelize the j loop, but the workload is very small inside this loop, so the OpenMP overheads will be significantly bigger than the speed gain by parallelization.
EDIT: In the case of your updated code I suggest to forget parallelization (because of loop carried dependency) and make sure that inner loop is properly vectorized, so I suggest the following:
for(int i = 0; i < n; i ++){
double sum_yi=y[i];
#pragma GCC ivdep
for(int j = 0; j < i; j++) {
sum_yi -= L[i][j]*y[j];
}
y[i] = sum_yi/L[i][i];
}
#pragma GCC ivdep tells the compiler that there is no loop carried dependency in the loop, so it can vectorize it safely. Do not forget to inform compiler the about the vectorization capabilities of your processor (e.g. use -mavx2 flag if your processor is AVX2 capable).

OpenMP collapse parallel for with parallel max-reduction?

I have the following nested loops that I want to collapse into one for parallelization. Unfortunately the inner loop is a max-reduction rather than standard for loop thus collapse(2) directive apparently can't be used here. Is there any way to collapse these two loops anyway? Thanks!
(note that s is the number of sublists and n is the length of each sublist and suppose n >> s)
#pragma omp parallel for default(shared) private(i,j)
for (i=0; i<n; i++) {
rank[i] = 0;
for (j=0; j<s; j++)
if (rank[i] < sublistrank[j][i])
rank[i] = sublistrank[j][i];
}
In this code the best idea is not to parallelize the inner loop at all, but make sure it is properly vectorized. The inner loop does not access the memory continuously, which prevents vectorization and results in a poor cache utilization. You should rewrite your entire code to ensure continuous memory access (e.g. change the order of indices and use sublistrank[i][j] instead of sublistrank[j][i]).
If also beneficial to use a temporary variable for comparisons and assign it to rank[i] after the loop.
Another comment is that always use your variables in their minimum required scope, it also helps the compiler to create more optimized code. Putting it together your code should look like something like this (assuming you use unsigned int for rank and loop variables)
#pragma omp parallel for default(none) shared(sublistrank, rank)
for (unsigned int i=0; i<n; i++) {
unsigned int max=0;
for (unsigned int j=0; j<s; j++)
if (max < sublistrank[i][j])
max = sublistrank[i][j];
rank[i]=max;
}
I have compared your code and this code on CompilerExporer. You can see that the compiler is able to vectorize it, but not the old one.
Note also that if n is small, the parallel overhead may be bigger than the benefit of parallelization.

warning: suggest parentheses around assignment used as truth value in for loop

I am working on a program, but am having trouble with a for loop
for (int i = N-1; i = 0; i--) {
guessarray [i] = guess % 10;
guess /= 10;
}//for
With my g++ compiler I keep getting the error "warning: suggest parentheses around assignment used as truth value. I understand that I am working backwards in the loop, from low to high, but I don't see how that could be a problem. I have tried putting in parentheses in different places, but it doesn't work. I also know it has nothing to do with the assignment operator since I want to use the assignment operator. The warning is placed directly after N-1.
The compiler is simply calling your attention to the assignment in the for loop. It is probably doing so because the "for" statement has 3 semicolon-delimited expressions which is rather dense and prone to human error. By changing
for (int i = N-1; i = 0; i--)
to
for (int i = (N-1); i = 0; i--)
you are telling the compiler, that yes you really intended the for the initial value of i to be (N-1).
==================
Note also that there is what appears to be a logic flaw in the condition section of the for loop (the 2nd expresses ion)
You have
i = 0;
which means the loop will never execute (unless N==1). I assume your intent was to count down from (N-1) to 0. Thus You should probably have the following:
for (int i = (N-1); i >= 0; i--) // note the i >= 0 condition

R nested loop slow

I have no idea why something like this should be slow:
steps=500
samples=100000
s_0=2.1
r=.02
sigma=.2
k=1.9
at<-matrix(nrow=(steps+1),ncol=samples)
at[1,]=s_0
for(j in 1:samples)
{
for(i in 2:(steps+1))
{
at[i,j]=at[(i-1),j] + sigma*sqrt(.0008)*rnorm(1)
}
}
I tried to rewrite this using sapply, but it was still awful from a performance standpoint.
Am I missing something here? This would be seconds in c++ or even the bloated c#.
R can vectorize certain operations. In your case you can get rid of the outer loop by doing a following change.
for(i in 2:(steps + 1))
{
at[i,] = at[(i - 1),] + sigma * sqrt(.0008) * rnorm(samples)
}
According to system.time the original version for samples = 1000 takes 6.83s, while the modified one 0.09s.
How about:
at <- s_0 + t(apply(matrix(rnorm(samples*(steps+1),sd=sigma*sqrt(8e-4)),
ncol=samples),
2,
cumsum))
(Haven't tested this carefully yet, but I think it should be right, and much faster.)
To write fast R code, you really need to re-think how you write functions. You want to operate on entire vectors, not just single observations at a time.
If you're really deadset in writing C-style loops, you could also try out Rcpp. Could be handy if you're well accustomed to C++ and prefer writing functions that way.
library(Rcpp)
do_stuff <- cppFunction('NumericMatrix do_stuff(
int steps,
int samples,
double s_0,
double r,
double sigma,
double k ) {
// Ensure RNG scope set
RNGScope scope;
// allocate the output matrix
NumericMatrix at( steps+1, samples );
// fill the first row
for( int i=0; i < at.ncol(); i++ ) {
at(0, i) = s_0;
}
// loop over the matrix and do stuff
for( int j=0; j < samples; j++ ) {
for( int i=1; i < steps+1; i++ ) {
at(i, j) = at(i-1, j) + sigma * sqrt(0.0008) * R::rnorm(0, 1);
}
}
return at;
}')
system.time( out <- do_stuff(500, 100000, 2.1, 0.02, 0.2, 1.9) )
gives me
user system elapsed
3.205 0.092 3.297
So, if you've already got some C++ background, consider learning how to use Rcpp to map data to and from R.

rewriting a simple C++ Code snippet into CUDA Code

I have written the following simple C++ code.
#include <iostream>
#include <omp.h>
int main()
{
int myNumber = 0;
int numOfHits = 0;
cout << "Enter my Number Value" << endl;
cin >> myNumber;
#pragma omp parallel for reduction(+:numOfHits)
for(int i = 0; i <= 100000; ++i)
{
for(int j = 0; j <= 100000; ++j)
{
for(int k = 0; k <= 100000; ++k)
{
if(i + j + k == myNumber)
numOfHits++;
}
}
}
cout << "Number of Hits" << numOfHits << endl;
return 0;
}
As you can see I use OpenMP to parallelize the outermost loop. What I would like to do is to rewrite this small code in CUDA. Any help will be much appreciated.
Well, I can give you a quick tutorial, but I won't necessarily write it all for you.
So first of all, you will want to get MS Visual Studio set up with CUDA, which is easy following this guide: http://www.ademiller.com/blogs/tech/2011/05/visual-studio-2010-and-cuda-easier-with-rc2/
Now you will want to read The NVIDIA CUDA Programming Guide (free pdf), documentation, and CUDA by Example (A book I highly recommend for learning CUDA).
But let's say you haven't done that yet, and definitely will later.
This is an extremely arithmetic heavy and data-light computation - actually it can be computed without this brute force method fairly simply, but that isn't the answer you are looking for. I suggest something like this for the kernel:
__global__ void kernel(int* myNumber, int* numOfHits){
//a shared value will be stored on-chip, which is beneficial since this is written to multiple times
//it is shared by all threads
__shared__ int s_hits = 0;
//this identifies the current thread uniquely
int i = (threadIdx.x + blockIdx.x*blockDim.x);
int j = (threadIdx.y + blockIdx.y*blockDim.y);
int k = 0;
//we increment i and j by an amount equal to the number of threads in one dimension of the block, 16 usually, times the number of blocks in one dimension, which can be quite large (but not 100,000)
for(; i < 100000; i += blockDim.x*gridDim.x){
for(; j < 100000; j += blockDim.y*gridDim.y){
//Thanks to talonmies for this simplification
if(0 <= (*myNumber-i-j) && (*myNumber-i-j) < 100000){
//you should actually use atomics for this
//otherwise, the value may change during the 'read, modify, write' process
s_hits++;
}
}
}
//synchronize threads, so we now s_hits is completely updated
__syncthreads();
//again, atomics
//we make sure only one thread per threadblock actually adds in s_hits
if(threadIdx.x == 0 && threadIdx.y == 0)
*numOfHits += s_hits;
return;
}
To launch the kernel, you will want something like this:
dim3 blocks(some_number, some_number, 1); //some_number should be hand-optimized
dim3 threads(16, 16, 1);
kernel<<<blocks, threads>>>(/*args*/);
I know you probably want a quick way to do this, but getting into CUDA isn't really a 'quick' thing. As in, you will need to do some reading and some setup to get it working; past that, the learning curve isn't too high. I haven't told you anything about memory allocation yet, so you will need to do that (although that is simple). If you followed my code, my goal is that you had to read up a bit on shared memory and CUDA, and so you are already kick-started. Good luck!
Disclaimer: I haven't tested my code, and I am not an expert - it could be idiotic.

Resources