How do I wait for child kernels to finish in a parent kernel before executing the rest of the parent kernel in CUDA dynamic parallelism?

How do I wait for child kernels to finish in a parent kernel before executing the rest of the parent kernel in CUDA dynamic parallelism? - parallel-processing

So I need the runParatron children to fully finish before the next iteration of the for loop happens. Based on the results I am getting, I'm pretty sure that's not happening. For example, I have a print statement in runParatron that executes AFTER the first "[" is printed outside the for loop.
I tried to run cudaDeviceSynchronize, but it wouldn't compile stating that host code can't be executed on device code, and that cudaDeviceSynchronize is undefined in device code. Is there any way to wait until the children kernels are done for this?
I see other posts, examples, and tutorials using cudaDeviceSynchronize within kernels, so perhaps I am missing something basic? Help would be thoroughly appreciated.
__global__ void runMLP(double* x, double* outputs, double* weights, activation_function* A_Fs, int* CIL, int layers, int bias, int* WLO, int* OLO) {
if (CIL[0] > 511) {
copyElements << <CIL[0] / 32, 32 >> > (outputs, x, CIL[0]);
//I WOULD ALSO LIKE TO WAIT HERE
}
else
for (int i = 0;i < CIL[0];i++) {
outputs[i] = x[i];
}
for (int i = 1;i < layers;i++) {
printf("----------------------Layer %d :: InputSize %d :: Layer weight offset %d :: Layer output offset %d----------------------\n", i, CIL[i-1], WLO[i-1], OLO[i]);
runParatron << < (CIL[i] / 32) + 1, 32 >> > (outputs + OLO[i - 1], outputs + OLO[i], weights + WLO[i - 1], A_Fs[i], CIL[i - 1], CIL[i], bias);
//cudaDeviceSynchronize(); //THIS IS WHERE I NEED TO WAIT UNTIL NEXT ITERATION
}
if (A_Fs[layers - 1] == SOFTMAX) {
double* temp = outputs + OLO[layers - 1];
printf("[");
for (int i = 0;i < CIL[layers-1];i++) {
printf("% d, ", temp[i]);
}
printf("]\n");
double denom = 0;
for (int i = 0;i < CIL[layers - 1];i++) {
denom += temp[i];
}
if (denom < DBL_MIN)
denom = DBL_MIN;
for (int i = 0;i < CIL[layers - 1];i++) {
temp[i] /= denom;
}
}
}
For example, here is the output where the "[" comes before the child kernel output:
//All Cell: starting lines are produced from child kernel
[Cell: 0 :: weightOffset 0 :: AF 2 //As you can see, there is the "[" here when it should be
Cell: 1 :: weightOffset 6 :: AF 2
Cell: 2 :: weightOffset 12 :: AF 2
Cell: 3 :: weightOffset 18 :: AF 2
-502657059, 2118981138, 1645236453, ] //Down here!

So I added an atomic counter and incremented it by one at the end of each child kernel. Then I put a while loop after the child kernel call checking to see if the counter had reached the amount of calls I wanted to finish yet. This fixed it. Let me know if anyone needs code for or clarification.

Related

c code is running to slow from nested for loops

my c program is running to slow (right now it is around 40 seconds without parallelization). I have tried using openmp which has brought the timing down significantly but I am looking to use simple and natural ways to make my code run faster other than using parallel for loops. The basic structure of the code is that is takes some command line arguments as inputs and then saves those inputs as variables. Then it recursively computes a variable called Rplus1 using the math.h library and the complex.h library. The problem of the code and where it is taking most of it's time is at the bottom where there are nested for loops. My goal is to get the whole code running in under 5 seconds but as of now it runs in about 40 seconds without using parallel for loops. Please Help!
#include "time.h"
#include "stdio.h"
#include "stdlib.h"
#include "complex.h"
#include "math.h"
#include "string.h"
#include "unistd.h"
#include "omp.h"
#define PI 3.14159265
int main (int argc, char *argv[]){
if(argc >= 8){
double start1 = omp_get_wtime();
// command line arguments are aligned in the following order: [theta] [number of layers in superlattice] [material_1] [lat const_1] [number of unit cells_1] [material_2] [lat const_2] [number of unit cells_2] .... [material_N] [lat const_N] [number of unit cells_N] [Log/Linear] [number of repeating superlattice layers] [yes/no]
int N;
sscanf(argv[2],"%d",&N); // Number of layers in superlattice specified by second input argument
if(strcmp(argv[argc-1],"yes") == 0) //If the substrate is included then add one more layer to the N variable
{
N = N+1;
}
int total;
sscanf(argv[argc-2],"%d",&total); // Number of repeating superlattice layers specified by second to last argument
double layers[N][6], horizangle[1001], vertangle[1001];
double complex (*F_hkl)[1001][1001] = malloc(N*1001*1001*sizeof(complex double)), (*F_0)[1001][1001] = malloc(N*1001*1001*sizeof(complex double)), (*g)[1001][1001] = malloc(N*1001*1001*sizeof(complex double)), (*g_0)[1001][1001] = malloc(N*1001*1001*sizeof(complex double)),SF_table[10];// this array will hold the unit cell structure factors for all of the materials selected for each wavevector in the beam spectrum
double real, real2, lam, c_light = 299792458, h_pl = 4.135667516e-15,E = 10e3, r_0 = 2.818e-15, Lccd = 1.013;// just a few variables to hold values through calculations and constants, speed of light, plancks const, photon energy, and detector distance from sample
double angle;
double complex z;// just a variable to hold complex numbers throughout calculations
int i,j,m,n,t; // integers to index through arrays
lam = (h_pl*c_light)/E;
sscanf(argv[1],"%lf",&angle); //first argument is the angle of incidence, read it
angle = angle*(PI/180.0);
angle2 = -angle;
double (*table)[10] = malloc(10*9*sizeof(double)); // this array holds all the coefficients to calculate the atomic scattering factor below
double (*table2)[10] = malloc(10*2*sizeof(double));
FILE*datfile1 = fopen("/home/vhosts/xraydev.engr.wisc.edu/data/coef_table.bin","rb"); // read the binary file containg all the coefficients
fread(table,sizeof(double),90,datfile1);
fclose(datfile1);
FILE*datfile2 = fopen("/home/vhosts/xraydev.engr.wisc.edu/data/dispersioncs.bin","rb");
fread(table2,sizeof(double),20,datfile2);
fclose(datfile2);
// Calculate scattering factors for all elements
double a,b;
double k_z = (sin(angle)/lam)*1e-10; // incorporate angular dependence of SF but neglect 0.24 degree divergence because of approximation
for(i = 0;i<10;i++) // for each element...
{
SF_table[i] = 0;
for(j = 0;j<4;j++) // summation
{
a = table[2*j][i];
b = table[2*j+1][i];
SF_table[i] = SF_table[i] + a * exp(-b*k_z*k_z);
}
SF_table[i] = SF_table[i] + table[8][i] + table2[0][i] + table2[1][i]*I;
}
free(table);
double mm = 4.0, (*phi)[1001][1001] = malloc(N*1001*1001*sizeof(double));
for(i = 1; i < N+1; i++) // for each layer of material...
{
sscanf(argv[i*3+1],"%lf",&layers[i-1][1]); // get out of plane lattice constant
sscanf(argv[i*3+2],"%lf",&layers[i-1][2]); // get the number of unit cells in the layer
layers[i-1][1] = layers[i-1][1]*1e-10; // convert lat const input to meters
// Define reciprocal space positions at the incident angle h, k, l
layers[i-1][3] = 0; // h
layers[i-1][4] = 0; // k
double l; // l calculated for each wavevector in the spectrum because l changes with angle of incidence
for (m = 0; m < 1001; m++)
{
for (n = 0; n <1001; n++)
{
l = 4;
phi[i-1][m][n] = 2*PI*layers[i-1][1]*sin(angle)/lam; // Caculate phi for each layer
if(strcmp(argv[i*3],"GaAs") == 0)
{
F_hkl[i-1][m][n] = (2+2*cexp(I*PI*l))*(SF_table[2]+SF_table[3]*cexp(I*PI*l/2));
F_0[i-1][m][n] = 0.5*8.0*(31 + table2[0][2] + table2[1][2]*I) + 0.5*8.0*(33 + table2[0][3] + table2[1][3]*I);
g[i-1][m][n] = 2*r_0*F_hkl[i-1][m][n]/mm/layers[i-1][1]*cos(2*angle[m][n]);
g_0[i-1][m][n] = 2*r_0*F_0[i-1][m][n]/mm/layers[i-1][1];
}
if(strcmp(argv[i*3],"AlGaAs") == 0)
{
F_hkl[i-1][m][n] = (2+2*cexp(I*PI*l))*((0.76*SF_table[2]+ 0.24*SF_table[4])+SF_table[3]*cexp(I*PI*l/2));
F_0[i-1][m][n] = 0.24*4.0*(13 + table2[0][4] + table2[1][4]*I) + 0.76*4.0*(31 + table2[0][2] + table2[1][2]*I) + 4.0*(33 + table2[0][3] + table2[1][3]*I);
g[i-1][m][n] = 2*r_0*F_hkl[i-1][m][n]/mm/layers[i-1][1]*cos(2*angle[m][n]);
g_0[i-1][m][n] = 2*r_0*F_0[i-1][m][n]/mm/layers[i-1][1];
}
}
}
}
double complex (*Rplus1)[1001] = malloc(1001*1001*sizeof(double complex));
for (m = 0; m < 1001; m++)
{
for (n = 0; n <1001; n++)
{
Rplus1[m][n] = 0.0;
}
}
double stop1 = omp_get_wtime();
for(i=1;i<N;i++) // For each layer of the film
{
for(j=0;j<layers[i][2];j++) // For each unit cell
{
for (m = 0; m < 1001; m++) // For each row of the diffraction pattern
{
for (n = 0; n <1001; n++) // For each column of the diffraction pattern
{
Rplus1[m][n] = -I*g[i][m][n] + ((1-I*g_0[i][m][n])*(1-I*g_0[i][m][n]))/(I*g[i][m][n] + (cos(-2*phi[i][m][n])+I*sin(-2*phi[i][m][n]))/Rplus1[m][n]);
}
}
}
}
double stop2 = omp_get_wtime();
double elapsed1 = (double)(stop1 - start1);// Second user defined function to use Durbin and Follis recursive formula
double elapsed2 = (double)(stop2 - start1);// Second user defined function to use Durbin and Follis recursive formula
printf("main() through before diffraction function took %f seconds to run\n\n",elapsed1);
printf("main() through after diffraction function took %f seconds to run\n\n",elapsed2);
}
}

Error while freeing an array within a data structure

Status bit_flags_set_flag(BIT_FLAGS hBit_flags, int flag_position) {
Bit_Flags* temp = (Bit_Flags*)hBit_flags;
int* nums;
int i;
int old_size;
if (temp->size < flag_position) {
nums = malloc(sizeof(int)*flag_position+1);
if (nums == NULL) {
return FAILURE;
}
for (i = 0; i < temp->size; i++) {
nums[i] = temp->data[i];
}
free(temp->data);
temp->data = nums;
old_size = temp->size;
temp->size = flag_position + 1;
for (i = old_size; i < temp->size; i++) {
temp->data[i] = 0;
}
}
temp->data[flag_position / 32] |= 1 << flag_position % 32;
return SUCCESS;
}
according to the debugger the error is from the free(temp->data) part. however. I only run into the error the second time I go through the function. any ideas what is happening here.
am getting a heap corruption error on visual studio.

I am writing on some assumptions like you are assuming int size is 32 bits and you are trying to set the bit at flag_position in the bitset and you are using 1 int for 1 bit for setting and unsetting bits
Few comments now
temp->data[flag_position / 32] |= 1 << flag_position % 32; now this doesn't make any sense, this line role is to set bit at flag_position, this should be temp->data[flag_position] = 1; instead because if you see your code your are using ints for each bit.
Also this line temp->size = flag_position + 1; is also incorrect , this should be temp->size = flag_position;

OpenACC bitonic sort is much slower on GPU than on CPU

I have the following bit of code to sort double values on my GPU:
void bitonic_sort(double *data, int length) {
#pragma acc data copy(data[0:length], length)
{
int i,j,k;
for (k = 2; k <= length; k *= 2) {
for (j=k >> 1; j > 0; j = j >> 1) {
#pragma acc parallel loop gang worker vector independent
for (i = 0; i < length; i++) {
int ixj = i ^ j;
if ((ixj) > i) {
if ((i & k) == 0 && data[i] > data[ixj]) {
_ValueType buffer = data[i];
data[i] = data[ixj];
data[ixj] = buffer;
}
if ((i & k) != 0 && data[i] < data[ixj]) {
_ValueType buffer = data[i];
data[i] = data[ixj];
data[ixj] = buffer;
}
}
}
}
}
}
}
This is a bit slower on my GPU than on my CPU. I'm using GCC 6.1. I can't figure out, how to run the whole code on my GPU. So far, only the parallel loop is executed on the cpu and it switches between CPU and GPU for each one of the outer loops.
I'd like to run the whole content of the function on the GPU, but I can't figure out how. One major problem for me now is that the GCC implementation currently doesn't allow nested parallelism, so I can't use a parallel construct inside another parallel construct. Is there any way to get around that?
I've tried putting a kernels construct on top of the first loop but that slows it down by a factor of about 10. If I use a parallel construct above the first loop instead, the result isn't sorted any more, which makes sense. The two outer loops need to be executed sequentially for the algorithm to work.
If you have any other suggestions on how I could improve performance, I would be grateful as well.

How to improve the performance of needleman -wunsch algorithm in CUDA

I need an advice on how optimizing my implementation of the Needleman-Wunsch algorithm in CUDA.
I want to optimize my code to fill the DP matrix in CUDA. Due to the data dependence between matrix elements (each next element depends on the other ones - left to it, up to it, and left-up to it), I'm filling anti-diagonal matrix elements in parallel as follows:
__global__ void alignment_kernel(int *T, char *A, char *B, int t_M, int t_N, int d) {
int row = BLOCK_SIZE_Y * blockIdx.y + threadIdx.y;
int col = BLOCK_SIZE_X * blockIdx.x + threadIdx.x;
// Check if we are inside the table boundaries.
if (!(row < t_M && col < t_N)) {
return;
}
// Check if current thread is on the current diagonal
if (row + col != d) {
return;
}
int v1;
int v2;
int v3;
int v4;
v1 = v2 = v3 = v4 = INT_MIN;
if (row > 0 && col > 0) {
v1 = T[t_N * (row - 1) + (col - 1)] + score_matrix_read(A[row - 1], B[col - 1]);
}
if (row > 0 && col >= 0) {
v2 = T[t_N * (row - 1) + col] + gap;
}
if (row >= 0 && col > 0) {
v3 = T[t_N * row + (col - 1)] + gap;
}
if (row == 0 && col == 0) {
v4 = 0;
}
// Synchronize (ensure all the data is available)
__syncthreads();
T[t_N * row + col] = mmax(v1, v2, v3, v4);
}
Nevertheless, one obvious problem of my code is that I do multiple kernel calls (code bellow). Until now, I don't know how to use threads to process the anti-diagonal synchronously without doing that. I think this is a major problem to reach a better performance.
// Invoke kernel.
for (int d = 0; d < t_M + t_N - 1; d++) {
alignment_kernel<<< gridDim, blockDim >>>(d_T, d_A, d_B, t_M, t_N, d);
//CHECK_FOR_CUDA_ERROR();
}
How can I process the anti-diagonal in parallel and, maybe, using shared memory to increase the speedup?
Beyond this problem, is there any way to do the back trace step of the needleman-wunsch algorithm in parallel?

I am currently working on a parallel implementation of the Needleman Wunsch algorithm as well (to use in a genome mapper). Depending on how many alignments you will be doing, it may be more efficient to do a single alignment per thread.
However, here is a publication that performs a single alignment in parallel (on a GPU). The novelty of their approach is that it does not generate the matrix sequentially, by rather diagonally. They don't talk about how they backtrack in their publication. They send the matrix back to the host after it is generated, then they perform the backtrack using a CPU. I think that backtracking on the GPU would be terribly inefficient due to branching.

USACO number triangle - Execution error

The question is as follows
Consider the number triangle shown below. Write a program that calculates the highest sum of numbers that can be passed on a route that starts at the top and ends somewhere on the base. Each step can go either diagonally down to the left or diagonally down to the right.
7
3 8
8 1 0
2 7 4 4
4 5 2 6 5
In the sample above, the route from 7 to 3 to 8 to 7 to 5 produces the highest sum: 30.
I had the following error
Your program had this runtime error: Bad
syscall #32000175 (RT_SIGPROCMASK) [email kolstad if you think
this is wrong]. The program ran for 0.259 CPU seconds before the
error. It used 16328 KB of memory.
The code is as follows.
int arr[1500][1500];
map < int,map < int,int> >dp;
int main()
{
// ofstream fout ("numtri.out");
// ifstream fin ("numtri.in");
int n;
// fin>>n;
freopen ("numtri.in", "r", stdin);
freopen ("numtri.out", "w", stdout);
scanf ("%d", &n);
int ct = 1;
int gaga = -100;
for (int i=0; i<n; i++)
{
for (int j=0; j<ct; j++)
{
scanf ("%d", &arr[i][j]);
if(i>0)
dp[i][j] = maxi (dp[i-1][j-1] + arr[i][j], dp[i-1][j] + arr[i][j]);
else
dp[0][0]=arr[0][0];
if (i == n-1)
{
if (dp[i][j] > gaga)
gaga=dp[i][j];
}
}
ct++;
}
printf ("%d\n", gaga);
return 0;
}
It works fine on my laptop. On the website it works for 8 test cases and fails for 9th one with this error.
Thanks for the help!

if(i>0)
dp[i][j]=maxi(dp[i-1][j-1]+arr[i][j],dp[i-1][j]+arr[i][j]);
You check if i > 0, which will ensure you never access a negative index. You never do the same for j however, so you will access dp[i-1][-1] on the first run of the inner (j) loop. I'm pretty sure this is what causes the error.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How do I wait for child kernels to finish in a parent kernel before executing the rest of the parent kernel in CUDA dynamic parallelism? - parallel-processing

So I added an atomic counter and incremented it by one at the end of each child kernel. Then I put a while loop after the child kernel call checking to see if the counter had reached the amount of calls I wanted to finish yet. This fixed it. Let me know if anyone needs code for or clarification.

Related

c code is running to slow from nested for loops

Error while freeing an array within a data structure

OpenACC bitonic sort is much slower on GPU than on CPU

How to improve the performance of needleman -wunsch algorithm in CUDA

USACO number triangle - Execution error

Categories

Resources