how to parallelise this code in hpc?

how to parallelise this code in hpc? - parallel-processing

s=1
r=m=n=o=p=q=u=t=19
myfile = fopen ("sequence2.txt", "w", "ieee-le");
for a=0:1
if(a==1)
r=5
endif
for b=0:r
if(a==1 && b==5)
m=11
endif
for c=0:m
n=o=19
for d=0:1
if(d==1)
n=5
endif
for e=0:n
if(d==1 && e==5)
o=11
endif
for f=0:o
p=q=19
for g=0:1
if(g==1)
p=5
endif
for h=0:p
if(g==1 && h==5)
q=11
endif
for i=0:q
t=u=19
for j=0:1
if(j==1)
t=5
endif
for k=0:t
if(j==1 && k==5)
u=11
endif
for l=0:u
s=s+1
fputs(myfile,num2str(a));
fputs(myfile,".");
fputs(myfile,num2str(b));
fputs(myfile,".");
fputs(myfile,num2str(c));
fputs(myfile,":");
fflush(stdout);
fputs(myfile,num2str(d));
fputs(myfile,".");
fputs(myfile,num2str(e));
fputs(myfile,".");
fputs(myfile,num2str(f));
fputs(myfile,":");
fflush(stdout);
fputs(myfile,num2str(g));
fputs(myfile,".");
fputs(myfile,num2str(h));
fputs(myfile,".");
fputs(myfile,num2str(i));
fputs(myfile,":");
fflush(stdout);
fputs(myfile,num2str(j));
fputs(myfile,".");
fputs(myfile,num2str(k));
fputs(myfile,".");
fputs(myfile,num2str(l));
fputs(myfile,"\n");
fflush(stdout);
end
end
end
end
end
end
end
end
end
end
end
end
The above code in octave is to generate a number sequence that is writing to a text file. it will take days to complete execution since it is generating around 2^36 numbers. so can anyone please let us know how to parallelise this code in hpc.

You may not need to parallelize this; you can speed this up by about 10000x by moving to a compiled language. (Seriously; see below.) Octave or even matlab are going to be slow as molasses running this. They're great for big matrix operations, but tonnes of nested loops with if statements in them is going to run slow slow slow. Normally I'd suggest moving Octave/Matlab code to FORTRAN, but since you've already got the file I/O written essentially with C statements anyway, the C equivalent of this code almost writes itself:
#include <stdio.h>
int main(int argc, char **argv) {
int a,b,c,d,e,f,g,h,i,j,k,l;
int s,r,m,n,o,p,q,u,t;
FILE *myfile;
s=1;
r=m=n=o=p=q=u=t=19;
myfile = fopen ("sequence2-c.txt", "w");
for (a=0; a<=1; a++) {
if (a == 1)
r = 5;
for (b=0; b<=r; b++) {
if (a == 1 && b == 5)
m = 11;
for (c=0; c<=m; c++) {
n = o = 19;
for (d=0; d<=1; d++) {
if (d==1)
n = 5;
for (e=0; e<=n; e++) {
if (d==1 && e == 5)
o = 11;
for (f=0; f<=o; f++) {
p = q = 19;
for (g=0; g<=1; g++) {
if (g == 1)
p = 5;
for (h=0; h<=p; h++) {
if (g == 1 && h==5)
q = 11;
for (i = 0; i<=q; i++) {
t=u=19;
for (j=0; j<=1; j++) {
if (j==1)
t=5;
for (k=0; k<=t; k++) {
if (j==1 && k==5)
u=11;
for (l=0;l<=u;l++){
s++;
fprintf(myfile,"%d.%d.%d:%d.%d.%d:%d.%d.%d:%d.%d.%d\n",a,b,c,d,e,f,g,h,i,j,k,l);
}
}
}
}
}
}
}
}
}
}
}
}
return 0;
}
Running your octave code above and this C code (compiled with -O3) for one minute each, the octave code got through about 2,163 items in the sequence, and the compiled C code got through 23,299,068. So that's good.
In terms of parallelization, breaking this up into independant pieces is easy, but they won't be especially well load-balanced. If you start (say) 26 processes, and give them (a=0,b=0), (a=0,b=1)...,(a=0,b=19),(a=1,b=0), (a=1,b=1),.. (a=1,b=5), they can all run independantly and you can concatenate the results when they're all done. The only down side is that the a=0 jobs will run somewhat slower than the a=1 jobs, but maybe that's good enough to start.

Related

How much can we trust to warnings generated by static analysis tools for vulnerablity detection?

I am running flawfinder on a set of libraries written in C/C++. I have a lot of generated warnings by flawfinder. My question is that, how much I can rely on these generated warnings? For example, consider the following function from numpy library (https://github.com/numpy/numpy/blob/4ada0641ed1a50a2473f8061f4808b4b0d68eff5/numpy/f2py/src/fortranobject.c):
static PyObject *
fortran_doc(FortranDataDef def)
{
char *buf, *p;
PyObject *s = NULL;
Py_ssize_t n, origsize, size = 100;
if (def.doc != NULL) {
size += strlen(def.doc);
}
origsize = size;
buf = p = (char *)PyMem_Malloc(size);
if (buf == NULL) {
return PyErr_NoMemory();
}
if (def.rank == -1) {
if (def.doc) {
n = strlen(def.doc);
if (n > size) {
goto fail;
}
memcpy(p, def.doc, n);
p += n;
size -= n;
}
else {
n = PyOS_snprintf(p, size, "%s - no docs available", def.name);
if (n < 0 || n >= size) {
goto fail;
}
p += n;
size -= n;
}
}
else {
PyArray_Descr *d = PyArray_DescrFromType(def.type);
n = PyOS_snprintf(p, size, "'%c'-", d->type);
Py_DECREF(d);
if (n < 0 || n >= size) {
goto fail;
}
p += n;
size -= n;
if (def.data == NULL) {
n = format_def(p, size, def) == -1;
if (n < 0) {
goto fail;
}
p += n;
size -= n;
}
else if (def.rank > 0) {
n = format_def(p, size, def);
if (n < 0) {
goto fail;
}
p += n;
size -= n;
}
else {
n = strlen("scalar");
if (size < n) {
goto fail;
}
memcpy(p, "scalar", n);
p += n;
size -= n;
}
}
if (size <= 1) {
goto fail;
}
*p++ = '\n';
size--;
/* p now points one beyond the last character of the string in buf */
#if PY_VERSION_HEX >= 0x03000000
s = PyUnicode_FromStringAndSize(buf, p - buf);
#else
s = PyString_FromStringAndSize(buf, p - buf);
#endif
PyMem_Free(buf);
return s;
fail:
fprintf(stderr, "fortranobject.c: fortran_doc: len(p)=%zd>%zd=size:"
" too long docstring required, increase size\n",
p - buf, origsize);
PyMem_Free(buf);
return NULL;
}
There are two memcpy() API calls, and flawfinder tells me that:
['vul_fortranobject.c:216: [2] (buffer) memcpy:\\n Does not check for buffer overflows when copying to destination (CWE-120).\\n Make sure destination can always hold the source data.\\n memcpy(p, "scalar", n);']
I am not sure whether the report is true.

To answer your question: static analysis tools (like FlawFinder) can generate a LOT of "false positives".
I Googled to find some quantifiable information for you, and found an interesting article about "DeFP":
https://arxiv.org/pdf/2110.03296.pdf
Static analysis tools are frequently used to detect potential
vulnerabilities in software systems. However, an inevitable problem of
these tools is their large number of warnings with a high false
positive rate, which consumes time and effort for investigating. In
this paper, we present DeFP, a novel method for ranking static analysis warnings.
Based on the intuition that warnings which have
similar contexts tend to have similar labels (true positive or false
positive), DeFP is built with two BiLSTM models to capture the
patterns associated with the contexts of labeled warnings. After that,
for a set of new warnings, DeFP can calculate and rank them according
to their likelihoods to be true positives (i.e., actual
vulnerabilities).
Our experimental results on a dataset of 10
real-world projects show that using DeFP, by investigating only 60% of
the warnings, developers can find
+90% of actual vulnerabilities. Moreover, DeFP improves the state-of-the-art approach 30% in both Precision and Recall.
Apparently, the authors built a neural network to analyze FlawFinder results, and rank them.
I doubt DeFP is a practical "solution" for you. But yes: if you think that specific "memcpy()" warning is a "false positive" - then I'm inclined to agree. It very well could be :)

OpenACC bitonic sort is much slower on GPU than on CPU

I have the following bit of code to sort double values on my GPU:
void bitonic_sort(double *data, int length) {
#pragma acc data copy(data[0:length], length)
{
int i,j,k;
for (k = 2; k <= length; k *= 2) {
for (j=k >> 1; j > 0; j = j >> 1) {
#pragma acc parallel loop gang worker vector independent
for (i = 0; i < length; i++) {
int ixj = i ^ j;
if ((ixj) > i) {
if ((i & k) == 0 && data[i] > data[ixj]) {
_ValueType buffer = data[i];
data[i] = data[ixj];
data[ixj] = buffer;
}
if ((i & k) != 0 && data[i] < data[ixj]) {
_ValueType buffer = data[i];
data[i] = data[ixj];
data[ixj] = buffer;
}
}
}
}
}
}
}
This is a bit slower on my GPU than on my CPU. I'm using GCC 6.1. I can't figure out, how to run the whole code on my GPU. So far, only the parallel loop is executed on the cpu and it switches between CPU and GPU for each one of the outer loops.
I'd like to run the whole content of the function on the GPU, but I can't figure out how. One major problem for me now is that the GCC implementation currently doesn't allow nested parallelism, so I can't use a parallel construct inside another parallel construct. Is there any way to get around that?
I've tried putting a kernels construct on top of the first loop but that slows it down by a factor of about 10. If I use a parallel construct above the first loop instead, the result isn't sorted any more, which makes sense. The two outer loops need to be executed sequentially for the algorithm to work.
If you have any other suggestions on how I could improve performance, I would be grateful as well.

Sorting too slow

So, I'm doing a project for my programming languages class, and i have to create a structure, sort it, and then show the time it takes to do it, the thing is bubble sorting(case 1) takes 60 sec to do it, insertion(case 2) 5 sec and selection (case 4) takes 10 sec. All this sorting 100000 elements. shell only takes 0.03 so i started thinking i might have something wrong with my algorithms. can some one help me?
void ordenesc(compleja * vd, int tam)
{
int i=0,j=0,k=0,aux=0,op=0,inc=0,minimo=0;
char auxcad[20];
clock_t start, end;
double tiempo;
op=menus(3);
start = clock();
switch(op)
{
case 1://Burbujeo
for(i=1;i<=tam;i++)
{
for(j=0;j<tam-1;j++)
{
if(vd[j].nro>vd[j+1].nro)
{
aux=vd[j].nro;
vd[j].nro=vd[j+1].nro;
vd[j+1].nro=aux;
strcpy(auxcad,vd[j].cad);
strcpy(vd[j].cad,vd[j+1].cad);
strcpy(vd[j+1].cad,auxcad);
}
}
}
break;
case 2://Inserccion
for(i = 1; i < tam; i++)
{
aux=vd[i].nro;
strcpy(auxcad,vd[i].cad);
for (j = i - 1; j >= 0 && vd[j].nro > aux; j--)
{
vd[j+1].nro=vd[j].nro;
strcpy(vd[j+1].cad,vd[j].cad);
j--;
}
vd[j+1].nro=aux;
strcpy(vd[j+1].cad,auxcad);
}
break;
case 3://Shell
inc=(tam/2);
while (inc > 0)
{
for (i=0; i < tam; i++)
{
j = i;
aux = vd[i].nro;
strcpy(auxcad,vd[i].cad);
while ((j >= inc) && (vd[j-inc].nro > aux))
{
vd[j].nro = vd[j - inc].nro;
strcpy(vd[j].cad,vd[j-inc].cad);
j = j - inc;
}
vd[j].nro = aux;
strcpy(vd[j].cad,auxcad);
}
if (inc == 2)
inc = 1;
else
inc = inc * 5 / 11;
}
break;
case 4://Seleccion
for(i=0;i<tam-1;i++)
{
minimo=i;
for(j=i+1;j<tam;j++)
{
if(vd[minimo].nro > vd[j].nro) minimo=j;
}
aux=vd[minimo].nro;
vd[minimo].nro=vd[i].nro;
vd[i].nro=aux;
strcpy(auxcad,vd[minimo].cad);
strcpy(vd[minimo].cad,vd[i].cad);
strcpy(vd[i].cad,auxcad);
}
break;
case 9:
break;
default:
break;
}
end = clock();
tiempo = ((double) (end - start)) / CLOCKS_PER_SEC;
//system("cls");
i=0;
for(i=0;i<tam;i++){
printf("%d %s \n",vd[i].nro,vd[i].cad);}
printf("\n Tardo %f segundos \n", tiempo);
return;
}
P.d:Edited the text sorry for my english is not my first language and my brain is failing due to this.

To make sure your sort algorithm works as expected, you could add a check to the final loop that the elements are actually ordered when you print them. Its relatively unlikely that there is a fundamental error in the algorithm and it still sorts correctly.
One point of the exercise may be to show that sorting algorithms really matter, and selection sort is the only algorithm that has a better performance than O(n^2) in your list. So I wouldn't be too surprised by wide differences in performance.
One improvement you could make to bubble sort is that you only need to iterate over i elements in the inner loop (instead of tam), as the i-largest element will have bubbled up all the way in the inner loop.
Another improvement may be to just copy the pointers instead of the contents of the char arrays, e.g.
instead of
char auxcad[20];
...
strcpy(auxcad, vd[j].cad);
strcpy(vd[j].cad, vd[j+1].cad);
strcpy(vd[j+1].cad, auxcad);
you may want to write
char* auxcad;
...
auxcad = vd[j].cad;
vd[j].cad = vd[j+1].cad;
vd[j+1].cad = auxcad;

SDCC integer comparison unexpected behavior

I'm trying to get started on a project using a PIC18F24J10. Attempting to use SDCC for this. At this point I've reduced my code to simply trying to nail down what is happening, as I've been seeing bizarre behavior for a while now and can't really proceed until I figure out what is going on. Not sure if this is my only problem at this point, but I have no idea why this is happening. Timer interrupt fires off, coupled with a #defined for loop causes LEDs on PORTC to blink maybe twice a second. If I just do a PORTC=0xFF and PORTC=0 this works fine. But it gets weird when I try to go much beyond that.
...
#define _RC0 31
#define _RC1 32
#define _RC2 33
#define _RC3 34
#define _RC4 35
#define _RC5 36
#define _RC6 37
#define _RC7 38
#define HI 1
#define LO 0
void why(unsigned char p, int z)
{
if(z == HI)
{
if(p == _RC0) PORTCbits.RC0 = 1;
else if(p == _RC1) PORTCbits.RC1 = 1;
else if(p == _RC2) PORTCbits.RC2 = 1;
else if(p == _RC3) PORTCbits.RC3 = 1;
else if(p == _RC4) PORTCbits.RC4 = 1;
else if(p == _RC5) PORTCbits.RC5 = 1;
else if(p == _RC6) PORTCbits.RC6 = 1;
else if(p == _RC7) PORTCbits.RC7 = 1;
}
else if(z == LO)
{
PORTC = 0b11110000;
}
else
{
PORTC = 0b10101010;
}
}
void timer_isr (void) __interrupt(1) __using (1)
{
TMR0H=0;
TMR0L=0;
why(_RC0, LO);
why(_RC1, LO);
why(_RC2, LO);
WAIT_CYCLES(5000);
why(_RC0, HI);
why(_RC1, HI);
why(_RC2, HI);
WAIT_CYCLES(5000);
}
void main(void)
{
WDTCONbits.SWDTE = 0;
WDTCONbits.SWDTEN = 0;
TRISC = 0b00000000;
PORTC=0b00000000;
INTCONbits.GIE = 1;
INTCONbits.PEIE = 1;
INTCONbits.TMR0IF = 0;
INTCONbits.TMR0IE = 1;
T0CONbits.T08BIT = 0;
T0CONbits.T0CS = 0;
T0CONbits.PSA = 1;
TMR0H = 0;
TMR0L = 0;
T0CONbits.TMR0ON = 1;
while(1)
{
}
}
In the code above, four of the LEDs should blink, while the other four stay on. Instead, the LEDs stay on in the 10101010 pattern that happens in the "else" block, which should happen when "why" is called with any value other than HI or LO. I never call it with anything else, so why does it ever reach that?
UPDATE - Further reduction, no more interrupts or unspecified includes/defines. This is now the entirety of the program, and I am still seeing the same behavior. Changed the pattern from 10101010 to 10101011 so that I could verify the chip is actually being programmed with the new code, and it appears to be.
#include "pic16/pic18f24j10.h"
#define WAIT_CYCLES(A) for(__delay_cycle = 0;__delay_cycle < A;__delay_cycle++)
int __delay_cycle;
#define HI 1
#define LO 0
void why(int z)
{
if(z == HI)
{
PORTC = 0b11111111;
}
else if(z == LO)
{
PORTC = 0b11110000;
}
else
{
PORTC = 0b10101011;
}
}
void main(void)
{
TRISC = 0b00000000;
PORTC=0b00000000;
while(1)
{
why(LO);
WAIT_CYCLES(5000);
why(HI);
WAIT_CYCLES(5000);
}
}

Once the interrupt is asserted it is never cleared. That results in timer_isr() being called repeatedly. No other code is ever reached. The TMR0IF bit must be cleared in software by the Interrupt Service Routine.
Keep in mind you not only need to spend less time in the ISR than the period of the timer - it’s a good practice to spend the minimum amount of time necessary.
Remove the delays and simply toggle a flag or increment a register. In your main while (1) loop monitor the flag or counter and make your calls to why() from there.

N nested for-loops

I need to create N nested loops to print all combinations of a binary sequence of length N. Im not sure how to do this.
Any help would be greatly appreciated. Thanks.

Use recursion. e.g., in Java
public class Foo {
public static void main(String[] args) {
new Foo().printCombo("", 5);
}
void printCombo(String soFar, int len) {
if (len == 1) {
System.out.println(soFar+"0");
System.out.println(soFar+"1");
}
else {
printCombo(soFar+"0", len-1);
printCombo(soFar+"1", len-1);
}
}
}
will print
00000
00001
00010
...
11101
11110
11111

You have two options here:
Use backtracking instead.
Write a program that generates a dynamic program with N loops and then executes it.

You don't need any nested loops for this. You need one recursive function to print a binary value of length N and a for loop to iterate over all numbers [0 .. (2^N)-1].
user949300's solution is also very good, but it might not work in all languages.
Here's my solution(s), the recursive one is approximately twice as slow as the iterative one:
#include <stdio.h>
#ifdef RECURSIVE
void print_bin(int num, int len)
{
if(len == 0)
{
printf("\n");
return;
}
print_bin(num >> 1, len -1);
putchar((num & 1) + '0');
}
#else
void print_bin(int num, int len)
{
char str[len+1];
int i;
str[len] = '\0';
for (i = 0; i < len; i++)
{
str[len-1-i] = !!(num & (1 << i)) + '0';
}
printf("%s\n", str);
}
#endif
int main()
{
int len = 24;
int i;
int end = 1 << len;
for (i = 0; i < end ; i++)
{
print_bin(i, len);
}
return 0;
}
(I tried this myself on a Mac printing all binary numbers of length 24 and the terminal froze. But that is probably a poor terminal implementation. :-)
$ gcc -O3 binary.c ; time ./a.out > /dev/null ; gcc -O3 -DRECURSIVE binary.c ; time ./a.out > /dev/null
real 0m1.875s
user 0m1.859s
sys 0m0.008s
real 0m3.327s
user 0m3.310s
sys 0m0.010s

I don't think we need recursion or n nested for-loops to solve this problem. It would be easy to handle this using bit manipulation.
In C++, as an example:
for(int i=0;i<(1<<n);i++)
{
for(int j=0;j<n;j++)
if(i&(1<<j))
printf("1");
else
printf("0");
printf("\n");
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

how to parallelise this code in hpc? - parallel-processing

Related

How much can we trust to warnings generated by static analysis tools for vulnerablity detection?

OpenACC bitonic sort is much slower on GPU than on CPU

Sorting too slow

SDCC integer comparison unexpected behavior

N nested for-loops

Categories

Resources