how to launch a sequential code on gpu - openacc

I have a CPU code:
`if(number_of_pushed_particles<N&&number_of_alive_particles<K)
{
push_particle();
number_of_pushed_particles++;
}`
Here number_of_pushed_particles, number_of_alive_particles, K and N are int, K and N are const. The function push_particle() is:
`push_particle()
{
particles[LIFE].id=++MAX_ELEMENT;
particles[LIFE].rx=0.0;
particles[LIFE].ry=0.0;
particles[LIFE].rz=0.0;
...
++LIFE;
}
`Particle is a structure of floats.The array Particle particles[0:GL], integer variables LIFE and MAX_ELEMENT are statically allocated on the device. That is why i do not want to use #pragma acc update host/device before/after calling the push_particle() function and lose time for copying data. How can i launch this sequential code on the GPU?

The OpenACC 2.6 standard which was just ratified includes a "serial" region but it will be a bit before this support is added to the various compiler implementations.
The current method is to use a "parallel" region and set "num_gangs(1)" and "vector_length(1)".
Something like:
push_particle()
{
#pragma acc parallel num_gangs(1) vector_length(1) present(particles)
{
particles[LIFE].id=++MAX_ELEMENT;
particles[LIFE].rx=0.0;
particles[LIFE].ry=0.0;
particles[LIFE].rz=0.0;
...
++LIFE;
}
}

Related

Am I using Rust's bitvec_simd in the most efficient way?

I'm using bitvec_simd = "0.20" for bitvector operations in rust.
I have two instances of a struct, call them clique_into and clique_from. The relevant fields of the struct are two bitvectors members_bv and neighbors_bv, as well as a vector of integers which is called members. The members_bv and members vector represent the same data.
After profiling my code, I find that this is my bottleneck (in here 41% of the time): checking whether the members (typically 1) of clique_from are all neighbors of clique_into.
My current approach is to loop through the members of clique_from (typically 1) and check each one in turn to see if it's a neighbor of clique_into.
Here's my code:
use bitvec_simd::BitVec;
use smallvec::{smallvec, SmallVec};
struct Clique {
members_bv: BitVec,
members: SmallVec<[usize; 256]>,
neighbors_bv: BitVec,
}
fn are_cliques_mergable(clique_into: &Clique, clique_from: &Clique) -> bool {
for i in 0..clique_from.members.len() {
if !clique_into.neighbors_bv.get_unchecked(clique_from.members[i]) {
return false;
}
}
return true;
}
That code works fine and it's fast, but is there a way to make it faster? We can assume that clique_from almost always has a single member so the inner for loop is almost always executed once.
It likely comes down to this:
if !clique_into.neighbors_bv.get_unchecked(clique_from.members[i])
Is get_unchecked() the fastest way to do this? While I have written this so it will never panic, the compiler doesn't know that. Does this force Rust to waste time checking if it should panic?

sort one array by another on the gpu

I have the code, which looks like:
...
const N=10000;
std::array<std::pair <int,int>,N> nnt;
bool compar(std::pair<int,int> i, std::pair <int,int> j) {return (int)
(i.second) > (int)(j.second);}
...
int main(int argc, char **argv)
{
#pragma acc data create(...,nnt)
{
#pragma acc parallel loop
{...}
//the nnt array is filled here
//here i need to sort nnt allocated on gpu, using the
//comparator compar()
}
}
So i need to sort an array of pairs, alocated on the GPU by the means of CUDA of OpenAcc.
As far as i understood, it is unlikely that i will be able to sort std::array of std::pair's on GPU.
Actually, i need to sort one array, allocated on the gpu, by another one alocated on the gpu, i. e. if there are
int a[N];
int b[N];
which are allocated or copied to the GPU by the means of CUDA or OpenAcc, i need to sort the array a by the values of the array b, and i need this sort to be done on GPU. May be, there are some CUDA functions that will help or the CUDA Thrust sort functions could be used (like thrust::stable_sort), i don't know. Is there a way to do it?
Is there a way to do it?
yes, one possible method would be to use thrust::sort_by_key, which allows you to sort device data using a device pointer.
This blog explains the method to interface between thrust and OpenACC. Including the passage of a deviceptr between routines.
This example code may be of interest. Specifically, the hash example gives a fully-worked example of calling thrust::sort_by_key from OpenACC.

Use of pthread increases execution time, suggestions for improvements

I had a piece of code, which looked like this,
for(i=0;i<NumberOfSteps;i++)
{
for(k=0;k<NumOfNodes;k++)
{
mark[crawler[k]]++;
r = rand() % node_info[crawler[k]].num_of_nodes;
crawler[k] = (int)DataBlock[node_info[crawler[k]].index+r][0];
}
}
I changed it so that the load can be split among multiple threads. Now it looks like this,
for(i=0;i<NumberOfSteps;i++)
{
for(k=0;k<NumOfNodes;k++)
{
pthread_mutex_lock( &mutex1 );
mark[crawler[k]]++;
pthread_mutex_unlock( &mutex1 );
pthread_mutex_lock( &mutex1 );
r = rand() % node_info[crawler[k]].num_of_nodes;
pthread_mutex_unlock( &mutex1 );
pthread_mutex_lock( &mutex1 );
crawler[k] = (int)DataBlock[node_info[crawler[k]].index+r][0];
pthread_mutex_unlock( &mutex1 );
}
}
I need the mutexes to protect shared variables. It turns out that my parallel code is slower. But why ? Is it because of the mutexes ?
Could this possibly be something to do with the cacheline size ?
You are not parallelizing anything but the loop heads. Everything between lock and unlock is forced to be executed sequentially. And since lock/unlock are (potentially) expensive operations, the code is getting slower.
To fix this, you should at least separate expensive computations (without mutex protection) from access to shared data areas (with mutexes). Then try to move the mutexes out of the inner loop.
You could use atomic increment instructions (depends on platform) instead of plain '++', which is generally cheaper than mutexes. But beware of doing this often on data of a single cache line from different threads in parallel (see 'false sharing').
AFAICS, you could rewrite the algorithm as indicated below with out needing mutexes and atomic increment at all. getFirstK() is NumOfNodes/NumOfThreads*t if NumOfNodes is an integral multiple of NumOfThreads.
for(t=0;t<NumberOfThreads;t++)
{
kbegin = getFirstK(NumOfNodes, NumOfThreads, t);
kend = getFirstK(NumOfNodes, NumOfThreads, t+1);
// start the following in a separate thread with kbegin and kend
// copied to thread local vars kbegin_ and kend_
int k, i, r;
unsigned state = kend_; // really bad seed
for(k=kbegin_;k<kend_;k++)
{
for(i=0;i<NumberOfSteps;i++)
{
mark[crawler[k]]++;
r = rand_r(&state) % node_info[crawler[k]].num_of_nodes;
crawler[k] = (int)DataBlock[node_info[crawler[k]].index+r][0];
}
}
}
// wait for threads/jobs to complete
This way to generate random numbers may lead to bad random distributions, see this question for details.

C/OpenMP - issue with threadprivate and vectors of pointers

I'm new to the world of parallel programming and openmp, so this may be a futile question, but I can't really come up with good answer to what I'm experiencing, so I hope someone will be able to shed some light on the matter.
What I am trying to achieve is to have a private copy of a dinamically allocated matrix (of integers) for every thread that will handle the following parallel section, but as soon as the flow of execution enters said region the reference to the supposedly private matrix holds a null value.
Is there any limitation of this directive I'm not aware of? Everything seems to work just fine with monodimensional dynamic arrays.
A snippet of the code is the following one...
#define n 10000
int **matrix;
#pragma omp threadprivate(matrix)
int main()
{
matrix = (int**) calloc(n, sizeof(int*));
for(i=0;i<n;i++) matrix[i] = (int*) calloc(n, sizeof(int));
AdjacencyMatrix(n, matrix);
...
/* Explicitly turn off dynamic threads */
omp_set_dynamic(0);
#pragma omp parallel
{
// From now on, matrix is NULL...
executor_p(matrix, n);
}
....
Look at the OpenMP documentation regarding what happens with the threadprivate clause:
On first entry to a parallel region, data in THREADPRIVATE variables and common blocks should be assumed undefined, unless a COPYIN clause is specified in the PARALLEL directive
There's no guarantee of what value is going to be stored in the matrix variable in the parallel region.
OpenMP can privatise only variables with known storage size. That is you can have a private copy of an array if it was defined like double matrix[N][M]. In your case is not only the storage size unknown (a pointer doesn't store the number of elements that it is pointing to) but also your matrix is not a contiguous area in memory and rather a pointer to a list of dynamically allocated rows.
What you would end up with is having a private copy of the top-level pointer, not a private copy of the matrix data itself.

How do you measure the time a function takes to execute?

How can you measure the amount of time a function will take to execute?
This is a relatively short function and the execution time would probably be in the millisecond range.
This particular question relates to an embedded system, programmed in C or C++.
The best way to do that on an embedded system is to set an external hardware pin when you enter the function and clear it when you leave the function. This is done preferably with a little assembly instruction so you don't skew your results too much.
Edit: One of the benefits is that you can do it in your actual application and you don't need any special test code. External debug pins like that are (should be!) standard practice for every embedded system.
There are three potential solutions:
Hardware Solution:
Use a free output pin on the processor and hook an oscilloscope or logic analyzer to the pin. Initialize the pin to a low state, just before calling the function you want to measure, assert the pin to a high state and just after returning from the function, deassert the pin.
*io_pin = 1;
myfunc();
*io_pin = 0;
Bookworm solution:
If the function is fairly small, and you can manage the disassembled code, you can crack open the processor architecture databook and count the cycles it will take the processor to execute every instructions. This will give you the number of cycles required.
Time = # cycles * Processor Clock Rate / Clock ticks per instructions
This is easier to do for smaller functions, or code written in assembler (for a PIC microcontroller for example)
Timestamp counter solution:
Some processors have a timestamp counter which increments at a rapid rate (every few processor clock ticks). Simply read the timestamp before and after the function.
This will give you the elapsed time, but beware that you might have to deal with the counter rollover.
Invoke it in a loop with a ton of invocations, then divide by the number of invocations to get the average time.
so:
// begin timing
for (int i = 0; i < 10000; i++) {
invokeFunction();
}
// end time
// divide by 10000 to get actual time.
if you're using linux, you can time a program's runtime by typing in the command line:
time [funtion_name]
if you run only the function in main() (assuming C++), the rest of the app's time should be negligible.
I repeat the function call a lot of times (millions) but also employ the following method to discount the loop overhead:
start = getTicks();
repeat n times {
myFunction();
myFunction();
}
lap = getTicks();
repeat n times {
myFunction();
}
finish = getTicks();
// overhead + function + function
elapsed1 = lap - start;
// overhead + function
elapsed2 = finish - lap;
// overhead + function + function - overhead - function = function
ntimes = elapsed1 - elapsed2;
once = ntimes / n; // Average time it took for one function call, sans loop overhead
Instead of calling function() twice in the first loop and once in the second loop, you could just call it once in the first loop and don't call it at all (i.e. empty loop) in the second, however the empty loop could be optimized out by the compiler, giving you negative timing results :)
start_time = timer
function()
exec_time = timer - start_time
Windows XP/NT Embedded or Windows CE/Mobile
You an use the QueryPerformanceCounter() to get the value of a VERY FAST counter before and after your function. Then you substract those 64-bits values and get a delta "ticks". Using QueryPerformanceCounterFrequency() you can convert the "delta ticks" to an actual time unit. You can refer to MSDN documentation about those WIN32 calls.
Other embedded systems
Without operating systems or with only basic OSes you will have to:
program one of the internal CPU timers to run and count freely.
configure it to generate an interrupt when the timer overflows, and in this interrupt routine increment a "carry" variable (this is so you can actually measure time longer than the resolution of the timer chosen).
before your function you save BOTH the "carry" value and the value of the CPU register holding the running ticks for the counting timer you configured.
same after your function
substract them to get a delta counter tick.
from there it is just a matter of knowing how long a tick means on your CPU/Hardware given the external clock and the de-multiplication you configured while setting up your timer. You multiply that "tick length" by the "delta ticks" you just got.
VERY IMPORTANT Do not forget to disable before and restore interrupts after getting those timer values (bot the carry and the register value) otherwise you risk saving incorrect values.
NOTES
This is very fast because it is only a few assembly instructions to disable interrupts, save two integer values and re-enable interrupts. The actual substraction and conversion to real time units occurs OUTSIDE the zone of time measurement, that is AFTER your function.
You may wish to put that code into a function to reuse that code all around but it may slow things a bit because of the function call and the pushing of all the registers to the stack, plus the parameters, then popping them again. In an embedded system this may be significant. It may be better then in C to use MACROS instead or write your own assembly routine saving/restoring only relevant registers.
Depends on your embedded platform and what type of timing you are looking for. For embedded Linux, there are several ways you can accomplish. If you wish to measure the amout of CPU time used by your function, you can do the following:
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#define SEC_TO_NSEC(s) ((s) * 1000 * 1000 * 1000)
int work_function(int c) {
// do some work here
int i, j;
int foo = 0;
for (i = 0; i < 1000; i++) {
for (j = 0; j < 1000; j++) {
for ^= i + j;
}
}
}
int main(int argc, char *argv[]) {
struct timespec pre;
struct timespec post;
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &pre);
work_function(0);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &post);
printf("time %d\n",
(SEC_TO_NSEC(post.tv_sec) + post.tv_nsec) -
(SEC_TO_NSEC(pre.tv_sec) + pre.tv_nsec));
return 0;
}
You will need to link this with the realtime library, just use the following to compile your code:
gcc -o test test.c -lrt
You may also want to read the man page on clock_gettime there is some issues with running this code on SMP based system that could invalidate you testing. You could use something like sched_setaffinity() or the command line cpuset to force the code on only one core.
If you are looking to measure user and system time, then you could use the times(NULL) which returns something like a jiffies. Or you can change the parameter for clock_gettime() from CLOCK_THREAD_CPUTIME_ID to CLOCK_MONOTONIC...but be careful of wrap around with CLOCK_MONOTONIC.
For other platforms, you are on your own.
Drew
I always implement an interrupt driven ticker routine. This then updates a counter that counts the number of milliseconds since start up. This counter is then accessed with a GetTickCount() function.
Example:
#define TICK_INTERVAL 1 // milliseconds between ticker interrupts
static unsigned long tickCounter;
interrupt ticker (void)
{
tickCounter += TICK_INTERVAL;
...
}
unsigned in GetTickCount(void)
{
return tickCounter;
}
In your code you would time the code as follows:
int function(void)
{
unsigned long time = GetTickCount();
do something ...
printf("Time is %ld", GetTickCount() - ticks);
}
In OS X terminal (and probably Unix, too), use "time":
time python function.py
If the code is .Net, use the stopwatch class (.net 2.0+) NOT DateTime.Now. DateTime.Now isn't updated accurately enough and will give you crazy results
If you're looking for sub-millisecond resolution, try one of these timing methods. They'll all get you resolution in at least the tens or hundreds of microseconds:
If it's embedded Linux, look at Linux timers:
http://linux.die.net/man/3/clock_gettime
Embedded Java, look at nanoTime(), though I'm not sure this is in the embedded edition:
http://java.sun.com/j2se/1.5.0/docs/api/java/lang/System.html#nanoTime()
If you want to get at the hardware counters, try PAPI:
http://icl.cs.utk.edu/papi/
Otherwise you can always go to assembler. You could look at the PAPI source for your architecture if you need some help with this.

Resources