OpenACC 2.0 routine: data locality - gpgpu

Take the following code, which illustrates the calling of a simple routine on the accelerator, compiled on the device using OpenACC 2.0's routine directive:
#include <iostream>
#pragma acc routine
int function(int *ARRAY,int multiplier){
int sum=0;
#pragma acc loop reduction(+:sum)
for(int i=0; i<10; ++i){
sum+=multiplier*ARRAY[i];
}
return sum;
}
int main(){
int *ARRAY = new int[10];
int multiplier = 5;
int out;
for(int i=0; i<10; i++){
ARRAY[i] = 1;
}
#pragma acc enter data create(out) copyin(ARRAY[0:10],multiplier)
#pragma acc parallel present(out,ARRAY[0:10],multiplier)
if (function(ARRAY,multiplier) == 50){
out = 1;
}else{
out = 0;
}
#pragma acc exit data copyout(out) delete(ARRAY[0:10],multiplier)
std::cout << out << std::endl;
}
How does function know to use the device copies of ARRAY[0:10] and multiplier when it is called from within a parallel region? How can we enforce the use of the device copies?

When your routine is called within a device region (the parallel in your code), it is being called by the threads on the device, which means those threads will only have access to arrays on the device. The compiler may actually choose to inline that function, or it may be a device-side function call. That means that you can know that when the function is called from the device it will be receiving device copies of the data because the function is essentially inheriting the present data clause from the parallel region. If you still want to convince yourself that you're running on the device once inside the function, you could call acc_on_device, but that only tells you that you're running on the accelerator, not that you received a device pointer.
If you want to enforce the use of device copies more than that, you could make the routine nohost so that it would technically not be valid to call from the host, but that doesn't really do what you're asking, which is to do a check on the GPU that the array really is a device array.
Keep in mind though that any code inside a parallel region that is not inside a loop will be run gang-redundantly, so the write to out is likely a race condition, unless you happen to be running with one gang or you write to it using an atomic.

Basically, when you involved "data" clause, the device will create/copy data to the device memory, then the block of code that defined with "acc routine" will be executed on the device. Notice that the memory between host and device does not share unlike multi-threading (OpenMP). So yes, "function" will be using the device copies of ARRAY and multiplier as long as it is under data segment. Hope this helps! :)

You should assign the function with one parallelism level such as gang/worker/vector. It's a more accurate way.
The routine will use the date in device memory.

Related

Problem of allocating memory for a global struct and free it

I am using a embedded board with FreeRTOS.
In a task, I defined two structs and use pvPortMalloc to allocate memory. (One struct is a member in the other)
Besides, I pass the address of struct to some functions.
However, there are some issues about freeing the memory using vPortFree.
The following is my code (test_task.c):
/* Struct definition */
typedef struct __attribute__((packed)) {
uint8_t num_parameter;
uint32_t member1;
uint8_t member2;
uint8_t *parameter;
}struct_member;
typedef struct __attribute__((packed)) {
uint16_t num_member;
uint32_t class;
struct_member *member;
}struct_master;
I define a global struct and an array below.
uint8_t *arr;
struct_master master:
Function definition:
void decode_func(struct_master *master, uint8_t *arr)
{
master->member = pvPortMalloc(master->num_member);
for(int i = 0; i < scr->num_command; ++i){
master->member[i].parameter = pvPortMalloc(master->member[i].num_parameter);
do_something();
}
}
The operation task is shown in the following.
At the end of task, I would like to free memory:
void test_task()
{
decode_func( &master, arr);
do_operation();
vPortFree(master.member);
for (int i = 0; i < master.num_member; ++i)
vPortFree(master.member[i].parameter);
hTest_task = NULL;
vTaskDelete(NULL);
}
It is ok to free master.member.
However, when the program tried free master.member[i].parameter,
it seems that freeing had been executed before and software just reset automatically.
Does anyone know why it happened like that?
At the very first glance, the way you allocate for members is wrong in the decode_func.
I assume that master->num_member indicates the number of struct members that master should contain.
master->member = pvPortMalloc(master->num_member);
should be corrected to,
master->member = pvPortMalloc(master->num_member * sizeof(struct_member));
Again, in the same function the loop seems a bit suspicious as well.
for(int i = 0; i < scr->num_command; ++i){
master->member[i].parameter = pvPortMalloc(master->member[i].num_parameter);
do_something();
}
I'm not sure what src->num_command indicates, but naturally I reckon the loop should execute until i < master->num_member. I assume your loop should be updated as follows as well,
for(int i = 0; i < master->num_member; ++i){
master->member[i].parameter = pvPortMalloc(master->member[i].num_parameter * sizeof(uint8_t));
do_something();
}
While doing the freeing of memory, make sure you free the contained members first before freeing the container structure. Therefore you should first free all the parameters and then the member, so change that order in test_task function as well.
Also make sure that before doing vTaskDelete(NULL); you must deallocate all the resources consumed by test_task, otherwise there will be a resource leak. vTaskDelete(NULL) will simply mark the TCB of that particular task as ready to be deleted so that at some time later the idle task will purge the TCB related resources.
Generally, when you free an object, the contents of the object are destroyed and you can't access them anymore. So when you want to free nested allocations like this, you need to free the inner allocations first and only free the outer (master) allocation afterwards. In other words:
for (int i = 0; i < master.num_member; ++i)
vPortFree(master.member[i].parameter);
vPortFree(master.member);
free the parameters first and then the containing member array.

How to map a data with openmp target to use inside a function?

I would like to know how can I map a data for future use inside of a function?
I wrote some code like the following:
struct {
int* a;
int *b;
// other members...
} s;
void func1(struct s* _s){
int a* = _s->a;
int b* = _s->b;
// do something with _s
#pragma omp target
{
// do something with a and b;
}
}
int main(){
struct s* _s;
// alloc _s, a and b
int *a = _s->a;
int *b = _s->b;
#pragma omp target data map(to: a, b)
{
func1(_s);
// call another funcs with device use of mapped data...
}
// free data
}
The code compiles, but on execution Kernel execution error at <address> is spammed from verbose output of execution, followed by many Device kernel launch failed! and CUDA error is: an illegal memory access was encountered
Your map directive looks like it's probably mapping the value of the pointers a and b to the device rather than the arrays they're pointing to. I think you want to shape them so that the runtime maps the data and not just the pointers. Personally, I would also put the map clause on your target region too, since that gives the compiler more information to work with and the present check will find the data already on the device from the outer data region and not perform any further data movement.

How Can synchronize data between differernt cores on Xeon (linux how to use memory barriers)

I wrote a simple program to test memory Synchronization. Use a global queue to share with two
processes, and bind two processes to different cores. my code is blew.
#include<stdio.h>
#include<sched.h>
#define __USE_GNU
void bindcpu(int pid) {
int cpuid;
cpu_set_t mask;
cpu_set_t get;
CPU_ZERO(&mask);
if (pid > 0) {
cpuid = 1;
} else {
cpuid = 5;
}
CPU_SET(cpuid, &mask);
if (sched_setaffinity(0, sizeof(mask), &mask) == -1) {
printf("warning: could not set CPU affinity, continuing...\n");
}
}
#define Q_LENGTH 512
int g_queue[512];
struct point {
int volatile w;
int volatile r;
};
volatile struct point g_p;
void iwrite(int x) {
while (g_p.r == g_p.w);
sleep(0.1);
g_queue[g_p.w] = x;
g_p.w = (g_p.w + 1) % Q_LENGTH;
printf("#%d!%d", g_p.w, g_p.r);
}
void iread(int *x) {
while (((g_p.r + 1) % Q_LENGTH) == g_p.w);
*x = g_queue[g_p.r];
g_p.r = (g_p.r + 1) % Q_LENGTH;
printf("-%d*%d", g_p.r, g_p.w);
}
int main(int argc, char * argv[]) {
//int num = sysconf(_SC_NPROCESSORS_CONF);
int pid;
pid = fork();
g_p.r = Q_LENGTH;
bindcpu(pid);
int i = 0, j = 0;
if (pid > 0) {
printf("call iwrite \0");
while (1) {
iread(&j);
}
} else {
printf("call iread\0");
while (1) {
iwrite(i);
i++;
}
}
}
The data between the two processesIntel(R) Xeon(R) CPU E3-1230 and two cores didn't synchronized.
CPU: Intel(R) Xeon(R) CPU E3-1230
OS: 3.8.0-35-generic #50~precise1-Ubuntu SMP
I want to know beyond IPC How I can synchronize the data between the different cores in user
space ?
If you are wanting your application to manipulate the cpus shared-cache in order to accomplish IPC I don't believe you will be able to do that.
Chapter 9 of "Linux Kernel Development Second Edition" has information on synchronizing multi-threaded applications (including atomic operations, semiphores, barriers, etc...):
http://www.makelinux.net/books/lkd2/ch09
so you may get some ideas on what you are looking for there.
here is a decent write up for Intel® Smart Cache "Software Techniques for Shared-Cache Multi-Core Systems": http://archive.is/hm0y
here are some stackoverflow questions/answers that may help you find the information you are looking for:
Storing C/C++ variables in processor cache instead of system memory
C++: Working with the CPU cache
Understanding how the CPU decides what gets loaded into cache memory
Sorry for bombarding you with links but this is the best I can do without a clearer understanding of what you are looking to accomplish.
I suggest reading "Volatile: Almost Useless for Multi-Threaded Programming" for why volatile should be removed from the example code. Instead, use C11 or C++11 atomic operations. See also the Fenced Data Transfer example in of the TBB Design Patterns Manual.
Below I show the parts of the question example that I changed to use C++11 atomics. I compiled it with g++ 4.7.2.
#include <atomic>
...
struct point g_p;
struct point {
std::atomic<int> w;
std::atomic<int> r;
};
void iwrite(int x) {
int w = g_p.w.load(std::memory_order_relaxed);
int r;
while ((r=g_p.r.load(std::memory_order_acquire)) == w);
sleep(0.1);
g_queue[w] = x;
w = (w+1)%Q_LENGTH;
g_p.w.store( w, std::memory_order_release);
printf("#%d!%d", w, r);
}
void iread(int *x) {
int r = g_p.r.load(std::memory_order_relaxed);
int w;
while (((r + 1) % Q_LENGTH) == (w=g_p.w.load(std::memory_order_acquire)));
*x = g_queue[r];
g_p.r.store( (r + 1) % Q_LENGTH, std::memory_order_release );
printf("-%d*%d", r, w);
}
The key changes are:
I removed "volatile" everywhere.
The members of struct point are declared as std::atomic
Some loads and stores of g_p.r and g_p.w are fenced. Others are hoisted.
When loading a variable modified by another thread, the code "snapshots" it into a local variable.
The code uses "relaxed load" (no fence) where a thread loads a variable that no other thread modifies. I hoisted those loads out of the spin loops since there is no point in repeating them.
The code uses "acquiring load" where a thread loads a "message is ready" indicator that is set by another thread, and uses a "releasing store" where it is storing a "message is ready" indicator" to be read by another thread. The release is necessary to ensure that the "message" (queue data) is written before the "ready" indicator (member of g_p) is written. The acquire is likewise necessary to ensure that the "message" is read after the "ready" indicator is seen.
The snapshots are used so that the printf reports the value that the thread actually used, as opposed to some new value that appeared later. In general I like to use the snapshot style for two reasons. First, touching shared memory can be expensive because it often requires cache-line transfers. Second, the style gives me a stable value to use locally without having to worry that a reread might return a different value.

How to use arrays in program (global) scope in OpenCL

AMD OpenCL Programming Guide, Section 6.3 Constant Memory Optimization:
Globally scoped constant arrays. These arrays are initialized,
globally scoped, and in the constant address space (as specified in
section 6.5.3 of the OpenCL specification). If the size of an array is
below 64 kB, it is placed in hardware constant buffers; otherwise, it
uses global memory. An example of this is a lookup table for math
functions.
I want to use this "globally scoped constant array". I have such code in pure C
#define SIZE 101
int *reciprocal_table;
int reciprocal(int number){
return reciprocal_table[number];
}
void kernel(int *output)
{
for(int i=0; i < SIZE; i+)
output[i] = reciprocal(i);
}
I want to port it into OpenCL
__kernel void kernel(__global int *output){
int gid = get_global_id(0);
output[gid] = reciprocal(gid);
}
int reciprocal(int number){
return reciprocal_table[number];
}
What should I do with global variable reciprocal_table? If I try to add __global or __constant to it I get an error:
global variable must be declared in addrSpace constant
I don't want to pass __constant int *reciprocal_table from kernel to reciprocal. Is it possible to initialize global variable somehow? I know that I can write it down into code, but does other way exist?
P.S. I'm using AMD OpenCL
UPD Above code is just an example. I have real much more complex code with a lot of functions. So I want to make array in program scope to use it in all functions.
UPD2 Changed example code and added citation from Programming Guide
#define SIZE 2
int constant array[SIZE] = {0, 1};
kernel void
foo (global int* input,
global int* output)
{
const uint id = get_global_id (0);
output[id] = input[id] + array[id];
}
I can get the above to compile with Intel as well as AMD. It also works without the initialization of the array but then you would not know what's in the array and since it's in the constant address space, you could not assign any values.
Program global variables have to be in the __constant address space, as stated by section 6.5.3 in the standard.
UPDATE Now, that I fully understood the question:
One thing that worked for me is to define the array in the constant space and then overwrite it by passing a kernel parameter constant int* array which overwrites the array.
That produced correct results only on the GPU Device. The AMD CPU Device and the Intel CPU Device did not overwrite the arrays address. It also is probably not compliant to the standard.
Here's how it looks:
#define SIZE 2
int constant foo[SIZE] = {100, 100};
int
baz (int i)
{
return foo[i];
}
kernel void
bar (global int* input,
global int* output,
constant int* foo)
{
const uint id = get_global_id (0);
output[id] = input[id] + baz (id);
}
For input = {2, 3} and foo = {0, 1} this produces {2, 4} on my HD 7850 Device (Ubuntu 12.10, Catalyst 9.0.2). But on the CPU I get {102, 103} with either OCL Implementation (AMD, Intel). So I can not stress, how much I personally would NOT do this, because it's only a matter of time, before this breaks.
Another way to achieve this is would be to compute .h files with the host during runtime with the definition of the array (or predefine them) and pass them to the kernel upon compilation via a compiler option. This, of course, requires recompilation of the clProgram/clKernel for every different LUT.
I struggled to get this work in my own program some time ago.
I did not find any way to initialize a constant or global scope array from the host via some clEnqueueWriteBuffer or so. The only way is to write it explicitely in your .cl source file.
So here my trick to initialize it from the host is to use the fact that you are actually compiling your source from the host, which also means you can alter your src.cl file before compiling it.
First my src.cl file reads:
__constant double lookup[SIZE] = { LOOKUP }; // precomputed table (in constant memory).
double func(int idx) {
return(lookup[idx])
}
__kernel void ker1(__global double *in, __global double *out)
{
... do something ...
double t = func(i)
...
}
notice the lookup table is initialized with LOOKUP.
Then, in the host program, before compiling your OpenCL code:
compute the values of my lookup table in host_values[]
on your host, run something like:
char *buf = (char*) malloc( 10000 );
int count = sprintf(buf, "#define LOOKUP "); // actual source generation !
for (int i=0;i<SIZE;i++) count += sprintf(buf+count, "%g, ",host_values[i]);
count += sprintf(buf+count,"\n");
then read the content of your source file src.cl and place it right at buf+count.
you now have a source file with an explicitely defined lookup table that you just computed from the host.
compile your buffer with something like clCreateProgramWithSource(context, 1, (const char **) &buf, &src_sz, err);
voilà !
It looks like "array" is a look-up table of sorts. You'll need to clCreateBuffer and clEnqueueWriteBuffer so the GPU has a copy of it to use.

Pthreads in Mac OS X - Mutexes issue

I'm trying to learn how to program parallel algorithms in C using POSIX threads. My environment is a Mac OS X 10.5.5 with gcc 4.
Compiling:
gcc -Wall -D_REENTRANT -lpthread source.c -o test.o
So, my problem is, if I compile this in a Ubuntu 9.04 box, it runs smoothly in thread order, on Mac looks like mutexes doesn't work and the threads don't wait to get the shared information.
Mac:
#1
#0
#2
#5
#3
#4
ubuntu
#0
#1
#2
#3
#4
#5
Any ideas?
Follow below the source code:
#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#define NUM_THREADS 6
pthread_mutex_t mutexsum;
pthread_t threads[NUM_THREADS];
long Sum;
void *SumThreads(void *threadid){
int tmp;
int i,x[10],y[10];
// Para cada x e y do vetor, jogamos o valor de i, só para meio didáticos
for (i=0; i<10 ; i++){
x[i] = i;
y[i] = i;
}
tmp = Sum;
for (i=0; i<10 ; i++){
tmp += (x[i] * y[i]);
}
pthread_mutex_lock (&mutexsum);
Sum += tmp;
printf("Im thread #%ld sum until now is: %ld\n",threadid,Sum);
pthread_mutex_unlock (&mutexsum);
return 0;
}
int main(int argc, char *argv[]){
int i;
Sum = 0;
pthread_mutex_init(&mutexsum, NULL);
for(i=0; i<NUM_THREADS; i++){
pthread_create(&threads[i], NULL, SumThreads, (void *)i);
}
pthread_exit(NULL);
}
There is nothing on your code that will make your threads running in ANY order. If in Ubuntu is running on some order, it might be because you are just lucky. Try running 1000 times in Ubuntu and see if you get the same results over and over again.
The thing is, that you can't control the way the scheduler will make your threads access the processor(s). So, when you iterate through the for loop is creating your threads, you can't assume that the first call to pthread_create will get to run first, or will get to lock the mutex you are creating first. It's up to the scheduler which it at the OS level, and you can't control it, unless you write your own kernel :-).
If you want a serial behavior why would you run your code in separate threads in the first place? If it is just for experimentation, then one solution I can think of using pthread_signal to wake a specific thread up and make it running... Then the woken up thread can wake up the second one and so on so forth.
Hope it helps.
To my recollection, the variable you have protected isn't actually being shared amongst the processes. It exists in its own context inside each of the threads. So, it's really just a matter of when each thread gets scheduled that determines what will print.
I don't think one simple mutex will allow you to guarantee correctness, if correctness is defined as printing 0, 1, 2, 3 ...
what your code is doing is creating multiple execution contexts, using the code in your sum function as its execution code. the variable you are protecting, unless declared as static, will be unique to each call of that function.
in the end, it is coincidence that you are getting one system to print out correctly, because you have no logical method of blocking threads until it is their proper turn.
I don't do pthreads in C or any other language (but I do thread programming on high-performace computers) so this 'answer' might be useless to you;
What in your code requires the threads to pass the mutex in thread id order ? I see that the threads are created in id order, but what requires them to execute in that order /
If you do require your threads to execute in id order, why ? It seems a bit as if you are creating threads, then serialising them. To what end ?
When I program in threads and worry about execution order, I often try creating a very large number of threads and seeing what happens to the execution order.
As I say, ignore this if my lack of understanding of C and pthreads is too poor.

Resources