How to map a data with openmp target to use inside a function? - openmp

I would like to know how can I map a data for future use inside of a function?
I wrote some code like the following:
struct {
int* a;
int *b;
// other members...
} s;
void func1(struct s* _s){
int a* = _s->a;
int b* = _s->b;
// do something with _s
#pragma omp target
{
// do something with a and b;
}
}
int main(){
struct s* _s;
// alloc _s, a and b
int *a = _s->a;
int *b = _s->b;
#pragma omp target data map(to: a, b)
{
func1(_s);
// call another funcs with device use of mapped data...
}
// free data
}
The code compiles, but on execution Kernel execution error at <address> is spammed from verbose output of execution, followed by many Device kernel launch failed! and CUDA error is: an illegal memory access was encountered

Your map directive looks like it's probably mapping the value of the pointers a and b to the device rather than the arrays they're pointing to. I think you want to shape them so that the runtime maps the data and not just the pointers. Personally, I would also put the map clause on your target region too, since that gives the compiler more information to work with and the present check will find the data already on the device from the outer data region and not perform any further data movement.

Related

how to fix wrong GCC ARM startup code pointer to initialized and zero variables?

[skip to UPDATE2 and save some time :-)]
I use ARM Cortex-M4, with CMSIS 5-5.7.0 and FreeRTOS, compiling using GCC for ARM (10_2021.10)
My variables are not initialized as they should.
My startup code is pretty simple, the entry point is the reset handler (CMSIS declared startup_ARMCM4.s as deprecated and recommend using the C code startup code so this is what I do).
Here is my code:
__attribute__((__noreturn__)) void Reset_Handler(void)
{
DataInit();
SystemInit(); /* CMSIS System Initialization */
main();
}
static void DataInit(void)
{
typedef struct {
uint32_t const* src;
uint32_t* dest;
uint32_t wlen;
} __copy_table_t;
typedef struct {
uint32_t* dest;
uint32_t wlen;
} __zero_table_t;
extern const __copy_table_t __copy_table_start__;
extern const __copy_table_t __copy_table_end__;
extern const __zero_table_t __zero_table_start__;
extern const __zero_table_t __zero_table_end__;
for (__copy_table_t const* pTable = &__copy_table_start__; pTable < &__copy_table_end__; ++pTable) {
for(uint32_t i=0u; i<pTable->wlen; ++i) {
pTable->dest[i] = pTable->src[i];
}
}
for (__zero_table_t const* pTable = &__zero_table_start__; pTable < &__zero_table_end__; ++pTable) {
for(uint32_t i=0u; i<pTable->wlen; ++i) {
pTable->dest[i] = 0u;
}
}
}
__copy_table_start__, __copy_table_end__ etc. have the wrong values an so no data is copied to the appropriate place in RAM.
I tried adding __libc_init_array() before DataInit(), as suggested in this answer, and remove the nostartfiles flag from the linker, but at some point __libc_init_array() jumps to an illegal address and I get a HardFault interrupt.
Is there a different method to fix it? maybe one where I can use the nostartfiles flag?
UPDATE:
Looking at the memory, where __copy_table_start__ is located, I see the data there is valid (even without the use of __libc_init_array()). It seems that pTable doesn't get the correct value.
I tried using __data_start__, __data_end__, __bss_start__, __bss_end__ and __etext instead of the above variables, in the linker file it is said they can be used in code without definition, but they cannot (maybe that's a clue?). In any case they didn't work either.
UPDATE2:
found the actual problem
all struct members get the same value (modifying one changes all others), it happens with every struct. I have no idea how this is possible. In other words the value of __copy_table_start__.src is, for example, 0x14651234, __copy_table_start__.dest is 0x00100000, and __copy_table_start__.wlen is 0x0365. When looking at pTable all members are 0x14651234.

Directly assigning to a std::vector after reserving does not throw error but does not increase vector size

Let's create a helper class to assist visualizing the issue:
class C
{
int ID = 0;
public:
C(const int newID)
{
ID = newID;
}
int getID()
{
return ID;
}
};
Suppose you create an empty std::vector<C> and then reserve it to hold 10 elements:
std::vector<C> pack;
pack.reserve(10);
printf("pack has %i\n", pack.size()); //will print '0'
Now, you assign a new instance of C into index 4 of the vector:
pack[4] = C(57);
printf("%i\n", pack[4].getID()); //will print '57'
printf("pack has %i\n", pack.size()); //will still print '0'
I found two things to be weird here:
1) shouldn't the assignment make the compiler (Visual Studio 2015, Release Mode) throw an error even in Release mode?
2) since it does not and the element is in fact stored in position 4, shouldn't the vector then have size = 1 instead of zero?
Undefined behavior is still undefined. If we make this a vector of objects, you would see the unexpected behavior more clearly.
#include <iostream>
#include <vector>
struct Foo {
int data_ = 3;
};
int main() {
std::vector<Foo> foos;
foos.reserve(10);
std::cout << foos[4].data_; // This probably doesn't output 3.
}
Here, we can see that because we haven't actually allocated the object yet, the constructor hasn't run.
Another example, since you're using space that the vector hasn't actually started allocating to you, if the vector needed to reallocate it's backing memory, the value that you wrote wouldn't be copied.
#include <iostream>
#include <vector>
int main() {
std::vector<int> foos;
foos.reserve(10);
foos[4] = 100;
foos.reserve(10000000);
std::cout << foos[4]; // Probably doesn't print 100.
}
Short answers:
1) There is no reason to throw an exception since operator[] is not supposed to verify the position you have passed. It might do so in Debug mode, but for sure not in Release (otherwise performance would suffer). In Release mode compiler trusts you that code you provide is error-proof and does everything to make your code fast.
Returns a reference to the element at specified location pos. No
bounds checking is performed.
http://en.cppreference.com/w/cpp/container/vector/operator_at
2) You simply accessed memory you don't own yet (reserve is not resize), anything you do on it is undefined behavior. But, you have never added an element into vector and it has no idea you even modified its buffer. And as #Bill have shown, the vector is allowed to change its buffer without copying your local change.
EDIT:
Also, you can get exception due to boundary checking if you use vector::at function.
That is: pack.at(4) = C(57); throws exception
Example:
https://ideone.com/sXnPzT

OpenACC 2.0 routine: data locality

Take the following code, which illustrates the calling of a simple routine on the accelerator, compiled on the device using OpenACC 2.0's routine directive:
#include <iostream>
#pragma acc routine
int function(int *ARRAY,int multiplier){
int sum=0;
#pragma acc loop reduction(+:sum)
for(int i=0; i<10; ++i){
sum+=multiplier*ARRAY[i];
}
return sum;
}
int main(){
int *ARRAY = new int[10];
int multiplier = 5;
int out;
for(int i=0; i<10; i++){
ARRAY[i] = 1;
}
#pragma acc enter data create(out) copyin(ARRAY[0:10],multiplier)
#pragma acc parallel present(out,ARRAY[0:10],multiplier)
if (function(ARRAY,multiplier) == 50){
out = 1;
}else{
out = 0;
}
#pragma acc exit data copyout(out) delete(ARRAY[0:10],multiplier)
std::cout << out << std::endl;
}
How does function know to use the device copies of ARRAY[0:10] and multiplier when it is called from within a parallel region? How can we enforce the use of the device copies?
When your routine is called within a device region (the parallel in your code), it is being called by the threads on the device, which means those threads will only have access to arrays on the device. The compiler may actually choose to inline that function, or it may be a device-side function call. That means that you can know that when the function is called from the device it will be receiving device copies of the data because the function is essentially inheriting the present data clause from the parallel region. If you still want to convince yourself that you're running on the device once inside the function, you could call acc_on_device, but that only tells you that you're running on the accelerator, not that you received a device pointer.
If you want to enforce the use of device copies more than that, you could make the routine nohost so that it would technically not be valid to call from the host, but that doesn't really do what you're asking, which is to do a check on the GPU that the array really is a device array.
Keep in mind though that any code inside a parallel region that is not inside a loop will be run gang-redundantly, so the write to out is likely a race condition, unless you happen to be running with one gang or you write to it using an atomic.
Basically, when you involved "data" clause, the device will create/copy data to the device memory, then the block of code that defined with "acc routine" will be executed on the device. Notice that the memory between host and device does not share unlike multi-threading (OpenMP). So yes, "function" will be using the device copies of ARRAY and multiplier as long as it is under data segment. Hope this helps! :)
You should assign the function with one parallelism level such as gang/worker/vector. It's a more accurate way.
The routine will use the date in device memory.

What is the use of 'i2c_get_clientdata" and "i2c_set_clientdata"

I have been studying I2C driver (client) code for a while.
I have seen this function "i2c_get_clientdata" and "i2c_set_clientdata" every where.
I have seen the this question here .
Use of pointer to structure instead of creating static local copy
Some times i think like it is like "container_of" macro to get a pointer to the structure.
But still i didn't understood properly why to use it and when to use it.
Below i am posting a sample code where I see its usage.
If any one could help me understand why it is used there and when we shall use it when we write our own drivers.
struct max6875_data {
struct i2c_client *fake_client;
struct mutex update_lock;
u32 valid;
u8 data[USER_EEPROM_SIZE];
unsigned long last_updated[USER_EEPROM_SLICES];
};
static ssize_t max6875_read(struct file *filp, struct kobject *kobj,
struct bin_attribute *bin_attr,
char *buf, loff_t off, size_t count)
{
struct i2c_client *client = kobj_to_i2c_client(kobj);
struct max6875_data *data = i2c_get_clientdata(client);
int slice, max_slice;
if (off > USER_EEPROM_SIZE)
return 0;
if (off + count > USER_EEPROM_SIZE)
count = USER_EEPROM_SIZE - off;
/* refresh slices which contain requested bytes */
max_slice = (off + count - 1) >> SLICE_BITS;
for (slice = (off >> SLICE_BITS); slice <= max_slice; slice++)
max6875_update_slice(client, slice);
memcpy(buf, &data->data[off], count);
return count;
}
Those functions are used to get/set the void *driver_data pointer that is part of the struct device, itself part of struct i2c_client.
This is a void pointer that is for the driver to use. One would use this pointer mainly to pass driver related data around.
That is what is happening in your example. The max6875_read is a callback getting a structu kobject. That kobject is an i2c_client which is enough to communicate with the underlying device using the driver_data pointer here allows to get back the driver related data (instead of using global variables for example).

How to use arrays in program (global) scope in OpenCL

AMD OpenCL Programming Guide, Section 6.3 Constant Memory Optimization:
Globally scoped constant arrays. These arrays are initialized,
globally scoped, and in the constant address space (as specified in
section 6.5.3 of the OpenCL specification). If the size of an array is
below 64 kB, it is placed in hardware constant buffers; otherwise, it
uses global memory. An example of this is a lookup table for math
functions.
I want to use this "globally scoped constant array". I have such code in pure C
#define SIZE 101
int *reciprocal_table;
int reciprocal(int number){
return reciprocal_table[number];
}
void kernel(int *output)
{
for(int i=0; i < SIZE; i+)
output[i] = reciprocal(i);
}
I want to port it into OpenCL
__kernel void kernel(__global int *output){
int gid = get_global_id(0);
output[gid] = reciprocal(gid);
}
int reciprocal(int number){
return reciprocal_table[number];
}
What should I do with global variable reciprocal_table? If I try to add __global or __constant to it I get an error:
global variable must be declared in addrSpace constant
I don't want to pass __constant int *reciprocal_table from kernel to reciprocal. Is it possible to initialize global variable somehow? I know that I can write it down into code, but does other way exist?
P.S. I'm using AMD OpenCL
UPD Above code is just an example. I have real much more complex code with a lot of functions. So I want to make array in program scope to use it in all functions.
UPD2 Changed example code and added citation from Programming Guide
#define SIZE 2
int constant array[SIZE] = {0, 1};
kernel void
foo (global int* input,
global int* output)
{
const uint id = get_global_id (0);
output[id] = input[id] + array[id];
}
I can get the above to compile with Intel as well as AMD. It also works without the initialization of the array but then you would not know what's in the array and since it's in the constant address space, you could not assign any values.
Program global variables have to be in the __constant address space, as stated by section 6.5.3 in the standard.
UPDATE Now, that I fully understood the question:
One thing that worked for me is to define the array in the constant space and then overwrite it by passing a kernel parameter constant int* array which overwrites the array.
That produced correct results only on the GPU Device. The AMD CPU Device and the Intel CPU Device did not overwrite the arrays address. It also is probably not compliant to the standard.
Here's how it looks:
#define SIZE 2
int constant foo[SIZE] = {100, 100};
int
baz (int i)
{
return foo[i];
}
kernel void
bar (global int* input,
global int* output,
constant int* foo)
{
const uint id = get_global_id (0);
output[id] = input[id] + baz (id);
}
For input = {2, 3} and foo = {0, 1} this produces {2, 4} on my HD 7850 Device (Ubuntu 12.10, Catalyst 9.0.2). But on the CPU I get {102, 103} with either OCL Implementation (AMD, Intel). So I can not stress, how much I personally would NOT do this, because it's only a matter of time, before this breaks.
Another way to achieve this is would be to compute .h files with the host during runtime with the definition of the array (or predefine them) and pass them to the kernel upon compilation via a compiler option. This, of course, requires recompilation of the clProgram/clKernel for every different LUT.
I struggled to get this work in my own program some time ago.
I did not find any way to initialize a constant or global scope array from the host via some clEnqueueWriteBuffer or so. The only way is to write it explicitely in your .cl source file.
So here my trick to initialize it from the host is to use the fact that you are actually compiling your source from the host, which also means you can alter your src.cl file before compiling it.
First my src.cl file reads:
__constant double lookup[SIZE] = { LOOKUP }; // precomputed table (in constant memory).
double func(int idx) {
return(lookup[idx])
}
__kernel void ker1(__global double *in, __global double *out)
{
... do something ...
double t = func(i)
...
}
notice the lookup table is initialized with LOOKUP.
Then, in the host program, before compiling your OpenCL code:
compute the values of my lookup table in host_values[]
on your host, run something like:
char *buf = (char*) malloc( 10000 );
int count = sprintf(buf, "#define LOOKUP "); // actual source generation !
for (int i=0;i<SIZE;i++) count += sprintf(buf+count, "%g, ",host_values[i]);
count += sprintf(buf+count,"\n");
then read the content of your source file src.cl and place it right at buf+count.
you now have a source file with an explicitely defined lookup table that you just computed from the host.
compile your buffer with something like clCreateProgramWithSource(context, 1, (const char **) &buf, &src_sz, err);
voilĂ  !
It looks like "array" is a look-up table of sorts. You'll need to clCreateBuffer and clEnqueueWriteBuffer so the GPU has a copy of it to use.

Resources