How to use arrays in program (global) scope in OpenCL - memory-management

AMD OpenCL Programming Guide, Section 6.3 Constant Memory Optimization:
Globally scoped constant arrays. These arrays are initialized,
globally scoped, and in the constant address space (as specified in
section 6.5.3 of the OpenCL specification). If the size of an array is
below 64 kB, it is placed in hardware constant buffers; otherwise, it
uses global memory. An example of this is a lookup table for math
functions.
I want to use this "globally scoped constant array". I have such code in pure C
#define SIZE 101
int *reciprocal_table;
int reciprocal(int number){
return reciprocal_table[number];
}
void kernel(int *output)
{
for(int i=0; i < SIZE; i+)
output[i] = reciprocal(i);
}
I want to port it into OpenCL
__kernel void kernel(__global int *output){
int gid = get_global_id(0);
output[gid] = reciprocal(gid);
}
int reciprocal(int number){
return reciprocal_table[number];
}
What should I do with global variable reciprocal_table? If I try to add __global or __constant to it I get an error:
global variable must be declared in addrSpace constant
I don't want to pass __constant int *reciprocal_table from kernel to reciprocal. Is it possible to initialize global variable somehow? I know that I can write it down into code, but does other way exist?
P.S. I'm using AMD OpenCL
UPD Above code is just an example. I have real much more complex code with a lot of functions. So I want to make array in program scope to use it in all functions.
UPD2 Changed example code and added citation from Programming Guide

#define SIZE 2
int constant array[SIZE] = {0, 1};
kernel void
foo (global int* input,
global int* output)
{
const uint id = get_global_id (0);
output[id] = input[id] + array[id];
}
I can get the above to compile with Intel as well as AMD. It also works without the initialization of the array but then you would not know what's in the array and since it's in the constant address space, you could not assign any values.
Program global variables have to be in the __constant address space, as stated by section 6.5.3 in the standard.
UPDATE Now, that I fully understood the question:
One thing that worked for me is to define the array in the constant space and then overwrite it by passing a kernel parameter constant int* array which overwrites the array.
That produced correct results only on the GPU Device. The AMD CPU Device and the Intel CPU Device did not overwrite the arrays address. It also is probably not compliant to the standard.
Here's how it looks:
#define SIZE 2
int constant foo[SIZE] = {100, 100};
int
baz (int i)
{
return foo[i];
}
kernel void
bar (global int* input,
global int* output,
constant int* foo)
{
const uint id = get_global_id (0);
output[id] = input[id] + baz (id);
}
For input = {2, 3} and foo = {0, 1} this produces {2, 4} on my HD 7850 Device (Ubuntu 12.10, Catalyst 9.0.2). But on the CPU I get {102, 103} with either OCL Implementation (AMD, Intel). So I can not stress, how much I personally would NOT do this, because it's only a matter of time, before this breaks.
Another way to achieve this is would be to compute .h files with the host during runtime with the definition of the array (or predefine them) and pass them to the kernel upon compilation via a compiler option. This, of course, requires recompilation of the clProgram/clKernel for every different LUT.

I struggled to get this work in my own program some time ago.
I did not find any way to initialize a constant or global scope array from the host via some clEnqueueWriteBuffer or so. The only way is to write it explicitely in your .cl source file.
So here my trick to initialize it from the host is to use the fact that you are actually compiling your source from the host, which also means you can alter your src.cl file before compiling it.
First my src.cl file reads:
__constant double lookup[SIZE] = { LOOKUP }; // precomputed table (in constant memory).
double func(int idx) {
return(lookup[idx])
}
__kernel void ker1(__global double *in, __global double *out)
{
... do something ...
double t = func(i)
...
}
notice the lookup table is initialized with LOOKUP.
Then, in the host program, before compiling your OpenCL code:
compute the values of my lookup table in host_values[]
on your host, run something like:
char *buf = (char*) malloc( 10000 );
int count = sprintf(buf, "#define LOOKUP "); // actual source generation !
for (int i=0;i<SIZE;i++) count += sprintf(buf+count, "%g, ",host_values[i]);
count += sprintf(buf+count,"\n");
then read the content of your source file src.cl and place it right at buf+count.
you now have a source file with an explicitely defined lookup table that you just computed from the host.
compile your buffer with something like clCreateProgramWithSource(context, 1, (const char **) &buf, &src_sz, err);
voilĂ  !

It looks like "array" is a look-up table of sorts. You'll need to clCreateBuffer and clEnqueueWriteBuffer so the GPU has a copy of it to use.

Related

How to optimize SYCL kernel

I'm studying SYCL at university and I have a question about performance of a code.
In particular I have this C/C++ code:
And I need to translate it in a SYCL kernel with parallelization and I do this:
#include <sycl/sycl.hpp>
#include <vector>
#include <iostream>
using namespace sycl;
constexpr int size = 131072; // 2^17
int main(int argc, char** argv) {
//Create a vector with size elements and initialize them to 1
std::vector<float> dA(size);
try {
queue gpuQueue{ gpu_selector{} };
buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));
gpuQueue.submit([&](handler& cgh) {
accessor inA{ bufA,cgh };
cgh.parallel_for(range<1>(size),
[=](id<1> i) { inA[i] = inA[i] + 2; }
);
});
gpuQueue.wait_and_throw();
}
catch (std::exception& e) { throw e; }
So my question is about c value, in this case I use directly the value two but this will impact on the performance when I'll run the code? I need to create a variable or in this way is correct and the performance are good?
Thanks in advance for the help!
Interesting question. In this case the value 2 will be a literal in the instruction in your SYCL kernel - this is as efficient as it gets, I think! There's the slight complication that you have an implicit cast from int to float. My guess is that you'll probably end up with a float literal 2.0 in your device assembly. Your SYCL device won't have to fetch that 2 from memory or cast at runtime or anything like that, it just lives in the instruction.
Equally, if you had:
constexpr int c = 2;
// the rest of your code
[=](id<1> i) { inA[i] = inA[i] + c; }
// etc
The compiler is almost certainly smart enough to propagate the constant value of c into the kernel code. So, again, the 2.0 literal ends up in the instruction.
I compiled your example with DPC++ and extracted the LLVM IR, and found the following lines:
%5 = load float, float addrspace(4)* %arrayidx.ascast.i.i, align 4, !tbaa !17
%add.i = fadd float %5, 2.000000e+00
store float %add.i, float addrspace(4)* %arrayidx.ascast.i.i, align 4, !tbaa !17
This shows a float load & store to/from the same address, with an 'add 2.0' instruction in between. If I modify to use the variable c like I demonstrated, I get the same LLVM IR.
Conclusion: you've already achieved maximum efficiency, and compilers are smart!

Directly assigning to a std::vector after reserving does not throw error but does not increase vector size

Let's create a helper class to assist visualizing the issue:
class C
{
int ID = 0;
public:
C(const int newID)
{
ID = newID;
}
int getID()
{
return ID;
}
};
Suppose you create an empty std::vector<C> and then reserve it to hold 10 elements:
std::vector<C> pack;
pack.reserve(10);
printf("pack has %i\n", pack.size()); //will print '0'
Now, you assign a new instance of C into index 4 of the vector:
pack[4] = C(57);
printf("%i\n", pack[4].getID()); //will print '57'
printf("pack has %i\n", pack.size()); //will still print '0'
I found two things to be weird here:
1) shouldn't the assignment make the compiler (Visual Studio 2015, Release Mode) throw an error even in Release mode?
2) since it does not and the element is in fact stored in position 4, shouldn't the vector then have size = 1 instead of zero?
Undefined behavior is still undefined. If we make this a vector of objects, you would see the unexpected behavior more clearly.
#include <iostream>
#include <vector>
struct Foo {
int data_ = 3;
};
int main() {
std::vector<Foo> foos;
foos.reserve(10);
std::cout << foos[4].data_; // This probably doesn't output 3.
}
Here, we can see that because we haven't actually allocated the object yet, the constructor hasn't run.
Another example, since you're using space that the vector hasn't actually started allocating to you, if the vector needed to reallocate it's backing memory, the value that you wrote wouldn't be copied.
#include <iostream>
#include <vector>
int main() {
std::vector<int> foos;
foos.reserve(10);
foos[4] = 100;
foos.reserve(10000000);
std::cout << foos[4]; // Probably doesn't print 100.
}
Short answers:
1) There is no reason to throw an exception since operator[] is not supposed to verify the position you have passed. It might do so in Debug mode, but for sure not in Release (otherwise performance would suffer). In Release mode compiler trusts you that code you provide is error-proof and does everything to make your code fast.
Returns a reference to the element at specified location pos. No
bounds checking is performed.
http://en.cppreference.com/w/cpp/container/vector/operator_at
2) You simply accessed memory you don't own yet (reserve is not resize), anything you do on it is undefined behavior. But, you have never added an element into vector and it has no idea you even modified its buffer. And as #Bill have shown, the vector is allowed to change its buffer without copying your local change.
EDIT:
Also, you can get exception due to boundary checking if you use vector::at function.
That is: pack.at(4) = C(57); throws exception
Example:
https://ideone.com/sXnPzT

Allocation and usage of cuda device variable in different functions

i am quite new to CUDA and I have a question regarding the memory management for an object. I have an object function to load the data to the device and if another object function is called the computation is carried out.
I have read some parts of the NVIDIA programming guide and some SO questions but they do data copying and computing in a single function so there no need of multiple functions.
Some more specifications:
The data is read one time. I do not know the data size at compile time therefore I need a dynamic allocation. My current device has a compute capability of 2.1 (will be updated soon to 6.1).
I want to copy the data in a first function and use the data in a different function. For example:
__constant__ int dev_size;
__device__ float* dev_data; //<- not sure about this
/* kernel */
__global__ void computeSomething(float* dev_output)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < dev_size)
{
dev_output[idx] = dev_data[idx]*100; // some computation;
}
}
// function 1
void OBJECT::copyVolumeToGPU(int size, float* data)
{
cudaMalloc(&dev_data, size * sizeof(float));
cudaMemcpy(dev_data, data, size * sizeof(float), cudaMemcpyHostToDevice );
cudaMemcpyToSymbol(dev_size, size, sizeof(int));
}
// function 2
void OBJECT::computeSmthOnDevice(int size)
{
// allocate output array
auto host_output = new float[size];
float* dev_output;
cudaMalloc(&dev_output, size * sizeof(float));
int block = 256;
int grid = ceil(size/block);
computeSomething<<<grid,block>>>(dev_output);
cudaMemcpy(host_output, dev_data, size * sizeof(float), cudaMemcpyDeviceToHost);
/* ... do something with output ... */
delete[] host_output;
cudaFree(dev_output);
}
gpuErrChk is carried out this way: https://stackoverflow.com/a/14038590/3921660 but omitted in this example.
Can I copy the data using a __device__pointer (like __device__ float* dev_data;)?
Generally, your idea is workable, but this:
cudaMalloc(&dev_data, size * sizeof(float));
is not legal. It is not legal to take an address of a __device__ item in host code. So if you know the size at compile time, the easiest approach is to convert this to a static allocation e.g.
__device__ float dev_data[1000];
If you really want to make this a dynamically allocated __device__ pointer, then you will need to use a method such as described here, which involves using cudaMalloc on a typical device pointer in host code that is a "temporary", then copy that "temporary" pointer to the __device__ pointer via cudaMemcpyToSymbol. And then when you want to copy data to/from that particular allocation via cudaMemcpy, you would use cudaMemcpy to/from the temporary pointer from host code.
Note that for the purposes of "communicating" data from one function to the next, or one kernel to the next, there's no reason you couldn't just use a dynamically allocated pointer from cudaMemcpy, and pass that pointer around to wherever you need it. You can even pass it via a global variable to any host function that needs it, like an ordinary global pointer. For kernels, however, you would still need to pass such a global pointer to the kernel via kernel argument.

How to map a data with openmp target to use inside a function?

I would like to know how can I map a data for future use inside of a function?
I wrote some code like the following:
struct {
int* a;
int *b;
// other members...
} s;
void func1(struct s* _s){
int a* = _s->a;
int b* = _s->b;
// do something with _s
#pragma omp target
{
// do something with a and b;
}
}
int main(){
struct s* _s;
// alloc _s, a and b
int *a = _s->a;
int *b = _s->b;
#pragma omp target data map(to: a, b)
{
func1(_s);
// call another funcs with device use of mapped data...
}
// free data
}
The code compiles, but on execution Kernel execution error at <address> is spammed from verbose output of execution, followed by many Device kernel launch failed! and CUDA error is: an illegal memory access was encountered
Your map directive looks like it's probably mapping the value of the pointers a and b to the device rather than the arrays they're pointing to. I think you want to shape them so that the runtime maps the data and not just the pointers. Personally, I would also put the map clause on your target region too, since that gives the compiler more information to work with and the present check will find the data already on the device from the outer data region and not perform any further data movement.

Does using unique_ptr mean that I don't have to use the restrict keyword?

When trying to get loops to auto-vectorise, I've seen code like this written:
void addFP(int N, float *in1, float *in2, float * restrict out)
{
for (i = 0 ; i < N; i++)
{
out[i] = in1[i] + in2[i];
}
}
Where the restrict keyword is needed reassure the compiler about pointer aliases, so that it can vectorise the loop.
Would something like this do the same thing?
void addFP(int N, float *in1, float *in2, std::unique_ptr<float> out)
{
for (i = 0 ; i < N; i++)
{
out[i] = in1[i] + in2[i];
}
}
If this does work, which is the more portable choice?
tl;dr Can std::unique_ptr be used to replace the restrict keyword in a loop you're trying to auto-vectorise?
restrict is not part of C++11, instead it is part of C99.
std::unique_ptr<T> foo; is telling your compiler: I only need the memory in this scope. Once this scope ends, free the memory.
restrict tells your compiler: I know that you can't know or proof this, but I pinky swear that this is the only reference to this chunk of memory that occurs in this function.
unique_ptr doesn't stop aliases, nor should the compiler assume they don't exist:
int* pointer = new int[3];
int* alias = pointer;
std::unique_ptr<int> alias2(pointer);
std::unique_ptr<int> alias3(pointer); //compiles, but crashes when deleting
So your first version is not valid in C++11 (though it works on many modern compilers) and the second doesn't do the optimization you are expecting. To still get the behavior concider std::valarray.
I don't think so. Suppose this code:
auto p = std::make_unique<float>(0.1f);
auto raw = p.get();
addFP(1, raw, raw, std::move(p));

Resources