I'm trying to use custom data structure with OpenCL kernel. I defined in my host program a simple structure like :
struct myStruct{
cl_ulong n_occ;
cl_ulong start_time;
cl_ulong end_time;
cl_ulong exec_time;
cl_ulong total_time;
cl_float avg_time;
} myStruct_t;
The equivalent custom data structure definition is also done in my OpenCL kernel.
struct myStruct{
unsigned long n_occ;
unsigned long start_time;
unsigned long end_time;
unsigned long exec_time;
unsigned long total_time;
float avg_time;
} myStruct_t;
The kernel function is the following :
__kernel void process_data( __global myStruct_t* input, __global myStruct_t* output){
output->start_time = input->start_time;
output->end_time = input->end_time;
output->exec_time = input->end_time - input->start_time;
output->total_time = input->total_time + output->exec_time;
output->n_occ = input->n_occ + 1;
output->avg_time = output->total_time / output->n_occ;
}
I use an Nvidia card as GPU device. After the execution of the kernel code, I obtained incorrect results. I don't understand the reason. is there something missing ?
Thank you in advance for your help.
Have you checked to see if your host struct (C) is packed correctly (also ensure that the OpenCL side is packed correctly and that both report the same size)? It is also probably a good idea to use the same cl_* types in both structs.
Related
I'm studying SYCL at university and I have a question about performance of a code.
In particular I have this C/C++ code:
And I need to translate it in a SYCL kernel with parallelization and I do this:
#include <sycl/sycl.hpp>
#include <vector>
#include <iostream>
using namespace sycl;
constexpr int size = 131072; // 2^17
int main(int argc, char** argv) {
//Create a vector with size elements and initialize them to 1
std::vector<float> dA(size);
try {
queue gpuQueue{ gpu_selector{} };
buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));
gpuQueue.submit([&](handler& cgh) {
accessor inA{ bufA,cgh };
cgh.parallel_for(range<1>(size),
[=](id<1> i) { inA[i] = inA[i] + 2; }
);
});
gpuQueue.wait_and_throw();
}
catch (std::exception& e) { throw e; }
So my question is about c value, in this case I use directly the value two but this will impact on the performance when I'll run the code? I need to create a variable or in this way is correct and the performance are good?
Thanks in advance for the help!
Interesting question. In this case the value 2 will be a literal in the instruction in your SYCL kernel - this is as efficient as it gets, I think! There's the slight complication that you have an implicit cast from int to float. My guess is that you'll probably end up with a float literal 2.0 in your device assembly. Your SYCL device won't have to fetch that 2 from memory or cast at runtime or anything like that, it just lives in the instruction.
Equally, if you had:
constexpr int c = 2;
// the rest of your code
[=](id<1> i) { inA[i] = inA[i] + c; }
// etc
The compiler is almost certainly smart enough to propagate the constant value of c into the kernel code. So, again, the 2.0 literal ends up in the instruction.
I compiled your example with DPC++ and extracted the LLVM IR, and found the following lines:
%5 = load float, float addrspace(4)* %arrayidx.ascast.i.i, align 4, !tbaa !17
%add.i = fadd float %5, 2.000000e+00
store float %add.i, float addrspace(4)* %arrayidx.ascast.i.i, align 4, !tbaa !17
This shows a float load & store to/from the same address, with an 'add 2.0' instruction in between. If I modify to use the variable c like I demonstrated, I get the same LLVM IR.
Conclusion: you've already achieved maximum efficiency, and compilers are smart!
For a research project, I have a long-running process that uses various buffers and stack variables. I'd like to be able to launch this process multiple times such that the physical addresses backing its heap, stack, code, and static variables are equal each time. I know the exact size of all of these variables, and the size of the heap and stack stay constant during execution. To help with this, I use some helper code to translate arbitrary virtual addresses in my program to their corresponding physical addresses (sourced from here):
struct pagemap
{
union status
{
struct present
{
unsigned long long pfn : 54;
unsigned char soft_dirty : 1;
unsigned char exclusive : 1;
unsigned char zeroes : 4;
unsigned char type : 1;
unsigned char swapped : 1;
unsigned char present : 1;
} present;
struct swapped
{
unsigned char swaptype : 4;
unsigned long long offset : 50;
unsigned char soft_dirty : 1;
unsigned char exclusive : 1;
unsigned char zeroes : 4;
unsigned char type : 1;
unsigned char swapped : 1;
unsigned char present : 1;
} swapped;
} status;
} __attribute__ ((packed));
unsigned long get_pfn_for_addr(void *addr)
{
unsigned long offset;
struct pagemap pagemap;
FILE *pagemap_file = fopen("/proc/self/pagemap", "rb");
offset = (unsigned long) addr / getpagesize() * 8;
if(fseek(pagemap_file, offset, SEEK_SET) != 0)
{
fprintf(stderr, "failed to seek pagemap to offset\n");
exit(1);
}
fread(&pagemap, 1, sizeof(struct pagemap), pagemap_file);
fclose(pagemap_file);
return pagemap.status.present.pfn;
}
unsigned long virt_to_phys(void *addr)
{
unsigned long pfn, page_offset, phys_addr;
pfn = get_pfn_for_addr(addr);
page_offset = (unsigned long) addr % getpagesize();
phys_addr = (pfn << PAGE_SHIFT) + page_offset;
return phys_addr;
}
So far, my methodology has only required that a specific buffer in my program is located at the same physical address for each run. For this, I was just able to exit and relaunch the process whenever the physical address for that buffer was wrong, and I would end up with the correct location relatively quickly each time. However, I'd like to extend my experiment to ensure that my process is loaded identically in physical memory between runs, and this try-and-restart method does not seem to work well for this. Ideally, I would like to be able to set apart some small number of physical page frames that can't be allocated to another process, or to the kernel itself. Then, I would pass a flag down to do_fork that notifies the kernel that this is my special process and to allocate specific page frames to it.
My questions are:
Is there any sort of isolation mechanism already built into the kernel that would let me set aside an exclusive physical memory space that I could launch my process in?
If not, what would be a starting point for modifying the kernel to support behavior like this?
Is there any other solution (not involving either of the two above) that I could use for my desired behavior?
This is something that the kernel, using virtual memory, is tasked to abstract from you, so I'm not sure it is even possible to do (without insane amounts of work).
May I ask what experiment requires this? Perhaps if you describe what you want to achieve, it is easier to offer advice.
I am writing a linux kernel module.
Here is what i've done in module's init function:
register_chrdev(300 /* major */, "mydev", &fops);
It works fine. But i need to know the minor number.
I have read we cannot set this minor number. It is the kernel which gives us this number. If so, how can i know it in module's init function ?
Thanks
register_chrdev calls __register_chrdev internally.
static inline int register_chrdev(unsigned int major, const char *name,
const struct file_operations *fops)
{
return __register_chrdev(major, 0, 256, name, fops);
}
If you will see __register_chrdev function signature, it is
int __register_chrdev(unsigned int major, unsigned int baseminor,
unsigned int count, const char *name,
const struct file_operations *fops)
register_chrdev will pass your major number(300) and a base minor number 0 with a count of 256. So, it will reserve 0-255 minor number range for your device.
Also, in the definition of __register_chrdev, dev_t structure is created (contains major & minor number) for your device.
err = cdev_add(cdev, MKDEV(cd->major, baseminor), count);
MKDEV(cd->major, baseminor) creates it. So, the first device number(dev_t) will have 0 as its minor number. Besides, count(256) is the consecutive minor numbers that can be further used.
You can also dynamically get the major & minor number if you use alloc_chrdev_region. All you have to do is pass a dev_t struct
to alloc_chrdev_region. It will dynamically allocate a major and minor number to your device. To get the major and minor number in your module, you can use
major = MAJOR(dev);
minor = MINOR(dev);
I have the latest mac pro(OS:10.12.2) ,with intel intergrated GPU HD 530(Gen9) which runs the OpenCL code. In my OpenCL code, I use vloadx and atomic_add instruction. change my OpenCL kernel code into bitcode like https://developer.apple.com/library/content/samplecode/OpenCLOfflineCompilation/Introduction/Intro.html#//apple_ref/doc/uid/DTS40011196-Intro-DontLinkElementID_2
. and create the program with clCreateProgramWithBinary. But when clBuildProgram, it returns error with -11 .and build log is "
error: undefined reference to _Z6vload2mPKU3AS1h()'
undefined reference to_Z8atom_addPVU3AS3ii()'
"
But in my mac air with HD 5500(Gen8), the code is ok.
Can someone tell me what should I do?
The problem here is, you cannot use incompatible binaries in different devices. Which means if you compile for Intel, you cannot use the compiled binary for AMD for example. What you need to do is compile the code for the specific device every time from the source.
If you do not want to use the OpenCL codes in different files, what you can do is put them inside your source file by stringifying them. Instead of reading a file, you use the kernel string inside your host code to pass as the kernel string. This will allow you to protect your IP. However, everytime, you need to build the code using clBuildProgram. You can also save the built program as binary, so after the first run, you won't degrade performance by building it everytime. To give an example, lets suppose you have a kernel.cl file as following:
__kernel void foo(__global int* in, __global int* out)
{
int idx = get_global_id(0);
out[idx] = in[idx] * in[idx];
}
You probably get this kernel code by reading the file with something like:
char *source_str;
fp = fopen("kernel.cl", "r");
source_str = (char *)malloc(MAX_SOURCE_SIZE);
source_size = fread(source_str, 1, MAX_SOURCE_SIZE, fp);
fclose(fp);
program = clCreateProgramWithSource(context, 1, (const char **)&source_str, (const size_t *)&source_size, &ret);
What you can do instead is something like:
const char* src = "__kernel void foo(__global int* in, __global int* out)\
{\
int idx = get_global_id(0);\
out[idx] = in[idx] * in[idx];\
}";
program = clCreateProgramWithSource(context, 1, (const char **)&src, (const size_t *)&src_size, &ret);
When you compile your C code, this string will be converted into binary, so you protect your source code.
AMD OpenCL Programming Guide, Section 6.3 Constant Memory Optimization:
Globally scoped constant arrays. These arrays are initialized,
globally scoped, and in the constant address space (as specified in
section 6.5.3 of the OpenCL specification). If the size of an array is
below 64 kB, it is placed in hardware constant buffers; otherwise, it
uses global memory. An example of this is a lookup table for math
functions.
I want to use this "globally scoped constant array". I have such code in pure C
#define SIZE 101
int *reciprocal_table;
int reciprocal(int number){
return reciprocal_table[number];
}
void kernel(int *output)
{
for(int i=0; i < SIZE; i+)
output[i] = reciprocal(i);
}
I want to port it into OpenCL
__kernel void kernel(__global int *output){
int gid = get_global_id(0);
output[gid] = reciprocal(gid);
}
int reciprocal(int number){
return reciprocal_table[number];
}
What should I do with global variable reciprocal_table? If I try to add __global or __constant to it I get an error:
global variable must be declared in addrSpace constant
I don't want to pass __constant int *reciprocal_table from kernel to reciprocal. Is it possible to initialize global variable somehow? I know that I can write it down into code, but does other way exist?
P.S. I'm using AMD OpenCL
UPD Above code is just an example. I have real much more complex code with a lot of functions. So I want to make array in program scope to use it in all functions.
UPD2 Changed example code and added citation from Programming Guide
#define SIZE 2
int constant array[SIZE] = {0, 1};
kernel void
foo (global int* input,
global int* output)
{
const uint id = get_global_id (0);
output[id] = input[id] + array[id];
}
I can get the above to compile with Intel as well as AMD. It also works without the initialization of the array but then you would not know what's in the array and since it's in the constant address space, you could not assign any values.
Program global variables have to be in the __constant address space, as stated by section 6.5.3 in the standard.
UPDATE Now, that I fully understood the question:
One thing that worked for me is to define the array in the constant space and then overwrite it by passing a kernel parameter constant int* array which overwrites the array.
That produced correct results only on the GPU Device. The AMD CPU Device and the Intel CPU Device did not overwrite the arrays address. It also is probably not compliant to the standard.
Here's how it looks:
#define SIZE 2
int constant foo[SIZE] = {100, 100};
int
baz (int i)
{
return foo[i];
}
kernel void
bar (global int* input,
global int* output,
constant int* foo)
{
const uint id = get_global_id (0);
output[id] = input[id] + baz (id);
}
For input = {2, 3} and foo = {0, 1} this produces {2, 4} on my HD 7850 Device (Ubuntu 12.10, Catalyst 9.0.2). But on the CPU I get {102, 103} with either OCL Implementation (AMD, Intel). So I can not stress, how much I personally would NOT do this, because it's only a matter of time, before this breaks.
Another way to achieve this is would be to compute .h files with the host during runtime with the definition of the array (or predefine them) and pass them to the kernel upon compilation via a compiler option. This, of course, requires recompilation of the clProgram/clKernel for every different LUT.
I struggled to get this work in my own program some time ago.
I did not find any way to initialize a constant or global scope array from the host via some clEnqueueWriteBuffer or so. The only way is to write it explicitely in your .cl source file.
So here my trick to initialize it from the host is to use the fact that you are actually compiling your source from the host, which also means you can alter your src.cl file before compiling it.
First my src.cl file reads:
__constant double lookup[SIZE] = { LOOKUP }; // precomputed table (in constant memory).
double func(int idx) {
return(lookup[idx])
}
__kernel void ker1(__global double *in, __global double *out)
{
... do something ...
double t = func(i)
...
}
notice the lookup table is initialized with LOOKUP.
Then, in the host program, before compiling your OpenCL code:
compute the values of my lookup table in host_values[]
on your host, run something like:
char *buf = (char*) malloc( 10000 );
int count = sprintf(buf, "#define LOOKUP "); // actual source generation !
for (int i=0;i<SIZE;i++) count += sprintf(buf+count, "%g, ",host_values[i]);
count += sprintf(buf+count,"\n");
then read the content of your source file src.cl and place it right at buf+count.
you now have a source file with an explicitely defined lookup table that you just computed from the host.
compile your buffer with something like clCreateProgramWithSource(context, 1, (const char **) &buf, &src_sz, err);
voilĂ !
It looks like "array" is a look-up table of sorts. You'll need to clCreateBuffer and clEnqueueWriteBuffer so the GPU has a copy of it to use.