how to compile opencl project with kernels

how to compile opencl project with kernels - compilation

I am totally a beginner on opencl, I searched around the internet and found some "helloworld" demos for opencl project. Usually in such sort of minimal project, there is a *.cl file contains some sort of opencl kernels and a *.c file contains the main function. Then the question is how do I compile this kind of project use a command line. I know I should use some sort of -lOpenCL flag on linux and -framework OpenCL on mac. But I have no idea to link the *.cl kernel to my main source file. Thank you for any comments or useful links.

In OpenCL, the .cl files that contain device kernel codes are usually being compiled and built at run-time. It means somewhere in your host OpenCL program, you'll have to compile and build your device program to be able to use it. This feature enables maximum portability.
Let's consider an example I collected from two books. Below is a very simple OpenCL kernel adding two numbers from two global arrays and saving them in another global array. I save this code in a file named vector_add_kernel.cl.
kernel void vecadd( global int* A, global int* B, global int* C ) {
const int idx = get_global_id(0);
C[idx] = A[idx] + B[idx];
}
Below is the host code written in C++ that exploits OpenCL C++ API. I save it in a file named ocl_vector_addition.cpp beside where I saved my .cl file.
#include <iostream>
#include <fstream>
#include <string>
#include <memory>
#include <stdlib.h>
#define __CL_ENABLE_EXCEPTIONS
#if defined(__APPLE__) || defined(__MACOSX)
#include <OpenCL/cl.cpp>
#else
#include <CL/cl.hpp>
#endif
int main( int argc, char** argv ) {
const int N_ELEMENTS=1024*1024;
unsigned int platform_id=0, device_id=0;
try{
std::unique_ptr<int[]> A(new int[N_ELEMENTS]); // Or you can use simple dynamic arrays like: int* A = new int[N_ELEMENTS];
std::unique_ptr<int[]> B(new int[N_ELEMENTS]);
std::unique_ptr<int[]> C(new int[N_ELEMENTS]);
for( int i = 0; i < N_ELEMENTS; ++i ) {
A[i] = i;
B[i] = i;
}
// Query for platforms
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
// Get a list of devices on this platform
std::vector<cl::Device> devices;
platforms[platform_id].getDevices(CL_DEVICE_TYPE_GPU|CL_DEVICE_TYPE_CPU, &devices); // Select the platform.
// Create a context
cl::Context context(devices);
// Create a command queue
cl::CommandQueue queue = cl::CommandQueue( context, devices[device_id] ); // Select the device.
// Create the memory buffers
cl::Buffer bufferA=cl::Buffer(context, CL_MEM_READ_ONLY, N_ELEMENTS * sizeof(int));
cl::Buffer bufferB=cl::Buffer(context, CL_MEM_READ_ONLY, N_ELEMENTS * sizeof(int));
cl::Buffer bufferC=cl::Buffer(context, CL_MEM_WRITE_ONLY, N_ELEMENTS * sizeof(int));
// Copy the input data to the input buffers using the command queue.
queue.enqueueWriteBuffer( bufferA, CL_FALSE, 0, N_ELEMENTS * sizeof(int), A.get() );
queue.enqueueWriteBuffer( bufferB, CL_FALSE, 0, N_ELEMENTS * sizeof(int), B.get() );
// Read the program source
std::ifstream sourceFile("vector_add_kernel.cl");
std::string sourceCode( std::istreambuf_iterator<char>(sourceFile), (std::istreambuf_iterator<char>()));
cl::Program::Sources source(1, std::make_pair(sourceCode.c_str(), sourceCode.length()));
// Make program from the source code
cl::Program program=cl::Program(context, source);
// Build the program for the devices
program.build(devices);
// Make kernel
cl::Kernel vecadd_kernel(program, "vecadd");
// Set the kernel arguments
vecadd_kernel.setArg( 0, bufferA );
vecadd_kernel.setArg( 1, bufferB );
vecadd_kernel.setArg( 2, bufferC );
// Execute the kernel
cl::NDRange global( N_ELEMENTS );
cl::NDRange local( 256 );
queue.enqueueNDRangeKernel( vecadd_kernel, cl::NullRange, global, local );
// Copy the output data back to the host
queue.enqueueReadBuffer( bufferC, CL_TRUE, 0, N_ELEMENTS * sizeof(int), C.get() );
// Verify the result
bool result=true;
for (int i=0; i<N_ELEMENTS; i ++)
if (C[i] !=A[i]+B[i]) {
result=false;
break;
}
if (result)
std::cout<< "Success!\n";
else
std::cout<< "Failed!\n";
}
catch(cl::Error err) {
std::cout << "Error: " << err.what() << "(" << err.err() << ")" << std::endl;
return( EXIT_FAILURE );
}
std::cout << "Done.\n";
return( EXIT_SUCCESS );
}
I compile this code on a machine with Ubuntu 12.04 like this:
g++ ocl_vector_addition.cpp -lOpenCL -std=c++11 -o ocl_vector_addition.o
It produces a ocl_vector_addition.o, which when I run, shows successful output. If you look at the compilation command, you see we have not passed anything about our .cl file. We only have used -lOpenCL flag to enable OpenCL library for our program. Also, don't get distracted by -std=c++11 command. Because I used std::unique_ptr in the host code, I had to use this flag for a successful compile.
So where is this .cl file being used? If you look at the host code, you'll find four parts that I repeat in below numbered:
// 1. Read the program source
std::ifstream sourceFile("vector_add_kernel.cl");
std::string sourceCode( std::istreambuf_iterator<char>(sourceFile), (std::istreambuf_iterator<char>()));
cl::Program::Sources source(1, std::make_pair(sourceCode.c_str(), sourceCode.length()));
// 2. Make program from the source code
cl::Program program=cl::Program(context, source);
// 3. Build the program for the devices
program.build(devices);
// 4. Make kernel
cl::Kernel vecadd_kernel(program, "vecadd");
In the 1st step, we read the content of the file that holds our device code and put it into a std::string named sourceCode. Then we make a pair of the string and its length and save it to source which has the type cl::Program::Sources. After we prepared the code, we make a cl::program object named program for the context and load the source code into the program object. The 3rd step is the one in which the OpenCL code gets compiled (and linked) for the device. Since the device code is built in the 3rd step, we can create a kernel object named vecadd_kernel and associate the kernel named vecadd inside it with our cl::kernel object. This was pretty much the set of steps involved in compiling a .cl file in a program.
The program I showed and explained about creates the device program from the kernel source code. Another option is to use binaries instead. Using binary program enhances application loading time and allows binary distribution of the program but limits portability since binaries that work fine on one device may not work on another device. Creating program using source code and binary are also called offline and online compilation respectively (more information here). I skip it here since the answer is already too long.

My answer comes four years late. Nevertheless, I have something to add that complements #Farzad's answer, as follows.
Confusingly, in OpenCL practice, the verb to compile is used to mean two different, incompatible things:
In one usage, to compile means what you already think that it means. It means to build at build-time, as from *.c sources to produce *.o objects for build-time linking.
However, in another usage—and this other usage may be unfamiliar to you—to compile means to interpret at run time, as from *.cl sources, producing GPU machine code.
One happens at build-time. The other happens at run-time.
It might have been less confusing had two different verbs been introduced, but that is not how the terminology has evolved. Conventionally, the verb to compile is used for both.
If unsure, then try this experiment: rename your *.cl file so that your other source files cannot find it, then build.
See? It builds fine, doesn't it?
This is because the *.cl file is not consulted at build time. Only later, when you try to execute the binary executable, does the program fail.
If it helps, you can think of the *.cl file as though it were a data file or a configuration file or even a script. It isn't literally a data file, a configuration file or a script, perhaps, for it does eventually get compiled to a kind of machine code, but the machine code is GPU code and it is not made from the *.cl program text until run-time. Moreover, at run-time, your C compiler as such is not involved. Rather, it is your OpenCL library that does the building.
It took me a fairly long time to straighten these concepts in my mind, mostly because—like you—I had long been familiar with the stages of the C/C++ build cycle; and, therefore, I had thought that I knew what words like to compile meant. Once your mind has the words and concepts straight, the various OpenCL documentation begins to make sense, and you can start work.

Related

Performance drop when getting attributes of files residing on an SMB mount

I have a piece of code in C++ that lists files in a folder and then gets attributes for each of them through Windows API. I am puzzled by the performance of this code when the folder is an SMB mount on a remote server (mounted as a disk).
#include <string>
#include <iostream>
#include <windows.h>
#include <vector>
//#include <chrono>
//#include <thread>
int main(int argc, char *argv[]) {
WIN32_FIND_DATA ffd;
HANDLE hFile;
std::string pathStr = argv[1];
std::vector<std::string> paths;
hFile = FindFirstFile((pathStr + "\\*").c_str(), &ffd);
if (hFile == INVALID_HANDLE_VALUE) {
std::cout << "FindFirstFile failed: " << GetLastError();
return 1;
} else {
do {
paths.push_back(pathStr + "\\" + ffd.cFileName);
} while (FindNextFile(hFile, &ffd) != 0);
int error = GetLastError();
if (error != ERROR_NO_MORE_FILES) {
std::cout << "FindNextFile failed: " << error;
FindClose(hFile);
return error;
}
FindClose(hFile);
}
std::cout << paths.size() << " files listed" << std::endl;
// std::this_thread::sleep_for(std::chrono::milliseconds(30000));
for (const std::string & p : paths) {
int a = GetFileAttributes(p.c_str());
bool isDir = (a & FILE_ATTRIBUTE_DIRECTORY);
bool isHidden = (a & FILE_ATTRIBUTE_HIDDEN);
std::cout << p << ": " << (isDir ? "D" : "f") << (isHidden ? "H" : "_") << std::endl;
}
}
Namely, if I have a folder with 250 files, it passes in about 1 second. When there are 500 files, it passes in about 1 minute, and even the first files take hundreds of milliseconds each (so, 1 second is enough for ~10 files).
Experimenting with it, I found that there is some limit below which processing speed is in hundreds files per second and above which the speed is ~10 files per second. I also noticed that this number differs with file name length. With names like file-001: between 510 and 520. With names like file-file-file-file-file-001: between 370 and 380.
I am interested in why this happens, in particular why the speed degrades from the very beginning when there are "too many" files/folders in the folder. Is there a way to investigate that? Optional: is there a way to overcome that while still using GetFileAttributes?
(The code is probably ugly as hell, I just stuck it together from samples found online. I compile it with MinGW, g++ -static files.cpp -o files.exe, run it files.exe "Z:\test_folder".
My original code is in Java, and I got from reading the source of the Hotspot JVM that it uses GetFileAttributes WinAPI method, so I created this snippet to see if it would behave the same as the Java code — and it does. I am also limited in the ways to solve this performance problem: I noticed that FindFirstFile/FindNextFile WinAPI calls perform consistently fast, but I did not find a way to use it from Java without JNI/JNA which would be too much fuss for me.)
Update: if I put a 30-second sleep between listing (files collected into a vector) and getting their attributes in a loop, behavior becomes consistent with any number of files in the folder — "slow" for any number of files. I also read some scattered info here and there that Windows SMB client applies caching, limited by time etc. I guess this is what I see here: listing the folder fills this cache with file attributes, and subsequent GetFileAttributes does not hit the remote system if ran immediately after the listing. I guess the other behavior is also cache related: when listing "too many" files, only the tail of the list remains in the cache. Then we start GetFileAttributes from the first file again, and every request hits the server. Still a mystery to me why listing is so fast and GetFileAttributes is slow...
Update 2: I thought to confirm that it has something to do with the cache, but I was not lucky so far. If it had something to do with eviction of the "first" file attributes, then getting attributes in the reverse order would hit the cache for many files — not the case: it's either all fast or all slow.
I tried fiddling with SMB client parameters according to this MS article, hoping that if I set sizes really high I won't notice the slow behavior any more — was not the case either, the behavior seems to be completely independent from these parameters. What I set was:
// HKLM\SYSTEM\CurrentControlSet\Services\LanmanWorkstation\Parameters
DirectoryCacheEntriesMax = 4096
DirectoryCacheEntrySizeMax = 32768
FileInfoCacheEntriesMax = 8192
In addition, I noticed that when there are "too many files", listing returns paths in random order (not alphabetically sorted). Not sure if it has anything to do with the problem. This behavior is even when changing listing to use FindFirstFile/FindNextFile.
In addition, I studied more carefully the timeout that is needed to "invalidate" the "cache" (so, for a folder with few files to start behaving slowly), and it is around 30 seconds in my case. Sometimes setting a lower value shows the same behavior (slow attributes getting for a folder with few files), but then re-running the program is instantaneous again.
I updated the code above, originally used std::filesystem::directory_iterator from C++17.

Running code at memory location in my OS

I am developing an OS in C (and some assembly of course) and now I want to allow it to load/run external (placed in the RAM-disk) programs. I have assembled a test program as raw machine code with nasm using '-f bin'. Everything else i found on the subject is loading code while running Windows or Linux. I load the program into memory using the following code:
#define BIN_ADDR 0xFF000
int run_bin(char *file) //Too many hacks at the moment
{
u32int size = 0;
char *bin = open_file(file, &size);
printf("Loaded [%d] bytes of [%s] into [%X]\n", size, file, bin);
char *reloc = (char *)BIN_ADDR; //no malloc because of the org statement in the prog
memset(reloc, 0, size);
memcpy(reloc, bin, size);
jmp_to_bin();
}
and the code to jump to it:
[global jmp_to_bin]
jmp_to_bin:
jmp [bin_loc] ;also tried a plain jump
bin_loc dd 0xFF000
This caused a GPF when I ran it. I could give you the registers at the GPF and/or a screenshot if needed.
Code for my OS is at https://github.com/farlepet/retro-os
Any help would be greatly appreciated.

You use identity mapping and flat memory space, hence address 0xff000 is gonna be in the BIOS ROM range. No wonder you can't copy stuff there. Better change that address ;)

_CrtCheckMemory usage example

I'm trying to understand how to use _CrtCheckMemory to track down heap corruption in a Windows application I'm working on. I can't seem to get it to return false. Here's my test code:
int* test = new int[1];
for(int i = 0; i < 100; i++){
test[i] = 1;
}
assert( _CrtCheckMemory( ) );
In the code above, _CrtCheckMemory( ) returns true. I'm running in Debug mode. What else do I need to do in order to get a simple example of _CrtCheckMemory flagging a problem?

An extra step is required, you must convince the compiler to replace the default new operator with the debug allocator. Only the debug allocator creates the "no-mans land" areas that detect an under- or overwrite of the heap block. It is risky, code that's compiled with the original allocator will not mix well with code that wasn't. So it forces you to opt-in explicitly.
That's best done in the pre-compiled header file (stdafx.h by default) so you can be sure that all code uses the debug allocator. Like this:
#ifdef _DEBUG
# define _CRTDBG_MAP_ALLOC
# define _CRTDBG_MAP_ALLOC_NEW
# include <crtdbg.h>
# include <assert.h>
#endif
The CRTDBG macros get the malloc() functions and the new operators replaced.
Do beware that your code as posted will trigger another diagnostic first. On Windows Vista and up, the Windows heap allocator is going to complain first because the code destroyed the Windows heap integrity. Make the overwrite a bit subtler by indexing only up to, say, 2.

Read/Write memory on OS X 10.8.2 with vm_read and vm_write

This is my code that works only on Xcode (version 4.5):
#include <stdio.h>
#include <mach/mach_init.h>
#include <mach/mach_vm.h>
#include <sys/types.h>
#include <mach/mach.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <Security/Authorization.h>
int main(int argc, const char * argv[]) {
char test[14] = "Hello World! "; //0x7fff5fbff82a
char value[14] = "Hello Hacker!";
char test1[14];
pointer_t buf;
uint32_t sz;
task_t task;
task_for_pid(current_task(), getpid(), &task);
if (vm_write(current_task(), 0x7fff5fbff82a, (pointer_t)value, 14) == KERN_SUCCESS) {
printf("%s\n", test);
//getchar();
}
if (vm_read(task, 0x7fff5fbff82a, sizeof(char) * 14, &buf, &sz) == KERN_SUCCESS) {
memcpy(test1, (const void *)buf, sz);
printf("%s", test1);
}
return 0;
}
I was trying also ptrace and other things, this is why I include other libraries too.
The first problem is that this works only on Xcode, I can find with the debugger the position (memory address) of a variable (in this case of test), so I change the string with the one on value and then I copy the new value on test on test1.
I actually don't understand how vm_write works (not completely) and the same for task_for_pid(), the 2° problem is that I need to read and write on another process, this is only a test for see if the functions works on the same process, and it works (only on Xcode).
How I can do that on other processes? I need to read a position (how I can find the address of "something"?), this is the first goal.

For your problems, there are solutions:
The first problem: OS X has address space layout randomization. If you want to make your memory images fixed and predictable, you have to compile your code with NOPIE setting. This setting (PIE = Position Independent Executable), is responsible for allowing ASLR, which "slides" the memory by some random value, which changes on every instance.
I actually don't understand how vm_write works (not completely) and the same for task_for_pid():
The Mach APIs operate on the lower level abstractions of "task" and "Thread" which correspond roughly to that of the BSD "process" and "(u)thread" (there are some exceptions, e.g. kernel_task, which does not have a PID, but let's ignore that for now). task_for_pid obtains the task port (think of it as a "handle"), and if you get the port - you are free to do whatever you wish. Basically, the vm_* functions operate on any task port - you can use it on your own process (mach_task_self(), that is), or a port obtained from task_for_pid.
Task for PID actually doesn't necessarily require root (i.e. "sudo"). It requires getting past taskgated on OSX, which traditionally verified membership in procmod or procview groups. You can configure taskgated ( /System/Library/LaunchDaemons/com.apple.taskgated.plist) for debugging purposes. Ultimately, btw, getting the task port will require an entitlement (the same as it now does on iOS). That said, the easiest way, rather than mucking around with system authorizations, etc, is to simply become root.

Did you try to run your app with "sudo"?
You can't read/write other app's memory without sudo.

Linux Device Driver Program, where the program starts?

I've started to learn Linux driver programs, but I'm finding it a little difficult.
I've been studying the i2c driver, and I got quite confused regarding the entry-point of the driver program. Does the driver program start at the MOUDULE_INIT() macro?
And I'd also like to know how I can know the process of how the driver program runs. I got the book, Linux Device Driver, but I'm still quite confused. Could you help me? Thanks a lot.
I'll take the i2c driver as an example. There are just so many functions in it, I just wanna know how I can get coordinating relation of the functions in the i2c drivers?

A device driver is not a "program" that has a main {} with a start point and exit point. It's more like an API or a library or a collection of routines. In this case, it's a set of entry points declared by MODULE_INIT(), MODULE_EXIT(), perhaps EXPORT_SYMBOL() and structures that list entry points for operations.
For block devices, the driver is expected to provide the list of operations it can perform by declaring its functions for those operations in (from include/linux/blkdev.h):
struct block_device_operations {
int (*open) ();
int (*release) ();
int (*ioctl) ();
int (*compat_ioctl) ();
int (*direct_access) ();
unsigned int (*check_events) ();
/* ->media_changed() is DEPRECATED, use ->check_events() instead */
int (*media_changed) ();
void (*unlock_native_capacity) ();
int (*revalidate_disk) ();
int (*getgeo)();
/* this callback is with swap_lock and sometimes page table lock held */
void (*swap_slot_free_notify) ();
struct module *owner;
};
For char devices, the driver is expected to provide the list of operations it can perform by declaring its functions for those operations in (from include/linux/fs.h):
struct file_operations {
struct module *owner;
loff_t (*llseek) ();
ssize_t (*read) ();
ssize_t (*write) ();
ssize_t (*aio_read) ();
ssize_t (*aio_write) ();
int (*readdir) ();
unsigned int (*poll) ();
long (*unlocked_ioctl) ();
long (*compat_ioctl) ();
int (*mmap) ();
int (*open) ();
int (*flush) ();
int (*release) ();
int (*fsync) ();
int (*aio_fsync) ();
int (*fasync) ();
int (*lock) ();
ssize_t (*sendpage) ();
unsigned long (*get_unmapped_area)();
int (*check_flags)();
int (*flock) ();
ssize_t (*splice_write)();
ssize_t (*splice_read)();
int (*setlease)();
long (*fallocate)();
};
For platform devices, the driver is expected to provide the list of operations it can perform by declaring its functions for those operations in (from include/linux/platform_device.h):
struct platform_driver {
int (*probe)();
int (*remove)();
void (*shutdown)();
int (*suspend)();
int (*resume)();
struct device_driver driver;
const struct platform_device_id *id_table;
};
The driver, especially char drivers, does not have to support every operation listed. Note that there are macros to facilitate the coding of these structures by naming the structure entries.
Does the driver program starts at the MOUDLUE_INIT() macro?
The driver's init() routine specified in MODULE_INIT() will be called during boot (when statically linked in) or when the module is dynamically loaded. The driver passes its structure of operations to the device's subsystem when it registers itself during its init().
These device driver entry points, e.g. open() or read(), are typically executed when the user app invokes a C library call (in user space) and after a switch to kernel space. Note that the i2c driver you're looking at is a platform driver for a bus that is used by leaf devices, and its functions exposed by EXPORT_SYMBOL() would be called by other drivers.
Only the driver's init() routine specified in MODULE_INIT() is guaranteed to be called. The driver's exit() routine specified in MODULE_EXIT() would only be executed if/when the module is dynamically unloaded. The driver's op routines will be called asynchronously (just like its interrupt service routine) in unknown order. Hopefully user programs will invoke an open() before issuing a read() or an ioctl() operation, and invoke other operations in a sensible fashion. A well-written and robust driver should accommodate any order or sequence of operations, and produce sane results to ensure system integrity.

It would probably help to stop thinking of a device driver as a program. They're completely different. A program has a specific starting point, does some stuff, and has one or more fairly well defined (well, they should, anyway) exit point. Drivers have some stuff to do when the first get loaded (e.g. MODULE_INIT() and other stuff), and may or may not ever do anything ever again (you can forcibly load a driver for hardware your system doesn't actually have), and may have some stuff that needs to be done if the driver is ever unloaded. Aside from that, a driver generally provides some specific entry points (system calls, ioctls, etc.) that user-land applications can access to request the driver to do something.
Horrible analogy, but think of a program kind of like a car - you get in, start it up, drive somewhere, and get out. A driver is more like a vending machine - you plug it in and make sure it's stocked, but then people just come along occasionaly and push buttons to make it do something.

Actually you are taking about (I2C) platform (Native)driver first you need to understand how MOUDULE_INIT() of platform driver got called versus other loadable modules.
/*
* module_init() - driver initialization entry point
* #x: function to be run at kernel boot time or module insertion
* module_init() will either be called during do_initcalls() (if
* builtin) or at module insertion time (if a module). There can only
* be one per module.*/
and for i2c driver you can refer this link http://www.linuxjournal.com/article/7136 and
http://www.embedded-bits.co.uk/2009/i2c-in-the-2632-linux-kernel/

Begin of a kernel module is starting from initialization function, which mainly addressed with macro __init just infront of the function name.
The __init macro indicate to linux kernel that the following function is an initialization function and the resource that will use for this initialization function will be free once the code of initialization function is executed.
There are other marcos, used for detect initialization and release function, named module_init() and module_exit() [as described above].
These two macro are used, if the device driver is targeted to operate as loadable and removeable kernel module at run time [i.e. using insmod or rmmod command]

IN short and crisp way : It starts from .probe and go all the way to init as soon you do insmod .This also registers the driver with the driver subsystem and also initiates the init.
Everytime the driver functionalities are called from the user application , functions are invoked using the call back.

"Linux Device Driver" is a good book but it's old!
Basic example:
#include <linux/module.h>
#include <linux/version.h>
#include <linux/kernel.h>
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Name and e-mail");
MODULE_DESCRIPTION("my_first_driver");
static int __init insert_mod(void)
{
printk(KERN_INFO "Module constructor");
return 0;
}
static void __exit remove_mod(void)
{
printk(KERN_INFO "Module destructor");
}
module_init(insert_mod);
module_exit(remove_mod);
An up-to-date tutorial, really well written, is "Linux Device Drivers Series"

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio