I have got only an CPU Core i3 with two cores, so I can only work with CPU, not GPU. I want to test a simple example using OpenCL with a simple add kernel. But here is my problem:
After allocating platform, CPU device, etc, I do the following:
1) clEnqueueNDRange() enqueues a kernel task and assigns an event to the completion of this task using the last parameter.
2) clSetEventCallback() using CL_COMPLETE links the callback function to the aforementioned event.
Normally, the callback function should be called when the task completes. But it doesn't. Indeed, the task in INCOMPLETE at the end event if the host as a lot of stuff to do before ending. Could someone say me why?
Here is my minimal code:
/** Simple add kernel */
private static String programSource0 =
"__kernel void vectorAdd(" +
" __global const float *a,"+
" __global const float *b, " +
" __global float *c)"+
"{"+
" int gid = get_global_id(0);"+
" c[gid] = a[gid]+b[gid];"+
"}";
/** The entry point of this sample */
public static void main(String args[])
{
/** Callback function */
EventCallbackFunction kernelCommandEvent = new EventCallbackFunction()
{
#Override
public void function(cl_event event, int event_status, Object user_data)
{
System.out.println("Callback: task COMPLETED");
}
};
// Initialize the input data
int n = 1000000;
float srcArrayA[] = new float[n];
float srcArrayB[] = new float[n];
float dstArray0[] = new float[n];
Array.fill(srcArrayA, 1,0f);
Array.fill(srcArrayB, 1,0f);
// .
// (hidden) Allocation of my Intel platform, CPU device, kernel, commandQueue, and memory buffer, set the argument to kernel etc...
// .
// Set work-item dimensions and execute the kernels
long globalWorkSize[] = new long[]{n};
// I pass an event on completion of the command queue.
cl_event[] myEventID = new cl_event[1];
myEventID[0] = new cl_event();
clEnqueueNDRangeKernel(commandQueue, kernel0, 1, null, globalWorkSize, null, 0, null, myEventID[0]);
// I link the event to the callback function "kernelCommandEvent", and pass 10 as parameter
clSetEventCallback(myEventID[0], CL_COMPLETE, kernelCommandEvent, new Integer(10));
// host does some very long stuff !!
// Normally, my device task should be completed
int[] ok = new int[1];
Arrays.fill(ok, 0);
clGetEventInfo(myEventID[0], CL_EVENT_COMMAND_EXECUTION_STATUS, Sizeof.cl_int, Pointer.to(ok), null);
if (ok[0] == CL_COMPLETE) System.out.println("Task COMPLETE");else System.out.println("Task INCOMPLETE");
}
Enqueue does not enforce the execution of the task. It just puts it in the queue.
The tasks are executed only if you:
Force the immediate execution using clFlush().
Make a blocking call that depends directly or indirectly in that task.
Some drivers can also decide that they will start working on a task even if you did not flush it. But that is implementation dependent. If you want to be sure use clFlush(commandQueue);
Extra:
This behaviour is like that, because the overhead of queuing data to the device can be big, and doing it every Enqueue call may not be efficient if it is called multiple times in a loop. Instead it is defered to the flush or a blocking call, so it can be batched.
Related
this is not about programming, but I ask it here..
in linux start_kernel() function, in the mm_init() function, I see vmalloc_init() function.
inside the function I see codes like this.
void __init vmalloc_init(void)
{
struct vmap_area *va;
struct vm_struct *tmp;
int i;
/*
* Create the cache for vmap_area objects.
*/
vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);
for_each_possible_cpu(i) {
struct vmap_block_queue *vbq;
struct vfree_deferred *p;
vbq = &per_cpu(vmap_block_queue, i);
spin_lock_init(&vbq->lock);
INIT_LIST_HEAD(&vbq->free);
p = &per_cpu(vfree_deferred, i);
init_llist_head(&p->list);
INIT_WORK(&p->wq, free_work);
}
/* Import existing vmlist entries. */
for (tmp = vmlist; tmp; tmp = tmp->next) {
va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
if (WARN_ON_ONCE(!va))
continue;
va->va_start = (unsigned long)tmp->addr;
va->va_end = va->va_start + tmp->size;
va->vm = tmp;
insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
}
/*
* Now we can initialize a free vmap space.
*/
vmap_init_free_space();
vmap_initialized = true;
}
I'm not sure if this code is run on every cpu(core) or just on the first cpu?
if this code runs on every smp core, how is this code inside for_each_possible_cpu loop run?
The smp setup seems to be done before this function.
start_kernel() calls mm_init() which calls vmalloc_init(). Only the first (boot) CPU is active at that point. Later, start_kernel() calls arch_call_rest_init() which calls rest_init().
rest_init() creates a kernel thread for the init task with entry point kernel_init(). kernel_init() calls kernel_init_freeable(). kernel_init_freeable() eventually calls smp_init() to activate the remaining CPUs.
Every macro in for_each_cpu family is just wrapper for for() loop, where iterator is a CPU index.
E.g., the core macro of this family is defined as
#define for_each_cpu(cpu, mask) \
for ((cpu) = -1; \
(cpu) = cpumask_next((cpu), (mask)), \
(cpu) < nr_cpu_ids;)
Each macro in for_each_cpu family uses its own CPUs mask, which is just a set of bits corresponded to CPU indices. E.g. mask for for_each_possible_cpu macro have bits set for every index of CPU which could ever be enabled in current machine session.
i am learning linux kernel, and i meet a problem.
in linux kernel, i use "mod_delayed_work(bdi_wq, &wb->dwork, 0)" to queue a work_struct to a work queue, i assume the work function of the queued work_struct will soon be executed. but work function is not executed until 300 seconds later.
and i find a watchdog thread happens meanwhile.
does this a normal case? or is it because of the watchdog thread that make the work queue sleep although there is a work(my queued work_truct) pending here.
added:
the followings are my condition. i use the linux kernel 4.9.13 codes and do not change them except for adding some printk logs.
i have five disks, and use five shells to copy 4GB files from disks to disks concurrently. this problem happens while i am doing sync. one of the shells is like:
#!/bin/bash
for ((i=0; i<9999; i++))
do
cp disk1/4GB.tar disk2/4GB-chen.tar
sync
rm disk2/4GB-chen.tar
sync
done
i do a sync after each copy is done. after the shells run for some times, i find that the sync command will be blocked for a long time(longer than 2 minutes). i find sync will call a system call, the code is as follows:
SYSCALL_DEFINE0(sync)
{
int nowait = 0, wait = 1;
wakeup_flusher_threads(0, WB_REASON_SYNC);
iterate_supers(sync_inodes_one_sb, NULL);
iterate_supers(sync_fs_one_sb, &nowait);
iterate_supers(sync_fs_one_sb, &wait);
iterate_bdevs(fdatawrite_one_bdev, NULL);
iterate_bdevs(fdatawait_one_bdev, NULL);
if (unlikely(laptop_mode))
laptop_sync_completion();
return 0;
}
in iterate_supers(sync_inodes_one_sb, NULL), kernel will call sync_inodes_one_sb for each disk'super block. sync_inodes_one_sb will eventually call sync_inodes_sb, the code is:
void sync_inodes_sb(struct super_block *sb)
{
DEFINE_WB_COMPLETION_ONSTACK(done);
struct wb_writeback_work work = {
.sb = sb,
.sync_mode = WB_SYNC_ALL,
.nr_pages = LONG_MAX,
.range_cyclic = 0,
.done = &done,
.reason = WB_REASON_SYNC,
.for_sync = 1,
};
struct backing_dev_info *bdi = sb->s_bdi;
/*
* Can't skip on !bdi_has_dirty() because we should wait for !dirty
* inodes under writeback and I_DIRTY_TIME inodes ignored by
* bdi_has_dirty() need to be written out too.
*/
if (bdi == &noop_backing_dev_info)
return;
WARN_ON(!rwsem_is_locked(&sb->s_umount));
bdi_split_work_to_wbs(bdi, &work, false); /* split work to wbs */
wb_wait_for_completion(bdi);
wait_sb_inodes(sb);
}
and in bdi_split_work_to_wbs(bdi, &work, false)(in fs/fs-writeback.c), queue the write back works to the work queue:
static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
struct wb_writeback_work *base_work,
bool skip_if_busy)
{
struct bdi_writeback *last_wb = NULL;
struct bdi_writeback *wb = list_entry(&bdi->wb_list,
struct bdi_writeback, bdi_node);
might_sleep();
restart:
rcu_read_lock();
list_for_each_entry_continue_rcu(wb, &bdi->wb_list, bdi_node) {
DEFINE_WB_COMPLETION_ONSTACK(fallback_work_done);
struct wb_writeback_work fallback_work;
struct wb_writeback_work *work;
long nr_pages;
if (last_wb) {
wb_put(last_wb);
last_wb = NULL;
}
/* SYNC_ALL writes out I_DIRTY_TIME too */
if (!wb_has_dirty_io(wb) &&
(base_work->sync_mode == WB_SYNC_NONE ||
list_empty(&wb->b_dirty_time)))
continue;
if (skip_if_busy && writeback_in_progress(wb))
continue;
nr_pages = wb_split_bdi_pages(wb, base_work->nr_pages);
work = kmalloc(sizeof(*work), GFP_ATOMIC);
if (work) {
*work = *base_work;
work->nr_pages = nr_pages;
work->auto_free = 1;
wb_queue_work(wb, work); /*** here to queue write back work ***/
continue;
}
/* alloc failed, execute synchronously using on-stack fallback */
work = &fallback_work;
*work = *base_work;
work->nr_pages = nr_pages;
work->auto_free = 0;
work->done = &fallback_work_done;
wb_queue_work(wb, work);
/*
* Pin #wb so that it stays on #bdi->wb_list. This allows
* continuing iteration from #wb after dropping and
* regrabbing rcu read lock.
*/
wb_get(wb);
last_wb = wb;
rcu_read_unlock();
wb_wait_for_completion(bdi, &fallback_work_done);
goto restart;
}
rcu_read_unlock();
if (last_wb)
wb_put(last_wb);
}
use wb_queue_work(wb, work) to queue a work to work structure, in fs/fs-writeback.c wb_queue_work is:
static void wb_queue_work(struct bdi_writeback *wb,
struct wb_writeback_work *work)
{
trace_writeback_queue(wb, work);
if (work->done)
atomic_inc(&work->done->cnt);
spin_lock_bh(&wb->work_lock);
if (test_bit(WB_registered, &wb->state)) {
list_add_tail(&work->list, &wb->work_list);
mod_delayed_work(bdi_wq, &wb->dwork, 0); /*** queue work to work queue ***/
} else
finish_writeback_work(wb, work);
spin_unlock_bh(&wb->work_lock);
}
here the mod_delayed_work(bdi_wq, &wb->dwork, 0) will actually queue the wb->dwork to the bdi_wq work queue, the work function of wb->dwork is wb_workfn()(in fs/fs-writeback.c), i add some printks when prepare to queue the work and in the work function, i find the printk logs in the work function are not printed out until approximately 300 seconds later some times(most of the times, they will be printed less than 1 seconds after the work has been queued to the work queue). and the bdi_wq work queue will be blocked until 300 seconds later when the work function begin to be executed.
I'm working in a C++11 class that will fetch via I2C the value of a temperature sensor in a Raspberry Pi. It will be polling the value until it's stopped. It does the polling in a separate thread, so that it does not stop the application flow. The problem is that in the line 64 of this file: https://github.com/OpenStratos/server/blob/feature/temperature/temperature/Temperature.cpp#L64
void Temperature::read_temperature()
{
while (this->reading)
{
#ifndef OS_TESTING
int value = wiringPiI2CRead(this->filehandle);
#else
int value = 16000;
#endif
float voltage = value * 5 / 32768; // 2^15
float temp = r_to_c(TEMP_R * (TEMP_VIN / voltage - 1));
this->temperature = temp; // Gives segmentation fault
this_thread::sleep_for(chrono::milliseconds(50));
}
}
it gives a segmentation fault. The curius thing is that it does not always happen. After compiling, running the binary many times about the 75% of the time will crash.
This is the file that invoques the code:https://github.com/OpenStratos/server/blob/feature/temperature/testing/temperature_test.cpp
Temperature temp(20);
temp.start_reading();
AssertThat(temp.is_reading(), Equals(true));
// this_thread::sleep_for(chrono::milliseconds(100)); if uncommented less segmentation faults
temp.stop_reading();
AssertThat(temp.is_reading(), Equals(false));
What could be happening? How can it be fixed?
You need to wait for Temperature::read_temperature() to quit, so you need:
bool reading;
volatile bool stopped; // volatile required to make the compiler re-read
// the value everytime we expect it to.
//
bool is_stopped(){ return stopped; }
and
void Temperature::start_reading()
{
if (!reading)
{
stopped = false;
reading = true;
// etc
and
void Temperature::read_temperature()
{
while (this->reading)
{
// etc
}
stopped=true;
}
and
temp.stop_reading();
while(!temp.is_stopped();
AssertThat(temp.is_reading(), Equals(false));
I have been working on implementing a half duplex serial driver by learning from a basic serial terminal example using boost::asio::basic_serial_port:
http://lists.boost.org/boost-users/att-41140/minicom.cpp
I need to read asynchronously but still detect when the handler is finished in the main thread so I pass async_read_some a callback with several additional reference parameters in a lambda function using boost:bind. The handler never gets invoked but if I replace the async_read_some function with the read_some function it returns data without an issue.
I believe I'm satisfying all of the necessary requirements for this function to invoke the handler because they are the same for the asio::read some function which returns:
The buffer stays in scope
One or more bytes is received by the serial device
The io service is running
The port is open and running at the correct baud rate
Does anyone know if I'm missing another assumption unique to the asynchronous read or if I'm not setting up the io_service correctly?
Here is an example of how I'm using the code with async_read_some (http://www.boost.org/doc/libs/1_56_0/doc/html/boost_asio/reference/basic_serial_port/async_read_some.html):
void readCallback(const boost::system::error_code& error, size_t bytes_transfered, bool & finished_reading, boost::system::error_code& error_report, size_t & bytes_read)
{
std::cout << "READ CALLBACK\n";
std::cout.flush();
error_report = error;
bytes_read = bytes_transfered;
finished_reading = true;
return;
}
int main()
{
int baud_rate = 115200;
std::string port_name = "/dev/ttyUSB0";
boost::asio::io_service io_service_;
boost::asio::serial_port serial_port_(io_service_,port_name);
serial_port_.set_option(boost::asio::serial_port_base::baud_rate(baud_rate));
boost::thread service_thread_;
service_thread = boost::thread(boost::bind(&boost::asio::io_service::run, &io_service_));
std::cout << "Starting byte read\n";
boost::system::error_code ec;
bool finished_reading = false;
size_t bytes_read;
int max_response_size = 8;
uint8_t read_buffer[max_response_size];
serial_port_.async_read_some(boost::asio::buffer(read_buffer, max_response_size),
boost::bind(readCallback,
boost::asio::placeholders::error,
boost::asio::placeholders::bytes_transferred,
finished_reading, ec, bytes_read));
std::cout << "Waiting for read to finish\n";
while (!finished_reading)
{
boost::this_thread::sleep(boost::posix_time::milliseconds(1));
}
std::cout << "Finished byte read: " << bytes_read << "\n";
for (int i = 0; i < bytes_read; ++i)
{
printf("0x%x ",read_buffer[i]);
}
}
The result is that the callback does not print out anything and the while !finished loop never finishes.
Here is how I use the blocking read_some function (boost.org/doc/libs/1_56_0/doc/html/boost_asio/reference/basic_serial_port/read_some.html):
int main()
{
int baud_rate = 115200;
std::string port_name = "/dev/ttyUSB0";
boost::asio::io_service io_service_;
boost::asio::serial_port serial_port_(io_service_,port_name);
serial_port_.set_option(boost::asio::serial_port_base::baud_rate(baud_rate));
boost::thread service_thread_;
service_thread = boost::thread(boost::bind(&boost::asio::io_service::run, &io_service_));
std::cout << "Starting byte read\n";
boost::system::error_code ec;
int max_response_size = 8;
uint8_t read_buffer[max_response_size];
int bytes_read = serial_port_.read_some(boost::asio::buffer(read_buffer, max_response_size),ec);
std::cout << "Finished byte read: " << bytes_read << "\n";
for (int i = 0; i < bytes_read; ++i)
{
printf("0x%x ",read_buffer[i]);
}
}
This version prints from 1 up to 8 characters that I send, blocking until at least one is sent.
The code does not guarantee that the io_service is running. io_service::run() will return when either:
All work has finished and there are no more handlers to be dispatched
The io_service has been stopped.
In this case, it is possible for the service_thread_ to be created and invoke io_service::run() before the serial_port::async_read_some() operation is initiated, adding work to the io_service. Thus, the service_thread_ could immediately return from io_service::run(). To resolve this, either:
Invoke io_service::run() after the asynchronous operation has been initiated.
Create a io_service::work object before starting the service_thread_. A work object prevents the io_service from running out of work.
This answer may provide some more insight into the behavior of io_service::run().
A few other things to note and to expand upon Igor's answer:
If a thread is not progressing in a meaningful way while waiting for an asynchronous operation to complete (i.e. spinning in a loop sleeping), then it may be worth examining if mixing synchronous behavior with asynchronous operations is the correct solution.
boost::bind() copies its arguments by value. To pass an argument by reference, wrap it with boost::ref() or boost::cref():
boost::bind(..., boost::ref(finished_reading), boost::ref(ec),
boost::ref(bytes_read));
Synchronization needs to be added to guarantee memory visibility of finished_reading in the main thread. For asynchronous operations, Boost.Asio will guarantee the appropriate memory barriers to ensure correct memory visibility (see this answer for more details). In this case, a memory barrier is required within the main thread to guarantee the main thread observes changes to finished_reading by other threads. Consider using either a Boost.Thread synchronization mechanism like boost::mutex, or Boost.Atomic's atomic objects or thread and signal fences.
Note that boost::bind copies its arguments. If you want to pass an argument by reference, wrap it with boost::ref (or std::ref):
boost::bind(readCallback, boost::asio::placeholders::error, boost::asio::placeholders::bytes_transferred, boost::ref(finished_reading), ec, bytes_read));
(However, strictly speaking, there's a race condition on the bool variable you pass to another thread. A better solution would be to use std::atomic_bool.)
I try to implement a simple algorithm in preperation of a more complex one.
I want to call a kernel several times and it shall increment each value within an array by let's say 5 in each call.
So when I have initially the array [1,2,3,4] I want [6,7,8,9] after the first call and [11,12,13,14] after the second call and so on. But I don't unterstand how to configure my buffers and how to enqueue my buffer in that case. I tried to orient at this tutorial:
http://www.browndeertechnology.com/docs/BDT_OpenCL_Tutorial_NBody-rev3.html
(this is the algorithm I want to implement in the end with some modifications)
but the library used there hides the most important aspects.
At the moment I create my buffer with:
pos2g_buf = clCreateBuffer(
context,
CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
sizeof(cl_float) * (nparticle*4),
pos2g,
&status);
The calls of the Kernel are placed within a for-loop
for(int i=0; i
And there I set the kernel arguments and call it by:
status = clEnqueueNDRangeKernel(
oclm->commandQueue,
kernel,
NDRangeDimension,
NULL,
globalThreads,
localThreads,
0,
NULL,
&events[0]); //
Can someone please help me and give the correct (pseudo-) code, how to create my simple iterator program?
Many thanks in advance!
Michael
On Host side:
cl_mem buffer = clCreateBuffer(..., CL_MEM_READ_WRITE, ...);
cl_kernel kernel = clCreateKernel(...);
clSetKernelArg(.., kernel, buffer, ...);
for(int i=0; i<num_laps; i++){
clEnqueueNDRangeKernel(..., kernel, ...);
}
void *host_mem = malloc(...);
clEnqueueReadBuffer(..., buffer, ..., host_mem, ...);
On Device side:
void __kernel my(global int* mem)
{
mem[get_global_id(0) += 5;
return;
}
Don't forget to check return codes and release resources.