Iterative Algorithm with OpenCL

Iterative Algorithm with OpenCL - algorithm

I try to implement a simple algorithm in preperation of a more complex one.
I want to call a kernel several times and it shall increment each value within an array by let's say 5 in each call.
So when I have initially the array [1,2,3,4] I want [6,7,8,9] after the first call and [11,12,13,14] after the second call and so on. But I don't unterstand how to configure my buffers and how to enqueue my buffer in that case. I tried to orient at this tutorial:
http://www.browndeertechnology.com/docs/BDT_OpenCL_Tutorial_NBody-rev3.html
(this is the algorithm I want to implement in the end with some modifications)
but the library used there hides the most important aspects.
At the moment I create my buffer with:
pos2g_buf = clCreateBuffer(
context,
CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
sizeof(cl_float) * (nparticle*4),
pos2g,
&status);
The calls of the Kernel are placed within a for-loop
for(int i=0; i
And there I set the kernel arguments and call it by:
status = clEnqueueNDRangeKernel(
oclm->commandQueue,
kernel,
NDRangeDimension,
NULL,
globalThreads,
localThreads,
0,
NULL,
&events[0]); //
Can someone please help me and give the correct (pseudo-) code, how to create my simple iterator program?
Many thanks in advance!
Michael

On Host side:
cl_mem buffer = clCreateBuffer(..., CL_MEM_READ_WRITE, ...);
cl_kernel kernel = clCreateKernel(...);
clSetKernelArg(.., kernel, buffer, ...);
for(int i=0; i<num_laps; i++){
clEnqueueNDRangeKernel(..., kernel, ...);
}
void *host_mem = malloc(...);
clEnqueueReadBuffer(..., buffer, ..., host_mem, ...);
On Device side:
void __kernel my(global int* mem)
{
mem[get_global_id(0) += 5;
return;
}
Don't forget to check return codes and release resources.

Related

for_each_possible_cpu macro in vmalloc_init() function, does the code run in only one cpu? or in every cpu?

this is not about programming, but I ask it here..
in linux start_kernel() function, in the mm_init() function, I see vmalloc_init() function.
inside the function I see codes like this.
void __init vmalloc_init(void)
{
struct vmap_area *va;
struct vm_struct *tmp;
int i;
/*
* Create the cache for vmap_area objects.
*/
vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);
for_each_possible_cpu(i) {
struct vmap_block_queue *vbq;
struct vfree_deferred *p;
vbq = &per_cpu(vmap_block_queue, i);
spin_lock_init(&vbq->lock);
INIT_LIST_HEAD(&vbq->free);
p = &per_cpu(vfree_deferred, i);
init_llist_head(&p->list);
INIT_WORK(&p->wq, free_work);
}
/* Import existing vmlist entries. */
for (tmp = vmlist; tmp; tmp = tmp->next) {
va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
if (WARN_ON_ONCE(!va))
continue;
va->va_start = (unsigned long)tmp->addr;
va->va_end = va->va_start + tmp->size;
va->vm = tmp;
insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
}
/*
* Now we can initialize a free vmap space.
*/
vmap_init_free_space();
vmap_initialized = true;
}
I'm not sure if this code is run on every cpu(core) or just on the first cpu?
if this code runs on every smp core, how is this code inside for_each_possible_cpu loop run?
The smp setup seems to be done before this function.

start_kernel() calls mm_init() which calls vmalloc_init(). Only the first (boot) CPU is active at that point. Later, start_kernel() calls arch_call_rest_init() which calls rest_init().
rest_init() creates a kernel thread for the init task with entry point kernel_init(). kernel_init() calls kernel_init_freeable(). kernel_init_freeable() eventually calls smp_init() to activate the remaining CPUs.

Every macro in for_each_cpu family is just wrapper for for() loop, where iterator is a CPU index.
E.g., the core macro of this family is defined as
#define for_each_cpu(cpu, mask) \
for ((cpu) = -1; \
(cpu) = cpumask_next((cpu), (mask)), \
(cpu) < nr_cpu_ids;)
Each macro in for_each_cpu family uses its own CPUs mask, which is just a set of bits corresponded to CPU indices. E.g. mask for for_each_possible_cpu macro have bits set for every index of CPU which could ever be enabled in current machine session.

how do I allocate memory for some of the structure elements

I want to allocate memory for some elements of a structure, which are pointers to other small structs.How do I allocate and de-allocate memory in best way?
Ex:
typedef struct _SOME_STRUCT {
PDATATYPE1 PDatatype1;
PDATATYPE2 PDatatype2;
PDATATYPE3 PDatatype3;
.......
PDATATYPE12 PDatatype12;
} SOME_STRUCT, *PSOME_STRUCT;
I want to allocate memory for PDatatype1,3,4,6,7,9,11.Can I allocate memory with single malloc? or what is the best way to allocate memory for only these elements and how to free the whole memory allocated?

There is a trick that allows a single malloc, but that also has to weighed against doing a more standard multiple malloc approach.
If [and only if], once the DatatypeN elements of SOME_STRUCT are allocated, they do not need to be reallocated in any way, nor does any other code do a free on any of them, you can do the following [the assumption that PDATATYPEn points to DATATYPEn]:
PSOME_STRUCT
alloc_some_struct(void)
{
size_t siz;
void *vptr;
PSOME_STRUCT sptr;
// NOTE: this optimizes down to a single assignment
siz = 0;
siz += sizeof(DATATYPE1);
siz += sizeof(DATATYPE2);
siz += sizeof(DATATYPE3);
...
siz += sizeof(DATATYPE12);
sptr = malloc(sizeof(SOME_STRUCT) + siz);
vptr = sptr;
vptr += sizeof(SOME_STRUCT);
sptr->Pdatatype1 = vptr;
// either initialize the struct pointed to by sptr->Pdatatype1 here or
// caller should do it -- likewise for the others ...
vptr += sizeof(DATATYPE1);
sptr->Pdatatype2 = vptr;
vptr += sizeof(DATATYPE2);
sptr->Pdatatype3 = vptr;
vptr += sizeof(DATATYPE3);
...
sptr->Pdatatype12 = vptr;
vptr += sizeof(DATATYPE12);
return sptr;
}
Then, the when you're done, just do free(sptr).
The sizeof above should be sufficient to provide proper alignment for the sub-structs. If not, you'll have to replace them with a macro (e.g. SIZEOF) that provides the necessary alignment. (e.g.) for 8 byte alignment, something like:
#define SIZEOF(_siz) (((_siz) + 7) & ~0x07)
Note: While it is possible to do all this, and it is more common for things like variable length string structs like:
struct mystring {
int my_strlen;
char my_strbuf[0];
};
struct mystring {
int my_strlen;
char *my_strbuf;
};
It is debatable whether it's worth the [potential] fragility (i.e. somebody forgets and does the realloc/free on the individual elements). The cleaner way would be to embed the actual structs rather than the pointers to them if the single malloc is a high priority for you.
Otherwise, just do the the [more] standard way and do the 12 individual malloc calls and, later, the 12 free calls.
Still, it is a viable technique, particularly on small memory constrained systems.
Here is the [more] usual way involving per-element allocations:
PSOME_STRUCT
alloc_some_struct(void)
{
void *vptr;
PSOME_STRUCT sptr;
sptr = malloc(sizeof(SOME_STRUCT));
// either initialize the struct pointed to by sptr->Pdatatype1 here or
// caller should do it -- likewise for the others ...
sptr->Pdatatype1 = malloc(sizeof(DATATYPE1));
sptr->Pdatatype2 = malloc(sizeof(DATATYPE2));
sptr->Pdatatype3 = malloc(sizeof(DATATYPE3));
...
sptr->Pdatatype12 = malloc(sizeof(DATATYPE12));
return sptr;
}
void
free_some_struct(PSOME_STRUCT sptr)
{
free(sptr->Pdatatype1);
free(sptr->Pdatatype2);
free(sptr->Pdatatype3);
...
free(sptr->Pdatatype12);
free(sptr);
}

If your structure contains the others structures as elements instead of pointers, you can allocate memory for the combined structure in one shot:
typedef struct _SOME_STRUCT {
DATATYPE1 Datatype1;
DATATYPE2 Datatype2;
DATATYPE3 Datatype3;
.......
DATATYPE12 Datatype12;
} SOME_STRUCT, *PSOME_STRUCT;
PSOME_STRUCT p = (PSOME_STRUCT)malloc(sizeof(SOME_STRUCT));
// Or without malloc:
PSOME_STRUCT p = new SOME_STRUCT();

Incomplete task before HOST ends with CPU device

I have got only an CPU Core i3 with two cores, so I can only work with CPU, not GPU. I want to test a simple example using OpenCL with a simple add kernel. But here is my problem:
After allocating platform, CPU device, etc, I do the following:
1) clEnqueueNDRange() enqueues a kernel task and assigns an event to the completion of this task using the last parameter.
2) clSetEventCallback() using CL_COMPLETE links the callback function to the aforementioned event.
Normally, the callback function should be called when the task completes. But it doesn't. Indeed, the task in INCOMPLETE at the end event if the host as a lot of stuff to do before ending. Could someone say me why?
Here is my minimal code:
/** Simple add kernel */
private static String programSource0 =
"__kernel void vectorAdd(" +
" __global const float *a,"+
" __global const float *b, " +
" __global float *c)"+
"{"+
" int gid = get_global_id(0);"+
" c[gid] = a[gid]+b[gid];"+
"}";
/** The entry point of this sample */
public static void main(String args[])
{
/** Callback function */
EventCallbackFunction kernelCommandEvent = new EventCallbackFunction()
{
#Override
public void function(cl_event event, int event_status, Object user_data)
{
System.out.println("Callback: task COMPLETED");
}
};
// Initialize the input data
int n = 1000000;
float srcArrayA[] = new float[n];
float srcArrayB[] = new float[n];
float dstArray0[] = new float[n];
Array.fill(srcArrayA, 1,0f);
Array.fill(srcArrayB, 1,0f);
// .
// (hidden) Allocation of my Intel platform, CPU device, kernel, commandQueue, and memory buffer, set the argument to kernel etc...
// .
// Set work-item dimensions and execute the kernels
long globalWorkSize[] = new long[]{n};
// I pass an event on completion of the command queue.
cl_event[] myEventID = new cl_event[1];
myEventID[0] = new cl_event();
clEnqueueNDRangeKernel(commandQueue, kernel0, 1, null, globalWorkSize, null, 0, null, myEventID[0]);
// I link the event to the callback function "kernelCommandEvent", and pass 10 as parameter
clSetEventCallback(myEventID[0], CL_COMPLETE, kernelCommandEvent, new Integer(10));
// host does some very long stuff !!
// Normally, my device task should be completed
int[] ok = new int[1];
Arrays.fill(ok, 0);
clGetEventInfo(myEventID[0], CL_EVENT_COMMAND_EXECUTION_STATUS, Sizeof.cl_int, Pointer.to(ok), null);
if (ok[0] == CL_COMPLETE) System.out.println("Task COMPLETE");else System.out.println("Task INCOMPLETE");
}

Enqueue does not enforce the execution of the task. It just puts it in the queue.
The tasks are executed only if you:
Force the immediate execution using clFlush().
Make a blocking call that depends directly or indirectly in that task.
Some drivers can also decide that they will start working on a task even if you did not flush it. But that is implementation dependent. If you want to be sure use clFlush(commandQueue);
Extra:
This behaviour is like that, because the overhead of queuing data to the device can be big, and doing it every Enqueue call may not be efficient if it is called multiple times in a loop. Instead it is defered to the flush or a blocking call, so it can be batched.

cudaMemcpy() gives segfault when using Type**

I want to copy a double pointer object to the host and compute over it on the GPU Device. When doing cudaMemcpy of the object to device it throws SEGFAULT.
BMP Input;
Input.ReadFromFile( fileName );
WIDTH = Input.TellWidth();
HEIGHT = Input.TellHeight();
RGBApixel** imageData = new RGBApixel* [HEIGHT];
for (int i = 0; i < HEIGHT; i++)
imageData[i] = new RGBApixel [WIDTH];
for(int j=0;j<Input.TellHeight();j++){
for(int i=0;i<Input.TellWidth();i++){
imageData[j][i] = Input.GetPixel(i,j);
}
}
long long imageSize = WIDTH*HEIGHT*sizeof(RGBApixel *);
RGBApixel** dev_imgdata,dev_imgdata_out;
//Allocating cudaMemory
cudaMalloc( (void **) &dev_imgdata, imageSize );
cudaMalloc( (void **) &dev_imgdata_out, imageSize );
Now the below line throws segfault
cudaMemcpy(dev_imgdata,imageData,imageSize,cudaMemcpyHostToDevice);

When declaring RGBApixel** imageData = new RGBApixel* [HEIGHT]; you have absolutely no guarantee that imageData will occupy a contiguous block of memory.
cudaMemcpy copies contiguous blocks of memory into the device RAM. Your statement tries to copy the start addresses of each matrix row but not the actual data. Also when using cudaMalloc, you need to properly allocate for each line, exactly as you did for the host buffer.
What you need to do is to declare imageData as just a RGMAPixel* - basically put the matrix in a single vector and use proper indexing and it will work.
You can also copy each line at a time but that's not a very good practice since every memory access will require an extra indirection and you will mess the caching efficiency.

Also, make sure that when you compile your program, you use -arch sm_20 to enable extra options for your graphic card ( if it has Capability 2.0). Without it I believe you can't use double and the result is unpredictable (or the double is diminished to float)

Using Thrift for IPC-Communication via shared Memory

I couldn't find a sufficient example on how to use apache thrift for ipc-communication via shared memory. My goal is to serialize an exisiting class with help of thrift, then send via shared memory to a different process where i deserialize it again with help of thrift. Right now i'm using TMemoryBuffer and TBinaryProtocol to serialize the data. Although this works, I have no idea on how to write it to shared memory.
Here is my code so far:
#include "test_types.h"
#include "test_constants.h"
#include "thrift/protocol/TBinaryProtocol.h"
#include "thrift/transport/TBufferTransports.h"
int main(int argc, char** argv)
{
int shID;
char* myPtr;
Person* dieter = new Person("Dieter", "Neuer");
//Person* johann = new Person("Johann", "Liebert");
//Car* ford = new Car("KLENW", 4, 4);
PersonThrift dieterThrift;
dieterThrift.nachName = dieter->getNachname();
dieterThrift.vorName = dieter->getVorname();
boost::shared_ptr<apache::thrift::transport::TMemoryBuffer> transport(new apache::thrift::transport::TMemoryBuffer);
boost::shared_ptr<apache::thrift::protocol::TBinaryProtocol> protocol(new apache::thrift::protocol::TBinaryProtocol(transport));
test thriftTest;
thriftTest.personSet.insert(dieterThrift);
u_int32_t size = thriftTest.write(protocol.get());
std::cout << transport.get()->getBufferAsString();
shID = shmget(1000, 100, IPC_CREAT | 0666);
if (shID >= 0)
{
myPtr = (char*)shmat(shID, 0, 0);
if (myPtr==(char *)-1)
{
perror("shmat");
}
else
{
//myPtr = protocol.get();
}
}
getchar();
shmdt(myPtr);
}
The main problem is the part
//myPtr = protocol.get();
How do I use thrift so that I can write my deserialized data into myPtr (and thus into shared memory). I guess TMemoryBuffer might already be a bad idea. As you may see, I'm not really experienced with this.
Kind regards and thanks in advance
Michael

After reading the question again and having a closer look at the code ... you were almost there. The mistake you made is to look at the protocol, which gives you no data. Instead, you have to ask the transport, as you already did with
std::cout << transport.get()->getBufferAsString();
The way to get the raw data is quite similar, just use getBuffer(&pbuf, &sz); instead. Using this, we get something like this:
// query buffer pointer and data size
uint8_t* pbuf;
uint32_t sz;
transport.get()->getBuffer(&pbuf, &sz);
// alloc shmem blöock of adequate size
shID = shmget(1000, sz, IPC_CREAT | 0666);
if (shID >= 0)
{
myPtr = (char*)shmat(shID, 0, 0);
if (myPtr==(char *)-1)
{
perror("shmat");
}
else
{
// copy serialized data into shared memory
memcpy( myPtr, pbuf, sz);
}
}
Since shmget() may give you a larger block than requested, it seems to be a good idea to additionally use the framed transport, which automatically carries the real data size in the serialized data. Some sample code for the latter can be found in the Test Client or server code.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Iterative Algorithm with OpenCL - algorithm

Related

for_each_possible_cpu macro in vmalloc_init() function, does the code run in only one cpu? or in every cpu?

how do I allocate memory for some of the structure elements

Incomplete task before HOST ends with CPU device

cudaMemcpy() gives segfault when using Type**

Using Thrift for IPC-Communication via shared Memory

Categories

Resources