Is there a good openCL wrapper for Ruby? - ruby

I am aware of:
https://github.com/lsegal/barracuda
Which hasn't been updated since 01/11
And
http://rubyforge.org/projects/ruby-opencl/
Which hasn't been updated since 03/10.
Are these projects dead? Or have they simply not changed because their functioning, and OpenCL/Ruby haven't changed since then. Is anybody using these projects? Any luck?
If not, can you recommend another opencl gem for Ruby? Or how is this sort of call done usually? Just call raw C from Ruby?

You can try opencl_ruby_ffi, it's actively developed (by a colleague of mine) and working well with OpenCL version 1.2. OpenCL 2.0 should also be available soon.
sudo gem install opencl_ruby_ffi
In Khronos forum you can find a quick example that shows how it works:
require 'opencl_ruby_ffi'
# select the first platform/device available
# improve it if you have multiple GPU on your machine
platform = OpenCL::platforms.first
device = platform.devices.first
# prepare the source of GPU kernel
# this is not Ruby but OpenCL C
source = <<EOF
__kernel void addition( float2 alpha, __global const float *x, __global float *y) {\n\
size_t ig = get_global_id(0);\n\
y[ig] = (alpha.s0 + alpha.s1 + x[ig])*0.3333333333333333333f;\n\
}
EOF
# configure OpenCL environment, refer to OCL API if necessary
context = OpenCL::create_context(device)
queue = context.create_command_queue(device, :properties => OpenCL::CommandQueue::PROFILING_ENABLE)
# create and compile the OpenCL C source code
prog = context.create_program_with_source(source)
prog.build
# allocate CPU (=RAM) buffers and
# fill the input one with random values
a_in = NArray.sfloat(65536).random(1.0)
a_out = NArray.sfloat(65536)
# allocate GPU buffers matching the CPU ones
b_in = context.create_buffer(a_in.size * a_in.element_size, :flags => OpenCL::Mem::COPY_HOST_PTR, :host_ptr => a_in)
b_out = context.create_buffer(a_out.size * a_out.element_size)
# create a constant pair of float
f = OpenCL::Float2::new(3.0,2.0)
# trigger the execution of kernel 'addition' on 128 cores
event = prog.addition(queue, [65536], f, b_in, b_out,
:local_work_size => [128])
# #Or if you want to be more OpenCL like:
# k = prog.create_kernel("addition")
# k.set_arg(0, f)
# k.set_arg(1, b_in)
# k.set_arg(2, b_out)
# event = queue.enqueue_NDrange_kernel(k, [65536],:local_work_size => [128])
# tell OCL to transfer the content GPU buffer b_out
# to the CPU memory (a_out), but only after `event` (= kernel execution)
# has completed
queue.enqueue_read_buffer(b_out, a_out, :event_wait_list => [event])
# wait for everything in the command queue to finish
queue.finish
# now a_out contains the result of the addition performed on the GPU
# add some cleanup here ...
# verify that the computation went well
diff = (a_in - a_out*3.0)
65536.times { |i|
raise "Computation error #{i} : #{diff[i]+f.s0+f.s1}" if (diff[i]+f.s0+f.s1).abs > 0.00001
}
puts "Success!"

You may want to package whatever C functionality you would like as a gem. This is pretty straightforward and this way you can wrap all your c logic in a specific namespace that you can reuse in other projects.
http://guides.rubygems.org/c-extensions/

If you want to do high speed calculations with GPU, Cumo / NArray is a good choice. Cumo has the same interface as NArray. Although it is cuda rather than opencl.
https://github.com/sonots/cumo

Related

Pytorch: RAM explodes when using multiprocessing SharedMemory and CUDA

I would like to use multiprocessing to launch multiple training instances on CUDA device. Since the data is common between the processes, I want to avoid data copy for every process. I'm using python 3.8's SharedMemory from multiprocessing module to achieve this following this SO example.
I can allocate a memory block using SharedMemory and create as many processes as I'd like with constant memory (RAM) usage. However, when I try to send tensors to CUDA, the memory scales linearly with the number of processes. It appears as if when c.to(device) is called, the base data is copied for every process.
Does any one know why this is happening? Any ideas to mitigate this issue?
Here is the sample code I'm using:
import numpy as np
from multiprocessing import shared_memory, get_context
import time
import torch
import copy
dim = 10000
batch_size = 10
sleep_time = 2
npe = 1 # number of parallel executions
# cuda
if torch.cuda.is_available():
dev = 'cuda:0'
else:
dev = "cpu"
device = torch.device(dev)
def step(i, shr_name):
existing_shm = shared_memory.SharedMemory(name=shr_name)
np_arr = np.ndarray((dim, dim), dtype=np.float32, buffer=existing_shm.buf)
b = np_arr[i * batch_size: (i + 1) * batch_size, :]
b = torch.Tensor(b)
# This is just to explicitly copy the tensor so that it has nothing to do
# with the shared memory block
c = copy.deepcopy(b)
# If tensor c is sent to the cuda device, then RAM scales linearly
# with the number of parallel executions.
# If c is not sent to cuda device, memory consumption is constant.
c = c.to(device)
time.sleep(sleep_time)
existing_shm.close()
def create_shared_block():
a = np.random.random((dim, dim)).astype(np.float32)
shm = shared_memory.SharedMemory(create=True, size=a.nbytes, name='sha')
np_arr = np.ndarray(a.shape, dtype=np.float32, buffer=shm.buf)
np_arr[:] = a[:]
return shm, np_arr
if __name__ == '__main__':
# create shared memory block
shm, np_arr = create_shared_block()
# create list of inputs to be executed in parallel
inp = [[x, 'sha'] for x in range(npe)]
print(inp)
# sleep added before and after launching multiprocessing to monitor the memory consumption
print('before pool') # to check memory with top or htop
time.sleep(sleep_time)
context = get_context('spawn')
with context.Pool(npe) as pool:
print('after pool') # to check memory with top or htop
time.sleep(sleep_time)
pool.starmap(step, inp)
time.sleep(sleep_time)
shm.close()
shm.unlink()

Customize task resources on Airflow using MesosExecutor

Is it possible to specify resources (CPU, memory, GPU, disk space) for each operator of a DAG when using MesosExecutor?
I know you can specify global values for resources of a task.
For instance, I have several operators that are CPU expensive and others that not. I would like to execute one at a time of the first, but many in parallel of the non CPU expensive ones.
From the code (mesos_executor.py line 67), it seems that is not possible since cpu and memory values are passed to the Scheduler during initialization:
def __init__(self,
task_queue,
result_queue,
task_cpu=1,
task_mem=256):
self.task_queue = task_queue
self.result_queue = result_queue
self.task_cpu = task_cpu
self.task_mem = task_mem
and those values are used without modification:
cpus = task.resources.add()
cpus.name = "cpus"
cpus.type = mesos_pb2.Value.SCALAR
cpus.scalar.value = self.task_cpu
mem = task.resources.add()
mem.name = "mem"
mem.type = mesos_pb2.Value.SCALAR
mem.scalar.value = self.task_mem
It requires a custom Executor implementation to achieve that

Memory leak on ruby gtk-3

I'm experiencing a memory leak on ruby 2.3.1 with gtk-3.
On my system (Ubuntu 16-04) the following code consumes approximately 80 MB.
The size of picture.jpg is 289kb.
`require 'gtk3'
def ptest
i=0
j=0
loop {
i += 1
j += 1
exit if j==50
#image = Gtk::Image.new
newPixbuf = GdkPixbuf::Pixbuf.new(:file => "picture.jpg")
#image.pixbuf = newPixbuf
#image.clear
#image=nil
if i == 10
p "GC"
GC.start
i = 0
end
}
end
ptest`
According https://sourceforge.net/p/ruby-gnome2/mailman/message/8659687/ this shouldn't happen. What can I do to release the memory?
I'm not into Ruby but know some bits of Gtk+.
In C, where you must deal with memory allocations by yourself, you need to unref the pixbuf.
From GtkImage Documentation:
The GtkImage does not assume a reference to the pixbuf; you still need to unref it if you own references.
So, most probably, if Ruby doesn't implement ARC (Automatic Reference Counting on GObjects), you must do something like newPixbuf.unref (not sure about Ruby syntax) right after #image.pixbuf = newPixbuf.
Hope it helps.
Apparently there has been a bug in my Ruby gdk3-Gem. A gem update solved the problem.

Building an UCS4 string buffer in python 2.7 ctypes

In an attempt to recreate the getenvironment(..) C function of _winapi.c (direct link) in plain python using ctypes, I'm wondering how the following C code could be translated:
buffer = PyMem_NEW(Py_UCS4, totalsize);
if (! buffer) {
PyErr_NoMemory();
goto error;
}
p = buffer;
end = buffer + totalsize;
for (i = 0; i < envsize; i++) {
PyObject* key = PyList_GET_ITEM(keys, i);
PyObject* value = PyList_GET_ITEM(values, i);
if (!PyUnicode_AsUCS4(key, p, end - p, 0))
goto error;
p += PyUnicode_GET_LENGTH(key);
*p++ = '=';
if (!PyUnicode_AsUCS4(value, p, end - p, 0))
goto error;
p += PyUnicode_GET_LENGTH(value);
*p++ = '\0';
}
/* add trailing null byte */
*p++ = '\0';
It seems that the function ctypes.create_unicode_buffer(..) (doc, code) is doing something quite close that I could reproduce if only I could have an access to Py_UCS4 C type or be sure of its link to any other type accessible to python through ctypes.
Would c_wchar be a good candidate ?, but it seems I can't make that assumption, as python 2.7 could be compiled in UCS-2 if I'm right (source), and I guess windows is really waiting fo UCS-4 there... even if it seems that ctypes.wintypes.LPWSTR is an alias to c_wchart_p in cPython 2.7 (code).
For this question, it is safe to make the assumption that the target platform is python 2.7 on Windows if that helps.
Context (if it has some importance):
I'm in the process of delving for the first time in ctypes to attempt a plain python fix at cPython 2.7's bug hitting windows subprocess.Popen(..) implementation. This bug is a won't fix. This bug prevents the usage of unicode in command line calls (as executable name or arguments). This is fixed in python 3, so I'm having a go at reverse implementing in plain python the actual cPython3 implementation of the required CreateProcess(..) in _winapi.c which calls in turn getenvironment(..).
This possible workaround was mentionned in the comments of this answer to a question related to subprocess.Popen(..) unicode issues.
This doesn't answer the part in the title about build specifically UCS4 buffer. But it gives a partial answer to the question in bold and manage to create a unicode buffer that seems to work on my current python 2.7 on windows: (so maybe UCS4 is not required).
So we are here taking the assumption that c_wchar is what windows require (if it is UCS4 or UCS2 is not so clear to me yet, and it might have no importance, but I recon having a very light confidence in my knowledge here).
So here is the python code that reproduces the C code as requested in the question:
## creation of buffer of size totalsize
wenv = (c_wchar * totalsize)()
wenv.value = (unicode("").join([
unicode("%s=%s\0") % (key, value)
for k, v in env.items()])) + "\0"
This wenv can then be fed to CreateProcessW and this seems to work.

Using parfor and labSend/labRecieve

I want to run two matlab scripts in parallel for a project and communicate between them. The purpose of this is to have one script do image analysis and sending the results to the other which will use it for more calculations (time consuming, but not related to the task of finding stuff in the images). Since both tasks are time consuming, and should preferably be done in real time, I believe that parallelization is necessary.
To get a feel for how this should be done I created a test script to find out how to communicate between the two scripts.
The first script takes a user input using the built in function input, and then using labSend sends it to the other, which recieves it, and prints it.
function [blarg] = inputStuff(blarg)
mpiInit(); %added because of error message, but do not work...
for i=1:2
labBarrier; % added because of error message
inp = input('Enter a number to write');
labSend(inp);
if (inp == 0)
break;
else
i = 1;
end
end
end
function [ blarg ] = testWrite( blarg )
mpiInit(); % added because of error message, but does not help
par = 0;
if ( blarg == 0)
par = 1;
end
for i = 1:10
if (par == 1)
labBarrier
delta = labReceive();
i = 1;
else
delta = input('Enter number to write');
end
if (delta == 0)
break;
end
s = strcat('This lab no', num2str(labindex), '. Delta is = ')
delta
end
end
%%This is the file test_parfor.m
funlist = {#inputStuff, #testWrite};
matlabpool(2);
mpiInit(); % added because of error message, but does not help
parfor i=1:2
funlist{i}(0);
end
matlabpool close;
Then, when the code is run, the following error message appears:
Starting matlabpool using the 'local' profile ... connected to 2 labs.
Error using parallel_function (line 589)
The MPI implementation has not yet been loaded. Please
call mpiInit.
Error stack:
testWrite.m at 11
Error in test_parfor (line 8)
parfor i=1:2
Calling the method mpiInit does not help... (Called as shown in the code above.)
And nowhere in the examples that mathworks have in the documentation, or on their website, show this error or what to do with it.
Any help is appreciated!
You would typically use constructs such as labSend, labRecieve and labBarrier within an spmd block, rather than a parfor block.
parfor is intended for implementing embarrassingly parallel algorithms, in other words algorithms that consist of multiple independent tasks that can be run in parallel, and do not require communication between tasks.
I'm stretching my knowledge here (perhaps someone more expert can correct me), but as I understand things, it does not set up an MPI ring for communication between workers, which is probably the explanation for the (rather uninformative) error message you're getting.
An spmd block enables communication between workers using labSend, labRecieve and labBarrier. There are quite a few examples of using them all in the documentation.
Sam is right that the MPI functionality is not enabled during parfor, only during spmd. You need to do something more like this:
spmd
funlist{labindex}(0);
end
(Sam is also quite right that the error message you saw is pretty unhelpful)

Resources