Using Ray Core Actors on multiple cpu core at same time - openmp

I am trying to use Ray Core for communication between nodes for multinode clusters. The problem is that, one Ray actor can only work on one core at a time, and the C++ APIs I'm calling from the python are using openmp for loop optimization. So, the loop optimization is becoming redundant. I need to use Ray Core such that I could use those loop optimization.
One more approach I am thinking of is to use Ray Actors just for communication between nodes and start another process which can run my program in background on that process where I can use openmp parallelization. But till now, not able to find any lead on that.
Could anyone suggest some solution/approaches for this problem?

Actually, the python environment variable 'OMP_NUM_THREADS' is set to 1 by default, and even after increasing the num_cpus to K>1, 'OMP_NUM_THREADS' remain to be 1. If we explicitly change the 'OMP_NUM_THREADS' variable with something like export OMP_NUM_THREADS="10", it works. That is, openmp runs behind the ray actor.


what's make GPU programs non-preemptible?

Modern operating systems have no support for GPUs, treating them more or less as a normal I/O device.there some researches in that areas attempt to managing GPUs at operating system level,but they claim that the GPU programs are non-preemptible: once a work unit has been started, it’s impossible to interrupt it without destroying the channel’s data.
so what i am asking is:
Is it true that it's non-preemtible?
If it's non-preemtible ,what make it non-preemtible ,is it because of hardware
design or what is the reason?
If it non-preemtible what we need to make it preemtible?
i'll be highly appreciated if someone can give a clear explanation.
GPU preempt themselves all the time, but only with other work items from the same kernel. If a compute unit is waiting on a memory read or write it will execute other work items. It's single instruction multiple threads essentially. However, it doesn't make sense to stop a job part way through and switch to a different job. You'd need to keep track of an enormous amount of state (unlike a serial processor that just has a register set, you'd have all that multiplied by the number of compute units). GPU jobs are all designed to run quickly, so cycling jobs through the system is more efficient that switching between partially complete jobs. That all said, some modern GPUs divide up the hardware and can have different parts working on different jobs at the same time.
At the risk of over simplification:
Let's say I have a solid object defined within the GPU. For simplicity, let's say that the object is a cube and that the GPU maintains 8 vertices (and that the GPU is VERY slow).
Let me start a rotation. I have do a matrix multiplication on each vertex. I do 3 of them. Then I get preempted.
My cube is no longer a cube.
If you wanted it preemtable, you'd need some kind of transaction processing with rollback (slowing things down) and hardware support giving a preemptable interface.

MATLAB Parallel Computing Toolbox - Parallelization vs GPU?

I'm working with someone who has some MATLAB code that they want to be sped up. They are currently trying to convert all of this code into CUDA to get it to run on a CPU. I think it would be faster to use MATLAB's parallel computing toolbox to speed this up, and run it on a cluster that has MATLAB's Distributed Computing Toolbox, allowing me to run this across several different worker nodes. Now, as part of the parallel computing toolbox, you can use things like GPUArray. However, I'm confused as to how this would work. Are using things like parfor (parallelization) and gpuarray (gpu programming) compatible with each other? Can I use both? Can something be split across different worker nodes (parallelization) while also making use of whatever GPUs are available on each worker?
They think its still worth exploring the time it takes to convert all of your matlab code to cuda code to run on a machine with multiple GPUs...but I think the right approach would be to use the features already built into MATLAB.
Any help, advice, direction would be really appreciated!
When you use parfor, you are effectively dividing your for loop into tasks, with one task per loop iteration, and splitting up those tasks to be computed in parallel by several workers where each worker can be thought of as a MATLAB session without an interactive GUI. You configure your cluster to run a specified number of workers on each node of the cluster (generally, you would choose to run a number of workers equal to the number of available processor cores on that node).
On the other hand, gpuarray indicates to MATLAB that you want to make a matrix available for processing by the GPU. Underneath the hood, MATLAB is marshalling the data from main memory to the graphics board's internal memory. Certain MATLAB functions (there's a list of them in the documentation) can operate on gpuarrays and the computation happens on the GPU.
The key differences between the two techniques are that parfor computations happen on the CPUs of nodes of the cluster with direct access to main memory. CPU cores typically have a high clock rate, but there are typically fewer of them in a CPU cluster than there are GPU cores. Individually, GPU cores are slower than a typical CPU core and their use requires that data be transferred from main memory to video memory and back again, but there are many more of them in a cluster. As far as I know, hybrid approaches are supposed to be possible, in which you have a cluster of PCs and each PC has one or more Nvidia Tesla boards and you use both parfor loops and gpuarrays. However, I haven't had occasion to try this yet.
If you are mainly interested in simulations, GPU processing is the perfect choice. However, if you want to analyse (big) data, go with Parallization. The reason for this is, that GPU processing is only faster than cpu processing if you don't have to copy data back and forth. In case of a simulation, you can generate most of the data on the GPU and only need to copy the result back. If you try to work with bigger data on the GPU you will very often run into out of memory problems.
Parallization is great if you have big data structures and more than 2 cores in your computer CPU.
If you write it in CUDA it is guaranteed to run in parallel at the chip-level versus going with MATLAB's best guess for a non-parallel architecture and your best effort to get it to run in parallel.
Kind of like drinking fresh mountain water run-off versus buying filtered water. Go with the purist solution.

What is easier to learn and debug OpenMP or MPI?

I have a number crunching C/C++ application. It is basically a main loop for different data sets. We got access to a 100 node cluster with openmp and mpi available. I would like to speedup the application but I am an absolut newbie for both mpi and openmp. I just wonder what is the easiest one to learn and to debug even if the performance is not the best.
I also wonder what is the most adequate for my main loop application.
If your program is just one big loop using OpenMP can be as simple as writing:
#pragma omp parallel for
OpenMP is only useful for shared memory programming, which unless your cluster is running something like kerrighed means that the parallel version using OpenMP will only run on at most one node at a time.
MPI is based around message passing and is slightly more complicated to get started. The advantage is though that your program could run on several nodes at one time, passing messages between them as and when needed.
Given that you said "for different data sets" it sounds like your problem might actually fall into the "embarrassingly parallel" category, where provided you've got more than 100 data sets you could just setup the scheduler to run one data set per node until they are all completed, with no need to modify your code and almost a 100x speed up over just using a single node.
For example if your cluster is using condor as the scheduler then you could submit 1 job per data item to the "vanilla" universe, varying only the "Arguments =" line of the job description. (There are other ways to do this for Condor which may be more sensible and there are also similar things for torque, sge etc.)
OpenMP is essentially for SMP machines, so if you want to scale to hundreds of nodes you will need MPI anyhow. You can however use both. MPI to distribute work across nodes and OpenMP to handle parallelism across cores or multiple CPUs per node. I would say OpenMP is a lot easier than messing with pthreads. But it being coarser grained, the speed up you will get from OpenMP will usually be lower than a hand optimized pthreads implementation.

Drawing triangles with CUDA

I'm writing my own graphics library (yep, its homework:) and use cuda to do all rendering and calculations fast.
I have problem with drawing filled triangles. I wrote it such a way that one process draw one triangle. It works pretty fine when there are a lot of small triangles on the scene, but it breaks performance totally when triangles are big.
My idea is to do two passes. In first calculate only tab with information about scanlines (draw from here to there). This would be triangle per process calculation like in current algorithm. And in second pass really draw the scanlines with more than one process per triangle.
But will it be fast enough? Maybe there is some better solution?
You can check this blog: A Software Rendering Pipeline in CUDA. I don't think that's the optimal way to do it, but at least the author shares some useful sources.
Second, read this paper: A Programmable, Parallel Rendering Architecture. I think it's one of the most recent paper and it's also CUDA based.
If I had to do this, I would go with a Data-Parallel Rasterization Pipeline like in Larrabee (which is TBR) or even REYES and adapt it to CUDA: (see the second part of the presentation)
I suspect that you have some misconceptions about CUDA and how to use it, especially since you refer to a "process" when, in CUDA terminology, there is no such thing.
For most CUDA applications, there are two important things to getting good performance: optimizing memory access and making sure each 'active' CUDA thread in a warp performs the same operation at the same time as otehr active threads in the warp. Both of these sound like they are important for your application.
To optimize your memory access, you want to make sure that your reads from global memory and your writes to global memory are coalesced. You can read more about this in the CUDA programming guide, but it essentially means, adjacent threads in a half warp must read from or write to adjacent memory locations. Also, each thread should read or write 4, 8 or 16 bytes at a time.
If your memory access pattern is random, then you might need to consider using texture memory. When you need to refer to memory that has been read by other threads in a block, then you should make use of shared memory.
In your case, I'm not sure what your input data is, but you should at least make sure that your writes are coalesced. You will probably have to invest some non-trivial amount of effort to get your reads to work efficiently.
For the second part, I would recommend that each CUDA thread process one pixel in your output image. With this strategy, you should watch out for loops in your kernels that will execute longer or shorter depending on the per-thread data. Each thread in your warps should perform the same number of steps in the same order. The only exception to this is that there is no real performance penalty for having some threads in a warp perform no operation while the remaining threads perform the same operation together.
Thus, I would recommend having each thread check if its pixel is inside a given triangle. If not, it should do nothing. If it is, it should compute the output color for that pixel.
Also, I'd strongly recommend reading more about CUDA as it seems like you are jumping into the deep end without having a good understanding of some of the basic fundamentals.
Not to be rude, but isn't this what graphics cards are designed to do anyway? Seems like using the standard OpenGL and Direct3D APIs would make more sense.
Why not use the APIs to do your basic rendering, rather than CUDA, which is much lower-level? Then, if you wish to do additional operations that are not supported, you can use CUDA to apply them on top. Or maybe implement them as shaders.

Converting a parallel program to a cluster program. From OpenMP to?

I want to write a code converter that takes an OpenMP based parallel program and runs it on a cluster.
How do I go about this problem? What libraries do I use? How do I set up a small cluster for this?
I'm finding it extremely hard to find good material about cluster computing on the internet.
EDIT: If it's impossible then how does Intel do it? The Intel compiler seems to do exactly what I want to. I don't have any specific application that I would like to run. I want to write the "converter/compiler", not the application. I understand that shared memory is different from distributed memory, but there has to be a way to sync memory, if not for all cases, then for some specific cases, even if it means that application is written with custom constructs.
Intel has an implementation of OpenMP that works with their C++ and Fortran compilers for x86 64-bit clusters. You can get a 30-day eval version of these compilers for free. Other than that, Zifre is mostly right. If you are concerned with scalability, bite the bullet and write your parallel program in another programming model (MPI, CUDA, Cilk, ...) which is designed with distributed systems in mind. If you provide a little more information about your application, we may be able to provide more useful guidance on that front.
It seems to me that this is not a good idea.
The basic idea behind OpenMP is data-shared parallel execution. It works well, when accessing shared data costs you nothing. Every thread can access a variable in shared cache or RAM.
The cluster computations exploit message-passing, because computers in cluster have distributed memory. When one process needs data from another one then you should manage data passing over the network. It is time-consuming operation.
So, if you want to write such compiler, you should implement data broadcasting operations (e.g. MPI_Bcast from MPI) for each data access in OpenMP. This will kill parallel performance at all.
This is simply not possible. You have to structure your code in a completely different way to get it to work on a cluster (programming multiple machines is very different from programming one machine).
There is no magic pixie dust to do this.
On the other hand, if you write your program with clusters in mind, it is possible to run it on a single machine (although it will obviously be slower).
SCORE/SCASH and Omni OpenMP compiler
