As far as I know, the answer is no. OpenCL is designed for multi-cores system.
But, is there any way to use OpenCL on multi-computers ( each computer is a multi-cores system ) ? If not, are any additional tools, frameworks... required?
I read some articles about Distributed computing, Cluster computing, Grid computing... but I can't find a satisfied answer
Any ideas will be appreciated
Thank you :)
There are two frameworks for this purpose: VirtualCL and CLara. Both packages let you work transparently with remote machines as local devices. Unfortunately, VirtualCL is only available as pre-compiled binaries without sources and CLara is not actively developed anymore.
SnuCL uses MPI and OpenCL to transparently use the cluster through the OpenCL API. It also adds a few OpenCL extensions to effectively deal with the memory objects.
It is open source. See http://aces.snu.ac.kr/Center_for_Manycore_Programming/SnuCL.html
and http://tbex.twbbs.org/~tbex/pad/SunCL.pdf
There is one more solution not mentioned above: dOpenCL.
"dOpenCL (distributed OpenCL) is a novel, uniform approach to programming distributed heterogeneous systems with accelerators. It transparently integrates the nodes of a distributed system into a single OpenCL platform. Thus, dOpenCL allows the user to run unmodified existing OpenCL applications in a heterogeneous distributed environment. Besides, it extends the OpenCL programming model to deal with individual nodes of the distributed system."
I have used VirtualCL to form a GPU cluster with 3 AMD GPU as compute node and my ubuntu intel desktop running as broker node. I was able to start both the broker and compute nodes.
In addition to the various options already mentioned by other posters, here are two more open source projects that you may be interested in:
ocland (in beta stage): offers a server application and an ICD implementation that the clients can use to take advantage of local and remote devices that support OpenCL in a transparent fashion. The license is GPLv3.
COPRTHR SDK by Brown Deer Technnology (currently version 1.6): this SDK which offers an open source (GPLv3) OpenCL implementation for x86_64, ARM, Epiphany and Intel MIC includes a "Compute Layer Remote Procedure Call" implementation. This consists of a client-side OpenCL implementation that supports rpc (libclrpc) and a server application (clrpcd). The website doesn't mention much about it but the documentation contains a section about this CLRPC implementation.
Related
According to Wikipedia, the Windows Kernel is a hybrid model, meaning it has both a monolithic and microkernel architecture.
But both definitions are very opposite: monolithic is that there is a shared place for both system services and core functionality, microkernel means there is not a shared place.
So, I bet that means that windows has shared space for some, and for other system services and core functionalities it is decoupled.
I'm trying my best to understand this but it's very cryptic for me, although I'm a professional software engineer.
Do you perhaps have an, maybe relatable, example in which it is monolithic and in which it is microkernel?
And to what extent is it similar to say Ubuntu and to what extent is it totally different from Ubuntu kernel, which is said to be fully monolithic?
Generally speaking, a microkernel has very few services provided by the kernel itself, which execute in kernel mode while a monolithic kernel has the vast majority of servers (especially drivers) running in kernel mode.
Many monolithic OSes are taking the approach of running some of their services and drivers at user level and this is what they mean by hybrid. They might keep the network drivers completely in the kernel but run GPU drivers at user level for example.
I would like to distribute a Windows/Linux application that uses openCL, but I can't find the best way to do it.
For the moment my problem are only on Windows:
1- I'm using Intel CPU, how can I manage Intel AND AMD (CPU of final users) ?
2- For distribution of application that uses Visual Studio DLL, we have Visual Studio Redistributable to manage this easily and to avoid a big installation of Visual Studio. Is there a package like this for openCL ?
3- Finally, I don't know if I must provide OpenCL.dll or not (example of different point of view here)
I read several topics on the web about this problem without clear solution.
Thank you for your help.
1) You write to the OpenCL API and it works with whatever hardware your user has. User the header for the lower version you want to support (e.g., use cl.h from 1.1 if you want to target 1.1 and higher).
2) The OpenCL runtime is installed on the user's machine when they install a graphics driver. You don't need to (and should not) redistribute anything.
3) Please don't redistribute OpenCL.dll
The one problem you may need to deal with is if your user does not have any OpenCL installed on their machine. In this case, the call to clGetPlatformIDs will fail. There are various ways to deal with this, all platform specific. Dynamically linking to OpenCL.dll is one way, or running a helper process to test for OpenCL is another. An elegant solution on Windows is to delay load OpenCL.dll and hook that API to return 0 if the late binding fails.
1- I'm using Intel CPU, how can I manage Intel AND AMD (CPU of final users)
Are you talking about running OpenCL kernels on CPU, or just host-side code while kernels run on GPU ? because if the former (on CPU), your users will need to install their respective OpenCL CPU implementation, IIRC the Intel CPU implementation does not run on AMDs (or at least that used to be the case, perhaps it's now different..)
3- Finally, I don't know if I must provide OpenCL.dll
You don't have to, but you should, IMO. The way OpenCL works (usually), OpenCL.dll is just an ICD loader - a small library (a few dozen KB) that loads the actual OpenCL implementation(s) by looking into a few predefined places. It should be safe to include on Windows, and it simplifies your program logic - you can always build with OpenCL enabled, and if there's no OpenCL implementation installed, the loader will return CL_PLATFORM_NOT_FOUND_KHR - you just handle that error by asking user to install an OpenCL implementation, or fallback to non-OpenCL code path if you have it, whatever suits you more.
There's no need to complicate your life with delayed DLL loads or helper processes. In fact that's the entire point of the ICD concept - you don't need to look for the platforms and DLLs yourself, you let the ICD loader do it. It's pretty absurd to write helper code to load a helper library (ICD) which then loads the actual implementation DLLs...
I want to know if it would be possible to run an OpenMP program on multiple hosts. So far I only heard of programs that can be executed on multiple thread but all within the same physical computer. Is it possible to execute a program on two (or more) clients? I don't want to use MPI.
Yes, it is possible to run OpenMP programs on a distributed system, but I doubt it is within the reach of every user around. ScaleMP offers vSMP - an expensive commercial hypervisor software that allows one to create a virtual NUMA machine on top of many networked hosts, then run a regular OS (Linux or Windows) inside this VM. It requires a fast network interconnect (e.g. InfiniBand) and dedicated hosts (since it runs as a hypervisor beneath the normal OS). We have an operational vSMP cluster here and it runs unmodified OpenMP applications, but performance is strongly dependent on data hierarchy and access patterns.
NICTA used to develop similar SSI hypervisor named vNUMA, but development also stopped. Besides their solution was IA64-specific (IA64 is Intel Itanium, not to be mistaken with Intel64, which is their current generation of x86 CPUs).
Intel used to develop Cluster OpenMP (ClOMP; not to be mistaken with the similarly named project to bring OpenMP support to Clang), but it was abandoned due to "general lack of interest among customers and fewer cases than expected where it showed a benefit" (from here). ClOMP was an Intel extension to OpenMP and it was built into the Intel compiler suite, e.g. you couldn't use it with GCC (this request to start ClOMP development for GCC went in the limbo). If you have access to old versions of Intel compilers (versions from 9.1 to 11.1), you would have to obtain a (trial) ClOMP license, which might be next to impossible given that the product is dead and old (trial) licenses have already expired. Then again, starting with version 12.0, Intel compilers no longer support ClOMP.
Other research projects exist (just search for "distributed shared memory"), but only vSMP (the ScaleMP solution) seems to be mature enough for production HPC environments (and it's priced accordingly). Seems like most efforts now go into development of co-array languages (Co-Array Fortran, Unified Parallel C, etc.) instead. I would suggest that you have a look at Berkeley UPC or invest some time in learning MPI as it is definitely not going away in the years to come.
Before, there was the Cluster OpenMP.
Cluster OpenMP, was an implementation of OpenMP that could make use of multiple SMP machines without resorting to MPI. This advance had the advantage of eliminating the need to write explicit messaging code, as well as not mixing programming paradigms. The shared memory in Cluster OpenMP was maintained across all machines through a distributed shared-memory subsystem. Cluster OpenMP is based on the relaxed memory consistency of OpenMP, allowing shared variables to be made consistent only when absolutely necessary. source
Performance Considerations for Cluster OpenMP
Some memory operations are much more expensive than others. To achieve good performance with Cluster OpenMP, the number of accesses to unprotected pages must be as high as possible, relative to the number of accesses to protected pages. This means that once a page is brought up-to-date on a given node, a large number of accesses should be made to it before the next synchronization. In order to accomplish this, a program should have as little synchronization as possible, and re-use the data on a given page as much as possible. This translates to avoiding fine-grained synchronization, such as atomic constructs or locks, and having high data locality source.
Another option for running OpenMP programs on multiple hosts is the remote offloading plugin in the LLVM OpenMP runtime.
https://openmp.llvm.org/design/Runtimes.html#remote-offloading-plugin
The big issue with running OpenMP programs on distributed memory is data movement. Coincidentally, that is also one of the major issues in programming GPU's. Extending OpenMP to handle GPU programming has given rise to OpenMP directives to describe data transfer. Programming GPU's has also forced programmers to think more carefully about building programs that consider data movement.
I'm working on diploma project that heavily uses mathematical calculations and should present some results in 3D. For these purposes I decided to use CUDA or OpenCL for parallel computation of mathematical part and, most possibly, OpenGL for presenting result. In addition, project should be able to be deployed on clusters (operated by MS Windows), for these purposes project supervisor recommended MPI.
My question is the following: where it is easier to combine all these components, in MS Visual tudio
Main part is CUDA + OpenCL + OpenGL, it will be the core of the project.
P.S. This question is not to star holy-war betwen Qt and MS Visual studio.
OpenCL is not limited to GPUs, it can be used for parallel programming in clusters as well. Intel for example provides a OpenCL implementation, that is aimed at multicore CPU and clusters.
So my recommendation is to use OpenCL for both GPU computing and clustering. MPI (Message Passing Interface) is mainly a way to communicate between tasks running on separate cluster nodes. It's not so much of a clustering framework by itself.
I'm currently developing an OpenCL-application for a very heterogeneous set of computers (using JavaCL to be specific). In order to maximize performance I want to use a GPU if it's available otherwise I want to fall back to the CPU and use SIMD-instructions. My plan is to implement the OpenCL-code using vector-types because my understanding is that this allows CPUs to vectorize the instructions and use SIMD-instructions.
My question however is regarding which OpenCL-implementation to use. E.g. if the computer has a Nvidia GPU I assume it's best to use Nvidia's library but if no GPU is available I want to use Intel's library to use the SIMD-instructions.
How do I achieve this? Is this handled automatically or do I have to include all libraries and implement some logic to pick the right one? It feels like this is a problem that more people than I are facing.
Update
After testing the different OpenCL-drivers this is my experience so far:
Intel: crashed the JVM when JavaCL tried to call it. After a restart it didn't crash the JVM but it also didn't return any usable
devices (I was using an Intel I7-CPU). When I compiled the
OpenCL-code offline it seemed to be able to do some
auto-vectorization so Intel's compiler seems quite nice.
Nvidia: Refused to install their WHQL-drivers because it claimed I didn't have Nvidia-card (that computer has a Geforce GT 330M). When
I tried it on a different computer I managed to get all the way to
create a kernel but at the first execution it crashed the drivers
(the screen flickered for a while and Windows 7 said it had to
restart the drivers). The second execution caused a bluee-screen of
death.
AMD/ATI: Refused to install 32-bit SDK (I tried that since I will be using a 32-bit JVM) but 64-bit SDK worked well. This is the only
driver which I've managed to execute the code on (after a restart
because at first it gave a cryptic error-message when compiling).
However it doesn't seem to be able to do any implicit vectorization
and since I don't have any ATI GPU I didn't get any performance
increase compared to the Java-implementation. If I use vector-types I
might see some improvements though.
TL;DR None of the drivers seem ready for commercial use. I'm probably better of creating JNI-module with C-code compiled to use SSE-instructions.
First try to understand hosts & devices: http://www.streamcomputing.eu/blog/2011-07-14/basic-concept-hosts-and-devices/
Basically you can just do exactly what you described: check if a certain driver is available and if not, try the next one. What you choose first depends completely on your own preference. I would pick the device I have tested my kernel best on. In JavaCL you can pick the fastest device with JavaCL.createBestContext and CLPlatform.getBestDevice, check the host-code here: http://ochafik.com/blog/?p=501
Know NVidia does not support CPUs via their driver; only AMD and Intel do. Also is targeting multiple devices (say 2 GPUs and a CPU) a bit more difficult.
There is no API providing what you want. however, you can do the following:
i suggest you iterate over clGetPlatformIDs and query for the number of devices (clGetDeviceIDs), and device type for each device;
and pick the platform which has both types.
then build a map in u'r code, that maps for each type the list of platforms supporting it, ordered in some manner.
finally, just get the first item in the list corresponding for CL_DEVICE_TYPE_CPU and the first item corresponding for CL_DEVICE_TYPE_GPU.
if both returned results are equal (platform_cpu == platform_gpu) then pick one of them and use it for both.
if there is a platform supporting both, you will get match as before since you got order lists. then u can also do load balancing if u like on a single platform, like what Intel has.
Sorry for being late to the party, but regarding Intel's implementation behaviour under JavaCL, I'm afraid you've been bitten by a JavaCL bug :
https://github.com/ochafik/nativelibs4java/issues/297
Fixed in JavaCL 1.0.0-RC2 !
Cheers