Combining graphic-card and parallel-computation libraries in C++ - visual-studio

I'm working on diploma project that heavily uses mathematical calculations and should present some results in 3D. For these purposes I decided to use CUDA or OpenCL for parallel computation of mathematical part and, most possibly, OpenGL for presenting result. In addition, project should be able to be deployed on clusters (operated by MS Windows), for these purposes project supervisor recommended MPI.
My question is the following: where it is easier to combine all these components, in MS Visual tudio
Main part is CUDA + OpenCL + OpenGL, it will be the core of the project.
P.S. This question is not to star holy-war betwen Qt and MS Visual studio.

OpenCL is not limited to GPUs, it can be used for parallel programming in clusters as well. Intel for example provides a OpenCL implementation, that is aimed at multicore CPU and clusters.
So my recommendation is to use OpenCL for both GPU computing and clustering. MPI (Message Passing Interface) is mainly a way to communicate between tasks running on separate cluster nodes. It's not so much of a clustering framework by itself.

Related

how to run a openmp program on clusters with multiple nodes? [duplicate]

I want to know if it would be possible to run an OpenMP program on multiple hosts. So far I only heard of programs that can be executed on multiple thread but all within the same physical computer. Is it possible to execute a program on two (or more) clients? I don't want to use MPI.
Yes, it is possible to run OpenMP programs on a distributed system, but I doubt it is within the reach of every user around. ScaleMP offers vSMP - an expensive commercial hypervisor software that allows one to create a virtual NUMA machine on top of many networked hosts, then run a regular OS (Linux or Windows) inside this VM. It requires a fast network interconnect (e.g. InfiniBand) and dedicated hosts (since it runs as a hypervisor beneath the normal OS). We have an operational vSMP cluster here and it runs unmodified OpenMP applications, but performance is strongly dependent on data hierarchy and access patterns.
NICTA used to develop similar SSI hypervisor named vNUMA, but development also stopped. Besides their solution was IA64-specific (IA64 is Intel Itanium, not to be mistaken with Intel64, which is their current generation of x86 CPUs).
Intel used to develop Cluster OpenMP (ClOMP; not to be mistaken with the similarly named project to bring OpenMP support to Clang), but it was abandoned due to "general lack of interest among customers and fewer cases than expected where it showed a benefit" (from here). ClOMP was an Intel extension to OpenMP and it was built into the Intel compiler suite, e.g. you couldn't use it with GCC (this request to start ClOMP development for GCC went in the limbo). If you have access to old versions of Intel compilers (versions from 9.1 to 11.1), you would have to obtain a (trial) ClOMP license, which might be next to impossible given that the product is dead and old (trial) licenses have already expired. Then again, starting with version 12.0, Intel compilers no longer support ClOMP.
Other research projects exist (just search for "distributed shared memory"), but only vSMP (the ScaleMP solution) seems to be mature enough for production HPC environments (and it's priced accordingly). Seems like most efforts now go into development of co-array languages (Co-Array Fortran, Unified Parallel C, etc.) instead. I would suggest that you have a look at Berkeley UPC or invest some time in learning MPI as it is definitely not going away in the years to come.
Before, there was the Cluster OpenMP.
Cluster OpenMP, was an implementation of OpenMP that could make use of multiple SMP machines without resorting to MPI. This advance had the advantage of eliminating the need to write explicit messaging code, as well as not mixing programming paradigms. The shared memory in Cluster OpenMP was maintained across all machines through a distributed shared-memory subsystem. Cluster OpenMP is based on the relaxed memory consistency of OpenMP, allowing shared variables to be made consistent only when absolutely necessary. source
Performance Considerations for Cluster OpenMP
Some memory operations are much more expensive than others. To achieve good performance with Cluster OpenMP, the number of accesses to unprotected pages must be as high as possible, relative to the number of accesses to protected pages. This means that once a page is brought up-to-date on a given node, a large number of accesses should be made to it before the next synchronization. In order to accomplish this, a program should have as little synchronization as possible, and re-use the data on a given page as much as possible. This translates to avoiding fine-grained synchronization, such as atomic constructs or locks, and having high data locality source.
Another option for running OpenMP programs on multiple hosts is the remote offloading plugin in the LLVM OpenMP runtime.
https://openmp.llvm.org/design/Runtimes.html#remote-offloading-plugin
The big issue with running OpenMP programs on distributed memory is data movement. Coincidentally, that is also one of the major issues in programming GPU's. Extending OpenMP to handle GPU programming has given rise to OpenMP directives to describe data transfer. Programming GPU's has also forced programmers to think more carefully about building programs that consider data movement.

OpenCL programming in Charm++

Is it possible to run OpenCL through Charm++, while retaining the same fault tolerance and load balancing capabilities as for CPU or CUDA?
I did not explicitly see anything mentioned in the tutorials or the book.
Background: I'm one of the core developers of Charm++.
It's not clear whether you mean compiling OpenCL code to a Charm++-based parallel program, or calling kernels written in OpenCL from Charm++ code. Regardless, there is nothing explicitly implemented to support either of those cases at present.
Compiling OpenCL to Charm++ would be a large project. I don't know of anyone proposing to do such a thing, but it's not fundamentally implausible.
The research group behind Charm++, the Parallel Programming Laboratory has looked at the possibility of implementing OpenCL support to match our offload support for CUDA-based accelerators. This would not be particularly hard. However, at present, we don't have any demand from grant-funded projects that support our work to do so. We would welcome contributions of code to do this. There's also the possibility that commercial development may lead to this getting implemented.

Possible to use OpenCL on multi-computers?

As far as I know, the answer is no. OpenCL is designed for multi-cores system.
But, is there any way to use OpenCL on multi-computers ( each computer is a multi-cores system ) ? If not, are any additional tools, frameworks... required?
I read some articles about Distributed computing, Cluster computing, Grid computing... but I can't find a satisfied answer
Any ideas will be appreciated
Thank you :)
There are two frameworks for this purpose: VirtualCL and CLara. Both packages let you work transparently with remote machines as local devices. Unfortunately, VirtualCL is only available as pre-compiled binaries without sources and CLara is not actively developed anymore.
SnuCL uses MPI and OpenCL to transparently use the cluster through the OpenCL API. It also adds a few OpenCL extensions to effectively deal with the memory objects.
It is open source. See http://aces.snu.ac.kr/Center_for_Manycore_Programming/SnuCL.html
and http://tbex.twbbs.org/~tbex/pad/SunCL.pdf
There is one more solution not mentioned above: dOpenCL.
"dOpenCL (distributed OpenCL) is a novel, uniform approach to programming distributed heterogeneous systems with accelerators. It transparently integrates the nodes of a distributed system into a single OpenCL platform. Thus, dOpenCL allows the user to run unmodified existing OpenCL applications in a heterogeneous distributed environment. Besides, it extends the OpenCL programming model to deal with individual nodes of the distributed system."
I have used VirtualCL to form a GPU cluster with 3 AMD GPU as compute node and my ubuntu intel desktop running as broker node. I was able to start both the broker and compute nodes.
In addition to the various options already mentioned by other posters, here are two more open source projects that you may be interested in:
ocland (in beta stage): offers a server application and an ICD implementation that the clients can use to take advantage of local and remote devices that support OpenCL in a transparent fashion. The license is GPLv3.
COPRTHR SDK by Brown Deer Technnology (currently version 1.6): this SDK which offers an open source (GPLv3) OpenCL implementation for x86_64, ARM, Epiphany and Intel MIC includes a "Compute Layer Remote Procedure Call" implementation. This consists of a client-side OpenCL implementation that supports rpc (libclrpc) and a server application (clrpcd). The website doesn't mention much about it but the documentation contains a section about this CLRPC implementation.

Cross-platform development solutions for (mobile) performance computing

I am looking for a programming language/platform which is suited for handling around 1 to 2 GB of data, and can run signal processing algorithms fast enough across platforms (including modern mobile platforms). Did someone come across suitable solutions?. Please share your research and experience on cross-platform development.
My current research effort is complicated, owing to
the increased deployment of powerful [mobile] platforms,
the backing of Java through the popularity of Android (currently the #1 tag on stackoverflow)
I had mixed experience with C# and Mono (It is fast, but cross-platform compatibility can take some additional effort).
I am not looking for High performance computing(HPC), but into developing code with a data and computing-intensive background that performs well with as little effort as possible. A case study scenario for this would be Python code which is adapted to the Cython compiler.
I am not looking for UI-centric projects with solutions such as Phonegap and Adobe Air, but I know little about its performance.
Links:
Cross-Platform Development of High Performance Applications
Using Generic Programming
Native Client: A Sandbox for Portable, Untrusted x86 Native Code
You can perhaps start having a look at Droidcluster

OpenCL maturity under Windows

I consider using OpenCL in a consumer product which is currently under development.
Doing a small research I found that generally there is good support under Mac OSX. Linux support is also relatively good, but my target audience does not use Linux. It remains to check how well it is supported in Windows.
Regarding Windows I found OpenCL distribution which raises some concerns.
Do any of you have any experience with using OpenCL in consumer-oriented products under Windows? I am more interested in the GPU side of OpenCL, specifically driver support.
Just like CUDA or Stream, OpenCL needs to be supported by the driver. Most CUDA-capable GPUs support OpenCL with a somewhat up-to-date driver (CUDA 1.0 upwards).
In fact, if you compile with, say, CUDA SDK 4.1 your end users will need newer drivers than if you had used OpenCL.
Also, OpenCL is not bound to any GPU architecture. While this might be problematic for specifically designed algorithms, it shouldn't have a very high impact on normal end user programs.
At least with CUDA, you can only compile code optimized for the current known major version. Compiling OpenCL kernels on the end user machine might allow optimizations for newer binary specifications in the future.
The crashes the author in that questions reported for Nvidia OpenCL generally seem to happen a lot if resources are not freed properly. I've been seeing similar crashes until I fixed a leak that didn't release created kernels.
I'm not saying it's the only reason why it might crash, but apart from programmer errors it appears fairly stable to me.
AMD and NVidia both support OpenCL on most (all?) of their GPUs
Unfortunately Intel only supports it on the CPU which is a bit pointless and if you have to insist that the user has a separate GPU for your app you can also insist that they have an NVidia one and use CUDA. This has limited the uptake of OpenCL.

Resources