I'm currently developing an OpenCL-application for a very heterogeneous set of computers (using JavaCL to be specific). In order to maximize performance I want to use a GPU if it's available otherwise I want to fall back to the CPU and use SIMD-instructions. My plan is to implement the OpenCL-code using vector-types because my understanding is that this allows CPUs to vectorize the instructions and use SIMD-instructions.
My question however is regarding which OpenCL-implementation to use. E.g. if the computer has a Nvidia GPU I assume it's best to use Nvidia's library but if no GPU is available I want to use Intel's library to use the SIMD-instructions.
How do I achieve this? Is this handled automatically or do I have to include all libraries and implement some logic to pick the right one? It feels like this is a problem that more people than I are facing.
Update
After testing the different OpenCL-drivers this is my experience so far:
Intel: crashed the JVM when JavaCL tried to call it. After a restart it didn't crash the JVM but it also didn't return any usable
devices (I was using an Intel I7-CPU). When I compiled the
OpenCL-code offline it seemed to be able to do some
auto-vectorization so Intel's compiler seems quite nice.
Nvidia: Refused to install their WHQL-drivers because it claimed I didn't have Nvidia-card (that computer has a Geforce GT 330M). When
I tried it on a different computer I managed to get all the way to
create a kernel but at the first execution it crashed the drivers
(the screen flickered for a while and Windows 7 said it had to
restart the drivers). The second execution caused a bluee-screen of
death.
AMD/ATI: Refused to install 32-bit SDK (I tried that since I will be using a 32-bit JVM) but 64-bit SDK worked well. This is the only
driver which I've managed to execute the code on (after a restart
because at first it gave a cryptic error-message when compiling).
However it doesn't seem to be able to do any implicit vectorization
and since I don't have any ATI GPU I didn't get any performance
increase compared to the Java-implementation. If I use vector-types I
might see some improvements though.
TL;DR None of the drivers seem ready for commercial use. I'm probably better of creating JNI-module with C-code compiled to use SSE-instructions.
First try to understand hosts & devices: http://www.streamcomputing.eu/blog/2011-07-14/basic-concept-hosts-and-devices/
Basically you can just do exactly what you described: check if a certain driver is available and if not, try the next one. What you choose first depends completely on your own preference. I would pick the device I have tested my kernel best on. In JavaCL you can pick the fastest device with JavaCL.createBestContext and CLPlatform.getBestDevice, check the host-code here: http://ochafik.com/blog/?p=501
Know NVidia does not support CPUs via their driver; only AMD and Intel do. Also is targeting multiple devices (say 2 GPUs and a CPU) a bit more difficult.
There is no API providing what you want. however, you can do the following:
i suggest you iterate over clGetPlatformIDs and query for the number of devices (clGetDeviceIDs), and device type for each device;
and pick the platform which has both types.
then build a map in u'r code, that maps for each type the list of platforms supporting it, ordered in some manner.
finally, just get the first item in the list corresponding for CL_DEVICE_TYPE_CPU and the first item corresponding for CL_DEVICE_TYPE_GPU.
if both returned results are equal (platform_cpu == platform_gpu) then pick one of them and use it for both.
if there is a platform supporting both, you will get match as before since you got order lists. then u can also do load balancing if u like on a single platform, like what Intel has.
Sorry for being late to the party, but regarding Intel's implementation behaviour under JavaCL, I'm afraid you've been bitten by a JavaCL bug :
https://github.com/ochafik/nativelibs4java/issues/297
Fixed in JavaCL 1.0.0-RC2 !
Cheers
Related
We are trying to optimize HPC applications using OpenMP on a new hardware platform. These applications need precise placement/pinning of their cores or performance falls in half. Currently, we provide the user a custom GOMP_CPU_AFFINITY map for each platform, but this is cumbersome, because it's different on each hardware version, and even platforms with different firmware versions sometimes change their CoreID physical mappings - all things impossible for the user to detect on the fly.
It would be a great help if HPC applications could simply set GOMP_PROC_BIND to "close" and OpenMP would do the right thing for the given platform - but to make this possible, the hardware vendor would need to define what "close" means for each machine. We'd like to do this, but we can't tell how/where OpenMP gets CoreID lists to use for things like close, spread, etc. (For various external requirements, the CoreID spatial pattern on this machine would appear utterly random to a software writer.)
Any advice as to where/how OpenMP defines the CoreID lists for OMP_PROC_BIND so we could configure them? We are comfortable with the idea that we might need a custom version of OpenMP (with altered source code) for this platform if needed.
Thanks, everyone. :)
Jeff
Expanding on what #VictorEijkhout said...
You seem have invented an envirable that I can't find anywhere with Google (GOMP_PROC_BIND), with the OpenMP standard envirable (OMP_PROC_BIND). If GOMP_PROC_BIND exists the name suggests that it is a GNU feature. Note too that one of the two Google hits for GOMP_PROC_BIND says "Code that reads the setting is buggy. Setting is invalid and ignored at runtime." So, if you are setting that it is unsurprising that it has no effect!
I will therefore answer for the more general case of OMP_PROC_BIND.
The binding of OpenMP threads to logicalCPUs clearly has to be done at runtime, since, beyond its ISA, the compiler has no knowledge of the hardware on which the compiled code will run. Therefore you need to be looking at the runtime library code.
I have not looked at GNU's libgomp, but, where it can, LLVM's libomp uses the hwloc library to explore the machine hardware. Since hwloc also includes other useful tools for machine exploration (such as lstopo) it is likely that your effort is best invested in ensuring good hwloc support on your machine, at which point there will be no need to delve inside the OpenMP runtime.
You know the ARM-based M1 chips that are used in modern mac computers. On those macs, some number of software are ran through the layer called Rosetta (Discord, Steam), some natively, directly through M1 (Slack, IntelliJ) and some actually doesn't work in either way (Virtual Box). Huge list holding the status can be found here.
Apps that can be ran only with Rosetta are not yet M1 optimized, their developers have to optimize it, it takes some time to do so. But what does it mean to optimize it? What the process looks like? I'm quite sure that they don't rewrite the whole application code to another language (like Swift), because Jetbrains was able to M1 optimize their apps quite quickly. On the other hand, Discord is not yet optimized, same for Unity game engine (it's in beta though).
At bottom, it just means that the compiler's backend was configured to emit ARM64 instructions for the program instead of (or in-addition to) x86-64 instructions.
This means that certain x86-64 specific functionality instruction can no longer be used, unless equivalent ARM instructions are used instead.
This usually isn't much of a problem though, because most macOS software is typically written at a higher level of abstraction, using system-provided frameworks.
For example, using CoreImage to manipulate images abstracts you from the details of the CPU and GPU. In such cases, Apple does the heavy lifting of porting over their frameworks. All you have to do as an application developer is to check a box that says "target ARM64".
This might be obvious for some but not to me so I'll ask =)
I'm having an issue that I have build an embedded Linux stack for some piece of hardware (NVidia TX2 + ConnectTech Astro carrier). I use a PCIe card from EPIX
If I use Ubuntu's official distribution for tegra, the PCIe card is properly detected.
With identical kernel and device tree blob, and the same HW unit, the detection fails with embedded Linux.
I thought that detecting PCIe devices would be kernel's job and not be influence by the distro, unless the drivers are built as kernel modules and inserted at different times. But in my case they are build in kernel.
Could someone elaborate why the detection would work with one distro but not the order?
Here is a link to what I tried to do to fix the detection
tx2-pcie-does-not-detect-endpoint-on-connecttech-carrier-board
Thanks!
A Linux distribution contains a kernel that usually differs from the vanilla kernel of the same release. Most of the time a distribution kernel contains lots of back ports of bug fixes that were discovered and fixed later in micro releases. There may be other features that a specific vendor includes and the vanilla kernel does not, like more recent version of certain drivers, etc. What makes this even more confusing is that sets of these back ports are often different in distributions from different vendors. As a side effect, this makes it difficult to depend on something like KERNEL_VERSION() macro in custom kernel code or in custom device drivers.
I can't say about the specific issue that you're having. The topic is pretty generic, and I hope that this explanation helps.
I would like to distribute a Windows/Linux application that uses openCL, but I can't find the best way to do it.
For the moment my problem are only on Windows:
1- I'm using Intel CPU, how can I manage Intel AND AMD (CPU of final users) ?
2- For distribution of application that uses Visual Studio DLL, we have Visual Studio Redistributable to manage this easily and to avoid a big installation of Visual Studio. Is there a package like this for openCL ?
3- Finally, I don't know if I must provide OpenCL.dll or not (example of different point of view here)
I read several topics on the web about this problem without clear solution.
Thank you for your help.
1) You write to the OpenCL API and it works with whatever hardware your user has. User the header for the lower version you want to support (e.g., use cl.h from 1.1 if you want to target 1.1 and higher).
2) The OpenCL runtime is installed on the user's machine when they install a graphics driver. You don't need to (and should not) redistribute anything.
3) Please don't redistribute OpenCL.dll
The one problem you may need to deal with is if your user does not have any OpenCL installed on their machine. In this case, the call to clGetPlatformIDs will fail. There are various ways to deal with this, all platform specific. Dynamically linking to OpenCL.dll is one way, or running a helper process to test for OpenCL is another. An elegant solution on Windows is to delay load OpenCL.dll and hook that API to return 0 if the late binding fails.
1- I'm using Intel CPU, how can I manage Intel AND AMD (CPU of final users)
Are you talking about running OpenCL kernels on CPU, or just host-side code while kernels run on GPU ? because if the former (on CPU), your users will need to install their respective OpenCL CPU implementation, IIRC the Intel CPU implementation does not run on AMDs (or at least that used to be the case, perhaps it's now different..)
3- Finally, I don't know if I must provide OpenCL.dll
You don't have to, but you should, IMO. The way OpenCL works (usually), OpenCL.dll is just an ICD loader - a small library (a few dozen KB) that loads the actual OpenCL implementation(s) by looking into a few predefined places. It should be safe to include on Windows, and it simplifies your program logic - you can always build with OpenCL enabled, and if there's no OpenCL implementation installed, the loader will return CL_PLATFORM_NOT_FOUND_KHR - you just handle that error by asking user to install an OpenCL implementation, or fallback to non-OpenCL code path if you have it, whatever suits you more.
There's no need to complicate your life with delayed DLL loads or helper processes. In fact that's the entire point of the ICD concept - you don't need to look for the platforms and DLLs yourself, you let the ICD loader do it. It's pretty absurd to write helper code to load a helper library (ICD) which then loads the actual implementation DLLs...
I want to write something like DaemonTools: a software that presents itself to the system as a real device (a DVD-ROM in the previous example) but it reads the data from a file instead. My requirement is not limited to DVD-ROM. The first goal is a joystick/gamepad for Windows.
I'm a web developer, so I don't know from where I could start such a project. I believe it will have to be written in C/C++, but other than that, I have no clue where to start.
Did anyone tried something like this and can give me some starting tips ?
Most drivers are written in either C or C++, so if you don't know those languages reasonably well, you'll want to get familiar with them before you start. Windows programming uses a lot of interesting shortcuts that might be confusing to a beginner - for example PVOIDs (typedef void* PVOID) and LPVOIDs (typedef void* far LPVOD;). You'll need to be happy with pointers as concepts as well as structures because you'll be using a lot of them. I'd suggest writing a really straightforward win32 app as an exercise in getting to grips with the Windows style of doing C/C++.
Your next port of call then is to navigate the Windows Driver Kit - specifically, you'll need it to build drivers for Windows. At this stage my ability to advise really depends on what you're doing and the hardware you have available etc, or whether or not you're really using hardware. You'll need to know how to drive your hardware and from there you'll need to choose an appropriate way of writing a driver - there are several different types of driver depending on what you need to achieve and it might be you can plug into one of these.
The windows driver kit contains quite a large number of samples, including a driver that implements a virtual toaster. These should provide you with starting points.
I strongly suggest you do the testing of this in a virtual machine. If your driver successfully builds, but causes a runtime error, the result could well crash windows entirely if you're in kernel-mode. You will therefore save yourself some pain by being able to revert the virtual machine if you damage it, as well as not having to wait on your system restarting. It'll also make debugging easier as virtual serial cables can be used.
This is quite a big undertaking, so before you start, I'd research Windows development more thoroughly - check you can't do it using the Windows APIs first, then have a look at the user-mode driver framework, then finally and only if you need to, look at the kernel level stuff.