Distribution of an application that uses OpenCL

Distribution of an application that uses OpenCL - windows

I would like to distribute a Windows/Linux application that uses openCL, but I can't find the best way to do it.
For the moment my problem are only on Windows:
1- I'm using Intel CPU, how can I manage Intel AND AMD (CPU of final users) ?
2- For distribution of application that uses Visual Studio DLL, we have Visual Studio Redistributable to manage this easily and to avoid a big installation of Visual Studio. Is there a package like this for openCL ?
3- Finally, I don't know if I must provide OpenCL.dll or not (example of different point of view here)
I read several topics on the web about this problem without clear solution.
Thank you for your help.

1) You write to the OpenCL API and it works with whatever hardware your user has. User the header for the lower version you want to support (e.g., use cl.h from 1.1 if you want to target 1.1 and higher).
2) The OpenCL runtime is installed on the user's machine when they install a graphics driver. You don't need to (and should not) redistribute anything.
3) Please don't redistribute OpenCL.dll
The one problem you may need to deal with is if your user does not have any OpenCL installed on their machine. In this case, the call to clGetPlatformIDs will fail. There are various ways to deal with this, all platform specific. Dynamically linking to OpenCL.dll is one way, or running a helper process to test for OpenCL is another. An elegant solution on Windows is to delay load OpenCL.dll and hook that API to return 0 if the late binding fails.

1- I'm using Intel CPU, how can I manage Intel AND AMD (CPU of final users)
Are you talking about running OpenCL kernels on CPU, or just host-side code while kernels run on GPU ? because if the former (on CPU), your users will need to install their respective OpenCL CPU implementation, IIRC the Intel CPU implementation does not run on AMDs (or at least that used to be the case, perhaps it's now different..)
3- Finally, I don't know if I must provide OpenCL.dll
You don't have to, but you should, IMO. The way OpenCL works (usually), OpenCL.dll is just an ICD loader - a small library (a few dozen KB) that loads the actual OpenCL implementation(s) by looking into a few predefined places. It should be safe to include on Windows, and it simplifies your program logic - you can always build with OpenCL enabled, and if there's no OpenCL implementation installed, the loader will return CL_PLATFORM_NOT_FOUND_KHR - you just handle that error by asking user to install an OpenCL implementation, or fallback to non-OpenCL code path if you have it, whatever suits you more.
There's no need to complicate your life with delayed DLL loads or helper processes. In fact that's the entire point of the ICD concept - you don't need to look for the platforms and DLLs yourself, you let the ICD loader do it. It's pretty absurd to write helper code to load a helper library (ICD) which then loads the actual implementation DLLs...

Related

How to specify the physical CoreIDs used for "CLOSE" when specifying OMP_PROC_BIND?

We are trying to optimize HPC applications using OpenMP on a new hardware platform. These applications need precise placement/pinning of their cores or performance falls in half. Currently, we provide the user a custom GOMP_CPU_AFFINITY map for each platform, but this is cumbersome, because it's different on each hardware version, and even platforms with different firmware versions sometimes change their CoreID physical mappings - all things impossible for the user to detect on the fly.
It would be a great help if HPC applications could simply set GOMP_PROC_BIND to "close" and OpenMP would do the right thing for the given platform - but to make this possible, the hardware vendor would need to define what "close" means for each machine. We'd like to do this, but we can't tell how/where OpenMP gets CoreID lists to use for things like close, spread, etc. (For various external requirements, the CoreID spatial pattern on this machine would appear utterly random to a software writer.)
Any advice as to where/how OpenMP defines the CoreID lists for OMP_PROC_BIND so we could configure them? We are comfortable with the idea that we might need a custom version of OpenMP (with altered source code) for this platform if needed.
Thanks, everyone. :)
Jeff

Expanding on what #VictorEijkhout said...
You seem have invented an envirable that I can't find anywhere with Google (GOMP_PROC_BIND), with the OpenMP standard envirable (OMP_PROC_BIND). If GOMP_PROC_BIND exists the name suggests that it is a GNU feature. Note too that one of the two Google hits for GOMP_PROC_BIND says "Code that reads the setting is buggy. Setting is invalid and ignored at runtime." So, if you are setting that it is unsurprising that it has no effect!
I will therefore answer for the more general case of OMP_PROC_BIND.
The binding of OpenMP threads to logicalCPUs clearly has to be done at runtime, since, beyond its ISA, the compiler has no knowledge of the hardware on which the compiled code will run. Therefore you need to be looking at the runtime library code.
I have not looked at GNU's libgomp, but, where it can, LLVM's libomp uses the hwloc library to explore the machine hardware. Since hwloc also includes other useful tools for machine exploration (such as lstopo) it is likely that your effort is best invested in ensuring good hwloc support on your machine, at which point there will be no need to delve inside the OpenMP runtime.

What does M1 mac optimization process for an application mean?

You know the ARM-based M1 chips that are used in modern mac computers. On those macs, some number of software are ran through the layer called Rosetta (Discord, Steam), some natively, directly through M1 (Slack, IntelliJ) and some actually doesn't work in either way (Virtual Box). Huge list holding the status can be found here.
Apps that can be ran only with Rosetta are not yet M1 optimized, their developers have to optimize it, it takes some time to do so. But what does it mean to optimize it? What the process looks like? I'm quite sure that they don't rewrite the whole application code to another language (like Swift), because Jetbrains was able to M1 optimize their apps quite quickly. On the other hand, Discord is not yet optimized, same for Unity game engine (it's in beta though).

At bottom, it just means that the compiler's backend was configured to emit ARM64 instructions for the program instead of (or in-addition to) x86-64 instructions.
This means that certain x86-64 specific functionality instruction can no longer be used, unless equivalent ARM instructions are used instead.
This usually isn't much of a problem though, because most macOS software is typically written at a higher level of abstraction, using system-provided frameworks.
For example, using CoreImage to manipulate images abstracts you from the details of the CPU and GPU. In such cases, Apple does the heavy lifting of porting over their frameworks. All you have to do as an application developer is to check a box that says "target ARM64".

What is it meant by "developers must optimise their apps to run on ARM-based processors"?

This is a subject that I am not very knowledgable about and I was hoping to get a better understanding on the topic.
I was going through articles about Apple's transition to Apple Silicon and at some point I read "Apple is going to ship Rosetta 2, an emulation layer that lets you run old apps on new Macs."
As far as I know, an application is written in a high level language (e.g. C/C++,Java etc.). Then the compiler (let's assume interpreters don't exist for a moment) reads that code and translates it to assembly code. Then the assembler will convert assembly code to machine code which is readable by the processor.
My question is, assuming the above are correct, why is Rosetta 2 required since a CPU is supposed to translate high level code into readable machine code anyway? Why would developers need to "optimise" (or care on what processor their applications are run on) their applications since they are written (mostly) in high level language (which the processor can compile) ? I don't get why would programmers care if the CPU is supposed to handle compiling and assembling.
This question is probably rather trivial but I couldn't find what I was looking for just by reading about compilers or CPU architecture.

a CPU is supposed to translate high level code into readable machine code anyway?
No, the CPU doesn't do that itself, it happens via software running on the CPU (JIT or ahead-of-time compiler).
For ahead-of-time compiler (e.g. normal C++ implementations), closed source software only ships x86 machine code, not source. So you can't just recompile it yourself. Open-source software is usually easily portable by recompiling.
Rewritten is an overstatement for most apps, most can just recompile.
But if you have custom x86-specific code, like manually vectorized SIMD loops using SSE / AVX intrinsics or hand-written asm, you'd have to port those to NEON / AArch64 SIMD.

Expanding media capabilities of Win Embedded CE 6.0

I have an embedded device with WinCE 6.0 as OS. The manufacturer provides an IDE for 3rd party development to it. The IDE pretty much allows nothing else than
.NET 3.5 Compact Framework scripting that's invoked from various events from the main application
Adding files to the device.
The included mediaplayer seems to be using DirectShow and the OS has media codec only for mpeg-1 encoded video playback. My goal is to to be able to play media encoded with some other codecs as well inside that main application.
I've already managed to use DirectShowNETCF (DirectShow wrapper for .NET Compact Framework) and successfully playback mpeg-1 encoded video.
I'm totally new with this stuff and I have tons of (stupid) questions. I'll try to narrow them down:
The OS is based on WinCE, but as far as I've understood, it's actually always some customized version of it (via Platform Builder). Only "correct way" of developing anything for it afterwards is to use the SDK the manufacturer usually provides. Right? In my case, the SDK is extremely limited and tightly integrated into IDE as noted above. However, .NET CF 3.5 is capable for interop so its possible to call native libraries -as long as they are compiled for correct platform.
Compiled code is pretty much just instructions for the processor (assembler code) and the compiler chooses the correct instructions based on the target processor setting. Also there's the PE-header that defines under which platform the program is meant to be run. If I target my "helloworld.exe" (does nothing but returns specific exit code) to x86 and compile it with VC, should it work?
If the PE-header is in fact the problem, is it possible to setup for WINCE without the SDK? Do I REALLY need the whole SDK for creating a simple executable that uses only base types? I'm using VS2010, which doesn't even support smart device dev anymore and I'd hate to downgrade just for testing purposes.
Above questions are prequel to my actual idea: Porting ffmpeg/ffdshow for WinCE. This actually already exists, but not targeted nor built for Intel Atom. Comments?
If the native implementation is not possible and I would end up implementing some specific codec with C#...well that would probably be quite a massive task. But having to choose C# over native, could I run into problems with codec performance? I mean.. is C# THAT much slower?
Thank you.

I've not seen an OEM that shipped their own IDE, but it's certainly possible. That shouldn't change how apps can created, however. It's possible that they've done a lot of work to make sure only things from their IDE work, but that would be a serious amount of work for not that much benefit, so I'd think it's unlikely.
As for your specific questions:
The OS is Windows CE, not "based on" it. The OS is, however, componentized, so not all pieces are going to be available. An SDK generally provides a mechanism to filter out what isn't available. You can actually use any SDK that targets the right processor architecture, but if your app calls into a library for something that isn't in the OS, then you'll get at the very least an error. For managed code this is all not relevant because the CF isn't componentized. If it's there, and CF app can run (and if it's not, you can often install it after the fact). This means that if the platform supports the CF, then you can write a CF app and run it. That app can then call native stuff via P/Invoke (unless, of course, the OS creator decided to add security to prevent that. This is possible in the OS, though I've never seen it implemented).
Yes, compiled code is just "instructions". For native, yes, they are processor instructions. For managed, they are MSIL instructions that the managed runtime in turn converts to processor instructions at JIT time. If your target is an ARM platform, you cannot use an x86 compiler. Broadly speaking, you need to use the correct Microsoft compiler that support Windows CE, and call that compiler with the proper switches to tell it not only the processor architecture, but also the target OS because the linking that needs to be done will be different for OS-level APIs and even the C runtimes. The short of this is that for your platform, you need to use Visual Studio 2008 Pro.
For native apps, you need some SDK that targets the same OS version (CE 6.0) and processor architecture (e.g. ARMv4I). Having it match the OS feature set is also useful but not a requirement. For managed code, you can just use the SDKs that ship with Studio because managed code is not processor-dependent. Still, you have to go back to Studio 2008 because 2010 doesn't have any WinCE compilers.
If you've found an existing library, then you can try to use it. Things that might impede your progress are A) it's unlikely to use an SDK you have so you probably have to create new project files (painful, but workable) and B) if it uses features not available in your OS, then you'd have to work around those. If you're missing OS features, you're probably out of luck but if it already has a media player and codec, I suspect you'll be ok.
Don't implement this in managed code. Seriously, just don't do it. Could you? Yes. Performance could probably be made to be nearly the same except to avoid GC stuttering you're going to have to basically create your own memory manager. The amount of work involved in this path is very, very large.

OpenCL distribution

I'm currently developing an OpenCL-application for a very heterogeneous set of computers (using JavaCL to be specific). In order to maximize performance I want to use a GPU if it's available otherwise I want to fall back to the CPU and use SIMD-instructions. My plan is to implement the OpenCL-code using vector-types because my understanding is that this allows CPUs to vectorize the instructions and use SIMD-instructions.
My question however is regarding which OpenCL-implementation to use. E.g. if the computer has a Nvidia GPU I assume it's best to use Nvidia's library but if no GPU is available I want to use Intel's library to use the SIMD-instructions.
How do I achieve this? Is this handled automatically or do I have to include all libraries and implement some logic to pick the right one? It feels like this is a problem that more people than I are facing.
Update
After testing the different OpenCL-drivers this is my experience so far:
Intel: crashed the JVM when JavaCL tried to call it. After a restart it didn't crash the JVM but it also didn't return any usable
devices (I was using an Intel I7-CPU). When I compiled the
OpenCL-code offline it seemed to be able to do some
auto-vectorization so Intel's compiler seems quite nice.
Nvidia: Refused to install their WHQL-drivers because it claimed I didn't have Nvidia-card (that computer has a Geforce GT 330M). When
I tried it on a different computer I managed to get all the way to
create a kernel but at the first execution it crashed the drivers
(the screen flickered for a while and Windows 7 said it had to
restart the drivers). The second execution caused a bluee-screen of
death.
AMD/ATI: Refused to install 32-bit SDK (I tried that since I will be using a 32-bit JVM) but 64-bit SDK worked well. This is the only
driver which I've managed to execute the code on (after a restart
because at first it gave a cryptic error-message when compiling).
However it doesn't seem to be able to do any implicit vectorization
and since I don't have any ATI GPU I didn't get any performance
increase compared to the Java-implementation. If I use vector-types I
might see some improvements though.
TL;DR None of the drivers seem ready for commercial use. I'm probably better of creating JNI-module with C-code compiled to use SSE-instructions.

First try to understand hosts & devices: http://www.streamcomputing.eu/blog/2011-07-14/basic-concept-hosts-and-devices/
Basically you can just do exactly what you described: check if a certain driver is available and if not, try the next one. What you choose first depends completely on your own preference. I would pick the device I have tested my kernel best on. In JavaCL you can pick the fastest device with JavaCL.createBestContext and CLPlatform.getBestDevice, check the host-code here: http://ochafik.com/blog/?p=501
Know NVidia does not support CPUs via their driver; only AMD and Intel do. Also is targeting multiple devices (say 2 GPUs and a CPU) a bit more difficult.

There is no API providing what you want. however, you can do the following:
i suggest you iterate over clGetPlatformIDs and query for the number of devices (clGetDeviceIDs), and device type for each device;
and pick the platform which has both types.
then build a map in u'r code, that maps for each type the list of platforms supporting it, ordered in some manner.
finally, just get the first item in the list corresponding for CL_DEVICE_TYPE_CPU and the first item corresponding for CL_DEVICE_TYPE_GPU.
if both returned results are equal (platform_cpu == platform_gpu) then pick one of them and use it for both.
if there is a platform supporting both, you will get match as before since you got order lists. then u can also do load balancing if u like on a single platform, like what Intel has.

Sorry for being late to the party, but regarding Intel's implementation behaviour under JavaCL, I'm afraid you've been bitten by a JavaCL bug :
https://github.com/ochafik/nativelibs4java/issues/297
Fixed in JavaCL 1.0.0-RC2 !
Cheers

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio