How to specify the physical CoreIDs used for "CLOSE" when specifying OMP_PROC_BIND? - openmp

We are trying to optimize HPC applications using OpenMP on a new hardware platform. These applications need precise placement/pinning of their cores or performance falls in half. Currently, we provide the user a custom GOMP_CPU_AFFINITY map for each platform, but this is cumbersome, because it's different on each hardware version, and even platforms with different firmware versions sometimes change their CoreID physical mappings - all things impossible for the user to detect on the fly.
It would be a great help if HPC applications could simply set GOMP_PROC_BIND to "close" and OpenMP would do the right thing for the given platform - but to make this possible, the hardware vendor would need to define what "close" means for each machine. We'd like to do this, but we can't tell how/where OpenMP gets CoreID lists to use for things like close, spread, etc. (For various external requirements, the CoreID spatial pattern on this machine would appear utterly random to a software writer.)
Any advice as to where/how OpenMP defines the CoreID lists for OMP_PROC_BIND so we could configure them? We are comfortable with the idea that we might need a custom version of OpenMP (with altered source code) for this platform if needed.
Thanks, everyone. :)
Jeff

Expanding on what #VictorEijkhout said...
You seem have invented an envirable that I can't find anywhere with Google (GOMP_PROC_BIND), with the OpenMP standard envirable (OMP_PROC_BIND). If GOMP_PROC_BIND exists the name suggests that it is a GNU feature. Note too that one of the two Google hits for GOMP_PROC_BIND says "Code that reads the setting is buggy. Setting is invalid and ignored at runtime." So, if you are setting that it is unsurprising that it has no effect!
I will therefore answer for the more general case of OMP_PROC_BIND.
The binding of OpenMP threads to logicalCPUs clearly has to be done at runtime, since, beyond its ISA, the compiler has no knowledge of the hardware on which the compiled code will run. Therefore you need to be looking at the runtime library code.
I have not looked at GNU's libgomp, but, where it can, LLVM's libomp uses the hwloc library to explore the machine hardware. Since hwloc also includes other useful tools for machine exploration (such as lstopo) it is likely that your effort is best invested in ensuring good hwloc support on your machine, at which point there will be no need to delve inside the OpenMP runtime.

Related

Measuring time in omp_fn routines

I am writing a pintool gathering metrics in a subset of applications routines(some among them, are generated by the compiler).
The goal is to get the execution time of those routines.
Below is a set of attempts I already gave:
Of course doing it with pin is a bad idea because of the Virtual Machine overhead.
gcc option -finstrument-functions does not scope the OpenMP functions it generates.
LD_PRELOAD does not work with OpenMP functions which are statically linked.
Maybe if pin allowed to dump statically instrumented assembly, we could avoid the virtual environment overhead, but as far as I know it isn't possible.
I know about Maqao instrumentation tool which do not use virtual environment, but I want to avoid using too many frameworks or translating my pintool into maqao lua script.
I guess I am left with manual binary instrumentation, but if anybody has a better solution, the help will be appreciated.
If you just want the results - use a comprehensive measurement infrastructure that supports OpenMP such as Intel VTune, Extrae/Paraver, Score-P. This will provide you profiling or tracing information about the OpenMP regions.
If you want to implement the measurement yourself, you can use the underlying source-to-source transformation tool Opari. You could also use the much cleaner OpenMP tools interface (OMPT), but AFAIK it is not widely supported yet. You might have some luck with recent Intel OpenMP runtimes.

What is the difference between "binary install" and "compile and install from source"? Which is better?

I want to install a driver for Ros (robot operating system), and I have two options the binary install and the compile and install from source. I would like to know which installation is better, and what are the advantages and disadvantages of each one.
Source: AKA sourcecode, usually in some sort of tarball or zip file. This is RAW programming language code. You need some sort of compiler (javac for java, gcc for c++, etc.) to create the executable that your computer then runs.
Advantages:
You can see what the source code is which means....
You can edit the end result program to behave differently
Depending on what you're doing, when you compile, you could enable certain optimizations that will work on your machine and ONLY your machine (or one EXACTLY like it). For instance, for some sort of gfx rendering software, you could compile it to enable GPU support, which would increase the rendering speed.
You can create a version of an application for a different OS/Chipset (see Binary below)
Disadvantages:
You have to have your compiler installed
You need to manually install all required libraries, which frequently also need to be compiled (and THEIR libraries need to be installed, etc.) This can easily turn a quick 30-second command into a multi-hour project.
There are any number of things that could go wrong, and if you're not familiar with what the various errors mean, finding support online could be quite difficult.
Binary: This is the actual program that runs. This is the executable that gets created when you compile from source. They typically have all necessary libraries built into them, or install/deploy them as necessary (depending on how the application was written).
Advantages:
It's ready-to-run. If you have a binary designed for your processor and operating system, then chances are you can run the program and everything will work the first time.
Less configuration. You don't have to set up a whole bunch of configuration options to use the program; it just uses a generic default configuration.
If something goes wrong, it should be a little easier to find help online, since the binary is pre-compiled....other people may be using it, which means you are using the EXACT same program as them, not one optimized for your system.
Disadvantages:
You can't see/edit the source code, so you can't get optimizations, or tweak it for your specific application. Additionally, you don't really know what the program is going to do, so there could be nasty surprises waiting for you (this is why Antivirus is useful....although LESS necessary on a linux system).
Your system must be compatible with the Binary. For instance, you can't run a 64-bit application on a 32-bit operating system. You can't run an Intel binary for OS X on an older PowerPC-based G5 Mac.
In summary, which one is "better" is up to you. Only you can decide which one will be necessary for whatever it is you're trying to do. In most cases, using the binary is going to be just fine, and give you the least trouble. Sometimes, though, it is nice to have the source available, if only as documentation.

Is Go developed enough to use it to make the core of an operating system?

I'm wondering if Go is developed enough to use it to make the core of an operating system? So basically replace what you would normally use C for with Go.
Of course you can develop an OS in almost any (Turing complete) language. Usually there's some small assembly layer required, though. And usually one must implement some parts of the OS using only a restricted subset of the language in question.
Examples:
JavaOS.
Singularity. (Applies with some limits only.)
What concerns Go, there used to be a usable (toy) Go kernel implementation, but it is now obsoleted already for a long time. From rsc's post:
In the repository history there is a toy kernel called "tiny".
If you run hg log -k tiny you'll find it. It doesn't build anymore
with the current version of Go but it illustrates what might
be done. It had the whole package runtime, including the
garbage collector, in the kernel.
Russ

OpenCL distribution

I'm currently developing an OpenCL-application for a very heterogeneous set of computers (using JavaCL to be specific). In order to maximize performance I want to use a GPU if it's available otherwise I want to fall back to the CPU and use SIMD-instructions. My plan is to implement the OpenCL-code using vector-types because my understanding is that this allows CPUs to vectorize the instructions and use SIMD-instructions.
My question however is regarding which OpenCL-implementation to use. E.g. if the computer has a Nvidia GPU I assume it's best to use Nvidia's library but if no GPU is available I want to use Intel's library to use the SIMD-instructions.
How do I achieve this? Is this handled automatically or do I have to include all libraries and implement some logic to pick the right one? It feels like this is a problem that more people than I are facing.
Update
After testing the different OpenCL-drivers this is my experience so far:
Intel: crashed the JVM when JavaCL tried to call it. After a restart it didn't crash the JVM but it also didn't return any usable
devices (I was using an Intel I7-CPU). When I compiled the
OpenCL-code offline it seemed to be able to do some
auto-vectorization so Intel's compiler seems quite nice.
Nvidia: Refused to install their WHQL-drivers because it claimed I didn't have Nvidia-card (that computer has a Geforce GT 330M). When
I tried it on a different computer I managed to get all the way to
create a kernel but at the first execution it crashed the drivers
(the screen flickered for a while and Windows 7 said it had to
restart the drivers). The second execution caused a bluee-screen of
death.
AMD/ATI: Refused to install 32-bit SDK (I tried that since I will be using a 32-bit JVM) but 64-bit SDK worked well. This is the only
driver which I've managed to execute the code on (after a restart
because at first it gave a cryptic error-message when compiling).
However it doesn't seem to be able to do any implicit vectorization
and since I don't have any ATI GPU I didn't get any performance
increase compared to the Java-implementation. If I use vector-types I
might see some improvements though.
TL;DR None of the drivers seem ready for commercial use. I'm probably better of creating JNI-module with C-code compiled to use SSE-instructions.
First try to understand hosts & devices: http://www.streamcomputing.eu/blog/2011-07-14/basic-concept-hosts-and-devices/
Basically you can just do exactly what you described: check if a certain driver is available and if not, try the next one. What you choose first depends completely on your own preference. I would pick the device I have tested my kernel best on. In JavaCL you can pick the fastest device with JavaCL.createBestContext and CLPlatform.getBestDevice, check the host-code here: http://ochafik.com/blog/?p=501
Know NVidia does not support CPUs via their driver; only AMD and Intel do. Also is targeting multiple devices (say 2 GPUs and a CPU) a bit more difficult.
There is no API providing what you want. however, you can do the following:
i suggest you iterate over clGetPlatformIDs and query for the number of devices (clGetDeviceIDs), and device type for each device;
and pick the platform which has both types.
then build a map in u'r code, that maps for each type the list of platforms supporting it, ordered in some manner.
finally, just get the first item in the list corresponding for CL_DEVICE_TYPE_CPU and the first item corresponding for CL_DEVICE_TYPE_GPU.
if both returned results are equal (platform_cpu == platform_gpu) then pick one of them and use it for both.
if there is a platform supporting both, you will get match as before since you got order lists. then u can also do load balancing if u like on a single platform, like what Intel has.
Sorry for being late to the party, but regarding Intel's implementation behaviour under JavaCL, I'm afraid you've been bitten by a JavaCL bug :
https://github.com/ochafik/nativelibs4java/issues/297
Fixed in JavaCL 1.0.0-RC2 !
Cheers

Why is bytecode JIT compiled at execution time and not at installation time?

Compiling a program to bytecode instead of native code enables a certain level of portability, so long a fitting Virtual Machine exists.
But I'm kinda wondering, why delay the compilation? Why not simply compile the byte code when installing an application?
And if that is done, why not adopt it to languages that directly compile to native code? Compile them to an intermediate format, distribute a "JIT" compiler with the installer and compile it on the target machine.
The only thing I can think of is runtime optimization. That's about the only major thing that can't be done at installation time. Thoughts?
Often it is precompiled. Consider, for example, precompiling .NET code with NGEN.
One reason for not precompiling everything would be extensibility. Consider those languages which allow use of reflection to load additional code at runtime.
Some JIT Compilers (Java HotSpot, for example) use type feedback based inlining. They track which types are actually used in the program, and inline function calls based on the assumption that what they saw earlier is what they will see later. In order for this to work, they need to run the program through a number of iterations of its "hot loop" in order to know what types are used.
This optimization is totally unavailable at install time.
The bytecode has been compiled just as well as the C++ code has been compiled.
Also the JIT compiler, i.e. .NET and the Java runtimes are massive and dynamic; And you can't foresee in a program which parts the apps use so you need the entire runtime.
Also one has to realize that a language targeted to a virtual machine has very different design goals than a language targeted to bare metal.
Take C++ vs. Java.
C++ wouldn't work on a VM, In particular a lot of the C++ language design is geared towards RAII.
Java wouldn't work on bare metal for so many reasons. primitive types for one.
EDIT: As delnan points out correctly; JIT and similar technologies, though hugely benificial to bytecode performance, would likely not be available at install time. Also compiling for a VM is very different from compiling to native code.

Resources