OpenCL programming in Charm++ - parallel-processing

Is it possible to run OpenCL through Charm++, while retaining the same fault tolerance and load balancing capabilities as for CPU or CUDA?
I did not explicitly see anything mentioned in the tutorials or the book.

Background: I'm one of the core developers of Charm++.
It's not clear whether you mean compiling OpenCL code to a Charm++-based parallel program, or calling kernels written in OpenCL from Charm++ code. Regardless, there is nothing explicitly implemented to support either of those cases at present.
Compiling OpenCL to Charm++ would be a large project. I don't know of anyone proposing to do such a thing, but it's not fundamentally implausible.
The research group behind Charm++, the Parallel Programming Laboratory has looked at the possibility of implementing OpenCL support to match our offload support for CUDA-based accelerators. This would not be particularly hard. However, at present, we don't have any demand from grant-funded projects that support our work to do so. We would welcome contributions of code to do this. There's also the possibility that commercial development may lead to this getting implemented.

Related

How to specify the physical CoreIDs used for "CLOSE" when specifying OMP_PROC_BIND?

We are trying to optimize HPC applications using OpenMP on a new hardware platform. These applications need precise placement/pinning of their cores or performance falls in half. Currently, we provide the user a custom GOMP_CPU_AFFINITY map for each platform, but this is cumbersome, because it's different on each hardware version, and even platforms with different firmware versions sometimes change their CoreID physical mappings - all things impossible for the user to detect on the fly.
It would be a great help if HPC applications could simply set GOMP_PROC_BIND to "close" and OpenMP would do the right thing for the given platform - but to make this possible, the hardware vendor would need to define what "close" means for each machine. We'd like to do this, but we can't tell how/where OpenMP gets CoreID lists to use for things like close, spread, etc. (For various external requirements, the CoreID spatial pattern on this machine would appear utterly random to a software writer.)
Any advice as to where/how OpenMP defines the CoreID lists for OMP_PROC_BIND so we could configure them? We are comfortable with the idea that we might need a custom version of OpenMP (with altered source code) for this platform if needed.
Thanks, everyone. :)
Jeff
Expanding on what #VictorEijkhout said...
You seem have invented an envirable that I can't find anywhere with Google (GOMP_PROC_BIND), with the OpenMP standard envirable (OMP_PROC_BIND). If GOMP_PROC_BIND exists the name suggests that it is a GNU feature. Note too that one of the two Google hits for GOMP_PROC_BIND says "Code that reads the setting is buggy. Setting is invalid and ignored at runtime." So, if you are setting that it is unsurprising that it has no effect!
I will therefore answer for the more general case of OMP_PROC_BIND.
The binding of OpenMP threads to logicalCPUs clearly has to be done at runtime, since, beyond its ISA, the compiler has no knowledge of the hardware on which the compiled code will run. Therefore you need to be looking at the runtime library code.
I have not looked at GNU's libgomp, but, where it can, LLVM's libomp uses the hwloc library to explore the machine hardware. Since hwloc also includes other useful tools for machine exploration (such as lstopo) it is likely that your effort is best invested in ensuring good hwloc support on your machine, at which point there will be no need to delve inside the OpenMP runtime.

What is it meant by "developers must optimise their apps to run on ARM-based processors"?

This is a subject that I am not very knowledgable about and I was hoping to get a better understanding on the topic.
I was going through articles about Apple's transition to Apple Silicon and at some point I read "Apple is going to ship Rosetta 2, an emulation layer that lets you run old apps on new Macs."
As far as I know, an application is written in a high level language (e.g. C/C++,Java etc.). Then the compiler (let's assume interpreters don't exist for a moment) reads that code and translates it to assembly code. Then the assembler will convert assembly code to machine code which is readable by the processor.
My question is, assuming the above are correct, why is Rosetta 2 required since a CPU is supposed to translate high level code into readable machine code anyway? Why would developers need to "optimise" (or care on what processor their applications are run on) their applications since they are written (mostly) in high level language (which the processor can compile) ? I don't get why would programmers care if the CPU is supposed to handle compiling and assembling.
This question is probably rather trivial but I couldn't find what I was looking for just by reading about compilers or CPU architecture.
a CPU is supposed to translate high level code into readable machine code anyway?
No, the CPU doesn't do that itself, it happens via software running on the CPU (JIT or ahead-of-time compiler).
For ahead-of-time compiler (e.g. normal C++ implementations), closed source software only ships x86 machine code, not source. So you can't just recompile it yourself. Open-source software is usually easily portable by recompiling.
Rewritten is an overstatement for most apps, most can just recompile.
But if you have custom x86-specific code, like manually vectorized SIMD loops using SSE / AVX intrinsics or hand-written asm, you'd have to port those to NEON / AArch64 SIMD.

Measuring time in omp_fn routines

I am writing a pintool gathering metrics in a subset of applications routines(some among them, are generated by the compiler).
The goal is to get the execution time of those routines.
Below is a set of attempts I already gave:
Of course doing it with pin is a bad idea because of the Virtual Machine overhead.
gcc option -finstrument-functions does not scope the OpenMP functions it generates.
LD_PRELOAD does not work with OpenMP functions which are statically linked.
Maybe if pin allowed to dump statically instrumented assembly, we could avoid the virtual environment overhead, but as far as I know it isn't possible.
I know about Maqao instrumentation tool which do not use virtual environment, but I want to avoid using too many frameworks or translating my pintool into maqao lua script.
I guess I am left with manual binary instrumentation, but if anybody has a better solution, the help will be appreciated.
If you just want the results - use a comprehensive measurement infrastructure that supports OpenMP such as Intel VTune, Extrae/Paraver, Score-P. This will provide you profiling or tracing information about the OpenMP regions.
If you want to implement the measurement yourself, you can use the underlying source-to-source transformation tool Opari. You could also use the much cleaner OpenMP tools interface (OMPT), but AFAIK it is not widely supported yet. You might have some luck with recent Intel OpenMP runtimes.

C++ running on PIC32 (MIPS32)

Unfortunately, my C app for PIC32 needs OO too much and I can't continue doing it in C.
Do you know any MIPS32 C++ compiler for PIC32?
Thanks
Microchip's XC32 tool chain now supports C++ since version 1.10
You might contact Comeau Computing; thier C++ compiler generates C code as an intermediate language so that it can then utilise a platform's existing native C compiler where only a C compiler is available, and therefore porting to new platforms is relatively quick and simple.
For various reasons the intermediate generation and compiler adaptation is not accessible to end users so you will still need Comeau to generate a PIC32/C32 port, but it probably won't take long and hopefully they would amortise the cost over sales to other users.
However if you use Commeau or any other C++ to C translator, you will suffer from the inability to use source-level debugging, and that is likley to be the killer to any attempt to use C++ sucessfully without native debugger support.
Although it is not always pretty, your best bet is probably to learn how to implement OO designs in C. Here's a whole book on the subject: http://www.planetpdf.com/codecuts/pdfs/ooc.pdf
According to this fairly recent thread on the microchip forums it looks like C++ support for PIC32 isn't available anywhere yet and isn't a high priority with Microchip. The wisdom of the respondents in that thread appears to be: don't hold your breath.
I'm a MPLAB user myself building small programs so I just take what Microchip gives me. I've never gotten to the point where I thought I needed C++, longed for yes, but never needed. As a next step you can either consider moving to another platform with C++ support or take another look at your design and ask why you need C++ that badly. Some features can be simulated in C with varying amounts of pain and suffering.
You might keep an eye on the proper GCC MIPS port. They have all the pieces, but I don't know if anyone's made C++ work with PIC32 in particular. I know it did work on sgimips.

Profiling a benchmark compiled for the SPARC v8 on an x86

I'm trying to make a (small) improvement to the leon3 processor (instruction set is SPARC v8) for an academic exercise. Before I decide what to improve, I want to profile a couple of benchmark programs that I want to tailor the improvements to.
I don't have access to a SPARC v8 machine.
Currently, I'm using an evaluation version of 'tsim' (a leon3 simulator) which does profiling at the functional level. Which is not really all that useful.
I have tried weird stuff like compiling with loop unrolling enabled and then counting the interesting instructions in the assembly code, but gcc refuses to unroll the loops, probably because some of them go too deep (e.g. 4 nested 'for' loops).
Ideally, what I'm looking for is a SPARC v8 simulator that runs the benchmark and profiles it at the instruction level (stuff like: 'smul' was executed x times) so that I can decide where to start trying with the improvement. Of course if there are other ways I can do this if not a profiler, I won't mind.
Any ideas?
Simulating the processor in Modelsim could be an option. With Modelsim you can do a functional simulation of the complete LEON3 processor. Although the simulation will be quite slow and probably complete overkill for your purposes but Aeroflex Gaisler provides excellent scripts to work with Modelsim.
A student edition of modelsim can be found here:
http://www.mentor.com/company/higher_ed/modelsim-student-edition
If you really want to dig that deep into the hardware, you'll find a simulator useful that helps you with that.
Simics comes into mind. They used to have free academic licenses, but since they were bought by Intel, you now need to apply for one, which from my experience takes a couple of weeks. If you are willing to invest this time, you'll certainly get a tool that suits your needs, although they support LEON2, not LEON3, as a model, but for profiling this should be fine.
Qemu also has LEON support but as they're heavily recompiling, it will probably be hard to to instruction-level profiling with it.

Resources