How to use EPCC's OpenMP Microbenchmark suite for my program

How to use EPCC's OpenMP Microbenchmark suite for my program - openmp

I have implemented an application using OpenMP that I compiled with GCC on Ubuntu 16.04 for which I would like to calculate overheads in my application. (The binary file of my application is for e.g. xyz.exe.)
For that I'm trying to use EPCC OpenMP micro-benchmark suite. After makeing the suite, I tried to run one of the benchmarks called syncbench (./syncbench) on the terminal. But I would like to know as to how can I use the benchmark on my OpenMP implementation (xyz.exe). I tried to search the EPCC's official webpage for the suite (https://www.epcc.ed.ac.uk/research/computing/performance-characterisation-and-benchmarking/epcc-openmp-micro-benchmark-suite) and also the README available with the install code, but couldn't find how exactly can I do this.
If anyone has used this suite for their own implementation, please let me know how you have merged the benchmark with your implementation.
I'm new to parallel computing and benchmarking, so please excuse me if my query sounds trivial.

I think you are confusing a microbenchmark and a profiler. A microbenchmark (like EPCC) measures the performance of a specific set of small code fragments (in the case of the EPCC OpenMP benchmark, the performance of OpenMP constructs). A profiler measures the performance of any code and shows you where time is spent.
Therefore, to measure the behaviour of your code, you need a profiler (such as Intel Vtune, HPC toolkit, Tau, ...) not a microbenchmark.
[FWIW I work for Intel, but not directly on Vtune]

Related

Is there a performance different between compiling and linking mkl library via icc or gcc?

I cant find any info about this topic,
Is there a different in runtime performance when running a program which was compiled and linked with gcc or icc ?
(My assumption is that the program run on Intel architecture)

As both compilers are officially supported by MKL and they link the same libraries such as libmkl_core.a or libmkl_core.so, which do the actual work for MKL. The performance of MKL operations should be same. But of course the code written by yourself could be different as they are compiled by different compilers.
Edit
MKL is designed as a C library. Most of the APIs are pre-compiled and designed to run on large input data, which expect a relatively long running time. The way you calling the API won't affect the performance very much.
There are inline code and helper marcos through. For example mkl_direct_call.h include inline code/marco for small matrix multiplication, small matrix (size of ~20 or smaller) may get performance improvement with this code. So you may see performance difference when involving this part. Please refer to the following links for more details.
Improve Intel MKL Performance for Small Problems: The Use of MKL_DIRECT_CALL
Limitations of the Direct Call

ARM Cortex-M compiler differences

I'm about to develop some firmwares for Cortex-M cores on STM32 processors using C for my projects, and searching on the web I've found a lot of different compilers:
Keil, IAR, Linaro, Yagarto and GNU Tools for ARM Embedded Processors.
I was wondering, what functional differences are there between these compilers that might influence my choice? For example as an enthusiast I don't need support or assistance from the vendor, and a limitation on the code size is OK for the moment. Also the ease of use is not a main concern since I like to learn (and for the moment I have both Keil Lite and Eclipse with GNU ARM configured and working).
Is the generated code so different in terms of size/speed between these compilers? Are there any comparison table? (I've found only stale infos on the web)

benchmarking is an artform in and of itself, usually easy to manipulate the results to show whatever you want. I would not expect the compilers to generate the same results except for very small test cases, and sometimes in those small test cases their results are either identical or sometimes vastly different as your test has exposed an optimization that one compiler knows/uses and one the other doesnt.
I used to keep track of such things (compiler performance numbers) with dhrystone for example, but in the case of known benchmarks (not that dhrystone means much anymore, but others) you may find that some compilers are tuning themselves to look good under benchmarks perhaps at the expense of something else.
There is no right answer, there is no universal "best", it is all in the eye of the beholder, you. Which tool is easier for you to use, which do you like better be it for the gui or pretty colors or sound card sounds or whatever. And go from there.
The gnu compiler generally for applications I have tested does not produce code as "fast" which is my benchmark, compared to the others, but there are way more people using the free gnu tools so the support for it is considerably wider due to the number of web pages and forums and examples. gnu wont have a size restriction either, but it may require more learning or whatever to get up and running...
The cortex-ms are split into the armv6m and armv7m families, the v6m (cortex-m0) only have a small number of thumb2 extensions, the armv7m have about 150 thumbv2 extensions to thumb, so you need to know what your tools support and not use the wrong stuff on the wrong chip. Then the compilers if they know all of this may and will produce different instruction mixes from the same source code. Further within the same compiler or family using different command line options you can/will get vastly different code. And then beyond that with a cortex-m4 with cache on if you have one with such a thing, depending on how the code lies in the cache lines you may get vastly different performance, so benchmarking is a research project in itself for each blob of C code you want to benchmark. The performance range within a single compiler may shadow another compiler or the overlap may be enough to not matter.
If you have access to the tools you add value to yourself professionally by learning to use the competing tools and being able to walk into a job and or within your job choose what you see as the right tool for the job or walk into a Kiel house and be able to work right away or a gnu house and work right away. Where you might lose a job if you are gnu only and the job is for a Kiel house.

We have done some comparisons; IAR and Keil typically outperform GCC with default settings. But with some compiler flags you can make GCC come pretty close to the result of IAR and Keil.
Some of the compilers you mention are integrated development environments. Others are just plain compilers.
Some people prefer a integrated environment with compiler, editor and debugger nicely packaged for you. Others prefer to set up their own environment. It is a matter of taste.
In addition to Yagarto, there is also the "Code Sourcery" distribution of GCC for ARM.

Performance should not be your first concern unless when it becomes so in a production environment. The reason is that first, most ARM compilers are plenty good enough, and really you are down to GCC based, Keil, and IAR. Second, most ARM MCU are "blazingly fast" and have "so much memory" (these are comparing to 8-bit MCU like AVR/PIC but also to older PC). A decent Cortex-M4 MCU runs up to 100MHz and has 256K of flash. Again, to put it in perspective, that's more memory and 10x faster clock rate than the original Macintosh etc. We went to the Moon with much less ;-)
Now the performance of the tools itself, in particular, the IDE and the debuggers, differ greatly. For example, the popular Eclipse is written in Java, might be a bit sluggish to slower or memory-starved PCs. The best thing to do is to install GCC+Eclipse, and the vendors' demos and see for yourself.

Scientific library in C/C with OpenMP

I have to exploit PpenMP in some algorithm and for this purpose I need some mathematical functions, like eig or svd as it is available in MATLAB and it is quite fast in MATLAB. I already tried the following libraries with OpenMP
GSL - GNU Scientific Library
Eigen C++ template library
but I don't know why my OpenMP parallelised code is much slower than the serial code, may be there is some thing wrong in the library, or that the function random, eig or svd are blocking? I have no idea how to figure it out, can some body suggest me which is most compatible math library with OpenMP.

I can recommend Intel's MKL; note that it costs money which may affect your decision. I neither know nor care what language(s) it is written in, just so long as it provides APIs callable from my chosen language. Mine is Fortran, but it has bindings for C too
If you look around SO you'll find many questions from people whose first (or second or third) OpenMP programs were actually slower than their serial versions. Look at some of the answers. Don't conclude that there is a magic bullet, in the shape of a library, to make your code faster. Instead, realise that it is most likely that you've written a poorly-parallelised program and fix that.
Finally, if you have an installation of Matlab, don't expect to be able to write your own routines to outperform Matlab's. I won't say it can't be done, but I think you'll find it very difficult.

GSL is compatible with OpenMP. You can try with Intel Math Kernel Library which comes as a trial version for free.
If the speed up is not so much, then probably the code is not much parallelizable. You may want to debug and see the details of the running threads in Intel Thread Checker, that could be helpful to see where the bottlenecks are.

I think you just want to find a fast implementation of lapack (or related routines) which is already threaded, but it's a little hard to tell from your question. High Performance Mark suggests MKL, which is an excellent example; others include ATLAS or FLAME which are open source but take some doing to build.

Parallel STL algorithms in OS X

I working on converting an existing program to take advantage of some parallel functionality of the STL.
Specifically, I've re-written a big loop to work with std::accumulate. It runs, nicely.
Now, I want to have that accumulate operation run in parallel.
The documentation I've seen for GCC outline two specific steps.
Include the compiler flag -D_GLIBCXX_PARALLEL
Possibly add the header <parallel/algorithm>
Adding the compiler flag doesn't seem to change anything. The execution time is the same, and I don't see any indication of multiple core usage when monitoring the system.
I get an error when adding the parallel/algorithm header. I thought it would be included with the latest version of gcc (4.7).
So, a few questions:
Is there some way to definitively determine if code is actually running in parallel?
Is there a "best practices" way of doing this on OS X? (Ideal compiler flags, header, etc?)
Any and all suggestions are welcome.
Thanks!

See http://threadingbuildingblocks.org/
If you only ever parallelize STL algorithms, you are going to disappointed in the results in general. Those algorithms generally only begin to show a scalability advantage when working over very large datasets (e.g. N > 10 million).
TBB (and others like it) work at a higher level, focusing on the overall algorithm design, not just the leaf functions (like std::accumulate()).

Second alternative is to use OpenMP, which is supported by both GCC and
Clang, though is not STL by any means, but is cross-platform.
Third alternative is to use Grand Central Dispatch - the official multicore API in OSX, again hardly STL.
Forth alternative is to wait for C++17, it will have Parallelism module.

GPU Programming?

I'm new to the GPU Programming world, I've tried reading on Wikipedia and Googling, but I still have several questions:
I downloaded some GPU Examples, for CUDA, there were some .cu files and some CPP files, but all the code was normal C/C++ Code just some weird functions like cudaMemcpyToSymbol and the rest was pure c code. The question is, is the .cu code compiled with nvcc and then linked with gcc? Or how is it programmed?
if I coded something to be run on GPU, will it run on ALL GPUs? or just CUDA? or is there a method to write for CUDA and a Method to write for ATI and a method to write for both?

To answer your second question:
OpenCL is the (only) way to go if you want to write platform independent GPGPU code.
ATIs website actually has a lot of resources for OpenCL if you search a little, and their example projects are very easy to modify into what you need, or just to understand the code.
The OpenCL spec and reference pages is also a very good source of knowledge:
http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/
http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf
There are a lot of talks that explain some of the core concepts, and also that explain how to write fast code that I would recommend (that is applicable to CUDA too).
To almost answer your first question:
In OpenCL, the code is compiled at runtime to the specific GPU you're using (to guarantee speed).

You probably want to do some background reading on CUDA - it's not something you can just pick up by looking at a few code samples. There are about 3 different CUDA books on Amazon now, and there is a lot of reference material at http://developer.nvidia.com.
To answer your questions:
yes, .cu files are compiled with nvcc to an intermediate form (PTX) - this is subsequently converted to GPU-specific code at run-time
the generated code will run on a subset of nVidia GPUs, the size of the subset depending on what CUDA capabilities you use in your code

completing the answer given by #nulvinge, I'd say that OpenCL its to GPU Programming like OpenGL is to GPU Rendering. But its not the only option for multi-architecture development, you could also use DirectCompute, but I wouldn't say that its the best option, just if you want your code running on every DirectX11 compatible GPUs, that includes some intel graphics cards chips too right?
But even if you are thinking in doing some GPU programming with OpenCL, do not forget to study the architecture of the platforms that you're using. ATI CPUs, GPUs and NVIDIA GPUs have big differences and your code is needed to be tuned for each platform that you're using if you want to get the most of it...
Fortunately both NVIDIA and AMD have Programming Guides to help you:)

In addition to previous answers, for CUDA you would need a NVIDIA card/GPU, unless you have access for a remote one, which I would recommend this course from Coursera:
Heterogeneous Parallel Programming
It not just gives an introduction to CUDA and OpenCL, memory model, tiling, handling boundary conditions and performance considerations, but also directive-based languages such as OpenACC, a high level language for expressing parallelism into your code, leaving mostly of the parallel programming work for the compiler (good to start with). Also, this course has a online platform where you can use their GPU, which is good to start GPU programming without concerning about software/hardware setup.

If you want to write a portable code which you can execute on different GPU devices and also on CPUs. You need to use OpenCL.
Actually, to configure your kernel you need to write a host code in C. The configuration file might be shorter if you want to write it for CUDA kernels comparing to OpenCL one.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio