Runtime VS Driver API in Cuda - parallel-processing

It's common to use the runtime API in function calls when programming with CUDA. Different sources insist in the fact that the performance of both APIs is nearly the same, and it's better to focus in memory use and thread organization to improve the performance. So what would be the real difference between the two APIs ?

Quoting from the CUDA documentation: https://docs.nvidia.com/cuda/cuda-driver-api/driver-vs-runtime-api.html#driver-vs-runtime-api
Difference between the driver and runtime APIs
The driver and runtime APIs are very similar and can for the most part be used
interchangeably. However, there are some key differences worth noting
between the two.
Complexity vs. control
The runtime API eases device code management by
providing implicit initialization, context management, and module
management. This leads to simpler code, but it also lacks the level of
control that the driver API has.

Related

Why is having an userspace version of eBPF interesting?

I've seen that userspace version of ebpf (runtime, assembler, dissasembler) are being developped (uBPF, rbpf).
Why is having an userspace version of eBPF interesting ?
Do those alternatives focus on the same goal than the eBPF program types (network, observability and security) ?
One of the main advantages of eBPF is that it runs code in the kernel. Observability, in-kernel data aggregation, early packet processing: it all happens in kernel space. So the question sounds legit: Why were uBPF or rbpf created?
I think they were created mostly as prototypes. uBPF was introduced very early in eBPF history, and was probably a proof-of-concept implementation of an eBPF interpreter and x86_64 JIT in user space. I wrote rbpf, which is strongly based on uBPF, and the main objective for me was to get more familiar with two things: eBPF and Rust. Very little afterthought :).
I've always been curious to see what people could do with it. Truth to tell, there are not so many users. The biggest users for rbpf are probably the people from Solana, who implement some blockchain tool with smart contracts run in the eBPF machine. One other use case I've had in the past was to debug some eBPF bytecode, because it is easy to have breakpoints in an user space interpreter (by contrast, runtime debugging for regular kernel eBPF is quite limited at this time).
uBPF had more success and was used as a basis for other projects like DPDK as a filtering library or Oko, an extension to Open vSwitch (both about high-performance network processing). [Edit August 2021] More recently, it was chosen to serve as an eBPF runtime for the implementation of eBPF for Windows (references: announcement, my analysis).
As you can see, the interest of having these user space eBPF machines entirely depends on what you do with it. They're available to run eBPF programs, they don't have a specific focus by themselves, but they can help if you have a use case. In that regard, one of the particularities for uBPF and rbpf is their licenses (Apache, MIT): They are not under GPL, which means that you could reuse them with a larger number of projects, including proprietary ones. This is not the case with code from the kernel.
One big limitation for those user space eBPF machines is also that they tend to be quite out-of-date with regards to what happens in the kernel, where things evolve fast. They don't have a solid verifier, so you cannot assert security or safety of the programs. They hardly support eBPF maps if at all, they do not support function calls, or BTF, or even the latest eBPF instructions for that matter. Some of it could be added, but this would require some engineering efforts and time. [Edit August 2021] uBPF is getting a lot of activity now that Microsoft contributes to it for its eBPF implementation. They also use a user-space verifier, PREVAIL.

Measuring time in omp_fn routines

I am writing a pintool gathering metrics in a subset of applications routines(some among them, are generated by the compiler).
The goal is to get the execution time of those routines.
Below is a set of attempts I already gave:
Of course doing it with pin is a bad idea because of the Virtual Machine overhead.
gcc option -finstrument-functions does not scope the OpenMP functions it generates.
LD_PRELOAD does not work with OpenMP functions which are statically linked.
Maybe if pin allowed to dump statically instrumented assembly, we could avoid the virtual environment overhead, but as far as I know it isn't possible.
I know about Maqao instrumentation tool which do not use virtual environment, but I want to avoid using too many frameworks or translating my pintool into maqao lua script.
I guess I am left with manual binary instrumentation, but if anybody has a better solution, the help will be appreciated.
If you just want the results - use a comprehensive measurement infrastructure that supports OpenMP such as Intel VTune, Extrae/Paraver, Score-P. This will provide you profiling or tracing information about the OpenMP regions.
If you want to implement the measurement yourself, you can use the underlying source-to-source transformation tool Opari. You could also use the much cleaner OpenMP tools interface (OMPT), but AFAIK it is not widely supported yet. You might have some luck with recent Intel OpenMP runtimes.

How these to platforms compare in performance namely staff-wsf and wt?

How these to c++ web service framworks compare in performance namely staff-wsf and witty?
I did not perform bench marking, But can get idea from following
Although implemented in C++, Wt’s main focus or novelty is not its
performance, but its focus on developing maintainable applications and
its extensive library of built-in widgets. But because it is popular
and widely used in embedded systems, you will find that performance
and foot-print has been optimized too, by virtue of a no-nonsense API,
thoughtful architecture, and C++ …
given in webtoolkit tutorial

Feasibility of using the same code on both embedded and Windows platforms

We have a program written in VBA that is running on Windows machines.
We have a very similar program written in ANSI C, using a Keil IDE and compiler that is running on an STR9x uP.
Our plans were to rewrite the VBA code in .NET using C#.
What is the feasibility of writing the shared code in C++ to be used on both systems? Obviously, the .NET framework would be off limits, but that isn't much of a concern. I'm wondering, specifically, about how labor intensive you think the compilation process might be.
This is kind of a theoretical question, I know, but thanks for any thoughts.
I do this a as general practice. I think a better question than "is it possible" is "how should I structure my code to be able to run on both an embedded system and also a PC".
I prefer to write the code in C and structure each file as a c++ class using static variables to make global variables private to the module. Create getter and setter functions to access the private variables. Also use function pointers which I set at initialization of the module for the methods the module need to call outside of the module.
It is also easy to refactor from the above structured c code to a class in c# or c++.
You can also use C++ directly but using it incorrectly on an embedded system can cause problems.
You will need a hardware abstraction layer if you are accessing any hardware. I separate my code into two types the first being code that has no reference to what it is running on and other code which I refer to as drivers.
I use this code for reusing modules for things like communication protocols. But more importantly I use it for testing. I like to use gtest to unit test the modules. I can also rewrite the drivers and simulate the hardware on a PC to be able to run it on the PC.
Obviously, the .NET framework would be off limits
Not necessarily true. Given sufficient ROM and RAM resources (256K/64K respectively), the .NET Micro Framework will run on your device. However that is not necessarily a good reason to use it; there are already two other commonly used portable languages available for both your embedded target and Windows: C and C++. The target resource required for both C and C++ is minimal - C/C++ runtime start-up code can be well under 1K of code, almost all available resources can be utilised by your application code rather than the run-time environment.
The trick to utilising common code on both platforms is abstraction. This will involve at least hardware abstraction and possibly OS abstraction if your target is using any sort of kernel or scheduler such as an RTOS or thread library.
I'd recommend designing your embedded target with a layer architecture, having at least a device layer and an application layer and as mentioned already, possibly a system layer that deals with IPC, synchronisation and scheduling, if used. You may have other higher layer interfaces such as networking or filesystem that would equally benefit from abstraction. Note that standard APIs such as BSD sockets or stdio already count as abstraction, so if your target uses these, you have less work to do in Windows (minor differences between BSD Sockets and Winsock may still need some work)
The application layer will have no OS or hardware dependencies other than those accessible through the device and system layers. You must then implement the device and system layers on Windows as either a simulation or remapping to services or devices available on Windows. Some RTOS's already include Windows simulators for test and development, but defining your own OS API layer that you can port between a number of native RTOS and GPOS will allow your application code to be ported to different targets for both simulation and real-time execution very quickly.
Where the platform differences are minor and localised, and may not justify an abstraction layer, then target specific conditional compilation may be appropriate. Compilers support predefined macros for architecture, OS or compiler specific code that can be used for both this localised code and to make the abstraction layer code itself common where there is significant similarity.

What Java application is available to stress-test a virtual machine?

I am interested in ways to stress-test as well as benchmark the SANOS operating system kernel.
While I'm not sure if this is suitable for kernel testing you may want to have a look at
SPECjbb2005
""SPECjbb2005 (Java Server Benchmark) is
SPEC's benchmark for evaluating the
performance of server side Java. Like
its predecessor, SPECjbb2000,
SPECjbb2005 evaluates the performance
of server side Java by emulating a
three-tier client/server system (with
emphasis on the middle tier). The
benchmark exercises the
implementations of the JVM (Java
Virtual Machine), JIT (Just-In-Time)
compiler, garbage collection, threads
and some aspects of the operating
system. It also measures the
performance of CPUs, caches, memory
hierarchy and the scalability of
shared memory processors (SMPs).
SPECjbb2005 provides a new enhanced
workload, implemented in a more
object-oriented manner to reflect how
real-world applications are designed
and introduces new features such as
XML processing and BigDecimal
computations to make the benchmark a
more realistic reflection of today's
applications.""

Resources