What Java application is available to stress-test a virtual machine? - performance

I am interested in ways to stress-test as well as benchmark the SANOS operating system kernel.

While I'm not sure if this is suitable for kernel testing you may want to have a look at
SPECjbb2005
""SPECjbb2005 (Java Server Benchmark) is
SPEC's benchmark for evaluating the
performance of server side Java. Like
its predecessor, SPECjbb2000,
SPECjbb2005 evaluates the performance
of server side Java by emulating a
three-tier client/server system (with
emphasis on the middle tier). The
benchmark exercises the
implementations of the JVM (Java
Virtual Machine), JIT (Just-In-Time)
compiler, garbage collection, threads
and some aspects of the operating
system. It also measures the
performance of CPUs, caches, memory
hierarchy and the scalability of
shared memory processors (SMPs).
SPECjbb2005 provides a new enhanced
workload, implemented in a more
object-oriented manner to reflect how
real-world applications are designed
and introduces new features such as
XML processing and BigDecimal
computations to make the benchmark a
more realistic reflection of today's
applications.""

Related

Runtime VS Driver API in Cuda

It's common to use the runtime API in function calls when programming with CUDA. Different sources insist in the fact that the performance of both APIs is nearly the same, and it's better to focus in memory use and thread organization to improve the performance. So what would be the real difference between the two APIs ?
Quoting from the CUDA documentation: https://docs.nvidia.com/cuda/cuda-driver-api/driver-vs-runtime-api.html#driver-vs-runtime-api
Difference between the driver and runtime APIs
The driver and runtime APIs are very similar and can for the most part be used
interchangeably. However, there are some key differences worth noting
between the two.
Complexity vs. control
The runtime API eases device code management by
providing implicit initialization, context management, and module
management. This leads to simpler code, but it also lacks the level of
control that the driver API has.

Why is the Windows NT kernel said to be a hybrid model?

According to Wikipedia, the Windows Kernel is a hybrid model, meaning it has both a monolithic and microkernel architecture.
But both definitions are very opposite: monolithic is that there is a shared place for both system services and core functionality, microkernel means there is not a shared place.
So, I bet that means that windows has shared space for some, and for other system services and core functionalities it is decoupled.
I'm trying my best to understand this but it's very cryptic for me, although I'm a professional software engineer.
Do you perhaps have an, maybe relatable, example in which it is monolithic and in which it is microkernel?
And to what extent is it similar to say Ubuntu and to what extent is it totally different from Ubuntu kernel, which is said to be fully monolithic?
Generally speaking, a microkernel has very few services provided by the kernel itself, which execute in kernel mode while a monolithic kernel has the vast majority of servers (especially drivers) running in kernel mode.
Many monolithic OSes are taking the approach of running some of their services and drivers at user level and this is what they mean by hybrid. They might keep the network drivers completely in the kernel but run GPU drivers at user level for example.

how to run a openmp program on clusters with multiple nodes? [duplicate]

I want to know if it would be possible to run an OpenMP program on multiple hosts. So far I only heard of programs that can be executed on multiple thread but all within the same physical computer. Is it possible to execute a program on two (or more) clients? I don't want to use MPI.
Yes, it is possible to run OpenMP programs on a distributed system, but I doubt it is within the reach of every user around. ScaleMP offers vSMP - an expensive commercial hypervisor software that allows one to create a virtual NUMA machine on top of many networked hosts, then run a regular OS (Linux or Windows) inside this VM. It requires a fast network interconnect (e.g. InfiniBand) and dedicated hosts (since it runs as a hypervisor beneath the normal OS). We have an operational vSMP cluster here and it runs unmodified OpenMP applications, but performance is strongly dependent on data hierarchy and access patterns.
NICTA used to develop similar SSI hypervisor named vNUMA, but development also stopped. Besides their solution was IA64-specific (IA64 is Intel Itanium, not to be mistaken with Intel64, which is their current generation of x86 CPUs).
Intel used to develop Cluster OpenMP (ClOMP; not to be mistaken with the similarly named project to bring OpenMP support to Clang), but it was abandoned due to "general lack of interest among customers and fewer cases than expected where it showed a benefit" (from here). ClOMP was an Intel extension to OpenMP and it was built into the Intel compiler suite, e.g. you couldn't use it with GCC (this request to start ClOMP development for GCC went in the limbo). If you have access to old versions of Intel compilers (versions from 9.1 to 11.1), you would have to obtain a (trial) ClOMP license, which might be next to impossible given that the product is dead and old (trial) licenses have already expired. Then again, starting with version 12.0, Intel compilers no longer support ClOMP.
Other research projects exist (just search for "distributed shared memory"), but only vSMP (the ScaleMP solution) seems to be mature enough for production HPC environments (and it's priced accordingly). Seems like most efforts now go into development of co-array languages (Co-Array Fortran, Unified Parallel C, etc.) instead. I would suggest that you have a look at Berkeley UPC or invest some time in learning MPI as it is definitely not going away in the years to come.
Before, there was the Cluster OpenMP.
Cluster OpenMP, was an implementation of OpenMP that could make use of multiple SMP machines without resorting to MPI. This advance had the advantage of eliminating the need to write explicit messaging code, as well as not mixing programming paradigms. The shared memory in Cluster OpenMP was maintained across all machines through a distributed shared-memory subsystem. Cluster OpenMP is based on the relaxed memory consistency of OpenMP, allowing shared variables to be made consistent only when absolutely necessary. source
Performance Considerations for Cluster OpenMP
Some memory operations are much more expensive than others. To achieve good performance with Cluster OpenMP, the number of accesses to unprotected pages must be as high as possible, relative to the number of accesses to protected pages. This means that once a page is brought up-to-date on a given node, a large number of accesses should be made to it before the next synchronization. In order to accomplish this, a program should have as little synchronization as possible, and re-use the data on a given page as much as possible. This translates to avoiding fine-grained synchronization, such as atomic constructs or locks, and having high data locality source.
Another option for running OpenMP programs on multiple hosts is the remote offloading plugin in the LLVM OpenMP runtime.
https://openmp.llvm.org/design/Runtimes.html#remote-offloading-plugin
The big issue with running OpenMP programs on distributed memory is data movement. Coincidentally, that is also one of the major issues in programming GPU's. Extending OpenMP to handle GPU programming has given rise to OpenMP directives to describe data transfer. Programming GPU's has also forced programmers to think more carefully about building programs that consider data movement.

high performance runtime

It’s the first time I submit a question in this forum.
I’m posting a general question. I don’t have to develop an application for a specific purpose.
After a lot of “googling” I still haven’t found a language/runtime/script engine/virtual machine that match these 5 requirements:
memory allocation of variables/values or objects cleaned at run time
(e.g. a la C++ that use keyword delete or free in C )
language (and consequently the program) is a script or
pseudo-compiled a la byte code that should be portable on main
operating system (windows, linux, *bsd, solaris) & platform(32/64bit)
native use of multicore (engine/runtime)
no limit on the heap usage
library for network
The programming language for building application and that run on this engine is agnostic oriented (paradigm is not important).
I hope that this post won’t stir up a Holy-War but I'd like to put focus on engine behavior during program execution.
Sorry for my bad english.
Luke
I think Erlang might fit your requirement:
most data is either allocated in local scopes and therefore immediately deleted after use or contained in a library-powered permanent storage like ETS, DETS or Mnesia. There is Garbage Collection, though, but the paradigm of the language makes the need for it not as important.
the Erlang compiler compiles the source code to the BEAM virtual machine byte code, which, unlike Java is register-based and thus much faster. The VM is available for:
Solaris (including 64 bit)
BSD
Linux
OSX
TRU64
Windows NT/2000/2003/XP/Vista/7
VxWorks
Erlang has been designed for distributed systems, concurrency and reliability from day one
Erlang's Heap grows with your demand for it, it's initially limited and expanded automatically (there are numerous tweaks you can use to configure this on a per-VM-basis)
Erlang comes from a networking background and provides tons of libraries from IP to higher-level protocols

Feasibility of using the same code on both embedded and Windows platforms

We have a program written in VBA that is running on Windows machines.
We have a very similar program written in ANSI C, using a Keil IDE and compiler that is running on an STR9x uP.
Our plans were to rewrite the VBA code in .NET using C#.
What is the feasibility of writing the shared code in C++ to be used on both systems? Obviously, the .NET framework would be off limits, but that isn't much of a concern. I'm wondering, specifically, about how labor intensive you think the compilation process might be.
This is kind of a theoretical question, I know, but thanks for any thoughts.
I do this a as general practice. I think a better question than "is it possible" is "how should I structure my code to be able to run on both an embedded system and also a PC".
I prefer to write the code in C and structure each file as a c++ class using static variables to make global variables private to the module. Create getter and setter functions to access the private variables. Also use function pointers which I set at initialization of the module for the methods the module need to call outside of the module.
It is also easy to refactor from the above structured c code to a class in c# or c++.
You can also use C++ directly but using it incorrectly on an embedded system can cause problems.
You will need a hardware abstraction layer if you are accessing any hardware. I separate my code into two types the first being code that has no reference to what it is running on and other code which I refer to as drivers.
I use this code for reusing modules for things like communication protocols. But more importantly I use it for testing. I like to use gtest to unit test the modules. I can also rewrite the drivers and simulate the hardware on a PC to be able to run it on the PC.
Obviously, the .NET framework would be off limits
Not necessarily true. Given sufficient ROM and RAM resources (256K/64K respectively), the .NET Micro Framework will run on your device. However that is not necessarily a good reason to use it; there are already two other commonly used portable languages available for both your embedded target and Windows: C and C++. The target resource required for both C and C++ is minimal - C/C++ runtime start-up code can be well under 1K of code, almost all available resources can be utilised by your application code rather than the run-time environment.
The trick to utilising common code on both platforms is abstraction. This will involve at least hardware abstraction and possibly OS abstraction if your target is using any sort of kernel or scheduler such as an RTOS or thread library.
I'd recommend designing your embedded target with a layer architecture, having at least a device layer and an application layer and as mentioned already, possibly a system layer that deals with IPC, synchronisation and scheduling, if used. You may have other higher layer interfaces such as networking or filesystem that would equally benefit from abstraction. Note that standard APIs such as BSD sockets or stdio already count as abstraction, so if your target uses these, you have less work to do in Windows (minor differences between BSD Sockets and Winsock may still need some work)
The application layer will have no OS or hardware dependencies other than those accessible through the device and system layers. You must then implement the device and system layers on Windows as either a simulation or remapping to services or devices available on Windows. Some RTOS's already include Windows simulators for test and development, but defining your own OS API layer that you can port between a number of native RTOS and GPOS will allow your application code to be ported to different targets for both simulation and real-time execution very quickly.
Where the platform differences are minor and localised, and may not justify an abstraction layer, then target specific conditional compilation may be appropriate. Compilers support predefined macros for architecture, OS or compiler specific code that can be used for both this localised code and to make the abstraction layer code itself common where there is significant similarity.

Resources