How to randomize address space at runtime for benchmarking purposes

How to randomize address space at runtime for benchmarking purposes - performance

I'm looking for a mechanism like ASLR for Linux in order to benchmark a distributed application while accounting for incidental layout changes. For background and motivation, see the Stabilizer paper.
The goal is to recreate the behavior of Stabilizer, but in a distributed environment with complex deployment. (As far as I can tell, that project is no longer maintained and never made it past a prototype phase.) In particular, the randomization should take place (repeatedly) at runtime and without needing to invoke the program through a special binary or debugger. On the other hand, I assume full access to source code and the ability to arbitrarily change/recompile the system under test.
By "complex deployment", I mean that there may be different binaries running on different machines, possibly written in different languages. Think Java programs calling into JNI libraries (or other languages using FFI), "main" binaries communicating with sidecar services, etc. The key in all of these cases is that the native code (the target of randomization) is not manually invoked, but is somehow embedded by another program.
I'm only concerned about the randomization aspect (i.e., assume that metrics collection/reporting is handled externally). It's fine if the solution is system-specific (e.g., only runs on Linux or with C++ libraries), but ideally it would be a general pattern that can be applied "anywhere", regardless of the compiler/toolchain/OS.
Side note: layout issues are less of a concern on larger systems thanks to the extra sources of random noise (network, CPU temperatures/throttling, IPC overheads, etc). However, in many cases, distributed applications are deployed on "identical" machines with uniform environments, so there's still plenty of room for correlated performance impacts. The goal is just to debias the process for decision making.

Related

Why isn't there a widely used tool to "warmup" .NET applications to prevent "cold start"?

I understand why cold starts happen (Byte code needs to be turned into machine code through JIT compilation). However with all the generated meta data available for binaries these days I do not understand why there isn't a simple tool that automatically takes the byte code and turns ALL PATHS THROUGH THE CODE (auto discovered) into machine code specific for that target platform. That would mean the first request through any path (assume a rest api) would be fast and not require any further Just In Time Compilation.
We can create an automation test suite or load test to JIT all the paths before allowing the machine into the load balancer rotation (good best practice anyway). We can also flip the "always on" setting in cloud hosting providers to keep the warmed application from getting evicted from memory (requiring the entire process over again). However, it seems like such an archaic process to still be having in 2020.
Why isn't there a tool that does this? What is the limitation that prevents us from using meta data, debug symbols and/or other means to understand how to generate machine code that is already warm and ready for users from the start?

So I have been asking some sharp minds around my professional network and no one seems to be able to point out exactly what limitation makes this so hard to do. However, I did get a few tools on my radar that do what i'm looking for to some level.
Crossgen appears to be the most promising but it's far from widely used among the many peers I've spoken to. Will have to take a closer look.
Also several do some sort of startup task that runs some class initialization and also register them as singletons. I wouldn't consider those much different then just running integration or load tests on the application.

Most programming languages have some form of native image compiler tool. It's up to you to use them if that is what you are looking to do.
Providers are supposed to give you a platform for your application and there is a certain amount of isolation and privacy you should expect from your provider. They should not go digging into your application to figure out all its "paths". That would be very invasive.
Plus "warming up" all paths would be an overly resource intensive process for a provider to be obligated to perform for every application they host.

using hundreds of child processes in Windows?

I have a Windows application which uses many third-party modules of questionable reliability. My app has to create many objects from those modules, and one bad object may cause a lot of problems.
I was thinking of a multi-process scheme where the objects are created in a separate process (the interfaces are basically all the same, so creating the cross-process communication shouldn't be so difficult). At the most extreme, I'm considering one object per process so I might end up with anywhere between 20 processes and a few hundred processes launch from my main app.
Would that cause Windows any problems? I'm using Windows 7, 64-bit, and memory and CPU power won't be an issue.

As long as you have enough CPU power and memory there are no problems. Having the general rules for distributed applications, multithreading (yes, multithreading), deadlocks, atomic operations and co. everything should be fine.

Software patching at a billion miles

Could someone here shed some light about how NASA goes about designing their spacecraft architecture to ensure that they are able to patch bugs in the deployed code?
I have never built any “real time” type systems and this is a question that has come to mind after reading this article:
http://pluto.jhuapl.edu/overview/piPerspective.php?page=piPerspective_05_21_2010
“One of the first major things we’ll
do when we wake the spacecraft up next
week will be uploading almost 20 minor
bug fixes and other code enhancements
to our fault protection (or “autopilot
response”) software.”

I've been a developer on public telephone switching system software, which has pretty severe constraints on reliability, availability, survivability, and performance that approach what spacecraft systems need. I haven't worked on spacecraft (although I did work with many former shuttle programmers while at IBM), and I'm not familiar with VXworks, the operating system used on many spacecraft (including the Mars rovers, which have a phenomenal operating record).
One of the core requirements for patchability is that a system should be designed from the ground up for patching. This includes module structure, so that new variables can be added, and methods replaced, without disrupting current operations. This often means that both old and new code for a changed method will be resident, and the patching operation simply updates the dispatching vector for the class or module.
It is just about mandatory that the patching (and un-patching) software is integrated into the operating system.
When I worked on telephone systems, we generally used patching and module-replacement functions in the system to load and test our new features as well as bug fixes, long before these changes were submitted for builds. Every developer needs to be comfortable with patching and replacing modules as part of their daly work. It builds a level of trust in these components, and makes sure that the patching and replacement code is exercised routinely.
Testing is far more stringent on these systems than anything you've ever encountered on any other project. Complete and partial mock-ups of the deployment system will be readily available. There will likely be virtual machine environments as well, where the complete load can be run and tested. Test plans at all levels above unit test will be written and formally reviewed, just like formal code inspections (and those will be routine as well).
Fault tolerant system design, including software design, is essential. I don't know about spacecraft systems specifically, but something like high-availability clusters is probably standard, with the added capability to run both synchronized and unsynchronized, and with the ability to transfer information between sides during a failover. An added benefit of this system structure is that you can split the system (if necessary), reload the inactive side with a new software load, and test it in the production system without being connected to the system network or bus. When you're satisfied that the new software is running properly, you can simply failover to it.
As with patching, every developer should know how to do failovers, and should do them both during development and testing. In addition, developers should know every software update issue that can force a failover, and should know how to write patches and module replacement that avoid required failovers whenever possible.
In general, these systems are designed from the ground up (hardware, operating system, compilers, and possibly programming language) for these environments. I would not consider Windows, Mac OSX, Linux, or any unix variant, to be sufficiently robust. Part of that is realtime requirements, but the whole issue of reliability and survivability is just as critical.
UPDATE: As another point of interest, here's a blog by one of the Mars rover drivers. This will give you a perspective on the daily life of maintaining an operating spacecraft. Neat stuff!

I've never build real-time system either, but in those system, I suspect their system would not have memory protection mechanism. They do not need it since they wrote all their own software themselves. Without memory protection, it will be trivial for a program to write the memory location of another program and this can be used to hot-patch a running program (writing a self-modifying code was a popular technique in the past, without memory protection the same techniques used for self-modifying code can be used to modify another program's code).
Linux has been able to do minor kernel patching without rebooting for some time with Ksplice. This is necessary for use in situations where any downtime can be catastrophic. I've never used it myself, but I think the technique they uses is basically this:
Ksplice can apply patches to the Linux
kernel without rebooting the computer.
Ksplice takes as input a unified diff
and the original kernel source code,
and it updates the running kernel in
memory. Using Ksplice does not require
any preparation before the system is
originally booted (the running kernel
does not need to have been specially
compiled, for example). In order to
generate an update, Ksplice must
determine what code within the kernel
has been changed by the source code
patch. Ksplice performs this analysis
at the ELF object code layer, rather
than at the C source code layer.
To apply a patch, Ksplice first
freezes execution of a computer so it
is the only program running. The
system verifies that no processors
were in the middle of executing
functions that will be modified by the
patch. Ksplice modifies the beginning
of changed functions so that they
instead point to new, updated versions
of those functions, and modifies data
and structures in memory that need to
be changed. Finally, Ksplice resumes
each processor running where it left
off.
(from Wikipedia)

Well I'm sure they have simulators to test with and mechanisms for hot-patching. Take a look at the linked article below - there's a pretty good overview of the spacecraft design. Section 5 discusses the computation machinery.
http://www.boulder.swri.edu/pkb/ssr/ssr-fountain.pdf
Of note:
Redundant processors
Command switching by the uplink card that does not require processor help
Time-lagged rules

I haven't worked on spacecraft, but the machines I've worked on have all been built to have a stable idle state where it's possible to shut down the machine briefly to patch the firmware. The systems that have accommodated 'live' updates are those that were broken into interacting components, where you can bring down one segment of the system long enough to update it and the other components can continue operating as normal, as they can tolerate the temporary downtime of the serviced component.
One way you can do this is to have parallel (redundant) capabilities, such as parallel machines that all perform the same task, so that the process can be routed around the machine under service. The benefit of this approach is that you can bring it down for longer periods for more significant service, such as regular hardware preventative maintenance. Once you have this capability, supporting downtime for a firmware patch is fairly easy.

One of the approaches that's been used in the past is to use LISP.

Analogs of Intel's Cluster OpenMP

Are there analogs of Intel Cluster OpenMP? This library simulates shared-memory machine (like SMP or NUMA) while running on distributed memory machine (like Ethernet-connected cluster of PC's).
This library allows to start openmp programs directly on cluster.
I search for
libraries, which allow multithreaded programms run on distributed cluster
or libraries (replacement of e.g. libgomp), which allow OpenMP programms run on distributed cluster
or compilers, capable of generating cluster code from openmp programms, besides Intel C++

The keyword you want to be searching for is "distributed shared memory"; there's a Wikipedia page on the subject. MOSIX, which became openMOSIX, which is now being developed as part of LinuxPMI, is the closest thing I'm aware of; but I don't have much experience with the current LinuxPMI project.
One thing you need to be aware of is that none of these systems work especially well, performance-wise. (Maybe a more optimistic way of saying it is that it's a tribute to the developers that these things work at all). You can't just abstract away the fact that accessing on-node memory is very very different from memory on some other node over a network. Even making local memory systems fast is difficult and requires a lot of hardware; you can't just hope that a little bit of software will hide the fact that you're now doing things over a network.
The performance ramifications are especially important when you consider that OpenMP programs you might want to run are almost always going to be written assuming that memory accesses are local and thus cheap, because, well, that's what OpenMP is for. False sharing is bad enough when you're talking about different sockets accessing a common cache line - page-based false sharing across a network is just disasterous.
Now, it could well be that you have a very simple program with very little actual shared state, and a distributed shared memory system wouldn't be so bad -- but in that case I've got to think you'd be better off in the long run just migrating the problem away from a shared-memory-based model like OpenMP towards something that'll work better in a cluster environment anyway.

Supplying 64 bit specific versions of your software

Would I expect to see any performance gain by building my native C++ Client and Server into 64 bit code?
What sort of applications benefit from having a 64 bit specific build?
I'd imagine anything that makes extensive use of long would benefit, or any application that needs a huge amount of memory (i.e. more than 2Gb), but I'm not sure what else.

Architectural benefits of Intel x64 vs. x86
larger address space
a richer register set
can link against external libraries or load plugins that are 64-bit
Architectural downside of x64 mode
all pointers (and thus many instructions too) take up 2x the memory, cutting the effective processor cache size in half in the worst case
cannot link against external libraries or load plugins that are 32-bit
In applications I've written, I've sometimes seen big speedups (30%) and sometimes seen big slowdowns (> 2x) when switching to 64-bit. The big speedups have happened in number crunching / video processing applications where I was register-bound.
The only big slowdown I've seen in my own code when converting to 64-bit is from a massive pointer-chasing application where one compiler made some really bad "optimizations". Another compiler generated code where the performance difference was negligible.
Benefit of porting now
Writing 64-bit-compatible code isn't that hard 99% of the time, once you know what to watch out for. Mostly, it boils down to using size_t and ptrdiff_t instead of int when referring to memory addresses (I'm assuming C/C++ code here). It can be a pain to convert a lot of code that wasn't written to be 64-bit-aware.
Even if it doesn't make sense to make a 64-bit build for your application (it probably doesn't), it's worth the time to learn what it would take to make the build so that at least all new code and future refactorings will be 64-bit-compatible.

Before working too hard on figuring out whether there is a technical case for the 64-bit build, you must verify that there is a business case. Are your customers asking for such a build? Will it give you a definitive leg up in competition with other vendors? What is the cost for creating such a build and what business costs will be incurred by adding another item to your accounting, sales and marketing processes?
While I recognize that you need to understand the potential for performance improvements before you can get a handle on competitive advantages, I'd strongly suggest that you approach the problem from the big picture perspective. If you are a small or solo business, you owe it to yourself to do the appropriate due diligence. If you work for a larger organization, your superiors will greatly appreciate the effort you put into thinking about these questions (or will consider the whole issue just geeky excess if you seem unprepared to answer them).
With all of that said, my overall technical response would be that the vast majority of user-facing apps will see no advantage from a 64-bit build. Think about it: how much of the performance problems in your current app comes from being processor-bound (or RAM-access bound)? Is there a performance problem in your current app? (If not, you probably shouldn't be asking this question.)
If it is a Client/Server app, my bet is that network latency contributes far more to the performance on the client side (especially if your queries typically return a lot of data). Assuming that this is a database app, how much of your performance profile is due to disk latency times on the server? If you think about the entire constellation of factors that affect performance, you'll get a better handle on whether your specific app would benefit from a 64-bit upgrade and, if so, whether you need to upgrade both sides or whether all of your benefit would derive just from the server-side upgrade.

Not much else, really. Though writing a 64-bit app can have some advantages to you, as the programmer, in some cases. A simplistic example is an application whose primary focus is interacting with the registry. As a 32-bit process, your app would not have access to large swaths of the registry on 64-bit systems.

Continuing #mdbritt's comment, building for 64-bit makes far more sense [currently] if it's a server build, or if you're distributing to Linux users.
It seems that far more Windows workstations are still 32-bit, and there may not be a large customer base for a new build.
On the other hand, many server installs are 64-bit now: RHEL, Windows, SLES, etc. NOT building for them would be cutting-off a lot of potential usage, in my opinion.
Desktop Linux users are also likely to be running the 64-bit versions of their favorite distro (most likely Ubuntu, SuSE, or Fedora).
The main obvious benefit of building for 64-bit, however, is that you get around the 3GB barrier for memory usage.

According to this web page you benefit most from the extra general-purpose registers with 64 bit CPU if you use a lot of and/or deep loops.

You can expect gain thanks to the additional registers and the new passing parameters convention (which are really linked to the additional registers).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio