OpenMPI custom fault tolerance for lowly coupled parallel processes

OpenMPI custom fault tolerance for lowly coupled parallel processes - parallel-processing

I do computations on the Amazon EC3 platform, using multiple machines which are connected through OpenMPI. To reduce the cost of the computation, spot instances are used, which are automatically shut down when the cost of a machine goes above a maximum preset price: : http://aws.amazon.com/ec2/spot-instances/ . A weird behaviour occurs: when a machine is shut down, the other processes in the MPI communicator still continue to run. I think that the network interfaces are silenced before the process has the time to indicate to the other processes that it has received a kill signal.
I have read in multiple posts that MPI does not provide a lot of high-level resources regarding fault-tolerance. On the other hand, the structure of my program is very simple: a master process is queried by slave processes, for the permission to execute a portion of code. The master process only keeps track of the number of queries it has replied to, and tell the slave to stop when an upper limit is reached. There is no coupling between the slaves.
I would like to be able to detect when a process silently died as mentioned previously. In that case I would re-attribute the work he was doing to a slave that is still alive. Is there a simple way to check whether a died ? I have thought of using threads and sockets to do that independently of the rest of the MPI layer, but that seem cumbersome. I also though of maintaining on the master process (which is launched on a non spot instance) a list of the time of last communication with each process, and specify a timeout, but that would not guarantee me that a slave process is dead. There is also the problem that "barrier" and "finalize functions will not see all the processes, and potentially hang.
My question would then be what kind of solution would you implement to detect if processes are silently dead ? And how would you modify the remainder of the code to be compatible with a reduced number of processes ?

Which version of Open MPI are you using?
I'm not sure exactly what Open MPI might be doing (or not doing) that wouldn't detect that a process is gone. The usual behavior of Open MPI after a failure is that the runtime would abort the entire job.
Unfortunately, there is no mechanism in Open MPI for discovering failed processes (especially in the case where it sounds like Open MPI doesn't even know they're failed). However, there is a lot of work ongoing to add this to future versions of all MPI libraries. One of the example implementations that supports this behavior is a branch of Open MPI called ULFM (www.fault-tolerance.org). There's lots of documentation there to see exactly what's going on, but essentially, it's a new chapter in the MPI standard to add fault tolerance.
There is an older effort that's available in MPICH 3.0.3 (unfortunately, it's broken in 3.0.4, but it should be back for 3.1) (www.mpich.org). The documentation for using that work is in the README.
The problem with both of these efforts is that they aren't compliant with the MPI Standard. Eventually, there will be a chapter describing fault tolerance in MPI and all of the MPI implementations will become compatible, but in the meantime, there is no good solution for everyone.

PVM might be a reasonable alternative to MPI in your case. While no longer developed after it lost to MPI years ago, PVM still comes pre-packaged with most Linux distributions and provides built-in fault tolerance. It's API is conceptually very similar to that of MPI, but its execution model differs a bit. One could say that it allows for one degree less coupling between the tasks in the parallel program than MPI does.
There is an example implementation of a fault-tolerant master-worker PVM application in Beowulf Cluster Computing with Linux. Read the relevant chapter from the book here.
As for fault tolerance in MPI, the proposed addition to the standard was rejected when the MPI Forum voted for inclusion of new features in MPI-3.0. It might take much longer than anticipated before FT becomes a standard feature of MPI.

Related

using hundreds of child processes in Windows?

I have a Windows application which uses many third-party modules of questionable reliability. My app has to create many objects from those modules, and one bad object may cause a lot of problems.
I was thinking of a multi-process scheme where the objects are created in a separate process (the interfaces are basically all the same, so creating the cross-process communication shouldn't be so difficult). At the most extreme, I'm considering one object per process so I might end up with anywhere between 20 processes and a few hundred processes launch from my main app.
Would that cause Windows any problems? I'm using Windows 7, 64-bit, and memory and CPU power won't be an issue.

As long as you have enough CPU power and memory there are no problems. Having the general rules for distributed applications, multithreading (yes, multithreading), deadlocks, atomic operations and co. everything should be fine.

Using Cloud/Distrubted computing to share processor time - possibilities and methods

My question is one I have pondered around when working on a demanding network application that would explicitly share a task across the network using a server to assign the job to each computer individually and "share the load".
I wondered: could this be done in a more implicit manner?
Question
Is there a possibility of distributing processor intensive tasks around a voluntary and public network of computers to make the job run more efficiently without requiring the job's program or process to be installed on each computer?
Scenario
Lets say we have a ridiculously intensive mathematics scenario where I am trying to get my computer to calculate every prime factorization break down for all numbers from 1 to 10,000,000 and store them in a database (assuming I have the space and that the algorithms are already implemented in their own class, program, dynamic link library or any runnable process.)
Now it would be more efficient to share this burdening process across a network or on a multi-core super computer, however these are both expensive. To my knowledge you would require a specifically designed program to run the specific algorithm and have the program installed across the said cloud/distributed computing network whilst you have a server keep track of what each computer is doing (ie. what number they are currently calculating the primes for).
Conclusion
Overall:
Would it be possible to create a cloud program / OS / suite
where you could share processor time
for an unspecified type of process?
If so how would you implement it, where would you start?
Would you make an OS dedicated to being able to run unspecified non-explicit tasks or would it be possible to do with a cloud enabled program installed on volunteers computers volunteers who were willing to share a percentage of their processor clock to help the general community).
If this was implementable, would you be a voluntary part of the greater cloud?
I would love to hear everyone's thoughts and possible solutions as this would be a wonderful project to start.

I have been dealing with the same challenge for the last few months.
My conclusions thus far:
The main problem with using a public network (internet) for cloud computing is in addressing the computers through NATs and firewalls. Solving this is non-trivial.
Asking all participants to open ports in their firewalls and configure their router for port-forwarding is generally too much to ask for 95% of users and can pose severe security threats.
A solution is to use a registration server where all peers register themselves and can get in contact with others. Connections are kept open by server and communication is routed through server. There are several variations. However, in all scenario's, this requires vast resources to keep everything scalable and is therefore out of reach for anyone but large corporations.
More practical solutions are offered by commercial cloud platforms like Azure (for .Net) or just the .Net ServiceBus. Since everything runs in the cloud, there will be no problems with reaching computers behind NATs and firewalls, and even if you need to do so to reach "on-premise" computers or those of clients, this can be done through the ServiceBus.

Would you trust someone else's code to run on your computer?
Its more practical to not ask their permission: ;)
I once wrote a programming competition solver written with Haxe in a Flash banner on a friend's fairly popular website...

I would not expect a program which allow "share processor time for an unspecified type of process". An prerequisite of sharing processor time is that the task can divided into multiple sub tasks.
To allow automatic sharing of processor time for any dividable task, the key of solution would be an AI program clever enough to know how to divide a task into sub task, which does not seems realistic.

Software patching at a billion miles

Could someone here shed some light about how NASA goes about designing their spacecraft architecture to ensure that they are able to patch bugs in the deployed code?
I have never built any “real time” type systems and this is a question that has come to mind after reading this article:
http://pluto.jhuapl.edu/overview/piPerspective.php?page=piPerspective_05_21_2010
“One of the first major things we’ll
do when we wake the spacecraft up next
week will be uploading almost 20 minor
bug fixes and other code enhancements
to our fault protection (or “autopilot
response”) software.”

I've been a developer on public telephone switching system software, which has pretty severe constraints on reliability, availability, survivability, and performance that approach what spacecraft systems need. I haven't worked on spacecraft (although I did work with many former shuttle programmers while at IBM), and I'm not familiar with VXworks, the operating system used on many spacecraft (including the Mars rovers, which have a phenomenal operating record).
One of the core requirements for patchability is that a system should be designed from the ground up for patching. This includes module structure, so that new variables can be added, and methods replaced, without disrupting current operations. This often means that both old and new code for a changed method will be resident, and the patching operation simply updates the dispatching vector for the class or module.
It is just about mandatory that the patching (and un-patching) software is integrated into the operating system.
When I worked on telephone systems, we generally used patching and module-replacement functions in the system to load and test our new features as well as bug fixes, long before these changes were submitted for builds. Every developer needs to be comfortable with patching and replacing modules as part of their daly work. It builds a level of trust in these components, and makes sure that the patching and replacement code is exercised routinely.
Testing is far more stringent on these systems than anything you've ever encountered on any other project. Complete and partial mock-ups of the deployment system will be readily available. There will likely be virtual machine environments as well, where the complete load can be run and tested. Test plans at all levels above unit test will be written and formally reviewed, just like formal code inspections (and those will be routine as well).
Fault tolerant system design, including software design, is essential. I don't know about spacecraft systems specifically, but something like high-availability clusters is probably standard, with the added capability to run both synchronized and unsynchronized, and with the ability to transfer information between sides during a failover. An added benefit of this system structure is that you can split the system (if necessary), reload the inactive side with a new software load, and test it in the production system without being connected to the system network or bus. When you're satisfied that the new software is running properly, you can simply failover to it.
As with patching, every developer should know how to do failovers, and should do them both during development and testing. In addition, developers should know every software update issue that can force a failover, and should know how to write patches and module replacement that avoid required failovers whenever possible.
In general, these systems are designed from the ground up (hardware, operating system, compilers, and possibly programming language) for these environments. I would not consider Windows, Mac OSX, Linux, or any unix variant, to be sufficiently robust. Part of that is realtime requirements, but the whole issue of reliability and survivability is just as critical.
UPDATE: As another point of interest, here's a blog by one of the Mars rover drivers. This will give you a perspective on the daily life of maintaining an operating spacecraft. Neat stuff!

I've never build real-time system either, but in those system, I suspect their system would not have memory protection mechanism. They do not need it since they wrote all their own software themselves. Without memory protection, it will be trivial for a program to write the memory location of another program and this can be used to hot-patch a running program (writing a self-modifying code was a popular technique in the past, without memory protection the same techniques used for self-modifying code can be used to modify another program's code).
Linux has been able to do minor kernel patching without rebooting for some time with Ksplice. This is necessary for use in situations where any downtime can be catastrophic. I've never used it myself, but I think the technique they uses is basically this:
Ksplice can apply patches to the Linux
kernel without rebooting the computer.
Ksplice takes as input a unified diff
and the original kernel source code,
and it updates the running kernel in
memory. Using Ksplice does not require
any preparation before the system is
originally booted (the running kernel
does not need to have been specially
compiled, for example). In order to
generate an update, Ksplice must
determine what code within the kernel
has been changed by the source code
patch. Ksplice performs this analysis
at the ELF object code layer, rather
than at the C source code layer.
To apply a patch, Ksplice first
freezes execution of a computer so it
is the only program running. The
system verifies that no processors
were in the middle of executing
functions that will be modified by the
patch. Ksplice modifies the beginning
of changed functions so that they
instead point to new, updated versions
of those functions, and modifies data
and structures in memory that need to
be changed. Finally, Ksplice resumes
each processor running where it left
off.
(from Wikipedia)

Well I'm sure they have simulators to test with and mechanisms for hot-patching. Take a look at the linked article below - there's a pretty good overview of the spacecraft design. Section 5 discusses the computation machinery.
http://www.boulder.swri.edu/pkb/ssr/ssr-fountain.pdf
Of note:
Redundant processors
Command switching by the uplink card that does not require processor help
Time-lagged rules

I haven't worked on spacecraft, but the machines I've worked on have all been built to have a stable idle state where it's possible to shut down the machine briefly to patch the firmware. The systems that have accommodated 'live' updates are those that were broken into interacting components, where you can bring down one segment of the system long enough to update it and the other components can continue operating as normal, as they can tolerate the temporary downtime of the serviced component.
One way you can do this is to have parallel (redundant) capabilities, such as parallel machines that all perform the same task, so that the process can be routed around the machine under service. The benefit of this approach is that you can bring it down for longer periods for more significant service, such as regular hardware preventative maintenance. Once you have this capability, supporting downtime for a firmware patch is fairly easy.

One of the approaches that's been used in the past is to use LISP.

Analogs of Intel's Cluster OpenMP

Are there analogs of Intel Cluster OpenMP? This library simulates shared-memory machine (like SMP or NUMA) while running on distributed memory machine (like Ethernet-connected cluster of PC's).
This library allows to start openmp programs directly on cluster.
I search for
libraries, which allow multithreaded programms run on distributed cluster
or libraries (replacement of e.g. libgomp), which allow OpenMP programms run on distributed cluster
or compilers, capable of generating cluster code from openmp programms, besides Intel C++

The keyword you want to be searching for is "distributed shared memory"; there's a Wikipedia page on the subject. MOSIX, which became openMOSIX, which is now being developed as part of LinuxPMI, is the closest thing I'm aware of; but I don't have much experience with the current LinuxPMI project.
One thing you need to be aware of is that none of these systems work especially well, performance-wise. (Maybe a more optimistic way of saying it is that it's a tribute to the developers that these things work at all). You can't just abstract away the fact that accessing on-node memory is very very different from memory on some other node over a network. Even making local memory systems fast is difficult and requires a lot of hardware; you can't just hope that a little bit of software will hide the fact that you're now doing things over a network.
The performance ramifications are especially important when you consider that OpenMP programs you might want to run are almost always going to be written assuming that memory accesses are local and thus cheap, because, well, that's what OpenMP is for. False sharing is bad enough when you're talking about different sockets accessing a common cache line - page-based false sharing across a network is just disasterous.
Now, it could well be that you have a very simple program with very little actual shared state, and a distributed shared memory system wouldn't be so bad -- but in that case I've got to think you'd be better off in the long run just migrating the problem away from a shared-memory-based model like OpenMP towards something that'll work better in a cluster environment anyway.

Fast restart technique instead of keeping the good state (availability and consistency)

How often do you solve your problems by restarting a computer, router, program, browser? Or even by reinstalling the operating system or software component?
This seems to be a common pattern when there is a suspect that software component does not keep its state in the right way, then you just get the initial state by restarting the component.
I've heard that Amazon/Google has a cluster of many-many nodes. And one important property of each node is that it can restart in seconds. So, if one of them fails, then returning it back to initial state is just a matter of restarting it.
Are there any languages/frameworks/design patterns out there that leverage this techinque as a first-class citizen?
EDIT The link that describes some principles behind Amazon as well as overall principles of availability and consistency:
http://www.infoq.com/presentations/availability-consistency

This is actually very rare in the unix/linux world. Those oses were designed (and so was windows) to protect themselves from badly behaved processes. I am sure google is not relying on hard restarts to correct misbehaved software. I would say this technique should not be employed and if someone says that the fatest route to recovery for their software you should look for something else!

This is common in the embedded systems world, and in telecommunications. It's much less common in the server based world.
There's a research group you might be interested in. They've been working on Recovery-Oriented Computing or "ROC". The key principle in ROC is that the cleanest, best, most reliable state that any program can be in is right after starting up. Therefore, on detecting a fault, they prefer to restart the software rather than attempt to recover from the fault.
Sounds simple enough, right? Well, most of the research has gone into implementing that idea. The reason is exactly what you and other commenters have pointed out: OS restarts are too slow to be a viable recovery method.
ROC relies on three major parts:
A method to detect faults as early as possible.
A means of isolating the faulty component while preserving the rest of the system.
Component-level restarts.
The real key difference between ROC and the typical "nightly restart" approach is that ROC is a strategy where the reboots are a reaction. What I mean is that most software is written with some degree of error handling and recovery (throw-and-catch, logging, retry loops, etc.) A ROC program would detect the fault (exception) and immediately exit. Mixing up the two paradigms just leaves you with the worst of both worlds---low reliability and errors.

Microcontrollers typically have a watchdog timer, which must be reset (by a line of code) every so often or else the microcontroller will reset. This keeps the firmware from getting stuck in an endless loop, stuck waiting for input, etc.
Unused memory is sometimes set to an instruction which causes a reset, or a jump to a the same location that the microcontroller starts at when it is reset. This will reset the microcontroller if it somehow jumps to a location outside the program memory.

Embedded systems may have a checkpoint feature where every n ms, the current stack is saved.
The memory is non-volatile on power restart(ie battery backed), so on a power start, a test is made to see if the code needs to jump to an old checkpoint, or if it's a fresh system.
I'm going to guess that a similar technique(but more sophisticated) is used for Amazon/Google.

Though I can't think of a design pattern per se, in my experience, it's a result of "select is broken" from developers.
I've seen a 50-user site cripple both SQL Server Enterprise Edition (with a 750 MB database) and a Novell server because of poor connection management coupled with excessive calls and no caching. Novell was always the culprit according to developers until we found a missing "CloseConnection" call in a core library. By then, thousands were spent, unsuccessfully, on upgrades to address that one missing line of code.
(Why they had Enterprise Edition was beyond me so don't ask!!)

If you look at scripting languages like php running on Apache, each invocation starts a new process. In the basic case there is no shared state between processes and once the invocation has finished the process is terminated.
The advantages are less onus on resource management as they will be released when the process finishes and less need for error handling as the process is designed to fail-fast and it cannot be left in an inconsistent state.

I've seen it a few places at the application level (an app restarting itself if it bombs).
I've implemented the pattern at an application level, where a service reading from Dbase files starts getting errors after reading x amount of times. It looks for a particular error that gets thrown, and if it sees that error, the service calls a console app that kills the process and restarts the service. It's kludgey, and I hate it, but for this particular situation, I could find no better answer.
AND bear in mind that IIS has a built in feature that restarts the application pool under certain conditions.
For that matter, restarting a service is an option for any service on Windows as one of the actions to take when the service fails.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio