What happend if I ran an MPI program which require 3 nodes (i.e. mpiexec -np 3 ./Program) on a single machine which has 2 cpu?
This depends on your MPI implementation, of course. Most likely, it will create three processes, and use shared memory to exchange the messages. This will work just fine: the operating system will dispatch the two CPUs across the three processes, and always execute one of the ready processes. If a process waits to receive a message, it will block, and the operating system will schedule one of the other two processes to run - one of which will be the one that is sending the message.
Martin has given the right answer and I've plus-1ed him, but I just want to add a few subtleties which are a little too long to fit into the comment box.
There's nothing wrong with having more processes than cores, of course; you probably have dozens running on your machine well before you run any MPI program. You can try with any command-line executable you have sitting around something like mpirun -np 24 hostname or mpirun -np 17 ls on a linux box, and you'll get 24 copies of your hostname, or 17 (probably interleaved) directory listings, and everything runs fine.
In MPI, this using more processes than cores is generally called 'oversubscribing'. The fact that it has a special name already suggests that its a special case. The sorts of programs written with MPI typically perform best when each process has its own core. There are situations where this need not be the case, but it's (by far) the usual one. And for this reason, for instance, OpenMPI has optimized for the usual case -- it just makes the strong assumption that every process has its own core, and so is very agressive in using the CPU to poll to see if a message has come in yet (since it figures it's not doing anything else crucial). That's not a problem, and can easily be turned off if OpenMPI knows it's being oversubscribed ( http://www.open-mpi.org/faq/?category=running#oversubscribing ). It's a design decision, and one which improves the performance of the vast majority of cases.
For historical reasons I'm more familiar with OpenMPI than MPICH2, but my understanding is that MPICH2s defaults are more forgiving of the oversubscribed case -- but I think even there, too it's possible to turn on more agressive busywaiting.
Anyway, this is a long way of saying that yes, there what you're doing is perfectly fine, and if you see any weird problems when you switch MPIs or even versions of MPIs, do a quick search to see if there are any parameters that need to be tweaked for this case.
Related
Question
Are there any notable differences between context switching between processes running the same executable (for example, two separate instances of cat) vs processes running different executables?
Background
I already know that having the same executable means that it can be cached in the same place in memory and in any of the CPU caches that might be available, so I know that when you switch from one process to another, if they're both executing the same executable, your odds of having a cache miss are smaller (possibly zero, if the executable is small enough or they're executing in roughly the same "spot", and the kernel doesn't do anything in the meantime that could cause the relevant memory to be evicted from the cache). This of course applies "all the way down", to memory still being in RAM vs. having been paged out to swap/disk.
I'm curious if there are other considerations that I'm missing? Anything to do with virtual memory mappings, perhaps, or if there are any kernels out there which are able to somehow get more optimal performance out of context switches between two processes running the same executable binary?
Motivation
I've been thinking about the Unix philosophy of small programs that do one thing well, and how taken to its logical conclusion, it leads to lots of small executables being forked and executed many times. (For example, 30-something runsv processes getting started up nearly simultaneously on Void Linux boot - note that runsv is only a good example during startup, because they mostly spend their time blocked waiting for events once they start their child service, so besides early boot, there isn't much context-switching between them happening. But we could easily image numerous cat or /bin/sh instances running at once or whatever.)
The context switching overhead is the same. That is usually done with a single (time consuming) instruction.
There are some more advanced operating systems (i.e. not eunuchs) that support installed shared programs. They have reduced overhead when more than one process accesses them. E.g., only one copy of read only data loaded into physical memory.
Let's say that the number of processes I'm launching are greater than the number of cores I'm working with. When a series of processes on a set of cores complete, I want to utilize those cores. Is there any way for me to do that?
I thought of updating my rankfile on the go, but I'm not sure if that will work.
Any input will be appreciated. Thanks!
Launching more MPI processes than the number of CPU cores available is often referred to as oversubscription. This is normally perfectly well supported by the MPI libraries and operating systems, but might require some tweaking at job's submission time. The main point one should be careful with is the process-to-core attachment possibly performed by the MPI job launcher (ie mpirun, mpiexec, ortrun, srun, prun, mpprun, [addYourPreferredLauncherHere], ...).
If process-to-core attachment is enabled, then the oversubscription is likely to be quite ineffective (understanding that it is already likely to be counter-effective to oversubscribe, even in the best possible running conditions). So you will have to simply refer to the documentation of your MPI launcher to see how to disable attachment (sometimes referred to as "process affinity") and just run your MPI code as usual, with simply more processes than there are cores. No modification in the MPI code itself is required.
We know that in bash, time foo will tell us how long a command foo takes to execute. But there is so much variability, depending on unrelated factors including what else is running on the machine at the time. It seems like there should be some deterministic way of measuring how long a program takes to run. Number of processor cycles, perhaps? Number of pipeline stages?
Is there a way to do this, or if not, to at least get a more meaningful time measurement?
You've stumbled into a problem that's (much) harder than it appears. The performance of a program is absolutely connected to the current state of the machine in which it is running. This includes, but is not limited to:
The contents of all CPU caches.
The current contents of system memory, including any disk caching.
Any other processes running on the machine and the resources they're currently using.
The scheduling decisions the OS makes about where and when to run your program.
...the list goes on and on.
If you want a truly repeatable benchmark, you'll have to take explicit steps to control for all of the above. This means flushing caches, removing interference from other programs, and controlling how your job gets run. This isn't an easy task, by any means.
The good news is that, depending on what you're looking for, you might be able to get away with something less rigorous. If you run the job on your regular workload and it produces results in a good amount of time, then that might be all that you need.
I have a multithreaded code that I want to run on all 4 cores that my processor has. I.e. I create four threads, and I want each of them to run on a separate core.
What happens is that it starts running on four cores, but occasionally would switch to only three cores. The only things running are the OS and my exe. This is somewhat disappointing, since it decreases performance by a quarter, which is significant enough for me.
The process affinity that I see in Task Manager allows the process to use any core. I tried restricting thread affinities, but it did't help. I also tried increasing priority of the process, but it did not help the case either.
So the question is, is there any way to force Windows to keep it running on all four cores? If this is not possible, can I reduce the frequency of these interruptions? Thanks!
This is not an issue of affinity unless I am very much mistaken. Certainly the system will not restrict your process to affinity with a specific set of threads. Some other program in the system would have to do that, if indeed that is happening.
Much more likely however is that, simply, there is another thread that is ready to run that the system is scheduling in a round-robin fashion. You have four threads that are always ready to run. If there is another thread that is ready to run, it will get its turn. Now there are 5 threads sharing 4 processors. When the other thread is running, only 3 of yours are able to run.
If you want to be sure that such other threads won't run then you need to do one of the following:
Stop running the other program that wants to use CPU resource.
Make the relative thread priorities such that your threads always run in preference to the other thread.
Now, of these options, the first is to be preferred. If you prioritize your threads above others, then the other threads don't get to run at all. Is that really what you want to happen?
In the question you say that there are no other processes running. If that is the case, and nobody is meddling with processor affinity, and only a subset of your threads are executing, then the only conclusion is that not all of your threads are ready to run and have work to do. That might happen if you, for instance, join your threads at the end of one part of work, before continuing on to the next.
Perhaps the next step for you is to narrow things down a little. Use a tool like Process Explorer to diagnose which threads are actually running.
If this is windows, try SetThreadAffinityMask():
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686247(v=vs.85).aspx
I would assume that if you only set a single bit, then that forces the thread to run only on the selected processor (core).
other process / thread functions:
http://msdn.microsoft.com/en-us/library/windows/desktop/ms684847(v=vs.85).aspx
I use a windows video program, and it's able to keep all the cores running at near max when rendering video.
I'm a big fan of speeding up my builds using "make -j8" (replacing 8 with whatever my current computer's number of cores is, of course), and compiling N files in parallel is usually very effective at reducing compile times... unless some of the compilation processes are sufficiently memory-intensive that the computer runs out of RAM, in which case all the various compile processes start swapping each other out, and everything slows to a crawl -- thus defeating the purpose of doing a parallel compile in the first place.
Now, the obvious solution to this problem is "buy more RAM" -- but since I'm too cheap to do that, it occurs to me that it ought to be possible to have an implementation of 'make' (or equivalent) that watches the system's available RAM, and when RAM gets down to near zero and the system starts swapping, make would automatically step in and send a SIGSTOP to one or more of the compile processes it had spawned. That would allow the stopped processes to get fully swapped out, so that the other processes could finish their compile without further swapping; then, when the other processes exit and more RAM becomes available, the 'make' process would send a SIGCONT to the paused processes, allowing them to resume their own processing. That way most swapping would be avoided, and I could safely compile on all cores.
Is anyone aware of a program that implements this logic? Or conversely, is there some good reason why such a program wouldn't/couldn't work?
For GNU Make, there's the -l option:
-l [load], --load-average[=load]
Specifies that no new jobs (commands) should be started if there are others jobs running and the load average is at least load (a floating-
point number). With no argument, removes a previous load limit.
I don't think there's a standard option for this, though.