mpi and process scheduling - parallel-processing

Let's say that the number of processes I'm launching are greater than the number of cores I'm working with. When a series of processes on a set of cores complete, I want to utilize those cores. Is there any way for me to do that?
I thought of updating my rankfile on the go, but I'm not sure if that will work.
Any input will be appreciated. Thanks!

Launching more MPI processes than the number of CPU cores available is often referred to as oversubscription. This is normally perfectly well supported by the MPI libraries and operating systems, but might require some tweaking at job's submission time. The main point one should be careful with is the process-to-core attachment possibly performed by the MPI job launcher (ie mpirun, mpiexec, ortrun, srun, prun, mpprun, [addYourPreferredLauncherHere], ...).
If process-to-core attachment is enabled, then the oversubscription is likely to be quite ineffective (understanding that it is already likely to be counter-effective to oversubscribe, even in the best possible running conditions). So you will have to simply refer to the documentation of your MPI launcher to see how to disable attachment (sometimes referred to as "process affinity") and just run your MPI code as usual, with simply more processes than there are cores. No modification in the MPI code itself is required.

Related

How to get concurrent function (pmap) to use all cores in Elixir?

I'm new to Elixir, and I'm starting to read through Dave Thomas's excellent Programming Elixir. I was curious how far I could take the concurrency of the "pmap" function, so I iteratively boosted the number of items to square from 1,000 to 10,000,000. Out of curiosity, I watched the output of htop as I did so, usually peaking out with CPU usage similar to that shown below:
After showing the example in the book, Dave says:
And, yes, I just kicked off 1,000 background processes, and I used all the cores and processors on my machine.
My question is, how come on my machine only cores 1, 3, 5, and 7 are lighting up? My guess would be that it has to do with my iex process being only a single OS-level process and OSX is managing the reach of that process. Is that what's going on here? Is there some way to ensure all cores get utilized for performance-intensive tasks?
Great comment by #Thiago Silveira about first line of iex's output. The part [smp:8:8] says how many operating system level processes is Erlang using. You can control this with flag --smp if you want to disable it:
iex --erl '-smp disable'
This will ensure that you have only one system process. You can achieve similar result by leaving symmetric multiprocessing enabled, but setting directly NumberOfShcedulers:NumberOfSchedulersOnline.
iex --erl '+S 1:1'
Each operating system process needs to have its own scheduler for Erlang processes, so you can easily see how many of them do you have currently:
:erlang.system_info(:schedulers_online)
To answer your question about performance. If your processors are not working at full capacity (100%) and non of them is doing nothing (0%) then it is probable that making the load more evenly distributed will not speed things up. Why?
The CPU usage is measured by probing the processor state at many points in time. This states are either "working" or "idle". 82% CPU usage means that you can perform couple of more tasks on this CPU without slowing other tasks.
Erlang schedulers try to be smart and not migrate Erlang processes between cores unless they have to because it requires copying. The migration occurs for example when one of schedulers is idle. It can then borrow a process from others scheduler run queue.
Next thing that may cause such a big discrepancy between odd and even cores is Hyper Threading. On my dual core processor htop shows 4 logical cores. In your case you probably have 4 physical cores and 8 logical because of HT. It might be the case that you are utilizing your physical cores with 100%.
Another thing: pmap needs to calculate result in separate process, but at the end it sends it to the caller which may be a bottleneck. The more you send messages the less CPU utilization you can achieve. You can try for fun giving the processes a task that is really CPU intensive like calculating Ackerman function. You can even calculate how much of your job is the sequential part and how much is parallel using Amdahl's law and measuring execution times for different number of cores.
To sum up: the CPU utilization from screenshot looks really great! You don't have to change anything for more performance-intensive tasks.
Concurrency is not Parallelism
In order to get good parallel performance out of Elixir/BEAM coding you need to have some understanding of how the BEAM scheduler works.
This is a very simplistic model, but the BEAM scheduler gives each process 2000 reductions before it swaps out the process for the next process. Reductions can be thought of as function calls. By default a process runs on the core/scheduler that spawned it. Processes only get moved between schedulers if the queue of outstanding processes builds up on a given scheduler. By default the BEAM runs a scheduling thread on each available core.
What this implies is that in order to get the most use of the processors you need to break up your tasks into large enough pieces of work that will exceed the standard "reduction" slice of work. In general, pmap style parallelism only gives significant speedup when you chunk many items into a single task.
The other thing to be aware of is that some parts of the BEAM use a spin/wait loop when awaiting work and that can skew usage when you use
a tool like htop to examine CPU usage. You'll get a much better understanding of your program's performance by using :observer.

Preventing Windows from changing process affinity

I have a multithreaded code that I want to run on all 4 cores that my processor has. I.e. I create four threads, and I want each of them to run on a separate core.
What happens is that it starts running on four cores, but occasionally would switch to only three cores. The only things running are the OS and my exe. This is somewhat disappointing, since it decreases performance by a quarter, which is significant enough for me.
The process affinity that I see in Task Manager allows the process to use any core. I tried restricting thread affinities, but it did't help. I also tried increasing priority of the process, but it did not help the case either.
So the question is, is there any way to force Windows to keep it running on all four cores? If this is not possible, can I reduce the frequency of these interruptions? Thanks!
This is not an issue of affinity unless I am very much mistaken. Certainly the system will not restrict your process to affinity with a specific set of threads. Some other program in the system would have to do that, if indeed that is happening.
Much more likely however is that, simply, there is another thread that is ready to run that the system is scheduling in a round-robin fashion. You have four threads that are always ready to run. If there is another thread that is ready to run, it will get its turn. Now there are 5 threads sharing 4 processors. When the other thread is running, only 3 of yours are able to run.
If you want to be sure that such other threads won't run then you need to do one of the following:
Stop running the other program that wants to use CPU resource.
Make the relative thread priorities such that your threads always run in preference to the other thread.
Now, of these options, the first is to be preferred. If you prioritize your threads above others, then the other threads don't get to run at all. Is that really what you want to happen?
In the question you say that there are no other processes running. If that is the case, and nobody is meddling with processor affinity, and only a subset of your threads are executing, then the only conclusion is that not all of your threads are ready to run and have work to do. That might happen if you, for instance, join your threads at the end of one part of work, before continuing on to the next.
Perhaps the next step for you is to narrow things down a little. Use a tool like Process Explorer to diagnose which threads are actually running.
If this is windows, try SetThreadAffinityMask():
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686247(v=vs.85).aspx
I would assume that if you only set a single bit, then that forces the thread to run only on the selected processor (core).
other process / thread functions:
http://msdn.microsoft.com/en-us/library/windows/desktop/ms684847(v=vs.85).aspx
I use a windows video program, and it's able to keep all the cores running at near max when rendering video.

How are light weight threads scheduled by the linux kernel on a multichip multicore SMP system?

I am running a parallel algorithm using light threads and I am wondering how are these assigned to different cores when the system provides several cores and several chips. Are threads assigned to a single chip until all the cores on the chip are exhausted? Are threads assigned to cores on different chips in order to better distribute the work between chips?
You don't say what OS you're on, but in Linux, threads are assigned to a core based on the load on that core. A thread that is ready to run will be assigned to a core with lowest load unless you specify otherwise by setting thread affinity. You can do this with sched_setaffinity(). See the man page for more details. In general, as meyes1979 said, this is something that is decided by the scheduler implemented in the OS you are using.
Depending upon the version of Linux you're using, there are two articles that might be helpful: this article describes early 2.6 kernels, up through 2.6.22, and this article describes kernels newer than 2.6.23.
Different threading libraries perform threading operations differently. The "standard" in Linux these days is NPTL, which schedules threads at the same level as processes. This is quite fine, as process creation is fast on Linux, and is intended to always remain fast.
The Linux kernel attempts to provide very strong CPU affinity with executing processes and threads to increase the ratio of cache hits to cache misses -- if a task always executes on the same core, it'll more likely have pre-populated cache lines.
This is usually a good thing, but I have noticed the kernel might not always migrate tasks away from busy cores to idle cores. This behavior is liable to change from version to version, but I have found multiple CPU-bound tasks all running on one core while three other cores were idle. (I found it by noticing that one core was six or seven degrees Celsius warmer than the other three.)
In general, the right thing should just happen; but when the kernel does not automatically migrate tasks to other processors, you can use the taskset(1) command to restrict the processors allowed to programs or you could modify your program to use the pthread_setaffinity_np(3) function to ask for individual threads to be migrated. (This is perhaps best for in-house applications -- one of your users might not want your program to use all available cores. If you do choose to include calls to this function within your program, make sure it is configurable via configuration files to provide functionality similar to the taskset(1) program.)

MPI on a single machine dualcore

What happend if I ran an MPI program which require 3 nodes (i.e. mpiexec -np 3 ./Program) on a single machine which has 2 cpu?
This depends on your MPI implementation, of course. Most likely, it will create three processes, and use shared memory to exchange the messages. This will work just fine: the operating system will dispatch the two CPUs across the three processes, and always execute one of the ready processes. If a process waits to receive a message, it will block, and the operating system will schedule one of the other two processes to run - one of which will be the one that is sending the message.
Martin has given the right answer and I've plus-1ed him, but I just want to add a few subtleties which are a little too long to fit into the comment box.
There's nothing wrong with having more processes than cores, of course; you probably have dozens running on your machine well before you run any MPI program. You can try with any command-line executable you have sitting around something like mpirun -np 24 hostname or mpirun -np 17 ls on a linux box, and you'll get 24 copies of your hostname, or 17 (probably interleaved) directory listings, and everything runs fine.
In MPI, this using more processes than cores is generally called 'oversubscribing'. The fact that it has a special name already suggests that its a special case. The sorts of programs written with MPI typically perform best when each process has its own core. There are situations where this need not be the case, but it's (by far) the usual one. And for this reason, for instance, OpenMPI has optimized for the usual case -- it just makes the strong assumption that every process has its own core, and so is very agressive in using the CPU to poll to see if a message has come in yet (since it figures it's not doing anything else crucial). That's not a problem, and can easily be turned off if OpenMPI knows it's being oversubscribed ( http://www.open-mpi.org/faq/?category=running#oversubscribing ). It's a design decision, and one which improves the performance of the vast majority of cases.
For historical reasons I'm more familiar with OpenMPI than MPICH2, but my understanding is that MPICH2s defaults are more forgiving of the oversubscribed case -- but I think even there, too it's possible to turn on more agressive busywaiting.
Anyway, this is a long way of saying that yes, there what you're doing is perfectly fine, and if you see any weird problems when you switch MPIs or even versions of MPIs, do a quick search to see if there are any parameters that need to be tweaked for this case.

What is easier to learn and debug OpenMP or MPI?

I have a number crunching C/C++ application. It is basically a main loop for different data sets. We got access to a 100 node cluster with openmp and mpi available. I would like to speedup the application but I am an absolut newbie for both mpi and openmp. I just wonder what is the easiest one to learn and to debug even if the performance is not the best.
I also wonder what is the most adequate for my main loop application.
Thanks
If your program is just one big loop using OpenMP can be as simple as writing:
#pragma omp parallel for
OpenMP is only useful for shared memory programming, which unless your cluster is running something like kerrighed means that the parallel version using OpenMP will only run on at most one node at a time.
MPI is based around message passing and is slightly more complicated to get started. The advantage is though that your program could run on several nodes at one time, passing messages between them as and when needed.
Given that you said "for different data sets" it sounds like your problem might actually fall into the "embarrassingly parallel" category, where provided you've got more than 100 data sets you could just setup the scheduler to run one data set per node until they are all completed, with no need to modify your code and almost a 100x speed up over just using a single node.
For example if your cluster is using condor as the scheduler then you could submit 1 job per data item to the "vanilla" universe, varying only the "Arguments =" line of the job description. (There are other ways to do this for Condor which may be more sensible and there are also similar things for torque, sge etc.)
OpenMP is essentially for SMP machines, so if you want to scale to hundreds of nodes you will need MPI anyhow. You can however use both. MPI to distribute work across nodes and OpenMP to handle parallelism across cores or multiple CPUs per node. I would say OpenMP is a lot easier than messing with pthreads. But it being coarser grained, the speed up you will get from OpenMP will usually be lower than a hand optimized pthreads implementation.

Resources