MPI Send latency for different process localities - parallel-processing

I am currently participating in a course for efficient programming of supercomputers and multicore processors. Our recent assignment is to measure the latency for the MPI_Send command (thus the time spent sending a zero byte message). Now this alone would not be that hard, but we have to perform our measurements for the following criterias:
communication of processes in the same processor,
same node but different processors,
and for processes on different nodes.
I am wondering: How do i determine this? For proccesses on different nodes i thought about hashing the name returned by MPI_Get_processor_name, which returns the identifier of the node the process is currently running on, and sending it as a tag. I also tried using sched_cpu() to get the core id, but it seems like that this returns a incremental number, even if the cores a hyperthreaded (thus a process would run on the same core). How do i go about this?
I just need a concept for determining the localities! Not a complete code for the stated problem. Thank you!

In order to have both MPI processes placed on separate cores of the same socket, you should pass the following options to mpiexec:
-genv I_MPI_PIN=1 -genv I_MPI_PIN_DOMAIN=core -genv I_MPI_PIN_ORDER=compact
In order to have both MPI processes on cores from different sockets, you should use:
-genv I_MPI_PIN=1 -genv I_MPI_PIN_DOMAIN=core -genv I_MPI_PIN_ORDER=scatter
In order to have them on two separate machines, you should create a host file that provides only one slot per node or use:
-perhost 1 -genv I_MPI_PIN=1 -genv I_MPI_PIN_DOMAIN=core
You can check the actual pinning/binding on Linux by calling sched_getcpuaffinity() and examining the returned affinity mask. As an alternative, you could parse /proc/self/status and look for Cpus_allowed or Cpus_allowed_list. On Windows, GetProcessAffinityMask() returns the active affinity mask.
You could also ask Intel MPI to report the final pinning by setting I_MPI_DEBUG to 4, but it produces a lot of other output in addition to the pinning information. Look for lines that resemble the following:
[0] MPI startup(): 0 1234 node100 {0}
[0] MPI startup(): 1 1235 node100 {1}

Related

How to get concurrent function (pmap) to use all cores in Elixir?

I'm new to Elixir, and I'm starting to read through Dave Thomas's excellent Programming Elixir. I was curious how far I could take the concurrency of the "pmap" function, so I iteratively boosted the number of items to square from 1,000 to 10,000,000. Out of curiosity, I watched the output of htop as I did so, usually peaking out with CPU usage similar to that shown below:
After showing the example in the book, Dave says:
And, yes, I just kicked off 1,000 background processes, and I used all the cores and processors on my machine.
My question is, how come on my machine only cores 1, 3, 5, and 7 are lighting up? My guess would be that it has to do with my iex process being only a single OS-level process and OSX is managing the reach of that process. Is that what's going on here? Is there some way to ensure all cores get utilized for performance-intensive tasks?
Great comment by #Thiago Silveira about first line of iex's output. The part [smp:8:8] says how many operating system level processes is Erlang using. You can control this with flag --smp if you want to disable it:
iex --erl '-smp disable'
This will ensure that you have only one system process. You can achieve similar result by leaving symmetric multiprocessing enabled, but setting directly NumberOfShcedulers:NumberOfSchedulersOnline.
iex --erl '+S 1:1'
Each operating system process needs to have its own scheduler for Erlang processes, so you can easily see how many of them do you have currently:
:erlang.system_info(:schedulers_online)
To answer your question about performance. If your processors are not working at full capacity (100%) and non of them is doing nothing (0%) then it is probable that making the load more evenly distributed will not speed things up. Why?
The CPU usage is measured by probing the processor state at many points in time. This states are either "working" or "idle". 82% CPU usage means that you can perform couple of more tasks on this CPU without slowing other tasks.
Erlang schedulers try to be smart and not migrate Erlang processes between cores unless they have to because it requires copying. The migration occurs for example when one of schedulers is idle. It can then borrow a process from others scheduler run queue.
Next thing that may cause such a big discrepancy between odd and even cores is Hyper Threading. On my dual core processor htop shows 4 logical cores. In your case you probably have 4 physical cores and 8 logical because of HT. It might be the case that you are utilizing your physical cores with 100%.
Another thing: pmap needs to calculate result in separate process, but at the end it sends it to the caller which may be a bottleneck. The more you send messages the less CPU utilization you can achieve. You can try for fun giving the processes a task that is really CPU intensive like calculating Ackerman function. You can even calculate how much of your job is the sequential part and how much is parallel using Amdahl's law and measuring execution times for different number of cores.
To sum up: the CPU utilization from screenshot looks really great! You don't have to change anything for more performance-intensive tasks.
Concurrency is not Parallelism
In order to get good parallel performance out of Elixir/BEAM coding you need to have some understanding of how the BEAM scheduler works.
This is a very simplistic model, but the BEAM scheduler gives each process 2000 reductions before it swaps out the process for the next process. Reductions can be thought of as function calls. By default a process runs on the core/scheduler that spawned it. Processes only get moved between schedulers if the queue of outstanding processes builds up on a given scheduler. By default the BEAM runs a scheduling thread on each available core.
What this implies is that in order to get the most use of the processors you need to break up your tasks into large enough pieces of work that will exceed the standard "reduction" slice of work. In general, pmap style parallelism only gives significant speedup when you chunk many items into a single task.
The other thing to be aware of is that some parts of the BEAM use a spin/wait loop when awaiting work and that can skew usage when you use
a tool like htop to examine CPU usage. You'll get a much better understanding of your program's performance by using :observer.

Using Parallel gem in Ruby; how many cores to use?

I have a section of slow code that I want to speed up by using the parallel gem and multiple cores.
With the Parallel gem, you must specify the number of processes or threads you want to use, so I hard-coded it to use the same number of logical cores that I have. It works perfectly, but my problem is that this code is intended to be distributed and used by other people who may have a different number of cores.
Should I try to detect the number of cores that their machine has, and use that number? Or should I default to no parallelism and only switch to multi-threaded code if the user explicitly specifies the number of threads they'd like to use? (e.g. pg_restore)
If I do try to detect cores, should I try to utilise all cores found, or would it be more polite to use, say, all but one of the cores?
No idea how memory-intensive your program is but those requirements could also cause major unexpected issues for people with less memory than the machine you're testing it on.
Since it's a CLI tool, why not add a flag like --procs that takes an argument for the number of processes to use, and leave it up to the user to decide?

mpi and process scheduling

Let's say that the number of processes I'm launching are greater than the number of cores I'm working with. When a series of processes on a set of cores complete, I want to utilize those cores. Is there any way for me to do that?
I thought of updating my rankfile on the go, but I'm not sure if that will work.
Any input will be appreciated. Thanks!
Launching more MPI processes than the number of CPU cores available is often referred to as oversubscription. This is normally perfectly well supported by the MPI libraries and operating systems, but might require some tweaking at job's submission time. The main point one should be careful with is the process-to-core attachment possibly performed by the MPI job launcher (ie mpirun, mpiexec, ortrun, srun, prun, mpprun, [addYourPreferredLauncherHere], ...).
If process-to-core attachment is enabled, then the oversubscription is likely to be quite ineffective (understanding that it is already likely to be counter-effective to oversubscribe, even in the best possible running conditions). So you will have to simply refer to the documentation of your MPI launcher to see how to disable attachment (sometimes referred to as "process affinity") and just run your MPI code as usual, with simply more processes than there are cores. No modification in the MPI code itself is required.

MPI shared memory access

In the parallel MPI program on for example 100 processors:
In case of having a global counting number which should be known by all MPI processes and each one of them can add to this number and the others should see the change instantly and add to the changed value.
Synchronization is not possible and would have lots of latency issue.
Would it be OK to open a shared memory among all the processes and use this memory for accessing this number also changing that?
Would it be OK to use MPI_WIN_ALLOCATE_SHARED or something like that or is this not a good solution?
Your question suggests to me that you want to have your cake and eat it too. This will end in tears.
I write you want to have your cake and eat it too because you state that you want to synchronise the activities of 100 processes without synchronisation. You want to have 100 processes incrementing a shared counter, (presumably) to have all the updates applied correctly and consistently, and to have increments propagated to all processes instantly. No matter how you tackle this problem it is one of synchronisation; either you write synchronised code or you offload the task to a library or run-time which does it for you.
Is it reasonable to expect MPI RMA to provide automatic synchronisation for you ? No, not really. Note first that mpi_win_allocate_shared is only valid if all the processes in the communicator which make the call are in shared memory. Given that you have the hardware to support 100 processes in the same, shared, memory, you still have to write code to ensure synchronisation, MPI won't do it for you. If you do have 100 processes, any or all of which may increment the shared counter, there is nothing in the MPI standard, or any implementations that I am familiar with, which will prevent a data race on that counter.
Even shared-memory parallel programs (as opposed to MPI providing shared-memory-like parallel programs) have to take measures to avoid data races and other similar issues.
You could certainly write an MPI program to synchronise accesses to the shared counter but a better approach would be to rethink your program's structure to avoid too-tight synchronisation between processes.

MPI on a single machine dualcore

What happend if I ran an MPI program which require 3 nodes (i.e. mpiexec -np 3 ./Program) on a single machine which has 2 cpu?
This depends on your MPI implementation, of course. Most likely, it will create three processes, and use shared memory to exchange the messages. This will work just fine: the operating system will dispatch the two CPUs across the three processes, and always execute one of the ready processes. If a process waits to receive a message, it will block, and the operating system will schedule one of the other two processes to run - one of which will be the one that is sending the message.
Martin has given the right answer and I've plus-1ed him, but I just want to add a few subtleties which are a little too long to fit into the comment box.
There's nothing wrong with having more processes than cores, of course; you probably have dozens running on your machine well before you run any MPI program. You can try with any command-line executable you have sitting around something like mpirun -np 24 hostname or mpirun -np 17 ls on a linux box, and you'll get 24 copies of your hostname, or 17 (probably interleaved) directory listings, and everything runs fine.
In MPI, this using more processes than cores is generally called 'oversubscribing'. The fact that it has a special name already suggests that its a special case. The sorts of programs written with MPI typically perform best when each process has its own core. There are situations where this need not be the case, but it's (by far) the usual one. And for this reason, for instance, OpenMPI has optimized for the usual case -- it just makes the strong assumption that every process has its own core, and so is very agressive in using the CPU to poll to see if a message has come in yet (since it figures it's not doing anything else crucial). That's not a problem, and can easily be turned off if OpenMPI knows it's being oversubscribed ( http://www.open-mpi.org/faq/?category=running#oversubscribing ). It's a design decision, and one which improves the performance of the vast majority of cases.
For historical reasons I'm more familiar with OpenMPI than MPICH2, but my understanding is that MPICH2s defaults are more forgiving of the oversubscribed case -- but I think even there, too it's possible to turn on more agressive busywaiting.
Anyway, this is a long way of saying that yes, there what you're doing is perfectly fine, and if you see any weird problems when you switch MPIs or even versions of MPIs, do a quick search to see if there are any parameters that need to be tweaked for this case.

Resources