I got a MPI program written by other people.
Basic structure is like this
program basis
initialize MPI
do n=1,12
call mpi_job(n)
end do
finalize MPI
subroutine mpi_job(n) !this is mpi subroutine
end subroutine
end program
What I want to do now is to make the do loop a parallel do loop. So if I got a 24 core machine, I can run this program with 12 mpi_job running simultaneously and each mpi_job uses 2 threads. There are several reasons to do this, for example, the performance of mpi_job may not scale well with number of cores. To sum up, I want to make one level of MPI parallelization into two levels of parallelization.
I found myself constantly encounter this problem when I working with other people.The question is what is the easiest and efficient way to modify the program?
So if I got a 24 core machine, I can run this program with 12 mpi_job running simultaneously and each mpi_job uses 2 threads.
I wouldn't do that. I recommend mapping MPI processes to NUMA nodes and then spawning k threads where there are k cores per NUMA node.
There are several reasons to do this, for example, the performance of mpi_job may not scale well with number of cores.
That's an entirely different issue. What aspect of mpi_job won't scale well? Is it memory bound? Does it require too much communication?
You use should use sub-communicators.
Compute job_nr = floor(global_rank / ranks_per_job)
Use MPI_COMM_SPLIT over the job_nr. This creates a local to be used communicator for each job
Pass the resulting communicator to the mpi_job. All communication then should use that communicator and the rank local to that communicator.
Of course, this all implies that there is no dependencies between the different calls to mpi_job - or that you map that to appropriate global/world communicator.
There is some confusion here over the basics of what you are trying to do. Your skeleton code will not run 12 MPI jobs at the same time; each MPI process that you create will run 12 jobs sequentially.
What you want to do is run 12 MPI processes, each of which calls mpi_job a single time. Within mpi_job, you can then create 2 threads using OpenMP.
Process and thread placement is outside the scope of the MPI and OpenMP standards. For example, ensuring that the processes are spread evenly across your multicore machine (e.g. each of the 12 even cores 0, 2, ... out of 24) and that the OpenMP threads run on even and odd pairs of cores would require you to look up the man pages for your MPI and OpenMP implementations. You may be able to place processes using arguments to mpiexec; thread placement may be controlled by environment variables, e.g. KMP_AFFINITY for Intel OpenMP.
Placement aside, here is a code that I think does what you want (I make no comment on whether it is the most efficient thing to do). I am using GNU compilers here.
user#laptop$ mpif90 -fopenmp -o basis basis.f90
user#laptop$ export OMP_NUM_THREADS=2
user#laptop$ mpiexec -n 12 ./basis
Running 12 MPI jobs at the same time
MPI job 2 , thread no. 1 reporting for duty
MPI job 11 , thread no. 1 reporting for duty
MPI job 11 , thread no. 0 reporting for duty
MPI job 8 , thread no. 0 reporting for duty
MPI job 0 , thread no. 1 reporting for duty
MPI job 0 , thread no. 0 reporting for duty
MPI job 2 , thread no. 0 reporting for duty
MPI job 8 , thread no. 1 reporting for duty
MPI job 4 , thread no. 1 reporting for duty
MPI job 4 , thread no. 0 reporting for duty
MPI job 10 , thread no. 1 reporting for duty
MPI job 10 , thread no. 0 reporting for duty
MPI job 3 , thread no. 1 reporting for duty
MPI job 3 , thread no. 0 reporting for duty
MPI job 1 , thread no. 0 reporting for duty
MPI job 1 , thread no. 1 reporting for duty
MPI job 5 , thread no. 0 reporting for duty
MPI job 5 , thread no. 1 reporting for duty
MPI job 9 , thread no. 1 reporting for duty
MPI job 9 , thread no. 0 reporting for duty
MPI job 7 , thread no. 0 reporting for duty
MPI job 7 , thread no. 1 reporting for duty
MPI job 6 , thread no. 1 reporting for duty
MPI job 6 , thread no. 0 reporting for duty
Here's the code:
program basis
use mpi
implicit none
integer :: ierr, size, rank
integer :: comm = MPI_COMM_WORLD
call MPI_Init(ierr)
call MPI_Comm_size(comm, size, ierr)
call MPI_Comm_rank(comm, rank, ierr)
if (rank == 0) then
write(*,*) 'Running ', size, ' MPI jobs at the same time'
end if
call mpi_job(rank)
call MPI_Finalize(ierr)
subroutine mpi_job(n) !this is mpi subroutine
use omp_lib
implicit none
integer :: n, ithread
!$omp parallel default(none) private(ithread) shared(n)
ithread = omp_get_thread_num()
write(*,*) 'MPI job ', n, ', thread no. ', ithread, ' reporting for duty'
!$omp end parallel
end subroutine mpi_job
end program basis
I want to distribute subroutines to different tasks with OpenMP.
In my code I implemented this:
!$omp parallel
!$omp single
do thread = 1, omp_get_num_threads()
!$omp task
write(*,*) "Task,", thread, "is computing"
call find_pairs(me, thread, points)
call count_neighbors(me, thread, neighbors(:, thread))
!$omp end task
end do
!$omp end single
!$omp end parallel
The subroutines find_neighbors and count_neighbors do some calculations.
I set the number of threads in my program before with:
nr_threads = 4
call omp_set_num_threads(nr_threads)
Compiling this with GNU Fortran (Ubuntu 8.3.0-6ubuntu1) 8.3.0 and running,
gives me only one thread, running at nearly 100% when monitoring with top. Nevertheless, it prints the right
Task, 1 is computing
Task, 2 is computing
Task, 3 is computing
Task, 4 is computing
I compile it using:
gfortran -fopenmp main.f90 -o program
What I want is to distribute different calls of the subroutines according to
the number of OpenMP threads, working in parallel.
From what I understand is, that a single thread is created which creates the different
I have been experimenting my code to send "parallel" commands to multiple serial COM ports.
My multi-threading code consists of:
global q
q = Queue()
devices = [0, 1, 2, 3]
for i in devices:
cpus=cpu_count() #detect number of cores
logging.debug("Creating %d threads" % cpus)
for i in range(cpus):
t = Thread(name= 'DeviceThread_'+str(i), target=testFunc1)
t.daemon = True
and multi-processing code consists of:
devices = [0, 1, 2, 3]
cpus=cpu_count() #detect number of cores
pool = Pool(cpus)
results = pool.map(multi_run_wrapper, devices)
I observe that the task of sending serial commands to 4 COM ports in "parallel" takes about 6 seconds and multi-processing always always takes a 0.5 to 1 second of additional total run time.
Any inputs on why the discrepancy on a Windows machine?
Well, for one, you're not comparing apples to apples. If you want equivalent code, use multiprocessing.dummy.Pool in your threaded case (which is the same as multiprocessing.Pool implemented in terms of threads, not processes), so you're at least using the same basic parallelization model with different internal implementations, not changing everything all at once.
Beyond that, launching the workers and communicating data to them has some overhead, more on Windows than on other systems since Windows can't fork to spawn new processes cheaply; it has to spawn a new Python instance and then copy over state via IPC to approximate forking.
Aside from that, you haven't provided enough information; your process and thread based worker functions aren't provided, and could cause significant differences in behavior. Nor have you provided information on how you're performing timing. Similarly, if each worker process needs to reinitialize the COM port communication library, that could involve non-trivial overhead.
I have several question regarding cuda. Following is a figure taken from a book on parallel programming. It shows how threads are allocated in the device for a multiplication of two vectors each of length 8192.
1) in threadblock 0 there are 15 SIMD threads. Are these 15 threads executed in parallel or just one thread at a specific time?
2) each block contains 512 elements in this example. is this number dependent on the hardware or is it a decision of the programmer?
In this particular example, each thread seems to be assigned to 32 elements in the vector. Code that is executed by a single thread is executed sequentially.
The size of the thread blocks is up to the programmer. However, there are restrictions on the number and size of the thread blocks given the hardware the code is executed on. For more information on this, see this elaborate answer:
Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)
From your illustration, it seems that:
The grid is composed of 16 thread blocks, numbered from 0 to 15.
Each block is composed of 16 "SIMD threads", numbered from 0 to 15
Each "SIMD thread" computes the product of 32 vector elements.
It is not necessarily obvious from the illustration whether "SIMD thread" means, in the CUDA (OpenCL) parlance:
A warp (wavefront) of 32 threads (work-items)
A thread (work-item) working on 32 elements
I will assume the former ("SIMD thread" = warp/wavefront), since it is a more reasonable assumption performance-wise, but the latter isn't technically incorrect, it's simply suboptimal design (on current hardware, at least).
1) in threadblock 0 there are 15 SIMD threads. Are these 15 threads executed in parallel or just one thread at a specific time?
As stated above, there are 16 warps (numbered from 0 to 15, that makes 16) in thread block 0, each of them made of 32 threads. These threads execute in lockstep, simultaneously, in parallel. The warps are executed independently from each another, sequentially or in parallel, depending on the capabilities of the underlying hardware. For example, the hardware may be capable of scheduling a number of warps for simultaneous execution.
2) each block contains 512 elements in this example. is this number dependent on the hardware or is it a decision of the programmer?
In this case, it is simply a decision of the programmer, but in some cases there are also hardware limitations that could force the programmer into changing the design. For example, there is a maximum number of threads a block can handle, and there is a maximum number of blocks a grid can handle.
I am running MPI on my laptop (intel i7 quad core 4700m 12Gb RAM) and the efficiency drops even for codes that involve no inter-process communication. Obviously I cannot just throw 100 processes at it since my machine is only quad-core, but I thought that it should scale well up to 8 process (intel quad core simulates as 8???). For example consider the simple toy Fortran code:
program test
implicit none
integer, parameter :: root=0
integer :: ierr,rank,nproc,tt,i
integer :: n=100000
real :: s=0.0,tstart,tend
complex, dimension(100000/nproc) :: u=2.0,v=0.0
call MPI_INIT(ierr)
call cpu_time(tstart)
do tt=1,200000
do i=1,100000/nproc
v(i) = v(i) + 0.1*u(i)
call cpu_time(tend)
if (rank==root) then
print *, 'total time was: ',tend-tstart
call MPI_FINALIZE(ierr)
end subroutine test2
For 2 processes it takes half the time, but even trying 4 processes (should be quarter of the time?) the result begins to become less efficient and for 8 processes there is no improvement whatsoever. Basically I am wondering if this is just because I am running on a laptop and has something to do with shared memory, or if I am making some fundamental mistake in my code. Thanks
Note: In the above example I manually change the nproc in the array declaration and the inner loop to be equal to the number of processors I am using.
A quad core processor, thanks to hyperthreading shows itself as having 8 threads, but physically they are just 4 cores. The other 4 are scheduled by the hardware itself using the free slots in the execution pipelines.
It happens that especially with compute intensive loads this approach does not pay at all, being often counter-productive too on extreme loads because of overheads and not always optimized cache usage.
You can try to disable hyperthreading in the BIOS and compare it: you will have just 4 threads, 4 cores.
Even going from 1 to 4 there are resources that are being in competition. In particular each core has its own L1 cache, but each pair of cores shares the L2 cache (2x256KB) and the 4 cores share the L3 cache.
And all the cores obviously share the memory channels.
So you cannot expect to have linear scaling occupying more and more cores, since they will have to balance the usage of the resources, that are dedicated to one core/one thread in the sequential case.
All of this without involving communications at all.
The same behavior happens on desktops/servers, in particular for memory-intensive loads, as the one in your test case.
For example it's less evident with matrix-matrix multiplies, that is compute-intensive: for a NxN matrix, you have O(N^2) memory accesses but O(N^3) floating point operations.
I want to accurately pin my MPI processes to a list of (physical) cores. I refer to the following points of the mpirun --help output:
-cpu-set|--cpu-set <arg0>
Comma-separated list of ranges specifying logical
cpus allocated to this job [default: none]
-rf|--rankfile <arg0>
Provide a rankfile file
The topology of my processor is as follows:
CPU type: Intel Core Bloomfield processor
Hardware Thread Topology
Sockets: 1
Cores per socket: 4
Threads per core: 2
HWThread Thread Core Socket
0 0 0 0
1 0 1 0
2 0 2 0
3 0 3 0
4 1 0 0
5 1 1 0
6 1 2 0
7 1 3 0
Socket 0: ( 0 4 1 5 2 6 3 7 )
Now, if I start my programm using mpirun -np 2 --cpu-set 0,1 --report-bindings ./solver the program starts normally but without considering the --cpu-set argument I provided. On the other hand starting my program with mpirun -np 2 --rankfile rankfile --report-bindings ./solver gives me the following output:
[neptun:14781] [[16333,0],0] odls:default:fork binding child [[16333,1],0] to slot_list 0
[neptun:14781] [[16333,0],0] odls:default:fork binding child [[16333,1],1] to slot_list 1
Indeed checking with top shows me that mpirun actually uses the specified cores. But how should I interpret this output? Except for the host (neptun) and the specified slots (0,1) I don't have a clue. Same with the other commands I tried out:
$mpirun --np 2 --bind-to-core --report-bindings ./solver
[neptun:15166] [[15694,0],0] odls:default:fork binding child [[15694,1],0] to cpus 0001
[neptun:15166] [[15694,0],0] odls:default:fork binding child [[15694,1],1] to cpus 0002
$mpirun --np 2 --bind-to-socket --report-bindings ./solver
[neptun:15188] [[15652,0],0] odls:default:fork binding child [[15652,1],0] to socket 0 cpus 000f
[neptun:15188] [[15652,0],0] odls:default:fork binding child [[15652,1],1] to socket 0 cpus 000f
With --bind-to-core, the top command once again shows me that cores 0 and 1 are used, but why is the output cpus 0001 and 0002? --bind-to-socket causes even more confusion: 2x 000f?
I use the last paragraph to summarize the questions that arose from my experiments:
Why isn't my --cpu-set command working?
How am I supposed to interpret the output resulting from the --report-bindings output?
The CPU-Topology was read out using LIKWID Performance Tools, more precisely using likwid-topology.
LIKWID is licensed under the GPL-3.0 license, see their GitHub for more info.
In both cases the output matches exactly what you have told Open MPI to do. The hexadecimal number in cpus ... shows the allowed CPUs (the affinity mask) for the process. This is a bit field with each bit representing one logical CPU.
With --bind-to-core each MPI process is bound to its own CPU core. Rank 0 ([...,0]) has its affinity mask set to 0001 which means logical CPU 0. Rank 1 ([...,1]) has its affinity mask set to 0002 which means logical CPU 1. The logical CPU numbering probably matches the HWThread identifier in the output with the topology information.
With --bind-to-socket each MPI process is bound to all cores of the socket. In your particular case the affinity mask is set to 000f, or 0000000000001111 in binary, which corresponds to all four cores in the socket. Only a single hyperthread per core is being assigned.
You can further instruct Open MPI how to select the sockets on multisocket nodes. With --bysocket the sockets are selected in round-robin fashion, i.e. the first rank is placed on the first socket, the next rank is placed on the next socket, and so on until there is one process per socket, then the next rank is again put on the first socket and so on. With --bycore each sockets receives as much consecutive ranks as is the number of cores in that socket.
I would suggest that you read the manual for mpirun for Open MPI 1.4.x, especially the Process Binding section. There are some examples there with how the different binding options interact with each other. The --cpu-set option is not mentioned in the manual, although Jeff Squyres has written a nice page on processor affinity features in Open MPI (it is about v1.5, but most if not all of it applies to v1.4 also).