I want to accurately pin my MPI processes to a list of (physical) cores. I refer to the following points of the mpirun --help output:
-cpu-set|--cpu-set <arg0>
Comma-separated list of ranges specifying logical
cpus allocated to this job [default: none]
...
-rf|--rankfile <arg0>
Provide a rankfile file
The topology of my processor is as follows:
-------------------------------------------------------------
CPU type: Intel Core Bloomfield processor
*************************************************************
Hardware Thread Topology
*************************************************************
Sockets: 1
Cores per socket: 4
Threads per core: 2
-------------------------------------------------------------
HWThread Thread Core Socket
0 0 0 0
1 0 1 0
2 0 2 0
3 0 3 0
4 1 0 0
5 1 1 0
6 1 2 0
7 1 3 0
-------------------------------------------------------------
Socket 0: ( 0 4 1 5 2 6 3 7 )
-------------------------------------------------------------
Now, if I start my programm using mpirun -np 2 --cpu-set 0,1 --report-bindings ./solver the program starts normally but without considering the --cpu-set argument I provided. On the other hand starting my program with mpirun -np 2 --rankfile rankfile --report-bindings ./solver gives me the following output:
[neptun:14781] [[16333,0],0] odls:default:fork binding child [[16333,1],0] to slot_list 0
[neptun:14781] [[16333,0],0] odls:default:fork binding child [[16333,1],1] to slot_list 1
Indeed checking with top shows me that mpirun actually uses the specified cores. But how should I interpret this output? Except for the host (neptun) and the specified slots (0,1) I don't have a clue. Same with the other commands I tried out:
$mpirun --np 2 --bind-to-core --report-bindings ./solver
[neptun:15166] [[15694,0],0] odls:default:fork binding child [[15694,1],0] to cpus 0001
[neptun:15166] [[15694,0],0] odls:default:fork binding child [[15694,1],1] to cpus 0002
and
$mpirun --np 2 --bind-to-socket --report-bindings ./solver
[neptun:15188] [[15652,0],0] odls:default:fork binding child [[15652,1],0] to socket 0 cpus 000f
[neptun:15188] [[15652,0],0] odls:default:fork binding child [[15652,1],1] to socket 0 cpus 000f
With --bind-to-core, the top command once again shows me that cores 0 and 1 are used, but why is the output cpus 0001 and 0002? --bind-to-socket causes even more confusion: 2x 000f?
I use the last paragraph to summarize the questions that arose from my experiments:
Why isn't my --cpu-set command working?
How am I supposed to interpret the output resulting from the --report-bindings output?
References
The CPU-Topology was read out using LIKWID Performance Tools, more precisely using likwid-topology.
LIKWID is licensed under the GPL-3.0 license, see their GitHub for more info.
In both cases the output matches exactly what you have told Open MPI to do. The hexadecimal number in cpus ... shows the allowed CPUs (the affinity mask) for the process. This is a bit field with each bit representing one logical CPU.
With --bind-to-core each MPI process is bound to its own CPU core. Rank 0 ([...,0]) has its affinity mask set to 0001 which means logical CPU 0. Rank 1 ([...,1]) has its affinity mask set to 0002 which means logical CPU 1. The logical CPU numbering probably matches the HWThread identifier in the output with the topology information.
With --bind-to-socket each MPI process is bound to all cores of the socket. In your particular case the affinity mask is set to 000f, or 0000000000001111 in binary, which corresponds to all four cores in the socket. Only a single hyperthread per core is being assigned.
You can further instruct Open MPI how to select the sockets on multisocket nodes. With --bysocket the sockets are selected in round-robin fashion, i.e. the first rank is placed on the first socket, the next rank is placed on the next socket, and so on until there is one process per socket, then the next rank is again put on the first socket and so on. With --bycore each sockets receives as much consecutive ranks as is the number of cores in that socket.
I would suggest that you read the manual for mpirun for Open MPI 1.4.x, especially the Process Binding section. There are some examples there with how the different binding options interact with each other. The --cpu-set option is not mentioned in the manual, although Jeff Squyres has written a nice page on processor affinity features in Open MPI (it is about v1.5, but most if not all of it applies to v1.4 also).
Related
My University has computational nodes with 128 total cores but comprised of two individual AMD processors (i.e., sockets), each with 64 cores. This leads to anomalous simulation runtimes in ABAQUS using crystal plasticity models implemented in a User MATerial Subroutine (UMAT). For instance, if I run a simulation using 1 node and 128 cores, this takes around 14 hours. If I submit the same job to run across two nodes with 128 cores (i.e., using 64 cores/1 processor on two separate nodes), the job finishes in only 9 hours. One would expect the simulation running on a single host node to be faster than on two separate nodes for the same total number of cores, but this is not the case. The problem is that in the latter configuration, each host node contains two processors each with 64 cores and the abaqus_v6.env file therefore contains:
mp_host_list=[['node_1', 64],['node_2', 64]]
for the 2 node/128 core simulation. The ABAQUS .msg file then accordingly splits the job into two processes each with 64 threads:
PROCESS 1 THREAD NUMBER OF ELEMENTS
1 3840
2 3840
...
63
64 3840
PROCESS 2 THREAD NUMBER OF ELEMENTS
1 3584
2 4096
...
63
64 3840
The problem arises when I specify a single host node with 128 cores because ABAQUS has no way of recognizing that the host node consists of two separate processors. I can modify the abaqus_v6.env file accordingly as:
mp_host_list=[['node_1', 64],['node_1', 64]]
but ABAQUS just clumps this into one process with 128 threads, and I believe this is why my simulations actually run quicker on two nodes instead of one with the same number of cores, because ABAQUS does not recognize that it should treat the single node as two processors/processes.
Is there a way to specify two processes on the same host node in ABAQUS?
As a note, the amount of memory/RAM reserved per core does not change (~2 GB per core).
Final update: able to reduce runtimes using multiple nodes
I found that running these types of simulations across multiple nodes reduces run times. A table of simulation speeds for two models across various numbers of cores, nodes, and cores/processor are listed below.
The smaller model finished in 9.7 hours on two nodes with 64 cores/node = total of 128 cores. The runtime reduced by 25% when simulated over four nodes with 32 cores/node for the same total of 128 cores. Interestingly, the simulation took longer using three nodes with 64 cores/node (total of 192 cores), and there could be many reasons for this. One surprising result was that the simulation ran quicker using 64 nodes split over two nodes (32 cores/socket) vs. 64 cores on a single socket, which means the extra memory bandwidth of using multiple nodes helps (details of which I do not fully understand).
The larger model finished in ~32.5 hours using 192 cores and there was little between using three (64 cores/processor) or six (32 cores/processor) nodes, which means that at some point, using more nodes does not help. However, this larger model finished in 36.7 hours using 128 cores with 32 cores/processor (four nodes). Thus, the most efficient use of nodes for both the larger and smaller model is with 128 CPUs split over four nodes.
Simulation details for a model with 477,956 tetrahedral elements and 86,153 nodes. Model is cyclically strained to a strain of 1.3% for 10 cycles with a strain ratio R = 0.
# CPUs
# Nodes
# ABAQUS processes: actual
# ABAQUS processes: ideal
Notes on cores per processor
Wall time (hr)
Notes
64
1
1
2
32 cores/processor
13.8
Using cores on two processors but unable to specify two separate processes
64
1
1
1
64 cores/processor
11.5
No need to specify two processes
64
2
2
2
32 cores/processor
10.5
Correctly specifies two processes. Surprisingly faster than the scenario directly above!
128
1
1
2
64 cores/processor
14.5
Unable to specify two separate processes
128
2
2
2
64 cores/processor
8.9
Correctly specifies two processes
128
2
2
4
~32 cores/processor; 4 total processors
9.9
Specifies two processes but should be four processes
128
2
2
3
64 cores/processor
9.7
Specifies two processes over three processors
128
4
4
4
32 cores/processor
7.2
32 cores per node. Most efficient!
192
3
3
3
64 cores/processor
7.6
Three nodes with three processors
192
2
2
4
64 and 32 cores/processor on both node
10.5
Four processors over two nodes
Simulation details for a model with 4,104,272 tetrahedral elements and 702,859 nodes. The model is strained to 1.3% strain and then back to 0% strain (one cycle).
# CPUs
# Nodes
# ABAQUS processes: actual
# ABAQUS processes: ideal
Notes on cores per processor
Wall time (hr)
Notes
64
1
1
1
64 cores/processor
53.0
Using a single processor
128
1
1
2
64 cores/processor
57.3
Using two processors on one node
128
2
2
2
64 cores/processor
40.9
128
4
4
4
32 cores/processor
36.7
Most efficient!
192
2
2
4
64 and 32 cores/processor on both node
42.7
192
3
3
3
64 cores/processor
32.4
192
6
6
6
32 cores/processor
32.6
First of all, sorry for bad English since my English skill is not that good...
Before the question, I want to explain my situation to help understanding.
I want to use EEPROM as a kind of counter.
The value of that counter would be increased very frequenty so I should consider endurance problem.
My idea is, write counter value on multiple address alternatively so cell wearing is reduced by N.
for example, if I use 5x area for counting,
Count 1 -> 1 0 0 0 0
Count 2 -> 1 2 0 0 0
Count 3 -> 1 2 3 0 0
Count 4 -> 1 2 3 4 0
Count 5 -> 1 2 3 4 5
Count 6 -> 6 2 3 4 5
...
So cell endurance can be extended by a factor of N.
However, AFAIK, for current NAND flash, data erase/write is done by a group of bytes, called block. So, if all the bytes are within single write/erase block, my method would not work.
So, my main question : Does erase/write operation of EEPROM of PIC is done by a group of bytes? or done by a single word or byte?
For example, if it is done by a group of 8-bytes, then I should make 8-byte offset between each counter value to make my method properly work.
Otherwise, if it is done by a byte or a word, I don't have to consider about spacing/offset.
From datasheet PIC24FJ256GB110 section 5.0:
The user may write program memory data in blocks of 64 instructions
(192 bytes) at a time, and erase program memory in blocks of 512
instructions (1536 bytes) at a time.
However you can overwrite individual block several times if you left the rest of block erased (bits are one) and the privius content rest the same. Remember: you can clear single bit in block only ones.
How much will decerease the data retention after 8 writes in to single FLASH block I don't know!
I have a server with four mic cards (mic0-mic3), and it works well.I want to disable some mic, for example mic3, now only mic0 - mic2 is available.
what should I do?
OFFLOAD_DEVICES="0,1,2" # run with devices 0, 1 and 2 visible
The environment variable OFFLOAD_DEVICES restricts the process to use only the MIC cards specified as the value of the variable. is a comma separated list of physical device numbers in the range 0 to (number_of_devices_in_the_system-1).
Devices available for offloading are numbered logically. That is _Offload_number_of_devices() returns the number of allowed devices and device indexes specified in the target specifiers of offload pragmas are in the range 0 to (number_of_allowed_devices-1).
Example
export OFFLOAD_DEVICES="1,2"
Allows the program to use only physical MIC cards 1 and 2 (for instance, in a system with four installed cards). Offloads to devices numbered 0 or 1 will be performed on physical devices 1 and 2. Offloads to target numbers higher than 1 will wrap-around so that all offloads remain within logical devices 0 and 1 (which map to physical cards 1 and 2). The function _Offload_get_device_number() executed on a MIC device will return 0 or 1, when the offload is running on physical devices 1 or 2.
I got a MPI program written by other people.
Basic structure is like this
program basis
initialize MPI
do n=1,12
call mpi_job(n)
end do
finalize MPI
contains
subroutine mpi_job(n) !this is mpi subroutine
.....
end subroutine
end program
What I want to do now is to make the do loop a parallel do loop. So if I got a 24 core machine, I can run this program with 12 mpi_job running simultaneously and each mpi_job uses 2 threads. There are several reasons to do this, for example, the performance of mpi_job may not scale well with number of cores. To sum up, I want to make one level of MPI parallelization into two levels of parallelization.
I found myself constantly encounter this problem when I working with other people.The question is what is the easiest and efficient way to modify the program?
So if I got a 24 core machine, I can run this program with 12 mpi_job running simultaneously and each mpi_job uses 2 threads.
I wouldn't do that. I recommend mapping MPI processes to NUMA nodes and then spawning k threads where there are k cores per NUMA node.
There are several reasons to do this, for example, the performance of mpi_job may not scale well with number of cores.
That's an entirely different issue. What aspect of mpi_job won't scale well? Is it memory bound? Does it require too much communication?
You use should use sub-communicators.
Compute job_nr = floor(global_rank / ranks_per_job)
Use MPI_COMM_SPLIT over the job_nr. This creates a local to be used communicator for each job
Pass the resulting communicator to the mpi_job. All communication then should use that communicator and the rank local to that communicator.
Of course, this all implies that there is no dependencies between the different calls to mpi_job - or that you map that to appropriate global/world communicator.
There is some confusion here over the basics of what you are trying to do. Your skeleton code will not run 12 MPI jobs at the same time; each MPI process that you create will run 12 jobs sequentially.
What you want to do is run 12 MPI processes, each of which calls mpi_job a single time. Within mpi_job, you can then create 2 threads using OpenMP.
Process and thread placement is outside the scope of the MPI and OpenMP standards. For example, ensuring that the processes are spread evenly across your multicore machine (e.g. each of the 12 even cores 0, 2, ... out of 24) and that the OpenMP threads run on even and odd pairs of cores would require you to look up the man pages for your MPI and OpenMP implementations. You may be able to place processes using arguments to mpiexec; thread placement may be controlled by environment variables, e.g. KMP_AFFINITY for Intel OpenMP.
Placement aside, here is a code that I think does what you want (I make no comment on whether it is the most efficient thing to do). I am using GNU compilers here.
user#laptop$ mpif90 -fopenmp -o basis basis.f90
user#laptop$ export OMP_NUM_THREADS=2
user#laptop$ mpiexec -n 12 ./basis
Running 12 MPI jobs at the same time
MPI job 2 , thread no. 1 reporting for duty
MPI job 11 , thread no. 1 reporting for duty
MPI job 11 , thread no. 0 reporting for duty
MPI job 8 , thread no. 0 reporting for duty
MPI job 0 , thread no. 1 reporting for duty
MPI job 0 , thread no. 0 reporting for duty
MPI job 2 , thread no. 0 reporting for duty
MPI job 8 , thread no. 1 reporting for duty
MPI job 4 , thread no. 1 reporting for duty
MPI job 4 , thread no. 0 reporting for duty
MPI job 10 , thread no. 1 reporting for duty
MPI job 10 , thread no. 0 reporting for duty
MPI job 3 , thread no. 1 reporting for duty
MPI job 3 , thread no. 0 reporting for duty
MPI job 1 , thread no. 0 reporting for duty
MPI job 1 , thread no. 1 reporting for duty
MPI job 5 , thread no. 0 reporting for duty
MPI job 5 , thread no. 1 reporting for duty
MPI job 9 , thread no. 1 reporting for duty
MPI job 9 , thread no. 0 reporting for duty
MPI job 7 , thread no. 0 reporting for duty
MPI job 7 , thread no. 1 reporting for duty
MPI job 6 , thread no. 1 reporting for duty
MPI job 6 , thread no. 0 reporting for duty
Here's the code:
program basis
use mpi
implicit none
integer :: ierr, size, rank
integer :: comm = MPI_COMM_WORLD
call MPI_Init(ierr)
call MPI_Comm_size(comm, size, ierr)
call MPI_Comm_rank(comm, rank, ierr)
if (rank == 0) then
write(*,*) 'Running ', size, ' MPI jobs at the same time'
end if
call mpi_job(rank)
call MPI_Finalize(ierr)
contains
subroutine mpi_job(n) !this is mpi subroutine
use omp_lib
implicit none
integer :: n, ithread
!$omp parallel default(none) private(ithread) shared(n)
ithread = omp_get_thread_num()
write(*,*) 'MPI job ', n, ', thread no. ', ithread, ' reporting for duty'
!$omp end parallel
end subroutine mpi_job
end program basis
Could somebody please provide a step-through approach to solving the following problem using the Banker's Algorithm? How do I determine whether a "safe-state" exists? What is meant when a process can "run to completion"?
In this example, I have four processes and 10 instances of the same resource.
Resources Allocated | Resources Needed
Process A 1 6
Process B 1 5
Process C 2 4
Process D 4 7
Per Wikipedia,
A state (as in the above example) is considered safe if it is possible for all processes to finish executing (terminate). Since the system cannot know when a process will terminate, or how many resources it will have requested by then, the system assumes that all processes will eventually attempt to acquire their stated maximum resources and terminate soon afterward. This is a reasonable assumption in most cases since the system is not particularly concerned with how long each process runs (at least not from a deadlock avoidance perspective). Also, if a process terminates without acquiring its maximum resources, it only makes it easier on the system.
A process can run to completion when the number of each type of resource that it needs is available, between itself and the system. If a process needs 8 units of a given resource, and has allocated 5 units, then it can run to completion if there are at least 3 more units available that it can allocate.
Given your example, the system is managing a single resource, with 10 units available. The running processes have already allocated 8 (1+1+2+4) units, so there are 2 units left. The amount that any process needs to complete is its maximum less whatever it has already allocated, so at the start, A needs 5 more (6-1), B needs 4 more (5-1), C needs 2 more (4-2), and D needs 3 more (7-4). There are 2 available, so Process C is allowed to run to completion, thus freeing up 2 units (leaving 4 available). At this point, either B or D can be run (we'll assume D). Once D has completed, there will be 8 units available, after which either A or B can be run (we'll assume A). Once A has completed, there will be 9 units available, and then B can be run, which will leave all 10 units left for further work. Since we can select an ordering of processes that will allow all processes to be run, the state is considered 'safe'.
Resources Allocated | Resources Needed claim
Process A 1 6 5
Process B 1 5 4
Process C 2 4 2
Process D 4 7 3
Total resources allocated is 8
Hence 2 resources are yet to be allocated hence that is allocated to process C. and process c after finishing relieves 4 resources that can be given to process B ,Process B after finishing relives 5 resources which is allocated to PROCESS A the n process A after finishing allocates 2 resources to process D