Understanding MPI map-by and bind-to option

This tutorial will introduce how to utilize map-by option to deal with many complex scenarios such as running a hybrid MPI program (mixture of OpenMP and MPI).

Before starting

The behavior of MPI varies significantly if the environment changes (including MPI version and implementations, dependent libraries, and job schedulers). All the experiments mentioned in this article are conducted on OpenMPI 4.0.2, which means if you use different implementations or versions of MPI, you may encounter unexpected problems. For example, OpenMPI 2.1.1, the default version bundled in Ubuntu 18.04, will behave strangely and fail to control the number of OpenMP threads when running a hybrid program. Thus I strongly recommended to download the latest version of MPI.

Test Environment

On the test platform, each machine contains 2 NUMA nodes, 36 physical cores, 72 hardware threads overall. The test hybrid program could be downloaded from https://rcc.uchicago.edu/docs/running-jobs/hybrid/index.html, and OpenMPI 4.0.2 and GCC 7.3.0 are downloaded from Anaconda.

map-by unit

This is the most fundamental syntax. And unit can be filled in hwthread, core, L1cache, L2cache, L3cache, socket, numa, board, node. Note that hwthread means hardware thread, while core means physical core. numa option is commonly used.

The following example illustrates the differences of each option. To make output clear, PE=1 was added to limit thread numbers, and we will introduce it in the section map-by unit:pe=n. --report-bindings is a proprietary option of OpenMPI to visualize bindings, and you can check Appendix to figure out the similar usage of other MPI implementations.

map-by numa

1
2
3
4
5
6
7
8
9
10
11
12
13
mpirun -n 4 --map-by numa:PE=1 --report-bindings ./a.out
[asialab-01:69587] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..]
[asialab-01:69587] MCW rank 1 bound to socket 1[core 18[hwt 0-1]]: [../../../../../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../../../../../..]
[asialab-01:69587] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..]
[asialab-01:69587] MCW rank 3 bound to socket 1[core 19[hwt 0-1]]: [../../../../../../../../../../../../../../../../../..][../BB/../../../../../../../../../../../../../../../..]
Hello from thread 0 out of 2 from process 0 out of 4 on asialab-01
Hello from thread 1 out of 2 from process 0 out of 4 on asialab-01
Hello from thread 0 out of 2 from process 1 out of 4 on asialab-01
Hello from thread 1 out of 2 from process 1 out of 4 on asialab-01
Hello from thread 0 out of 2 from process 2 out of 4 on asialab-01
Hello from thread 1 out of 2 from process 2 out of 4 on asialab-01
Hello from thread 0 out of 2 from process 3 out of 4 on asialab-01
Hello from thread 1 out of 2 from process 3 out of 4 on asialab-01

map-by hwthread

1
2
3
4
5
6
7
8
9
mpirun -n 4 --map-by hwthread:PE=1 --report-bindings ./a.out
[asialab-01:69621] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B./../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..]
[asialab-01:69621] MCW rank 1 bound to socket 0[core 0[hwt 1]]: [.B/../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..]
[asialab-01:69621] MCW rank 2 bound to socket 0[core 1[hwt 0]]: [../B./../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..]
[asialab-01:69621] MCW rank 3 bound to socket 0[core 1[hwt 1]]: [../.B/../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..]
Hello from thread 0 out of 1 from process 3 out of 4 on asialab-01
Hello from thread 0 out of 1 from process 2 out of 4 on asialab-01
Hello from thread 0 out of 1 from process 0 out of 4 on asialab-01
Hello from thread 0 out of 1 from process 1 out of 4 on asialab-01

map-by core

1
2
3
4
5
6
7
8
9
10
11
12
13
mpirun -n 4 --map-by core:PE=1 --report-bindings ./a.out
[asialab-01:69653] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..]
[asialab-01:69653] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..]
[asialab-01:69653] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..]
[asialab-01:69653] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..]
Hello from thread 0 out of 2 from process 0 out of 4 on asialab-01
Hello from thread 1 out of 2 from process 0 out of 4 on asialab-01
Hello from thread 0 out of 2 from process 1 out of 4 on asialab-01
Hello from thread 1 out of 2 from process 1 out of 4 on asialab-01
Hello from thread 0 out of 2 from process 2 out of 4 on asialab-01
Hello from thread 1 out of 2 from process 2 out of 4 on asialab-01
Hello from thread 0 out of 2 from process 3 out of 4 on asialab-01
Hello from thread 1 out of 2 from process 3 out of 4 on asialab-01

Observe that when we set to map-by numa, MPI will take NUMA architecture into consideration and balance the workload of two NUMA nodes. Otherwise, MPI will ignore NUMA.

You may notice that when map-by is set to hwthread, only one thread is allocated to OpenMP in each rank. The section bind-to unit explains this to some degree.

bind-to unit

The default option is core if we didn't specify this option. Although this option is not so important, but there are several interesting concepts to learn. You may have heard the word slot, and you can imagine each slot will hold one rank at most. My understanding is that it is slot will be bound to specified units such as hardware threads or physical cores. Let us go through some examples.

bind-to hwthread

1
2
3
4
5
6
7
8
9
mpirun -n 4 --bind-to hwthread --map-by numa --report-bindings ./a.out
[asialab-01:71905] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B./../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..]
[asialab-01:71905] MCW rank 1 bound to socket 1[core 18[hwt 0]]: [../../../../../../../../../../../../../../../../../..][B./../../../../../../../../../../../../../../../../..]
[asialab-01:71905] MCW rank 2 bound to socket 0[core 0[hwt 1]]: [.B/../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..]
[asialab-01:71905] MCW rank 3 bound to socket 1[core 18[hwt 1]]: [../../../../../../../../../../../../../../../../../..][.B/../../../../../../../../../../../../../../../../..]
Hello from thread 0 out of 1 from process 1 out of 4 on asialab-01
Hello from thread 0 out of 1 from process 2 out of 4 on asialab-01
Hello from thread 0 out of 1 from process 3 out of 4 on asialab-01
Hello from thread 0 out of 1 from process 0 out of 4 on asialab-01

bind-to core

1
2
3
4
5
6
7
8
9
10
11
12
13
mpirun -n 4 --bind-to core --map-by numa --report-bindings ./a.out
[asialab-01:71922] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..]
[asialab-01:71922] MCW rank 1 bound to socket 1[core 18[hwt 0-1]]: [../../../../../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../../../../../..]
[asialab-01:71922] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..]
[asialab-01:71922] MCW rank 3 bound to socket 1[core 19[hwt 0-1]]: [../../../../../../../../../../../../../../../../../..][../BB/../../../../../../../../../../../../../../../..]
Hello from thread 0 out of 2 from process 0 out of 4 on asialab-01
Hello from thread 1 out of 2 from process 0 out of 4 on asialab-01
Hello from thread 1 out of 2 from process 1 out of 4 on asialab-01
Hello from thread 0 out of 2 from process 1 out of 4 on asialab-01
Hello from thread 0 out of 2 from process 2 out of 4 on asialab-01
Hello from thread 1 out of 2 from process 2 out of 4 on asialab-01
Hello from thread 0 out of 2 from process 3 out of 4 on asialab-01
Hello from thread 1 out of 2 from process 3 out of 4 on asialab-01

bind-to numa

1
2
3
4
mpirun -n 2 --bind-to numa --map-by numa --report-bindings ./a.out
[asialab-01:72100] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]], socket 0[core 16[hwt 0-1]], socket 0[core 17[hwt 0-1]]: [BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../../../../../../../../../..]
[asialab-01:72100] MCW rank 1 bound to socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt 0-1]], socket 1[core 20[hwt 0-1]], socket 1[core 21[hwt 0-1]], socket 1[core 22[hwt 0-1]], socket 1[core 23[hwt 0-1]], socket 1[core 24[hwt 0-1]], socket 1[core 25[hwt 0-1]], socket 1[core 26[hwt 0-1]], socket 1[core 27[hwt 0-1]], socket 1[core 28[hwt 0-1]], socket 1[core 29[hwt 0-1]], socket 1[core 30[hwt 0-1]], socket 1[core 31[hwt 0-1]], socket 1[core 32[hwt 0-1]], socket 1[core 33[hwt 0-1]], socket 1[core 34[hwt 0-1]], socket 1[core 35[hwt 0-1]]: [../../../../../../../../../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
...

If each slot is bound to a hardware thread, only one thread could be allocated to OpenMP in each rank. If each slot is bound to a physical core, all the threads in a physical core could be allocated. If each slot is bound to a NUMA node, all the threads in a NUMA node could be allocated.

map-by unit:pe=n

In the previous section we introduce the concept slot. By default, each slot is bound to one physical core. This section we will dig deep into pe, and it may refer to processing element according to a website. This concept is ambiguous, and my understanding is that pe=n determines the number of units that each slot will occupy. Here are some examples.

bind-to core, PE=1

1
2
3
4
5
6
7
mpirun -n 2 --bind-to core --map-by numa:PE=1 --report-bindings ./a.out
[asialab-01:72668] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..]
[asialab-01:72668] MCW rank 1 bound to socket 1[core 18[hwt 0-1]]: [../../../../../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../../../../../..]
Hello from thread 0 out of 2 from process 0 out of 2 on asialab-01
Hello from thread 1 out of 2 from process 0 out of 2 on asialab-01
Hello from thread 0 out of 2 from process 1 out of 2 on asialab-01
Hello from thread 1 out of 2 from process 1 out of 2 on asialab-01

bind-to core, PE=2

1
2
3
4
5
6
7
8
9
10
11
mpirun -n 2 --bind-to core --map-by numa:PE=2 --report-bindings ./a.out
[asialab-01:72700] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]]: [BB/BB/../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..]
[asialab-01:72700] MCW rank 1 bound to socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt 0-1]]: [../../../../../../../../../../../../../../../../../..][BB/BB/../../../../../../../../../../../../../../../..]
Hello from thread 0 out of 4 from process 0 out of 2 on asialab-01
Hello from thread 2 out of 4 from process 0 out of 2 on asialab-01
Hello from thread 3 out of 4 from process 0 out of 2 on asialab-01
Hello from thread 1 out of 4 from process 0 out of 2 on asialab-01
Hello from thread 2 out of 4 from process 1 out of 2 on asialab-01
Hello from thread 0 out of 4 from process 1 out of 2 on asialab-01
Hello from thread 1 out of 4 from process 1 out of 2 on asialab-01
Hello from thread 3 out of 4 from process 1 out of 2 on asialab-01

bind-to hwthread, PE=1

1
2
3
4
5
mpirun -n 2 --bind-to hwthread --map-by numa:PE=1 --report-bindings ./a.out
[asialab-01:72729] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B./../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..]
[asialab-01:72729] MCW rank 1 bound to socket 1[core 18[hwt 0]]: [../../../../../../../../../../../../../../../../../..][B./../../../../../../../../../../../../../../../../..]
Hello from thread 0 out of 1 from process 1 out of 2 on asialab-01
Hello from thread 0 out of 1 from process 0 out of 2 on asialab-01

bind-to hwthread, PE=2

1
2
3
4
5
6
7
mpirun -n 2 --bind-to hwthread --map-by numa:PE=2 --report-bindings ./a.out
[asialab-01:72740] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..]
[asialab-01:72740] MCW rank 1 bound to socket 1[core 18[hwt 0-1]]: [../../../../../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../../../../../..]
Hello from thread 0 out of 2 from process 0 out of 2 on asialab-01
Hello from thread 1 out of 2 from process 0 out of 2 on asialab-01
Hello from thread 0 out of 2 from process 1 out of 2 on asialab-01
Hello from thread 1 out of 2 from process 1 out of 2 on asialab-01

However, I failed to launch with the setting -n 1 --bind-to numa --map-by node:PE=2, and I expect that only one rank consumes all the resources (equivalent to bind-to core, PE=36, or bind-to hwthread, PE=72). Anyway, pe=n will work well when combining with bind-to core or bind-to hwthread.

map-by ppr:n:unit

ppr is short for processes per resource. The processes here basically are equivalent to MPI Rank. This option actually limits the maximum number of ranks (number of slots) that each unit can hold. Let us verify this.

bind-to core, ppr:4

1
2
3
4
5
6
7
8
9
10
11
12
13
 mpirun -n 4 --bind-to core --map-by ppr:4:numa --report-bindings ./a.out
[asialab-01:00520] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..]
[asialab-01:00520] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..]
[asialab-01:00520] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..]
[asialab-01:00520] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..]
Hello from thread 0 out of 2 from process 0 out of 4 on asialab-01
Hello from thread 1 out of 2 from process 0 out of 4 on asialab-01
Hello from thread 0 out of 2 from process 1 out of 4 on asialab-01
Hello from thread 1 out of 2 from process 1 out of 4 on asialab-01
Hello from thread 0 out of 2 from process 2 out of 4 on asialab-01
Hello from thread 1 out of 2 from process 2 out of 4 on asialab-01
Hello from thread 0 out of 2 from process 3 out of 4 on asialab-01
Hello from thread 1 out of 2 from process 3 out of 4 on asialab-01

bind-to core, ppr:2

1
2
3
4
5
6
7
8
9
10
11
12
13
mpirun -n 4 --bind-to core --map-by ppr:2:numa --report-bindings ./a.out
[asialab-01:00541] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..]
[asialab-01:00541] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..]
[asialab-01:00541] MCW rank 2 bound to socket 1[core 18[hwt 0-1]]: [../../../../../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../../../../../..]
[asialab-01:00541] MCW rank 3 bound to socket 1[core 19[hwt 0-1]]: [../../../../../../../../../../../../../../../../../..][../BB/../../../../../../../../../../../../../../../..]
Hello from thread 0 out of 2 from process 1 out of 4 on asialab-01
Hello from thread 1 out of 2 from process 1 out of 4 on asialab-01
Hello from thread 0 out of 2 from process 2 out of 4 on asialab-01
Hello from thread 1 out of 2 from process 2 out of 4 on asialab-01
Hello from thread 0 out of 2 from process 3 out of 4 on asialab-01
Hello from thread 1 out of 2 from process 3 out of 4 on asialab-01
Hello from thread 0 out of 2 from process 0 out of 4 on asialab-01
Hello from thread 1 out of 2 from process 0 out of 4 on asialab-01

bind-to core, ppr:1

1
2
3
4
5
6
7
8
9
10
11
 mpirun -n 4 --bind-to core --map-by ppr:1:numa --report-bindings ./a.out
--------------------------------------------------------------------------
Your job has requested more processes than the ppr for
this topology can support:

App: ./a.out
Number of procs: 4
PPR: 1:numa

Please revise the conflict and try again.
--------------------------------------------------------------------------

Since each NUMA node is limited to hold one MPI process at most, and there are two NUMA nodes overall, it is reasonable to fail to run the program with four ranks.

map-by ppr:n:unit:pe=n

This is complete form of map-by. There is nothing new, so you should be able to explain the following complex example. Hint: -host hostname:-1 will let MPI detect the number of available slots on the remote machine automatically.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
mpirun -n 4 -host asialab-01:-1,asialab-03:-1 --bind-to hwthread --map-by ppr:1:numa:pe=4 --report-bindings ./a.out
[asialab-01:00614] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]]: [BB/BB/../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..]
[asialab-01:00614] MCW rank 1 bound to socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt 0-1]]: [../../../../../../../../../../../../../../../../../..][BB/BB/../../../../../../../../../../../../../../../..]
[asialab-03:73603] MCW rank 2 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]]: [BB/BB/../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..]
[asialab-03:73603] MCW rank 3 bound to socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt 0-1]]: [../../../../../../../../../../../../../../../../../..][BB/BB/../../../../../../../../../../../../../../../..]
Hello from thread 0 out of 4 from process 2 out of 4 on asialab-03
Hello from thread 2 out of 4 from process 2 out of 4 on asialab-03
Hello from thread 3 out of 4 from process 2 out of 4 on asialab-03
Hello from thread 0 out of 4 from process 3 out of 4 on asialab-03
Hello from thread 2 out of 4 from process 3 out of 4 on asialab-03
Hello from thread 1 out of 4 from process 3 out of 4 on asialab-03
Hello from thread 3 out of 4 from process 3 out of 4 on asialab-03
Hello from thread 1 out of 4 from process 2 out of 4 on asialab-03
Hello from thread 0 out of 4 from process 0 out of 4 on asialab-01
Hello from thread 3 out of 4 from process 0 out of 4 on asialab-01
Hello from thread 1 out of 4 from process 0 out of 4 on asialab-01
Hello from thread 2 out of 4 from process 0 out of 4 on asialab-01
Hello from thread 1 out of 4 from process 1 out of 4 on asialab-01
Hello from thread 0 out of 4 from process 1 out of 4 on asialab-01
Hello from thread 2 out of 4 from process 1 out of 4 on asialab-01
Hello from thread 3 out of 4 from process 1 out of 4 on asialab-01

Appendix

Report bindings

  • Intel MPI: -print-rank-map
  • MVAPICH2: MV2_SHOW_CPU_BINDING=1
  • OpenMPI: --report-bindings

Reference