Running MPI jobs

Running MPI jobs #

Please use mpirun tool to run MPI jobs through Slurm management system. By default mpirun will run your program using all allocated processes and cores. You can run specific number of processes with the parameter -n np, e.g.:

mpirun -n 8 ./mpiprog8.exe

Intel MPI #

Load Intel modules and use mpiicc, mpicpc and mpiifort compilers:

module load intel impi
mpiicc -o main main.c

Open MPI #

You can also use Open MPI with gcc compilers mpicc, mpicxx and mpifort:

# gcc compilers version 12
module load gnu12 openmpi4
# gcc compilers version 9
# module load gnu9 openmpi4
# gcc compilers version 7
# module load gnu openmpi
mpicc -o main main.c

Hybrid MPI+OpenMP jobs #

We use xthi as an example of MPI+OpenMP program. The source code is available at NERSC site.

#define _GNU_SOURCE

#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <sched.h>
#include <mpi.h>
#include <omp.h>
#include <sys/syscall.h>

/* Borrowed from util-linux-2.13-pre7/schedutils/taskset.c */
static char *cpuset_to_cstr(cpu_set_t *mask, char *str)
{
  char *ptr = str;
  int i, j, entry_made = 0;
  for (i = 0; i < CPU_SETSIZE; i++) {
    if (CPU_ISSET(i, mask)) {
      int run = 0;
      entry_made = 1;
      for (j = i + 1; j < CPU_SETSIZE; j++) {
        if (CPU_ISSET(j, mask)) run++;
        else break;
      }
      if (!run)
        sprintf(ptr, "%d,", i);
      else if (run == 1) {
        sprintf(ptr, "%d,%d,", i, i + 1);
        i++;
      } else {
        sprintf(ptr, "%d-%d,", i, i + run);
        i += run;
      }
      while (*ptr != 0) ptr++;
    }
  }
  ptr -= entry_made;
  *ptr = 0;
  return(str);
}

int main(int argc, char *argv[])
{
  int rank, thread;
  cpu_set_t coremask;
  char clbuf[7 * CPU_SETSIZE], hnbuf[64];

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  memset(clbuf, 0, sizeof(clbuf));
  memset(hnbuf, 0, sizeof(hnbuf));
  (void)gethostname(hnbuf, sizeof(hnbuf));
  #pragma omp parallel private(thread, coremask, clbuf)
  {
    thread = omp_get_thread_num();
    (void)sched_getaffinity(0, sizeof(coremask), &coremask);
    cpuset_to_cstr(&coremask, clbuf);
    #pragma omp barrier
    printf("Hello from rank %d, thread %d, on %s. (core affinity = %s)\n",
            rank, thread, hnbuf, clbuf);
  }
/*  sleep(60); */
  MPI_Finalize();
  return(0);
}

Use the following commands to compile the code with Intel MPI:

# load Intel MPI modules
module load intel impi
# use mpiicc compiler with -qopenmp option
mpiicc -qopenmp xthi.c -o xthi

Use the following commands to compile the code with Open MPI and gcc compilers:

# load Open MPI modules
module load gnu12 openmpi4
# use mpicc compiler with -fopenmp option
mpicc -fopenmp xthi.c -o xthi

We will use the following Slurm batch script test-xthi.sh:

#!/bin/bash

# A hybrid MPI+OpenMP example xthi
#SBATCH --partition=short
#SBATCH --job-name=xthi
#SBATCH --output=xthi.out
#SBATCH --time=5

####### 4 MPI ranks
#SBATCH --ntasks=4

####### 8 OMP threads per MPI rank
#SBATCH --cpus-per-task=8

## load MPI libs if needed
## for Intel MPI:
# module load intel impi

## for Open MPI:
# module load gnu12 openmpi4


## explicitly set number of OMP threads to the value of --cpus-per-task
## (required for mpich and mvapich2)
# export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

## run with mpirun
mpirun ./xthi

## Open MPI requires additional parameter
# mpirun --map-by slot:PE=$SLURM_CPUS_PER_TASK ./xthi

Slurm will preserve the user environment variables, therefore you can skip loading modules for MPI if they are already loaded by the user.

You can explicitly set OMP_NUM_THREADS to the value of $SLURM_CPUS_PER_TASK. In most cases this is not necessary, since each MPI process will be bounded to specific CPU cores, and OpenMP will determine the number of threads automatically.

When running tasks with Open MPI an additional option --map-by slot:PE=$SLURM_CPUS_PER_TASK is required, otherwise all OpenMP threads will be bound to one core.

Enqueue the job:

sbatch test-xthi.sh

The result will be in xthi.out file. Example output:

Hello from rank 3, thread 0, on n38. (core affinity = 0-7)
Hello from rank 3, thread 1, on n38. (core affinity = 0-7)
Hello from rank 3, thread 7, on n38. (core affinity = 0-7)
Hello from rank 3, thread 3, on n38. (core affinity = 0-7)
Hello from rank 3, thread 4, on n38. (core affinity = 0-7)
Hello from rank 3, thread 5, on n38. (core affinity = 0-7)
Hello from rank 3, thread 2, on n38. (core affinity = 0-7)
Hello from rank 3, thread 6, on n38. (core affinity = 0-7)
Hello from rank 2, thread 0, on n37. (core affinity = 4-11)
Hello from rank 2, thread 2, on n37. (core affinity = 4-11)
Hello from rank 2, thread 3, on n37. (core affinity = 4-11)
Hello from rank 2, thread 4, on n37. (core affinity = 4-11)
Hello from rank 2, thread 5, on n37. (core affinity = 4-11)
Hello from rank 2, thread 1, on n37. (core affinity = 4-11)
Hello from rank 2, thread 6, on n37. (core affinity = 4-11)
Hello from rank 2, thread 7, on n37. (core affinity = 4-11)
Hello from rank 0, thread 0, on n37. (core affinity = 12-19)
Hello from rank 0, thread 2, on n37. (core affinity = 12-19)
Hello from rank 0, thread 1, on n37. (core affinity = 12-19)
Hello from rank 0, thread 5, on n37. (core affinity = 12-19)
Hello from rank 0, thread 3, on n37. (core affinity = 12-19)
Hello from rank 0, thread 7, on n37. (core affinity = 12-19)
Hello from rank 0, thread 6, on n37. (core affinity = 12-19)
Hello from rank 0, thread 4, on n37. (core affinity = 12-19)
Hello from rank 1, thread 3, on n37. (core affinity = 0-3,20-23)
Hello from rank 1, thread 2, on n37. (core affinity = 0-3,20-23)
Hello from rank 1, thread 5, on n37. (core affinity = 0-3,20-23)
Hello from rank 1, thread 0, on n37. (core affinity = 0-3,20-23)
Hello from rank 1, thread 7, on n37. (core affinity = 0-3,20-23)
Hello from rank 1, thread 6, on n37. (core affinity = 0-3,20-23)
Hello from rank 1, thread 1, on n37. (core affinity = 0-3,20-23)
Hello from rank 1, thread 4, on n37. (core affinity = 0-3,20-23)

The job used 2 nodes (n37 and n38), three MPI processes on the first node, each with 8 cores, and one MPI process on the second node.

Please note the reported core affinity. One of the MPI ranks (rank 1) has suboptimal core affinity 0-3,20-23 spanning across two different CPU sockets (nodes in short queue have two CPU sockets with cores 0-11 and 12-23). Compare the reported core affinity when the tasks are distributed evenly across the nodes:

sbatch --ntasks-per-node=2 test-xthi.sh

Additional options of sbatch, srun, mpirun commands may affect the CPU binding, please refer to Slurm, Open MPI and Intel MPI documentation.