Current InfiniBand status

Current InfiniBand status

2020-11-21

Latest modification of this page: 25 November 2020

Current status of InfiniBand upgrades #

Recently installed InfiniBand switch Mellanox MSB7800 supports EDR 100Gb/s speed.

Nodes cl1n005–cl1n010 and cl1n017–cl1n030 include new Mellanox ConnectX-5 adapters with EDR 100Gb/s support.

Nodes cl1n001–cl1n004, cl1n011–cl1n016 include old Mellanox ConnectX-3 adapters with QDR 40Gb/s support.

Nodes with new adapters have higher priority in Slurm system. You can also explicitly define the list of nodes you want to run you job with -w parameter, e.g., to run your job on 4 nodes cl1n005–cl1n008 use the following command:

sbatch -p x12core -w cl1n[005-008] --nodes=4 --ntasks-per-node=24 ...

Known problems #

  1. Node cl1n001 is currently running at SDR speed, which is 4 times slower compared to QDR. This node currently have lowest priority in Slurm system. Maximum speed at cl1n001 node was restored on November 25, 2020.
  2. Mixing nodes with different adapters (ConnectX-3 и ConnectX-5) may cause problems in MPI apps. The following environment variable may help when using latest Intel MPI library:
export UCX_TLS=ud,sm,self
This page will be updated.