Queue management

Queue management system Torque #

The new resource management system is active now. Please refer to Slurm usage page. Information below is deprecated.

We use TORQUE Resource Manager to schedule all user jobs on our cluster. The job is an ordinary shell script which is submitted to the queue by qsub command. This script contains the information about requested resources (number of nodes, amount of RAM, required time). The user can use the usual bash commands inside this script. The script is executed by the system only on the first reserved core. The user should take care of executing his programs on other cores, e.g. with mpiexec tool.

Example #

Let’s start with a simple example for job sleep for just one core and maximum execution time equal to 5 minutes. Create the file sleep.qs with the following content:

#!/bin/bash
#PBS -N sleep
#PBS -l nodes=1
#PBS -l walltime=05:00

echo "Start date: $(date)"
sleep 60
echo "  End date: $(date)"

You can enqueue this job with qsub command:

qsub sleep.qs

Strings starting with prefix #PBS define the parameters for qsub command. You can also specify these parameters explicitly in the command line, e.g.:

qsub  -N sleep  -l nodes=1  -l walltime=5:00  sleep.qs

You can combine several parameters for resources requests in one delimiting them by comma:

qsub  -l nodes=1,walltime=5:00  sleep.qs

qsub command #

The main parameters for qsub command:

  • -d path
    The working dir of the job, by default the user home dir is used.
  • -e path
  • -o path
    The file names for standard error (stderr) and standard output (stdout) streams, by default file names <job_name>.e<job_id> and <job_name>.o<job_id> in the current dir are used.
  • -j oe
  • -j eo
    Unite stderr and stdout in one file. oe — both streams will go to stdout, eo — both streams will go to stderr.
  • -m aben
    Select type of events for notifying the user via e-mail. a — in case of job abortion, b — job start, e — job finish, n — no notifications. You can select several chars from abe or one char n, only a is used by default.
  • -M e-mail
    The e-mail address (or several addresses separated by the comma) for notifications, by default the owner of the job.
  • -N name
    The name of the job.
  • -q queue
    The execution queue for the job. Several queues are available, the default one is x6core.
  • -l resource_list
    The list of required resources.

The main resources (used with -l option):

  • nodes=N
    nodes=M:ppn=K
    The number of required cores or nodes. In the first case we request N cores, where N is from 1 to 432. In the second case we request M nodes with K cores per each node. E.g, nodes=24 will request 24 cores, nodes=3:ppn=8 will request 3 nodes each with 8 cores, nodes=6:ppn=4 will request 6 nodes with 4 cores per each node. Only one core is requested by default.
  • pmem=size
    pvmem=size
    The required size on physical and virtual memory respectively. The size is a number with one of the suffixes b, kb, mb, gb. E.g., pvmem=1gb will request 1 GB of virtual memory per each process. By default there is no limitation on the memory usage. However, please keep in mind, that the job with huge memory footprint may hang up the whole node.
  • walltime=time
    The maximum execution time for the job. The job will be terminated after this time is exceeded. Default time limit is 6 hours. The maximum time you can request is 24 hours. E.g., walltime=1:45:00 will stop the execution of the job after 1 hour and 45 minutes.

Some useful environment variables defined by the Torque management system:

  • PBS_O_WORKDIR
    The directory from which the user submitted the job.
  • PBS_JOBID
    Unique ID of job.
  • PBS_O_HOST
    The hostname of the current node.
  • PBS_NODEFILE
    File with the list of all allocated nodes per job.

You can find more information in the Torque documentation.

qstat command #

You can check the status of the jobs with qstat command. You will get more information running qstat -a. The current state is shown in S column.

  • Q — the job is waiting in the queue for requested resources.
  • R — the job is running.
  • E — the job is finishing, and stderr and stdout files are transfered to the headnode.
  • C — the job finished. The information about finished jobs is shown only for 5 minutes.

You can get the whole list of nodes and cores allocated for your job with qstat -n command.

qdel command #

You can remove your job from the queue, e.g. in case you requested to much cores, with the command qdel <job_id>. You can also remove the running job the same way, in this case the job will be terminated first.

pbstop program #

You can monitor the general statistics about cluster usage with pbstop program. Press key q for exit. The output of the pbstop program will look similar to this:

Usage Totals: 420/480 Procs, 25/28 Nodes, 12/12 Jobs Running                      16:51:05
Node States:     4 free                  23 job-exclusive          1 offline

 Visible CPUs: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
          1                        2
          -------------------------------------------------
  cl1n001 ................         pppp............           [x8core e5core mix]
  cl1n003 pppppppppppppppp         PPPPPPPPPPPPPPPP           [x8core e5core mix]
  cl1n005 yyyyyyyyyyyyyyyyyyyyyyyy PPPPPPPPPPPPPPPPPPPPPPPP   [x12core e5core mix]
  cl1n007 yyyyyyyyyyyyyyyyyyyyyyyy dddddddddddddddddddddddd   [x12core e5core mix]
  cl1n009 DDDDDDDDDDDDDDDDDDDDDDDD yyyyyyyyyyyyyyyyyyyyyyyy   [x12core e5core mix]
  cl1n011 gggggggggggggggggggggggg yyyyyyyyyyyyyyyyyyyyyyyy   [x12core e5core mix]
  cl1n013 pppppppppppppppppppp     ........PPPPPPPPPPPP       [x10core e5core mix]
  cl1n015 EEEEEEEEEEEEEEEEEEEE     PPPPPPPPPPPPPPPPPPPP       [x10core e5core mix]
  cl1n031 tttttttttttt             GGGGGGGGGGGG               [x6core mix]
  cl1n033 tttttttttttt             tttttttttttt               [x6core mix]
  cl1n035 CCCCCCCCCCCC             BBBBBBBBBBBB               [x6core mix]
  cl1n037 ............             tttttttttttt               [x6core mix]
  cl1n039 AAAAAAAAAAAA             tttttttttttt               [x6core mix]
  cl1n041 tttttttttttt             %%%%%%%%%%%%               [x6core mix]
          -------------------------------------------------
 [.] idle  [@] busy  [*] down  [%] offline  [!] other  [?] unknown

      Job#         Username Queue        Jobname          CPUs/Nodes S   Elapsed/Requested
  P = 66851        perezhog e5core       nse2d              72/72    R  17:13:43/24:00:00
  p = 66863        perezhog e5core       nse2d              40/40    R  00:52:17/24:00:00
  E = 66865        goyman   x10core      slm                20/1     R  00:24:31/12:00:00
  y = 66847        yakovlev x12core      SLO.=.INMCM        96/96    R  22:20:10/24:00:00
  D = 66853        dinar    x12core      EXP2019_run+.exe   24/1     R  03:35:41/24:00:00
  d = 66861        dinar    x12core      EXP2019_run-.exe   24/1     R  01:02:25/24:00:00
  g = 66864        goyman   x12core      slm                24/1     R  00:29:51/12:00:00
  C = 66854        galin    x6core       a19f1Rm75.out      12/1     R  09:22:17/24:00:00
  A = 66855        galin    x6core       a19f2Rm75.out      12/1     R  09:22:22/24:00:00
  G = 66856        galin    x6core       a19f1Rm25.out      12/1     R  09:22:23/24:00:00
  B = 66857        galin    x6core       a19f2Rm25.out      12/1     R  09:21:25/24:00:00
  t = 66866        terekhov x6core       STDIN              72/6     R          /02:00:00

Available queues #

We use several queues on the cluster. All compute nodes are separated in groups. Each queue can use nodes only from specific groups. The maximum execution time may be reduced for some queues.

Queue cl1n001–cl1n004 cl1n005–cl1n012 cl1n013–cl1n016 cl1n017–cl1n020 cl1n031–cl1n038 Total cores Max walltime
x6core + 96 24 h
x8core + 64 24 h
x10core + 80 24 h
x12core + 192 24 h
mix + + + + 432 12 h
e5core + + + 336 24 h

In order to submit the job to e.g. x12core queue, you should use paramater -q x12core for qsub command:

qsub -q x12core -N big-problem -l nodes=2:ppn=24  qbig.qs

Or you can add the following line in your job file:

#PBS -q x12core

The default queue is x6core.