Queue management system Torque #

The new resource management system is active now. Please refer to Slurm usage page. Information below is deprecated.

We use TORQUE Resource Manager to schedule all user jobs on our cluster. The job is an ordinary shell script which is submitted to the queue by qsub command. This script contains the information about requested resources (number of nodes, amount of RAM, required time). The user can use the usual bash commands inside this script. The script is executed by the system only on the first reserved core. The user should take care of executing his programs on other cores, e.g. with mpiexec tool.

Example #

Let’s start with a simple example for job sleep for just one core and maximum execution time equal to 5 minutes. Create the file sleep.qs with the following content:

#!/bin/bash
#PBS -N sleep
#PBS -l nodes=1
#PBS -l walltime=05:00

echo "Start date: $(date)"
sleep 60
echo "  End date: $(date)"

You can enqueue this job with qsub command:

qsub sleep.qs

Strings starting with prefix #PBS define the parameters for qsub command. You can also specify these parameters explicitly in the command line, e.g.:

qsub  -N sleep  -l nodes=1  -l walltime=5:00  sleep.qs

You can combine several parameters for resources requests in one delimiting them by comma:

qsub  -l nodes=1,walltime=5:00  sleep.qs

`qsub` command #

The main parameters for qsub command:

-d path
The working dir of the job, by default the user home dir is used.
-e path
-o path
The file names for standard error (stderr) and standard output (stdout) streams, by default file names <job_name>.e<job_id> and <job_name>.o<job_id> in the current dir are used.
-j oe
-j eo
Unite stderr and stdout in one file. oe — both streams will go to stdout, eo — both streams will go to stderr.
-m aben
Select type of events for notifying the user via e-mail. a — in case of job abortion, b — job start, e — job finish, n — no notifications. You can select several chars from abe or one char n, only a is used by default.
-M e-mail
The e-mail address (or several addresses separated by the comma) for notifications, by default the owner of the job.
-N name
The name of the job.
-q queue
The execution queue for the job. Several queues are available, the default one is x6core.
-l resource_list
The list of required resources.

The main resources (used with -l option):

nodes=N
nodes=M:ppn=K
The number of required cores or nodes. In the first case we request N cores, where N is from 1 to 432. In the second case we request M nodes with K cores per each node. E.g, nodes=24 will request 24 cores, nodes=3:ppn=8 will request 3 nodes each with 8 cores, nodes=6:ppn=4 will request 6 nodes with 4 cores per each node. Only one core is requested by default.
pmem=size
pvmem=size
The required size on physical and virtual memory respectively. The size is a number with one of the suffixes b, kb, mb, gb. E.g., pvmem=1gb will request 1 GB of virtual memory per each process. By default there is no limitation on the memory usage. However, please keep in mind, that the job with huge memory footprint may hang up the whole node.
walltime=time
The maximum execution time for the job. The job will be terminated after this time is exceeded. Default time limit is 6 hours. The maximum time you can request is 24 hours. E.g., walltime=1:45:00 will stop the execution of the job after 1 hour and 45 minutes.

Some useful environment variables defined by the Torque management system:

PBS_O_WORKDIR
The directory from which the user submitted the job.
PBS_JOBID
Unique ID of job.
PBS_O_HOST
The hostname of the current node.
PBS_NODEFILE
File with the list of all allocated nodes per job.

You can find more information in the Torque documentation.

`qstat` command #

You can check the status of the jobs with qstat command. You will get more information running qstat -a. The current state is shown in S column.

Q — the job is waiting in the queue for requested resources.
R — the job is running.
E — the job is finishing, and stderr and stdout files are transfered to the headnode.
C — the job finished. The information about finished jobs is shown only for 5 minutes.

You can get the whole list of nodes and cores allocated for your job with qstat -n command.

`qdel` command #

You can remove your job from the queue, e.g. in case you requested to much cores, with the command qdel <job_id>. You can also remove the running job the same way, in this case the job will be terminated first.

`pbstop` program #

You can monitor the general statistics about cluster usage with pbstop program. Press key q for exit. The output of the pbstop program will look similar to this:

Usage Totals: 420/480 Procs, 25/28 Nodes, 12/12 Jobs Running                      16:51:05
Node States:     4 free                  23 job-exclusive          1 offline

 Visible CPUs: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
          1                        2
          -------------------------------------------------
  cl1n001 ................         pppp............           [x8core e5core mix]
  cl1n003 pppppppppppppppp         PPPPPPPPPPPPPPPP           [x8core e5core mix]
  cl1n005 yyyyyyyyyyyyyyyyyyyyyyyy PPPPPPPPPPPPPPPPPPPPPPPP   [x12core e5core mix]
  cl1n007 yyyyyyyyyyyyyyyyyyyyyyyy dddddddddddddddddddddddd   [x12core e5core mix]
  cl1n009 DDDDDDDDDDDDDDDDDDDDDDDD yyyyyyyyyyyyyyyyyyyyyyyy   [x12core e5core mix]
  cl1n011 gggggggggggggggggggggggg yyyyyyyyyyyyyyyyyyyyyyyy   [x12core e5core mix]
  cl1n013 pppppppppppppppppppp     ........PPPPPPPPPPPP       [x10core e5core mix]
  cl1n015 EEEEEEEEEEEEEEEEEEEE     PPPPPPPPPPPPPPPPPPPP       [x10core e5core mix]
  cl1n031 tttttttttttt             GGGGGGGGGGGG               [x6core mix]
  cl1n033 tttttttttttt             tttttttttttt               [x6core mix]
  cl1n035 CCCCCCCCCCCC             BBBBBBBBBBBB               [x6core mix]
  cl1n037 ............             tttttttttttt               [x6core mix]
  cl1n039 AAAAAAAAAAAA             tttttttttttt               [x6core mix]
  cl1n041 tttttttttttt             %%%%%%%%%%%%               [x6core mix]
          -------------------------------------------------
 [.] idle  [@] busy  [*] down  [%] offline  [!] other  [?] unknown

      Job#         Username Queue        Jobname          CPUs/Nodes S   Elapsed/Requested
  P = 66851        perezhog e5core       nse2d              72/72    R  17:13:43/24:00:00
  p = 66863        perezhog e5core       nse2d              40/40    R  00:52:17/24:00:00
  E = 66865        goyman   x10core      slm                20/1     R  00:24:31/12:00:00
  y = 66847        yakovlev x12core      SLO.=.INMCM        96/96    R  22:20:10/24:00:00
  D = 66853        dinar    x12core      EXP2019_run+.exe   24/1     R  03:35:41/24:00:00
  d = 66861        dinar    x12core      EXP2019_run-.exe   24/1     R  01:02:25/24:00:00
  g = 66864        goyman   x12core      slm                24/1     R  00:29:51/12:00:00
  C = 66854        galin    x6core       a19f1Rm75.out      12/1     R  09:22:17/24:00:00
  A = 66855        galin    x6core       a19f2Rm75.out      12/1     R  09:22:22/24:00:00
  G = 66856        galin    x6core       a19f1Rm25.out      12/1     R  09:22:23/24:00:00
  B = 66857        galin    x6core       a19f2Rm25.out      12/1     R  09:21:25/24:00:00
  t = 66866        terekhov x6core       STDIN              72/6     R          /02:00:00

Available queues #

We use several queues on the cluster. All compute nodes are separated in groups. Each queue can use nodes only from specific groups. The maximum execution time may be reduced for some queues.

Queue	cl1n001–cl1n004	cl1n005–cl1n012	cl1n013–cl1n016	cl1n031–cl1n038	Total cores	Max walltime
x6core				+	96	24 h
x8core	+				64	24 h
x10core			+		80	24 h
x12core		+			192	24 h
mix	+	+	+	+	432	12 h
e5core	+	+	+		336	24 h

In order to submit the job to e.g. x12core queue, you should use paramater -q x12core for qsub command:

qsub -q x12core -N big-problem -l nodes=2:ppn=24  qbig.qs

Or you can add the following line in your job file:

#PBS -q x12core

The default queue is x6core.

Queue management system Torque #

Example #

qsub command #

qstat command #

qdel command #

pbstop program #

Available queues #

`qsub` command #

`qstat` command #

`qdel` command #

`pbstop` program #