Queue management system Torque #
The new resource management system is active now. Please refer to Slurm usage page. Information below is deprecated.
We use TORQUE Resource Manager to schedule all user jobs on our cluster. The job is an ordinary shell script which is submitted to the queue by qsub
command. This script contains the information about requested resources (number of nodes, amount of RAM, required time). The user can use the usual bash commands inside this script. The script is executed by the system only on the first reserved core. The user should take care of executing his programs on other cores, e.g. with mpiexec tool.
Example #
Let’s start with a simple example for job sleep
for just one core and maximum execution time equal to 5 minutes.
Create the file sleep.qs
with the following content:
#!/bin/bash
#PBS -N sleep
#PBS -l nodes=1
#PBS -l walltime=05:00
echo "Start date: $(date)"
sleep 60
echo " End date: $(date)"
You can enqueue this job with qsub
command:
qsub sleep.qs
Strings starting with prefix #PBS
define the parameters for qsub
command. You can also specify these parameters explicitly in the command line, e.g.:
qsub -N sleep -l nodes=1 -l walltime=5:00 sleep.qs
You can combine several parameters for resources requests in one delimiting them by comma:
qsub -l nodes=1,walltime=5:00 sleep.qs
qsub
command
#
The main parameters for qsub
command:
-d path
The working dir of the job, by default the user home dir is used.-e path
-o path
The file names for standard error (stderr
) and standard output (stdout
) streams, by default file names<job_name>.e<job_id>
and<job_name>.o<job_id>
in the current dir are used.-j oe
-j eo
Unite stderr and stdout in one file.oe
— both streams will go tostdout
,eo
— both streams will go tostderr
.-m aben
Select type of events for notifying the user via e-mail.a
— in case of job abortion,b
— job start,e
— job finish,n
— no notifications. You can select several chars fromabe
or one charn
, onlya
is used by default.-M e-mail
The e-mail address (or several addresses separated by the comma) for notifications, by default the owner of the job.-N name
The name of the job.-q queue
The execution queue for the job. Several queues are available, the default one isx6core
.-l resource_list
The list of required resources.
The main resources (used with -l option):
nodes=N
nodes=M:ppn=K
The number of required cores or nodes. In the first case we requestN
cores, whereN
is from 1 to 432. In the second case we requestM
nodes withK
cores per each node. E.g,nodes=24
will request 24 cores,nodes=3:ppn=8
will request 3 nodes each with 8 cores,nodes=6:ppn=4
will request 6 nodes with 4 cores per each node. Only one core is requested by default.pmem=size
pvmem=size
The required size on physical and virtual memory respectively. The size is a number with one of the suffixesb
,kb
,mb
,gb
. E.g.,pvmem=1gb
will request 1 GB of virtual memory per each process. By default there is no limitation on the memory usage. However, please keep in mind, that the job with huge memory footprint may hang up the whole node.walltime=time
The maximum execution time for the job. The job will be terminated after this time is exceeded. Default time limit is 6 hours. The maximum time you can request is 24 hours. E.g.,walltime=1:45:00
will stop the execution of the job after 1 hour and 45 minutes.
Some useful environment variables defined by the Torque management system:
PBS_O_WORKDIR
The directory from which the user submitted the job.PBS_JOBID
Unique ID of job.PBS_O_HOST
The hostname of the current node.PBS_NODEFILE
File with the list of all allocated nodes per job.
You can find more information in the Torque documentation.
qstat
command
#
You can check the status of the jobs with qstat
command. You will get more information running qstat -a
. The current state is shown in S column.
- Q — the job is waiting in the queue for requested resources.
- R — the job is running.
- E — the job is finishing, and stderr and stdout files are transfered to the headnode.
- C — the job finished. The information about finished jobs is shown only for 5 minutes.
You can get the whole list of nodes and cores allocated for your job with qstat -n
command.
qdel
command
#
You can remove your job from the queue, e.g. in case you requested to much cores, with the command qdel <job_id>
. You can also remove the running job the same way, in this case the job will be terminated first.
pbstop
program
#
You can monitor the general statistics about cluster usage with pbstop
program. Press key q
for exit.
The output of the pbstop
program will look similar to this:
Usage Totals: 420/480 Procs, 25/28 Nodes, 12/12 Jobs Running 16:51:05
Node States: 4 free 23 job-exclusive 1 offline
Visible CPUs: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
1 2
-------------------------------------------------
cl1n001 ................ pppp............ [x8core e5core mix]
cl1n003 pppppppppppppppp PPPPPPPPPPPPPPPP [x8core e5core mix]
cl1n005 yyyyyyyyyyyyyyyyyyyyyyyy PPPPPPPPPPPPPPPPPPPPPPPP [x12core e5core mix]
cl1n007 yyyyyyyyyyyyyyyyyyyyyyyy dddddddddddddddddddddddd [x12core e5core mix]
cl1n009 DDDDDDDDDDDDDDDDDDDDDDDD yyyyyyyyyyyyyyyyyyyyyyyy [x12core e5core mix]
cl1n011 gggggggggggggggggggggggg yyyyyyyyyyyyyyyyyyyyyyyy [x12core e5core mix]
cl1n013 pppppppppppppppppppp ........PPPPPPPPPPPP [x10core e5core mix]
cl1n015 EEEEEEEEEEEEEEEEEEEE PPPPPPPPPPPPPPPPPPPP [x10core e5core mix]
cl1n031 tttttttttttt GGGGGGGGGGGG [x6core mix]
cl1n033 tttttttttttt tttttttttttt [x6core mix]
cl1n035 CCCCCCCCCCCC BBBBBBBBBBBB [x6core mix]
cl1n037 ............ tttttttttttt [x6core mix]
cl1n039 AAAAAAAAAAAA tttttttttttt [x6core mix]
cl1n041 tttttttttttt %%%%%%%%%%%% [x6core mix]
-------------------------------------------------
[.] idle [@] busy [*] down [%] offline [!] other [?] unknown
Job# Username Queue Jobname CPUs/Nodes S Elapsed/Requested
P = 66851 perezhog e5core nse2d 72/72 R 17:13:43/24:00:00
p = 66863 perezhog e5core nse2d 40/40 R 00:52:17/24:00:00
E = 66865 goyman x10core slm 20/1 R 00:24:31/12:00:00
y = 66847 yakovlev x12core SLO.=.INMCM 96/96 R 22:20:10/24:00:00
D = 66853 dinar x12core EXP2019_run+.exe 24/1 R 03:35:41/24:00:00
d = 66861 dinar x12core EXP2019_run-.exe 24/1 R 01:02:25/24:00:00
g = 66864 goyman x12core slm 24/1 R 00:29:51/12:00:00
C = 66854 galin x6core a19f1Rm75.out 12/1 R 09:22:17/24:00:00
A = 66855 galin x6core a19f2Rm75.out 12/1 R 09:22:22/24:00:00
G = 66856 galin x6core a19f1Rm25.out 12/1 R 09:22:23/24:00:00
B = 66857 galin x6core a19f2Rm25.out 12/1 R 09:21:25/24:00:00
t = 66866 terekhov x6core STDIN 72/6 R /02:00:00
Available queues #
We use several queues on the cluster. All compute nodes are separated in groups. Each queue can use nodes only from specific groups. The maximum execution time may be reduced for some queues.
Queue | cl1n001–cl1n004 | cl1n005–cl1n012 | cl1n013–cl1n016 | cl1n017–cl1n020 | cl1n031–cl1n038 | Total cores | Max walltime |
---|---|---|---|---|---|---|---|
x6core | + | 96 | 24 h | ||||
x8core | + | 64 | 24 h | ||||
x10core | + | 80 | 24 h | ||||
x12core | + | 192 | 24 h | ||||
mix | + | + | + | + | 432 | 12 h | |
e5core | + | + | + | 336 | 24 h |
In order to submit the job to e.g. x12core queue, you should use paramater -q x12core
for qsub
command:
qsub -q x12core -N big-problem -l nodes=2:ppn=24 qbig.qs
Or you can add the following line in your job file:
#PBS -q x12core
The default queue is x6core.