Batch Job Management with Torque/OpenPBS
The batch system on titan uses
OpenPBS, a free customizable batch system. Jobs are submitted by users with
qsub from titan.physics.umass.edu, and are scheduled to run using a fair-share algorithm which prioritizes jobs based on all users' recent cpu utilization. This means, for example, that a user who wants to run a few quick jobs will not have to wait for another user who has hundreds of 10-hour long jobs already in the system.
Submitting jobs with qsub
A job, at a minimum, consists of an executable shell script. The script can run other executables or do just about anything you can do from within an interactive shell. The pbs system, runs your script on a batch node as you. By default, torque/openpbs writes all files with permissions which allow only the user to read the files. To allow everyone else to also read the files you make with your jobs running on the batch system, you can add the following line to your shell script on the second line, immediately after the first line containing #!/bin/zsh (or whatever yours is) that invokes the shell:
#PBS -W umask=022
The simplest invocation needed to send a single job myjob.zsh to the batch system is:
>
qsub myjob.zsh
All jobs should specify how much cpu time is needed, else they default run in the express queue, which has a cpu time limit of just a few hours. To specify job resource requirements (e.g. time, memory, etc. ), use the '-l' option for qsub.
To send a job myjob.csh requesting 8:00 of cpu time, use:
>
qsub -l cput=08:00:00 myjob.csh
Specify Scientific Linux Version
Both SL5 and SL6 nodes run in our batch system. By default, jobs will run on SL5 machines. To change this, do:
To select SL6 add the following to your qsub command:
-l nodes=1:sl6
To explicitly select SL5 (the default anyway if you don't specify) add the following to your qsub command:
-l nodes=1:sl5
To select any flavor (you don't care where the job runs) add the following to your qsub command:
-l nodes=1
List Job Identifiers
Print a list of job identifiers of all jobs in the system by user bbrau:
>
qselect -u bbrau
Query the system with qstat
List all jobs:
>
qstat
and which nodes they're running on:
>
qstat -n
Full disclosure:
>
qstat -f
of just one job using its job identifier:
>
qstat -f 1908.titan.physics.umass.edu
Learn about the batch system with qmgr
Print the server configuration:
>
qmgr -c 'print server'
Find out about node titan12:
>
qmgr -c 'print node titan12'
Node Status with qnodes
List them all:
>
qnodes
or just the ones that aren't up:
>
qnodes -l
Delete jobs with qdel
>
qdel 1908.titan.physics.umass.edu
or do it with a qselect to pick all of your jobs
>
qdel `qselect -u bbrau`
* To investigate a running job, log into the node it is running on and look at its logfiles in
/var/spool/torque/spool
All the gory details are in:
[[http://www.doesciencegrid.org/public/pbs/pbs.v2.3_admin.pdf][The
OpenPBS Administrator's Guide]
And of course on titan, you can read the man pages for most of the commands:
>
man qstat
--
BenjaminBrau - 25-Mar-2010