Batch Job Management with Torque/OpenPBS

The batch system on titan uses OpenPBS, a free customizable batch system. Jobs are submitted by users with qsub from titan.physics.umass.edu, and are scheduled to run using a fair-share algorithm which prioritizes jobs based on all users' recent cpu utilization. This means, for example, that a user who wants to run a few quick jobs will not have to wait for another user who has hundreds of 10-hour long jobs already in the system.

Submitting jobs with qsub

A job, at a minimum, consists of an executable shell script. The script can run other executables or do just about anything you can do from within an interactive shell. The pbs system, runs your script on a batch node as you. By default, torque/openpbs writes all files with permissions which allow only the user to read the files. To allow everyone else to also read the files you make with your jobs running on the batch system, you can add the following line to your shell script on the second line, immediately after the first line containing #!/bin/zsh (or whatever yours is) that invokes the shell:

#PBS -W umask=022

The simplest invocation needed to send a single job myjob.zsh to the batch system is:

> qsub myjob.zsh

All jobs should specify how much cpu time is needed, else they default run in the express queue, which has a cpu time limit of just a few hours. To specify job resource requirements (e.g. time, memory, etc. ), use the '-l' option for qsub.

To send a job myjob.csh requesting 8:00 of cpu time, use:

> qsub -l cput=08:00:00 myjob.csh

Specify Scientific Linux Version

Both SL5 and SL6 nodes run in our batch system. By default, jobs will run on SL5 machines. To change this, do:

To select SL6 add the following to your qsub command:

-l nodes=1:sl6

To explicitly select SL5 (the default anyway if you don't specify) add the following to your qsub command:

-l nodes=1:sl5

To select any flavor (you don't care where the job runs) add the following to your qsub command:

-l nodes=1

List Job Identifiers

Print a list of job identifiers of all jobs in the system by user bbrau:

> qselect -u bbrau

Query the system with qstat

List all jobs:

> qstat

and which nodes they're running on:

> qstat -n

Full disclosure:

> qstat -f

of just one job using its job identifier:

> qstat -f 1908.titan.physics.umass.edu

Learn about the batch system with qmgr

Print the server configuration:

> qmgr -c 'print server'

Find out about node titan12:

> qmgr -c 'print node titan12'

Node Status with qnodes

List them all:

> qnodes

or just the ones that aren't up:

> qnodes -l

Delete jobs with qdel

>qdel 1908.titan.physics.umass.edu

or do it with a qselect to pick all of your jobs

> qdel `qselect -u bbrau`

* To investigate a running job, log into the node it is running on and look at its logfiles in

/var/spool/torque/spool

All the gory details are in: [[http://www.doesciencegrid.org/public/pbs/pbs.v2.3_admin.pdf][The OpenPBS Administrator's Guide]

And of course on titan, you can read the man pages for most of the commands:

> man qstat

-- BenjaminBrau - 25-Mar-2010

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2012-04-12 - BenjaminBrau
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback