comparatorlogoLow.jpg

GridScripts

Introduction

As Part of the AtlasComparator development, I found I needed a 'halfway house' between single grid job submission, and full GANGA-based job management. My utilities are designed to be run on a grid user interface machine, and will administrate a few hundred grid jobs via bash scripts. I came up with a suite of scripts and utilities, and bundled them together in gridjobutils.tar.

README file

The README file explains how to use the utilities and how they work. For completeness, I will include it here:

latest changes 14/6/2006:
  • getjoboutput.sh now puts jobs into jobOutput/jobname so you can see the job ID.
  • killajob.sh now has the ability to move job/ and subOutput/ files to deleted/ using an optional 3rd parameter. canjobdellfn.sh asks for this parameter before it makes its calls to killajob.sh.
  • new utility: siteprobe.sh (see below) allows user to test all sites purporting a specific athena version.
  • new utilities: register.sh and deletefiles.sh. Allows user to register / delete the contents of a local directory onto a grid storage element.

setup instructions:
  • log on to a Grid UI machine.
  • make a directory called e.g. $HOME/mygridjobs
  • cd into this directory- it will be where your grid jobs are made.
  • copy and untar the file gridjobutils.tar in the mygridjobs directory.
     > tar -xvf gridjobutils.tar
  • Next, get a grid proxy valid for a LONG time e.g.:
     > grid-proxy-init -valid 200:00

running requirements:
untarring the file should give you
  • some utilities (described later)
  • a job/ directory to hold the shell and py scrips for job submission
  • a subOutput/ directory to hold the result of job submission (i.e. a url)
  • a jobOutput/ directory to hold the edg-job-get-output and the job/ scripts for ok jobs
  • a deleted/ directory to hold the subOutput/ files when a job is resubmitted.
    canjobdellfn.sh cancels the job with edg-job-cancel and deletes the lfns. It can also move the job-related files in job/ and subOutput/ to deleted/.
    BEWARE.. moving the files to deleted/ means we could end up with jobs cancelled and job-related files in /deleted for jobs which never ran at all! Once files are moved to deleted/ canjobdellfn.sh loses the ability to clean up that job.

submitting grid jobs:
  • Make sure you are in the directory you created to run your grid jobs:
     cd $HOME/mygridjobs
  • submitgridjobs.sh is an example Reconstruction job. It uses input "digi" (hits) files from athena 9.0.4, and outputs ESD & AOD files onto the grid.
  • Edit submitgridjobs.sh
    • You MUST change the job names and output file names!!! - (at least) change 'myusername' to your username
    • You might want to change the input file names - I have left a 'digi' file full of detector hits as input.
    • You might want to change the storage element where your files are output. The 'se' is currently scotgrid: "se2-gla.scotgrid.ac.uk". You can get a list of available se's:
               lcg-infosites --vo atlas se

  • ./submitgridjobs.sh makes files for running the grid jobs in the job/ directory.
    It makes a "record of job submission" in subOutput/ directory.
    Files for running athena (e.g RecExCommon_myOptions.py and
    esd-post-options.py) are also stored in the job/ directory.

  • Submit Grid jobs by typing (e.g. where 0 and 190 are the lowest and highest numbers of the jobs you are submitting):
     ./submitgridjobs.sh 0 190

utilities:
./statusgridjobs.sh Lists jobs. You must supply a parameter:
  • 0 - Jobs in the subOutput/ directory which completed successfully (NB- I have known jobs to fail and still say this!)
  • 1 - Jobs in the subOutput/ directory which are in error
  • R - Jobs in the subOutput/ directory which are currently running or scheduled
  • A - All jobs in the subOutput/ directory

./getjoboutput.sh Determines the successful jobs as determined by "statusgridjobs.sh 0" and 'gets' the output to jobOutput/.
It then moves the .jdl, .sh and subOutput/ submission records to jobOutput/ as well. (getjoboutput.sh calls getajob.sh for each successful job)

./canjobdellfn.sh Is designed to make sure you have edg-job-cancelled and lcg-del'd any grid lfns so you are free to restart the jobs (canjobdellfn.sh calls killajob.sh for each job in error).
The script takes two parameters.

  1. which jobs to cancel (1,R,A). You can see which jobs it will cancel by typing:
    statusgridjobs.sh [parameter=1,R,A].
  2. Whether to move job-related files to deleted/.
    If you move files to deleted/, the job is effectively gone; (restart job will NOT be possible).
    NB: You should only move the files to deleted/ once you are sure you have 'cleaned up' after your job i.e. edg-job-cancel and lcg-del worked ok.

./resubmitgridjobs.sh Calls statusgridjobs.sh 1 to get a list of jobs to restart.
You can check which jobs this will apply to by typing "statusgridjobs.sh 1".
For these jobs, resubmitgridjobs.sh 'gets' the job output into deleted/.
Then, it moves the submission record from subOutput/ to deleted/.
It then tries to cancel the job in case it is still running.
It then tries to lcg-del delete any files which may have been registered on the grid for the job by looking at any lcg-cr commands in the .sh script which formed the job.
It will then resubmit the job, and make a new record in the subOutput/ directory.

./siteprobe.sh Probes sites to see if they really have a version of athena.
You must supply the athena version eg: ./siteprobe.sh 11.0.3
It writes two files ("testjob.jdl" & "testjob.sh") and then dynamically gets sites which fulfill requirements in the jdl.
It uses the list of sites obtained to fire a test job to each one.
You can then see which sites respond and work.
NB: User should also change "JobName=myusername_..." [myusername] to their name.

./register.sh A simple utility to register the contents of a local directory on the grid.
It requires a storage element so it knows where to put your files. (You can get a list of se's using "lcg-infosites --vo atlas se")
It requires an "lfn prefix", usually "username_description_" which will be prepended to your actual filenames.
It also requires a local directory holding the files you'd like to register and copy onto the grid.
Output log files from the copy and register are stored in registerlogs/.

./deletefiles.sh Will simply delete from the grid any records of files it finds in the registerlogs/ directory.

getajob.sh, killajob.sh, resubmitajob.sh These are utilities called by the other scripts to deal with individual grid jobs. You may use them to deal with individual jobs if you wish. In all cases, they take two parameters- (1)the job name and (2)the https url of the job.
NB: killajob.sh also takes an optional parameter which moves the job/ and subOutput/ files to deleted/.

-- ChrisCollins - 15 Dec 2005

Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2006-06-15 - ChrisCollins
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback