Run CMSSW code using CRAB at FNAL

Introduction

Following the data-location driven computing model, most large scale data and MC sample access for analysis purposes will have to use GRID tools to run jobs. CRAB is the user tool provided by CMS to handle the interaction with the GRID for the user. In the following, we will submit jobs of the prepared analysis code using the GRID.

CRAB overview

CRAB, short for CMS Remote Analysis Builder, enables the user to process datasets and MC samples using the GRID. It hides the interaction with the GRID and provides the user with a simple and easy to use interface. CRAB organizes the processing of data and MC samples in 4 steps:

  1. Job creation
  2. Job submission
  3. Job status check
  4. Job output retrieval

GRID authentication

After installation of the certificate in the user's home directory, the following command creates a user proxy:

voms-proxy-init -voms cms

using the passphrase defined during installation.

To check how long the user's proxy is valid, use the following command:

voms-proxy-info -all

A valid proxy should produce a similar output like:

subject   : /DC=org/DC=doegrids/OU=People/CN=Oliver Gutsche 103748/CN=proxy
issuer    : /DC=org/DC=doegrids/OU=People/CN=Oliver Gutsche 103748
identity  : /DC=org/DC=doegrids/OU=People/CN=Oliver Gutsche 103748
type      : proxy
strength  : 512 bits
path      : /tmp/x509up_u12840
timeleft  : 11:59:54
VO        : cms
subject   : /DC=org/DC=doegrids/OU=People/CN=Oliver Gutsche 103748
issuer    : /C=CH/O=CERN/OU=GRID/CN=host/lcg-voms.cern.ch
attribute : /cms/Role=NULL/Capability=NULL
attribute : /cms/analysis/Role=NULL/Capability=NULL
attribute : /cms/uscms/Role=NULL/Capability=NULL
timeleft  : 11:59:54

Authentication test

A good test if the user's certificate is valid and if his proxy is valid uses the following globus tool trying to authenticate to FNAL:

globusrun -a -r cmsosgce.fnal.gov

CRAB setup

To initialize CRAB, call its setup script. In the following, the initialization commands of the CRAB installation at various sites are given (here for CRAB_1_5_2).

*Site* *Initialization command*
CERN sh shell family:
source /afs/cern.ch/cms/ccs/wm/scripts/Crab/CRAB_1_5_2/crab.sh
csh shell family:
source /afs/cern.ch/cms/ccs/wm/scripts/Crab/CRAB_1_5_2/crab.csh
FNAL sh shell family:
source /uscmst1/prod/grid/CRAB_1_5_2/crab.sh
csh shell family:
source /uscmst1/prod/grid/CRAB_1_5_2/crab.csh

or call the appropriate setup script from your local installation (CRAB installation).

Here, we will use the FNAL setup.

First time initialization

If you call the CRAB initialization script for the first time, CRAB will ask you to initialize one of its sub-components, BOSS by calling the following command:

$CRABDIR/configureBoss

BOSS will create two directories in your home directory:

boss
.bossrc

which should not be removed.

CRAB configuration file

CRAB is configured by a configuration file called crab.cfg. The configuration file should be located within the CMSSW user project directory at the same location as the CMSSW parameter-set to be used by CRAB. It's basic content is described in the following. Please safe the following file in the src directory of your local user project area named:

crab.cfg

and fill it with the following content:

[CRAB]
jobtype                = cmssw
scheduler              = edg 

[CMSSW]

datasetpath            = /RelVal150Higgs-ZZ-4Mu/CMSSW_1_5_0-RelVal-1182498841/GEN-SIM-DIGI-RECO
pset                   = crab_produce.cfg

total_number_of_events = 100
events_per_job         = 10

output_file            = h_zz_4mu-plus-utilities.root

[USER]
return_data            = 1

use_central_bossDB     = 0

use_boss_rt            = 0

[EDG]
lcg_version            = 2
rb                     = CERN
proxy_server           = myproxy.cern.ch 
virtual_organization   = cms
retry_count            = 2
lcg_catalog_type       = lfc
lfc_host               = lfc-cms-test.cern.ch
lfc_home               = /grid/cms

The CRAB configuration file is structured into sections and it is important in which section a specific configuration item is listed. The sections in the configuration file given above are

[CRAB]
[CMSSW]
[USER]
[EDG]

[CRAB] section

Parameter Description
jobtype The jobtype defines the kind of job CRAB should run. As CMSSW only knows one jobtype, this is always cmssw
scheduler The scheduler defines which GRID middleware is to be used by CRAB. There are 3 different schedulers for EGEE and one special scheduler only for OSG:
Scheduler Description
edg Default access mode to all EGEE and OSG resources using the resource broker.
glite New access mode to all EGEE and OSG resources using the new gLite resource broker.
glitecoll New access mode to all EGEE and OSG resources using the new gLite resource broker in high performance bulk mode.
condor_g Direct access mode to only OSG sites (requires local Condor scheduler (see Local user interface for sh family or Local user interface for csh family)).
For this tutorial, we will choose the edg scheduler.

[CMSSW] section

Parameter Description
datasetpath The datasetpath identifies the dataset you want to access. It can be queried by using the CMS data discovery page: http://cmsdbs.cern.ch/discovery/. More information is given at Dataset discovery and job configuration. In this tutorial, we use the previously used discovery from Access files from local disk and mass storage (dCache), Data discovery and enter the datasetpath given in bold in the output of the discovery. Alternatively, the user can select the crab.cfg link on the page as a template.
pset The name of the CMSSW parameter-set of your CMSSW job. The parameter-set has to be in the same directory as the CRAB configuration file.
total_number_of_events Total number of events to be processed by CRAB. If set to -1, all events of the selected dataset are processed. More information is given at Dataset discovery and job configuration.
events_per_job Number of events per job. CRAB will create as many jobs as needed to process the total_number_of_events. Due to technical reasons, the number of jobs may be larger than the mathematical number of jobs (total_number_of_events/events_per_job) due to constraints for the job splitting. More information is given at Dataset discovery and job configuration.
outputfile Comma-separated list of output filenames. Usually the filename selected in the PoolOutputSoure of the CMSSW parameter-set but can also hold user-specific output filenames like histogram files, etc. . These name is used by CRAB when generating the output filenames of the individual jobs. CRAB automatically adds job identifiers to the output filenames of the individual jobs so that the user can distinguish them. For example, if the output filename is output.root and the selected CRAB configuration results in 10 jobs, the output filenames of the individual jobs are named: output_00001.root, output_00002.root, ...

[USER] section

Parameter Description
return_data Defines the way CRAB handles user output. Default is 1 for using the GRID middleware sandbox. Attention: the sandbox is limited to 100 MB. More information is given at Output handling
use_central_bossDB BOSS specific parameter.
use_boss_rt Boss specific parameter.

[EDG] section

Parameter Description
lcg_version EGEE resource broker specific information.
rb Defines which resource broker configuration should be used. If set to CERN, the official CERN configuration is downloaded from cmsdoc.cern.ch, if set to CNAF, the configuration for the CNAF resource broker is downloaded. If this parameter is commented out, the default of the used user interface is used.
proxy_server Defines the grid proxy server name
virtual_organization Has to be: cms
retry_count Resource broker parameter, defines how often the resource broker should try to resubmit a job before giving up.
lcg_catalog_type LFC catalog specific parameter.
lfc_host LFC catalog specific parameter.
lfc_home LFC catalog specific parameter.

CRAB CMSSW parameter-set

Create a parameter-set in the src directory of your local user project directory:

crab_produce.cfg

with the following content:

process P =
{

  #
  # load input file
  #

  untracked PSet maxEvents = {untracked int32 input = -1}

  source = PoolSource
  {
    untracked vstring fileNames = {"file:test.root"}
    untracked uint32 skipEvents = 0
  }

  # include MyTrackUtility produces
  module producer = MyTrackUtility
  {
    InputTag TrackProducerTag = ctfWithMaterialTracks
  }

  #
  # write results out to file
  #
  module Out = PoolOutputModule
  {
    untracked string fileName = "h_zz_4mu-plus-utilities.root"
  }

  path p =
  {
    producer
  }

  endpath e =
  {
    Out
  }
}

Job creation

The job creation uses the provided CRAB configuration file to check the availability of the selected dataset and prepares the jobs for submission according to the selected job splitting:

crab -create

which uses the default name for a CRAB configuration file: crab.cfg. If a differently named CRAB configuration file should be used, the command changes to

crab -create -cfg <configuration file>

The creation process creates a CRAB project directory (format =crab_<number>_<date>_<time>) in the current working directory. It can be used later to distinguish multiple CRAB projects in the same directory.

The used CRAB configuration file is copied into the CRAB project directory and can be changed and used again without interfering with the already created projects.

This is an example for the standard output of the creation command:

crab. crab (version 1.5.2) running on Sun Jun 24 17:45:30 2007

crab. Working options:
  scheduler           edg
  job type            CMSSW
  working directory   /uscms_data/d1/gutsche/tutorial/CMSSW_1_5_0/src/crab_0_070624_174530/

crab. Downloading config files for RB: http://cmsdoc.cern.ch/cms/ccs/wm/www/Crab/useful_script/edg_wl_ui.conf.CMS_CERN
crab. Downloading config files for RB: http://cmsdoc.cern.ch/cms/ccs/wm/www/Crab/useful_script/edg_wl_ui_cmd_var.conf.CMS_CERN
crab. Contacting DBS...
crab. Required data are :/RelVal150Higgs-ZZ-4Mu/CMSSW_1_5_0-RelVal-1182498841/GEN-SIM-DIGI-RECO
crab. The number of available events is 1100

crab. Contacting DLS...
crab. Sites (1) hosting part/all of dataset: ['srm.cern.ch']

crab. 10 job(s) can run on 100 events.

crab. Creating 10 jobs, please wait...

crab. Total of 10 jobs created.

crab. Log-file is /uscms_data/d1/gutsche/tutorial/CMSSW_1_5_0/src/crab_0_070624_174530/log/crab.log

In case of problems, a good place to look for output is the CRAB log file in

crab_?_*_*/log/crab.log

Job submission

The following command submits the previously created jobs:

crab -submit all -c

where -c specifies that CRAB uses the latest CRAB project in the current directory and all specifies to submit all created jobs. You can also specify and combination of jobs and job-ranges separated by comma (Example: =1,2,3-4). You can specify a specific directory to use a different CRAB project than the last:

crab -submit all -c <directory>

An example standard output would look like:

crab. crab (version 1.5.0) running on Thu Mar 22 09:53:18 2007

crab. Working options:
  scheduler           edg
  job type            CMSSW
  working directory   /uscms_data/d1/gutsche/tutorial/CMSSW_1_3_0_pre6/src/crab_0_070322_094957/

crab. Matched Sites :['cmslcgce.fnal.gov']
crab. Found 1 compatible site(s) for job 1
                                                                  Submitting 10 jobs                                                                  
100% [============================================================================================================================================]
                                                                     please wait                                                                      
crab. Total of 10 jobs submitted.

crab. Total of 10 jobs submitted.
crab. Log-file is /uscms_data/d1/gutsche/tutorial/CMSSW_1_3_0_pre6/src/crab_0_070322_094957/log/crab.log

Job status check

The following command checks the status of all jobs in the latest CRAB project:

crab -status -c

You can specify a specifc directory to use a different CRAB project than the last:

crab -status -c <directory>

An example standard output would look like:

crab. crab (version 1.5.0) running on Thu Mar 22 10:07:09 2007

crab. Working options:
  scheduler           edg
  job type            CMSSW
  working directory   /uscms_data/d1/gutsche/tutorial/CMSSW_1_3_0_pre6/src/crab_0_070322_094957/

crab. Checking the status of all jobs: please wait
Chain    STATUS             E_HOST                                   EXE_EXIT_CODE JOB_EXIT_STATUS
---------------------------------------------------------------------------------------------------
1        Done (Success)     cmslcgce.fnal.gov                                                     
2        Done (Success)     cmslcgce.fnal.gov                                                     
3        Done (Success)     cmslcgce.fnal.gov                                                     
4        Done (Success)     cmslcgce.fnal.gov                                                     
5        Running            cmslcgce.fnal.gov                                                     
6        Done (Success)     cmslcgce.fnal.gov                                                     
7        Running            cmslcgce.fnal.gov                                                     
8        Running            cmslcgce.fnal.gov                                                     
9        Running            cmslcgce.fnal.gov                                                     
10       Running            cmslcgce.fnal.gov                                                     

>>>>>>>>> 10 Total Jobs 

>>>>>>>>> 5 Jobs Running
          List of jobs: 5,7,8,9,10

>>>>>>>>> 5 Jobs Done
          List of jobs: 1,2,3,4,6
          Retrieve them with: crab -getoutput <Jobs list>

crab. Log-file is /uscms_data/d1/gutsche/tutorial/CMSSW_1_3_0_pre6/src/crab_0_070322_094957/log/crab.log

The output gives list of job number at the end for easy access of the different job categories.

Job output retrieval

The following command retrieves the output of all jobs of a CRAB project which are Done:

crab -getoutput all -c

where all specifies to try to retrieve the output of all jobs of the latest project. You can specify a specific directory to use a different CRAB project than the last:

crab -getoutput all -c <directory>

You can also specify every combination of job number and job range instead of all.

An example standard output would look like:

crab. crab (version 1.5.0) running on Thu Mar 22 10:08:05 2007

crab. Working options:
  scheduler           edg
  job type            CMSSW
  working directory   /uscms_data/d1/gutsche/tutorial/CMSSW_1_3_0_pre6/src/crab_0_070322_094957/

crab. Results of Job # 1 are in /uscms_data/d1/gutsche/tutorial/CMSSW_1_3_0_pre6/src/crab_0_070322_094957/res/
crab. Results of Job # 2 are in /uscms_data/d1/gutsche/tutorial/CMSSW_1_3_0_pre6/src/crab_0_070322_094957/res/
crab. Results of Job # 3 are in /uscms_data/d1/gutsche/tutorial/CMSSW_1_3_0_pre6/src/crab_0_070322_094957/res/
crab. Results of Job # 4 are in /uscms_data/d1/gutsche/tutorial/CMSSW_1_3_0_pre6/src/crab_0_070322_094957/res/
crab. Results of Job # 5 are in /uscms_data/d1/gutsche/tutorial/CMSSW_1_3_0_pre6/src/crab_0_070322_094957/res/
crab. Results of Job # 6 are in /uscms_data/d1/gutsche/tutorial/CMSSW_1_3_0_pre6/src/crab_0_070322_094957/res/
crab. Results of Job # 7 are in /uscms_data/d1/gutsche/tutorial/CMSSW_1_3_0_pre6/src/crab_0_070322_094957/res/
crab. Results of Job # 8 are in /uscms_data/d1/gutsche/tutorial/CMSSW_1_3_0_pre6/src/crab_0_070322_094957/res/
crab. Results of Job # 9 are in /uscms_data/d1/gutsche/tutorial/CMSSW_1_3_0_pre6/src/crab_0_070322_094957/res/
crab. Results of Job # 10 are in /uscms_data/d1/gutsche/tutorial/CMSSW_1_3_0_pre6/src/crab_0_070322_094957/res/
 
crab. Log-file is /uscms_data/d1/gutsche/tutorial/CMSSW_1_3_0_pre6/src/crab_0_070322_094957/log/crab.log

The output of the jobs can be found in the given directory which contains following files:

BossChainer.log
BossProgram_1.log
CMSSW_000001.stderr
CMSSW_000001.stdout
CMSSW_000002.stderr
CMSSW_000002.stdout
CMSSW_000003.stderr
CMSSW_000003.stdout
CMSSW_000004.stderr
CMSSW_000004.stdout
CMSSW_000005.stderr
CMSSW_000005.stdout
CMSSW_000006.stderr
CMSSW_000006.stdout
CMSSW_000007.stderr
CMSSW_000007.stdout
CMSSW_000008.stderr
CMSSW_000008.stdout
CMSSW_000009.stderr
CMSSW_000009.stdout
CMSSW_000010.stderr
CMSSW_000010.stdout
crab_0_070322_094957_1_10.log
crab_0_070322_094957_1_1.log
crab_0_070322_094957_1_2.log
crab_0_070322_094957_1_3.log
crab_0_070322_094957_1_4.log
crab_0_070322_094957_1_5.log
crab_0_070322_094957_1_6.log
crab_0_070322_094957_1_7.log
crab_0_070322_094957_1_8.log
crab_0_070322_094957_1_9.log
crab_fjr_10.xml
crab_fjr_1.xml
crab_fjr_2.xml
crab_fjr_3.xml
crab_fjr_4.xml
crab_fjr_5.xml
crab_fjr_6.xml
crab_fjr_7.xml
crab_fjr_8.xml
crab_fjr_9.xml
h_zz_4mu-plus-utilities_10.root
h_zz_4mu-plus-utilities_1.root
h_zz_4mu-plus-utilities_2.root
h_zz_4mu-plus-utilities_3.root
h_zz_4mu-plus-utilities_4.root
h_zz_4mu-plus-utilities_5.root
h_zz_4mu-plus-utilities_6.root
h_zz_4mu-plus-utilities_7.root
h_zz_4mu-plus-utilities_8.root
h_zz_4mu-plus-utilities_9.root

Amongst the files are the standard output and error of the jobs (CMSSW_*.stdout and CMSSW_*.stderr) and the output root files (output_*,root).

Additional exercise

Change the scheduler to condor_g and submit the jobs again to FNAL (can be done in parallel).

Previous: Run CMSSW code using the Condor batch queue of the LPCCAF at FNAL Top: Main page Next: Documentation and further Information
Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2007-08-28 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback