CURRENTLY OUTDATED, SEND A MAIL TO leonardo.sala at cern.ch IF YOU NEED INFORMATIONS
CMSSW Performance Toolkit
Introduction
This set of tools is intended to provide a standard way to retrieve and analyze performances related to CMSSW jobs and computational centres.
CMSSW can provide lot of information, eg:
- Timing and memory statistics
- CPU specs
- Storage relates statistics
These methods are described here:
The provided information can be large, and its handling difficult. This set of tools (let's call it CMSSW
PerformanceToolkit, or CPT
) is available on CVS:
http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/UserCode/leo/Utilities/PerfToolkit/
and provides:
- a standard set of CMSSW configuration files (PerfToolkit/TestBed)
- a python script for retrieving interesting statistics from the job output, saving them as ROOT histograms (PerfToolkit/cpt_getJobInfo.py)
- a python script for plotting the selected quantities and produce a twiki-compliant summary table (PerfToolkit/cpt_getStats.py)
- a python script containing utilities used by cpt_getJobInfo.py and cpt_getStats.py (PerfToolkit/cpt_utilities.py)
- A bunch of simple scripts for using CRAB (PerfToolkit/CrabUtils)
Current CPT release is:
V01-00
Example config files
Some example cfg can be found here:
http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/UserCode/leo/Utilities/PerfToolkit/TestBed/
The important cfg parts are:
* Enabling the timing Module
process.Timing = cms.Service("Timing",
useJobReport = cms.untracked.bool(True)
)
- Enabling the CPU specs info (not yet supported by the plotting utility):
process.CPU = cms.Service("CPU",
useJobReport = cms.untracked.bool(True),
reportCPUProperties = cms.untracked.bool(True)
)
- Enabling the Simple Memory checker (not yet supported by the plotting utility):
process.SimpleMemoryCheck = cms.Service("SimpleMemoryCheck",
useJobReport = cms.untracked.bool(True)
)
process.AdaptorConfig = cms.Service("AdaptorConfig",
cacheHint=cms.untracked.string("application-only"),
readHint=cms.untracked.string("auto-detect")
)
process.source.cacheSize = cms.untracked.uint32(20*1024*1024)
Furthermore, a "drop *" option has been included in order not to be dependent on stageOut.
The cfg name schema is very simple:
<Jobtype>_<READHINT_cfg>_<CACHEHINT_cfg>_<CACHE>.py
. Two different jobtypes are given:
If you want to use this latter, you need to retrive the code from CVS.
The CACHEHINT parameter can be in the example cfgs:
- APP: this refers to the "application-only" setting, and it means that ROOT will do the caching
- LD: this refers to the "lazy-download" setting. In this case, the data will be downloaded to a shadow local file
- STO: this refers to the "storage-only" setting, and means ROOT will drive the caching using a prefetch list, but will not allocate a cache of its own. Usable with file and RFIO technologies, will crash if used with dCache
- AUTO: this refers to the "auto-detect" setting. This will select "lazy-download" forRFIO and dCache and "storage-only" for the remaining.
The READHINT parameter can be in the example cfgs:
- UNBUFF: this refers to the "direct-unbuffered" setting, where all the caching is disabled
- BUFF: this refers to the "read-ahead-buffered" setting, where buffering is used at the IO level
- AUTO: this refers to the "auto-detect" setting, and requests optimum read-ahead buffering given the other I/O choices
CACHE indicates the amount of cache to be set: usually a value between 10 and 50 MB is suggested. For more details, please visit:
Getting job statistics
The
cpt_getJobStats.py
script retrieves the job statistics from the job stdout and the
FrameworkJobReport xml file. Two kinds of jobs are actually supported:
- Plain CMSSW jobs (e.g. simply run on a shell)
- CMSSW jobs sent with CRAB
In both cases, getJobStats gets the timing information from the stdout, and the storage stats from the FJR. For the CPU time information, this is retrieved from the stdout in both cases, but
when running a CMSSW job not using crab
/usr/bin/time
must be used, eg:
( /usr/bin/time cmsRun -j myCMSSW_TEST/test_1.xml PAT1.py ) &> myCMSSW_TEST/test_1.stdout
The usage syntax is:
cpt_getJobInfo.py --type=CMSSW|CRAB DIR [outfile.root]
- the
--type
options tells the program how the output is formatted. By default is set to CRAB
-
DIR
is the directory which contains the relevant output files:
- for CRAB, it is simply the CRAB working dir, eg
crab_0_100114_102558
- for CMSSW, is the directory which contains the stdout and the xml files. CAVEAT: they must have the same name and be numerated, e.g.:
$ ls myCMSSW/
cmsrun_1.stdout cmsrun_1.xml cmsrun_2.stdout cmsrun_2.xml cmsrun_3.stdout cmsrun_3.xml cmsrun_4.stdout cmsrun_4.xml
-
outfile.root
is the name of the output root file containing the histos. By default, it is DIR.root
ROOT file format
The
ROOT file produced by
cpt_getJobInfo.py
contains histograms with the information gathered from CMSSW. Histograms names are
formatted in a way to be simple identifiable by the plotting script (and by humans, too). The structure is:
QUANT<quantName>-SAMPLE<sampleName>
where
is the variable registered in the CMSSW statistics (e.g. tstoragefile-read-total-msecs ) and
is the name of the sample analyzed (if the output file is
OUT.root
, then the sample name will be
OUT
).
Histogram binning is hardcoded in the
setBinEdges()
function, and have been tuned to cover all the information retrieved with best resolution. The rebinning and the optimization
is delegated to the
cpt_getStats.py
script.
How does it work
Analyzing job statistics
Once one or more root files have been produced with
cpt_getJobInfo.py
, they can be user as input for
cpt_getStats.py
. This script computes statistics from
the Root histos, plot the selected histos and produce a twiki formatted table.
The syntax is:
cpt_getStats.py [--save-png] [-h] [--no-auto-bin] [--binwidth-time=BINWIDTHTIME] output.root [output_2.root, ...]
where
output.root
is the file produced by
cpt_getJobInfo.py
. Wildcards are supported, so if you have e.g. files called test1.root, test2.root, test3.root you can do:
cpt_getStats.py test*.root
Options are explained calling
cpt_getStats.py -h
:
Options:
--version show program's version number and exit
-h, --help show this help message and exit
--save-png Saves created histos in png format
--no-auto-bin Automatic histo binning
--binwidth-time=BINWIDTHTIME Bin width of time histos in seconds (default 30, min 10)
At the moment of writing, the script can support a specific schema for root files naming:
SITE-CFG-DATASET-NEVTS-LABEL-DATE.root
. This allows a more user friendly labelling in histogram's legends: if the file name is not compliant,
then it will be used directly as label in the legend (e.g. if the file is called test.root, in the legend will appear as "test").
The script can be configured in order to plot only some variables. The configurations are stored in sequences in the file header:
-
filter
: this filters the quantities to be analyzed. It accepts regular expression, e.g.:
filter = [
".*read.*[^(max|min)].*",
"Time",
"Percentage",
"User",
"Error"
]
selects all the read(v) statistics (but
min
and
max
values) and all quantities which contains Time (at the beginning), Percentage, User and Error.
-
plotFilter
: selects the quantities to be actually plotted, e.g.:
plotFilter = [
"readv-total-msecs",
"read-total-msecs",
"TimeEvent",
"Percentage",
"UserTime",
"WrapperTime"
]
In this case, only quantities containing one of the selected words will be plotted. It does support regular expressions.
-
plotTogether
: this is a list of tstorage statistics to be plotted together (tstoragefile values excluded). This in order to have quantities coming from different SE technologies (dcap, file, ...) plotted overlapped. Better not to touch it.
-
summaryPlots
: this selects the plots to be inserted in a unique canvas. Useful to have a grand view at once. E.g.:
summaryPlots = (
"CpuPercentage",
"UserTime",
"ExeTime",
"tstoragefile-read-total-msecs"
)
will plot a canvas divided into 4 pieces, each with one of the selected plots.
-
legendComposition
: the labels to use in legends (histos and tables). Supported labels are: Site, Cfg, Dataset, EventsJob, Label, Date, Hour
, e.g.:
legendComposition = ["Site","Cfg","Label"]
will use as labels the used site, the cfg name and the Label
How does it work
Using CPT with simple CMSSW jobs: An Example
In this case, I ran 10 jobs (100 evts each) both with
PAT.py and PAT1_RHAUTO_CHAPP_CACHE20.py, using a simple shell loop:
# for i in `seq 1 10`; do ( /usr/bin/time cmsRun -j CMSSW_PAT1/test_$i.xml PAT1.py ) &> CMSSW_PAT1/test_$i.stdout; done
# for i in `seq 1 10`; do ( /usr/bin/time cmsRun -j CMSSW_PAT1_RHAUTO_CHAPP_CACHE20/test_$i.xml PAT1_RHAUTO_CHAPP_CACHE20.py) &> CMSSW_PAT1_RHAUTO_CHAPP_CACHE20/test_$i.stdout; done
NB: please notice the names of stdout and xml files (
test_X.stdout
and
test_X.xml
).
The files produced and stored in the
CMSSW_PAT1
and
CMSSW_PAT1_RHAUTO_CHAPP_CACHE20
are the base for getting the statistics. Get the scripts from the CVS:
# cvs co -r %CPT_VERSION% -d PerfToolkit UserCode/leo/Utilities/PerfToolkit/cpt_getJobInfo.py
# cvs co -r %CPT_VERSION% -d PerfToolkit UserCode/leo/Utilities/PerfToolkit/cpt_getStats.py
# cvs co -r %CPT_VERSION% -d PerfToolkit UserCode/leo/Utilities/PerfToolkit/cpt_utilities.py
First of all, produce the root files:
# python PerfToolkit/cpt_getJobInfo.py --type=CMSSW CMSSW_PAT1
OutFile: CMSSW_PAT1.root
Analyzing CMSSW_PAT1
# python PerfToolkit/cpt_getJobInfo.py --type=CMSSW CMSSW_PAT1_RHAUTO_CHAPP_CACHE20
OutFile: CMSSW_PAT1_RHAUTO_CHAPP_CACHE20.root
Analyzing CMSSW_PAT1_RHAUTO_CHAPP_CACHE20
NB If you are using it as a simple executable (
./cpt_getJobInfo.py
) it would fail on a CMSSW area (e.g. after
cmsenv
being used), because the scripts call the system python version.
Then, we are ready for producing tables and plots:
# python PerfToolkit/cpt_getStats.py --save-png CMSSW*.root
The results are shown below:
|
CMSSW_PAT1 |
CMSSW_PAT1_RHAUTO_CHAPP_CACHE20 |
Success |
100.0% (10 / 10) |
100.0% (10 / 10) |
ExeTime |
61.59 +- 1.69 |
60.31 +- 1.81 |
MinEventTime |
0.10 +- 0.01 |
0.08 +- 0.00 |
UserTime |
53.96 +- 0.24 |
53.75 +- 0.18 |
CpuPercentage |
91.60 +- 1.96 |
93.10 +- 1.70 |
AvgEventTime |
3.02 +- 0.04 |
2.96 +- 0.12 |
MaxEventTime |
27.30 +- 0.23 |
26.39 +- 0.57 |
TotalJobTime |
30.22 +- 0.37 |
29.64 +- 1.17 |
SysTime |
2.67 +- 0.19 |
2.61 +- 0.51 |
READ |
tstoragefile-read-actual-total-megabytes |
3.49 +- 0.00 |
2.57 +- 0.00 |
tstoragefile-read-async-total-megabytes |
// |
0.00 +- 0.00 |
tstoragefile-read-total-megabytes |
3.49 +- 0.00 |
3.49 +- 0.00 |
dcap-read-total-megabytes |
3.49 +- 0.00 |
2.57 +- 0.00 |
file-read-total-megabytes |
0.00 +- 0.00 |
0.00 +- 0.00 |
tstoragefile-read-via-cache-total-megabytes |
// |
0.92 +- 0.00 |
tstoragefile-read-actual-total-msecs |
1953.94 +- 344.83 |
473.26 +- 748.09 |
tstoragefile-read-async-total-msecs |
// |
0.00 +- 0.00 |
tstoragefile-read-total-msecs |
1954.58 +- 344.83 |
477.75 +- 748.10 |
dcap-read-total-msecs |
1953.24 +- 344.84 |
473.17 +- 748.09 |
file-read-total-msecs |
0.00 +- 0.00 |
0.00 +- 0.00 |
tstoragefile-read-via-cache-total-msecs |
// |
3.79 +- 0.03 |
tstoragefile-read-actual-num-operations |
306.00 +- 0.00 |
30.00 +- 0.00 |
tstoragefile-read-actual-num-successful-operations |
306.00 +- 0.00 |
30.00 +- 0.00 |
tstoragefile-read-async-num-operations |
// |
1.00 +- 0.00 |
tstoragefile-read-async-num-successful-operations |
// |
0.00 +- 0.00 |
tstoragefile-read-num-operations |
306.00 +- 0.00 |
306.00 +- 0.00 |
dcap-read-num-operations |
306.00 +- 0.00 |
30.00 +- 0.00 |
file-read-num-operations |
0.00 +- 0.00 |
0.00 +- 0.00 |
tstoragefile-read-num-successful-operations |
306.00 +- 0.00 |
306.00 +- 0.00 |
dcap-read-num-successful-operations |
306.00 +- 0.00 |
30.00 +- 0.00 |
file-read-num-successful-operations |
0.00 +- 0.00 |
0.00 +- 0.00 |
tstoragefile-read-via-cache-num-operations |
// |
277.00 +- 0.00 |
tstoragefile-read-via-cache-num-successful-operations |
// |
276.00 +- 0.00 |
READV |
tstoragefile-readv-actual-total-megabytes |
// |
20.70 +- 0.00 |
dcap-readv-total-megabytes |
0.00 +- 0.00 |
20.70 +- 0.00 |
file-readv-total-megabytes |
0.00 +- 0.00 |
0.00 +- 0.00 |
tstoragefile-readv-actual-total-msecs |
// |
1059.14 +- 472.79 |
dcap-readv-total-msecs |
0.00 +- 0.00 |
1059.13 +- 472.79 |
file-readv-total-msecs |
0.00 +- 0.00 |
0.00 +- 0.00 |
tstoragefile-readv-actual-num-operations |
// |
1.00 +- 0.00 |
tstoragefile-readv-actual-num-successful-operations |
// |
1.00 +- 0.00 |
dcap-readv-num-operations |
0.00 +- 0.00 |
1.00 +- 0.00 |
file-readv-num-operations |
0.00 +- 0.00 |
0.00 +- 0.00 |
dcap-readv-num-successful-operations |
0.00 +- 0.00 |
1.00 +- 0.00 |
file-readv-num-successful-operations |
0.00 +- 0.00 |
0.00 +- 0.00 |
- Performance Overview. NB In this case the time spent by reading the file is too short (<10 secs): this means that only the 0th bin is filled, and the script does not plot these kinds of histos. <br />
- Cpu Percentage Event by event:
- Cpu Time Event by event:
- Time used by eleIsoDepositTk-CandIsoDepositProducer module (the first called by PAT) Event by event:
- Wall Time Event by event:
All the workflow: site performance example
In this use case, we want to submit the same job(s) to different sites, in order to evaluate their performances. This example will use
PAT and
CRAB.
Job Submission
Go to a CMSSW area, eg CMSSW_3_3_2, and download the
TestBed directory:
# cd CMSSW_3_3_2/src
# cvs co -d TestBed UserCode/leo/Utilities/PerfToolkit/TestBed
# cd TestBed
In this directory the example cfgs are available. Also, there is a
crab.template
file which will ease the submission of the jobs. In this files there are the placeholders which can
be substituted through
sed
. A python script which does this work is given in
PerfToolkit/CrabUtils:
# cvs co -d CrabUtils UserCode/leo/Utilities/PerfToolkit/CrabUtils
The script is named
crab_LaunchIOTestJobs.py
, and its usage is:
crab_LaunchIOTestJobs.py SITE LABEL CFG EVENTSJOB DATASET
In order to have the script properly working, you should source the LCG, CMSSW and
CRAB environment:
source /afs/cern.ch/cms/LCG/LCG-2/UI/cms_ui_env.sh
cmsenv
source /afs/cern.ch/cms/ccs/wm/scripts/Crab/crab.sh
For example, to submit to CSCS the
PAT.py cfg with 10000 evts/job on the /QCD_Pt80/Summer09-MC_31X_V3_AODSIM-v1/AODSIM dataset:
crab_LaunchIOTestJobs.py CSCS CMSSW_3_3_2 PAT1.py 10000 /QCD_Pt80/Summer09-MC_31X_V3_AODSIM-v1/AODSIM
where I've chosen to use
CMSSW_3_3_2
as label. This will produce a
CRAB working dir like
CSCS-PAT1-QCD_Pt80+Summer09.MC_31X_V3_AODSIM.v1+AODSIM-10000-CMSSW_3_3_2-201001042356
, which is compliat
with the labelling rules described above. If you create a text file with the site names in it, then you can submit to all the sites in a glance:
for i in `cat sites.txt`; do crab_LaunchIOTestJobs.py $i CMSSW_3_3_2 PAT1.py 10000 /QCD_Pt80/Summer09-MC_31X_V3_AODSIM-v1/AODSIM; done
NB:
crab_LaunchIOTestJobs.py
will create jobs for all the events in the
specified dataset and that only the first 20 (this number can be
changed in the script) will be submitted
In order to quickly check the status, retrieve the output and kill jobs you can use the other utilities contained in
CrabUtils:
- crabStatus.sh LABEL
- crabGetoutput.sh LABEL
- crabKill.sh LABEL
LABEL
is a part of
CRAB working dirs you're interested in, in this case can be QCD, Summer09, etc... e.g.:
crabStatus.sh QCD
CSCS-PAT1-QCD_Pt80+Summer09.MC_31X_V3_AODSIM.v1+AODSIM-10000-CMSSW_3_3_2-201001141613/
List of jobs Created: 21-220
List of jobs Ready: 1-20
KNU-PAT1-QCD_Pt80+Summer09.MC_31X_V3_AODSIM.v1+AODSIM-10000-CMSSW_3_3_2-201001141622/
List of jobs Created: 21-220
List of jobs Ready: 1-20
Analyzing statistics
Now that we have our
CRAB directories, we only need to do:
- Get the scripts from CVS:
# cvs co -r %CPT_VERSION% -d PerfToolkit UserCode/leo/Utilities/PerfToolkit/cpt_getJobInfo.py
# cvs co -r %CPT_VERSION% -d PerfToolkit UserCode/leo/Utilities/PerfToolkit/cpt_getStats.py
# cvs co -r %CPT_VERSION% -d PerfToolkit UserCode/leo/Utilities/PerfToolkit/cpt_utilities.py
# python PerfToolkit/cpt_getJobInfo.py CSCS-PAT1-QCD_Pt80+Summer09.MC_31X_V3_AODSIM.v1+AODSIM-10000-CMSSW_3_3_2-201001141613
OutFile: CSCS-PAT1-QCD_Pt80+Summer09.MC_31X_V3_AODSIM.v1+AODSIM-10000-CMSSW_3_3_2-201001141613.root
Analyzing CSCS-PAT1-QCD_Pt80+Summer09.MC_31X_V3_AODSIM.v1+AODSIM-10000-CMSSW_3_3_2-201001141613
# python PerfToolkit/cpt_getJobInfo.py KNU-PAT1-QCD_Pt80+Summer09.MC_31X_V3_AODSIM.v1+AODSIM-10000-CMSSW_3_3_2-201001141622/
OutFile: KNU-PAT1-QCD_Pt80+Summer09.MC_31X_V3_AODSIM.v1+AODSIM-10000-CMSSW_3_3_2-201001141622.root
Analyzing KNU-PAT1-QCD_Pt80+Summer09.MC_31X_V3_AODSIM.v1+AODSIM-10000-CMSSW_3_3_2-201001141622/
Then, we are ready to produce histos and tables:
# python PerfToolkit/cpt_getStats.py --save-png *20100114*.root
which produces this table and these graphs (table has been just cut and pasted from the stdout).
|
CSCS PAT1 CMSSW_3_3_2 |
KNU PAT1 CMSSW_3_3_2 |
Success |
100.0% (20 / 20) |
100.0% (20 / 20) |
WrapperTime |
4720.45 +- 254.19 |
3285.55 +- 654.12 |
ExeTime |
4709.25 +- 253.92 |
3274.65 +- 653.02 |
MinEventTime |
0.07 +- 0.01 |
0.07 +- 0.02 |
UserTime |
2664.38 +- 119.62 |
2580.32 +- 390.90 |
CpuPercentage |
57.45 +- 2.54 |
81.50 +- 9.11 |
User_ReadkBEvt |
89.96 +- 0.26 |
89.96 +- 0.26 |
AvgEventTime |
0.47 +- 0.03 |
0.32 +- 0.06 |
MaxEventTime |
28.51 +- 4.88 |
32.26 +- 7.05 |
TotalJobTime |
4660.38 +- 254.32 |
3227.93 +- 647.26 |
StageoutTime |
-1.00 +- 0.00 |
-1.00 +- 0.00 |
SysTime |
67.22 +- 7.50 |
71.97 +- 11.04 |
READ |
tstoragefile-read-actual-total-megabytes |
878.52 +- 2.59 |
878.52 +- 2.59 |
tstoragefile-read-total-megabytes |
878.52 +- 2.59 |
878.52 +- 2.59 |
dcap-read-total-megabytes |
878.52 +- 2.59 |
878.52 +- 2.59 |
file-read-total-megabytes |
0.00 +- 0.00 |
0.00 +- 0.00 |
tstoragefile-read-actual-total-msecs |
1982040.00 +- 188744.51 |
641224.70 +- 412932.27 |
tstoragefile-read-total-msecs |
1982305.00 +- 188750.95 |
641683.15 +- 412922.38 |
dcap-read-total-msecs |
1981720.00 +- 188733.85 |
640731.05 +- 412930.85 |
file-read-total-msecs |
0.00 +- 0.00 |
0.00 +- 0.00 |
tstoragefile-read-actual-num-operations |
134683.15 +- 311.18 |
134683.15 +- 311.18 |
tstoragefile-read-actual-num-successful-operations |
134683.15 +- 311.18 |
134683.15 +- 311.18 |
tstoragefile-read-num-operations |
134683.15 +- 311.18 |
134683.15 +- 311.18 |
dcap-read-num-operations |
134683.15 +- 311.18 |
134683.15 +- 311.18 |
file-read-num-operations |
0.00 +- 0.00 |
0.00 +- 0.00 |
tstoragefile-read-num-successful-operations |
134683.15 +- 311.18 |
134683.15 +- 311.18 |
dcap-read-num-successful-operations |
134683.15 +- 311.18 |
134683.15 +- 311.18 |
file-read-num-successful-operations |
0.00 +- 0.00 |
0.00 +- 0.00 |
READV |
dcap-readv-total-megabytes |
0.00 +- 0.00 |
0.00 +- 0.00 |
file-readv-total-megabytes |
0.00 +- 0.00 |
0.00 +- 0.00 |
dcap-readv-total-msecs |
0.00 +- 0.00 |
0.00 +- 0.00 |
file-readv-total-msecs |
0.00 +- 0.00 |
0.00 +- 0.00 |
dcap-readv-num-operations |
0.00 +- 0.00 |
0.00 +- 0.00 |
file-readv-num-operations |
0.00 +- 0.00 |
0.00 +- 0.00 |
dcap-readv-num-successful-operations |
0.00 +- 0.00 |
0.00 +- 0.00 |
file-readv-num-successful-operations |
0.00 +- 0.00 |
0.00 +- 0.00 |
- CSCS-PAT1-QCD_Pt80Summer09-Overview.png:
- CSCS-PAT1-QCD_Pt80Summer09-TimeEvent_cpuPercentage.png:
- CSCS-PAT1-QCD_Pt80Summer09-TimeEvent_cpuSecs.png:
- CSCS-PAT1-QCD_Pt80Summer09-TimeEvent_eleIsoDepositTk-CandIsoDepositProducer_secs.png:
- CSCS-PAT1-QCD_Pt80Summer09-TimeEvent_secs.png:
If we want a coarser time binning:
# python PerfToolkit/cpt_getStats.py --binwidth-time=120 --save-png *20100114*.root
- coarser_CSCS-PAT1-QCD_Pt80Summer09-Overview.png:
--
LeonardoSala - 14-Jan-2010