Datasets and Data Preparation Exercise
Note
This exercise page is an updated version of
https://twiki.cern.ch/twiki/bin/view/CMS/SWGuideCMSDataAnalysisSchoolNTU2016PPDExercise
The original exercises are created by Giovanni Franzoni (
giovanni.franzoni@cernNOSPAMPLEASE.ch), big thanks to him!!
Overview
In this set of exercises you will learn how to look for
datasets and find out their key properties relevant for analysis: how to navigate their parent-child relationship and determine the software release, production configuration and alignment-calibration conditions they've been produced with. You'll be exposed to using the principal services providing status and details of datasets:
das,
brilcalc,
McM,
pMp,
cmsDBbrowser. You'll also find out how to compute the integrated luminosity for your analysis.
Introduction to datasets
Summarise there the content of the slides, add a pointer to them:
Getting ready for these exercises
A bit of preparation will ease following this tutorial. Below a few concrete actions you can take before starting the hands-on session.
Prerequisites
In order to carry out the exercises of this session you need:
- a CMS account at CERN to access web services (you can access this twiki you have it) and log into lxplus
Color coding conventions
There's color scheme followed in this page exercises:
It is expected that you can cut-and-paste from the command box into the command line. Similarly, you should be able to cut-and-paste from the configuration fragments directly into a text editor, when necessary.
Setup a CMSSW area
Most of the exercises will be carried out
using your web browser.
A
CMSSW work area will also be needed; please create a working directory on your lxplus home
ssh -XY your_login@lxplus.cern.ch
mkdir data-preparation-exercises/
cd data-preparation-exercises/
cmsrel CMSSW_8_0_21
cd CMSSW_8_0_21/src
cmsenv
Now you are ready to proceed to the exercises.
Note: CMSSW_8_0_21 is the release which contain changes for L1 software, desired for Moriond17 Monte Carlo (DIgi-Reco) production.
Exercise 1: find an accessible file with single electron events in miniaod format from the latest data reprocessing of 2016D
Reminder about the general structure of a dataset name:
dataset = /PrimaryDataset/ProcessingVersion/DataTier
Examples:
dataset = /SingleElectron/Run2016D-23Sep2016-v1/MINIAOD
To start with, you need to establish the full dataset name "with single electron". Either you know it from a reference, or you need to construct a key elements of it, and put them together in a search. With the key elements at hand, you'll be able to use the
das web interface and queries with wildcard in order to establish or confirm the complete dataset name.
- The PrimaryDataset string for real data (i.e. collected at CMS) is, strictly speaking, specified in the HLT configuration accessible via the HLT browser; however in most cases you can do reasonable guess to find what you need:
- → SingleElectron.
- the latest reprocessing of the 2015 data can be found look at the PdmVDataReprocessing: data reprocessing campaigns documentation twiki: the version of the reprocessing is indicated by a date
- → 23Sep2016
- the acquisition era is part of ProcessingVersion and indicates the portion of the 2015 run when the data were collected
- → 2016D
You can now place a
query to das to find the dataset; using the wildcard increases the chances of finding at the first try what you're looking for with no need to remember the details of the naming conventions - of course, you ought to use the wildcard with a pinch of salt, not to be flooded with too many results matching your query.
dataset dataset=/SingleElectron*/*Run2016D*23Sep2016*/MINIAOD
- → check the Sites
-
- note that some sites are not accessible to the users (e.g. : tape storage)
-
Place your
final query to das looking for the file you want at a site where the dataset has a presence. (
note: the site where the dataset is present might change with time, thus the site chosen in detailed query which follows might need to be updated):
file dataset=/SingleElectron/Run2016D-23Sep2016-v1/MINIAOD site=T1_US_FNAL_Buffer
Does the query return a file ? If not, why ?
You can now run on one of the files and find out its basic properties, exploiting the fact that
xrootd will serve the file you've chosen from the CMS site where it's available on disk to your
cmsRun
process:
edmFileUtil --eventsInLumis -P root://xrootd.unl.edu//store/data/Run2016D/SingleElectron/MINIAOD/23Sep2016-v1/70000/04E8F72C-AF89-E611-9D2F-FA163E1D7951.root
edmFileUtil --eventsInLumis -P /store/data/Run2016D/SingleElectron/MINIAOD/23Sep2016-v1/70000/04E8F72C-AF89-E611-9D2F-FA163E1D7951.root (*update with the actual file you've found*)
Exercise 2: compute the integrated luminosity collected by CMS in run 276775
The data collected by CMS are
certified on a luminosity-section basis to determine which
data is of good quality to be included in physics analyses. The
data certification is carried out taking into account both the health in operation of the sub-detectors at and the scrutiny of the reconstructed physics objects by DPG and POG experts. The outcome of the certification process as more data gets collected and for each new version of the data processing is regularly updated by the
DQM-DataCertification with reports at the
PPD General Meeting and by means of
json files, also available in this
certification repository:
ls -ltrFh /afs/cern.ch/cms/CAF/CMSCOMM/COMM_DQM/certification/Collisions16/13TeV
The json files from the certification are used to
restrict the events to be included in analysis, typically setting the
lumiMask
in the crab configuration.
You can see the run and luminosity section structure by opening one of the files:
cat /afs/cern.ch/cms/CAF/CMSCOMM/COMM_DQM/certification/Collisions16/13TeV/Cert_271036-280385_13TeV_PromptReco_Collisions16_JSON.txt
Only successfully processed luminosity sections should be used to
compute the integrated luminosity of your analysis: that's typically achieved by asking for the
crab report
, which is also in json format, and provides a summary file of the runs and luminosity sections processed by completed jobs. Here, for semplicity, we'll use directly the certification exercise for luminosity calculation, assuming all processing jobs for run 276775 have been successful.
The luminosity information can be accessed via the
BRIL Work Suite , which needs a simple installation procedure:
*bash* : export PATH=$HOME/.local/bin:/afs/cern.ch/cms/lumi/brilconda-1.0.3/bin:$PATH
*tcsh* : setenv PATH $HOME/.local/bin:/afs/cern.ch/cms/lumi/brilconda-1.0.3/bin:$PATH
pip install --install-option="--prefix=$HOME/.local" brilws
(Do it again after installation)
*bash* : export PATH=$HOME/.local/bin:/afs/cern.ch/cms/lumi/brilconda-1.0.3/bin:$PATH
*tcsh* : setenv PATH $HOME/.local/bin:/afs/cern.ch/cms/lumi/brilconda-1.0.3/bin:$PATH
The integrated luminosity as measured during the data taking ( Norm tag:
onlineresult), delivered and recored, is provided for the luminosity sections specified in the json, limiting to the run 276775:
brilcalc lumi --help
brilcalc lumi -b 'STABLE BEAMS' -r 276775 -i /afs/cern.ch/cms/CAF/CMSCOMM/COMM_DQM/certification/Collisions16/13TeV/Cert_271036-280385_13TeV_PromptReco_Collisions16_JSON.txt [--byls]
#Data tag : v1 , Norm tag: onlineresult
+-------------+-------------------+------+------+----------------+---------------+
| run:fill | time | nls | ncms | delivered(/ub) | recorded(/ub) |
+-------------+-------------------+------+------+----------------+---------------+
| 276775:5093 | 07/12/16 21:26:20 | 1165 | 1165 | 222078295.493 | 210692911.427 |
+-------------+-------------------+------+------+----------------+---------------+
#Summary:
+-------+------+------+------+-------------------+------------------+
| nfill | nrun | nls | ncms | totdelivered(/ub) | totrecorded(/ub) |
+-------+------+------+------+-------------------+------------------+
| 1 | 1 | 1165 | 1165 | 222078295.493 | 210692911.427 |
+-------+------+------+------+-------------------+------------------+
You can verify that you get the same output of you constrict yourself a json file limited to run 256843 and process it without run restrictions:
cd CMSSW_8_0_21/src
cmsenv
filterJSON.py --min=276775 --max=276775 /afs/cern.ch/cms/CAF/CMSCOMM/COMM_DQM/certification/Collisions16/13TeV/Cert_271036-280385_13TeV_PromptReco_Collisions16_JSON.txt | tee 276775.txt
cat 276775.txt
brilcalc lumi -b 'STABLE BEAMS' -i 276775.txt
Exercise 3: for a given Monte Carlo AODSIM sample, find: the global tag, the digi-reco configuration, the production history/advancement and all the pile up scenarios available for it
The sample we start from is the neutrino gun overlaid with the pile up which matches the profile of instantaneous luminosity of the 2015 data taking:
dataset dataset=/SingleNeutrino/RunIIFall15DR76-PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1/AODSIM
- The global tag is fully specified in the ProcessingVersion, following the campaign name (RunIIFall15DR76) and preceding the processing string (here absent) and the dataset version (v1)
- → 76X_mcRun2_asymptotic_v12
-
- You find multiple files output-config-* , one for each step: digitization, reconstruction, miniaod/PAT
Any Monte Carlo sample is associated to a
prepID, a unique identifier of the production request which has produced it.
prepID 's are strings like
HCA-RunIIFall15DR76-00002, formed by the physics group which has placed the production request, the production campaign and an integer number.
prepID 's are used by the Monte Carlo Management Meeting, where production requests are notified and prioritized, and by the computing operation teams, and are the identifier used in its two key web based platforms:
Monte Carlo Management (McM) and
production Monitoring platform (pMp).
- Click on the Search tab, enter HCA-RunIIFall15DR76-00002 in the prepId field and click Search
- Each column shows different elements of the request. You can view more using "Select View"
- Click the tick icon in the column Actions
-
- The sequence of commands can be executed to run over a few events all the steps of the digi-reco processing
The
production Monitoring platform (pMp) is a service available to CMS members to monitor the status of progress of single Monte Carlo production requests, full campaigns, and group of requests (defined by physics working group, processing configuration, priority etc). It can be accessed directly or linked from
Monte Carlo Management (McM).
- Click the movie-film icon in the column Actions (see screen-shot of the previous bullet) to get the status of the request: events for a given production request are split across the different status they traverse in production, new, approved, submitted, done
-
- Click the vide-camera icon in the column Actions to get the historical development of the events produced in this request
-
Most datasets in any campaign is produced with a single pile up scenario: the pile up scenario typical of the campaign. In 2015, because of the transition from 50 to 25 ns data taking, the main production campaign had 2 pile up scenarios for a large set of requests. Some datasets, however, get processed with multiple pile up configuration to support specific studies. Different version of the pile up in the DIGI-RECO step all process the same parent GEN-SIM: this is the simplest way of finding out if there's more than one pile up version for a physics dataset.
-
- Click on the Children link in the das presentation of the dataset
-
- Multiple children are found, spanning over 6 pile up scenario
- Note that:
- the ProcessingVersion contains also the optional strings indicating special processing configuration (related to ECAL zero-suppression settings)
- the datasets with special processing configuration also have a dedicated data tier
Exercise 4: Simulate your private sample
--
PhatSrimanobhas - 2016-11-11