CMS Data Analysis School 2018: Jet Analysis

Contacts

Robin Aggleton (Uni. Hamburg), Jens Multhaup (Uni. Hamburg)

NB to edit this page use the Raw Edit button at the bottom, otherwise it WILL lose formatting

Introduction

What is this set of exercises trying to do?

Give you a hands-on experience on how to access jet collection in an event, plot basic jet quantities, apply jet energy corrections, and look at jet substructure:

  • A 101 on how to access jets in the CMS framework without assuming prior knowledge of jet analysis.
  • Make you familiar with basic jet types and algorithms and how to use them in your analysis.
  • Illustrate each exercise using real life example scripts.
  • Give you a comprehensive reference to more advanced workbook examples, additional resources, and pedagogical documentation in one place.

What are these exercises NOT meant for?

To give a comprehensive summary of the CMS JetMET software machinery or of the jet analyses being performed at CMS.

We will not be covering MET or b-tagging, even though they are closely related to jets - you are advised to go to those specific short exercises.

What do YOU need to do?

Colour coding convention

Colour legend for this tutorial:

GRAY background for the commands to execute  (cut&paste) 
GREEN background for the output of the executed commands
BLUE background for the configuration files  (cut&paste)
PINK background for the code (EDAnalyzer etc.)  (cut&paste) 

Exercise 1: Setup and run simple jet analysis

Step 0: Login to a machine

ssh -Y lxplus6.cern.ch (for running this exercise at CERN only)
ssh -Y nafhh-cms02.desy.de (at DESY only)

(the -Y is to do X11 forwarding, i.e. so you can see a GUI on your laptop that's actually running on the remote machine)

Step 1: Checkout a CMSSW release

# if using BASH:
source /cvmfs/cms.cern.ch/cmsset_default.sh
export SCRAM_ARCH=slc6_amd64_gcc630

# if using TCSH
source /cvmfs/cms.cern.ch/cmsset_default.csh
setenv SCRAM_ARCH=slc6_amd64_gcc630

mkdir JetsShort
cd JetsShort
cmsrel CMSSW_9_4_8
cd CMSSW_9_4_8/src/
cmsenv

Step 2: Checkout additional software for jet analysis exercises

git clone https://github.com/raggleton/JMEDAS.git Analysis/JMEDAS
scram b -j 4

Step 3: Run simple analysis

cd Analysis/JMEDAS/test
voms-proxy-init -voms cms  # init your grid cert so you can access files from around the world
python jmedas_fwlite.py --files=../data/ttbar.txt --outname=ttbar.root --maxevents=2000 --maxjets=2

What you are doing: this python script runs over CMS files in the MINIAOD format, using the FWLite framework. We loop over events, and in each event fill some histograms using the objects inside the file. We then write the result to another ROOT file, which can then be used for subsequent plotting, etc.

The script has several options, including:

  • --files : text file with list of MINIAOD files to run over
  • --outname : name of output ROOT file
  • --maxevents : maximum number of events to loop over
  • --maxjets : maximum number of jets to consider in each event

Code aside: we're using FWLite here to run over the MINIAOD, which is convenient for small, quick analyses like this. But for a full-blown analysis you'd want to leverage the speed of C++, and write a custom EDAnalyzer module to use in a python config with cmsRun.

Quiz 1: Plot basic jet quantities

python -i plots1.py

(the -i means that after running the script, python goes into "interactive" mode, allowing the canvas to remain, otherwise python would quit and the canvas would disappear)

To quit python, in your terminal use Ctrl+d, or type exit().

Do your histograms should look like the following?

plots1.png

Exercise 2: Get familiar with different jet types and jet reconstruction algorithms

A jet algorithm forms "clusters" out of an input of 4-vectors. One can cluster reconstructed objects from the detector, or one can cluster objects purely from the Monte Carlo generator. At CMS, the most popular type of reconstructed jet uses objects reconstructed using particle flow (PF). This aims to utilise information from all the subdetectors to create 4-vectors representing specific particles.

Reconstructed ParticleFlow Jets (PFJets)

  • Particle Flow candidates (PFCandidates) combine information from various detectors to make the best combined estimation of particle properties.
  • PFJet is made from PFCandidates and contains information about contributions of every particle class: Electromagnetic/hadronic, Charged/neutral etc.
  • Jet response is high. Jet pT resolution is good: starting at 15--20% at low pT and asymptotically reaching 5% at high pT.

Monte Carlo Generator-level Jets (GenJets)

  • GenJets are pure Monte Carlo simulated jets. Useful for analysis with MC samples.
  • Generated by clustering energies of Monte Carlo particles. May include “invisible” particles (muons, neutrinos, ..)
  • Jet response is 1, i.e., the jet energy scale is 1. Resolution is perfect, by definition.
  • GenJets include information about the 4-vectors of the constituent particles, the hadronic and electromagnetic components of the energy etc.

Quiz 2: Reconstructed versus Generator-Level Jets

Compare the basic kinematic distributions of reconstructed vs generator-level jets in Monte Carlo using the command

python -i plots2.py

Do you see (& expect) any difference between the reconstructed and generator-level jets ?

The result should look like this (reconstructed jets are in solid black, generator-level jets are in dashed red) :

Jet Algorithms

The majority of jet algorithms at CMS use a so-called "clustering sequence". This is essentially a pairwise examination of the input four-vectors. If the pair satisfy some criteria, they are merged. The process is repeated until the entire list of constituents is exhausted. In addition, there are several ways to determine the "area" of the jet over which the input constituents lay. This is very important in correcting for pileup, as we will see, because some algorithms tend to "consume" more constituents than others and hence are more susceptible to pileup. Furthermore, the amount of energy that is inside of a jet due to pileup is proportional to the area, so to correct for this effect it is very important to know the jet area.

Figure: Comparison of jet areas for four different jet algorithms, from "The anti-kt Clustering Algorithm" by Cacciari, Salam, and Soyez [JHEP04, 063 (2008), arXiv:0802.1189].

Some excellent references about jet algorithms can be found here:

You will also be looking at several "jet grooming" algorithms, which attempt to reduce the impact of "soft" contributions to clustering sequence by adding some other criteria. You will be investigating three types of groomers : You will also be investigating several algorithms to identify ("tag") highly-boosted massive SM particles such as W,Z,H bosons and top quarks. These involve utilizing substructure and kinematic information of subjets to "tag" these particles. You will be investigating two types of identification algorithms ("taggers"):

Several measurements have been performed about jet substructure also

Quiz 3: Comparing jet cone sizes

Compare the jet areas for anti-kt jets with R-parameters of 0.4 and 0.8. What type of distribution do you expect given what you now know about jet algorithms?

Now make plots comparing them:

 python -i plots3.py 

The answers should look like this (AK4 jets are in solid black, AK8 jets are in dashed blue) :

Jet types and algorithms in CMS

  • This twiki summarizes the respective labels by which each jet collection can be retrieved from the event record for general AOD files.

  • For the MINIAOD, there are three main jet collections as described here.
    • slimmedJets :
      • AK4 jets
      • Using charged hadron subtraction
      • b-tagging applied
      • Jet corrections applied
      • Pileup jet ID info embedded.
      • Quark/gluon likelihood info embedded.
    • slimmedJetsPUPPI :
      • Also AK4 jets, but using PUPPI instead of charged hadron subtraction
    • slimmedJetsAK8 :
      • AK8 jets
      • Uses PUPPI instead of CHS
      • Several jet grooming techniques applied (modified mass drop tagging with beta=0 (aka "Soft Drop"), pruning)
      • CMS top tagging algorithm
      • n-subjettiness algorithm
      • energy corrections functions (ECFs)
      • Access to subjets (from applying soft drop)

Exercise 3: Jet energy scale and uncertainty

The jets stored in the data and MC samples used in the previous exercises have already been corrected for non-uniform responses in pt and eta, as well as an average correction for pileup. More information and technical details on the jet energy scale calibration in CMS can be found at the following twiki: Jet Energy Scale Calibration.

However, it is often necessary to re-apply jet energy corrections (JECs), and to apply the jet energy uncertainties to one's analysis. Often, the JEC are updated fairly late in the analysis cycle simply due to the nature of the problem (the JEC experts do the analysis on the same data that people are for their analyses, so there's a "chicken-and-egg" problem). For this reason it is imperative to maintain flexibility in the JEC, and the software reflects this. It is possible to run the JEC software "on the fly" after you've done your heavy processing (Ntuple creation, skimming, etc). We will now show how this is done.

More info:

  • For the official documentation see here
  • For the official Jet Energy Corrections see here
  • Intro to JECs : here
  • JEC uncertainty sources: here
  • Github of JECs: here

Applying Jet Energy Corrections (JECs) and scale uncertainties

Sets of JECs and their associated uncertainties are published in 3 formats:

1. In a Global Tag (GT)

2. As SQL files

3. As plain text files

The latter two options are used for applying JECs "on the fly". We will now use the last option to change JECs in our jet analysis.

The exercise repository already comes with a set of JEC text files, in JMEDAS/data/JECs.

Applying these JECs "on the fly" in FWLite, we do (this is already implemented in the script, no need to add yourself):

1. Construct a FactorizedJetCorrector from whichever levels of JECs

vPar = ROOT.vector(ROOT.JetCorrectorParameters)()
vPar.push_back( ROOT.JetCorrectorParameters('../data/JECs//_V3_MC_L1FastJet_AK4PFchs.txt') )
vPar.push_back( ROOT.JetCorrectorParameters('../data/JECs//_MC_L2Relative_AK4PFchs.txt') )
vPar.push_back( ROOT.JetCorrectorParameters('../data/JECs//_MC_L3Absolute_AK4PFchs.txt') )
jec = ROOT.FactorizedJetCorrector( vPar )

2. We also construct a JetCorrectionUncertainty

jecUnc = ROOT.JetCorrectionUncertainty( '../data/JECs//_MC_Uncertainty_AK4PFchs.txt' )

3. Then for each jet, we have to tell it the jet's pt, eta, area, and energy along with the event rho, and number of vertices in the event. Note that we pass the uncorrected pT & energy.

jec.setJetEta( uncorrJet.eta() )
jec.setJetPt ( uncorrJet.pt() )
jec.setJetE  ( uncorrJet.energy() )
jec.setJetA  ( jet.jetArea() )
jec.setRho   ( rhoValue[0] )
jec.setNPV   ( len(pvs) )

4. We can then ask for the correction factor, and apply it to the uncorrected jet:

corr = jec.getCorrection()
jet.setJetPt( corr * uncorrJet.pt() )

5. We then ask for the uncertainties, by telling it the pt, eta, phi and asking for the variation:

jecUnc.setJetEta( uncorrJet.eta() )
jecUnc.setJetPhi( uncorrJet.phi() )
jecUnc.setJetPt( corr * uncorrJet.pt() )

corrUp = corr + jecUnc.getUncertainty(1)  # 1 is shift UP, 0 is shift DOWN

Run the main analysis script, this time correcting jets with a different set of JECs:

python jmedas_fwlite.py --files=../data/ttbar.txt --outname=ttbar_jec.root --maxevents=2000 --maxjets=2 --correctJets

Quiz 4

Now plot a comparison between uncorrected and corrected jets:

python -i plots_jec_1.py

This should look like:

Now plot a comparison between the central ("nominal") value of the corrected jet pT, along with its ±1 sigma variations:

python -i plots_jec_2.py

This should look like:

Why do we need to calibrate jet energy ? Why is "jet response" not equal to 1 ? Can you think of a physics process in nature that can help us calibrate the jet response to 1?

The amount of material in front of the CMS calorimeter varies by η. Therefore, the calorimeter response to jet is also a function of jet η. Can you think of a physics process in nature that can help us calibrate the jet response in η to be uniform?

Exercise 4: Jet ID

In order to avoid fake jets in data (e.g. originating from a "hot"/noisy calorimetric cell or electronic read-out box etc.) we require some basic quality criteria for jets. These criteria are collectively called "Jet ID". Details on the jet ID for PFJets can be found in the following twiki: https://twiki.cern.ch/twiki/bin/viewauth/CMS/JetID#Recommendations_for_13_TeV_2017

For 2017 data, there is the option of "Tight" or "TightLepVeto", depending on how much you want to veto jets that overlap with/are faked by leptons.

JetID is typically > 99% efficient - it is only designed to remove bad "jets" that are really just noise from the detector. Note that the JetID requirement change depending on which section of the detector your jet is in, and whether it uses CHS or PUPPI pileup subtraction.

Applying JetID

There are several ways to apply JetID. Note that JetID cuts must be done using the uncorrected jet pT.

  • To apply the cuts on pat::Jet (like in miniAOD) in python then you can do :
                    # Apply jet ID to uncorrected jet
                    nhf = jet.neutralHadronEnergy() / uncorrJet.E()
                    nef = jet.neutralEmEnergy() / uncorrJet.E()
                    chf = jet.chargedHadronEnergy() / uncorrJet.E()
                    cef = jet.chargedEmEnergy() / uncorrJet.E()
                    nconstituents = jet.numberOfDaughters()
                    nch = jet.chargedMultiplicity()
                    goodJet = \
                      nhf < 0.99 and \
                      nef < 0.99 and \
                      chf > 0.00 and \
                      cef < 0.99 and \
                      nconstituents > 1 and \
                      nch > 0
  • To apply the cuts on pat::Jet (like in miniAOD) in C++ then you can do :
                    // Apply jet ID to uncorrected jet
                    double nhf = jet.neutralHadronEnergy() / uncorrJet.E();
                    double nef = jet.neutralEmEnergy() / uncorrJet.E();
                    double chf = jet.chargedHadronEnergy() / uncorrJet.E();
                    double cef = jet.chargedEmEnergy() / uncorrJet.E();
                    int nconstituents = jet.numberOfDaughters();
                    int nch = jet.chargedMultiplicity();
                    bool goodJet = 
                      nhf < 0.99 &&
                      nef < 0.99 &&
                      chf > 0.00 &&
                      cef < 0.99 &&
                      nconstituents > 1 &&
                      nch > 0;
  • A selector module for CMSSW is available in PFJetIDSelectionFunctor. It can be used with such a code snippet:
from PhysicsTools.SelectorUtils.pfJetIDSelector_cfi import pfJetIDSelector
process.goodPatJetsPFlow = cms.EDFilter("PFJetIDSelectionFunctorFilter",
                                                filterParams = pfJetIDSelector.clone(),
                                                src = cms.InputTag("selectedPatJetsPFlow"),
                                                filter = cms.bool(True)
                                                )

Quiz 5

  • What would the hadronic and electromagnetic energy fractions look like for electrons? Photons? Muons? Single protons? Single pions?
  • What would the hadronic and electromagnetic energy fractions look like for jets ? HINT : Assume jets are made of a bunch of pions. What would that mean for the fractions?

Exercise 5: Jet Resolution

Jets are stochastic objects. The content of jets fluctuates quite a lot, and the content also depends on what actually caused the jet (uds quarks, gluons, etc). In addition, there are experimental limitations to the measurement of jets. Both of these aspects limit the accuracy to which we can measure the 4-momentum of a jet. This is called the jet resolution. If you have a group of single pions that have the same energy, the energy measured by CMS will not be exactly the same every time, but will typically follow a (roughly) Gaussian distribution with a mean and a width. The mean is corrected using the jet energy corrections. It is impossible to "correct" for all resolution effects on a jet-by-jet basis, although regression techniques can account for many effects.

As such, there will always be some experimental and theoretical uncertainty in the jet energy measurement, so this is the jet energy resolution. There is also jet angular resolution, and jet mass resolution. We will demonstrate how to apply the jet energy resolution, since that is applicable for all analyses that use jets. More information can be found at the jet resolution twiki and jet resolution software guide. The resolution is measured in data for different eta bins, and was approximately 10% with a 10% uncertainty for 7 TeV and 8 TeV data. For precision, it is important to use the correctly measured resolutions, but a reasonable calculation is to assume a flat 10% uncertainty for simplicity.

We will now repeat our jet analysis, but this time smearing the jet energy. Similar to correcting the jets, this is easily performed by adding another command line option:

python jmedas_fwlite.py --files=../data/ttbar.txt --outname=ttbar_jer.root --maxevents=2000 --maxjets=2 --smearJets

Quiz 6: Jet Resolution vs Scale

Plot the results of smearing the jets up & down:

python -i plots_jer_1.py

The result should look like:

For our sample, which has the larger effect: jet energy correction uncertainty, or jet energy resolution uncertainty?

Exercise 6: Jet Substructure - Jet Mass

Jet substructure is a rapidly growing field, taking advantage of the new phase space of high-momentum jets being produced by the LHC, along with the ever-increasing mass of exotic objects being searched for, combined with increasingly more sophisticated algorithms. One is now able to "tag" candidate W/Z bosons, Higgs bosons, top quarks, or some exotic object that decays to jets. However, one must also be able to efficiently reject boosted QCD jets.

Jet mass is a key variable used to discriminate between the different sources of boosted jets. First we need some other samples of different physics processes with boosted top jets and W bosons:

python jmedas_fwlite.py --files=../data/XtoWW_M3000.txt --outname=grav_ww_3000.root --maxevents=2000 --maxjets=2
python jmedas_fwlite.py --files=../data/Zprime3000.txt --outname=zprime_ttbar_3000.root --maxevents=2000 --maxjets=2

For our boosted Ws, we use a sample of 3 TeV gravitons decaying to WW. For our boosted top jets, we use a sample of 3 TeV Z' bosons decaying to ttbar.

First, plot a comparison of the different types of mass for each sample:

# for Z' -> ttbar
python -i plots4.py 
# for Graviton (G*) -> WW
python -i plots4a.py 

You should see:

Now, compare the mass distribution for a SM ttbar sample to the Z' -> ttbar sample:

python -i plots5.py 

Quiz

How do the different types of mass compare?

Can you explain all of the peaks you see?

Why is it that there are no boosted top quarks in the ttbar sample, but there are in the Z' sample?

Exercise 7: Jet Substructure - Subjet Mass

The subjets can also be used to help "tag" interesting objects.

Compare the masses of subjets between the Z' and ttbar samples. We will perform the comparison separately for the highest-mass subjet, and the lowest-mass subjet.

python -i plots7.py 

You should see:

Quiz

Do you understand all the features?

Why are there extra peaks in the highest-mass subjet plot but not the lowest-mass subjet plot?

Exercise 8: Jet Substructure - Boosted W/Z Boson Tagging

We can use N-subjettiness to tag likely boosted W/Z boson candidates, since they can decay to a pair of quarks. The key variable is therefore tau21, defined as the ratio of tau2 / tau1.

Compare the tau2 / tau1 ratio for the boosted W bosons from the Graviton sample, and the jets from the ttbar sample :

python -i plots9.py

The result should look like this :

Quiz

Did you expect the distributions to look different? If so, why?

Exercise 9: Jet Substructure - Boosted Top Quark Tagging

We can also use N-subjettiness to tag likely boosted top quark candidates, since they can decay to a pair of quarks (via the W) along with a b-jet. The key variable is therefore tau32, defined as the ratio of tau3 / tau2.

Compare the tau3 / tau2 ratio for the boosted top quarks from the Z' sample, and the jets from the ttbar sample :

python -i plots10.py

The results should look like this :

Quiz

Why are the jets in the ttbar sample so different from the Z' sample? Why is it that there are no boosted top quarks in the ttbar sample, but there are in the RS KK gluon sample?

What cut would you apply to select boosted top quarks?

References/Where to get help

There are many TWiki pages, some up to date, some out of date:

There are also several relevant hypernews:

Indico meetings:

FAQ/Problems

  • Error in <TNetXNGFile::Open>: [FATAL] Redirect limit has been reached or other errors pertaining to input files: did you do voms-proxy-init -voms cms?

-- RobinAggleton - 2018-07-11

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng plots1.png r1 manage 19.8 K 2018-07-16 - 09:40 RobinAggleton  
Edit | Attach | Watch | Print version | History: r41 < r40 < r39 < r38 < r37 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r41 - 2018-07-16 - RobinAggleton
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback