-- Based on an exercise designed by Kai-Feng Chen.

CMS Data Analysis School: Search for rare $B_s^0 \to \mu^+\mu^-$ decay -- Exercise

Facilitators

This is an introduction to the exercise based on the recent publication on the measurement of $B_s^0 \to \mu^+\mu^-$ decay branching fraction and effective lifetime using the CMS Run-2 data sets (a.k.a. BMM5 analysis), see BPH-21-006 for details. In this exercise we will start with an introductory presentation, quickly touch the reconstruction of B meson using the standard tool from BPH group, and practice how to construct an unbinned maximum likelihood fitter to extract the decay branching fractions based on RooFit (the real main task!).

Introductory notes

In this Twiki, the following colour scheme will be used:

  • Commands that you should run will be shown in a gray box, e.g.:
ls -ltr
  • Output such as screen printouts will be shown in a green box, e.g.:
This is a test
  • Code snippets to be looked at (e.g, as examples) or edited will be shown in a code block, e.g.:
std::cout << "This is a test" << std::endl; 

In case you are not familiar with RooFit, you can have a look at:

Please work on your own code! It is not necessary to reproduce exactly the same code which is posted here just for reference and debugging purpose. Indeed, many improvements can be introduced, and different programming languages can be used.

Data and MC samples

In task #1 we provide (very) small sets of data and MC for the normalization and control channels, respectively $B^+\to J/\psi K^+$ and $B_s\to J/\psi \phi$ events. They can be used in the exercise for reconstructing B meson events.

Starting from task #2, we are going to build several maximum likelihood fitters based on a collection of flat trees, which were reproduced/simplified/compressed from the official BMM4/BMM5 analysis. The files are already copied to the server and available at
(at lxplus)

ls /eos/user/c/cmsdas/2024/long-ex-bph/
bdmmMc.root bmmSoup05.root bspsiphiData.root effyield.csv bdpimumuMcBg.root bmmSoup10.root bspsiphiMc.root miniaod_bspsiphi.root bdpimunuMcBg.root bmmSoup15.root bstohhMcBg.root miniaod_bupsik.root bdtohhMcBg.root bmmSoup20.root bupimumuMcBg.root miniaod_run2016b.root bmmData-blind.root bskmunuMcBg.root bupsikData.root pickevents_2016_2ps_wgt1.root bmmData.root bsmmMc.root bupsikMc.root

In the following, a summary of all the files.

The files in standard CMS format (miniAOD or AOD) will be used for task #1 and part of task #7:

Filename Process Comment
miniaod_bupsik.root $B^+ \to J/\psi K^+$ 2K $J/\psi K^+$ events in miniAOD format
miniaod_bspsiphi.root $B_s \to J/\psi \phi$ 2K $J/\psi \phi$ events in miniAOD format
miniaod_run2016b.root "Charmonium" data from 2016B 10K events in miniAOD format
pickevents_2016_2ps_wgt1.root $B_s\to\mu^+\mu^-$ candidates picked data events in AOD format
The flat trees will be used from task #2 on:

Filename Process Comment
bsmmMc.root $B_s \to \mu^+\mu^-$ Signal MC
bdmmMc.root $B^0 \to \mu^+\mu^-$ Signal/peaking backgrond MC
bstohhMcBg.root $B_s \to K^+K^-/K^-\pi^+/\pi^+\pi^-$ Rare/peaking background MC
bdtohhMcBg.root $B^0 \to K^+K^-/K^-\pi^+/\pi^+\pi^-$
bskmunuMcBg.root $B_s \to K^+\mu^-\nu$ Semileptonic background MC
bdpimunuMcBg.root $B^0 \to \pi^+\mu^-\nu$
bdpimumuMcBg.root $B^0 \to \pi^0\mu^+\mu^-$
bupimumuMcBg.root $B^+ \to \pi^+\mu^+\mu^-$
bmmSoup05.root data-like MC soup mixture $\mathcal{B}(B_s \to \mu^+\mu^-) = 0.5 \cdot 3.66 \times 10^{-9}$
bmmSoup10.root data-like MC soup mixture $\mathcal{B}(B_s \to \mu^+\mu^-) = 1.0 \cdot 3.66 \times 10^{-9}$
bmmSoup15.root data-like MC soup mixture $\mathcal{B}(B_s \to \mu^+\mu^-) = 1.5 \cdot 3.66 \times 10^{-9}$
bmmSoup20.root data-like MC soup mixture $\mathcal{B}(B_s \to \mu^+\mu^-) = 2.0 \cdot 3.66 \times 10^{-9}$
bmmData-blind.root Real data (blind) $5.2<M(\mu\mu)<5.45$ GeV are excluded
bmmData.root Real data (full) will only be provided before the end of the exercise
bupsikData.root $B^+ \to J/\psi(\to \mu^+\mu^-) K^+$ Normalization channel, data
bupsikMc.root Normalization channel, MC
bspsiphiData.root $B_s \to J/\psi(\to \mu^+\mu^-) \phi(\to K^+K^-)$ Control channel, data
bspsiphiMc.root Control channel, MC
The format of the flat trees is given below:

Variable Format Meaning
run uint32_t run number
evt uint64_t event number
ls uint32_t lumi section
cate uint32_t category index (0-7)
wgt float event prescale/weight
bdt float event BDT score
pt float $p_T$ of the candidate [GeV]
eta float $\eta$ of the candidate
phi float $\phi$ of the candidate
m float invariant mass [GeV]
me float per-event mass resolution [GeV]
t float decay time [ps]
te float per-event decay time resolution [ps]
gt float true decay time at generator [ps]
A direct check can be performed like this:
root bsmmMc.root root [1] bsmmMc->Scan("*")
************************************************************************************************************************************************************************************ * Row * run.run * evt.evt * ls.ls * cate.cate * wgt.wgt * bdt.bdt * pt.pt * eta.eta * phi.phi * m.m * me.me * t.t * te.te * gt.gt * ************************************************************************************************************************************************************************************ * 0 * 1 * 35458896 * 5182 * 1 * 1 * 0.9979488 * 13.765307 * 0.7137690 * 0.0989341 * 5.4061279 * 0.0435446 * 2.5570764 * 0.0923385 * 2.6121985 * * 1 * 1 * 35459383 * 5182 * 1 * 1 * 0.9977474 * 12.912679 * -1.103692 * 2.9579954 * 5.3832445 * 0.0504641 * 5.0555057 * 0.1244605 * 5.0800714 * * 2 * 1 * 35459885 * 5182 * 1 * 1 * 0.9970942 * 12.452368 * -1.215893 * 0.1227818 * 5.3627362 * 0.0446646 * 2.7904653 * 0.1141994 * 2.6338570 * * 3 * 1 * 35460161 * 5182 * 1 * 1 * 0.9992429 * 14.459903 * 0.8303663 * -2.311928 * 5.2686963 * 0.0421808 * 3.856426 * 0.0734309 * 3.8440439 * * 4 * 1 * 35461440 * 5182 * 1 * 1 * 0.9307835 * 13.047897 * 1.0679357 * 1.7753243 * 5.4436235 * 0.0452461 * 2.3172597 * 0.0922169 * 2.1170656 *

All of the events are already required to pass through a set of preselections, and passing BDT muon ID (>0.45 for both muon), and with an event BDT score >0.1.

The category index is served as channel number and is defined as following:

Category index Year/era Detector region CM Energy Lumi
0 2016BF Central($\eta^f<0.7$) 13 TeV 20.0/fb
1 2016BF Forward($0.7<\eta^f<1.4$)
2 2016GH Central($\eta^f<0.7$) 13 TeV 16.6/fb
3 2016GH Forward($0.7<\eta^f<1.4$)
4 2017 Central($\eta^f<0.7$) 13 TeV 40.2/fb
5 2017 Forward($0.7<\eta^f<1.4$)
6 2018 Central($\eta^f<0.7$) 13 TeV 61.3/fb
7 2018 Forward($0.7<\eta^f<1.4$)
The Central or Forward regions are defined by the pseudorapidity of the most forward muon of the candidate ($\eta^f$).

The efficiencies for some of the processes and the expected yields at the target luminosity after the preselections (including the BDT>0.1 cut and muon ID) are given in the effyield.csv file. The associated systematic uncertainties are also provided. The content of the csv file reads like:

Category Eff_bsmmMc dEff_bsmmMc ... N_bsmmMc dN_bsmmMc ...
0 0.003209738 1.419501e-05 ... 13.34416 0.5941475 ...
1 0.005369033 1.836274e-05 ... 24.43867 1.086063 ...
The Eff_bsmmMc is the efficiency times acceptance of the $B_s \to \mu^+\mu^-$ decay in each category, and the dEff_bsmmMc is the associated systematic uncertainty; likewise N_bsmmMc and dN_bsmmMc are the expected yield (and its uncertainty) at the target luminosity, based on the standard model predicted branching fraction ($\mathcal{B}(B_s \to \mu^+\mu^-) = 3.66 \times 10^{-9}$). Similar information can be found for other processes as well. The efficiencies are only provided for the $B_s \to \mu^+\mu^-$, $B^0 \to \mu^+\mu^-$, $B^+ \to J/\psi K^+$, and $B_s \to J/\psi \phi$ decays.

In the following code, the content of the given csv file can be accessed from the array given in common.h file, ie.

enum { Category, Eff_bsmmMc, dEff_bsmmMc, ..., N_Columns };
const int N_Categories = 8;
const double effyield[N_Categories][N_Columns] = {
    {0, 0.003209738, 1.419501e-05, ...},
    {1, 0.005369033, 1.836274e-05, ...},
    ...};

So one can easily access those information by checking the effyield array, e.g. effyield[4][N_bsmmMc] will give the expected $B_s \to \mu^+\mu^-$ yield at 2017 central region.

EXERCISE: How does categories help in improving sensitivity? Show that the combined sensitivity is always better or equal to the merged sensitivity. You can take simple example of counting experiment assuming poisson distribution ( i.e. significance =s/sqrt(b) for known b) assuming only statistical errors and no correlated systematic errors.

Task #1: Reconstruction of B mesons

Our very first task is to play with one of the standard tools to reconstruct B mesons (note: not the actual analyzer for BMM4, which is far too complicated for the purpose of CMSDAS unfortunately!). In this twiki we will just provide the exact commands to run on a small set of MC and data events, and explain what you will obtain in the output files. For more information it is recommended to go through this BPHReco_V2 twiki page. The 3 miniAOD files listed above are going to be used in this task.

First, access lxplus9 as

ssh -XY @lxplus.cern.ch
and after setup the environment, ie.
cd cmsrel CMSSW_14_0_7 cd CMSSW_14_0_7/src cmsenv git clone https://github.com/ronchese/BPHAnalysis cd BPHAnalysis git checkout V06-00-18 cd ../ scram b

Now you can browse around and compare the code with the documentation, etc. Then we can use the example analyzer provided by the tool to process some $J/\psi K^+$ and/or $J/\psi \phi$ MC events. Here are the exact commands:

cd BPHAnalysis/RecoDecay/test emacs -nw cfg_mini.py // or any other editor you like

Let's change the following settings: first, run on more events instead of just 1K:

process = cms.Process("bphAnalysis")

process.maxEvents = cms.untracked.PSet( input = cms.untracked.int32(-1) )
#process.maxEvents = cms.untracked.PSet( input = cms.untracked.int32(1000) )

process.load("Configuration.Geometry.GeometryRecoDB_cff")

Second, set the input file:

process.source = cms.Source("PoolSource",fileNames = cms.untracked.vstring(
    'file:/eos/user/c/cmsdas/2024/long-ex-bph/miniaod_bupsik.root'
))

Then one can just process the sample with the cmsRun command:

cmsRun cfg_mini.py

You may see lots of "BPHDecayVertex::tTracks: no track for reco::(PF)Candidate" messages but it is fine (not an error). The outputs are stored into two files, the text file dump_mini.txt (dump of the reconstruction information) and a root file hist_mini.root, containing some predefined histograms. The beginning of the dump_mini.txt reads like this

TestBPHRecoDecay::beginJob --------- event 1 / 352464209 --------- 1160 pcCands found 4 muons found build and dump JPsi 0 JPsi cand found build and dump Phi 22 Phi cand found ****** Phi cand mass: 1.03452 momentum -1.64548 1.23153 8.91898 ****** Phi constr mass: 1.04879 momentum -1.6471 1.23816 8.99651 ****** Phi vertex: 0 1 - 1521.3 1 0 - 1.38869 -0.876096 -4.29726 daughter 0 KNeg - momentum: -0.954123 0.526696 4.5247 - at vertex: -0.944963 0.542958 4.5247 daughter 1 KPos + momentum: -0.691362 0.704833 4.39428 - at vertex: -0.704663 0.691535 4.39428 kin fit vertex: 1.3887 -0.876097 -4.29726 daughter 0 refitted: -0.641582 0.604058 4.48505 daughter 1 refitted: -1.00552 0.634103 4.51146 ****** Phi cand mass: 1.03919 momentum -3.65517 1.11109 7.99432 ****** Phi constr mass: 2.16498 momentum -26.3609 7.24158 52.5885 ......

In the root file you may find rather nice reconstructed $J/\psi$ and $B^+$ mass distributions, e.g.
MassJpsi.pngMCSTBu.png
Note there are two different versions of $B^+$ invariant mass distributions: one is called "massBu", which is the raw reconstructed mass, while "mcstBu" is the reconstucted mass after a constraint of $J/\psi$ mass to its known value is applied. Surely the "mcstBu" shows a much narrower distribution since the energy loss is "corrected" by the mass constraint.

EXERCISE: try the example job as above, change the input file from $J/\psi K^+$ MC to the $J/\psi \phi$ MC, or even the data file; see what can be obtained

Note: for running on real data, the global tag should be changed from auto:run2_mc to auto:run2_data accordingly:

from Configuration.AlCa.GlobalTag_condDBv2 import GlobalTag
process.GlobalTag = GlobalTag(process.GlobalTag, 'auto:run2_data', '')

If you try to process some of the data events (it will take a while since the whole file contains 10K events, one can just run the first 2K to check a little bit), the output histograms will not show clear peaks as for the MC samples, due to the large combinatorial background. (Yes, one has to be aware the "dirty" environment of the proton collisions!) Let's try to tighten the selection cuts before filling the histograms.

Please try to implement the following modifications to the code BPHAnalysis/RecoDecay/test/stubs/TestBPHRecoDecay.cc:

......
    KaonChargeSelect tkPos( +1 );
    KaonChargeSelect tkNeg( -1 );
    KaonPtSelect tkPt( 1.0 ); // change from 0.7 GeV to 1.0 GeV for the pt threshold of the kaons
    string kPos = "KPos";
    string kNeg = "KNeg";
......
    bBs.filter( "JPsi", mJPsi );
    bBs.filter(  "Phi",  mPhi );
    Chi2Select chi2Bs( 0.10 ); // change the Bs vertex probability threshold from 2% to 10%
    bBs.filter( chi2Bs );
......
    bBu.filter( "JPsi", mJPsi );
    bBu.filter( "Kaon", tkPt );
    bBu.filter( "Kaon", knv );
    Chi2Select chi2Bu( 0.10 ); // change the B+ threshold as well
    bBu.filter( chi2Bu );
......

In fact just raising the cut threshold may not enough, let's "hack" the class for vertex probability selection (Chi2Select class) and add an additional cut to require the candidates to be displaced:

......
    virtual bool accept( const BPHDecayVertex& cand ) const {
      const reco::Vertex& v = cand.vertex();
      if ( v.isFake() ) return false;
      if ( !v.isValid() ) return false;
      if ( sqrt(v.x()*v.x()+v.y()*v.y())<0.15 ) return false; // add a cut on transverse flight distance > 0.15 cm
      return ( TMath::Prob( v.chi2(), lround( v.ndof() ) ) > mProb );
    }
......

Now one can compile and run it again:

scram b cmsRun cfg_mini.py

After processing 10K data events (all events in miniaod_run2016b.root), a "hint" of B hadron mass peak can be observed:
task1_3.png

EXERCISE: practice the modification to the code mentioned above and verify the same "hint"

EXERCISE: Look few more data files (Charmonium PD and miniAOD format) on DAS database and process them

The $B^+ \to J\psi K^+$ events are easy to see due to its clear signature (narrow $J/\psi$ peak, displaced vertex due to long-lived B mesons, and a relately large branching fraction of 0.1%). This will not be the case for our target $B_s \to \mu^+\mu^-$, since the decay branching fraction is as small as $10^{-9}$. In the following tasks/exercises, we are going to use pre-processed flat trees instead of running the reconstruction jobs directly, and we will prepare a simplified maximum likelihood fitter to extract the target branching fraction.

Task #2: Yield extractions with fits to normalization/control channels

We are going to demonstrate how to construct a maximum likelihood fitter to extract the yields of the normalization channel ($B^+ \to J/\psi K^+$) and control channel ($B_s \to J/\psi \phi$) decays step-by-step.

Task #2-1: Fit to $B^+ \to J/\psi K^+$ MC

The first task is to read the $B^+ \to J/\psi K^+$ MC samples and construct a simple signal PDF. One can process the root file as the following code by first creating an observable m, representing the invariant mass of the candidate, then reading the ttree from the root file and creating the corresponding RooFit data set rds_mc with only two columns (mass and category):

{
    using namespace RooFit;

  unsigned int cate_t = 0;
  RooRealVar m("m","",5.0,5.8);
  RooRealVar cate("cate","",-1,20);

  TFile *fin = new TFile("/eos/user/c/cmsdas/2024/long-ex-bph/bupsikMc.root");
  TTree *tin = (TTree*)fin->Get("bupsikMc");

  RooDataSet* rds_mc   = new RooDataSet("rds_mc", "rds_mc", tin,  RooArgSet(cate, m));

  rds_mc = (RooDataSet*)rds_mc->reduce(m, Form("m >= 5.0 && m <= 5.8 && cate==%d", cate_t));
  delete fin;

Through the "reduce" method of RooDataSets, we select one specific category (in this case, we select the 0th, i.e. 2011/central).

Then, one can start to construct a double-Gaussian model, and perform a fit to the loaded sample:

......
    RooRealVar sig_mean1("sig_mean1","",5.28,5.2,5.4);
    RooRealVar sig_mean2("sig_mean2","",5.28,5.2,5.4);
    RooRealVar sig_sigma1("sig_sigma1","",0.030,0.005,0.060);
    RooRealVar sig_sigma2("sig_sigma2","",0.080,0.040,0.200);
    RooRealVar sig_frac("sig_frac","",0.9,0.5,1.0);
    RooGaussian sig_g1("sig_g1","",m,sig_mean1,sig_sigma1);
    RooGaussian sig_g2("sig_g2","",m,sig_mean2,sig_sigma2);
    RooAddPdf pdf_sig("pdf_sig","",RooArgList(sig_g1,sig_g2),RooArgList(sig_frac));

    pdf_sig.fitTo(*rds_mc);
}

In fact the code snapshot given above should already produce a working fit. The results of the fit as returned from Minuit are

NO. NAME VALUE ERROR STEP SIZE VALUE 1 sig_frac 9.26028e-01 1.79343e-03 9.77751e-05 7.81170e-01 2 sig_mean1 5.27971e+00 5.56665e-05 3.90503e-05 -2.04339e-01 3 sig_mean2 5.26646e+00 8.15478e-04 1.14711e-04 -3.42061e-01 4 sig_sigma1 1.56011e-02 5.44736e-05 2.82384e-05 -1.09294e+02 5 sig_sigma2 6.27416e-02 7.24608e-04 1.41262e-04 -2.34392e+00

Surely it is kind of difficult to check the goodness of the fit by just looking at the fitted parameters. It would be much better to produce a projection plot along the main observable (mass m) and overlap the fitting curve on top of it. This can be carried out by adding the following piece of code after the call to the fitTo() function:

......
    RooPlot *frame = m.frame();
    rds_mc->plotOn(frame);
    pdf_sig.plotOn(frame);
    
    TCanvas* canvas = new TCanvas("canvas", "", 600, 600);
    frame->Draw();

With some simple decoration works one should be able to obtain such a plot:
task_2_1.png

EXERCISE: work out the code, perform the fit, add legend/decorations to the resulting projection plot

EXERCISE: fit with the events from other categories and/or in different phase-space region (ie. different pt thresholds).

Task #2-2: Fit to $B^+ \to J/\psi K^+$ data

The next step is to perform the fit on real data events. This code is similar to previous one, but you need to replace the input MC tree with the data tree, as given below. In this task we have used an explicit "for-loop" instead of using the "reduce" method of RooFit. The two things are basically same. Users can choose whichever they prefer.

......
    RooRealVar m("m","",5.0,5.8);
    RooRealVar wgt("wgt","",1.,0.,1000.);
    RooDataSet *rds_data = new RooDataSet("rds_data","",RooArgSet(m,wgt),"wgt");
    
    TFile *fin = new TFile("/eos/user/c/cmsdas/2024/long-ex-bph/bupsikData.root");
    TTree *tin = (TTree*)fin->Get("bupsikData");
    
    unsigned int cate_t;
    float wgt_t,m_t;
    tin->SetBranchAddress("cate",&cate_t);
    tin->SetBranchAddress("wgt",&wgt_t);
    tin->SetBranchAddress("m",&m_t);
    
    for(int evt=0; evt<tin->GetEntries(); evt++) {
        tin->GetEntry(evt);
        if (cate_t!=0) continue;
        if (m_t<5.0 || m_t>=5.8) continue;
        m.setVal(m_t);
        wgt.setVal(wgt_t);
        rds_data->add(RooArgSet(m,wgt),wgt_t);
    }
    delete fin;

The event weight wgt is necessary since the triggers are, in fact, prescaled in 2016 data taking. Question: do you know what "prescaling a trigger" means?

There are also some additional corrections to overcome the known efficiency losses due to the tracker HIP problem (basically, efficiency was lower at higher pileup). All of these effects can be corrected by introducing the proper weights when processing data events.

Based on the fit to MC we have already constructed the PDF model for signal. The data contain also background contributions though. Therefore it is needed to add other functions to model the backgrounds: for example, an exponential function can be used to model the combinatorial background, and a (conjugated) error function erfc for the $J/\psi$+X background on the left-hand side. Question: can you guess how the function for the J/psi X background has been chosen/can be better studied?

The joint model is then constructed by adding all three components together with the corresponding yields (a.k.a. extended PDF).

/// From task2_1.C  You can try to fix the signal parameters and fit releasing the background parameters
//  NO.   NAME      VALUE            ERROR       STEP SIZE       VALUE
//   1  sig_frac     9.26028e-01   1.79343e-03   9.77751e-05   7.81170e-01
//   2  sig_mean1    5.27971e+00   5.56665e-05   3.90503e-05  -2.04339e-01
//   3  sig_mean2    5.26646e+00   8.15478e-04   1.14711e-04  -3.42061e-01
//   4  sig_sigma1   1.56011e-02   5.44736e-05   2.82384e-05  -1.09294e+02
//   5  sig_sigma2   6.27416e-02   7.24608e-04   1.41262e-04  -2.34392e+00
 ///


......
    RooRealVar sig_mean1("sig_mean1","",5.27971);
    RooRealVar sig_mean2("sig_mean2","",5.26646);
    RooRealVar sig_sigma1("sig_sigma1","",0.0156007);
    RooRealVar sig_sigma2("sig_sigma2","",0.0627436);
    RooRealVar sig_frac("sig_frac","",0.926022);
    RooGaussian sig_g1("sig_g1","",m,sig_mean1,sig_sigma1);
    RooGaussian sig_g2("sig_g2","",m,sig_mean2,sig_sigma2);
    RooAddPdf pdf_sig("pdf_sig","",RooArgList(sig_g1,sig_g2),RooArgList(sig_frac));
    
    RooRealVar comb_coeff("comb_coeff","",-1.2,-10.,10.);
    RooExponential pdf_comb("pdf_comb","",m,comb_coeff);
    
    RooRealVar jpsix_scale("jpsix_scale","",0.02,0.001,0.08);
    RooRealVar jpsix_shift("jpsix_shift","",5.13,5.12,5.16);
    RooGenericPdf pdf_jpsix("pdf_jpsix","","TMath::Erfc((@0-@1)/@2)",RooArgList(m,jpsix_shift,jpsix_scale));
    
    RooRealVar n_sig("n_sig","",100000,0.,1E8);
    RooRealVar n_comb("n_comb","",80000,0.,1E6);
    RooRealVar n_jpsix("n_jpsix","",20000,0.,1E5);
    RooAddPdf model("model","",RooArgList(pdf_sig,pdf_comb,pdf_jpsix),RooArgList(n_sig,n_comb,n_jpsix));
    
    model.fitTo(*rds_data, Extended(true), SumW2Error(true));

A fit to data with the joint PDF model should be carried out with the option SumW2Error(true) given that the events can be weighted (not the case for the 0th category at this moment). The result of the fit is shown as below
task_2_2.png
Although the fit looks not bad at all, there is an obvious issue with the fixed signal shape, which seems to be slightly off from the data points!

EXERCISE: construct the fit as explained above, produce the corresponding projection plot

EXERCISE: release the parameters of the signal model in the fit, and see if the fit quality improves or not

Task #2-3: Fit to $B^+ \to J/\psi K^+$ data (improved)

Surely one can indeed release the signal parameters in the fit and resolve the small difference between data and MC peaks, but it may not always work if the peak is very wide. If you just apply the same procedure to the forward category (for example the 1st category instead of 0th) the modeling becomes difficult. In order to improve the fit, here we can introduce a "correction" to MC-driven model and only float those corrections in the fit to data. In this case the basic signal shape is still defined by MC events, but the actual mean and resolution, can be different in data. The trick is as following:

......
    RooRealVar sig_shift("sig_shift","",0.,-0.02,0.02);
    RooRealVar sig_scale("sig_scale","",1.,0.8,1.2);
    
    RooRealVar sigmc_mean1("sigmc_mean1","",5.27971);
    RooRealVar sigmc_mean2("sigmc_mean2","",5.26646);
    RooRealVar sigmc_sigma1("sigmc_sigma1","",0.0156007);
    RooRealVar sigmc_sigma2("sigmc_sigma2","",0.0627436);

    RooAddition sig_mean1("sig_mean1","",RooArgList(sigmc_mean1,sig_shift));
    RooAddition sig_mean2("sig_mean2","",RooArgList(sigmc_mean2,sig_shift));
    RooProduct sig_sigma1("sig_sigma1","",RooArgList(sigmc_sigma1,sig_scale));
    RooProduct sig_sigma2("sig_sigma2","",RooArgList(sigmc_sigma2,sig_scale));
    RooRealVar sig_frac("sig_frac","",0.926022);
    RooGaussian sig_g1("sig_g1","",m,sig_mean1,sig_sigma1);
    RooGaussian sig_g2("sig_g2","",m,sig_mean2,sig_sigma2);
    RooAddPdf pdf_sig("pdf_sig","",RooArgList(sig_g1,sig_g2),RooArgList(sig_frac));

Instead of using the MC-driven Gaussian mean and sigma directly, one can apply a mean value correction sig_shift and a correction on the resolution sig_scale on top of it, and perform the fit to data. By introducing such an improvement the fit projection would show a better agreement (look at the data points around the signal peak). Question: do you think that, in order to evaluate the goodness of the fit, it would be useful to look at the distribution of the pulls between the fit model and the data? if yes, you can try to add it when doing the exercise (suggestion: use the dedicated Roofit method)


task_2_3.png
Also, the fit result shows

7 sig_scale 9.95578e-01 1.55972e-03 1.45973e-02 -2.21099e-02 8 sig_shift -1.38291e-03 2.63982e-05 2.63303e-03 -6.92006e-02

which means the $J/\psi K^+$ signal in data is 0.5% narrower and shifted to the left by 1.4 MeV, comparing to the shape from MC.

EXERCISE: implement the fit described above

Task #2-4: Fit to $B^+ \to J/\psi K^+$ data (improved again)

Now we managed to come up with a fairly good fit to the $J/\psi K^+$ candidates. Note, we have ignored a relatively small background component, $B^+ \to J/\psi \pi^+$ decay. The decay branching fraction of $J/\psi \pi^+$ is around 4% of the $J/\psi K^+$, and the signal distribution is much wider due to the wrong mass assignment in the reconstruction. Nevertheless this is an exercise so we do not want to make the modeling too complicated.

However there is one more thing we can still improve: one can run the fits to MC and to data sequentially within one code, given that the fit to data requires the MC-driven signal model! e.g. one can read in both MC and data events, perform a fit to the MC events and determine those sigmc_mean1/2, sigmc_sigma1/2, and sig_frac first. Then fix them and apply the mean/width corrections, and perform a fit to data afterwards.

EXERCISE: construct the code to perform MC and data fits in sequence

Task #2-5: Fit to $B_s \to J/\psi \phi$ events

Another important decay, $B_s \to J/\psi \phi$, is the control channel in the analysis. Surely we shall come up with a code to perform the fit, and extract the yields as well! Basically the only difference is the mass of $B_s$ meson, which is slightly heavier (5.367 GeV), and there is no need to model the $J/\psi$+X component at the left-hand side. Hence it is, in fact, easier than the $J/\psi K^+$ events. One can come up with a fit with only signal and combinatorial background, as shown below
task_2_5a.png task_2_5b.png
Surely the data and MC trees should be replaced by bspsiphiData.root and bspsiphiMc.root!

Again, we are ignoring a smaller (and wider) component in the fit, which is $B^0 \to J/\psi K^{*0}(\to K^+\pi^-)$, with mass mis-assignment to the pion as kaon here. But the fit does not look too bad already.

EXERCISE: convert the code(s) to work on $B_s \to J/\psi \phi$ decays

Task #3: Verify fs/fu dependence

As discussed in the introductory lecture, the ratio of fragmentation fraction, $f_s/f_u$, is an external input to this analysis. Experimental situation is not very clear: LHCb observes some non-trivial $p_T$ dependence (ref. PRD 100, 031102), but not the case for ATLAS (ref. PRL 115, 262001). Also there is no obvious dependence in terms of $\eta$. It would be nice to check this in the exercise as well!

The key idea is to measure the production ratio between between $B_s \to J/\psi \phi$ (as the $f_s$ part) and $B^+ \to J/\psi K^+$ (as the $f_u$ part). One has to measure the yields for these two processes (in a particular phase-space, or $p_T$/$\eta$ region), correct for their efficiencies, and correct for their branching fractions. Note such a ratio study cannot be a pure/absolute measurement of $f_s/f_u$, since the known decay branching fraction of $B_s \to J/\psi \phi$ depends on $f_s/f_u$, ie. one cannot use the fraction in the study and perform the measurement of itself. But a relative study (dependence in $p_T$/$\eta$) can be performed. Note the current PDG estimate the fraction $f_s/f_u = 0.252 \pm 0.012$ from an average of LHCb and ATLAS 7 TeV results.

All we need to do is to re-cycle the code we produced for task #2. First one shall define a particular phase-space to perform the study at the beginning of the code, e.g.

......
    uint cate = 2;
    double pt_min = 20.0;
    double pt_max = 30.0;
    double eta_min = 0.;
    double eta_max = 2.5;

Let's pick up category index = 2, which is 2012/central region and is the single category with highest statistics. The $p_T$ bin between 20 and 30 GeV and the full $\eta$ region is selected here. The piece of code for reading the data should be updated as

......
    uint cate_t;
    float wgt_t,m_t,pt_t,eta_t;
    tin->SetBranchAddress("cate",&cate_t);
    tin->SetBranchAddress("wgt",&wgt_t);
    tin->SetBranchAddress("m",&m_t);
    tin->SetBranchAddress("pt",&pt_t);
    tin->SetBranchAddress("eta",&eta_t);
    
    for(int evt=0; evt<tin->GetEntries(); evt++) {
        tin->GetEntry(evt);
        if (cate_t!=cate) continue;
        if (m_t<5.0 || m_t>=5.8) continue;
        if (pt_t<pt_min || pt_t>=pt_max) continue;
        if (fabs(eta_t)<eta_min || fabs(eta_t)>=eta_max) continue;
        m.setVal(m_t);
        wgt.setVal(wgt_t);
        rds_data->add(RooArgSet(m,wgt),wgt_t);
    }
    delete fin;

Now the addresses of pt and eta branches should be assigned, and a selection requirement should be introduced in the event loop as given above. A very similar fix to the MC reading code has to be adopted as well:

......
    tin->SetBranchAddress("cate",&cate_t);
    tin->SetBranchAddress("m",&m_t);
    tin->SetBranchAddress("pt",&pt_t);
    tin->SetBranchAddress("eta",&eta_t);
    
    int evtcount[2] = {0,0};
    for(int evt=0; evt<tin->GetEntries(); evt++) {
        tin->GetEntry(evt);
        if (cate_t!=cate) continue;
        evtcount[0]++;
        if (m_t<5.0 || m_t>=5.8) continue;
        if (pt_t<pt_min || pt_t>=pt_max) continue;
        if (fabs(eta_t)<eta_min || fabs(eta_t)>=eta_max) continue;
        evtcount[1]++;
        m.setVal(m_t);
        rds_mc->add(RooArgSet(m));
    }
    double eff = (double)evtcount[1]/(double)evtcount[0];
    double eff_error = sqrt(eff*(1.-eff)/(double)evtcount[0]);
    delete fin;

Another ingredient of the measurement is the efficiency of the phase-space requirement. Hence a simple estimation based on counting (evtcount[]) of the signal MC is carried out. In the end of the code we can print out the summary of the fitting results:

......
    cout << "Category: " << cate << endl;
    cout << "pt range: " << pt_min << ", " << pt_max << endl;
    cout << "|eta| range: " << eta_min << ", " << eta_max << endl;
    cout << "Selection efficiency: " << eff << " +- " << eff_error << endl;
    cout << "Observed yield: " << n_sig.getVal() << " +- " << n_sig.getError() << endl;

By applying the same updates for both $J/\psi K^+$ and $J/\psi \phi$ events, one can obtain the results of the fit as following:

(for $J/\psi K^+$)

Category: 2 pt range: 20, 30 |eta| range: 0, 2.5 Selection efficiency: 0.32762 +- 0.0014842 Observed yield: 106740 +- 782.594

(for $J/\psi \phi$)

Category: 2 pt range: 20, 30 |eta| range: 0, 2.5 Selection efficiency: 0.39079 +- 0.00154296 Observed yield: 7593.83 +- 207.071

The selection efficiency reported above is just for the efficiency of the phase-space requirement. An important factor, the efficiency of the preselections (including the muon ID, BDT cut, etc) has to be checked from the given effyield.csv file. These "base" efficiences for $J/\psi K^+$ and $J/\psi \phi$ in category index 2 are 0.001267 and 0.0005476, respectively.

The fraction $f_s$ is proportional to $N(J/\psi \phi)/\epsilon_{\rm ps}(J/\psi \phi)/\epsilon_{\rm base}(J/\psi \phi)/\mathcal{B}(B_s\to J/\psi \phi \to \mu\mu KK)$, and $f_u$ is proportional to $N(J/\psi K^+)/\epsilon_{\rm ps}(J/\psi K^+)/\epsilon_{\rm base}(J/\psi K^+)/\mathcal{B}(B^+\to J/\psi K^+\to\mu\mu K)$. The $\epsilon_{\rm ps}$ is the efficiency for the phase-space cut, and $\epsilon_{\rm base}$ is the base efficiency mentioned above. The only missing factors are the decay branching fractions: $\mathcal{B}(B_s\to J/\psi \phi) = (1.08\pm0.08)\times 10^{-3}$, $\mathcal{B}(B^+\to J/\psi K^+) = (1.010\pm0.028)\times 10^{-3}$, and $\mathcal{B}(\phi \to K^+K^-) = 49.2 \pm 0.5\%$.

By injecting all of the numeric values above, the $f_s/f_u$ fraction of the selected phase-space (20<$p_T$<30 GeV) can be evaluated as

$f_s/f_u = 0.262 \pm 0.008 \pm 0.021$,

where the first uncertainty is the non-correlated error (e.g. statistical uncertainties from the fit and MC statistics), and the second uncertainty is the common errors (e.g. the decay branching fractions). Note we have ignored the systematic uncertainties provided in the effyield.csv table since they are basically dominated by the uncertainties in the branching fractions.

EXERCISE: verify the fits and calculations above

EXERCISE: verify $p_T$ and $\eta$ dependence of $f_s/f_u$

For example, evaluate $f_s/f_u$ in the following phase-space regions (fill the following table accordingly; you can fine tune the binning as well). Also consider to produce the plots (such as $f_s/f_u$ versus $p_t$, $f_s/f_u$ versus $\eta$).

pt range $f_s/f_u$ abs(eta) range $f_s/f_u$
10-12 GeV x.xxx +- y.yyy +- z.zzz 0.0-0.2 x.xxx +- y.yyy +- z.zzz
12-14 GeV ... 0.2-0.4 ...
14-16 GeV ... 0.4-0.6 ...
16-18 GeV ... 0.6-0.8 ...
18-20 GeV ... 0.8-1.0 ...
20-24 GeV ... 1.0-1.2 ...
24-30 GeV ... 1.2-1.6 ...
30-40 GeV ...    
40-60 GeV ...    
60-120 GeV ...    

Task #4: Construct signal & background models

model_components_1.png
Now we shall move ahead to construct the PDF models for the actual target, $B_s \to \mu^+\mu^-$ decays. The steps are similar to the fits to the normalization/control channels: build the model for each component, collect all of them, and construct a maximum likelihood fitter on top of them. In this exercise a much simpler model will be introduced: we will only fit the invariant mass distribution, and with only four components (and do not consider $\mathcal{B}(B^0 \to \mu^+\mu^-)$ decay as one of the parameter of interests), as shown in the plot above. Note the official BMM4 fits were performed in 3D, and with 6 components.

Another key information one needs to consider is the BDT threshold(s) in each category. In principle multiple categories can be introduced in each channel (year/era/detector region), but here again, a simplified version is considered with only one BDT cut in each category. A fine-tuned BDT threshold should be introduced at some point, but in the following demonstration/example code, we will start with a simple, fixed BDT threshold of >0.8 throughout all of the 8 categories. You will have to optimize this by yourself in the next task.

Here some information about each components to be included in the fit:

  • Signal $B_s \to \mu^+\mu^-$:
    • An obvious/essential component!
    • To be modeled by a Gaussian plus a Crystall-Ball line, ie. the RooCBShape class.
    • Also need to calculate the efficiencies after the BDT cut.
  • Peaking background
    • Including all of the $B \to hh$ (two-body hadronic decays), and $B^0 \to \mu^+\mu^-$ (merged, treated as a background here).
    • Will have to scale/weight the MC samples according to their expected yields and use the MC mixture to determine the PDF model.
    • To be modeled by a non-parameteric keys PDF with the RooKeysPdf class.
    • Sum of expected yields (and the associated uncertainty) should be evaluated, to be used in the fit.
  • Semileptonic background
    • Including all of the $B \to h\mu\nu$ and $B \to h\mu\mu$ 3-body decays.
    • Similar treatment as for the peaking background, have to scale/weight the MC samples according to their expected yields and use the MC mixture to determine the PDF model; to be modeled by a non-parameteric keys PDF.
    • Sum of expected yields (and the associated uncertainty) should be evaluated, to be used in the fit.
  • Combinatorial background
    • The major background component.
    • To be modeled by a linear function (Bernstein polynomial to avoid negative PDF, via RooBernstein class).
    • Fit from data directly, only need to calculate an initial "guess" of the yield.

Although the normalization channel ($B^+ \to J/\psi K^+$) is not part of the fit, it is still necessary to fit it again, estimate the yield and total efficiency, after the selected BDT threshold. Hence:

  • Normalization
    • Just recycle the fitting code introduced for the $f_s/f_u$ check
    • Need to calculate the fitted yield as well as total efficiency (including the "base" efficiency of the preselection, as given by the CSV file, and the cuts applied on top of the trees, such as a tighter BDT threshold).

Remark: in this exercise we have ignored many technical details, including the carefully calculated systematic uncertainties. The model is also significantly simplifed (ie. the per-event mass resolution is not considered in the following example; you are encourged to test such kind of 2D model after the school). We have also skipped the very important step of data-MC comparisons and validations, to be carried out before entering the stage of fitting, not even to mention the construction of BDT variables and training/validating tasks. Please be aware!

Task #4-1: Construct peaking background PDF

As mentioned above, in order to construct proper models for peaking backgrounds, one has to scale/weight the MC samples according to their expected yields and use the MC mixture to determine the PDF model. Here an example code to do this (you can find here the common.h file):

#include "common.h"
{
    unsigned int cate = 0;
    double bdt_min = 0.8;
    
    vector<TString> decay;
    vector<double> yield, yield_err;
    
    decay.push_back("bdmmMc");
    yield.push_back(effyield[cate][N_bdmmMc]);
    yield_err.push_back(effyield[cate][dN_bdmmMc]);
    
    RooRealVar m("m","",4.9,5.9);
    RooRealVar wgt("wgt","",1.,0.,1000.);
    RooDataSet *rds = new RooDataSet("rds","",RooArgSet(m,wgt),"wgt");
    
    double sum_weight = 0.;
    double sum_weight_err = 0.;

    rds->weightError(RooAbsData::SumW2);

    double weight_max = 0.; // get the largest weight value across samples 
    for(int proc=0; proc<(int)decay.size(); proc++) {
        TFile *fin = new TFile(input_path + decay[proc]+".root");
        TTree *tin = (TTree*)fin->Get(decay[proc]);
        double weight = yield[proc]/(double)tin->GetEntries(Form("cate==%d",cate));
        if (weight>weight_max) weight_max = weight;
        delete fin;
    }

     for(int proc=0; proc<(int)decay.size(); proc++) {
        TFile *fin = new TFile(input_path + decay[proc]+".root");
        TTree *tin = (TTree*)fin->Get(decay[proc]);

        unsigned int cate_t;
        float m_t,bdt_t;
        tin->SetBranchAddress("cate",&cate_t);
        tin->SetBranchAddress("m",&m_t);
        tin->SetBranchAddress("bdt",&bdt_t);

        double weight = yield[proc]/(double)tin->GetEntries(Form("cate==%d",cate));
        double weight_err = yield_err[proc]/(double)tin->GetEntries(Form("cate==%d",cate))/weight_max;

        std::cout<<"weight "<<weight<<" weight_err"<<weight_err<<std::endl;

        for(int evt=0; evt<tin->GetEntries(); evt++) {
            tin->GetEntry(evt);
            if (cate_t!=cate) continue;
            if (bdt_t<=bdt_min) continue;
            m.setVal(m_t);
            wgt.setVal(weight/weight_max); // rescale the event with largest weight to be 1
            rds->add(RooArgSet(m,wgt),weight/weight_max);

            sum_weight += weight;
            sum_weight_err += weight_err; // systematics; linear sum
        }
        delete fin;
    }

In the code above, we first put in the name of decay (ie. "bdmmMc", indicating $B^0\to\mu^+\mu^-$), the expected yields (and the associated uncertainty) from the effyield table into the std::vector. In the "proc" loop we go through all processes (although there is only one for now), calculate the weights according to the ratio between the expected yield and the input MC size. The sum of the event weights after the specific BDT cut is a direct representation of expected yield after the selection. The uncertainties are added linearly since all the uncertainties in the effyield table are systematics (including the uncertainties on the branching fractions and so on). The next step is to build a RooKeysPdf based on the weighted MC samples and produce a plot, ie:

......
    RooKeysPdf pdf("pdf", "", m, *rds,  RooKeysPdf::NoMirror, 2.0);
    
    RooPlot *frame = m.frame();
    rds->plotOn(frame);
    pdf.plotOn(frame);
    frame->Draw();
    
    cout << "Category: " << cate << endl;
    cout << "BDT min: " << bdt_min << endl;
    cout << "Sum of weights: " << sum_weight << " +- " << sum_weight_err << endl;
}

Now all we need to do is to include all of the "peaking" channels into the decay/yield/yield_err vectors. This should include "bdmmMc" ($B^0 \to \mu\mu$), "bstohhMcBg" (including all the $B_s \to hadron hadron$ decays, as $B_s \to KK$, $B_s \to K\pi$, $B_s \to \pi\pi$), "bdtohhMcBg" (including all the $B^0 \to hadron hadron$ decays, as $B^0 \to KK$, $B^0 \to K\pi$, $B^0 \to \pi\pi$). An example PDF projection is shown below.
task_4_1.png

EXERCISE: build a "peak" background PDF model including all the peaking two-body rare decays listed above

Task #4-2: Construct semileptonic background PDF

The background PDF for semileptonic 3-body B decays can be constructed with a very similar method, except

  • Have to replace the inputs for the decay/yield/yield_err vectors with the corresponding decays: "bskmunuMcBg" ($B_s \to K\mu\nu$), "bdpimunuMcBg" ($B^0 \to \pi\mu\nu$), "bdpimumuMcBg" ($B^0\to\pi^0\mu\mu$), and "bupimumuMcBg" ($B^+\to\pi^+\mu\mu$).
  • Better to change the option for RooKeysPdf, ie
RooKeysPdf pdf("pdf", "", m, *rds,  RooKeysPdf::MirrorLeft, 2.0);
otherwise may observe some boundary effect at the left-hand side edge.
  • Again, an example projection is given below.
    task_4_2.png

EXERCISE: build a "semi" background PDF model including all of the semileptonic 3-body rare decays listed above

Task #4-3: Derive signal PDF from MC events

The next step is to build the signal ($B_s\to\mu^+\mu^-$) PDF based on the MC samples. The $B_s\to\mu^+\mu^-$ events have a slightly longer tail compared to $J/\psi K^+$, hence a model consisting with a Crystal-ball function plus a Gaussian is introduced. One can start with the following piece of code:

......
    RooRealVar m("m","",4.9,5.9);
    RooDataSet *rds = new RooDataSet("rds","",RooArgSet(m));
    
    TFile *fin = new TFile("/eos/user/c/cmsdas/2024/long-ex-bph/bsmmMc.root");
    TTree *tin = (TTree*)fin->Get("bsmmMc");
        
    unsigned int cate_t;
    float m_t,bdt_t;
    tin->SetBranchAddress("cate",&cate_t);
    tin->SetBranchAddress("m",&m_t);
    tin->SetBranchAddress("bdt",&bdt_t);
    
    int evtcount[2] = {0,0};
    for(int evt=0; evt<tin->GetEntries(); evt++) {
        tin->GetEntry(evt);
        if (cate_t!=cate) continue;
        evtcount[0]++;
        if (bdt_t<=bdt_min) continue;
        evtcount[1]++;
        m.setVal(m_t);
        rds->add(RooArgSet(m));
    }
    double eff = (double)evtcount[1]/(double)evtcount[0] * effyield[cate][Eff_bsmmMc];
    delete fin;
    
    RooRealVar bs_mean1("bs_mean1","",5.37,5.2,5.5);
    RooRealVar bs_mean2("bs_mean2","",5.37,5.2,5.5);
    RooRealVar bs_sigma1("bs_sigma1","",0.030,0.005,0.060);
    RooRealVar bs_sigma2("bs_sigma2","",0.080,0.040,0.200);
    RooRealVar bs_cbalpha("bs_cbalpha","",1.,0.,4.);
    RooRealVar bs_cbn("bs_cbn","",1.,0.,4.);
    RooRealVar bs_frac("bs_frac","",0.7,0.,1.);
    RooGaussian bs_gaus("bs_gaus", "", m, bs_mean1, bs_sigma1);
    RooCBShape bs_cbline("bs_cbline", "", m, bs_mean2, bs_sigma2, bs_cbalpha, bs_cbn);
    RooAddPdf pdf("pdf","",RooArgList(bs_gaus,bs_cbline),RooArgList(bs_frac));
    
    pdf.fitTo(*rds);

The model should be sufficient to model the signal shape as shown in the example figure below. In the piece of code above we also calculate the total efficiencies by counting the signal MC events passing the additional BDT cut, and multiplying it with the base efficiency in the effyield table. For the 0th category and a BDT cut of 0.8, the total effieiency is 0.356%.
task_4_3.png

EXERCISE: build the signal PDF model as described above

Task #4-4: Revisit the normalization channel, $B^+\to J/\psi K^+$

A similar calculation of efficiency as well as yield extraction should be carried out for the normalization channel $B^+\to J/\psi K^+$. The task is rather straightforward since it can be performed with a simple modification to the code we have introduced in task #3 (or from task #2-4). The key modifications are collected in the following code snapshot:

......
    tin->SetBranchAddress("cate",&cate_t);
    tin->SetBranchAddress("m",&m_t);
    
    int evtcount[2] = {0,0};
    for(int evt=0; evt<tin->GetEntries(); evt++) {
        tin->GetEntry(evt);
        if (cate_t!=cate) continue;
        evtcount[0]++;
        if (m_t<5.0 || m_t>=5.8) continue;
        evtcount[1]++;
        m.setVal(m_t);
        rds_mc->add(RooArgSet(m));
    }
    double eff = (double)evtcount[1]/(double)evtcount[0] * effyield[cate][Eff_bupsikMc];
    delete fin;

All we need to do is evaluating the total efficiency (=efficiency of the mass cut, and multiply it with the base efficiency). For the 0th category, the total efficiency is calculated to be 0.0010% and the total observed yield is $266196 \pm 799$. The projections of $J/\psi K^+$ MC and data events can be found below.
task_4_4a.pngtask_4_4b.png

EXERCISE: prepare the code for normalization channel accordingly

Task #4-5: Store PDF models into RooWorkspace

In task #4-1/2/3/4, we managed to create the PDFs for the single components and for a single category at a time -- but the full maximum likelihood fitter requires the PDFs of all components and for every category! We need to "collect" all of the PDF models in the same place, connect all of them into a global PDF and simultaneously fit the data from 8 categories. Technically one can definitely build all the PDFs on the fly and perform a fit afterwards, but it can be rather inefficient and slow. In the following example we will modify the codes introduced earlier and store the output PDF models (and some of the key variables) into a common RooWorkspace; with the workspace one can easily reload the models back to memory, so that those PDFs can be used to build the global PDF afterwards.

The following (simple) modifications are required for this task. First we shall define the functions of each task with proper arguments (the pointer to RooWorkspace, selected category index, and BDT threshod):

void build_pdf_peak(RooWorkspace *wspace, uint cate, double bdt_min)
void build_pdf_semi(RooWorkspace *wspace, uint cate, double bdt_min)
void build_pdf_bs(RooWorkspace *wspace, uint cate, double bdt_min)
void fit_bupsik(RooWorkspace *wspace, int cate, double bdt_min)

The body of those functions can be just copied from the codes introduced in task #4-1/2/3/4. At the end of each function we need to store the output PDF as well as the yields and efficiencies in the workspace. (Note: it would be better to delete those allocated objects such as TCanvas, RooPlot, etc, before the end of these routines as well.)

void build_pdf_peak(RooWorkspace *wspace, uint cate, double bdt_min)
{
    ......
    wspace->import(pdf);
    wspace->import(RooRealVar(Form("N_peak_%d",cate),"",sum_weight));
}
void build_pdf_semi(RooWorkspace *wspace, uint cate, double bdt_min)
{
    ......
    wspace->import(pdf);
    wspace->import(RooRealVar(Form("N_semi_%d",cate),"",sum_weight));
}
void build_pdf_bs(RooWorkspace *wspace, uint cate, double bdt_min)
{
    ......
    bs_mean1.setConstant(true);
    bs_mean2.setConstant(true);
    bs_sigma1.setConstant(true);
    bs_sigma2.setConstant(true);
    bs_cbalpha.setConstant(true);
    bs_cbn.setConstant(true);
    bs_frac.setConstant(true);
    wspace->import(pdf);
    wspace->import(RooRealVar(Form("Eff_bs_%d",cate),"",eff));
}
void fit_bupsik(RooWorkspace *wspace, uint cate, double bdt_min)
{
    ......
    wspace->import(RooRealVar(Form("N_bu_%d",cate),"",n_sig.getVal()));
    wspace->import(RooRealVar(Form("Eff_bu_%d",cate),"",eff));
}

The model parameters for signal $B_s \to \mu\mu$ process can be fixed before saving it in the workspace. In addition to the tasks introduced earlier, we should also construct the model for combinatorial background. We just need to add a linear function (as Bernstein polynomial here), and a floating yield with an initial guess. Here is the small piece of code for this remaining model:

void build_pdf_comb(RooWorkspace *wspace, int cate, double bdt_min)
{
    RooRealVar m("m","",4.9,5.9);
    
    RooRealVar comb_B1(Form("comb_B1_%d",cate), "", 0.5, 0. , 1);
    RooFormulaVar comb_B2(Form("comb_B2_%d",cate), "", "1.-@0", RooArgList(comb_B1));
    RooBernstein pdf(Form("pdf_comb_%d",cate), "", m, RooArgList(comb_B1, comb_B2));
    
    TFile *fin = new TFile("/eos/user/c/cmsdas/2024/long-ex-bph/bmmData-blind.root");
    TTree *tin = (TTree*)fin->Get("bmmData");
    
    double N_comb_guess = (double)tin->GetEntries(Form("cate==%d&&bdt>%g&&m>5.45",cate,bdt_min));
    N_comb_guess *= 1.0/0.45; // scale to full mass region
    delete fin;
 
    wspace->import(pdf);
    wspace->import(RooRealVar(Form("N_comb_%d",cate),"",N_comb_guess,0.,N_comb_guess*10.));
}

The expected yield is just obtained by counting the number of events in the high mass sideband (5.45<m<5.9), and scaling them to the full mass region (4.9<m<5.9), assuming a flat distribution. The construction of a Bernstein polynomial (1st order) takes two parameters, but there is only one free parameter due to the normalization condition. Here we enforce the sum of the two parameters to be one. At this point we need to prepare a code to loop over all the categories and save the resulting workspace into a file as following:

......
    RooWorkspace *wspace = new RooWorkspace("wspace");
    for(int cate=0; cate<N_Categories; cate++) {
        double bdt_min = 0.80;
        build_pdf_peak(wspace,cate,bdt_min);
        build_pdf_semi(wspace,cate,bdt_min);
        build_pdf_bs(wspace,cate,bdt_min);
        build_pdf_comb(wspace,cate,bdt_min);
        fit_bupsik(wspace, cate, bdt_min);
    }
    wspace->writeToFile("wspace.root");

EXERCISE: construct the code as described above, and produce the corresponding workspace file

Once the wspace.root has been produced, it is straightforward to double-check if all of the variables and PDF models are in place, for example:

root wspace.root root [1] wspace->Print()
RooWorkspace(wspace) wspace contents variables --------- (Eff_bs_0,Eff_bs_1,Eff_bs_2,Eff_bs_3,Eff_bs_4,Eff_bs_5,Eff_bs_6,Eff_bs_7... ...... p.d.f.s ------- RooCBShape::bs_0_cbline[ m=m m0=bs_0_mean2 sigma=bs_0_sigma2 alpha=bs_0_cbalpha n=bs_0_cbn ] = 0.664694 ......

If one want to verify just one of many stored PDF, it can be done by using typical RooFit commands, e.g.

root [2] frame = wspace->var("m")->frame() root [3] wspace->pdf("pdf_bs_0")->plotOn(frame) root [4] frame->Draw()

And one can see a display of the PDF like the plot below.
task_4_5.png

Task #5: Optimize BDT thresholds

Up to now we were always using a tentative BDT threshold of 0.8. This is definitely not optimal, and in particular, the threshold potentially can be different for every category due to different detector responses and condition of data taking. It is therefore necessary to optimize the cut on the BDT, for example by choosing the BDT value which maximises the signal significance.

In the following exercise we are going to provide small pieces of code to calculate the expected signal and combinatorial background yields in the signal region (5.2<m<5.45), and evaluate the counting significance according to the formula:
$Z = \sqrt{2[(n_s+n_b)\ln(1+{ns \over n_b})-n_s]}$

{
    uint cate = 0;
    double bdt_min = 0.8;
    
    RooRealVar bdt("bdt","",0.1,1.0);
    RooRealVar wgt("wgt","",1.,0.,1000.);
    RooDataSet *rds_sig = new RooDataSet("rds_sig","",RooArgSet(bdt,wgt),"wgt");
    RooDataSet *rds_bkg = new RooDataSet("rds_bkg","",RooArgSet(bdt,wgt),"wgt");
    
    TFile *fin = new TFile("/eos/user/c/cmsdas/2024/long-ex-bph/bsmmMc.root");
    TTree *tin = (TTree*)fin->Get("bsmmMc");
    
    uint cate_t;
    float m_t, bdt_t;
    tin->SetBranchAddress("cate",&cate_t);
    tin->SetBranchAddress("m",&m_t);
    tin->SetBranchAddress("bdt",&bdt_t);
    
    double weight = effyield[cate][N_bsmmMc]/(double)tin->GetEntries(Form("cate==%d",cate));
    for(int evt=0; evt<tin->GetEntries(); evt++) {
        tin->GetEntry(evt);
        if (cate_t!=cate) continue;
        bdt.setVal(bdt_t);
        wgt.setVal(weight);
        rds_sig->add(RooArgSet(bdt,wgt),weight);
    }
    delete fin;
    
    fin = new TFile("/eos/user/c/cmsdas/2024/long-ex-bph/bmmData-blind.root");
    tin = (TTree*)fin->Get("bmmData");
    
    tin->SetBranchAddress("cate",&cate_t);
    tin->SetBranchAddress("m",&m_t);
    tin->SetBranchAddress("bdt",&bdt_t);
    
    weight = 0.25/0.45; // scale from 5.45-5.90 to 5.20-5.45, assuming a uniform mass dist.
    for(int evt=0; evt<tin->GetEntries(); evt++) {
        tin->GetEntry(evt);
        if (cate_t!=cate) continue;
        if (m_t<5.45) continue;
        bdt.setVal(bdt_t);
        wgt.setVal(weight);
        rds_bkg->add(RooArgSet(bdt,wgt),weight);
    }
    delete fin;
    
    cout << "Category: " << cate << endl;
    double ns = rds_sig->sumEntries(Form("bdt>%g",bdt_min));
    double nb = rds_bkg->sumEntries(Form("bdt>%g",bdt_min));
    if (ns>0. && nb>0.) {
        double Z = sqrt(2.*((ns+nb)*log(1.+ns/nb)-ns));
        cout << "bdt_min: " << bdt_min << ", ns: " << ns << ", nb: " << nb << ", Z: " << Z << endl;
    }
}

EXERCISE: Find the optimal BDT threshold for each category

Evaluate the counting significance for different BDT thresholds (starting from 0.10 to 0.40) for each category. Record the optimal selection in the following table. You can also consider to add the estimation for other background sources to the code above; or consider to have a non-uniform model in the estimation of combinatorial background.

Category index Best BDT threshold Ns Nb Significance
0 x.xx y.yy z.zz s.ss
1 ... ... ... ...
2 ... ... ... ...
3 ... ... ... ...
4 ... ... ... ...
5 ... ... ... ...
6 ... ... ... ...
7 ... ... ... ...

EXERCISE: Implement the optimal BDT threshold in the model building

Insert the optimal BDT threshold for each category back to the code produced for task #4-5, and create the updated PDF models for each component.

Task #6: Put all together -- branching fraction extraction

We are going to read back the workspace created at the end of task 4 and/or 5, join the PDFs into a RooSimultaneous global PDF, and perform a fit eventually.

Task #6-1: Construct a (minimal) fitter for all categories

The construction starts from a RooCategory discrete variable and the main parameter of interest, $B_s\to\mu\mu$ branching fraction. The signal yield can be extracted from the yield/efficiency of the $J/\psi K^+$ normalization channel and the $B_s\to\mu\mu$ efficiency. In the first model we are going to construct, the background yields (except combinatorial), are fixed to their expected values. The shape parameters of the background model are also fixed, except the slope of the linear combinatorial background.

#include "common.h"

{
    TFile *fin_wspace = new TFile("wspace.root");
    RooWorkspace *wspace = (RooWorkspace*)fin_wspace->Get("wspace");
    
    RooRealVar *m = wspace->var("m");
    RooCategory cate("cate","");
    for(int idx=0; idx<N_Categories; idx++)
        cate.defineType(Form("c%d",idx),idx);
    
    RooSimultaneous *model = new RooSimultaneous("model", "", cate);
    
    // Main POI: Bs->mumu branching fraction
    RooRealVar *BF_bs = new RooRealVar("BF_bs","",3.57E-9,0.,2E-8);
    
    // BF(B+ -> J/psi K+) = (1.010 +- 0.028) E-3 (PDG)
    // BF(J/psi -> mu+mu-) = (5.961 +- 0.033) E-2 (PDG)
    RooRealVar *BF_bu = new RooRealVar("BF_bu","",1.010E-3 * 5.961E-2);
    
    // fs/fu = 0.252 +- 0.012 (PDG) +- 0.015 (energy/pt dependence)
    RooRealVar *fs_over_fu = new RooRealVar("fs_over_fu","",0.252);
    
    RooFormulaVar *N_bs[N_Categories];
    RooAddPdf *pdf_sum[N_Categories];
    for(int idx=0; idx<N_Categories; idx++) {
        
        RooRealVar *Eff_bs = wspace->var(Form("Eff_bs_%d",idx));
        RooRealVar *Eff_bu = wspace->var(Form("Eff_bu_%d",idx));
        RooRealVar *N_bu = wspace->var(Form("N_bu_%d",idx));
        
        N_bs[idx] = new RooFormulaVar(Form("N_bs_%d", idx), "", "@0*@1*@2*@3/@4/@5",
                                      RooArgList(*BF_bs, *N_bu, *fs_over_fu, *Eff_bs, *Eff_bu, *BF_bu));
        RooRealVar *N_peak = wspace->var(Form("N_peak_%d",idx));
        RooRealVar *N_semi = wspace->var(Form("N_semi_%d",idx));
        RooRealVar *N_comb = wspace->var(Form("N_comb_%d",idx));
        
        RooArgList pdf_list;
        pdf_list.add(*wspace->pdf(Form("pdf_bs_%d",idx)));
        pdf_list.add(*wspace->pdf(Form("pdf_peak_%d",idx)));
        pdf_list.add(*wspace->pdf(Form("pdf_semi_%d",idx)));
        pdf_list.add(*wspace->pdf(Form("pdf_comb_%d",idx)));
        
        RooArgList N_list;
        N_list.add(*N_bs[idx]);
        N_list.add(*N_peak);
        N_list.add(*N_semi);
        N_list.add(*N_comb);
        
        pdf_sum[idx] = new RooAddPdf(Form("pdf_sum_%d",idx), "", pdf_list, N_list);
        model->addPdf(*pdf_sum[idx],Form("c%d",idx));
    }

In the snapshot of code above, the model contains the final global/simultaneous PDF for all categories. There are also two global parameters (currently fixed as constants): the combined branching fraction of the normalization channel $B^+ \to J/\psi K^+ \to \mu^+\mu^- K^+$, which is obtained from PDG; the fragment fraction $f_s/f_u$, which is also from PDG.

Once the model is built, we can read the data and perform the fit! However, in order not to bias ourselves, we should not look at the unblind data sample during code development! Instead, the MC soup (mixture of MC events) should be the data sample to play with. Here we take the bmmSoup10.root file, which is composed by all the background and signal samples according to their expected yield, (the signal branching fraction has been set to the SM predicted value). The file, in fact, contains 100 trees of random MC mixture; hence we just take the first one here to practice. In the next exercise you can try to perform the fit to all these 100 sets (and the sets stored in other files with different assumed branching fractions), and check the distribution of fitted branching fraction and its uncertainty.

......
    double bdt_min = 0.8;

    RooDataSet *rds_data = new RooDataSet("rds_data","",RooArgSet(*m,cate));
    
    TFile *fin_data = new TFile("/eos/user/c/cmsdas/2024/long-ex-bph/bmmSoup10.root");
    TTree *tin = (TTree*)fin_data->Get("bmmSoup10_100");
    
    int cate_t;
    float m_t, bdt_t;
    tin->SetBranchAddress("cate",&cate_t);
    tin->SetBranchAddress("m",&m_t);
    tin->SetBranchAddress("bdt",&bdt_t);
    
    for(int evt=0; evt<tin->GetEntries(); evt++) {
        tin->GetEntry(evt);
        if (bdt_t<=bdt_min) continue;
        cate.setIndex(cate_t);
        m->setVal(m_t);
        rds_data->add(RooArgSet(*m,cate));
    }
    
    model->fitTo(*rds_data,Extended(true),Minos(RooArgSet(*BF_bs)));

The RooDataSet contains two variables, the invarint mass m and the category index cate. In the code snapshot the BDT threshold is still set to 0.8, and obviously it has to be updated with the optimal threshold found in the previous task. Then the fit to the data should give this:

Pos | Name | type | Value | Error +/- 0 | BF_bs | fixed | 4.039940232e-09 | 6.049049088e-10 (at lower limit) 1 | N_comb_0 | limited | 283.2667767 | 17.56401315 2 | N_comb_1 | limited | 720.4000421 | 27.78586579 3 | N_comb_2 | limited | 270.966665 | 17.40725515 4 | N_comb_3 | limited | 618.0347983 | 25.97171289 5 | N_comb_4 | limited | 1421.604657 | 39.00779314 6 | N_comb_5 | limited | 3125.620564 | 57.34804216 7 | N_comb_6 | limited | 2250.806339 | 49.17224555 8 | N_comb_7 | limited | 4916.795396 | 71.95540089 9 | comb_B1_0 | limited | 0.5438618653 | 0.05424791566 10 | comb_B1_1 | limited | 0.5434179277 | 0.03295451653 11 | comb_B1_2 | limited | 0.6489947221 | 0.05220176234 12 | comb_B1_3 | limited | 0.5106479047 | 0.03665468166 13 | comb_B1_4 | limited | 0.607380486 | 0.02273956399 14 | comb_B1_5 | limited | 0.582694311 | 0.01553735396 15 | comb_B1_6 | limited | 0.617767062 | 0.01864070746 16 | comb_B1_7 | limited | 0.5627468118 | 0.01243258839

The fit is performed with 17 free parameters, but the only parameter we would like to measure is BF_bs. You can see that the fitted branching fraction agrees with the input value $3.57 \times 10^{-9}$ rather well. Note this is NOT always the case, since the MC soup samples were generated with Poisson fluctuations, the fitted decay branching fraction can vary! (You may try to fit other soup sets within the same root file, and you should be able to observe this situation.)

Then obviously we shall try to produce a projection plot for the fit by overlapping the fitted curves on top of data events. This might be tricky, but the following piece of code should be able to do it properly:

......
     TCanvas* canvas = new TCanvas("canvas", "", 1200, 600);
     canvas->Divide(4,2);
     
     RooArgSet nset(*m);
     RooPlot *frame[N_Categories];
     for(int idx=0; idx<N_Categories; idx++) {
         canvas->cd(idx+1);
         
         frame[idx] = m->frame(Title(" "),Bins(25));
         rds_data->plotOn(frame[idx], Cut(Form("cate==%d",idx)), MarkerSize(0.8));
         
        double norm = pdf_sum[idx]->expectedEvents(&nset);
        pdf_sum[idx]->plotOn(frame[idx], Normalization(norm, RooAbsReal::NumEvent), LineWidth(3));
        pdf_sum[idx]->plotOn(frame[idx], Normalization(norm, RooAbsReal::NumEvent), Components(RooArgSet(*wspace->pdf(Form("pdf_bs_%d",idx)))), DrawOption("F"), FillColor(kRed), FillStyle(3365));
        pdf_sum[idx]->plotOn(frame[idx], Normalization(norm, RooAbsReal::NumEvent), Components(RooArgSet(*wspace->pdf(Form("pdf_peak_%d",idx)))), DrawOption("F"), FillColor(kViolet-4), FillStyle(3344));
        pdf_sum[idx]->plotOn(frame[idx], Normalization(norm, RooAbsReal::NumEvent), Components(RooArgSet(*wspace->pdf(Form("pdf_semi_%d",idx)))), DrawOption("L"), LineColor(kGreen-3), LineStyle(2));
        frame[idx]->Draw();
     }
 

With some simple decoration one may reach such a projection plot containing 8 subplot for events from each category as shown below. The red shaded area represents the signal components. You may find the signal-to-noise ratio is not very high, and this is totally expected since here a rather relaxed BDT threshold of 0.8 is used.
task_6_1.png

task_6_1_bdt.png

EXERCISE: build the full maximum likelihood fitter as introduced above (with the BDT thresholds obtained in the previous task)

EXERCISE: perform the fits to different sets of MC soup within the same or different root file

Please perform the fits to all 20 sets in each MC soup file, and summarize the results in the table below:

MC soup file Input BF_bs Average fitted BF_bs Average positive error Average negative error
bmmSoup05.root $0.5 \cdot 3.57 \times 10^{-9}$ x.xx y.yy z.zz
bmmSoup10.root $1.0 \cdot 3.57 \times 10^{-9}$ ... ... ...
bmmSoup15.root $1.5 \cdot 3.57 \times 10^{-9}$ ... ... ...
bmmSoup20.root $2.0 \cdot 3.57 \times 10^{-9}$ ... ... ...

Task #6-2: Standard check: profiled likelihood scan

Profiled Likelihood scan (and verification of the observed significance) is a very common practice people introduced as a post-data-fit check. The key idea is to calculate the log-likelihood values as a function of the main parameter of interest (here it is the branching fraction of $B_s\to\mu\mu$, BF_bs). The proper way is to fix BF_bs to the target value, repeat the fit (a.k.a. "profiling"), and calculate the resulting difference of the log-likelihood value w.r.t. the minimum point. The following piece of code is doing exactly the task described above:

......
    RooFitResult *res_best = model->fitTo(*rds_data,Extended(true),Save(true),Minos(RooArgSet(*BF_bs)));
    RooAbsReal* nll = model->createNLL(*rds_data,Extended(true));
    
    TH1D *frame = new TH1D("frame","",101, -0.5E-10, 100.5E-10);
    for(int bin=1; bin<=frame->GetNbinsX(); bin++) {
        BF_bs->setVal(frame->GetBinCenter(bin));
        BF_bs->setConstant(true);
        model->fitTo(*rds_data,Extended(true));
        frame->SetBinContent(bin,(nll->getVal()-res_best->minNll())*2.);
    }
    
    TCanvas* canvas = new TCanvas("canvas", "", 600, 600);
    frame->Draw("c");

Note we have to multiply the log likelihood values by two to match the usual definition of Gaussians. The cross points between the likelihood curve and horizontal line of y=1 represents the $\pm1\sigma$ uncertainties, and should be consistent with the values reported by the MINOS call from the fit. Again, with some decoration a plot similar to the one below can be obtained:
task_6_2.png

EXERCISE: perform the profiled likelihood scan (with the fitter adopting the optimal BDT thresholds)

Another information that can be extracted is the significance, which can be calculated by the difference between the log-likelihood value at the best-fit (res_best->minNll() in the code snapshot) and the value at the null hypthosis (ie. zero branching fraction, BF_bs = 0). Since the first bin of the histogram contains the log-likelihood difference at zero branching fraction already, the significance value can be evaluated as

cout << "Observed significance: " << sqrt(frame->GetBinContent(1)) << " sigma." << endl;

One can see this on the shell print out (if the BDT threshold still sets to 0.8 for all categories):

Observed significance: 5.93132 sigma.

Task #6-3: Standard check: statistical tests with toy MC

It is essential to understand the properties of the fitting model, and some of them can be checked together with toy MC tests:

  • if the model (and fitter implementation) is unbiased;
  • if the uncertainties reported by Minuit are consistent with the expected statistical fluctuation;
  • if the observed significance is consistent with the expected values from the model.
The study can be performed first by generating toy MC samples based on the PDF model, and then fitting the samples to extract the mean and uncertainties for the parameter of interests. Repeating the generating-fitting procedure allows to study the statistical effects. The following code snapshot demonstrates how to perform such a study. The model construction from task #6-1 has been re-introduced again as a separated function do_gen_and_fit(). Instead of loading the data from the MC soup, we generate the events using a RooFit routine.
void do_gen_and_fit(double *res)
{
    double bdt_min = 0.8;
    
    TFile *fin_wspace = new TFile("wspace.root");
    RooWorkspace *wspace = (RooWorkspace*)fin_wspace->Get("wspace");
    ......
    for(int idx=0; idx<N_Categories; idx++) {
        ......
        model->addPdf(*pdf_sum[idx],Form("c%d",idx));
    }

    RooDataSet *rds_toy = model->generate(RooArgSet(*m,cate),Extended(true));
    
    RooFitResult *res_best = model->fitTo(*rds_toy,Save(true),Extended(true),Minos(RooArgSet(*BF_bs)));
    
    res[0] = BF_bs->getVal();
    res[1] = BF_bs->getError();
    res[2] = BF_bs->getErrorHi();
    res[3] = BF_bs->getErrorLo();

    BF_bs->setConstant(true);
    BF_bs->setVal(0.);
    RooFitResult *res_null = model->fitTo(*rds_toy,Save(true),Extended(true));
    res[4] = sqrt((res_null->minNll()-res_best->minNll())*2.);
}

The pointer passed to the function is used to store the result of the toy test: res[0]-[3] are used to store the fitted values and uncertainties from Minuit; res[4] keeps the estimated significance value. Then one can repeat the study with a simple loop calling the do_gen_and_fit() function multiple times.

......
    TH1D *h_mean = new TH1D("h_mean","",50,0.,1E-8);
    TH1D *h_error = new TH1D("h_error","",50,0.,2E-9);
    TH1D *h_signif = new TH1D("h_signif","",50,0.,10.);
    TH1D *h_pull = new TH1D("h_pull","",50,-5.,5.);
    
    for(int iter=0; iter<100; iter++) {
        cout << "iteration: " << iter << endl;
        double res[5];
        do_gen_and_fit(res);
        
        h_mean->Fill(res[0]);
        h_error->Fill(res[1]);
        h_signif->Fill(res[4]);
        
        double pull = (res[0]-3.57E-9);
        if (pull>0.) pull /= fabs(res[3]);
        if (pull<0.) pull /= fabs(res[2]);
        if (fabs(res[3])>0. && fabs(res[2])>0.) h_pull->Fill(pull);
    }
    
    TCanvas* canvas = new TCanvas("canvas", "", 800, 800);
    canvas->Divide(2,2);
    
    canvas->cd(1);
    h_mean->Draw();
    canvas->cd(2);
    h_error->Draw();
    canvas->cd(3);
    h_signif->Draw();
    canvas->cd(4);
    h_pull->Fit("gaus","L");

An example of the resulting plot is shown below, with 1000 repeated toy tests (instead of 100).
task6_3.png
The results from the fit to the first MC soup set are consistent with the expected values from the toy samples (including the observed mean, uncertainties, and significance). The fit to the pull distribution shows a very good consistency with a standard Gaussian, which means the uncertainties reported from Minuit agree with the expected statistical fluctuation.

EXERCISE: carry out the studies with toy MC samples described above

(Remark: better to delete those allocated objects before the end of the do_gen_and_fit() function!)

Task #6-4: Systematic uncertainties as constrained nuisances

In the PDF model constructed above, many of the parameters (including the yields of peaking/semileptonic background, the branching fraction and yields for the normalization channel, signal efficiencies...) are actually fixed in the fit. It is not wrong, but such an arrangement needs some studies for systematic uncertainties. A classical way is to repeat the fit after changing one of the those parameters, and take the residual difference at the POI ($B_s\to\mu\mu$ branching fraction here) as an estimate of the uncertainty associated with the target parameter (or nuisance).

One can incorporate all of these systematic efferts into the fitter directly with constrained nuisances. Technically this can be carried out by adding an additional constrained PDF into the likelihood function for each nuisance parameter. Those nuisance parameters can be profiled all together. This means we are going to change the model slightly, as illustrated below.
model_components_2.png
In each category the yields of peaking and semileptonic components will be constrained by log-normal distributions, since these yields are relatively small and the non-negative feature of log-normal distribution may work better. The yield of the normalization channel is large enough, hence a Gaussian constraint will be introduced. The efficiencies for signal and normalization modes are tricky, since some of the systematics (as given by the effyield table) do in fact cancel out. Here we will simplify this by including only the uncertainty on the signal efficiency, while keeping the efficiency for normalization fixed. There are two global parameters, the branching fraction of normalization and the hadronization fraction, $f_s/f_u$. These two parameters are external and will be constrained by Gaussians as well. (Remark: in the BMM4 analysis we have decided to keep $f_s/f_u$ uncertainty as an additional/external uncertainty, instead of a constrained nuisance).

By introducing all these constrained nuisances into the fit, the uncertainties reported by Minuit will reflect the total uncertainties including both statistical and systematic. Surely the nuisance parameters mentioned above are not a complete set of systematics, since one should still consider the uncertainties in the model (line shape) itself, but it would be a good demonstration of how to implement these uncertainties by ourselves.

Nevertheless, one can mostly recycle the code for model-building (from task #4-5), and add the following two helper functions to prepare the constrained PDFs:

void build_gaussian_constraint(RooWorkspace* wspace, TString varname, double mean, double sigma)
{
    double min = mean-sigma*6.;
    double max = mean+sigma*6.;
    if (min<0.) min = 0.;
    
    RooRealVar var(varname, "", mean, min, max);
    
    RooRealVar var_mean(Form("%s_mean",varname.Data()),"",mean);
    RooRealVar var_sigma(Form("%s_sigma",varname.Data()),"",sigma);
    RooGaussian gau(Form("%s_gau",varname.Data()), "", var, var_mean, var_sigma);
    wspace->import(gau);
}
void build_lognormal_constraint(RooWorkspace* wspace, TString varname, double mean, double sigma)
{
    double min = mean-sigma*6.;
    double max = mean+sigma*6.;
    if (min<0.) min = 0.;
    
    RooRealVar var(varname, "", mean, min, max);
    
    RooRealVar var_mean(Form("%s_mean",varname.Data()),"",mean);
    RooRealVar var_kappa(Form("%s_kappa",varname.Data()),"",1.+sigma/mean);
    RooLognormal lnn(Form("%s_lnn",varname.Data()), "", var, var_mean, var_kappa);
    wspace->import(lnn);
}

These two functions add the corresponding variable to the workspace, plus a Gaussian or log-normal PDF to be used in the constrained fit. Note the log-normal distribution takes "kappa" (=1+sigma/mean) instead of regular "sigma" in the construction. The other modifications are at the end of each model building rountine -- instead of creating a fixed RooFit variable, we add the floated variable and PDF together.

void build_pdf_peak(RooWorkspace *wspace, int cate, double bdt_min)
{
    ......
    wspace->import(pdf);
    build_lognormal_constraint(wspace,Form("N_peak_%d",cate),sum_weight,sum_weight_err);
    ......
}
void build_pdf_semi(RooWorkspace *wspace, int cate, double bdt_min)
{
    ......
    wspace->import(pdf);
    build_lognormal_constraint(wspace,Form("N_semi_%d",cate),sum_weight,sum_weight_err);
    ......
}
void build_pdf_bs(RooWorkspace *wspace, int cate, double bdt_min)
{
    ......
    wspace->import(pdf);
    build_gaussian_constraint(wspace,Form("Eff_bs_%d",cate),eff,eff_error);
    ......
}
void fit_bupsik(RooWorkspace *wspace, int cate, double bdt_min)
{
    ......
    build_gaussian_constraint(wspace,Form("N_bu_%d",cate),n_sig.getVal(),n_sig.getError());
    wspace->import(RooRealVar(Form("Eff_bu_%d",cate),"",eff));
    ......
}

Before the end of the main code, one can also add the two global parameters as following:

......
    build_gaussian_constraint(wspace, "BF_bu", 1.010E-3*5.961E-2, 1.010E-3*5.961E-2*0.028);
    build_gaussian_constraint(wspace, "fs_over_fu", 0.252, sqrt(pow(0.012,2)+pow(0.015,2)));

    wspace->writeToFile("wspace_constr.root");
......

Total uncertainty on the $B^+\to J/\psi K^+ \to \mu^+\mu^-K^+$ is around 2.8%, and the total uncertainty on the $f_s/f_u$ includes the uncertainty from PDG ($\pm$0.012) and possible $p_T$ and energy dependence ($\pm$0.015). The rest of the code can be preserved.

The resulting workspace should have the complete list of constrained PDF as shown here:

RooGaussian::Eff_bs_0_gau[ x=Eff_bs_0 mean=Eff_bs_0_mean sigma=Eff_bs_0_sigma ] = 1 ...... RooGaussian::Eff_bs_7_gau[ x=Eff_bs_7 mean=Eff_bs_7_mean sigma=Eff_bs_7_sigma ] = 1 RooGaussian::N_bu_0_gau[ x=N_bu_0 mean=N_bu_0_mean sigma=N_bu_0_sigma ] = 1 ...... RooGaussian::N_bu_7_gau[ x=N_bu_7 mean=N_bu_7_mean sigma=N_bu_7_sigma ] = 1 RooLognormal::N_peak_0_lnn[ x=N_peak_0 m0=N_peak_0_mean k=N_peak_0_kappa ] = 2.02782 ...... RooLognormal::N_peak_7_lnn[ x=N_peak_7 m0=N_peak_7_mean k=N_peak_7_kappa ] = 1.1453 ...... RooLognormal::N_semi_7_lnn[ x=N_semi_7 m0=N_semi_7_mean k=N_semi_7_kappa ] = 0.110549

EXERCISE: update the code for individual PDF construction as introduced above

The PDF construction in the main fitter code should be updated as well. Here are the updated version:

......
    TFile *fin_wspace = new TFile("wspace_constr.root");
    RooWorkspace *wspace = (RooWorkspace*)fin_wspace->Get("wspace");
    .......
    RooSimultaneous *model = new RooSimultaneous("model", "", cate);
    
    RooRealVar *BF_bs = new RooRealVar("BF_bs","",3.57E-9,0.,1E-8);
    RooRealVar *BF_bu = wspace->var("BF_bu");
    RooRealVar *fs_over_fu = wspace->var("fs_over_fu");
    
    RooFormulaVar *N_bs[N_Categories];
    RooAddPdf *pdf_sum[N_Categories];
    RooProdPdf *pdf_prod[N_Categories];
    for(int idx=0; idx<N_Categories; idx++) {
        
        ......
        
        pdf_sum[idx] = new RooAddPdf(Form("pdf_sum_%d",idx), "", pdf_list, N_list);
        
        RooArgList prod_list;
        prod_list.add(*pdf_sum[idx]);
        prod_list.add(*wspace->pdf(Form("Eff_bs_%d_gau",idx)));
        prod_list.add(*wspace->pdf(Form("N_bu_%d_gau",idx)));
        prod_list.add(*wspace->pdf(Form("N_peak_%d_lnn",idx)));
        prod_list.add(*wspace->pdf(Form("N_semi_%d_lnn",idx)));
        pdf_prod[idx] = new RooProdPdf(Form("pdf_prod_%d",idx), "", prod_list);
        
        model->addPdf(*pdf_prod[idx],Form("c%d",idx));
    }

There are two ways to join the constrained PDFs: one can implement them as producted PDF (multiply all of the constrained PDF together) as shown above, or include them as ExternalConstraints() at the fitTo() call:

......
    model->fitTo(*rds_data,Extended(true),Minos(RooArgSet(*BF_bs)),
                 ExternalConstraints(RooArgSet(*wspace->pdf("BF_bu_gau"),*wspace->pdf("fs_over_fu_gau"))));

It is recommended to add the global constrained PDFs to ExternalConstraints(), and implement the per-category constraints as product of PDFs (good for RooFit internal treatement). Again, the rest of the code can stay the same as before.

The fit is now performed with much more free parameters, although most of them are constrained by the newly added PDFs:

Pos | Name | type | Value | Error +/- 0 | BF_bs | fixed | 4.165862909e-09 | 6.777987392e-10 (at lower limit) 1 | BF_bu | limited | 6.044651178e-05 | 1.640414785e-06 2 | Eff_bs_0 | limited | 0.002619082111 | 1.153090826e-05 ...... 9 | Eff_bs_7 | limited | 0.006364220381 | 1.010071755e-05 10 | N_bu_0 | limited | 266196.2308 | 795.433597 ...... 17 | N_bu_7 | limited | 1884067.403 | 5452.51321 18 | N_comb_0 | limited | 286.1006563 | 17.55374327 ...... 25 | N_comb_7 | limited | 4913.472791 | 71.93819959 26 | N_peak_0 | limited | 6.233823826e-07 | 2.163720571e-06 ...... 33 | N_peak_7 | limited | 0.0005014437129 | 0.001595842986 34 | N_semi_0 | limited | 15.49556334 | 2.6824379 ...... 41 | N_semi_7 | limited | 181.752032 | 24.77093462 42 | comb_B1_0 | limited | 0.5504284884 | 0.05367459644 ...... 49 | comb_B1_7 | limited | 0.5600463199 | 0.0124970767 50 | fs_over_fu | limited | 0.2443940214 | 0.01674567941

You should be able to observe that the uncertainties reported for BF_bs are significantly larger than the uncertainties reported earlier (task #6-1), which shows the effect of the inclusion of systematic uncertainties. The fit projections are very similar since most of the parameters are taking similar values as before (again, the projection and fit here are with a BDT threshold of 0.8, your result could show something different!):
task6_5.png

EXERCISE: update the main fitter code as introduced above, verify the updated uncertainties reported by Minuit

EXERCISE: perform the profiled likelihood scan for the fitter with constrained nuisances

EXERCISE: carry out the toy MC studies for the fitter with constrained nuisances

In fact in the case of constrained nuisance, the toy studies have to be performed with care:

  • the constrained nuisance parameters have to be randomized for each sample (ie. those constrained parameter N_bu, N_peak, N_semi, ...)
  • Or one has to randomize the constraints (ie. those parameters for constrained PDF, N_bu_mean, N_peak_mean, N_semi_mean...)
The latter method is usually considered as a frequentist treatment since the constrained PDF parameters can be considered as a complementary dataset. This is usually introduced for a post-fit model to data. If the randomisation is not performed properly, the pull distribution will not agree with a standard Gaussian.

Task #7: Fit to the unblind data and produce the results

Now you should have the full control of the fits and be able to perform the analysis on the unblind data. Get the root file from the facilitator and re-do the final analysis. Prepare your presentation accordingly!

EXERCISE: perform the analysis with unblind data

You certainly have spotted that there is a "pickevent" file pickevents_bmm4_2016_2ps_wgt1.root, which contains some of the $B_s\to\mu\mu$ signal-like candidates in AOD format. This is preserved for you to play with event display WorkBookFireworks. Please follow the description on the Fireworks twiki to have your environment prepared. Just load the pickevent file and "see" the events by eyes!
task7_1.png

Remark: the file pickevents_bmm4_2016_2ps_wgt1.root was produced with CMSSW_8_X, so one may need to use an older version of fireworks, or enable the --no-version-check option.

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf Combined_vs_MergedSignificance.pdf r1 manage 2255.9 K 2024-04-19 - 07:52 KaiFengChen  
PNGpng MCSTBu.png r1 manage 5.0 K 2024-05-15 - 15:02 FedericaRiti  
PNGpng MassJpsi.png r1 manage 5.4 K 2024-05-15 - 15:02 FedericaRiti  
Header fileh common.h r1 manage 3.5 K 2024-05-17 - 15:26 FedericaRiti  
Unknown file formatcsv effyield.csv r1 manage 2.7 K 2024-05-17 - 17:43 FedericaRiti  
PNGpng model_components_1.png r1 manage 167.3 K 2024-05-17 - 12:18 FedericaRiti  
PNGpng model_components_2.png r1 manage 173.3 K 2024-05-22 - 11:44 FedericaRiti  
PNGpng task1_3.png r1 manage 17.7 K 2024-05-15 - 15:56 FedericaRiti  
PNGpng task2_1.png r1 manage 17.0 K 2024-05-16 - 16:53 FedericaRiti  
PNGpng task6_3.png r1 manage 20.3 K 2024-05-27 - 13:44 FedericaRiti  
PNGpng task6_5.png r2 r1 manage 31.3 K 2024-05-27 - 13:44 FedericaRiti  
PNGpng task7_1.png r1 manage 897.8 K 2024-05-22 - 16:19 FedericaRiti  
PNGpng task_2_1.png r1 manage 17.0 K 2024-05-16 - 16:55 FedericaRiti  
PNGpng task_2_2.png r2 r1 manage 17.2 K 2024-05-21 - 10:29 FedericaRiti  
PNGpng task_2_3.png r1 manage 13.3 K 2024-05-17 - 11:55 FedericaRiti  
PNGpng task_2_5a.png r1 manage 12.4 K 2024-05-17 - 12:11 FedericaRiti  
PNGpng task_2_5b.png r1 manage 13.3 K 2024-05-17 - 12:11 FedericaRiti  
C source code filec task_3_1.C r2 r1 manage 6.7 K 2024-05-24 - 17:21 FedericaRiti  
PNGpng task_4_1.png r1 manage 14.9 K 2024-05-27 - 13:15 FedericaRiti  
PNGpng task_4_2.png r2 r1 manage 11.5 K 2024-05-27 - 13:19 FedericaRiti  
C source code filec task_4_3.C r1 manage 2.8 K 2024-05-27 - 13:19 FedericaRiti  
PNGpng task_4_3.png r2 r1 manage 13.7 K 2024-05-27 - 13:19 FedericaRiti  
PNGpng task_4_4a.png r1 manage 11.5 K 2024-05-17 - 17:32 FedericaRiti  
PNGpng task_4_4b.png r1 manage 13.1 K 2024-05-17 - 17:32 FedericaRiti  
PNGpng task_4_5.png r1 manage 35.0 K 2024-05-21 - 11:07 FedericaRiti  
PNGpng task_6_1.png r2 r1 manage 31.8 K 2024-05-27 - 13:44 FedericaRiti  
PNGpng task_6_1_bdt.png r1 manage 34.9 K 2024-05-21 - 15:30 FedericaRiti  
PNGpng task_6_2.png r2 r1 manage 13.0 K 2024-05-27 - 13:44 FedericaRiti  
PNGpng task_6_3.png r1 manage 112.4 K 2024-05-22 - 15:47 FedericaRiti  
Edit | Attach | Watch | Print version | History: r24 < r23 < r22 < r21 < r20 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r24 - 2024-05-30 - FedericaRiti
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback