Tests of the new 2020 BDT
Most recent version of the analysis ntuples were merged and the new BDT algorithm was applied to them. These can be compared to the 2015/16 analysis data with 2016 BDT.
BDT 2020
This BDT uses the following list of discriminating variables:
No pileup variable was used in this BDT.
Running
When the addClassBDT_2020_groupVersion.cpp macro is run, it prints out missing branch in the input ntuples:
Error in <TTree::SetBranchStatus>: unknown branch -> closeTrkDOCA_T0134217728_LooSiHi1Pt05_f2dc2
Error in <TTree::SetBranchAddress>: unknown branch -> closeTrkDOCA_T0134217728_LooSiHi1Pt05_f2dc2
This should be one of the variables entering the BDT, so it should be clarified whether this could affect the results.
SB data comparison
Number of entries in SB, Original ntuple: 2452402
Number of entries in SB, New ntuple: 2454693
Difference Orig - New: -2291
i.e. there are more events in the new ntuple. Analysis preselections are applied to both.
The following figures show the invariant mass distribution in the top three 18% signal efficiency bins.
Corresponding bin edges:
2016 BDT ... {0.2455,0.3312,0.4163,1}
2020 BDT ... {0.2774,0.3662,0.4418,1}
Mass distribution in BDT bin 1 |
Mass distribution in BDT bin 2 |
Mass distribution in BDT bin 3 |
Data SB blinded fits in top three bins
First, the fitter was validated against shape parameters obtained from data SB fits in all four bins as presented in the 2015/2016 internal note. The model used in both cases is 1st order Chebychev + exponential. Chebychev slope is constrained to be linear with BDT, exponential constant is constrained to be constant amongst the BDT bins.
Fitter validation against 15/16 bkg shape parameters in data SB
Next, this fitter was used to perform the same flavour of the fit in the top three bins only, comparing the 2016 and 2020 BDTs:
Background yields comparison:
Combinatiorial |
|
2016 BDT |
2020 BDT |
Bin 1 |
1.0950e+03 +/- 7.29e+01 |
4.4930e+02 +/- 5.53e+01 |
Bin 2 |
1.6882e+02 +/- 3.04e+01 |
6.8667e+01 +/- 1.65e+01 |
Bin 3 |
2.2281e+01 +/- 9.91e+00 |
2.1687e+01 +/- 1.07e+01 |
SSSV |
|
2016 BDT |
2020 BDT |
Bin 1 |
2.1811e+02 +/- 4.79e+01 |
1.8530e+02 +/- 3.90e+01 |
Bin 2 |
1.3447e+02 +/- 2.32e+01 |
8.4999e+01 +/- 1.37e+01 |
Bin 3 |
3.4279e+01 +/- 8.52e+00 |
2.1437e+01 +/- 8.08e+00 |
Unblinding - re-applying preselection cuts on loose ntuples
The mass spectrum looks strange + there are some negative-mass entries.
Unblinded region - missing entries |
|
Checking Bs MC
These are the mass distribution comparisons between the 2016 BDT applied to the old derivation and 2020 BDT applied to the new derivation.
Bin 0 lower edge
Lower edge of BDT bin 0 (72 % signal efficiency) was identified by ordering the MC events according to BDT and counting the weights (
CombWeights branch) untill the ratio of (counted weights)/(total sum of weights) reached 0.28 (= 0.72 signal events passed that BDT cut). The result is:
Crossed 0.72 signal efficiency point at BDT value: 0.164033
Previous entry has BDT: 0.164031
An attempt was made to validate the other bin edges of the 2020 BDT earlier found by Aidan by the same approach:
What was found : 18 % eff ... 0.439777, 36 % eff ... 0.363047, 54 % ... 0.274418, 72 % ... 0.164033
What Aidan found: 18 % eff ... 0.4418, 36 % eff ... 0.3662, 54 % ... 0.2774
UPDATE 25.9.20 :
The disagreement observed in the bin edges was due to wrong calculation of the weights. The "CombWeights" branch contains
only the QLC*DDW weights, and we need to multiply this further with
PVWeight, Muon{1,2}_trigger_sf, Muon{1,2}_reco_eff_sf. Then we indeed get the same bin edges as Aidan found together with (hopefully this time) correct bin edge of the 0th bin.
18 % eff ... 0.441817, 36 % eff ... 0.366231, 54 % eff ... 0.277443, 72 % eff ... 0.167089
2016 BDT values of events in 2020 BDT bins and vice versa
Regarding the two MC derivations:
2016 derivation: 166218 events in total, 120682 in the top 4 BDT bins, out of those 113728 match an event from the 2020 derivation
2020 derivation: 166752 events in total, 118016 in the top 4 BDT bins, out of those 110886 match an event from the 2016 derivation
there are 156694 events shared between the two derivations
Taking into account only the events in common, the current bin edges still correspond more or less to 18% signal efficiencies:
Common events - efficiency |
|
2016 BDT |
2020 BDT |
bin 0 |
0.178217 |
0.180164 |
bin 1 |
0.182015 |
0.179978 |
bin 2 |
0.184372 |
0.179547 |
bin 3 |
0.19332 |
0.180383 |
Full Fit on 2016/2020 BDT
feature |
2016 analysis fitter |
new fitter |
comb bkg (chebychev) |
yes |
yes |
sssv bkg (exponential) |
yes |
yes |
Bs (double gaussian) |
yes |
yes |
Bd (double gaussian) |
yes |
yes |
Peaking bkg (double gaussian) |
yes |
yes |
Peaking bkg constraint |
yes |
yes |
Smearing parameters |
yes |
no |
Relative efficiency in bins |
yes |
no |
BDT mean + constraint |
no |
yes |
Validation of the new fitter
In terms of Bs/Bd yields:
|
N Bs |
N Bd |
new fitter |
80.83+/-21.0 |
-10.96 +/ 19.1 |
15/16 result |
80 +/- 22 |
-12 +/ 20 |
Fitter validation BDT bin 0 |
Fitter validation BDT bin 1 |
Fitter validation BDT bin 2 |
Fitter validation BDT bin 3 |
The model fitted by the new fitter has the background parameters initialized from the inidividual fits and the signal normalization initialized with the SM expectations: 91 Bs and 10 Bd
New BDT
Mass projections:
Fit of the 2020 derivation, BDT bin 0 |
Fit of the 2020 derivation, BDT bin 1 |
Fit of the 2020 derivation, BDT bin 2 |
Fit of the 2020 derivation, BDT bin 3 |
fitLog_2020BDT_allBins.txt |
Bs yield: 75 +/- 18
Bd yield: -9 +/- 16
Shape-wise comparison: new vs. old derivation with 2016 BDT
Apr 22: working on the reference channel fit, some x-checks are in place to see if the new derivation is OK. It was projected vs. the 15/16 derivation, v2 ntuples. Note that the 2020 v4 is not the latest greatest at this point, as it has been replaced by 2021 v2. The 2021 v2 vs 2020 v4 comparison will be done by Joe on the reference channel, here I'm adding the missing link comparing 2020 derivation to 15/16 derivation. It's the full 15/16 Bmumu data.
Shape-wise comparison of 2020 derivation to 15/16 one, under the 15/16 BDT |
Running the ntupling on data17_Main data18_Main
Gradually moving to the latest ntuples (v2, 2021 derivation) in May 2022. These have not yet been produced for data17_Main and data18_Main. I got the how-to on running the ntupling from here:
https://twiki.cern.ch/twiki/bin/viewauth/AtlasProtected/BPhysicsRareDecaysBsmumuNtupleMaker .
I first checked if it works on 2015 period D Bmumu data - reproduced the ntuples and cross-checked them against what's on eos. The distributions of several variables seem identical in the reference and in my reproduction. I also checked the
BJpsiK channel - all seems to work there as well. I'm therefore ready to ntuple the data17_Main and data18_Main.
Ntupling for 2017_Main and 2018_Main was performed successfully after several repeats of failing jobs.
Preselection code was updated for the Bmumu process as well to account for the int-like trigger matching description in v2 ntuples.
Then I found out that some of the branch names were changed - those of the isolation variables that enter the BDT calculation. They were, however, not changed in the BDT applicator itself, so now what? Jesus Christ, why did they change them from one nonsensical description to another? Why not just leave the naming as it was?? Anyway, the change is described here:
https://indico.cern.ch/event/968954/contributions/4078161/attachments/2129095/3585188/brare_ww_20201023-1.pdf , so essentially we need to change the iso in the BDT applicator to BEJ and the dca/zca to BEL... However, there are two occurences of these in the applicator - one for the
TMVA reader of the xml weight files, the other for the input/output tree branch name. The first must not be changed, the latter has to.
After applying the above-mentioned changes to the preselection and the BDT macro, I've ran the BDT on the v2 ntuples, and cross-checked the output mass distribution against the 2020 v4 ntuples that were used for the BDT comparisons earlier (2015/16 data SB - the v2 ntuples have not been unblinded). These test the new tools on the new derivation vs. older tools on the older derivation.
SB mass distribution, 2015 & 2016 data, top 4 2020 BDT bins |
difference thereof |
Shape-wise comparison of 2020 derivation to 15/16 one, under the 15/16 BDT |
|
FR2 SB fits
The same model as used in 15/16 was fitted to the Full Run 2 sample sidebands. The BDT bins were used as determined in the previous studies on the 2020 v4 ntuples, i.e. BDTBins = {0.1671,0.2774,0.3662,0.4418,1}.
BDT bin 0 |
BDT bin 1 |
BDT bin 2 |
BDT bin 3 |
|
|
|
|
The fit result:
result.root
For future convenience, let's list the parameter values in the bins:
|
av BDT |
nCOMB |
nSSSV |
bin 0 |
2.0474e-01 |
2.1862e+04 +/- 3.41e+02 |
2.0076e+03 +/- 2.29e+02 |
bin 1 |
3.1154e-01 |
1.9449e+03 +/- 1.29e+02 |
8.9262e+02 +/- 9.70e+01 |
bin 2 |
3.9776e-01 |
2.4779e+02 +/- 3.89e+01 |
4.3567e+02 +/- 3.53e+01 |
bin 3 |
4.7235e-01 |
4.8540e+01 +/- 1.66e+01 |
1.8126e+02 +/- 1.81e+01 |
|
|
|
|
expConst |
-7.5872e-03 +/- 6.00e-04 |
|
slope_0_cmb |
-2.0741e-01 +/- 1.47e-01 |
|
slope_1_bdt |
1.5670e-02 +/- 7.30e-01 |
|
This was a bit rushing - when I look back at the evolution of the shape parameters with average BDT, I see that we may not want to introduce the linear trend into the fit for the COMB slope:
Evolution of bkg shape parameters in the FR2 data sidebands |
Preliminary MC mixture
We don't have all the weights and stuff, but need to mix approximately the MC16a, d, e according to collected effective luminosities. The effective luminosities per period are in the BLS trigger table:
BphysTriggers_FullRun2_(1).xls
We have multiple triggers in 2017 and 18. Same approach as in run 1 was adopted: divide the events into mutually exclusive categories with Trigger_1 && !(Trigger_2 || Trigger_3 || ...), Trigger_2 && !(Trigger_3 || ...) with triggers ordered by descending prescale (high-prescale are most exclusive). With that, events in each category were weighted by ratio of eff. luminosity collected with respect to the least prescaled trigger. E. g. in 2018, we've got:
Category 1: HLT_2mu4_bBmumu_Lxy0_L1BPH-2M9-2MU4_BPH-0DR15-2MU4 && !(HLT_mu6_mu4_bBmumu_Lxy0_L1BPH-2M9-MU6MU4_BPH-0DR15-MU6MU4 || HLT_2mu6_bBmumu_Lxy0_L1BPH-2M9-2MU6_BPH-2DR15-2MU6)
Category 2: HLT_mu6_mu4_bBmumu_Lxy0_L1BPH-2M9-MU6MU4_BPH-0DR15-MU6MU4 && !(HLT_2mu6_bBmumu_Lxy0_L1BPH-2M9-2MU6_BPH-2DR15-2MU6)
Category 3: HLT_2mu6_bBmumu_Lxy0_L1BPH-2M9-2MU6_BPH-2DR15-2MU6
Category 1 will be weighted by 26.2/62.739, Category 2 will be weighted by 53.353/62.739
This leads to following behaviour on the preselected, FR2 sample of Bs MC, which is otherwise unweighted and no BDT cut is imposed:
Effective luminosity weighting on Bs MC, FR2 |
As it happens, in my immense ignorance I didn't get this quite right. The above weights may reproduce the average reweighting, but what we need rather is a per-event weight reflecting the prescales at a given luminosity (~= pileup) value. These are, in a slightly convoluted way, already present in some form in the ntuples as "PV weights".
The PV weights are calculated by the PRW tool (Athena) during the ntupling process. The PRW tool uses two inputs:
1. pileup profiles generated in the MC (NTUP_PILEUP files on the grid) - sample specific
2. luminosity files generated by the lumicalc tool based on GRL and trigger - these include the pileup information and corresponding trigger prescale in each LB in data.
The MC events are then assigned a random run number, and based on their generated \mu, they are assigned a weight accounting to the prescale of a given trigger in that run, averaged over LBs with the same average \mu. On top of the prescale weight, a weight reproducing the actual av. \mu profile in data is applied -
correct me if I'm wrong - by taking the generated profile histograms in MC and data, dividing them and using that to reweight the MC based on it's actual generated \mu. The documentation of the PRW tool is here:
https://twiki.cern.ch/twiki/bin/view/AtlasProtected/ExtendedPileupReweighting.
In the actual use, there's one PRW tool for each trigger category. The categories, fortunately, correspond to what's above, i.e. we've got for N triggers:
Cat 1: Softest_trigger && !(||all other triggers)
Cat 2: Second_Softest_trigger && !(||(Third_Softest_trigger ... Stiffest_Trigger))
...
Cat N: Stiffest_trigger
The categories are most relevant for 2017 and 18, although have been used for 2015 as well (see later). The PRW tool for each category is then fed the lumifiles for each of the trigger that falls into it. The prescale weight is then handled for the OR of the triggers based on their actual prescale at a given \mu.
Contrary to the baseline use of the 2015's mu6mu4 trigger for Bmumu, a category of 2mu4 && mu6mu4 was introduced in the ntuples too, but is then discarded at the preselection (going from Loose to Nominal).
The PRW tool inputs used for the v2 ntuples are as follows:
PRW files found in: /eos/atlas/atlascerngroupdisk/phys-beauty/BsMuMuRun2/PRW/v_05, obtained according to
https://gitlab.cern.ch/atlas-physics/beauty/rare/bmumu-run2/AnalysisTools/-/blob/master/Pileup_Files/mc16d_get_ntup_prw.sh. These are the standard NTUP_PILEUP files for each of our MC samples (process, campaign). Downloaded from the grid and renamed for convenience. I tried to re-download mc16e bsmumu and cross-checked against the one in the folder - they are identical.
LUMI files found in: /eos/atlas/atlascerngroupdisk/phys-beauty/BsMuMuRun2/Lumifiles/v_10, obtained with online lumicalc tool. It used recommended GRLs from
https://twiki.cern.ch/twiki/bin/viewauth/AtlasProtected/GoodRunListsForAnalysisRun2 which did not change since then - the website lists the same ones as used in the lumicalc tool. I've cross-checked this on the logfiles that are in the above folder. Aside from that, recommended setting of the lumicalc tool was used, including LAr veto (
LARBadChannelsOflEventVeto -RUN2-UPD4-10).
In summary, all looks up to date even today.
Fits to FR2 bkg MC
With the weighting checked, I proceeded to make projections of the background shapes on the PV-weighted FR2 bkg MC, in the 4 preliminary BDT bins
BDT bin 0 |
BDT bin 1 |
BDT bin 2 |
BDT bin 3 |
|
|
|
|
|
av BDT |
nCOMB |
nSSSV |
bin 0 |
2.0504e-01 |
7.8427e+04 +/- 3.15e+02 |
1.7570e+03 +/- 1.51e+02 |
bin 1 |
3.1022e-01 |
8.2781e+03 +/- 1.09e+02 |
1.7570e+03 +/- 1.51e+02 |
bin 2 |
3.9497e-01 |
1.0160e+03 +/- 3.90e+01 |
8.0790e+02 +/- 3.62e+01 |
bin 3 |
4.7586e-01 |
1.3047e+02 +/- 1.38e+01 |
3.1443e+02 +/- 1.93e+01 |
|
|
|
|
expConst |
-1.4041e-02 +/- 5.46e-04 |
|
slope_0_cmb |
-2.6408e-01 +/- 4.18e-02 |
|
slope_1_bdt |
8.6599e-02 +/- 1.94e-01 |
|
Evolution of bkg shape parameters in the FR2 MC fit |
--
OndrejKovanda - 2020-08-31