Individual fits performed by simultaneous machinery - BDT bins, data SB Three BDT bins, data SB Four BDT bins, data SB Four BDT bins, MC Linear slope and constant expConst, MC and data SB Comparison of BDT errors
Independent fits to Bs and Bkg samples in BDT and η evolutions
MC fits
As a sanity check, the individual fits were re-done using the simultaneous machinery in order to cross-check the consistency of the two approaches. The individual fits done in this way are called
independent here. When the
high-statistics MC is used as the fitted sample for both signal and background, the
results of independent fits are compatible with those obtained by individual fits up to a high decimal digit.
Bs MC, BDT evolution:
BsMC_indep_BDTEv.txt
Bs MC, η evolution:
BsMC_indep_etaEv.txt
Bkg MC, BDT evolution:
BkgMC_indep_BDTEv.txt
Bkg MC, η evolution:
BkgMC_indep_etaEv.txt
The case of independent Bkg fits to sideband data
There are two problems with simultaneous fits to the sideband data. First, the statistics in certain bins can be insufficient, resulting in very wrong starting values provided by individual fits, and the entire simultaneous fit goes wrong. Second problem are the starting values of the extended parameters (nComb, nTot) - these are provided in sidebands only by the individual fits, but are required in full range in independent fits. Therefore, nTot is initialized as the number of events in the entire sample, and nComb as the number of events in the entire sample scaled by ratio of nComb and nTot resulting from the individual fits.
In the case of Bkg sideband data fitting, the simultaneous procedure in BDT evolution fails due to the last bin with very low statistics (Approximated covariance matrix, MINIMIZE=1, HESSE=4) and unnatural starting value of the slope. If the last two BDT bins are merged, fit converges and results are largely compatible.
Bkg Data, BDT evolution, merged bins:
BkgData_indep_BDTev_merged.txt
Without merging the bins and initiating the simFit with the initial values of the individual
RooFit step values (which take care of both the full-range normalization and set the slope to zero whenever it gets positive), we obtain compatible results in all but the last BDT bin. These plots compare the simFit and it's components with the result of the individual
RooFit step (log:
simFit_bkgSB.txt).
There are enough events in all η bins of sideband data, and the simultaneous fit in this evolution converges without changes of the binning. The discrepancy wrt. individual fits is about 1% in the shape parameters [calculated as 100*(individual-independent)/individual]
Bkg Data, η evolution:
BkgData_indep_etaEv.txt
#IndSimFits ---++++ Individual fits performed by simultaneous machinery - BDT bins, SB data
Another cross-check was performed - to preform the fits in BDT bins of Bkg sideband data bin-by-bin, i.e. individually, but using the full-simultaneous approach, i.e. defining an index category for the bin, creating 'combined' dataset (containing only one category) and fitting it with a
RooSimultaneous (again containing the one category and one pdf). Interestingly, the results are in good agreement in the middle lower two bins, and in total agreement in the first and the last bin.
This has been done for parameter limits expConst = [-10, 10], slope = [-1000, 1000], nComb = nTot = [-10 000, 1 000 000]. By setting the slope limits to [-100, 100], the agreement in the two middle bins is improved.
Independent simultaneous fits in two BDT bins, SB data
In general, if we fit in bins that do not include isolated bin 4, such as 1+2, 2+3, (1+2)+(3+4), agreement between values obtained from the simultaneous fit and those from last
RooFit step of individual fitting is good:
BDT1+2_sim.txt,
BDT2+3_sim.txt,
BDT1+2-3+4_mergedSim.txt.
However, whenever bin 4 is involved, the fit does not finish properly - covariance matrix is not positive-definite. Examples for simFits (1+2+3)+4, 3+4:
BDT1+2+3-4_mergedSim.txt,
BDT3+4_Sim.txt. Limits of parameters expConst and slope were changed in order to see if with certain combination the fit will converge properly, but no such combination had been found.
UPDATE: Combination of limits expConst = [-1000, 1000], slope = [-5000, 5000], nComb = nTot = [-100 000, 100 000] seems to have the desired effect - the covariance matrix is now positive-definite. The fourth bin is still converging to a different minimum.
Process of constraining the parameters
Example: BDT evolution of expConst: We want to introduce the dependence of the SSSV bkg exp constant on average BDT value. The overall bkg pdf then becomes a '2D' pdf in mass and average BDT value. The average BDT value is a random variable by itself and has a pdf. We're therefore dealing with a conditional probability: pdf(m|params(<bdt>))*pdf(<bdt>).
Simultaneously with introducing expConst as a
RooFormulaVar, we need to introduce the pdf for <bdt> for each bin. This is a gaussian with an unknown fitted mean <bdt>True and unknown but fixed (not-fitted) width. The dependence of the parameter is then: expConst = e_0 + e_1*<bdt>True. Parameters e_0, e_1 are fitted, and initialized with respective values obtained from individual fits. The <bdt>True is initialized with the (sub)sample BDT mean and the width of the pdf(<bdt>) is fixed to the uncertainty on the mean, i.e. standard deviation of the (sub)sample BDT values divided by sqrt(N). The pdf(<bdt>) is then used as an external constraint, i.e. the overall likelihood is multiplied by this term.
In practice, there's a loop over BDT and η bins. Above this loop, the evolution variables are initialized as
RooRealVar. Inside the loop, the pdf(<bdt>) is initialized, together with the
RooFormulaVar corresponding to the actual evolved variable.
Constraining expConst and slope of the BKG model over BDT bins, data SB
For the time being, the fitting is set up in the following way:
sim fits are initialized with the same values with which the RooFit step of the individual fitting chain is being initialized - this is to have a sort of reference, and can be changed easily to work with final values of this step. In other words,
simFit is initialized with values resulting from the last ROOT step, where the combinatorial slope is set to zero whenever it's non-negative. The evolution parameters are taken from the results of the last
ROOT step, or can be set by hand. This can also be changed.
29-5-20 UPDATE: The simFits are initialized with
final values of RooFit step in the individual fitting chain. Evolution parameters are also taken from the results of this step.
As a sanity check, the expConst was constrained to be linear over the two bins - in that case, if the machinery works correctly, there should be no change in the result wrt. the results of the final
RooFit step of the individual fitting chain (fiting two points with line) - this is indeed the case:
29-5-20 UPDATE: The updates from above are projected here, the binning has been changed to match the blinded region edges, comparisons of all parameters have been added to the logs, fitted evolution was overplotted.
|
|
Exponential constant evolution |
The comparison of the parameter values (
RooFit step vs simFit):
valuesComp_sanityCheck.txt, fit log:
fitLog_sanityCheck.txt
Next, the expConst was set to be constant:
Values comparison:
valuesComp_expConst_constEv_bdtEv.txt, fit log:
fitLog_expConst_constEv_bdtEv.txt
Now, the expConst was released and slope was set to constant.
Values comparison:
valuesComp_constSlope_bdtEv.txt, fit log:
fitLog_constSlope_bdtEv.txt #ThreeBins
Three BDT bins
The same for BDT binning 1, 2, 3+4:
Linear dependence of expConst:
Values comparison:
valuesComp_threeBin_expConstLinear_bdtEv.txt, fit log:
fitLog_threeBin_expConstLinear_bdtEv.txt
Linear dependence of slope:
Values comparison:
valuesComp_threeBin_slopeLinear_bdtEv.txt, fit log:
fitLog_threeBin_slopeLinear_bdtEv.txt
Constant dependence of expConst:
Values comparison:
valuesComp_threeBin_expConstConstant_bdtEv.txt, fit log:
fitLog_threeBin_expConstConstant_bdtEv.txt
Constant dependence of slope:
Values comparison:
valuesComp_threeBin_slopeConstant_bdtEv.txt, fit log:
fitLog_threeBin_slopeConstant_bdtEv.txt #FourBins
Four BDT bins
16-6-2020 UPDATE: Bug was found in the simFit code - the slope was not initialized with final value of the
RooFit individual step, but with the respective initial value. The only significat effect this had was in four bins, especially when the expConst was linear. Fixed and updated for 4bins expConst lin and expConst const. This has no effect on the slope evolution plots.
Linear dependence of expConst:
Values comparison:
valuesComp_fourBin_expConstLin_bdtEv_update.txt, fit log:
fitLog_fourBin_expConstLin_bdtEv_update.txt
Constant dependence of expConst:
Values comparison:
valuesComp_fourBin_expConstConstant_bdtEv_update.txt, fit log:
fitLog_fourBin_expConstConstant_bdtEv_update.txt
Linear dependence of slope:
Values comparison:
valuesComp_fourBin_slopeLin_bdtEv.txt, fit log:
fitLog_fourBin_slopeLin_bdtEv.txt
Constant dependence of slope:
Values comparison:
valuesComp_fourBin_slopeConst_bdtEv.txt, fit log:
fitLog_fourBin_slopeConst_bdtEv.txt
#FourBinsMC
Four bins in Bkg MC
Blinded MC
Linear dependence of expConst:
Values comparison:
valuesComp_fourBin_expConstLin_bdtEv_MC_blinded.txt, fit log:
fitLog_fourBin_expConstLin_bdtEv_MC_blinded.txt
Constant dependence of expConst:
Values comparison:
valuesComp_fourBin_expConstConst_bdtEv_MC_blinded.txt, fit log:
fitLog_fourBin_expConstConst_bdtEv_MC_blinded.txt
Linear dependence of slope:
Values comparison:
valuesComp_fourBin_slopeLin_bdtEv_MC_blinded.txt, fit log:
fitLog_fourBin_slopeLin_bdtEv_MC_blinded.txt
Constant dependence of slope:
Values comparison:
valuesComp_fourBin_slopeConst_bdtEv_MC_blinded.txt, fit log:
fitLog_fourBin_slopeConst_bdtEv_MC_blinded.txt
Unblinded MC
Linear dependence of expConst:
Values comparison:
valuesComp_fourBin_expConstLin_bdtEv_MC_unBlinded.txt, fit log:
fitLog_fourBin_expConstLin_bdtEv_MC_unBlinded.txt
Constant dependence of expConst:
Values comparison:
valuesComp_fourBin_expConstConst_bdtEv_MC_unBlinded.txt, fit log:
fitLog_fourBin_expConstConst_bdtEv_MC_unBlinded.txt
Linear dependence of slope:
Values comparison:
valuesComp_fourBin_slopeLin_bdtEv_MC_unBlinded.txt, fit log:
fitLog_fourBin_slopeLin_bdtEv_MC_unBlinded.txt
Constant dependence of slope:
Values comparison:
valuesComp_fourBin_slopeConst_bdtEv_MC_unBlinded.txt, fit log:
fitLog_fourBin_slopeConst_bdtEv_MC_unBlinded.txt
#BothParsConstrained
Linear slope and constant expConst
Bkg MC unblinded
Values comparison:
valuesComp_fourBin_expConstConst_slopeLin_bdtEv_MC_unBlinded.txt, fit log:
fitLog_fourBin_expConstConst_slopeLin_bdtEv_MC_unBlinded.txt
Bkg MC blinded
Values comparison:
valuesComp_fourBin_expConstConst_slopeLin_bdtEv_MC_Blinded.txt, fit log:
fitLog_fourBin_expConstConst_slopeLin_bdtEv_MC_Blinded.txt
Data SB
Values comparison:
valuesComp_fourBin_expConstConst_slopeLin_bdtEv_dataSB.txt, fit log:
fitLog_fourBin_expConstConst_slopeLin_bdtEv_dataSB.txt #Xerrors
Comparison of parameter evolution fits with BDT RMS and uncertainty on BDT mean
Comparison of evolution fits to independent fit results. The green dependence takes BDT RMS as the x-axis error in each bin, the yellow dependence takes the error on mean (BDT RMS/sqrt(N)). The constant dependence is plotted as well in blue as a consistency check (it's not affected by x-axis errors). These comparisons are performed for evolutions of SSSV exponential constant and combinatorial slope in blinded MC, unblinded MC and data SB.
expConst:
slope:
Generating bkg+signal MC sample
Hig-stat sample
Signal and background MC samples were combined to contain both background and Bs entries. The number of Bs entries blended into the background MC sample has been calculated based on the expected number of signal events for the 2015/16 dataset, which is 120. Rescaling this number by the ratio of SB events in the bkg MC and data SB events, we obtain 426.535. The expected number of signal events corresponding to our background MC sample is then obtained as random poisson-distributed number with mean 426.535. As a result, this dataset contains 436 events (within 1 sigma away from the mean 426.535). These were selected by picking 436 uniformly distributed indices of the signal MC events (the total number of signal MC events passing the mass and BDT cuts is 120682). The following plots summarize the sample generation - they show entries in the highest BDT bin, the histogram of the signal events added, the distribution of the signal MC event indices, and check on their uniformity performed by generating 10 000 numbers with the same generator. Another 10000 poisson-distributed events with mean 426.535 were plotted in histogram as well to check the poisson random number generator.
Low-stat (pseudo data) sample
Another sample similar in statistics to the 2015/2016 data was produced by rescaling the total available Bkg MC statistics by the ratio of SB events in Bkg MC sample and in 2015/16 data. This number (11133.1) was used as expected number of background events and used as Poisson mean to generate the actual number of "measured" bkg events (11090, within 1 sigma). Similarly, an expected number of signal events (120) was used to generate "measured" number of signal evens (123, within 1 sigma). "measured number" of uniformly distributed random entries was then selected from the background and signal MCs and combined together into a single dataset. Both Root ntuple and
RooFit dataset had to be produced in this case, because of the individual fits (in the high-stat case this was not necessary as the sidebands are the same for the Bkg+signal and Bkg only MC).
***UPDATE** First bin of this sample was not fitted properly as described below*
Individual Fits of bkg+signal high-stat MC sample
First, blinded fit of the Bkg MC was performed, then another step was added to fit the full model with signal component on the unblinded high-statistics Bkg+signal MC.
Individual Fits of bkg+signal pseudo data MC sample
Are problematic. Multiple versions of the sample were fitted, and often they end up in nonsense, such as negative number of SSSV events (and nComb larger than nTot) and positive exponential constant of the SSSV model. In some cases, this was due to the last step of
ROOT fitting starting with nComb/nTot larger than 1. The model used in
ROOT fitting doesn't feel bad about it, but
RooFit subsequently fails.
A fix was introduced - whenever in the signal+bkg fitting the starting number of nComb should be higher than nTot, it's set to 0.99*nTot, and the Chebychev slope is set to 0. With these starting values, the
RooFit steps can proceed. These are the results for the lo-stat sample.
Simultaneous Fits of bkg+signal pseudo data MC sample
Free background, constraint on signal yield in each bin - constant with gaussian smearing (mean = nSignal/nBins, sigma = 3.7)
+ Linear slope, constant expConst:
Change of (extended) parametrization from {nComb, nTot} to {nComb, nSSSV}
Comparison of individual and unconstrained simultaneous background fits to bkg MC unblinded and blinded and to data SB:
valuesComp_bkgMCUnb_parametrizationCheck_oldParam.txt vs.
valuesComp_bkgMCUnb_parametrizationCheck.txt
valuesComp_bkgMCBlind_parametrizationCheck_oldParam.txt vs.
valuesComp_bkgMCBlind_parametrizationCheck.txt
valuesComp_bkgDataBlind_parametrizationCheck_oldParam.txt vs.
valuesComp_bkgDataBlind_parametrizationCheck.txt
2D Sim fits
Are on page
SimFits2D
--
OndrejKovanda - 2020-05-18