A Guide to Advanced Statistical Methods in Particle Physics
Introduction
We try to compile a list of commonly and less commonly used advanced statistical methods
in Particle physics.
Types of models
In all tasks described here, we make use of a model to describe the data. Such models
usually have several parametres which are adjusted to best describe the data. They
can be diveded into three categories, depending on how the number of parameters grows
with the number of data points.
Parametric models
These are models which have a fixed set of parameters which does not depend on the number
of data points.
An example is a linear fit data points or a Gaussian fit to an energy distribution.
Advantage: The model usually is easily interpretable, e.g. its parameters have a physical
meaning.
Disadvantage: The model might describe the data very badly, e.g. because the 'true' process
generating the data does not correspond to the model assumption. An example are the
'non-Gaussian' lower tails in energy distributions reconstructed in a Calorimeter due to regions
of non-sensitive material (e.g. gaps between the calorimeter cells).
Semi-Parametric models
Non-parametric models
Density estimation
Density estimation is the task of estimating a probability density.
Given a data sample, this corresponds to finding a function which
describes 'how many events one expect for given values of the
independent variables'.
This problem is 'ill posed': In most cases, we can describe the given
data perfectly by increasing the number of parameters in our model
(unless we use a parametric model). See
Overfitting.
Histograms
Kernel density estimation
To test the quality of the density estimation in one dimension (density as function of one variable), one
can use the Kolmogorov-Smirnov test. However, there is no straightforward way to test the
goodness in two or more dimensions.
See also:
Likelihood combination
Instead of construction a density as function of all variables, one approximates
the overall density with the product of the densities ('projections') of each variable.
I.e. the full density of all variables is written as:
p(x1..xn) = p_1(x1) * ... *p_n(xn)
To estimate the one dimensional densities, methods such as
Histograms or
kernel density estimation
are used.
A widespread application of this method is the search for new physics phenomena, i.e. to discriminate a signal from background.
Densities are constructed for the background physics process and for signal plus background (or in many cases for signal only) which
are then used to calculate the local likelihood ratio of both hypotheses (see
Neyman Pearson Lemma).
Advantages:
- The number of points to determine each density is larger than for estimating higher dimension densities directly (see Curse of dimensionality).
Disadvantages:
- some features of the density are lost if the approximation is bad and the variables are not independent. The resulting density is then not optimal e.g. for searches (see Neyman Pearson Lemma).
Regression
Fisher's discriminant
Neural Networks (Multilayer Perceptron)
Radial Basis functions
Support Vector Machines
Classification
Although treated differently in the statistical literature, classifications problems ('signal vs. background') are often
solved using
regression methods by assigning different values for each class (e.g. 0 for background, 1 for signal).
Hypothesis testing
When searching for new phenomena or particles, we have to come up with a conclusion
on whether we see this new phenomenon or not. We can do this by comparing two hypotheses
to each other:
- H0: The observed data consist of the established (background) physics processes only
- H1: The observed data consist of the established (background) physics processes only
Note that in this schema, hypotheses like: 'the data can neither be described by the established (background)
processes only nor can it be described by the established processes and the new phenomenon' are not included.
This hypothesis would be favoured in cases e.g. where we have massive underfluctuations of the background,
detector problems, problems with backgrounds which were not considered etc.
Note that one calculates P(data|hypothesis) [the probability of the observed data under a certain hypothesis/physics model],
not P(hypothesis|data) [the probability of the hypothesis/physics model given the observed data]. To calculate the latter from
the former, one can use
Bayes' theorem but then one needs to make an assumption
on the (unconditional) probability of the observation and of the model ('prior').
Neyman-Pearson Lemma
Thus, whenever possible, we would like to know the ratio of the density of 'signal+background' and the 'background only' density
at point in the measurement space. If we know this function exactly, we can make the optimal decision.
Search analyses using multivariate statistical techniques often construct a discriminant which calculates a 'discriminant variable'
as function of the variables measured for each event. This discriminant variable should be the likelihood ratio mentioned above or a
function of it.
See also:
Wikipedia article on the Neyman-Pearson lemma
Glossary
Curse of dimensionality
The fact that for increasing dimension (number of variables) the number of data points needed
to fill the observation space with 'equal density' increases exponentially with the dimension.
An example: To fill a one dimensional histogram of ten bins with 100 events per bin, we need 10 * 100 = 1000 measurements.
To draw the distribution as function of two variables, we take a histogram with ten bins in both variables. For the same number
of events per bin, we need 10 * 10 * 100 = 10000 points. For three variables, we would need 10^3 * 100 = 100'000 points
while for five variables we need 10^5 * 100 = 10^7 points.
The more variables one uses in a multivariate analysis (keeping the amount of data fixed), the 'more far away from each other'
the points are and the 'more empty space' we have.
When doing analysis, we often forget about this. In addition, the data can often be visualized in two, sometimes in three dimensions
but not more. When comparing Monte Carlo (Simulated physics processes) to observed data, it is hard to see whether
we have some regions where data occurs but which is not populated by Monte Carlo events (our expectation). In classification
problems (i.e. 'is an event observed in the experiment background or signal like ?') it is then often unpredictable how such
events in 'unpopulated Monte Carlo regions' will be classified.
See also:
Wikipedia article on the curse of dimensionality
Overfitting
This is the situation where our model 'learns the data by heart'. It shows good performance
on the data used to build model (the training data) but perfoms badly on independent data
(originating from the generating process).
An example: Given 100 measurements of (x,y),
we can try to fit a polynomial y = p(x) of degree 99 to it. If we take another 100 measurements
of the same process, we most likely will get different polynomial coefficients if the values x and/or
y contain random noise.
See also:
Wikipedia article on overfitting
Further resources