A Guide to Advanced Statistical Methods in Particle Physics


Introduction

We try to compile a list of commonly and less commonly used advanced statistical methods in Particle physics.

Types of models

In all tasks described here, we make use of a model to describe the data. Such models usually have several parametres which are adjusted to best describe the data. They can be diveded into three categories, depending on how the number of parameters grows with the number of data points.

Parametric models

These are models which have a fixed set of parameters which does not depend on the number of data points.

An example is a linear fit data points or a Gaussian fit to an energy distribution.

Advantage: The model usually is easily interpretable, e.g. its parameters have a physical meaning.

Disadvantage: The model might describe the data very badly, e.g. because the 'true' process generating the data does not correspond to the model assumption. An example are the 'non-Gaussian' lower tails in energy distributions reconstructed in a Calorimeter due to regions of non-sensitive material (e.g. gaps between the calorimeter cells).

Semi-Parametric models

Non-parametric models


Density estimation

Density estimation is the task of estimating a probability density. Given a data sample, this corresponds to finding a function which describes 'how many events one expect for given values of the independent variables'.

This problem is 'ill posed': In most cases, we can describe the given data perfectly by increasing the number of parameters in our model (unless we use a parametric model). See Overfitting.

Histograms

Kernel density estimation

To test the quality of the density estimation in one dimension (density as function of one variable), one can use the Kolmogorov-Smirnov test. However, there is no straightforward way to test the goodness in two or more dimensions.

See also:

Likelihood combination

Instead of construction a density as function of all variables, one approximates the overall density with the product of the densities ('projections') of each variable. I.e. the full density of all variables is written as:

p(x1..xn) = p_1(x1) * ... *p_n(xn)

To estimate the one dimensional densities, methods such as Histograms or kernel density estimation are used.

A widespread application of this method is the search for new physics phenomena, i.e. to discriminate a signal from background. Densities are constructed for the background physics process and for signal plus background (or in many cases for signal only) which are then used to calculate the local likelihood ratio of both hypotheses (see Neyman Pearson Lemma).

Advantages:

  • The number of points to determine each density is larger than for estimating higher dimension densities directly (see Curse of dimensionality).

Disadvantages:

  • some features of the density are lost if the approximation is bad and the variables are not independent. The resulting density is then not optimal e.g. for searches (see Neyman Pearson Lemma).


Regression

Fisher's discriminant

Neural Networks (Multilayer Perceptron)

Radial Basis functions

Support Vector Machines


Classification

Although treated differently in the statistical literature, classifications problems ('signal vs. background') are often solved using regression methods by assigning different values for each class (e.g. 0 for background, 1 for signal).


Hypothesis testing

When searching for new phenomena or particles, we have to come up with a conclusion on whether we see this new phenomenon or not. We can do this by comparing two hypotheses to each other:

  • H0: The observed data consist of the established (background) physics processes only
  • H1: The observed data consist of the established (background) physics processes only

Note that in this schema, hypotheses like: 'the data can neither be described by the established (background) processes only nor can it be described by the established processes and the new phenomenon' are not included. This hypothesis would be favoured in cases e.g. where we have massive underfluctuations of the background, detector problems, problems with backgrounds which were not considered etc.

Note that one calculates P(data|hypothesis) [the probability of the observed data under a certain hypothesis/physics model], not P(hypothesis|data) [the probability of the hypothesis/physics model given the observed data]. To calculate the latter from the former, one can use Bayes' theorem but then one needs to make an assumption on the (unconditional) probability of the observation and of the model ('prior').

Neyman-Pearson Lemma

Thus, whenever possible, we would like to know the ratio of the density of 'signal+background' and the 'background only' density at point in the measurement space. If we know this function exactly, we can make the optimal decision.

Search analyses using multivariate statistical techniques often construct a discriminant which calculates a 'discriminant variable' as function of the variables measured for each event. This discriminant variable should be the likelihood ratio mentioned above or a function of it.

See also: Wikipedia article on the Neyman-Pearson lemma


Glossary

Curse of dimensionality

The fact that for increasing dimension (number of variables) the number of data points needed to fill the observation space with 'equal density' increases exponentially with the dimension.

An example: To fill a one dimensional histogram of ten bins with 100 events per bin, we need 10 * 100 = 1000 measurements. To draw the distribution as function of two variables, we take a histogram with ten bins in both variables. For the same number of events per bin, we need 10 * 10 * 100 = 10000 points. For three variables, we would need 10^3 * 100 = 100'000 points while for five variables we need 10^5 * 100 = 10^7 points.

The more variables one uses in a multivariate analysis (keeping the amount of data fixed), the 'more far away from each other' the points are and the 'more empty space' we have.

When doing analysis, we often forget about this. In addition, the data can often be visualized in two, sometimes in three dimensions but not more. When comparing Monte Carlo (Simulated physics processes) to observed data, it is hard to see whether we have some regions where data occurs but which is not populated by Monte Carlo events (our expectation). In classification problems (i.e. 'is an event observed in the experiment background or signal like ?') it is then often unpredictable how such events in 'unpopulated Monte Carlo regions' will be classified.

See also: Wikipedia article on the curse of dimensionality

Overfitting

This is the situation where our model 'learns the data by heart'. It shows good performance on the data used to build model (the training data) but perfoms badly on independent data (originating from the generating process).

An example: Given 100 measurements of (x,y), we can try to fit a polynomial y = p(x) of degree 99 to it. If we take another 100 measurements of the same process, we most likely will get different polynomial coefficients if the values x and/or y contain random noise.

See also: Wikipedia article on overfitting


Further resources

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2006-07-23 - AndreHolzner
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback