Abstract

We have developed an algorithm for nonparametric fitting and extraction of statistically significant peaks in the presence of statistical and systematic uncertainties. Applications of this algorithm for analysis of high-energy collision data are discussed. In particular, we illustrate how to use this algorithm in general searches for new physics in invariant-mass spectra using Monte Carlo simulations.

1. Introduction

Searching for peaks in particle spectra is a task which is becoming increasingly popular at the Large-Hadron collider that focuses on new physics beyond TeV-scale. Bump searches can be performed either in single-particle (such as distributions) or invariant-mass spectra. For instance, searches for new particles decaying into a two-body final state (jet-jet, gamma-gamma, etc.) and multibody decays are typically done by examining invariant masses of final-state objects (jets, leptons, missing transverse momenta, etc.). For example, assuming seven identified particles (jets, photons, electrons, muons, taus, -bosons, and missing ), a search can be made for parent particles decaying into 2, 3, or 4 daughter particles. This leads to 322 unique daughter groups. Thus, the task of analyzing such invariant-mass combinations becomes rather tedious and difficult to handle. Considering a “blind” analysis technique for scanning many channels [1], any cut variation increases the number of channels that need to be investigated. Finally, similar challenges exist for automatic searches for new hadronic resonances combining tracks [2].

The task of finding bumps is ultimately related to the task of determining a correct background shape using theoretical or known cross sections. However, a theory can be rather uncertain in the regions of interest, difficult to use for background simulation, or entirely nonexistent. Even for a simple jet-jet invariant mass, finding an analytical background function that fits the QCD-driven background spanning many orders in magnitude and which can be used to extract possible excess of events due to new physics requires a careful examination. Attempts to fit two-jet and three-jet invariant masses have been discussed in CMS [3, 4] and ATLAS [5] papers; while both experiments have reached the necessary precision for such fits using initial low statistical data, the used analytical functions are rather different and have many free parameters. This task becomes even more difficult considering multiple channels (invariant-mass distributions) with various cuts or detector-selection criteria (like -tagging). Each such channel requires a careful selection of analytical functions for background fit and adjustments of their initial values for convergence of a nonlinear regression while determining an expectantly smooth background shape. A fully automated approach to searches for new physics has been discussed elsewhere [6].

One technically attractive approach is to find a nonparametric way to extract statistically significant peaks without a priori assumptions on background shapes. Such approach is popular in many areas, from image processing to studies of financial market, where a typical peak-identification task is reduced to data smoothing in order to create a function that attempts to approximate the background. The smoothing can be achieved using the moving average [7], Lowess [8], and Splines [9] algorithms. Statistically significant deviations from smoothed distributions can be considered as peaks. Such technique is certainly adequate for the peak extraction, but it does not pursue the goal of peak identification with a correct treatment of statistical (or systematic) uncertainties. The later can be asymmetric.

The closest peak-search approach for high-energy-physics applications has been developed for studies of -ray spectra where the usual features of interest are the energies and intensities of photo-peaks. Several techniques have been developed, such as those based on least squares [10], second differences with least-squares fitting [11], the Fourier transformation [12], Markov chain [13], and convolution [14], (just naming a few). While such approaches are well suited for counting-type observables, they typically focus on narrow peaks on top of small and often flat-shaped background.

For example, the ROOT analysis framework [15] used in high-energy physics contains the TSPECTRUM package based on a smoothing method developed for -ray spectra [14]. The latter typically have narrow peaks on a smooth background. This algorithm is efficient in finding sharp peaks, while detection of wide peaks requires a visual examination of data to adjust several free parameters of this tool. Thus this approach is not well suited for a completely automatic peak search. In addition, systematic uncertainties on data points are not easy to incorporate in this approach.

In high-energy collisions, a typical standard-model background distribution has a falling shape spanning many orders of magnitude in event counts. A typical example is jet-jet invariant masses used for new particle searches [3, 5]. For such spectra, the most interesting regions are the tails of the exponentially suppressed distributions where a new high- physics may show up. This means that there should be rather different thresholds to statistical noise, depending on the phase-space region, and, as result, a correct treatment of statistical and systematic uncertainties is obligatory. Unlike the -ray spectra where peaks are rather common and subject of various classification techniques, peaks in high-energy collisions are rather rare. As a consequence, relatively little progress has been made to develop a nonparametric fitting technique for high-energy physics applications where an observation of peaks is typically a subject for searches for new physics rather than for peak-classification purposes.

The above discussion leads to the need for a nonparametric way of background estimation together with the peak extraction mechanism which can be suited for high-energy collision distributions, such as invariant masses. The algorithm should be able to take into account the discrete nature of input distributions with their uncertainties. The proposed algorithm is less ambiguous compared to the smoothing methods (such as that used in ROOT [14, 15]), since it uses only one free parameter. In addition, it can take into account systematic uncertainties on data points (that can be asymmetric) and thus can estimate statistical significance of possible peaks in the presence of systematic uncertainties.

2. Nonparametric Peak Finder Algorithm

Due to the reasons discussed above, the program called Nonparametric Peak Finder (NPFinder) was developed using a numerical, iterative approach to detect statistically significant peaks in event-counting distributions. In short, NPFinder iterates through bins of input histograms and, using only one sensitivity parameter, determines the location and statistical significance of possible peaks. Unlike the known smoothing algorithms, the main focus of this method is not how to smooth data and then extract peaks, but rather how to extract peaks by comparing neighboring points and then calling what is left over the “background.” Below we discuss the major elements of this algorithm and then we illustrate and discuss its limitations and possible improvements.

For each point in a histogram, the first-order derivative is found taking into account possible (statistical or/and systematic) uncertainties. This is done by calculating the slope between two points including their experimental uncertainties: if point is lower than point , the upper error is used, while if point is higher than point , the lower error uncertainty is used. This is done in order to be always on a conservative side while reducing statistical noise. Mathematically, this can be written as where the uncertainty is taken with negative sign for and with positive sign otherwise. The uncertainty may not need to be symmetric, but for simplicity we assume that they are symmetric as this is usually the case for statistical nature of uncertainties. The derivatives are averaged calculating a running average for any given position : The algorithm triggers the beginning of a peak if the local derivatives satisfy where is a free positive parameter that reflects (unknown) slope of the peak. This parameter should be found empirically and we will discuss below a possible range for its value. When the above conditions are true, NPFinder registers a possible peak and begins classifying next points as a part of the peak. The running average equation (2) is not accumulated for the points which belong to a possible peak. is the only free parameter which specifies the sensitivity to the peak finding. This parameter should decrease with increase of sensitivity to the peaks (and likely will increase sensitivity to statistical fluctuations).

NPFinder continues to walk over data points until and are both negative, which signifies that the maximum of the peak has been reached. The double condition in (3) is used to reinforce the peak-search robustness. When this condition is met, NPFinder exits the peak and adds an equal number of points to the right side from the peak center. The requirement of having the same number of points implies that the peaks are expected to be symmetric, which is the most common case. For steeply falling distributions, such as transverse-momenta spectra or dijet mass distributions, this assumption usually means that we somewhat underestimate the peak significance. Figure 1 illustrates the NPFinder algorithm for a falling invariant-mass distribution. Each point of the distribution can have an upper and lower statistical (or systematic) uncertainty.

After detecting all peak candidates, NPFinder iterates through the list of possible peaks in order to form a background for each peak. This is achieved by performing a linear regression of points between the first and last points in the peak, that is, applying the function , where and are the slope and intercept of the linear regression, which in this case is rather trivial as it is performed via the two points only. It should be noted that the linear regression is also performed taking into account uncertainties: where is the first point of the peak, is the final point of the peak, and and are their statistical uncertainties, respectively. Here the statistical uncertainties are added in order to always be on the conservative side in estimation of the background level under the peak. The intercept parameter then is .

It should be mentioned that the technique of the peak finding considered above is somewhat similar to that discussed for -ray applications [11]. But there are several important differences of NPFinder compared to this algorithm: NPFinder can detect peaks of arbitrary shapes (not only Gaussian-shaped peaks as in [11]), no fitting or smoothing procedure is used, and statistical and systematic uncertainties for data points are included during the peak-finding procedure. The algorithm [11] was not tested since its source code is not publicly available.

Finally, NPFinder uses the background points to calculate the statistical significance of each peak in a given histogram. This is done by summing up the differences of the original points in a peak with respect to the calculated background points and then dividing this value by its own square root. For a given peak, it can be approximated by where the sum runs over all points in the peaks. The algorithm runs over an input histogram or graph, builds a list of peaks, and estimates their statistical significance. A typical statistically significant peak in this approach has . A first peak is usually ignored as it corresponds to the kinematic peak of background distributions.

Below we illustrate the above approach by generating fully inclusive collision events using the PYTHIA generator [16]. The required integrated luminosity was 200 pb−1. Jets are reconstructed with the anti- algorithm [17] with a distance parameter of 0.6 using the cut  GeV. Then, the dijet invariant-mass distribution is calculated and the NPFinder finder is applied using the parameter . As expected, no peaks with were found.

Next, a few fake peaks were generated using the Gaussian distributions with different peak positions and widths. The peaks were added to the original background histogram. Figure 2 shows an example with 3 peaks generated at 1000 GeV (20 GeV widths, 200000 events), 1500 GeV (50 GeV widths, 30000 events), and at 2800 GeV (40 GeV widths, 1200 events). The algorithm found all three peaks and gave correct estimates of their positions, widths, and approximate statistical significance using the input parameter .

For a comparison, the same distribution was used to test the TSPECTRUM package of the ROOT program discussed in the introduction. It was found that TSPECTRUM can also detect such peaks, but several iterations with a visual examination of the data have been required to adjust the free parameters of this algorithm, (an effective sigma of searched peaks), and the amplitude of the expected peaks. After the first TSPECTRUM pass, an additional analytic fit was required to determine the statistical significance of each peak. This approach was found to be difficult to implement in a fully automatic peak search.

It should be noted that the peak statistical significance of the proposed nonparametric method might be smaller than that calculated using more conventional approaches, such as those based on a minimisation with appropriate background and signal functions. This can be due to the assumption on the symmetric form of the extracted peaks, the linear approximation for the background under the peaks, and the way in which the uncertainty is incorporated in the peak-significance calculation. An influence of experimental resolution can also be an issue [18], which can only be addressed via correctly identified signal and background functions. Such drawbacks are especially well seen for the highest mass peak shown in Figure 2 where a statistical fluctuation to the right of the peak pulls the background level up compared to the expected falling shape. Given the approximate nature of the statistical significance calculations which only serve to trigger attention of analyzers who need to study the found peaks in more detail, the performance of the algorithm was found to be reasonable.

It should be noted that there is a correlation between the peak width and the input parameter : a detection of broader peaks typically requires a smaller value of (which can be as low as 0.2).

In conclusion, a peak-detection algorithm has been developed which can be used for extraction of statistically significant peaks in event-counting distributions taking into account statistical (and potentially systematic) uncertainties. The method can be used for new physics searches in high-energy particle experiments where a correct treatment of such uncertainties is one of the most important issues. The nonparametric peak finder has only one free parameter which is fairly independent of input background distributions. The algorithm was tested and found to perform well. The code is implemented in the Python programming language with the graphical output using either ROOT (C++) [15] or SCaVis (Java) [19]. The code example is available for download [20].

Acknowledgments

The authors would like to thank J. Proudfoot for discussion and comments. The submitted paper has been created by UChicago Argonne, LLC, Operator of Argonne National Laboratory (Argonne). Argonne, a U.S. Department of Energy Office of Science Laboratory, is operated under Contract no. DE-AC02-06CH11357.