# Exercise 3: Background subtraction

The following exercises relate to the NTUPLEs and ROOT code found in
`/data/tutorial`

. There is a ROOT macro
`backgroundSubtractionTemplate.C`

in the subdirectory. It illustrates
how to do a "background subtraction".

The basic concept is illustrated in the the plots below. Figure 1 shows a invariant mass distribution. There is clearly a signal on top of a substantial background.

We define a signal region to be the MeV range symmetric
around the expected mass (1870 MeV). This is the blue range in
the figure. We define a lower sideband and an upper sideband, each 20
MeV wide, symmetrically spaced with respect to the center of the nominal
signal region. These are the red ranges in the figure. We *assume*
that the kinematic distributions of background events in the signal
region are described by the kinematic distributions of background events
in the lower and upper sidebands. From the observed distribution for all
events in the signal region and the observed distribution for background
events in the sidebands, we estimate the kinematic distribution of
signal events in the signal region. This is illustrated in Fig. 1.

For each event, anywhere in Fig. 1, we determine which
track has the lowest transverse momentum and we call this the minimum
for the event. The blue points in Fig. 2 are
the observed minimum distribution for tracks from all
candidates in the signal region. The red points are
the *estimated* distribution for background events in the signal
region. This distribution is determined by summing the distributions for
the events in the sideband regions of Fig. `MassRegions`

and dividing
by two. This factor of two corresponds to using 40 MeV of background
range and 20 MeV of signal range. If this ratio were different, the
scaling factor would differ accordingly. Finally, the green
*background-subtracted signal distribution* is determined by
subtracting the red distribution from the blue distribution bin-by-bin.

Looking at Fig. 2 it is clear that the signal-to-background ratio is much greater above minimum = 700 MeV than it is below 500 MeV. If we were to select only candidates with we would lose signal, but remained would be much cleaner. We could study the same issue alternatively by separating the data into, for example, three ranges of minimum and making the invariant mass distributions for each. Events with MeV would have more background than signal in the signal region. Events with minimum in the range 500 - 700 MeV would have more signal than background in the signal region, with the ratio . Events with minimum MeV would have would an even higher signal-to-background ratio in the signal region. Exactly how to use this information in an analysis depends on the details of the distributions and what is being studied. But the approach to understanding how to statistically discriminate between signal and background distributions is common to very many analyses.

Using the NTUPLE we have been using,

define signal and sideband ranges in the

`LambdaC_M`

distribution;Make a background-subtracted distribution of

`PromptK_ProbNNk`

. This variable is supposed to describe the probability that a charged track is really a kaon. Compare it to the estimated background distribution. That is, make a plot showing both distributions and write a few sentences describing what you see qualitatively.Make a background-subtracted distribution of

`PromptK_ProbNNpi`

. This variable is supposed to describe the probability that a charged track is really a pion. Compare it to the estimated background distribution. That is, make a plot showing both distributions and write a few sentences describing what you see qualitatively.Make a scatter plot of

`PromptK_ProbNNk`

versus`PromptK_ProbNNpi`

. Describe what you see and draw a conclusion.Make a background-subtracted distribution of the variable

`PromptPi_TRACK_GhostProb`

. It is supposed to be large if the reconstructed track is likely to be a fake track created by incorrectly combining track segments. What conclusion do you draw?Make a background-subtracted distribution of

`LambdaC_TAU`

. This variable is nominally the decay time of the candidate, reported in nanoseconds. It is calculated*assuming*the is produced at the event’s primary interaction point (also called primary vertex). If the calculation fails for some reason, the decay is reported to be -100. The lifetime of the is a fraction of a picosecond. Make your plots explicitly defining a reasonable range of candidate decay times. What conclusions do you draw?Based on the results of these studies and whatever you have previously learned about variables which discriminate between signal and background, define sets of cuts which produce larger, dirtier samples of and smaller, cleaner samples. Fit the distributions to extract the signals and statistical errors. Make a table showing how the statistical significance first grows, and later diminishes, as you apply tighter and tighter cuts.