Implementing a Multivariate Analysis to Identify

Implementing a Multivariate Analysis to Identify Plug Photons in the Search for the
Higgs Boson at CDF
Joseph Rauch
Baylor University
University of Denver
(Dated: August 5, 2011)
We discuss replacing the standard selection requirements (cuts) used to identify plug photons with
a multivariate analysis. The standard cuts perform reasonably well; however they fail to account for
correlations between variables and they even ignore other variables. We trained multivariate classifiers with Monte Carlo (MC) samples to distinguish real photons from fake photons. Multivariate
analysis techniques have already been used to improve signal efficiency and background rejection
for central photons and have already been implemented into the H→ γγ search [1]. Here we show
they have the same affect for plug photons.
I.
INTRODUCTION
The Standard Model (SM) predicts the existence of 17 fundamental particles. By colliding protons and antiprotons
at an energy of 1.96 TeV, the Tevatron at the Fermi National Accelerator Laboratory has successfully produced 16
of the fundamental particles (most likely 17 if the Higgs does exist) and various combinations of them. The Higgs
Boson remains the only particle predicted by the SM that is still experimentally unconfirmed. The Collider Detector at
Fermilab (CDF) uses multiple complex detectors to measure different properties of particles produced in the collisions,
such as momentum and energy. Some particles, generally the more massive ones, have an extremely short half-life,
and cannot be directly detected; they must be reconstructed from the more stable particles into which they decay.
Thus, more efficient measurements (high signal and low background rates) of the stabler particles are necessary for
accurate reconstruction of the more massive and elusive particles.
Photons are an integral part of many different particles’ decay chains, so a more efficient method for identifying
photons will potentially help many different analyses. The identification of photons suffers primarily from two different
sources of background. The first is the electron, which mimics a photon signal when it deposits its energy in the
electromagnetic calorimeter. The second is from π 0 and η 0 particles within jets, which decay into two collinear
photons. These photons can deposit their energy in the calorimeter so close together that they appear as one photon,
faking a true photon from the collision region. Multivariate analysis techniques have been shown to better distinguish
between real photon events and photon-like background better than the standard selection requirements (cuts) for the
central region of the detector [1], |η| < 1.1; however, these techniques were not applied to the plug (forward) region,
1.2 < |η| < 2.8, until now.
The improved identification for plug photons would especially help the search for the Higgs boson via the diphoton
2
decay channel. The search for the Higgs boson at CDF focuses primarily on the bb̄ decay channel because of its
high branching ratio. Although the branching ratio for H→ γγ is very small (0.2%), the diphoton decay channel
remains interesting because photons are measured with a better energy resolution and higher identification efficiency
than b jets. For a search in the H→ γγ channel, it is important to maximize the efficiency for detecting photons.
By implementing a multivariate analysis technique to identify plug photons, we may be able to increase the signal
efficiency.
II.
DETECTOR
The CDF detector contains a Silicon Vertex detector (SVX II) closest to the collision site. Around the SVX is
the Central Outer Tracker (COT). Both of these detectors combine to track the positions and measure the momenta
of charged particles emerging from the collision region. Around the COT is a powerful solenoid creating a uniform
magnetic field (1.4 T) to bend the paths of charged particles so their momentum may be measured. The electromagnetic calorimeter (CEM) measures the energy of electrons and photons as they pass into it and are absorbed.
The hadronic calorimeter (CHA) does the same for hadrons. Lastly, the muon chambers track the muons’ positions
as they leave the detector. In the plug region of the detector, 1.2 < |η| < 2.8, the particle tracking is less sensitive
because particles only go through portions of the SVX and COT. Instead of direct tracking, sophisticated programs
are used to predict the particles’ path based on the information measured. Later we will discuss electrons detected
using this method, which are called phoenix electrons. The plug region does have its own electromagnetic calorimeter
(PEM) and hadronic calorimeter (PHA), which perform at slightly lower efficiencies than the CEM and CHA [2].
More information on the CDF detector can be found in references [3] and [4].
FIG. 1: A cross sectional view of the CDF depicting the different layers of the detector as well as the central and plug regions.
III.
STANDARD CUTS AND MOTIVATION FOR MULTIVARIATE ANALYSIS
While a multivariate analysis has been studied and implemented in the current H→ γγ analysis for the central region
of the detector [1], analyses continue to use standard cuts in the plug region. Like the central region, however, the plug
region is plagued by background of electrons and quark jets. While in the central region tracking information from
the SVX and COT can significantly reduce the background from electrons, the low efficiency for track reconstruction
in the plug region makes cutting these electrons more difficult because some of them may not have a track. This
is discussed further in the conclusion section. The same problem arises from the jet backgrounds. Many π 0 and η 0
3
mesons are not tracked at all by the SVX or COT, and they decay into two collinear photons which are often so close
they appear as a single photon in the PEM. Because of this, electrons and quark jets can pass the standard cuts with
a frequency studied in reference [5].
Variable
Standard Cut
Loose Cut
PES U and V Fiducial 1.2 < |η| < 2.8
Same
HAD/EM
< 0.05 for ECorr ≤ 100 GeV ELSE: < .125
< 0.05 + 0.026 × ln(ECorr/100)
Cone 0.4 IsoEtCorr
< 0.1 × EtCorr for EtCorr ≤ 20 ELSE: < 0.15 × EtCorr for EtCorr ≤ 20 ELSE:
< 2.0 + .02 × (EtCorr – 20.0)
< 3.0 + .02 × (EtCorr – 20.0)
PEM χ2
< 10
None
PES 5 × 9
> 0.65
None
Cone 0.4 Track Iso
< 2.0 + 0.005 × EtCorr
<5
TABLE I: Standard Cuts for Photon ID in Plug Region (For more information on the variables see reference [6].)
Maximizing signal efficiency and background rejection is the main motivation for this study. However, as just
discussed, the standard cuts can often allow background events to be mistaken as signal events as well as potentially
cutting out several real photon events from the event sample. A real photon may pass all of the cuts with the
exception of one, and the standard cuts will reject the event. This problem arises because the standard cuts are fixed
and give a boolean result of true or false for each event. Standard cuts neglect possible correlations between variables
which can be used to better identify photons. Different multivariate analysis techniques use complex algorithms,
called “classifiers”, to look at all the variables for an event combined, rather than each variable independently. The
classifiers give a continuous output, with background and signal defined at the extremes. This allows the analysis
to make a single cut on how signal-like an event must be to be considered a real photon event. The trade-offs for
better signal efficiency and background rejection are longer processing time because the classifier needs to evaluate
each event individually, and often the classifier uses an enigmatic way of producing the output due to the complexity
of the algorithms. However, the extra processing time is normally not significant and the output can be trusted if the
classifier was carefully trained.
IV.
TRAINING
The multivariate classifiers must be trained to distinguish signal from background with samples of pure signal events
and background events. Thus, it becomes extremely important how one produces these samples, as the performance
of the classifier will depend on them. The samples used to train the multivariate classifiers were produced using Monte
Carlo techniques. Thus, we were able to control the parameters precisely for both the signal and the background
training samples. The background sample was composed of quark jets and no electrons; since electrons interact with
the calorimeter in the same way as photons, we should not expect a multivariate analysis to distinguish the two.
Instead, a set of tracking cuts must be applied to exclude electrons. While this is made difficult because of the less
efficient tracking in the plug region, there are some variables we can possibly use [6]. The jets, on the other hand, will
not interact with the calorimeter in the same way, so we can expect the multivariate analysis to learn to distinguish
between the jet sample and the photon sample. Making the jet sample more general then just π 0 ’s and η 0 ’s meant
the possibility of producing Initial State Radiation (ISR), where a quark from the proton or antiproton radiates a
real photon. To avoid training the classifier to recognize this real photon as background, we apply a cut to get rid
of ISR before training begins. Both the samples were constructed so they pass the loose set of cuts (see Table I)
because these cuts are general enough that they exclude a large portion of background but keep all the signal. This
way the multivariate classifier trains to distinguish between the background and signal that appear most similar and
with which the standard cuts have the most difficulty.
A.
Training Variables
The input variables chosen for training were chosen specifically to highlight the differences between jet signatures
and photon signatures in the calorimeter. No tracking variables were included to exclude electrons, except for Sum
Pt4 because this can be used to exclude jets as well. The tracking variables would be applied with the loose set of
cuts before the multivariate analysis is even applied. For specific information on each of the variables see [6].
4
0.03
0.02
0.01
0
100
200
300
400
500
600
0.6
0.5
0.4
0.3
0.2
0.1
0
700
-3
-2
-1
0
1
2
3
Chi3x3
0.8
0.6
0.4
0.2
0
0.5
1
1.5
2
2.5
25
15
10
5
0
0
0.02
0.04
0.06
0.08
1
0.8
0.6
0.4
0.2
2
4
0.12
Input variable: Prof5by9U
1.2
0
0.1
HadEm
U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)%
1
(1/N) dN / 0.169
U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)%
(1/N) dN / 0.049
1.2
30
20
4
Input variable: PesDeltaR
1.8
1.4
35
EIso4
Input variable: PES_PEM
1.6
40
U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)%
0.04
0.7
45
6
PES_PEM
8
10
PesDeltaR
14
U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)%
0.05
0.8
(1/N) dN / 0.00212
0.06
0.9
(1/N) dN / 0.0116
0.07
Input variable: HadEm
U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)%
0.08
Input variable: EIso4
Signal
Background
(1/N) dN / 0.132
0.09
U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.4)%
(1/N) dN / 12.4
Input variable: Chi3x3
12
10
8
6
4
2
0
0.4
0.5
0.6
0.7
0.8
0.9
1
Prof5by9U
FIG. 2: The variables used to train the multivariate analysis to distinguish between quark jets and photons. For more detail
about these variables see [6].
B.
Weighting
While transverse energy (ET ), energy directed in the transverse plane to the beam direction, may seem to be a
good variable to train with, because photons normally have a higher ET than background jets, this is not the case.
First of all, ET is highly correlated to other training variables, meaning the classifier may just make a cut on the
ET value and not challenge itself to look for more correlations. This is bad; although the ET distribution for the
background falls off rapidly at high ET , it can still overwhelm the signal due to shear numbers. Also, we desire a
method for identifying photons at both the high and low ET ranges, where each range suffers from background of the
same ET . Thus, to force the classifier to discriminate between background and signal of the same ET , we re-weighted
the background ET distribution to match that of the signal. Lastly, to avoid situations with too few signal or too few
background events, we cut out the ET ranges where the weights were too high or too low. This ensured we had good
statistics in both signal and background samples.
C.
Procedure
Once the signal and background samples were generated with the necessary variables, the following process was
used to train the MVA classifiers:
5
sigc
Entries 1575122
Mean
27.25
RMS
10.39
Et
0.07
h_Et_
Entries
0
Mean
79.9
RMS
33.96
Et
35
0.06
30
0.05
25
0.04
20
0.03
15
0.02
10
0.01
5
0
50
100
150
200
250
0
50
100
150
200
250
FIG. 3: The left plot shows the ET distributions for the signal (black) and the background (red). The right plot shows the
weights applied to the background. No upper weight cut was applied in the ET < 40 GeV range and a weight cut of 13 was
applied in the ET > 40 range. A lower weight cut of 1/13 was applied throughout the whole ET range.
• Apply loose cuts from Table I
• Apply cuts to exclude ISR photons
• Weight the background and cut appropriately
At this point the samples are ready for input into the classifiers. For the training we used the Tools for Multivariate
Analysis (TMVA) software package Version 04.01.00 [7], available online and through recent versions of ROOT. There
are many different classifiers available through TMVA, so we did a preliminary training run on a fraction of the signal
and background samples. After this training we chose two of the classifiers that performed the best to conduct a full
training and comparison.
V.
RESULTS
The plot in Figure 4 shows the results from the preliminary training of several MVA classifiers on a fraction of the
signal and background samples. As one can see, the MLP and BDT classifiers performed the best, and thus, were
chosen for a full training. While training, one runs the risk of training the classifier so much that it only recognizes
the samples provided as signal or background. Thus, the classifiers were trained on half of the samples and the other
half were used to test the training. The results of this are shown in Figure 5. After the classifier has been trained, it
outputs a number on a continuous scale, the boundaries of which depend on the specific classifier chosen, for each of
the events tested. This implies that only one cut is required now, rather than a cut for multiple variables. A 0.8 cut
on the MLP output, for example, would mean an event with an output > 0.8 would be taken as signal, and everything
below as background.
We can see the gain in signal efficiency and background rejection from Figure 6 and from Table II. A further study
will investigate which cut optimizes the signal efficiency and background rejection in the search for the Higgs boson.
However, because of the extremely low number of signal events compared to the high number of background events,
the cut will be extremely high.
ID Method
Standard Cuts
MLP
BDT
MLP
BDT
Signal Efficiency Background Rejection Cut
83.4%
85.7%
All Cuts Applied
83.4%
94.5%
0.72
83.4%
94.9%
0.07
96.1%
85.7%
0.31
96.4%
85.7%
–0.04
TABLE II: The above table give the values for the signal efficiency and background rejection for the standard cuts as well as
the signal efficiencies, background rejections, and cut values for the MLP and BDT classifiers. Again, this table shows the
advantage of using an MVA classifier over standard cuts.
6
FIG. 4: The performance of several MVA classifiers is evaluated. The classifiers that perform the best maximize signal efficiency
and background rejection simultaneously. From the plot one can see the MLP (a Neural Network) and Boosted Decision Tree
(BDT) performed the best.
VI.
CONCLUSIONS
In conclusion, we investigated and compared using the standard cuts for plug photon identification versus using
a multivariate analysis. We demonstrated that the multivariate analysis has the potential to vastly outperform the
standard cuts if properly trained. By providing a sample of Monte Carlo-generated jet background and a sample of
pure photons to the chosen classifiers, they were trained to distinguish the background, particularly π 0 ’s and η 0 ’s,
from the photons better than the standard cuts. Further study is required to investigate the possibility of making
tracking cuts along with the loose cuts prior to training. Tracking cuts would cut out phoenix electrons and reduce
the other main source for false photon signals. Also, we need to study further the effects that multiple vertices have on
the signal efficiencies. While these results are somewhat preliminary, the advantages to using a multivariate analysis
over the standard cuts are clear. Once every concern has been addressed, the group searching for the Higgs boson
via the diphoton decay channel plans to implement a multivariate classifier to identify plug photons so as to match
the method for identifying central photons. The multivariate identification of plug photons should be included in the
final results based on the full dataset collected at CDF before the Tevatron stops running at the end of September.
7
FIG. 5: The results from training and testing the MLP and BDT classifiers. The close agreement between the training and
testing samples implies that the classifiers were not overtrained or susceptible to statistical fluctuations in the training sample.
8
Background rejection
Background rejection versus Signal efficiency
1
0.9
0.8
0.7
0.6
0.5
MVA Method:
0.4
0.3
0.2
0.7
BDT
MLP
0.75
0.8
0.85
0.9
0.95
1
Signal efficiency
FIG. 6: Above is a plot of the signal efficiency versus background rejection with a full training of the MLP and BDT classifiers.
You can also see a star where the standard cut signal efficiency and background rejection perform. This plot clearly emphasizes
the advantages of an MVA identification method.
[1]
[2]
[3]
[4]
[5]
[6]
[7]
K. Bland, R. Culbertson, J. Dittmann, C. Group, and J. Ray (CDF) (2010), CDF note 10036.
R. Oishi, Nuclear Instruments and Methods in Physics Research A 453, 227 (2000).
F. Abe et al. (CDF Collaboration), Nucl. Instrum. Meth. A271, 387 (1988).
P. T. Lukens (CDF IIb Collaboration) (2003).
C. Lester et al. (CDF) (2007), CDF note 9033.
S.-M. V. Wynne (2007), ph.D. thesis (advisor: Mike Houlden).
A. Hoecker, P. Speckmayer, J. Stelzer, J. Therhaag, E. von Toerne, H. Voss, M. Backes, T. Carli, O. Cohen, A. Christov,
et al., ArXiv Physics e-prints (2007), arXiv:physics/0703039.