Building the NIST Tandem Mass Spectral Library 2014

Building the NIST Tandem Mass Spectral Library 2014
Xiaoyu (Sara) Yang
Mass Spectrometry Data Center
Biomolecular Measurement Division Seminar
September 9, 2014
1
Outline
• Methods of building the NIST tandem mass spectral library
• Quality Control
o Peak annotation
o Noise Removal
o Chemical information consistency
• Major types of mass spectra
• Major types of compounds
2
Mass Spectral Library Searching
Experimental
mass spectrum
Searching
Reference
Spectra in
MS Library
Experimental
matched
Reference
MS Search
3
NIST Tandem Mass Spectral Library
Rationale and Objectives:
• Chemical identification through Electrospray ionization (ESI) tandem
mass spectrometry (MS/MS) is becoming a routine technique in
metabolomics, proteomics and other fields.
• The identification can be aided by matching the acquired tandem mass
spectra against reference library spectra.
• We are developing a comprehensive library of high quality reference ESI
tandem mass spectra for the identification of compounds through the
ion fragmentations.
Goals:
• Develop a tandem mass spectral library of all biologically relevant metabolite ions.
• Provide the library in a form that is easily searchable using software tools.
4
Steps of Building the NIST Tandem Mass Spectral Library
Authentic samples
metabolites, drugs,
peptides
lipids, pesticides, surfactants,
glycans, sugars
LC/MS/MS
Ion Trap (LTQ, IT/FTMS)
First
clustering
Second
clustering
Collision Cell (HCD, QQQ, QTOF)
Cluster MS2 spectra using precursor m/z
count-based
clustering algorithm
cluster spectra of the similar
precursor m/z values
Create consensus spectra of MS2, MS3 and MS4
adjusted dot product-based
clustering algorithm
Chemical structure,
formula, name,
synonyms, and CAS
are consistent
cluster spectra of the similar
fragmentations
Precursor type identification
• precursor purity
• mass accuracy
• peak annotation
• noise removal
Manual
inspection
MSMS library with multiple precursor types
5
Count-based Algorithm for Clustering Precursors
Steps:
1. Count the number of precursors (m/z count) within 0.1 m/z;
2. Sort the precursors by the m/z count in descending order;
3. Group similar precursors into the same cluster by using the precursor
with the highest m/z count as the cluster center;
4. Repeat step 3 until all the precursors are clustered.
intensity
100,000,00035
m/z count
40
Intensity vs m/z
m/z count vs m/z
30
10,000,00025
437
spectra
30
20
33
clusters
1,000,000
15
20
10
10
100,000
5
β-Dihydroequilin
[M+H]+
10,0000
HCD 10eV
0
0
100
100
200
200
300
300
400
400
500
500
600
600 m/z
6
Clustering Algorithm for Generating Consensus Spectra
Steps:
1. Group similar spectra into the same cluster;
2. Generate one consensus spectrum from each cluster;
3. Pick the best consensus spectrum for the library.
Dot Product (DP) =
1
3
2
4
 I 1* I 2
5
 I 1*  I 2
I – peak intensity
MS/MS spectra
1. Calculate adjusted dot
product (DP)
2. Find a cluster center with
the highest average DP (>0.7)
cluster 2
cluster 1
3 5
1 2 4
Cluster the
same ion at
the same
collision
energy
Consensus spectrum:
Median peak intensity
Median m/z
consensus
spectrum 1
consensus
spectrum 2
bin size
bin = 700 x 10ppm x 10-6 =
0.0070 m/z when m/z=700
best consensus
spectrum
7
Using Consensus Spectrum in the Library
• Eliminated low quality spectra by spectral clustering.
• Improved the spectrum quality by using the median of the m/z and
intensity values.
• Realistically represented the characteristic fragmentations.
Same compound
Same precursor type
Same instrument
Same energy (cone, collision)
Same mode (+/-)
Same spectrum type (MS2, MS3, MS4)
10-20 spectra / consensus spectrum
10-20 energy levels
8
What Precursor Types are in the NIST MSMS Library?
Compound
Type
Neutral
Molecule
Organic
Salt
Cations
ESI Product
Ionic Species
[M+H]+, [M+2H]2+, [2M+H]+, [M+H-H2O]+,
+, [M+H-OH]+, [M+H+H O]+,
[M+H-NH
]
3
2
Positive Ions
[M+NH4]+, [M+Na]+, [M-H+2Na]+, [M(129,662; 84%) 2H+3Na]+, [M+K]+, [M-H+2K]+, [M-2H+3K]+,
[M+Li]+, [M-H+2Li]+, [M-2H+3Li]+
Negative Ions [M-H]-, [M-2H]2-, [2M-H]-, [M-H-H2O]-, [M-H-, [M-H+H O]-, [M-H+NH ]NH
]
3
2
3
(23,638; 15%)
Positive Ions [Cat]+, [Cat+H]2+, [Cat-H2O]+, [Cat-NH3]+,
(1,751; 1%)
Negative Ions
(40; <0.1%)
[Cat+H2O]+
[Cat-2H]-, [Cat-2H-H2O]-, [Cat-2H-NH3]-, [Cat2H+H2O]-, [Cat-2H+NH3]9
Multiple Precursor Ions for
More Flexible Identification
100
p-C5H10O10P2
137.0459
O
p=[M+H]+
Relative Intensity
0
N
N
HO
O
100
50
0
100
50
0
O
P
OH
p
429.0211
p-C5H10O9N4P2
97.0283
200
300
400
p-C5H4ON4
336.9462
Na2PO3
124.9375
p-C5H6O2N4
318.9356
500
200
300
m/z
OH
50
0
100
p
472.9848
50
400
500
p=[M+Na]+
Energy=18eV
P
p=[M-H+2Na]+
Energy=28eV
100
200
300
400
500
p-C5H10O10P2
HP2O6
p=[M-H]135.0316
158.9257
Energy=27eV
PO3
p
78.9592
p-H3PO4
329.0298 427.0067
100
OH
O
O O
HO
100
100
N
Energy=12eV
50
p-C5H6O2N4
296.9537
H
N
0
100
50
0
p-C5H9O9N4NaP2
97.0283
90
NaH4P2O7+
200.9325
180
270
360
p-C8H11NaO4N4 p-C H ON
5 4
4
244.8964
358.9282
100
p
494.9668
200
HP2O6
158.9256
300
p-C5H4ON4
272.9574
200
inosine 5'-diphosphate acquired on Orbitrap HCD
450
p=[M-2H+3Na]+
Energy=29eV
Na3HPO4
164.9300
PO3100
78.9592
p-C5H4ON4
314.9643
p
451.0031
400
500
p=[M-H-H2O]Energy=32eV
p
408.9958
300
m/z
400
500
10
What Precursor Types are in the NIST MSMS Library?
Structure dependent losses:
2H2O, 3H2O, NH3+H2O,
H2S, HCl, H3PO4, HCN, H2,
CO, CO2, HCOOH,
CH4, CH3, CH3OH, CH3SH,
C2H5OH,
...
11
In Source Fragmentation Confirms Metabolite Identification
MS2 spectra acquired on Orbitrap HCD
12
n
MS
Orbitrap FT/IT
Orbitrap ion trap MS2
MS22
MS
Estra-1,3,5(10),7-tetraene-3,17β-diol [M+H]+
13
n
MS
Estra-1,3,5(10),7-tetraene-3,17β-diol [M+H]+
MS2
MS3
Orbitrap ion trap
MS3
MS3
MS4
14
Noise Removal
Relative Intensity
1. Satellite peaks in Orbitrap HCD spectra:
due to the Fourier transform ringing artifacts
100
359.0117
A
130.0612
50
0
<1% intensity of
the main peak
within ± 2 m/z
218.9783
70
140
210
m/z
2 m/z
280
thiencarbazone
-methyl
[M+H]+
350
2 m/z
15
Noise Removal
Relative Intensity
2. Tailing peaks in QTOF spectra:
due to the imperfect centroiding
100
A
84.04
102.05
130.04
1-Eicosatrienoyl-snglycero-3phosphoethanolamine
[M+H]+
190.08
50
172.07
0
60
90
84.044
20 84.044
m/z
2020
<20% intensity
of the main peak
after 0.3 m/z
150
190.080
20 190.080
202020
B
10
1010
00
00
120
84
84
1010
10
2020 84.044
20 84.044
10
101010
84.236
84
190
000
00 0
0.3
190
84
84.236
101010
10
180
C
190.246
0.3
B
202020 190.080
20 190.080
190.246
85
85
191
85
191
85
C
16
Noise Removal
3. Random noise peaks: in all mass spectra
due to unstable instrument, impurity…
The occurrence <25%
occurrence=1/3
=33.3%
occurrence=2/3
=66.7%
17
Noise Removal – an Example of Voting Algorithm
100
A
p-C4
75.0
50
…
Relative Intensity
0
100
50
100
0
50
0
22
Relative Intensity
C
100
p-C4H8O2
75.0229 p-C2H5O2
50
102.0466
0
22
11
0
00
1
1
60
80
100
120
140
160
m/z
B
A
p-C4H8O2
p-CH4O2
75.0229
p-C3Hp-C
4O22H115.0545
5O2
102.0466
115.0545
p-C4H8O2
60 75.0229
80
B
p-C
100
2H5O2
102.0466
11
22
0
00
11
1
01
00
2
12
1
22
80
100
120
140
160
60
80
100
120
140
160
B
C
11
120
140
160
m/z
60
80
100
120
140
160
B60
80
100
120
140
160
60
60
80
80
100
100
120
120
140
140
160
160
60
80
100
120
140
160
after
0
00
1
1
2
2
before
2
2
60
60
C
C
p-CH4O2
p-C3H4O2 115.0545
115.0545
A
A
p-CH4O2
p-C3H4O2 115.0545
115.0545
m/z
4-methylcinnamic acid [M+H]+
HCD
18
60
60
Peak Annotation – Peptide
(for low and high resolution MS/MS spectra)
H+
ADGK/1
Int-DG
• p, y, b, a, internal fragments, neutral losses(-H2O, -NH3, -CO)
• y+10(CO-H2O), a2-45(CONH3)
• 52 immonium ions (e.g. IHA: C6H7N3O + H - CO --> C5H8N3)
19
Peak Annotation - Small Molecule
(for high resolution MS/MS spectra)
• Peaks were annotated with the most probable chemical
formula consistent with the precursor formula
formula valence=∑Count*(valence-2) + 2 + charge
count of each element
formula valence<=# of H; H:C ratio>=0.125
|(Observed m/z – Theoretical m/z)|
Accuracy (ppm) =
x 106
Observed m/z
Sum of intensities of unassigned peaks
Unassigned % =
x 100
Sum of intensities of all peaks
T. Kind and O. Fiehn, Seven Golden Rules for heuristic filtering of molecular formulas obtained by
accurate mass spectrometry. BMC Bioinformatics 2007, 8:105 doi:10.1186/1471-2105-8-105
10 ppm
10 %
20
Peak Annotation - Small Molecule
(for high resolution MS/MS spectra)
H+
/-4.2ppm
C10H12O H+
Cuminaldehyde [M+H]+
theoretical precursor m/z= 149.0961
HCD energy=19eV
Unassigned=0.9%
/-1.3ppm
/-0.6ppm
/-0.1ppm
21
Anal. Chem. 2014, 86: 6393−6400
22
What Types of Mass Spectra are in the NIST MSMS Library?
8% 6%
MS2
86%
MS3
MS4
Instruments:
Micromass Quattro Micro: Triple Quadrupole
Thermo Finnigan LTQ: IT/ion trap
Agilent QTOF 6530: Q-TOF
Thermo Finnigan Elite Orbitrap: HCD
IT-FT/ion trap with FTMS
IT/ion trap
23
What Types of Compounds are in the NIST MSMS Library?
Metabolites ~50%
Ureidosuccinic acid
[M+H]+
HCD 14eV
24
What Types of Compounds are in the NIST MSMS Library?
Drugs ~20%
Acetaminophen
[M+H]+
QQQ 24V
25
What Types of Compounds are in the NIST MSMS Library?
Bioactive peptides
~10%
ARLDVASEFRKKWNKWALSR
[M+4H]4+
QQQ 27V
26
What Types of Compounds are in the NIST MSMS Library?
All amino acids (20)
All dipeptides (400)
All tryptic tripeptides (800)
YNK
[M+H]+
Ion Trap 35%
27
What Types of Compounds are in the NIST MSMS Library?
Lipids ~ 5%
Cholesterol
[M+H-H2O]+
QTOF 20V
28
What Types of Compounds are in the NIST MSMS Library?
Glycans ~ 2%
Glycan A1F-MIX
[M+2Na]2+
HCD 78eV
29
What Types of Compounds are in the NIST MSMS Library?
Sugars ~ 2%
Like sugar? So does diabetes…
D-(+)-Galactose
[M+H]+
QQQ 12V
30
What Types of Compounds are in the NIST MSMS Library?
Surfactants and
Contaminants
Sorbitan monopalmitate
[M+H]+
QQQ 14V
31
What Types of Compounds are in the NIST MSMS Library?
Pesticides
Atrazine
[M+H]+
HCD 32eV
32
NIST Tandem Mass Spectral Library 2014
33
Conclusions
• Two level clustering algorithms (counter-based and distance-based) were developed
and provide a robust means of generating consensus spectra of multiple precursor
types for the NIST Tandem Mass Spectral Library.
• Quality control programs such as peak annotation and noise removal methods were
developed and are used in building the reference quality NIST Tandem Mass Spectral
Library.
• NIST Tandem Mass Spectral Library 2014 can be applied in chemical identification in
Metabolomics, Proteomics and other fields.
Plans for the future:
• Improve peak annotation program by using compound’s structure and physical
chemical properties.
• Develop more quality control methods to eliminate low quality spectra for the library.
34
Acknowledgements
35