Refining similarity scoring to enable decoy-free validation in

Refining similarity scoring to enable decoy-free validation
in spectral library searching
Wenguang Shao1, Kan Zhu2, Henry Lam1,2*
1
Division of Biomedical Engineering, the Hong Kong University of Science and
Technology, Clear Water Bay, Hong Kong, China
2
Department of Chemical and Biomolecular Engineering, the Hong Kong University of
Science and Technology, Clear Water Bay, Hong Kong, China
* Corresponding author
Supplementary Information
1
Spectral preprocessing before searching
In order to reduce search times and memory use, as well as to increase the sensitivity of
spectral library searching, the query and library spectra were de-noised by retaining only the
most intense n peaks, as suggested previously [34]. Therefore, to determine the optimal value
of n in our study, we varied n from 25 to 125 and compared the number of SSMs identified at
the same FDR cutoff.
Dataset 1
Dataset 2
Figure S1 | Sensitivity of the spectral library search with different values of n, the
number of most intense peaks retained. The number of correct SSMs are plotted against
the FDR predicted by PeptideProphet for the new SpectraST discriminant function (the
logarithm of the expectation value, obtained from the rank-based dot product) for Dataset 1
and Dataset 2. For Dataset 2, a randomly chosen subset of 8 LC-MS runs was used to speed
up the optimization.
Given the results in Figure S1, we chose n = 50, since the best sensitivity is obtained for n =
50-75. Also, in order to make retained peaks evenly spread across the spectrum, a peak will
be skipped if there are U other peaks more intense than itself within +/- V Th. Similarly, we
optimized the values of U and V by testing the following combinations (Table S1).
2
Combination
A
B
C
D
E
F
G
H
I
U
50
2
4
5
5
6
8
8
10
V
N/A
30
30
50
30
50
30
50
50
Table S1 | A list of different combinations of U and V used in this study. Combination A
means that there is no restriction on the m/z locations of the 50 retained peaks.
Dataset 1
Dataset 2
Figure S2 | The number of correct SSMs using different combinations of U and V when
the FDR is controlled at 1%. For Dataset 2, a randomly chosen subset of 8 LC-MS runs was
used to speed up the optimization.
As shown in Figure S2, Combination A (no restriction on the m/z locations of the 50
retained peaks) is the lowest among all tested combinations, although the difference is quite
small. Spreading retained peaks evenly across the spectrum results in better sensitivity for
spectral library searching. Based on the above results, we choose Combination D (U=5, V=50)
as the default value for our algorithm.
3
Proof that the top-scoring incorrect SSMs should follow the Gumbel distribution
Let 𝑋1 , 𝑋2 , 𝑋3 , … , π‘‹π‘βˆ’1 , 𝑋𝑁 be the p-values of all SSMs for a query, with 𝑋1 ≀ 𝑋2 ≀ 𝑋3 ≀
β‹― ≀ π‘‹π‘βˆ’1 ≀ 𝑋𝑁 , where N is the search space. By definition, p-values should be uniformly
distributed. Therefore 𝑋1 , 𝑋2 , 𝑋3 , … , π‘‹π‘βˆ’1 , 𝑋𝑁 is a sequence of independent and identically
distributed (i.i.d.) random variables drawn from a standard uniform distribution, 𝑒𝑛𝑖𝑓[0,1].
Applying the Bonferroni correction for the search space and the logarithm transform to the pvalues, we defined the corresponding discriminant F-values:
𝑁𝑋
π‘Œ = π‘Ž ln ( )
𝑁0
where π‘Ž is a negative constant and 𝑁0 is a positive integer. Thus, π‘Œ1 β‰₯ π‘Œ2 β‰₯ π‘Œ3 β‰₯ β‹― β‰₯
π‘Œπ‘βˆ’1 β‰₯ π‘Œπ‘ ,
We now prove that 𝑍 = π‘Œ1 follows a Gumbel (Type I extreme value)
distribution with cumulative density function (cdf):
𝑃(𝑍 ≀ 𝑧) = exp[βˆ’π‘0 𝑒 𝑧/π‘Ž ]
Proof:
Since 𝑋 ~ 𝑒𝑛𝑖𝑓[0,1], we have 𝑃(𝑋 ≀ π‘₯) = π‘₯ and 𝑃(𝑋 β‰₯ π‘₯) = 1 βˆ’ π‘₯. The cdf of π‘Œ is:
𝑃(π‘Œ ≀ 𝑦) = 𝑃 (π‘Ž ln (
𝑁𝑋
) ≀ 𝑦)
𝑁0
Note that π‘Ž < 0,
𝑃 (π‘Ž ln (
𝑁𝑋
𝑁𝑋
𝑦
𝑁0
𝑦
) ≀ 𝑦) = 𝑃 (ln ( ) β‰₯ ) = 𝑃 (𝑋 β‰₯
exp ( ))
𝑁0
𝑁0
π‘Ž
𝑁
π‘Ž
Since 𝑃(𝑋 β‰₯ π‘₯) = 1 βˆ’ π‘₯,
𝑃(π‘Œ ≀ 𝑦) = 𝑃 (𝑋 β‰₯
𝑁0
𝑦
𝑁0
𝑦
exp ( )) = 1 βˆ’
exp ( )
𝑁
π‘Ž
𝑁
π‘Ž
The cdf of 𝑍 = π‘Œ1 , the F-value of the top-scoring SSM, is:
𝑃(𝑍 ≀ 𝑧) = 𝑃(max{π‘Œ1 , π‘Œ2 , π‘Œ3 , … , π‘Œπ‘βˆ’1 , π‘Œπ‘ } ≀ 𝑧)
= 𝑃(π‘Œ1 ≀ 𝑧)𝑃(π‘Œ2 ≀ 𝑧) … 𝑃(π‘Œπ‘›βˆ’1 ≀ 𝑧)𝑃(π‘Œπ‘› ≀ 𝑧)
= [1 βˆ’
𝑁0
𝑧 𝑁
exp ( )]
𝑁
π‘Ž
4
π‘₯
Since lim (1 βˆ’ 𝑁)𝑁 = 𝑒 βˆ’π‘₯ , we can write, for large 𝑁:
π‘β†’βˆž
𝑁0
𝑧 𝑁
𝑃(𝑍 ≀ 𝑧) = [1 βˆ’ exp ( )] β‰ˆ exp[βˆ’π‘0 𝑒 𝑧/π‘Ž ]
𝑁
π‘Ž
which is the cdf of the Gumbel distribution.
Search space variation
The search space of a typical spectral library search can vary greatly from query to query,
depending on the precursor m/z. To investigate this variation, we searched Dataset 1 and
Dataset 2 against NIST human ion-trap library and NIST yeast ion-trap library, respectively.
Figure S3 shows that the number of candidate library spectra can differ by approximately
one order of magnitude.
Dataset 1
Dataset 2
Figure S3 | The histogram of the search space (the number of candidate library spectra
matched) for all query spectra in Dataset 1 and Dataset 2.
5
Examples of SSMs gained by the new scoring function of SpectraST
To illustrate the profile of additional SSMs gained by the new scoring function of SpectraST,
which employs the rank-based dot product and p-value transformation, we selected several
example SSMs from the set of confident identifications (FDR 1%) that were not in the set of
confident identifications (FDR 1%) of the original SpectraST. Upon examination, we found
that the original SpectraST often ranked the same SSMs first, but the scores were not
favorable enough to pass the FDR threshold. This could be due to a high dot bias (Figures
S4a, S4b), a low delta (Figures S4c, S4d), or a smaller dot product (Figure S4e).
Figure S4a | One of 6 replicate SSMs of IINEPTAAAIAYGLDKK (charge 3+). Original
SpectraST also found 4 out of 6 of these SSMs as top identifications with high dot products,
but did not pass the FDR threshold due to very large dot bias penalty.
6
Figure S4b | One of 6 replicate SSMs of GIPEFWLTVFK (charge 2+). Original
SpectraST also found all 6 of these SSMs as top identifications with high dot products, but
did not pass the FDR threshold due to very large dot bias penalty.
Figure S4c | One of 3 replicate SSMs of AGAEVVK (charge 1+). Original SpectraST also
found all 3 SSMs as top identifications but the delta scores were not high enough to pass the
FDR threshold.
7
Figure S4d | One of 16 replicate SSMs of DGNASGTTLLEALDC160ILPPTRPTDKPLR
(charge 4+). Original SpectraST also found 15 of these SSMs as top identifications but the
delta scores were not high enough to pass the FDR threshold.
Figure S4e | One of 4 replicate SSMs of ENPSATLEDLEKPGVDEEPQHVLLR
(charge 3+). Original SpectraST also found all 4 SSMs as top identifications but the dot
products were not high enough to pass the FDR threshold.
8