Comparison between Linear and Nonlinear Estimation of Multifield

Comparison between Linear and
Nonlinear Estimation of Multifield
15N Relaxation Parameters in
Protein.
by
Yun-Ting Wang
Advisor
Mei-Hui Guo
Department of Applied Mathematics,
National Sun Yat-sen University
Kaohsiung, Taiwan 804 R.O.C.
July, 2003
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iv
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
2
Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
3
4
5
2.1
Theory of Relaxation Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.2
Transverse Relaxation Rates and Chemical Exchange . . . . . . . . . . .
. . . . .
8
2.3
The Protein Information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1
Linear Regression Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
3.2
Non-Linear Regression Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
3.3
Hierarchical Clustering - Ward’s Method. . . . . . . . . . . . . . . . . . . . . . . . . . .
18
3.4
Principal Component Analysis(PCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
Results and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1
The First Protein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
4.2
The Protein Pilin from Strain K122 − 4 . . . . . . . . . . . . . . . . .
45
. . . . . . . .
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
A
Mathematica Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
A.1 The Linear Estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
A.2
58
The Nonlinear Estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
i
List of Figures
1.
The structure of p8M T CP 1
. . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.
The structure of Pilin from Strain K122 − 4 . . . . . . . . . . . . . . . . . .
12
3.
The Fitted Parameters (A, B, S 2 , τs , τe , ∆σN ) of The First Protein which are
Estimated Three Methods (Linear-4, Nonlinear-4, and Nonlinear-3), Residue by
Residue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
4.
The SSE value calculated by three methods. . . . . . . . . . . . . . . . . . . .
36
5.
The coordinate of the four standardized estimators by the linear-4 and nonlinear-4
method in the first (x coordinate axis) and second component (y coordinate axis)
39
6-1.
The result of Ward’s clustering method by the linear-4 method
. . . . . . . .
41
6-2.
The result of Ward’s clustering method by the nonlinear-4 method . . . . . . .
42
7.
The estimators Parameters (A, B, S 2 , τs , τe , ∆σN ) of The Protein Pilin from
Strain K122 − 4 by three different estimating methods (Linear-4, Nonlinear-4,
and Nonlinear-3), Residue by Residue. (a) The estimators of A by three estimation methods; (b) The estimators of B by three estimation methods; (c) The
estimators of τs by three estimation methods; (d) The estimators of ∆σN by three
estimation methods; (e) The estimators of S 2 by three estimation methods; (f )
The estimators of τe by three estimation methods. . . . . . . . . . . . . . . . .
8.
46
The 95% Confidence Interval of R2 parameters in three fields and the experimental
value of R2 , residue by residue.
. . . . . . . . . . . . . . . . . . . . . . . . .
53
List of Tables
The summary of the relative differences of the estimated parameters (A, B, S 2 , τs , τe ,
1
∆σN ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
Correlation Matrix of Parameters (S 2 , τs , τe , ∆σN ) . . . . . . . . . . . . . . .
37
ii
2
Principle Component Analysis ( of standardized the parameters (S 2 , τs , τe , ∆σN )) 38
3
Correlation Matrix of Parameters S 2 , τs , τe , ∆σN in the ProteinPilin from Strain
K122 − 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
4
The ratio R2 values in different field . . . . . . . . . . . . . . . . . . . . . . .
51
5
The simulated and exact 95% C.I. of R2 of residue 35 . . . . . . . . . . . . . .
52
iii
Abstract
Abstract: According to the model free approach assumption four protein dynamic related parameters, the slow and fast local motion of the NH vector, the
generalized order parameter, and the 15 N shielding anisotropy can be estimated
at each residue by the spectral density functions at the resonant frequencies of
N (ωN ) and H (ωH ). In this work, we study the linear and nonlinear estimations of the aforementioned parameters of the two proteins C12A − p8M T CP I
and Pilin from strain K122 − 4. The principal components of the four parameters of p8M T CP 1 are used to cluster the residues. The results show that the
principle components provide useful information about the secondary structure
of the protein. Finally, we propose a practical method to examine the model
free assumption by characterizing the distribution of the transverse rate (R2 )
in multifield.
Keywords : NMR, nonlinear estimation, linear estimation, correlation time,
relaxation parameter, principal component analysis, Ward’s clustering
method
iv
1
Introduction
Heteronuclear spin relaxation is frequently being used to study the dynamics of protein
at present. NMR-nuclear magnetic resonance spectroscopy is currently the only experimental method that can yield high resolution structural information on peptides and proteins
in solution. Each experiment under a magnetic field can be obtained three values, Longitudinal (R1 ), transverse (R2 ) 15 N relaxation rates and 1 H–15 N cross-relaxation rate (σN H ),
which can be expressed as the combination of spectral density functions of ωN and ωH .
The four values, S 2 (order parameter, S 2 ∈ [0, 1]. If S 2 → 1 means the residue is in a stable
protein structure, for example the helix structure. If S 2 → 0 means the residue is in a
loose protein structure, for example the N, C terminal turns.), τs (correlation time associated with slow motion), τe (effect correlation time associated with fast motion ), ∆σN (15 N
shielding anisotropy) sensed by NH vector at residue, derived from the NMR parameters
(R1 , R2 , σN H ) analysis, are useful to indirectly explain the structure of protein. Based on
the model free assumption (Lipari, G; Szabo, A.), we can obtain the significant parameters
(S 2 , τs , τe , ∆σN ) by the experimental data of R1 , R2 , σN H . There are four parameters to be
estimated in each reside, so we need more than four observational values by obtaining data
at least in two magnetic field strengths. Two ways to estimate the four parameters (S 2 , τs ,
τe , ∆σN ), one is the linear estimation method, and the other one is nonlinear estimation
method. Danial et al.(2001) made linear estimation method to fit four parameters (A, B,
τs , ∆σN , and A = 2τe (1 − S 2 ), B = 2τs S 2 ) and discussed the chemical shift ansotropy
(csa) whether relating to the protein structure. This study mainly discusses the estimating
values (S 2 , τs , τe , ∆σN ) by different methods , and analysis the outcomes whether are
helpful to explain the protein structure.
This study contrasts the estimators of S 2 , τs , τe , ∆σN by three different methods (the
linear estimation, nonlinear regression and nonlinear regression with ∆σN = −170ppm)
1
for two protein (p8M T CP 1 and Pilin from Strain K122 − 4 ). The relaxation parameters
of R1 , R2 , σN H are measured on a
15
N -labeled sample of C12A − p8M T CP I , involving 68
residues. Those relaxation parameters of the protein p8M T CP 1 in five different magnetic
fields are obtained from the Supporting Information given by Danial Canet et al. (2001).
The relaxation parameters of R1 , R2 , σN H measured on a
15
N -labeled sample of Pilin from
Strain K122 − 4 , involving 119 residues. Those protein Pilin from Strain K122 − 4 are
obtained from the Supporting Information provide by Jeong Yong Suh et al. (2001).
Danial Canet et al. (2001) had discussed the estimators of S 2 , τs , τe , ∆σN by linear
regression method. In this study, We add two estimations (nonlinear regression method of
four estimators and three estimators with ∆σN = −170ppm) to compare with the linear
estimation for estimators (S 2 , τs , τe , ∆σN ).
In the protein p8M T CP 1 , the significant difference between the linear estimation and
nonlinear estimation is the estimator of τe . However, the estimators of three different estimatios in the protein Pilin from Strain K122 − 4 have unusual patterns. We suspect
the problem is caused by the bad quality of the data. Thus, we observe the distribution of
the relaxation parameter, and derive the causes for the unusual estimators. The process
of this paper from the next section is showed as follows:
1. Literature review (section 2): We introduce the background knowledges about the
relaxation mechanisms, transverse relaxation rates and chemical exchange. We also
provide the information of the two proteins (p8M T CP 1 and Pilin from Strain K122 − 4)
which we used in this study.
2. Methodology (section 3): We describe four methods which are used in this study.
The first two methods are the linear and nonlinear estimations that are used to
estimate four parameters(S 2 , τs , τe , ∆σN ), and the last two methods are the Ward’s
cluster method and principal component analysis ,that are used to analyze the protein
2
residues by these estimated parameters.
3. Conclusion and Results (section 4): This section includes two parts:
(a) The protein p8M T CP 1 : we use three estimations (the linear estimation, the
nonlinear estimation and the nonlinear estimation with ∆σN = −170ppm) to
estimate the four parameters of (S 2 , τs , τe , ∆σN ) at each residue of the protein
p8M T CP 1 , and use the four parameters to study the following two aspects:
i. Three different estimation methods, the linear estimation, the nonlinear
estimation and the nonlinear estimation with ∆σN = −170ppm, are used
to estimate (A, B, S 2 , τs , τe , ∆σN ). We will compare the performances of
the three estimation procedures at each residue.
ii. The fitted parameters, which describe the dynamic motion of the residues,
are used to explain the the protein secondary structure. We use the principal
component analysis and the Ward’s cluster method to cluster the residues.
(b) The protein Pilin from Strain K122 − 4 : since there are many unusual estimated parameters of the protein Pilin from Strain K122 − 4 corresponding
to three methods, we aimed at the following points to detect outliers of the
relaxation parameters data.
i. We discuss the relation between the relaxation parameters and the resonance frequencies ωN and ωH corresponding to different magnetic fields.
ii. It is found that the R2 parameter has a great impact for the estimation
of the parameter ∆σN . By assuming Normal distribution, the 95% confidence interval (C.I.) of R2 parameter for ∆σN ∈ [−220ppm, −140ppm] are
obtained. Simulation are also performed to confirm our results.
4. Discussions (section 5): We give the conclusions for the comparison of the linear and
3
nonlinear estimation.
Finally, we append to the source codes of the linear and nonlinear estimation with
Mathematica programming.
4
2
Literature Review
This section involves three subsections, are theory of relaxation mechanisms, transverse
relaxation rates and chemical exchange, and the protein information. This section mainly
introduces the background knowledge in this study, and displays the characteristic of the
two proteins that are used in this study. In the 2.1 subsection, the describe of the relaxation
mechanisms is on the premise to handle the protein data. The 2.2 subsection describes the
affect of transverse relaxation rates and chemical exchange in the relaxation mechanisms.
The 2.3 subsection separately introduces two proteins, p8M T CP 1 and Pilin from Strain
K122 − 4. We also show the transformations between R1 and T1 , R1 and T1 , σN H and
N OE. The T1 , T2 , N OE parameters are the data format in the Supporting Information
of the second protein.
2.1 Theory of Relaxation Mechanisms
This study makes the relaxation mechanisms which base on the assumption originating
from Danial Canet et al.(2001).
Below equations expressed by assuming that the following relaxation mechanisms are
dominant: the dipolar
15
N − 1 H interaction(d), the
15
N chemical shift anisotropy(csa).
˜
Denote J(ω)
a spectral density function which involves only dynamical parameters and
whose simplest form would be
˜
J(ω)
=
2τc
,
1 + ω 2 τc2
which τc being an effective correlation time. Here it must be particularly defined that J˜d
and J˜csa . (This definitions of J˜d and J˜csa are shown later.)
The spectral density functions characterizing protein dynamics are related to experimentally measured
15
N relaxation rates, R1 , R2 , andσN H as follows (Canet, D.):
R1 = Kd [6J˜d (ωH + ωN ) + 3J˜d (ωN ) + J˜d (ωH − ωN )] + Kcsa J˜csa (ωN )
5
(1)
3
1
R2 = Kd [3J˜d (ωH + ωN ) + J˜d (ωN ) + 3J˜d (ωH ) + J˜d (ωH − ωN ) + 2J˜d (0)]
2
2
1˜
2˜
+Kcsa [ Jcsa (ωN ) + Jcsa (0)]
2
3
σN H = Kd [6J˜d (ωH + ωN ) − J˜d (ωH − ωN )]
(2)
(3)
The various symbols whose usual meaning are listed as below,
ωN 15
: N
2π
resonance; dN H : N-H bond length; ∆σN :
15
N shielding anisotropy. The spectral
density functions are multiplied by
Kd =
1 µ0 2 γH γN ~ 2
( )( 2
)
20 4π
dN H
(4)
and
Kcsa =
1
2
(∆σN )2 ωN
,
15
(5)
where Kd = 2.5 × 108 assuming dN H = 1.02 Å.
Above procedure(eqs 1-3) is model-independent but provides only a qualitative interpretation of backbone dynamics. This study directly exploits the following function
form
˜
J(ω)
=A+
B
1 + ω 2 τs2
(6)
to be made use of eqs 1-3. τs is an effective correlation time associated slow motion. The
parameter A implies local fast motion by hypothetic form,
A = (1 − S 2 )(2τe )
, where τe is the effective correlation time describing the fast local motions ( τ1c =
(7)
1
τf
+
1
,
τs
where τf is associated with fast local motions. In fact, τc is vary close to τf , because of
τs τf ), and S is an order parameter specifying the restriction of these motions with
respect to a local director. The other parameter B is
B = S 2 (2τs )
6
(8)
, where τs is the effective correlation time describing the slow motions sensed by the relevant NH vector. By eqs 6-8, we define J˜d (ω) = eq 6 and J˜csa (ω) =
B .
1+ω 2 τs2
It should note that the concept of local director is reminiscent of organized systems
and that, as far as a spherical object is concerned, the local director would be along the
direction passing through the sphere center and the atom involved in the relaxation study.
Consequentially, in the case if a protein having approximately a spherical shape, τs would
be the overall tumbling correlation time. Of course, the order parameter depends on the
orientation of the relaxation vector with respect to the local director. This relaxation vector is the N-H bond for dipolar spectral densities, or the symmetry axis of the nitrogen
shielding tensor (supposed to be of axial symmetry) for csa spectral densities. As a result
of these two vectors are not collinear, we should have one order parameter for the dipolar
interaction and another one for the csa mechanism. In other words, the parameters A and
B are not only site-dependent but also mechanism-dependent. However the angle between
the N-H vector and the shielding tensor symmetry axis is small (∼ 13 − 16◦ ) so that considering the order parameters for the N-H dipolar interaction and the
are identical may constitute a reasonable assumption.
7
15
N csa mechanism
2.2 Transverse Relaxation Rates and Chemical Exchange
The first one data which submitted by Canet, D. et al. in 2001, thought about
transverse relaxation rate(Rex ) in R2 . The Rex relate to the nitrogen Larmor frequency,
2
ωN , and can be written as Rex = ΦωN
, where Φ is dependent on intrinsic rate constant of
the exchange process. However, we consider the (2R2 − R1 ) term, written as
2R2 − R1 = Kd
h
i 4
2 ˜
2
˜
˜
6J(ωH ) + 4J(0) +
(∆σN ) J(0) + 2Φ ωN
,
45
˜ H)
where eq 4 can be used and J˜d = J˜csa = J˜ has been supposed. Assuming further that J(ω
˜
is negligible with respect to J(0),
obtained from the intercept of the linear representation
2
2
of (2R2 − R1 ), is a function of ωN
. Furthermore, the slope term of the function of ωN
is Φ
while ∆σN = −170ppm. By the determination of Φ, the Rex can be obtained to corrected
R2 each residue under the different magnetic field. Clearly, the first one data makes the
equation of R2 (eq 2) correct as,
1
3
R2 = Kd [3J˜d (ωH + ωN ) + J˜d (ωN ) + 3J˜d (ωH ) + J˜d (ωH − ωN ) + 2J˜d (0)]
2
2
2˜
1˜
+Kcsa [ Jcsa (ωN ) + Jcsa (0)] + Rex .
2
3
This corrected data of the first protein can directly obtain from the Supporting Information of Canet, D. et al. (2001). The second protein does not consider the transverse
relaxation rate when we use the supporting information of , because of the paper not to
define it.
8
2.3 The Protein Information
§ The Protein: p8M T CP 1
The first protein originating from Canet, D. et al.(2001) is the N-labeled C12A −
p8M T CP 1 , and the experimental relaxation data (type of R1 , R2 , σN H ) is obtained from
five magnetic fields (9.4, 11.75, 14.1, 16.45, and 18.8T). The human protein p8M T CP 1 is a
68 residues protein encoded by the MTCP-1 oncogene. It’s code number in PDB is 1HP8,
and biological function is unknown. The structure of p8M T CP 1 in PDB consists of three
α-helices and one (3,10) helix, associated with a new cysteine motif. The core of the protein mainly consists of two helices (helix I: residues 8-20, helix II: residues 25-39) which
are covalently paired by two disulfide bridges (Cys7-Cys38 and Cys17-Cys28), forming an
α-hairpin. The third disulfide bridge (Cys39-Cys50) links the top of helix IV to the tip of
helix II. The helix III spans residues 44-46 and helix IV spans residues 48-59. Besides, the
protein also consists the loops which connect among these four helices, N-terminal turns
which are ahead helix I, and C-terminal turns which lie in the end of protein structure.
There are few nOe contacts which were found between helix IV and the α-hairpin, suggesting that helix IV is loosely bound to the core of the protein.
This study makes the structure of the protein described in PDB, and there are different to the definition in the original paper (Canet, D. et al.). This consideration is usefully
to explain the result of data analysis, for example the result of clustering.
9
F&3uˆazvg "!$#&%('*),+.-0/
9“
§ The Second Protein: Pilin from Strain K122 − 4
The information of the second protein is mainly obtained from the paper published
by Jeong-Yong Suh et al.. The code number of this protein in PDB is 1hpw, and there
are a little difference about the description of protein structure for the definition by JeongYong Suh et al. (2001). Pilin from strain K122 − 4 is a 150-residue protein involves a
long N-terminal which connects with α-helix (residues α-helix: 31-54), and four β strands
(residues βI: 79-87,βII: 91-100, βIII: 110-119, βIV: 126-133). There are two disulphide
bridges (Cys57-Cys93 and Cys129-Cys142) in the protein, the C-terminal disulphide loop
region is hypervariable, and the central region is semiconserved (Marceau, M. et al.). The
remove of the first 28 residues does not alter the structure of the intact protein (Keizer et
al., unpublished).
The experiment data of
15
N -labeled K122 − 4 pilin29−150 are expressed as the forms
of T1 (longitudinal time; T1 =
1
),
R1
T2 (transverse time; T2 =
1
)
R2
and {1 H} − 15 N N OE.
The {1 H} − 15 N N OE expression with spectral density is
"
#
γH Kd [6J˜d (ωH + ωN ) − J˜d (ωH − ωN )]
N OE = 1 +
γN
1/T1
γH σN H
= 1+
,
γN 1/T1
where γH is the proton magnetogyric ratio (2.68 × 108 rad s−1 T −1 ), and γN is the magnetogyric ratio of
15
N (−2.71 × 107 rad s−1 T −1 ). To transform the data type (as T1 ,
T2 , N OE) from Supporting Information published by Jeong-Yong Suh et al. (2001) into
the standard data type (as R1 , R2 , σN H ), before using the data to process our procedure
as follows. This experimental
15
N -Relaxation data from three different magnetic fields
(300MHz, 500MHz, and 600MHz). The T1 , T2 , N OE relaxation parameters are governed
principally by the dipolar interaction between the
and by the chemical shift anisotropy of the
15
15
N nucleus and its attached proton,
N nucleus(Abragam, A. et al.).
11
F&3uˆa‡vg "!21.354 3567!8:9<;=8>:356
?A@B(BCAD
96
3
Methodology
This section involve four subsections. The first two subsections describe the linear and
nonlinear methods to estimate the parameters (S 2 , τs , τe , ∆σN ) by using the relaxation
parameters data. The last two subsections introduce two kinds of methods to analysis the
estimated parameters. The principal component analysis can obtain the useful combinations of the estimated parameters. The Ward’s Method can help us to cluster the residues
by the factor that we provide.
3.1 Linear Regression Method
Assuming R2, others is negligibly small (this could actually be achieved by correcting
the R2 values from exchange contributions implying that multifield data are available) and
that J˜d (ω) = A+ 1+ω12 τ 2 B (the diplor interaction ), J˜csa (ω) =
s
1
B
1+ω 2 τs2
(the csa mechanism).
To consider eqs 1-3 and let
ucsa =
1
∆σ 2 B.
15 N
(9)
(Assuming that shielding tensor is axially symmetric and that its symmetry axis is possesses the same dynamical properties as the NH vector.)
R1 = 10Kd A + Kd [
+
1
6
3
1
+
+
]B
2 2
2
2
1 + (ωH + ωN ) τs
1 + ωN τs
1 + (ωH − ωN )2 τs2
1 2 2
ω τ
2
15 N s
B∆σN
+ (ωN )2 τs2
(10)
3
1
3
2
2
+
+
2 2
1 + (ωH + ωN )2 τs2 1 + ωN
τs
1 + (ωH − ωN )2 τs2
1 2 2
ω τ
3
2 2
2
30 N s
+
+ 2]B + [
+ ωN
]B∆σN
2 2
2 2
1 + ωH τs
1 + ωN τs
45
6
1
= 5Kd A + Kd [
+
]B
2
2
1 + (ωH + ωN ) τs
1 + (ωH − ωN )2 τs2
R2 = 10Kd A + Kd [
σN H
13
(11)
(12)
To fix the parameter τs . The above equations could be formed as the linear combination of A, B, ucsa .
Furthermore, the linear functions can be written as
3
X
xij βj = Riexp ,
j=1
where β1 = A, β2 = B, β3 = ucsa , and R1exp , R2exp are the experimental value of R1 , R2 , and
R3exp is the experimental value of σN H . Furthermore, xij is the jth variables multiplied by
βi in one of Rk ’s equation(k = 1, 2, 3, eqs 10-12). We can obtain three observed values(R1 ,
R2 , σN H ) each experiment on a magnetic field. Therefore, the total number experimental
values is 3 × n at n different magnetic field strengths. The above equation also can be
written by matrix form as
Xβ = R,
where X is a 3n × 3 coefficients matrix, β is a 3 × 1 parameters matrix, and R is a 3n × 1
experimental relaxation parameter matrix. By way of the matrix equation, we can estimate
for A, B, ∆σN by fixing one value of τs . Suppose
3
n
exp
cal
1 X X Rij − Rij
E=
(
)2 ,
3n i=1 j=1
∆Rij
exp
where Rij
is a experimental value of Ri in the jth magnetic field, and ∆Rij is Ri ’s
cal
experimental uncertainty in the jth magnetic field. Rij
is the fitting value with the
estimating parameter of linear regression method (Introduction To Regression analysis, p.
67-79). By changing the value of τs minimize E, and the estimating parameters, A, B,
∆σN , and the fixing value τs are the estimating values by this method. By this way, we
can obtain four parameters which non linear form in original equation (eqs 10-12).
The estimated variance-covariance matrix s2 {β} is given by
s2 {β}3×3 = M SE(X 0 X)−1 ,
14
where M SE = (R − Xβ)0 (R − Xβ)/(3n − 3). The estimated standard deviations of
A, B, ∆σN , in turn are s{β}ii , i = 1, 2, 3. For the τs term, the standard deviation is
calculated by (B + s{β}22 )/(2τs ).
15
3.2 Non-Linear Regression Method
Make eqs 10-12 directly estimating parameters of S 2 , τs , τe , ∆σN with Non-Linear
Regression Method ( Introduction To Linear Regression Analysis, p.414-426). The NonLinear Regression model of eqs 10-12 is
y = f (x, θ) + ,
where θ = (A, B, τs , ∆σN ) is a vector of unknown parameters, and is an uncorrelated
random error term, with E() = 0 and V ar() = σ 2 . Therefore,
E(y) = E(f (x, θ) + ) = f (x, θ).
To exploit those relaxation parameter functions inverting eqs 10-12 include function,
x2 B
x3 B
x4 B
+
+
2
2
1 + x 9 τs
1 + x10 τs
1 + x11 τs2
2
x5 B
x7 B∆σσN
2
+
+
x
B
+
+ x8 B∆σN
,
6
1 + x12 τs2
1 + x13 τs2
f (x, θ) = x1 A +
(13)
where x is a variable vector refer to eqs 10-13.
A method widely be used in computer algorithms for Non-Linear Regression is linearization of the Non-Linear function followed by the Gauss-Newton iteration method of
parameter estimation. By a Taylor series expansion of f (xi , θ) about the point θ 0 =
0
(A0 , B0 , τs0 , ∆σN
) accomplish linearization with only the linear terms retained. This yields
∂f (xi , θ) ∂f (xi , θ) 0
f (xi , θ) = f (xi , θ 0 ) +
(A − A ) +
(B − B0 )
θ
=θ 0
θ
=θ 0
∂A
∂B
∂f (xi , θ) ∂f (xi , θ) 0
+
(τ − τs0 ) +
(∆σN − ∆σN
),
θ
=θ 0 s
θ
=θ 0
∂τs
∂∆σN
0
where θ 0 = (A0 , B0 , τs0 , ∆σN
) is a initial value. Let
fi0 = f (xi , θ 0 )
0
β 0 = θ − θ 0 = (A − A0 , B − B0 , τs − τs0 , ∆σN − ∆σN
)
∂f (xi , θ) Zij0 =
∂θj θ =θ 0
16
we can writ the Non-Linear Regression form as
yi −
fi0
=
4
X
βj0 Zij0 + εi , i = 1, 2, 3, ..., n.
j=1
The matrix form is
y 0 = Z 0 β 0 + .
By linear regression estimation, β̂ 0 = (Z 00 Z 0 )−1 Z 00 y 0 = (Z 00 Z 0 )−1 Z 00 (y − f 0 ). Since
β 0 = θ − θ 0 , we can define βˆ1 = βˆ0 + θ 0 as revised estimates of θ. Follow the same mode,
replace θ 0 with θˆ1 . θˆ2 and so forth can be produced. Consequently, at the kth iteration
we can have
ˆ = θˆk + βˆk .
θ k+1
The iterative process continues until
θ̂
−
θ̂
j,k+1
jk < δ, j = 1, ..., 4.
θ̂jk
δ is vary small and around 1.0 × 10−6 . We also evaluate S(θ̂ k ) =
P3n h
i=1
i2
yi − f (xi , θ̂ k ) .
Finally, we also calculated the estimating variance of θ̂ by
V ar(θ̂) = σ̂ 2 (Z 0 Z)−1
≈
S(θ̂)
(Z 0 Z)−1 .
3n − 4
The above method, we used Mathematica software to run. The program in Mathematica
predestines all the starting point (θ 0 ) of 1.0.
17
3.3 Hierarchical Clustering - Ward’s Method
Hierarchical Method (Methods of Multivariate Analysis, p.455)represent an attempt
to find ”good” clusters in the data using a computationally efficient technique, and it’s
algorithm involve a sequential process. We start with n clusters (total number of residues in
a protein) and end with one single cluster containing the entire data set. In the process, an
alternative approach makes the two closest clusters to be merged into a single new cluster.
Ward’s Method (Methods of Multivariate Analysis, p.466-468), also called the incremental
sum of squares method, uses the within-cluster (squared) distances and between-cluster
(squared) distances (Ward 1963, Wishart 1969). If AB is the cluster by combining clusters
A and B, then the sum of within-cluster distances (of the items from the cluster mean
vectors) are
SSEA =
SSEB =
nA
X
i=1
nB
X
(y i − y A )0 (y i − y A ),
(y i − y B )0 (y i − y B ),
i=1
SSEAB =
nAB
X
(y i − y AB )0 (y i − y AB ),
i=1
where (y AB = nA y A + nB y B )/(nA + nB ), nA , nB and nAB = nA + nB are the numbers of
point in A, B and AB. The symbol y i is a vector of the i-th residue in a protein which
involves the factor(s) to cluster the residues in a protein. Since these sums of distances
are equivalent to within-cluster sums of squares, they are denoted by SSEA , SSEB , and
SSEAB .
Ward’s method joins the two clusters A and B that minimize the increase in SSE,
defined as
IAB = SSEAB − (SSEA + SSEB ).
Furthermore, minimizing the increase in SSE is equivalent to minimizing the betweencluster distance (refer to Rencher, Alvin C., 1934, ”Methods of Multivariate Analysis”,
18
p.468). Therefore, the Ward’s method is likely to join smaller cluster or clusters of equal
size. We use the S-plus software to run this clustering program.
19
3.4 Principal Component Analysis(PCA)
The multivariate statistical method of PCA (Methods of Multivariate Analysis, p.380)
is vary useful tool for reducing the number of variables in a data set and explaining the
maximum amount of variance of a linear combination of the variables. The PCA for a
data set will determine the perpendicular axes (eigenvectors) which are defined by the
dimensions of the data set. The number of axes is maybe the same as variables (or dimensions); the first principal component is the linear combination with maximal variance, the
second principle component is the linear combination with maximal variance in a direction
orthogonal to the first principle component, and so on. Principal component analysis may
often indicate which variables in a data set are important and which ones may be of little
consequence. If a variable does not correspond to any principal component axis (eigenvector), or corresponds only with high-number principal component axis, this usually suggests
that the variable has little or no control on the distribution of the data set. For principle
component analysis, the calculation of eigenvector can be made using either the covariance
matrix or the correlation matrix of the data set. In this study, the principle component
analysis on the estimated parameters (S 2 , τs , τe , ∆σN ) is made using correlation matrix
to remove the effect of different scale among these parameters.
The mean of the original data is the origin of the transformed system with the transformed axes of each component mutually orthogonal. To begin the transformation, the
correlation matrix (also could be covariance matrix, but we use correlation matrix in this
study), C4×4 , of the original data (in this paper, we use the four estimated parameters
forming a image of four dimension to describe the protein residues, S 2 , τs , τe , ∆σN , so
every residue implies a vector of (S 2 , τs , τe , ∆σN )) is found. By using the correlation
matrix, the eigenvalues (λi , i = 1, 2, 3, 4) are obtained from
|C − λi I| = 0,
20
where I is a 4 × 4 identity matrix. The eigenvalues are equal to the variance of each
corresponding component image. The eigenvectors, ~ei(4×1) , is relative to λi , i = 1, 2, 3, 4,
define the axes of the components and are obtained from
(C − λi I)~ei = 0.
The principal components (P4×4 ) are then given as
P4×4 = T4×4 · D4×4 ,
where D4×4 is the digital number matrix of the original data, and T is the transformation
matrix given by
T4×4 = (~e1 , ~e2 , ~e3 , ~e4 ) .
Principal component analysis images thus generated are uncorrelated and ordered by
decreasing variance. The correlation matrix of the transform data is a diagonal matrix
of which the elements are comprised of the eigenvalues. The transformed data points
are linear combinations of their original data hands to build each respective principal
component. The percent of the total variance in each of the components is given by
λi · 100
%V ari = P4
.
λ
k
k=1
The first component image has the maximum signal-to-noise ratio and the largest percentage of the total variance (%V ari ). Each subsequent component contains the maximum
variance for any axes orthogonal to previous component. Contrast enhancement, dimensionality reduction, and lossy compression are common application of principal component
analysis.
21
4 Results and Discussions
This section involves two subsections relating to the estimation results of the two
proteins p8M T CP 1 and Pilin from Strain K122 − 4 introduced in section 2.3.We are
mainly interested in the estimation of the six parameters (A, B, S 2 , τs , τe , ∆σN ), where
- the parameter A is related to the fast motion of the NH bond of each residue (see
eq. (7));
- the parameter B is related to the slow motion of the NH bond of each residue (see
eq. (8))t
- the parameter τs describes the correlation time of the local slow motion of the NH
bond of a residuet
- the parameter τe describes the correlation time of the fast motion of the NH bond of
a residue;
- the parameter S 2 is called that order parameter which describes the stability of the
protein structure. The values is between 0 and 1. If the value closes to 1 means the
residue is in a stable protein structure, such as the helix structure. If the value close
to 0 means that the residue is in a loose protein structure, such as the N, C terminal
turns;
- the parameter ∆σN is called
15
N shielding anisotropy sensed by NH vector at each
residue. The unit of ∆σN is ppm;
In section 4.1, we discuss the following two aspects for the estimation of the parameters
(A, B, S 2 , τs , τe , ∆σN ) of protein p8M T CP 1 at each residue.
1. Three different estimation methods, the linear estimation, the nonlinear estimation
and the nonlinear estimation with ∆σN = −170ppm, are used to estimate (A, B, S 2 ,
22
τs , τe , ∆σN ). We will compare the performances of the three estimation procedures
at each residue.
2. The fitted parameters, which describe the dynamic motion of the residues, are used
to explain the the protein secondary structure. We use the principal component
analysis and the Ward’s cluster method to cluster the residues.
In section 4.2, since there are many unusual estimated parameters of the protein Pilin
from Strain K122 − 4 corresponding to three methods, we aimed at the following points
to detect outliers of the relaxation parameters data.
1. We discuss the relation between the relaxation parameters and the resonance frequencies ωN and ωH corresponding to different magnetic fields.
2. It is found that the R2 parameter has a great impact for the estimation of the
parameter ∆σN . By assuming Normal distribution, the 95% confidence interval (C.I.)
of R2 parameter for ∆σN ∈ [−220ppm, −140ppm] are obtained. Simulation are also
performed to confirm our results.
4.1 The Protein p8M T CP 1
In this study, we use three methods to estimate the parameters of the protein p8M T CP 1 .
First, we estimate the four parameters (A, B, τs , and ∆σN ) by linear regression method
(described in section 3.1). Secondary, we estimate the four parameters (A, B, τs , and ∆σN )
by nonlinear regression method (described in section 3.2). Finally, we fix ∆σN = −170ppm
and estimate the other three parameters (A, B, and τs ) by nonlinear regression method. For
simplicity, the aforementioned estimation procedures will be called linear-4, nonlinear-4,
and nonlinear-3. The main goal is to compare the estimators from different methods.
We focus on the following two aspects:
23
1. Comparison of the fitted parameters of the three different methods at each residue.
2. The scattering situation of the fitted parameters, which describe the dynamic motion
of residue, is used to explain the the protein secondary structure.
First, we compare the performance of the fitted parameters of the three different
methods by their scatter diagrams and the relative errors plots. Figure 3(a1) is the plots of
the estimators of A by three different estimation procedure. Figure 3(a2) and 3(a3)are the
plots of the relative error between the nonlinear-3 and nonlinear-4 method, and between
the nonlinear-4 and linear-4 method, respectively. We summarize the estimation of the
parameter A in the following.
1. From Figure 3(a1), most of the estimators by linear-4 method have small values
as well as smaller variation than those estimated by nonlinear-3 and nonlinear-4
methods.
2. From Figure 3(a2),the A values estimated by nonlinear-4 are smaller than those by
nonlinear-3 at most residues, except for residues 7, 31, 64, 65 and 66. The nonlinear-3
and nonlinear-4 have most consistent estimators at residue 31, 45, 57, 58, 59 and the
residues that lie in N, C terminal turns.
3. From Figure 3(a3), most of the A values estimated by linear-4 are smaller than those
by nonlinear-4, except for residues 31 and 65. The linear-4 and nonlinear-4 method
are most consistent at the residue 45 and the residues that lie in the N, C terminal
turns.
4. The relative errors between linear-4 and nonlinear-4 method are in general larger
than those between the nonlinear-3 and nonlinear-4 method.
To summarize, we conclude that the estimators of A in general have the following
order Anonlinear−3 > Anonlinear−4 > Alinear−4 , and the three methods are most consistent
24
at the N, C terminal turns. From Figure 3.(b1)-(b3), we summarize the estimation of the
parameter B in the following.
1. From Figure 3(b1), the estimators of B derived from three estimation processes are
consistent, and have small values in the N, C terminal turns.
2. From Figure 3(b2),the B values estimated by nonlinear-4 are smaller than those by
nonlinear-3 at most residues, except for residues 7, 31 and 68. The nonlinear-3 and
nonlinear-4 have most consistent estimators at most residues, except for the residues
57, 58, 59 and the residues that lie in the N, C terminal turns.
3. From Figure 3(b3), most of the B values estimated by linear-4 are smaller than those
by nonlinear-4, except for residues 1, 2, 3, 31, 45, 62, 63, 64, 66, and 68. The linear-4
and nonlinear-4 method are consistent.
To summarize, we conclude that the estimators of B in general have the following
order Bnonlinear−3 > Bnonlinear−4 > Blinear−4 , and the three methods are consistent. From
Figure 3.(c1)-(c3), we summarize the estimation of the parameter τs in the following.
1. From Figure 3(c1), the estimators of τs derived from three estimation processes are
consistent, and have small values in the N, C terminal turns, except for residue 7.
2. From Figure 3(c2),the τs values estimated by nonlinear-4 are smaller than those by
nonlinear-3 at most residues, except for residues 7, 65 and 66. The nonlinear-3 and
nonlinear-4 have most consistent estimators at most residues, except for the residues
57, 58, 59 and the residues that lie in the N, C terminal turns.
3. From Figure 3(c3), most of the τs values estimated by linear-4 are smaller than
those by nonlinear-4, except for residues 31, 45 and 65. The linear-4 and nonlinear-4
method are consistent.
25
To summarize, we conclude that the estimators of τs in general have the following
order τs,nonlinear−3 > τs,nonlinear−4 > τs,linear−4 , and the three methods are consistent. From
Figure 3.(d1)-(d3), we summarize the estimation of the parameter S 2 in the following.
1. From Figure 3(d1), the estimators of S 2 derived from three estimation processes are
consistent, and have small values in the N, C terminal turns, except for residue 7.
2. From Figure 3(d2),the S 2 values estimated by nonlinear-4 are smaller than those
by nonlinear-3 at most residues, except for residues 7 and 68. The nonlinear-3 and
nonlinear-4 have most consistent estimators at most residues, except for the residues
that lie in the N, C terminal turns.
3. From Figure 3(d3), most of the S 2 values estimated by linear-4 are smaller than
those by nonlinear-4, except for residues 31, 45 and 65. The linear-4 and nonlinear-4
method are consistent.
To summarize, we conclude that the estimators of S 2 in general have the following
2
2
2
order Snonlinear−3
> Snonlinear−4
> Slinear−4
, and the three methods are consistent. From
Figure 3(e1)-(e3), we summarize the estimation of the parameter τe in the following.
1. From Figure 3(e1), most of the estimators by linear-4 method have small values as well
as smaller variation than those estimated by nonlinear-3 and nonlinear-4 methods.
2. From Figure 3(e2),the τe values estimated by nonlinear-4 are smaller than those by
nonlinear-3 at most residues, except for residues 7, 31, 65 and 66. The nonlinear3 and nonlinear-4 have most consistent estimators at the residues that lie in N, C
terminal turns.
3. From Figure 3(e3), most of the τe values estimated by linear-4 are smaller than those
by nonlinear-4, except for residue 65. The linear-4 and nonlinear-4 method are most
26
consistent at residues 57, 58, 59 and the residues that lie in the N, C terminal turns
and loops.
4. The relative errors between linear-4 and nonlinear-4 method are in general larger
than those between the nonlinear-3 and nonlinear-4 method.
To summarize, we conclude that the estimators of τe in general have the following
order τe,nonlinear−3 > τe,nonlinear−4 > τe,linear−4 , and the three methods are most consistent
at the N, C terminal turns. Finally, from Figure 3(f1)-(f2) (Figure 3(f2) is the plots of the
relative error between the nonlinear-4 and linear-4 method), we summarize the estimation
of the parameter ∆σN in the following.
1. From Figure 3(f1), the estimators of ∆σN derived from three estimation processes
are consistent, and have small values in the N, C terminal turns, except for residue
7.
2. From Figure 3(f2), the ∆σN values estimated by nonlinear-4 are smaller than those
by linear-4, except for residues 31 and 66. The linear-4 and nonlinear-4 method are
consistent.
To summarize, we conclude that the estimators of ∆σN of the two methods are consistent.
27
F&3uˆa7pdvgoF&3GHIg1K>:8>:9LMHN}OQ2RWSTR—UWVXRxYZRxYM[MR,\^]=_.`}"!oF&35N—1.8aH356~bWd3G
>:
f 359g>aHI dNheM5I}OiW356j>:klR—mn:64 356j>:klRT>:6=Igmn:64 356j>:k"p`XR—qK3GI:(grs qK3GI:(

! "#$%'& ()%'* ,+* , &-).0/1$% $+/ 2, &345/1$% $+/6
, &3'+$&
/1 $+/ 25, &07 8
O‹> `~^359g>ay:N "!2Qrs}5dN^359g>a3G:6~9LM5I
helix III
helix I
helix II
helix IV
0.15
A(ns)
0.10
0.05
0.00
10
30
50
O‹> M`~74u>a3 t g8:N^"!75g359g>ay:N^Idk
3 t I !8:9 5’6=:64 356j>:k"pE9LM5I >:6=IA5
6=:64 356j>:kl*9LM5Izv
helix III
helix I
1.0
helix II
(Anonlinear-4-Alinear-4)/Anonlinear-4
(Anonlinear-3-Anonlinear-4)/Anonlinear-4
70
O‹>ap` n4u>a3 t g8:N^"!75g359g>ay:N^Idk
3 t Ie!8:9 5’6=:64 356j>:kl 9LM5I >:6=IA5
4 356j>:kl*9LM5Izv
residue
helix IV
0.5
0.0
-0.5
-1.0
0
20
40
residue
60
1.0
0.5
0.0
helix I
-0.5
helix III
helix II
helix IV
-1.0
0
20
40
60
residue
:9; ,6=<1+>8?
@,& +BA3+ =C 5 ,DC1 B!
#$E FHGJILK
M'NPO% @QR7 STRU3V07):9; ,6=<1+R68?
C,& +
A3+ =C 5
,DC1 B!
#$E FHG ILKM'NWO XQR7 VRY2%Z07
6h
O"r `*^359g>ay:N "!2SŠrsL5dNg359g>a3G:6~9LM5I
helix III
13
helix I
helix IV
helix II
B(ns)
8
3
0
10
20
"O r M`~ 4u>a3 t g8:Ng"!75L359g>ay:NgIdk
3 t I !8:9 5’6=:64 356j>:k"pE9LM5I >:6=IA5
6=:64 356j>:kl*9LM5Izv
30
40
50
60
70
"O rMp` 74u>a3 t L8:N^"! 5g359g>ay:NgIdk
3 t Ie!8:9 5’6=:64 356j>:kl 9LM5I >:6=IA5
4 356j>:kl*9LM5Izv
residue
(Bnonlinear-4-Blinear-4)/Bnonlinear-4
(Bnonlinear-3-Bnonlinear-4)/Bnonlinear-4
helix III
helix III
0.3
helix I
helix II
helix IV
0.2
0.1
0.0
-0.1
-0.2
0
20
40
residue
60
0.3
helix II
helix I
helix IV
0.2
0.1
0.0
-0.1
-0.2
0
20
40
60
residue
:9; ,6=<>8? C,& +A3+ =C CD,1 B!
#$E FHGJILKM'NWO= @QR7 QSRVRZ07P:9; ,6=<68? @,& +
A3+ =C 5
,DC1 B!
#$E FHG ILKM'NWO XQR7 QRQ3>=T3Z07
6o
O‹ `*^359g>ay:N "!ƒYZ^rs}5dNg359g>a3G:6~9LM5I
helix III
helix I
helix IV
helix II
τs (ns)
6
4
2
10
20
30
(τs, nonlinear-3-τs, nonlinear-4)/τs, nonlinear-4
‹O M`~ 4u>a3 t g8:Ng"!75L359g>ay:NgIdk
3 t I !8:9 5’6=:64 356j>:k"pE9LM5I >:6=IA5
6=:64 356j>:kl*9LM5Izv
helix III
helix II
helix IV
0.2
0.1
0.0
-0.1
-0.2
0
20
40
residue
60
50
60
70
‹O p` 74u>a3 t L8:N^"! 5g359g>ay:NgIdk
3 t Ie!8:9 5’6=:64 356j>:kl 9LM5I >:6=IA5
4 356j>:kl*9LM5Izv
0.3
helix I
40
residue
(τs, nonlinear-4-τs, linear-4)/τs, nonlinear-4
0
helix III
0.3
helix I
helix II
helix IV
0.2
0.1
0.0
-0.1
-0.2
0
20
40
60
residue
:9; ,6=<1D>8? C,& +A3+ =C CD,1 B!
#$E FHGJILKM'NWO= @QR7 QSRVRS07P:9; ,6=<1D?68? @,& +
A3+ =C 5
,DC1 B!
#$E FHG ILKM'NWO XQR7 Q3>=T3Y07
i“
O‹I`*^359g>ay:N^"!2UWV}rs}5dNg359g>a3G:6~9LM5I
helix III
helix I
helix IV
helix II
0.8
S2
0.6
0.4
0.2
10
20
30
(S2nonlinear-3-S2nonlinear-4)/S 2nonlinear-4
‹O I M`~74u>a3 t g8:N^"!75g359g>ay:N^Idk
3 t I !8:9 5’6=:64 356j>:k"pE9LM5I >:6=IA5
6=:64 356j>:kl*9LM5Izv
helix III
helix II
helix IV
0.2
0.0
-0.2
0
20
40
residue
60
50
60
70
‹O Iap` n4u>a3 t g8:N^"!75g359g>ay:N^Idk
3 t Ie!8:9 5’6=:64 356j>:kl 9LM5I >:6=IA5
4 356j>:kl*9LM5Izv
0.4
helix I
40
residue
(S2nonlinear-4-S2linear-4)/S 2nonlinear-4
0
helix III
helix III
0.3
helix I
helix I
helix II
helix II
helix IV
helix IV
0.2
0.1
0.0
0.0
-0.1
-0.2
0
20
40
60
residue
:9; ,6=<1&>8?
C,& +BA3+ =C 5 CD,1 B!
#$E FHGJILKM'NWO% XQR7 QRQ0TRT2$7P:9; ,6=<1&R68?
C,& +
A3+ =C 5 ,DC1 B!
#$E FHG ILKM'NWO W/!QR7 Q3>0Y>7
i/9
O" `*^359g>ay:N "!ƒYM[grs}5dNg359g>a3G:6~9LM5I
helix III
helix I
helix IV
helix II
τe (ns)
0.2
0.1
0.0
10
20
30
(τe, nonlinear-3-τe, nonlinear-4)/τe, nonlinear-4
"O M`~ 4u>a3 t g8:Ng"!75L359g>ay:NgIdk
3 t I !8:9 5’6=:64 356j>:k"pE9LM5I >:6=IA5
6=:64 356j>:kl*9LM5Izv
helix II
helix IV
0.4
0.2
0.0
-0.2
0
20
40
residue
60
50
60
70
"O Mp` 74u>a3 t L8:N^"! 5g359g>ay:NgIdk
3 t Ie!8:9 5’6=:64 356j>:kl 9LM5I >:6=IA5
4 356j>:kl*9LM5Izv
helix III
helix I
40
residue
(τe, nonlinear-4-τe, linear-4)/τe, nonlinear-4
0
1.0
0.5
0.0
helix II
helix I
-0.5
helix III
helix IV
-1.0
0
20
40
residue
:9; ,6=<1>8?
@,& +BA3+ =C 5 ,DC1 B!
#$E FHGJILKM'NWO% XQR7 >0S>0607P:9; ,6=<1?68?
@,& +
A3+ =C 5 ,DC1 B!
#$E FHG ILKM'NWO XQR7 V>0Z2$7
i6
60
 E 6‘‹;=<?>cAW>$HJIMLON:^>D>$AWAW@AWUm@E‰LO<?>c>$UCLONPVmIMLO@AWU
a?>$AWN:^>$a†E*AW@V¤LO<?>Q?@Q?HPNPQ?>YIA}|lŒ2V>LO<?@8a IQ?a
LO<?>ZHPNPQ?>YIA}|lŒ_V>LO<?@8aˆ7
O ! `c}359g>ay:N}"!n\^]=_ rs5dN359g>:k
3G:6*9LM5I
helix III
helix III
(Bnonlinear-4-Blinear-4)/Bnonlinear-4
helix I
helix IV
helix II
∆σ N (ppm)
-170
-200
-230
0
10
20
30
40
residue
50
60
70
0.3
helix II
helix I
0.1
0.0
-0.1
-0.2
0
20
:9; ,6=< >8? C,& +A3+ =C CD,1 B!
#$E FHGJILKM'NWO= "/!QR7 QS0Q0T3S07
ii
helix IV
0.2
40
residue
60
Table: The mean and median values of (A, B, S 2 , τs , τe , ∆σN ) at three places in above figures
are listed as follows:
Nonlinear-4
Estimation
Nonlinear-3
Estimation
Linear-4
Estimation
A
B
τs
τe
S2
∆σN
A
B
τs
τe
S2
A
B
τs
τe
S2
∆σN
total mean
0.056
8.626
5.768
0.096
0.714
180.345
0.061
8.835
5.841
0.112
0.725
0.025
8.600
5.591
0.030
0.734
182.322
total median
0.041
9.713
6.222
0.096
0.778
175.985
0.049
9.993
6.264
0.115
0.785
0.006
9.660
6.079
0.019
0.798
177.750
core mean
0.039
9.827
6.255
0.091
0.783
176.491
0.047
10.032
6.334
0.111
0.790
0.009
9.797
6.078
0.018
0.803
178.456
core median
0.037
10.170
6.300
0.092
0.793
175.638
0.042
10.365
6.376
0.117
0.798
0.006
10.140
6.170
0.016
0.819
177.330
2 turns mean
0.040
9.884
6.251
0.093
0.788
176.329
0.047
10.084
6.327
0.113
0.794
0.009
9.853
6.072
0.018
0.808
178.320
2 turns median
0.037
10.221
6.300
0.094
0.800
175.514
0.042
10.529
6.376
0.117
0.806
0.005
10.178
6.176
0.015
0.825
177.290
The total mean and median of the estimated parameters are calculated at all residue. The ”core” region of protein
includes the residues that lie in the protein without the N and C terminal turns. The ”2 turns mean” and ”2 turns median”
means the mean and median values of the estimated parameter at the residues that lie in the N and C terminal turns.
34
i\
F&3uˆawl(v";; ft >:4 ( O‹>`~;; ft >:4 (7"!n5dN I:3 ‰26(09LM5I2O"r‹`*n4u>a3 t ^8:N
"!n5;; f Id3 t Iw!8:9 5n6=:64 356j>:kl*9LM5I>:6=I}5n4 356j>:kl*9LM5I
O‹>`*;; ft >:4 (7"!n5dN^I:3 ‰26($9LM5I
O"r‹` ‘4u>a3 t 8:N "! 5’;; f Id3 t I
!8:9 56=:64 356j>:kl 9LM5I>:6=Ic5J4 356j>:k
lP9LM5Izv
helix III
helix I
helix II
0.2
helix IV
(SSE nonlinear -SSE linear )/SSE linear
10
SSE
6
2
0
10
20
! "*#D* A3 X#$%'& ()%
* ,+* 30
40
50
60
70
residue
:9; "2 <1+8?
@
, &3'+$&
/1 $+/ 25, &07
i
0.1
0.0
-0.1
-0.2
0
10
20
30
40
50
60
residue
, &-P.0/1$% $+/ 25, &345/1$% $+/6
70
In Figure 4(a), we plot the SSE (sum of squared errors) values of the three different methods. The SSE values of the nonlinear-3 method are in general larger than the
nonlinear-4 and linear-4 methods. In Figure 4(b) we plot standardized difference between
the SSE of nonlinear-4 and linear-4. It is obvious that the SSE values of the nonlinear-4
method are smaller than the linear-4 method. Consequently, the nonlinear-4 method attains the best goodness of fit among three methods.
In the second part, we are interested in explaining the the protein secondary structure
Table 1: Correlation Matrix of Parameters (S 2 , τs , τe , ∆σN )
τs
Nonlinear
∆σN
Estimation τe
S2
τs
Linear
∆σN
Estimation τe
S2
τs
1
1
S2
*0.928
*-0.733
0.080
1
*-0.613 *0.888 *0.949
1 *0.683 *-0.726
1 *-0.906
1
∆σN
*-0.613
1
τe
0.073
0.107
1
*. Correlation in significant at the 0.01 level(2-tailed)
by the four parameters (S 2 , τs , τe , ∆σN ). Since the estimation errors of the nonlinear-3
method are relatively larger than the other two methods, in the following works, we only
adapt the estimators of linear-4 and nonlinear-4 methods. In Table 1, we give the correlation matrix of the four estimators (S 2 , τs , τe , ∆σN ) of the linear-4 and nonlinear-4
method, respectively. According to Table 1, the correlation coefficients among the three
parameters S 2 , τs , and ∆σN of the linear-4 and nonlinear-4 method, respectively, are consistent. However, the correlation coefficients between the estimator of τe and the other
three parameters of the linear-4 and nonlinear-4 method are, respectively, quite different.
Since the estimator of τe and other parameters by the linear-4 method are highly correlated, the effect of τe is confounded by the other three parameters. Yet, the estimator of
37
τe and other parameter by nonlinear-4 are all insignificant, the nonlinear-4 method can be
used to separate out more useful information from parameter τe (which relates to local fast
motion of NH bond). Since there are significant correlations among the parameters (S 2 ,
τs , τe , ∆σN ), we apply the principal component analysis to extract significant component
to explain the protein structure. In the following, we show that more information about
protein can be obtained from the estimators of the nonlinear-4 method than the linear-4
method.
Since the scales of the four estimators (S 2 , τs , τe , ∆σN ) are different, we apply the
Table 2: Principle Component Analysis ( of standardized the parameters (S 2 , τs , τe , ∆σN ))
Nonlinear
Estimation
Linear
Estimation
Component
Component
Component
Component
Component
Component
Component
Component
1
2
3
4
1
2
3
4
Cumulative Proportion
τs ∗ ∆σN ∗
τe ∗
S 2∗
0.6316 0.584 -0.532
0.612
0.8877 0.169 0.107
0.98
0.9879645 0.475 0.826 -0.171 0.252
1 -0.636 0.157
0.749
0.8515 0.513 -0.44 -0.513 0.529
0.9597 -0.377 0.886
0.21 -0.617
0.9898 0.457
0.83 0.315
1 -0.621 0.133
0.77
parameter*: the standardized value of the parameter, S 2 , τs , τe , ∆σN .
principal component analysis with the standardized estimators as introduct in section 3.4.
In Table 2, we give the results of the principal component analysis of the standardized
estimators of S 2 , τs , τe , ∆σN by the linear-4 and nonlinear-4 methods, respectively. From
the estimators of the nonlinear-4 method, the first three components can explain 98% variability. Yet, for the linear-4 method, the first two components can also explain the 95%
variability.
In Figure 5(a) and (b), we plot the coordinates of the first component v.s. the second component (denote as component 1 and component 2) of the nonlinear-4 and linear-4
38
methods, respectively. From Figure 5(a), we observe that component 1 of the nonlinear-4
method is composed of τs , S 2 and ∆σN (with ∆σN having an opposite sign) and τe plays a
main role for component 2, and ∆σN is the main term for component 3. Yet, from Figure
5(b), we observe that component 1 of the linear-4 method is composed of S 2 , τs , τe , ∆σN
(with ∆σN and τe having opposite signs to the other two parameters) and ∆σN and τe
are the main term for component 2 and component 3, respectively. In order to compare
the effective of the two sets of the principal components, we further analyze the clustering
results from Ward’s method with the principal components of the linear-4 and nonlinear-4
method, respectively.
39
Figure 5. The coordinate of the four standardized estimators by the linear-4 and nonlinear-4
method in the first (x coordinate axis) and second component (y coordinate axis)
(a) The nonlinear-4 method
(b) The linear-4 method
component 2
6τe (0,0.98)
q
∆σN (-0.532,0.107)
q
Op
component 2
∆σN (-0.44,0.886)
q
τs (0.584,0.169)
component
1
S 2 (0.612,0)
6
τe (-0.513,0.21)
q
Op
q
q
component
1
q
τs (0.513,-0.377)
q
S 2 (0.529,-0.617)
We will compare the clustering results of the linear and nonlinear methods, using
three different clustering vectors – (component 1), (component 1, component 2), (component 1, component 2, component 3), respectively. The clustering results for most regions
are similar by either linear-4 method or nonlinear-4 methods, expect for three groups of
residues. Thus, we will emphasis our comparison for the three groups, namely, the first
group (denoted by 1-Group) is residue 7, the second group (denoted by 2-Group) includes
the N and C terminals excluding residue 5 and residue 60, and the last group (denoted by
3-Group) includes residue 5, 57, 58, 59 and 60.
40
68C
5
10
15
2-Group
1N
66C
3N
64C
65C
4N
62C
63C
5N
57H_4
58H_4
59H_4
60C
7N
31H_2
8H_1
12H_1
38H_2
15H_1
41R_23
34H_2
29H_2
19H_1
20H_1
13H_1
21R_12
16H_1
9H_1
40R_23
10H_1
11H_1
24R_12
48H_4
49H_4
46H_3
14H_1
30H_2
50H_4
17H_1
36H_2
39H_2
33H_2
28H_2
37H_2
18H_1
35H_2
32H_2
51H_4
22R_12
45H_3
26H_2
42R_23
52H_4
53H_4
23R12
27H_2
47R_34
25H_2
44H_3
54H_4
2-Group
5N
57H_4
58H_4
59H_4
60C
7N
8H_1
16H_1
12H_1
29H_2
13H_1
21R_12
38H_2
10H_1
40R_23
28H_2
37H_2
33H_2
9H_1
11H_1
50H_4
14H_1
30H_2
15H_1
34H_2
41R_23
39H_2
17H_1
36H_2
35H_2
18H_1
32H_2
51H_4
19H_1
20H_1
48H_4
49H_4
26H_2
42R_23
52H_4
53H_4
22R_12
45H_3
31H_2
23R12
24R_12
46H_3
25H_2
44H_3
54H_4
27H_2
47R_34
1N
66C
3N
64C
65C
4N
62C
63C
0
Height
68C
0
5
10
Height
15
1N
66C
3N
64C
65C
4N
62C
60C
63C
68C
5N
57H_4
58H_4
59H_4
7N
14H_1
50H_4
30H_2
8H_1
15H_1
21R_12
41R_23
34H_2
9H_1
10H_1
13H_1
12H_1
29H_2
40R_23
38H_2
11H_1
16H_1
28H_2
37H_2
33H_2
17H_1
32H_2
51H_4
18H_1
35H_2
36H_2
39H_2
19H_1
20H_1
31H_2
24R_12
42R_23
48H_4
49H_4
26H_2
46H_3
52H_4
53H_4
22R_12
23R12
27H_2
47R_34
45H_3
25H_2
54H_4
44H_3
0
5
10
Height
15
F&3uˆak zvgn4„."!
…
>:8I† M4 ‡H356‡ˆ}9LM5Irs}5n4 356j>:kl*9LM5I
(a) Linear-4 Method, Principle Component 1
3-Group
1-Group
2-Group
(b) Linear-4 Method, Principle Component 1 and 2
3-Group
1-Group
(c) Linear-4 Method, Principle Component 1, 2 and 3
3-Group
1-Group
Œ?9
2-Group
68C
10
15
1N
66C
3N
64C
63C
65C
4N
62C
68C
5N
57H_4
58H_4
59H_4
60C
8H_1
17H_1
33H_2
37H_2
10H_1
16H_1
28H_2
30H_2
12H_1
36H_2
29H_2
24R_12
11H_1
40R_23
46H_3
23R12
50H_4
9H_1
19H_1
20H_1
41R_23
18H_1
35H_2
34H_2
38H_2
39H_2
13H_1
15H_1
32H_2
51H_4
25H_2
54H_4
27H_2
44H_3
53H_4
48H_4
49H_4
52H_4
14H_1
21R_12
42R_23
22R_12
26H_2
47R_34
31H_2
45H_3
2-Group
3-Group
7N
5
0
7N
3-Group
Œ.6
9H_1
19H_1
20H_1
41R_23
18H_1
35H_2
34H_2
39H_2
38H_2
13H_1
15H_1
32H_2
51H_4
25H_2
54H_4
27H_2
44H_3
53H_4
48H_4
49H_4
52H_4
14H_1
21R_12
42R_23
22R_12
26H_2
47R_34
31H_2
45H_3
5N
57H_4
58H_4
59H_4
60C
8H_1
17H_1
33H_2
37H_2
10H_1
28H_2
30H_2
16H_1
12H_1
36H_2
29H_2
24R_12
11H_1
40R_23
46H_3
23R12
50H_4
1N
66C
3N
64C
63C
4N
62C
65C
0
Height
5
Height
10
15
1N
66C
3N
64C
65C
63C
68C
4N
62C
60C
5N
57H_4
58H_4
59H_4
9H_1
26H_2
49H_4
31H_2
42R_23
19H_1
20H_1
48H_4
46H_3
23R12
52H_4
53H_4
22R_12
27H_2
44H_3
25H_2
54H_4
45H_3
47R_34
7N
8H_1
37H_2
50H_4
28H_2
17H_1
35H_2
33H_2
14H_1
16H_1
30H_2
10H_1
11H_1
13H_1
29H_2
39H_2
40R_23
21R_12
38H_2
12H_1
32H_2
51H_4
18H_1
36H_2
15H_1
24R_12
34H_2
41R_23
0
5
Height
10
F&3uˆak ‡vgn4„."!
…
>:8I† M4 ‡H356‡ˆ}9LM5Irs}5n6=:64 356j>:kl*9LM5I
(a) Nonlinear-4 Method, Principle Component 1
3-Group
1-Group
2-Group
(b) Nonlinear-4 Method, Principle Component 1 and 2
1-Group
(c) Nonlinear-4 Method, Principle Component 1, 2 and 3
1-Group
Figure 6-1(a)-(c) are the clustering results of three different clustering vectors by the
linear-4 method and Figure 6-2(a)-(c) are the clustering results of three different clustering vectors by the nonlinear-4 method, respectively. In the following, we concentrate on
discussing the clustering vectors of the aforementioned three groups.
1. The 1-Group (residue 7) is clustered into a single group by the three clustering vectors
of the nonlinear-4 method (see Figure 6-2 (a), (b), (c)), yet for the linear-4 method
it is clustered into a single group only by the third clustering vector (see Figure 61 (c)). For the Ward’s Clustering Method, that a residue is clustered into a single
group means that the residue is significant different from the other residues (Methods
of Multivariate Analysis, p.456). The residue 7 is classified in the tip of N-terminal
turn of the protein p8M T CP 1 by PDB (see section 2.3), so it is a unstable residue,
yet Canet, D. et al. in 2001 classified the residue 7 is in the top of helix I. Therefore,
the status of residue 7 is still debatable. From the above results, we can see that it
is easier to clustering out residue 7 by nonlinear-4 method.
2. The 2-Group in all of the clustering results of the nonlinear-4 method (see Figure 62(a)-(c)) are clustered into a group at the highest level by using principal components
for the protein p8M T CP 1 . Thus, it is useful to identify residues in the N, C terminal
turns.
3. The 3-Group of the nonlinear-4 method (see Figure 6-2(a)-(c)) is classified in the
group of the highest clustering level which involves the residues in the core region of
the protein p8M T CP 1 . Yet, the 3-Group of the linear-4 method is clustered in the
other one group of the highest clustering level (see Figure 6-1(a)-(c)). Only in the
clustering results of the nonlinear-4 method, the residues 57, 58 and 59 (in the tip
of the helix IV) the nonlinear-4 method are separated from the residues of the N, C
terminal turns, expect for residues 5 and 60. In the clustering results of the linear-4
43
method, the residues 57, 58 and 59 can’t be separated out.
From Figure 6-1(a)-(c), in the clustering results of the linear-4 method, we also can
observe that the 1-Group is classified in a single group by the third clustering vectors, and
that 2-Group is classified in a group of the highest clustering level. Since the residues 57,
58, 59 (in the tip of the helix IV ) in the 3-Group are in the loose secondary structure of the
protein p8M T CP 1 , it is also significant that the 3-Group and 2-Group of the linear method
are classified in the same group. Consequently, we can get more helpful information of
protein structure from the estimators (S 2 , τs , τe , ∆σN ) to explain the protein secondary
structure by using principle component analysis and clustering method. The estimators
by nonlinear-4 method include more useful information than with linear-4 method.
44
4.2 The Protein Pilin from Strain K122 − 4
In the first part of this section, we discuss the estimators results of three different
method for the protein Pilin from Strain K122 − 4 . Next, we aimed at the following two
aspects of detecting outlier of the the relaxation parameters.
1. We derive the relation between the relaxation parameters and resonance frequencies
ωN and ωH corresponding to different magnetic fields.
2. We discuss the 95% confidence interval (C.I.) of the relaxation parameter R2 for
∆σN ∈ [−140ppm, −220ppm].
45
F&3uˆaXvgoF&3GHIg1K>:8>:9LMHN}OQ2RWSTR—UWVXRxYZRxYM[MR,\^]=_.`}"!;š6=N:6=Ig1.8aH356*bWd3G
>:
f 359g>aHI dNheM5I}OiW356j>:klR—mn:64 356j>:klRT>:6=Igmn:64 356j>:k"p`XR—qK3GI:(grs qK3GI:(v O‹>`
^359g>ay:N "!2Qrs}5dNg359g>a3G:6~9LM5IN2O"r‹`*^359g>ay:N "!2SŠrsL5dNg359g>a3G:6
9LM5INoO‹"`P^359g>ay:N^"!ƒYZ^rsL5dNg359g>a3G:6~9LM5INoO‹I`*^359g>ay:N "!w\^]=_ rs
5dNg359g>a3G:6~9LM5IN2O"‹`*^359g>ay:N^"!wUWVrs}5dNg359g>a3G:6~9LM5INoO !`*
359g>ay:N "!ƒYM[grs}5dNg359g>a3G:6~9LM5IMv
˜N6
F&3uˆazO‹>`kO !`XR.5s:9Lr:4 7y‚3 t 4 s}H‚=6(TI:3 ‰26(T359g>a356‡ˆ9LM5I
I Tk‹4 356j>:klP9LM5Izv
k‹6=:64 356j>:kl*9LM5IR Pk‹6=:64 356j>:k"p}9LM5IRK>:6=
(a)
βI
α helix
0.2
βII
βIII
βIV
0.1
A
0.0
-0.1
20
60
100
residues
140
(b)
βI
α helix
20
βIII
βII
βIV
15
B
10
5
20
60
residues
Œ
100
140
(c)
τs (ns)
βI
α helix
10
βII
βIII
βIV
8
6
4
20
60
100
residues
140
(d)
∆σ N (ppm)
0
-100
-200
β II
βI
α helix
βIII
βIV
-300
20
residues
60
100
140
(e)
1.0
0.8
S
2
0.6
βI
α helix
βII
βIII
βIV
0.4
20
60
residues
Œ.¯
100
140
(f)
α helix
τe (ns)
3
βI
β III
βII
βIV
1
-1
20
residues
60
Œh
100
140
From Figure 7, there are some unusual patterns for the estimators of τe , τs , ∆σN
and S 2 for most residues of the protein Pilin. There are two notable characteristics in
Figure 7. The first one is that a lot of the estimators of ∆σN of the linear-4 and nonlinear4 methods are too large about −120ppm compared to −170ppm. The second one is that the
correlation matrices of the estimators of S 2 , τs , τe , ∆σN (see Table 3) are not consistent with
the results of the protein p8M T CP 1 . In particular, for nonlinear-4 method the correlation
of τe and τs (ρ(τe , τs )) changes from insignificant to positive correlation, ρ(∆σN , S 2 ) change
from negative to positive. Also for linear-4 method, ρ(τe , S 2 ) changes from highly negative
to insignificant. We suspect the problem is caused by the bad quality of the experimental
data. Since among the four parameters (S 2 , τs , τe , ∆σN ), only ∆σN is known to lie between
−220ppm ∼ −140ppm , it is possible to derive a monitoring procedure based on ∆σN . In
the following, we will first derive the sampling distribution of R2 based on the normality
assumption on ∆σN .
Table 3: Correlation Matrix of Parameters S 2 , τs , τe , ∆σN in the ProteinPilin from Strain
K122 − 4
τs
Nonlinear
∆σN
Estimation τe
S2
τs
Linear
∆σN
Estimation τe
S2
τs
1
1
∆σN
τe
*-0.453 *0.413
1 *0.536
1
*-0.307
1
0.513
0.54
1
S2
*0.672
*0.574
*0.447
1
*0.695
*-0.718
0.034
1
*. Correlation in significant at the 0.01 level(2-tailed)
49
From eq.s (1)-(3) in multifield, we find that the parameter R2 is a main effect for
the estimation of ∆σN . Eq. (2) can be simplified to,
3
1
R2 = Kd [3J˜d (ωH + ωN ) + J˜d (ωN ) + 3J˜d (ωH ) + J˜d (ωH − ωN ) + 2J˜d (0)]
2
2
1˜
2˜
+Kcsa [ Jcsa (ωN ) + Jcsa (0)]
2
3
= JD(ωN , ωH ) + (∆σN )2 JC(ωN ),
where JD(ωN , ωH ) = Kd [3J˜d (ωH +ωN )+ 32 J˜d (ωN )+3J˜d (ωH )], and JC(ωN ) =
2 ˜
J (0)].
3 csa
ωN 1 ˜
[ J (ωN )+
15 2 csa
As the strength of magnetic field increases (the resonance frequencies of ωN and
ωH are increasing), the JD(ωN , ωH ) and JC(ωN ) decrease, yet the decreasing range of
JC(ωN ) is smaller than that of JD(ωN , ωH ). If ∆σN is too large to lie in (−220ppm, −140ppm),
∂R2
∂ωN
< 0 and
∂R2
∂ωH
< 0, then R2 decreases as the strength of magnetic field increases. Fur-
thermore, if the R2 data in multifield increases as the strength of magnetic field increase,
then the ∆σN will be too large to lie in (−220ppm, −140ppm) interval.
Base on the above discussion, we summarize the following two characteristics of R2
in multifield. If ∆σN ∈ [−220ppm, −140ppm] , R2 increases as the strength of field is
increasing, and the relationship R2,500M Hz + 1.03R2,300M Hz , R2,600M Hz + 1.09R2,300M Hz
holds. The exact ratios of two residues (50 and 138) of protein Pilin are given in Table 4.
Furthermore, the relationship also holds for all the residues of protein p8M T CP 1 . However,
if ∆σN ∈
/ (−220ppm, −140ppm), there is no such proportional relationship exist for R2 in
different magnetic fields. For the relaxation data of the protein Pilin (from Supporting
Information, providing by Jeong-Yong Suh et al.) there are about 80% R2 parameters not
conforming to the proportional relationship.
In the next work, we derive the distribution of R2 by eq. (2) in a magnetic field by
assuming ∆σN (ppm) follows a normal distribution with mean=−170ppm, valance=20ppm,
denoted by ∆σN ∈ N (−170ppm, 20ppm). From eq.(2), we can simplify the expression of
50
Table 4: The ratio R2 values in different field
Magnetic Field Increment residue 50
300MHz-500MHz
300MHz-600MHz
500MHz-600MHz
residue 138
(A=0.0181, B=15.738, τs =
(A=0.0743, B=13.513, τs =
8.6376, ∆σN =-170ppm)
8.343, ∆σN =-170ppm)
1.032
1.098
1.062
1.027
1.09
1.06
The increasing ratio of R2 =(R2 value in the last field)÷(R2 value in the first field). This table shows two example (residue
50, 138) to explain the distribution of R2 in different field.
R2 to φ × ∆σN 2 + ψ, where φ and ψ are positive real function of ωN , ωH , τs , S 2 and τe , ie
∗
∗
∗
∗
φ = JC(ωN
, ωH
) and ψ = JD(ωN
, ωH
). There R2 is a linear transformation of a Chisquare
random variable with the following c.d.f. (cumulative distribution function)
√
√
x−µ
1
x+µ
F (x) =
Erf √
+ Erf √
,
2
2×σ
2×σ
where
∆σN ∼ N (µ, σ 2 ),
x = φ × ∆σN 2 + ψ, x ∈ [0, ∞),
Z z
2
2
Erf (z) = √
e−t dt, z ∈ [0, ∞).
π 0
Base on the result, the 95% C.I.(confidence interval) of R2 for µ = −170ppm, σ =
√
20ppm,
is
φ(1.71088 × 10−8 ) + ψ, φ(4.37643 × 10−8 ) + ψ .
(14)
Simulations are performed to conform the above formula of the 95%C.I. of R2 using the
estimators of S 2 , τs and τe by the nonlinear-4 method. From Table 5, the results of simulation (5000 times) and from eq.(14) are similar. Consequently, we can use eq.(14) to detect
the outliers of R2 parameters in the original data.
Lots of R2 lie outside the 95% C.I., which cause the problem in estimating the pa-
51
Table 5: The simulated and exact 95% C.I. of R2 of residue 35
600MHz
500MHz
300MHz
simulating C.I.
(5000 times)
[12.4513,15.7955]
[12.0312,14.3992]
[12.1244,13.0574]
exact C.I.
[12.4314,15.7685]
[12.0171,14.3801]
[12.1189,13.0574]
rameters. Recall for Figure 7(d), we observe that must of the estimated ∆σN is around
−100ppm which is outside the normal range (-220ppm,-140ppm).
The estimation problems of the protein Pilin from Strain K122 − 4 might be arising
from the following two reasons, (i) the relaxation data of the protein Pilin maybe not suitable to apply the model free approach, or (ii) there are large the experimental errors in the
data (the second protein is a big protein, that will increase the experimental measurement
error).
52
F&3uˆadvg “z”• –—:60Id6=o˜N6(H t >:4W"!ƒ™ V ‚>:8>:9LMHN7356~5dN0.4uI7>:6=I5^€‚359L6(y>:4
t >:4 ( "!K™ V R&3GI:(^rsL3GI:(v <
(a) 95% Confidence Interval of R
2
-PZRY
;7 7 -
#++,EE8
in 300MHz
R2.300MHZ.
13
11
9
7
5
3
20
(b) 95% Confidence Interval of R
60
2
residue
in 500MHz
100
140
: 95% C.I.
15
R2.500MHZ.
13
11
9
7
5
3
20
(c) 95% Confidence Interval of R
residue
60
2
in 600MHz
100
140
: 95% C.I.
R2.600MHZ.
17
13
9
5
20
60
100
residue
\i
140
5 Conclusions
In this study, we compare the estimators of the nonlinear and linear estimations. We
give the conclusions, as follows:
1. In a reasonable relaxation parameters data in multifield, such as the first protein
(p8M T CP 1 ), the estimators which are analyzed by principal component analysis and
the Ward’s clustering method can give a significant information to discuss the protein
secondary structure.
2. The estimators of the nonlinear estimation have a better fit to explain the second
protein structure, comparing to the linear estimation.
3. The estimator of ∆σN by the linear and nonlinear estimations can provide useful
information to analyze protein structure.
4. The eq.(14) can be used to detect the experimental values of the relaxation parameterof R2 .
54
References
[1] Abragam, A. (1961) Principles of Nuclear Magnetism. Clarendon Press, Oxford,
U.K.
[2] Alvin C. Rencher (1961) Methods of Multivariate Analysis. second edition Brigham
Young University.
[3] Canet, D. (1996) Nuclear Magnrtic Resonance. Concepts and methods Wiley:
Chichster.
[4] Canet, D, Philippe Barthe, Pierre Mutzenhardt, and Christrian Roumestand
(2001) A Comprehensive Analysis of Multifield
Protein: Determination of
15
15
N Relaxlation Parameters in
N Chemical Shift Anisotropies. J. Am. Chem. Soc.
123, 4567-4576.
[5] Douglas C. Montgomery, Elizabeth A. Peck, G. Geoffrey Vining (2001) Introduction To Linear Regression Analysis (Third Edition) Wiley-Interscience.
[6] Jeong-Yong Suh, Leo Spyracopoulos, David W. Keizer, Randall T. Irvin, and Brian
D. Sykes (2001) Backbone Dynamics of Receptor Binding and Antigenic Region
of a P seudomonas aeruginosa Pilin Monomer. Biochemistry 40, 3985-3995.
[7] Lipari, G.; Szabo, A.(1982) J. Am. Chem. Soc. 104, 4546.
[8] Marceau, M., and Nassif, X. (1999) J. Bacteriol. 181, 656-661.
[9] PDB (Protein Data Bank: www.rcsb.org/pdb/)
55
I ,.' 7 +-,89'*+-,(/.
97LoheM59g>a3GN>L;j:8–—Id
\
687eLId&6jg"!^5ƒ‚>:8>:9LMHN R V Ryv5v5vuR 356c5dN}M>a3G:6L"! 4u>€d>a3G:6
‚>:8>:9LMHN
O" k M`
"
U 6 Q
Q
6 6 > Y
U
Q
Q
Q
Q
S0Q
S0Q
\¯
Q
Q
/.6%,.' 7 +-,89'*+-,(/.
97LoheM59g>a3GN>L;j:8–—Id
687eLId&6jg"!^5ƒ‚>:8>:9LMHN R V Ryv5v5vuR 356c5dN}M>a3G:6L"! 4u>€d>a3G:6
‚>:8>:9LMHN
O" k M`
"
U 6 Q
Q
6 6 > Y
U
Q
Q
Q
Q
S0Q
S0Q
\h
Q
Q