Shape manifold and curves in the context of big data

Shape manifold and curves in the context of big data
Kanti Mardia
Senior Research Professor,
University of Leeds and University of Oxford
LEEDS ANNUAL STATISTICAL RESEARCH WORKSHOP(LASR)—MISSION STATEMENT
“Statistics without science is incomplete,
Science without statistics is imperfect.” - Mardia
Joint work with many statisticians, scientists, and industries
–1–
CONTENTS
1. Curves and Big data context: Fetal alcohol spectrum disorder (FASD)
and Drug Discovery
2. Shape Analysis for FASD (Background)
3. FASD (a closed curve in 2D)
4. Drug Discovery: Form analysis of Helix (an open curve in 3D)
5. Locating the Kink
6. Big Picture of Shape Manifold
7. Some Concluding Remarks
–2–
Topic 1: Fetal alcohol spectrum disorder (FASD) and some “big” data
FASD is an umbrella term that covers a range of abnormalities resulting from
exposing human embryos to high levels of ALCOHOL.
The number of FASDs in the USA and some European countries is
20 to 50 out of 1,000 school children.
–3–
FASD DIAGNOSIS from the callosal mid-curve of the corpus
callosum of the brain.
The callosal mid-curve (a closed curve in 2D) itself, in a cut of the brain up the middle,
can be described as a shallow upside-down letter C (TOP RED section) with varying
curvatures and thicknesses .
–4–
MURDERER AND JAIL
Judge: Death Penalty or Life Sentence ??
(one with the FASD on a murder trial in the USA)
–5–
Topic 2: Big Data in Human Genomics/Proteins
–6–
Recap: Genome and big data (High dimensional as well)
1. 23 chromosomes (not 24!)
2. 20,000-25000 genes
3. 3-billion-letter (ATCG) in the human genome project: Millions of protein sequences
4. More than 100,000 protein structures
5. High dimensional as well: say 200 amino acids average in protein times in average
four angles = 800 angles
6. More and more complete Genomes (Horse recently)
7. Challenge: Predict protein structures/functions of say 1000 sequences same
time (to go to 3D from 1D).
–7–
Membrane Protein: Straight and Kinked Helices (3D) in Drug
Discovery
–8–
Drug Discovery
–9–
2 SHAPE ANALYSIS for FASD (Background)
Shape (similarity) is the description of objects after ignoring changes in
• location, scale, and rotation (David Kendall, Fred Bookstein).
Objects described in terms of landmarks (mainly).
( Form (rigid transformations): ignore changes in • location, rotation but not scale Helix).
Two triangles (3 Landmarks – points on a nose?) with the same labelled shape (form).
–10–
CONTRASTED WITH DIRECTIONAL STATISTICS
Deals with angular observations (or directions). Angles as wind bearing as points on
the circle of unit radius with centre at the origin.
Average Direction? Use the Centre of Gravity.... Shape Analysis is much more
complicated than Directional Statistics; how to get an Average Face?
–11–
Bookstein coordinates in 2 dimensions
Starting point is a configuration of k landmarks in 2–d but to work statistically with shapes
we need a coordinate system. The simplest system is Bookstein coordinates
• move first landmark to the origin;
• rotate and scale so the second landmark lies at (-1/2,1/2);
• the remaining k - 2 points are shape coordinates in R2 .
But choice of baseline matters (i) nonlinear effects if data not highly concentrated and
(ii) should change to PCA even if data highly concentrated.
–12–
Bookstein Coordinates for a triangle
The original triangle in (a) is transformed by translation in (b), rotation in (c) and finally
rescaling in (d) to give the Bookstein coordinates labelled 3 in plot (d).
–13–
Shape Model and Procrustes Matching
Consider two centred configurations
y = (y1 , ..., yk )T and w = (w1 , ..., wk )T ,
both in k complex space, with y ∗ 1k
= 0 = w∗ 1k , where y ∗ denotes the transpose of
the complex conjugate of y .
We take the shape model as Complex Linear Regression
y
=
(a + ib)1k + βeiθ w + ǫ
where
a + ib ,
scale: β > 0
and rotation: 0 ≤ θ < 2π ;
ǫ: a k × 1 complex error vector.
translation:
–14–
Procrustes Matching and Distance
The Procrustes fit (superimposition) of w onto y is
wP = (â + ib̂)1k + β̂eiθ̂ w,
where (β̂, θ̂, â, b̂) are chosen to minimize
D 2 (y, w) = ky − wβeiθ − (a + ib)1k k2 .
The Procrustes distance (a metric) between w and y is given by
dF (w, y)
=
1/2
∗
∗
y ww y
1− ∗ ∗
.
w wy y
(1)
–15–
Mean Shape and Tangent Coordinates
The full Procrustes estimate of mean shape [µ̂] is obtained by minimizing (over µ) (each
wi to an unknown unit size mean configuration ) µ, i.e.
[µ̂] = arg inf
µ
n
X
d2F (wi , µ),
i=1
The mean is seen to be the eigenvector corresponding to the largest eigenvalue of the
complex sum of squares and products matrix
S=
n
X
i=1
where the zi
wi wi∗ /(wi∗ wi ) =
n
X
zi zi∗ ,
(2)
i=1
= wi /kwi k, i = 1, . . . , n, are centred ). The Procrustes residuals are
!
n
1X P
P
ri = wi −
wi
, i = 1, . . . , n,
(3)
n i=1
which are the form of tangent coordinates so now MVA can be used on the coordinates.
–16–
3 FASD: Case of a Real Trial of Murderer XX (USA) and FASD
• We consider the court proceedings at the penalty phase of this trial.
• Analysis of his MRI brain scan was part of the defence.
• If a jury could be convinced that he had been born with brain damage, the death
penalty might NOT be applied.
• Put crudely, whether or not he was to be executed could depend on the shape of his
callosal curve.
–17–
FASD: Shape characterization of callosal curve
Average normal callosal midcurve (15 normals) with one landmark point (rostrum: closed
circle) and 39 sliding landmarks (stars)
–18–
MRI scan of the defendant XX
The callosal curve of defendant XX as visualized over a nearby plane of his MRI scan
–19–
Normal (dashed lines) vs FASD XX (solid lines)
The average curve (dashed lines) compared to the polygon (solid lines) of XX. Note the
narrowing of XXs callosal curve in the isthmus region.
–20–
On callosal curve of FASD XX
Narrowness of isthmus with the shortest width shown by a vector of the outline.
–21–
Shape Discrimination on XX
The likelihood ratio contours (equispaced) for the hypothesis of prenatal alcohol damage
versus normal by the quadratic discrimination in the tangent space. (+ = FASD (45);
filled circles = normal (15); Seattle data, and XX ).
–22–
Statistical Analysis of shape of callosal mid-curve of Murderer XX
• The log-odds ratio the one hypothesis (damage of the sort seen in FASD) over the
alternative hypothesis (a normal callosal curve) is about 800 to 1.
• In open courtroom testimony, my collaborator, Fred Bookstein pointed to this strong
evidence for congenital brain damage.
• As a consequence of that argument, together with observations by other
experts, the American jury found the defendant not to be sufficiently culpable
to deserve execution. Instead he is serving a life sentence in an American prison.
• Fred has acted as an expert witness for the defence in about 25 murder cases. In
each case the guilt of the defendant was not in question, but evidence of FASD was
used as a mitigating circumstance when sentencing.
• In more than half of those cases, the defendant was saved from the death
penalty.
–23–
4. Membrane Protein: Straight Helices and Kinked Helices
–24–
Why kinks play a key role in Membrane proteins helices?
• Membrane proteins are attached to the membrane of a cell.
• Estimated that 20 %-30% of all genes in most genomes ( more so in Human
Proteome) encode membrane proteins.
• Why do we care about kinks in membrane proteins helices?
• Around 50% of current drug targets are membrane proteins.
• Kinks in membrane proteins helices ( alpha) are known to be functionally important.
–25–
GOOD HELIX with its atoms (3A7KA: protein data bank) and a fitted
cylinder
–26–
Kinked Helix with two adjacent cylinders at the kink(1RHZA 365)
–27–
5A. Classifying a Helix: The General Equation of Ellipsoid / Cylinder
Recall that an ellipsoid has the equation
(x − µ)T Σ−1 (x − µ) = 1, x ǫ R3
where µ is the center and Σ is a positive definite matrix.
Let
Σ = ΓT DΓ
where
Γ = (γ1 , γ2 , γ3 ) is an orthogonal matrix and
D = diag(λ1 , λ2 , λ3 ) is the matrix of eigenvalues.
These λ ’s are the squared length ( half) of the axes of the ellipsoid where as
Γ = (γ1 , γ2 , γ3 ) are the corresponding three axes.
If the λ ’s are arranged in descending order than for a (right) cylinder
λ1 = “∞”, λ2 = λ3 and its axis is the first eigenvector γ1 .
–28–
For straight helices λ2
= λ3 but λ2 6= λ3 for kinked helices; λ1 very
large.
–29–
Cylinder Testing and Fitting
Let x1 , x2 , ...., xn be a random sample. Let l1 , l2 , l3 be the eigenvalues of the sample
covariance matrix S in so for a cylinder, we need to test λ2
= λ3 . The likelihood ratio
test with the Bartlett’s correction leads to
p
B = 2(n − (17/6))log(a/g); a = (l2 + l3 )/2 , g = (l2 × l3 ),
(4)
distributed as χ2 with 2 degrees of freedom.
If the null hypothesis is accepted, we can make the sample eigenvector g1 , the axis of
the cylinder as the z -axis, and the eigenvectors g2 , g3 as the x-axis and y -axis.
For the real STRAIGHT helix 3A7K, n= 22 (Cα coordinates), we find
l1 = 94.0, l2 =2.7, l3 = 2.9,
B =0.024 and P (B > 0.024) = 0.9.
–30–
Discriminant based on B for straight helices (1014) and kinked
helices (356), large B for kinked helices. Boundary: log(B) = 0
–31–
1
0
−1
coords[,2]
2
3
5. Location of a kink: PCA fails for small number of atoms
(“non-integer” turns)
−1
0
1
2
3
coords[,1]
–32–
Kinked Helices and Robust Estimation of the Helix Axis
Bioinformaticians show that the helix axis is the key in locating the kink. Let the
observed points are Ciα , aT =the axis of the helix , di = distance vector of Ciα to the
axis, r0T = distance from origin to the axis.
Our new method involves minimizing with respect to r0 and a,
∆2 (r0 , a) =
X
d2i
− d2
2
X
1
, d2 =
d2i .
n
= 1, aT r0 = 0. If xi is the vector of the origin to Ciα , then
2
T
T
di = xi I − aa xi − 2xTi r0 + r0T r0 .
under the constraints |a|
A nonlinear conjugate gradient method for minimization is fast and thus the axis a is
estimated. The radius of the helix is estimated by d¯.
–33–
Geometry of a robust method for the axis estimation
The observed points are Ciα , aT = the axis of the helix , di = distance vector of Ciα to
the axis, r0T = distance from origin O to the axis.
–34–
Axes fitted to the backbone atoms of a sliding window of 6 residues.
Axes used to calculate a local angle θ for each residue.
–35–
Oxford algorithm to locate axial changes: “KINK FINDER” software
• Use a sliding window of 6 residues using all the four heavy atoms of the backbone
(N , Cα , C , O ) so there are 24 atoms to calculate the axis.
• Estimate the axis using our modified least square method .
• A kink is identified where a residue:
a. has an angle greater than 10◦ ,
b. is at least four residues from any already identified kinks,
c. has a residue with an angle less than 10◦ between it and any already identified
kinks.
–36–
CUSUM plot (Quality Control) of θ for locating the kink residues
–37–
Kinked Helix (3DDLA 90) with Cylinders(6 residues each with 4 atoms, n=23 )
Kink Angle=32.6◦ , Upper Cylinder r=1.91Å, RMSD=0.36Å, Lower=1.86Å, RMSD=0.20Å,
log B =2.6.
–38–
6. Big Picture: Shape manifold and FASD
• A manifold is a space which can be viewed locally as a Euclidean space. The key
gradient are the spaces formed by the tangent vectors as in Procrustes Analysis.
• A Riemannian manifold is a connected manifold which has a positive-definite inner
product defined on each tangent space. (The simplest example is of a circle: the
distance between a pair of points is defined to be the length of the shorter of the two
arcs into which the circle is partitioned by the two points.)
• For two dimensional landmark shapes ( k landmarks), the shape space is a
Riemannian manifold, as shown by Kendall. Namely, the shape space is
S2k /SO(2) = CP k−2 (4),
the complex projective space with sectional curvature 4.
• We have assumed finite number of landmarks and we could use closed curve space.
–39–
Quotient space for curves: FORM and Helix
• Let f be a real valued differentiable curve function in the original space,
f (t) : [0, 1] → Rm . From normalized tangent vector of f is defined as
q : [0, 1] → Rm , where
f˙(t)
q(t) = q
,
kf˙(t)k
and kf k denotes the standard Euclidean norm.
• The parametric equation of a helix can be written as
x = ρ cos t, y = ρ sin t, z = φt/(2π),
• After taking the derivative, the q function is now invariant under translation of the
original function. In the one dimensional functional case the domain t ∈ [0, 1] often
represents ‘time’ rescaled to unit length, whereas in two and higher dimensional
cases t represents the proportion of arc-length along the curve.
• We need invariance under re-parameterization (warping) and rotation, and such a
quotient space (with a metric) has been constructed recently by Anuj Srivastva.
–40–
The Frenet framework and the form space of 3D curves.
• Recall from differential geometry that curvature measures the rate of change of the
angle of tangents and curvature is zero for straight lines.
• Torsion measures the twisting of a curve, and when the torsion is zero then the
torsion curves lie in 2D.
• The helix is invariant under a rigid transformation and thus useful for studying the
form of curves.
• For circular helices, these two measures are constant.
• Recently, by using splines, there have been recent work of Peter Kim which provide
consistent estimators of curves in 3D which in particular lead to consistent
estimators of curvature and torsion.
• One of the applications is to infer if torsion is zero or not, for example in spinal
deformity.
–41–
7.Concluding Remarks
• Statistical problems related to curves and surfaces appear in various scientific areas
Biology, Medicine, Nuclear Physics, Archeology, Industry, .....
• An explosion of interest to this problem occurred in the 1990s when, it was
realized that fitting simple contours (lines, circles, ellipses) to images was one
of the basic tasks in pattern recognition and computer vision.
• Statistical Modelling has still not reached the Scientific Community.
• “Big data” on manifolds in medical and life- sciences should provide a further
boost to the subject.
–42–