Shape manifold and curves in the context of big data Kanti Mardia Senior Research Professor, University of Leeds and University of Oxford LEEDS ANNUAL STATISTICAL RESEARCH WORKSHOP(LASR)—MISSION STATEMENT “Statistics without science is incomplete, Science without statistics is imperfect.” - Mardia Joint work with many statisticians, scientists, and industries –1– CONTENTS 1. Curves and Big data context: Fetal alcohol spectrum disorder (FASD) and Drug Discovery 2. Shape Analysis for FASD (Background) 3. FASD (a closed curve in 2D) 4. Drug Discovery: Form analysis of Helix (an open curve in 3D) 5. Locating the Kink 6. Big Picture of Shape Manifold 7. Some Concluding Remarks –2– Topic 1: Fetal alcohol spectrum disorder (FASD) and some “big” data FASD is an umbrella term that covers a range of abnormalities resulting from exposing human embryos to high levels of ALCOHOL. The number of FASDs in the USA and some European countries is 20 to 50 out of 1,000 school children. –3– FASD DIAGNOSIS from the callosal mid-curve of the corpus callosum of the brain. The callosal mid-curve (a closed curve in 2D) itself, in a cut of the brain up the middle, can be described as a shallow upside-down letter C (TOP RED section) with varying curvatures and thicknesses . –4– MURDERER AND JAIL Judge: Death Penalty or Life Sentence ?? (one with the FASD on a murder trial in the USA) –5– Topic 2: Big Data in Human Genomics/Proteins –6– Recap: Genome and big data (High dimensional as well) 1. 23 chromosomes (not 24!) 2. 20,000-25000 genes 3. 3-billion-letter (ATCG) in the human genome project: Millions of protein sequences 4. More than 100,000 protein structures 5. High dimensional as well: say 200 amino acids average in protein times in average four angles = 800 angles 6. More and more complete Genomes (Horse recently) 7. Challenge: Predict protein structures/functions of say 1000 sequences same time (to go to 3D from 1D). –7– Membrane Protein: Straight and Kinked Helices (3D) in Drug Discovery –8– Drug Discovery –9– 2 SHAPE ANALYSIS for FASD (Background) Shape (similarity) is the description of objects after ignoring changes in • location, scale, and rotation (David Kendall, Fred Bookstein). Objects described in terms of landmarks (mainly). ( Form (rigid transformations): ignore changes in • location, rotation but not scale Helix). Two triangles (3 Landmarks – points on a nose?) with the same labelled shape (form). –10– CONTRASTED WITH DIRECTIONAL STATISTICS Deals with angular observations (or directions). Angles as wind bearing as points on the circle of unit radius with centre at the origin. Average Direction? Use the Centre of Gravity.... Shape Analysis is much more complicated than Directional Statistics; how to get an Average Face? –11– Bookstein coordinates in 2 dimensions Starting point is a configuration of k landmarks in 2–d but to work statistically with shapes we need a coordinate system. The simplest system is Bookstein coordinates • move first landmark to the origin; • rotate and scale so the second landmark lies at (-1/2,1/2); • the remaining k - 2 points are shape coordinates in R2 . But choice of baseline matters (i) nonlinear effects if data not highly concentrated and (ii) should change to PCA even if data highly concentrated. –12– Bookstein Coordinates for a triangle The original triangle in (a) is transformed by translation in (b), rotation in (c) and finally rescaling in (d) to give the Bookstein coordinates labelled 3 in plot (d). –13– Shape Model and Procrustes Matching Consider two centred configurations y = (y1 , ..., yk )T and w = (w1 , ..., wk )T , both in k complex space, with y ∗ 1k = 0 = w∗ 1k , where y ∗ denotes the transpose of the complex conjugate of y . We take the shape model as Complex Linear Regression y = (a + ib)1k + βeiθ w + ǫ where a + ib , scale: β > 0 and rotation: 0 ≤ θ < 2π ; ǫ: a k × 1 complex error vector. translation: –14– Procrustes Matching and Distance The Procrustes fit (superimposition) of w onto y is wP = (â + ib̂)1k + β̂eiθ̂ w, where (β̂, θ̂, â, b̂) are chosen to minimize D 2 (y, w) = ky − wβeiθ − (a + ib)1k k2 . The Procrustes distance (a metric) between w and y is given by dF (w, y) = 1/2 ∗ ∗ y ww y 1− ∗ ∗ . w wy y (1) –15– Mean Shape and Tangent Coordinates The full Procrustes estimate of mean shape [µ̂] is obtained by minimizing (over µ) (each wi to an unknown unit size mean configuration ) µ, i.e. [µ̂] = arg inf µ n X d2F (wi , µ), i=1 The mean is seen to be the eigenvector corresponding to the largest eigenvalue of the complex sum of squares and products matrix S= n X i=1 where the zi wi wi∗ /(wi∗ wi ) = n X zi zi∗ , (2) i=1 = wi /kwi k, i = 1, . . . , n, are centred ). The Procrustes residuals are ! n 1X P P ri = wi − wi , i = 1, . . . , n, (3) n i=1 which are the form of tangent coordinates so now MVA can be used on the coordinates. –16– 3 FASD: Case of a Real Trial of Murderer XX (USA) and FASD • We consider the court proceedings at the penalty phase of this trial. • Analysis of his MRI brain scan was part of the defence. • If a jury could be convinced that he had been born with brain damage, the death penalty might NOT be applied. • Put crudely, whether or not he was to be executed could depend on the shape of his callosal curve. –17– FASD: Shape characterization of callosal curve Average normal callosal midcurve (15 normals) with one landmark point (rostrum: closed circle) and 39 sliding landmarks (stars) –18– MRI scan of the defendant XX The callosal curve of defendant XX as visualized over a nearby plane of his MRI scan –19– Normal (dashed lines) vs FASD XX (solid lines) The average curve (dashed lines) compared to the polygon (solid lines) of XX. Note the narrowing of XXs callosal curve in the isthmus region. –20– On callosal curve of FASD XX Narrowness of isthmus with the shortest width shown by a vector of the outline. –21– Shape Discrimination on XX The likelihood ratio contours (equispaced) for the hypothesis of prenatal alcohol damage versus normal by the quadratic discrimination in the tangent space. (+ = FASD (45); filled circles = normal (15); Seattle data, and XX ). –22– Statistical Analysis of shape of callosal mid-curve of Murderer XX • The log-odds ratio the one hypothesis (damage of the sort seen in FASD) over the alternative hypothesis (a normal callosal curve) is about 800 to 1. • In open courtroom testimony, my collaborator, Fred Bookstein pointed to this strong evidence for congenital brain damage. • As a consequence of that argument, together with observations by other experts, the American jury found the defendant not to be sufficiently culpable to deserve execution. Instead he is serving a life sentence in an American prison. • Fred has acted as an expert witness for the defence in about 25 murder cases. In each case the guilt of the defendant was not in question, but evidence of FASD was used as a mitigating circumstance when sentencing. • In more than half of those cases, the defendant was saved from the death penalty. –23– 4. Membrane Protein: Straight Helices and Kinked Helices –24– Why kinks play a key role in Membrane proteins helices? • Membrane proteins are attached to the membrane of a cell. • Estimated that 20 %-30% of all genes in most genomes ( more so in Human Proteome) encode membrane proteins. • Why do we care about kinks in membrane proteins helices? • Around 50% of current drug targets are membrane proteins. • Kinks in membrane proteins helices ( alpha) are known to be functionally important. –25– GOOD HELIX with its atoms (3A7KA: protein data bank) and a fitted cylinder –26– Kinked Helix with two adjacent cylinders at the kink(1RHZA 365) –27– 5A. Classifying a Helix: The General Equation of Ellipsoid / Cylinder Recall that an ellipsoid has the equation (x − µ)T Σ−1 (x − µ) = 1, x ǫ R3 where µ is the center and Σ is a positive definite matrix. Let Σ = ΓT DΓ where Γ = (γ1 , γ2 , γ3 ) is an orthogonal matrix and D = diag(λ1 , λ2 , λ3 ) is the matrix of eigenvalues. These λ ’s are the squared length ( half) of the axes of the ellipsoid where as Γ = (γ1 , γ2 , γ3 ) are the corresponding three axes. If the λ ’s are arranged in descending order than for a (right) cylinder λ1 = “∞”, λ2 = λ3 and its axis is the first eigenvector γ1 . –28– For straight helices λ2 = λ3 but λ2 6= λ3 for kinked helices; λ1 very large. –29– Cylinder Testing and Fitting Let x1 , x2 , ...., xn be a random sample. Let l1 , l2 , l3 be the eigenvalues of the sample covariance matrix S in so for a cylinder, we need to test λ2 = λ3 . The likelihood ratio test with the Bartlett’s correction leads to p B = 2(n − (17/6))log(a/g); a = (l2 + l3 )/2 , g = (l2 × l3 ), (4) distributed as χ2 with 2 degrees of freedom. If the null hypothesis is accepted, we can make the sample eigenvector g1 , the axis of the cylinder as the z -axis, and the eigenvectors g2 , g3 as the x-axis and y -axis. For the real STRAIGHT helix 3A7K, n= 22 (Cα coordinates), we find l1 = 94.0, l2 =2.7, l3 = 2.9, B =0.024 and P (B > 0.024) = 0.9. –30– Discriminant based on B for straight helices (1014) and kinked helices (356), large B for kinked helices. Boundary: log(B) = 0 –31– 1 0 −1 coords[,2] 2 3 5. Location of a kink: PCA fails for small number of atoms (“non-integer” turns) −1 0 1 2 3 coords[,1] –32– Kinked Helices and Robust Estimation of the Helix Axis Bioinformaticians show that the helix axis is the key in locating the kink. Let the observed points are Ciα , aT =the axis of the helix , di = distance vector of Ciα to the axis, r0T = distance from origin to the axis. Our new method involves minimizing with respect to r0 and a, ∆2 (r0 , a) = X d2i − d2 2 X 1 , d2 = d2i . n = 1, aT r0 = 0. If xi is the vector of the origin to Ciα , then 2 T T di = xi I − aa xi − 2xTi r0 + r0T r0 . under the constraints |a| A nonlinear conjugate gradient method for minimization is fast and thus the axis a is estimated. The radius of the helix is estimated by d¯. –33– Geometry of a robust method for the axis estimation The observed points are Ciα , aT = the axis of the helix , di = distance vector of Ciα to the axis, r0T = distance from origin O to the axis. –34– Axes fitted to the backbone atoms of a sliding window of 6 residues. Axes used to calculate a local angle θ for each residue. –35– Oxford algorithm to locate axial changes: “KINK FINDER” software • Use a sliding window of 6 residues using all the four heavy atoms of the backbone (N , Cα , C , O ) so there are 24 atoms to calculate the axis. • Estimate the axis using our modified least square method . • A kink is identified where a residue: a. has an angle greater than 10◦ , b. is at least four residues from any already identified kinks, c. has a residue with an angle less than 10◦ between it and any already identified kinks. –36– CUSUM plot (Quality Control) of θ for locating the kink residues –37– Kinked Helix (3DDLA 90) with Cylinders(6 residues each with 4 atoms, n=23 ) Kink Angle=32.6◦ , Upper Cylinder r=1.91Å, RMSD=0.36Å, Lower=1.86Å, RMSD=0.20Å, log B =2.6. –38– 6. Big Picture: Shape manifold and FASD • A manifold is a space which can be viewed locally as a Euclidean space. The key gradient are the spaces formed by the tangent vectors as in Procrustes Analysis. • A Riemannian manifold is a connected manifold which has a positive-definite inner product defined on each tangent space. (The simplest example is of a circle: the distance between a pair of points is defined to be the length of the shorter of the two arcs into which the circle is partitioned by the two points.) • For two dimensional landmark shapes ( k landmarks), the shape space is a Riemannian manifold, as shown by Kendall. Namely, the shape space is S2k /SO(2) = CP k−2 (4), the complex projective space with sectional curvature 4. • We have assumed finite number of landmarks and we could use closed curve space. –39– Quotient space for curves: FORM and Helix • Let f be a real valued differentiable curve function in the original space, f (t) : [0, 1] → Rm . From normalized tangent vector of f is defined as q : [0, 1] → Rm , where f˙(t) q(t) = q , kf˙(t)k and kf k denotes the standard Euclidean norm. • The parametric equation of a helix can be written as x = ρ cos t, y = ρ sin t, z = φt/(2π), • After taking the derivative, the q function is now invariant under translation of the original function. In the one dimensional functional case the domain t ∈ [0, 1] often represents ‘time’ rescaled to unit length, whereas in two and higher dimensional cases t represents the proportion of arc-length along the curve. • We need invariance under re-parameterization (warping) and rotation, and such a quotient space (with a metric) has been constructed recently by Anuj Srivastva. –40– The Frenet framework and the form space of 3D curves. • Recall from differential geometry that curvature measures the rate of change of the angle of tangents and curvature is zero for straight lines. • Torsion measures the twisting of a curve, and when the torsion is zero then the torsion curves lie in 2D. • The helix is invariant under a rigid transformation and thus useful for studying the form of curves. • For circular helices, these two measures are constant. • Recently, by using splines, there have been recent work of Peter Kim which provide consistent estimators of curves in 3D which in particular lead to consistent estimators of curvature and torsion. • One of the applications is to infer if torsion is zero or not, for example in spinal deformity. –41– 7.Concluding Remarks • Statistical problems related to curves and surfaces appear in various scientific areas Biology, Medicine, Nuclear Physics, Archeology, Industry, ..... • An explosion of interest to this problem occurred in the 1990s when, it was realized that fitting simple contours (lines, circles, ellipses) to images was one of the basic tasks in pattern recognition and computer vision. • Statistical Modelling has still not reached the Scientific Community. • “Big data” on manifolds in medical and life- sciences should provide a further boost to the subject. –42–
© Copyright 2026 Paperzz