Geometric invariant core for the VL and VH

Protein Engineering vol.11 no.10 pp.1015–1025, 1998
Geometric invariant core for the VL and VH domains of
immunoglobulin molecules
Israel Gelfand1, Alexander Kister1, Casimir Kulikowski2
and Ognyan Stoyanov1,2,3
1Department
of Mathematics and 2Department of Computer Science,
Rutgers University, New Brunswick, NJ 08903, USA
3To
whom correspondence should be addressed
A new algorithmic method for identifying a geometric
invariant of protein structures, termed geometrical core,
is developed. The method used the matrix of Cα–Cα distances and does not require the usual superposition of
structures. The result of applying the algorithm to 53
immunoglobulin structures led to the identification of two
geometrical core sets of Cα atoms positions for the VL
and VH domains. Based on these geometric invariants a
preferred coordinate system for the immunoglobulin family
is constructed which serves as a basis for structural prediction. The X-ray atom coordinates for all available immunoglobulin structures are transformed to the preferred
coordinate system. An affine symmetry between the VL and
VH domains is defined and computed for each of the 53
immunoglobulin structures.
Keywords: atom coordinate prediction/immunoglobulin geometry/preferred coordinate system/protein core
Introduction
It is well understood that in order to find any reasonable
classification of proteins into families one needs to find
invariant characteristics that are shared by all members of the
family. There are a number of such characteristics, depending
on functional and structural properties. For structures in a
protein family the main invariant is the geometrical core.
There are several definitions of a core and of ‘core atoms’. In
this paper we use a core structure concept similar to that
developed by Chothia and Lesk (1986, 1987) and Altman and
Gerstein (1994, 1995). Intuitively a core is a set of residues
whose Cα atoms occupy the same relative positions in space.
A variety of methods and core-finding algorithms have been
developed. Most of them are based on finding the optimal
superposition of structures in order to determine the structural
variations of their residues (Kearsley, 1990; Diamond, 1992;
Shapiro et al., 1992; Yee and Dill, 1993) or analyzing contact
matrices between side chain atoms or Cα–Cα atoms (Taylor
and Orengo, 1989; Selbig, 1995; Swindells, 1995; Thomas
et al., 1996, Godzik et al., 1993).
An analytical way to define a geometrical core was recently
suggested (Gelfand et al., 1996), where an invariant system
of coordinates was introduced and a core defined as the subset
of positions with approximately the same coordinates relative
to this coordinate system. The main idea is to introduce a
system of coordinates that is based on the geometrical properties common to all molecules of a protein family.
This approach was applied to the molecules of the immunoglobulin (Ig) family. One of the common properties of this
© Oxford University Press
family is the pseudo two-fold symmetry which is observed in
all known Ig structures (Padlan, 1994). For the invariant
system of coordinates specific for the Ig molecules one of the
axes was chosen to be this axis of symmetry. The other axis
was chosen to be the line connecting the centers of mass for
the set of Cα atoms of the variable light (VL) and variable
heavy (VH) domains. The invariant system of coordinates was
constructed using 22 different immunoglobulin structures. The
calculations showed that the coordinates of about 65% of the
Cα atoms in the variable domains were found to be approximately the same in all structures, when transformed to the
invariant system of coordinates.
The advantage of this approach is the ability to compare
protein structures by defining a common geometrical core
without the procedure of superposition. However, this definition
of a geometrical core is still dependent on the way the invariant
system of coordinates was constructed (Gelfand et al., 1996).
The procedure used averaging over the directions of the pseudo
two-fold axes of symmetry and can only be considered as a
first approximation to the construction of an invariant system
of coordinates.
We introduce here a new core-finding method which is
independent of any system of coordinates and does not require
the procedure of superposition either. We define a geometrical
core as the set of residues such that the distances between
pairs Cα(I)–Cα(J) of Cα atoms at the Ith and Jth positions are
equal in all members of a protein family. The positions I and
J are secondary structure positions—positions in strands, loops,
or parts of them (Gelfand and Kister, 1995). The method
includes an algorithm which was used to identify a geometrical
core in the VL and VH domains of Ig molecules. The results
of constructing the core are based on the calculation of Cα–
Cα distances in 53 (mouse and human) Ig structures with VL
κ domains taken from the Protein Data Bank (PDB). (The
structures are identified by PDB names: 1acy, 1baf, 1bbd,
1bql, 1cbv, 1cgs, 1dbj, 1dfb, 1eap, 1fai, 1fdl, 1fgv, 1flr, 1for,
1fpt, 1frg, 1fvc, 1gaf, 1ggb, 1hil, 1ibg, 1ifh, 1igc, 1igf, 1igm,
1ikf, 1jel, 1jhl, 1kno, 1mam, 1mcp, 1mlc, 1mrd, 1nbv, 1nmb,
1opg, 1plg, 1tet, 1vfa, 1vge, 2cgr, 2dbl, 2fl9, 2fbj, 2fgw, 2iff,
2igf, 3hfl, 3hfm, 6fab, 1igj, 1ncc and 4fab.) The idea for the
algorithm is described concisely as follows. We align the
N(5 53) Ig sequences and assign to each residue a secondary
structure position. For every pair (I, J) of secondary structure
positions (common to all structures, that is, positions that are
present in all structures after the alignment) we compute the
distances Di(I, J), i 5 1,..., N(5 53), between the Cα atoms
at these positions. Thus, for each pair (I, J) we have a
distribution of distances across the family of structures. Let
DISP(I, J) be the dispersion of this distribution. Our aim is to
find a subset of positions, that we term geometrical core. Let
us call a pair (I, J) ‘perfect’ if in all N(5 53) structures the
distance D(I, J) is the same. Since ‘perfect’ pairs do not exist,
we measure the quality of the pair (I, J) by the dispersion
DISP(I, J) of D(I, J) across the family. It turns out that there
1015
I.Gelfand et al.
exist ‘almost perfect pairs’, that is, pairs with very small
dispersion (ø0.3 Å for example). A geometrical core is such
a subset of positions all pairs from which are ‘almost perfect’.
The identification of the geometrical core has several immediate
applications. It is used for structure comparison. Using the
geometrical core, a preferred coordinate system for the
immunoglobulin family is constructed.
Table I. Sample of Ig sequences and the alignment of their VL κ domains
Materials and methods
The method has two major components: (i) secondary structure
alignment; (ii) comparison of distances between Cα atoms at
pairs of identical secondary structure positions across the
Ig family.
Secondary structure alignment
A prerequisite condition for secondary structure alignment is
a division of protein sequences into secondary structure units
(strands and loops). A procedure for segmentation of Ig
sequences has been developed (Gelfand and Kister, 1995). The
sequences are divided into 21 fragments, which we termed
‘words’. In most cases the amino acid fragment-word coincides
with a strand or loop but there are two exceptions. The first
exception is for the first three residues of a sequence. They
are picked out from the A strand and constitute an OA word.
The second difference is found in the connection between the
B and C strands. This loop has a unique ‘two span bridge’
conformation with one residue deeply inserted into the structure
(Tramantano et al., 1990). We describe the loop by two words,
BC and CB.
In total, the sequences of the variable domains are divided
into 21 words: OA, A, AA9, A9, A9B, B, BC, CB, C, CC9,
C9, C9C0, C0, C0D, D, DE, E, EF, F, FG and G. Here
the accepted notation for the secondary structural units for
immunoglobulin structures is used. The segmentation of
sequences into words allows the possibility of assigning to
every residue a position in a word.
All Ig sequences under study have been aligned according
to this secondary structure assignment. In Tables I and II are
given, respectively, the aligned VL and VH domains of several
of the sequences from our pool. This secondary structure
definition of residues allows us to compare residues at identical
positions of words that are common to all structures, that is,
positions that are present in all structures under the alignment.
Comparison of distances between Cα atoms
Geometric principle underlying the core-finding algorithm. As
discussed already, the aim is to identify the subset of Cα atoms
of residues at certain secondary structure positions—designated
‘core positions’—such that the distances among every pair of
them are invariant (are the same in each structure) across a
family of protein structures. According to this definition the
invariant subset consisting of geometrical core positions can
be obtained from the following simple geometrical fact. If two
polytops in three-dimensional space with the same number of
vertices have edges between corresponding vertices with equal
length then there is an Euclidean motion (a translation and
rotation in space, i.e. a motion that is determined by six
parameters) which maps (superimposes) one polytop onto the
other. For example, any two triangles in the plane with equal
sides can be superimposed by an Euclidean motion, consisting
of a translation and rotation in the plane (determined by three
parameters).
1016
Based on this, we can reason as follows. Given an immunoglobulin molecule, we can compute all distances between Cα
atoms in the VL and VH domains (using the X-ray crystallographic atom coordinates available from PDB). The Cα atoms
play the role of vertices of our VL and VH polytops. If
one does this for all immunoglobulins for which structural
information is available (we selected from PDB 62 molecules
with the highest resolution and non-identical sequences and
from them only the ones (53) with VL κ domains), the following
natural question arises: does there exist a subset of Cα positions
in the VL (or VH) domain such that the polytops in the VL (or
VH) domain with vertices formed by these positions are
identical (can be superimposed) across the whole family? If
such an invariant subset of positions exists, then it is natural
to view it as a candidate for the invariant geometrical core of
positions for the immunoglobulin family. The results of this
research show (within the suitably defined constraints given
below) the existence of such invariant cores for the VL and VH
domains on their own, and a way to identify them.
We hasten to remark that, in reality, we should not expect
to find that polytops with vertices at the Cα atoms of proteins
coincide completely when compared. Therefore, we need to
modify our definition of a core by allowing for some variability
when comparing different polytops. This variability is
accounted for by a parameter T (with a dimension of length),
which we term the variability threshold. Thus, a subset of
Geometric invariant core for Ig molecules
Table II. Sample of Ig sequences and the alignment of their VH domains
positions CORET is termed a core at a variability threshold T,
if the variation in the length of edges between any pair of
vertices of the polytops compared across the family is less
than T. In other words, one allows a variation in the length of
edges of the core polytops of each immunoglobulin molecule
such that it is less than a certain threshold T. This variation is
controlled by the dispersion of the distribution of the lengths
of edges between a pair Cα(I)–Cα(J) of Cα atoms across the
family of structures under consideration.
The above definition of geometrical core has the merit of
being independent of the (orthogonal) coordinate system used
in the X-ray experiment for measuring the coordinates of the
Cα atoms, since it is based only on comparing Euclidean
distances.
We started by aligning the Ig structures according to the
secondary structure assignments described above (Gelfand and
Kister, 1995). Further, we selected 53 Ig structures with VL κ
and VH domains with the highest resolution and nonidentical
sequences. All VL κ domains were aligned and the set of
positions that were common to all of them was selected. [The
set of (101) common VL κ positions includes: OA2-OA3, A1A3, AA91-AA94, A91-A94, A9B1-A9B3, B1-B8, BC1-BC3,
CB1-CB3, C1-C6, CC91-CC96, C91-C95, C9C01-C9C03, C0D1C0D5, D1-D7, DE1-DE2, E1-E7, EF1-EF7, F1-F7, FG1-FG4
and G1-G10.] This was the starting set of positions of the VL κ
domain from which a geometrical core set was later obtained.
Similarly, all VH domains were aligned and a set of positions
common to all of them was taken as a starting point for
obtaining a set of geometrical core positions for the VH domain.
[The set of (109) common VH positions includes: OA2-OA3,
A1-A3, AA91-AA94, A91-A93, A9B1-A9B3, B1-B8, BC1-BC4,
CB1-CB5, C1-C6, CC91-CC96, C91-C96, C9C01-C9C03, C01C06, C0D1-C0D5, D1-D7, DE1-DE4, E1-E7, EF1-EF7, F1-F7,
FG1-FG4 and G1-G10.]
The choice of using a multiple alignment between each of
the 53 immunoglobulin molecules ensures that we look at
polytops whose vertices belong to the same set of positions.
Moreover, with this choice we implicitly use the information
about structural similarities on the basis of which the segmentation scheme was adopted. This means that we compare
the light and heavy domain polytops of each of the 53
immunoglobulins with vertices being the chosen common
positions. By comparison of polytops we understand comparison of the lengths of their corresponding edges, i.e. edges
connecting the same pairs of positions (e.g. A1-F3, C2FG1, etc.).
Core-finding algorithm. We now describe the algorithm for
selecting the set of core positions. We will explain the procedure
for the VL domain. The algorithm is the same for the case of
heavy domains. The main steps are: (i) from the set, termed
CANDIDATES below, of all (common to all 53 structures)
positions in the VL domain select a subset called FOLDREP.
(We put also the additional requirement that these positions
belong to different parts of the fold. This requirement is not
necessary and the final result does not depend on it. It was
introduced in order to reduce the space of the initial search
for most conservative positions.) This subset must be self
consistent. That is, for each two positions of FOLDREP the
dispersion of the distribution of distances between them across
all 53 structures is less than a fixed threshold T1 5 0.2 Å.
(ii) Ensure that the subset FOLDREP is self consistent.
(iii) Enlarge the subset FOLDREP to a subset PRE-CENTRAL
by adding additional positions such that the dispersion of their
distances to the members of FOLDREP is less than a new
threshold T2 5 0.25 Å. (iv) Ensure that the subset PRECENTRAL is self consistent with the new threshold T2 5
0.25 Å. This may lead to the exclusion of some of the positions
from PRE-CENTRAL that were added initially to FOLDREP.
After this exclusion, the resulting self consistent subset is
called CENTRAL. (v) Enlarge CENTRAL to a subset PRECORE by adding additional positions such that the dispersion
of their distances to the members of CENTRAL is less than a
third threshold T3 5 0.35 Å. (vi) Ensure that PRE-CORE is
self consistent. This again may lead to the exclusion of positions
from PRE-CORE that were added initially to CENTRAL. The
resulting self consistent subset of CANDIDATES is termed
(geometrical) CORE of the domain. The choice of the thresholds is dictated by the density distribution of positions in
CANDIDATES with respect to the average dispersion relative
to the members of FOLDREP (Figures 3 and 4). We choose
a ‘window’ (T2,T3) around the point where this distribution
has its maximum (T1, T2 and T3 are generally different from
the VL and VH domains. For the VH domain the choices were
T1 5 0.25 Å, T2 5 0.30 Å and T3 5 0.40 Å). The choice of
the two bounds T2 and T3 introduces a filter on the set
CANDIDATES of secondary structure positions into three
levels of conservation, the first level of which is determined
by FOLDREP. A detailed description of the algorithm follows.
1017
I.Gelfand et al.
Fig. 1. Density distribution of number of pairs with respect to their
dispersion from the VL κ chains.
Fig. 2. Density distribution of number of pairs with respect to their
dispersion for the VH chains.
We take the set of 101 positions, denoted CANDIDATES,
from the VL domain common to all structures under consideration. As mentioned above, we did computations using 53 Igs
with VL κ chains only. For each pair (I, J) (e.g. A1-G3) of
positions from this set we compute the distance Di(I, J)
between these two positions for each of the i 5 1,2,...,N
(5 53) Igs. Then we take the average
AVER(I, J) 5
1
N
AVERDISP(I) 5
{
1
N
N
Σ D (I, J)
B6
C3
C93
E4
F5
G2
Av
i
i51
1/2
N
Σ
[Di(I, J) – AVER(I, J)]2
i51
}
Thus, with each pair (I, J) of positions from our CANDIDATES
list we associate a number DISP(I, J). The set of all these
numbers form a symmetric matrix DISP with zeros along the
main diagonal. Figures 1 and 2 show the density distribution
of number of pairs versus their dispersion for the VL and VH
domains respectively.
To determine a stable set of core positions for the VL domain
we need only know the matrix DISP, and the following natural
characteristic one can ascribe to each of the positions of
the starting CANDIDATES set. Namely, for each position I
we define
AVERDISP(I): 5
1
M
d
1018
B6
0.00
0.19
0.23
0.19
0.16
0.25
0.17
C3
0.19
0.00
0.18
0.23
0.14
0.19
0.15
C93
0.23
0.18
0.00
0.24
0.17
0.23
0.17
E4
0.19
0.23
0.24
0.00
0.21
0.26
0.19
F5
0.16
0.14
0.17
0.21
0.00
0.18
0.14
G2
0.25
0.19
0.23
0.26
0.18
0.00
0.19
Next we take the 63101 submatrix DISP_TO_FOLDREP
of DISP
B6
C3
C93
E4
F5
G2
Av
M
Σ DISP(J, I) ,
J51
Using the matrix DISP, we chose the set FOLDREP of
positions, each representing a distinct part of the protein
fold, having low AVERDISP and being self-consistent in
the following sense. For any element I of FOLDREP we
must have
DISP(J, I) ø 0.20 Å
In the last row is shown the average of the dispersions over
the members of FOLDREP which are all øT1 5 0.20 Å.
where M is the number of positions in CANDIDATES (the
size of the matrix DISP). AVERDISP(I) gives a rough characterization of the overall ‘poor fitness’ of the position I. The set
of geometric core positions is determined in the following
algorithmic way.
d
Σ
L J∈FOLDREP
Here L is the number of elements in FOLDREP. One subset
of CANDIDATES satisfying these conditions turned out to be
FOLDREP5{B6, C3, C93, E4, F5 and G2}. The submatrix of
dispersions for these ‘super conservative’ positions is
over these N 5 53 Igs and compute the dispersion
DISP(I, J) 5
1
OA2
0.26
0.31
0.33
0.29
0.29
0.27
0.29
...
...
...
...
...
...
...
...
A3
0.19
0.24
0.26
0.26
0.22
0.34
0.25
...
...
...
...
...
...
...
...
B7
0.05
0.24
0.26
0.21
0.27
0.24
0.21
...
...
...
...
...
...
...
...
F3
0.25
0.16
0.27
0.21
0.13
0.17
0.20
...
...
...
...
...
...
...
...
G9
0.34
0.27
0.38
0.31
0.22
0.23
0.29
G10
0.43
0.55
0.74
0.61
0.40
0.31
0.51
The last row presents the average dispersion relative to the
six members of FOLDREP. The distribution of the 101
positions of CANDIDATES with respect to their average
dispersions relative to the six members of FOLDREP is
shown in Figure 3.
d
The second level of conservation is determined as follows.
Based on the data contained in the submatrix DISP_TO_FOLDREP we select a subset PRE-CENTRAL of positions
of CANDIDATES satisfying the criterion: every position I
in PRE-CENTRAL has Av ø T1 5 0.25 Å, where Av is the
Geometric invariant core for Ig molecules
A new set PRE-CORE of positions is selected, the elements
of which satisfy the requirement Av ø T2 5 0.35 Å. It
consists of 74 positions.
Finally, the set PRE-CORE is checked for self-consistency
by taking the submatrix of DISP with rows and columns
labeled by the members of the set PRE-CORE:
OA2
OA3
A1
:
G8
G9
Av
Fig. 3. Density distribution of positions in CANDIDATES with respect to
the average dispersion relative to the members of FOLDREP for the VL κ
domain.
average dispersion of the distances between the members
of FOLDREP and all members of CANDIDATES. There are
36 positions that satisfied this criterion: A3, B3, B5-B8,
C1-C6, C91-C95, E1-E5, EF3, F2-F7 and G1-G7.
d
The set PRE-CENTRAL is checked for self consistency.
Namely, we take the 36336 submatrix DISP_PRE-CENTRAL of DISP, the 36 rows and 36 columns of which
correspond to the elements of PRE-CENTRAL
A3
B3
B5
:
G6
G7
Av
A3
0.00
0.31
0.23
:
0.33
0.33
0.25
B3
0.31
0.00
0.20
:
0.21
0.29
0.26
B5
0.23
0.20
0.00
:
0.26
0.32
0.24
...
...
...
...
:
...
...
...
G6
0.33
0.21
0.26
:
0.00
0.03
0.20
G7
0.33
0.29
0.32
:
0.03
0.00
0.23.
Then all positions in PRE-CENTRAL with Au . T1 5
0.25 Å are excluded from PRE-CENTRAL yielding a new
self consistent set CENTRAL. The excluded positions are
B3, B8, EF3 and G3. Thus, after the requirement for self
consistency has been met, the set CENTRAL then consists
of the following 32 positions: A3, B5-B7, C1-C6, C91-C95,
E1-E5, F2-F7, G1-G2 and G4-G7.
d
To identify the third level of conservation (which we termed
geometrical core) we take the 323101 submatrix of DISP,
the rows of which are the ones labeled by the positions of
the set CENTRAL and the columns by all positions in
CANDIDATES:
A3
B5
B6
:
G6
G7
Av
OA2
0.26
0.26
0.26
:
0.32
0.32
0.31
OA3
0.26
0.28
0.29
:
0.32
0.30
0.31
A1
0.25
0.29
0.30
:
0.30
0.39
0.28
...
...
...
...
:
...
...
...
G9
0.30
0.35
0.35
:
0.18
0.18
0.29
G10
0.33
0.45
0.43
:
0.22
0.19
0.46.
OA2
0.00
0.06
0.24
:
0.32
0.35
0.33
OA3
0.06
0.00
0.05
:
0.31
0.34
0.31
A1
0.24
0.05
0.00
:
0.29
0.30
0.28
...
...
...
...
:
...
...
...
G8
0.32
0.31
0.29
:
0.00
0.04
0.29
G9
0.35
0.34
0.30
:
0.04
0.00
0.34,
and then excluding from PRE-CORE all positions with
Av . T2 5 0.35 Å. In this case it turned out that all positions
in PRE-CORE satisfy Av , T2 5 0.35 Å and none was
excluded. Thus, this yields a set of 74 positions: OA2-OA3,
A1-A3, AA92-AA94, A9B1-A9B3, B1-B8, BC1-BC2, C1C6, CC91, C91-C95, C9C01-C9C02, C0D4-C0D5, D1-D6,
DE2, E1-E7, EF2-EF7, F1-F7 and G1-G9, which we term
geometrical core, CORETL, for the VL domain of the immunoglobulin family at variability threshold TL 5 0.35 Å.
For the VH domain we start with the initial set FOLDREP5
{B6, C3, D3, F5, E4, G2}, to which corresponds a dispersion
submatrix:
B6
C3
D3
E4
F5
G2
Av
B6
0.00
0.32
0.29
0.23
0.18
0.32
0.22
C3
0.32
0.00
0.30
0.23
0.21
0.21
0.21
D3
0.29
0.30
0.00
0.22
0.35
0.32
0.25
E4
0.23
0.23
0.22
0.00
0.30
0.29
0.21
F5
0.18
0.21
0.35
0.30
0.00
0.16
0.20
G2
0.32
0.21
0.32
0.29
0.16
0.00
0.22.
All of these positions have Av ø T1 5 0.25 Å. Further,
analogously as it was done for the VL domain, we selected a
subset PRE-CENTRAL of positions having average dispersion
(average taken over the number of positions in FOLDREP)
Av ø T2 5 0.30 Å. After the self consistency requirement was
met this yielded the set CENTRAL consisting of the following
28 positions: B1, B3-B7, C3-C4, C6, D3-D4, DE4, E1-E6,
F1-F6, G2, G5, G7 and G9. Similarly, a subset PRE-CORE is
further selected consisting of positions whose average dispersion satisfy Av ø T3 5 0.40 Å (the average being over the
members of CENTRAL). Finally, the refinement of PRE-CORE
yields the geometrical core CORETH for the VH domain at
variability threshold TH 5 0.40 Å. This core consists of the
following 65 positions: A91-A93, A9B1-A9B3, B1-B8, C1-C6,
CC91, CC96, C91-C93, C95-C96, D3-D7, DE2-DE4, E1-E7,
EF1-EF3, EF5-EF7, F1-F7 and G1-G10.
Construction of a preferred system of coordinates and an
average Cα framework for the VL and VH domains
The existence of geometrical cores is here used to construct a
preferred coordinate system for the immunoglobulin family.
The main idea can be stated as follows: we would like to
describe each molecule in a system of coordinates that is based
on the geometrical properties common to all molecules under
comparison. For the construction of the preferred system of
coordinates the two-fold symmetry of the VL and the VH
1019
I.Gelfand et al.
complex was used. The Y axis is (approximately) identified
with the two-fold axis. The line connecting the centers of mass
of the core common to both VL and VH domains becomes the
X axis of the preferred system of coordinates. The origin of
the coordinate system is the midpoint on this line, between
the two centers of mass. One of the advantages of this system
of coordinates is that knowledge of coordinate values in
this coordinate system yields information about the possible
structural or functional roles of the residues. Residues with
small values of their X-coordinates are mostly involved in
domain–domain interactions, while residues with high Y values
are usually located in antigen-binding regions.
The preferred system of coordinates was built for each of
the 53 immunoglobulin structures under study. Subsequently,
one of the 53 structures was chosen as a reference and the
coordinates of the Cα atoms of all other structures were
transformed to the preferred coordinate system of the reference
structure. Comparison of the coordinates of Cα atoms in the
identical positions of the 53 different structures revealed those
positions whose coordinates are approximately the same. The
dispersions in the values of the coordinates provides valuable
information about the relative position of a residue in space.
In Table III we present the average (across the 53 immunoglobulin structure) coordinates of the VL and VH core positions
in the preferred coordinate system together with their dispersion
across this family. The significance of these tables can be
stated as follows: we can predict the coordinates of the core
positions in the preferred coordinate system within an error
estimated by the dispersions of the coordinates. As seen from
the tables, this can be done in most of the cases with an error
equal to 60.35 Å for the VL domain and 60.40 Å for the
VH domain.
Results
Using a simple geometric technique we identified two sets,
CORETL and CORETH, of Cα atom secondary structure positions
for the variable VL κ and VH domains of the immunoglobulin
family based on the analysis of the experimental data for 53
X-ray structures. The sets CORETL and CORETH consist of
secondary structure positions with low structural variation.
The positions of the Cα atoms in CORETL can be thought of
as vertices of a skeleton that forms a geometrical ‘core’ of the
VL domain and that is the same across the Ig family. Analogously, the secondary structure positions from CORETH are
vertices of a skeleton for the VH domain that is conserved
across the family.
In order to determine invariant characteristics and in order
to further compare different structures it is of utmost importance
to have the coordinates of atoms of each structure in a single
coordinate system. We now show how the above determined
cores can be used in order to construct such a coordinate system
for the immunoglobulin family. The idea of its construction is
presented to us by nature itself. It is well understood that the
immunoglobulin molecules possess an approximate rotational
symmetry. In the literature it is known as a symmetry under
a rotation around a ‘pseudo two-fold’ axis. This usually is
understood as being able to superimpose the VL and VH
domains by rotating one of them around the ‘pseudo two-fold’
axis to approximately 180°. However, we could not find in
the literature an exact definition (or determination) of this
‘pseudo two-fold’ axis. Mathematically the existence of such
symmetry can be expressed as follows. Let us choose a
coordinate system such that, for example, its Y axis is the
1020
‘pseudo two-fold’ axis. Let x(L), y(L), z(L) be the coordinates of
a Cα atom position in the VL domain and x(H), y(H), z(H) be the
coordinates of a Cα atom position in the VH domain. Then a
‘pseudo two-fold’ symmetry between these two Cα atom
positions exists if the following relations (approximately) hold:
x(L) 5 –x(H), –y(L) 5 y(H), z(L) 5 –z(H). The fact that intrinsic
geometrical cores for the VL and VH have been found leads us
to hypothesize that the geometrical core positions common to
both domains do exhibit this symmetry. Before we proceed
with the construction, we think that it is important to mention
that we do not expect the above relations between coordinates
to hold exactly. They should be thought of as holding within
some small error (e.g. ,0.5 Å). Our idea is to determine
analytically an approximation to the ‘pseudo two-fold’ axis
for each of the 53 Ig X-ray structures and then use it as one
of the axes of a special coordinate system constructed for each
structure. We discovered that for several of these structures
the ‘pseudo two-fold’ symmetry holds for the common VL–VH
domain core position with very high precision. One of these
structures, 1CGS, is later used as a reference for the construction of the final common coordinate system. We now describe
the construction of this coordinate system which we refer to
as a preferred coordinate system. Let
XL 5
1
N
N
Σ
YL 5
xi(L),
i51
1
N
N
Σ
yi(L),
1
N
ZL 5
i51
N
Σz
(L)
i
i51
be the coordinates of the center of mass of the VL Cα atom
core positions that are in the common VL–VH set for a particular
Ig structure. Analogously, let
XH 5
1
N
N
Σ
YH 5
xi(H),
i51
1
N
N
Σ
yi(H),
ZH 5
i51
1
N
N
Σz
(H)
i
i51
be the coordinates of the center of mass of the VH Cα atom
core positions that are in the common VL–VH set for the same
Ig structure. Here (xi(L), yi(L), zi(L)) and (xi(H), yi(H), zi(H)), i 5
1, ... ,N(5 55), are the coordinates of the common core
positions in the VL and VH domains relative to the coordinate
system used in the X-ray experiment and in which the
coordinates were deposited in the PDB. Let Lc be the line that
passes through the above two centers of mass. The direction
vector of this line is
c 5 (XH – XL, YH – YL, ZH – ZL).
The vector
b5
(X
H
1 XL YH 1 YL ZH 1 ZL
,
,
2
2
2
)
is the radius vector of the point bisecting the segment of Lc
connecting the two centers of mass. We choose this point to
be the origin O of our new coordinate system. The points with
coordinates
(x
(L)
i
1 xi(H) yi(L) 1 yi(H) zi(L) 1 zi(H)
,
,
, i 5 1,...,
2
2
2
)
N(5 55), are the points bisecting the segments connecting
corresponding Cα atom core positions common to both the VL
and VH domains of the molecule. The first axis of the preferred
coordinate system is constructed as the line L passing through
the origin O and best fitting (minimizing the sum of the
squares of distances from it to the bisecting points) the bisecting
points above. This will be the sought after approximation to
the ‘pseudo two-fold’ axis. Let a 5 (a1, a2, a3) be the direction
Geometric invariant core for Ig molecules
Table III. Average coordinates and their dispersion for the secondary structure positions in the VL and VH domain
geometrical cores in the preferred coordinate system
vector of this axis. For example, in the X-ray structure 1CGS
this line has the following equation (in parametric form):
x 5 –16.638 t 1 31.355,
y 5 3.531 t – 0.636,
z 5 1.000 t 1 16.449.
Here the direction vector is a 5 (–16.638, 3.531, 1.000). Note
that the lines Lc and La are not necessarily orthogonal. In the
case of the structure 1CGS the angle between Lc and La is
90.322°. Within the family of 53 X-ray Ig structures that we
studied, the angle between Lc and La varies within the range
(85–95°). The second axis of the preferred coordinate system
is taken to be the line Ld with direction vector d 5 a3c which
is perpendicular to both Lc and La. Finally, the third axis Lc9
is obtained by rotating Lc in the plane determined by Lc and
La (and perpendicular to Ld) until it becomes perpendicular to
1021
I.Gelfand et al.
Fig. 4. Density distribution of positions in CANDIDATES with respect to
the average dispersion relative to the members of FOLDREP for the VH
domain.
La. In the case of 1CGS, we need to rotate at 0.322°. The
direction vector c9 5 (c91, c92, c93) of Lc9 can be taken to be
(
c9 5 1, –(d1 –
Fig. 5. Cα atom core positions for the VL (shown in black) and VH (shown
in grey) domains of the structure 1CGS. The midpoints (shown in black)
bisecting the segments connecting corresponding positions in the two
domains are also shown. The Y axis of the preferred coordinate system is
the line best fitting the set of midpoints and approximating the ‘pseudo twofold’ axis of symmetry. (The distances are in Angstroms.)
a1
a2
a1 a2
d )/(d –
d ), –
–
c92 .
a3 3 2 a3 3
a3 a3
)
We label Lc9 as the X axis, La as the Y axis and Ld as the Z
axis of the preferred coordinate system. Then, if x 5 (x,y,z)
is the radius vector and coordinates of a point in the coordinate
system of the X-ray experiment of the structure 1CGS, the
coordinates of the same point in the newly constructed preferred
orthogonal coordinate system (whose Y axis is the approximation to the ‘pseudo two-fold’ axis) are given by the formulae:
x̃ 5
ỹ 5
z̃ 5
c9
||c9||
a
||a||
d9
||d||
· (x – b) 5
· (x – b) 5
· (x – b) 5
1
||c9||
1
||a||
1
||d||
[c91x 1 c92 y 1 c93z – (c91b1 1 c92b2 1 c93b3)],
[a1x 1 a2 y 1 a3z – (a1b1 1 a2b2 1 a3b3)],
[d1x 1 d2 y 1 d3z – (d1b1 1 d2b2 1 d3b3)],
In Figures 5, 6 and 7 we show a preferred coordinate system
constructed using the structure 1CGS as a reference. The Cα
atom core positions for both the VL and VH domains are shown
relative to both the X-ray experiment coordinate system and
the preferred coordinate system.
We transformed the X-ray experiment coordinates of the VL
and VH core positions of all 53 Ig structures to the preferred
coordinate system. The variance of these coordinates is remarkably small. In Table III are shown the averages (over the
family of 53 structures) of the coordinates of the VL and VH
core positions as well as their dispersions. (The coordinates
of positions that are in neither one of the VL and VH cores are
not shown for clarity.) From Table III one can also clearly
read off the relations x(L) 5 –x(H), y(L) 5 y(H), z(L) 5 –z(H) for
corresponding positions in both cores with very good accuracy.
We transformed the coordinates of all Cα atoms (and all
heavy atoms) of all 53 X-ray Ig structures to the preferred
coordinate system. In order to have the Cα atom coordinates
1022
Fig. 6. Preferred coordinate system built using the structure 1CGS as a
reference. The coordinate axes are shown relative to the Cα atom core
positions for the VL and VH domains of 1CGS as they appear in the light
and heavy chain respectively. (The distances are in Angstroms.)
of all 53 X-ray structures in the preferred coordinate system
(constructed using the reference molecule) we need to first
transform them into the coordinate system of the X-ray
experiment of the reference structure (1CGS in our example)
and then, using the above formulae, transform them to the
preferred coordinate system. To transform the Cα atom coordinates of all 53 X-ray structures to the coordinate system of the
X-ray experiment of the reference structure, we devised a
purely analytical method of superposition of X-ray structures.
Let NL and NH equal the number of core positions in the VL
and VH domains of the immunoglobulin family, and let xi(L),
i 5 1, ... ,NL, be the radius vectors of the VL core positions of
the reference structure (1CGS in our example) in its X-ray
experiment coordinate system and let ζ(L)
i , i 5 1, ... ,NL be the
Geometric invariant core for Ig molecules
Fig. 7. Preferred coordinate system built using the structure 1CGS as a
reference. The VL and VH domain cores are shown as solid bodies which
represent the polytops with vertices the VL and VH geometrical core
positions. (The distances are in Angstroms.)
radius vectors of the VL core positions of another X-ray
structure (of our list of 53) in its own X-ray experiment
coordinate system. Then the global minimum of the function
NL
g(AL, bL) 5
Σ ||A ς
(L)
L i
1 bL – xi(L)||2,
i51
yields an Euclidean transformation (AL,bL) that is later used
to transform the coordinates of all Cα atoms of the VL domain
of the second structure to the coordinate system of the reference
structure, or vice versa. Similarly, let xi(H), i 5 1, ... ,NH, be
the radius vectors of the VH core positions of the reference
structure (1CGS in our example) in its X-ray experiment
coordinate system and let ζi(H), i 5 1, ... ,NH be the radius
vectors of the VH core positions of another X-ray structure (of
our list of 53) in its own X-ray experiment coordinate system.
Then the global minimum of the function
NH
h(AH, bH) 5
Σ ||A ς
(H)
H i
1 bH – yi(H)||2,
i51
yields an Euclidean transformation (AH,bH) that is later used
to transform the coordinates of all Cα atoms of the VH domain
of the second structure to the coordinate system of the reference
structure, or vice versa. The second step is to transform the
Cα atom coordinates to the preferred coordinate system (for
which we used the reference molecule) constructed above.
Thus, we can transform the coordinates of all Cα atoms of
any known immunoglobulin X-ray structure to a preferred
coordinate system. (We are in the process of creating a small
data bank containing the atom coordinates of all available
immunoglobulin X-ray structures in such a preferred coordinate
system.) This gives a very powerful method of comparison
within a given protein family based on an intrinsic invariant
(the geometrical core) of the family. Previously one of our
preferred coordinate systems has been the one of the ‘average
immunoglobulin molecule’ described in Gelfand et al. (1996),
in the construction of which averaging over all possible
directions of the ‘pseudo two-fold’ axis was performed as well
as averaging over the coordinates of positions in both domains
that exhibit the ‘pseudo two-fold’ symmetry. However, aver-
aging over the directions of the ‘pseudo two-fold’ axis in
different X-ray structures ceases to be legitimate when the
variation in the direction of the ‘pseudo two-fold’ axis is large.
As was mentioned above, after applying our construction to
each of the 53 immunoglobulin X-ray structures that we
studied we found that the variation can be of the order of
65°. [Note that an averaging of rotations was also used in
Altman and Gerstein (1995).] Thus, one looses information
about the mutual orientation of the light and heavy domains
in each individual structure. (We are grateful to C.Chothia for
pointing this out to us.)
The determination of the geometrical core allows us to
undertake a very precise study of the 3D structure of the
immunoglobulins. One can now look at different invariant
characteristics inherent to the immunoglobulin molecules. One
example is the mutual orientation of the VL and VH domains.
To any Ig molecule there corresponds an Euclidean motion
under which a best superposition of the core positions in the
VL and VH domains can be achieved. This transformation,
determined by a 333 matrix A and a (translation) vector b, is
a unique intrinsic characteristic of each Ig molecule. By a
transformation achieving the best superposition of the set of
core positions we mean the pair (A,b) which yields the
global minimum of the function
N
f(A, b) 5
Σ ||Ax
(L)
i
1 b – x(H)||2,
i51
where N is the number of geometrical core positions common
to both the VL and VH domains. (These are the following 55
positions: A9B1-A9B3, B1-B8, C1-C6, CC91, C91-C93, C95,
D3-D6, DE2, E1-E7, EF2-EF3, EF5-EF7, F1-F7 and G1-G9.)
Had the superposition of the two domains been exact the
matrix A would have been orthogonal. The fact that in practice
we can only achieve superposition of Cα atoms with some
(small) error results in the matrix A being not exactly orthogonal. Having transformed the Cα atom coordinates of both
VL and VH domains of all 53 X-ray structures to a preferred
coordinate system we can now compute for each of them the
pair (A,b) that gives rise to an Euclidean motion achieving an
optimal superposition of the light and heavy geometrical cores
in each of them. We can consider the pair (A,b) as an individual
characteristic of each immunoglobulin structure. As an example
we give here the matrix A for the X-ray structure 1CGS:
A1CGS 5
(
–1.016 –0.028 0.049
–0.040 1.000 0.070
–0.019 0.003 –1.001
)
.
The corresponding vector b 5 (–0.21, –0.37, –0.24) has length
||b|| 5 0.49 Å.
Discussion
Before discussing further applications of the geometrical cores
determined above we would like to compare our method and
results with the ones in other research devoted to identifying
structurally conservative secondary structure positions in protein families (cf. Altman and Gerstein, 1995). Our method is
coordinate system independent. No superposition techniques
were employed. The only input on which the final result
depends is the predetermined secondary structure alignment
of the 53 Ig sequences which was done by hand and is based
on the principles of identification of positions of residues
described in Gelfand and Kister (1995). The thresholds
1023
I.Gelfand et al.
Table IV. Geometrical core positions at which conserved residues were
observed
The column entries containing one residue indicate that this residue has
been found at the corresponding position in .90% of the sequences polled.
Similarly, the column entries containing two residues indicate that either one
or the other were found at the corresponding position in .90% of the
sequences polled.
TL 5 0.35 Å and TH 5 0.40 Å on the average dispersions for
the VL and VH domains, respectively, where chosen to be the
upper limits of windows with size 0.10 Å around the maxima
of the corresponding density distributions of all positions with
respect to their average dispersions relative to the six members
of FOLDREP. The choice of the initial sets FOLDREP was
based on two criteria: (i) the six positions for both the VL and
VH domains were selected to represent the different parts of
the fold and (ii) the distances among them exhibited a very
low dispersion (ø0.20 Å in the case of the VL domain and
ø0.25 Å in the case of the VH domain) after averaging across
the family. Thus, these positions formed the kernel around
which the geometrical core was subsequently built.
As was explained in the previous sections, our CORETL
contains 74 Cα atom positions. This set differs at several
places from the VL core determined in Altman and Gerstein
(1995). They are OA2[2](1,–), AA91[7](–,1), AA93[9](1,*),
A91-A92[11–12](–,1), A94[14](1, –), A9B1[15](1, –), BC1BC2[26–27](1,*), CC96[44](–,1), C9C01-C9C02[50–51](1,*),
C0D3[58](–,1), DE1[68](–,1), EF1[77](–,1), EF4[80](1, –),
F7[90](1, –), G9[105](1, –), where the notation (1,–) means
that the position is in CORETL but not in the VL core of Altman
and Gerstein (1995) and (–,1) means the opposite. The notation
(1,*) indicates a position that was not taken into consideration
in Altman and Gerstein (1995) but appears in CORETL. In
square brackets we indicate the corresponding Kabat number
of the positions.
Analogously, for the VH domain we determined CORETH
which consists of 65 Cα atom positions. There are again
several differences with the VH core determined in Altman
and Gerstein (1995). The differences are in OA2-OA3[2–3]
(–,1), A1-A3[4–6](–,1), AA91[7](–,1), BC1[25](–,1),
CB5[33](–,1), CC92-CC93[41–42](–,1), CC95[44](–,1),
C9C01[53](–,1), C02-CC05[56–59](–,1), D1-D2[66–67](–,1),
DE1[73](–,1), EF4[84](–,1). As seen from this comparison,
our self-consistent method yielded a smaller subset of VH Cα
atom positions based on the experimental data for the 53 X-ray
structures. The cores determined in Gelfand and Kister (1995)
were based on the analysis of 12 VL κ domains and 12 VH
1024
domains of which only eight structures are common with our
list. [These are: 1fdl, 1hil, 1igf, 1mcp, 2fbj, 3hfm, 6fab, 4fab.
The structure 2hfl used in Gelfand and Kister (1995) is now
obsolete and has been replaced by 3hfl (which was included
in our list) in the PDB.] Our experience with this data also
showed that the Cα atom positions in the VH domain exhibit
greater variability than the ones in the VL domain.
The role of residues at each position in the core is determined
from the examination of residue frequencies, calculations of
surface areas and determination of residue–residue contacts
(Chothia et al., 1998). In that work the frequencies of residues
were calculated for about 5300 immunoglobulin sequences of
both light and heavy chains from the Kabat database. The
most strongly conserved positions were found at the center of
the interface between the beta sheets. It includes eight invariant
residue positions (IR), which are occupied by a single particular
residue in more than 90% of the sequences and 11 similar
residue positions (SR), which are occupied by several residues
with similar chemical properties (i.e. hydrophobic, charged,
aromatic, etc.), the sum of the frequencies of which is more
than 0.9 (i.e. they appear in .90% of the sequences).
In the present paper we analyzed the statistics of residue
frequencies separately for human 2800 heavy and 2000 kappa
chains. It allowed us to understand better the difference
between residue occupancies at core positions in the VL and
VH domains. Our analysis of residue frequencies at different
positions (Table IV) showed that there are 24 IR positions in
the VL κ and 15 IR positions in the VH domain (the single
residues for the invariant positions are shown in the rows of
Table IV). Further inspection of the positions in the core of
the VL and VH domains revealed 22 SR in the VL κ and 20 SR
positions in the VH respectively. It is important to note that
more than half of these positions are occupied by hydrophobic
or aromatic residues (Table IV). The observation of the nature
of the residues which were found at the conservative IR and
SR positions of the core shows a very good agreement with
the major requirement for the creation of a stable protein fold:
a sufficiently large hydrophobic interior of a three dimensional
structure. Thus, these positions play a significant role for the
stability of the immunoglobulin fold.
Applications of our method to the understanding of the
geometric structure of the antigen-binding (CDR) regions in
order to get an insight into the functional role of each of the
immunoglobulins within the family is the subject of current
research. Similar constructions are being carried out for the
constant domains CL and CH. Even though our focus of study
is the immunoglobulin family, it is clear from the generality
of the described algorithms that the above methods can be
applied to the study of invariants of other protein families as
well. Currently the method is also being applied to several of
the azurin families.
Acknowledgements
The authors are grateful to Drs C.Chothia and O.Ptitsyn for very helpful
discussions. A.K. and O.S. would like to thank the Gabriella and Paul
Rosenbaum Foundation for support. We wish to thank Mrs M.Goldman for
continuous encouragement. We would also like to thank SROA at Rutgers
University for partial support.
References
Altman, R.B. and Gerstein, M. (1994) In: Proc. Second Int. Conf. Intell. Sys.
Mol. Biol., AAAI Press, Menlo Park, CA, pp. 161–175.
Altman, R.B. and Gerstein, M. (1995) J. Mol. Biol., 251, 161–175.
Chothia, C. and Lesk, A. (1986) EMBO J., 5, 823–826.
Geometric invariant core for Ig molecules
Chothia, C. and Lesk, A. (1987) J. Mol. Biol., 196, 901–917.
Chothia, C., Gelfand, I. and Kister, A. (1998) J. Mol. Biol., 278, 457–479.
Diamond, R. (1992) Protein Sci., 1, 1279–1287.
Gelfand, I. and Kister, A.E. (1995) Proc. Natl Acad. Sci. USA, 92,
10884–10888.
Gelfand, I.M., Kister, A.E. and Leshchiner, D. (1996) Proc. Natl Acad. Sci.
USA, 93, 3675–3675.
Gelfand, I., Kister, A., Kulikowski, C. and Stoyanov, O. (1998) J. of Comp.
Biol., 5, 467–477.
Godzik, A., Skolnick, J. and Kolinski, A. (1993) Protein Engng, 8, 801–810.
Kearsley, S. (1990) J. Comp. Chem., 11, 1187–1192.
Padlan, E. (1994) Mol. Immunol., 31, 169–217.
Selbig, J. (1995) Protein Engng, 8, 339–351.
Shapiro, A., Botha, J., Pastore, A. and Lesk, A. (1992) Acta Crystallogr.
Sect. A, 48, 11–14.
Swindells, M. (1995) Protein Sci., 4, 93–102.
Taylor, W. and Orengo, C. (1989) J. Mol. Biol., 208, 1–22.
Thomas, D., Casari, G. and Sander, C. (1996) Protein Engng, 9, 941–948.
Tramantano, A., Chothia, C. and Lesk, A. (1990) J. Mol. Biol., 215, 175–182.
Yee, D. and Dill, K. (1993) Protein Sci., 2, 884–899.
Received February 9, 1998, revised May 13, 1998, accepted June 1, 1998
1025