Protein Structure Geometry
Protein geometry: Dr. Patrice Koehl, UC Davis
A common concrete model for proteins is to represent them as a union of balls, in which each
ball corresponds to an atom. Properties of a protein are then expressed in terms of properties of
the union. For example, the interaction of the protein with its environment is quantified through
the surface area and/or volume of the union of balls, and its potential active sites are detected as
cavities.
How do we compute geometric properties of a protein, represented as a union of balls? This is
not an easy problem. Analytical solutions to that problem exist (see paper below), but they are
not easy to implement. Here we will use numerical methods to solve this problem that are
surprisingly accurate and easy to implement. These methods are based on the scientific
computing technique called “Monte Carlo integration”.
A good resource on protein geometry:
Koehl and Edelsbrunner : “The geometry of biomolecular solvation”
http://www.cs.usfca.edu/~pfrancislyon/koehl_msri_05.pdf
Monte Carlo techniques:
Monte Carlo methods use random samples to approximate distributions.
Let us assume we want to compute the integral I of a function f over a multidimensional volume
V. Suppose that we have picked N random points, uniformly distributed in a multidimensional
volume V. Call them x1,…,xN. The basic theorem of Monte Carlo integration estimates the
integral I as:
fdV V
f V
f2 f
2
N
Here the brackets denote taking the arithmetic mean over the N sample points,
f
1
N
N
f ( xi )
i 1
f2
1
N
N
f
i 1
2
( xi )
The “plus-or-minus” term is a one standard deviation error estimate for the integral.
To understand how it can be applied to computing a volume, let us look at a simple toy problem
in 2D, where we want to compute the surface area of two overlapping disks.
R
A
B
We define the function f as:
For all x in R, f(x) = 1 if x A B , and f(x) = 0 otherwise.
To compute the surface area covered by the two disks A and B, which we denote S A B , we
use the following algorithm:
(1) Initialize S = 0 and S2 = 0
(2) Initialize number of random points N
(3) for (i=0; i <= N; i++)
{
- Position xi at random in the box R
- Compute f(xi): if the distance from xi to the center of A is smaller than the radius
of A, or if the distance from xi to the center of B is smaller than the radius of B,
f(xi) = 1; in all other cases, f(xi) = 0
- Update the sums: S = S + f(xi); S2 = S2 + f(xi) * f(xi)
}
(4) Compute means:
< f > = S/N
< f2 > = S2/N
(5) Compute Surface area
S A B S ( R) f
where S(R) is the surface of the rectangular box R.
(6) Compute the one standard deviation error estimate:
SD S ( R )
f2 f
2
N
This algorithm can be generalized to any number of disks, and any dimensions.
Problem 1:
Biotech students may solve the problem in two dimensions as sketched out in the above
pseudocode: Write a script in Python that will compute the total surface area of the union of
disks that correspond to the x and y coordinates of the atoms in this file:
http://www.cs.usfca.edu/~pfrancislyon/test.txt
The format of test.txt is:
line 1: number of atoms (M)
lines 2 to (M+1) : x, y, z, r for each atom sphere
where (x, y, z) is the center of the sphere and r is its radius
CS students: adapt the above algorithm to the problem of computing the total volume of a
union of balls, in this case atoms. This involves mapping the problem from R2 -> R3 (from the
Euclidean space of real numbers in 2 dimensions -> the Euclidean space of real numbers in 3
dimensions). Write a script implementing this algorithm Python, utilizing numpy and
BioPython. Your script should read in a .pdb file of any length. Apply this program to a test case,
test.pdb, you find here:
http://www.cs.usfca.edu/~pfrancislyon/test.pdb
This file corresponds to residues 105-450 of 1z6t from the protein data bank.
CS students may begin development with test.txt and later utilize BioPython to parse the
information from test.pdb or any pdb file.
All students: Show how the estimate of the volume/area and of the standard deviation changes
as you vary N, the number of points you consider. (Using an analytical method, the volume of
this union of balls is 35461.81 Å3)
Problem 2: Biotech students skip this part
Now compute the surface area of this molecule, i.e. the accessible surface area of the union of
balls. We cannot apply directly the algorithm described above. To explain the method we will
use, let us go back to the example considered above.
We want to compute the length of the union of disks, i.e. the length of the arcs shown in red. We
do that by sampling uniformly the circles directly:
(1) Initialize number of circles NC (here 2: A and B) and number of random points N
(2) Initialize total length L=0
(3) For (k=0; k <= NC; k++)
{
a. Initialize: S = 0; S2 = 0;
b. For {i=0; i <= N; i++)
{
i. Position xi at random on circle k
ii. Compute f(xi): if xi is inside a circle j, j≠k (green region in the figure
above), f(xi) = 0; in all other cases (red region in the figure) f(xi) = 1
iii. Update the sums: S = S + f(xi); S2 = S2 + f(xi) * f(xi)
}
c. Compute means:
i. < f > = S/N
ii. < f2 > = S2/N
d. Compute accessible lengths and update total length:
i. Lk 2Rk f
where Rk is the radius of circle k, and (2πRk) its total circumference.
ii. L = L + Lk
}
Upon termination, L contains the total length of the union of circles.
a) Adapt the algorithm above to the problem of computing the total accessible surface area of a
union of balls. Extend your script to implement this algorithm in Python utilizing numpy. As
before, your program should obtain the required information from a .pdb file of any length.
(Hint: there is a hidden difficulty in this part of the assignment: how do you uniformly sample
the surface of a sphere?)
b) Apply this program to the above test case, test.pdb. Show how the estimate of the surface area
and of the standard deviation changes as you vary N, the number of points you consider. Note
that standard deviation is the deviation in the total surface area, not in the individual surface
areas. (Using an analytical method, the surface of this union of balls is 36764.21 Å2)
Please provide for both Problem 1 and Problem 2: the source code of the script you wrote, the
output of your results of running test.pdb as you vary N, and your brief explanation of the
variation.
© Copyright 2026 Paperzz