Mathematical Biology

J. Math. Biol. (2012) 64:829–854
DOI 10.1007/s00285-011-0433-5
Mathematical Biology
Exact and approximate distributions of protein
and mRNA levels in the low-copy regime of gene
expression
Pavol Bokes · John R. King ·
Andrew T. A. Wood · Matthew Loose
Received: 25 June 2010 / Revised: 10 February 2011 / Published online: 8 June 2011
© Springer-Verlag 2011
Abstract Gene expression at the single-cell level incorporates reaction mechanisms
which are intrinsically stochastic as they involve molecular species present at low copy
numbers. The dynamics of these mechanisms can be described quantitatively using
stochastic master-equation modelling; in this paper we study a generic gene-expression
model of this kind which explicitly includes the representations of the processes of
transcription and translation. For this model we determine the generating function
of the steady-state distribution of mRNA and protein counts and characterise the
underlying probability law using a combination of analytic, asymptotic and numerical
approaches, finding that the distribution may assume a number of qualitatively distinct
forms. The results of the analysis are suitable for comparison with single-molecule
resolution gene-expression data emerging from recent experimental studies.
Keywords Stochastic modelling · Gene expression · Master equation ·
Generating function
Mathematics Subject Classification (2000)
92C40 · 60J27
P. Bokes (B) · J. R. King · A. T. A. Wood
Centre for Mathematical Medicine and Biology, School of Mathematical Sciences,
University of Nottingham, Nottingham NG7 2RD, UK
e-mail: [email protected]
Present Address:
P. Bokes
Department of Applied Mathematics and Statistics, Comenius University, Mlynská dolina,
Bratislava 842 48, Slovakia
M. Loose
Institute of Genetics, Queen’s Medical Centre, University of Nottingham, Nottingham NG7 2UH, UK
123
830
P. Bokes et al.
1 Introduction
The regulation of a genetic programme in a given developmental context or in response
to a particular set of environmental stimuli is characterized by complex networks of
molecular interactions between the regulatory factors involved (Davidson et al. 2002;
Shen-Orr et al. 2002; Lee et al. 2002; Swiers et al. 2006). Mathematical and computational modelling has proved to be useful in reproducing the observed behaviour of
such networks and has also revealed features yet to be verified experimentally (Tomlin
and Axelrod 2007; Tyson et al. 2003). A common approach to modelling gene regulatory networks is to represent the cell as a well-stirred chemical reactor comprising
relevant biochemical species such as transcription factors, RNA molecules and signalling proteins (Tyson et al. 2003; McAdams and Arkin 1997); the interactions between
these species can be modelled using elementary chemical kinetics and the resulting
temporal dynamics studied using the deterministic or stochastic formalisms developed
for systems of chemical reactions (McAdams and Arkin 1997; Tyson et al. 2003).
The building block of such (typically large) gene regulatory networks is a single
gene and its read-outs, the associated mRNA and protein molecules (Griffith 1968a,b).
Gene expression is (at least) a two-stage process: first, the DNA code is transcribed
by an RNA polymerase into mRNA molecules; then, mRNAs are translated by ribosomes into proteins (Lewin 2000). Transcription and translation are counteracted by
the degradation of mRNAs and proteins by appropriate enzymes, such as ribonucleases and proteases (Lewin 2000). Following works of others (e.g. McAdams and Arkin
1997; Thattai and van Oudenaarden 2001), we model these mechanisms as elementary
reactions, obtaining a reaction system of two chemical species (mRNA and protein),
γ1
k1
mRNA −
→ ∅,
k2
protein −
→ ∅.
→ mRNA,
∅−
mRNA −
→ mRNA + protein,
γ2
(1)
In this schematic, transcription is a zeroth order reaction occurring with rate k1 , while
translation and the degradation mechanisms are first-order reactions occurring with
rate constants k2 , γ1 and γ2 . Biologically, these rate constants can depend on many
factors, notably k1 on the availability of transcriptional machinery and polymerases
(Lewin 2000), the phase of the cell cycle (Berg 1978), or the expression patterns of
the regulators of the gene (Ackers et al. 1982), k2 on the ribosomal activity of the cell
(Lewin 2000) and γ1 and γ2 on the availability of the degradation enzymes (Lewin
2000). By taking these rates to be constant, we assume that the underlying extrinsic
mechanisms are in a non-fluctuating steady state (Elowitz et al. 2002). We shall discuss in the final few paragraphs of this section the relationship between (1) and more
complex models for gene expression which include transitioning between multiple
promoter states.
Experimental studies in model organisms provide useful information on the order
of magnitude of the above parameters (Wang et al. 2002; Bernstein et al. 2002;
Lu et al. 2007; Belle et al. 2006; García-Martínez et al. 2007). Genome-wide studies
in E. coli showed that a majority of genes have mRNA half-lives, ln(2)/γ1 , between
3 and 8 min (Bernstein et al. 2002). In S. cerevisiae, the half-lives of mRNAs for a
123
Exact and approximate distributions of protein and mRNA levels
831
large set of genes range from 3 to 90 min (Wang et al. 2002), with a median of 20 min,
while for the protein half-life, ln(2)/γ2 , a median value of 43 min has been reported
(Belle et al. 2006). Messengers tend to be less stable than proteins (García-Martínez
et al. 2007), i.e. we can expect γ1 > γ2 , and, in particular for genes exhibiting translational bursts, we often have γ1 γ2 (Ozbudak et al. 2002; McAdams and Arkin
1997). Typical values of the production rates k1 and k2 can be inferred from the mRNA
and protein abundances. The average number of mRNA molecules per cell, which, as
detailed below, corresponds to the ratio k1 /γ1 in the model (1), ranges from less than
one molecule to a few dozens in E. coli and S. cerevisiae (Lu et al. 2007). Proteins
are more abundant than their mRNAs templates; the number of proteins per mRNA
molecule, corresponding to k2 /γ2 in (1), is around 540 for an average gene in E. coli
and 5600 in yeast (Lu et al. 2007). Although these numbers suggest that the number of proteins of a specific type in a cell can be quite high, a recent system-wide
study of gene expression in E. coli demonstrated that a large fraction of proteins were
expressed at less than 10 molecule copies per cell (Taniguchi et al. 2010). Protein
abundance is related to their function, with regulatory proteins, e.g. transcription factors, being present at lower numbers than housekeeping molecules (Taniguchi et al.
2010; García-Martínez et al. 2007). The low-copy regime of protein expression has
also been experimentally demonstrated for genes transcribed in repressed conditions
(Cai et al. 2006; Yu et al. 2006). Specifically, Cai et al. (2006) counted the copy number of the lactose utilisation enzyme β-galactosidase (β-gal) in E. coli cells grown in
glucose-containing medium, in which the lac promoter controlling the β-gal-encoding lacZ gene is repressed, obtaining 1.2 molecules per cell. Yu et al. (2006) replaced
the lacZ gene behind a repressed lac promoter by a gene encoding for a fluorescent
protein and counted the protein copy numbers using imaging techniques, measuring
on average 0.037 mRNA and 5 protein molecules per E. coli cell.
If the levels of mRNAs (m) and proteins (n) in the reaction system (1) are sufficiently large, we can treat them as continuous variables that satisfy (van Kampen
2006)
dm
= k1 − γ1 m,
dt
dn
= k2 m − γ2 n.
dt
(2)
The solution to the linear initial-value problem in which (2) is subject to the initial
condition
m(0) = m 0 , n(0) = n 0 ,
(3)
k1
k1 −γ1 t
e
+ m0 −
,
γ1
γ1
k1 k2
k1
k1 k2
e−γ2 t + k2 m 0 −
F(t),
+ n0 −
n(t) =
γ1 γ2
γ1 γ2
γ1
(4)
is given by
m(t) =
123
832
P. Bokes et al.
where
F(t) =
e−γ1 t −e−γ2 t
γ2 −γ1
−γ
t e 2t
if γ1 = γ2 ,
if γ1 = γ2 .
(5)
This completes the deterministic analysis of the gene expression model (1).
However, for many genes, mRNAs and proteins are present in small quantities and
molecular fluctuations cannot therefore be neglected (McAdams and Arkin 1999).
It is then appropriate to treat m and n as discrete stochastic variables; the probability pm,n (t) of having m mRNA and n protein molecules at time t then satisfies the
chemical master equation (van Kampen 2006)
d
pm,n = k1 ( pm−1,n − pm,n ) + γ1 ((m + 1) pm+1,n − mpm,n )
dt
+k2 m( pm,n−1 − pm,n ) + γ2 ((n + 1) pm,n+1 − npm,n ),
(6)
subject to initial condition
pm,n (0) = δm,m 0 δn,n 0 ,
(7)
where δ represents Kronecker’s delta; m 0 and n 0 are the initial amounts of mRNA and
protein, respectively. Recent studies (Coulon et al. 2010; Thattai and van Oudenaarden
2001; Paulsson 2005; Paszek 2007) of (6)–(7) and its generalisations focused on finding the first and second moments of the probability distribution pm,n (t) without solving
the master equation itself; to illustrate this approach, let us denote by (M(t), N (t))
the Markov process associated with the distribution of pm,n (t) and let
M i (t)N j (t) =
m i n j pm,n (t)
m,n
be the (i, j)th moment of this process. Multiplying (6)–(7) by m i n j and summing
over all m and n gives an infinite-dimensional system of ODEs for the moments.
It can be shown that, since the reaction rates in (6) depend linearly on the state (m, n)
of the system, the equations for all moments of order up to a given integer d > 0
(i.e. for M i (t)N j (t) such that i + j ≤ d) do not depend on the moments of higher
order and hence form a closed finite system of ODEs (Lestas et al. 2008; Gadgil et al.
2005; Tomioka et al. 2004); in particular, the first moments, i.e. the mRNA and protein means, are given as solutions to the deterministic formulation (2)–(3) (Singh and
Hespanha 2007). The stationary means Ms and Ns can be obtained as the limit as
t → +∞ of the time-dependent ones; the formula for the deterministic solution (4)
yields
Ms =
123
k1
k1 k2
, Ns =
.
γ1
γ1 γ2
(8)
Exact and approximate distributions of protein and mRNA levels
833
The time-dependent second moments satisfy a system of equations which is more
complex than the one for the means and which can be found in Singh and Hespanha
(2007); the stationary variances and covariance can be determined by analysing this
system and are given by (cf. Thattai and van Oudenaarden 2001)
k1
k1 k2
k1 k2
k2
1+
, Cov(Ms , Ns ) =
.
Var(Ms ) = , Var(Ns ) =
γ1
γ1 γ2
γ1 + γ2
γ1 (γ1 + γ2 )
(9)
The derivation of the analytic formulae (8)–(9) for the mean and the variance of mRNA
and protein levels, the methods of which can be applied to understand the transmission
of noise down a generic regulatory pathway (Paulsson 2004), provides an important
tool for theoretical understanding of stochasticity in gene expression; nevertheless,
complete characterisation of the distribution of gene products is of equal interest and
has not been provided in recent studies (e.g. Cheong et al. 2010; Shahrezaei and Swain
2008b) discussing the stochastic model (6), except for a special asymptotic case of
fast mRNA degradation (Shahrezaei and Swain 2008a).
The chemical kinetics description (1) for spontaneous gene expression is referred
to as the two-stage model (Thattai and van Oudenaarden 2001; Shahrezaei and Swain
2008a), the stages being those of transcription and translation, as opposed to a more
detailed three-stage model (Shahrezaei and Swain 2008a; Blake et al. 2003; Raser and
O’Shea 2004; Coulon et al. 2010; Raj et al. 2006), which includes an upstream mechanism of promoter transitioning between two or more states each associated with a
distinct rate constant for transcription. Such promoter transitioning can be attributed to
stochastic binding to and unbinding from the promoter of regulatory factors modifying
the transcription initiation rate (Golding et al. 2005). If these processes of binding and
dissociation are fast compared to the transcription dynamics—such a limit was previously considered for prokaryotic gene expression by Shea and Ackers (1985) in the
deterministic and by Hornos et al. (2005) in the stochastic contexts—then the threestage model reduces to the two-stage one, with k1 in (1) equal to the weighted average
of the transcription rates associated with the individual promoter states, wherein the
weights are the stationary probabilities of these states.
Although the focus of this paper is on the two-stage model, let us briefly summarise the related results obtained for the model with three stages: similarly as presented
above for the master equation (6), the first and second moments of mRNA and protein
counts have been determined for the stochastic three-stage model (Raj et al. 2006;
Paszek 2007). The stationary distribution of mRNA levels, but not that of proteins,
has been completely characterised for the case of constitutive transcription from a
transitioning promoter (Raj et al. 2006; Innocentini and Hornos 2007; Iyer-Biswas
et al. 2009) using generating-function methods. These methods have also been used
to characterise the steady-state protein distribution in stochastic models which do not
explicitly include a representation of translation, notably in an early study of Peccoud
and Ycart (1995) and also in an analysis of stochastic gene autoregulation (Hornos
et al. 2005).
In this paper we present the following results on the large-time behaviour of the
solution to problem (6)–(7): we shall find the generating function of the stationary
123
834
P. Bokes et al.
distribution of mRNA and protein counts and use this to derive analytic formulae
for the marginal stationary probabilities and to find the first four central stationary
moments, thus determining the skewness and kurtosis of the marginal distributions
whilst reiterating the results (8)–(9) for their means and variances. We also present
a numerical method for obtaining the joint stationary distribution of mRNA and protein counts from its generating function and study the asymptotic properties of the
generating function in order to interpret the results of our analysis.
Note that in (1) the dynamics of mRNA is functionally independent of the amount
of protein in the system; indeed, the amount of mRNA in the system evolves according
to a simple one-dimensional Markov process, known as the immigration-and-death
process (Cox and Miller 1977) due to its early applications in stochastic modelling of
population growth (Kendall 1949); it is well known that the stationary distribution of
this process is Poissonian (Kendall 1949; Cox and Miller 1977). We shall not repeat
the standard analysis of the immigration-and-death process here; instead, we focus
on the derivation of the joint (stationary) distribution of mRNA and protein levels,
obtaining the result for mRNAs in Section 3 as a trivial corollary of our analysis.
2 The generating function of the stationary distribution
As a rule of thumb, the master equation has a unique equilibrium distribution [see van
Kampen (2006) for details and also some examples for which it is not true]. Thus, for
st
all initial conditions, the solution pm,n (t) of (6)–(7) tends as t → +∞ to a limit pm,n
which is a time-independent solution to the master equation, i.e.
st
st
st
st
k1 ( pm−1,n
− pm,n
) + γ1 ((m + 1) pm+1,n
− mpm,n
)
st
st
st
st
− pm,n
) + γ2 ((n + 1) pm,n+1
− npm,n
) = 0,
+k2 m( pm,n−1
(10)
and satisfies the normalizing condition
st
pm,n
= 1.
(11)
m,n
The standard approach to solving recurrence relations such as (10)–(11) is to consider
the corresponding generating function (e.g. Cox and Miller 1977) which is defined by
ϕ(x, y) =
st
x m y n pm,n
.
(12)
m,n
Multiplying (10) by the factor x m y n and summing over m and n, we obtain that the
generating function satisfies the linear first-order PDE
(γ1 (x − 1) + k2 x(1 − y))
123
∂ϕ
∂ϕ
+ γ2 (y − 1)
= k1 (x − 1)ϕ.
∂x
∂y
(13)
Exact and approximate distributions of protein and mRNA levels
835
The normalizing condition (11) implies
ϕ(1, 1) = 1.
(14)
The characteristics of (13) are defined by
x = γ1 (x − 1) + k2 x(1 − y),
y = γ2 (y − 1), ϕ = k1 (x − 1)ϕ.
(15)
Although the system (15) is exactly solvable, its solutions (and their ϕ components in
particular) are given by complex formulae from which the functional form of solutions
to the PDE (13) cannot be easily inferred. Therefore, we shall not analyse the system
(15) any further; instead, we shall use a suitable ansatz to find a particular solution
to (13)-(14) and then we shall demonstrate that it is the only solution to the problem.
First, we simplify (13)-(14) by changing the variables according to
x = 1 + u,
y = 1 + v, ϕ = exp(ψ),
(16)
and obtain that the factorial cumulant generating function (Johnson et al. 2005) ψ =
ψ(u, v) is a solution of the inhomogeneous linear PDE,
(γ1 u − k2 (1 + u)v)
∂ψ
∂ψ
+ γ2 v
= k1 u,
∂u
∂v
(17)
subject to
ψ(0, 0) = 0.
(18)
Let us look for ψ(u, v) in the form of a power series,
ψ(u, v) =
u m v n am,n .
m,n
Immediately, we obtain
a0,0 = ψ(0, 0) = 0.
The PDE (17) implies the following recurrence relation for the sequence am,n :
(γ1 m + γ2 n)am,n = k2 (m + 1)am+1,n−1 + mam,n−1 + k1 δm,1 δn,0 .
(19)
Taking n = 0 in (19), we obtain γ1 mam,0 = k1 δm,1 which implies that am,0 = 0 for
all m ≥ 2. Setting n = 1 in (19), we find that am,1 is a linear combination of am+1,0
and am,0 ; therefore, am,1 = 0 for all m ≥ 2. Clearly, we can successively use (19) for
increasing integer values of n to obtain
am,n = 0, m ≥ 2, n ≥ 0.
123
836
P. Bokes et al.
It remains to determine {a0,n }n≥1 and {a1,n }n≥0 . Setting m = 1 in (19) yields a linear
first-order recurrence equation for the latter,
(γ1 + γ2 n)a1,n = k2 a1,n−1 + k1 δn,0 ,
which gives
a1,n =
k1 (k2 /γ2 )n
, n ≥ 0,
γ1 (1 + γ1 /γ2 )n
(20)
where
(c)n = c(c + 1)(c + 2) · · · (c + n − 1), (c)0 = 1.
To determine the terms of the sequence {a0,n }n≥1 , we set m = 0 in (19), and obtain
γ2 na0,n = k2 a1,n−1 ,
which implies
a0,n =
k1 (k2 /γ2 )n
k1 (k2 /γ2 )n
=
, n ≥ 1.
γ1 n(1 + γ1 /γ2 )n−1
γ2 n(γ1 /γ2 )n
(21)
Thus, we find that all coefficients am,n , except for those given by (20)–(21), are zero,
which implies that the solution ψ(u, v) of (17)–(18) satisfies
ψ(u, v) =
u m v n am,n =
m,n
v n a0,n + u
n≥1
v n a1,n
n≥0
k1 (k2 v/γ2 )n
k1 u (k2 v/γ2 )n
=
+
.
γ2
n(γ1 /γ2 )n
γ1
(1 + γ1 /γ2 )n
n≥1
(22)
n≥0
We now show that the constructed particular solution ψ(u, v) is the only solution to
the problem (17)–(18). Let us consider another (possibly different) solution ψ̃(u, v)
to the same problem, i.e. let ψ̃(u, v) be an arbitrary differentiable function of two
variables which satisfies
(γ1 u − k2 (1 + u)v)
∂ ψ̃
∂ ψ̃
+ γ2 v
= k1 u, ψ̃(0, 0) = 0.
∂u
∂v
Then ψh = ψ̃ − ψ satisfies the homogeneous equation
(γ1 u − k2 (1 + u)v)
∂ψh
∂ψh
+ γ2 v
= 0,
∂u
∂v
(23)
and the additional condition
ψh (0, 0) = ψ̃(0, 0) − ψ(0, 0) = 0.
123
(24)
Exact and approximate distributions of protein and mRNA levels
837
The characteristics of (23) satisfy the planar system
u = γ1 u − k2 (1 + u)v, v = γ2 v.
(25)
According to the method of characteristics, the solution ψh (u, v) to the homogeneous
PDE (23) is constant along each trajectory of the dynamical system (25). Since the
origin (u, v) = (0, 0) is an (unstable) node of the system (25), there are two possible scenarios for the qualitative behaviour of the solution ψh (u, v): either ψh (u, v) is
constant everywhere, or it is discontinuous at the trivial node. The latter option is not
appropriate in our case because the additional condition (24) requires ψh (u, v) to be
well-behaved around the origin. Therefore, ψh (u, v) ≡ ψh (0, 0) = 0, which implies
that ψ̃(u, v) = ψ(u, v) for all u and v; thus, the function ψ(u, v) given by (22) is
indeed the only solution to the problem (17)–(18).
Thus, having found a unique solution (22) to the PDE (17)–(18), we return to the
original variables according to (16), obtaining that the solution to (13)–(14), which
gives the generating function of the stationary distribution of mRNA and protein
amounts, is given by
⎛
⎜ k1 ϕ(x, y) = exp ⎝
γ2
n≥1
n
k2
γ2
(y − 1)n
n(γ1 /γ2 )n
⎞
n
k2
n
(y
−
1)
γ2
k1 (x − 1)
⎟
+
⎠.
γ1
(1 + γ1 /γ2 )n
n≥0
The above formula can be rewritten as
⎛
ϕ(x, y) = exp ⎝αβ
y
⎞
M(1, 1 + λ, β(s − 1))ds + α(x − 1)M(1, 1 + λ, β(y − 1))⎠,
1
(26)
where
M(a, b, z) =
∞
(a)n z n
n=0
(b)n n!
is Kummer’s function (Abramowitz and Stegun 1972) and
λ=
γ1
k1
k2
, α= , β= ,
γ2
γ1
γ2
(27)
so that λ is the ratio between protein and mRNA half-lives, which is typically greater
than one and often λ 1 (refer back to Sect. 1 for details), α is the average number
of mRNA molecules, ranging from less than one to a few dozens, and β gives the ratio
between the average protein and the average mRNA abundances, which is typically
quite high (see Lu et al. (2007), Taniguchi et al. (2010) or Sect. 1).
123
838
P. Bokes et al.
3 The marginal generating functions and distributions
In this section, we use the formula for the stationary generating function (26) to
characterize the underlying distribution. In particular, we shall determine the marginal stationary probabilities,
st
pm,
=
st
pm,n
,
pst,n =
n
st
pm,n
.
(28)
m
st gives the probability of having m mRNA molecules and any number (includHere, pm,
ing zero) of protein; pst,n is the probability of having n protein molecules and an
arbitrary amount of mRNA. Obviously, each of these two sequences of probabilities
sums to unity, so that each forms a probability distribution; these two are referred to
as the marginal distributions of mRNA (or protein) levels, and the two-dimensional
st is known as the joint distribution.
distribution pm,n
The generating functions corresponding to the marginal distributions are defined by
ϕ1 (x) =
st
x m pm,
, ϕ2 (y) =
m
y n pst,n .
(29)
n
Below, we refer to ϕ1 (x) and ϕ2 (y) as to the marginal generating functions; similarly,
st is referred to as the joint
the generating function ϕ(x, y) of the joint distribution pm,n
generating function. The marginal and joint generating functions are related by
ϕ1 (x) = ϕ(x, 1), ϕ2 (y) = ϕ(1, y),
(30)
so that we can use formula (26) for the latter to obtain formulae for the former.
To determine the marginal mRNA distribution, we set y = 1 in (26), finding that
ϕ1 (x) = exp(α(x − 1)).
(31)
Such generating function corresponds to the Poisson distribution (Johnson et al.
2005), i.e.
st
pm,
=
e−α α m
.
m!
(32)
This result is a trivial consequence of the fact that the mRNA dynamics in the two-stage
gene-expression model is governed by a simple univariate immigration-and-death
Markov process. Clearly, the Poisson distribution of mRNA levels can be determined
without the knowledge of the joint distribution: inserting y = 1 in the PDE (13) for
the joint generating function, we obtain that the marginal mRNA generating function
satisfies the linear first-order ODE γ1 ϕ1
= k1 ϕ1 ; solving this equation subject to the
normalizing condition ϕ1 (1) = 1, we arrive at the formula (31), which implies the
Poisson distribution of mRNA levels.
123
Exact and approximate distributions of protein and mRNA levels
839
Setting x = 1 in (26), we find that the marginal protein generating function ϕ2 (y)
can be written as
ϕ2 (y) = exp(ψ2 (y)),
(33)
where ψ2 (y) is defined by
y
ψ2 (y) = αβ
M (1, 1 + λ, β(s − 1)) ds.
(34)
1
Generating functions of a similar, but non-identical, form, involving an exponential
of a power series in y, are known to correspond to the family of the Poisson-stopped
sum distributions (Johnson et al. 2005), which have been used to model growth of
a heterogeneous population (Neyman 1939). Some of the methods developed for the
Poisson-stopped sum family are readily applicable to the marginal protein distribution
(33)–(34); below, we find a recursive formula for the marginal probabilities pst,n using
a procedure previously suggested for a particular type of the Poisson-stopped sum
by Gurland (1958).
The marginal protein distribution can be obtained from its generating function as
Dn (ϕ2 (y)) st
(35)
p,n =
n!
y=0,
where D denotes the differential operator d/dy. Equation (35) requires us to find the
nth derivative of the composite function ϕ2 (y) = exp(ψ2 (y)); the first derivative is
obtained by the chain rule,
dψ2 (y)
dϕ2 (y)
= ϕ2 (y)
.
dy
dy
(36)
Taking the (n −1)th derivative of (36) and differentiating the product on the right-hand
side according to the Leibniz rule (Gurland 1958), we have
D (ϕ2 (y)) =
n
n−1 n−1
i=0
i
Di (ϕ2 (y))Dn−i (ψ2 (y)),
(37)
which gives the nth derivative of ϕ2 (y) expressed in terms of its lower-order derivatives, thus allowing us evaluate the derivatives of ϕ2 (y) of any order recursively; an
alternative, nonrecursive expression for Dn (ϕ2 (y)) can be obtained using the Faà di
Bruno formula (Johnson 2002) but, as this leads to algebraic expressions that reveal
little, we adhere to the recursive formulation in this exposition. Equation (37) can be
rewritten in a form which will be more convenient for us in what follows:
n−1
1 Dn−i (ψ2 (y)) Di (ϕ2 (y))
Dn (ϕ2 (y))
=
.
n!
n
(n − i − 1)!
i!
(38)
i=0
123
840
P. Bokes et al.
Obviously, we still need to determine the derivatives of the inner function ψ2 (y),
which appear in our recursive formula (38); this is not difficult: the r th derivative—r
being an arbitrary positive integer—of ψ2 (y) is given by
Dr (ψ2 (y)) =
αβ r (r − 1)!
M (r, r + λ, β(y − 1)),
(1 + λ)r −1
(39)
in which we used that for the derivative of Kummer’s function, we have (Abramowitz
and Stegun 1972)
(a)s
ds
M(a, b, z) =
M(a + s, b + s, z),
dz s
(b)s
for any nonnegative integer s. Expressing the derivatives of ψ2 (y) in (38) according
to (39) and then taking y = 0, we arrive at the recursive formula for the marginal
protein probabilities,
pst,n
n−1
αβ β n−i−1
=
M (n − i, n − i + λ, −β) pst,i ,
n
(1 + λ)n−i−1
(40)
i=0
where the first of the series is given by
⎛
pst,0 = ϕ2 (0) = exp ⎝−αβ
1
⎞
M (1, 1 + λ, β(s − 1)) ds ⎠.
(41)
0
Further properties of the marginal protein stationary distribution, e.g. the cumulants
and the first four central moments, are derived in the Appendix.
4 Calculating the joint distribution using the discrete Fourier transform
st are
By the definition (12) of the generating function ϕ(x, y), the probabilities pm,n
the coefficients of the power-series expansion of ϕ(x, y); therefore, these probabilities
can in theory be obtained by evaluating the partial derivatives of any order of ϕ(x, y)
st does not yield
at x = y = 0. Unfortunately, such a direct approach to finding pm,n
results as neat as those obtained in Sect. 3 for the marginal distributions. Therefore,
we choose an alternative approach: we shall determine numerically the joint stationary distribution of mRNA and protein counts from the generating function using the
discrete Fourier transform. We begin by considering two positive integers M, N and
the following values of the generating function:
Ak,l = ϕ(e−
123
2πik
M
, e−
2πil
N
), k = 0, . . . , M − 1, l = 0, . . . , N − 1.
Exact and approximate distributions of protein and mRNA levels
841
Using the definition (12) of the generating function ϕ(x, y), we find that
mk
nl
st
pm,n
exp −2πi
+
.
M
N
∞
∞ Ak,l =
m=0 n=0
(42)
If M and N are sufficiently large, we can truncate the infinite sum in (42) by excluding
the terms for which m ≥ M or n ≥ N without introducing a significant numerical
error, so that we can write
Ak,l ≈
−1
M−1
N
m=0 n=0
st
pm,n
mk
nl
,
exp −2πi
+
M
N
(43)
which implies that the M × N matrix Ak,l is approximately equal to the discrete Fourier
st . Taking the
transform (Press et al. 2007) of the truncated probability sequence pm,n
inverse discrete Fourier transform (Press et al. 2007) of (43), we obtain
st
pm,n
M−1 N −1
nl
1 mk
+
, m < M, n < N .
≈
Ak,l exp 2πi
MN
M
N
(44)
k=0 l=0
st are negligibly small. The
The remaining terms of the probability distribution pm,n
right-hand side of (44) can be evaluated efficiently using the fast Fourier transform
algorithm (Press et al. 2007); thus, (44) provides us with the desired numerical recipe
for computing the joint stationary distribution from the matrix Ak,l .
Let us now focus on the problem of the numerical evaluation of the terms Ak,l .
It can easily be obtained from (42) that these terms satisfy the Hermitian property
A M−k,N −l = Ak,l , k < M, l < N ,
and therefore it is sufficient to compute Ak,l for l < N /2 +1 only. From the functional
2πil
2πik
form of ϕ(x, y), (26), we see that in order to obtain the values Ak,l = ϕ(e− M , e− N ),
we need to evaluate the function
F(y) = M(1, 1 + λ, β(y − 1))
at y = e−
function,
2πil
N
; we are also required to find a sequence of complex integrals of this
− 2πil
N
e
Il =
F(z) dz, 0 ≤ l < N /2 + 1.
1
The former task is normally straightforward as most mathematical software provides
an implementation of Kummer’s function; to evaluate the integrals, note that
123
842
P. Bokes et al.
I0 = 0.
For l > 0, given that N is large enough, we can use the trapezium rule to obtain that
− 2πil
N
e
Il = Il−1 +
e
≈ Il−1 +
F(z) dz
− 2πi(l−1)
N
2πil
2πi(l−1)
2πi(l−1)
2πil
1
F(e− N ) + F(e− N ) e− N − e− N
,
2
which enables us to compute the terms Il recursively.
We implemented the above-described recipe for finding the joint stationary distribution in the programming language Python (version 2.5) enhanced by its numerical
and scientific packages NumPy and SciPy (versions 1.2.1 and 0.7.1 respectively). We
used the irfft2 routine from the module numpy.fft to compute the inverse discrete Fourier transform of a two-dimensional sequence of complex numbers satisfying
the Hermitian property as discussed above. For finding values of Kummer’s function,
we used the routine hyp1f1 from the module scipy.special. In Fig. 1, which
was prepared using Python’s plotting package Matplotlib, we show a graphical repst for parameter values α = 2, β = 5/3, λ = 1/3
resentation of the distribution pm,n
and for the truncation integers M = N = 64. Each of the squares in the central
graph of the figure corresponds to a particular pair of indices m and n; the color of the
st for the given index pair. The two bar charts next to
square relates to the value of pm,n
the central graph depict the marginal distributions of mRNAs and proteins, either of
st by
which has been calculated from the numerically evaluated joint distribution pm,n
summing it over rows or columns—the marginal probabilities calculated in this way
are shown as semi-transparent white bars in the charts—and, independently, using the
exact analytic expressions (32) and (40)–(41), the evaluations of which are visualised
as background grey bars. The close agreement between the numerical and analytic
results observed in the bar charts of Fig. 1 indicates that the numerical method for
evaluating the joint distribution introduces a small error only.
In Fig. 1 the mRNA and protein counts are positively correlated, the distribution of
protein counts being more widely-spread and heavy-tailed than that of mRNA levels.
Intuitively, the correlation results from mRNA acting as an upstream element which
positively regulates protein production. The relatively large variability in protein levels
is due to the variability in the expression of the upstream element being transmitted
down the regulatory pathway. Accordingly, in models for stochastic gene expression
which include promoter transitions (e.g. Raj et al. 2006), the distribution of the mRNA
levels can be more heavy-tailed than the Poisson distribution in the vertical bar chart of
Fig. 1, since random transitions between promoter states introduce an extra source of
stochasticity which is transmitted downstream to the mRNA levels and through those
down to the levels of protein. The Poisson distribution of mRNA levels is however
appropriate if promoter transitions do not represent a significant source of stochasticity
in gene expression (Yu et al. 2006).
123
Exact and approximate distributions of protein and mRNA levels
843
Fig. 1 The joint and marginal distributions of mRNA and protein counts. The mRNA and protein levels are
positively correlated. The marginal distribution of protein counts is more widely-spread and has a heavier
tail than that of mRNA levels
5 Asymptotic analysis of the stationary distribution
In the previous sections we used the formula (26) for the generating function ϕ(x, y)
to find a recursive expression for the marginal protein distribution pst,n and to compute
st via the discrete Fourier transform. Here we shall examine the
the joint distribution pm,n
asymptotic behaviour of the function ϕ(x, y) in order to understand the properties of
the underlying distributions qualitatively. Among the asymptotic regimes considered
below, some are realistic and have been experimentally demonstrated. Other regimes
are less realistic, yet they help us in exploring the parameter space of the investigated
distributions. The method of using limit cases to understand a complex stochastic phenomenon was in the field of gene expression previously used by Hornos et al. (2005),
who examined a gene autoregulation model in the limit case of slow, as well as in the
case of fast, promoter transitions.
The first asymptotic case we focus on is that of β/(1 + λ) = k2 /(γ1 + γ2 ) 1,
which requires that either k2 /γ1 1, i.e. that most mRNAs are not translated, or
k2 /γ2 1, so that mRNAs are much more abundant than proteins, the latter of
which, however, is not in agreement with experimental evidence (Lu et al. 2007).
If y is fixed and β 1 + λ, Kummer’s function satisfies
M(1, 1 + λ, β(y − 1)) = 1 + O
β
.
1+λ
Therefore, for the joint generating function ϕ(x, y) we have
⎛
ϕ(x, y) = exp ⎝αβ
y
⎞
M(1, 1 + λ, β(s − 1))ds + α(x − 1)M(1, 1 + λ, β(y − 1))⎠
1
= exp (αβ(y − 1) + α(x − 1)) + O
β
.
1+λ
(45)
123
844
P. Bokes et al.
Fig. 2 The a Poisson, b Neyman and c negative binomial distributions (white bars), compared to the exact
stationary distribution of protein counts (black bars in the background) for illustrative parameter sets (details
in the main text)
The leading-order term in (45) is the joint generating function of two Poissonian
random variables which are independent as the function factorises into a product of its
marginals. Thus, if β 1 + λ, the stationary distribution of protein levels converges
weakly to the Poisson distribution with mean αβ which is statistically independent
of the amount of mRNA in the system (which, as we showed before, has the Poisson
distribution with mean α). In Fig. 2a we compare the exact stationary distribution pst,n
of protein levels, obtained by the recursive formula (40)–(41) for a chosen parameter
set (λ = 20, α = 1, β = 4), to the Poisson distribution with mean αβ = 4, observing
a close agreement between the two.
The Poisson distribution of protein counts represents a trivial asymptotic stationary behaviour of the gene expression model; to identify some of its nontrivial limits,
we shall consider the behaviour of the generating function for the ratio λ = γ1 /γ2
between the protein and the mRNA half-lives small or large, the latter representing a
much more biologically realistic case than the former (García-Martínez et al. 2007).
For small values of λ, we write
M(1, 1 + λ, β(y − 1)) = eβ(y−1) + O(λ),
which implies
⎛
ϕ(x, y) = exp ⎝αβ
y
⎞
M(1, 1 + λ, β(s − 1))ds + α(x − 1)M(1, 1 + λ, β(y − 1))⎠
1
= exp α(eβ(y−1) − 1) + α(x − 1)eβ(y−1) + O(λ)
= exp α(xeβ(y−1) − 1) + O(λ).
In particular, the marginal protein generating function satisfies
ϕ2 (y) = ϕ(1, y) = eα(e
123
β(y−1) −1)
+ O(λ).
(46)
Exact and approximate distributions of protein and mRNA levels
845
The leading-order term on the right-hand side of (46) is the generating function of the
Neyman type A distribution (Johnson et al. 2005) with parameters α and β, which we
plot in Fig. 2b and compare it to the exact distribution pst,n of protein levels for an
illustrative parameter set (λ = 0.05, α = 1, β = 4).
Let us provide an intuitive explanation for the appearance of the Neyman probability law. The condition λ = γ1 /γ2 1 requires the degradation of proteins to be
much faster than that of mRNAs, so let us assume, even though such an assumption
is not likely to be found satisfied in real biological systems (García-Martínez et al.
2007), that it be so. Then on the fast timescale of protein turnover, the number m of
mRNA molecules is constant (with high probability) and the protein dynamics evolves
according to an immigration-and-death process in which the immigration rate is k2 m
and the death rate per protein is γ2 . Therefore, on the fast timescale, conditional on
having m mRNAs in the system, the distribution of protein levels equilibrates and
becomes Poisson with mean k2 m/γ2 = βm. On the slow timescale, however, the
amount of mRNA in the system, m, is subject to variation, and has a Poisson distribution with mean α, as shown in Sect. 3. Therefore, the large-time distribution of protein
levels can be expected to be Poisson(mβ), where m is a Poisson(α) random variable.
The distribution of such a variable has been shown to be Neyman type A with the
generating function given by the leading-order term of (46) (Johnson et al. 2005).
To investigate the behaviour of ϕ(x, y) for large values of λ, we rescale the parameters α and β according to
α=
α
, β = λβ ,
λ
(47)
in which β = k2 /γ1 gives the mean number of protein molecules synthesised from
an mRNA copy before it is degraded and α β = αβ gives the average number of
proteins in the cell at any time. If α and β are kept costant in (47), while λ tends
to infinity, then these characteristics of protein synthesis remain unchanged while the
average amount of mRNA, α = α /λ, tends to zero.
Using reparametrization (47) in the formula (26) for the generating function ϕ(x, y),
we obtain
⎛
y
⎝
ϕ(x, y) = exp α β
M 1, 1 + λ, λβ (s − 1) ds
1
+
Note that
M(1, 1 + λ, λβ (y − 1)) =
⎞
− 1)
M(1, 1 + λ, λβ (y − 1))⎠ .
λ
α (x
λβ (y − 1) n
n
=
β (y − 1) + O(1/λ)
(1 + λ)n
n≥0
n≥0
1
+ O(1/λ) as λ → +∞.
=
1 + β (1 − y)
123
846
P. Bokes et al.
Therefore, for x and y fixed, we have
ϕ(x, y) = (1 + β (1 − y))−α + O(1/λ),
(48)
from which we can deduce that the stationary distribution of mRNA levels converges
weakly to a degenerate distribution which has the whole probability mass concentrated at m = 0, and that the distribution of protein levels tends to the negative
binomial distribution (Johnson et al. 2005) with parameters α and β . In Fig. 2c we
compare the negative binomial to the exact distribution for a suitable parameter set
λ = 10, α = 1, β = 4. These parameter values imply an average of 0.1 mRNA and
4 protein molecules in the system, which is consistent with the experimental measurements of gene expression from the lac promoter in repressed conditions by Yu et al.
(2006), who reported an average of 0.037 mRNA and 5 protein molecules per cell.
Heuristically, if λ 1 while α and β are finite, so that α is low and β is high, then
proteins are produced during infrequent and brief periods—often called bursts—of
rapid translation which occur whenever an mRNA molecule is transcribed and which
end once the mRNA is degraded (McAdams and Arkin 1997). The number of proteins
produced from a burst has been shown to be geometrically distributed with the mean
number equal to β (Berg 1978). In the limit of short mRNA lifetimes, it is no longer
necessary to consider the mRNA intermediates; the dynamics of protein expression
can be approximated by a univariate Markov process composed of (i) protein degradation which occurs with the rate constant γ2 and (ii) protein burst production occurring
with the rate constant k1 , by which a geometrically distributed number (with mean β )
of proteins is produced. The stationary distribution of such a Markov process has been
shown to be negative binomial with the generating function equal to the leading order
term of (48) (Paulsson and Ehrenberg 2000).
The results of the simple asymptotic analysis presented above allow us to examine
the qualitative properties of the stationary distribution of protein levels and to characterize the dependence of these properties on the model parameters. Recall that the
mean ν and the variance ν2 of the stationary distribution pst,n of protein levels are
given by
ν = αβ, ν2 = αβ 1 +
β
.
1+λ
Note that we always have ν2 > ν and if ν2 ν, then β 1 + λ and thus, by (45),
the protein levels have the Poisson distribution. In the general situation of ν2 > ν, we
can express the parameters α, β (or alternatively α and β ) in terms of the mean ν,
variance ν2 and the parameter λ as
ν2
,
(1 + λ)(ν2 − ν)
(1 + λ)(ν2 − ν)
,
β=
ν
α=
123
λν 2
,
(1 + λ)(ν2 − ν)
(1 + λ)(ν2 − ν)
β
=
.
λν
α
=
Exact and approximate distributions of protein and mRNA levels
847
Fig. 3 The non-Poisson behaviour of the distribution of protein counts: a the negative binomial limit case;
b–d intermediate exact distributions; e the Neyman limit case
2
We find that (α, β)|λ0 = ν2ν−ν , ν2ν−ν , and thus, for fixed values of the mean
and variance and for λ 1, the marginal protein distribution tends to the Neyman
2
type A distribution with parameters ν2ν−ν and ν2ν−ν . Similarly, since (α , β )|λ→+∞ =
2
ν2 −ν
ν
, we have that as λ 1, the protein distribution converges to the negν2 −ν , ν
ative binomial distribution with parameters ν2ν−ν and ν2ν−ν . Thus, if the mean and
variance are fixed, the exact stationary protein distribution is specified by the parameter λ ∈ (0, +∞) and can be envisaged as an intermediate between the Neyman and
the negative binomial distributions; this is illustrated in Fig. 3, in which we show five
distributions pst,n which all have the same mean ν = 4 and variance ν2 = 20 but have
different ratio λ = γ1 /γ2 , namely (a) +∞ (the negative binomial limit case), (b) 5,
(c) 1, (d) 0.2, (e) 0+ (the Neyman limit case).
Having investigated the discrete limits of the mRNA-protein distribution, let us
focus on identifying the continuous ones. If the mean mRNA and protein counts are
large, i.e. if
2
α 1, αβ 1,
(49)
hold simultaneously, then mRNA and protein levels are approximately normally distributed with means, variances and covariance given by (8)–(9), i.e. equal to those of
the exact distributions. This result can be derived using van Kampen’s large system
size expansion, which is applicable for an arbitrary stochastic reaction system (van
Kampen 2006).
If the first condition in (49) holds but the second does not, the protein distribution
is not normally distributed; however, we then have β 1 and hence the Poisson
approximation (45) can be used instead. Thus, for α 1, the exact mRNA and protein distributions can be approximated either by Gaussians if β 1/α or by mutually
independent Poisson distributions if β 1. Notably, the latter case does not seem
biologically realistic in light of experimental evidence (Lu et al. 2007).
123
848
P. Bokes et al.
A mixed continuous-discrete mode of the gene expression model occurs if the
dynamics of mRNA levels is discrete and stochastic but that of protein levels is continuous and deterministic. Heuristically, for the protein dynamics to be deterministic,
we need simultaneously k2 /γ1 1 and k2 /γ2 1, so that upon the transcription
of an individual mRNA molecule a large number of proteins are produced from the
transcript before either degradation process (of proteins or of the mRNA template)
steps in to interrupt the resulting surge in protein levels. The above two conditions can
be written succinctly as one, k2 /(γ1 + γ2 ) = β/(1 + λ) 1.
To analyse the leading-order asymptotic behaviour of the protein distribution for
β/(1 + λ) large, we consider the characteristic function χ (v) of the rescaled random
variable (1+λ)Ns /β, in which Ns is the random variable giving the amount of proteins
at steady state. The characteristic function χ (v) can easily be expressed in terms of
the generating function ϕ2 (y), given by (33)–(34), of the variable Ns and satisfies
χ (v) = e
iv(1+λ)Ns
β
⎛
∼ exp ⎝α
⎛
iv(1+λ)
⎜
= ϕ2 e β
= exp ⎝αβ
iv(1+λ)
0
eiv(1+λ)/β
⎞
M(1, 1 + λ, z)dz ⎠ for
⎞
⎟
M(1, 1+λ, β(s −1))ds⎠
1
β
1.
1+λ
(50)
The approximate characteristic function given by the final expression in (50) corresponds to a parametric family of distributions which includes two notable special cases
identified by considering the behaviour of the formula (50) for λ asymptotically large
or small. If in addition to our previous assumption β/(1 + λ) 1 we assume that
λ 1, then it is possible to further simplify (50) to find that χ (v) ∼ (1 − iv)−α ,
which is the characteristic function of the gamma distribution. The same gamma distribution can alternatively be obtained by taking β large in the previously derived
negative binomial distribution (48) which we showed to approximate protein levels
whenever λ 1; clearly, if λ 1, then the condition β/(1 + λ) 1 is equivalent to β/λ = β 1, and hence, the two alternative derivations of the gamma
probability law confirm that our analysis is self-consistent. The gamma distribution of
protein levels has previously been obtained from a relatively simple piecewise deterministic model for gene expression dynamics (Friedman et al. 2006). It represents an
important asymptotic case of the protein distribution studied here since the condition
k2 γ1 γ2 for which the approximation is valid is often satisfied (McAdams and
Arkin 1997; Friedman et al. 2006).
Examining the behaviour of (50) for λ 1, we find that the characteristic function
of the random variable Ns /β can be approximated by exp(α(eiv − 1)), this being the
characteristic function of the Poisson distribution with mean α; the same result can be
obtained by taking β large in the Neyman distribution (46) which has been shown to be
valid whenever λ 1. A similar comment to that made for the Neyman distribution
needs to be made here: biologically, λ tends to be large rather than very small.
123
Exact and approximate distributions of protein and mRNA levels
849
Table 1 Approximate distributions of protein levels. For each relevant asymptotic parameter region, we
describe the distribution of the approximate stationary protein levels, Ñs , and give the functional form of
its characteristic function ω(s) = eis Ñs Asymptotic region of validity
Approximate distribution
Description
Characteristic function
β
1+λ 1
Poisson
exp(αβ(eis − 1))
λ1
Neyman
is
exp(α(eβ(e −1) − 1))
λ1
Negative binomial
α 1, αβ 1
Gaussian
β
1+λ 1
Deterministic prot. dynamics
(1 + β (1 − eis ))−α
β
αβs 2
exp iαβs − 1 + 1+λ
2
isβ
exp α
M(1, 1 + λ, z)dz
β 1, λ 1
Proportional to Poisson
exp(α(eiβs − 1))
β 1, λ 1
Gamma
(1 − iβ s)−α
0
Thus, we identified a number of distinguished asymptotic cases for which the exact
distribution of steady-state protein levels can be approximated by simpler distribution
types; the complete list is given in Table 1.
6 Discussion
In this paper, we studied the properties of the stationary distribution of a stochastic
model for gene expression. The model, described by the chemical master equation (6),
has been analysed in a standard way—by writing down and solving the partial differential equation (13) for the generating function of the unknown stationary probability
distribution. The method of finding the solution to this PDE involved a non-trivial
step: we transformed the variables, obtaining the PDE (17) for the factorial cumulant
generating function of the unknown distribution; this transformed equation was solved
easily using a power-series ansatz. Similar approaches have been used in other contexts: changing the variables in the PDE for the generating function to obtain one for
the moment generating function or one for the (non-factorial) cumulant generating
function is a standard method by which information on the probabilistic behaviour of
Markov processes in continuous time and with discrete state space can be obtained
(see e.g. Bailey 1964).
The joint stationary distribution of protein and mRNA levels was calculated in Section 4 from the derived formula for the generating function via the discrete Fourier
transform, which was efficiently implemented using the fast Fourier transform algorithm. In addition, we found relatively simple recursive expression (40)–(41) for the
marginal distribution of protein levels. We also obtained formulae giving the first four
central moments of this distribution, thus determining its skewness and kurtosis whilst
reiterating the result (8)–(9) for the mean and variance obtained in previous studies
using different methods.
123
850
P. Bokes et al.
We used these results, combined with a simple asymptotic analysis of the generating function, to examine the qualitative behaviour of the stationary distribution of
the gene expression model. The protein counts have been found to have the Poisson
distribution if the rate of translation is significantly lower than either of the degradation
rates (of protein or mRNA). If, however, the rate of translation balances with the degradation rates, then the deviation from the Poisson distribution becomes pronounced,
and two particular cases appear to be of special interest: the Neyman distribution of
protein levels, which occurs when it is the protein degradation reaction which balances with translation, and the negative binomial distribution which occurs when the
mRNA degradation is the dominant decay reaction. We also characterised a number
of distributions in continuous state space which can serve as approximations, upon
rescaling, of the exact distribution of gene expression levels.
The stochastic model for gene expression (1) can be extended by assuming that
the promoter of the gene can transition in a Markovian fashion between several states
and that from each of these promoter states mRNA is transcribed with a specific rate
constant (Blake et al. 2003; Raser and O’Shea 2004; Coulon et al. 2010). Several
authors considered the illustrative case of a promoter which can be in the active state
from which mRNA is expressed with a given rate constant or in the inactive state from
which no transcription occurs (Peccoud and Ycart 1995; Raj and van Oudenaarden
2009). The time-dependent and stationary first and second moments of mRNA and
protein counts can be determined for this description in a similar way as was done for
(1) by Thattai and van Oudenaarden (2001); the details can be found in Paszek (2007).
In addition, the stationary distribution of mRNA levels, which is not Poissonian in this
case as the transitioning between the promoter states contributes to the stochasticity
in the expression of gene products, has been completely characterised (Peccoud and
Ycart 1995; Raj et al. 2006; Iyer-Biswas et al. 2009). The results in this paper for the
stochastic model (1) could possibly be extended to obtain for the model which includes
promoter transitions a complete characterisation of the distribution of protein levels,
which is not available in literature (Paszek 2007).
We believe that the theoretical results derived in this work are suitable for comparison with experimental data which are becoming available thanks to the recent
advances in imaging technologies allowing the measurement of gene expression at the
single-cell level with single-molecule sensitivity (Xie et al. 2008; Larson et al. 2009).
Previous theoretical results on stochastic gene expression have already been exploited
in that way: Raj et al. (2006) used the maximum likelihood method to fit the model
by Peccoud and Ycart (1995) to experimentally observed mRNA counts obtained by
single-molecule resolution fluorescence in-situ hybridization. Therefore, we expect
that the characterisation of the protein distribution provided in this study could be of
help in analysing data on protein expression in individual cells at the single-molecule
level obtained e.g. by live cell fluorescence microscopy (Yu et al. 2006) or by protein
detection methods based on enzymatic amplification (Cai et al. 2006). We believe that
comparing theoretical results to such gene-expression data provided by future experimental studies will improve the understanding of the role of stochasticity in gene
expression in real biological systems.
123
Exact and approximate distributions of protein and mRNA levels
851
Acknowledgment P. Bokes was supported by the European Commission under Marie Curie Early Stage
Researcher Training (contract no. MEST-CT- 2005-020723). J. King gratefully acknowledges the funding
of the BBSRC/EPSRC (reference no. BB/D008522/1) and of the Royal Society and Wolfson Foundation.
Appendix
The reaction system (1) which we used in this paper as a model for gene expression
involves linear kinetics only, and therefore, as explained in Sect. 1, the first and second
moments (i.e. means and variances) of the dependent stochastic variables, the amount
of mRNA M(t) and of protein N (t), satisfy a closed system of linear inhomogeneous
ordinary differential equations, which has been solved by other authors (as reviewed
in Sect. 1). The stationary solution of this linear system gives the stationary means
and variances of the gene expression model. Here we show that the same results can
be obtained from the explicit formula for the generating function of the stationary distribution of mRNA and protein amounts. In addition, we determine the third and the
fourth central stationary moments, which have not been available in literature so far.
It is well known that the moments can be obtained from the generating function by
evaluating its derivatives; however, for the generating function of the functional form
we obtained, it is more convenient to calculate the moments via the corresponding
cumulants. By Johnson et al. (2005), the r th factorial cumulant κ[r ] of the marginal protein distribution pst,n is defined as the r th derivative of the logarithm of its generating
function (39) taken at y = 1, i.e.
κ[r ] = Dr (ψ2 (y)) | y=1 =
αβ r (r − 1)!
,
(1 + λ)r −1
i.e. the first four factorial cumulants are
κ[1] = αβ, κ[2] =
αβ 2
2αβ 3
6αβ 4
, κ[3] =
, κ[4] =
.
1+λ
(1 + λ)(2 + λ)
(1 + λ)(2 + λ)(3 + λ)
(A1)
Another useful characteristic of discrete distributions is the sequence of their (nonfactorial) cumulants: for definitions, consult Johnson et al. (2005); here we shall use
that the r th cumulant κr can be expressed in terms of the first r factorial cumulants as
(see Johnson et al. 2005)
κr =
r
S(r, j)κ[r ] ,
j=1
where S(r, j)’s are the Stirling numbers of the second kind (Johnson et al. 2005).
Thus, the first four cumulants are given by
κ1 = κ[1] , κ2 = κ[2] + κ[1] , κ3 = κ[3] + 3κ[2] + κ[1] ,
κ4 = κ[4] + 6κ[3] + 7κ[2] + κ[1] .
(A2)
123
852
P. Bokes et al.
Denoting by Ns the random variable associated with the stationary protein distribution
pst,n , we can define the first four central moments by
ν = Ns , ν2 = (Ns − ν)2 , ν3 = (Ns − ν)3 , ν4 = (Ns − ν)4 .
Johnson et al. (2005) gives the relation between the first four central moments and the
first four cumulants:
ν = κ1 , ν2 = κ2 , ν3 = κ3 , ν4 = κ4 + 3κ22 .
(A3)
Thus, using (A1)–(A3), the first four central moments of the marginal protein distribution pst,n are given by
3β
β
2β 2
+ 1 , ν3 = αβ
+
+1 ,
ν = αβ, ν2 = αβ
1+λ
(1 + λ)(2 + λ) 1 + λ
12β 2
7β
6β 3
ν4 = αβ
+
+
+ 1 + 3ν22 .
(1 + λ)(2 + λ)(3 + λ) (1 + λ)(2 + λ) 1 + λ
For completeness, let Ms denote the random variable associated with the stationary
st and let us consider the first four central moments
mRNA distribution pm,
μ = Ms , μ2 = (Ms − μ)2 , μ3 = (Ms − μ)3 , μ4 = (Ms − μ)4 .
The random variable Ms has the Poisson distribution (32); therefore, the central
moments are given by (Johnson et al. 2005)
μ = μ2 = μ3 = α, μ4 = α + 3α 2 .
If we express the above formulae for the stationary means μ, ν and variances μ2 , ν2
in terms of the chemical kinetics parameters of the model, we find that
μ=
k1 k2
k1
k1 k2
k1
, ν=
, μ2 = , ν2 =
γ1
γ1 γ2
γ1
γ1 γ2
1+
k2
,
γ1 + γ2
(A4)
which coincide with the formulae (8)–(9) obtained in other studies, as reviewed in
Sect. 1, by deriving from the master equation a finite closed system of ordinary differential equations for the first- and second-order moments of the Markov process
(M(t), N (t)); the stationary moments were obtained in Sect. 1 as the time-independent
solution of that system of ODEs. Thus we observe an agreement of the results we
arrived at by finding the exact stationary generating function with the previous results
obtained using the differential equations for moments.
123
Exact and approximate distributions of protein and mRNA levels
853
References
Abramowitz M, Stegun I (1972) Handbook of mathematical functions with formulas, graphs, and mathematical tables. National Bureau of Standards, Washington, DC
Ackers G, Johnson A, Shea M (1982) Quantitative model for gene regulation by lambda phage repressor.
Proc Natl Acad Sci USA 79:1129–1133
Bailey N (1964) The elements of stochastic processes with applications to the natural sciences. Wiley,
New York
Belle A, Tanay A, Bitincka L, Shamir R, O’Shea E (2006) Quantification of protein half-lives in the budding
yeast proteome. Proc Natl Acad Sci USA 103:13004–13009
Berg O (1978) A model for the statistical fluctuations of protein numbers in a microbial population. J Theor
Biol 71:587–603
Bernstein J, Khodursky A, Lin P, Lin-Chao S, Cohen S (2002) Global analysis of mRNA decay and abundance in Escherichia coli at single-gene resolution using two-color fluorescent DNA microarrays.
Proc Natl Acad Sci USA 99:9697–9702
Blake W, Kaern M, Cantor C, Collins J (2003) Noise in eukaryotic gene expression. Nature 422:633–637
Cai L, Friedman N, Xie X (2006) Stochastic protein expression in individual cells at the single molecule
level. Nature 440:358–362
Cheong R, Paliwal S, Levchenko A (2010) Models at the single cell level. Wiley Interdiscip Rev Syst Biol
Med 2:34–48
Coulon A, Gandrillon O, Beslon G (2010) On the spontaneous stochastic dynamics of a single gene:
complexity of the molecular interplay at the promoter. BMC Syst Biol 4:2
Cox D, Miller H (1977) The theory of stochastic processes. Chapman & Hall/CRC, London
Davidson E, Rast J, Oliveri P, Ransick A, Calestani C, Yuh C, Minokawa T, Amore G, Hinman V,
Arenas-Mena C et al (2002) A genomic regulatory network for development. Science 295:1669–
1678
Elowitz M, Levine A, Siggia E, Swain P (2002) Stochastic gene expression in a single cell. Science
297:1183–1186
Friedman N, Cai L, Xie X (2006) Linking stochastic dynamics to population distribution: an analytical
framework of gene expression. Phys Rev Lett 97:168,302
Gadgil C, Lee C, Othmer H (2005) A stochastic analysis of first-order reaction networks. B Math Biol
67:901–946
García-Martínez J, González-Candelas F, Pérez-Ortín J (2007) Common gene expression strategies
revealed by genome-wide analysis in yeast. Genome Biol 8:R222
Golding I, Paulsson J, Zawilski S, Cox E (2005) Real-time kinetics of gene activity in individual bacteria.
Cell 123:1025–1036
Griffith J (1968a) Mathematics of cellular control processes. I. Negative feedback to one gene. J Theor Biol
20:202–208
Griffith J (1968b) Mathematics of cellular control processes. II. Positive feedback to one gene. J Theor Biol
20:209–216
Gurland J (1958) A generalized class of contagious distributions. Biometrics 14:229–249
Hornos J, Schultz D, Innocentini G, Wang J, Walczak A, Onuchic J, Wolynes P (2005) Self-regulating gene:
an exact solution. Phys Rev E 72:051,907
Innocentini G, Hornos J (2007) Modeling stochastic gene expression under repression. J Math Biol 55:
413–431
Iyer-Biswas S, Hayot F, Jayaprakash C (2009) Stochasticity of gene products from transcriptional pulsing.
Phys Rev E 79:031,911
Johnson N, Kotz S, Kemp A (2005) Univariate discrete distributions, 3rd edn. Wiley-Interscience, London
Johnson W (2002) The curious history of Faà di Bruno’s formula. Am Math Mon 109:217–234
Kendall D (1949) Stochastic processes and population growth. J Roy Stat Soc B 11:230–282
Larson D, Singer R, Zenklusen D (2009) A single molecule view of gene expression. Trends Cell Biol
19:630–637
Lee T, Rinaldi N, Robert F, Odom D, Bar-Joseph Z, Gerber G, Hannett N, Harbison C, Thompson C, Simon
I et al (2002) Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298:799–804
Lestas I, Paulsson J, Ross N, Vinnicombe G (2008) Noise in gene regulatory networks. IEEE T Circuits-I
53:189–200
Lewin B (2000) Genes VII. Oxford University Press, Oxford
123
854
P. Bokes et al.
Lu P, Vogel C, Wang R, Yao X, Marcotte E (2007) Absolute protein expression profiling estimates the
relative contributions of transcriptional and translational regulation. Nat Biotechnol 25:117–124
McAdams H, Arkin A (1997) Stochastic mechanisms in gene expression. Proc Natl Acad Sci USA 94:
814–819
McAdams H, Arkin A (1999) It is a noisy business! Genetic regulation at the nanomolar scale. Trends
Genet 15:65–69
Neyman J (1939) On a new class of “contagious” distributions, applicable in entomology and bacteriology.
Ann Math Stat 10:35–57
Ozbudak E, Thattai M, Kurtser I, Grossman A, van Oudenaarden A (2002) Regulation of noise in the
expression of a single gene. Nat Genet 31:69–73
Paszek P (2007) Modeling stochasticity in gene regulation: characterization in the terms of the underlying
distribution function. B Math Biol 69:1567–1601
Paulsson J (2004) Summing up the noise in gene networks. Nature 427:415–418
Paulsson J (2005) Models of stochastic gene expression. Phys Life Rev 2:157–175
Paulsson J, Ehrenberg M (2000) Random signal fluctuations can reduce random fluctuations in regulated
components of chemical regulatory networks. Phys Rev Lett 84:5447–5450
Peccoud J, Ycart B (1995) Markovian modeling of gene-product synthesis. Theor Popul Biol 48:222–234
Press W, Teukolsky S, Vetterling W, Flannery B (2007) Numerical recipes: the art of scientific computing.
Cambridge university press, Cambridge
Raj A, van Oudenaarden A (2009) Single-molecule approaches to stochastic gene expression. Annu Rev
Biophys 38:255–270
Raj A, Peskin C, Tranchina D, Vargas D, Tyagi S (2006) Stochastic mRNA synthesis in mammalian cells.
PLoS Biol 4:e309
Raser J, O’Shea E (2004) Control of stochasticity in eukaryotic gene expression. Science 304:1811–1814
Shahrezaei V, Swain P (2008a) Analytical distributions for stochastic gene expression. Proc Natl Acad Sci
USA 105:17,256
Shahrezaei V, Swain P (2008b) The stochastic nature of biochemical networks. Curr Opin Biotechnol
19:369–374
Shea M, Ackers G (1985) The OR control system of bacteriophage lambda. A physical–chemical model
for gene regulation. J Mol Biol 181:211–230
Shen-Orr S, Milo R, Mangan S, Alon U (2002) Network motifs in the transcriptional regulation network
of Escherichia coli. Nat Genet 31:64–68
Singh A, Hespanha J (2007) Stochastic analysis of gene regulatory networks using moment closure. In:
Proceedings of the American control conference
Swiers G, Patient R, Loose M (2006) Genetic regulatory networks programming hematopoietic stem cells
and erythroid lineage specification. Dev Biol 294:525–540
Taniguchi Y, Choi P, Li G, Chen H, Babu M, Hearn J, Emili A, Xie X (2010) Quantifying E. coli proteome
and transcriptome with single-molecule sensitivity in single cells. Science 329:533–538
Thattai M, van Oudenaarden A (2001) Intrinsic noise in gene regulatory networks. Proc Natl Acad Sci
USA 98:151588,598
Tomioka R, Kimura H, J Kobayashi T, Aihara K (2004) Multivariate analysis of noise in genetic regulatory
networks. J Theor Biol 229:501–521
Tomlin C, Axelrod J (2007) Biology by numbers: mathematical modelling in developmental biology. Nat
Rev Genet 8:331–340
Tyson J, Chen K, Novak B (2003) Sniffers, buzzers, toggles and blinkers: dynamics of regulatory and
signaling pathways in the cell. Curr Opin Cell Biol 15:221–231
van Kampen N (2006) Stochastic processes in physics and chemistry. Elsevier, New York
Wang Y, Liu C, Storey J, Tibshirani R, Herschlag D, Brown P (2002) Precision and functional specificity
in mRNA decay. Proc Natl Acad Sci USA 99:5860–5865
Xie X, Choi P, Li G, Lee N, Lia G (2008) Single-molecule approach to molecular biology in living bacterial
cells. Annu Rev Biophys 37:417–444
Yu J, Xiao J, Ren X, Lao K, Xie X (2006) Probing gene expression in live cells, one protein molecule at a
time. Science 311:1600–1603
123