J. Math. Biol. (2012) 64:829–854 DOI 10.1007/s00285-011-0433-5 Mathematical Biology Exact and approximate distributions of protein and mRNA levels in the low-copy regime of gene expression Pavol Bokes · John R. King · Andrew T. A. Wood · Matthew Loose Received: 25 June 2010 / Revised: 10 February 2011 / Published online: 8 June 2011 © Springer-Verlag 2011 Abstract Gene expression at the single-cell level incorporates reaction mechanisms which are intrinsically stochastic as they involve molecular species present at low copy numbers. The dynamics of these mechanisms can be described quantitatively using stochastic master-equation modelling; in this paper we study a generic gene-expression model of this kind which explicitly includes the representations of the processes of transcription and translation. For this model we determine the generating function of the steady-state distribution of mRNA and protein counts and characterise the underlying probability law using a combination of analytic, asymptotic and numerical approaches, finding that the distribution may assume a number of qualitatively distinct forms. The results of the analysis are suitable for comparison with single-molecule resolution gene-expression data emerging from recent experimental studies. Keywords Stochastic modelling · Gene expression · Master equation · Generating function Mathematics Subject Classification (2000) 92C40 · 60J27 P. Bokes (B) · J. R. King · A. T. A. Wood Centre for Mathematical Medicine and Biology, School of Mathematical Sciences, University of Nottingham, Nottingham NG7 2RD, UK e-mail: [email protected] Present Address: P. Bokes Department of Applied Mathematics and Statistics, Comenius University, Mlynská dolina, Bratislava 842 48, Slovakia M. Loose Institute of Genetics, Queen’s Medical Centre, University of Nottingham, Nottingham NG7 2UH, UK 123 830 P. Bokes et al. 1 Introduction The regulation of a genetic programme in a given developmental context or in response to a particular set of environmental stimuli is characterized by complex networks of molecular interactions between the regulatory factors involved (Davidson et al. 2002; Shen-Orr et al. 2002; Lee et al. 2002; Swiers et al. 2006). Mathematical and computational modelling has proved to be useful in reproducing the observed behaviour of such networks and has also revealed features yet to be verified experimentally (Tomlin and Axelrod 2007; Tyson et al. 2003). A common approach to modelling gene regulatory networks is to represent the cell as a well-stirred chemical reactor comprising relevant biochemical species such as transcription factors, RNA molecules and signalling proteins (Tyson et al. 2003; McAdams and Arkin 1997); the interactions between these species can be modelled using elementary chemical kinetics and the resulting temporal dynamics studied using the deterministic or stochastic formalisms developed for systems of chemical reactions (McAdams and Arkin 1997; Tyson et al. 2003). The building block of such (typically large) gene regulatory networks is a single gene and its read-outs, the associated mRNA and protein molecules (Griffith 1968a,b). Gene expression is (at least) a two-stage process: first, the DNA code is transcribed by an RNA polymerase into mRNA molecules; then, mRNAs are translated by ribosomes into proteins (Lewin 2000). Transcription and translation are counteracted by the degradation of mRNAs and proteins by appropriate enzymes, such as ribonucleases and proteases (Lewin 2000). Following works of others (e.g. McAdams and Arkin 1997; Thattai and van Oudenaarden 2001), we model these mechanisms as elementary reactions, obtaining a reaction system of two chemical species (mRNA and protein), γ1 k1 mRNA − → ∅, k2 protein − → ∅. → mRNA, ∅− mRNA − → mRNA + protein, γ2 (1) In this schematic, transcription is a zeroth order reaction occurring with rate k1 , while translation and the degradation mechanisms are first-order reactions occurring with rate constants k2 , γ1 and γ2 . Biologically, these rate constants can depend on many factors, notably k1 on the availability of transcriptional machinery and polymerases (Lewin 2000), the phase of the cell cycle (Berg 1978), or the expression patterns of the regulators of the gene (Ackers et al. 1982), k2 on the ribosomal activity of the cell (Lewin 2000) and γ1 and γ2 on the availability of the degradation enzymes (Lewin 2000). By taking these rates to be constant, we assume that the underlying extrinsic mechanisms are in a non-fluctuating steady state (Elowitz et al. 2002). We shall discuss in the final few paragraphs of this section the relationship between (1) and more complex models for gene expression which include transitioning between multiple promoter states. Experimental studies in model organisms provide useful information on the order of magnitude of the above parameters (Wang et al. 2002; Bernstein et al. 2002; Lu et al. 2007; Belle et al. 2006; García-Martínez et al. 2007). Genome-wide studies in E. coli showed that a majority of genes have mRNA half-lives, ln(2)/γ1 , between 3 and 8 min (Bernstein et al. 2002). In S. cerevisiae, the half-lives of mRNAs for a 123 Exact and approximate distributions of protein and mRNA levels 831 large set of genes range from 3 to 90 min (Wang et al. 2002), with a median of 20 min, while for the protein half-life, ln(2)/γ2 , a median value of 43 min has been reported (Belle et al. 2006). Messengers tend to be less stable than proteins (García-Martínez et al. 2007), i.e. we can expect γ1 > γ2 , and, in particular for genes exhibiting translational bursts, we often have γ1 γ2 (Ozbudak et al. 2002; McAdams and Arkin 1997). Typical values of the production rates k1 and k2 can be inferred from the mRNA and protein abundances. The average number of mRNA molecules per cell, which, as detailed below, corresponds to the ratio k1 /γ1 in the model (1), ranges from less than one molecule to a few dozens in E. coli and S. cerevisiae (Lu et al. 2007). Proteins are more abundant than their mRNAs templates; the number of proteins per mRNA molecule, corresponding to k2 /γ2 in (1), is around 540 for an average gene in E. coli and 5600 in yeast (Lu et al. 2007). Although these numbers suggest that the number of proteins of a specific type in a cell can be quite high, a recent system-wide study of gene expression in E. coli demonstrated that a large fraction of proteins were expressed at less than 10 molecule copies per cell (Taniguchi et al. 2010). Protein abundance is related to their function, with regulatory proteins, e.g. transcription factors, being present at lower numbers than housekeeping molecules (Taniguchi et al. 2010; García-Martínez et al. 2007). The low-copy regime of protein expression has also been experimentally demonstrated for genes transcribed in repressed conditions (Cai et al. 2006; Yu et al. 2006). Specifically, Cai et al. (2006) counted the copy number of the lactose utilisation enzyme β-galactosidase (β-gal) in E. coli cells grown in glucose-containing medium, in which the lac promoter controlling the β-gal-encoding lacZ gene is repressed, obtaining 1.2 molecules per cell. Yu et al. (2006) replaced the lacZ gene behind a repressed lac promoter by a gene encoding for a fluorescent protein and counted the protein copy numbers using imaging techniques, measuring on average 0.037 mRNA and 5 protein molecules per E. coli cell. If the levels of mRNAs (m) and proteins (n) in the reaction system (1) are sufficiently large, we can treat them as continuous variables that satisfy (van Kampen 2006) dm = k1 − γ1 m, dt dn = k2 m − γ2 n. dt (2) The solution to the linear initial-value problem in which (2) is subject to the initial condition m(0) = m 0 , n(0) = n 0 , (3) k1 k1 −γ1 t e + m0 − , γ1 γ1 k1 k2 k1 k1 k2 e−γ2 t + k2 m 0 − F(t), + n0 − n(t) = γ1 γ2 γ1 γ2 γ1 (4) is given by m(t) = 123 832 P. Bokes et al. where F(t) = e−γ1 t −e−γ2 t γ2 −γ1 −γ t e 2t if γ1 = γ2 , if γ1 = γ2 . (5) This completes the deterministic analysis of the gene expression model (1). However, for many genes, mRNAs and proteins are present in small quantities and molecular fluctuations cannot therefore be neglected (McAdams and Arkin 1999). It is then appropriate to treat m and n as discrete stochastic variables; the probability pm,n (t) of having m mRNA and n protein molecules at time t then satisfies the chemical master equation (van Kampen 2006) d pm,n = k1 ( pm−1,n − pm,n ) + γ1 ((m + 1) pm+1,n − mpm,n ) dt +k2 m( pm,n−1 − pm,n ) + γ2 ((n + 1) pm,n+1 − npm,n ), (6) subject to initial condition pm,n (0) = δm,m 0 δn,n 0 , (7) where δ represents Kronecker’s delta; m 0 and n 0 are the initial amounts of mRNA and protein, respectively. Recent studies (Coulon et al. 2010; Thattai and van Oudenaarden 2001; Paulsson 2005; Paszek 2007) of (6)–(7) and its generalisations focused on finding the first and second moments of the probability distribution pm,n (t) without solving the master equation itself; to illustrate this approach, let us denote by (M(t), N (t)) the Markov process associated with the distribution of pm,n (t) and let M i (t)N j (t) = m i n j pm,n (t) m,n be the (i, j)th moment of this process. Multiplying (6)–(7) by m i n j and summing over all m and n gives an infinite-dimensional system of ODEs for the moments. It can be shown that, since the reaction rates in (6) depend linearly on the state (m, n) of the system, the equations for all moments of order up to a given integer d > 0 (i.e. for M i (t)N j (t) such that i + j ≤ d) do not depend on the moments of higher order and hence form a closed finite system of ODEs (Lestas et al. 2008; Gadgil et al. 2005; Tomioka et al. 2004); in particular, the first moments, i.e. the mRNA and protein means, are given as solutions to the deterministic formulation (2)–(3) (Singh and Hespanha 2007). The stationary means Ms and Ns can be obtained as the limit as t → +∞ of the time-dependent ones; the formula for the deterministic solution (4) yields Ms = 123 k1 k1 k2 , Ns = . γ1 γ1 γ2 (8) Exact and approximate distributions of protein and mRNA levels 833 The time-dependent second moments satisfy a system of equations which is more complex than the one for the means and which can be found in Singh and Hespanha (2007); the stationary variances and covariance can be determined by analysing this system and are given by (cf. Thattai and van Oudenaarden 2001) k1 k1 k2 k1 k2 k2 1+ , Cov(Ms , Ns ) = . Var(Ms ) = , Var(Ns ) = γ1 γ1 γ2 γ1 + γ2 γ1 (γ1 + γ2 ) (9) The derivation of the analytic formulae (8)–(9) for the mean and the variance of mRNA and protein levels, the methods of which can be applied to understand the transmission of noise down a generic regulatory pathway (Paulsson 2004), provides an important tool for theoretical understanding of stochasticity in gene expression; nevertheless, complete characterisation of the distribution of gene products is of equal interest and has not been provided in recent studies (e.g. Cheong et al. 2010; Shahrezaei and Swain 2008b) discussing the stochastic model (6), except for a special asymptotic case of fast mRNA degradation (Shahrezaei and Swain 2008a). The chemical kinetics description (1) for spontaneous gene expression is referred to as the two-stage model (Thattai and van Oudenaarden 2001; Shahrezaei and Swain 2008a), the stages being those of transcription and translation, as opposed to a more detailed three-stage model (Shahrezaei and Swain 2008a; Blake et al. 2003; Raser and O’Shea 2004; Coulon et al. 2010; Raj et al. 2006), which includes an upstream mechanism of promoter transitioning between two or more states each associated with a distinct rate constant for transcription. Such promoter transitioning can be attributed to stochastic binding to and unbinding from the promoter of regulatory factors modifying the transcription initiation rate (Golding et al. 2005). If these processes of binding and dissociation are fast compared to the transcription dynamics—such a limit was previously considered for prokaryotic gene expression by Shea and Ackers (1985) in the deterministic and by Hornos et al. (2005) in the stochastic contexts—then the threestage model reduces to the two-stage one, with k1 in (1) equal to the weighted average of the transcription rates associated with the individual promoter states, wherein the weights are the stationary probabilities of these states. Although the focus of this paper is on the two-stage model, let us briefly summarise the related results obtained for the model with three stages: similarly as presented above for the master equation (6), the first and second moments of mRNA and protein counts have been determined for the stochastic three-stage model (Raj et al. 2006; Paszek 2007). The stationary distribution of mRNA levels, but not that of proteins, has been completely characterised for the case of constitutive transcription from a transitioning promoter (Raj et al. 2006; Innocentini and Hornos 2007; Iyer-Biswas et al. 2009) using generating-function methods. These methods have also been used to characterise the steady-state protein distribution in stochastic models which do not explicitly include a representation of translation, notably in an early study of Peccoud and Ycart (1995) and also in an analysis of stochastic gene autoregulation (Hornos et al. 2005). In this paper we present the following results on the large-time behaviour of the solution to problem (6)–(7): we shall find the generating function of the stationary 123 834 P. Bokes et al. distribution of mRNA and protein counts and use this to derive analytic formulae for the marginal stationary probabilities and to find the first four central stationary moments, thus determining the skewness and kurtosis of the marginal distributions whilst reiterating the results (8)–(9) for their means and variances. We also present a numerical method for obtaining the joint stationary distribution of mRNA and protein counts from its generating function and study the asymptotic properties of the generating function in order to interpret the results of our analysis. Note that in (1) the dynamics of mRNA is functionally independent of the amount of protein in the system; indeed, the amount of mRNA in the system evolves according to a simple one-dimensional Markov process, known as the immigration-and-death process (Cox and Miller 1977) due to its early applications in stochastic modelling of population growth (Kendall 1949); it is well known that the stationary distribution of this process is Poissonian (Kendall 1949; Cox and Miller 1977). We shall not repeat the standard analysis of the immigration-and-death process here; instead, we focus on the derivation of the joint (stationary) distribution of mRNA and protein levels, obtaining the result for mRNAs in Section 3 as a trivial corollary of our analysis. 2 The generating function of the stationary distribution As a rule of thumb, the master equation has a unique equilibrium distribution [see van Kampen (2006) for details and also some examples for which it is not true]. Thus, for st all initial conditions, the solution pm,n (t) of (6)–(7) tends as t → +∞ to a limit pm,n which is a time-independent solution to the master equation, i.e. st st st st k1 ( pm−1,n − pm,n ) + γ1 ((m + 1) pm+1,n − mpm,n ) st st st st − pm,n ) + γ2 ((n + 1) pm,n+1 − npm,n ) = 0, +k2 m( pm,n−1 (10) and satisfies the normalizing condition st pm,n = 1. (11) m,n The standard approach to solving recurrence relations such as (10)–(11) is to consider the corresponding generating function (e.g. Cox and Miller 1977) which is defined by ϕ(x, y) = st x m y n pm,n . (12) m,n Multiplying (10) by the factor x m y n and summing over m and n, we obtain that the generating function satisfies the linear first-order PDE (γ1 (x − 1) + k2 x(1 − y)) 123 ∂ϕ ∂ϕ + γ2 (y − 1) = k1 (x − 1)ϕ. ∂x ∂y (13) Exact and approximate distributions of protein and mRNA levels 835 The normalizing condition (11) implies ϕ(1, 1) = 1. (14) The characteristics of (13) are defined by x = γ1 (x − 1) + k2 x(1 − y), y = γ2 (y − 1), ϕ = k1 (x − 1)ϕ. (15) Although the system (15) is exactly solvable, its solutions (and their ϕ components in particular) are given by complex formulae from which the functional form of solutions to the PDE (13) cannot be easily inferred. Therefore, we shall not analyse the system (15) any further; instead, we shall use a suitable ansatz to find a particular solution to (13)-(14) and then we shall demonstrate that it is the only solution to the problem. First, we simplify (13)-(14) by changing the variables according to x = 1 + u, y = 1 + v, ϕ = exp(ψ), (16) and obtain that the factorial cumulant generating function (Johnson et al. 2005) ψ = ψ(u, v) is a solution of the inhomogeneous linear PDE, (γ1 u − k2 (1 + u)v) ∂ψ ∂ψ + γ2 v = k1 u, ∂u ∂v (17) subject to ψ(0, 0) = 0. (18) Let us look for ψ(u, v) in the form of a power series, ψ(u, v) = u m v n am,n . m,n Immediately, we obtain a0,0 = ψ(0, 0) = 0. The PDE (17) implies the following recurrence relation for the sequence am,n : (γ1 m + γ2 n)am,n = k2 (m + 1)am+1,n−1 + mam,n−1 + k1 δm,1 δn,0 . (19) Taking n = 0 in (19), we obtain γ1 mam,0 = k1 δm,1 which implies that am,0 = 0 for all m ≥ 2. Setting n = 1 in (19), we find that am,1 is a linear combination of am+1,0 and am,0 ; therefore, am,1 = 0 for all m ≥ 2. Clearly, we can successively use (19) for increasing integer values of n to obtain am,n = 0, m ≥ 2, n ≥ 0. 123 836 P. Bokes et al. It remains to determine {a0,n }n≥1 and {a1,n }n≥0 . Setting m = 1 in (19) yields a linear first-order recurrence equation for the latter, (γ1 + γ2 n)a1,n = k2 a1,n−1 + k1 δn,0 , which gives a1,n = k1 (k2 /γ2 )n , n ≥ 0, γ1 (1 + γ1 /γ2 )n (20) where (c)n = c(c + 1)(c + 2) · · · (c + n − 1), (c)0 = 1. To determine the terms of the sequence {a0,n }n≥1 , we set m = 0 in (19), and obtain γ2 na0,n = k2 a1,n−1 , which implies a0,n = k1 (k2 /γ2 )n k1 (k2 /γ2 )n = , n ≥ 1. γ1 n(1 + γ1 /γ2 )n−1 γ2 n(γ1 /γ2 )n (21) Thus, we find that all coefficients am,n , except for those given by (20)–(21), are zero, which implies that the solution ψ(u, v) of (17)–(18) satisfies ψ(u, v) = u m v n am,n = m,n v n a0,n + u n≥1 v n a1,n n≥0 k1 (k2 v/γ2 )n k1 u (k2 v/γ2 )n = + . γ2 n(γ1 /γ2 )n γ1 (1 + γ1 /γ2 )n n≥1 (22) n≥0 We now show that the constructed particular solution ψ(u, v) is the only solution to the problem (17)–(18). Let us consider another (possibly different) solution ψ̃(u, v) to the same problem, i.e. let ψ̃(u, v) be an arbitrary differentiable function of two variables which satisfies (γ1 u − k2 (1 + u)v) ∂ ψ̃ ∂ ψ̃ + γ2 v = k1 u, ψ̃(0, 0) = 0. ∂u ∂v Then ψh = ψ̃ − ψ satisfies the homogeneous equation (γ1 u − k2 (1 + u)v) ∂ψh ∂ψh + γ2 v = 0, ∂u ∂v (23) and the additional condition ψh (0, 0) = ψ̃(0, 0) − ψ(0, 0) = 0. 123 (24) Exact and approximate distributions of protein and mRNA levels 837 The characteristics of (23) satisfy the planar system u = γ1 u − k2 (1 + u)v, v = γ2 v. (25) According to the method of characteristics, the solution ψh (u, v) to the homogeneous PDE (23) is constant along each trajectory of the dynamical system (25). Since the origin (u, v) = (0, 0) is an (unstable) node of the system (25), there are two possible scenarios for the qualitative behaviour of the solution ψh (u, v): either ψh (u, v) is constant everywhere, or it is discontinuous at the trivial node. The latter option is not appropriate in our case because the additional condition (24) requires ψh (u, v) to be well-behaved around the origin. Therefore, ψh (u, v) ≡ ψh (0, 0) = 0, which implies that ψ̃(u, v) = ψ(u, v) for all u and v; thus, the function ψ(u, v) given by (22) is indeed the only solution to the problem (17)–(18). Thus, having found a unique solution (22) to the PDE (17)–(18), we return to the original variables according to (16), obtaining that the solution to (13)–(14), which gives the generating function of the stationary distribution of mRNA and protein amounts, is given by ⎛ ⎜ k1 ϕ(x, y) = exp ⎝ γ2 n≥1 n k2 γ2 (y − 1)n n(γ1 /γ2 )n ⎞ n k2 n (y − 1) γ2 k1 (x − 1) ⎟ + ⎠. γ1 (1 + γ1 /γ2 )n n≥0 The above formula can be rewritten as ⎛ ϕ(x, y) = exp ⎝αβ y ⎞ M(1, 1 + λ, β(s − 1))ds + α(x − 1)M(1, 1 + λ, β(y − 1))⎠, 1 (26) where M(a, b, z) = ∞ (a)n z n n=0 (b)n n! is Kummer’s function (Abramowitz and Stegun 1972) and λ= γ1 k1 k2 , α= , β= , γ2 γ1 γ2 (27) so that λ is the ratio between protein and mRNA half-lives, which is typically greater than one and often λ 1 (refer back to Sect. 1 for details), α is the average number of mRNA molecules, ranging from less than one to a few dozens, and β gives the ratio between the average protein and the average mRNA abundances, which is typically quite high (see Lu et al. (2007), Taniguchi et al. (2010) or Sect. 1). 123 838 P. Bokes et al. 3 The marginal generating functions and distributions In this section, we use the formula for the stationary generating function (26) to characterize the underlying distribution. In particular, we shall determine the marginal stationary probabilities, st pm, = st pm,n , pst,n = n st pm,n . (28) m st gives the probability of having m mRNA molecules and any number (includHere, pm, ing zero) of protein; pst,n is the probability of having n protein molecules and an arbitrary amount of mRNA. Obviously, each of these two sequences of probabilities sums to unity, so that each forms a probability distribution; these two are referred to as the marginal distributions of mRNA (or protein) levels, and the two-dimensional st is known as the joint distribution. distribution pm,n The generating functions corresponding to the marginal distributions are defined by ϕ1 (x) = st x m pm, , ϕ2 (y) = m y n pst,n . (29) n Below, we refer to ϕ1 (x) and ϕ2 (y) as to the marginal generating functions; similarly, st is referred to as the joint the generating function ϕ(x, y) of the joint distribution pm,n generating function. The marginal and joint generating functions are related by ϕ1 (x) = ϕ(x, 1), ϕ2 (y) = ϕ(1, y), (30) so that we can use formula (26) for the latter to obtain formulae for the former. To determine the marginal mRNA distribution, we set y = 1 in (26), finding that ϕ1 (x) = exp(α(x − 1)). (31) Such generating function corresponds to the Poisson distribution (Johnson et al. 2005), i.e. st pm, = e−α α m . m! (32) This result is a trivial consequence of the fact that the mRNA dynamics in the two-stage gene-expression model is governed by a simple univariate immigration-and-death Markov process. Clearly, the Poisson distribution of mRNA levels can be determined without the knowledge of the joint distribution: inserting y = 1 in the PDE (13) for the joint generating function, we obtain that the marginal mRNA generating function satisfies the linear first-order ODE γ1 ϕ1 = k1 ϕ1 ; solving this equation subject to the normalizing condition ϕ1 (1) = 1, we arrive at the formula (31), which implies the Poisson distribution of mRNA levels. 123 Exact and approximate distributions of protein and mRNA levels 839 Setting x = 1 in (26), we find that the marginal protein generating function ϕ2 (y) can be written as ϕ2 (y) = exp(ψ2 (y)), (33) where ψ2 (y) is defined by y ψ2 (y) = αβ M (1, 1 + λ, β(s − 1)) ds. (34) 1 Generating functions of a similar, but non-identical, form, involving an exponential of a power series in y, are known to correspond to the family of the Poisson-stopped sum distributions (Johnson et al. 2005), which have been used to model growth of a heterogeneous population (Neyman 1939). Some of the methods developed for the Poisson-stopped sum family are readily applicable to the marginal protein distribution (33)–(34); below, we find a recursive formula for the marginal probabilities pst,n using a procedure previously suggested for a particular type of the Poisson-stopped sum by Gurland (1958). The marginal protein distribution can be obtained from its generating function as Dn (ϕ2 (y)) st (35) p,n = n! y=0, where D denotes the differential operator d/dy. Equation (35) requires us to find the nth derivative of the composite function ϕ2 (y) = exp(ψ2 (y)); the first derivative is obtained by the chain rule, dψ2 (y) dϕ2 (y) = ϕ2 (y) . dy dy (36) Taking the (n −1)th derivative of (36) and differentiating the product on the right-hand side according to the Leibniz rule (Gurland 1958), we have D (ϕ2 (y)) = n n−1 n−1 i=0 i Di (ϕ2 (y))Dn−i (ψ2 (y)), (37) which gives the nth derivative of ϕ2 (y) expressed in terms of its lower-order derivatives, thus allowing us evaluate the derivatives of ϕ2 (y) of any order recursively; an alternative, nonrecursive expression for Dn (ϕ2 (y)) can be obtained using the Faà di Bruno formula (Johnson 2002) but, as this leads to algebraic expressions that reveal little, we adhere to the recursive formulation in this exposition. Equation (37) can be rewritten in a form which will be more convenient for us in what follows: n−1 1 Dn−i (ψ2 (y)) Di (ϕ2 (y)) Dn (ϕ2 (y)) = . n! n (n − i − 1)! i! (38) i=0 123 840 P. Bokes et al. Obviously, we still need to determine the derivatives of the inner function ψ2 (y), which appear in our recursive formula (38); this is not difficult: the r th derivative—r being an arbitrary positive integer—of ψ2 (y) is given by Dr (ψ2 (y)) = αβ r (r − 1)! M (r, r + λ, β(y − 1)), (1 + λ)r −1 (39) in which we used that for the derivative of Kummer’s function, we have (Abramowitz and Stegun 1972) (a)s ds M(a, b, z) = M(a + s, b + s, z), dz s (b)s for any nonnegative integer s. Expressing the derivatives of ψ2 (y) in (38) according to (39) and then taking y = 0, we arrive at the recursive formula for the marginal protein probabilities, pst,n n−1 αβ β n−i−1 = M (n − i, n − i + λ, −β) pst,i , n (1 + λ)n−i−1 (40) i=0 where the first of the series is given by ⎛ pst,0 = ϕ2 (0) = exp ⎝−αβ 1 ⎞ M (1, 1 + λ, β(s − 1)) ds ⎠. (41) 0 Further properties of the marginal protein stationary distribution, e.g. the cumulants and the first four central moments, are derived in the Appendix. 4 Calculating the joint distribution using the discrete Fourier transform st are By the definition (12) of the generating function ϕ(x, y), the probabilities pm,n the coefficients of the power-series expansion of ϕ(x, y); therefore, these probabilities can in theory be obtained by evaluating the partial derivatives of any order of ϕ(x, y) st does not yield at x = y = 0. Unfortunately, such a direct approach to finding pm,n results as neat as those obtained in Sect. 3 for the marginal distributions. Therefore, we choose an alternative approach: we shall determine numerically the joint stationary distribution of mRNA and protein counts from the generating function using the discrete Fourier transform. We begin by considering two positive integers M, N and the following values of the generating function: Ak,l = ϕ(e− 123 2πik M , e− 2πil N ), k = 0, . . . , M − 1, l = 0, . . . , N − 1. Exact and approximate distributions of protein and mRNA levels 841 Using the definition (12) of the generating function ϕ(x, y), we find that mk nl st pm,n exp −2πi + . M N ∞ ∞ Ak,l = m=0 n=0 (42) If M and N are sufficiently large, we can truncate the infinite sum in (42) by excluding the terms for which m ≥ M or n ≥ N without introducing a significant numerical error, so that we can write Ak,l ≈ −1 M−1 N m=0 n=0 st pm,n mk nl , exp −2πi + M N (43) which implies that the M × N matrix Ak,l is approximately equal to the discrete Fourier st . Taking the transform (Press et al. 2007) of the truncated probability sequence pm,n inverse discrete Fourier transform (Press et al. 2007) of (43), we obtain st pm,n M−1 N −1 nl 1 mk + , m < M, n < N . ≈ Ak,l exp 2πi MN M N (44) k=0 l=0 st are negligibly small. The The remaining terms of the probability distribution pm,n right-hand side of (44) can be evaluated efficiently using the fast Fourier transform algorithm (Press et al. 2007); thus, (44) provides us with the desired numerical recipe for computing the joint stationary distribution from the matrix Ak,l . Let us now focus on the problem of the numerical evaluation of the terms Ak,l . It can easily be obtained from (42) that these terms satisfy the Hermitian property A M−k,N −l = Ak,l , k < M, l < N , and therefore it is sufficient to compute Ak,l for l < N /2 +1 only. From the functional 2πil 2πik form of ϕ(x, y), (26), we see that in order to obtain the values Ak,l = ϕ(e− M , e− N ), we need to evaluate the function F(y) = M(1, 1 + λ, β(y − 1)) at y = e− function, 2πil N ; we are also required to find a sequence of complex integrals of this − 2πil N e Il = F(z) dz, 0 ≤ l < N /2 + 1. 1 The former task is normally straightforward as most mathematical software provides an implementation of Kummer’s function; to evaluate the integrals, note that 123 842 P. Bokes et al. I0 = 0. For l > 0, given that N is large enough, we can use the trapezium rule to obtain that − 2πil N e Il = Il−1 + e ≈ Il−1 + F(z) dz − 2πi(l−1) N 2πil 2πi(l−1) 2πi(l−1) 2πil 1 F(e− N ) + F(e− N ) e− N − e− N , 2 which enables us to compute the terms Il recursively. We implemented the above-described recipe for finding the joint stationary distribution in the programming language Python (version 2.5) enhanced by its numerical and scientific packages NumPy and SciPy (versions 1.2.1 and 0.7.1 respectively). We used the irfft2 routine from the module numpy.fft to compute the inverse discrete Fourier transform of a two-dimensional sequence of complex numbers satisfying the Hermitian property as discussed above. For finding values of Kummer’s function, we used the routine hyp1f1 from the module scipy.special. In Fig. 1, which was prepared using Python’s plotting package Matplotlib, we show a graphical repst for parameter values α = 2, β = 5/3, λ = 1/3 resentation of the distribution pm,n and for the truncation integers M = N = 64. Each of the squares in the central graph of the figure corresponds to a particular pair of indices m and n; the color of the st for the given index pair. The two bar charts next to square relates to the value of pm,n the central graph depict the marginal distributions of mRNAs and proteins, either of st by which has been calculated from the numerically evaluated joint distribution pm,n summing it over rows or columns—the marginal probabilities calculated in this way are shown as semi-transparent white bars in the charts—and, independently, using the exact analytic expressions (32) and (40)–(41), the evaluations of which are visualised as background grey bars. The close agreement between the numerical and analytic results observed in the bar charts of Fig. 1 indicates that the numerical method for evaluating the joint distribution introduces a small error only. In Fig. 1 the mRNA and protein counts are positively correlated, the distribution of protein counts being more widely-spread and heavy-tailed than that of mRNA levels. Intuitively, the correlation results from mRNA acting as an upstream element which positively regulates protein production. The relatively large variability in protein levels is due to the variability in the expression of the upstream element being transmitted down the regulatory pathway. Accordingly, in models for stochastic gene expression which include promoter transitions (e.g. Raj et al. 2006), the distribution of the mRNA levels can be more heavy-tailed than the Poisson distribution in the vertical bar chart of Fig. 1, since random transitions between promoter states introduce an extra source of stochasticity which is transmitted downstream to the mRNA levels and through those down to the levels of protein. The Poisson distribution of mRNA levels is however appropriate if promoter transitions do not represent a significant source of stochasticity in gene expression (Yu et al. 2006). 123 Exact and approximate distributions of protein and mRNA levels 843 Fig. 1 The joint and marginal distributions of mRNA and protein counts. The mRNA and protein levels are positively correlated. The marginal distribution of protein counts is more widely-spread and has a heavier tail than that of mRNA levels 5 Asymptotic analysis of the stationary distribution In the previous sections we used the formula (26) for the generating function ϕ(x, y) to find a recursive expression for the marginal protein distribution pst,n and to compute st via the discrete Fourier transform. Here we shall examine the the joint distribution pm,n asymptotic behaviour of the function ϕ(x, y) in order to understand the properties of the underlying distributions qualitatively. Among the asymptotic regimes considered below, some are realistic and have been experimentally demonstrated. Other regimes are less realistic, yet they help us in exploring the parameter space of the investigated distributions. The method of using limit cases to understand a complex stochastic phenomenon was in the field of gene expression previously used by Hornos et al. (2005), who examined a gene autoregulation model in the limit case of slow, as well as in the case of fast, promoter transitions. The first asymptotic case we focus on is that of β/(1 + λ) = k2 /(γ1 + γ2 ) 1, which requires that either k2 /γ1 1, i.e. that most mRNAs are not translated, or k2 /γ2 1, so that mRNAs are much more abundant than proteins, the latter of which, however, is not in agreement with experimental evidence (Lu et al. 2007). If y is fixed and β 1 + λ, Kummer’s function satisfies M(1, 1 + λ, β(y − 1)) = 1 + O β . 1+λ Therefore, for the joint generating function ϕ(x, y) we have ⎛ ϕ(x, y) = exp ⎝αβ y ⎞ M(1, 1 + λ, β(s − 1))ds + α(x − 1)M(1, 1 + λ, β(y − 1))⎠ 1 = exp (αβ(y − 1) + α(x − 1)) + O β . 1+λ (45) 123 844 P. Bokes et al. Fig. 2 The a Poisson, b Neyman and c negative binomial distributions (white bars), compared to the exact stationary distribution of protein counts (black bars in the background) for illustrative parameter sets (details in the main text) The leading-order term in (45) is the joint generating function of two Poissonian random variables which are independent as the function factorises into a product of its marginals. Thus, if β 1 + λ, the stationary distribution of protein levels converges weakly to the Poisson distribution with mean αβ which is statistically independent of the amount of mRNA in the system (which, as we showed before, has the Poisson distribution with mean α). In Fig. 2a we compare the exact stationary distribution pst,n of protein levels, obtained by the recursive formula (40)–(41) for a chosen parameter set (λ = 20, α = 1, β = 4), to the Poisson distribution with mean αβ = 4, observing a close agreement between the two. The Poisson distribution of protein counts represents a trivial asymptotic stationary behaviour of the gene expression model; to identify some of its nontrivial limits, we shall consider the behaviour of the generating function for the ratio λ = γ1 /γ2 between the protein and the mRNA half-lives small or large, the latter representing a much more biologically realistic case than the former (García-Martínez et al. 2007). For small values of λ, we write M(1, 1 + λ, β(y − 1)) = eβ(y−1) + O(λ), which implies ⎛ ϕ(x, y) = exp ⎝αβ y ⎞ M(1, 1 + λ, β(s − 1))ds + α(x − 1)M(1, 1 + λ, β(y − 1))⎠ 1 = exp α(eβ(y−1) − 1) + α(x − 1)eβ(y−1) + O(λ) = exp α(xeβ(y−1) − 1) + O(λ). In particular, the marginal protein generating function satisfies ϕ2 (y) = ϕ(1, y) = eα(e 123 β(y−1) −1) + O(λ). (46) Exact and approximate distributions of protein and mRNA levels 845 The leading-order term on the right-hand side of (46) is the generating function of the Neyman type A distribution (Johnson et al. 2005) with parameters α and β, which we plot in Fig. 2b and compare it to the exact distribution pst,n of protein levels for an illustrative parameter set (λ = 0.05, α = 1, β = 4). Let us provide an intuitive explanation for the appearance of the Neyman probability law. The condition λ = γ1 /γ2 1 requires the degradation of proteins to be much faster than that of mRNAs, so let us assume, even though such an assumption is not likely to be found satisfied in real biological systems (García-Martínez et al. 2007), that it be so. Then on the fast timescale of protein turnover, the number m of mRNA molecules is constant (with high probability) and the protein dynamics evolves according to an immigration-and-death process in which the immigration rate is k2 m and the death rate per protein is γ2 . Therefore, on the fast timescale, conditional on having m mRNAs in the system, the distribution of protein levels equilibrates and becomes Poisson with mean k2 m/γ2 = βm. On the slow timescale, however, the amount of mRNA in the system, m, is subject to variation, and has a Poisson distribution with mean α, as shown in Sect. 3. Therefore, the large-time distribution of protein levels can be expected to be Poisson(mβ), where m is a Poisson(α) random variable. The distribution of such a variable has been shown to be Neyman type A with the generating function given by the leading-order term of (46) (Johnson et al. 2005). To investigate the behaviour of ϕ(x, y) for large values of λ, we rescale the parameters α and β according to α= α , β = λβ , λ (47) in which β = k2 /γ1 gives the mean number of protein molecules synthesised from an mRNA copy before it is degraded and α β = αβ gives the average number of proteins in the cell at any time. If α and β are kept costant in (47), while λ tends to infinity, then these characteristics of protein synthesis remain unchanged while the average amount of mRNA, α = α /λ, tends to zero. Using reparametrization (47) in the formula (26) for the generating function ϕ(x, y), we obtain ⎛ y ⎝ ϕ(x, y) = exp α β M 1, 1 + λ, λβ (s − 1) ds 1 + Note that M(1, 1 + λ, λβ (y − 1)) = ⎞ − 1) M(1, 1 + λ, λβ (y − 1))⎠ . λ α (x λβ (y − 1) n n = β (y − 1) + O(1/λ) (1 + λ)n n≥0 n≥0 1 + O(1/λ) as λ → +∞. = 1 + β (1 − y) 123 846 P. Bokes et al. Therefore, for x and y fixed, we have ϕ(x, y) = (1 + β (1 − y))−α + O(1/λ), (48) from which we can deduce that the stationary distribution of mRNA levels converges weakly to a degenerate distribution which has the whole probability mass concentrated at m = 0, and that the distribution of protein levels tends to the negative binomial distribution (Johnson et al. 2005) with parameters α and β . In Fig. 2c we compare the negative binomial to the exact distribution for a suitable parameter set λ = 10, α = 1, β = 4. These parameter values imply an average of 0.1 mRNA and 4 protein molecules in the system, which is consistent with the experimental measurements of gene expression from the lac promoter in repressed conditions by Yu et al. (2006), who reported an average of 0.037 mRNA and 5 protein molecules per cell. Heuristically, if λ 1 while α and β are finite, so that α is low and β is high, then proteins are produced during infrequent and brief periods—often called bursts—of rapid translation which occur whenever an mRNA molecule is transcribed and which end once the mRNA is degraded (McAdams and Arkin 1997). The number of proteins produced from a burst has been shown to be geometrically distributed with the mean number equal to β (Berg 1978). In the limit of short mRNA lifetimes, it is no longer necessary to consider the mRNA intermediates; the dynamics of protein expression can be approximated by a univariate Markov process composed of (i) protein degradation which occurs with the rate constant γ2 and (ii) protein burst production occurring with the rate constant k1 , by which a geometrically distributed number (with mean β ) of proteins is produced. The stationary distribution of such a Markov process has been shown to be negative binomial with the generating function equal to the leading order term of (48) (Paulsson and Ehrenberg 2000). The results of the simple asymptotic analysis presented above allow us to examine the qualitative properties of the stationary distribution of protein levels and to characterize the dependence of these properties on the model parameters. Recall that the mean ν and the variance ν2 of the stationary distribution pst,n of protein levels are given by ν = αβ, ν2 = αβ 1 + β . 1+λ Note that we always have ν2 > ν and if ν2 ν, then β 1 + λ and thus, by (45), the protein levels have the Poisson distribution. In the general situation of ν2 > ν, we can express the parameters α, β (or alternatively α and β ) in terms of the mean ν, variance ν2 and the parameter λ as ν2 , (1 + λ)(ν2 − ν) (1 + λ)(ν2 − ν) , β= ν α= 123 λν 2 , (1 + λ)(ν2 − ν) (1 + λ)(ν2 − ν) β = . λν α = Exact and approximate distributions of protein and mRNA levels 847 Fig. 3 The non-Poisson behaviour of the distribution of protein counts: a the negative binomial limit case; b–d intermediate exact distributions; e the Neyman limit case 2 We find that (α, β)|λ0 = ν2ν−ν , ν2ν−ν , and thus, for fixed values of the mean and variance and for λ 1, the marginal protein distribution tends to the Neyman 2 type A distribution with parameters ν2ν−ν and ν2ν−ν . Similarly, since (α , β )|λ→+∞ = 2 ν2 −ν ν , we have that as λ 1, the protein distribution converges to the negν2 −ν , ν ative binomial distribution with parameters ν2ν−ν and ν2ν−ν . Thus, if the mean and variance are fixed, the exact stationary protein distribution is specified by the parameter λ ∈ (0, +∞) and can be envisaged as an intermediate between the Neyman and the negative binomial distributions; this is illustrated in Fig. 3, in which we show five distributions pst,n which all have the same mean ν = 4 and variance ν2 = 20 but have different ratio λ = γ1 /γ2 , namely (a) +∞ (the negative binomial limit case), (b) 5, (c) 1, (d) 0.2, (e) 0+ (the Neyman limit case). Having investigated the discrete limits of the mRNA-protein distribution, let us focus on identifying the continuous ones. If the mean mRNA and protein counts are large, i.e. if 2 α 1, αβ 1, (49) hold simultaneously, then mRNA and protein levels are approximately normally distributed with means, variances and covariance given by (8)–(9), i.e. equal to those of the exact distributions. This result can be derived using van Kampen’s large system size expansion, which is applicable for an arbitrary stochastic reaction system (van Kampen 2006). If the first condition in (49) holds but the second does not, the protein distribution is not normally distributed; however, we then have β 1 and hence the Poisson approximation (45) can be used instead. Thus, for α 1, the exact mRNA and protein distributions can be approximated either by Gaussians if β 1/α or by mutually independent Poisson distributions if β 1. Notably, the latter case does not seem biologically realistic in light of experimental evidence (Lu et al. 2007). 123 848 P. Bokes et al. A mixed continuous-discrete mode of the gene expression model occurs if the dynamics of mRNA levels is discrete and stochastic but that of protein levels is continuous and deterministic. Heuristically, for the protein dynamics to be deterministic, we need simultaneously k2 /γ1 1 and k2 /γ2 1, so that upon the transcription of an individual mRNA molecule a large number of proteins are produced from the transcript before either degradation process (of proteins or of the mRNA template) steps in to interrupt the resulting surge in protein levels. The above two conditions can be written succinctly as one, k2 /(γ1 + γ2 ) = β/(1 + λ) 1. To analyse the leading-order asymptotic behaviour of the protein distribution for β/(1 + λ) large, we consider the characteristic function χ (v) of the rescaled random variable (1+λ)Ns /β, in which Ns is the random variable giving the amount of proteins at steady state. The characteristic function χ (v) can easily be expressed in terms of the generating function ϕ2 (y), given by (33)–(34), of the variable Ns and satisfies χ (v) = e iv(1+λ)Ns β ⎛ ∼ exp ⎝α ⎛ iv(1+λ) ⎜ = ϕ2 e β = exp ⎝αβ iv(1+λ) 0 eiv(1+λ)/β ⎞ M(1, 1 + λ, z)dz ⎠ for ⎞ ⎟ M(1, 1+λ, β(s −1))ds⎠ 1 β 1. 1+λ (50) The approximate characteristic function given by the final expression in (50) corresponds to a parametric family of distributions which includes two notable special cases identified by considering the behaviour of the formula (50) for λ asymptotically large or small. If in addition to our previous assumption β/(1 + λ) 1 we assume that λ 1, then it is possible to further simplify (50) to find that χ (v) ∼ (1 − iv)−α , which is the characteristic function of the gamma distribution. The same gamma distribution can alternatively be obtained by taking β large in the previously derived negative binomial distribution (48) which we showed to approximate protein levels whenever λ 1; clearly, if λ 1, then the condition β/(1 + λ) 1 is equivalent to β/λ = β 1, and hence, the two alternative derivations of the gamma probability law confirm that our analysis is self-consistent. The gamma distribution of protein levels has previously been obtained from a relatively simple piecewise deterministic model for gene expression dynamics (Friedman et al. 2006). It represents an important asymptotic case of the protein distribution studied here since the condition k2 γ1 γ2 for which the approximation is valid is often satisfied (McAdams and Arkin 1997; Friedman et al. 2006). Examining the behaviour of (50) for λ 1, we find that the characteristic function of the random variable Ns /β can be approximated by exp(α(eiv − 1)), this being the characteristic function of the Poisson distribution with mean α; the same result can be obtained by taking β large in the Neyman distribution (46) which has been shown to be valid whenever λ 1. A similar comment to that made for the Neyman distribution needs to be made here: biologically, λ tends to be large rather than very small. 123 Exact and approximate distributions of protein and mRNA levels 849 Table 1 Approximate distributions of protein levels. For each relevant asymptotic parameter region, we describe the distribution of the approximate stationary protein levels, Ñs , and give the functional form of its characteristic function ω(s) = eis Ñs Asymptotic region of validity Approximate distribution Description Characteristic function β 1+λ 1 Poisson exp(αβ(eis − 1)) λ1 Neyman is exp(α(eβ(e −1) − 1)) λ1 Negative binomial α 1, αβ 1 Gaussian β 1+λ 1 Deterministic prot. dynamics (1 + β (1 − eis ))−α β αβs 2 exp iαβs − 1 + 1+λ 2 isβ exp α M(1, 1 + λ, z)dz β 1, λ 1 Proportional to Poisson exp(α(eiβs − 1)) β 1, λ 1 Gamma (1 − iβ s)−α 0 Thus, we identified a number of distinguished asymptotic cases for which the exact distribution of steady-state protein levels can be approximated by simpler distribution types; the complete list is given in Table 1. 6 Discussion In this paper, we studied the properties of the stationary distribution of a stochastic model for gene expression. The model, described by the chemical master equation (6), has been analysed in a standard way—by writing down and solving the partial differential equation (13) for the generating function of the unknown stationary probability distribution. The method of finding the solution to this PDE involved a non-trivial step: we transformed the variables, obtaining the PDE (17) for the factorial cumulant generating function of the unknown distribution; this transformed equation was solved easily using a power-series ansatz. Similar approaches have been used in other contexts: changing the variables in the PDE for the generating function to obtain one for the moment generating function or one for the (non-factorial) cumulant generating function is a standard method by which information on the probabilistic behaviour of Markov processes in continuous time and with discrete state space can be obtained (see e.g. Bailey 1964). The joint stationary distribution of protein and mRNA levels was calculated in Section 4 from the derived formula for the generating function via the discrete Fourier transform, which was efficiently implemented using the fast Fourier transform algorithm. In addition, we found relatively simple recursive expression (40)–(41) for the marginal distribution of protein levels. We also obtained formulae giving the first four central moments of this distribution, thus determining its skewness and kurtosis whilst reiterating the result (8)–(9) for the mean and variance obtained in previous studies using different methods. 123 850 P. Bokes et al. We used these results, combined with a simple asymptotic analysis of the generating function, to examine the qualitative behaviour of the stationary distribution of the gene expression model. The protein counts have been found to have the Poisson distribution if the rate of translation is significantly lower than either of the degradation rates (of protein or mRNA). If, however, the rate of translation balances with the degradation rates, then the deviation from the Poisson distribution becomes pronounced, and two particular cases appear to be of special interest: the Neyman distribution of protein levels, which occurs when it is the protein degradation reaction which balances with translation, and the negative binomial distribution which occurs when the mRNA degradation is the dominant decay reaction. We also characterised a number of distributions in continuous state space which can serve as approximations, upon rescaling, of the exact distribution of gene expression levels. The stochastic model for gene expression (1) can be extended by assuming that the promoter of the gene can transition in a Markovian fashion between several states and that from each of these promoter states mRNA is transcribed with a specific rate constant (Blake et al. 2003; Raser and O’Shea 2004; Coulon et al. 2010). Several authors considered the illustrative case of a promoter which can be in the active state from which mRNA is expressed with a given rate constant or in the inactive state from which no transcription occurs (Peccoud and Ycart 1995; Raj and van Oudenaarden 2009). The time-dependent and stationary first and second moments of mRNA and protein counts can be determined for this description in a similar way as was done for (1) by Thattai and van Oudenaarden (2001); the details can be found in Paszek (2007). In addition, the stationary distribution of mRNA levels, which is not Poissonian in this case as the transitioning between the promoter states contributes to the stochasticity in the expression of gene products, has been completely characterised (Peccoud and Ycart 1995; Raj et al. 2006; Iyer-Biswas et al. 2009). The results in this paper for the stochastic model (1) could possibly be extended to obtain for the model which includes promoter transitions a complete characterisation of the distribution of protein levels, which is not available in literature (Paszek 2007). We believe that the theoretical results derived in this work are suitable for comparison with experimental data which are becoming available thanks to the recent advances in imaging technologies allowing the measurement of gene expression at the single-cell level with single-molecule sensitivity (Xie et al. 2008; Larson et al. 2009). Previous theoretical results on stochastic gene expression have already been exploited in that way: Raj et al. (2006) used the maximum likelihood method to fit the model by Peccoud and Ycart (1995) to experimentally observed mRNA counts obtained by single-molecule resolution fluorescence in-situ hybridization. Therefore, we expect that the characterisation of the protein distribution provided in this study could be of help in analysing data on protein expression in individual cells at the single-molecule level obtained e.g. by live cell fluorescence microscopy (Yu et al. 2006) or by protein detection methods based on enzymatic amplification (Cai et al. 2006). We believe that comparing theoretical results to such gene-expression data provided by future experimental studies will improve the understanding of the role of stochasticity in gene expression in real biological systems. 123 Exact and approximate distributions of protein and mRNA levels 851 Acknowledgment P. Bokes was supported by the European Commission under Marie Curie Early Stage Researcher Training (contract no. MEST-CT- 2005-020723). J. King gratefully acknowledges the funding of the BBSRC/EPSRC (reference no. BB/D008522/1) and of the Royal Society and Wolfson Foundation. Appendix The reaction system (1) which we used in this paper as a model for gene expression involves linear kinetics only, and therefore, as explained in Sect. 1, the first and second moments (i.e. means and variances) of the dependent stochastic variables, the amount of mRNA M(t) and of protein N (t), satisfy a closed system of linear inhomogeneous ordinary differential equations, which has been solved by other authors (as reviewed in Sect. 1). The stationary solution of this linear system gives the stationary means and variances of the gene expression model. Here we show that the same results can be obtained from the explicit formula for the generating function of the stationary distribution of mRNA and protein amounts. In addition, we determine the third and the fourth central stationary moments, which have not been available in literature so far. It is well known that the moments can be obtained from the generating function by evaluating its derivatives; however, for the generating function of the functional form we obtained, it is more convenient to calculate the moments via the corresponding cumulants. By Johnson et al. (2005), the r th factorial cumulant κ[r ] of the marginal protein distribution pst,n is defined as the r th derivative of the logarithm of its generating function (39) taken at y = 1, i.e. κ[r ] = Dr (ψ2 (y)) | y=1 = αβ r (r − 1)! , (1 + λ)r −1 i.e. the first four factorial cumulants are κ[1] = αβ, κ[2] = αβ 2 2αβ 3 6αβ 4 , κ[3] = , κ[4] = . 1+λ (1 + λ)(2 + λ) (1 + λ)(2 + λ)(3 + λ) (A1) Another useful characteristic of discrete distributions is the sequence of their (nonfactorial) cumulants: for definitions, consult Johnson et al. (2005); here we shall use that the r th cumulant κr can be expressed in terms of the first r factorial cumulants as (see Johnson et al. 2005) κr = r S(r, j)κ[r ] , j=1 where S(r, j)’s are the Stirling numbers of the second kind (Johnson et al. 2005). Thus, the first four cumulants are given by κ1 = κ[1] , κ2 = κ[2] + κ[1] , κ3 = κ[3] + 3κ[2] + κ[1] , κ4 = κ[4] + 6κ[3] + 7κ[2] + κ[1] . (A2) 123 852 P. Bokes et al. Denoting by Ns the random variable associated with the stationary protein distribution pst,n , we can define the first four central moments by ν = Ns , ν2 = (Ns − ν)2 , ν3 = (Ns − ν)3 , ν4 = (Ns − ν)4 . Johnson et al. (2005) gives the relation between the first four central moments and the first four cumulants: ν = κ1 , ν2 = κ2 , ν3 = κ3 , ν4 = κ4 + 3κ22 . (A3) Thus, using (A1)–(A3), the first four central moments of the marginal protein distribution pst,n are given by 3β β 2β 2 + 1 , ν3 = αβ + +1 , ν = αβ, ν2 = αβ 1+λ (1 + λ)(2 + λ) 1 + λ 12β 2 7β 6β 3 ν4 = αβ + + + 1 + 3ν22 . (1 + λ)(2 + λ)(3 + λ) (1 + λ)(2 + λ) 1 + λ For completeness, let Ms denote the random variable associated with the stationary st and let us consider the first four central moments mRNA distribution pm, μ = Ms , μ2 = (Ms − μ)2 , μ3 = (Ms − μ)3 , μ4 = (Ms − μ)4 . The random variable Ms has the Poisson distribution (32); therefore, the central moments are given by (Johnson et al. 2005) μ = μ2 = μ3 = α, μ4 = α + 3α 2 . If we express the above formulae for the stationary means μ, ν and variances μ2 , ν2 in terms of the chemical kinetics parameters of the model, we find that μ= k1 k2 k1 k1 k2 k1 , ν= , μ2 = , ν2 = γ1 γ1 γ2 γ1 γ1 γ2 1+ k2 , γ1 + γ2 (A4) which coincide with the formulae (8)–(9) obtained in other studies, as reviewed in Sect. 1, by deriving from the master equation a finite closed system of ordinary differential equations for the first- and second-order moments of the Markov process (M(t), N (t)); the stationary moments were obtained in Sect. 1 as the time-independent solution of that system of ODEs. Thus we observe an agreement of the results we arrived at by finding the exact stationary generating function with the previous results obtained using the differential equations for moments. 123 Exact and approximate distributions of protein and mRNA levels 853 References Abramowitz M, Stegun I (1972) Handbook of mathematical functions with formulas, graphs, and mathematical tables. National Bureau of Standards, Washington, DC Ackers G, Johnson A, Shea M (1982) Quantitative model for gene regulation by lambda phage repressor. Proc Natl Acad Sci USA 79:1129–1133 Bailey N (1964) The elements of stochastic processes with applications to the natural sciences. Wiley, New York Belle A, Tanay A, Bitincka L, Shamir R, O’Shea E (2006) Quantification of protein half-lives in the budding yeast proteome. Proc Natl Acad Sci USA 103:13004–13009 Berg O (1978) A model for the statistical fluctuations of protein numbers in a microbial population. J Theor Biol 71:587–603 Bernstein J, Khodursky A, Lin P, Lin-Chao S, Cohen S (2002) Global analysis of mRNA decay and abundance in Escherichia coli at single-gene resolution using two-color fluorescent DNA microarrays. Proc Natl Acad Sci USA 99:9697–9702 Blake W, Kaern M, Cantor C, Collins J (2003) Noise in eukaryotic gene expression. Nature 422:633–637 Cai L, Friedman N, Xie X (2006) Stochastic protein expression in individual cells at the single molecule level. Nature 440:358–362 Cheong R, Paliwal S, Levchenko A (2010) Models at the single cell level. Wiley Interdiscip Rev Syst Biol Med 2:34–48 Coulon A, Gandrillon O, Beslon G (2010) On the spontaneous stochastic dynamics of a single gene: complexity of the molecular interplay at the promoter. BMC Syst Biol 4:2 Cox D, Miller H (1977) The theory of stochastic processes. Chapman & Hall/CRC, London Davidson E, Rast J, Oliveri P, Ransick A, Calestani C, Yuh C, Minokawa T, Amore G, Hinman V, Arenas-Mena C et al (2002) A genomic regulatory network for development. Science 295:1669– 1678 Elowitz M, Levine A, Siggia E, Swain P (2002) Stochastic gene expression in a single cell. Science 297:1183–1186 Friedman N, Cai L, Xie X (2006) Linking stochastic dynamics to population distribution: an analytical framework of gene expression. Phys Rev Lett 97:168,302 Gadgil C, Lee C, Othmer H (2005) A stochastic analysis of first-order reaction networks. B Math Biol 67:901–946 García-Martínez J, González-Candelas F, Pérez-Ortín J (2007) Common gene expression strategies revealed by genome-wide analysis in yeast. Genome Biol 8:R222 Golding I, Paulsson J, Zawilski S, Cox E (2005) Real-time kinetics of gene activity in individual bacteria. Cell 123:1025–1036 Griffith J (1968a) Mathematics of cellular control processes. I. Negative feedback to one gene. J Theor Biol 20:202–208 Griffith J (1968b) Mathematics of cellular control processes. II. Positive feedback to one gene. J Theor Biol 20:209–216 Gurland J (1958) A generalized class of contagious distributions. Biometrics 14:229–249 Hornos J, Schultz D, Innocentini G, Wang J, Walczak A, Onuchic J, Wolynes P (2005) Self-regulating gene: an exact solution. Phys Rev E 72:051,907 Innocentini G, Hornos J (2007) Modeling stochastic gene expression under repression. J Math Biol 55: 413–431 Iyer-Biswas S, Hayot F, Jayaprakash C (2009) Stochasticity of gene products from transcriptional pulsing. Phys Rev E 79:031,911 Johnson N, Kotz S, Kemp A (2005) Univariate discrete distributions, 3rd edn. Wiley-Interscience, London Johnson W (2002) The curious history of Faà di Bruno’s formula. Am Math Mon 109:217–234 Kendall D (1949) Stochastic processes and population growth. J Roy Stat Soc B 11:230–282 Larson D, Singer R, Zenklusen D (2009) A single molecule view of gene expression. Trends Cell Biol 19:630–637 Lee T, Rinaldi N, Robert F, Odom D, Bar-Joseph Z, Gerber G, Hannett N, Harbison C, Thompson C, Simon I et al (2002) Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298:799–804 Lestas I, Paulsson J, Ross N, Vinnicombe G (2008) Noise in gene regulatory networks. IEEE T Circuits-I 53:189–200 Lewin B (2000) Genes VII. Oxford University Press, Oxford 123 854 P. Bokes et al. Lu P, Vogel C, Wang R, Yao X, Marcotte E (2007) Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation. Nat Biotechnol 25:117–124 McAdams H, Arkin A (1997) Stochastic mechanisms in gene expression. Proc Natl Acad Sci USA 94: 814–819 McAdams H, Arkin A (1999) It is a noisy business! Genetic regulation at the nanomolar scale. Trends Genet 15:65–69 Neyman J (1939) On a new class of “contagious” distributions, applicable in entomology and bacteriology. Ann Math Stat 10:35–57 Ozbudak E, Thattai M, Kurtser I, Grossman A, van Oudenaarden A (2002) Regulation of noise in the expression of a single gene. Nat Genet 31:69–73 Paszek P (2007) Modeling stochasticity in gene regulation: characterization in the terms of the underlying distribution function. B Math Biol 69:1567–1601 Paulsson J (2004) Summing up the noise in gene networks. Nature 427:415–418 Paulsson J (2005) Models of stochastic gene expression. Phys Life Rev 2:157–175 Paulsson J, Ehrenberg M (2000) Random signal fluctuations can reduce random fluctuations in regulated components of chemical regulatory networks. Phys Rev Lett 84:5447–5450 Peccoud J, Ycart B (1995) Markovian modeling of gene-product synthesis. Theor Popul Biol 48:222–234 Press W, Teukolsky S, Vetterling W, Flannery B (2007) Numerical recipes: the art of scientific computing. Cambridge university press, Cambridge Raj A, van Oudenaarden A (2009) Single-molecule approaches to stochastic gene expression. Annu Rev Biophys 38:255–270 Raj A, Peskin C, Tranchina D, Vargas D, Tyagi S (2006) Stochastic mRNA synthesis in mammalian cells. PLoS Biol 4:e309 Raser J, O’Shea E (2004) Control of stochasticity in eukaryotic gene expression. Science 304:1811–1814 Shahrezaei V, Swain P (2008a) Analytical distributions for stochastic gene expression. Proc Natl Acad Sci USA 105:17,256 Shahrezaei V, Swain P (2008b) The stochastic nature of biochemical networks. Curr Opin Biotechnol 19:369–374 Shea M, Ackers G (1985) The OR control system of bacteriophage lambda. A physical–chemical model for gene regulation. J Mol Biol 181:211–230 Shen-Orr S, Milo R, Mangan S, Alon U (2002) Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet 31:64–68 Singh A, Hespanha J (2007) Stochastic analysis of gene regulatory networks using moment closure. In: Proceedings of the American control conference Swiers G, Patient R, Loose M (2006) Genetic regulatory networks programming hematopoietic stem cells and erythroid lineage specification. Dev Biol 294:525–540 Taniguchi Y, Choi P, Li G, Chen H, Babu M, Hearn J, Emili A, Xie X (2010) Quantifying E. coli proteome and transcriptome with single-molecule sensitivity in single cells. Science 329:533–538 Thattai M, van Oudenaarden A (2001) Intrinsic noise in gene regulatory networks. Proc Natl Acad Sci USA 98:151588,598 Tomioka R, Kimura H, J Kobayashi T, Aihara K (2004) Multivariate analysis of noise in genetic regulatory networks. J Theor Biol 229:501–521 Tomlin C, Axelrod J (2007) Biology by numbers: mathematical modelling in developmental biology. Nat Rev Genet 8:331–340 Tyson J, Chen K, Novak B (2003) Sniffers, buzzers, toggles and blinkers: dynamics of regulatory and signaling pathways in the cell. Curr Opin Cell Biol 15:221–231 van Kampen N (2006) Stochastic processes in physics and chemistry. Elsevier, New York Wang Y, Liu C, Storey J, Tibshirani R, Herschlag D, Brown P (2002) Precision and functional specificity in mRNA decay. Proc Natl Acad Sci USA 99:5860–5865 Xie X, Choi P, Li G, Lee N, Lia G (2008) Single-molecule approach to molecular biology in living bacterial cells. Annu Rev Biophys 37:417–444 Yu J, Xiao J, Ren X, Lao K, Xie X (2006) Probing gene expression in live cells, one protein molecule at a time. Science 311:1600–1603 123
© Copyright 2026 Paperzz