Exhaustive genotype-phenotype mapping in metabolic genotype

Exhaustive genotype-phenotype mapping in
metabolic genotype space
Sayed-Rzgar Hosseini
August 2013
Master thesis to partially fulfill requirements of:
Master of Science (MSc) in Computational Biology & Bioinformatics (CBB)
Swiss Federal Institute of Technology (ETH) Zürich
Research performed at:
Institute of Evolutionary Biology and Environmental Studies, University of Zürich
Supervisor:
Prof. Andreas Wagner
1
Abstract
Genotype-phenotype mapping provides a unique opportunity to gain new insights into the
underlying design principles of biological systems and their evolution. Especially informative
and insightful is exhaustive genotype-phenotype mapping, which aims to determine the
phenotype of every single genotype belonging to a genotype space. These types of studies have
been possible with systems of RNA, lattice proteins and Boolean gene-regulatory networks. Due
to astronomically large size of metabolic genotype space, the same approach has not been
successful for metabolic networks, even though similar studies have been conducted for
metabolic systems based on efficient sampling of metabolic genotype space. The present project
aims to depart from sampling based methods towards an exhaustive approach. To accomplish
this goal I focused on central carbon metabolism of Escherichia coli (E.coli), and by developing
a new algorithm to efficiently explore the genotype space I obtained a complete genotypephenotype map for this metabolic system. I analyzed the topological properties of metabolic
genotype networks and showed that the metabolic genotype networks are connected. Moreover,
the results revealed that genotypes with more connections or higher centrality in the genotype
networks are more robust to reaction elimination. Surprisingly, I observed a strong association
between robustness and biomass synthesis rate (i.e. fitness), which might shed light on the
emergence of robustness in metabolic networks. Finally, I extensively examined the superessentiality of metabolic reactions and the results indicate that consecutive reactions in metabolic
pathways tend to show similar super-essentiality profiles.
Keywords: genotype-phenotype mapping, metabolism, genotype networks, network topology,
network connectedness, robustness, reaction super-essentiality
2
Contents
Chapter 1: Introduction ................................................................................................................4
1.1. Metabolism ...................................................................................................................4
1.2. Genome-scale metabolic networks ...............................................................................4
1.3. Genotype-phenotype mapping ......................................................................................5
1.4. Biological networks and their topological properties ...................................................6
1.5. Connectedness of metabolic genotype networks ........................................................10
1.6. Robustness of biological systems ...............................................................................10
1.7. Super-essentiality of metabolic reactions ...................................................................11
1.8. Central carbon metabolism .........................................................................................12
1.9. Aims of the study ........................................................................................................13
Chapter 2: Characterization of genotype networks .................................................................15
2.1. Flux balance analysis (FBA) ......................................................................................15
2.2. The algorithm .............................................................................................................16
2.3. Analysis of the efficiency of the algorithm ................................................................24
2.4. Partitioning the set of viable genotypes based on genotype size ...............................26
2.5. Assessing the correctness of the algorithm ................................................................27
Chapter 3: Topological properties of genotype networks .......................................................29
3.1. Degree distribution of genotype networks ................................................................29
3.2. Shortest path and diameter of genotype networks ....................................................33
3.3. Betweenness centrality in genotype networks ..........................................................37
Chapter 4: Connectedness of genotype networks .....................................................................39
4.1. The algorithm ............................................................................................................40
4.2. Analyzing the efficiency of the algorithm .................................................................41
4.3. Connectedness in networks with more than one million genotypes .........................45
4.4. Results ........................................................................................................................47
Chapter 5: Mutational robustness of metabolic networks.......................................................49
5.1. Measuring the mutational robustness of metabolic networks ....................................49
5.2. The impact of genotype size on mutational robustness..............................................50
5.3. More central genotypes in genotype networks are more robust ................................53
5.4. The influence of genotype size on fitness .................................................................54
5.5. On the association between robustness and fitness ...................................................57
Chapter 6: Super-essentiality of metabolic reactions ...............................................................59
6.1. Super-essentiality index .............................................................................................59
6.2. Results ........................................................................................................................60
Chapter 7: Conclusion.................................................................................................................70
Acknowledgements ......................................................................................................................74
References .....................................................................................................................................75
Appendix.......................................................................................................................................80
3
Chapter 1: Introduction
1.1.
Metabolism:
In order for living cells to accomplish a myriad of tasks required for sustaining life and growth,
they need to obtain chemical energy either by capturing solar energy or degrading energy-rich
nutrients from environments. Moreover, they need to convert nutrients to small molecules like
amino acids, DNA nucleotides, RNA nucleotides, lipids, carbohydrates and several enzyme
cofactors (collectively referred to as biomass). These two purposes are fulfilled through a series
of well-coordinated chemical reactions which altogether constitute the catabolic pathways.
Furthermore, living cells need to synthesize macromolecules like proteins, nucleic acids,
polysaccharides and membrane lipids from small precursor molecules. This task is achieved via
sets of enzymatic reactions which comprise anabolic pathways. The set of catabolic reactions
together with the set of anabolic reactions define the metabolism of the cell. Importantly,
metabolic reactions are catalyzed by enzymes which are catalytic proteins encoded by genes in a
genome. Thus, metabolic network of a cell can be inferred from its genomic information. In the
post-genomic era it is possible to reconstruct metabolic networks in many organisms, from
bacteria to human [1-3].
1.2.
Genome-scale metabolic networks:
Studying the structure, function and evolution of metabolic networks has been an active area of
research for many decades [4]. Early studies focused on small networks comprising a handful of
reactions, or on a linear sequence of reactions. The dominant approach for studying such small
scale metabolic networks was traditional experimental biochemistry, including measurements of
enzyme concentration, enzyme activities, reaction rate constants, or metabolic fluxes.
Quantitative models for studying such small systems were kinetic models which describe the
changes in concentrations of individual metabolites of a system over time, based on
experimentally determined parameters [5].
The availability of complete genome sequences of prokaryotes and later eukaryotes together with
the accumulated biochemical literature provided the opportunity to identify almost all the
reactions proceeding in an organism’s metabolism. Access to such comprehensive information
about metabolism shifted the paradigm from small-scale to genome-scale analysis and the
modeling of metabolic networks. However, for genome-scale metabolic networks, because of
experimental difficulty of determining rate constants and reaction fluxes for hundreds and
thousands of reactions, it is very difficult to achieve a quantitative understanding with as much
detail as is possible for small-scale networks. However, a coarser-grained genome-scale analysis
of metabolism is still possible using a well-established and fruitful computational modeling
approach called flux balance analysis (FBA) [6].
4
FBA employs constraints imposed by reaction stoichiometry, reversibility and maximal nutrient
uptake rates of an organism in a certain environment to predict the allowed metabolic fluxes for
all metabolic reactions in a metabolic steady state. It then uses linear programming to identify a
unique solution among the many allowed metabolic fluxes, which maximizes a certain objective
function like rate of ATP production or the rate at which biomass is produced [7, 8]. In order to
use FBA for modeling any one organism, the reactions constituting its metabolism, biomass
composition and also the nutrient uptake rates imposed by the environment must be known.
1.3.
Genotype-phenotype mapping:
Mapping genotypes to their corresponding phenotypes provides an unprecedented opportunity to
gain insight into the underlying design principles of biological systems and their evolution [9,
10]. For instance, mapping sequence of amino acids as genotypes to their phenotypes defined as
lattice models of protein structures [11, 12] and RNA sequences to their secondary structures
[13-16], has been extensively studied. Even, for more complex systems like gene regulatory
networks in which genotypes correspond to a given topology of the regulatory network and the
phenotypes correspond to the steady-state gene expression pattern of the network, the same types
of studies have been conducted [17, 18].
The resulting genotype-phenotype maps have shown that any one phenotype is adopted by a vast
number of genotypes. Moreover, these genotypes form large connected sets in genotype space,
meaning that it is possible for any pair of genotypes to reach each other by a series of phenotype
preserving genotypic changes called neutral mutations. Such a set of connected genotypes which
are mapped to the same phenotype is referred to as neutral network or genotype network [15,
16].
Moreover, using sampling methods genotype-phenotype mapping of genome-scale metabolic
networks has been possible [19, 20]. Any one organism's genome encodes enzymes that catalyze
some reactions belonging to a universe of reactions. Thus, a metabolic genotype network G can
simply be represented as a binary vector of length N, e.g., G 
 r1, ..., rN 
where each entry of
the vector  ri  corresponds to one reaction among the N number of enzymatic reactions occurring
in the reaction universe. If the enzyme catalyzing the ith reaction is present in the metabolic
network the ith entry of the genotype vector will be one  ri  1 otherwise it will be zero  ri  0  .
Thus, a metabolic genotype G can be envisioned as a point in N dimensional hypercube (i.e. the
genotype space) comprising 2N metabolic genotypes [20].
Using the flux balance analysis (FBA) approach mentioned above, it is possible to determine
whether a given metabolic network can synthesize all major biomass components in a given
chemical environment (i.e. is viable) or not. Thus, metabolic phenotype P can simply be
represented as a binary vector of length M, e.g., P   c1 , ..., cM  where M is the number of
5
chemical environments (e.g. carbon sources) on which the viability of metabolic network
genotypes is checked. Each position in the vector of P corresponds to one carbon source, and the
corresponding carbon source either can be sufficient for viability of a given metabolic network
 ci  1 or
not  ci  0  . Thus, genotype-phenotype mapping in context of metabolic networks
equals to assigning the corresponding P vector of length M to each binary string G of length N in
the genotype space of size 2N. As it was mentioned above, the collection of metabolic genotypes
which are mapped to the same phenotype form a genotype network. Even in a more abstract
level, a metabolic phenotype can be defined based on viability on a single carbon source.
Therefore, genotype-phenotype mapping in this condition will be equal to partitioning of the
genotype space into to disjoint sets: the set of genotypes which are viable on the carbon source
and the other set which includes the genotypes which are not viable on the carbon source.
Moreover, the genotypes viable on a carbon source would constitute the genotype network
corresponding to the carbon source. Finally, if the genotypes belonging to a genotype network
are to be of the same size (i.e. containing the same number of reactions), the genotype network
corresponding to the carbon source should be further partitioned based on genotype size. In this
context, the genotype networks would be defined as “genotype network including genotypes of
size x viable on carbon source y”.
Investigation of the properties of metabolic genotype networks provides us with broad insights
into the structure, function and evolution of metabolic networks [19, 20]. Hence, the concepts
and methods developed for studying complex networks which is the topic of section 1.4 can be
very helpful for inspection of graphical properties of metabolic genotype networks.
1.4.
Biological networks and their topological properties:
The fact that most of the complex biological characteristics arise from a complex interaction
between cell’s numerous components led to shift the dominant paradigm in biological sciences
from reductionism to holism and consequently to rise into prominence of the emerging field of
systems biology. Therefore, the key challenge in the twenty-first century for biological sciences
is to understand the structure and dynamics of complicated intracellular web of interactions
which results in complex biological functions and behaviors [21]. Thanks to the development of
high-throughput data-collection techniques, it is now possible to interrogate the status of all
cellular components simultaneously to gain insight into the structure and dynamics of
intracellular networks such as protein-protein interaction, metabolic, signaling and transcriptionregulatory networks. In parallel, the concepts and methodologies developed in the theory of
complex networks which have aided in uncovering the organizing principles governing on the
function and evolution of various complex technological and social networks [22-25], have
directly been applied for studying the various features of complex biological networks.
In an abstract level, the components of a network are reduced to a series of nodes connected to
each other by links with each link representing the interactions between two components. The
6
nodes and links together form a network or in a mathematically more formal language a graph.
Physical interactions like protein-protein, protein-nucleic acid and protein-metabolite can easily
be conceptualized using the node-link representation. However, this representation can be
extended to more complex functional interactions. For example; small molecule substrates can
be considered as nodes of a metabolic network and the links as enzymatic reactions transforming
one metabolite into another. In an alternative representation, the enzymes can be envisioned as
nodes and the links would connect enzymes catalyzing consecutive reactions [21, 26]. To
characterize the local features of a network, several measures on the components have been
defined in the field of graph theory which is briefly reviewed here as follows:





Degree (k) is the number of neighbors of a node (also considered as node connectivity). For
example, in the network shown in figure 1.1, node A has a degree of 4 (kA = 4), the average
degree of nodes for the whole network (<k>) is used as an index to describe the 'density' of a
network and finally the degree distribution (p(k)) gives the probability that a selected node
has exactly k links [26].
Clustering coefficient (C) is a measure of the degree of interconnectivity in the neighborhood
of a node [27]. In the network shown in the figure 1.1, the clustering coefficient for node A
(CA) is described as 2nA / kA (kA - 1), where nA is the number of links connecting the
neighbors of node A to each other.
Assortativity (NC) is the average degree of the nearest neighbors of a node [28] (figure 1.1).
A negative correlation of assortativity with degree indicates that nodes that have a high
connectivity (e.g., hubs) tend to interact with nodes that have a relatively low connectivity.
By contrast, a positive correlation shows that the hubs tend to be located in highly connected
topological modules.
Shortest path (SP) is the path between two nodes in a network with the minimum number of
steps among many alternative paths between the two nodes. For example, in figure 1.1, the
shortest path from node F to node H (SPfh) is composed of four steps. Moreover, the length
of the shortest path between the most distanced nodes is the diameter of the network.
Betweenness (B) measures the centrality of nodes in a network, defined as the frequency with
which a node is located on the shortest path between all other nodes [29]. Nodes with high
betweenness control the flow of information across a network. In figure 1.1, the diameters of
the nodes correlate with their betweenness.
Moreover, global structures of networks are also distinguishable by several topological models
as follows:


Regular networks are networks in which each node has exactly the same degree as other
nodes of the graph like lattice networks [30].
Random networks are graphs in which with probability p, each pair of nodes is connected
[31] (figure 1.2.a) and the degree distribution follows a Poisson distribution.
7

Scale-free networks are characterized by a power law-like degree distribution [32] (figure
1.3.b). In a scale-free network, the probability that a node has k links follows p  k  k  ,
where  is the degree exponent. Such distributions are seen as a straight line on a log–log
plot. A relatively small number of highly connected nodes are known as hubs, and the
probability of those hubs is statistically more significant than in a random network. In this
model, the probability of an additional node connecting to an existing node depends on its
degree [21].
A major finding towards understanding of the cellular network architecture was the fact that
most of the networks within the cell have a scale-free topology. The first evidence came from the
analysis of metabolism. The analysis of the metabolic networks of 43 different organisms
indicates that the cellular metabolism has a scale-free topology in which most metabolic
substrates participate in only few reactions, but a few, such as pyruvate or coenzyme A,
participate in dozens and function as metabolic hubs [33, 34]. Moreover, several studies show
that protein–protein interactions in diverse eukaryotic species also have the features of a scalefree network [35-39]. Further examples of scale-free organization include genetic regulatory
networks [40, 41], or protein domain networks that are constructed on the basis of protein
domain interactions [42, 43].
Importantly, all complex networks share a common feature called ‘small world effect’ meaning
that any two nodes can be connected with a path of a few links only. This feature, which was
originally observed in a study on social networks [44], has been subsequently shown in several
systems, from neural networks [45] to the World-Wide-Web [46]. Although the small-world
effect is a property of random networks, scale-free networks are ultra-small (i.e., their average
shortest path is much shorter than predicted by small world effect) [47, 48]. For intracellular
networks, this ultra-small-world effect was first demonstrated for metabolism, where paths of
only three to four reactions were enough to link most pairs of metabolites [33, 34]. However, in
contrast to the assortative nature of social networks, in which well-connected people tend to
know each other, cellular networks are disassortative, meaning that highly connected nodes
(hubs) tend to connect to low-connected nodes and avoid other hubs [49].
The other network of interest is the metabolic genotype network in which each node corresponds
to a genotype (metabolic network) and the links represent neighborhood relationship among
genotypes. Two metabolic networks will be considered as neighbors if they have the same
reaction content except only one reaction which they differ from each other. In other words, if
genotype G1 is neighbor of G2, it is possible to convert G1 to G2 first by deleting the reaction
that G1 has but G2 doesn’t, from the reaction set of G1 and second by adding the reaction which
G2 has but G1 doesn’t, to the reaction set of G1. This consecutive deletion and addition of two
distinct reactions are called a reaction swap [19]. Thus, having defined the metabolic genotype
network within the graph-theoretical framework, it will be possible to investigate the graphical
properties of metabolic genotype networks as discussed above.
8
Figure 1.1: Local topological properties of networks: degree, clustering coefficient, assortativity, shortest path
and betweenness. The details are mentioned in the accompanying text. The figure is adapted from (26) by
permission from Nature Publishing Group.
Figure 1.2: Global properties of Random
networks (A) versus Scale-free networks (B). In
1.2.Aa an example of a random network is
illustrated and in 1.2.Ab the degree distribution of
the random network is shown which is similar to a
Poisson distribution. In 1.2.Ba an example of a
scale-free network is shown and in 1.2.Bb the loglog plot of degree distribution of the scale-free
network which is linear is shown. Linear log-log
plot indicates that the degree distribution of scalefree networks obey power-law distribution. The
figure is obtained from (21) by permission from
Nature Publishing Group.
9
1.5.
Connectedness of metabolic genotype networks :
Another important measurable characteristic of networks is their connectedness. A network is
connected if it is possible for any pairs of its nodes to connect each other via a path; otherwise
the network is decomposable to disjoint sets of connected sub-networks called “connected
components” [50]. Especially, in studying genotype networks it is of particular interest to know
whether the genotype network is connected or it is fragmented into smaller disjoint subnetworks. If a genotype network is connected it implies that evolution can be facilitated by
allowing the genotype network to be explored via neutral mutations [16, 51]. Studies on
genotype-phenotype maps of systems like RNA, proteins and gene regulatory networks have
shown that the corresponding genotype networks are connected [15, 17, and 52]. Surprisingly, it
is also shown that number of disjoint components does not correlate with the genotype network
size (i.e. number of nodes in the genotype) [53]. Furthermore, many studies has shown that
connectedness of genotype networks helps to increase evolvability [54-56].
1.6.
Robustness of biological systems:
Robustness is a general feature of biological systems and is defined as persistence of the
organism’s traits or phenotypes against various types of perturbations [57]. The robust
phenotypes encompass a broad range of traits from macroscopic and visible features to
molecular traits like three-dimensional structure of proteins, expression level of a gene or the
flux of matter through a metabolic pathway. The perturbations that affect a phenotype originate
from two broad sources. The first comprises the environmental perturbations like change in
temperature, available nutrients, or internal fluctuations such as inherent stochasticity of gene
expression. The second type of perturbation is genetic mutation which alters the organism’s
genotype. For example, genetic changes might lead to loss of function mutations in the genes
encoding metabolic enzymes which consequently changes the metabolic genotype. Surprisingly,
it has been shown that mutationally robust biological systems are also robust to environmental
perturbations [57, 58], even though exceptions may exist [59, 60]. There are two mechanisms
suggested for the emergence of robustness in biological systems namely “redundancy” and
“distributed robustness” [61]. Redundant systems in which multiple components have the same
function are typical of biological systems. The reason is that a process called gene duplication
produces genes with redundant functions during genome evolution [62, 63]. Distributed
robustness in contrast, can exist in systems where there are no two parts with the same function.
For example, in complex metabolic networks, blockage of one metabolic pathway may be
bypassed if an important metabolite can be produced through an alternative pathway, even
though the two pathways may not share a single enzyme with the same function [64].
10
1.7.
Super-essentiality of metabolic reactions:
Aristotle’s famous quotation: “The whole is more than some of the parts” applies for many
complex systems. In such complex systems, due to the Aristotelian nonlinearity of the
interactions among the parts, the importance of the role which each individual plays for the
ultimate function of the system is not the same instead the role of some parts is more crucial than
others. The same principle holds for complex metabolic systems in which reactions are parts
forming the whole (metabolic networks). For instance, the function of some reactions are
indispensable for life of an organism while the role of some other reactions can be fulfilled via
some alternative reactions or pathways which render them unnecessary for life of the organism.
On the other hand, a certain reaction which is essential for life of an organism might be
unnecessary for another organism. For example, for production of isopentenyl diphosphate
which is important for the synthesis of cell wall components, the isoprenoid pathway is essential
in Bacillus subtilis, while this pathway is replaced by mevalonate pathway in Staphylococcus
aureus. Moreover, neither of the two pathways would be essential in an organism possessing
both metabolic routes [65, 66]. Attempts to quantify the essentiality of metabolic reactions in
different metabolic networks and different environments using various approaches have recently
been made [67-73]. Studying essentiality of reactions in metabolic networks has direct
applications in development of new antibiotics to combat pathogens. For instance, several
existing drugs like sulfonamides, fosmidomycin and isoniazid target the metabolism of
pathogens [74-77]. Importantly, a reaction which is selected as target of drug discovery must be
essential for the survival of the pathogen. Recently, Barve et al, have defined a reaction as
essential if its elimination abolishes the network’s ability to synthesize all biomass molecules in
a certain environment and a reaction as nonessential if the network has the ability to bypass the
reaction through alternate reactions or metabolic pathways or when the product of the reaction is
not needed in a given environment [73]. Moreover, superessentiality index has been estimated
for each reaction [73]. This index indicates the fraction of metabolic networks (genotypes) in the
genotype network in which a given reaction is essential. Reactions whose super-essentiality
index is low tend to be easily bypassed, while reactions in which this index is high are difficult to
bypass. The word superessentiality highlights the fact that reactions can be more than just
essential, instead they can be essential in many, most, or all metabolic networks with a given
phenotype.
11
1.8.
Central carbon metabolism:
Central carbon metabolism refers to the collection of metabolic pathways inside the cell which
extract energy from extracellular carbon sources. In most bacteria, the main pathways of central
carbon metabolism (CCM) include glycolysis, gluconeogenesis, pentose-phosphate pathway (PP)
and the tri-carboxylic acid cycle (TCA). In E.coli in which the metabolism of glucose has
extensively been studied [78-82] the anaplerotic reactions, glyoxylate shunt and acetate
production and assimilation are also included in the central carbon metabolism (figure 1.4) [83].
At the terminal stages of glycolysis in E. coli, phosphoenol pyruvate (PEP) either is converted to
pyruvate by pyruvate kinase (PK) or it may give rise to oxaloacetate by the PEP carboxylase.
Pyruvate, the end-product of glycolysis, is oxidized to acetyl-CoA and CO2 by the pyruvate
dehydrogenase complex. Acetyl-CoA can participate in many different reactions either as
substrate or as product. For example, it can enter the TCA cycle or it can be used in the
biosynthesis of fatty acids, triglycerides and acetate. Acetyl-CoA connects the glycolysis and the
acetate metabolism pathways with the TCA cycle and the glyoxylate shunt. Moreover, via
conversion to acetyl-CoA and oxaloacetate, PEP and pyruvate can directly enter the TCA cycle
and forming a route called anaplerosis which aids in replenishing the intermediates of the TCA
cycle that were used for anabolic purposes. Under gluconeogenic conditions, the TCA cycle
intermediates, oxaloacetate or malate are converted to pyruvate and PEP by decarboxylation
thereby providing the precursors for gluconeogenesis.
Figure 1.3: Central carbon metabolism of
E.coli. (A) glycolysis and gluconeogenesis,
(B) anaplerotic reactions, (C) acetate
formation and assimilation, (D) citric-acid
cycle, (E) glyoxylate shunt, and (F) PentosePhosphate pathway. The figure is adapted by
permission from (83).
12
1.9.
Aims of the study:
Exhaustive genotype-phenotype mapping refers to the assignment of a phenotype to every single
genotype belonging to a genotype space. This task has been possible using genotype space
corresponding to RNA, lattice proteins and Boolean gene-regulatory networks [11-18]. However,
the same types of studies have not been feasible for metabolic networks because metabolic
genotype spaces are extremely large and not amenable to exhaustive enumeration. For example,
the universe of reactions currently contains more than 5000 biochemical reactions that take place
in some organisms. This means the space of possible networks is comprised of more than 25000
networks. Therefore, genotype-phenotype mapping in metabolic genotype spaces has been
possible only by sampling-based approaches [19, 20]. Although the sampling based approaches
has enabled us to gain insights into some properties of the metabolic genotype networks, many
other characteristics of genotype networks are amenable to study only if the genotype networks
are exhaustively characterized. Therefore, in this study, the aim is to depart from samplingbased approaches to exhaustively map all individual genotypes in a metabolic genotype space to
their corresponding phenotypes. Since exhaustive enumeration of genotypes corresponding to
genome scale metabolic networks is totally impossible, the focus is on a small albeit crucial part
of metabolism which is the central carbon metabolism. Hence, the central carbon metabolism of
E.coli including 20 export reactions and 51 internal reactions is considered as the universe of
reactions (Appendix-tables A-2, A-3 and A-4). Nevertheless, the corresponding genotype space
is still large and exhaustive characterization of genotype networks requires efficient algorithms
to be designed.
Hence, in chapter 2, an efficient algorithm is introduced to tackle the complexity of the task.
Moreover, the set of genotypes belonging to the genotype space which are viable on each of the
10 carbon sources, are identified. Based on the number of reactions included in each genotype
(i.e. genotype size), the set of viable genotypes on each carbon source is partitioned into disjoint
sets of equally sized genotypes which constitute the corresponding genotype networks.
Therefore, in this chapter, the genotype networks each of which is comprised of equally-sized
genotypes that are viable on a particular carbon source are exhaustively characterized.
In chapter 3, the topological properties of the genotype networks are studied and the focus will
be on degree distribution, shortest path and betweenness centrality of the genotype networks.
Importantly, the impact of genotype size on these graph-theoretical features is investigated.
Moreover, the global topological properties of the genotype networks are compared with those of
the known networks like regular, random and scale-free networks.
Chapter 4 is devoted to study the connectedness of genotype networks. Checking connectedness
of the vast genotype networks requires an algorithm which could determine the connected
components of the network without a need for an already filled adjacency matrix (or adjacency
list). A new algorithm is designed to fulfill this criterion and its efficiency is discussed.
Moreover, a new theorem will be proven which aids in determination of the connectedness of
13
very large genotype networks for which using the first algorithm is not feasible. Finally the
number of connected components of each genotype network is determined.
Robustness of metabolic networks to reaction elimination is investigated in chapter 5. Especially,
the influence of genotype size and number of genotypes in genotype networks on mutational
robustness of metabolic networks is discussed. Moreover, the question whether the metabolic
networks which are considered as more central nodes in the genotype network are mutationally
more robust is also going to be answered. Finally, the growth rate that is associated with each
genotype and is considered as the fitness of the genotype is computed and how it is influenced by
genotype size is discussed. Importantly, it will be checked whether genotypes which are
mutationally more robust are associated with a higher fitness or not.
Finally in chapter 6, in the framework of genotype networks, the super-essentiality index for all
the reactions belonging to the reaction universe is determined. Importantly the influence of
genotype size on the super-essentiality of the reactions is investigated. Moreover reactions are
going to be clustered based on their super-essentiality indices on different genotype networks.
Finally the carbon sources are also going to be classified based on the super-essentiality profile
of the reactions in the corresponding genotype networks.
14
Chapter 2: Characterization of genotype networks
Neutral network corresponds to a set of genotypes having the same phenotype. For instance, all
possible metabolic genotypes (metabolic networks) which are viable on an environment with
glucose as its sole carbon source, collectively constitute a neutral network. To determine the
neutral network corresponding to a certain metabolic phenotype requires exhaustive examination
of the viability of each individual genotype in the genotype space. The viability of each
metabolic network on each carbon source can be checked by Flux Balance Analysis (FBA)
approach described in section 2.1. FBA checks whether a given metabolic network is able to
produce all the necessary small molecules required to sustain life or not. Since determination of
neutral network entails exhaustive exploration of the genotype space, the feasibility of the task
strongly depends on the size of the genotype space which exponentially depends on the number
of reactions in the reaction universe. This approach is indeed infeasible for studying genome
scale metabolic networks, because the total reaction universe contains around 5000 reactions
which results in a genotype space of size 25000 that is an astronomically large number. Hence, in
this study we focus on central carbon metabolism of E.coli which is a smaller scale metabolic
network and plays a pivotal role in cellular metabolism. The central carbon metabolism of E.coli
includes 71 reactions among which 20 are export reactions and 51 are internal reactions
(Appendix-table A.2). Since each of the 51 reactions can either be present or absent in a
metabolic network, the metabolic genotype space is of size 251 ~1015. However, still exhaustive
exploration of such a genotype space is not feasible, because each FBA test requires 0.01
seconds that results in  1013 seconds (105 years!) to check the viability of all the genotypes.
However, in section 2.2 an algorithm is introduced which makes this task computationally
possible. Using the algorithm, the neutral networks corresponding to 10 different carbon sources
(Appendix-table A.1) are identified. Importantly in section 2.3, the neutral networks are
partitioned based on the genotype size to obtain genotype networks comprised of equally sized
genotypes which are viable on the corresponding carbon source. Therefore, at the end,
G  i, Cm  , i  1, 2,...,51 , m  1, 2,...,10 is obtained where G  i, Cm  represents the genotype
network comprised of genotypes of size i which are viable on carbon source Cm.
2.1.
Flux Balance Analysis (FBA):
Flux balance analysis (FBA) is a linear programming [84] based method for analyzing metabolic
networks. FBA predicts the rate of each reaction in the metabolic network. To do so, it first
imposes constraints on the fluxes of reactions based on a stoichiometric matrix constructed from
the information regarding the molar quantities of each substrate consumed or produced in each
reaction. Moreover, further constraints like maximum and minimum rate of some reactions and
reversibility of reactions is considered. More importantly, FBA assumes that the metabolic
network has reached a steady state as follows:
15
S .v  0
Where S is the m × n stoichiometric matrix of all the reactions in the metabolic network, m is the
number of metabolites, n is the number of reactions, and v is the flux vector of the metabolic
network. Since, the S matrix usually is underdetermined, the v vector resulting from the above
equation will not result in a unique solution instead the feasible solution for flux vector
encompasses a space called nullspace. FBA, in the second step uses linear programming to select
a unique flux vector in the nullspace which maximizes a particular objective function as follows:
T
Max f v ,
s.t. S.v  0,
0  v  vmax
Where f is the objective function vector and vmax is the vector containing the maximum capacities
of the fluxes. Maximum measured uptake rates are used to constrain the exchange fluxes. The
objective function of choice in this study is rate of biomass formation. Biomass refers to the
collection of small molecules like amino acids, DNA nucleotides, RNA nucleotides, lipids,
carbohydrates and several enzyme cofactors that are required for sustaining life and growth of an
organism and biomass reaction is a hypothetical reaction in which each of these small molecules
are produced from their corresponding precursors with a certain stoichiometry determined from
experimental data [85, 86]. For this study we used a Biomass reaction which is generated by
fluxes branching out from central carbon metabolism, originating at 12 well-known precursor
substances [85-87]. At each such branch-point node, flux branches out from central carbon
metabolism to build the components of the cell's biomass (the precursors are listed in Appendixtable A.5)).
2.2.
The algorithm:
The key idea behind the algorithm is to skip the time consuming FBA method for genotypes
unviability of which is already known. In other words, the idea is to identify most of the unviable
genotypes without the need for viability checking using FBA approach. This leads to shrink the
candidate genotype space whose viability is going to be checked by FBA method and
consequently the time complexity of the problem is minimized. The first reduction in time
complexity of the problem comes from the notion of “environment-general essential reactions”
which refers to the reactions whose deletion in each carbon source will abolish viability [73]. Six
reactions of the reaction universe (reactions 22 to 27 in Appendix-table A.2) are among
environment-general essential reactions. Therefore, in order for a genotype to be considered as
viable these six reactions should be included in its reaction set. In other words, without the need
for FBA method, it can be concluded that all those genotypes which do not include these 6
reactions are unviable. So, in this step the number of genotypes whose viability should be
checked is reduced from 251~1015 to 245~1013 and consequently the time complexity from 105
16
years to 103 years. The second and more important reduction in time complexity comes from the
fact that removing a reaction from a genotype cannot improve its viability. In other words,
removing reactions from a genotype which is already unviable won’t result in a viable genotype.
Therefore, by realizing that a genotype which includes a set (S) of reactions is unviable -without
the need for FBA approach- it can be concluded that all the genotypes which include only a
subset of S as their set of reactions, are unviable as well. Metaphorically, when a “parent
genotype” including a set (S) of reactions is unviable, it can be inferred that all of its “child
genotypes” – the genotypes including only a subset of S as reaction set- will be unviable as well.
Thus, an algorithm for efficient characterization of viable genotype space should be able to
identify the unviable parent genotypes first and then skip all their child genotypes. The abovementioned points were incorporated in an algorithm as follows:
According to the first point the six environment-general essential reactions should always be
present in all genotypes, so the difference in different genotypes will be based on the presence or
absence of the remaining 45 reactions. Therefore each genotype is represented as a binary vector
of length 45 in which each entry will be one if the corresponding reaction is present and zero if
not.
To incorporate the parent-child notion, the algorithm must start from genotypes with higher
number of reactions and step by step reduce the size of the reaction set of the genotypes.
However, if the size of the reaction set of the genotypes is reduced by one in each step, the
algorithm will need 45 steps and the implementation issues aggravate the complexity of the
problem. Therefore, I decided to split the 45 length vector into 5 disjoint blocks of length 9 such
that the algorithm will need 5 steps (figure 2.1). Two types of blocks exist which are called
“closed blocks” and “open blocks” and they are defined as follows:
Closed block: is an all one vector. When a portion of the genotype vector is a closed block, that
genotype includes all the 9 reactions corresponding to the block. For example, the genotype
including all 51 reactions can be considered as concatenation of 5 closed blocks.
Open block: Initially it is a set of all possible binary vectors of length 9 (i.e. a set including 2 9 
512 binary vectors) which are represented as red boxes (figure 2.1). In later stages, the open
blocks will be represented as green boxes.
The point is that at each step one open block is added and one closed block is reduced. More
importantly the added open block is obtained from the output of the previous step which
corresponds to the viable subset of the initial open block (the set of binary vectors is smaller than
(29=512).). Hence, the unviable subsets of the previous steps (the unviable parents) are not
included in the next step, so their children will not be generated and consequently their viability
is not going to be checked by FBA. Another concept used by the algorithm is “merged open
block” that is defined as follows:
Merged open block: When open blocks A and B each containing a set including respectively na
and nb number of binary vectors of length 9 are merged, a new block called “merged open block”
is generated which contains a set including na .nb number of binary vectors of length 18.
17
Likewise, if the resulting merged open block is merged by the open block C containing a set
including nc number of binary vectors of length 9, a merged open block containing a set
including na .nb .nc number of binary vectors of length 27 is generated. Moreover, if the resulting
merged open block is merged by the open block D containing a set including nd number of
binary vectors of length 9, a merged open block containing a set including na .nb .nc .nd number of
binary vectors of length 36 is generated. Finally, if the resulting merged open block is merged by
the open block E containing a set including ne number of binary vectors of length 9, a merged
open block containing a set including na .nb .nc .nd .ne number of binary vectors of length 45 is
generated. Therefore, the algorithm requires five steps described as follows:
Step1:
In this step, one block is open and four are closed (filled with all one). The open block is a set
   5 ways to select one block among the
of genotypes  a , i  1, 2, 3, 4, 5  with
including (29=512) distinct binary vectors. There are
five
n
i
as
open,
so,
five
distinct
sets
5
1
i
 512, i  1, 2, 3, 4, 5  number of genotypes are defined where i represents the index of
the open block in the corresponding genotype set (figure 2.1). Next, the viability of each
genotype is checked by Flux Balance Analysis, and the viable genotypes are stored in one of five
different files each of which corresponds to one of the five distinct sets of genotypes. Thus in this
step 5×512 times FBA method is used and the output will be 5 distinct sets of genotypes
 a ,i  1, 2, 3, 4, 5
*
i
with  ni* =i  ni | i 1, 2, 3, 4, 5 ,0  i  1 number of viable genotypes
where  i is the viability ratio of the corresponding genotype set of step1 (figures 2.1 and 2.6).
Figure 2.1) Step 1: The 45 length binary vector is split into five blocks and based on these blocks, five sets each
containing 512 genotypes (аi, i∈ {1, 2, 3, 4, 5}) are defined. In the i-th set, 4 blocks are kept closed and filled with
all one vector (blue blocks) while the i-th block is open and can be any of 512 binary vectors of length 9 (red
blocks). After viability checking using FBA, five sets (а *i, i∈ {1, 2, 3, 4, 5}) with (ni≤512, i∈ {1, 2, 3, 4, 5}) number
of viable genotypes is obtained. The green blocks represent the viable subsets of the corresponding red blocks.
18
Step 2:
In this step, two blocks will be open and three are closed (filled with all one). The two open
   10 ways to select two
  a a  , i  1, 2, 3, 4 , j   i  1 , , 5 with
blocks is obtained from the step 1 output (the viable ones). There are
blocks as open, so 10 distinct sets of genotypes
5
2
* *
i j
  n  n  , i  1, 2, 3, 4 , j  i  1 , , 5 number of genotypes are defined, where i and
*
i
*
j
j
represent the indexes of the open blocks in the corresponding genotype sets (figures 2.2 and 2.6).
Next, the viability of each genotype is checked by Flux Balance Analysis, and the viable
genotypes are stored in one of 10 different files each of which corresponds to one of the 10
4
distinct sets of genotypes. Thus, in this step ( 
5

i 1 j  ( i 1)
output will be 10 distinct sets of genotypes
n
**
ij
ni*  n*j ) times FBA method is used and the
 а a 
**
i
j
, i  1, 2, 3, 4 , j   i  1 ,..., 5

with

= ij   ni*  n*j  | i  1, 2, 3, 4 , j   i  1 ,..., 5 , 0   ij  1 number of viable genotypes where
 ij is the viability ratio of the corresponding genotype set of step2.
Figure 2.2) Step 2: 10 sets of genotype each of which containing three closed blocks (blue blocks) and 2 open
blocks obtained from step1 output (green blocks) form the input of step2. After viability checking, the two open
blocks merge and form a merged block (with 18 length vectors) containing the simultaneously viable genotypes of
the corresponding open blocks. Hence the output of step 2 are 10 sets of genotypes each containing three closed
blocks and one merged open block.
19
Step 3:
In this step, three blocks is open and two are closed (filled with all one). Two of the three open
blocks are obtained from the step 2 outputs (in form of a merged block containing 18 length
binary vectors) and one from step 1 output. There are
open, so 10 distinct sets of genotypes
with
 n
**
ij
   10
5
3
ways to select three blocks as
  a a a  , i  1, 2, 3 , j   i  1 , , 4 , k   j  1 ,...,5
* * *
i j k
 nk*  , i  1, 2, 3 , j   i  1 , , 4 , k   j  1 ,...,5

number of genotypes is
defined where i , j and k represent the indices of the open blocks in the corresponding
genotypes (figures 2.3 and 2.6). Next, the viability of each genotype is checked by Flux Balance
Analysis, and the viable genotypes are stored in one of 10 different files each of which
3
corresponds to one of the 10 distinct sets of genotypes. Thus, in this step ( 
4
5
 
i 1 j  ( i 1) k  ( j 1)
nij**  nk* )
times FBA method is used and the output will be 10 distinct sets of genotypes
 a a a 
***
i
with
j k
n
***
ijk
, i  1, 2, 3 , j   i  1 , , 4 , k   j  1 ,...,5

= ijk   ni*  n*j  nk*  | i  1, 2, 3 , j   i  1 ,..., 4 , k   j  1 ,..., 5 0   ijk  1

number of
viable genotypes where  ijk is the viability ratio of the corresponding genotype set of step3.
Figure 2.3) Step 3: 10 sets of genotype each of which containing two closed blocks (blue blocks) , one open block
obtained from step1 output (green block with white color text) and one merged open block coming from step2
output (green block with black color text) form the input of step3. After viability checking, the two open blocks
merge and form a merged block (with 27 length vectors) containing the simultaneously viable genotypes of the
corresponding open blocks. Hence the output of step 3 are 10 sets of genotypes each containing two closed blocks
and one merged open block.
20
Step 4:
In this step, four blocks is open and one is closed (filled with all one). Three of the four open
blocks are obtained from the step 3 outputs (in form of a merged block containing 27 length
binary vectors) and one from step 1 output.
There are
 5
5
4
ways to select four blocks as open, so 5 distinct sets of genotypes
  a a a a  , i  1, 2 , j   i  1 , , 3 , k   j  1 ,..., 4 , l   k  1 ,...,5
  n  n  , i  1, 2 , j   i  1 , , 3 , k   j  1 ,..., 4 , l   k  1 ,...,5
with
* * * *
i j k l
***
ijk
*
l
number
of
genotypes is defined where i , j , k and l represent the indices of the open blocks in the
corresponding genotypes (figures 2.4 and 2.6).
Next, the viability of each genotype is checked by Flux Balance Analysis, and the viable
genotypes are stored in one of 5 different files each of which corresponds to one of the 5 distinct
sets of genotypes.
2
Thus, in this step ( 
3
4
5
  
i 1 j  ( i 1) k  ( j 1) l  ( k 1)
5 distinct sets of genotypes as follows
 a a a a 
****
i
n
j
***
ijkl
k
l
***
nijk
 nl* ) times FBA method is used and the output will be

, i  1, 2 , j   i  1 , , 3 , k   j  1 ,..., 4 , l   k  1 ,...,5 with
= ijkl   ni*  n*j  nk*  nl*  | i  1, 2 , j   i  1 ,..., 3 , k   j  1 ,..., 4 , l   k  1 ,..., 5 , 0   ijkl  1

number of viable genotypes where  ijkl is the viability ratio of the corresponding genotype set of
step4.
Figure 2.4) Step 4: 5 sets of genotype each of which contain one closed block (blue blocks) , one open block
obtained from step1 output (green block with white color text) and one merged open block coming from step3
output (green block with black color text) form the input of step4. After viability checking, the two open blocks
merge and form a merged block (with 36 length vectors) containing the simultaneously viable genotypes of the
corresponding open blocks. Hence the output of step 4 is 5 sets of genotypes each containing one closed block and
one merged open block.
21
Step 5:
In this step, all the five blocks are open .Four of the five open blocks are obtained from the step 4 outputs
(in form of a merged block containing 36 length binary vectors) and one from step 1 output. There is
   1 way to select five blocks as open, so one set of genotype  a a
5
5
* * * * *
1 2 3 4 5
aaa
 with  n
****
1234
 n5*  number
of genotypes is defined (figures 2.5 and 2.6). Next, the viability of each genotype is checked by Flux
****
Balance Analysis, and the viable genotypes are stored in a file. Thus, in this step ( n1234
 n5* ) times FBA
method
is
used
and
the
output
will
be
one
set
of
genotypes
 a1a2 a3a4 a5 
*****
with
*****
(n12345
=12345  (n1*  n2*  n3*  n4*  n5* ) ,0  12345  1) number of viable genotypes where 12345 is the
viability ratio of the genotype set of step5.
Figure 2.5) Step 5: one set of genotype containing one open block obtained from step1 output (green block with
white color text) and one merged open block coming from step4 output (green block with black color text) form the
input of step5. After viability checking, the two open blocks merge and form a merged block (with 45 length
vectors) containing the simultaneously viable genotypes of the corresponding open blocks. Hence the output of step
5 is a set of genotypes containing one closed block.
22
Figure 2.6) Summary of the consecutive five steps of the algorithm: For each step, the genotype sets with the number
of genotypes in each set for both input and output are specified. Furthermore, the number of required FBA tests for each
step is illustrated.
23
2.3.
Analysis of the efficiency of the algorithm:
The efficiency of the algorithm can be evaluated by measuring the number of FBA tests required
to identify the viable genotype space. According to the algorithm the number of required FBA
tests is the sum of the number of FBA tests for the five steps which equals to:
5
4
Total number of FBA tests   ni  
i 1
5
3

i 1 j (i 1)
ni*n*j  
4
5
 
i 1 j (i 1) k ( j 1)
2
3
ni*n*j nk*  
4
5
  
i 1 j (i 1) k ( j 1) l  ( k 1)
ni*n*j nk*nl*  n1*n2*n3*n4*n5*
Since  n =i  ni | i 1, 2, 3, 4, 5 ,0  i  1 -where i is the viability ratio of the corresponding
*
i
step1 genotype sets- we can rewrite the above equation as follows:
5
4
5
3
4
Total number of FBA tests  ni    
i j ni nj   
i1
i1 j(i1)
5
2
3
4
5
   n n n        n n n n     n n n n n
i
i1 j(i1) k( j1)
j k i
j k
i1 j(i1) k( j1) l(k1)
i
j k l i
j k l
1 2 3 4 5 1 2 3 4 5
5
To simplify the analysis, let’s assume  i  (  i ) / 5   , then we’ll be able to express the total
i 1
number of FBA tests as a univariate function of  as follows:
5
4
Total number of FBA tests   ni  
i 1
5

i 1 j  ( i 1)
3
2
ni n j  
4
5
 
i 1 j  ( i 1) k  ( j 1)
2
 3ni n j nk  
3
4
5
  
i 1 j  ( i 1) k  ( j 1) l  ( k 1)
 4 ni n j nk nl   5 n1 n2 n3 n4 n5
And, since ni  512 , it’ll equal to: 5  512 10  2  5122 10  3  5123  5 4  5124   5  5125 ,0    1
This dependence of the algorithm on  is illustrated in figure 2.7 and is compared to the brute
force approach.
Number of required FBA tests as a function of alpha
14
Log(Number of required FBA tests)
13
12
11
Brute force Algorithm
My Algorithm
Real Data
10
9
8
7
6
5
0
0.1
0.2
0.3
0.4
0.5
Alpha
0.6
0.7
0.8
0.9
1
Figure 2.7) Dependence of the algorithm on alpha (The average viability ratio of step 1) in comparison to the brute
force approach: The number of required FBA tests is constant for brute force approach (red line) while my algorithm
(blue curve) depends strongly on average viability ratio of step1 (α). The blue circles correspond to the number of FBA
tests required by my algorithm for each of the 10 carbon sources. We see that the number of required FBA tests in reality
was smaller than what is predicted by the simple univariate model.
24
According to the Figure 2.7, the success of the algorithm strongly depends on  . At large values
of alpha   1 , the algorithm is as inefficient as the brute force approach while in smaller
values of alpha, the algorithm with a growing slope tends to lower the number of required FBA
tests. For instance, at   0.63 ,   0.39  ,   0.25  ,   0.15  and   0.10  the number
of required FBA tests is reduced 10, 100, 1000, 10000 and 100000 times respectively. However,
in reality the number of the required FBA tests was even smaller than what the above-mentioned
model predicts. They are depicted as blue circles in figure 2.7 each of which represents the
number of FBA tests needed for each carbon source. We can see that the algorithm
approximately reduces the number of FBA tests from 104 to 106 times depending on the carbon
source. Thus, instead of 1000 years the algorithm is able to characterize the neutral networks
from 0.1 year to one day depending on carbon source. Moreover, since the algorithm was
amenable to parallelization, using the cluster computing in the lab, this task was done in few
days for each carbon source. The number of required FBA tests and the number of viable
genotypes are concurrently depicted for each carbon source in figure 2.8. Roughly speaking, the
number of required FBA tests is twice that of viable genotypes, in other words the candidate
genotypes whose viability is going to be checked by FBA method have around 50% of chance to
be considered as viable. Thus, the algorithm is able to shrink the searching space (total genotype
space) to a tiny subspace in which the probability of finding a viable genotype is around 50%
and it skips the vast portion of the genotype space in which the probability of finding a viable
genotype is absolutely zero.
9
3.5
x 10
Number of FBA Tests
Number of viable genotypes
3
2.5
2
1.5
1
0.5
e
e
La
cta
t
Ac
eta
t
Py
r uv
a te
Gl
uta
ma
te
Su
cc
ina
te
e
Fu
ma
r at
e
Ma
lat
Fru
cto
A lp
se
ha
- ke
tog
lu t
a ra
te
Gl
uc
os
e
0
Figure 2.8) the number of required FBA tests versus the number of viable genotypes for each carbon source:
The number of FBA tests (blue bars) required to define viable genotype space is roughly twice that of the number of
viable genotypes (brown bars), in other words, the algorithm is able to shrink the total genotype space to a tiny
subspace in which the probability of finding viable genotypes is around 50%.
25
2.4.
Partitioning the set of viable genotypes based on genotype size:
The number of reactions among reaction universe which are present in a genotype is defined as
the size of the genotype. As it was previously mentioned, the number of reactions in our reaction
universe was 71 among which 20 reactions were export reactions, and 51 were internal reactions.
Every viable genotype should contain all export reactions, so conventionally we consider size 51
as the size of a genotype which contains all the 51 internal reactions and the size of other
genotypes will be  51  i  in which i represents the number of eliminated reactions. Based on the
size of genotypes, the neutral networks were partitioned into genotype networks each of which
including genotypes of a certain size (Figure 2.9). Partitioning the total genotype space (viable
plus unviable) based on size leads to a binomial distribution which is simply obtained by  51k 
where k is the genotype size. The logarithm of this binomial distribution generates the
symmetrical black curve in figure 2.9 with a maximum in the middle. Partitioning the neutral
networks of all carbon sources based on genotype size also leads to almost symmetrical curves
but obviously with lower maximum value and right-shifted maximum point (the size with
maximum number of genotypes). We observe that in all of the carbon sources, there is no viable
genotype in sizes smaller than 17. Furthermore, in carbon sources with larger neutral network
(e.g. glucose), genotypes tend to be distributed in a wider range of size, and also they tend to
have a higher maximum genotype network than carbon sources with smaller neutral network
(e.g. acetate). On the other hand, the size in which the number of viable genotypes reaches
maximum is smaller for carbon sources with larger neutral network than those with smaller one.
Then, for each genotype network corresponding to each carbon source we measured a new
parameter defined as follows:
vnc  Vnc
  Where:
51
n
Vnc : is the number of viable genotypes of size n for carbon source c
  : is the maximum possible number of genotypes of size n, and
51
n
vnc : is the viability ratio of the genotype network of size n for carbon source c .
As is shown in figure 2.10, the viability ratio for each carbon source monotonically increases by
increasing the size. This result is quite obvious since the more number of reactions a genotype
possesses the higher the probability of its viability will be. Moreover, for each genotype size, the
viability ratio of carbon sources with larger neutral network (e.g. glucose) is higher than that of
carbon sources with smaller neutral network (e.g. acetate).
26
15
Total(viable+unviable)
Glucose
Fructose
Log(number of genotypes)
Alpha-ketoglutarate
Malate
Fumarate
Pyruvate
Glutamate
10
Succinate
Lactate
Acetate
5
0
0
5
10
15
20
25
Size
30
35
40
45
50
55
Figure 2.9) Partitioning the neutral network into genotype networks of homogeneous genotype size for each
carbon source: The black curve corresponds to the logarithm of the number of possible genotypes (viable plus
unviable) for each size which is logarithm of a binomial distribution. The colorful curves correspond to the
logarithm of the number of viable genotypes with a given number of reactions (size) for each carbon source. The
genotype networks are almost symmetrically distributed but with smaller maximum value and a right-shifted
maximum point with respect to the total genotype network.
0
Glucose
Log(viability ratio)
Fructose
Alpha-ketoglutarate
Malate
Fumarate
-5
Pyruvate
Glutamate
Succinate
Lactate
Acetate
-10
-15
20
25
30
35
40
45
50
55
Size
Figure 2.10) Viability ratio for genotype networks of different size for each carbon source: the vertical axis is the
logarithm of viability ratio and the horizontal axis is the genotype size. Each curve corresponds to one of the ten carbon
sources. The viability ratio monotonically and with a decreasing slope rises by increasing the genotype size. Moreover, for
each genotype size, the viability ratio of carbon sources with larger neutral network is higher than that of carbon sources
with smaller neutral network.
27
2.5.
Assessing the correctness of the algorithm:
To evaluate the correctness of the algorithm a brute force approach to exhaustive genotypephenotype mapping as follows:
It was computationally feasible to generate all possible genotypes of size 46 to 51 using a
MATLAB built-in function called “nchoosek”, and then for all the genotypes in the 6 genotype
sets the viability on each carbon sources was tested directly using Flux balance analysis
approach. Finally the number of viable genotypes in each 6 genotype sets (genotypes of size 46
to 51), for each carbon source was identified, and the resulting numbers were compared to those
which were obtained by the proposed algorithm. Results showed no inconsistency and the
numbers were exactly the same.
One might argue that this method for assessing the correctness of the method is insufficient
since we just compare the results for genotype size from 46 to 51 and not the rest. However, the
algorithm first defines the whole neutral network and then it partitions the neutral network into
genotype networks containing genotypes of different sizes, so the correctness of the method is
independent of the size of the genotypes. Therefore, if there was a problem in the algorithm or in
the implementation, it would have definitely been reflected in at least one of these 6 genotype
sets.
.
28
Chapter 3: Topological properties of genotype networks
In case of total genotype space in which every node has exactly the same number of neighbors,
the graph-theoretical properties of the network is easy to determine because the network is a
typical example of a regular graph in which all the nodes have the same degree. However, a
genotype network corresponding to a particular phenotype is a subspace of the genotype space.
In other words, the set of nodes in the genotype network will be a subset of the total nodes in the
genotype space because some of the nodes in the genotype space will be absent in the resulting
genotype network. Thus, due to the missing nodes, the topological features of the genotype
networks might deviate from those of a regular graph. Here, using the graph theoretical approach
we investigate the topological features of these genotype networks. First, the degree distribution
and average degree are discussed. Next, the shortest path between all pairs of the nodes followed
by the diameter of the genotype networks will be determined. Finally, the distribution of the
betweenness centrality for each network is quantified. Importantly, the influence of size on the
topological features also will be investigated. Moreover, the topological features of genotype
networks including more than 106 genotypes are not amenable for exhaustive analysis, so
whenever needed we have randomly sampled 105 genotypes from the corresponding genotype
networks.
3.1. Degree distribution of genotype networks:
The degree of a node in the genotype network reflects its number of neighbors. As it was
mentioned previously, each node of the network corresponds to a distinct metabolic network and
two metabolic networks are neighbors if they are convertible to each other through a reaction
swap. For each genotype network including genotypes of a particular size, the upper-bound for
degree of the nodes can be defined. The maximum number of neighbors for a particular genotype
in a certain genotype network will be attainable if all its corresponding neighbors in the total
genotype space are mapped to the same phenotype and consequently all is present in the
genotype network. Therefore the upper-bound for degree equals to the number of neighboring
genotypes in the total genotype space which is a distinct number for each genotype size.
For a genotype network including genotypes of size i, the upper-bound for the degree of each
individual node equals:
Max d k  i, j    (| U | i ).i 
k  G  i, C j  , i  0,1,...,| U | ; j  1, 2,...,10
Where:
dk  i, j  is the degree of node k in genotype network of G  i, C j  which includes all metabolic
networks having i number of reactions and are viable on carbon source Cj. |U| is the number of
reactions in the reaction universe which equals to 51 in this study.
29
In figure 3.1, for each genotype network viable on glucose, the theoretical upper-bound,
maximum degree, average degree (with standard deviation) and minimum degree are depicted.
Similar to figure 2.9, in which maximum possible genotypes in genotype networks versus
genotype size resulted in a symmetrical curve; here also there is a symmetrical curve for the
maximum possible (upper-bound) degree versus genotype size. Furthermore, the size in which
the degree of genotype networks reach maximum is right shifted. In other words, the size with
largest upper-bound does not equal to the size corresponding to the genotype network with the
largest degree; instead the size leading to the largest degree is higher than the size with largest
upper-bound. Surprisingly, the size with largest maximum degree is neither the same as the size
with largest average degree nor the same as the size with largest minimum degree. Precisely
speaking, upper-bound, maximum degree, average degree and minimum degree reach their
maximum in sizes 25, 35, 38 and 40 respectively. However, when the degree in each size is
normalized by dividing it to their corresponding upper-bound, maximum degree, minimum
degree and also average degree all grow by increasing size. Therefore, they reach their maximum
values in the largest size (figure 3.2). Another important point to be noted is that in genotype
sizes between 30 and 45 the variance of degrees among nodes is higher. This might be explained
by the fact that genotype networks associated with this size range contain more genotypes and
consequently their corresponding graphs include more nodes with more heterogeneous degree
distribution leading to a larger variance. Importantly, these features of degree of nodes in
genotype networks discussed above were not peculiarity of genotype network corresponding to
glucose while they were general features extendible for genotype networks corresponding to
other carbon sources investigated in this study.
Furthermore, the degree distribution of genotype networks was analyzed and their similarity to
known degree distributions typical of regular, random and scale-free networks was investigated.
As it was noted before, the network corresponding to genotype space in which every possible
genotypes are viable in a certain carbon source, generates a regular graph in which every node’s
degree equals to the upper-bound degree corresponding to the genotype size. However, it is not
the case for genotype networks which are a sub-graph of the original genotype space, because we
observed that the degree of nodes in genotype networks span a wide range of numbers.
Moreover, as a typical example in figure 3.4 it is shown that the degree distribution of genotype
networks don’t obey power law, so they are not similar to scale-free networks. However, the
degree distribution of genotype networks apparently is similar to that of a random graph with
Poisson distribution (figure 3.3). The most frequent degree is almost the same as the average
degree, and proportional to the deviation from the average, the frequency of degrees goes down.
However, measuring the goodness of fit does not confirm statistical significance of the
resemblance of the degree distributions to Poisson distribution. One reason for deviation from
Poisson distribution might be the presence of heavier tails as depicted in 3.5. Hence, the degree
distribution of genotype networks cannot be categorized into any known distribution.
30
700
Upper-bound
Maximum
Average
Minimum
600
degree
500
400
300
200
100
0
0
10
20
30
Size
40
50
60
Figure 3.1: Dependence of degree on genotype size. The upper bound (black), the maximum degree (green),
average degree with standard deviation (blue) and minimum degree for each genotype network corresponding to
glucose are illustrated.
0.8
Maximum
Average
Minimum
Normalized degree
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
10
20
30
Size
40
50
60
Figure 3.2: Dependence of normalized degree on genotype size. The raw degrees are normalized by dividing
them to their corresponding upper-bound and the resulting maximum (green), average (blue), and minimum (red)
normalized degrees are depicted.
31
0.08
Genotype network
Random network
0.07
Frequency
0.06
0.05
0.04
0.03
0.02
0.01
0
40
60
80
100
120
Degree
140
160
180
200
Figure 3.3: The degree distribution of genotype network corresponding to genotypes of size 38 viable on
acetate (blue circles) versus the fitted Poisson distribution (red line).
10
Frequency
10
10
10
10
10
0
-5
-10
-15
-20
-25
10
1
2
10
Degree
10
3
Figure 3.4: log-log plot of the degree distribution of genotype network corresponding to genotypes of size 38
viable on acetate. It deviates clearly from a straight line which shows that the distribution is not similar to powerlaw distribution, so the network is not scale-free.
32
20000
Quantiles of Input Sample
15000
10000
5000
0
-5000
-2.5
-2
-1.5
-1
-0.5
0
0.5
Standard Normal Quantiles
1
1.5
2
2.5
Figure 3.5: QQ plot of the degree distribution of genotype network corresponding to genotypes of size 38
viable on acetate versus standard normal. Deviation from standard normal at the tails is clear. Therefore, the
global topological features of the genotype network are different from that of random graphs.
3.2. Shortest path and diameter of genotype networks:
Length of the shortest path between nodes A and B in a graph determines the minimum number
of edges that have to be traversed in order to reach B starting from A or vice versa. Identifying
shortest path between all pairs of nodes in a graph has broad range of applications in science and
engineering [50]. In genotype networks, the length of the shortest path between genotypes A and
B indicates the number of mutational steps required to end up in genotype B starting from
genotype A. Larger the shortest path between a pair of genotypes, more mutational swaps would
be required and consequently more time would be needed to convert the genotypes to each other
during evolutionary time scales. Hence, the distribution of shortest path between all pairs of
nodes in the genotype network indicates how long it takes for to the genotypes to be converted to
each other through neutral mutations. There are two widely used algorithms to solve all-pairs
shortest path in a graph; namely, Floyd-Warshall algorithm and Johnson’s algorithm which solve
the problem with time complexity of O V 3  and O V 2logV  VE  respectively, where V
indicates the number of nodes in the graph and E refers to the number of edges in the network
[88, 89]. Therefore, identifying all-pairs shortest-path using these algorithms for vast genotype
networks including millions of nodes will not be efficient. Fortunately, there is a unique
possibility to employ an intrinsic property of genotype networks to profoundly facilitate the
33
study of all-pairs shortest path in genotype networks. As noted beforehand, genotypes which are
reachable from each other via a mutational swap are considered as neighbors. Furthermore, the
neighbors of neighbor are reachable by 2 mutational swaps. The first swap is needed to reach the
neighbor and the second one is required to reach the neighbors of the neighbor. This definition in
a connected network can be further extended such that we can determine the minimum number
of edges traversed to reach from one genotype to the other from the number of mutational swaps
required to convert one genotype to the other one. Thus, the length of the shortest-path between
each distinct pair of nodes in a connected genotype network can be easily measured by defining
the number of reaction swaps required to convert one genotype to the other one. Moreover, the
number of required mutational swaps to convert genotype i to genotype j equals the number of
reactions that are included in i but not included in j or vice versa. Therefore, the length of the
shortest path between a pair of genotypes in a connected network is defined as follows:
X   x | x  Ri \ R j 
Sp  i, j  | X |
Where, X is the set of reactions which are included in reaction set of genotype i  R i  but not
included in the reaction set of genotype j  R j  and Sp  i, j  is the length of the shortest path
between genotypes i and j which equals to the cardinality of set X. Thus, using this simple idea,
the time complexity of determining the all-pairs shortest path of the network is reduced to V 2  .
The maximum possible length of the shortest path of two nodes in a genotype network of a
particular size is also bounded by an upper-bound defined as follows:
Max Sp nm  i, j   Min (| U |  n), n
i, j  G  n, Cm  , n  0,1,...,| U | ; m  1, 2,...,10
Where:
Sp nm  i, j  is the maximum length of the shortest path among pairs of nodes i and j in the
genotype network G  n, Cm  of size n corresponding to carbon source Cm. The upper-bound
versus size makes a symmetrical curve with straight lines of opposite slopes in each side (Figure
3.6). The length of the largest shortest path existing in a network is called diameter of the
network. As it is exemplified for genotype network according to glucose in figure 3.6, at smaller
sizes diameter of the networks are smaller than upper-bound, but at size 34 onwards the diameter
reaches the upper-bound meaning that the theoretically largest shortest path is achieved in the
corresponding networks. Average shortest path versus size also results in a symmetrical curve,
but unlike the upper-bound the curve is not straight line instead it is more similar to a parabola.
Dividing the diameter and average length of shortest paths by the upper-bound of corresponding
genotype size, results in normalized parameters which are monotonically going up by increasing
size (Figure 3.7). By increasing size, both normalized diameter and average shortest path are
growing in a straight line, but normalized diameter grows with a steeper slope, so reaches 1
sooner than average length of shortest paths. The reason lies in the fact that by increasing size
34
according to figure 2.10, the viability ratio increases, and consequently, the genotype networks
would be more similar to their corresponding genotype sub-space, so the diameter and average
shortest path of the genotype networks would get closer to those of the original sub-spaces.
Length of shortest path
25
Upper-bound
Diameter
Average
20
15
10
5
0
0
10
20
30
Size
40
50
60
Figure 3.6: Dependence of the length of the shortest path on genotype size. Upper-bound (black), diameter (red)
and average shortest-path (blue) are illustrated for genotype networks corresponding to glucose are illustrated.
(Length of shortest path)/(Upper bound)
1
0.8
Diameter
Average
0.6
0.4
0.2
0
0
10
20
30
Size
40
50
60
Figure 3.7: Dependence of the normalized length of the shortest path on genotype size. The length of the
shortest paths was normalized by dividing to the corresponding upper-bound and the normalized diameter (red) and
the normalized length of the average shortest path (blue) are depicted.
35
0.5
0.5
0.4
0.4
Frequency
Frequency
As is depicted in figure 3.6, plotting the diameter with respect to size unlike the case for upperbound and average shortest path does not result in a symmetrical curve. Thus, it can be seen that
diameters of two genotype networks with different genotype size which have almost the same
average shortest path differ strongly from each other. On the other hand, two genotype networks
with almost the same diameter would have totally different average shortest path. The reason for
this trend lies in the fact that by increasing the genotype size, the distribution of the length of
shortest path changes. Generally, for the genotype networks related to smaller genotype sizes, the
length of the shortest path is distributed almost like a Poisson distribution in which the length in
the middle (average) has the highest frequency and distancing from the average causes gradual
reduction in frequency, while for genotype networks related to larger genotype sizes, the
distribution of the length of shortest paths is right-skewed meaning that majority of the pairs of
nodes in the network are reachable via the lengthy shortest paths, and a few are reachable by
shorter ones. For example as is illustrated in figure 3.8, the length distribution for genotype
network related to genotypes of size 43 (blue) is right-skewed and majority of the lengths are
fallen into 6 and 7 which are close to the diameter (8), while for genotype networks
corresponding to genotype sizes of 26 and 29 (green and red respectively), the average lengths
account for most of the lengths. Although the genotype networks of size 43 and 26 have the same
degree, due to right-shifted distribution of the lengths in the genotype network of size 43, its
average shortest path is higher than that of the genotype network of size 26. Similarly, although
the genotype network of size 43 has approximately the same average degree as the genotype
network of size 29, the lengths in size 29 are distributed in a larger range like Poisson
distribution leading to a larger diameter in comparison to the genotype network of size 43.
0.3
0.2
0.1
0
0.3
0.2
0.1
1
2
3 4 5 6 7 8
Length of shortest path
1
2
3 4 5 6 7 8
Length of shortest path
0
1 2 3 4 5 6 7 8 9 10 11 12 13
Length of shortest path
0.5
Frequency
0.4
0.3
0.2
0.1
0
Figure 3.8: Distribution of the length of the shortest path for genotype networks containing genotypes of size 43
(blue), 26 (green) and 29 (red) which are viable on glucose.
36
3.3. Betweenness centrality in genotype networks:
Betweenness is a measure of a node's centrality and importance in a network. Betweenness of a
node equals to the total number of shortest paths between all possible pairs of nodes in the
network that pass through that node. The more nodes in the network, basically the larger the
number of shortest paths passing through a node will be. Therefore, usually the betweenness
centrality is normalized by dividing it to the total number of distinct pairs of nodes in the graph
which equals to n.(n  1) / 2 where n is the number of nodes in the graph. This normalization
causes the values of betweenness centrality to lie between zero and one. To define the
betweenness centrality of the genotypes in the genotype network, the Floyd-Warshall algorithm
implemented in igraph package was employed [88, 90]. However, the algorithm was not efficient
for networks with more than 105 genotypes, so the betweenness of the genotype networks of size
between 35 and 45 were not investigated.
As is illustrated in figure 3.9.a, for smaller sizes (i.e. between 20 and 35), by increasing size the
average normalized betweenness exponentially reduces, while for larger sizes (i.e. between 45
and 50) the average normalized betweenness exponentially goes up by increasing size. On the
other hand, as is shown in figure 2.9 for smaller sizes by increasing size the number of viable
genotypes mapped to their corresponding genotype network increases, leading to networks with
more nodes while for larger sizes, by rising the size up, the number of genotypes mapped to the
related genotype network decreases leading to genotype networks with less nodes. Therefore, it
can be concluded that the more nodes a genotype network possesses the lower the betweenness
centrality of individual nodes would be. Figure 3.9.b confirms that this inverse proportionality
indeed holds. Moreover, the same trend exists for the range of betweenness centrality among
genotypes of a genotype network; the more genotypes in a genotype network, smaller the
difference between the most central and the least central nodes will be. Consequently in larger
genotype networks, the centrality of every genotype will be more or less the same, while in
smaller networks the difference among genotypes in terms of centrality is more profound (figure
3.9.c-d). Therefore, in networks with smaller number of genotypes, there exist genotypes which
are very central and their presence in the network facilitates other genotypes to reach each other
via minimum number of mutational steps. The fraction of highly central genotypes in different
genotype networks was also investigated. By convention, highly central genotypes are defined as
the genotypes whose betweenness centrality deviates from that of the maximally central
genotype less than 10% of the range of betweenness value in the network. The pattern is again
the same, meaning that networks with smaller number of genotypes possess more highly central
genotypes (figure 3.9.e-f). Nevertheless, the highly central genotypes even in most of the small
networks accounts for a tiny fraction of genotypes in the network, implying that in the genotype
networks, there are few numbers of highly central genotypes whose existence is vital for
facilitating reachability of genotypes in the network via stepwise mutations.
37
Figure 3.9: Dependence of betweenness centrality on genotype size and network size. dependence of logarithm
of average normalized betweenness on genotype size size (a) and on logarithm of network size (b) are shwon for
genotype networks associated with 10 differen carbon sources. Moreover, the dependence of logarithm of the range
of betweenness centrality and genotype size (c) and logarithm of network size (d) are illustrated for genotype
networks related to 10 different carbon sourrces. Finally, the dependence of logarithm of the fraction of highly
central genotypes on genotype size (e) and on logarithm of network size (f) are depicted for the genotypes networks
corresponding to the 10 carbon sources. The details are explained in the main text.
38
Chapter 4: Connectedness of genotype networks
Another important feature of genotype networks to be investigated is their connectedness. A
network in which every pair of nodes is reachable via a path (consecutive traversing of the edges
of the network) is considered as a connected network. If genotype networks are turned out to be
connected it will imply that genotypes in the network are convertible to each other via some
mutational steps that preserve the phenotype during evolutionary time scales. Moreover, if a
genotype network is not fully connected, the number of connected components to which the
networks are fragmented and also the relative size of the components are important to study. For
example, whether the network is fragmented to some small components of more or less the same
size or the network is fragmented into unequally sized components, is of particular interest. In
network theory the largest component which includes a high fraction of the total nodes in the
network is called “giant component” [91]. Therefore, in this chapter it would of particular
interest to pursue whether the genotype networks are fully connected or not, and if not, whether
they are fragmented to a giant component. On top of connectivity, there is another concept called
“biconnectivity” which is a property of a connected network in which removal of any individual
node does not influence the connectivity of the network [92]. In other words, the connectivity of
a biconnected network is robust to node removal. Hence, finally the biconnectivity of connected
genotype networks will also be investigated.
The algorithm which is generally used to determine the number of connected components in a
network and to assign the nodes to their corresponding components employs either breadth-fist
search or depth-first search [93]. However, this algorithm first needs the network to be
represented in an adjacency matrix (or adjacency list for sparse graphs). Each entry of an
adjacency matrix corresponds to a pair of nodes and it will be one if there is an edge between the
two nodes or zero otherwise. However, for genotype networks, the adjacency matrix is not
known beforehand. To define such a matrix for a genotype network comprised of V genotypes,
V 2 pairs of genotypes should be checked to know whether their reaction contents differ by only a
pair of reactions or not (i.e. there is an edge between them or not). Therefore, the time
complexity of just the initial step of the algorithm is  V 2  which renders the algorithm
infeasible for studying the connectivity of our vast genotype networks. Moreover, the space
complexity is also another issue which causes memory problems. Hence, an alternative algorithm
is needed to circumvent the need first for exhaustive checking of all node pairs for neighborhood,
and second for storing the resulting adjacency matrix or list. Fortunately, an algorithm fulfilling
both criteria was developed and is discussed in sections 4.1 and 4.2. Nevertheless, the algorithm
encounters memory problem for networks with more than 106 nodes. Finally, this issue also was
resolved using a new approach discussed in section 4.3.
39
4.1. The algorithm:
Algorithm for identifying connected components of a genotype network
connected_comp (G,V,l,comp)
Input: G , V [1...l ] , l , comp
Output: V [1...l ]
Initialization:
on=0 // a variable which is used to decide whether to continue or not
init=0 // the index of the first node with un-assigned connected component
for i  1...l
if V [i ]  0
V [i ]  comp //the node gets assigned to the current component
on  1 //the algorithm will be allowed to progress to the main procedure
init  i
break
Main procedure:
if on  1
//run the following procedure twice
for p  init ,..., l
if V [ p ]  comp // the node p can influence others
for q   p  1 ,..., l 
if V  q   0 // if node q is not yet assigned to any component might be influenced by p
if | R | 1 : R  r | r  G p \ Gq  //to check whether p and q are neighbor
V  q   comp // node q is assigned to the same component as p
else if V [ p ]  0 // the node p can be influenced by other nodes
for q   p  1 ,..., l 
if V  q   comp // if node q is assigned to the current component it might be influence p
if | R | 1 : R  r | r  G p \ Gq  //to check whether p and q are neighbor
V  p   comp // node p is assigned to the same component as q
else
Return V //the vector V, each entry of which is the index of the connected component of the
//corresponding node is returned.
Recursive call:
comp  comp  1 //the comp variable is redefined to identify nodes belonging to the next component
connected _ comp  G , V , l , comp  //the algorithm is recursively called; the new V vector as
// modified above is passed to the algorithm.
40
The algorithm receives the genotype matrix (G) containing l rows each of which corresponds to a
genotype and stores the reaction content of the genotype. V is a vector each entry of which will
represent the index of the connected component to which the corresponding genotype belongs to.
Initially this vector is passed to the algorithm in form of an all-zero vector, and then inside the
algorithm the vector is updated and finally is returned as the final output of the algorithm. Comp
is the variable indicating the current connected component that the algorithm is trying to map
genotypes to, and it is initialized as one. At the beginning, the algorithm checks whether there
still exists any zero element in V vector (i.e. genotype that is still not assigned to any connected
component). If not, then the algorithm ends and outputs the V vector, otherwise it would progress
to the main procedure. In the main procedure lies the idea of reducing the number of pairs of
nodes whose neighborhood is checked. There are two independent double for loops which aim to
check all possible pairs of nodes in the network. However, the algorithm using several “if”
statements restricts the number of neighborhood checking to a limited fraction of the total
possible pairs of nodes. When the algorithm reaches a node indexed p, it checks whether it is
already assigned to the current component (comp) or it is still unassigned. If it is assigned to the
current component, it could influence the unassigned nodes, so its neighborhood is checked only
against unassigned nodes. Thus, if they are neighbor, the unassigned node gets assigned to the
current component. On the other hand, if the node indexed p is unassigned, it could be influenced
by nodes assigned to the current category, so its neighborhood is checked only against the nodes
assigned to the current component. Hence, if the node finds a neighbor among the nodes
assigned to the current component, it gets mapped to the current component otherwise it remains
unassigned. After labeling all nodes which belong to the current component, the algorithm goes
ahead to the next step to find nodes belonging to the next component. Therefore, the algorithm is
recursively called and the updated V vector and new comp together with the original genotype
network are passed to the algorithm.
4.2. Analyzing the efficiency of the algorithm:
As it was noted before, using breadth-first search or depth-first search algorithms to find
connected components of the genotype networks requires two steps: first defining the adjacency
matrix in  V 2  time and then running their corresponding graph traversal algorithms to output
the final results which leads to  V  E  time complexity where E is the number of edges in the
graph. Therefore, their total time complexity will be  V 2    V  E    V 2  . As it will be
formally shown later, the proposed algorithm not only reduces space complexity by
simultaneously identifying connected components without the need for an already filled
.  , where m stands
adjacency matrix, but also it performs the whole task in running time of   mV
for average degree of the graph and V equals to the number of nodes in the network. Importantly,
  mV
.  in comparison to  V 2  is a great reduction, especially because in most of the networks,
41
the average number of nodes that an individual node has connection with is a very small fraction
of the total nodes in the network. In other words, m  V  mV
. V 2 . For example for a
network comprised of 106 nodes, with average degree of 100 per node the maximum running
time for determination of connected components using the proposed algorithm will be in the
order of 100.106  108 while using conventional algorithms the time complexity will be in the
order of 106.106  1012 . In other words the algorithm makes the task to be performed 10000
times faster.
Theorem: The proposed algorithm for identification of connected components of a network
. .
G (V , E ) with average degree per node of m, has running time of   mV
Proof:
Maximum possible number of edges in a graph with V number of nodes is achieved when all
possible pairs of nodes (V2) are directly connected to each other by an edge. Hence, the density
of a network can be measured by dividing the total number of edges to the maximum possible
number of edges as follows:
E
(4.1)
d 2
V
This parameter also can be considered as the probability of existence of an edge between any
randomly chosen pairs of nodes in the network. The larger this density, the higher the probability
that a randomly chosen pair of nodes will be connected via an edge.
Let’s represent the running time of the algorithm for defining the first connected component as R
n
which is defined as R   Ri where Ri is the running time for ith node which equals to the
i 1
number of nodes whose neighborhood is checked against ith node. Let’s denote the number of
nodes which has been included in the current connected component before the algorithm reaches
node i, as A  i  1 . Then the number of nodes whose neighborhood is checked against node i
(i.e. Ri) is calculated as follows:
Probability that i  th node is unassigned
to the current component
The number of nodes assigned
to the current component
Probability that i  th node is assigned
to the current component
The number of unassigned nodes
Ri 



A  i  1
V
.



V

A
i

1







A  i  1 

1 

V


.



A  i  1
(4.2)
Moreover, the number of nodes assigned to the current component at each step (i.e. A  i  ) can be
defined based on the number of nodes assigned to the current component at previous step plus
the number of nodes which get assigned to the current component at the present step as follows:
42
A  i   A(i  1) 
Probability that i  th node
is assigned to the current component



A  i  1
V
Probability that i  th node is unassigned
The number of newly assigned nodes
to the current component
to the current component
The number of newly assigned nodes
to the current component
.



d . V  A  i  1 






A  i  1 

1 

V


.

d
(4.3)
The above equation can be re-written as follows:
d
d



(4.4)
A(i )  1  d   .  A  i  1    d  . A2  i  1 
V
V



This is a non-homogeneous first-order recurrence equation whose analytical solution results in
the following formula:
A  i   pi  Q.i 2  R.i  1
(4.5)
Where:
d

P   1  d    1  d 
V

2
 d2 
1
 d
Q
* 1   d   
  2V
 2V  V
1

 d
* 1   d  
 V
 V
Moreover, the exponential part of the above formula can be further approximated using binomial
series as follows:
R
d d2

V 2V

p i  1  d   1  i.d   i2  .d 2   3i  .d 3  ...
i



di  1  p i  1  i.d   i2  .d 2   3i  .d 3  ...  1  di 
(4.6)
Plugging the equation (4.6) into equation (4.5) followed by plugging equation (4.5) into equation
(4.2) results in the following equation:

d2 2 d  2 
d2 2 d 
Ri 1  2.  d .i 
.i  .i    d .i 
.i  .i 
2V
V  V
2V
V 


d
d2 2  2 
d2 2 
 d  Ri 1  2.  d .i 
.i    d .i 
.i 
V
2V  V 
2V 

(4.7)
Therefore the total running time will be:

d2 2  2 
d2 2 
R   Ri 1   2.  d .i 
.i    d .i 
.i 
2V  V 
2V 
i 1
i 1

Using the formulas for summations of polynomial series we will have:
V
V
(4.8)
43
  d 2  V 2 V 1    d 4  V 5 V 4 V 3 V   2.d 3  V 4 V 3 V 2  
2
R  V 2 .d  V .d    . 
    
.


    2 .


   V .d  V .d
3 
 V  3
2
6
5
2
3
30
4
2
4
   2.V 
 V 





0
E 

R  V 2 .d  V .d  R  O V 2 .d   O  V 2 . 2   O  E   O  mV
. 
V 

(4.9)
The running time R was required to find the first component. Therefore the running time for
identifying n number of connected components each containing Wl number of nodes where l is
the index of the connected component would equal to:
n
 k 1

Rtotal  O  mV
.    O  m. (V  Wl )   n.O  mV
. 
k 2
 l 1

Rtotal  n.O  mV
.   Rtotal  O  mV
. 
(4.10)
Therefore the theorem is proven.
Nevertheless, if the adjacency matrix for the algorithm is provided without computational cost
and memory problem the running time of graph searching based algorithms would be
O V  E   O  E   O  mV
.  that is the same as the proposed algorithm, so the algorithm won’t
find any advantages in these conditions. The advantage becomes profound first if defining
adjacency matrix requires costly computations. In this circumstance the algorithm reduces
.  . The second condition in which the algorithm becomes
running time from  V 2  to O  mV
advantageous is when storing the adjacency matrix or list causes memory problem, because the
algorithm does not store the adjacency matrix or list at all. Nonetheless, the algorithm for
genotype networks indeed needs the reaction content of each genotype to be stored. Basically
storing networks of larger than 106 was not possible. Since half of the genotype networks have
more than 106 genotypes, the algorithm could just cover half of the networks. However, without
studying all the networks finding general understanding of the connectedness of genotype
networks would be impossible. Furthermore, studying connectedness of a network also is
impossible by random sampling of a fraction of nodes in the network. Fortunately, observing a
relationship between connectivity of networks corresponding to genotypes of consecutive sizes,
helped to identify connected components of networks with huge number of genotypes. This idea
will be discussed in the next section.
44
4.3. Connectedness in networks with more than one million genotypes
To check connectedness of large genotype networks, due to memory problem employing any
algorithm was impossible. Nevertheless, a very useful observation between connectedness of
genotype networks of consecutive sizes helped us to circumvent the need for aforementioned
algorithms. The idea will be derived as follows:
Proposition 1: By adding a reaction to an already viable metabolic network, the metabolic
network remains viable.
Definition 1: Immediate superset of set x is a superset of x with cardinality |x+1| [94].
Definition 2: Immediate superset of a genotype is the set of genotypes whose reaction content is
an immediate superset of the reaction content of that genotype.
Lemma 1:
The immediate superset of a genotype of size n will make a clique in the genotype network
containing genotypes of size n+1.
Proof:
According to proposition 1, all the genotypes belonging to the superset of a viable genotype
containing n reactions will be viable as well. Furthermore, according to the definition 1 they will
be present in the genotype network including genotypes of size n+1. Since they are all
immediate superset of the same genotype, they share n reactions and their reaction content
differs from each other by only one reaction. In other words, all pairs among those genotypes are
connected to each other. This results in a clique in the genotype network containing genotypes of
size n+1.
Lemma 2:
if there is an edge between genotypes x and y in the genotype network comprised of genotypes
of size n, their corresponding immediate supersets form connected cliques with one common
node in the network comprised of genotypes of size n+1.
Proof:
Presence of an edge between genotypes x and y in the genotype network comprised of genotypes
of size n, implies that the two genotypes share n-1 reactions and they differ in only one reaction
from each other. Let’s denote the set of shared reactions as S  s1 , s2 ,..., sn 1  and the reaction
which is present in genotype x but not in y as rx and the reaction that is present in genotype y but
not in x as ry . Adding rx to the reaction content of genotype y, results in a genotype in the
immediate
superset
s1 , s2 ,..., sn1  rx  ry
of
genotype
y
whose
reaction
content
is
represented
by
. On the other hand, adding ry to the reaction content of genotype x
results in a genotype in the immediate superset of genotype x whose reaction content is
represented by s1 , s2 ,..., sn1   rx  ry . This is the same as the previous genotype. Therefore, the
sets of genotypes in the immediate supersets of genotypes x and y share a common genotype and
45
consequently their corresponding cliques in the genotype network comprised of genotypes of
size n+1 will be connected via the common node (Figure 4.1).
Theorem:
If a genotype network comprised of genotypes of size n is connected, a network constructed from
the immediate superset of that genotype network is also connected.
Proof:
According to lemma 1, every genotypes belonging to the immediate superset of a genotype are
connected. Moreover, according to the lemma 2 the immediate supersets of two genotypes which
are connected by an edge are also connected via a common genotype. Therefore, the network of
immediate superset of the genotype network can be imagined as a network of cliques which
might be connected to each other via common genotypes in the cliques. Since it is assumed that
the genotype network is connected, according to lemma 2 the network of cliques will also be
connected. Therefore, the network comprised of the immediate supersets of the genotype
network will be connected.
This theorem is very useful to study connectedness of genotype networks comprised of more
than one million genotypes. For instance, if using the algorithm described in previous sections, it
has been shown that a genotype network containing genotypes of size n, is connected (i.e. forms
a single connected component) it can be checked whether the genotype network of size n+1 that
contains more than one million genotypes, is connected or not as follows:
Step 1: For each genotype in the genotype network corresponding to genotypes of size n, the set
of genotypes comprising its immediate superset is generated.
Step 2: The immediate supersets of all genotypes are combined.
Step 3: The number of distinct genotypes will be counted ( Important to be noted is that the
genotypes occurring in more than one immediate superset is counted only once)
Step 4: If the number of distinct genotype counted this way equals to the number of genotypes
included in the genotype network comprised of genotypes of size n+1, according to the theorem
it will be concluded that the genotype network corresponding to the genotypes of size n+1 are
also connected.
Figure 4.1: an example of two linked
genotypes of size n with their corresponding
connected cliques. In this example the reaction
universe is 1, 2, 3, 4, 5, 6, 7 . Two genotypes
2, 3 (blue) and 1, 2 (red) are connected in the
genotype network comprised of genotypes of
size 2. The corresponding immediate superset of
genotypes 2, 3 and 1, 2 respectively makes a
clique of size 5 whose nodes are respectively
colored blue and red. The two cliques are
connected via the common bicolored genotype.
46
This approach was efficiently implemented using careful combining of Linux shell scripting
functions. Luckily, in lower sizes using the previous algorithm the connected genotype networks
were obtained, so that the theorem discussed above was applicable. Moreover, for all the very
large genotype networks, the above approach was enough to confirm their connectedness.
4.4. Results:
The results show that genotype networks are connected, so it is possible for metabolic genotypes
to be converted to each other via mutational swaps. The results shown in figure 4.2 indicate that
the genotype size influences the connectivity of the corresponding genotype network. Regardless
of carbon sources, the genotype networks comprised of genotypes of size larger than 31 make an
all-encompassing connected component, meaning that all genotypes belong to the same
connected component. However, for genotypes of size smaller than 31, some genotype networks
were partitioned into two components, a few networks to 3 components and only one network
belonging to Pyruvate carbon source with genotypes of size 27 were partitioned into 4 distinct
connected components. One reason to explain the disconnectedness of some of the networks
corresponding to genotypes of smaller size could be the extremely low viability ratio shown in
figure 2.10, which renders the networks very sparse and consequently the probability for some
genotypes to be disconnected from the rest increases. However, the connectedness of genotype
network cannot be predicted based on the genotype size, and there is no direct relationship
between them. For instance, in genotype network related to acetate, the network containing
genotypes of size 29 is connected while the network corresponding to size 30 is fragmented into
two disjoint components although the network is denser than the network related to genotypes of
size 29. More importantly, even in genotype networks fragmented into more than one
component, the nodes are not partitioned into equally sized component while vast majority of
nodes are included in a giant component and the remaining minority form very small
components. Figure 4.3 shows this asymmetric fragmentation by plotting the ratio of the size of
the giant component to the total size of the network. This ratio for connected networks is exactly
one, and the figure indicates that even for majority of the fragmented networks, this ratio is very
close to one. There are few exceptions which all belong to networks of very small size.
Moreover, the robustness of the connectedness of genotype networks to node removal was
investigated in genotype networks. As it was noted previously, if the connectedness of a network
is robust to node removal, the network is considered as biconnected. Among the 90 genotype
networks 86 were biconnected, so in majority of the networks there was no individual node
whose removal leads to abolish the network connectedness (i.e. vertex articulation).
Furthermore, for all the remaining 4 non-biconnected networks the number of vertex articulation
was only 1. It means that only removal of one specific node caused the network to get
fragmented, and the connectedness of the network was robust to the removal of all the other
nodes.
47
Number of connected-components
5
Glucose
Fructose
Alpha-ketoglutarate
Malate
Fumarate
Pyruvate
Glutamate
Succinate
Lactate
Acetate
4
3
2
1
0
20
25
30
35
Size
40
45
50
55
ratio of the size of giant component to the size of the network
Figure 4.2: Number of connected components for genotype networks of different size corresponding to 10
distinct carbon sources.
1
Glucose
Fructose
Alpha-ketoglutarate
Malate
Fumarate
Pyruvate
Glutamate
Succinate
Lactate
Acetate
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
20
25
30
35
Size
40
45
50
55
Figure 4.2: Ratio of the size of the giant component to the size of the total network for genotype networks of
different sizes corresponding to 10 distinct carbon sources.
48
Chapter 5:
Mutational robustness of metabolic networks
In this chapter the robustness of metabolic networks to loss of function mutations resulting in
elimination of a reaction, is investigated. I try to know how mutationally robust the metabolic
networks are, and how robustness varies among different metabolic networks. Importantly, the
dependence and correlation of the mutational robustness of genotype networks with the number
of reactions present in the network (i.e. genotype size) is studied. Moreover, the average
robustness of metabolic networks present in a genotype network is measured and its correlation
with the number of genotypes in the genotype network is discussed. As mentioned before, degree
and betweenness of a node indicate the importance and centrality of the node in a network. Since
metabolic networks with more connections and higher betweenness are more central for the
genotype network, they might also be more robust to reaction eliminations. Therefore, the
correlation between robustness of a metabolic network and degree and also between robustness
and betweenness centrality are studied. Finally, the growth rate (i.e. rate of biomass formation)
of each metabolic network belonging to a genotype network on the corresponding carbon source
will be measured and is considered as the fitness of the metabolic network (genotype). How the
fitness is varied among genotypes and how it relates to the genotype size will be examined and
eventually the interesting question of whether the fitter genotypes are also more robust will be
answered.
5.1. Measuring the mutational robustness of metabolic networks:
Let’s define a metabolic network as a set comprised of n reactions as follows:
M  r1 , r2 ,..., rn 
Then, all the metabolic networks obtained by eliminating one reaction from the reaction content
of M, will be represented as follows:
S  m1 , m2 ,..., mn 
mi  M \ ri
Then the ability of each of the individual metabolic networks belonging to the set S to sustain life
on the corresponding carbon source is checked using FBA approach. Let’s denote the subset of
metabolic networks belonging to S which are viable as S ' . Then, robustness of metabolic network
M is defined as:
RM  
| S' |
|S|
(5.1)
49
When elimination of a reaction does not abolish the viability of a metabolic network, that
reaction is considered as non-essential. The cardinality of set S ' equals to the number of nonessential reactions, so according to equation (5.1), the robustness of a metabolic network is
defined as the fraction of non-essential reactions in that metabolic network.
5.2. The impact of genotype size on mutational robustness
How the number of reactions included in a metabolic network (i.e. genotype size) influences its
robustness to reaction elimination is the major question which is addressed in this section. It is
intuitively acceptable that a metabolic network including all the reactions belonging to the
reaction universe is the most robust network, because when the network loses more reactions, the
possibility for the reaction elimination to influence the phenotype increases, thereby the
robustness of the metabolic networks with smaller number of reactions decreases. Hence, more
reactions a metabolic network loses the more fragile the network will become. Nonetheless, the
importance of different reactions for the final phenotype is not the same, so removal of different
reactions might influence the robustness of the networks with varying degree. Therefore, it is
expected to observe variation in terms of mutational robustness among metabolic networks with
the same number of reactions. Figure 5.1, supports these predictions. The robustness of the
metabolic network with maximum number of reactions (i.e. 51) is the largest and by decreasing
the genotype size, the average robustness of the corresponding genotype networks goes down.
Moreover, the variation of robustness among genotypes belonging to the same genotype network
can be seen. Importantly, the variation of robustness increases by rising genotype size which
might be the consequence of the fact that the difference between the super-essentiality of the
reactions reduces by decreasing the genotype size (see chapter 6). Although the average
robustness of the genotype networks goes up by increasing genotype size, the claim that every
genotype of larger size is more robust than every other genotype of lower size is not true. The
reason is that due to variation in robustness of genotypes of the same size, some genotypes of
lower sizes might be more robust than some genotypes of higher sizes. For example, almost for
every size, the maximum robustness of the genotype network of size n is larger than the
minimum robustness of genotype network of size n+1. At lower sizes, the slope of the maximum
robustness is larger than that of higher sizes, implying that even in smaller genotype sizes there
are genotypes whose robustness is very close to that of the metabolic network comprised of the
reaction universe. On the other side, minimum robustness goes down sharply at the large sizes
(from 51 to 48), then the slope of reduction remains the same until the minimum robustness
reaches zero at genotype of size 30. A metabolic network whose robustness equals zero does not
have any non-essential reaction, so it is called minimal metabolic network since all of its
reactions are required to sustain life [95]. According to figure 5.1, for genotype size above 30 the
genotype networks does not have minimal network. Therefore, for n>30, according to the parentchild relationship described before, each genotype of size n has at least one viable children of
size n-1. In other words, in genotype networks comprised of genotypes of size larger than 30
50
which are viable on Glucose, not only each children of size n has at least one parent of size n+1
but also each parent of size n+1 has at least one child of size n.
Figure 5.2, shows that the same trend holds for the genotype networks corresponding to other
carbon sources. The curves of the average robustness versus genotype size corresponding to the
10 carbon sources are almost parallel. However, the maximum robustness which is the
robustness of the metabolic network comprised of the reaction universe is different in different
carbon sources. The reason for the difference lies in the fact that the number of super-essential
reactions is different for different carbon sources (see chapter 6).
Furthermore, the relationship between the average robustness of genotype networks and the
number of genotypes included in the network was investigated. In other words, it was of
particular interest to know whether the genotype networks with more nodes have more robust
genotypes or not. As figure 5.3 shows there is no correlation between the size of the network and
the average robustness. Especially, because approximately for every network size it is shown in
figure 5.3 that two networks of the same size have totally different average robustness.
Therefore, robustness of the metabolic networks does not depend on the number of genotypes
included in the genotype network rather it is a property more influenced by the nature and
number of reactions included in the metabolic network.
0.9
0.8
Robustness
0.7
maximum robustness
average robustness
minimum robustness
0.6
0.5
0.4
0.3
0.2
0.1
0
20
25
30
35
40
Genotype Size
45
50
55
Figure 5.1: Dependence of robustness on genotype size. How genotype size influences maximum (upper-black),
minimum (lower-black), average (red) and standard deviation (blue) of the robustness of genotype networks
corresponding to glucose are illustrated.
51
0.8
Glucose
Fructose
Alpha-ketoglutarate
Malate
Fumarate
Pyruvate
Glutamate
Succinate
Lactate
Acetate
0.7
Average Robustness
0.6
0.5
0.4
0.3
0.2
0.1
0
20
25
30
35
40
45
50
55
Genotype Size
Figure 5.2: dependence of average robustness on genotype size. The influence of genotype size on average
robustness of the genotype networks corresponding to each of the 10 carbon sources is shown.
0.8
0.7
Average Robustness
0.6
0.5
Glucose
Fructose
Alpha-ketoglutarate
Malate
Fumarate
Pyruvate
Glutamate
Succinate
Lactate
Acetate
0.4
0.3
0.2
0.1
0
0
1
2
3
4
5
6
7
8
9
Log (size of genotype network)
Figure 5.3: dependence of average robustness on network size. The influence of logarithm network size on
average robustness of the genotype networks corresponding to each of the 10 carbon sources is shown.
52
5.3. More central genotypes in the genotype network are more robust
In the framework of network theory, there are various types of measures of the centrality of
nodes in a graph which reflect the relative importance of a vertex within the network [30]. There
are four widely used measures of centrality: degree centrality, betweenness, closeness and
eigenvector centrality [96]. I investigated how centrality of a metabolic network as a node in a
genotype network relates to the robustness of the metabolic network. The focus was on two
measures of centrality: degree centrality and betweenness centrality. The correlation between
robustness and degree, and between robustness and betweenness were investigated for every
genotype networks corresponding to different genotype sizes and different carbon sources. The
results of the correlation between degree and robustness show Spearman’s   0.9 and
P  10300 which implies that degree centrality and robustness are strongly correlated, so the
metabolic networks with more connections in the genotype network are more robust to reaction
elimination. Moreover, the correlation between robustness and betweenness for genotypes of
smaller sizes (between 25 and 35) shows Spearman’s   0.4 and P  10100 . The results of the
same correlation for genotypes of larger sizes (between 45 and 51) indicate Spearman’s   0.9
0.4
0.4
0.3
0.3
Robustness
Robustness
and P  10300 . As examples the correlation between robustness and degree for genotype
networks corresponding to genotypes of size 21 and 41 viable on glucose are depicted in figure
5.4 (upper left and lower left respectively). Moreover, the association between robustness and
betweenness of the same genotype networks are depicted in figure 5.4 (upper and lower right).
Thus, the results support the fact that more central genotypes in the metabolic network are also
more robust.
0.2
0.1
0
0
50
100
150
200
0.2
0.1
0
250
0
0.5
0.8
0.8
0.7
0.7
0.6
0.5
0.4
80
1
1.5
2
Betweenness
Robustness
Robustness
Degree
2.5
7
x 10
0.6
0.5
0.4
90
100
110
120
Degree
130
140
150
2
4
6
8
Betweenness
10
12
4
x 10
Figure 5.4: Correlation between robustness and centrality. The correlation between robustness and degree and
between robustness and betweenness for genotype network corresponding to glucose containing genotypes of size
21 (upper left and right respectively) and of size 41 (lower left and right respectively) is shown.
53
5.4. The influence of genotype size on fitness
As it was noted previously, in this study the growth rate associated with a metabolic network was
considered as fitness of the genotype and the growth rate was approximated by the rate of the
biomass reaction. Here, the aim is to investigate the influence of genotype size on fitness. Since
eliminating reactions from a metabolic network does not increase fitness but it might decrease
the growth rate, it is expected that by increasing the genotype size the average fitness of the
corresponding genotype network goes up. Figure 5.5 which shows the relationship between
fitness and genotype size for genotype networks corresponding to glucose supports this claim.
Moreover, since different reactions influence the fitness with varying degree, it is expected to
observe variation among the fitness of genotypes of the same size. Figure 5.5 indicates that the
variation among genotypes also goes up by increasing genotype size. The variation reaches
maximum at size 42 and then declines by increasing genotype size. Due to the variation of
fitness among genotypes of the same size it is not acceptable to claim that every metabolic
network with more number of reactions is fitter than every other metabolic network comprised of
less number of reactions. For example, there are metabolic networks including only 30 reactions
which grow faster than some genotypes of size 45. Surprisingly, the maximum fitness curve
indicates that even in genotype networks comprised of genotypes of smaller sizes which are
obtained by elimination of several reactions from the metabolic network including all 51
reactions, there exist genotypes with exactly the same fitness as the metabolic network including
all 51 reactions. This implies that there are some reactions whose removal has absolutely no
influence on the growth rate of the metabolic network. For example even in size 32, there are
genotypes with the same fitness as the metabolic network that is comprised of 51 reactions.
Therefore, there are 19 reactions whose simultaneous elimination did not influence the fitness.
One possible explanation could be that the rate of biomass formation is totally independent of the
rate of these 19 reactions. More in depth analysis of this observation will be provided later.
Furthermore, it seems that average fitness depends exponentially on the genotype size. Figure 5.6
supports this claim by showing that the logarithm of average fitness grows almost linearly by
increasing genotype size. Figure 5.7 (a, b) shows the fitness versus genotype size for other
carbon sources. For example, the relationship between average fitness and genotype size for
genotype networks of carbon sources like fructose, α-ketoglutarate, pyruvate and glutamate is
similar to glucose in that the average fitness grows exponentially by increasing genotype size.
However, the curves are not exactly the same because the fitness of the metabolic network
including all 51 reactions differs in different carbon sources. Moreover, the average fitness of
different genotype size was normalized by dividing the fitness to the fitness of the genotype
including all 51 reactions grown on the corresponding carbon source. The results are shown in
figure 5.7 (c, d) and they show not only the fitness of the genotype network including all 51
reactions is different on different carbon sources but also the slope of the curves are not the
same. For example the growth of average fitness by genotype size is quicker in genotype
networks corresponding to pyruvate and α-ketoglutarate than fructose and glucose. On the other
side the curve of average fitness versus genotype size deviates from exponential curve for
54
genotype networks corresponding to malate, fumarate, succinate and lactate. Especially for
succinate, in small sizes, by increasing genotype size the average fitness declines. The analysis
required to find the reason behind this counter-intuitive result will be done later. Furthermore, it
was hypothesized that the number of carbon atoms in the carbon sources might be the reason that
the genotype including all 51 reactions has different fitness on different carbon source, so the
fitness was normalized by the number of carbon atoms in the carbon source molecules.
According to figure 5.7 (e, f) the normalization causes the fitness on different carbon sources to
get closer to each other but still the difference of the fitness among different carbon sources
exist, which implies that number of carbon atoms is not enough to explain the difference in
fitness, instead the growth rate also depends on the nature of carbon sources and how they
influence the flux distribution in the metabolic network.
1
0.9
maximum Fitness
Average Fitness
minimum Fitness
Fitness(Growth rate)
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
20
25
30
35
40
45
50
55
Size
Figure 5.5: Dependence of fitness (growth rate) on genotype size. How genotype size influences maximum
(upper-black), minimum (lower-black), average (red) and standard deviation (blue) of the growth rate of genotype
networks corresponding to glucose are illustrated.
Log (Average Fitness)
0.2
0
real data
linear fit
-0.2
-0.4
-0.6
-0.8
20
25
30
35
40
45
50
55
Size
Figure 5.6: Exponential dependence of average growth rate (fitness) on genotype size. The logarithm of average
growth rate of genotype networks corresponding to glucose grows linearly by increasing genotype size indicating
that average fitness depends exponentially on genotype size.
55
a
1
b
0
Glucose
0.9
Fructose
Log (Average Fitness)
Average Fitness
0.8
0.7
0.6
0.5
0.4
0.3
Alpha-ketoglutarate
-0.5
Malate
Fumarate
Pyruvate
Glutamate
-1
Succinate
0.2
Lactate
0.1
0
20
25
30
35
40
45
50
-1.5
20
55
Acetate
25
30
c
50
55
Log(Average Normalized Fitness)
Average Normalized Fitness
45
d
0
0.9
-0.2
0.8
0.7
-0.4
0.6
0.5
-0.6
0.4
0.3
-0.8
0.2
0.1
20
25
30
35
40
45
50
-1
20
55
30
Size
0.12
0.1
0.08
0.06
0.04
0.02
30
40
Size
50
60
Log (Average Fitness per carbon atom)
0.14
0
20
40
50
60
Size
e
0.16
Average Fitness per carbon atom
40
Size
Size
1
35
f
-0.6
-0.8
-1
-1.2
-1.4
-1.6
-1.8
-2
20
30
40
50
60
Size
Figure 5.7: The influence of genotype size on average fitness (growth rate): how genotype size impacts average
growth rate of the genotype networks related to each carbon source is shown in (a, b). After normalization of the
growth rates with respect to that of the genotype including all 51 reactions the same dependence is illustrated in (c,
d). Finally the average growth rates are normalized based on the number of carbon atoms in the associated carbon
source and influence of genotypes on the normalized average growth rate is depicted in (e, f).
56
5.5. On the association between robustness and fitness
It was also of particular interest to investigate the question whether the genotypes with higher
robustness to reaction elimination are fitter than less robust genotypes or not. If there is such an
association between robustness and fitness, it might provide a more direct explanation for the
emergence of robustness in metabolic networks. If organisms with more robust metabolic
networks can grow faster, finally nature will be in favor of organisms with more robust
metabolic networks. To follow this question, the correlation between robustness and growth rate
of metabolic networks of different sizes belonging to genotype networks corresponding to
different carbon sources was measured. The results in general shows that the correlations are
statistically significant for vast majority of genotype networks (Appendix –table A.6). The
Spearman’s  was between 0.2 and 0.4, and P  10300 for almost all genotype networks. Three
examples to cover three different range of genotype sizes (small, medium and large) are depicted
in figure 5.8. The figure respectively from left to right represents the correlation for genotype
networks corresponding to genotypes with size 30 viable on succinate, genotypes of size 35
viable on malate and genotypes of size 45 viable on pyruvate. Figure shows that the genotypes
with higher mutational robustness on average show higher growth rate. The Spearman’s  is
0.42, 0.49 and 0.29 respectively and their p-values are almost zero. Although the significant
correlation holds for majority of different genotype networks comprised of genotypes of different
sizes and viable on different carbon sources, there are minority of genotype networks for which
the Spearman’s  is negative and the correlation is not statistically significant . They belong to
the genotypes of size 27 to 32 viable on pyruvate and also the genotypes of size 31 to 35 viable
on Acetate. Nevertheless, in these genotype networks despite the statistical insignificance of the
correlation, still the claim that more robust genotypes can grow faster would still become valid if
a coarser grained approach is taken toward definition of robustness. For example if genotypes are
partitioned into two groups of “more robust” and “less robust” genotypes, like what the
rectangles do in figure 5.9, it can still be seen that genotypes belonging to the more robust
category on average grow faster than genotypes of the less robust group.
57
0.5
0.5
0.
0.45
0.45
0.5
0.3
0.3
0.25
Growth rate
0.
0.35
Growth rate
0.4
0.35
Growth rate
0.4
0.3
0.2
0.25
0.2
0.15
0.
0.2
0.
0.1
0.15
0.1
0.1
0.
0.05
0.05
0.0
0
0
0
0.2
0.4
0.6
Robustness
0.8
1
0
0.2
0.4
0.6
Robustness
0.8
0
1
0
0.2
0.4
0.6
Robustness
0.8
1
Figure 5.8: Correlation between robustness and growth rate. From left to right the correlation for genotype
networks corresponding to genotypes with size 30 viable on succinate, genotypes of size 35 viable on malate and
genotypes of size 45 viable on pyruvate. Figure shows that the genotypes with higher mutational robustness on
average show higher growth rate. The Spearman’s  is 0.42, 0.49 and 0.29 respectively and P  10
a
0.5
Growth rate
Growth rate
0.4
0.3
0.2
0.1
0
.
b
0.5
0.4
300
0.3
0.2
0.1
0
0.2
0.4
0.6
Robustness
0.8
1
0
0
0.2
0.4
0.6
0.8
1
Robustness
Figure 5.9: Correlation between robustness and growth rate. Figure (a) represents the correlation for genotype
network comprised of genotypes of size 30 viable on succinate and figure (b) relates to the genotype network
containing genotypes of size 29 viable on pyruvate. The Spearman’s ρ for figure (a) is 0.42 and for figure (b) it
equals to -0.11. However, if using the rectangles shown in the figure, each plot is split into two regions, both plots
will be similar because in both the average growth rate of the left rectangle which corresponds to smaller robustness
is smaller than the average growth rate of the right rectangle with larger robustness. Therefore, by a coarser grained
definition of robustness, still it can be claimed that even in figure (b) which shows negative Spearman’s  , the
growth rate of genotypes with “high robustness” on average is larger than genotypes with “small robustness”.
58
Chapter 6: Super-essentiality of metabolic reactions
In this chapter, the relative importance of reactions of a metabolic network for sustaining life on
a certain carbon source is discussed. Towards this end, an index termed super-essentiality is
assigned to each reaction of a metabolic network. It was of particular interest to know how
super-essentiality index varies with the number of reactions which present in a metabolic
network (i.e. genotype size). This would imply how the importance of the role which a specific
reaction plays in a metabolic network is influenced by elimination or addition of other reactions
to the metabolic network. Importantly also how the super-essentiality of each reaction varies by
changing the carbon source, is investigated. Finally, I categorized reactions based on the
similarity of their super-essentiality profile. Moreover, the same effort is made to cluster carbon
sources.
6.1. Super-essentiality index:
A reaction would be essential for viability of a metabolic network if its elimination renders the
metabolic network unable to sustain life. Super-essentiality index of a reaction is defined as the
ratio of metabolic networks in the genotype network for which the reaction is essential. If a
reaction is essential for all metabolic networks in the genotype network, the ratio is one, so the
super-essentiality index of that reaction will be 1. On the other hand, if a reaction is not essential
for any metabolic network in the genotype network, the ratio is zero, so the super-essentiality of
that reaction will be 0. Reactions with super-essentiality index of 1 and 0 are respectively called
“absolutely essential” and “absolutely non-essential” reactions. Other reactions lie between these
two extremes, so a real number between 0 and 1 is assigned to them.
The brute-force approach to determine super-essentiality index of a certain reaction would be
first to eliminate the reaction in all the metabolic networks existing in the genotype network and
then checking the viability of the resulting metabolic networks using FBA. Finally by counting
the ratio of viable metabolic networks the super-essentiality of the reaction is determined.
Fortunately, the process was sped up by using an alternative approach which circumvents the
need for FBA method. The point is that the super-essentiality of a reaction in a genotype network
comprised of genotypes of size i is determined simply based on the frequency of the presence
and absence of the reaction in genotype networks of size i and i-1 as follows:

SCi m  R j  
f Cim  R j   f Cim1  R j 
N  i , Cm 
(6.1)
i, j  1,...,51 , m  1,..,10
Where SCi m  R j  is the super-essentiality index of the reaction R j in the genotype network
comprised of genotypes of length i which are viable on carbon source Cm. fCim  R j  is the
59
number of metabolic networks in the genotype network comprised of genotypes of size i which

include reaction Rj. f Ci 1  R j  is the number of metabolic networks in the genotype network
m
comprised of genotypes of size i-1 which don’t include reaction Rj. Finally, N  i, Cm  is the total
number of genotypes present in the genotype network formed by genotypes of size i which are
viable on carbon source Cm. Basically, the numerator returns the number of metabolic networks
for which the reaction is essential and it is normalized by the total number of genotypes in the
network to obtain a value between zero and one.
6.2. Results:
I observed that the influence of genotype size on super-essentiality index is not the same for all
the reactions in the reaction universe; instead the reactions can be categorized into four types
based on how the genotype size influences their super-essentiality index as follows (Figure 6.1):
The first type: The super-essentiality index of the reaction remains one regardless of genotype
size. In other words, reactions of this category remain absolutely essential for all genotype
networks regardless of the size of genotypes.
The second type: The super-essentiality index of the reaction decreases by increasing genotype
size. Therefore, these reactions are not essential for most of the metabolic networks with higher
number of reactions and they gradually become more essential in metabolic networks comprised
of fewer reactions. One possible reason to explain this behavior is based on the viability ratio
described in chapter 2 (figure 2.10). In genotype networks with higher viability ratio, these
reactions are less essential while in genotype networks with lower viability ratio, the reactions
become more essential.
The third type: The super-essentiality of reactions remains small regardless of the genotype size.
However, the super-essentiality index varies based on the genotype size. By increasing genotype
size, super-essentiality increases until it reaches its maximum, and then it diminishes towards
zero. Hence, in genotypes of average size, the reaction is more essential than in genotypes of
extreme sizes. According to chapter 2 (figure 2.9), the genotype networks comprised of
genotypes of moderate sizes contain more genotypes than genotype networks comprised of
extreme sizes. Therefore, the reactions of the third category are assigned a higher superessentiality index in more populated genotype networks.
The fourth type: The super-essentiality index of the reaction remains zero regardless of the
genotype size. In other words, reactions of this category remain absolutely non-essential for all
genotype networks regardless of the size of genotypes.
Then, the distribution of these four reaction types was studied for genotype networks
corresponding to the 10 different carbon sources (Figure 6.2).
60
1
Type
Type
Type
Type
Super-essentiality index
0.9
0.8
0.7
1
2
3
4
0.6
0.5
0.4
0.3
0.2
0.1
0
30
35
40
45
50
55
genotype size
Figure 6.1: Dependence of super-essentiality index on genotype size. Based on how the super-essentiality index
of reaction is influenced by genotype size, the reactions are categorized into four classed as illustrated.
25
Type 1
Type 2
Type 3
Type 4
Frequency
20
15
10
5
na
te
Su
cci
ate
Py
r uv
Ma
l at
e
La
cta
te
te
tam
a
Glu
co
se
Glu
rat
e
Fu
ma
cto
se
Fru
te
gl u
tar
a
Alp
ha
-ke
to
Ac
eta
te
0
Figure 6.2: Distribution of reaction types. The frequency of each type of reaction in genotype networks
corresponding to each carbon source is shown.
61
Almost for all the carbon sources, the fourth type of the reactions was the least frequent.
However, the most frequent reaction type was different in different carbon sources. For example,
in genotype networks corresponding to Acetate, majority of the reactions belong to the first
category and its frequency is significantly higher than others. For lactate and succinate also the
reactions belonging to the first category are dominant. However, their frequency is not very
different from that of other types. On the other hand, in genotype networks belonging to
pyruvate, the reactions of the second type, and for glucose and fructose, the third type of
reactions is the most frequent one. For the remaining four carbon sources, the first three types of
reactions have more or less the same frequency. It is important to note that a reaction which is
classified in a certain category for genotype networks corresponding to a carbon source is not
necessarily classified into the same category for genotype networks corresponding to another
carbon source. Among the 51 reactions, only 15 were classified in only one particular reaction
type regardless of the carbon source. Respectively, 19, 14 and 3 reactions were assigned into 2, 3
and 4 different categories depending on carbon source. The 3 reactions which could be
categorized into each of the four types all belong to the reactions of pyruvate metabolism.
Among the 15 reactions which are classified always into only one category, 6 reactions were
classified into the first type, meaning that they are absolutely essential for every carbon sources
which are called “environment-general essential reactions” [73]. Five reactions among these six
reactions belong to the glycolysis and gluconeogenesis pathway. Furthermore, I defined the
reactions which belong to the first category in at least 5 carbon sources as “frequently
“absolutely essential”” reactions and also the reactions which belong to the fourth category in at
least 5 carbon sources as “frequently “absolutely non-essential”” reactions. Among the 51
reactions, 16 reactions are mostly absolutely essential and 7 reactions belong to the mostly
absolutely non-essential. Surprisingly, it is possible that a reaction which is absolutely essential
in more than five carbon sources to be absolutely non-essential in the remaining carbon sources.
For example, the reaction catalyzed by Acotinase in citric acid cycle, is absolutely essential in 8
carbon sources among the 10 and simultaneously is absolutely non-essential in the other two
carbon sources (i.e. glutamate and α-ketoglutarate). As is shown in figure 6.3, majority of the
frequently “absolutely essential” reactions belong to the glycolysis pathway and citric acid cycle
while most of the frequently “absolutely non-essential” reactions belong to the oxidative
phosphorylation pathway and pyruvate metabolism.
Furthermore, to measure the influence of genotype size on the essentiality of reactions, the
average super-essentiality index for a genotype network comprised of metabolic networks of size
i viable on carbon source Cm was defined as follows:
S R 
|U |
m  i , Cm  
j 1
i
Cm
|U |
j
(6.2)
62
Where the numerator is the summation of the super-essentiality index of all the reactions
belonging to the reaction universe and the denominator is the number of reactions existing in the
reaction universe which equals to 51 in this study.
10
Frequently "absolutely essential"
Frequently "absolutely non-essential"
Number of reactions
8
6
4
2
cy
cle
Ox
ida
t iv
ep
ho
sp
ho
ry l
Pe
ati
on
nto
se
ph
os
ph
ate
pa
thw
ay
Gl
uta
ma
te
me
ta b
oli
sm
Py
r uv
a te
me
tab
oli
sm
An
ap
le r
oti
cp
ath
wa
y
ca
c id
Cit
ri
Gl
yc
oly
sis
/G
luc
on
eo
ge
n ic
0
Figure 6.3: The occurrence of frequently absolutely essential and frequently absolutely non-essential
reactions in different pathways of central carbon metabolism.
Average Super-essentiality
0.7
Acetate
Alpha-ketoglutarate
0.6
Fructose
Fumarate
0.5
Glucose
Glutamate
0.4
Lactate
Malate
0.3
0.2
20
Pyruvate
Succinate
25
30
35
40
genotype size
45
50
55
Figure 6.4: Dependence of average super-essentiality on genotype size. How average super-essentiality is
influenced by genotype size for genotype networks corresponding to each carbon source is depicted.
63
As it is shown in figure 6.4, the genotype including all 51 reactions has the lowest average superessentiality index. By decreasing genotype size via reaction elimination, the average superessentiality index increases proportional to the number of reaction eliminations. However, from
genotype networks of a certain size (almost 40) downwards, the average super-essentiality
almost stabilizes, so the average super-essentiality does not change anymore by decreasing
genotype size. This trend can be described by the 4 class of reactions mentioned before. The
super-essentiality of the first and fourth type always remains the same regardless of the size so
they cannot contribute the change of the average super-essentiality index with genotype size.
Therefore, only the second and third types of reactions influence the change of the average
super-essentiality by changing the genotype size. At larger sizes, by reducing genotype size, the
super-essentiality index of both types of reactions increases, so the average super-essentiality
also increases. However, from a certain size downwards, by reducing the genotype size, the
super-essentiality index of the reactions of the third type decreases towards zero, but the superessentiality index of the reactions of the second type rises towards one. Therefore, the sum of the
super-essentiality index of the second and third types of reactions gets stabilized and
consequently the average essentiality also reaches the equilibrium. This trend is more or less the
same for genotype networks corresponding to different carbon sources. Nevertheless, the average
super-essentiality for genotype networks corresponding to acetate is significantly higher than
other carbon sources. This can explain why the number of genotypes viable on acetate is too
smaller than that of other carbon sources, because the higher super-essentiality is translated into
lower viability of the genotypes which are obtained by elimination of some reactions from the
genotype including all 51 reactions.
As it was defined before, the absolutely essential and absolutely non-essential reactions for a
genotype network, respectively refers to the reactions whose super-essentiality index always
equals 1 and 0. Reactions of the first type are absolutely essential in every genotype network
regardless of genotype size and the reactions of the fourth class, are absolutely non-essential in
every genotype networks. Whereas the reactions belonging to the third group can never become
absolutely essential, the reactions of the second category can be absolutely essential at smaller
sizes. Therefore, at larger sizes the number of absolutely essential reactions remains the same as
the number of the reactions belonging to the first type, while in smaller sizes gradually the
number of reactions belonging to the second class whose super-essentiality index has reached 1,
increases, and consequently the number of absolutely essential reactions also goes up (figure
6.5). On the other hand, at size 51, in addition to the reactions of the fourth class, the superessentiality index of the reactions belonging to the second and third category is also zero.
Therefore, the number of absolutely non-essential reactions is very high at the largest size.
However, the number of absolutely non-essential reactions for the next largest size drops
quickly, because the super-essentiality index of majority of the reactions belonging to the second
and third class departs from zero (figure 6.6). Nevertheless, the superessentiality index of the
reactions of the third class drops down again in smaller sizes until new reactions with super-
64
Glucose
Fructose
Alpha-ketoglutarate
Pyruvate
Glutamate
Malate
Fumarate
Succinate
Lactate
Acetate
Number of absolutely essential reactions
essentiality index of zero emerge. This consequently increases the number of absolutely nonessential reactions in lower sizes (figure 6.6).
30
25
20
15
10
5
0
25
30
35
40
Genotype size
45
50
Carbon sourse
Alpha-ketoglutarate
Pyruvate
Succinate
Acetate
Glutamate
Fumarate
Lactate
Malate
Fructose
Glucose
Number of absolutely non-essential reactions
Figure 6.5: Number of absolutely essential reactions based on genotype size and carbon source.
40
30
20
10
0
25
30
35
Genotype size
40
45
50
Carbon source
Figure 6.5: Number of absolutely non-essential reactions based on genotype size and carbon source.
65
Furthermore, average super-essentiality of the reaction Ri in genotype networks corresponding to
the carbon source Cm was defined as follows:
|Km |
S  Ri , Cm  
 S R 
j 1
j
Cm
| Km |
i
(6.3)
Where, the numerator is the summation of the super-essentiality index of the reaction Ri in
genotype networks comprised of genotypes of different sizes j viable on the carbon source Cm
and the denominator is the number of genotype networks corresponding to the carbon source Cm.
Therefore, for each reaction a vector of size 10 is assigned each entry of which corresponds to
the average super-essentiality index of the reaction according to one of the 10 carbon sources.
The 10 length vector is used as the super-essentiality profile of the reaction based on which the
reactions were classified. The agglomerative hierarchical clustering approach implemented in
MATLAB was employed to classify reactions. As it is depicted in figure 6.7, the reactions are
classified into 4 distinct clusters. Moreover, the Spearman’s  of the super-essentiality profiles
of each pair of the reactions are illustrated in figure 6.8. The clustering approach could correctly
classify all the reactions involved in pyruvate metabolism to the same category. However, the
results show that it cannot generally be true to claim that the reactions belonging to the same
functional pathway are necessarily classified into the same cluster. For example, four reactions
belonging to pentose phosphate pathway are classified into the first cluster while the other two
reactions are categorized in the fourth one. Nevertheless, the pairs of reactions whose superessentiality profile are highly correlated are also functionally dependent on each other and in real
metabolic pathways they are neighboring reactions. Highly correlated pairs of reactions are those
in the same cluster which are drawn closer in figure 6.7 and whose corresponding position in
figure 6.8 are colored dark red. For example, the four consecutive reactions colored red and
abbreviated as PGI, FBP, FBA and TPI in figure 6.7 are the four consecutive reactions in
glycolysis pathway which are respectively catalyzed by phosphoglucose isomerase,
phosphofructokinase, aldolase and triosephosphate isomerase. According to figure 6.8 the
pairwise Spearman’s  of their corresponding super-essentiality profile is close to one (colored
dark red). Therefore, the similarity in super-essentiality profiles of the reactions to some extent
reflects the real functional relationship among the reactions.
66
Figure 6.7: Clustering metabolic reactions based on super-essentiality profile. The reactions are clustered into 4
classes. The names of the reactions are colored according to the pathway they belong to.
1
TKT2
RPE
PPC
PDH
PGL
G6PDH2r
GND
GLUN
ATPM
PFK
ICDHyr
CS
ACONT
GLUSy
PYK
NADH16
FRD7
NADTRHD
ICL
MALS
SUCOAS
AKGDH
GLUDy
CYTBD
THD2
PPCK
MDH
ME1
TPI
FBA
FBP
PGI
ATPS4r
ACKr
PTAr
ME2
PFL
FUM
SUCDi
PPS
ADK1
TKT1
TALA
LDH_D
ACALD
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
ACALD
TKT2
RPE
PPC
PDH
PGL
G6PDH2r
GND
GLUN
ATPM
PFK
ICDHyr
CS
ACONT
GLUSy
PYK
NADH16
FRD7
NADTRHD
ICL
MALS
SUCOAS
AKGDH
GLUDy
CYTBD
THD2
PPCK
MDH
ME1
TPI
FBA
FBP
PGI
ATPS4r
ACKr
PTAr
ME2
PFL
FUM
SUCDi
PPS
ADK1
TKT1
TALA
LDHD
-1
Figure 6.8: The Spearman’s ρ of the correlation between super-essentiality profiles of the reactions. The
colors in each position represent the Spearman’s ρ value of the correlation between super-essentiality profiles of the
corresponding reaction pairs based on the color bar. The position corresponding to positively correlated reactions
will be colored dark red and that of negatively correlated reactions is colored dark blue and that of uncorrelated ones
is colored green.
67
When the super-essentiality profile for every reaction is defined as above, at the same time a
vector of length 51 is assigned to each carbon source each entry of which corresponds to the
average super-essentiality of one of the 51 reactions. These vectors were used to measure
functional similarity between carbon sources based on the idea that if two carbon sources
influence the metabolic network in the same way, the resulting super-essentiality of the 51
reactions should be the same. Figure 6.9, shows the classification of the 10 carbon sources and
the pairwise correlation among carbon sources is illustrated in figure 6.10. Carbon sources are
classified into four clusters; he first one contains the gluconeogenic carbon sources namely
glucose and fructose. The second group includes five carbon sources. Three of which namely
fumarate, malate, and succinate are part of citric acid cycle, and are consecutively convertible to
each other via the enzymes belonging to the citric acid cycle. On the other hand, pyruvate and
lactate which also belong to this cluster are also functionally related metabolites because they are
convertible to each other via pyruvate dehydrogenase. Importantly acetate is not directly
convertible to any other carbon source, so it is alone in the third category. Acetate can be
functionally related to the other carbon sources via acetyl-coA, but our clustering approach does
not reveal that. Moreover, glutamate and α-ketoglutarate are also convertible to each other via
glutamate dehydrogenase and the proposed clustering approach also cluster both of them in the
fourth category. Thus, the carbon sources which are directly convertible to each other show
similar super-essentiality profile for reactions of the metabolic network. Moreover, the functional
relevance of the carbon sources clustered to the same category to some extent confirms the
validity of the proposed clustering approach.
0.8
0.6
0.4
Fru
0
cto
se
Glu
co
se
Fu
ma
r at
e
Ma
lat
e
Su
cc
ina
te
La
cta
te
Py
r uv
a te
A lp
Ac
eta
ha
- ke
te
tog
lu ta
rat
e
Glu
tam
ate
0.2
Figure 6.9: Clustering carbon sources based on super-essentiality of the reactions.
68
1
Fructose
Glucose
0.9
Fumarate
Malate
0.8
Succinate
0.7
Lactate
Pyruvate
0.6
Acetate
Alpha-ketoglutarate
0.5
Glutamate
Alpha-ketoglutarate
Acetate
Pyruvate
Lactate
Succinate
Malate
Fumarate
Glucose
Fructose
Glutamate
Figure 6.10: The Spearman’s ρ of the correlation between super-essentiality profiles associated with pairs of
carbon sources. The colors in each position represent the Spearman’s ρ value of the correlation between superessentiality associated with pairs of carbon sources based on the color bar. The position corresponding to strongly
correlated carbon sources will be colored dark red and that of weakly correlated carbon sources is colored dark blue.
69
Chapter 7: Conclusion
So far, because of the astronomically large size of the metabolic genotype space the genotypephenotype mapping for metabolic networks has been possible only by using sampling-based
methods. In this project, for the first time, the genotype space related to a metabolic system is
exhaustively explored and the phenotype of all individual genotypes belonging to the genotype
space is determined. This task became possible first by focusing on a smaller reaction universe
that belongs to the central carbon metabolism of E.coli, instead of considering every reactions
belonging to genome-scale metabolic networks. Secondly, the efficient algorithm which was
discussed in chapter 2 helped to overcome the computational hurdle of the task. The algorithm
was based on the observation that genotypes whose set of reactions is a subset of an already
known unviable genotype become unviable as well, so there is no need to check their viability
using time consuming FBA approach. Thus, the algorithm shrinks the search space by skipping a
vast portion of the genotype space which corresponds to the genotypes the unviability of which
is already known without the need for computation. There are 51 internal reactions in the central
carbon metabolism of E.coli, so the number of genotypes in the genotype space is 251~1015. The
results show that the total number of viable genotypes on each carbon source is around 10 9 which
implies that only one genotype among a million is viable. Checking viability of each individual
genotype using FBA method requires 0.01 seconds, so brute-force checking of the whole space
using FBA requires 105 years. Nevertheless, the algorithm by skipping a vast number of
genotypes whose unviability is already proven without FBA, could determine the viability of all
the genotypes in few days. Hence, using this approach, for each genotype in the genotype space
it was determined whether the genotype is viable or not on each of the 10 carbon sources. In
other words, exhaustively a binary phenotype vector of length 10 was assigned to each genotype.
Moreover, the set of viable genotypes on each carbon source was partitioned based on size into
disjoint sets including equally-sized genotypes which are viable on a particular carbon source.
The set of genotypes of size i which are viable on carbon source Cm makes a genotype network
denoted by G  i, Cm  in which each genotype is represented as a node and there would be an edge
between a pair of genotypes if they are convertible to each other by a mutational swap. Thus,
finally an information-rich dataset including genotype networks corresponding to different sizes
and different carbon sources was generated. Access to this dataset made it possible to answer
new questions about structure, function and evolution of the metabolic networks. Chapters 3 to 6
were devoted to answering some of these questions.
The frequency of genotypes in the total genotype space according to the genotype size results in
a binomial distribution. Surprisingly, the partitioning of the viable genotypes based on size, also
leads to binomial distributions. However, the frequency of the most frequent size is smaller than
that of the total genotype space and also the size with maximum frequency is right-shifted with
respect to that of total genotype space (figure 2.9). Moreover, the viability of the genotype
70
networks reduces with decreasing genotype size (figure 2.10). This is plausible, since by
reducing size, the number of eliminated reactions increases and the possibility for the genotypes
to still be able to sustain life decreases.
In chapter 3, some of the graph-theoretical properties of genotype networks were investigated.
The results show that the average number of neighbors per node (i.e. average degree) in larger
networks is higher than smaller networks (figure 3.1). This implies that larger genotype networks
are denser than smaller ones. Moreover, an upper bound for the average degree for genotype
networks corresponding to each genotype size was introduced. When the average degree in each
genotype network of a certain size is normalized by dividing it to its corresponding upper-bound,
the average degree grows by increasing size (figure 3.2). This implies that for genotype networks
including genotypes of larger sizes, the average degree of the genotype network is more similar
to that of the original genotype space (i.e. the network including all possible genotypes of a
certain length which is a regular lattice). Importantly, also the results of the degree distribution of
each genotype network show that degree distribution of genotype networks is apparently similar
to that of random networks. However, the similarity is not statistically significant. Luckily for
genotype networks it is possible to determine all-pairs shortest path in the graph without any
need to the common time consuming algorithms used for this purpose. The point is that since the
neighborhood between genotypes in the network is defined based on the mutational swaps, the
shortest path between any two genotypes in a connected genotype network is defined as the
minimum number of mutational swaps required to convert the genotypes to each other and it is
easily inferred from the reaction contents of the genotypes. The average length of the shortest
path and diameter also are higher in larger graphs than smaller ones (figure 3.6). Again
normalization of the average length of the shortest path and diameter of genotype network by
dividing them to the corresponding upper-bond indicates that the normalized parameters are also
increasing by raising the genotype size (figure 3.7). This also implies that genotype networks
corresponding to genotypes of larger sizes are more similar to the original genotype space.
Moreover, the average normalized betweenness centrality of the genotype networks was
measured and it was shown that the mean and variance of the betweenness centrality reduce by
increasing the size of genotype network. Therefore, in larger genotype networks the relative
importance of individual genotypes for connecting other genotypes to each other reduces.
Connectedness of genotype networks as a very important topological property of genotype
networks was studied in chapter 4. It was of particular interest to know whether it is possible for
any pairs of genotypes in a genotype network to reach each other via neutral mutational swaps or
not. To answer this question, again there was a computational obstacle, because the known
algorithms for determining the connected components of a network like breadth-first search or
depth-first search algorithms at first place require the network to be filled and stored in an
adjacency matrix or list. For the vast genotype networks studied in this project using these
algorithms was not feasible, because first checking all pairs of genotypes for neighborhood
relationship was computationally time consuming. For example, for a network with 106
71
genotypes, filling the adjacency matrix requires 1012 number of pairwise comparisons of the
reaction contents of the genotypes. On the other hand, such large adjacency matrices or lists
cannot be stored due to memory problems. Fortunately, this computational hurdle was tackled by
designing a new algorithm which could assign the nodes of the genotype network without filling
and storing the adjacency matrix. Moreover the total running time of the algorithm is O  mV
. 
where m is the average degree and V is the number of genotypes in the network. This made the
task to be achieved very efficiently for genotype networks with less than one million genotypes.
Although, the algorithm does not need the adjacency matrix to be filled, it requires the genotypes
as set of reactions to be stored. Therefore, due to memory problems, it was not possible even for
this new algorithm to determine the connectedness of genotype networks including more than
one million genotypes. Fortunately, a relationship between connectedness of genotype networks
corresponding to genotypes of consecutive sizes was observed which led to formulate a theorem
that aids in determining the connectedness of very large genotype networks based on the
connectedness of the smaller genotype networks. Finally, the new algorithm together with the
theorem was sufficient to determine the connectedness of all the genotype networks. The results
show that the genotype networks corresponding to genotypes of larger sizes are all connected,
but due to extremely low viability ratio and consequently the tremendous sparsity, some of the
genotype networks corresponding to genotypes of smaller sizes are fragmented into 2, 3 or at
most 4 connected components. Nevertheless, the components are not equally-sized instead one of
the components is always a giant component encompassing a large fraction of the nodes in the
network. Therefore, it is concluded that genotypes belonging to genotype networks are reachable
via neutral mutational swaps.
Metabolic networks are robust to reaction elimination meaning that despite removal of several
reactions the ability of the metabolic networks to sustain life and growth persists. The focus of
the study was on the impact of the genotype size on robustness of the metabolic networks. The
largest genotype size corresponds to the genotype possessing all the reactions belonging to the
reaction universe and obviously it has the largest robustness. Reducing genotype size implies
increasing the number of eliminated reactions. Since increasing number of eliminated reactions
enhances the possibility that removal of another reaction would abolish viability, it is expected
that reducing genotype size results in reduction of the robustness. The results confirm that
average robustness reduces by decreasing genotype size (figures 5.1 and 5.2). Moreover, the
average robustness of a genotype network does not correlate with the number of genotypes in the
network (figure 5.3). Important to be mentioned also is that among genotypes of the same size
variation in terms of robustness exists, so not only the number of reactions are important to
determine the robustness of genotypes but also the reaction content of the metabolic networks
indeed matters. Furthermore, the results show that more central genotypes in the genotype space
are more robust (figure 5.4). Therefore, more robust genotypes are crucial for the evolution of
metabolic networks because they facilitate exploration of genotype networks by easing the
reachability of distant genotypes.
72
In this study growth rate of the genotypes was approximated as the rate of biomass reaction, and
also the growth rate was considered as the fitness of the genotypes. The results show that the
average fitness of a genotype network grows almost exponentially by increasing genotype size
for genotype networks corresponding to glucose, fructose, glutamate, α-ketoglutarate and
pyruvate (figures 5.5-5.7). Moreover, it was shown that, robustness and growth rate are
significantly correlated implying that genotypes that are more robust to reaction elimination are
associated with a higher fitness (figures 5.8 and 5.9). This finding can help to explain the
emergence of robustness in metabolic networks.
Finally super-essentiality index for the reactions in each genotype network was determined.
Generally it is expected that the reactions become more essential when the metabolic networks
undergo more reaction eliminations (i.e. smaller genotype size). However, the results show that
the reactions can be categorized into four types based on how their super-essentiality index
depends on genotype size (figure 6.1). The first and fourth categories belong to the reactions
whose super-essentiality index respectively remains 1 and 0 regardless of genotype size. In
contrast, the super-essentiality index of the reactions belonging to the second category increases
by reducing genotype size and for the third category, the super-essentiality of reactions initially
increases until reaching a maximum then it drops towards zero. Surprisingly, the reactions do not
stay in the same category in the genotype networks corresponding to different carbon sources,
while their category might change depending on carbon sources. Moreover, the results show that
reactions belonging to the first category in more than 5 carbon sources frequently occur in
glycolysis and citric acid cycles while the reactions which belong to the fourth category occur
more frequently in oxidative phosphorylation and pyruvate metabolism (figure 6.3).
Furthermore, the consecutive reactions with common metabolites frequently have similar superessentiality profile and consequently are clustered in the same category (figures 6.7 and 6.8).
However, it is not true to claim that every reactions belonging to a particular pathway will be
clustered in the same category. Moreover, the carbon sources also were clustered based on the
super-essentiality profile of the reactions. Surprisingly, the super-essentiality profiles of the
reactions in carbon sources which are directly convertible to each other via an enzymatic reaction
were strongly similar. Therefore, the carbon sources directly convertible to each other were
classified in the same cluster (figures 6.9 and 6.10).
As it can be concluded from definitions, robustness, super-essentiality and viability ratio are
interrelated concepts. If the average super-essentiality of reactions is higher, the average
robustness of the genotype network and also the viability ratio will be lower. Genotype networks
corresponding to acetate are good examples for displaying these interrelationships. For example,
according to figure 6.4 the average super-essentiality of the genotype networks corresponding to
acetate is higher than that of other carbon sources. Consequently, as is illustrated in figure 5.2 the
average mutational robustness of the genotype networks associated with acetate is lower than
other carbon sources. Furthermore, as is depicted in figure 2.10, the viability ratio of the
associated genotype networks in acetate is clearly below that of other carbon sources.
73
Acknowledgements
I would like to offer my thanks to Prof. Andreas Wagner for setting up this interesting project
and offering me the position that allowed me to pursue questions related to my research interests.
I particularly wish to thank Aditya Barve for his assistance and sharing of experience during the
research. Moreover, I am grateful to Dr João Matias Rodrigues for his useful suggestions.
I appreciate the education that I received from the CBB Master program. Especially, I wish to
express my gratitude to my mentor, Prof. Niko Beerenwinkel for his guidance and support during
my studies.
I wish to thank my parents and my brothers for their support and encouragement throughout my
studies.
74
References
1. Feist AM, Herrgård MJ, Thiele I, Reed JL, Palsson BØ . Reconstruction of biochemical
networks in microorganisms. Nat Rev Microbiol. 2009; 7(2):129-143.
2. Herrgård MJ, Swainston N, Dobson P, Dunn WB, Arga KY, Arvas M, Blüthgen N, Borger S,
Costenoble R, Heinemann M, Hucka M, Le Novère N, Li P, Liebermeister W, Mo ML,
Oliveira AP, Petranovic D, Pettifer S, Simeonidis E, Smallbone K, Spasić I, Weichart D,
Brent R, Broomhead DS, Westerhoff HV, Kirdar B, Penttilä M, Klipp E, Palsson BØ, Sauer
U, Oliver SG, Mendes P, Nielsen J, Kell DB. A consensus yeast metabolic network
reconstruction obtained from a community approach to systems biology. Nat Biotechnol.
2008 Oct; 26(10):1155-60.
3. Duarte NC, Becker SA, Jamshidi N, Thiele I, Mo ML, Vo TD, Srivas R, Palsson BØ. Global
reconstruction of the human metabolic network based on genomic and bibliomic data. Proc
Natl Acad Sci USA. 2007; 104: 1777–1782.
4. Wagner A. Metabolic networks and their evolution. Adv Exp Med Biol. 2012;751:29-52
5. Zielinski DC, Palsson BØ. Kinetic Modeling of Metabolic Networks in Systems Metabolic
Engineering. 2012, pp 25-55.
6. Orth, J.D., Thiele, I. & Palsson, B.Ø. What is flux balance analysis? Nature Biotechnology
28, 245-248(2010).
7. Price N, Reed J, Palsson B (2004) Genome-scale models of microbial cells: evaluating the
consequences of constraints. Nat Rev Microbiol 2:886–897
8. Schuetz R, Kuepfer L, Sauer U (2007) Systematic evaluation of objective functions for
predicting intracellular fluxes in Escherichia coli. Mol Syst Biol 3.
9. Pigliucci M. Genotype–phenotype mapping and the end of the ‘genes as blueprint’ metaphor.
Phil Trans Soc B 2010; 365:557-66.
10. Loewe L (2009) A framework for evolutionary systems biology. BMC Syst Biol 3:27
11. K.F. Lau, K.A. Dill. A lattice statistical mechanics model of the conformational and
sequence spaces of proteins Macromolecules, 22 (1989), pp. 3986–3997
12. E. Bornberg-Bauer. How are model protein structures distributed in sequence space?
Biophys. J., 73 (1997), pp. 2393–2403
13. W. Fontana, D.A.M. Konings, P. Schuster et al. Statistics of RNA secondary structures.
Biopolymers, 33 (1993), pp. 1389–1404
14. I.L. Hofacker, W. Fontana, P. Schuster et al. fast folding and comparison of RNA secondary
structures. Monatsh. Chem., 125 (1994), pp. 167–188
15. P. Schuster, W. Fontana, I.L. Hofacker et al. From sequences to shapes and back: a case
study in RNA secondary structures. Proc. Biol. Sci., 255 (1994), pp. 279–284
16. Grüner W, Giegerich R, Strothmann D, Reidys C, Weber J, et al. (1996) Analysis of RNA
sequence structure maps by exhaustive enumeration I: Neutral networks. Monatsheft für
Chemie 127: 375–389.
17. Ciliberti S, Martin OC and Wagner A. Innovation and robustness in complex regulatory
gene networks. Proc Natl Acad Sci U S A. 2007; 104(34): 13591
75
18. Nochomovitz YD, Li H. Highly designable phenotypes and mutational buffers emerge from a
systematic mapping between network topology and dynamic output. Proc Natl Acad Sci U S
A. 2006 Mar 14; 103(11):4180-5.
19. Matias Rodrigues JF, Wagner A: Evolutionary plasticity and innovations in complex
metabolic reaction networks. PLoS Comput Biol 2009, 5:e1000613.
20. Samal A, Matias Rodrigues JF, Jost J, Martin OC, Wagner A. Genotype networks in
metabolic reaction spaces. BMC Syst Biol. 2010; 4:30.
21. Barabási AL, Oltvai ZN. Network biology: understanding the cell's functional organization.
Nat Rev Genet. 2004; 5(2):101-13
22. Albert, R. & Barabási, A.-L. Statistical mechanics of complex networks. Rev. Mod. Phys.
2002; 74, 47–97
23. Dorogovtsev, S. N. & Mendes, J. F. F. Evolution of Networks: from Biological Nets to the
Internet and WWW. 2003; Oxford University Press, Oxford.
24. Bornholdt, S. & Schuster, H. G. Handbook of Graphs and Networks: from the Genome to the
Internet. 2003; Wiley-VCH, Berlin, Germany.
25. Strogatz, S. H. Exploring complex networks. Nature. 2001; 410, 268–276.
26. Yamada T, Bork P.Evolution of biomolecular networks: lessons from metabolic and protein
interactions. Nat Rev Mol Cell Biol. 2009; 10(11):791-803.
27. Ravasz, E., Somera, A. L., Mongru, D. A., Oltvai, Z. N. & Barabási, A. L. Hierarchical
organization of modularity in metabolic networks. Science. 2002; 297, 1551–1555.
28. Maslov, S. & Sneppen, K. Specificity and stability in topology of protein networks. Science.
2002; 296, 910–913.
29. Hahn, M. W. & Kern, A. D. Comparative genomics of centrality and essentiality in three
eukaryotic protein interaction networks. Mol. Biol. Evo.2005; 22, 803–806.
30. M. E. J. Newman. Networks: An Introduction. 2010; Oxford: Oxford University Press.
31. Erdo˝s, P. & Renyi, A. On the strength of connectedness of a random graph. Acta Math.
Hung. 1961; 12, 261–267.
32. Barabasi, A. L. & Albert, R. Emergence of scaling in random networks. Science. 1999; 286,
509–512.
33. Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N. & Barabási, A. -L. The large-scale
organization of metabolic networks. Nature. 2000; 407, 651–654.
34. Wagner, A. & Fell, D. A. The small world inside large metabolic networks. Proc. R. Soc.
Lond. 2001; B 268, 1803–1810.
35. Jeong, H., Mason, S. P., Barabási, A. -L. & Oltvai, Z. N. Lethality and centrality in protein
networks. Nature. 2001; 411, 41–42.
36. Wagner, A. The yeast protein interaction network evolves rapidly and contains few
redundant duplicate genes. Mol. Biol. Evol. 2001; 18, 1283–1292.
37. Giot, L. et al. A protein interaction map of Drosophila melanogaster. Science 2003; 302,
1727–1736.
38. Li, S. et al. A map of the interactome network of the metazoan, C. elegans. Science. 2004 23;
303(5657):540-3.
76
39. Yook SH, Oltvai ZN, Barabási AL. Functional and topological characterization of protein
interaction networks. Proteomics. 2004; 4(4):928-42.
40. Featherstone, D. E. & Broadie, K. Wrestling with pleiotropy: genomic and topological
analysis of the yeast gene expression network. Bioessays; 2002; 24, 267–274.
41. Agrawal, H. Extreme self-organization in networks constructed from gene expression data.
Phys. Rev. Lett. 2002; 89, 268702.
42. Wuchty, S. Scale-free behavior in protein domain networks. Mol. Biol. Evol. 2001; 18,
1694–1702.
43. Apic, G., Gough, J. & Teichmann, S. A. An insight into domain combinations.
Bioinformatics; 2001; 17, S83–S89.
44. Milgram, S. The small world problem. Psychol. Today. 1967; 2, 60.
45. Watts, D. J. & Strogatz, S. H. Collective dynamics of 'small-world' networks. Nature. 1998;
393, 440–442.
46. Albert R, Jeong H and Barabási AL. Internet: Diameter of the World-Wide Web. Nature
1999; 401, 130-131.
47. Chung, F. & Lu, L. The average distances in random graphs with given expected degrees.
Proc. Natl Acad. Sci. 2002; USA 99, 15879–15882.
48. Cohen, R. & Havlin, S. Scale-free networks are ultra small. Phys. Rev. Lett. 2003; 90,
058701.
49. Maslov, S. & Sneppen, K. Specificity and stability in topology of protein networks. Science
2002; 296, 910–913.
50. Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms. 3rd edn.
MIT Press, Cambridge, MA
51. Cowperthwaite MC and Meyers LA. How Mutational Networks Shape Evolution: Lessons
from RNA Models. Annu. Rev. Ecol. Evol. Syst. 2007. 38:203–30
52. Babajide A, Hofacker IL, Stadler PF et al. Neutral networks in protein space: A
computational study based on knowledge-based potentials of mean force. Fold. Des, 2
(1997), pp. 261–269
53. Cowperthwaite MC, Economo EP, Harcombe WR, Miller EL, Meyers LA. The ascent of the
abundant: how mutational networks constrain evolution. PLoS Comput Biol. 2008; 18;4(7).
54. Kirschner M, Gerhart J. Evolvability. Proc. Natl. Acad. Sci. 1998; USA 95:8420–27
55. Stadler BMR, Stadler P, Wagner GP, Fontana W. The topology of the possible: formal
spaces underlying patterns of evolutionary change. J. Theor. Biol. 2001; 213:241–74
56. Wagner A. Robustness, evolvability, and neutrality. 2005; FEBS Lett. 579:1772–78
57. Wagner A. Robustness and evolvability in living systems. 2005; Princeton, NJ: Princeton
University Press.
58. Szollosi G. J., Derenyi I. Congruent evolution of genetic and environmental robustness in
micro-RNA. Mol. Biol. Evol. 2009; 26, 867–874.
77
59. Cooper T. F., Morby A. P., Gunn A., Schneider D. Effect of random and hub gene
disruptions on environmental and mutational robustness in Escherichia coli. BMC Genomics.
2006; 7, 237.
60. Milton C. C., Huynh B., Batterham P., Rutherford S. L., Hoffmann A. A. Quantitative trait
symmetry independent of Hsp90 buffering: distinct modes of genetic canalization and
developmental stability. Proc. Natl Acad. Sci. USA. 2003; 100, 13 396–13 401.
61. Félix MA, Wagner A. Robustness and evolution: concepts, insights and challenges from a
developmental model system. Heredity. 2008 Feb; 100(2):132-40.
62. Woollard A.Gene duplications and genetic redundancy in C. elegans. WormBook. 2005;
25:1-6.
63. Dean EJ, Davis JC, Davis RW, and Petrov DA. Pervasive and persistent redundancy among
duplicated genes in Yeast. PLoS Genet. 2008 July; 4(7): e1000113.
64. Wagner A. Distributed robustness versus redundancy as causes of mutational robustness.
Bioessays. 2005; 27, 176–188.
65. Wilding EI, et al. Identification, evolution, and essentiality of the mevalonate pathway for
isopentenyl diphosphate biosynthesis in gram-positive cocci. J Bacteriol. .2000; 182:4319–
4327.
66. Chaudhuri RR, et al. Comprehensive identification of essential Staphylococcus aureus genes
using Transposon-Mediated Differential Hybridisation (TMDH) BMC Genomics. 2009;
10:291.
67. Baba T, et al. Construction of Escherichia coli K-12 in-frame, single-gene knockout
mutants: The Keio collection. Mol Syst Biol. 2006; 2:2006.0008.
68. Feist AM, et al. A genome-scale metabolic reconstruction for Escherichia coli K-12 MG1655
that accounts for 1260 ORFs and thermodynamic information. Mol Syst Biol. 2007; 3:121.
69. Wang Z, Zhang J. Abundant indispensable redundancies in cellular metabolic networks.
Genome Biol Evol. 2009; 1:23–33.
70. Papp B, Pál C, Hurst LD. Metabolic network analysis of the causes and evolution of enzyme
dispensability in yeast. Nature. 2004; 429:661–664.
71. Joyce AR, et al. Experimental and computational assessment of conditionally essential genes
in Escherichia coli. J Bacteriol. 2006; 188:8259–8271.
72. Dowell RD, et al. Genotype to phenotype: A complex problem. Science 2010; 328:469.
73. Barve A, Rodrigues JF, Wagner A. Superessential reactions in metabolic networks. Proc Natl
Acad Sci U S A. 2012. 1; 109(18):E1121-30.
74. Gadad AK, Mahajanshetti CS, Nimbalkar S, Raichurkar A. Synthesis and antibacterial
activity of some 5-guanylhydrazone/thiocyanato-6-arylimidazo[2,1-b]-1,3, 4-thiadiazole-2sulfonamide derivatives. Eur J Med Chem. 2000; 35:853–857.
75. Wiesner J, Borrmann S, Jomaa H. Fosmidomycin for the treatment of malaria. Parasitol Res.
2003; 90(Suppl 2):S71–S76.
76. Banerjee A, et al. inhA, a gene encoding a target for isoniazid and ethionamide in
Mycobacterium tuberculosis. Science 1994; 263:227–230.
78
77. Timmins GS, Deretic V. Mechanisms of action of isoniazid. Mol Microbiol. 2006; 62:1220–
1227.
78. Lehninger Principles of Biochemistry (4th Ed.) Nelson, D., and Cox, M.; W.H. Freeman and
Company, New York, 2005.
79. Chassagnole C, Noisommit-Rizzi N, Schmid JW, Mauch K, Reuss M. Dynamic modeling of
the central carbon metabolism of Escherichia coli. Biotechnol Bioeng. 2002; 5; 79(1):53-73.
80. Peskov K, Mogilevskaya E & Demin O. Kinetic modelling of central carbon metabolism in
Escherichia coli. FEBS J. 2012; 279(18):3374-85.
81. Sauer U, Eikmanns B: The PEP-pyruvate-oxaloacetate node as the switch point for carbon
flux distribution in bacteria. FEMS Microbiol Rev. 2005; 29:765.
82. Shiloach J, Rinas U: Glucose and acetate metabolism in E. coli- System level analysis and
biotechnological applications in protein production processes. In Systems Biology and
Biotechnology of Escherichia coli. Edited by Lee SY. Springer Science and Bussiness Media
B.V, Dordrecht; 2009:377-400.
83. Papagianni M. Recent advances in engineering the central carbon metabolism of industrially
important bacteria. Microb Cell Fact. 2012 30; 11:50.
84. G.B Dantzig: Maximization of a linear function of variables subject to linear inequalities,
1947. Published pp. 339–347 in T.C. Koopmans (ed.):Activity Analysis of Production and
Allocation, New York-London 1951 (Wiley & Chapman-Hall)
85. Neidhardt F.C., Ingraham J.L., Schaechter M. Physiology of the Bacterial Cell: A Molecular
Approach. Sinauer Associates, Sunderland, MA (1990).
86. Palsson B.O. Systems Biology: Properties of Reconstructed Networks. Cambridge University
Press, New York (2006).
87. Noor E, Eden E, Milo R, Alon U. Central carbon metabolism as a minimal biochemical walk
between precursors for biomass and energy. Mol Cell. 2010; 10; 39(5):809-20.
88. Floyd, Robert W. Algorithm 97: Shortest Path. Communications of the ACM . 1962; (6): 345.
89. Johnson, Donald B. Efficient algorithms for shortest paths in sparse networks. Journal of the
ACM. 1997; 24 (1): 1–13
90. G. Csardi, Package ‘igraph’, 2010, available at: http://igraph.sourceforge.net/.
91. Bollobás, Béla (2001), "6. The Evolution of Random Graphs—The Giant Component",
Random Graphs, Cambridge studies in advanced mathematics 73 (2nd ed.), Cambridge
University Press, pp. 130–159.
92. Paul E. Black, "biconnected graph", in Dictionary of Algorithms and Data Structures
[online], Paul E. Black, ed., U.S. National Institute of Standards and Technology. Available
from: http://www.nist.gov/dads/HTML/biconnectedGraph.html
93. Hopcroft, J. & Tarjan, R. "Efficient algorithms for graph manipulation". Communications of
the ACM 1973; 16 (6): 372–378.
94. Pudi, Vikram and Haritsa, Jayant R. Generalized Closed Itemsets for Association Rule
Mining. In: 19th International Conference on Data Engineering (ICDE’03), 5-8 March, 2003,
Bangalore, India, pp. 714-716.
95. Matias Rodrigues JF, Wagner A. Genotype networks, innovation, and robustness in sulfur
metabolism. BMC Syst Biol. (2011) 7; 5:39.
96. Opsahl. T, Agneessens. F, Skvoretz. J (2010). "Node centrality in weighted networks:
Generalizing degree and shortest paths". Social Networks 32 (3): 245
79
Appendix
Glucose
Fructose
Acetate
Alpha-Keto Glutarate
Fumarate
Lactate
Malate
Pyruvate
Glutamate
Succinate
Table A.1: The 10 different carbon sources which were studied
number
Reaction
Description
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
Biomass
NH4t: nh4[e] <==> nh4
PIt2r: h[e] + pi[e] <==> h + pi
SUCCt2_2: (2) h[e] + succ[e] --> (2) h + succ
ACALDt: acald[e] <==> acald
SUCCt3: h[e] + succ --> h + succ[e]
H2Ot: h2o[e] <==> h2o
GLUt2r: glu-L[e] + h[e] <==> glu-L + h
GLNabc: atp + gln-L[e] + h2o --> adp + gln-L + h + pi
GLCpts: glc-D[e] + pep --> g6p + pyr
FRUpts2: fru[e] + pep --> f6p + pyr
FUMt2_2: fum[e] + (2) h[e] --> fum + (2) h
CO2t: co2[e] <==> co2
D_LACt2: h[e] + lac-D[e] <==> h + lac-D
ACt2r: ac[e] + h[e] <==> ac + h
O2t: o2[e] <==> o2
AKGt2r: akg[e] + h[e] <==> akg + h
FORt2: for[e] + h[e] --> for + h
FORti: for --> for[e]
MALt2_2: (2) h[e] + mal-L[e] --> (2) h + mal-L
PYRt2r: h[e] + pyr[e] <==> h + pyr
GLNS: atp + glu-L + nh4 --> adp + gln-L + h + pi
PGM: 2pg <==> 3pg
PGK: 3pg + atp <==> 13dpg + adp
GAPD: g3p + nad + pi <==> 13dpg + h + nadh
ENO: 2pg <==> h2o + pep
RPI: r5p <==> ru5p-D
ICL: icit --> glx + succ
PPCK: atp + oaa --> adp + co2 + pep
PPC: co2 + h2o + pep --> h + oaa + pi
MALS: accoa + glx + h2o --> coa + h + mal-L
ME1: mal-L + nad --> co2 + nadh + pyr
ME2: mal-L + nadp --> co2 + nadph + pyr
MDH: mal-L + nad <==> h + nadh + oaa
ICDHyr: icit + nadp <==> akg + co2 + nadph
Biomass
Export Reaction
Export Reaction
Export Reaction
Export Reaction
Export Reaction
Export Reaction
Export Reaction
Export Reaction
Export Reaction
Export Reaction
Export Reaction
Export Reaction
Export Reaction
Export Reaction
Export Reaction
Export Reaction
Export Reaction
Export Reaction
Export Reaction
Export Reaction
Glutamate Metabolism*
Glycolysis/Gluconeogenesis*
Glycolysis/Gluconeogenesis*
Glycolysis/Gluconeogenesis*
Glycolysis/Gluconeogenesis*
Pentose Phosphate Pathway*
Anaplerotic reactions
Anaplerotic reactions
Anaplerotic reactions
Anaplerotic reactions
Anaplerotic reactions
Anaplerotic reactions
Citric Acid Cycle
Citric Acid Cycle
80
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
CS: accoa + h2o + oaa --> cit + coa + h
FUM: fum + h2o <==> mal-L
ACONT: cit <==> icit
SUCOAS: atp + coa + succ <==> adp + pi + succoa
AKGDH: akg + coa + nad --> co2 + nadh + succoa
GLUDy: glu-L + h2o + nadp <==> akg + h + nadph + nh4
GLUSy: akg + gln-L + h + nadph --> (2) glu-L + nadp
GLUN: gln-L + h2o --> glu-L + nh4
PPS: atp + h2o + pyr --> amp + (2) h + pep + pi
PYK: adp + h + pep --> atp + pyr
TPI: dhap <==> g3p
PDH: coa + nad + pyr --> accoa + co2 + nadh
PFK: atp + f6p --> adp + fdp + h
PGI: g6p <==> f6p
FBA: fdp <==> dhap + g3p
FBP: fdp + h2o --> f6p + pi
CYTBD,nottransport: (2) h + (0.5) o2 + q8h2 --> (2) h[e] + h2o + q8
ATPS4r,nottransport: adp + (4) h[e] + pi <==> atp + (3) h + h2o
ATPM: atp + h2o --> adp + h + pi
NADTRHD: nad + nadph --> nadh + nadp
ADK1: amp + atp <==> (2) adp
NADH16,nottransport: (4) h + nadh + q8 --> (3) h[e] + nad + q8h2
THD2,nottransport: (2) h[e] + nadh + nadp --> (2) h + nad + nadph
SUCDi: q8 + succ --> fum + q8h2
FRD7: fum + q8h2 --> q8 + succ
PGL: 6pgl + h2o --> 6pgc + h
G6PDH2r: g6p + nadp <==> 6pgl + h + nadph
TKT1: r5p + xu5p-D <==> g3p + s7p
TKT2: e4p + xu5p-D <==> f6p + g3p
TALA: g3p + s7p <==> e4p + f6p
RPE: ru5p-D <==> xu5p-D
GND: 6pgc + nadp --> co2 + nadph + ru5p-D
ACALD: acald + coa + nad <==> accoa + h + nadh
ACKr: ac + atp <==> actp + adp
PTAr: accoa + pi <==> actp + coa
LDH_D: lac-D + nad <==> h + nadh + pyr
PFL: coa + pyr --> accoa + for
Citric Acid Cycle
Citric Acid Cycle
Citric Acid Cycle
Citric Acid Cycle
Citric Acid Cycle
Glutamate Metabolism
Glutamate Metabolism
Glutamate Metabolism
Glycolysis/Gluconeogenesis
Glycolysis/Gluconeogenesis
Glycolysis/Gluconeogenesis
Glycolysis/Gluconeogenesis
Glycolysis/Gluconeogenesis
Glycolysis/Gluconeogenesis
Glycolysis/Gluconeogenesis
Glycolysis/Gluconeogenesis
Oxidative Phosphorylation
Oxidative Phosphorylation
Oxidative Phosphorylation
Oxidative Phosphorylation
Oxidative Phosphorylation
Oxidative Phosphorylation
Oxidative Phosphorylation
Oxidative Phosphorylation
Oxidative Phosphorylation
Glycolysis/Gluconeogenesis
Pentose Phosphate Pathway
Pentose Phosphate Pathway
Pentose Phosphate Pathway
Pentose Phosphate Pathway
Pentose Phosphate Pathway
Pentose Phosphate Pathway
Pyruvate Metabolism
Pyruvate Metabolism
Pyruvate Metabolism
Pyruvate Metabolism
Pyruvate Metabolism
Table A.2: Reaction Universe
*. Reactions colored red are environment general super-essential reactions.
81
Abbreviation: Reaction full name
ACALD: acetaldehyde dehydrogenase (acetylating)
ACALDt: acetaldehyde reversible transport
ACKr: acetate kinase
ACt2r: acetate reversible transport via proton symport
ADK1: adenylate kinase
AKGDH: 2-Oxogluterate dehydrogenase
AKGt2r: 2-oxoglutarate reversible transport
ATPM: ATP maintenance requirement
ATPS4r: ATP synthase (four protons for one ATP)
Biomass: Biomass Objective Function with GAM
CO2t: CO2 transporter via diffusion
CS: citrate synthase
CYTBD: cytochrome oxidase bd
D_LACt2: D-lactate transport via proton symport
ENO: enolase
FBA: fructose-bisphosphate aldolase
FBP: fructose-bisphosphatase
FORt2: formate transport via proton symport
FORti: formate transport via diffusion
FRD7: fumarate reductase
FRUpts2: Fructose transport via PEP:Pyr PTS
FUM: fumarase
FUMt2_2: Fumarate transport via proton symport
G6PDH2r: glucose 6-phosphate dehydrogenase
GAPD: glyceraldehyde-3-phosphate dehydrogenase
GLCpts: D-glucose transport via PEP:Pyr PTS
GLNS: glutamine synthetase
GLNabc: L-glutamine transport via ABC system
GLUDy: glutamate dehydrogenase (NADP)
GLUN: glutaminase
GLUSy: glutamate synthase (NADPH)
GLUt2r: L-glutamate transport via proton symport,
GND: phosphogluconate dehydrogenase
H2Ot: H2O transport via diffusion
ICDHyr: isocitrate dehydrogenase (NADP)
ICL: Isocitrate lyase
Abbreviation: Reaction full name
LDH_D: D-lactate dehydrogenase
MALS: malate synthase
MALt2_2: Malate transport via proton symport (2 H)
MDH: malate dehydrogenase
ME1: malic enzyme (NAD)
ME2: malic enzyme (NADP)
NADH16: NADH dehydrogenase
NADTRHD: NAD transhydrogenase
NH4t: ammonia reversible transport
O2t: o2 transport via diffusion
PDH: pyruvate dehydrogenase
PFK: phosphofructokinase
PFL: pyruvate formate lyase
PGI: glucose-6-phosphate isomerase
PGK: phosphoglycerate kinase
PGL: 6-phosphogluconolactonase
PGM: phosphoglycerate mutase
PIt2r: phosphate reversible transport via proton symport
PPC: phosphoenolpyruvate carboxylase
PPCK: phosphoenolpyruvate carboxykinase
PPS: phosphoenolpyruvate synthase
PTAr: phosphotransacetylase
PYK: pyruvate kinase
PYRt2r: pyruvate reversible transport
RPE: ribulose 5-phosphate 3-epimerase
RPI: ribose-5-phosphate isomerase
SUCCt2_2: succinate transport via proton symport (2 H)
SUCCt3: succinate transport out via proton antiport
SUCDi: succinate dehydrogenase (irreversible)
SUCOAS: succinyl-CoA synthetase (ADP-forming)
TALA: transaldolase
THD2: NAD(P) transhydrogenase
TKT1: transketolase
TKT2: transketolase
TPI: triose-phosphate isomerase
LDH_D: D-lactate dehydrogenase
Table A.3: Reaction abbreviations with their corresponding full name
82
Abbreviation: Metabolite full name
13dpg: 3-Phospho-D-glyceroyl phosphate
2pg: D-Glycerate 2-phosphate
3pg: 3-Phospho-D-glycerate
6pgc: 6-Phospho-D-gluconate
6pgl: 6-phospho-D-glucono-1,5-lactone
ac: Acetate
ac[e]: Acetate (extracellular)
acald: Acetaldehyde
acald[e]: Acetaldehyde (extracellular)
accoa: Acetyl-CoA
actp: Acetyl phosphate
adp: ADP
akg: 2-Oxoglutarate
akg[e]: 2-Oxoglutarate (extracellular)
amp: AMP
atp: ATP
cit: Citrate
co2: CO2
co2[e]: CO2 (extracellular)
coa: Coenzyme A
dhap: Dihydroxyacetone phosphate
e4p: D-Erythrose 4-phosphate
f6p: D-Fructose 6-phosphate
fdp: D-Fructose 1,6-bisphosphate
for: Formate
for[e]: Formate (extracellular)
fru[e]: D-Fructose (extracellular)
fum: Fumarate
fum[e]: Fumarate (extracellular)
g3p: Glyceraldehyde 3-phosphate
g6p: D-Glucose 6-phosphate
glc-D[e]: D-Glucose (extracellular)
gln-L: L-Glutamine
gln-L[e]: L-Glutamine (extracellular)
glu-L: L-Glutamate
Abbreviation: Metabolite full name
glu-L[e]: L-Glutamate (extracellular)
glx: Glyoxylate
h2o: H2O
h2o[e]: H2O (extracellular)
h: H+
h[e]: H+ (extracellular)
icit: Isocitrate
lac-D: D-Lactate
lac-D[e]: D-Lactate (extracellular)
mal-L: L-Malate
mal-L[e]: L-Malate (extracellular)
nad: Nicotinamide adenine dinucleotide
nadh: Nicotinamide adenine dinucleotide - reduced
nadp: Nicotinamide adenine dinucleotide phosphate
nadph: Nicotinamide adenine dinucleotide phosphate
nh4: Ammonium
nh4[e]: Ammonium (extracellular)
o2: O2
o2[e]: O2 (extracellular)
oaa: Oxaloacetate
pep: Phosphoenolpyruvate
pi: Phosphate
pi[e]: Phosphate (extracellular)
pyr: Pyruvate
pyr[e]: Pyruvate (extracellular)
q8: Ubiquinone-8
q8h2: Ubiquinol-8
r5p: alpha-D-Ribose 5-phosphate
ru5p-D: D-Ribulose 5-phosphate
s7p: Sedoheptulose 7-phosphate
succ: Succinate
succ[e]: Succinate (extracellular)
succoa: Succinyl-CoA
xu5p-D: D-Xylulose 5-phosphate
glu-L[e]: L-Glutamate (extracellular)
Table A.4: Metabolite abbreviations with their corresponding full name
83
Number
1
2
3
4
5
6
7
8
9
10
11
12
Metabolite
d-glucose-6-phosphate
d-fructose-6-phosphate
d-ribose-5-phosphate
d-erythrose-4-phosphate
d-glyceraldehyde-3-phosphate
glycerate-3-phosphate
phosphoenolpyruvate
pyruvate
acetyl-CoA
2-ketoglutarate
succinyl-CoA
oxaloacetate
Abbreviation
G6P
F6P
R5P
E4P
GAP
3PG
PEP
PYR
ACA
2KG
SCA
OXA
Building Blocks Produced
glycogen, LPS
cell wall
His, Phe, Trp, nucleotides
Phe, Trp, Tyr
lipids
Cys, Gly, Ser
Tyr, Trp
Ala, Ile, Lys, Leu, Val
Leu, lipids
Glu, Gln, Arg, Pro
Met, Lys, tetrapyrroles (e.g., heme)
Asn, Asp, Ile, Lys, Met, Thr, nucleotides
Table A.5. The 12 Precursor Metabolites for Biomass in E. coli
size
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
Acetate
Alphaketoglutarate
#
#
#
#
#
#
#
0.247897
#
0.296468
#
0.316424
#
0.323818
#
0.323528
-0.076234
0.318719
-0.057669
0.311856
-0.034785
0.304575
-0.016827
0.297775
-0.002346
0.291776
0.013199
0.286485
0.033755
0.281523
0.060952
0.276413
0.094238
0.270626
0.130604
0.263664
0.165651
0.255044
0.194734
0.244292
0.214056
0.230808
0.221219
0.213716
0.216652
0.191686
0.203585
0.163099
0.188863
0.127473
0.184391
0.090656
0.210009
0.074285
0.290916
0.124994
Fructose
#
0.245509
0.284402
0.285785
0.271008
0.248776
0.223363
0.197061
0.171133
0.146351
0.123226
0.102215
0.122977
0.118814
0.152555
0.117768
0.096782
0.078914
0.105514
0.141776
0.184972
0.230682
0.273515
0.309139
0.338505
0.375231
0.451441
0.587042
Fumarate
#
#
#
#
0.229719
0.310452
0.363142
0.402816
0.434316
0.459697
0.479824
0.494938
0.504959
0.509726
0.509183
0.503446
0.492797
0.477637
0.458395
0.435373
0.408562
0.377509
0.341318
0.299177
0.251967
0.208735
0.196631
0.261953
Glucose
#
0.245509
0.291674
0.297915
0.284716
0.261409
0.233254
0.203411
0.173818
0.145638
0.119584
0.096169
0.119983
0.118148
0.132932
0.105734
0.094827
0.063182
0.087231
0.120396
0.160083
0.202087
0.241251
0.273401
0.299607
0.334718
0.415387
0.572569
Glutamate
#
#
#
#
0.204028
0.229004
0.229379
0.221687
0.212081
0.203951
0.199332
0.199179
0.203409
0.211194
0.221299
0.232389
0.243239
0.252852
0.260481
0.265595
0.267839
0.266948
0.262646
0.254808
0.244216
0.235365
0.243412
0.282588
Lactate
#
#
#
#
0.214479
0.262913
0.293769
0.318483
0.337693
0.349419
0.056399
0.054004
0.049154
0.043527
0.039706
0.040547
0.048237
0.063478
0.084859
0.108909
0.130665
0.144466
0.219134
0.189878
0.151528
0.121142
0.140131
0.266273
Malate
#
#
#
0.225766
0.297053
0.341183
0.373406
0.398976
0.420515
0.439418
0.456130
0.470832
0.482891
0.491964
0.497485
0.499035
0.496339
0.489224
0.477654
0.461527
0.440419
0.413642
0.379735
0.337225
0.286113
0.234229
0.210579
0.272494
Pyruvate
#
#
#
#
-0.109561
-0.122586
-0.111056
-0.085891
-0.052802
-0.015326
0.024268
0.064616
0.105038
0.145286
0.185183
0.224157
0.260841
0.293001
0.317817
0.332408
0.334441
0.322721
0.297666
0.261761
0.220425
0.185241
0.185499
0.279695
Succinate
#
#
#
#
#
0.414439
0.442872
0.423077
0.39317
0.366352
0.346262
0.332855
0.324856
0.320661
0.318772
0.318051
0.317766
0.317497
0.316923
0.315609
0.312875
0.307814
0.299555
0.288053
0.275873
0.271146
0.291718
0.367483
Table A.6. The Spearman’s ρ of the correlation between robustness and growth rate corresponding to genotype
networks of different sizes and carbon sources.
84