View - OhioLINK Electronic Theses and Dissertations Center

THE AGING PROCESS OF C. ELEGANS VIEWED THROUGH
TIME DEPENDENT PROTEIN EXPRESSION ANALYSIS
by
DAVID JAMES ALOUANI
Submitted in partial fulfilment of the requirements
Of the degree of Master of Science
SYSTEMS BIOLOGY AND BIOINFORMATICS
CENTER FOR PROTEOMICS AND BIOINFORMATICS
SCHOOL OF MEDICINE
CASE WESTERN RESERVE UNIVERSITY
August 2015
CASE WESTERN RESERVE UNIVERSITY
SCHOOL OF GRADUATE STUDIES
We hereby approve the thesis of
DAVID JAMES ALOUANI
Candidate for the degree of
Master of Science
Committee Chair
Dr. Masaru Miyagi
Committee Member
Dr. David Lodowski
Committee Member
Dr. Gurkan Bebek
Date of Defense
March 25, 2015
*We also certify that written approval has been obtained
for any proprietary material contained therein
2
To my wonderful wife, Katherine, for always being there for me and for being a
continuous source of support, inspiration and love.
To my two adorable daughters Sarah and Sophie for bringing joy and smiles to my life.
Thank you for helping me become a better person, and for making this work possible.
3
Table of Contents
Table of Contents
4
List of Tables
6
List of Figures
7
Acknowledgments
8
Abstract
9
1.
Introduction
10
2.
Proteomics of C. elegans
13
2.1.
Materials
13
2.2.
C. elegans Strain, Maintenance and Age Synchronization
13
2.3.
Labeling Bacteria with (12C6-Lys) and (13C6-Lys) Lysine
15
2.4.
Preparation and Labeling of C. elegans
15
2.5.
Separation of Live and Dead Worms
16
2.6.
Proteomic Sample Preparation
17
2.7.
LC-MS/MS Analysis
18
2.8.
Identification and Quantification of Peptides and Proteins
21
3.
Methods and Algorithms
22
3.1.
Background
22
3.2.
Imputation Algorithm and Normalization
23
3.3.
Statistical Testing
27
3.4.
Information Theory: Shannon Entropy and Protein Network Properties 30
3.5.
Machine-Learning: Classification and Feature Selection Algorithm
4
32
4.
Results and Discussion
38
4.1.
Background
38
4.2.
Outlier Detection
39
4.3.
Differential Expression between Age Groups
41
4.4.
Ontology and Pathway Association
46
4.5.
Network Entropy
48
4.6.
Feature Selection and Classification
54
4.7.
Discussion
59
Appendix A Supplementary Tables
63
Bibliography
82
5
List of Tables
Table 4.1: Differential expression (day 5 vs day 8): Proteins for which the expression
levels are significantly greater in day 5
63
Table 4.2: Differential expression (day 5 vs day 8): Proteins for which the expression
levels are significantly greater in day 8
63
Table 4.3: Differential expression (day 5 vs day 11): Proteins for which the expression
levels are significantly greater in day 11
64
Table 4.4: Differential expression (day 5 vs day 16): Proteins for which the expression
levels are significantly greater in day 5
64
Table 4.5: Differential expression (day 5 vs day 16): Proteins for which the expression
levels are significantly greater in day 16
70
Table 4.6: Differential expression (day 8 vs day 16): Proteins for which the expression
levels are significantly greater in day 8
75
Table 4.7: Differential expression (day 5 vs day 16): Proteins for which the expression
levels are significantly greater in day 16
76
Table 4.8: Differential expression (day 11 vs day 16): Proteins for which the expression
levels are significantly greater in day 11
79
Table 4.9: Feature selection – Features selected based on the level of predictive power of
the age of C. elegans, based on protein expression
6
79
List of Figures
Figure 2.1: Experimental Workflow
20
Figure 3.1: Imputation Sketch
24
Figure 3.2: Flow Chart of the Classification Procedure
34
Figure 4.1: Fraction of Proteins with Outlier Expression Value as Function of the Age of
C. elegans
40
Figure 4.2: Clustered Protein expression
42
Figure 4.3: Fraction of Differentially Expressed Proteins
43
Figure 4.4: Distribution of Significant Biological Processes
47
Figure 4.5: Protein-Protein Interaction Network
50
Figure 4.6: Distribution of the Number of First Neighbors for each Protein
51
Figure 4.7: Distribution of Protein’s Entropy
51
Figure 4.8: Distribution of Proteins’ First Neighbors
52
Figure 4.9: Proteins Highly Ranked in Differential Expression and Entropy
52
Figure 4.10: Distribution of Selected Features
56
Figure 4.11: Classifier Accuracy
56
Figure 4.12: Feature Selection Overlap with Proteins Highly Ranked in Differential
Expression
57
Figure 4.13: Distribution of Biological Processes for Selected Features
7
57
Acknowledgements
I would like to thank my thesis advisor Dr. Masaru Miyagi for his guidance,
support and constant availability throughout this research project. I would like to
especially acknowledge his most valuable role as a mentor in sharing his knowledge and
wisdom. Most importantly, I would like to express my gratitude to Dr. Miyagi for
introducing me to the fields of genomics, transcriptomics, proteomics, metabolomics and
mass spectroscopy. I am very thankful for his help.
I would like to thank the thesis committee members, Dr. David Lodowski
and Dr. Gurkan Bebek for taking the time to review this thesis and for the highly
constructive and valuable suggestions and comments they made. They undoubtedly
improved the quality of this work.
8
THE AGING PROCESS OF C. ELEGANS VIEWED THROUGH
TIME DEPENDENT PROTEIN EXPRESSION ANALYSIS
Abstract
by
DAVID JAMES ALOUANI
The main goal of the present effort is to develop a comprehensive computational
and statistical framework for analyzing large scale proteomics data and understand the
aging process of Caenorhabditis elegans (C.elegans) nematodes based on the age
dependent pattern of the protein expression. Modern numerical methods were used for
the analysis, including outlier detection, imputation, entropy and feature selection.
Protein expression in C. elegans was found to be highly age dependent. Increased
expression in younger nematodes was associated with the activation of metabolic
pathways. House-keeping processes, such as proteolysis, protein biogenesis and assembly
were found to be important at older age. Network entropy was also found to be age
dependent for a significant fraction of proteins. Increased protein expression was
associated with reduced entropy. Feature selection analysis further showed that proteins
linked to metabolic processes are most predictive of the age of C. elegans nematodes,
based on their level of expression.
9
Chapter 1
Introduction
Human life expectancy at birth (LEB) has seen more than a two fold increase in
last two centuries [Christensen et al. 2009]. A portion of this increase can be attributed to
external factors such as environment, socioeconomics, development and the
industrialization of society. The improved quality of healthcare and preventive medicine
has also played a major role in the extended human longevity seen today. Genetics is now
well-accepted as one of the key players in mediating longevity at the single individual
level. The role of the genetic component, as a factor in human longevity, is estimated at
25% compared to all remaining external cofactors [Hjelmborg et al. 2006]. The role of
genes and inherited traits are still far from being fully understood. Steady progress,
however, has being made, since the ground breaking mapping of the human genome
[Schmutz et al. 2004]. The genetics of aging is in fact a multilayered and highly complex
problem. Linking the aging process simply to “genes” is a gross over simplification. The
central dogma of biology is based on a two stage process, transcription from DNA to
RNA and translation from RNA to proteins. Further steps, such as post-translation
modification of proteins, protein-protein interactions and enzymatic reactions within a
larger metabolome are all assumed to be involved, to varying degrees, in the aging
process. At the cellular level, aging is linked to the accumulation of aggregates and
misfolded proteins, which in some cases lead to cancers or other neurodegenerative
diseases such as Alzheimer, Parkinson, Huntington [Clarke et al. 2003, Irvine et al. 2008,
10
Kammeyer et al. 2015, Walker 2007]. Extended longevity also depends on the ability of
cells counter the effect of varied types of stress. These include oxidative, hypoxia, heat
shock and osmotic stress.
To understand the aging process, in this thesis we focused on the age dependent
changes taking place at the protein level for the model organism Caenorhabditis elegans
(C. elegans). The protein expression process is neither unidirectional nor it is linear as it
relates to aging. The spectrum of protein expression is broad ranging from rarely
expressed proteins to steadily increasing or decreasing expression levels as the time goes
on [Alberts et al. 2002, Hsu et al. 2003, Yang et al. 2011]. This work implemented stateof-the-art experimental methods for protein expression analysis and quantification. These
included isotopic labeling techniques, age synchronization and mass spectrometry. In
addition, the age dependent protein expression data was analyzed using a variety of
numerical, statistical approaches and algorithms, some of which were developed during
the production of this thesis. This work looked at the age dependent protein expression of
C. elegans across the adult life span of these nematodes.
C. elegans is a model organism that is often used for genomics analysis due to its
well mapped genome, short life span and cost efficient experimental setup. C. elegans is
a worm-like organism with an approximate length of 1 mm [Wood 1988]. It was the first
multicellular organism to have its whole genome sequenced in 1998 [The C. elegans
Genome Sequencing Consortium 1998, Hillier et al. 2005]. Important advances were also
made toward mapping the C. elegans neural network (connectome) [White et al. 1986],
which is a first for any multicellular organism. The life expectancy of C. elegans is
11
between 2 to 3 weeks. These nematodes undergo multiple developmental phases, which
are comprised of an initial embryonic phase and four larval stages (L1-L4) followed by
adulthood [Stiernagle 2006]. In terms of genetic makeup, C. elegans has five pairs of
autosomes and one pair of sex chromosomes. Its genome has approximately 20,500
protein-coding genes with an estimated fraction of C. elegans genes with human
homologs slightly above 1/3 [Stiernagle 2006], making C. elegans an ideal model
organism not only for aging studies but also for ontology and signaling pathways
analysis. The C. elegans is therefore a highly valuable model for understand both, the
dynamics of protein expression with aging and its link to age related diseases, e.g.
cancers and neurodegenerative diseases. The goal of this present study was to lay the
foundations for future genomic and proteomic studies for human aging. The experimental
and numerical approaches used in this work, as well as the identified proteins, with a
human homolog, could be used as a starting point for future proteomic studies on this and
other model organisms.
12
Chapter 2
Proteomics of C. elegans
2.1
Materials
12
C6 and
13
C6-Lys was purchased from Sigma-Aldrich (St. Louis, MO) and
Cambridge Isotope Laboratories (Tewksbury, MA), respectively. Lys-C was purchased
from Wako USA (Richmond, VA). All other chemicals were either reagent grade or were
of the highest quality that was commercially available.
2.2
C. elegans Strain, Maintenance and Age Synchronization
WT Bristol N2 strain nematodes were used in this study as they are the strain that
has its genome sequenced [The C. elegans Genome Sequencing Consortium 1998, Hillier
et al. 2005]. For the incorporation of 12C6-Lys and 13C6-Lys, nematodes were maintained
by standard methods that included culture on peptone-free NGM plates (51 mM NaCl, 25
mM K3PO4, 5 μg/mL cholesterol, 1 mM CaCl2, 1 mM MgSO4) seeded with Escherichia
coli strain AT713. To synchronize their age, gravid nematodes were bleached and the
surviving eggs were hatched as age-synchronized nematodes. The pre-fertile period of
adulthood was identified as t = 0 and day 1 is the first day of the worm’s adulthood. The
composition of the media, solutions and protocols are reported below. The protocols are
the same as described in [Stiernagle 2006].
13
Preparation of NGM plates [Stiernagle 2006]:
Equipment and Reagents:

NaCl.

Agar.

Peptone.

5 mg/mL cholesterol in ethanol.

1 M KPO4 buffer pH 6.0 (108.3 g KH2PO4, 35.6 g K2HPO4, H2O to 1 liter).

1 M MgSO4.

Petri plates.

Peristaltic pump.
Methods:

Mix 3 g NaCl, 17 g agar, and 2.5 g peptone in a 2 liter Erlenmeyer flask. Add 975
mL H2O. Cover mouth of flask with aluminum foil. Autoclave for 50 min.

Cool flask in 55° C water bath for 15 min.

Add 1 mL 1 M CaCl2, 1 ml 5 mg/mL cholesterol in ethanol, 1 mL 1 M MgSO4
and 25 mL 1 M KPO4 buffer. Swirl to mix well.

Using sterile procedures, dispense the NGM solution into petri plates using a
peristaltic pump. Fill plates 2/3 full of agar.

Leave plates at room temperature for 2-3 days before use to allow for detection of
contaminants, and to allow excess moisture to evaporate. Plates stored in an airtight container at room temperature will be usable for several weeks.
14
2.3
Labeling Bacteria with (12C6-Lys) and (13C6-Lys) Lysine
Arginine and lysine auxotrophic Escherichia coli strain AT713 was obtained from
the E. coli Genetic Stock Center at Yale University. To label AT713 bacteria with
or
13
12
C6-
C6-Lys, bacteria were first streaked on a lysogeny broth (LB) plate and incubated
overnight at 37 °C. A single bacterial colony was then inoculated into 10 mL of LB
media and cultured overnight on an incubator shaker (37 °C, 180 rpm). 100 µL of
bacterial culture was next inoculated into 50 mL of M9 minimal medium (50 mM
Na2HPO4, 20 mM KH2PO4, 10 mM NaCl, 20 mM NH4Cl, 2 mM MgSO4, 0.1 mM CaCl2,
and 0.2% glucose) supplemented with arginine (100 μg/mL), cysteine (100 μg/mL), and
lysine (100 μg/mL, labeled with either 12C6- or 13C6-Lys) and continuously cultured on an
incubator shaker (37 °C, 200 rpm) until the absorbance of the culture at 600 nm (A600)
reached 1.0. Then, 10 mL of the resulting labeled bacteria were inoculated into 1000 mL
of M9 basal medium with the above mentioned amino acids and cultured on an incubator
shaker (37 °C, 200 rpm) until A600 reached 2.0. Bacteria were then pelleted by
centrifugation (8,000 × g, 10 min), and resuspended in 15 mL of sterile water, then were
spread onto a peptone-free NGM plate (500 μL for each 100-mm plate and 200 μL for
each 60-mm plate) and exposed to 1000 mJ/cm2 of ultraviolet light (SpectroLinker XL1500, Spectronics Corp, Westbury, NY) to kill the bacteria. The plates with the 12C6- and
13
C6-Lys containing bacteria were stored at 4 °C until used to feed C. elegans.
2.4
Preparation and Labeling of C. elegans
WT Bristol N2 strain nematodes were transferred onto a peptone-free NGM plate
seeded with
12
C6-Lys labeled E. coli. Gravid animals from the next generation of these
15
nematodes were bleached to collect their live eggs. Eggs were then transferred onto new
peptone-free NGM plates seeded with
12
C6-Lys E. coli. The bleaching, synchronization
12
and plating on NGM plates seeded with
C6-Lys labeled bacteria is repeated for one
more time to ensure all the worms are reared on NGM plates seeded with the
12
C6-Lys
labeled E. coli strain AT713. It is important to note that the original worm population was
frozen and had been reared on NGM plates seeded with a different E. coli strain. After
hatching, age-synchronized animals were then cultured to L4 (day 0) then transferred to
peptone-free NGM plates seeded with
12
C6-Lys E. coli plates containing 25 mg/L 5-
fluoro-2′-deoxyuridine (Acros Organics, Pittsburgh, PA). 5-fluoro-2′-deoxyuridine is an
agent used to inhibit egg production. It was included in order to ensure that no offspring
can be produced during the time of the study (younger worms than the age-synchronized
original worms). Worms were then transferred to fresh 12C6-Lys E. coli plates after 5, 8,
11 and 15 days subsequently. Worm samples were harvested at different time points on
adult day 1, 5, 8, 11 and 16.
13
C6-Lys labeled worms were also prepared by feeding them on a plate seeded
with 13C6-Lys E. coli, and age-synchronized 8-day old worms were harvested (see Figure
2.1). Both the
12
C6- and
13
C6-Lys labeled worms were then subjected to the protocol
described below for the separation of dead/live worms.
2.5
Separation of Live and Dead Worms
The separation of live worms from dead worms was performed by sucrose density
centrifugation. This extra step was necessary to analyze only the proteome of live worms.
Worms from a 10 cm diameter plate were collected with 40 mL of distilled water into a
16
50 mL falcon tube and centrifuged at 2,000 × g for 2 min. The volume of resuspension
buffer was 1 mL. The collected worms were then carefully overlaid on top of chilled 30
% sucrose and centrifuged at 2,000 × g for 5 min. The upper layer containing live worms
was immediately collected into a 15 mL falcon tube, and washed twice with 1 mL of
distilled water, then water was removed by centrifugation at 2,000 × g for 2 minutes. The
collected live worms were weighed and stored at -80 °C until used.
2.6
Proteomic Sample Preparation
The worms were suspended in 250 µL of 100 mM ammonium bicarbonate
containing 4% perfluorooctanoic acid (PFOA) (w/v) [Kadiyala et al. 2010], protease
inhibitor mixture (Sigma-Aldrich) and phosphatase inhibitor mixture 3 (Sigma-Aldrich).
PFOA was used to solubilize not only soluble proteins but also membrane proteins, both
of which were analyzed in our study. Worm cells were then sonicated at 4.5 kHz three
times for 9 s with a 3-min pause on ice between the strokes, using a Virsonic 100
ultrasonic cell disrupter (SP Scientific, Warminster, PA) [Yuan et al. 2012, Vukoti et al.
2015]. The resulting protein extract was centrifuged at 15,000 × g for 10 min and the
supernatant collected. The supernatants were further reduced with 10 mM dithiothreitol
(DTT) at 37 °C for 30 min and then S-alkylated by 25 mM iodoacetamide at 25 °C for 45
min in the dark. Proteins were then precipitated by mixing with a 9-fold excess volume of
ice-cold acetone and incubated for 2 hr at -20 °C. Following the incubation, precipitated
proteins were then harvested by centrifugation at 2,400 × g for 10 min at 4 °C and the
pellet was again washed with ice-cold 90 % acetone. Protein pellets were air-dried for 5
min and then redissolved, in 50 µL of a 100 mM ammonium bicarbonate buffer
17
containing 8 M urea, by inversion in a water bath sonicator for 1 min [Vukoti et al. 2015].
The resulting solution was diluted with 450 μL of 100 mM ammonium bicarbonate to
reduce the urea concentration to 0.8 M and the amount of dissolved protein determined
with a DC protein assay kit (Bio-Rad, Hercules, CA). A total of 25 µg of protein was
digested with Lys-C (1:25 Lys-C to protein ratio w/w) at 37 °C for 18 hr. The digest was
desalted using Vydac C18 UltraMicro Tip Column and resuspended in 0.1 % formic acid.
A similarly prepared constant amount of a
was spiked into all of the
12
13
C6-Lys labeled reference peptide mixture
C6-Lys labeled samples prior to the LC-MS/MS analysis for
standardization.
2.7
LC-MS/MS Analysis
LC-MS/MS analyses used an UltiMate 3000 LC systems (Dionex Inc.) interfaced
to Velos Pro Ion Trap and Orbitrap Elite Hybrid Mass Spectrometer (Thermo Scientific,
Bremen, Germany). The platform was operated in the nano-LC mode, using the standard
nano-ESI (Proxeon Biosystems) source. The spray voltage was set to 1.2 kV and the
temperature of the heated capillary was set to 200° C. The solvent flow rate through the
column was maintained at 300 nL/min. Lys-C peptide digests (typically 1 μg) were
injected into a reverse-phase 0.3 × 5 mm C18 PepMap trapping column with a 5-μm
particle size (Dionex Inc.) preequilibrated with 0.1% formic acid, 1% acetonitrile (v/v).
The column was washed for 5 min with the equilibration solution at a flow rate of
25 μL/min, using an isocratic loading pump operated through an auto sampler. Next, the
trapping column was switched in-line with a reverse-phase 0.075 × 150-mm C18
Acclaim PepMap 100 column (Dionex Inc.). For analysis, the peptides were separated
18
chromatographically using a linear gradient of acetonitrile, from 2 to 37%, in aqueous
0.1% formic acid at a flow rate of 300 nL/min for a total time of 202 min.
The eluent was directly introduced from the reverse phase separator into the mass
spectrometer. The mass spectrometer was operated in a data-dependent MS to MS/MS
switching mode, with the 25 most intense ions in each MS scan subjected to further
MS/MS analysis. The full MS scan was performed at the dimensionless resolution of
120,000 FWHM in the Orbitrap detector both a mass analyzer and a detector. The
MS/MS scans were performed in the ion trap detector in the collision-induced
dissociation mode with a normalized collision energy of 35 eV.
The data were collected entirely in the profile mode for the full MS scan and the
centroid mode for the MS/MS scans. The profile mode is better than the centroid mode in
that it offers a smooth, many points representation of the spectrum, while the centroid
represents the spectrum as single-point peaks. The profile mode therefore, has larger data
storage space requirements. In practice centroid mode is often used (one data point
corresponding to the peak top per ion peak) for low resolution spectra. The dynamic
exclusion function for previously selected precursor ions applied the following
parameters: repeat count of 1, repeat duration of 40 s, exclusion duration of 90 s and
exclusion size list of 500. Xcalibur software (Version 2.2 SP1 build 48, Thermo-Finnigan
Inc., San Jose, CA) was used for instrument control, data acquisition, and data
processing.
19
Figure 2.1: Experimental Workflow. (a) The wild-type C. elegans nematodes are age
synchronized and their growth is followed through their life span.
12
C6-Lys labeled
worms were harvested at days 1, 5, 8, 11 and 16, respectively, proteins extracted, and
digested by Lys-C. Eight day old
13
C6-Lys labeled worms were also prepared, proteins
extracted, and digested by Lys-C. The constant amount of
mixed with all the
12
13
C6-Lys labeled digest was
C6-Lys labeled digest from different ages, and analyzed by LC-
MS/MS. (b) Representative mass-spectrometer signal at days 1, 8 and 16 shows the
peptide intensity of 12C6- (L) vs 13C6-Lys (H) labeled C. elegans [Vukoti et al. 2015]. The
6 Da difference reflects the mass difference between 12C6-Lys and 13C6-Lys. The increase
in intensity of the
12
C6-Lys labeled peptides shows the relative change in protein
expression with age.
20
2.8 Identification and Quantification of Peptides and Proteins
Mascot database search software (Version 2.2.0, Matrix Science, London, UK) in
conjunction with the (December 2014) Wormpep database were used to identify the
proteins from the obtained MS/MS peptide spectra. The carbamidomethylation of
cysteine was set as a fixed modification. During this experiment, cysteine residues were
alkylated with iodoacetamide to prevent the formation of disulfide bridges. All cysteine
residues were therefore expected to be carbamidemethylated. The oxidation of
methionine to methionine sulfoxide, acetylation of the N-terminal amino group and the
replacement of C-terminal lysine to 13C6-Lys were considered variable modifications for
identification. The mass tolerance was set to 10 ppm for the precursor ion and to 0.8 Da
for the product ion. Strict Lys-C specificity was applied and missed cleavages were not
allowed. Peptides with at least six amino acid residues and with a minimum mascot score
of 20 were considered significant. False discovery rate was calculated from
2N(decoy)/[N(decoy) + N(target)] and threshold rate was set to be ≤ 0.01 for peptide
identification. Protein isoforms and proteins that could not be distinguished based on the
peptides identified were grouped and reported as a single protein group. ProteomicsTools
version 3.8.7 was used for obtaining the intensities of 12C6- and 13C6-Lys labeled proteins
[Guo et al. 2014], from which the fraction of 12C6-Lys protein expression was calculated.
21
Chapter 3
Methods and Algorithms
3.1
Background
Genomics and proteomics data are characterized by both high throughput and
high noise levels. The high throughput nature of the data is often reflected in the
disproportionally large number of features (e.g. genes, proteins) compared to the
available samples. This makes it difficult to determine the true link between a subset of
features and a disease or a biological phenomenon. Higher noise levels lead to strong
variability between and within samples, caused by a combination of missing data and the
presence of outliers. There are many potential sources of noise in genomics and
proteomics experiments including experimental protocols, instrument resolution and
variations in the reagents or errors in quantification [Bantscheff et al. 2007]. Some of this
variability can also be biological, e.g. not all proteins are expressed in the sample at all
times. Therefore, there is a certain bias, due to the finite sample size, toward highly
expressed proteins. To avoid some of these issues, it is crucial to have both biological and
technical replicates as well as to use modern statistical and machine-learning methods for
data reconstruction, removal of outliers and feature selection [Habibi et al. 2014, Kourou
et al. 2014].
22
3.2
Imputation Algorithm and Normalization
In the present study, nematode populations were originated from different
parental linage, making our replicates biological in nature. This allowed for a stronger
validation of effects or trends that are overlapping between replicates. Protein
identification was done (see Chapter 2) using MS/MS analysis, which was then scored
using Mascot database search software (Version 2.2.0, Matrix Science, London, UK) in
conjunction with the (December 2014) Wormpep database. Proteins were quantified from
the subpopulation of peptides that have a level of collinearity of correlation coefficient
(R2) ≥ 0.9 as estimated by ProteomicsTools.3.8.7 [Guo et al. 2014].
Due to the stated variability in protein expression across replicates and across
C. elegans life cycle, a certain amount of reconstruction of missing expression data is
required. The data reconstruction process is called “imputation”. The imputation of
missing data was based on protein co-expression within the same time point across
replicates or within the same replicate across different time points. It is important to note
that data imputation in high throughput experiments has been extensively studied using a
variety of algorithms and approaches; a short comparative review of such methods is
given in [Troyanskaya et al. 2001, Jerez et al. 2010]. Some of these include machinelearning based methods, both supervised and unsupervised, for instance, multi-layer
perceptron (MLP) and self-organizing maps (SOM). Other imputation methods are based
on Singular Value Decomposition (SVD) or weighted K-nearest neighbors (KNN)
[Troyanskaya et al. 2001].
23
Figure 3.1: Imputation Sketch. The missing expression of protein p1 at day 5, is
reconstructed based on the N neighbors, using a co-expression metric. The co-expression
is measured on at least two days other than day 5 (see pseudo-code 3.1). Proteins that are
not expressed in day 5, are excluded from the reconstruction procedure. The horizontal
axis of this figure shows the proteins extraction days (day 1, 5, 8, 11 and 16). The vertical
axis shows the list of proteins. The squares marked in black are missing expression data.
The green circle in protein p1 indicates the data point to be reconstructed for this protein.
Proteins p2 and pi cannot be used for this reconstruction as they have missing expression
in day 5. The yellow squares indicate both the time points at which protein expression is
available and where the co-expression with protein p1 is to be measured.
24
Pseudo-code 3.1
for
X
in replicate Y :
n( X , Y )  3 then :
if
for
day ( Z ) in DaySet1( X , Y ) :
m0
if
ProtSet1(day ( Z ), Y ) is not empty :
for
X ' in
if
S 2(day ( Z ), Y ) :
days in
DaySet2( X ' , X , Y )  2 then :
calculate Corr(X, X' , days,Y)
m  m 1
if
m  N then:
imputation not completed for X in replicate Y at day ( Z )
else:
calculate the weighted expression of X in replate Y at day(Z)
where : n( X , Y )  number of days protein X is expressed in replicate Y .
DaySet1( X , Y )  days when protein X is not expressed in replicate Y .
ProtSet1(day ( Z ), Y )  proteins in replicate Y that are expressed in day(Z).
DaySet 2( X ' , X , Y )  days when protein X ' and X are both expressed in Y .
Corr(X',X,days,Y)  correlatio n coefficien t between X' and X during days in Y .
25
In this work, the weighted K-nearest neighbor algorithm (KNN) was utilized for the
imputation of missing protein expression data. The formulation of this algorithm used
protein co-expression, i.e. correlation of expression, as a metric for imputation. The
implementation of this algorithm is described in pseudo-code 3.1. The imputed value of
the expression of protein X in replicate Y is given in Eq. 3.1, where the parameters are
defined as in pseudo-code 3.1:
E ( X , Day( Z )) 
1
N
 corr ( X ' , X , days )
N

l 1

 Corr( X ' , X , days ) 

E ( X , days )
 E ( X ' , day ( Z ))
E ( X ' , days )

X 'Pr otSet1( X ,Y )
(3.1)
Protein expression levels were normalized to correct for protein quantification
errors, variability across replicates and between protein collection time points. For each
replicate and for a given extraction day, the data were normalized to the expression level
of Tubulin (isoform 1: tbb-1). Tbb-1 is a major constituent of microtubules [Ellis et al.
2004, Lu et al. 2004]. This protein is also conserved in humans and others species [Ellis
et al. 2004, Lu et al. 2004] and is often used as an internal reference for normalization
[Yuan et al. 2014], due to its relative uniform level of expression regardless of age. In
addition, the data for each protein were normalized, within each replicate, to the level of
expression of the same protein in day 1. It is important to note that data imputation is
crucial for this step to avoid zero-expression on day 1, which would preclude this
normalization.
To insure that three replicates were drawn from populations with the same mean a
one-way analysis of variance (ANOVA) test was conducted. The cross-replicate
26
variability was assessed to be non-significant. Nevertheless, single proteins may exhibit
deviations across replicates. These can be reflected through the distribution of σ (standard
deviation) for all proteins. Each σ was calculated for a single protein across the three
replicates for a given extraction day. Another measure of deviation is the skewness for
the distribution of expression. Due to the small sample size, i.e. three replicates, and the
inability to smoothly describe the distribution function of the data, these measures were
not used. Instead, the Dixon’s Q-test was used as a measure for detecting outliers in
expression values between replicates for a single protein. The Q-test [Dean et al. 1951]
was used to detect and exclude a single outlier value for a given protein. The threshold Qtest value for a 90% confidence level and a sample size of 3 is 0.941. The Q-score for
protein Pi is given as:
 E ( Pi , REPY )  E ( Pi , REPX )
Qscore  max 
,
 E ( Pi , REPZ )  E ( Pi , REPX )
E ( Pi , REPZ )  E ( Pi , REPY ) 

E ( Pi , REPZ )  E ( Pi , REPX ) 
(3.2)
The protein expression, E ( Pi , REP ) , in Eq. (3.2) is sorted from low to high across
replicates. The minimum and maximum values are E ( Pi , REPX ) and E ( Pi , REPZ ) ,
respectively.
3.3
Statistical Testing
To estimate the degree of differential protein expression between the different
phases of C. elegans aging process, unpaired t-test statistics was used. To quantify the
significance of differential expression, a permutation test was conducted. The purpose of
27
this test was to derive all possible t-distributions that could be obtained by chance. Then
compare the t-score for each protein to the scores, for the same protein, that could be
obtained by chance. These distributions are referred to as null distributions of t-scores
[Storey et al. 2003], as they define the boundary of validity of the null hypothesis. These
distributions were obtained by calculating the unpaired t-statistics between samples of
similar size to the original t-test, with the new samples obtained by randomly relabeling
the initial sets. The corresponding p-value for each protein was then derived for each
permutation. It is important to note that differential expression analysis for genomics or
proteomics data, is in fact a multi-hypothesis-testing problem and as such, one needs to
correct for a possible inflation of type I statistical error (false positives). Intuitively, the
more properties of two different objects we compare the more likely we will find them to
be different by chance alone. There are many approaches to correct for such statistical
artifact, for instance, by using the Bonferroni correction [Storey et al. 2003]. This method
consists of dividing the p-values by the number of features compared, in this case the
number of proteins. Bonferroni correction, however, can be very conservative and can
lead to an inflation of type II statistical error (false negative). In this work, the concept of
a False Discovery Rate was used. The same formulation has been adopted as in [Storey et
al. 2003]. The t-statistics were calculated using the Welch unpaired two-sample t-test.
This approach is necessary when equal variance between the age groups to be compared
cannot be assumed. The variance was measured across the replicates for each single
protein. The t-score for the ith protein was calculated as:
t ( Pi , Dayl , Dayk )  3
E ( Pi , Dayl )  E ( Pi , Dayk )
 2 ( Pi , Dayl )   2 ( Pi , Dayk )
28
(3.3)
The expected value and standard deviation were calculated based on the expression level
of protein Pi across the three replicates. In equation (3.3), differential expression was
estimated between dayl and dayk. The square root of 3 reflects the sample size, i.e. 3
replicates.
For a small sample size it is often the case that the derived distribution of tstatistics does not behave as a standard t-distribution from which a p-value can be
analytically calculated. In our case, the corresponding p-values were derived using a
permutation method. This step consists of relabeling replicates, i.e. from the initial 3+3
for dayl and dayk, all possible combinations of 3 replicate values in dayl and 3 replicate
values for dayk were randomly generated. After each randomization, the t-statistics were
calculated (Eq. 3.3). These represent the null distribution of the t-scores that are
generated by chance. The p-value for the ith protein for differential expression was then
calculated using the approach described [Storey et al. 2003], i.e.:
#{Pi :| t Pi0 j ( Pi , Dayl , Dayk ) || t Pi ( Pi , Dayl , Dayk ) |, Pi  Proteome}
Nperm  # (Proteome)
j 1
(3.4)
N
p( Pi , Dayl , Dayk )  
Nperm- is the number of permutations and in the denominator is the number of proteins
in our collected proteome 740. A larger number of proteins was detected in each separate
replicate, however to conduct statistical testing, we required the same proteins be present
in all replicates and to be expressed in at least three days per replicate. Using the obtained
p-values and the above described approach the corresponding false discovery rate (FDR)
q-values were calculated. We set the threshold of significance to q < 0.05. One of the
advantages of a q-value formulation compared to a p-value is that a p-value of 0.05
29
indicates that among all the null features the false positive rate is 5%. p-value, therefore,
gives no information about the significant features, while a q-value of 0.05 indicates that
among the significant features 5% are false positives [Storey et al. 2003].
3.4
Information Theory: Shannon Entropy and Protein Network
Properties
Shannon-entropy [Shannon 1948] is a fundamental concept of information theory
that is increasingly explored in genomics and proteomics data analysis [Teschendroff et
al. 2010, West et al. 2012]. In the systems biology context, gene and proteins are often
viewed as components of a graph or a network. In such networks, proteins represent the
nodes while protein-protein interactions are edges of the network. The entropy is a
natural way of describing the transduced information flow along different parts of the
network. Another argument for using entropy as an additional metric for data analysis is
the fact that the correlation of expression can only capture linear interactions between
proteins, whereas entropy extends this concept further to include non-linear interactions.
Recent studies in the field of cancer genomics [Teschendroff et al. 2010, West et al.
2012] showed the importance of entropy as a measure for identifying not only sample
phenotypes (e.g. cancer/normal) but also the degree of cancer invasiveness (metastatic vs.
non-metastatic) [Teschendroff et al. 2010, West et al. 2012]. Changes in entropy, based
on the theorem of dynamical systems [Demetrius et al. 2004, Demetrius et al. 2005], were
proposed as a method for biomarker discovery for targeted gene therapies [West et al.
2012]. This theorem links changes in entropy, S, within the microscopic state of a system
to changes in the robustness, R, of the same system. Both the change in entropy and
30
robustness are positively correlated, i.e. ∆R∆S > 0. Herein, the concept of entropy was
used to further characterize the C. elegans protein-protein interaction network. Following
the formulation given in [West et al. 2012], the Shannon entropy of node (protein) i in the
network was defined as:
S ( Pi , Dayl )  


1


p
(
P
,
P
,
Day
)
log
p
(
P
,
P
,
Day
)

 i j
l
10
i
j
l 
log10 # ( Neighbors ( Pi ))  Pj Neighbors( Pi )

(3.5)
In Eq.3.5 the traditional probability mass function is defined through the elements of the
adjacency matrix p(Pi , Pj, dayl), which expresses the probability of coexpression of
proteins Pi and Pj at day dayl. The elements of this matrix were defined as [West et al.
2012]:
| r ( Pi , Pj , Dayl ) |
p( Pi , Pj , Dayl ) 
 r ( P , P , Day )
i
Pj Neighbors( Pi )
j
(3.6)
l
The adjacency matrix was expressed through the Pearson correlation between the
expression of proteins Pi and Pj. The term in the denominator insures that for any given
protein, the sum of the adjacency matrix elements linking this protein to its neighbors is
1. This gives the adjacency matrix similar properties as the probability mass function.
The absolute value in the numerator assumes an equal weight for anti-correlated and coexpressed proteins. This assumes any two proteins that are either perfectly correlated or
anti-correlated in their expression must be either downstream of each other or
downstream of a common activating or inhibiting process.
31
3.5
Machine-Learning: Classification and Feature Selection Algorithm
In the last few years, machine-learning algorithms have seen extensive use in a
wide range of applications in cancer genomics, proteomics, and epidemiology [Habibi et
al. 2014, Kourou et al. 2014, Cruz et al. 2006]. Some of the more prominent aspects for
which these methods were used involved classification. In a supervised setting,
classification consists of training a machine, i.e. computer, on a dataset for which class
labels are given, training data. The class labels can be, for instance, “cancer” vs
“normal”, “diseased” vs “healthy” or in the case of the present study “young” vs “old”.
Classification algorithm then generates a classifier, or a criterion by which the machine
recognizes and labels future samples. The performance of a classifier was assessed on a
class-blinded data, test data. One then compared the predicted class labels to the true
“hidden” labels. Prediction accuracy, over multiple test datasets, was one of the metrics
by which a classifier was assessed. Using both classification and feature selection we
propose to identify proteins whose expression is most predictive of the age of the C.
elegans nematodes.
Feature selection consists of reducing the initial set of features to a smaller subset
that is most representative of the relationship between attributes and labels. In this case,
attributes are protein expression levels, while labels are age categories (“Young” or
“Old”). Feature selection methods create an approximation of the functional relationship
between attributes and labels. This has two distinct advantages: 1) reducing the size of
the problem, hence minimizing the computational cost, although there is an initial
computational cost in extracting the relevant features, 2) by reducing the number of
32
useless features, which are not truly predictive of the label, one reduces the risk of overfitting, hence enhancing the potential for generalization. Feature selection methods have
an important advantage over dimensionality reduction methods, such as Principal
Component Analysis (PCA), in that the latter completely ignores the label. In addition,
for methods such as PCA the output is not directly tractable, as it represents different
combinations/projections of attributes over the axis of variance. The first component
represents the direction where most of the variance in the data is.
There are multiple approaches to tackling feature selection; these include "filter"
[Almuallim et al. 1991, Kira et al. 1992], "wrapper" [Caruana et al. 1994] and embedded
methods [Efron et al. 2004]. Filter methods consist of ranking features based on joint
mutual information or other feature ranking/scoring schema. This approach, due to its
simplicity and computational tractability, is often used as a first solution to the feature
selection problem. These methods, however, are noisy and highly dependent on the first
selected feature. The subsequent features are conditioned based on the initial choice. To
reduce such a bias and/or the effect of feature redundancy between highly correlated
features, one often uses the Joint Mutual Information (JMI) criterion [Hua et al. 1999].
The wrapper methods, on the other hand, assess the error of the classifier with respect to
selected features and use this as a metric in increasing or decreasing the size of the
selected feature set. There are two main categories, forward and backward selection. The
former starts from an empty set and incrementally adds new features based on the
performance of the classifier with respect to the new set. The backward selection starts
from the complete set of features then incrementally discards useless features until a set
that maximizes the performance of the classifier is reached.
33
Figure 3.2: Flow Chart of the Classification Procedure. The protein expression data
were split into 5-folds, i.e. 5 equal partitions of the data, for cross validation. Four of the
five folds were combined into a training set while the remaining data were used as a test
set, on which the performance of the classifier was assessed. Feature selection was
performed within an internal cross-validation loop, to avoid generating features from the
test dataset. During each of the five iterations of the inner cross-validation loop, a set of
features was generated. A majority vote was used to select the most common features and
extract the final set. The extracted features were then tested on the remaining test dataset
from the outer cross-validation loop. At the end of the five iterations of the outer crossvalidation loop a mean and variance of the accuracy of the classifier, with and with
feature selection, were generated.
34
Algorithm 3.1
35
Both approaches, although highly accurate, are computationally costly. Embedded
methods can be viewed as an intermediate approach, having less of the arbitrary scoring
of filter-type methods and less computational cost than wrapper-type methods. Embedded
methods tend to perform both feature selection and classification simultaneously. This, of
course, requires a modification of the classification algorithm, i.e. optimizing the loss
function. This is still an active area of research driven mostly by the goal of having both
differentiability of the loss function and faster decay toward zero for weights of irrelevant
features. Variants of this method, such as LASSO [Tibshirani 1996], have been used
recently with some success to analyze cancer genomics based on mixed data-types
including molecular, i.e. DNA methylation and mRNA, clinical variables [Yuan et al.
2014].
In high throughput experiments such as genomic sequencing, microarrays and
proteomics, the output datasets have a disproportionally large number of variables with
respect to sample size. In trying to establish a functional relationship between protein
expression and age of the nematodes, one finds that the expression level of most proteins
is not necessarily predictive of the age of the nematodes. This can, to some extent, be
inferred from differential expression analysis. However, statistical testing is significantly
more stringent in terms of the requirement on the distribution function of the expression
and the threshold of significance cannot always be achieved. The flowchart of this
method as well as a description of the hybridization algorithm are shown in Figures 3.2
and Algorithm 3.1.
36
Using a wrapper-based approach with a hybrid forward/backward algorithm,
feature selection was performed. This approach allows for the use of a prior on the set of
proteins to be selected. For the classification, Logistic Regression was used [Diaz et al.
2010]. This is a highly robust, discriminative algorithm well suited for high throughput
data, where the number of features largely exceeds the number of samples. Most
importantly, this algorithm does not make any strong assumptions about the data, e.g.
independence with respect to class label, which are made in models such as the naïve
Bayes [Demichelis et al. 2006]. This is especially relevant in the present data, as protein
expression can regulate the expression of other proteins. The hybrid forward/backward
feature selection algorithm is described in (Algorithm 3.1).
In contrast to traditional forward feature selection algorithms that add one feature
at a time, the algorithm shown (Algorithm 3.1) adds features in batches, then makes
backward corrections. Forward/backward algorithms have been successfully applied in
the past for feature selection in sparse data settings [Zhang 2011]. The method described
in (Algorithm 3.1), remains, overall, a forward selection method with a backward
correction. This approach is similar to many numerical methods of type:
predictor/corrector or expectation-maximization (EM). The idea was to use the model
parameter α to advance not one feature at a time but a set of features that minimize the
classifier error. These features may increase the error when put together, hence the
backward correction. It is apparent from the analysis that most of the backward steps are
shorter than the forward steps hence allowed for greater computational efficiency.
37
Chapter 4
Results and Discussion
4.1
Background
To assess the effect of aging on protein expression in C. elegans, a set of methods
for data processing and noise reduction were used. These methods were described in
Chapter 3 and include outlier detection, missing data imputation [Troyanskaya et al.
2001, Jerez et al. 2010] and machine learning methods for both classification and feature
selection. Outlier detection was performed using Dixon’s Q-test [Dean et al. 1951],
which is appropriate for small sample size like our case (three replicates for each protein
at a given age). The imputation of missing data was performed using weighted K-nearest
neighbors (KNN) algorithm [Troyanskaya et al. 2001]. This algorithm has two main
advantages: computational efficiency and a higher tolerance to noise in the data. The
metric used for KNN imputation was the correlation of protein expression. Data
classification and feature selection was conducted using the logistic regression model and
the wrapper approach with hybrid forward-backward algorithm for feature selection
[Almuallim et al. 1991, Kira et al. 1992, Caruana et al. 1994, Efron et al. 2004]. The goal
was to extract a subset of proteins whose expression is most predictive of the age of the
C. elegans nematodes.
38
4.2
Outlier Detection
Outliers were detected using Dixon’s Q-test with a Q-value, Eq. (3.2), exceeding
90% confidence threshold (QThres = 0.941) for a sample size of 3. The Q-value, for each
protein, was calculated using protein expression from the three replicates at each age.
The fraction of outliers, with respect to the total number of quantified proteins, is shown
in Figure 4.1 as a function of age. The fractions, derived from imputed data, showed only
a slight change with age, at the exception of 8 day old nematodes. The uniformity of the
fraction of outliers indicates a certain stability of the overall shape of the distribution of
protein expression with age. This, however, does not indicate that the expression of
individual proteins remains unchanged with age.
The effect of imputation of missing data can be seen in Figure 4.1. The presence
of missing data induced a noticeable asymmetry in the distribution of protein expression,
which led to enhanced inhomogeneity across replicates. The effect was more noticeable
for old age nematodes. This can be interpreted as increased asynchrony in protein
expression, between different replicates, as the nematodes reach the final stage of their
lives. It is also important to note that the effect seen at day 8, where the imputed data
show a higher fraction of outlier expression values as defined by the used Dixon’s Q-test.
The reason for this effect lies in the metric used in the imputation algorithm, which is the
correlation of protein expression. Lower co-expression levels at this intermediate aging
phase between proteins could decrease the accuracy of data reconstruction.
39
Figure 4.1: Fraction of Proteins with Outlier Expression Value as Function of the
Age of C. elegans. The fractions were calculated using the Dixon’s Q-test with a 90%
confidence level for a sample size of three (QThres = 0.941). The fractions are shown as a
function of the age of C. elegans. The fractions, derived from data before imputation, are
shown in blue, while the yellow bars represent fractions calculated after imputation of
missing data. The fraction of outliers is higher before imputation of missing protein
expression, especially for 16 day old C. elegans.
40
4.3
Differential Expression between Age Groups.
The expression levels of individual proteins are expected to change as C. elegans
ages [Stiernagle 2006]. Age related changes in protein expression are neither monotonous
nor similar in trend for all proteins [Yuan et al. 2012, Vukoti et al. 2015]. Identifying
groups of proteins with similar expression dependencies with age is fundamental to
understanding the cellular changes associated with aging as well as the mechanisms of
age-associated diseases [Irvine et al. 2008].
To visualize the age dependent changes of protein expression hierarchical
clustering was conducted across all age groups considered in this study. The clustering
was performed on scaled protein expression, z-scores, for each age and for each replicate.
The results are shown in Figures 4.2a-b for non-imputed and imputed missing data
respectively. The clustering pattern is similar between the two panels of Figures 4.2.
Imputation of missing data reduces the number of zeros in the data, hence reducing the
spread in the z-scored expression data. Figure 4.2 show the protein expressions having a
similar age dependency for young (day 5) and adult (days 8 and 11) nematodes. In both
panels of Figure 4.2, the upper cluster of proteins (red color map) has relatively higher
expression levels across the age groups (days 5, 8 and 11). Noticeable differences in
protein expression pattern can be seen for old nematodes (day 16) for all three replicates.
This suggests that the expression levels of many proteins change between days 11 and 16.
The relative protein expression between younger and older C. elegans showed somewhat
of an inverse proportionality relationship. A higher expression at younger age was
expected as the nematodes grew toward maturity.
41
Figure 4.2: Clustered Protein Expression – a) without imputation, b) with
imputation. The data are z-scored along columns, i.e. along each age (day 5, 8, 11 and
16). Within each age group, the three columns represent the data from the three
replicates. The data showed a reverse trend of protein expression between young (day 5)
and old age (day 16) nematodes. The data also showed similarities between days 5, 8, 11.
This could indicate an abrupt change in protein expression taking place between days 11
and 16.
42
Figure 4.3: Fraction of Differentially Expressed Proteins. Differential protein
expression is conducted between all age groups considered in this study: (day 5 - 8), (day
5 - 11), (day 5 - 16), (day 8 - 11), (day 8 –16) and (day 11 - 16). For each comparison
(day i – j), the (blue/yellow) bars represent the fraction of proteins (decreased/increased),
at a significant level, in day i compared to day j. The fractions are larger when comparing
younger age (days 5, 8) to older age (day 16) nematodes.
43
An equally large group of proteins was found to have an increased expression at older
age. These can be labeled “house-keeping” proteins, e.g. ubiquitin and apoptosis
regulatory proteins [Chung et al. 2006]. Higher levels of expression in such regulatory
proteins were proposed to explain increased skeletal muscle atrophy with age [Chung et
al. 2006]. In the case of ubiquitin, this protein is responsible for tagging misfolded or
damaged proteins for degradation. Accumulating evidence suggests that the capability of
maintaining proper protein clearance (degradation) system declines as an organism ages,
resulting in the accumulation of protein aggregates [Martinez-Vicente et al. 2005]. Such
aggregates are ultimately a consequence of aging and in some cases may lead to cancers
[Finkel et al. 2007]. In other cases, these aggregates form highly stable and extended
beta-sheet type structures that are known for their causative associations with
neurodegenerative diseases, such as Alzheimer’s disease, Parkinsons and Huntington's
disease [Irvine et al. 2008].
Using the results shown in Figures 4.2, age dependent differential expression
between younger and older C. elegans nematodes was estimated. The statistical method
described in Chapter 3, i.e. unpaired t-testing with permutations between samples, was
used for this analysis. Differential protein expression was calculated between all possible
combinations of age groups, i.e. (day 5 vs days 8, 11 and 16), (day 8 vs days 11 and 16)
and (day 11 vs day 16). The threshold of significance was chosen as a false discovery rate
(FDR) q-value of 0.05. The fraction of differentially expressed proteins, at significance
level of q = 0.05, between each age group was reported in Figure 4.3. These fractions
were calculated with respect to the total number of quantified proteins common to all
three replicates with a minimum of three data points (n = 740). The data in Figure 4.3
44
were divided into two categories, increased and decreased proteins between the ages as
shown as yellow and blue bars respectively. The fractions are shown for each comparison
(day i – j), where (i, j) represent all possible non-redundant permutations of the
considered ages, i.e. days 5, 8, 11 and 16. Detailed information about each protein
including name and gene identification number from the (December 2014) Wormpep
database were reported in the Tables 4.1-4.7 in Appendix section, as supplementary
material.
The results shown in Figure 4.3 indicate fractions of differentially expressed
protein increase with age. This can be seen from the comparison between day 5 and 16,
where the maximum fractions were obtained. In this case, the fraction of proteins for
which the levels of expression were higher in younger C. elegans (13%) was larger than
the fraction of proteins for which the levels of expression were higher in older nematodes
(8%). The age dependent differential expression showed a changing trend between
proportions of increased and decreased proteins between age groups. The comparison
between day 8 and 16 was dominated by the fraction of proteins for which the levels of
expression were higher at older age (~5%). The analysis also showed a negligible fraction
of differentially expressed proteins between day 5 and 11 as well as day 8 an 11. The
behavior of protein expression at day 8 and day 11 are characteristic of a transition phase
between young and old age. Cluster analysis (Figure 4.2) showed strong similarities
between days 5, 8 and 11, but day 16. At day 11 the effect of aging may not be
homogeneous for all proteins, i.e. some proteins react to aging faster than others.
Therefore, proteins at day 11 may share expression similarities common to both young
45
and old C. elegans, because the number of differentially expressed proteins between day
11 and either age group are small.
4.4
Ontology and Pathway Association
Identifying
the
biological
pathways
that
are
directly
involved
or
activated/inhibited in aging C. elegans is essential to better understand the aging
mechanism. The PANTHER database (http://www.pantherdb.org/) was used to quantify
significant biological processes and pathways with the differentially expressed proteins
found in the study are associated to. The ontology search was conducted using the
PANTHER-Overrepresentation Test (release 20150430) with the following annotation
version and release date: version 10.0 and release date of 2015-05-15. The C. elegans
reference list and the Bonferroni correction for multiple testing were also used in this
analysis.
The distribution of identified significant biological processes, ranked by their pvalue, for the comparison between day 5 and day 16 is shown in Figure 4.4. The fraction
of mapped protein (gene ids) to the PANTHER gene ontology search database was ~60%
for proteins whose expression levels are higher in day 5 compared to day 16, and ~66%
for proteins with opposite trend. Figure 4.4 show the significant biological processes
associated with proteins that were highly expressed in either young or old C. elegans. The
level of significance of metabolic processes, although highly ranked for both younger and
older C. elegans, was much higher at younger age. This could be associated with the
rapid growth of the nematodes.
46
(a)
(b)
Figure 4.4: Distribution of Significant Biological Processes. The PANTHER gene
ontology search was conducted and significant biological processes that involve proteins
whose levels of expression are significantly altered between day 5 and day 16. Biological
processes suggested to be activated (a)/inactivated (b) in day 5 compared to day 16 are
shown. The increase in protein expression of C. elegans at younger age was found to be
mostly associated with metabolic processes. At older age, protein expression, in addition
to metabolic processes, was associated with house-keeping processes such as proteolysis
and protein-folding.
47
At older age (day 16), in addition to metabolic processes, increased protein expression
was also associated with house-keeping type processes, such as proteolysis, protein
complex biogenesis and assembly. These processes are interlinked and are characteristic
of aging [Comaret et al. 2009, Bowerman 2007]. In aging nematode, misfolded and
aggregated proteins accumulate progressively in the cells. This could activate processes
such as ubiquitin-mediated proteolysis to tag and dispose the misfolded or harmful
aggregates. Proteins that respond to stress were found to be highly ranked among proteins
with increased expression in old nematodes. These proteins, however, were not found to
be significant in terms of differential expression.
4.5
Network Entropy
To understand the effect of aging on the pattern of interaction between C. elegans
proteins, network entropy analysis was conducted. Entropy [Shannon 1948] as a measure
of uncertainty in the path of information flow through a network has been successfully
applied to cancer genomics to identify the hallmarks of cancer and its degree of
invasiveness [Teschendroff et al. 2010, West et al. 2012]. Entropy has the added
advantage, compared to correlation of expression, in that it captures non-linear effects
within the protein network. In addition, changes in entropy were linked to the robustness
of cellular systems [Demetrius et al. 2004, Demetrius et al. 2005, Manke et al 2006]. This
has significant implications for the choice of genes or proteins that are key to
compromising certain type of cells, e.g. cancer cells [Teschendroff et al. 2010, West et al.
2012, Manke et al 2006].
48
The protein-protein interaction network (PPIN) was downloaded from
(http://interactome.dfci.harvard.edu/C_elegans/). This network was used as a topological
framework to calculate entropy. There were 6,176 proteins, nodes, and 178,151 curated
interactions, edges, in this network.
The protein network was further reduced, by discarding both self-interactions and
proteins not identified in our three replicated experiments. The updated protein
interaction network, as a result, has substantially reduced dimensions, 390 nodes and
7,201 edges. The layout of the protein-protein interaction network is shown in Figure 4.5.
Proteins within the network shown in Figure 4.5 have different network properties
and could be categorized as either network hubs, i.e. central nodes, or peripheral. In
entropy calculations, one needs to take into account such properties to avoid entropy bias.
Such bias could result from the fact that proteins with a higher number of interacting
neighbors, in graph terms this is called node’s degree, have multiple paths for the flow of
information along the network. Hence an increased path uncertainty, which in turns
increases the entropy. To account for such effect the distribution of the number of
neighbors, node’s degree, for each protein was calculated, based on the network in Figure
4.5 and shown in Figure 4.6. Similar to human PPIN structure, Figure 4.6 shows most
proteins to be peripheral, i.e. having a degree of 1 or a single direct interacting neighbor.
The entropy of each protein in the network was calculated using Eq. (3.5). The
correlation of protein expression was used to estimate protein-protein interaction
probability Eq. (3.6), at the core of the entropy estimate. The distribution of entropy as a
function of C. elegans age is shown in Figure 4.7.
49
Figure 4.5: Protein-Protein Interaction Network. Topological layout of the C. elegans
protein-protein interaction network. Nodes, filled yellow circles, represent proteins, while
edges, solid lines, represent curated interactions. The network was visualized using
Cytoscape v.3.2.1 [Shannon et al. 2003].
50
Figure 4.6: Distribution of the Number of First Neighbors for each Protein. This
distribution represents the degree of each node within the protein-protein interaction
network for C. elegans. Most of the proteins in network have a single direct interacting
neighbor.
Figure 4.7: Distribution of Protein’s Entropy. Protein entropy for day 5 (red circles)
and day 16 (blue circles). For both, day 5 and day 16, the same proteins are sorted by
increased entropy of day 5. Entropy, for most proteins, was unchanged between young
and old C. elegans. Two groups of proteins, however, at the left and at the right side of
the figure show noticeable entropic age dependency.
51
Figure 4.8: Distribution of Proteins’ First Neighbors. The number of first neighbors of
each protein, sorted in the same order as in Figure 4.7, i.e. by increasing entropy of day 5.
Proteins with the largest time variability in entropy are characterized by low number of
first neighbors.
Figure 4.9: Proteins Highly Ranked in Differential Expression and Entropy. The
overlap between proteins identified in both differential expression and entropy analysis,
for the comparison between day 5 and day 16. For entropy, only proteins with changes
between day 5 and 16 exceeding 1.2 folds are considered. The fractions were calculated
with respect to number of proteins in each respective entropy group, i.e. entropy up in
day 5 or up in day 16.
52
To avoid entropy bias favoring high degree nodes, i.e. network hub proteins, entropy was
normalized by the number of its first neighbors (Eq. 3.5). In such case, increased entropy
could now only take place if the interaction between the central proteins and any of its
neighbors is equally likely. In other words, the level of co-expression is the same between
a protein and any of its first neighbors. Proteins with a single first neighbor were
excluded from this comparison as the entropy calculations are meaningless in this case
due to lack of possible alternative information path.
Figure 4.7 show that the entropy of most proteins has a weak age dependency,
which suggests the effect of aging to involve limited sub-network of the protein-protein
interaction network. Proteins at both edges of Figure 4.7, on the other hand, showed a
reversed entropy trend between young and old nematodes. Analysis of the topological
layout of these proteins was conducted. The distribution of number of first neighbors of
each protein, sorted by increased entropy of day 5, is shown in Figure 4.8. In such
representation the x-axis of Figures 4.7 and 4.8 are the same. Proteins that showed most
significant entropy dependency on age are proteins with fewer direct neighbors. The
entropy reversal with age seen in Figure 4.7 is similar to the clustering pattern seen for
protein expression in Figure 4.2. The group of proteins on the left side of Figure 4.7 has
lower entropy at a younger age, i.e. low uncertainty in information flow or highly specific
protein-protein interaction. For older C. elegans these proteins have multiple, equally
probable, interactors. This could be viewed as multi-tasking by these proteins, due to age
related degradation or the activation of new pathways, such as stress resistance and
proteolysis. The inverse trend is true for the group of proteins in right side of Figure 4.7.
These proteins become more dedicated to fewer functions with age, i.e. a more specific
53
interaction pathways, hence lower entropy. Further analysis of the group of proteins with
strong entropic changes between day 5 and day 16, exceeding 1.2 folds, was conducted.
Results in Figure 4.9 shown interesting pattern of overlap with differentially expressed
proteins and proteins with strong entropic age dependency. Higher overlap fraction, 30%,
was obtained between proteins with increased expression in old C. elegans and proteins
with lower entropy on day 16, or higher entropy on day 5. The inverse proportionality
relationship between expression and entropy was previously report in the context of
cancer genomics [West et al. 2012]. Increased protein expression tends to put more
emphasize on specific pathways, reducing thus the uncertainty in information flow,
which reduces entropy.
4.6
Feature Selection and Classification
Future selection was conducted using the algorithm described in Chapter 3. The
goal was to extract features, proteins, whose expression is most predictive of C. elegans
age. The data were partitioned into 12 samples, representing protein expression from each
of the three replicates at each of the four extraction days, i.e. days 5, 8, 11 16.
Defining the class labels is a required step in any supervised feature selection
process. The class labels define and restrict the hypothesis space, i.e. link the values of
features to certain outcomes. In this case, binary class labels were adopted: class 1 “young”, class 2 - “old”. The “young” category was chosen to include days 5 and 8,
while the “old” category included days 11 and 16.
54
Feature selection and classification, as described in Chapter 3, are both intricately
linked. The logistic regression classifier was used to predict the class labels. The
performance of the classifier, using all features (all proteins) and selected features (subset
of proteins) was assessed using cross validation. Feature selection was carried out in the
following order:
1. An outer cross-validation loop was initiated to generate training and test data sets
from the 12 samples. Due to the small sample size, the partitioning between training
and test data was done using the leave-one-out method.
2. Training data were used to create yet another, inner, cross-validation step for feature
selection. This was done to avoid selecting features from test data.
3. Within the inner cross-validation loop the forward/backward algorithm (Algorithm
3.1) was used and a majority vote was performed to extract features.
4. The classifier was then applied to the test data set at the outer cross-validation loop
using the selected features and the accuracy of the classifier was assessed.
5. Features that are frequently reported in the outer-cross validation loop are validated
using ontology search to identify the biological process to which there are most
significantly associated.
The distribution of features as a function of times they were selected at the
outer cross-validation step is shown in Figure 4.10. Only a minority of features
showed a repeated selection over multiple iterations of the cross-validation step.
These high frequency features were reported (Table 4.8) in the Appendix section.
55
Figure 4.10: Distribution of Selected Features. Features were distributed based on the
number of times they were selected during the outer cross-validation step, i.e. selection
frequency. Higher selection frequency means stronger link between the selected feature,
protein expression and C. elegans age.
Figure 4.11: Classifier Accuracy. The logistic regression classifier was applied to test
data set. Blue squares - the data without feature selection; Red circles- data using feature
selection.
56
Figure 4.12: Feature Selection Overlap with Proteins Highly Ranked in Differential
Expression. The fraction of selected features, proteins, also identified in differential
expression analysis between day 5 and day 16. The fractions were calculated with respect
to number of selected features.
Figure 4.13: Distribution of Biological Processes for Selected Features. The
PANTHER gene ontology search was conducted and significant biological processes are
shown for the selected features. The selected features, proteins, were mostly linked to
metabolic processes.
57
The higher the frequency of occurrence of a feature, the more robust is the link between
the selected feature and the class label, in this case C. elegans age. The accuracy of the
logistic regression classifier in predicting “young” vs “old” developmental stage is shown
in Figure 4.11. The results indicate and overall improvement in accuracy when relevant
features are selected. The reason for such improved accuracy was the elimination of
useless features that are loosely predictive of the class label, hence, reducing the source
of noise for classification. In addition, feature selection helps reduce the risk of overfitting the data, which increases the power of a classifier to be applied to other datasets.
The selected features constituted a subset of the proteins identified as
differentially expressed between day 5 and day 16, as can be seen in Figure 4.12. Further
ontology search was conducted on the selected features, i.e. proteins whose expression
was most predictive of the age of C. elegans nematodes. The PANTHER gene ontology
search database was used and 75% of the selected features were mapped to the database.
The biological processes to which these features were associated is shown in Figures
4.13. The processes were sorted by level of significance. Metabolic processes were found
to be the most significant. Metabolic processes were also among the highest ranking
biological processes found using protein differential expression between young and old
C. elegans, see Figure 4.4. This provides a measure of confidence in both the used feature
selection algorithm and the selected features. The conducted ontology search shown in
Figure 4.13 was based on feature selection, for which C. elegans age was the class label.
Therefore, one can assert that the expression of proteins linked to metabolic processes is
most predictive of the age of C. elegans nematodes.
58
4.7
Discussion
The central question addressed in this thesis was how does protein expression
change during the aging process of the model organism C. elegans? The wild-type (N2)
strain of C. elegans was chosen for this study, owing to its well-characterized and
mapped genome. A variety of modern and state of the art numerical and bioinformatics
approaches were used in this study to analyze age dependent variations in C. elegans
protein expression. These included imputation of missing data, outlier detection, multiple
hypothesis statistical testing, network entropy calculations and lastly machine learning
classification techniques and feature selection.
Differential expression analysis between aging phases of C. elegans showed a
clear age dependency of the protein expression. In the comparison of the levels of protein
expressions between day 5 and day 16, there are more proteins whose expression levels
are higher in the old nematodes than the proteins whose expression levels are higher in
the young nematodes. It is worth pointing out that day 16 is very old, whichcorresponds
to the half-life of C. elegans (50% of nematodes are dead at this age) [Vukoti, JPR 2015].
On the other hand, the trend is reversed for the comparison between day 8 and 16, with
more proteins are expressed higher degree in the young nematodes (day 8) compared to
day 16 nematodes. This could be attributed to a drop in protein expression, at day 16, for
proteins linked to early developmental stages of C. elegans. The comparisons between
day 5 and 16 as well as day 8 and 16, showed comparable fractions of significantly
differentially expressed proteins. These represent two protein populations with strong age
dependent expression. Proteins expressed higher level at younger age are needed during
59
early developmental phases, these were mainly identified to be linked to metabolic
processes. Proteins expressed higher level predominantly at older age showed significant
association with house-keeping processes. These included proteolysis, protein complex
assembly and biogenesis as well as response to stress. These biological processes are
characteristic of aging [Comaret et al. 2009, Bowerman 2007], as a result of the
accumulation of aggregates and misfolded proteins in aging nematodes. Over expression
of proteins involved in processes such as ubiquitin-mediated proteolysis is a natural cell
response to dispose of these aggregates. Among the biological processes linked to
elevated protein expression at older age, although not at a significant level, was cellular
response to stress. The ability to counter cellular stress, i.e. oxidative stress, hypoxia, heat
shock and osmotic stress was previously linked to longevity in C. elegans [Zhou et al.
2011, Rodriguez et al. 2013]. Metabolic processes were also highly ranked in this group
of proteins. The degree of significance, however, was less than for younger C. elegans.
For the mature adulthood phase (day 11), similarities with both younger and older
C. elegans were expected. In other words, the effect of aging on protein expression may
not be uniform for all proteins. Hence, proteins at day 11 may share expression
similarities common to both young and old C .elegans. As a result, negligible fractions
were obtained of differentially expressed proteins, at a significant level, between either
days 5, 8 or 16 and day 11.
By employing information theory concepts such as entropy, we added another
layer of understanding of the interactions between proteins within a global protein
network. Such representation is often used for genomic and transcriptomic studies, where
60
equivalency between the transcriptome and the proteome is assumed, despite they are
different biological events and do not correlate well each other [Nagaraj et al. 2011].
These assumptions were not employed in the present study. Protein expression of C.
elegans nematodes was directly measured at different ages. Entropy has been
increasingly used as a metric for analyzing high- throughput gene and protein expression
data [Teschendroff et al. 2010, West et al. 2012]. It was previously shown, for instance,
that entropy of pharynx tissues in C. elegans was directly correlated with the age of
theses nematodes [Shamir et al. 2009]. Network entropy was also used to analyze the
robustness of yeast and C. elegans cells [Manke et al 2006]. In these studies, a strong
correlation was found between the lethality and knockdown of proteins with higher
network entropy. Such proteins were hence assumed to be central to the protein-protein
interaction network, i.e. network hubs. The topological properties of these proteins within
the network could create an entropy bias favoring central proteins [West et al. 2012].
Normalized entropy was therefore analyzed and the age-mediated changes were
analyzed. The results showed that significant number of proteins, including central
network nodes, have a weak entropic age dependency. Two clusters of proteins, however,
were identified with significant age related entropic changes between young (day 5) and
old (day 16) C. elegans. These groups of proteins showed an entropy reversal with age
that was similar to the clustering pattern of protein expression. Proteins with lower
entropy at a younger age indicated a highly specific protein-protein interaction. The same
proteins have higher entropy for older C. elegans, which signifies an equally probable
interaction with multiple proteins, multi-tasking. The group of proteins with the opposite
61
time dependency, i.e. high entropy at younger age and low entropy at older age, were
assumed to become dedicated to a single or a fewer tasks as C. elegans age.
By using machine-learning concepts such as feature selection and classification, a
subset of proteins was numerically compiled. These proteins, considered as features, are
most predictive of the age category of C. elegans, due to their peculiar time dependent
protein expression. Feature selection has a fundamental advantage over dimensionality
reduction methods, in that the output is directly tractable and has a direct biological
interpretation. In addition, this approach is less stringent than differential expression
analysis as it does not put string requirements on the distribution of the features. The age
categories were labeled in a binary fashion (“young” / “old”). The ontology search based
on the extracted features, proteins, identified metabolic processes as most predictive of
the age of C. elegans. This, to some extent, corroborates the results of differential
expression analysis, at least for younger age category, where metabolic processes are the
most significant. Metabolic processes were highly ranked for both groups proteins, i.e.
with elevated expression at younger or older age. Hence, one could assume that such
processes may not be a good predictor of the aging of C. elegans. The level of the
significance of metabolic processes, however, was different between the two age
categories. The difference in the level of significance combined with the predominance of
these processes at younger age conveyed age predictor characteristics to metabolic
processes as captured by the feature selection method.
62
Appendix A
Supplementary Tables
Table 4.1: Differential expression (day 5 vs day 8): Proteins for which the expression
levels are significantly greater in day 5.
WormbaseId
Protein
Gene
t-statistics
p-value
CE04859
acetoacetyl
CoA thiolase
T02G5.7
25.72873
2.8E-07
CE01308
locus:tag-174
Cytochrome C
oxidase
F54D8.2
15.36142
0.000399
Table 4.2: Differential expression (day 5 vs day 8): Proteins for which the expression
levels are significantly greater in day 8.
WormbaseId
Protein
Gene
t-statistics
p-value
CE00475
locus:ttr-2
Transthyretinlike family
K03H1.4
-20.2686
0.000132
CE04189
Unknown
C42D4.1
-16.9051
0.000264
CE09542
locus:asp-6
protease
F21F8.7
-10.4278
0.000529
63
Table 4.3: Differential expression (day 5 vs day 11): Proteins for which the expression
levels are significantly greater in day 11.
WormbaseId
Protein
Gene
t-statistics
p-value
CE05529
Unknown
D1054.11
-11.3901
0.000284
CE00475
locus:ttr-2
Transthyretinlike family
K03H1.4
-10.6167
0.000418
CE09542
locus:asp-6
protease
F21F8.7
-10.4832
0.000564
Table 4.4: Differential expression (day 5 vs day 16): Proteins for which the expression
levels are significantly greater in day 5.
WormbaseId
Protein
Gene
t-statistics
p-value
CE26206
Unknown
Y59A8B.10a
11.28049
0.000491
CE00874
locus:rnr-2
Ribonucleasediphosphate reductase
M2
C03C10.3
9.514267
0.0007
CE20707
locus:zyg-9
elongation factor
F22B5.7
9.427848
0.000832
CE18855
locus:dut-1
deoxyuridine 5'triphosphate
nucleotidohydrolase
K07A1.2
8.925026
0.001105
CE42949
locus:rad-23 RAD23
protein homolog2 like
ZK20.3
7.742909
0.001747
64
CE16234
locus:lin-53 chromatin
assembly factor 1 P55
subunit like
K07A1.12
7.655504
0.002068
CE06233
Unknown
R04F11.2
7.070476
0.002567
CE18476
Unknown
ZK1055.7
6.869931
0.002843
CE07241
locus:msra-1
F43E2.5
6.865826
0.002853
CE20336
locus:tfg-1
Y63D3A.5
6.313182
0.004004
CE02218
adenylate kinase
F38B2.4a
6.217497
0.004434
CE24222
Iron-containing
alcohol
dehydrogenases
Y38F1A.6
6.193826
0.004649
CE33722
Unknown
Y105E8A.19
6.060377
0.005315
CE37535
RNA recognition
motif. (aka RRM,
RBD, or RNP
domain)
K08F4.2
6.047346
-0.005355
CE40818
locus:pgp-6 pglycoprotein
T21E8.1a
5.879783
0.005588
CE17349
locus:eif-3.E
B0511.10
5.534634
0.005987
CE24278
locus:rps-4
Y43B11AR.4
5.423025
0.006536
CE08086
Unknown
C10G11.7
5.32125
0.007132
CE34065
NADH
dehydrogenase ND1
MTCE.11
5.280182
0.007329
CE32648
Unknown
F35D11.4
5.227409
0.007594
CE20708
locus:fars-3
phenylalanyl-tRNA
synthetase
F22B5.9
5.219364
0.007643
CE46432
locus:ant-1.1
T27E9.1d
5.188295
0.007851
CE40098
Unknown
M02H5.8
5.053657
0.008477
65
CE34713
Unknown
C27B7.9
4.98694
0.008759
CE16792
locus:mai-2 ATPase
inhibitor
B0546.1
4.968943
0.008873
CE17506
locus:cyn-12
C34D4.12
4.962991
0.008907
CE16968
locus:cyc-2.1
cytochrome C
E04A4.7
4.914915
0.009057
CE20207
Unknown
Y37D8A.2
4.815095
0.009772
CE24447
thioredoxin-like
protein
Y54E10A.3
4.795333
0.009913
CE07426
locus:htz-1 histone
H2A variant
R08C7.3
4.719642
0.010623
CE16252
locus:cysl-2 betasynthase
K10H10.2
4.719306
0.010626
CE18284
locus:acdh-7 acylCoA dehydrogenase
T25G12.5
4.699784
0.010775
CE29997
locus:vha-11
Y38F2AL.3a
4.652329
0.011048
CE21023
locus:fbp-1 fructosebisphosphatase
K07A3.1
4.641067
0.011111
CE41403
Unknown
F53H4.2
4.640525
0.011114
CE33098
Unknown
F46H5.3b
4.57982
0.011458
CE20820
locus:mdh-1 lactate
dehydrogenase
F46E10.10a
4.535479
0.011803
CE16403
locus:mlc-5 EF hand
T12D8.6
4.535364
0.011803
CE29373
Unknown
Y37E3.10
4.460484
0.012521
CE27882
Unknown
C53C9.2
4.457335
0.012563
CE30581
guanosine-3',5'bis(diphosphate)pyrophosphohydrolase
like
ZK909.3
4.311385
0.013397
66
CE21229
Unknown
VF13D12L.3
4.308329
0.013405
CE31008
Unknown
F56F11.4b
4.182023
0.014212
CE08497
D-3-Phosphoglycerate
dehydrogenase
C31C9.2
4.159781
0.014331
CE09767
locus:cpn-3 calponin
F28H1.2
4.11841
0.014426
CE26774
Unknown
Y37E3.8a
4.115546
0.014436
CE39609
locus:kin-2
R07E4.6a
4.082933
0.014888
CE42624
locus:gcn-1
Y48G9A.3
3.992393
0.016229
CE20974
locus:mog-2
H20J04.8
3.990326
0.016256
CE28976
Unknown
Y38A10A.7
3.989277
0.016269
CE43614
Unknown
ZC247.1
3.944351
0.01697
CE22542
Unknown
Y55F3AM.13
3.908747
0.017484
CE18819
locus:mbf-1 Helixturn-helix
H21P03.1
3.899382
0.017609
CE16399
locus:drr-2 RNA
recognition motif.
(aka RRM, RBD, or
RNP domain)
T12D8.2
3.80957
0.018882
CE19362
locus:vha-8 ATPase
C17H12.14
3.807933
0.018904
CE27230
locus:sec-23
Y113G7A.3
3.79176
0.01911
CE10934
aldehyde reductase
F53F1.2
3.734414
0.019887
CE06107
locus:gsnl-1 gelsolin
K06A4.3
3.707201
0.020355
CE24685
locus:vars-2
Y87G2A.5
3.706966
0.020359
CE28376
Unknown
Y67H2A.5
3.683136
0.020854
CE17154
locus:ahcy-1 Sadenosylhomocysteine
hydrolase
K02F2.2
3.64978
0.02177
67
CE01308
locus:tag-174
Cytochrome C
oxidase
F54D8.2
3.646905
0.021855
CE18478
locus:rpl-2 Ribosomal
Proteins L2
B0250.1
3.641181
0.022022
CE18785
ATP-dependent
helicase (DEAD box)
F58E10.3a
3.627245
0.022415
CE04501
locus:his-3 histone
H2A
T10C6.12
3.605163
0.023016
CE22210
locus:vha-13 ATP
synthase alpha and
beta subunits\; ATP
synthase ab Cterminal
Y49A3A.2
3.593631
0.023332
CE32003
locus:pptr-1 RTS1
PROTEIN (SCS1
PROTEIN)
W08G11.4
3.557601
0.024391
CE40008
locus:unc-27 troponin
I
ZK721.2
3.557431
0.024396
CE16341
locus:aldo-1 Fructosebisphosphate aldolase
class-I
T05D4.1
3.554241
0.024501
CE16650
locus:rpl-18
Eukaryotic ribosomal
protein L18
Y45F10D.12
3.527577
0.025384
CE04757
Unknown
M02D8.1
3.518384
0.025663
CE28486
locus:ogdh-1 2oxoglutarate
dehydrogenase
T22B11.5
3.498073
0.026194
CE36263
Unknown
H28O16.1d
3.492358
0.026332
CE23530
high-density
lipoprotein-binding
protein
C08H9.2a
3.458827
0.027046
68
CE30781
locus:rpl-36
Ribosomal protein
YL39
F37C12.4
3.432533
0.027444
CE29774
electron transfer
flavoprotein beta
F23C8.5
3.409685
0.027711
CE17082
locus:cpn-4
F49D11.8
3.402582
0.027802
CE29312
locus:acs-4
F37C12.7
3.388213
0.028028
CE40116
Unknown
T02E9.5
3.377391
0.028263
CE18020
Unknown
K07E3.4b
3.364348
0.028607
CE31733
locus:lev-11
Y105E8B.1e
3.36357
0.028629
CE29161
locus:dod-19
ZK6.10
3.350252
0.029012
CE34018
locus:idha-1 isocitrate
dehydrogenase
F43G9.1
3.337572
0.029362
CE30395
Unknown
T12A2.6
3.326333
0.029632
CE28974
Unknown
Y24D9A.8a
3.325381
0.029653
CE01253
locus:rpt-3
F23F12.6
3.320601
0.029756
CE22195
locus:rpl-17
Y48G8AL.8a
3.319945
0.02977
CE17755
locus:cey-1 'Coldshock' DNA-binding
domain
F33A8.3
3.314475
0.029883
CE22858
Unknown
Y71F9AL.9
3.300621
0.030157
CE26948
locus:rps-17 40S
ribosomal protein S17
T08B2.10
3.226166
0.032
CE27398
locus:rpl-7A
Y24D9A.4a
3.209526
0.032412
CE00561
Acetyl-coa
acetyltransferase
B0303.3
3.187211
0.032842
CE25224
locus:pud-1.2
Y19D10B.7
3.168548
0.033164
CE15997
Unknown
F37H8.5
3.144813
0.033617
69
CE13100
locus:pgk-1
phosphoglycerate
kinase
T03F1.3
3.136648
0.033783
CE17986
locus:rpl-12
Ribosomal protein
L11
JC8.3a
3.134835
0.03382
CE18107
locus:gpx-2
Glutathione
peroxidases
R05H10.5
3.108994
0.034386
Table 4.5: Differential expression (day 5 vs day 16): Proteins for which the expression
levels are significantly greater in day 16.
Wormbase Id
Protein
Gene
t-statistics
p-value
CE04442
locus:ilys-5
F22A3.6a
-22.1064
0
CE00475
locus:ttr-2
Transthyretinlike family
K03H1.4
-13.209
0.00011
CE05529
Unknown
D1054.11
-11.9632
0.000247
CE09542
locus:asp-6
protease
F21F8.7
-9.62595
0.000596
CE04746
locus:vit-1 vit-1
K09F5.2
-9.18562
0.00094
CE04189
Unknown
C42D4.1
-9.0985
0.001001
CE06290
locus:vha-3
Y38F2AL.4
-8.7682
0.00126
CE40453
Unknown
F13G11.3
-8.08169
0.001448
70
CE09406
locus:far-3
O.volvulus
antigen peptide
like
F15B9.1
-8.04526
0.001491
CE03921
locus:vit-5
C04F6.1
-7.71063
0.001854
CE06950
locus:vit-2
vitellogenin
C42D8.2a
-7.58344
0.002171
CE06664
Unknown
ZK856.7
-7.4765
0.002295
CE04533
locus:lbp-1 fatty
acid-binding
protein
F40F4.3
-6.38885
0.003435
CE32898
Unknown
F55B11.2
-6.38148
0.003466
CE31124
Unknown
Y62H9A.5
-6.26209
0.004084
CE15560
Unknown
B0513.4a
-6.2333
0.004292
CE03567
locus:asp-4
aspartyl protease
R12H7.2
-6.15874
0.004985
CE02733
Serine
carboxypeptidase
F41C3.5
-5.75812
0.005676
CE03639
locus:ttr-15
Transthyretinlike family
T07C4.5
-5.496
0.00627
CE05036
Unknown
W05H9.1a
-5.39579
0.006685
CE39742
locus:ttr-45
JC8.14
-5.38366
0.006758
CE19242
Unknown
Y62H9A.6
-5.23522
0.007548
CE33800
locus:ttr-51
Transthyretinlike family
JC8.8
-5.20322
0.007747
CE02283
ribosomal protein
(L7AE family)
M28.5
-5.00693
0.008646
CE02543
Unknown
C44B7.5
-4.85823
0.009383
71
CE14325
locus:ttr-6
T28B4.3
-4.80003
0.00988
CE17388
Unknown
C08F11.11
-4.74859
0.010321
CE02454
17k antigen (O.
volvulus)
C06A8.3
-4.64714
0.011078
CE00133
locus:far-1 O.
volvulus 20Kd
antigenic peptide
F02A9.2a
-4.57087
0.011522
CE02618
locus:ifb-1
Intermediate
filament protein
F10C1.2a
-4.5539
0.011652
CE37390
Unknown
K06G5.1b
-4.52673
0.011876
CE01220
Unknown
E04F6.8
-4.31626
0.013385
CE00994
locus:sip-1 Heat
shock hsp20
proteins
F43D9.4
-4.24152
0.013645
CE15746
locus:lmn-1
Intermediate
filament proteins
(2 domains)
DY3.2
-4.11672
0.014432
CE10246
locus:grd-5
F41E6.2
-4.11054
0.014458
CE16921
locus:perm-4
C44B12.5
-4.05915
0.015433
CE00839
locus:cgh-1
ATP-dependent
RNA helicase
C07H6.5
-4.04843
0.015612
CE01531
locus:mlc-4
myosin
regulatory light
chain 2
C56G7.1
-4.0338
0.01579
CE06835
locus:flu-2
kynureninase
C15H9.7
-3.98547
0.016321
CE12296
locus:cey-3
M01E11.5
-3.94466
0.016965
72
CE05528
Unknown
D1054.10
-3.91628
0.017383
CE11226
locus:ucr-1
Mitochondrial
processing
protease
enhancing
protein
F56D2.1
-3.78532
0.019189
CE07828
locus:lys-7
C02A12.4
-3.71538
0.020205
CE03635
locus:ttr-14
Transthyretinlike family
T05A10.3
-3.63674
0.02215
CE01235
locus:dnpp-1
F01F1.9
-3.63461
0.02221
CE04532
locus:lbp-2 fatty
acid-binding
protein
F40F4.2
-3.62024
0.022606
CE10370
locus:cpn-1
smooth muscle
protein SM22
like
F43G9.9
-3.6099
0.022887
CE15602
locus:atp-5 ATP
synthase D chain
C06H2.1
-3.57532
0.023849
CE02249
locus:iff-2
initiation factor
5A
F54C9.1
-3.57031
0.023995
CE21681
locus:asp-1
peptidase (A1
pepsin family)
Y39B6A.20
-3.56005
0.024312
CE06114
yeast protein
L8167.9-like
K07C5.4
-3.52536
0.025453
CE07648
locus:cth-2
cystathionine
gamma-lyase
ZK1127.10
-3.44861
0.027222
73
CE00657
locus:prdx-3
Mer5 (mouse)
R07E5.2
-3.39087
0.027981
CE09784
locus:ubc-9
ubiquitinconjugating
enzyme
F29B9.6
-3.36661
0.028544
CE10508
locus:hmg-5
F45E4.9
-3.31626
0.029846
CE16638
Unknown
Y45F10C.4
-3.31484
0.029875
CE04001
Unknown
C15H9.9
-3.30803
0.030012
CE01917
locus:tpp-2
Tripeptidylpeptidase II
F21H12.6
-3.28234
0.030524
CE13738
locus:npp-7
T19B4.2
-3.27681
0.030642
CE07373
locus:spc-1
spectrin alpha
chain
K10B3.10a
-3.24932
0.031342
CE02601
locus:numr-1
F08F8.5
-3.11391
0.034271
CE08945
locus:snr-4 small
nuclear
ribonucleoprotein
D2 like
C52E4.3
-3.1065
0.034446
CE03030
locus:spr-2
C27B7.1b
74
-3.09355
0.034781
Table 4.6: Differential expression (day 8 vs day 16): Proteins for which the expression
levels are significantly greater in day 8.
Wormbase Id
Protein
Gene
t-statistics
p-value
CE16403
locus:mlc-5 EF
hand
T12D8.6
11.74316
0.000571
CE17349
locus:eif-3.E
B0511.10
9.335005
0.001191
CE33722
Unknown
Y105E8A.19
9.103345
0.001301
CE17766
Leucine Rich
Repeat (2 copies) (2
domains)
F33H2.3
8.958298
0.001556
CE40818
locus:pgp-6 pglycoprotein
T21E8.1a
7.050867
0.003368
CE30779
locus:rps-21
Ribosomal protein
S21
F37C12.11
6.316825
0.004343
CE34065
NADH
dehydrogenase
ND1
MTCE.11
6.253446
0.004479
CE27850
locus:perm-2
C44B12.1
6.103793
0.004872
CE18855
locus:dut-1
deoxyuridine 5'triphosphate
nucleotidohydrolase
K07A1.2
5.969666
0.005085
CE40116
Unknown
T02E9.5
5.852998
0.005385
CE32648
Unknown
F35D11.4
5.645527
0.006362
CE00874
locus:rnr-2
Ribonucleasediphosphate
reductase M2
C03C10.3
5.367934
0.007738
CE26206
Unknown
Y59A8B.10a
5.284548
0.008106
75
CE42949
locus:rad-23
RAD23 protein
homolog2 like
ZK20.3
4.982257
0.008974
CE20707
locus:zyg-9
elongation factor
F22B5.7
4.883019
0.009438
CE20336
locus:tfg-1
Y63D3A.5
4.842286
0.00978
Table 4.7: Differential expression (day 8 vs day 16): Proteins for which the expression
levels are significantly greater in day 16.
Wormbase Id
Protein
Gene
t-statistics
p-value
CE04442
locus:ilys-5
F22A3.6a
-25.3306
0
CE02573
locus:icd-1
Transcription
factor BTF3
(human)
C56C10.8
-12.9218
0.000153
CE06835
locus:flu-2
kynureninase
C15H9.7
-11.9853
0.000339
CE15947
locus:clec-63
Lectin C-type
domain short and
long forms, von
Willebrand
factor type A
domain
F35C5.6
-10.6122
0.000724
CE00475
locus:ttr-2
Transthyretinlike family
K03H1.4
-9.65924
0.000911
CE15602
locus:atp-5 ATP
synthase D chain
C06H2.1
-8.95042
0.001566
76
CE00548
locus:eef-1B.1
Elongation
factor1
F54H12.6
-8.71844
0.001777
CE05529
Unknown
D1054.11
-8.31435
0.001957
CE04859
acetoacetyl CoA
thiolase
T02G5.7
-8.06285
0.002082
CE09406
locus:far-3
O.volvulus
antigen peptide
like
F15B9.1
-7.95689
0.002259
CE02249
locus:iff-2
initiation factor
5A
F54C9.1
-7.81098
0.002454
CE09656
locus:tct-1 TCTP
protein
F25H2.11
-7.78831
0.002496
CE00854
locus:rps-0 40S
ribosomal
protein
B0393.1
-7.76332
0.002551
CE15745
locus:tin-13
DY3.1
-7.6775
0.002832
CE07648
locus:cth-2
cystathionine
gamma-lyase
ZK1127.10
-7.26906
0.003065
CE01235
locus:dnpp-1
F01F1.9
-7.25281
0.003101
CE01308
locus:tag-174
Cytochrome C
oxidase
F54D8.2
-6.72905
0.003734
CE03567
locus:asp-4
aspartyl protease
R12H7.2
-6.55555
0.004
CE03921
locus:vit-5
C04F6.1
-6.55108
0.004006
CE04306
locus:nap-1
D2096.8
-6.24001
0.004517
CE04001
Unknown
C15H9.9
-6.17658
0.004715
77
CE02283
ribosomal
protein (L7AE
family)
M28.5
-5.89118
0.005255
CE06290
locus:vha-3
Y38F2AL.4
-5.81767
0.005497
CE04533
locus:lbp-1 fatty
acid-binding
protein
F40F4.3
-5.8096
0.005523
CE10598
locus:cey-2
F46F11.2
-5.7179
0.00588
CE02454
17k antigen (O.
volvulus)
C06A8.3
-5.67098
0.006206
CE01296
Unknown
F42A10.5
-5.65036
0.006333
CE02733
Serine
carboxypeptidase
F41C3.5
-5.58415
0.006668
CE08177
locus:hsp-3 heat
shock protein
C15H9.6
-5.57153
0.006716
CE04189
Unknown
C42D4.1
-5.41791
0.007581
CE00133
locus:far-1 O.
volvulus 20Kd
antigenic peptide
F02A9.2a
-5.32219
0.007903
CE11226
locus:ucr-1
Mitochondrial
processing
protease
enhancing
protein
F56D2.1
-5.30595
0.007975
CE05036
Unknown
W05H9.1a
-5.02112
0.008767
CE12770
locus:pes-9
Yeast
hypothetical 52.9
KD protein like
R11H6.1
-4.99277
0.008892
78
CE09542
locus:asp-6
protease
F21F8.7
-4.8437
0.009767
CE00657
locus:prdx-3
Mer5 (mouse)
R07E5.2
-4.80427
0.01015
CE14325
locus:ttr-6
T28B4.3
-4.76687
0.010544
CE15560
Unknown
B0513.4a
-4.72783
0.010983
Table 4.8: Differential expression (day 11 vs day 16): Proteins for which the
expression levels are significantly greater in day 11.
Wormbase Id
Protein
Gene
t-statistics
p-value
CE30779
locus:rps-21
Ribosomal
protein S21
F37C12.11
11.72859
0.000273
Table 4.9: Feature selection – Features selected based on the level of predictive power
of the age of C. elegans, based on protein expression.
Wormbase Id
Protein
Gene
CE00291
locus:eif-3.D
R08D7.3
CE01104
locus:ccdc-47
ZK1058.4
CE02249
locus:iff-2 initiation factor
5A
F54C9.1
CE03567
locus:asp-4 aspartyl
protease
R12H7.2
CE05402
locus:cdc-48.2 P97 protein
C41C4.8
79
CE05445
locus:sars-1 seryl-tRNA
synthetase
C47E12.1
CE06290
locus:vha-3
Y38F2AL.4
CE06950
locus:vit-2 vitellogenin
C42D8.2a
CE07669
locus:rpl-4 ribosomal
protein L1
B0041.4
CE07828
locus:lys-7
C02A12.4
CE09406
locus:far-3 O.volvulus
antigen peptide like
F15B9.1
CE09542
locus:asp-6 protease
F21F8.7
CE10370
locus:cpn-1 smooth muscle
protein SM22 like
F43G9.9
CE13219
locus:clic-1
T05B11.3
CE13738
locus:npp-7
T19B4.2
CE15560
Unknown
B0513.4a
CE16968
locus:cyc-2.1 cytochrome C
E04A4.7
CE18372
locus:qars-1 tRNA
synthetases classI (E and Q)
Y41E3.4a
CE18476
Unknown
ZK1055.7
CE18819
locus:mbf-1 Helix-turnhelix
H21P03.1
CE20622
locus:ftn-2 ferritin
D1037.3
CE26206
Unknown
Y59A8B.10a
CE27646
Unknown
ZK105.1
CE29774
electron transfer
flavoprotein beta
F23C8.5
CE33800
locus:ttr-51 Transthyretinlike family
JC8.8
80
CE40116
Unknown
T02E9.5
CE40453
Unknown
F13G11.3
CE42949
locus:rad-23 RAD23
protein homolog2 like
ZK20.3
CE45033
locus:lam-2
C54D1.5
81
Bibliography
Alberts B., Johnson A., Lewis J., Raff M., Roberts K. and Walter P., Molecular biology
of the cell. 4th ed. New York: Garland Science, 2002.
Almuallim H. and Dietterich T.G., Learning with many irrelevant features, MIT Press, In
Porc. AAAI-91: 547-5, 1991.
Bantscheff M., MaSchirle M., Sweetman G., Rick J. and Kuster B., Quantitative mass
spectrometry in proteomics: a critical review, Anal Bioanal Chem., 389: 1017–103,
2007.
Bowerman B.C., elegans Aging: Proteolysis Cuts Both Ways, Curr. Bio., 17(13): R5142, 2007.
Caruana R. and Freitag D., Greedy attribute selection, In Proc. of the Eleventh Int. Conf.
on Machine Learning, p 28-36, 1994.
Christensen K., Dodlhammer G., Rau R. and Vaupel J.W., Ageing populations: the
challenges ahead. Lancet, 374(9696): 1196-22, 2009.
Chung L. and Ng Y-C., Age-related alterations in expression of apoptosis regulatory
proteins and heat shock proteins in rat skeletal muscle. Biochim Biophys Acta.,
1762(1):103-9, 2006.
82
Clarke S., Aging as war between chemical and biochemical processes: Protein
methylation and the recognition of age-damaged proteins for repair. Ageing Res.
Rev., 2(3): 263-85, 2003.
Combaret L., Dardevet D., Béchet D., Taillandier D., Mosoni L. and Attaix D., Skeletal
muscle proteolysis in aging, Curr. Opin. Clin. Nutr. Metab. Care., 12(1): 37-4,
2009.
Cruz J.A. and Wishart D.S., Applications of Machine Learning in Cancer Prediction and
Prognosis, Cancer Inform., 2: 59-18, 2006.
Dean R.B. and Dixon W.J., Simplified Statistics for Small Numbers of Observations.
Anal. Chem., 23(4): 636-2, 1951.
Demetrius L., Grundlach V.M. and Ochs G., Complexity and demographic stability in
population models. Theo Pop Biol., 65: 211, 2004.
Demetrius L. and Manke T., Robustness and network evolution-an entropic principle.
Physica A, 346: 682, 2005.
Demichelis F., Magn P., Piergiorgi P., Rubin M.A. and Bellazzi R., A hierarchical Naïve
Bayes Model for handling sample heterogeneity in classification problems: an
application to tissue microarrays, BMC Bioinformatics, 7(514): 1-12, 2006.
Diaz A.A., Tomba E., Lennarson R., Richard R., Bagajewicz M.J. and Harrison R.G.,
Prediction of protein solubility in Escherichia coli using logistic regression,
Biotechnol Bioeng., 105(2): 374-9, 2010.
83
Efron B., Hastie I., Johnstone I. and Tibshirani R., Least angle regression, Annals of
Statistics, 32 (2): 407-92, 2004.
Ellis G.C., Phillips J.B., O'Rourke S., Lyczak R. and Bowerman B., Maternally expressed
and partially redundant beta-tubulins in Caenorhabditis elegans are autoregulated.,
J Cell Sci, 117(3): 457-64, 2004.
Finkel T., Serrano M. and Blasco M.A., The common biology of cancer and ageing.
Nature, 448: 767-7, 2007.
Guo Y., Miyagi M., Zeng, R. and Sheng Q., O18Quant: A Semiautomatic Strategy for
Quantitative Analysis of High-Resolution 16O/18O Labeled Data. Biomed Res.
Int., 2014: 971857, 2014.
Habibi N., Hashim S.Z.M., Norouzi A. and Samian M.R., A review of machine learning
methods to predict the solubility of overexpressed recombinant proteins in
Escherichia coli, BMC Bioinformatics, 15(134): 1-16, 2014.
Hillier L.W., Coulson A., Murray J.I., Bao Z., Sulston J.E. and Waterston R.H.,
Genomics in C. elegans: So many genes, such a little worm. Genome Res., 15(12):
1651-60, 2005.
Hjelmborg J., Iachine I., Skytthe A., Vaupel J.W., McGue M., Koskenvuo M., Kaprio J.,
Pedersen N.L. and Christensen K., Genetic influence on human lifespan and
longevity. Human Genetics, 119(3): 312–9, 2006.
84
Hsu A.L., Murphy, C.T. and Kenyon, C., Regulation of aging and age-related disease by
DAF-16 and heat-shock factor. Science, 300(5622): 1142-5, 2003.
Hua Yang H. and Moody, J., Feature Selection Based on Joint Mutual Information, In
Proc. of Int. ICSC Symp. on Adv. in Intel. Data Analysis, p 22-25, 1999.
Irvine G.B., El-Agnaf O.M., Shankar G.M. and Walsh D.M., Protein aggregation in the
brain: the molecular basis for Alzheimer's and Parkinson's diseases. Mol Med.,
14(7-8): 451-64, 2008.
Jerez J.M., Molina I., García-Laencina P.J., Alba E., Ribelles N., Martín M. and
Franco L., Missing data imputation using statistical and machine learning methods
in a real breast cancer problem. Artificial Intel. in Med., 50(2): 105-5, 2010.
Kadiyala C.S., Tomechko S.E. and Miyagi M., Perfluorooctanoic Acid for shotgun
proteomics. PLoS One, 5(12): e15332-7, 2010.
Kammeyer A. and Luiten R.M., Oxidation events and skin aging. Ageing Res. Rev., 21C:
16-29, 2015.
Kira K. and Rendell L.A., The feature selection problem: Traditional methods and a new
algorithm, MIT Press, Tenth Nat. Conf. on Artificial Intelligence, p 129-134, 1992.
Kourou K., Exarchos T.P., Exarchos K.P., Karamouzis M.V. and Fotiadis D.I., Machine
learning applications in cancer prognosis and prediction, Comput Struct Biotechnol
J., 13: 8-17, 2014.
85
Lu C., Srayko M. and Mains P.E., The Caenorhabditis elegans microtubule-severing
complex MEI-1/MEI-2 katanin interacts differently with two superficially
redundant beta-tubulin isotypes., Mol Biol Cell., 15(1): 142-50, 2004.
Manke T., Demetrius L., and Vingron M., An entropic characterization of protein
interaction networks and cellular robustness. J R Soc Interface., 3(11): 843–7,
2006.
Martinez-Vicente M, Sovak G and Cuervo AM., Protein degradation and aging. Exp
Gerontol., 40(8-9): 622-33, 2005.
Nagaraj N, Wisniewski JR, Geiger T, Cox J, Kircher M, Kelso J, Pääbo S and Mann M.,
Deep proteome and transcriptome mapping of a human cancer cell line. Mol Syst
Biol., 7:548, 2011.
Rodriguez M., Snoek L.B., De Bono M. and Kammenga J.E.; Worms under stress:
C. elegans stress response and its relevance to complex human disease and aging,
Trends Genet., 29(6): 367-74, 2013.
Schmutz J., Wheeler J., Grimwood J., Dickson M., Yang J., Caoile C., Bajorek E.,
Black S., Chan Y.M., Denys M., Escobar J., Flowers D., Fotopulos D., Garcia C.,
Gomez M., Gonzales E., Haydu L., Lopez F., Ramirez L., Retterer J., Rodriguez
A., Rogers S., Salazar A., Tsai M. and Myers R.M., Quality assessment of the
human genome sequence. Nature, 429(6990): 365–3, 2004.
Shamir L., Wolkow C.A. and Goldberg I.G., Quantitative measurement of aging using
image texture entropy. Bioinformatics., 25(23): 3060-3, 2009.
86
Shannon C.E., A Mathematical Theory of Communication. Bell System Technical
Journal, 27(3): 379-44, 1948.
Shannon P., Markiel A., Ozier O., Baliga1 N.S., Wang J.T., Ramage D., Amin N.,
Schwikowski B. and Ideker T., Cytoscape: A Software Environment for Integrated
Models of Biomolecular Interaction Networks, Genome Res., 13(11): 2498-6, 2003.
Stiernagle, T., Maintenance of C. elegans. WormBook, 1-11, 2006.
Storey J.D. and Tibshirani R., Statistical significance for genomewide studies. PNAS,
100(16): 9440-5, 2003.
Teschendorff A. and Severini S., Increased entropy of signal transduction in the cancer
metastasis phenotype. BMC Sys. Bio. 4: 104, 2010.
The C. elegans Genome Sequencing Consortium, Genome sequence of the nematode C.
elegans: A platform for investigating biology. Science 282: 2012-6, 1998.
Tibshirani R., Regression shrinkage and selection via the lasso., J. Royal. Statist. Soc B.
58(1): 267-21, 1996.
Troyanskaya O., Cantor M., Sherlock G., Brown P., Hastie T., Tibshirani R., Botstein D.
and Altman R.B., Missing value estimation methods for DNA microarrays.
Bioinformatics, 17(6), 520-5, 2001.
Vukoti K., Yu X., Sheng Q., Saha S., Feng Z., Hsu A-L. and Miyagi M., Monitoring
Newly Synthesized Proteins over the Adult Life Span of Caenorhabditis elegans. J.
Proteome Res., 14: 1483−11, 2015.
87
Walker F.O., Huntington's disease, Lancet, 369(9557): 218-28, 2007.
West J., Bianconi G., Severini S. and Teschendorff A.E., Differential network entropy
reveals cancer system hallmarks. Sci Rep., 2: 802, 2012.
White J.G., Southgate E., Thomson, J.N.; Brenner, S., The structure of the nervous
system of the nematode Caenorhabditis elegans. Philos. Trans. R. Soc. Lond., B,
Biol. Sci., 314(1165): 1–340, 1986.
Wood W.B., The Nematode Caenorhabditis elegans. Cold Spring Harbor Laboratory
Press. p.1, ISBN0-87969-433-5, 1988.
Yang J.S., Nam H.J., Seo M., Han S.K., Choi Y., Nam H.G., Lee S.J. and Kim S.,
OASIS: online application for the survival analysis of lifespan assays performed in
aging research. PLoS One, 6(8): e23525-11, 2011.
Yuan Y., Kadiyala C.S., Ching T.T., Hakimi P., Saha S., Xu H., Yuan C., Mullangi V.,
Wang L., Fivenson E., Hanson R.W., Ewing R., Hsu A.L., Miyagi M. and Feng Z.,
Enhanced Energy Metabolism Contributes to the Extended Life Span of Calorierestricted Caenorhabditis elegans. J. Biol. Chem, 287(37): 31414-26, 2012.
Yuan Y., Van Allen E.M., Omberg L., Wagle N., Amin-Mansour A., Sokolov A.,
Byers L.A., Xu Y., Hess K.R., Diao L., Han L., Huang X., Lawrence M.S.,
Weinstein J.N., Stuart J.M., Mills G.B., Garraway L.A., Margolin A.A., Getz G.
and Liang H., Assessing the clinical utility of cancer genomic and proteomic data
across tumor types. Nature Biotech., 32: 644-8, 2014.
88
Zhang T., Adaptive Forward-Backward Greedy Algorithm for Learning Sparse
Representations, IEEE Transactions on Info. Theo., 57(7): 4689-19, 2011.
Zhou K.I., Pincus Z. and Slack F.J., Longevity and stress in Caenorhabditis elegans,
Aging, 3(8): 733–20, 2011.
89