Studies On The Role Of Protein Structural Disorder On The

Studies On The Role Of Protein Structural Disorder On The
Evolutionary Features Of Prokaryotic And Eukaryotic
Genomes
THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY (Sc.)
IN
BIOPHYSICS, MOLECULAR BIOLOGY AND BIOINFORMATICS
By
Arup Panda
Department of Biophysics, Molecular Biology and Bioinformatics
University of Calcutta
2015
Declaration
I do hereby declare that the dissertation entitled ―Studies On The Role Of Protein Structural
Disorder On The Evolutionary Features Of Prokaryotic And Eukaryotic Genomes‖ submitted to the
evaluation committee is a record of original research done by me under the guidance of Professor
Tapash Chandra Ghosh, Bioinformatics Center, Bose Institute, Kolkata. I certify that this work
contains no material that has been accepted for the award of any other degree or diploma in my name
in any university or other tertiary institution, and no part of this work will in the future be used for any
other degree or diploma in any university or other tertiary institution without the prior approval from
the committee. I further declare that all the results presented in this thesis either in statement or in the
form of tables or figures are outcome of my own research and this thesis contains no material
previously published or written by any other person, unless stated with due reference in appropriate
context.
Date:
Place: Kolkata
(Arup Panda)
Acknowledgements
I would like to express my deepest sense of gratitude to my revered supervisor Dr. Tapash
Chandra Ghosh, Professor, Bioinformatics Center, Bose Institute, Kolkata for allowing me to pursue
Ph. D. under his guidance. His unfeigned care, spontaneous help, valuable suggestions and constant
encouragement has played the key role in the successful completion of this dissertation in time.
Further, I owe my sincere gratitude to Professor Pinakpani Chakraborty, Head of the Department BIC,
Bose Institute, and other faculty members of BIC, Bose Institute: Dr. Sudipto Saha, Dr. Subhra Ghosh
Dastidar and Dr. Jhumur Ghosh. I would like to take this opportunity to thank all my respected
teachers of the Department of Biophysics Molecular biology and Bioinformatics and the Department
of Biochemistry, Calcutta University. I would like to express my indebtedness to all my seniors and
other lab members and staff members of BIC, Bose Institute. Finally, I would like to express my
gratitude to my parents and my family members.
Abstract
Abstract:
Intrinsically disordered proteins (IDPs) are a class of proteins that lack stable threedimensional structures under physiological conditions. IDPs were found to have intriguing roles in
cellular signalling, regulation of cell division, transcriptional, translational control, etc. Due to their
extensive functional importance, special interest has been paid to find the attributes of IDPs in origin
and evolution of prokaryotic and eukaryotic systems. In this thesis we explored the role of disordered
proteins in various evolutionary features of prokaryotic and eukaryotic genomes. At first, we analyzed
whether disordered proteins have been exploited for microbial adaptation to the aerobic environment.
Our analysis with prokaryotes from four oxygen requirement groups revealed that aerobic proteomes
contain high amount of disordered residues irrespective of their selection for any other genomic or
proteomic attributes. We analyzed the functional significance of disordered proteins in aerobic
proteomes and proposed that high protein disorder is an adaptive opportunity for aerobic microbes to
fit with the genomic and functional complexities of aerobic lifestyles. Considering the inherent
differences in genome organization between cold and warm blooded vertebrates previously, it was
proposed that warm-blooded vertebrates had endured a significant GC increase. This type of genome
transition was supposed to increase thermodynamic and structural stabilities of proteins through a
selective increase in protein hydrophobicity. However, in our study we showed that GC transition
between vertebrate genomes increases protein disorder content in warm-blooded proteomes to
promote functional diversity of proteins encoded by GC-rich genes. To evaluate how disordered
residues influence human disease gene evolution, we analysed the evolutionary rates of human
neurodegenerative disease (NDD) associated genes. Here, we observed that human NDD genes are
evolutionarily conserved relative to non-disease genes. To explain the conserved nature of NDD
genes, we exploited several evolutionary parameters such as protein connectivity, 3‘UTR length,
relative aggregation propensity (RAP), nature of hub proteins (singlish/multi-interface), etc. Relative
importance of these determinants was confirmed from categorical regression analysis. Our
investigation has clarified the role of protein disorder content on the evolutionary attributes of NDD
genes and also explored its inter-connection with the other determinants of protein evolutionary rates.
Abbreviations
ADDA: Automatic Domain Decomposition Algorithm
ANOVA: ANalysis Of VAriance
AUC: Area Under the Curve
BKL: Biobase Knowledge Library
CAI: Codon Adaptation Index
CBP: Creb-Binding Protein
CD: Circular Dichorism spectroscopy
CH-Plot: Charge-Hydrophobicity Plot
COA: COrrespondence Analysis
COG: Cluster of Orthologous Group
dN/dS: Evolutionary Rate
dN: Non-Synonymous substitution rate
dS: Synonymous substitution rate
FTIR: Fourier Transform Infrared Spectroscopy
GAD: Genetic Association Database
gBGC: GC Biased Gene Conversion
GC3: GC content at the third codon position
Go: Gene Ontology
HGMD: Human Gene Mutation Database
HMM: Hidden Markov Model
IDPs: Intrinsically Disordered Proteins
ITC: IsoThermal Calorimetry
IUPs: Intrinsically Unstructured Proteins
KID: Kinase Inducible transcriptional activation Domain.
LHGR: Long Homogeneous Genome Regions
Abbreviations
MCC: Matthews Correlation Coefficient
MD: Molecular Dynamics
NDD: Neurodegenerative Disease
OMIM: Mendelian Inheritance In Man
PDB: Protein Data Bank
PIR: Protein Information Resource
PSRC: PhotoSynthetic Reaction Center
RAP: Relative Aggregation Propensity
RNaseA: RiboNuclease-A
ROC: Receiver Operating Characteristic
RSCU: Relative Synonymous Codon Usage
SANS: Small-Angle Neutron Scattering
SAXS: Small-Angle X-Ray Scattering
smFRET: single-molecule Fluorescence Resonance Energy Transfer
tAI : t-RNA Adaptation Index
Contents
Contents
Page number
1. Chapter-1: General introduction:
1.1.
Preface:
1.2.
Early history of protein structure determination and
1-27
1
discovery of disordered proteins:
4
1.3.
Amino acid composition bias of disordered proteins:
8
1.4.
Nucleotide composition of disordered proteins:
10
1.5.
Characterization of intrinsically disordered proteins:
11
1.6.
Identification of disordered regions by CH-plot analysis:
13
1.7.
Identification of disordered regions by prediction algorithms:
14
1.8.
Abundance of disordered proteins:
17
1.9.
Databases of disordered proteins:
18
1.10.
Functional annotations of disordered proteins:
19
1.11.
Disordered protein and disease association:
23
1.12.
Evolution and disordered proteins:
23
1.13.
Origin of the proposal:
25
1.14.
General organization of the thesis:
27
2. Chapter2: Resources and Methods
28-35
2.1.
NCBI database:
28
2.2.
Ensembl database:
28
2.3.
USCS genome browser:
28
2.4.
Gene Expression Atlas:
29
2.5.
Human disease gene databases:
29
2.6.
MicroRNA.org database:
29
2.7.
BioGRID database:
30
2.8.
Pfam database:
30
2.9.
UniProt and DBD databases:
31
2.10.
AgBaseGOanna web server:
31
2.11.
IUPred algorithm:
31
2.12.
FoldIndex:
32
2.13.
ANCHOR algorithm:
33
Contents
2.14.
NuPoP web server:
33
2.15.
CpG Island Searcher:
33
2.16.
TANGO algorithm:
34
2.17.
CodonW:
34
2.18.
ClustalW:
34
2.19.
BLAST:
35
2.20.
Statistical analysis:
35
3. Chapter -3: Prevalent structural disorder carries signature of
prokaryotic adaptation to oxic atmosphere:
Chapter summary:
36- 69
37
3.1.
Introduction:
38
3.2.
Methods:
41
3.3.
3.2.1.
Collection of dataset:
41
3.2.2.
Prediction of disordered residues:
42
3.2.3.
Calculation of GC content and amino acid frequencies:
42
3.2.4.
Disorder content of aerobic and anaerobic COGs:
42
3.2.5.
Prediction of disordered binding sites and transcription factors:
43
3.2.6.
Statistical analysis:
43
Results:
44
3.3.1.
Predicted protein disorder in prokaryotic genomes:
44
3.3.2.
High protein disorder in aerobic prokaryotes and other covariates:
47
3.3.3
High protein disorder in aerobic prokaryotes and their
functional implications:
56
3.4.
Discussion:
66
3.5.
Conclusions:
69
4. Chapter-4: GC-made protein disorder sheds new light on
vertebrate evolution:
Chapter summary:
70-96
71
4.1.
Introduction
72
4.2.
Results
74
4.2.1.
Compositional transition within vertebrates
Contents
effects on protein intrinsic disorder content:
4.2.2.
Confounding factors that can modulate protein intrinsic
disorder in trasition and non-transition groups:
4.2.3.
Discussion :
4.3.1.
81
Disorder Content Evolution in Human Proteins
functional Advantages:
4.3.
75
Correlation between GC and protein disorder
significance of amino acid choice:
4.2.4.
74
Potential caveats:
86
87
89
4.4.
Conclusion:
91
4.5.
Materials and methods
91
4.5.1.
Collection of dataset
91
4.5.2.
Prediction of protein intrinsic disorder content
92
4.5.3.
Mapping of genes to their corresponding isochores.
92
4.5.4.
Analysis of nucleosome occupancy and CpG islands
between ordered and disordered regions:
92
4.5.5. Association between GC changing substitutions and
disorder promoting amino acid mutations:
93
4.5.6.
Analysis of multifunctionality:
94
4.5.7.
Analysis of aggregation propensity and hydrophobicity
94
4.5.8.
Statistical analysis
94
5. Chapter-5: Insights into the Evolutionary Features of Human
Neurodegenerative Diseases:
Chapter summary:
97-109
98
5.1.
Introduction:
99
5.2.
Materials and Methods:
100
5.2.1.
Dataset Preparation for Evolutionary Rate Estimation:
100
5.2.2.
Determining Gene Expression Level and Expression Width:
100
5.2.3.
Protein-protein interaction data:
101
5.2.3.
Identification of Nature of Hub Proteins:
101
5.2.4.
microRNA Targeting and 3‘ UTR Length Calculation:
101
5.2.6.
Estimation of Protein Disorder Content:
101
Contents
5.3.
5.2.5.
Computing Protein Relative Aggregation Propensity (RAP):
102
5.2.6.
Statistical Analyses
102
Results:
5.3.1.
5.3.2.
5.3.3.
Gene Expression Level Constraining the evolutionary Rates of
NDD Genes:
Examining Protein Connectivity and miRNA Targeting as
102
Influential Factors of Protein Evolutionary Rates :
103
Protein Intrinsic Disorder Content and Nature of Hub
Proteins as the Functions of Protein Evolutionary Rates :
5.3.4.
105
Independent Forces of Protein Evolutionary Rates
Using Categorical Regression Model:
5.4.
104
Relative Aggregation Propensity Negatively
Steers Protein Evolutionary Rates:
5.3.5.
102
Discussion :
106
107
6. Chapter-6: Summary and general conclusions:
110-111
7. References:
112-132
8. Publications:
133
8.1.
List of publications:
133
8.2.
Reprints:
134