Understanding the Pathogenic Fungus Penicillium

Abstract of thesis entitled
Understanding the Pathogenic Fungus Penicillium marneffei : A
Computational Genomics Perspective
by James J. Cai
for the degree of Doctor of Philosophy
at The University of Hong Kong
in May 2006
Penicillium marneffei, a thermally dimorphic fungus that alternates between a filamentous and a yeast growth form in response to changes in
its environmental temperature, has become an emerging fungal pathogen
endemic in Southeast Asia. Defining the genomics of P. marneffei will
provide a better understanding of the fungus.
This thesis reports the draft sequence of the P. marneffei genome assembled from 6.6 coverage of the genome through whole genome shotgun
sequencing. The 31 Mb genome obtained from the assembly contains
10,060 protein-coding genes. The complete mitochondrial genome is 35
kb long and its gene content and gene order are very similar to that of
Aspergillus. An annotation system and P. marneffei genome database
(PMGD) were developed to allow a preliminary annotation of the sequences and provide an intuitive graphic interface to give curators and
users ready access to the annotation and the underlying evidence, and
a Matlab-based software package, MBEToolbox, was developed for data
analysis in phylogenetics and comparative genomics. A well-designed and
structured annotation system and powerful sequence analysis software
are essential requirements for the success of large-scale genome analysis
projects.
Analysis of the gene set of P. marneffei provided insights into the
adaptations required by a fungus to cause disease. The genome encodes
a diverse set of putative virulence genes such as proteinase, phospholipase, metacaspase and agglutinin, which may enable the fungus to adhere
to, colonise and invade the host, adapt to the tissue environment, and
avoid the host’s humoral and cellular defences of the innate and adaptive
immune responses. A gene cluster involved in biosynthesis of melanin, a
known virulence factor in some other pathogenic fungi, was also identified in the genome, indicating that P. marneffei may produce melanin
or melanin-like immunosuppressive compounds that protect the fungus
against immune effector cells. More interestingly, P. marneffei genome
contains more intragenic tandem repeats (IntraTRs) than other fungi.
These IntraTRs encoding repeat domains/motifs may create quantitative variation in surface proteins, allowing the fungus to ‘disguise’ itself
to slip past the vigilant defences of the host immune system. The genome
sequence of P. marneffei also revealed a number of genes associated with
mating processes and sexual development, suggesting an unidentified sexual cycle in the fungus.
The extent and evolutionary patterns of duplicate genes in P. marneffei and other ascomycetes were compared. All ascomycetes show a
certain degree of redundancy (though its extent can vary considerably),
which may provide the foundation for the specialisation of fungal genes
and form the basis for fungal diversification. An inverse relationship between the lineage specificity of a gene and gene’s evolutionary rate was
also discovered, implying that an accelerated evolutionary rate may be
responsible for the emergence of lineage specific genes.
The genome sequence of P. marneffei has provided our first glimpse
into the genomic basis of the physiology of the dimorphic filamentous
fungus.
Understanding the Pathogenic Fungus
Penicillium marneffei : A Computational
Genomics Perspective
BY
James J. Cai
M.D., Henan Medical University, 1996
M.S., University of New South Wales, 2001
THESIS
Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy
at The University of Hong Kong
May 2006
To Yan
“Any living cell carries with it the experiences of a billion
years of experimentation by its ancestors.”
Max Delbruck (1949)
DECLARATION
I declare that this thesis represents my own work, except where due
acknowledgement is made, and that it has not been previously included
in a thesis, dissertation or report submitted to this University or to any
other institution for a degree, diploma or other qualifications.
Signature:
Date:
i
ACKNOWLEDGEMENTS
First of all, a special thanks goes to my principle supervisor, Professor Kwok-yung Yuen, for his enthusiasm and support during
the course of my study. My heartfelt thanks to Dr. David K.
Smith and Dr. Xuhua Xia who introduced me to the fascinating
world of bioinformatics and molecular evolution.
Thanks to my friends and colleagues for their moral support
and technical assistance over the past four years especially Dr.
Patrick Woo, Dr. Sussana Lau, and Jade, Huang Yi, Ken, Haw,
Candy, Rachel ... I am also grateful to my external mentor Dr.
Gavin Huttley and fellow colleagues Peter, Ray, Helen and Brett
in the Australian National University.
Finally, I am very grateful to my wife and my parents. Without
their support, this work would not have been possible.
ii
TABLE OF CONTENTS
Declaration
i
Acknowledgements
ii
List of Figures
x
List of Tables
xii
Abbreviations
xiv
Glossary
xviii
Introduction
Chapter 1:
1
The draft genome sequence of Penicillium
marneffei
4
1.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.2
Literature Review . . . . . . . . . . . . . . . . . . . . . .
5
1.2.1
General fungal biology . . . . . . . . . . . . . . . .
5
1.2.2
P. marneffei, as an important fungal pathogen . .
7
1.2.3
Penicilliosis marneffei . . . . . . . . . . . . . . . .
13
1.2.4
Fungal genome projects . . . . . . . . . . . . . . .
20
Materials and Methods . . . . . . . . . . . . . . . . . . . .
23
1.3.1
Strain and DNA preparation . . . . . . . . . . . .
23
1.3.2
Library construction, shotgun sequencing . . . . .
24
1.3.3
Sequence assembly . . . . . . . . . . . . . . . . . .
24
1.3.4
Data release . . . . . . . . . . . . . . . . . . . . . .
24
1.3
iii
1.4
1.5
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
1.4.1
Assembly and general characteristic . . . . . . . .
25
1.4.2
Genome architecture and co-linearity . . . . . . . .
29
1.4.3
Gene duplications (multigene families) and comparisons . . . . . . . . . . . . . . . . . . . . . . . .
30
1.4.4
Interspecies proteome comparison . . . . . . . . . .
31
1.4.5
Lineage-specific genes . . . . . . . . . . . . . . . .
33
1.4.6
Cell signalling and morphogenesis
. . . . . . . . .
35
1.4.7
Potential mating ability . . . . . . . . . . . . . . .
35
1.4.8
Putative virulence genes . . . . . . . . . . . . . . .
35
1.4.9
Cell wall antigens and biosynthetic genes . . . . .
35
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
Chapter 2:
Penicillium marneffei genome database and
annotation pipeline
40
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
40
2.2
Literature Review . . . . . . . . . . . . . . . . . . . . . .
42
2.2.1
Methods for predicting protein function . . . . . .
42
2.2.2
Software/database systems for protein function prediction . . . . . . . . . . . . . . . . . . . . . . . . .
44
The art of gene finding . . . . . . . . . . . . . . . .
47
Implementation . . . . . . . . . . . . . . . . . . . . . . . .
50
2.3.1
Annotation pipeline . . . . . . . . . . . . . . . . .
50
2.3.2
Assembly process . . . . . . . . . . . . . . . . . . .
53
2.3.3
Gene finding . . . . . . . . . . . . . . . . . . . . .
55
2.3.4
Database and databank to store results . . . . . .
57
2.3.5
Perl source code collection . . . . . . . . . . . . . .
58
2.3.6
Genome browser configuration . . . . . . . . . . .
58
2.3.7
Synteny identification . . . . . . . . . . . . . . . .
59
2.2.3
2.3
iv
2.4
2.5
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
2.4.1
Statistics of assembly . . . . . . . . . . . . . . . .
60
2.4.2
Genome size estimation . . . . . . . . . . . . . . .
61
2.4.3
Accuracy of gene finding . . . . . . . . . . . . . . .
63
2.4.4
Combination of gene finding . . . . . . . . . . . . .
63
2.4.5
Database and databank to store results . . . . . .
65
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
Chapter 3:
Mitochondrial genome of Penicillium marneffei
69
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
69
3.2
Materials and Methods . . . . . . . . . . . . . . . . . . . .
72
3.2.1
Library construction and sequence assembly . . . .
72
3.2.2
Mitochondrial DNA sequence annotation . . . . .
72
3.2.3
Phylogenetic analysis . . . . . . . . . . . . . . . . .
73
3.2.4
Mitochondrial DNA sequences in nuclear genome .
73
Results and Discussion . . . . . . . . . . . . . . . . . . . .
74
3.3.1
Gene content and genome organisation . . . . . . .
74
3.3.2
Protein coding genes . . . . . . . . . . . . . . . . .
74
3.3.3
Genetic code and codon usage
. . . . . . . . . . .
81
3.3.4
tRNA genes . . . . . . . . . . . . . . . . . . . . . .
81
3.3.5
Other RNA genes
. . . . . . . . . . . . . . . . . .
81
3.3.6
Group I introns . . . . . . . . . . . . . . . . . . . .
84
3.3.7
Mitochondrial DNA sequences in nuclear genome .
85
3.3
Chapter 4:
Genomic evidence for the presence of melanin
biosynthesis gene cluster in Penicillium marn88
effei
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
88
4.2
Literature Review . . . . . . . . . . . . . . . . . . . . . .
89
v
4.3
4.2.1
Potential virulence factors . . . . . . . . . . . . . .
4.2.2
Genomic approaches in identification of virulence
factors . . . . . . . . . . . . . . . . . . . . . . . . .
95
Materials and Methods . . . . . . . . . . . . . . . . . . . .
96
4.3.1
Identification of melanin biosynthesis genes in P.
marneffei . . . . . . . . . . . . . . . . . . . . . . .
96
Multiple alignments and phylogenetic analyses . .
97
Results and Discussion . . . . . . . . . . . . . . . . . . . .
97
4.4.1
Melanin gene cluster present in P. marneffei . . .
97
4.4.2
Disrupted aflatoxin biosynthesis gene cluster in P.
4.3.2
4.4
90
marneffei . . . . . . . . . . . . . . . . . . . . . . . 101
4.4.3
Absence of penicillin biosynthesis genes in P. marneffei . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Chapter 5:
Mating abilities in Penicillium marneffei
105
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.2
Literature Review . . . . . . . . . . . . . . . . . . . . . . 107
5.2.1
Mating in hemiascomycete yeasts . . . . . . . . . . 108
5.2.2
Mating in filamentous ascomycetes . . . . . . . . . 109
5.3
Materials and Methods . . . . . . . . . . . . . . . . . . . . 112
5.4
Results and Discussion . . . . . . . . . . . . . . . . . . . . 113
5.4.1
Homologs of known sexual genes . . . . . . . . . . 114
5.4.2
Mating type genes . . . . . . . . . . . . . . . . . . 116
5.4.3
Mating pheromone precursor genes . . . . . . . . . 120
5.4.4
Mating pheromone processing genes . . . . . . . . 123
5.4.5
Mating pheromone receptor and other GPCRs . . 126
Chapter 6:
Exploring the genetic components associated
with the dimorphism of Penicillium marneffei
128
vi
6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.2
Materials and Methods . . . . . . . . . . . . . . . . . . . . 130
6.3
6.2.1
Sequence similarity
. . . . . . . . . . . . . . . . . 130
6.2.2
Phylogenetic Analysis . . . . . . . . . . . . . . . . 131
Results and Discussion . . . . . . . . . . . . . . . . . . . . 131
6.3.1
Perception of external stimuli by cellular sensors . 132
6.3.2
Transduction of biochemical signal . . . . . . . . . 134
6.3.3
Alteration of the genomic expression . . . . . . . . 136
6.3.4
Structural reorganization towards the morphological change
Chapter 7:
. . . . . . . . . . . . . . . . . . . . . . 141
Intragenic tandem repeats in Penicillium marneffei and other ascomycetes
144
7.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.2
Materials and Methods . . . . . . . . . . . . . . . . . . . . 146
7.3
7.2.1
Identification of coding tandem repeats . . . . . . 146
7.2.2
Sequence analysis . . . . . . . . . . . . . . . . . . . 146
Results and Discussion . . . . . . . . . . . . . . . . . . . . 146
Chapter 8:
Extent and evolutionary pattern of duplicate
genes in Penicillium marneffei and other ascomycetes
155
8.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.2
Literature Review . . . . . . . . . . . . . . . . . . . . . . 158
8.3
Materials and Methods . . . . . . . . . . . . . . . . . . . . 160
8.4
8.3.1
Sequences and gene families . . . . . . . . . . . . . 160
8.3.2
Estimation of substitution rate . . . . . . . . . . . 161
8.3.3
Relative rate test . . . . . . . . . . . . . . . . . . . 162
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.4.1
Extent of gene duplication in ascomycetes . . . . . 163
vii
8.5
8.4.2
Age distribution of duplicate genes . . . . . . . . . 164
8.4.3
Selective constraint between paralogs . . . . . . . . 168
8.4.4
Ka /Ks between paralogs and orthologs . . . . . . 169
8.4.5
Relative evolutionary rate between paralogs . . . . 170
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
8.5.1
Gene duplication in ascomycetes is highly diverse . 173
8.5.2
Different selective constraints in yeasts and filamentous ascomycetes . . . . . . . . . . . . . . . . . 176
8.5.3
Chapter 9:
Majority of paralogous genes evolve symmetrically 178
Accelerated evolutionary rate may be responsible for the emergence of lineage-specific genes180
9.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 181
9.2
Literature Review . . . . . . . . . . . . . . . . . . . . . . 184
9.3
Materials and Methods . . . . . . . . . . . . . . . . . . . . 185
9.3.1
Sequences and data sets . . . . . . . . . . . . . . . 185
9.3.2
Identification of orthologs . . . . . . . . . . . . . . 188
9.3.3
Classification of genes into LS groups . . . . . . . 188
9.3.4
Divergence Times . . . . . . . . . . . . . . . . . . . 189
9.3.5
Estimation of substitution rates and statistical analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
9.3.6
Detection of rate variability across species - Relative Divergence Score (RDS) . . . . . . . . . . . . 190
9.4
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
9.4.1
Evolutionary rate differences among LS groups . . 191
9.4.2
Evolutionary rate-related factors of genes belonging to different LS groups . . . . . . . . . . . . . . 196
9.4.3
Linear regression of divergence time and relative
divergence score (RDS) . . . . . . . . . . . . . . . 201
viii
9.5
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Chapter 10:
MBEToolbox: a Matlab toolbox for sequence
data analysis in molecular biology and evolution
205
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 205
10.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . 206
10.2.1 Probabilistic DNA substitution models . . . . . . . 206
10.2.2 Maximum likelihood estimation . . . . . . . . . . . 210
10.2.3 Elements of phylogenetic theory . . . . . . . . . . 211
10.2.4 Programs used for phylogenetic analyses . . . . . . 214
10.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 216
10.3.1 Input data and formats . . . . . . . . . . . . . . . 216
10.3.2 Sequence Manipulation and Statistics . . . . . . . 217
10.3.3 Evolutionary Distances . . . . . . . . . . . . . . . 217
10.3.4 Phylogeny Inference . . . . . . . . . . . . . . . . . 219
10.3.5 Combination of functions . . . . . . . . . . . . . . 222
10.3.6 Graphics and GUI . . . . . . . . . . . . . . . . . . 222
10.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . 223
10.4.1 Vectorisation simplifies programming . . . . . . . . 223
10.4.2 Extensibility . . . . . . . . . . . . . . . . . . . . . 226
10.4.3 Comparison with other toolboxes . . . . . . . . . . 226
10.4.4 A novel enhanced window analysis . . . . . . . . . 227
10.4.5 Limitations . . . . . . . . . . . . . . . . . . . . . . 230
Chapter 11:
Concluding remarks
Bibliography
231
234
ix
LIST OF FIGURES
Figure Number
Page
1.1
P. marneffei mould and yeast culture . . . . . . . . . . .
7
1.2
Dimorphic switching of P. marneffei . . . . . . . . . . . .
8
1.3
Phylogenetic tree showing the relationships of P. marneffei
to other fungi . . . . . . . . . . . . . . . . . . . . . . . . .
1.4
Microsyntenies containing pheromone precursor loci from
four fungi . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5
28
30
Triple proteome comparison between P. marneffei, S. cerevisiae and A. fumigatus . . . . . . . . . . . . . . . . . . .
32
1.6
Putative MAPK signalling pathway in P. marneffei
. . .
34
2.1
Flowchart of annotation pipeline for P. marneffei genome
51
2.2
PMGD genome browser . . . . . . . . . . . . . . . . . . .
60
2.3
Database schema of PMGD . . . . . . . . . . . . . . . . .
66
3.1
Fungal respiratory pathways . . . . . . . . . . . . . . . . .
71
3.2
Physical map of P. marneffei mitochondrial DNA . . . .
75
3.3
Comparison of gene order between mitochondrial DNAs .
78
3.4
Phylogenetic distribution of group I and group II introns .
80
3.5
28 tRNAs encoded in the mitochondrial genome of P.
marneffei . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
3.6
Secondary structures of two representative group I introns
84
4.1
P. marneffei abr1 gene Cu-oxidase domain homologues . 100
4.2
Melanin gene cluster in P. marneffei and A. fumigatus . . 102
x
5.1
Comparison of the mating-type loci in P. marneffei and
other fungi . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.2
Comparison of the alpha1 domian of MAT proteins of filamentous ascomycetes . . . . . . . . . . . . . . . . . . . . 116
5.3
Gene organisation around the MAT locus . . . . . . . . . 117
5.4
P. marneffei biogenesis of the a-factor pheromones . . . . 121
6.1
Phylogenetic tree of fungal GPCR family genes . . . . . . 133
6.2
P. marneffei genes in cAMP pathway . . . . . . . . . . . 135
7.1
Amino acid composition in intragenic tandem repeats . . 153
8.1
Frequency distribution of Ks . . . . . . . . . . . . . . . . 166
8.2
Log-log plots of Ka vs. Ks for duplicate gene pairs . . . . 167
9.1
LS classification based on phylogenetic profiles of genes . 186
9.2
Divergence of nonsynonymous substitution rate in LS groups192
9.3
Dependence of log gene expression level and substitution
rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
9.4
Linear regression analysis of divergence time and RDS . . 195
10.1 Relationship of GTR class DNA substitution models . . . 209
10.2 Log-likelihood of evolutionary distance . . . . . . . . . . . 221
10.3 MBEToolbox GUI . . . . . . . . . . . . . . . . . . . . . . 224
10.4 Comparison between sliding window and enhanced sliding
window methods . . . . . . . . . . . . . . . . . . . . . . . 228
xi
LIST OF TABLES
Table Number
Page
1.1
General features of the P. marneffei genome
. . . . . . .
25
1.2
Comparison of genome statistics of several fungi . . . . .
27
1.3
Putative virulence genes . . . . . . . . . . . . . . . . . . .
36
1.4
Cell wall antigens and biosynthetic genes predicted in P.
marneffei . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
2.1
Commonly used domain databases . . . . . . . . . . . . .
48
2.2
Summary of assembly statistics . . . . . . . . . . . . . . .
61
3.1
Gene content of P. marneffei mitochondrial genome . . .
76
3.2
Codon usage in protein-coding genes of P. marneffei mitochondrial genome . . . . . . . . . . . . . . . . . . . . . .
82
3.3
Presence of mitochondrial DNA fragments in nuclear genomes 85
3.4
P. marneffei mitochondrial DNA sequences present in nuclear genome . . . . . . . . . . . . . . . . . . . . . . . . .
86
4.1
Major dimorphic fungal pathogens . . . . . . . . . . . . .
95
4.2
Putative gene products related to melanin biosynthesis in
P. marneffei
. . . . . . . . . . . . . . . . . . . . . . . . .
99
5.1
Mating strategies adopted by ascomycetous fungi . . . . . 110
5.2
Pheromone-processing enzymes encoded by the putative
P. marneffei genes . . . . . . . . . . . . . . . . . . . . . . 122
6.1
GPCR family in P. marneffei and A. nidulans . . . . . . 132
xii
6.2
Homologous genes related to signal transduction in filamentous growth . . . . . . . . . . . . . . . . . . . . . . . . 137
7.1
P. marneffei genes containing intragenic tandem repeats . 147
7.2
Comparison of genome size and base in repeats . . . . . . 152
8.1
Distribution of multigene families in fungi . . . . . . . . . 163
8.2
Large multigene families in fungi . . . . . . . . . . . . . . 165
8.3
Ka /Ks ratio for recently diverged paralogs . . . . . . . . . 169
8.4
Amino-acid substitution rates versus Ka /Ks ratios in two
copies of duplicate genes . . . . . . . . . . . . . . . . . . . 172
9.1
Genomic sequence sources . . . . . . . . . . . . . . . . . . 185
9.2
Average Ka , Ks and Ka /Ks among LS classes . . . . . . . 197
9.3
Correlation and partial correlation between LS and other
factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
9.4
Regression analyseson predicted S. cerevisiae-S. mikatae
orthologs . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
xiii
ABBREVIATIONS AND SYMBOLS
aa Amino acid
AIDS Acquired Immunodeficiency Syndrome
ADHoRe Automatic Detection of Homologous Regions
BLAST Basic Local Alignment Search Tool
BLOSUM BLOcks SUbstitution Matrix
bp Base pairs
CDS Nucleotide coding sequence
DBMS Database management system
DDC Duplication-degeneration-complementation (model)
EST Expressed Sequence Tag
FASTA Fast-All (pronounced fast-aye) a program for pairwise sequence
alignment
FGI Fungal Genome Initiative
GFF ‘Gene-Finding Format’ or ‘General Feature Format’
GO Gene Ontology
xiv
GOLD Genomes OnLine Database
GPCR G Protein-Coupled Receptor
GTR General Time Reversible model
GUI Graphical User Interface
HAART Highly Active Anti-Retroviral Therapy
HMM Hidden Markov Model
HKU CC Computer Centre, University of Hong Kong
ITR Intragenic Tandem Repeat
Ka Nonsynonymous substitution rate
Ks Synonymous substitution rate
LS Lineage specificity
MAPK Mitogen-activated protein kinase
Mb Megabases
MBEToolbox Molecular biology and evolution toolbox
MCMC Markov-chain Monte Carlo
MDD Maximal dependence decomposition
MFS Major facilitator superfamily
MIPS Munich Information Center for Protein Sequences
xv
TF Transcription Factor
TNF Tumor Necrosis factor
MIT Massachusetts Institute of Technology
MLMT Multilocus microsatellite typing system
NCBI National Centre for Biotechnology Information
RDS Relative Divergence Score
ORF Open Reading Frame
PAUP* Phylogenetic Analysis Using Parsimony, *and other methods
(pronounced pop star)
PFGE Pulsed-field gel electrophoresis
PHYLIP PHYLogenetic Inference Package
PMGD P. marneffei genome database
REV General reversible process model
RIP Repeat-induced point
SAGE Serial Analysis of Gene Expression
SGD Saccharomyces Genome Database in Stanford Genomic Resources
xvi
Swiss-Prot a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high
level of integration with other databases.
TIGR The Institute for Genomic Research
TrEMBL a computer-annotated supplement of Swiss-Prot that contains
all the translations of EMBL nucleotide sequence entries not yet
integrated in Swiss-Prot.
UML Unified Modelling Language
UCSC University of California, Santa Cruz
URF unidentified reading frame
UTR Untranslated transcriptional region
WGS Whole-genome shotgun
HMG high mobility group motif
xvii
GLOSSARY
ADDITIVE TREE:
A phylogenetic tree in which the distance between
any two terminal nodes is equal to the sum of the branch lengths
connecting them.
BOOTSTRAP:
A statistical technique using resampling with replace-
ment.
BRANCH:
The graphical representation of an evolutionary relation-
ship in a phylogenetic tree.
CODON:
A triplet of adjacent nucleotides in mRNA that either codes
for an amino acid carried by a specific tRNA or specifies the termination of the translation process.
CODON USAGE:
The frequency with which members of a codon family
are used in protein-coding genes.
COMPLEMENTARY DNA (CDNA):
DNA synthesised from an RNA tem-
plate by the enzyme reverse transcriptase.
CONCERTED EVOLUTION:
Maintenance of homogeneity of nucleotide
sequences among members of a gene family in a species, although
the nucleotide sequences change over time.
CONSENSUS SEQUENCE:
A sequence that represents the most preva-
lent nucleotide or amino acid at each site in a number of homologous
sequences.
xviii
CONSERVATIVE SUBSTITUTION:
The substitution of an amino acid by
another with similar chemical properties.
CONSTANT SITE OR CONSTANT REGION:
A site or region within the
DNA that is occupied by the same nucleotide in all homologous
sequences under comparison.
CONVERGENCE:
The independent evolution of similar genetic or phe-
notypic traits.
CONVERGENT SUBSTITUTION:
The substitution of two different nu-
cleotides by the same nucleotide at the same nucleotide site in two
homologous sequences.
DETERMINISTIC PROCESS:
A process, the outcome of which can be
predicted exactly from knowledge of initial conditions.
DIRECTIONAL SELECTION:
A selective regime that changes the fre-
quency of an allele in a specific direction, either toward fixation or
toward elimination.
DIVERGENCE:
The differences between two homologous sequences due
to the independent accumulation of genetic changes in each lineage.
DOMAIN:
A well-defined region within a protein that can perform a
specific function. May not consist of a continuous stretch of amino
acids, although it almost always consists of amino acids that are
adjacent to each other as far as the tertiary structure of the protein
is concerned.
DUPLICATION:
The presence or the creation of two copies of a DNA
segment in the genome.
xix
EUKARYOTE:
An organism having a true nucleus and membraneous
organelles. One of the three primary lines of descent in the living
world.
EXON:
A DNA segment of a gene, the transcript of which appears in
the mature RNA molecule.
FIXATION PROBABILITY:
The probability that a particular allele will
become fixed in a population.
FIXATION TIME:
The time it takes for a mutant allele to become fixed
in a population.
FLANKING SEQUENCE:
Untranscribed sequences at the 5’ or 3’ termi-
nal of transcribed genes.
FOURFOLD DEGENERATE SITE:
A nucleotide site within a codon at
which all possible substitutions are synonymous. For example, in
the codon CCT, the third site is fourfold degenerate because CCT,
CCC, CCA and CCG are all codons for proline.
FUNCTIONAL CONSTRAINT (SELECTIVE CONSTRAINT):
The degree of
intolerance characteristic of a site or a locus toward nucleotide substitutions.
GENE CONVERSION:
A nonreciprocal recombination process resulting
in a sequence becoming identical with another.
GENE DIVERSITY:
A measure of genetic variability in a population.
The mean expected heterozygosity per locus in a population.
xx
GENE DUPLICATION:
Generally, the production of two copies of a
DNA sequence. Specifically, the duplication of an entire gene sequence.
GENETIC DISTANCE:
Broadly, any of several measures of the degree
of genetic difference between individuals, populations, or species.
In reference to molecular evolution, a measure of the number of
nucleotide substitutions per nucleotide site between two homologous DNA sequences that have accumulated since the divergence
between the sequences.
INFERRED TREE:
A phylogenetic tree based on empirical data per-
taining to extant taxa.
INFORMATIVE SITE (DIAGNOSTIC POSITION):
A site that is used to
choose the most-parsimonious tree from among all the possible phylogenetic trees. In molecular evolution, a site where there are at
least two different kinds of nucleotides or amino acids, and each of
them is represented in at least two sequences.
LIKELIHOOD RATIO TEST:
A statistical test of the goodness-of-fit be-
tween two models. A relatively more complex model is compared
to a simpler model to see if it fits a particular dataset significantly
better.
LINEAGE:
A linear evolutionary sequence from an ancestral species
through all intermediate species to a particular descendant species.
MAXIMUM LIKELIHOOD:
A statistical procedure of finding the value
of one or more parameters for a given statistic which makes the
known likelihood distribution a maximum.
xxi
ORTHOLOGOUS LOCUS:
A gene that has evolved directly from an an-
cestral locus. homologous genes: genes that share a common evolutionary ancestor.
PARALOGOUS LOCUS:
A gene that originated by duplication and then
diverged from the parent copy by mutation and selection or drift.
PATTERN OF SUBSTITUTION (SUBSTITUTION SCHEME):
The relative fre-
quency with which a nucleotide or an amino acid changes into another during evolution.
POSITIVE SELECTION:
Selection for an advantageous mutant allele.
POSTERIOR PROBABILITY:
The probability of a parameter value in-
ferred from an analysis.
RELATIVE-RATE TEST:
A calibration-free test for checking the con-
stancy of the rate of nucleotide substitutions in different lineages
during their evolution, thus determining whether or not the molecular clock operates at the same rate among different lineages.
ROOTED TREE:
A phylogenetic tree that specifies ancestral and de-
scendant species, thus indicating the direction of the evolutionary
path.
SENSE CODON:
A codon specifying an amino acid.
SEQUENCE DIVERGENCE (DIVERGENCE):
The differences between two
homologous sequences due to the independent accumulation of genetic changes in each lineage.
xxii
STOCHASTIC PROCESS:
A process, the outcome of which cannot be
predicted exactly from knowledge of initial conditions. However,
given the initial conditions, each of the possible outcomes of the
process can be assigned a certain probability.
SYNTENY:
A pair of genomes in which at least some of the genes are
located at similar map positions.
TANDEM DUPLICATION:
A duplication, the products of which reside
in close proximity to each other on the chromosome.
TRANSITION:
The substitution of a purine for a purine or a pyrimidine
for a pyrimidine.
TRANSVERSION:
The substitution of a purine for a pyrimidine or vice
versa.
xxiii
1
INTRODUCTION
Penicillium marneffei is a dimorphic fungus that intracellularly infects the reticuloendothelial system of humans and bamboo rats. Endemic in Southeast Asia, it infects 10% of AIDS patients in this region [365, 201, 182, 50, 348, 350]. The complete genomic sequencing for
various organisms has accelerated rapidly, which has offered another path
to gene discovery in recent years. This thesis presents the sequence of
P. marneffei genome, as well as related studies from the perspectives of
comparative and evolutionary genomics. These studies will throw light
on the molecular mechanism of virulence of this important pathogenic
fungus.
Chapter 1 gives an overview of P. marneffei genome, including sequence statistics, gene content and prediction of gene function. Chapter
2 describes the organisation and implementation of genome database of
P. marneffei genome project. The complete mitochondrial genome of P.
marneffei is reported in Chapter 3. The gene content and gene order
P. marneffei of mitochondrial genome are highly similar to that of Aspergillus, further confirming their close phylogenetic relationship. This
provides the basis for comparative genomics study between P. marneffei
and Aspergillus species.
This is followed by Chapter 4 that reports the presence of important virulence gene cluster, the melanin biosynthesis gene cluster, in P.
marneffei genome. Since melanin is a highly toxic natural product produced by some species of Aspergillus which are phylogenetically close to
P. marneffei, this finding is also valuable in revealing the evolutionary
origin of this gene cluster.
2
Mating of P. marneffei has not yet been observed in nature or under
laboratory defined conditions. The lack of a sexual stage impairs the
utility of experimental fungal genetics. By using genome sequence information, however, we found evidence of the potential mating ability of P.
marneffei (Chapter 5). It suggests that P. marneffei, like other pathogenic fungi, may limit access to the sexual cycle to generate a population
structure that is in part clonal but which retains the ability to undergo
sexual cycle in response to challenging conditions in the environment or
in the host. Chapter 6 contributes to the thesis by offering a systemic
exploration of genetic components that may be responsible for the morphogenetic processes in the genome of P. marneffei, mainly through the
sequence analysis in a context of comparative genomics. Chapter 7 reports an interesting phenomenon: Tandemly repeated DNA sequences
occuring frequently in the genomes of P. marneffei, not only in noncoding regions, but also in protein-coding regions, i.e. intragenic regions.
These highly dynamic genomic components provide the clue on how the
pathogenic fungus adapts to the host immune system.
Chapter 8 introduces a systematic test about the extent of duplicate
genes in major ascomycetes. We observed significant variation within
ascomycetes in the extent of gene duplications. Age distribution of gene
duplications tentatively suggests that P. marneffei genome have experienced duplication in large scale twice. We argue that different extents
and evolutionary patterns of duplicate genes in ascomycetes might be
associated with the great genotypical and phenotypical differences in ascomycetes. Chapter 9 tackled the question of the origin of species-specific
genes. The statistically significant correlation between accelerated evolutionary rate and the degree of lineage specificity is confirmed. This
correlation is independent of many confounding factors, like gene essentiality and expression level. This finding helps to explain the origin of
P. marneffei -specific genes, which is about one third of all P. marneffei
3
genes.
Finally, Chapter 10 introduces the software package, developed in a
high-performance scientific computer language, for sequence data manipulation and analysis, which performed very successfully throughout the
whole genome project.
Publications arising from this thesis are:
1. Cai JJ, Liu B, Woo PC, Lau SKP, Wong SS, Zhen H, Yuen KY (In
preparation) Genomic evidence for the presence of melanin biosynthesis gene cluster in the thermal dimorphic fungus Penicillium
marneffei
2. Cai JJ, Woo PCY, Lau SKP, Smith DK and Yuen KY (2006) Accelerated evolutionary rate may be responsible for the emergence
of lineage-specific genes in Ascomycota Journal of Molecular Evolution, in press
3. Cai JJ, Smith DK, Xia X and Yuen KY (2005) MBEToolbox: a
MATLABT M toolbox for sequence data analysis in molecular biology and evolution. BMC Bioinformatics, 6:64
4. Woo PC, Zhen H, Cai JJ, Yu J, Lau SKP, Wang J, Teng JLL,
Wong SS, Tse RH, Chen R, Yang H, Liu B and Yuen KY (2003) The
mitochondrial genome of the thermal dimorphic fungus Penicillium
marneffei is more closely related to those of molds than yeasts.
FEBS Letters, 555 (3): 469-77
5. Yuen KY, Pascal G, Wong SS, Glaser P, Woo PC, Kunst F, Cai JJ,
Cheung EY, Medigue C, Danchin A (2003) Exploring the Penicillium marneffei genome. Archives of Microbiology, 179 (5): 339-53
I have tried to explicitly acknowledge where the other authors’ ideas
have contributed significantly to the present work.
4
Chapter 1
THE DRAFT GENOME SEQUENCE OF
PENICILLIUM MARNEFFEI
This chapter describes basic features of genome of Penicillium marneffei, such as, genome assembly, gene content and some comparative results, attempting to give an overall impression of the genome. More detail
and complete analyses of some sections may be found in corresponding
chapters.
1.1
Introduction
Although fungi pose little threat to people with healthy immune systems,
they can cause fatal infections in the immunocompromised individuals.
Penicillium marneffei is the most important thermal dimorphic fungus
causing respiratory, skin and systemic mycosis in Southeast Asia [365,
201, 182, 50, 348, 350]. Discovered in 1956 in hepatic abscesses of the
Chinese bamboo rat Rhizomys sinensis, only 18 cases of human diseases
were reported (in HIV-negative patients) until 1985 [66]. The appearance
of the HIV pandemic, especially in South-east Asian countries, saw the
emergence of the infection as an important opportunistic mycosis in this
group of immunocompromised patients. About 10% of AIDS patients in
Hong Kong are infected with P. marneffei [346]. In northern Thailand,
penicilliosis is the third most common indicator disease of AIDS following
tuberculosis and cryptococcosis [300].
Genome sequencing of P. marneffei will increase the understanding
molecular biology and biochemical mechanisms for the pathogenicity of
this fungus. Despite its medical importance and its unusual thermal di-
5
morphism, our understanding of gene organisation in P. marneffei was
limited. To my knowledge, only one cell wall mannoprotein gene has
been characterised and successfully used in serodiagnosis and prevention
of this infection [38,37,347]. As a ‘pilot study’ of this genome project, the
random analysis of 2303 random sequence tags has been performed [364],
which laid down the foundation for the complete genomic sequencing
project of this fungus. In 2002, the complete genome sequencing project
of P. marneffei was initiated, and we have now approximately 6.6 coverage of the genome, which includes a contig that contains the complete
sequence of the mitochondrial genome. The sequencing of its genome
paves the way for the development of novel methods for detecting, preventing and treating this infection.
1.2
Literature Review
In this section I will first recap some basic concepts and terminologies
in fungal biology, and then review some clinical aspects, including the
diagnosis and management of P. marneffei infection. Finally, I will give
a survey of the recent advances in fungal genome projects.
1.2.1
General fungal biology
Fungi are a large and diverse group of eukaryotes characterised by their
absorptive mode of nutrition, i.e., digesting food outside of their bodies.
Modern taxonomists place fungi in their own kingdom, on equal footing
with plants and animals, sometimes called “The Fifth Kingdom”. They
include moulds, yeasts, and mushrooms. Most fungi are multicellular,
but some, the yeasts, are simple unicellular organisms. Fungi are plastic,
having a diversity of forms which influence the manner of function, and
a range of dispersal mechanisms enabling various approaches to survival
over time. Nevertheless, some basic structures of diverse fungi are in
common.
6
A fungal organism consists of a mass of threadlike filaments called
hyphae, which combine to make up the fungal mycelium. Each hypha is
composed of a chain of fungal cells, a continuous cytoplasm with many
nuclei. The hypha is surrounded by a plasma membrane and a polysaccharide chitin cell wall. The hyphae in a fungus branch off from one
another to form the mycelium, and are all ultimately connected to the
original hypha. Septa are barriers across the filament. In all fungi, septa
form, either adventitiously in all filamentous fungi, or at regular intervals
along the hypha in most members of the Ascomycota and Basidiomycota.
Different methods of reproduction have been adopted by different types
of fungus. For example, yeasts reproduce mitotically, while moulds have
much more complex life cycles involving distinct phases, including diploid
and haploid phases.
Fungi are often directly involved in our lives. Some fungi are indeed parasitic, and cause devastating plant infections. Serious agricultural pests, parasitic fungi such as the rusts and the smuts can ruin
entire crops, especially affecting cereals such as wheat and corn. Only
about 50 species are known to harm animals. Many medical applications
of fungi have been discovered, of which antibiotic production by fungi
is the most important. The first among these antibiotics is penicillin,
possibly the most important non-genetic medical breakthrough of last
century. Approximately 75% of all described fungi belongs to the Ascomycota. Among them are some famous ones, such as, Saccharomyces
cerevisiae, the baker’s yeast, Penicillium chrysogenum, producer of penicillin, and Neurospora crassa, the “one-gene-one-enzyme” organism, Aspergillus flavus, the producer of aflatoxin, Candida albicans, the cause of
thrush.
7
(A)
(B)
Figure 1.1: P. marneffei mould (A) and yeast (B) culture. Courtesy of
Prof. KY Yuen, Micriobiolgy, HKU
1.2.2
P. marneffei, as an important fungal pathogen
Mycology
The fungus grows well on the Sabouraud dextrose agar. When grown
at 25‰, the fresh culture appears similar to other Penicillium species,
with rapidly growing greenish-silver mycelial colonies. The reverse side
is usually of a beige colour. One of the most characteristic features is the
production of a soluble red pigment that diffuses into the medium. Of all
the Penicillium species, only P. marneffei, P. citrinum, P. janthinellum,
P. purpurogenum, and P. rubrum produce diffusible red pigments. The
other Penicillium species are generally not associated with human infections nor do they display dimorphism. In contrast to a room temperature
culture, the fungus assumes a yeast form at 37‰, whether in cultures or
in vivo. Colonies at 37‰ are glabrous and beige-coloured and do not
produce any red pigment (Fig. 1.1). The dimorphic growing feature that
as a yeast-like fungus at 37‰ and as a mould in culture at temperatures
below 30‰ is illustrated in Fig 1.2.
Microscopically, the mycelial form resembles other Penicillium species
with conidiophore-bearing biverticillate penicilli, and each penicillus being composed of four to five metulae with smooth-walled conidia. The
8
Figure 1.2: Dimorphic switching of P. marneffei.The diagram is obtained
from the website of Department of Genetics, University of Melbourne.
yeast forms are ovoid or elongated measuring 2–3 µm × 2–6.5 µm. Similar forms are also observed in tissue samples obtained from patients,
which may be seen within macrophages or extracellularly. In contrast to
other yeasts, the yeast cells of P. marneffei divide not by budding, but
by fission, with the result that a transverse septum is often seen in the dividing cell. This helps to differentiate P. marneffei from other dimorphic
fungi in histological sections, especially Histoplasma capsulatum.
Ecology and epidemiology
P. marneffei is geographically restricted to the Southeast Asia. Cases
have been reported mostly from northern Thailand, southwestern China
(e.g., around the Guangxi Province), Hong Kong, Taiwan, Singapore,
Malaysia, and the Philippines.
The ecology and possible environmental reservoirs of P. marneffei was
first investigated in 1986 by Deng et al. [67]. In the Guangxi Province
of region of the People’s Republic of China, it was found that P. marneffei can be isolated in the internal organs of 18 out of 19 bamboo rats
belonging to the species Rhizomys pruinosus. The findings of Deng et al.
9
were confirmed by a subsequent study by Li et al. [195]. Rhizomys pruinous senex bamboo rats in the Guangxi Province were studied. 93.1%
of the wild bamboo rats carried P. marneffei in the internal organs. The
fungus was most commonly isolated from the lungs (87.5%), followed by
the liver (56.3%), spleen (56.3%) and mesentery lymph node (50%).
The association between P. marneffei and bamboo rats had also been
noted in Thailand, another country endemic for the infection. In two
studies by Ajello et al. [3] and Chariyalertsak et al. [47], P. marneffei
was recovered from various species of bamboo rats, including Cannomys
badius, Rhizomys pruinosus, and R. sumatrensis. The distribution of the
fungus in the internal organs was similar to previous studies, with the
highest prevalence in the lungs followed by the liver.
The consistency of these findings suggests that inhalation of the (presumably) infective conidia could be an important mode of transmission.
The occurrence of the fungus in the liver could be a result of the propensity of the fungus to invade the reticuloendothelial system. It has been
suggested that bamboo rats, like human victims, probably acquired the
infection from a common environmental source. The possible link to environmental factors is demonstrated by two studies from northern Thailand which showed a significant clustering of cases of penicilliosis marneffei during the rainy season [45, 46]. A recent history of occupational or
other forms of exposure to soil is also a significant risk factor. Importantly, exposure to or consumption of bamboo rats, was not a risk factor
for infection. The exact mode of transmission of the fungus its natural
habitat is still unsettled at the moment.
Although P. marneffei is a naturally occurring sylvatic infection in
a high proportion of bamboo rat species [67], it is not known whether
bamboo rats are (1) an obligate stage in P. marneffei’s life cycle or (2) a
zoonotic focus for human infection. Furthermore, it is not known whether
all lineages of P. marneffei are equally infectious to bamboo rats and hu-
10
mans or rather represent a subset of a wider, more genetically diverse
population. In order to address these questions, four groups of investigators reported the use of various molecular typing techniques in the differentiation of P. marneffei strains. Vanittanakom et al. [323] first reported
in 1996 the use of restriction endonuclease analysis for epidemiological
typing of strains isolated in Thailand. Hsueh et al. noted an increase
in the incidence of P. marneffei infection in Taiwan in the 1990’s [134].
Antifungal susceptibility, chromosomal DNA restriction fragment-length
polymorphism types, and randomly amplified polymorphic DNA patterns
recognised 8 strain types out of 20 isolates. Trewatcharegon et al., on
the other hand, used pulsed-field gel electrophoresis (PFGE) with NotI
digestion for strain differentiation [316]. Fisher et al. [88] used multilocus microsatellite typing (MLMT) system, an accurate and reproducible
method of characterizing genetic diversity of eukaryotic pathogens that
have low levels of genetic variation. They observed the high genetic diversity and extensive spatial structure among clinical isolates, revealing
spatially structured P. marneffei populations [88]. In further study, again
based on MLMT typing results, Fisher et al. [89] showed that different
clones of the fungus are found in different environments, all the samples
from any given location were genetically very similar. This led them to
the conclusion that the fungus becomes highly adapted to its local environment, making it highly successful there, but stopping it spreading
to other areas. This is why P. marneffei is only endemic to a relatively
small area of south-east Asia.
Immunobiology
Like most other pathogens, the availability of iron is crucial to the survival
of P. marneffei in the human host. Studies by Taramelli et al. shown
that the antifungal activity of macrophages is markedly suppressed in the
presence of iron overload and that iron chelators inhibit the extracellular
11
growth of P. marneffei [306].
The route of transmission and infection of P. marneffei is unknown at
the moment. However, it is generally believed that inhalation of the conidia is a likely route, in line with the mode of infection for other moulds.
The attachment of P. marneffei conidia to host cells and tissues is the
first step in the establishment of an infection. The conidia-host interaction may occur via adhesion to the extracellular matrix protein laminin
and fibronectin via a sialic acid-dependent process. Using immunofluorescence microscopy, Hamilton et al. demonstrated that fibronectin binds to
the conidia surface and to phialides, but not to hyphae [122]. The investigators suggested that there could be a common receptor for the binding
of fibronectin and laminin on the surface of P. marneffei [123, 122].
The interaction between human leukocytes and heat-killed yeast-phase
P. marneffei has been studied by Rongrungruang et al. [269]. Their data
suggested that monocyte-derived macrophages phagocytose P. marneffei
even in the absence of opsonisation and the major receptor(s) recognising
P. marneffei could be a glycoprotein with N-acetyl-beta-D-glucosaminyl
groups. P. marneffei stimulates the respiratory burst of macrophages
regardless of whether opsonins are present, but tumour necrosis factor-α
secretion is stimulated only in the presence of opsonins. The authors thus
speculated that the ability of unopsonised fungal cells to infect mononuclear phagocytes in the absence of TNF-α production is a possible virulence mechanism.
Although P. marneffei is capable of infecting and replicating inside
mononuclear macrophages, it is also evident that macrophages do possess
antifungal activities. The fungicidal activities of macrophages is likely to
involve the generation of reactive nitrogen intermediates, as described
by Kudeken et al. [180]. In addition to macrophages, the neutrophils
also exhibit antifungal properties. The fungicidal activity of neutrophils
is significantly increased in the presence of proinflammatory cytokines,
12
especially GM-CSF, G-CSF and IFN-γ. In addition to GM-CSF, G-CSF
and IFN-γ, other cytokines such as TNF-α and IL-8 are capable of enhancing the neutrophil’s inhibitory effects on germination of P. marneffei
conidia. The strongest effect was observed with GM-CSF [179]. Conidia are, however, generally not susceptible to killing by phagocytes. The
fungicidal activity exhibited by neutrophils is believed to be independent
of superoxide anion, but through exocytosis of granular enzymes [181].
Recently, Koguchi et al. demonstrated that osteopontin (secreted by
monocytes) could be involved in IL-12 production by peripheral blood
mononuclear cells during infection by P. marneffei, and the production
of osteopontin is also regulated by GM-CSF [171]. It is also likely that
the mannose receptor is involved as a signal-transducing receptor for triggering the secretion of osteopontin by P. marneffei-stimulated peripheral
blood mononuclear cells.
Molecular biology
The mechanism of thermal dimorphism and morphogenesis in P. marneffei is not fully understood. However, studies by Borneman et al. start to
provide important information in this area [18,19]. It was shown that the
homologue of the Aspergillus nidulans abaA gene is involved in the regulation of cell cycle and morphogenesis in P. marneffei [18]. An STE12
homologue of P. marneffei (stlA gene) was subsequently shown to be able
to complement the sexual defect of an A. nidulans steA mutant [19]. A
hitherto unknown sexual stage of P. marneffei is therefore postulated to
be present.
Other genes which are involved in the growth and development of
P. marneffei have been described recently. A CDC42 homologue (cflA
gene) was shown to be required for polarisation and determination of correct cell shape during yeast-like growth, and for the separation of yeast
cells [22]. Deletion of the homologue of Aspergillus nidulans stuA gene in
13
P. marneffei showed that the gene is required for metula and phialide formation during conidiation but is not required for dimorphic growth [20].
No vaccine is currently available for P. marneffei. Some recent studies
showed that vaccine development is potentially feasible. The P. marneffei mannoprotein Mp1p (encoded by the MP1 gene) has been tested in a
mouse model as a potential vaccine candidate [347]. The relative efficacy
of intramuscular MP1 DNA vaccine, oral mucosal MP1 DNA vaccine using live-attenuated Salmonella typhimurium carrier, and intraperitoneal
recombinant Mp1p protein vaccine were compared. Intramuscular MP1
DNA vaccine appears to give the best protection against P. marneffei.
1.2.3
Penicilliosis marneffei
Clinical features
Penicilliosis marneffei manifests clinically as a progressive systemic febrile
illness as a result of infiltration and inflammation of the reticuloendothelial system by the yeast stage of P. marneffei. Common clinical features include systemic symptoms of fever, weight loss, anaemia, and those
due to local organ involvement such as pulmonary syndrome, chest radiographic infiltrate, lymphadenopathy, hepatosplenomegaly, molluscumcontagiosum-like skin lesions, osteolytic bone lesions, arthritis, subcutaneous abscesses and even endophthalmitis. Almost all organs could be
involved in severe disseminated disease.
In immunocompetent hosts, the tissue damage is mainly associated
with granulomatous inflammation with multinucleated giant cells, lymphocytes, and neutrophils. A suppurative inflammation dominated by
neutrophils resulting in abscess formation can be present. In immunosuppressed hosts, an anergic and necrotising reaction is found with diffuse
infiltration of macrophages engorged with yeast cells.
Underlying immunosuppression could be found in 80% of penicilliosis
patients. The commonest underlying disease is AIDS. P. marneffei is
14
second only to Cryptococcus neoformans as the commonest opportunistic fungal pathogen in AIDS patients in Southeast Asian countries like
Thailand.
Infections in non-HIV-infected patients have also been described, primarily among immunocompromised patients and less frequently in patients without any known underlying diseases. Reported cases of nonHIV-associated penicilliosis marneffei had occurred in patients with alcoholism, tuberculosis, systemic lupus erythematosus, patients receiving
corticosteroid or other forms of immunosuppressive therapy, and even
patients without any apparent underlying disease. Manifestations of the
infection included lymphadenopathy, osteomyelitis and septic arthritis,
pulmonary infection, and disseminated infection with multi-organ involvement.
Comparison of the clinical manifestations of penicilliosis in HIV-positive
and HIV-negative patients has been published recently [349]. Of the 15
patients who had culture-documented P. marneffei infection, 8 (53.3%)
were HIV positive and 7 (46.7%) were HIV negative. The HIV-infected
patients were more likely to have a higher incidence of fungaemia than
the non-HIV-infected patients (50% vs. 28.6%) while the latter group frequently required tissue biopsies for confirmation of the infection. There
was a significant delay in establishing the diagnosis in non-HIV-infected
patients when compared with HIV-infected patients (median delay of 5.5
weeks vs. 1 week, P < 0.01). Most of the non-HIV patients (85.7%)
have underlying immunocompromising conditions including haematological malignancies and autoimmune diseases requiring the use of corticosteroids or cytotoxic chemotherapy, as well as diabetes mellitius. In both
categories, pulmonary involvement was the commonest manifestation on
initial presentation, followed by pyrexia of unknown origin and cutaneous
manifestation.
15
Diagnosis
Fungal culture The infection itself is relatively amenable to antifungal therapy and a cure is potentially possible. Early recognition of the
infection is therefore essential for timely initiation of effective therapy.
Conventional fungal culture remains the diagnostic test of choice in
most settings. The fungus may be cultivated from appropriate clinical
specimens in most cases, such as blood cultures, skin lesions, and respiratory tract specimens. In the AIDS patients with high levels of fungaemia,
it has been occasionally reported that a direct smear of the peripheral
blood may reveal the fungus. In HIV-positive patients, fungaemia could
be detected in at least 55% of the patients in previous reports.
Unfortunately, fungal culture suffers from the drawback of a long
turnaround time and that sometimes invasive tissue biopsies are necessary
for obtaining a satisfactory specimen. In a series of HIV-infected patients
from Hong Kong, 50% of them had documented fungaemia [349].
The yeast form of P. marneffei may be stained by the methenamine
silver or periodic acid-Schiff stains in tissue sections. When the central septation of the yeast cell is seen in the histopathological section,
this offers clues to the diagnosis of penicilliosis. Piérard et al. reported
that the monocloncal antibody EB-A1 against the galactomannan of Aspergillus species may also be used to detect P. marneffei in formalinfixed, paraffin-embedded tissues [249].
Serology A number of studies aimed at detecting fungal antibodies
and/or antigens in the serum and body fluids of infected patients. In
earlier studies, culture filtrates or whole cell extracts were being used as
antigens. P. marneffei was cultured in liquid media, and the culture filtrate was concentrated to immunise rabbits. The culture filtrate and the
anti-P. marneffei rabbit sera were incorporated in an immunodiffusion
test to detect antibody or antigens respectively [277, 333, 144].
16
In 1994, an indirect immunofluorescent antibody test for serodiagnosis
of P. marneffei infection was reported, using the yeast-hyphae (representing tissue multiplication phase) or the germinating conidia (representing
initial tissue invasion phase) as antigens [365]. None of the eight sera
from culture-documented patients tested at 1 : 10 dilution gave a positive result for IgM. High IgG titres (of the respective phases, geometric
mean 1 : 905 and 1 : 1280) were found in all eight penicilliosis marneffei
patients, in contrast to that obtained from 78 healthy controls (with a
respective geometric mean of 1 : 1.34 and 1 : 2.14). Sera from patients
with cryptococcosis (n = 2) or candidaemia (n = 2) did not show crossreactivity (IgG titre < 1 : 40, which is similar to that of the healthy controls). Overall, the IgG titre was higher than IgA for the cases but there
was little difference in using the germinating conidia or the yeast-hyphae
form as the testing antigen. Moreover, IgA could not be detected in two
out of eight positive cases. Three HIV patients with culture-documented
penicilliosis marneffei were tested positive (IgG titres 1 : 80 − 1 : 160).
An IgG titre > 1 : 80 is suggestive of penicilliosis marneffei.
In 1996 Kaufman et al. developed a latex agglutination test to detect
antigenaemia, where polystyrene beads were coated with rabbit anti-P.
marneffei globulin, obtained from rabbits immunised with yeast culture
filtrate [160]. 77% of the 17 P. marneffei culture-positive HIV patients
were tested positive.
Desakorn et al. later used purified hyperimmune IgG, from rabbits
immunised with yeast cells, in an enzyme-linked immunosorbent assay
(ELISA) to quantitate P. marneffei yeast antigens in urine samples [69].
All urine samples from 33 P. marneffei culture-positive HIV patients
were tested positive, with a median titre of 1 : 20.
Jeavons et al. characterised and purified three cytoplasmic yeast antigens of 50-, 54- and 61-kDa, which were found respectively in 48, 71
and 85% of serum samples from 21 P. marneffei culture-positive pa-
17
tients [146]. Chongtrakool et al. isolated a 38-kDa antigen partiallypurified from yeast culture filtrate, where 45% of P. marneffei culturepositive HIV patients (n = 51), 17% of HIV positive asymptomatic patients (n = 262) and 25% of other fungal culture-positive HIV patients
(n = 67) have developed antibodies against this antigen [54].
PCR The detection of the P. marneffei genomic DNA in clinical specimens have also been reported. LoBuglio and Taylor used primers PM2
and PM4 to amplify a 347 bp fragment of the internal transcribed spacer
region between 18S rDNA and 5.8S rDNA [202]. On the other hand
Vanittanakom et al. used a PCR-Southern hybridisation format, where
primers RRF1 and RRH1 were used to amplify a 631 bp fragment of
the 18S rDNA, followed by hybridisation with a P. marneffei -specific 15oligonucleotide probe [324]. Recently Vanittanakom et al. described a
nested PCR assay which might prove useful in the detection of P. marneffei and identification of young fungal cultures [325].
Mp1p The first gene cloned from P. marneffei was the MP1 gene [37].
Serum from guinea pigs immunised with P. marneffei yeast cells was used
to screen the cDNA library of P. marneffei. The MP1 gene was subsequently cloned which encodes an abundant antigenic cell wall mannoprotein in P. marneffei. MP1 is a unique gene without homologues in
sequence databases. It codes for a protein, Mp1p, of 462 amino acid
residues, with a few sequence features that are present in several cell wall
proteins of Saccharomyces cerevisiae and Candida albicans. It contains
two putative N-glycosylation sites, a serine- and threonine-rich region for
O-glycosylation, a signal peptide, and a putative glycosylphosphatidylinositol attachment signal sequence. Specific anti-Mp1p antibody was
generated with recombinant Mp1p protein purified from Escherichia coli
to allow further characterisation of Mp1p. Western blot analysis with
anti-Mp1p antibody revealed that Mp1p produces dominant bands with
18
molecular masses of 58 and 90 kDa and that it belongs to a group of cell
wall proteins that can be readily removed from yeast cell surfaces by glucanase digestion. In addition, Mp1p is an abundant yeast glycoprotein
and has high affinity for concanavalin A, a characteristic indicative of a
mannoprotein. Furthermore, ultrastructural analysis with immunogold
staining indicated that Mp1p is present in the cell walls of the yeast, hyphae, and conidia of P. marneffei. Finally, it was observed that infected
patients develop a specific antibody response against Mp1p, suggesting
that this protein represents a good cell surface target for host humoral
immunity.
The antibody response of penicilliosis patients to Mp1p was studied
in two subsequent studies [38, 39]. An ELISA-based antibody test with
purified Mp1p was produced. Evaluation of the test with guinea pig sera
against P. marneffei and other pathogenic fungi indicated that this assay
was specific for P. marneffei. Clinical evaluation revealed that high levels
of specific antibody were detected in two immunocompetent penicilliosis
patients. Furthermore, approximately 80% (14 of 17) of the documented
penicilliosis patients with human immunodeficiency virus tested positive
for the specific antibody. No false-positive results were found for serum
samples from 90 healthy blood donors, 20 patients with typhoid fever,
and 55 patients with tuberculosis, indicating a high specificity of the test.
Thus, this ELISA-based test for the detection of anti-Mp1p antibody can
be of significant value as a diagnostic for penicilliosis.
In vitro, Mp1p is found to be secreted into the cell culture supernatant at a level that can be detected by Western blotting. A sensitive
ELISA developed with antibodies against Mp1p was capable of detecting this protein from the cell culture supernatant of P. marneffei at 104
cells/mL. The anti-Mp1p antibody is specific since it fails to react with
any protein-form lysates of Candida albicans, Histoplasma capsulatum, or
Cryptococcus neoformans by Western blotting. In addition, this Mp1p
19
antigen-based ELISA is also specific for P. marneffei since the cell culture supernatants of the other three fungi gave negative results. Finally,
a clinical evaluation of sera from penicilliosis patients indicates that 17
of 26 (65%) patients are Mp1p antigen test positive. Furthermore, an
Mp1p antibody test was performed with these serum specimens. The
combined antibody and antigen tests for P. marneffei carry a sensitivity
of 88% (23 of 26), with a positive predictive value of 100% and a negative
predictive value of 96%. The specificities of the tests are high since none
of the 85 control sera was positive by either test.
The value of antigen (Mp1p) and antibody (anti-Mp1p) detection in
the diagnosis of penicilliosis marneffei is best evaluated by comparing the
results in patients with or without underlying HIV infection. In a study
involving eight HIV positive and seven non-HIV penicilliosis marneffei
patients, the HIV positive patients tended to have a higher antigen titre
and a lower antibody titre, while the converse is true in the HIV negative
patients. This presumably is due to impaired antibody production as a
result of the underlying immune defects associated with HIV infection
and a higher fungal load in this group of patients. Concomitant testing
of the serum antigen and antibody levels could therefore improve the
diagnostic yield of serology in immunocompromised patients.
When serial serum samples were available for the HIV-positive patients, it was found that the serum antigen and antibody titres against
P. marneffei were elevated as early as 30 days before the day of positive cultures. The titres of both serum antigen and antibody dropped
with the initiation of amphotericin B therapy and itraconazole prophylaxis. Upon subsequent follow up, there was no clinical and mycological
evidence of relapse and this was associated with a persistently negative
serum antigen and antibody ELISA.
20
Treatment
In vitro, P. marnefffei is susceptible to itraconazole and amphotericin
B, while the susceptibility to fluconazole and 5-fluorocytosine is less uniform [301]. The recommended antifungal regimen to date consists of two
weeks of intravenous amphotericin B (0.6 mg/kg/d) followed by ten weeks
of oral itraconzaole (400 mg/d), which resulted in clinical and microbiological cure in 97.3% of the patients. Long term secondary prophylaxis
has also been suggested to reduce the relapse rate [290, 302]. With wider
use of HAART for HIV infection, it has been suggested that long term
antifungal prophylaxis may not be necessary. The highly active antiretroviral therapy (HAART) has been shown to reduce the incidence of
many opportunistic infections in AIDS patients, including invasive fungal infection. There is, however, currently no specific cut-off value of
CD4 cell count can be used to guide the use of secondary antifungal
prophylaxis [140]. One recent interesting observation is that several 4aminoquinoline agents including chloroquine were found to be able to
inhibit the growth of P. marneffei inside macrophages. The activity of
chlorquine on P. marneffei is postulated to be due to an increase in the
intravacuolar pH and a disruption of pH-dependent metabolic processes.
This finding could be of value in the chemotherapy or chemoprophylaxis
of penicilliosis marneffei [307].
1.2.4
Fungal genome projects
Genomics has only just started to impact on biological/medical research,
although modern molecular genetics has been at the center of the biomedical revolution in research since 1980s. The potential of studying
whole genome sequences is a new tool in biomedical research.
At the time when this thesis is written, there are about 317 completed
and published genome sequence projects and 549 eukaryotic and 802
prokaryotic ongoing projects (data from the Genomes OnLine Database
21
(GOLD) at http://www.genomesonline.org/). Current estimates suggest at least 2 million fungal species, of which only some 50,000 to 70,000
have been documented and merely a couple of them whose genomes them
have been completed.
S. cerevisiae was the first eukaryote to have its genome fully sequenced. In 1996 the work was completed by many different laboratories
and organisations. Its genome contains ≈6,000 genes on 16 chromosomes.
At the time that genome sequence was published, only 43.3% of the
yeast genes were classified as ‘functionally characterised’, i.e., having experimentally well-investigated properties, being members of well-defined
protein families, or displaying strong homology to proteins with known
biochemical functions. Despite this limitation, it is the most well studied
fungus, which serves as the most important model organim for fungal
genetics. The all-against-all matching of the yeast genome had been
accomplished and duplication patterns within the genome have been investigated in a systematic way. Such a view of the genome’s architecture,
based on an exhaustive intra-genomic sequence comparison, revealed that
whole genome duplication seems to have had an important influence of
the evolutionary development of S. cerevisiae [220].
The S. pombe genome [354] contains the smallest number of proteincoding genes yet recorded for a eukaryote: 4,824. Centromere structure
has been well studied in S. pombe: the centromeres are between 35 and
110 kb and contain related repeats including a highly conserved 1.8-kb
element. More introns (of which there are 4,730) are found than in S.
cerevisiae. Some 43% of the genes contain introns. Some homologs of
human disease genes, such as cancer related genes, have been identified.
Comparative study identified highly conserved genes important for eukaryotic cell organisation including those required for the cytoskeleton,
compartmentation, cell-cycle control, proteolysis, protein phosphorylation and RNA splicing, which may have originated with the appearance
22
of eukaryotic life. In constrast, few similarly conserved genes that are
important for multicellular organisation were identified. The lesson from
studying S. pombe genome is that the transition from prokaryotes to eukaryotes required more new genes than did the transition from unicellular
to multicellular organisation.
The N. crassa genome has been reported recently [101]. The genome
is assembled from genomic data of more than 20-fold sequence coverage
of the genome. It has the highest genome size (39.9 Mb) and gene number (10,082 protein-coding genes) among all published fungal genomes so
far. On average, the gene density is one gene per 3.7 kilobases (kb) and
an average of 1.7 short introns (134 bp on average) per gene. Neurospora
genome comprises a small number of repetitive elements, a low degree of
segmental duplications and very few paralogous genes. Neurospora genes
are highly divergent – of the predicted proteins 41% have no significant
matches to known proteins. Many of genes with predicted products likely
to be involved in determining hyphal growth and multicellular developmental structures in Neurospora, as well as involved in catabolism, chemical detoxification and stress-defense mechanisms. It has also been noted
that for some Neurospora genes the only known homologs are found in
prokaryotes [216], indicating that occupation of similar ecological niches
has resulted in conservation of genes for substrate degradation and secondary metabolism.
Magnaporthe grisea, one of the most devastating agricultural pathogens
in the world, has been sequenced [64]. The fungus causes blast disease in
rice, a scourge that destroys enough rice crops to feed 60 million people
annually. The pathogen’s remarkable ability to overcome plant defences
has stymied efforts to fight the disease. Analysis of its predicted gene set
provides an insight into the adaptations required by a fungus to cause
disease. The M. grisea genome encodes a large and diverse set of secreted proteins, including those defined by unusual carbohydrate-binding
23
domains. This fungus also possesses an expanded family of G-proteincoupled receptors, several new virulence-associated genes and large suites
of enzymes involved in secondary metabolism. Together with the draft
rice genome sequences published earlier this year, the new information
will help researchers develop better and cheaper methods of protecting
plants than the currently available fungicides.
Recently, the C. albicans and C. neoformans genomes were reported
[148, 203], enabling a comparison between these divergent fungi. Moreover, high-quality draft sequences of A. nidulans and A. fumigatus are
already in the public domain, and others, such as Ustilago maydis, are
likely to be available soon. Other genome sequencing projects of pathogenic fungi are also under way or will soon be started (for instance,
Pneumocystis carinii).
1.3
Materials and Methods
Strain and DNA preparation of P. marneffei genome were done by colleagues in the department of Microbiology, University of Hong Kong.
Library construction and shotgun sequencing were carried out by Beijing
Genomics Institute (BGI).
1.3.1
Strain and DNA preparation
P. marneffei strain PM1 was isolated from an HIV-negative patient suffering from culture-documented penicilliosis in Hong Kong. The arthroconidia (“yeast form”) of PM1 was used throughout the DNA sequencing
experiments. Genomic DNA, including mitochondrial DNA, was prepared from the arthroconidia purified at 37‰ . A single colony of the
fungus grown on Sabouraud dextrose agar at 37‰ was inoculated into
yeast peptone broth and incubated in a shaker at 30‰ for 3 days. Cells
were cooled in ice for 10 min, harvested by centrifugation at 2000g for
10 min, washed twice and re-suspended in ice cold 50 mmol EDTA/l
24
buffer (pH 7.5). 20 mg novazym/ml was added and incubated at 37‰
for one hour followed by digestion in a mixture of 1 mg proteinase K/ml,
1% N-lauroylsarcosine, and 0.5 mol EDTA/l pH 9.5 at 50‰ for 2 hours.
Genomic DNA was then extracted by phenol, phenol-chloroform, and finally precipitated and washed in ethanol. After digestion with RNase A,
a second ethanol precipitation was followed by washing with 70% ethanol,
air-dried and dissolved in 500 µl of TE (pH 8.0).
1.3.2
Library construction, shotgun sequencing
Two genomic DNA libraries were made in pUC18 carrying insert sizes
from 2.0 – 3.0 kb and 7.5 – 8.0 kb, respectively. DNA inserts were prepared by physical shearing using the sonication method. The genome
sequence was assembled from deep whole-genome shotgun (WGS) coverage obtained by paired-end sequencing from a variety of clone types,
i.e., all inserts were sequenced from both ends to generate paired reads.
A total of about 190.3 Mb of sequence data, which is equivalent to approximately 6.6 coverage of the genome, has been generated by random
shotgun sequencing.
1.3.3
Sequence assembly
Phred/Phrap/Consed package was used for base calling, contig assembly
and quality assessment [83, 84, 112]. Contigs were ordered into scaffolds
by the scaffold building program, Bambus [255]. Refer to Chapter 2 for
more detailed descriptions of annotation procedure and genome database
construction.
1.3.4
Data release
Sequence data generated by the project were released continuously and
were available for searching using the on-site BLAST server and downloading by FTP with access restriction. The annotated sequences are
25
Table 1.1: General features of the P. marneffei genome.
Feature
Assembly size (excluding gaps)
Estimated genome size
GC content overall
GC content (coding)
Protein coding genes
tRNAs
% coding
Average gene size
Average intergenic distance
Average intron size
Average exon size
Value
28.98 Mb
∼ 31 Mb
47%
50%
10,060
110
62%
1,753 bp
1,051 bp
111 bp
380 bp
available for browsing and downloading from web interface of P. marneffei Genome Database (PMGD), http://www.pmarneffei.hku.hk. At
present, PMGD contains 10,060 protein-coding genes.
1.4
1.4.1
Results
Assembly and general characteristic
Using a pure whole genome shotgun approach, we sequenced the P. marneffei genome to 6.6× coverage. The net length of assembled contigs
totalled 28.98 Mbp. Genome statistics are presented in Table 1.1.
Genome sequence
The P. marneffei genome size was estimated ∼ 31 Mb (see Section 2.4.2),
which is similar to that of Magnaporthe (∼ 30 Mbp), larger than that
of S. cerevisiae and S. pombe (both about 12 Mbp), but smaller than
Neurospora (greater than 40 Mbp). The resulting assembly consists of
2,911 sequence contigs with a total length of 28,977,603 bp. Contigs
were ordered into 273 supercontig (i.e., scaffolds) with a total length
of 28.42 Mbp (excluding gaps between contigs). Most of the assembly
26
(98.35%) is contained in the contigs. Given the high sequencing coverage, the assembly represents the vast majority (> 95%) of the genome,
as theoretically assessed by the Lander-Waterman model [186]. The mitochondrial genome (35 kb, circular) has been completely sequenced and
assembled (See Chapter 3 for detail).
Genes
A total of 10,060 protein-coding genes (9,257 (92%) longer than 100
amino acids) were predicted. This, again is similar to that of Magnaporthe and less than that of Neurospora, and constitutes nearly twice as
many genes as in S. cerevisiae(about 6,300) and S. pombe (about 4,800),
and nearly as many as in D. melanogaster (about 14,300). The average
gene density is one gene per 2.8 kb. The average gene length of 1.75 kb
is slightly longer than the 1.67 kb average gene length for Magnaporthe
and the 1.40 kb for both S. cerevisiae and S. pombe. The protein-coding
sequence is predicted to occupy 62.1% (51.2% excluding introns) of the sequenced portion of the P. marneffei, compared with 71% in S. cerevisiae
(70.5% excluding introns) and 60.2% in S. pombe (57% excluding introns)
(Table 1.2). An estimated total of 28,180 introns are distributed among
91% of P. marneffei genes, with 34 being the largest number of introns
found within a single gene. Introns varied from 15 to 1,617 nucleotides
long, with a mean length of 111 nucleotides. The telomere tandem repeat identified is TTAGGG. Several predicted genes that encode conserved
telomere and centromere proteins, such as, telomere-associated helicases,
were identified, but telomere and centromere sequences have remained
elusive. Note, although the complete genomes of A. fumigatus and A.
nidulans are not published, the high-quality drafts of their genomes can
be obtained. Preliminary analyses reveal that most of above statistics
about gene number and gene density of P. marneffei are similar to those
of Aspergillus. This result is consistent with our understanding of phylo-
27
Table 1.2: Comparison of P. marneffei genome statistics to those of other
fungi. PM - P. marneffei, AN - A. nidulans, MG - M. grisea, NC - N.
crassa, SC - S. cerevisiae, and SP - S. pombe.
Genome size (Mb)
Gene number
Gene coverage
Gene coverage (excluding introns)
PM
31
10,060
62.1%
51.2%
AN
31
9,457
59.2%
50.6%
MG
30
11,108
48.2%
40.5%
NC
43
10,620
44.5%
37.6%
SC
12
6,300
71.0%
70.5%
genetic relationship between them, as obtained by small ribosomal RNA
sequences (Section 1.4.1) and mitochondrial comparison (Chapter 3).
Ribosomal RNA and tRNA
Copies of the large rRNA tandem repeat containing the 18S, 5.8S and
25S rRNA genes are present in P. marneffei genome. Ribosomal RNAs
from P. marneffei and other fungi were used to construct phylogeny
to study phylogenetic relationships. 18S rRNA from 43 species of Ascomycetes were obtained from Ribosomal Database Project II Release
8.1 (http://rdp.cme.msu.edu/html/). The phylogenetic relationship
is presented in Fig. 1.3. The neighbour-joining method of tree reconstruction, implemented in MBEToolbox (Chapter 10), was used. Alignment replicates for bootstrapping were generated by using Phylip [86].
Result suggests that P. marneffei is likely to be an anamorph of a Talaromyces species. This substantiates the observation that the spacer
regions of the rRNA loci are highly similar to that found in Talaromyces
species [158,330]. Indeed the sequence is almost identical with that of T.
flavus and T. bacillisporus (Fig. 1.3). It is also very similar to that of
Chromocleista cinnabarina, a soil fungus that produces a red pigment, as
does P. marneffei. A total of 110 tRNA genes were identified, including
69 (63%) with introns.
SP
12
4,800
60.2%
57.0%
28
0.01
98
Ascosphaera apis [M83264]
Eremascus albus [M83258]
68
Coccidioides immitis [M55627]
Paracoccidioides brasiliensis [AF227151]
Blastomyces dermatitidis [M55624]
98
Histoplasma capsulatum [Z75306]
59
90
Penicillium allii [AF218787]
72 Penicillium expansum [AF218786]
80
Penicillium commune [AF236103]
100
Penicillium chrysogenum [AF548086]
99
Penicillium notatum [M55628]
100
73
Eupenicillium javanicum [U21298]
75
Monascus purpureus [M83260]
Aspergillus flavus [D63696]
55
82 Aspergillus fumigatus [M55626]
50 Eurotium rubrum [U00970]
53
Byssochlamys nivea [M83256]
Chromocleista cinnabarina [AB003952]
Talaromyces bacillisporus [D14409]
100
Talaromyces flavus [M83262]
62
100
97 Penicillium marneffei
63 Penicillium verruculosum [AF510496]
Thermoascus
crustaceus [M83263]
76
Pleospora rudis [U00975]
Aureobasidium pullulans [M55639]
50
Leucostoma persoonii [M83259]
81
Ophiostoma ulmi [M83261]
100
Pseudallescheria boydii [U43913]
100
Microascus cirrosus [M89994]
100
Podospora
anserina [X54864]
54
Neurospora crassa [X04971]
100
Chaetomium elatum [M83257]
77 77
Taphrina wiesneri [D12531]
97 Taphrina deformans [U00971]
100
Taphrina populina [D14165]
65
Protomyces inouyei [D11377]
97
Saitoella complicata [D12530]
Schizosaccharomyces pombe [X58056]
100 Torulaspora delbrueckii [X53496]
100
Saccharomyces cerevisiae [Z75578]
64
Zygosaccharomyces rouxii [X58057]
94
Candida tropicalis [M55527]
Pichia anomala [D86914]
100
Clavispora lusitaniae [M55526]
62
Figure 1.3: Phylogenetic tree showing the relationships of P. marneffei to
other Penicillium and Talaromyces species. The tree was inferred from
18S rRNA data by the neighbour-joining method and bootstrap values
calculated from 1000 trees. The scale bar indicates the estimated number
of substitutions per 100 bases using the Jukes-Cantor correction. Names
and accession numbers are given as cited in the GenBank database.
29
1.4.2
Genome architecture and co-linearity
Identification of syntenies conserved between species is valuable for tracing the evolutionary events that affect genomes, however, little information about synteny among chromosome segments (or contig) is known
for filamentous ascomycetes. Analysis of orthologous genes among P.
marneffei, A. nidulans and A. fumigatus, revealed extensive regions of
conserved synteny, as well as a considerable extent of genome reorganisation that has occurred in this phylum. There are 1,340 regions containing
four or more genes that were found to be co-linear between P. marneffei
and A. nidulans. A total 3,188 P. marneffei genes are in these regions.
There are 1,273 regions between P. marneffei and A. fumigatus, containing 3,716 P. marneffei genes. The largest syntenic cluster contains 27
gene pairs, appearing in P. marneffei and A. nidulans.
Melanin-biosynthesis gene cluster
One of the interesting examples of the syntenic segments conserved between P. marneffei and Aspergillus spp. is the melanin biosynthesis gene
cluster. This six-gene cluster, spanning ∼ 19 kb, which participates in
DHN-melanin biosynthesis [24, 187, 317, 318], is found in P. marneffei,
and is syntenic in A. fumigatus (Chapter 4).
Pheromone precursor gene loss
Syntenic regions reveal evolutionary events, like gene loss, which are difficult to identify by other methods. One of the examples is the loss
of known mating pheromone precursor genes. Figure 1.4 shows the microsyntenies among pheromone precursor loci from P. marneffei, A. nidulans, A. fumigatus and N. crassa. The pheromone precursor gene has
been identified in all these species (highlighted in green) except for P.
marneffei. The hypothetical locations of P. marneffei pheromone precursor genes before loss are indicated by triangles in the figure.
30
Figure 1.4: Microsyntenies containing pheromone precursor loci from P.
marneffei, A. nidulans, A. fumigatus and N. crossa. The pheromone precursor genes have been highlighted in green. The hypothetical locations
of P. marneffei pheromone precursor genes before gene loss are indicated
by triangles.
1.4.3
Gene duplications (multigene families) and comparisons
Among all predicted P. marneffei genes (total 10,060 with 9,541 longer
than 100 bp), 1,335 of them belong to 428 multigene families which contain more than one homologous member. The largest gene family consists
of 34 genes. The most expanded gene families include MFS multidrug
transporter, dehydrogenase/reductase and hexose transporter, as well as
pepsin-type protease (see Table 8.2 on page 165). Comparisons of con-
31
tig/supercontig sequences and searches for tracts of conserved gene order
reveal little evidence for large-scale duplications in P. marneffei. The
incomplete genome sequences and unordered contigs obviously impair
the detection. Notably, the result is inconsistent with that based on the
other line of evidence, as presented in Chapter 8, in which histogram of
synonymous substitution rate of P. marneffei duplicate gene pairs suggesting two large-scale gene duplications probably happened. Compared
to S. cerevisiae which undergone genome duplication (i.e., the largest
gene duplication), P. marneffei has relatively smaller number of recently
duplicate gene pairs. But, the age distribution of duplicate genes in P.
marneffei at the first peak (see Chapter 8 for detail) shows a similar
pattern with that in S. cerevisiae, which might suggest that duplicate
genes in P. marneffei probably originated through one or two episodic,
large-scale gene duplication.
1.4.4
Interspecies proteome comparison
The comparison of genomic sequences of two or more species may provide
highlighted information on how evolution shapes genome structure and
content, and to reveal specific sequences that have been conserved, as well
as those that have been invented throughout evolution. I conducted such
a comparative analysis of proteome sequences between P. marneffei and
A. fumigatus and S. cerevisiae. The analysis started by defining ortholog
or paralog pairs among proteomes. Two genes are said to be paralogous
if they are derived from a duplication event, but orthologous if they are
derived from a speciation event. Determining ortholog is important step
in assessing the relationship between genomes. This was performed using the BLAST comparison tool. BLASTP was used to compare the
sequences of proteins encoded by genes of one genome against those from
the other genomes. Protein sequences, instead of nucleotide sequences,
were compared because protein sequences remain conserved much longer,
32
on an evolutionary time scale and therefore can detect much older relationships among alignments. The lower the E-value, the greater chance
that two proteins are orthologous, that is, derived from a common ancestral protein and therefore having the same function. E-values have been
shown to be an accurate indication for the ratio of false positives to true
positives of homologous relationships. Genes g and h were considered orthologues if h is the best BLASTP hit for g and vice versa, with E-value
less than or equal to 1e-10.
The translated ORFs sequences of S. cerevisiae were obtained from
the Saccharomyces Genome Database (SGD) at http://www.yeastgenome.
org/. The predicted peptides of A. fumigatus were downloaded from the
FTP service at the A. fumigatus genome project in the Sanger Institute
(http://www.sanger.ac.uk/). The result of the proteome comparison
is given in Fig. 1.5.
Figure 1.5: Graphical representation of a triple proteome comparison
between P. marneffei, S. cerevisiae and A. fumigatus.
33
1.4.5
Lineage-specific genes
We identified many genes only present in P. marneffei or its closely related fungal species, namely lineage-specific genes. At the most extreme,
some genes are present in P. marneffei exclusively. These genes are of
particular interest because they may be determinators of characteristic
features of the fungus. A total of 1,447 genes whose proteins lack significant matches to known proteins from public databases (TBLASTN cutoff
10−10 ) were found. This reflects that the Penicillium and its closely related fungal genome projects are still in the early stage, the diversity of
fungal genes remaining to be explored. Furthermore, 2,506 proteins do
not have significant matches to genes in either of the sequenced yeast
and A. nidulans. A novel theory about the emergence of lineage- or
species-specific genes is given in Chapter 9. Briefly speaking, the accelerated evolutionary rate, one of the most characterised properties of a
lineage-specific gene, may be responsible for the gene’s emergence.
In addition to the lineage-specific genes, many fungal specific domains
have been identified. These include cell wall antigen MP1 domain that is
first described in cell wall antigen Mp1p encoded in P. marneffei [347].
The Mp1p contains two self conserved regions, namely CR1 and CR2,
which form a new conserved domain family that has not been described
in conserved domain databases, such as Pfam and ProDom. The genome
sequence reveals more than 12 P. marneffei genes containing at least one
MP1 domain. That is to say, the genes encoding MP1 containing proteins
have been expanded in P. marneffei genome. Such an expansion is not
so extensive in A. fumigatus and A. nidulans, despite at least two MP1
containing proteins, afmp1 and afmp2 (GenBank Acc.: AAG09624 and
AAR22399), were discovered in A. fumigatus genome.
34
Figure 1.6: Putative MAPK signalling pathway in P. marneffei.
Overview of major intracellular signalling pathways in P. marneffei.
Common genes between S. cerevisiae and P. marneffei are marked with
asterisks. Names of S. cerevisiae genes are presented. The P. marneffei genes are in parentheses. Created by using GenMAPP v2.0, a free
program for visualising genes on biological pathways.
35
1.4.6
Cell signalling and morphogenesis
The sequences encoding proteins that act on well-studied signalling pathways, including mitogen-activated protein kinases (MAPK) and cyclic
AMP-dependent protein kinase, as well as small GTPases of the Ras
family, are readily recognised in the P. marneffei genome. Figure 1.6 is
the comparison of MAPK signalling pathways between S. cerevisiae and
P. marneffei.
1.4.7
Potential mating ability
Traditionally, P. marneffei is considered as an asexual (anamorph) ascomycete that lacks an apparent sexual (teleomorph) stage in its life cycle
and seems to reproduce only mitotically [44, 104]. Recent genetic studies, however, suggest it may have an unidentified sexual cycle. Except
for the pheromone precursor gene, the whole set of sex-related genes in
P. marneffei genome was identified, which demonstrates the potential
matting ability of this important thermally dimorphic fungus (Chapter
5).
1.4.8
Putative virulence genes
What makes a fungus a pathogen is an old question. The P. marneffei
genome sequence has revealed many proteins and systems with functions
that have previously been found to be important in pathogenic fungi. For
example, proteins such as phospholipases and proteinases are involved in
direct host cell damage and lysis. A review about fungal virulence factor
is in Section 4.2. A few identified putative virulence factors are presented
in Table 1.3.
1.4.9
Cell wall antigens and biosynthetic genes
The cell wall of a fungus maintains the structural integrity of the cell,
protects the fungus against the defence mechanism of the host and har-
36
Table 1.3: Putative virulence genes
Gene
Proteinase
Pm47.49
Acc. No.
Pm61.35
Pm109.24
Q96WN2
P25375
Pm61.50
Pm88.30
Pm66.31
Q6FX66
Q64HW0
P32379
Pm13.58
Q871P4
Phospholipase
Pm1.261
Pm103.31
Pm16.57
Pm167.18
Pm182.7
Pm22.27
P87184
Q769K2
Q874F2
Q6U820
Q877A5
Q76H92
Q9P866
BLAST hit
E value
Intracellular vacuolar serine proteinase precursor
Lon proteinase
Saccharolysin (EC 3.4.24.37) (Protease D) (Proteinase yscD)
YCL057w PRD1 proteinase yscD
Aspartyl proteinase
Proteasome component PUP2 (EC
3.4.25.1)
Related to ubiquitin-specific proteinase UBP1
N-acyl-phosphatidylethanolaminehydrolysing phospholipase D
Phospholipase D
Lysophospholipase (EC 3.1.1.5)
Phospholipase (Fragment)
Phospholipase A2
Candida albicans Phosphatidylinositol phospholipase C
0
0
1e-159
1e-158
1e-122
3e-98
6e-97
6e-61
1e-156
0
2e-51
3e-27
4e-44
Metacaspase
Pm112.34
Pm205.1
Agglutinin
Pm113.29
Q8J140
Q8J140
Metacaspase
Metacaspase
1e-91
3e-58
Q9P5P9
1e-24
Pm10.4
Pm2.195
Pm28.53
P11219
Q8CMU7
Q7N911
related to A-agglutinin core protein
AGA1
Lectin precursor (Agglutinin)
Streptococcal hemagglutinin protein
Similar to hemagglutinin/hemolysinrelated protein
Toxin
Pm21.30
A45086
Pm21.31
Pm71.10
Pm71.39
Pm137.4
Pm151.1
Q9UVN5
Q9UVN5
Q9UVN5
Q9UVN5
A45086
Pm112.24
Q96WL1
HC-toxin
synthetase
(Cochliobolus carbonum)
AM-toxin synthetase
AM-toxin synthetase
AM-toxin synthetase
AM-toxin synthetase
HC-toxin
synthetase
(Cochliobolus carbonum)
Aflatoxin efflux pump Aflt
5e-09
3e-07
0.00005
fungus
0
fungus
0
0
0
0
0
1e-141
37
bours most of the fungal antigens. It consists of a polymer of α and
β(1,3)-glucans, chitin, galactomannan and β(1,3)(1,4)-glucan embedding
protein antigens including the adhesins. The cell wall is synthesised and
continuously remodelled by enzymes including synthases, transglycosidases and glycosyl hyrolases. All these are absent in human cell and thus
ideal targets for anti-fungal agents and immunisation. Previous studies
have shown that the specific monoclonal antibody against the galactofurane side chain of galactomannan antigen of A. fumigatus can react with
the cell wall of P. marneffei and can be used to detect the presence of
antigenaemia or antigenuria in patients suffering from penicilliosis marneffei [363]. Ortholog of one of the known P. marneffei cell wall antigen
genes, MP1, is present in A. fumigatus. Within P. marneffei, homologs
of a number of Aspergillus genes encoding similar biosynthetic enzymes
and cell wall antigens have been identified (Table 1.4).
1.5
Discussion
This is the initial analysis of the genome of a thermal dimorphic fungus. Although P. marneffei has not been studied intensively, the analysis of the genome sequence has provided many new insights into a variety of gene functions and cellular processes, including cell wall components, signalling pathway, secondary metabolism and mating ability.
Comparisons of the genome of P. marneffei with those of other pathogenic/nonpathogenic fungi have also uncovered surprising similarities and
differences, providing a new perspective on the molecular underpinnings
of these lifestyles. The analysis of P. marneffei -specific genes might allow
researchers to begin to make insights into the transition from mould to
yeast growth. Furthermore, the genome sequence has revealed the different pattern of gene duplication in P. marneffei and other ascomycetes,
which might be linked with their divergent biological characteristics. The
apparent lack of a pheromone precursor loci in P. marneffei may provide
38
Table 1.4: Cell wall antigens and biosynthetic genes predicted in P. marneffei.
Aspergillus gene Acc. No. Pm gene
CHSs
Class I CHSA
AAB33397 Pm14.101
Class II CHSB
AAB33398 Pm132.15
Class III CHSG
AAB07678
Pm110.5
Class IV CHS F
AAB33402
Pm87.22
Class V CHSE
CAA70736
Pm38.37
Class VI CHSD
AAB33400
Pm223.4
β(1,3)-glucan synthase
FKS1
AAB58492
Pm120.1
RHO1
AAG12155
Pm203.6
α(1,3)-glucan synthase
AGS1
AAL28129
Pm162.3
AGS2
AAL18964
Pm66.50
β(1,3)-glucanosyl transferases
GEL 1
AAC35942
Pm221.6
GEL 2
AAF40139
Pm94.24
GEL 3
AAF40140 Pm119.10
Mannosyl transferases
MNN9
Afu2g01450 Pm207.2
PIG-M
Afu7g01300 Pm90.41
Chitinases Endo-β(1,3)-glucanases
Engl1
AAF13033
Pm5.32
E value
e-107
5e-097
0
6e-064
0
e-051
0
5e-099
0
0
e-154
e-123
e-124
5e-097
2e-063
0
39
an explanation of its asexual life style. However, the fungus may indeed
undergo a yet undetected sexual cycle, which is supported by the findings
of homologs of many mating genes. Finally, one of the most interesting
findings is the abundant intragenic tandem repeats in the coding regions
of the genome. This finding provides a possible mechanism to explain
how the fungus can change its surface coat and thereby evade detection
by the host’s natural defences (see Chapter 7).
The draft genome sequence of P. marneffei presented in this chapter
provides the first attempt to understand the genetic basis of the physiology of the special Penicillium species. Nonetheless, This first glimpse
may be expanded as many other fungal genomes generated from fungal
genome sequence projects ongoing or planned. This new era in fungal
biology promises to yield insights into this important group of organisms,
as well as to provide a deeper understanding of the fundamental cellular
processes common to all eukaryotes.
40
Chapter 2
PENICILLIUM MARNEFFEI GENOME DATABASE
AND ANNOTATION PIPELINE
The draft genome of Penicillium marneffei has been obtained (Chapter 1). The huge amount of sequence data needs efficient analysis in order
to extract valuable information. A computer-based analysis system tailored for the genome is required. Such a sequence data management
system with a number of peripheral applications has been developed to
solve this problem.
2.1
Introduction
The ever accelerating amount of genome information of P. marneffei
needs to be adequately processed, annotated and interpreted. Computational annotation systems providing tools and algorithms can facilitate
this process and advance our understanding of the genome sequences.
For the systems to be developed and refined, data must be easily accessible and amenable to analysis. The analysed data must be fed back into
the loop to allow the data to be re-analysed, refined, verified, and new
hypotheses to be built. This is the issue of data management. Good data
management practices are essential to users of genomic data.
This chapter is concerned with two aspects: (1) construction of the
PMGD (P. marneffei genome database) system, and (2) the issues relevant to the development of annotation pipeline. Many steps are involved
in these two aspects. Among these steps, prediction of protein function
is one of the most critical one in genome information processing. The
process of function prediction therefore stands the central part of an-
41
notation pipeline. Since P. marneffei genetics has not been well established, most of proteins derived from its genome will be totally unknown
to biologists. More than ten thousand unknown proteins will undergo
function prediction. Different methods of protein function prediction
have been developed (see Literature Review). Briefly, these methods
can be categorised into two major groups: homology based methods and
non-homology based method. The former methods depend on the detectable homolog between unknown protein and the characterised proteins in database. The latter methods are based on various contexts in
functional information of a protein, which are collected and integrated
around the protein in order to assign a putative function for the protein
in an indirect way [218]. However, none of these methods can guarantee
a ‘one-stop’ solution that are particularly successful in P. marneffei gene
function prediction. Hence, the newly developed annotation pipeline integrates several currently used methods, but it is by no means a collection
of methodologies. Different methods have been tailored before it can be
integrated in order to give its maximum predicting power in respect to
the features of fungal proteins.
In next section, I will first review underlying principle behind the
methodologies used for predicting function of unknown proteins. I will
then examine a few protein function prediction systems implemented by
several research groups, before pointing out some additional considerations in regard to the further development of similar systems. Note that
the topic of protein function prediction is a broad one. It could be broken
down into different subtopics in many different ways. I have tried to organise them in a flow from theory to application as smoothly as possible.
But still, the content of sections might jumpover slightly; some of key
concepts, such as, algorithm of sequence alignment, might be mentioned
more than once in different sections.
42
2.2
Literature Review
In this literature review I will first examine the most widely used methods
in protein function prediction. Then give a survey of software/database
systems currently available, highlighting their strengths and shortcomings. Further possible research directions will be addressed before finalising the whole literature review section.
2.2.1
Methods for predicting protein function
Based on the underlying principle, the methods of protein function prediction can be categorised into two major groups: homology-based methods and nonhomology-based methods [17, 217, 142].
Homology-based methods
Homology-based annotation relies on sequence similarity between query
protein and a well characterised protein. If two proteins are highly similar
in sequence, they possibly share the same function. The rationale behind
this function extrapolation is that similarity in sequence is determinate
enough to functional similarity. This is reasonable but counter-examples
are not rare. For instance, in the presence of domains that are shared by
numerous proteins [74], choosing the first or the best hit may not be optimal. The multi-domain organisation of proteins can lead to incorrectly
annotated database entries. Despite such criticisms, homology-based
methods are definitely the most widely used method. To calculate similarities/distances with sequences of known proteins, pairwise similarities
are computed using the rigorous dynamic programming algorithm [292],
or heuristic algorithms such as FASTA [245] and BLAST [6].
Besides the whole protein similarity comparison, detecting motif or
domain sharing among proteins gives additional information about function. Motif is a simple combination of a few consecutive secondary structure elements with a specific geometric arrangement (e.g., helix-loop-
43
helix). Not all, but some motifs are associated with a specific biological function. Domain is the fundamental unit of structure folding and
evolution. It may combine several secondary elements and motifs, not
necessarily contiguous. A domain can fold independently into a stable
3D structure, and it has a specific function. A variety of mathematical representations of protein motif/domain were developed and utilised
in detecting and storing these motifs/domains, such as, regular expression, position specific scoring matrices [97], hidden Markov models [57],
probabilistic suffix trees [15], and sparse Markov transducers [81].
Nonhomology-based methods
Although homology-based annotation has been widely successful in extending knowledge from the small set of experimentally characterised
proteins to the tens of thousand proteins found in genome sequencing
projects, a fatal problem for this method is that a well characterised
reference protein must be found base on sequence similarity; otherwise,
one cannot assign putative function to the unknown protein. According to the data that we currently have, 30-40% of proteins cannot find
a clear sequence homology in today’s most updated protein databases.
Another fungal genome sequencing project finished recently has the same
problem [101].
Nonhomology-based methods, also called context-based function prediction is complementary to homology-based function prediction. Phylogenetic profiles, domain fusion and gene neighbouring are examples of
these methods. Pellegrini et al. [248] presented the phylogenetic profiles
method based on the assumption that proteins that function together in
a pathway or structural complex are likely to evolve in a correlated fashion. If protein A and B tend to be either preserved or eliminated together
in a new species, we can expect that they are functional linked. In this
case, if we know the function of protein A, we can manage to predict the
44
function of protein B with respect to this functional linkage. The method
of phylogenetic profiling could be useful in predicting the function of uncharacterised proteins in P. marneffei, especially, when more and more
fungal species are sequenced. But for the time being this method has
to be performed manually because there is no free software available in
assisting automation of the analysis.
2.2.2
Software/database systems for protein function prediction
Over decades, with the close cooperation of biological scientists and software engineers, a wide range of software and/or database systems have
been developed. As we can see in the next section of this review, some
of them utilise mainly one of methods mentioned above as its predictive
tool, while some of them try to integrate more than one method in order
to give more comprehensive annotation for unknown proteins.
Systems for automatic function assignment
A group of software systems, such as, PEDANT, Genequiz, Bio-Dictionary,
is attempting to accelerate the task of human experts by providing detailed and exhaustive information for function assignment.
PEDANT (http://pedant.gsf.de) is a software system for completely automatic and exhaustive analysis of protein sequence sets - from
individual sequences to complete genomes [96]. It was launched in 1996
and is one of the earliest such systems. It was extensively utilised in
MIPS, a Europe based bioinformatics institute. It claims that it is fully
integrated with sequence database system and provides access to a broad
range of biological information through a hierarchically organised, controlled vocabulary. The whole system became commercialised like some
other similar systems these days, which limits its popularity.
The GeneQuiz analysis server is open to public usage and accepts
anonymous protein sequences with GQserve [7]. It is composed of several
45
major modules: GQupdate keepings target databases current; GQsearch
performs database searching of queries, applies a variety of sequence
analysis tools to the query sequence, parsing, and storing the results
in a common format; GQbrowse allows browsing and querying of results;
GQupdate maintains integrated, up-to-date, non-redundant protein and
nucleotide sequence databases, as well as databases of protein structures
and motifs. These modules are general engineering achievement with no
principle different from other database systems. It is GQreason module
that is the most critical know-how for the whole system. The module
analyses results and makes intelligent guesses, assigns a specific function
to the query, a general functional class, and a reliability estimate.
Bio-Dictionary [264] employs a weighted, position-specific scoring scheme
and uses the complete collection of amino acid patterns (referred to as
seqlets) and can determine, in a single pass, the following: all local and
global similarities between the query and any protein already present in a
public database. The most unique feature of Bio-Dictionary is the usage
of seqlets that completely cover the natural sequence space of proteins in
the currently available public databases. As its developers claimed the
seqlets contain in this collection can capture both functional and structural signals that have been reused during evolution both within as well
as across families of related proteins. With this capacity, seqlets are ideal
elements for use in the context of protein annotation.
Classification system
It is not always the case that an unknown protein can be readily assigned a definite functional description. In such a case, protein classification can help to elucidate the function of the new protein. Comparing
a protein sequence with a database of protein families is more effective
than a standard database search. Generally, conserved proteins are classified according to their homologous relationships. Each protein group
46
composes of a set of “seed” proteins which is represented as multiple
alignments, regular expression profiles or HMM. Protein classification is
useful in structure and function prediction, and especially important in
large-scale annotation efforts.
As it claims as of 2001, Clusters of Orthologous Groups of proteins
(COGs) were delineated by comparing protein sequences encoded in 43
complete genomes, representing 30 major phylogenetic lineages [308].
Now it is more updated by including more complete genomes representing broader lineages. Each COG consists of individual proteins or groups
of paralogs from at least 3 lineages and thus corresponds to an ancient
conserved domain. The problem with COGs system is that the system is
not fully open to public. Batch-application of COGnitor, the key component of the system used to fit new proteins into the COGs, can only be
accessed inside the NCBI. Another issue has to be taken into account is
that COGs does not discriminate paralog (genes from the same genome
which are related by duplication) from ortholog (genes in different species
that evolved from the same ancestral protein). Orthologs typically have
the same function, allowing transfer of functional information from one
member to an entire COG. In contrast, paralogs are functionally diverse
proteins whose genes duplicated after speciation, although high sequence
similarity is normally preserved in paralogs. A system like COGs can
only be used as classifying system for automatically yielding a number of
functional predictions for poorly characterised genomes. COGs system
is of limited usefulness in P. marneffei genome project because its current version contains few fungal genomes. The other database systems,
such as, Systers [177], iProClass [135], ProtoMap [362], have the same
shortcoming as COGs. They are better to be treated as protein information storage/retrieval systems than active protein function prediction
systems.
47
Protein domain databases
A list of commonly used protein domain databases are given in Table 2.1.
Two of them have been used in PMGD. They are Pfam and InterPro.
Pfam (http://www.sanger.ac.uk/Software/Pfam) is a large collection of multiple sequence alignments and hidden Markov models covering
many common protein domains and families [13]. For each protein family, Pfam allows looking at multiple alignments, viewing protein domain
architectures, examining species distribution, and so on. Pfam is built
from fixed releases of Swiss-Prot and TrEMBL. At current version 18.0
(2005), 75% of protein sequences in Swiss-Prot and TrEMBL have at
least one match to Pfam.
InterPro (http://www.ebi.ac.uk/interpro) is a database of protein families, domains and functional sites in which identifiable features
found in known proteins can be applied to unknown protein sequences.
It provides an integrated view of the commonly used signature databases
like PROSITE, PRINTS, SMART, Pfam, ProDom, etc., and has an intuitive interface for text- and sequence-based searches. The latest release
11.0 contains 12,294 entries and covers 77.5% of UniProt proteins. InterProScan is a tool that combines different protein signature recognition
methods native to the InterPro member databases into one resource with
look up of corresponding InterPro and GO annotation.
2.2.3
The art of gene finding
The last 20 years has witnessed the significant development of computational methodology for finding genes and other functional sites in genomic DNA. Two major classes of computational approaches are commonly used to detect genes in genomic sequences. They are homologybased approaches, and ab initio gene-finding algorithms. The former
approaches are relatively straightforward, focusing on search of homologous relationship with the content and structure of known genes. If a
48
Table 2.1: Commonly used domain databases.
Database
Prosite
Pfam
Blocks
ProDom
Prints
Domo
InterPro
Smart
eMotif
Method
Semi-Maual
Semi-Auto
Full-Auto
Full-Auto
N/A
Full-Auto
N/A
Semi-Auto
Full-Auto
Data type
Motif
Domain
Motif
Domain
Motif
Domain
Motif
Domain
Motif
URL
www.expasy.ch/prosite/
www.sanger.ac.uk/Software/Pfam/
www.blocks.fhcrc.org/
prodes.toulouse.inra.fr/prodom
www.bioinf.man.ac.uk/PRINTS/
www.infobiogen.fr/services/domo/
www.ebi.ac.uk/interpro/
smart.embl-heidelberg.de/
dna.stanford.edu/identify
region of sequence is similar to the sequence of an identified gene it is
highly suggestive, though not necessarily conclusive, of a gene. The most
common program for such comparison may be BLAST.
Next I will review some issues related to ab initio gene finding algorithms. Generalised hidden Markov models (GHMMs) appear to be
approaching acceptance as a de facto standard for state-of-the-art ab
initio gene finding, as evidenced by the recent proliferation of GHMM
implementations, including GenScan [30] and FGENESH (Softberry). At
the time of this thesis’ written, neither GenScan nor FGENESH is opensourced, and no detailed information about underlying algorithm and
implementation is available. According to general algorithm description,
GenScan uses a training set in order to estimate the HMM parameters,
then the algorithm returns the exon structure using maximum likelihood
approach standard to many HMM algorithms (Viterbi algorithm). The
generalised HMM that GenScan uses consists of a number of states modelling the various parts of a gene. These states include 5’ splice site, 3’
splice site, internal coding exon, start exon, and terminal exon. The final
gene structure predicted by GenScan is the maximum probability path
through the HMM. FGENESH is also HMM-based with the algorithm
similar to GenScan [30], differing in the model of gene structure a signal
49
term (such as splice site or start site score) has some advantage over a
content term (such as coding potentials), reflecting the biological significance of the signals. No matter what algorithm a gene finding program
implements, several basic types of signal are indispensable to be detected.
These signals (or functional sites in genomic DNA) that researchers have
ever sought to recognise are splice sites, start and stop codons, branch
points, promoters and terminators of transcription, polyadenylation sites,
ribosomal binding sites, topoisomerase II binding sites, topoisomerase I
cleavage sites, and various transcription factor binding sites [108]. From
the point of view of information sciences, two basic types of information
are used here (1) “signals” in the sequence, such as splice sites; and (2)
“content” statistics, such as codon bias. Among signal measures, the
splice junctions-the donor and acceptor sites is the most important features to be identified. The most common method for this has been the
“weight matrix” based methods. Other methods like consensus, Maximal
dependence decomposition (MDD) and Neural network based methods
are also used. Other signals, such as, start and stop codons, TATA boxes,
transcription factor (TF) binding sites, and CpG islands, are also useful in predicting protein-coding regions. Content measures, like such as
codon bias, periodicities and asymmetries of coding regions, help to distinguish coding from noncoding regions. Fairly long exons are easy to
identify whereas short ones remain difficult. Neural networks have also
been used to distinguish coding from noncoding sequences.
Recently homolog-based approaches have been incorporated into the
ab initio gene-finding algorithms. GenomeScan, for example, is a combination of two sources of information: probabilistic models of exonsintrons and sequence similarity information [361]. It is an extension of
the GenScan program, predicting gene structures that have at least one
exon with supporting evidence from an existing protein sequence. The
major disadvantage to this method is the requirement of a close homolog.
50
It is often the case that homologs are unknown or are remote, in which
case this system would be inappropriate.
Although the programs for gene structure prediction have greatly improved in the last decade, even the best cannot autonomously detect all
genes and genomic elements and have to be supported by experimental
analysis. The programs still have considerable proportion of incorrect
and missed exons, and they concentrate only on the detection of coding
exons, while 5’ and 3’ UTRs, promoter elements, and polyA sites often
remain undetected. The elucidation of complex genome organisation,
such as nested and overlapping genes or alternative splicing, has not yet
been considered by any of the programs [267].
2.3
Implementation
The overall objective of PMGD is to design and implement a distributed
information framework that will provide services, tools and infrastructure for high-quality analysis and annotation of large amounts of diverse
genomic data. The whole system starts from assembly of sequences, and
ends with the web interface for output of all processed information. The
requirements of the update are dependant on the genomic data sources
to be updated, so the PMGD was designed to be modules and configurable so that adding new sequence data should be as straightforward as
possible.
2.3.1
Annotation pipeline
The general strategy applied to the analysis of all contigs is diagrammed
in Fig. 2.1. It uses standard published procedures of sequence comparisons as well as sh/bash shell scripts and Perl specifically developed for
this work (see Section 2.3.5). The procedure involves the following major
steps:
51
Contigs
(2911)
Consed/BAMBUS
Scaffolds
(273)
BLASTX Search
GenScan
Tandem Repeat
Finder
Other Nucleotide
Analyses ...
BLASTP Search
Predicted
Genes
(10,060)
Domain
Identification
Gene
Structure
&
Functional
Annotation
FGENESH
HmmGene
Best Gene
Prediction
Other Protein
Analyses ...
Relational
Database
Storing
Annotation
PMGD Website
Interface
Sequence Data
Files
Figure 2.1: Flowchart of annotation pipeline for P. marneffei genome.
Step 1: contig assembly
Contigs were assembled from the sequence electropherograms using the
Phred/Phrap with their default options except as otherwise indicated
(for detail, see Section 2.3.2).
Step 2: comparisons of contigs to sequence databases
Comparisons of all contigs with fungal DNA sequences were performed
using BLASTN (default parameters) to search for rDNA, plasmid or mitochondrial DNA sequences. The contigs were also compared to all known
proteins in GenBank (release 131) using ungapped BLASTX, with significant hits indicating potential exons. The searches were made using
the seg filter and the PAM250 substitution matrix. The searches against
mitochondrial sequences were made using the filamentous fungal mitochondrial genetic code. In order to facilitate the visual inspection of the
52
alignments, I have developed blast2html script that converts regular
BLAST output to the HTML format. A graph was inserted above the
descriptive lines showing alignments coloured according to their similarity
score with the contig or protein query. Note BLASTX hits can often indicate the approximate location of many coding exons but not every exon
and do not accurately delineate exon boundaries, so BLASTX search in
this step only provide preliminary coding information.
Step 3: identification of genetic elements
This step identifies protein coding genes and other genetic elements. Different gene finding programs were evaluated and then the best one was
used as the primary gene finding program (for detail, see Section 2.3.3).
In addition to the protein-coding genes, tRNAs were identified using the
tRNAScan-SE program [207](http://www.genetics.wustl.edu/eddy/
tRNAscan-SE/).
Step 4: BLAST comparisons to protein sequences
After obtaining predicted proteins, comparisons of proteins with the nonredundant NCBI protein database were performed using BLASTP version 2.0.10 with the seg filter and the PAM250 substitution matrix. All
predicted genes were searched against the Pfam set of hidden Markov
models using the HMMER program and InterPro using modified InterProScan running locally on Bioinfo server.
Step 5: Data storing and PMGD web interface
Before dumping the annotation data into database system, information
from vairous software programs were integrae d and the results were
converted into either GenBank or GFF format (see below). A manual
validation step was introduced at this stage. Data storing procedure will
be described in Section 2.3.4.
53
2.3.2
Assembly process
Phred/Phrap/Cosed package (version 0.99.03.19) is one of the most frequently used software sets for trace file base calling, contig assembly and
contig editing [83, 84, 112].
Base calling
The purpose of base calling is to determine the nucleotide sequence on
the basis of multi-colour peaks in the sequence trace. Because traces
(and regions within a trace) are of variable quality, the fidelity of “called”
nucleotides is also variable. This accuracy for each called base is measured
by what are called base quality values. Phred takes trace file as input.
The Phred base calling program provides these base quality values to
help realistically evaluate sequence accuracy. It computes a probability p
of an error in the base call at each position, and converts this to a quality
value q using the transformation q = −10 × log10 (p). Thus a quality of
30 corresponds to an error probability of 1/1000, a quality of of 40 to an
error probability of 1/10000, etc.
Vector clipping
Use the cross match alignment program to compare each read in fastaformat file generated by base calling to a fasta database of cloning and
sequencing vectors vector.fasta. The sequence of the cloning vector used
(pUC18 plasmid sequence in our case) was added to the vector sequence
database. On the bioinfo server, the the vector sequence database is located at /db/univec/UNIVEC/UniVec or /pgm1/phrap/vector.seq. The
example command line for clipping CLONE.fasta is:
% cross match -minmatch 12 -penalty -2 -minscore 20 -screen
CLONE.fasta
54
/db/univec/UNIVEC/UniVec
The -screen option tells cross match to produce another fasta file, CLONE.fasta.screen,
nearly identical to CLONE.fasta, except that recognised vector sequences
are replaced by X (or x, according to the original capitalisation).
Sequence assembly
Assemble the vector-clipped reads to reconstruct the clone sequence, using the Phrap sequence assembler. The program takes as input a fasta
format file of sequence fragments and a companion base quality file, constructs contig sequence as a mosaic of the highest quality parts of reads.
Run the assembly program using command line:
% phrap -new ace CLONE.fasta.screen > phrap.out
As a result, Phrap creates a number of files. The most important ones:
CLONE.fasta.screen.contigs (assembly consensus sequence in Fasta
format),
CLONE.fasta.screen.contigs.qual (assembly consensus base quality
values assigned by Phrap), and CLONE.fasta.screen.ace (a complicatedlooking file that enables one to view the result of the assembly in the
Consed assembly viewer/editor program).
In file CLONE.fasta.screen.contigs.qual, Phrap provides quality
information about assembly (i.e., quality values for contig sequence) by
generating its own quality measures (based on read-read confirmation).
This process seems rule-based (few references about it). For example, if
all input quality values (given by Phred) are relatively small (less than
15), Phrap assumes that they do not correspond to error probabilities
and attempts to rescale them so that the largest quality value is approximately 30; in contrast, if input quality values are relatively high (≥ 40),
55
Phrap may give the base in contig (consensus of more than one bases of
reads) a higher quality value like 90. After contig assembly, for a contig
of length n, the average quality value is given by:
P
2.3.3
(Quality value of base in contigs)
Number of base in contigs
Gene finding
One of the main aims of annotation pipeline is to aid in identification
of protein-coding genes. This can be done by using a gene-finding program to predict gene models (ab initio gene finding), or by predicting
possible genes based on the similarity of the sequence to other sequences,
particularly other identified sequences. I used both of these approaches
as follows. Ab initio gene predictions were performed using FGENESH
(SoftBerry). The automated gene prediction pipeline was hosted on the
bioinfo server at the Computer Center, HKU. The original prediction
was manually refined with assistance from GenomeScan, another gene
prediction program that combines sequence similarity and exon-intron
composition (i.e., two distinct types of evidence used by these classes of
methods), into one integrated algorithm.
Evaluation of gene recognition accuracy
The predictive accuracy of a gene-finding program is evaluated by comparing the exons predicted by the program with the actual coding exons
at nucleotide level and exon level [31]. For nucleotide level accuracy,
define the values TP (true positives), TN (true negatives), FP (false
positives), and FN (false negatives) as follows: TP = the number of
coding nucleotides predicted as coding; TN = the number of noncoding
nucleotides predicted as noncoding; FP = the number of noncoding nucleotides predicted as coding; FN = the number of coding nucleotides
predicted as noncoding, then sensitivity as the proportion of coding nu-
56
cleotides that are correctly predicted as coding:
Sn =
TP
,
TP + FN
and specificity as the proportion of nucleotides predicted as coding that
are actually coding:
Sp =
TP
.
TP + FP
For exon level accuracy, the formulas for exon level sensitivity (ESn) and
specificity (ESp) are:
ESn =
TE
,
AE
ESp =
TE
.
PE
where TE (true exons) is the number of exactly predicted exons and AE
and PE are the numbers of annotated and predicted exons, respectively.
Combining predictions from two gene-finding programs
Gene-finding programs are still unable to provide automatic gene discovery with desired correctness. The benefits of combining predictions
from more than one already existing gene prediction program have been
explored [268]. Therefore, methods for combining predictions from programs, GenScan and HMMgene, was used in predication of P. marneffei
genes, in attempt to improving exon level accuracy of gene-finding by
identifying more probable exon boundaries and by eliminating false positive exon predictions. The scripts implementing these methods are obtained from http://www.cs.ubc.ca/labs/beta/genefinding/. Note
that at the time this combining prediction study was conducted, the genefinding program FGENESH was still not available. A late retrospective
test was conducted after combining FGENESH with either GenScan or
HMMgene though.
57
2.3.4
Database and databank to store results
The first step in database design is to decide what the database will be
used for and how users will interact with it. Once these are defined, the
data to be stored and how these data are associated with one another
is defined. This is done using a conceptual data model. The model is
independent of how the information will be stored in the final, physical
implementation on the computer. Entities, like gene, contig and gene
product, are defined that informally represent concepts from the real
world. The relationship between these concepts were also defined, for
example, a contig contains more than one genes; generally one gene produces one gene product. A formal language such as Unified Modelling
Language (UML) was used for specifying both use cases and conceptual
data models.
The next step is physical implementation of the data model. Now a
database management system (DBMS) has to be selected. Here I used
Microsoft Access, relational database manager running on a Windows
operating system. It is available in our departmental facilities and is
quite powerful and efficient for medium-size database management. It
has straightforward Web-publication capabilities and intuitive graphic
user interface-building capabilities. Administrators of the database work
through the application interface, while users interact with database
through a web interface. Physical implementation of the conceptual data
model was mediated with the database schema (Fig. 2.3).
Large-scale data that are to be made accessible to the community
should be well curated, annotated and documented and appropriately
formatted for publication. At present, no universally accepted standards
for data format exist for genomics data. Here, I adopted GFF (http:
//www.sanger.ac.uk/Software/formats/GFF) and GenBank format to
transfer information to and from public databases and applications. The
database was populated using Perl scripts written using ActiveState Perl
58
Version 5.6 for Windows (downloaded from http://www.activestate.
com) and the Perl modules Bioperl (obtained from http://www.bioperl.
org).
2.3.5
Perl source code collection
In the annotation pipeline, a sequence of analysis steps each using different tools must be carried out one after the other. The challenge was that
in the absence of defined standards for the input and output of different
tools. Because there is no explicit ‘contract’ between the various tools
as to what input and output formats each will support, at any time one
of the tools in the pipeline may change the format of its input or output
(breaking the system). To connect together multiple tools ‘smoothly’
and ‘robustly’, special ‘glue codes’ have been written, mostly in Perl.
The collection of Perl scripts organised into several modules are available
at the PMGD website.
2.3.6
Genome browser configuration
Visualisation of genomic information is not just for the beauty or aesthetic purposes. It is of practical use that it gives more meaning to people
than reading those ‘cipher texts’. For example, three of the most prominent genome browsers are the Ensembl Genome Browser (http://www.
ensembl.org/) by the European Bioinformatics Institute and the Sanger
Institute, the Map Viewer (http://www.ncbi.nlm.nih.gov/mapview/)
by National Centre for Biotechnology Information and the UCSC Genome
Browser (http://genome.ucsc.edu/) by the University of California
Santa Cruz Genome Bioinformatics Group. They are highly specified
to their particular data type and information. Most of genome browsers
can work either online or offline. They are usually developed in Perl,
Java or other high-level languages.
PMGD incorporates two free but powerful genome browsers, Argo
59
(Java applet Fig.
2.2) and GBrowse (Generic Genome Browser), in
order to organise and annotate genomic data. The GBrowse (http:
//www.gmod.org/) combines database and interactive web page for manipulating and displaying annotations on genomes. It requires 3 steps,
installation, configuration and customisation. Installation is a easy walk
through following the instruction. Configuration is done by a configuration file. Customisation was achieved by the configuration file. The
machine is equipped with a Pentium III Processor at the clock speed of
800 MHz and 128 MB main memory. ActivePerl, BioPerl and Apache web
server are necessarily installed. There is an advanced option for choosing between the ‘in-memory’ database or the relational database MySQL
for storing the sequence and annotation information. For genome size
of P. marneffei, the ‘in-memory’ architecture is already good enough to
handle. Sequence files (in FASTA format) and annotation files (in GFF
format) are to be stored under ‘$HTDOCS/gbrowse/databases’ of the directory of Apache web server. The configuration file (.conf) defining the
settings is stored in ‘$CONF/gbrowse.conf’. GBrowse is highly customisable. For example, administrators can use different colours or shapes to
represent exon, intron, and other genetic elements. More sophisticated
functions, such as the display of different reading frames, transcription
profile, ESTs and alignments, are also provided. Administrators are allowed to freely customise it by switching ON/OFF these functions and
altering the default settings so that Genome Browser can better fit the
purposes of a particular database.
2.3.7
Synteny identification
To perform synteny analyses, amino acid identity between P. marneffei and A. nidulans (or other fungi) was first determined by comparing
the predicted proteins from each fungus using BLASTP. The putative
ortholog pairs is predicted by using INPARANOID program [261]. Puta-
60
Figure 2.2: PMGD genome browser.
tive ortholog pairs were aligned using ClustalW and the amino acid per
cent identity for each pair was calculated. If alignments spanned 60%
of both genes and the alignment score was within 80% of the top score
for either of the pair of genes, then the pair was accepted. Using these
putative ortholog pairs, supercontigs were compared with the ADHoRe
program [322] (r2 cutoff = 0.8, maximum gap size = 35 genes, minimum
number of pairs = 3). Results were filtered such that the maximum
probability for a segment to be generated by chance was < 0.01.
2.4
2.4.1
Results
Statistics of assembly
As mentioned in Section 1.3.2, all inserts were sequenced from both ends
to generate paired reads. These paired sequence fragments were assembled using the Phrap package of assembly tools [84], yielding a draft
assembly. 98.35% of the assembled sequence was reconstructed in 273
supercontigs (2911 contigs); The longest contig is 178,730 bp and the
longest supercontig is 729,276 bp; The fidelity of the assembly is sup-
61
ported by the high degree (80.50%) of plasmid-end pairs preserved in
contigs and scaffolds. The net length of assembled contigs totaled 28.98
Mbp, including the mitochondrial genome of ∼ 35 kbp (Table 2.2).
Table 2.2: Summary of assembly statistics.
Features
Read
Total Number of Reads Sequenced
Number of Bases in Total Reads
Average Read Length
Number of Confirmed Reads (by Phrap)
Fraction of Reads Assembled
Fraction of Reads Paired in Assembly
Number of Bases Used in Assembly
Average Shotgun Coverage
Contig
Total Number of Contigs
Number of Bases in Contigs
Longest Contigs
Average Length of Contigs
Supercontig (scaffold)
Total Number of Supercontigs
Number of Bases in Supercontigs
Longest Supercontigs
Average Length of Supercontigs
2.4.2
Value
315,580
173,664,505 bp
550.20
310,365
98.35%
80.50%
170,951,774 bp
6.6 fold (Phrap report)
2,911
28,977,603 bp
178,730 bp
9,955 bp
273
28,421,390 bp
729,276 bp
104,110 bp
Genome size estimation
The genome size was approximated from the draft assembly by estimating the size of gaps between contigs and scaffolds. As shown in Table
2.2, total base summarised is 28.42 Mb in supercontigs, 28.98 Mb in contigs. These estimates do not include gaps. Within a supercontig, gaps,
so called within-supercontig gaps, are between contigs that belong to
the supercontig. The size of these gaps can be derived from the size of
clones spanning the gap. As mentioned in Section 1.3.2, two sequencing
clone libraries were constructed, carrying insert sizes from 2.0 – 3.0 kb
62
and 7.5 – 8.0 kb, respectively. Paired-reads belonging to contigs adjacent
gaps was recognised to be from which library. The size of gaps between
adjacent contigs in a supercontig can therefore be derived from the size
of clones spanning the gap. When estimated gap sizes are included, the
total physical length of all scaffolds is estimated to be 29.8 – 30.5 Mb.
Between supercontigs there are so called between-supercontig gaps. The
size of these gaps is hard to estimate since no spanning clones are available. In addition, these gaps include difficult-to-sequence regions of the
genome including the ribosomal DNA (rDNA) repeats, centromeres, and
telomeres. If we take these considerations, the genome size is estimated
to be ∼ 31 Mb.
When the sequencing is at the stage of relatively low coverage. There
is ‘dynamic’ way to estimate genome size by applying Lander-Smith
mathematical model. Assuming there is no cloning bias, the DNA fragments generated in the shotgun sequence process are located around the
chromosome according to a Poisson distribution [92]. The unsequenced
fraction of a genome (double-strand) is:
p = e−nw/L
where n is the number of reads, w is the average length of reads and L
is the length of genome. For a 20 Mb genome, it would require about
120,000 reads of 500 bp to produce theoretically about 95% (P = 0.05)
coverage.
The number of unsequenced regions on both strands generates the
same number of contigs, N , which can be calculated as:
N = ne−nw/L
For the total sequence data (about 60 Mb reads) we have got, there are
total 119,744 reads with a mean length of 511 bp. After assembly with
63
Phrap, it generated 13,861 contigs. Therefore, n = 119744, w = 511, N =
13861. The genome size can be calculated as the following:
L=−
nw
= 28, 377, 000
ln(N/n)
In practice, the number of contigs is higher than theoretical expectation,
since when assembling fragments Phrap needs overlap of nucleotides to
link two reads together. These overlap regions do not contribute to the
actual coverage but was taken into calculation as it does. Another factor
is the bias due to cloning difficulties [186].
2.4.3
Accuracy of gene finding
The purpose for evaluation of gene recognition accuracy is to select the
best gene finding program. The testing data set, composing of 103 Penicillium protein-coding genes that contain multiple exons was built. Our
results shows that FGENESH gives the most accurate predication overall. With it, we can identify ∼ 90% of coding nucleotides with 12% false
positives. It provides sensitivity (Sn) = 96% and specificity (Sp) = 89%
at the base level, Sn = 92% and Sp = 84% at the exon level and Sn =
85% and Sp = 67% at the gene level.
2.4.4
Combination of gene finding
Gene recognition accuracy may be improved by combining predictions
from two gene-finding programs. Rogic et al. [268] implemented a series
of algorithms combining gene prediction from two existing gene finding
systems, GenScan and HMMgene. The combined algorithms were tested
on the HMR195 sequence dataset and generated improved accuracy at
both the nucleotide and exon levels, where the average improvement was
7.9% compared to the best result obtained by GenScan or HMMgene
alone.
In order to identify the most accurate gene prediction system for P.
64
marneffei, I conducted an evaluation study to compare GenScan, HMMgene and the combined gene prediction system based on them. The
improved accuracy of result obtained by using the combined algorithm
as in Rogic’s study was not observed in our study, where we used a dataset
of 103 sequences with known genes from Penicillium species. Our result
shows that GenScan tends to give a significantly better prediction than
either of the other systems. At the nucleotide level, the sensitivity decreased from 95% for GenScan to 89% for HMMgene, to 92% for the
combined algorithm.
Two considerations came up in regard to the discouraging result obtained when the combined algorithm was applied to the dataset from
Penicillium species. Firstly, the different performance of combined algorithm in ours and Rogic’s study is most likely caused by the difference
of organisms. The dataset HMR195 used in Rogic’s study is composed
of 195 human, mouse and rat sequences. Secondly, if two systems generate consistent (no matter good or bad) predictions, then combining
them would not give better results. For the human and rodents’ dataset,
GenScan and HMMgene performed differently, but neither of them was
always superior to the other. But when GenScan and HMMgene were
used in our dataset composed of sequences from Penicillium species, we
found GenScan always generated significantly better results than HMMgene. Obviously, it does not help to combine gene finding systems if one
system is always superior.
As mentioned, FGENESH was not available during the time when the
gene combination test was conducted. A late retrospective test indicated
that no improvement can be obtained when combining FGENESH with
either GenScan or HMMgene (data not shown). Consequently we decided
to use FGENESH alone to perform the gene prediction for this project.
65
2.4.5
Database and databank to store results
Physical deployment of P. marneffei genome database is different from
that of annotation pipeline hosted in SUN Solaris server at the Computer
Center, HKU. PMGD is located in the Windows 2000 based system at the
Department of Microbiology, HKU, which is accessible as a workstation
for administrators, and as a web service system for general users.
2.5
Discussion
Nowadays high through-put DNA sequencing offers a rapid and cost effective approach to obtain the most important and relevant of all genetic information – the complete DNA sequence of an organism. As
the quantity of data increases for a genome project like P. marneffei
genome, researchers have to become more sophisticated about data management issues. The study developed the system for P. marneffei genome
project. This system performs semi-automatic tasks of assembly analysis, gene prediction/analysis, and extragenic region analyses. In order to
be compatible with the computer systems available at the Department of
Microbiology, HKU, the system was designed to span multiple working
environments and integrate several public domains and newly developed
software programs capable of dealing with several types of databases.
Our PMGD solution approves a feasible way to handle the information
and to manage large quantities of data internally or for public use. The
genome sequence was searched against the public protein databases using
BLAST. Genes were predicted using FGENESH and adjusted manually
by referring GenomeScan. The FGENESH was selected as the best predictor from a number of gene calling programs validated against a test
set of 103 previously characterised Penicillium protein-coding genes.
Ab initio gene finding is challenging in P. marneffei. This is because
1) lack of training dataset. Normally training gene-finding program requires more than 300 genes, in order to reach statistical power. However,
AUTHOR_NO
AUTHOR_NO
AUTHOR_TYPE
AUTHOR_ORDER
FK2,I3,I2
Figure 2.3: Database schema of PMGD.
I1
FK1,I2
Score
GENE_NAME
SGD_SYS_NAME
I1
DB
DB_Object _ID
STANDARD_NAME
NOT
GOid
DB_Reference
Evidence
With
Aspect
DB_Object _Name
DB_Object _Synonym
DB_Object _Type
taxon
Date
Assigned_by
SGD_GO
OrtoID
PK
ORTHOLOG
PUB_TYPE
REFERENCE_NO
PUBLICATION_TYPE
REFERENCE_NO
FK1,I4
AUTHOR_EDITOR
AUTHOR_NAME
AUTHOR_FULLNAME
DATE_CREATED
CREATED_BY
FK1,U2
PK,I1
AUTHOR
PK
JOURNAL
HIT_ID
HIT_GI
HIT_LEN
HIT_ACCESSION
HIT_DEF
HIT_SIGNIF
HIT_SCORE
BLAST_QUERY_DEF
BLAST_QUERY_LEN
BLAST_QUERY_ACC
BLAST_QUERYDESC
BLAST_PROGRAM_NO
I3
FK1,I1
BLASTP_NO
BLASTP
STANDARD_NAME
ALIAS
DESCRIPTION
GENE_PRODUCT
PHENOTYPE
SYS_NAME
IS_ESSENTIAL
DB_Object _ID
PK,I2
FK1,I1
PK
SGD_GENENAME
REFERENCE_NO
CATEGORY_NO
CATEGORY_REF
REF_SOURCE
STATUS
CITATION
YEAR_VALUE
PUBMED
DATE_PUBLISHED
DATE_REVISED
ISSUE
PAGE
VOLUME
TITLE
JOURNAL_NO
BOOK _NO
DATE_CREATED
CREATED_BY
FK2,I4,I2
ABSTRACT
FK1
REMARK
SYS_NAME
Field3
BLAST_PROGRAM
BLAST_VERSION
BLAST_DB
BLAST_DB_LEN
BLAST_DB_LET
DATE_MODIFIED
DATE_CREATED
CREATED_BY
BLAST_PROGRAM_NO
BLAST_PROGRAM
FK1,I1
SGD_ESSENTIAL_ORF
CATEGORY
DATE_CREATED
CREATED_BY
CATEGORY_NO
GENE_FUNCTION
FK1,I1
I3
PK,I2
HIT_ID
HIT_GI
HIT_LEN
HIT_ACCESSION
HIT_DEF
HIT_SIGNIF
HIT_SCORE
BLAST_QUERY_DEF
BLAST_QUERY_LEN
BLAST_QUERY_ACC
BLAST_QUERYDESC
BLAST_PROGRAM_NO
BLASTX_NO
BLASTX
PATHWAY
PATHWAY_ID
PATHWAY
GOid
GO_EVIDENCE_NO
IS_NOT
PK,I1
FK2,I2
GENE_NO
GO_GENE_GOEV
EVIDENCE_CODE
DESCRIPTION
GO_EVIDENCE_NO
I2
FK1,I1
PK
GENE_NO
GENE_PRODUCT
DESCRIPTION
FUNCTION_EVIDENCE_NO
GENE_FUNCTION_NO
GO_EVIDENCE
FK1,I1
FK2,I2
PK
FK1
PK
REMARK
REFERENCE_WEIGHT
DATE_CREATED
CREATED_BY
PUBMED
CATEGORY
PK,I1
PK,I1
ABSTRACT
FK1,U2 REFERENCE_NO
REFERENCE_NO
REFERENCE
FK1,I3
FK1,I2
I1
U1
PK,I3
FULL_NAME
ABBREVIATION
ISSN
PUBLISHER
URL_NO
JOURNAL_NO
GENE
FUNCTION_EVIDENCE
GENE_NO
ALIAS_NO
FK2,I2
FK1,I1
PK
DOMAIN_NAME
INTERPRO_NO
INTERPRO
PROTEIN_NAME
INTERPRO_NO
PROTEIN_INTERPRO
FK2,I2
FK1
PROTEIN_NO
GENE_NO
PROTEIN_NAME
PROTEIN_SEQ
PROTEIN_LEN
DESCRIPTION
EC_NUMBER
PROTEIN
HOMOLOG _NO
GENE_NO
GENE_NAME
HMLG_SPECIES
HMLG_GENE_NAME
HMLG_SYS_NAME
HMLG_FUNCTION
SCORE
ID
HOMOLOG
LENGTH
OLD_ID
SCAFFOLD_NO
SCAFFOLD
I3
FK1,I1
I2
I1
FK1,I3
PK,I2
I1
PK
FUNCTION_EVIDENCE_NAME
DESCRIPTION
FUNCTION_EVIDENCE_NO
GENE_ALIAS
GENE_NAME
SCAFFOLD _NO
EXON_NUMBER
C_START
C_END
CDS_LENGTH
FRAME
CHROMOSOME
GENETIC_POSITION
GENE_DESCRIPTION
COMMENT
GENE_NO
PK
CONTIG
FK1
GENE_NO
GENE_PRODUCT
DESCRIPTION
FEATURE_NO
MOLECULAR_WEIGHT
PI_VALUE
CAI
PROTEIN_LENGTH
N_TERM_SEQ
C_TERM_SEQ
CODON_BIAS
TOP _SCORE
GRAVY_SCORE
AROMATICITY_SCORE
PROTEIN_NO
PROTEIN_INFO
ALIAS_NAME
FEATURE_NO
ALIAS_NO
ALIAS
GENE_PRODUCT_NO
FK1,I1
GENE_PRODUCT
CONTIG_NAME
ORGANISM
SOURCE
LENGTH
POST_GAP
PRE_GAP
CONTIG_ORDER
SCAFFOLD_NO
COMMENTS
CREATED_BY
DATE_CREATED
CONTIG_NO
PK
FK1,I1
PK
66
67
for P. marneffei we don’t have enough characterised genes; 2) lack of
cDNA which is very useful for confirming initial gene prediction. To
identify the genes that lack available cDNA sequence will require other
methods, such as, interspecies homolog search. We do have small amount
of RST sequences available [364], but, due to the poor sequence quality,
they are not even helpful. Our solution for this problem is to apply a
pre-existing gene finding program, namely FGENESH. Generally speaking, if one uses a pre-existing gene finding program in a newly sequenced
organism, one expects inaccurate predictions. However, our evaluation
shows that FGENESH trained with A. nidulans dataset produced satisfactory results when applied onto P. marneffei. This is due to the close
phylogenetic relationship between two species. We also tried to combine
predictions made by more than one gene prediction system, which has
been proposed that would significantly improvement gene prediction accuracy. But unfortunately, because FGENESH is dominately better than
any other gene finding programs available, we did not observe such an
improvement after combination.
The further direction can be envisaged basing on current stage of
the system. Firstly, one of striking characteristics of the genomes of eukaryotic organisms is the existence of muiltigene family. This confounds
the identification of orthologous relationship among genes in interspecies
comparison. In order to solve the problem of discrimination between ortholog and paralog, more sophisticated algorithms are required. These algorithms should take phylogenetic information into account and integrate
this into the protein prediction system. Secondly, when assigning a function to protein, controlled vocabulary should be used to all organisms.
Recent development of Gene Ontology [9] project produced a dynamic
controlled vocabulary environment that can cope with ever accumulating
and changing knowledge of gene and protein functions. Thirdly, it is obvious that the more function prediction system develops, the more impor-
68
tant will be its evaluation of accuracy. Iliopoulos (2002) has established a
scoring scheme to measure performance of prediction systems [143]. Despite of this, considerable concerns are still raised regarding the accuracy
of assignment and the reproducibility of methodologies. The evaluation
of the performance of these systems is missing at this stage.
In summary, modern biology has created an information explosion.
The areas of whole-genome sequencing and functional genomics have produced a prodigious amount of data. This is the case in P. marneffei
genome project. This study provided a solution by offering the annotation pipeline linking variant biological softwares in a systemic way, as
well as the state-of-art database management system for storing and retrieval biological sequence data. It has been successfully applied on the
daily-based work of annotation for the most important thermal dimorphic
fungus.
69
Chapter 3
MITOCHONDRIAL GENOME OF PENICILLIUM
MARNEFFEI
This work described in this chapter is very closely based on a paper
I have published with colleagues [353].
3.1
Introduction
Mitochondria are the power centres of the cell. They are generally the
major sites of aerobic respiration and the energy production centre in
fungi, providing the energy a cell needs to move, divide, produce secretory products and contract. They are small oval-shaped, membranebound organelles, about the size of a bacterium, surrounded by highly
specialised double membranes. The outer membrane is fairly smooth.
But the inner membrane, where oxidative phosphorylation takes place, is
highly convoluted, forming two compartments, the intermembrane space
and matrix. The reaction of the citric acid cycle and fatty acid oxidation
occur in the matrix.
Mitochondria maintain their own genomes. Nowadays a number of
mitochondrial genome sequences have become available. At present, the
NCBI organelle genome resource maintains a collection of 350 completed
mitochondrial genomes from different organisms, including 256 metazoans, 15 fungi, 9 plants and 22 others. The number is subject to change
with the advance of sequencing endeavours. The gene content of mitochondrial genomes is generally well conserved. In metazoans, for example, the mitochondrial genomes are generally circular, about 16 kb long,
and encode three primary transcript types (13 proteins used for energy
70
production, two rRNAs and 22 tRNAs). The homologous genes existing in the mitochondria of plants, protists, fungi, and animals, and in
the genomes of prokaryotes, make it possible to undertake inter-species
gene comparisons. Next I will review major components in respiratory
pathway of fungal mitochondria.
The common and invariant feature of respiratory pathways of mitochondria is production of ATP coupled to electron transport. The
respiratory chain begins with electrons being transferred from NADH to
complex I (NADH:ubiquinone oxidoreductase) or from the tricarboxylic
acid cycle intermediate succinate to complex II (succinate:ubiquinone
oxidoreductase). Electrons are transferred via ubiquinones, complex III
(ubiquinol:cytochrome c oxidoreductase), cytochrome c, complex IV (cytochrome c oxidase) and finally to molecular oxygen to give water (Fig.
3.1).
Complex I is comprised of peptides encoded by both nuclear- and
mithochondrial-genes (more than 25 nuclear-genes and seven mitochondrialencoded genes, nad 1, 2, 3, 4, 4L, 5, 6 ), forming a large multisubunit
complex and spanning the inner mitochondrial membrane. Note that a
few fungi like Saccharomyces cerevisiae and Schizosaccharomyces pombe
lack complex I, and many fungi have additional components, such as alternative NADH dehydrogenases and/or an alternative terminal oxidase
(see review [152]). Complex III contains nine subunits, of which only
the gene for apocytochrome b is encoded in the mitochondrion. Between
complexes III and IV there is Cytochrome c existing in the intermembrane
space and passes electrons. Cytochrome c is encoded by the nuclear cyc-1
gene. Complex IV contains 7-8 polypeptides of which three are encoded
in mitochondrion, cox1,2,3. It is the terminal oxidase of the standard
respiratory pathway. Complex V is the mitochondrial ATP synthase,
encoded by two of the ATP synthase subunit genes, atp6 and atp8.
Since the formation of several mitochondrial complexes have subunits
71
Figure 3.1: Fungal respiratory pathways. The diagram is downloaded
from http://pages.slu.edu/faculty/kennellj
encoded in both mitochondrion- and nuclear- genomes, the coordinated
expression of genes encoded in the nucleus and mitochondrion is critical
for the mitochondrial function. These mitochondrial complexes include
not only the large respiratory complexes as mentioned above, but also the
translational machinery that involves nuclear-encoded polypeptides and
mitochondrially-encoded rRNAs and tRNAs, and so on [240]. Therefore,
the communication between the nuclear and mitochondrial genomes contributes essential subunit polypeptides to important mitochondrial proteins and they collaborate in the synthesis and assembly of these proteins
(for review, see [256]).
In this chapter I report the complete sequence of the mitochondrial genome of Penicillium marneffei, the first complete mitochondrial
DNA sequence of thermally dimorphic fungi. This 35 kb mitochondrial
genome contains the genes encoding ATP synthase subunits 6, 8, and 9
(atp6, atp8, and atp9 ), cytochrome oxidase subunits I, II, and III (cox1,
cox2, and cox3 ), apocytochrome b (cob), reduced nicotinamide adenine
dinucleotide ubiquinone oxireductase subunits (nad1, nad2, nad3, nad4,
nad4L, nad5, and nad6 ), ribosomal protein of the small ribosomal sub-
72
unit (rps), 28 tRNAs, and small and large ribosomal RNAs. Analysis
of gene contents, gene orders, and gene sequences revealed that the mitochondrial genome of P. marneffei is more closely related to those of
moulds than yeasts.
3.2
3.2.1
Materials and Methods
Library construction and sequence assembly
The P. marneffei mitochondrial genome was sequenced as part of the
P. marneffei whole genome sequencing project as described in Chapter
1 and 2. A genomic DNA (including mitochondrial DNA) library was
made in pUC18 carrying insert sizes from 2.0 to 8.0 kb. DNA inserts
were prepared by physical shearing using the sonication method. These
work above were done by my colleagues in the Department of Micriology, HKU and Beijing Genome Institute. I used Phred/Phrap/Consed
software package for base calling, contigs assembly and assembly quality assessment [83, 84, 112]. The complete mitochondrial DNA genome
was generated from assembly of 467 successful sequence reads (100 bp at
Phred value Q20 [112, 243]), which corresponded to an overall mitochondrial genome coverage of about 7×.
3.2.2
Mitochondrial DNA sequence annotation
The putative ORFs in P. marneffei mitochondrial DNA were denoted
by using Artemis, a free sequence viewer and annotation tool, with the
genetic code of mould. Genes, in which the putative ORFs were located, were functionally assigned through BLASTP searces against fungal mitochondrion encoding proteins available in the GenBank database.
Introns and rRNAs were mainly identified by BLASTN pairwise comparison of P. marneffei mitochondrial DNA with mitochondrial DNAs of
Aspergillus nidulans, Neurospora crassa, Saccharomyces cerevisiae (Acc.
NC 001224), Schizosaccharomyces pombe (Acc. NC 001326), Podospora
73
anserina (Acc. NC 001329), Allomyces macrogynus (Acc. NC 001715),
Pichia canadensis (Acc. NC 001762), Candida albicans (Acc. NC 002653),
Yarrowia lipolytica (Acc. NC 002659), and Candida glabrata (Acc. NC 004691)
[29, 91, 101, 354, 175, 262]. The BLASTN results were viewed through
ACT, a DNA sequence comparison viewer based on Artemis [40], and
exon and intron boundaries were adjusted manually. The tRNAs were
predicted by tRNAscan-SE 1.21 [207]. The core structures of the group
I introns were inferred by the program CITRON [200].
3.2.3
Phylogenetic analysis
Phylogenetic analysis was performed by using MBEToolbox as described
in Chapter 10. The 11 genes that encode subunits of respiratory chain
complexes (cox1, cox2, cox3, cob, nad1, nad2, nad3, nad4, nad4L, nad5,
and nad6 ) and the three that encode ATPase subunits (atp6, atp8, and
atp9 ) in the P. marneffei mitochondrial genome and the corresponding
genes in 24 other fungi with completed mitochondrial genomes were used
to determine the phylogenetic relationships of P. marneffei to the other
fungi. Phylogenetic trees were constructed using unambiguously aligned
portions of concatenated amino acid sequences of these 14 protein coding genes by the maximum likelihood method in the Phylip package [86].
The corresponding nad genes are not present in Schizosaccharomyces
japonicus, Schizosaccharomyces octosporus, S. pombe, C. glabrata, Saccharomyces castellii, Saccharomyces servazzii, and S. cerevisiae, and the
maximum likelihood method is not as sensitive to a lack of sequence information as the distance methods. A total of 3,462 amino acid positions
were included in the analysis.
3.2.4
Mitochondrial DNA sequences in nuclear genome
Fragments of mitochondrial DNA sequences were searched for in the corresponding nuclear genomes in P. marneffei, A. nidulans, N. crassa, S.
74
cerevisiae, and S. pombe. For each fungus, the corresponding mitochondrial DNA sequence was used as the query sequence to search against
its own nuclear genome, using a published method for S. cerevisiae
[262]. The mitochondrial and genomic DNA sequences of A. nidulans
and N. crassa were downloaded from the A. nidulans Database (http:
//www-genome.wi.mit.edu/annotation/fungi/aspergillus/) and N.
crassa Database (http://www-genome.wi.mit.edu/annotation/fungi/
neurospora/) respectively, and those of S. cerevisiae and S. pombe were
obtained from GenBank. For P. marneffei, the 6.6× coverage of genomic DNA sequences was generated by our own whole genome sequencing project.
3.3
3.3.1
Results and Discussion
Gene content and genome organisation
The mitochondrial DNA of P. marneffei is a circular DNA molecule of
35,438 bp (Fig. 3.2). The overall G+C content is 25%, and 24% in
protein-coding genes. The genome encodes 28 tRNAs, the small and
the large subunit rRNAs, the ribosomal protein of the small ribosomal
subunit, 11 genes encoding subunits of respiratory chain complexes, and
the three ATPase subunits (Table 3.1). All genes are encoded by the
same DNA strand. 63.6% of the genome is occupied by structural genes
(40.5% corresponds to protein coding exons, 5.9% to the 28 tRNA genes,
and 17.3% to the rRNA subunits), 8.8% by intergenic spacers that are
14-372 bp in size, and 32.4% by the 11 introns.
3.3.2
Protein coding genes
The P. marneffei mitochondrial genome contains 15 protein coding genes.
These include genes encoding ATP synthase subunits 6, 8, and 9 (atp6,
atp8, and atp9 ), the cytochrome oxidase subunits I, II, and III (cox1,
75
P2
urf2
nad3 cox2
nad4L
nad5
nad2
atp9
N1
0/35.4
cox1
cob
C
30
nad9
P. marneffei mtDNA
35,438 bp
10
nad4
H
M3
Q
L2
F
A
L1
M2
M1,V,E,T
R1
atp8
atp6
20
N2
rps
rns
Y
nad6
introns
exons
rnl
urf1
cox3
P1,S2,I,W,S1,D,G2,G1,K,R2
intronic ORFs
tRNAs
Figure 3.2: Physical map of P. marneffei mitochondrial DNA. The map
is based on an annotation of the reverse complement of Assembly 3 of
the P. marneffei mitochondrial sequence determined by the P. marneffei
Sequencing Project at the University of Hong Kong in collaboration with
Beijing Genomics Institute of Chinese Academy of Sciences. Numbers in
the inner circle are in kb. The sequence is numbered from the unique
restriction enzyme ClaI site (AT|CGAT) (0/35.4), which is located just
upstream to the nad4L gene and downstream to the cox2 gene. Exons
are shown in black, introns in white, and intronic ORFs in gray.
76
Table 3.1: Gene content of P. marneffei mitochondrial genome. * Exact
start codon could not be determined merely through sequence comparison.
Genetic element
Localisation (nt)
nad4L
nad5
nad2
atp9
cob
cob-i1-ORF
nad1
nad4
atp8
atp6
rns
nad6
URF1
cox3
rnl
26-295
295-2271
2289-4028
4216-4440
Join: (4706-5098, 6270-7037)
5099-5965
Join: (7532-8179, 8650-9081)
9253-10716
10945-11091
11158-11928
12341-13721
14053-14637
14722-15177
15352-16161
Join:
(17165-19688, 2136121902)
19987-21252
join:
(23339-23718, 2499425099, 26298-26641, 2774027875, 29012-29201, 3050430553, 31652-31806, 3283533159)
23720-24622
25100-26200
26643-27647
27876-28928
29204-30043
30554-31384
31808-32629
33223-33660
33955-34362
34591-35346
rps
cox1
cox1-i1-ORF
cox1-i2-ORF
cox1-i3-ORF
cox1-i4-ORF
cox1-i5-ORF
cox1-i6-ORF
cox1-i7-ORF
URF2
nad3
cox2
Size
bp
aa
270
89
1977 658
1740 579
225
74
2332 386
867 288
1550 359
1464 487
147
48
771 256
1381
585 194
456 151
810 269
4738
Codons
Start Stop
ATG TAA
ATG TAA
TTA TAA
ATG TAA
ATG TAA
TTG* TAA
ATA TAA
ATG TAA
ATG TAG
ATG TAA
ATG
ATG
ATG
TAA
TAA
TAA
1266
9821
421
561
ATG
ATT
TAA
TAA
903
1101
1005
1053
840
831
821
438
408
756
300
366
334
350
279
276
273
145
135
251
AAA*
AAA*
AAA*
TGA*
TTA*
ACA*
AGA*
ATT
ATG
ATG
TAA
TAA
TAA
TAA
TAA
TAA
TAG
TAA
TAA
TAA
77
cox2, and cox3 ), apocytochrome b (cob), the reduced nicotinamide adenine dinucleotide ubiquinone oxireductase subunits (nad1, nad2, nad3,
nad4, nad4L, nad5, and nad6 ), and the ribosomal protein of the small
ribosomal subunit (rps). This set of protein coding genes is exactly the
same as that in the A. nidulans mitochondrial genome. Furthermore, the
gene order of the protein genes is the same as that in the A. nidulans mitochondrial genome, except for the atp9 gene, which is located between the
cox1 and nad3 genes in the A. nidulans mitochondrial genome, but between the nad2 and cob genes in the P. marneffei mitochondrial genome
(Fig. 3.3).
Concatenated amino acid sequences of the 14 protein coding genes in
the mitochondrial genomes of P. marneffei and 24 other fungi were used
for phylogenetic tree construction. The closest relatives of P. marneffei were A. nidulans and other moulds, such as P. anserina, N. crassa,
Hypocrea jecorina, and Verticillium lecanii (Fig. 3.4). On the other hand,
the yeasts, such as the Saccharomyces species, Schizosaccharomyces species,
Candida species, and P. canadensis were more distantly related to P.
marneffei. This implied that phylogenetically the mitochondrial genome
of P. marneffei is more related to those of moulds than yeasts. This is in
line with our previous observation and also results published by others,
that when the chromosomal 18S rRNA genes or the internal transcribed
spacers and 5.8S rRNA genes (ITS1-5.8S-ITS2) and mitochondrial small
subunit rRNA genes were used for phylogenetic trees construction, the
closest neighbours of P. marneffei, besides the other Penicillium species,
were the Aspergillus species as well as other moulds [202, 364]. Furthermore, the same gene content and almost the same gene order in the
mitochondrial genomes of P. marneffei and A. nidulans also implies that
the mitochondrial genome is probably not related to the unique characteristic of thermal dimorphism of P. marneffei. Interestingly, MP1, the
gene that encodes an abundant and highly immunogenic protein in P.
78
P. marneffei
A. nidulans
nad4L
nad4L
nad5
nad5
nad2
nad2
C1
atp9
N1
N1
cob
cob
C
C2
nad1
nad1
R2
nad4
nad4
K
R1
R
K
N2
G1
atp8
atp8
G2
atp6
atp6
G1
G2
D
N2
D
S
S1
rns
rns
W
Y
Y
I
nad6
nad6
I
S2
cox3
cox3
T
P1
P1
T
E
rnl
rnl
rps
rps
M2
L1
A
E
V
V
M1
W
M1
M2
cox1
cox1
P2
atp9
L1
nad3
nad3
A
cox2
cox2
F
F
L2
L2
Q
Q
M3
M3
H
Protein & rRNA genes
H
tRNA genes
Figure 3.3: Gene content and order comparison between P. marneffei mitochondrial DNA and A. nidulans mitochondrial DNA. The only exonic
gene that has undergone gene rearrangement is atp9, which is highlighted
in black background.
79
0.1
Schizophyllum commune
Cryptococcus neoformans var. grubii
Schizosaccharomyces octosporus
Schizosaccharomyces japonicus
Schizosaccharomyces pombe
Candida albicans
Yarrowia lipolytica
Candida glabrata
Saccharomyces castellii
Saccharomyces servazzii
Saccharomyces cerevisiae
Pichia canadensis
Penicillium marneffei
Aspergillus nidulans
Podospora anserina
Hypocrea jecorina
Neurospora crassa
Verticillium lecanii
Allomyces macrogynus
Rhizophydium sp.
Harpochytrium sp. JEL94
Spizellomyces punctatus
Harpochytrium sp. JEL105
Hyaloraphidium curvatum
Monoblepharella sp. JEL15
rnl
¢
¢
£
£
£
£
££
£
£
£
¿¢¢¢
¢¢¢¿¢¢¢¢£
¢
atp6
£
£
atp8
atp9
Genes not present were crossed out
£ Group I intron with intronic ORF
¢ Group I intron without intronic ORF
¿ Group II intron
cob
cox2
££
cox1
££
££¿£
££
£££££££¢£
¢¢¢
£££££££¢£
¢£
£
¿¿££££¿
£
£££££££
£££
¿££££££££££££££
¢
¿
£££££
¢
¢
£
¿
££££
¢
£
£
££
££
££
¢¢¿¢¢¢££¢¢¢¢
¢¢£££¢£££¿¢¢£¢
££
¢££
¢££¢£¢
££££¢¢
¢¢¢
££££
¿££££
££
£
cox3
£
¢
nad1
£
¢
¢
££££
£
¢£
¢
nad2
nad3
£
¢
¢
nad4
£
£
nad4L
£
£
nad5
££
£££¿
££
££¢
¢£
££
nad6
80
Figure 3.4: Phylogenetic relationships of P. marneffei to other fungi
and distribution of group I and group II introns in the corresponding
fungi. Maximum likelihood tree showing phylogenetic relationships of
P. marneffei to other fungi and distribution of group I and group II introns in the corresponding fungi. The tree was constructed using unambiguously aligned portions of concatenated amino acid sequences of the
14 protein-coding genes (atp6, atp8, atp9, cob, cox1, cox2, cox3, nad1,
nad2, nad3, nad4, nad4L, nad5 and nad6 ). A total of 3462 amino acid
positions were used for the inference with ProML [86]. Sequences were obtained from GenBank: Allomyces macrogynus (NC 001715), Aspergillus
nidulans (CAA32799, CAA33481, AAA99207, AAA31737, CAA25707,
AAA31736, CAA23994, P15956, CAA23995, CAA33116), Candida albicans (NC 002653), Candida glabrata (NC 004691), Cryptococcus neoformans var. grubii (NC 004336), Harpochytrium sp. JEL105 (NC 004623),
Harpochytrium sp.
JEL94 (NC 004760), Hyaloraphidium curvatum (NC 003048), Hypocrea jecorina (NC 003388), Monoblepharella
sp. JEL15 (NC 004624), Neurospora crassa (CAA24041, CAA32799,
AAA31961, CAA27029, CAA27418, AAA66053, AAA31959), P. marneffei (Present study), Pichia canadensis (NC 001762), Podospora anserina (NC 001329), Rhizophydium sp. 136 (NC 003053), Saccharomyces
castellii (NC 003920), Saccharomyces cerevisiae (NC 001224), Saccharomyces servazzii (NC 004918), Schizophyllum commune (NC 003049),
Schizosaccharomyces japonicus (NC 004332), Schizosaccharomyces octosporus (NC 004312), Schizosaccharomyces pombe (NC 001326), Spizellomyces punctatus (NC 003052, NC 003061 and NC 003060), Verticillium lecanii (NC 004514), Yarrowia lipolytica (NC 002659). Some sequences of A. nidulans were downloaded from Fungal Mitochondrial
Genome Project (http://megasun.bch.umontreal.ca/People/lang/
FMGP/FMGP.html), and some sequences of N. crassa were downloaded
from http://pages.slu.edu/faculty/kennellj/genbank.html. The
scale bar indicates the branch lengths that were scaled in terms of expected numbers of amino acid substitutions.
81
marneffei, only has known homologues in A. nidulans, A. fumigatus, and
A. flavus, but not in other fungi [37, 39, 38, 363, 43, 351, 352].
3.3.3
Genetic code and codon usage
Since the mitochondrial genome P. marneffei is phylogenetically closely
related those of moulds and its gene content is the same as that of A.
nidulans, the genetic code of the mitochondrial genome of P. marneffei
is assumed to be the same as that of A. nidulans .
There is a strong codon usage bias in exonic ORFs in the mitochondrial genome of P. marneffei towards codons ending in A or T. In fact, eight
codons (CTC, CTG, ACG, TGC, TGG, CGC, CGG, and GGC) were not
used at all, five codons (GTC, TCC, TCG, ACC, and AGG) were used
only once, and nine codons (ATC, CCG, GCC, GCG, CAC, CAG, AGG,
GAC, GGG) were used 2 to 10 times, in exonic ORFs. Moreover, this
codon usage bias is also evident in the use of stop codon, where TAA is
used as the stop codon in 14 genes, but TAG is only used in one gene.
3.3.4
tRNA genes
Twenty-eight tRNA genes were identified in the P. marneffei mitochondrial genome (Fig. 3.5). These are all located on the same DNA strand
as the other genes. The set of mitochondrial tRNAs in P. marneffei is
similar in type to that in A. nidulans. Furthermore, the sequences of
the mitochondrial tRNA genes of P. marneffei are fairly conserved with
those of A. nidulans, especially between the two tRNA gene clusters of
two species (Fig. 3.3).
3.3.5
Other RNA genes
The genes that encode the 23S and 16S ribosomal RNAs of the large and
small subunits of the ribosome (rnl and rns) were identified. Furthermore, a gene (rps), located within the intron of rnl (Table 3.1 and Fig.
82
Table 3.2: Codon usage in protein-coding genes of P. marneffei mitochondrial genome. Numbers indicate the total numbers of codons
in either identified protein coding genes or ORFs (including both freestanding URFs, intronic ORFs and RPS).
Codon
TTT
TTC
TTA
TTG
AA
F
F
L
L
Genes
307
66
572
26
ORFs
143
13
250
33
Codon
TCT
TCC
TCA
TCG
AA
S
S
S
S
Genes
160
1
105
1
ORFs
93
5
45
13
CTT
CTC
CTA
CTG
L
L
L
L
49
0
20
0
42
6
24
4
CCT
CCC
CCA
CCG
P
P
P
P
119
4
25
4
35
2
20
3
ATT
ATC
ATA
ATG
I
I
I
M
182
10
326
112
134
12
162
38
ACT
ACC
ACA
ACG
T
T
T
T
121
1
105
0
78
7
45
4
GTT
GTC
GTA
GTG
V
V
V
V
132
1
131
18
74
3
70
5
GCT
GCC
GCA
GCG
A
A
A
A
144
4
81
7
49
7
35
3
TAT
TAC
TAA
TAG
Y
Y
*
*
191
32
14
1
180
27
9
1
TGT
TGC
TGA
TGG
C
C
W
W
24
0
56
0
21
4
37
5
CAT
CAC
CAA
CAG
H
H
Q
Q
76
8
83
5
47
7
75
7
CGT
CGC
CGA
CGG
R
R
R
R
10
0
0
0
24
1
1
2
AAT
AAC
AAA
AAG
N
N
K
K
196
11
101
6
277
30
347
18
AGT
AGC
AGA
AGG
S
S
R
R
123
15
78
1
90
8
94
9
GAT
GAC
GAA
GAG
D
D
E
E
97
3
89
21
112
11
133
21
GGT
GGC
GGA
GGG
G
G
G
G
188
0
92
6
94
1
32
13
83
1
U
A
U U UG
AAAC
UUA
6
A
A
C
G
U
G
U
U
U
U
G
C
A
C
A
UA
UAUUC
A
G
AU A AG
C
U
UU
GA AU
A
UA
A
CU
U A
U A
A U
U A
G C
C
A
U
G
A CG
CUUA
7
G
A
G
A
C
C
U
G
G
A
A
A
G
G
G
C
C
U
U
U
C
C
UA
U
UUCUC
A
G
G
C CG
A AGAG
C
U
UU
GG C
C
U AG
G
AU U
GG U A A
U A
U U
C G
G A
G C
U U
U A
A
U
A
U
A
GCU
G
C
C
U
G
G
U
A
C
U
U U UG
AAAC
UA
UA
CG A C U
A
G
GC UG A
C
U
UU
GA AU
A
UA
C
A
G UA
U A
C G
G C
A U
C
C
U
A
UCC
A
U
A
G
G
U
U
C
9
A
C
G
A
U
C
A
U
A
AA
U
G
G
UUUA
A
G C
A U
U A
U A
C G
U A
U A
UA
CU
A
U
GUC A C
A
G
CAA
A
G
CUU AG
C U
UUCA
C AGUG
C
U
UU
G
U
UU
U
G
A AGU
U
C
U
UA
G
UG
G CCA
U A
C G
A U
A U
U A
A U
U G
A U
C G
C
A
U
A
U
A
U
A
UCA
GAU
URP
G
A
U
C
C
A
A
18
GLY
14
A
G
A
G
A
G
U
GLU
22
G
G
G
G
U
U
A
UG A
G
G
U
U
A
C
C
U
C
A
A
U
23
G
C
U
C
G
A
G
A
C
G
A
G
C
U
U
ASP
G
U
C
U
C
U
C
A
15
UA
A
G
A U A UG
C
C
UU
U
AA U
A UU A U
C G
G C
U A
G AA
A U
U
A
G C
AA
C
U
U
G
UGA
U
G
G
U CG
UA
AAUAC
A
G
U U A UG
C
G
UU
U
AAU
U G
C G
U A
C G
A U
U
U
U
G
UGG
U
G
A
C CG
G
GG C
A
G
GG A
U
G
UU
S ER
U
G
C
U
A
A
U
U
A
AA
U
G
G
A
A
C
G
C
U
U
G
UGU A
GC A U
UA
A
U
U
U
G
G
P RO
20
UA
A AUCC
A
G
U U A GG
C
A
UU
A
UG
UUUAC
A
G
A A A UG
C
U
UU
G
A AGC
A
UA
G
U A
A U G
A U
C G
U A
G U
C
A
U
G
C AU
U
A
UAA
A
G
A
G
C AU
U U CG
MET
24
A
C
A
G
A
U
A
A
U
G
U
C
U
A
U
A
U
U
C
U
G
A
U
A
A
G
G
C
U
A
A
A
A
A
C
C
C
U
A
G
U
C
C
A
A
U
C
A
G
G
U
U
A
UAUAC
GG C
UAA
G
A
U
U
C
U
U
U
A
VAL
G
C
U
C
A
A
U
C
G
G
G
U
U
A
G
U
UA
GG
CG A C U
A
C AGCC
A
G
UAA
AA
G
GC UG A
C A
UUC
G U C GG
C
C
UU
G
U
AU
GACU
G
G
A A GG
A
UA
A
U
UA
G
U AG
C GG
U A
U A
A U
C G
U A
G C
C G
U A
U A
U
A
U
A
U
A
U
G
A CC
GUC
19
A
A
G
A
A
A
U
10
U UG A
I LE
A
U
U
A
G
G
U
U
TYR
A
U
G
C
U
A
G
U
U
A
GA
C
G
G
GLY
13
AU
UA
CGU U C
A
U
CUCUC
A
UA
G
UG A A
G
G
GA
AU
UCCAC
A
GU A AG
C C
C UG
GAGAG
C C
C U CG
A
A
UG
G
C
UU
G
A GG U G
C
A UGC
A
G
GAC
A
G
GAGC
C
UU
UA
A
G
UUA A
A
G
UA
U
UG
C
UA
U
UG
G
G
C G
A U
U AAA
C G
A U
C G
G C
C G
G C
U
U
A U
C G
U
A
U
C
U
C
U
A
U
A
U
A
U GU
UUC
UAC
A
U
C
C
A
A
G
UA
GAGUC
17
UU
UCUAC
A
G
C
AG A UG
U
UU
A
A
GU
G C
G C
A U
U A
A U
C
A
U
A
GUU
A
U
C
C
U
U
C U
UA
U
GU CUC
A
U
G
UUAU
G
C
UACC
A UG AG
G
U
UU
U
A UGG
UU
G
U
AAA
U AUA C
U A
G G
G G
C UA
A U
A U
G C
A
C
A
U
A
GUA
A
G
G
A
A
G
ASN
A
U
A
C
U
G
A
U
U A CG
21
C U CG
GAGC
A
A
A
U
G
A
C
U
A
5
U
A
AA
U
G
G
UCU
8
G
U
U
C
U
C
A
U
U
A
AA
U
G
G
C UA
GAU C
A
G
C
CUA G
U
A UU
A
UG
U
C
G
A
G
A
A
U
G
C
U
C
U
U
A
G U
A U
A U
U A
A U
LYS
A
A
G
A
G
U
A
T HR
MET
25
U
A
U
C
U
C
G
A
A
U
A
G
A
G
C
CU
CC
UG
UA
UA
C U UGC
A
U
AGUCC
A
A
UGU U C
A
AA
U
UAUUC
A
U
C U CGC
A
G
UAA
A
G
UAA
A
G
U
C
G
G
UAA
U
G
G A A CG
C U
U U UG
U C A GG
C U
C U CG
AC A AG
C U
GUC
AU A AG
C A
UC AG
G AGUG
C
U
UU
G
C
UU
G
U
UU
G
U
UU
G
U
UU
C AG
G
G
AAAC
U
G
GAGC
U
G
C A GG
G
G
AGUC
A
U
A
U
U C
UA
U
UA
UU
G
A GG
U
A
U
A
UA
C
U
AU
G C G C
G U
G U
AG
C GU A A
U AUU
C G
G GA
C G
A U
U A
A U
A U
C U
G C
A U
C G
A U
A U
GC
A U
A U
A U
A U
A U
U A
A U
C
A
U A
C
A
U
U
C
A
U
G
U
U
U
G
U
A
U
G
U AG
U
G
UAA
UGC
GA A
UUG
AA
C
U
G
G
UA
U
A
AA
A
A
G
U
C
G
G
G
U
C
A
A
U
U
C
U
G
G
A
UA
UACCC
A
A
G U GGG
C
U
UU
AAAC
UC
UA
A
U
U
UA
G U
U A
G C
A U
C
A
U
A
UUU
S ER
16
C U CG
GAGC
4
A RG
U U UG
12
UA
U
U
A
AA
U
G
G
U
A
AA
C
G
G
A RG
11
U
U A
U A
C G
U A
U A
A U
U A
CYS
U
A
AA
G
G
3
UG
UG
G
UGU C C
A
CCCCU
A
G
G
A
C
A U GG G
A
A AG A UG
GGG A
U
G
UU
A
UU
G
C
A
CUAC
U
A
UU
GU
A
U AUA A
U AAU
A U
U U
U A
U A
C G
A A
A U
U
G U
U
U
A U
C
A
A
A
U
U
A
A
A UU
GC A
A SN
U
G
G
U
C
C
U
C
G
A
G
G
G
A
G
C
U
U
U
C
C
U
A
A
U
A
AAA
U
U
G
2
A
A
G
G
A
U
U
U
G
A
GUC
L EU
26
G
C
C
A
A
G
G
UA
CUCUU
A
G
GAGA A
C
U
UU
GA ACA
A
UA
A
A
U A AG
A U
A U
U A
U
A
U
G
C AU
ALA
27
U U UG
C
C
U
A
U
U
A
AA
U
G
G
UA
A
U
G
AAA
A
C C U UG
C U
U U UG
U
UU
U
A
G
AAAC
AC
UUA
A
GG A A C
C U UG
GA AC
UA
28
U
G
G
G
U
A
A
G U
C G
U A
G U
U A
A
U
C
G
GUG
MET
PHE
G
G C
U
U
A
AA
C
G
G
U
C
G
G
U
U
C
U
HIS
A
A
G
G
A
U
U
L EU
GLN
A
U
U
C
C
U
A
A
UA
UGU C C
A
G
A U A GG
A
U
UU
A
U
U AAA
A U
U A
A U
U A
A
U
U
Intron
A
GG U
P RO
Figure 3.5: 28 tRNAs encoded in the mitochondrial genome of P. marneffei. Predicted clover-leaf structures of the 28 tRNAs encoded in the
mitochondrial genome of P. marneffei. Anticodons are underlined and
the corresponding amino acids are indicated. tRNAs are listed according
to the order of their positions in the map in Fig. 3.2.
84
3.6), that encodes the ribosomal protein of the small ribosomal subunit,
which is also present in the A. nidulans mitochondrial genome, was also
identified.
..
..
A
P5
G
A
P4
A A A G T
A
G
C
G
A
C
G
T
T T T G A A
A
A A A C T T
C A A
T
A
A
A
T
G
G
G
T
T
P3
G
C
C
C
A
A
P4
A A
G A A G T A
T C A G C A
G
A
T
T
T
A
P7
T
A
C
A
T
T
G
C
P3
G
T
A
A
C
G
A
A
T
T
C
C
T
G
A
A
P8
26642
A
A
T
T
G
G
P7
A
T
A
G
A
A
G
A
A
T
T
G
T
A
G
T
A
14 bp
..
A
A
A
G
G
A
C
T
T T
A
C
C T T C G T
A G T C G T
C A G
A
A
G
T
A
G
A
A
A
T
A
T
76 bp
A
G
C
A
T
C
T
T
T
A
T
A
T
A P6
A
C
C T C A
75 bp
A
38 bp
T
A
G
A
T
T
G
G
A
T
A
A
19720
T
C
T
G
A
T
P5
A
A
A
G
P6
T C A
A
C
G
A
C
T
A
C 24 bp A
A
T T T C A
44 bp
27647
G
G
A
A
T
21360
G
T
T
47 bp
A
T
A
30 bp
C
A
A
T
T
T
A
A
C
A
T
C
A
T
A
P8
A
A
..
Pm Lsu.1
98 bp
45 bp
RPS5
1256 bp
Pm Cox1.3
783 bp
Figure 3.6: Predicted secondary structures of two representative group
I introns. Group I introns, PmRnl.1 and PmCox1.3, of rnl and cox1
genes respectively, in P. marneffei. The exon/intron boundaries are represented by dotted lines. Base pairs are depicted by bars. The corresponding sizes of nucleotides not shown are indicated in bp. RPS5 gene
is depicted by square box. The numbers correspond to the coordinates
in the mitochondrial genome.
3.3.6
Group I introns
In P. marneffei, the cox1 gene contains seven introns (PmCox1.1, PmCox1.2, PmCox1.3, PmCox1.4, PmCox1.5, PmCox1.6, and PmCox1.7),
while the cob gene, nad1 gene, and rnl gene contain one intron each
(PmCob1.1, PmNad1.1, and PmRnl1.1 respectively). Each intron in the
cox1, nad1, and rnl genes contains an ORF. The ORF in the rnl gene
85
Table 3.3: Presence of mitochondrial DNA fragments in nuclear genomes.
‘Nuc no.’, number of mtDNA fragments in nuclear genomes; ‘Mt size’,
size of mitochondrial genomes (kb); ‘Nuc size’, Size of nuclear genome
(Mb); ‘Ratio’, ratio of sizes of mitochondrial to nuclear genome (kb/Mb).
Fungus
P. marneffei
A. nidulans
N. crassa
S. cerevisiae
S. pombe
Nuc no.
10
17
21
34
21
Mt size
35.4
∼ 33.2
∼ 64.8
85.7
19.4
Nuc size
∼ 29.5
∼ 31.0
∼ 43.0
12.1
13.8
Ratio
∼ 1.20
∼ 1.07
∼ 1.51
7.08
1.41
encodes the rps gene. The predicted secondary structures of two representative group I introns are depicted in Fig. 3.6. In both introns, the
upstream exons end with a T and the introns end with a G, typical for
most group I introns.
A comparison of the distribution of group I and group II introns in the
14 protein coding genes and rnl gene in the P. marneffei mitochondrial
genome and that in the corresponding genes in the other 24 fungi is
shown in Fig. 3.4. As a whole, the distribution of these introns in the
genes encoded in the mitochondrial genome of P. marneffei concurs with
those of the other fungi. The cox1 gene, the gene that contains the
largest number of self-splicing introns in other mitochondrial genomes,
is also the gene that contains the largest number of self-splicing introns
in the P. marneffei genome. The cob and nad1 genes, the genes that
also contain significant numbers of self-splicing introns, also possess one
group I intron each in the P. marneffei mitochondrial genome.
3.3.7
Mitochondrial DNA sequences in nuclear genome
Presence of mitochondrial DNA sequence fragments in the corresponding nuclear genomes of P. marneffei, A. nidulans, N. crassa, S. cerevisiae, and S. pombe were compared (Table 3.3). By using the same
method of sequence similarity comparison used for S. cerevisiae [262],
86
Table 3.4: P. marneffei mitochondrial DNA sequences present in nuclear
genome.
No.
1
2
3
4
5
6
7
8
9
10
Coordinates
9031..9069
10182..10201
11622..11697
13445..13465
15158..15177
18757..18776
25168..25187
31197..31216
32560..32580
34510..34529
Size (bp)
39
20
76
21
20
20
20
20
21
20
Location
nad1
nad4
atp6
rrs
nad6 – cox3
rnl
cox1
cox1
cox1
nad3 – cox2
E-value
9e-08
1e-03
2e-15
2e-04
1e-03
1e-03
1e-03
1e-03
2e-04
1e-03
only 10 mitochondrial DNA sequence fragments were detected in the 4×
coverage, representing 95%, nuclear genome sequences for P. marneffei
(Table 3.4). This number of mitochondrial DNA sequence fragments in
the corresponding nuclear genomes, as well as the ratio of mitochondrial
to nuclear genome size, was comparable to those found in A. nidulans,
N. crassa, and S. pombe (Table 3.3). On the other hand, the number
of mitochondrial DNA sequence fragments in the nuclear genome of S.
cerevisiae was 34, which was about two times more than the other fungi.
Although the relatively high ratio of mitochondrial to nuclear genome
size of S. cerevisiae may partly explain this phenomenon, further studies
would be necessary to elucidate the difference in the significance of these
mitochondrial DNA fragments in the nuclear genomes for the different
fungi.
In conclusion, among the known mitochondrial genomes of fungi, the P.
marneffei mitochondrial genome has an intermediate size. The replication origin of the P. marneffei mitochondrial genome is unknown. De-
87
spite the distinct biological property of thermal dimorphism in P. marneffei, its mitochondrial genome is much more closely related to those of
moulds, especially to that of A. nidulans, than to yeasts. The set of
protein coding genes in the P. marneffei mitochondrial genome is exactly the same as that in the A. nidulans mitochondrial genome. Except
for the atp9 gene, the gene order of the protein genes is also the same
as that in the A. nidulans mitochondrial genome. Furthermore, when
concatenated amino acid sequences of 14 protein coding genes in the mitochondrial genomes of P. marneffei and 24 other fungi were used for
phylogenetic tree construction, the closest relatives of P. marneffei were
A. nidulans and other moulds, whereas the yeasts were more distantly
related.
88
Chapter 4
GENOMIC EVIDENCE FOR THE PRESENCE OF
MELANIN BIOSYNTHESIS GENE CLUSTER IN
PENICILLIUM MARNEFFEI
In this Chapter, I will firstly review fungal virulence factors and their
identification by genomic approaches, then I give genomic evidence for
the presence of melanin biosynthesis genes in Penicillium marneffei.
4.1
Introduction
In Chapter 3, when I compared the mitochondrial genome of P. marneffei
to those of other fungi, it was observed that the mitochondrial genome
of P. marneffei is much more closely related to those of moulds, especially to that of Aspergillus nidulans, than to yeasts. The set of protein
coding genes in the P. marneffei mitochondrial genome is exactly the
same as that in the A. nidulans mitochondrial genome. Except for the
atp9 gene, the gene order of the protein genes is also the same as that
in the A. nidulans mitochondrial genome. Furthermore, the amino acid
sequence identity between the mitochondrial genes of P. marneffei and
those of A. nidulans is significantly higher than those between the mitochondrial genes of P. marneffei and those of Neurospora crassa, Candida albicans, Saccharomyces cerevisiae, and Schizosaccharomyces pombe.
This evidence of close relationships between P. marneffei and Aspergillus
species has prompted a further search for previously undiscovered characteristics in P. marneffei based on our knowledge of the various Aspergillus
species.
Melanins are negatively charged pigments of high molecular weight
89
with hydrophobic surfaces. They are formed by the oxidative polymerisation of phenolic and/or indolic compounds [341]. They are carcinogens
that are widespread in agricultural products and food. They are mainly
produced by various Aspergillus species, like A. parasiticus and A. flavus,
and less frequently, also by A. nomius, A. pseudotamarri, and A. bombycis [170]. Since melanin is made by these important pathogenic fungi
and has been implicated in the pathogenesis of a number of fungal infections, it would be of interest to investigate whether P. marneffei could
synthesise melanin or melanin-like compounds.
Here, after the literature review, I report the progress in identifying a
gene cluster in P. marneffei, spanning 19 kb, which contains six homologs
of genes. All these six genes in the cluster in A. fumigatus have been
shown to be involved in DHN-melanin biosynthesis [24, 187, 317, 318].
These genes are alb1, arp1, arp2, abr1 and abr2 encoding polyketide
synthases, scytalone dehydratases, and hydroxynaphthalene reductases,
a putative protein possessing two signatures of multicopper oxidases and
laccase respectively, as well as, ayg1 of unknown function. The order of
genes in the clusters of two fungi differs slightly from each other. These
findings indicate that P. marneffei can potentially produce melanin or
melanin-like compounds. Since melanin is an important virulence factor
in other pathogenic fungi, this pigment may have a similar role to play
in the pathogenesis of penicilliosis.
4.2
Literature Review
Most fungi cannot survive in the environment provided by human tissue
and therefore are not pathogenic. Amongst more than 100,000 fungal
species which have been described, only a handful of them are pathogens.
The pathogenic fungi are divided into two classes, primary pathogens and
opportunistic pathogens. Primary pathogenic fungi, e.g., Coccidioides
immitis and Histoplasma capsulatum, are “professional” pathogens which
90
adapt to live inside healthy mammalian and human tissue, causing disease not only in immuno-compromised patients but also in healthy people. Opportunistic fungi may have an environmental reservoir or exist as
commensals in a healthy host. Some examples include Candida species,
C. neoformans and A. fumigatus. These fungi are able to grow and invade host tissue only when they take advantage of immuno-compromised
host. However, the incidence of life-threatening mycoses caused by opportunistic fungal pathogens has increased dramatically in recent years.
They are eventually the major cause of fungal infections. The infections
cause by pathogenic fungi can be superficial, subcutaneous or systemic.
Superficial infection localises to the skin, the hair, and the nails; subcutaneous infection confines to the dermis, subcutaneous tissue or adjacent
structures; systemic infection refers to deep infections of the internal organs.
4.2.1
Potential virulence factors
Virulence factor in a fungus literally refers to any factor that a fungus
possesses that increases its virulence in the host. For instance, if a gene
or a protein is essential for growth in vivo whose deletion does not affect mycelial growth in vitro, it is considered as a virulence factor [189].
The concept of virulence factor is different in primary pathogens and
opportunistic pathogens and it is relatively difficult to define literally
when dealing with the latter, as pointed out by [128]. For most of fungal
pathogens, few virulence factors which contribute to their pathogenicity
have been reported.
Although the mechanisms of fungal pathogenicity remain less-well
understood, the development of a fungal infection must satisfy several
considerations. The fungus must first be able to adhere to the host
tissues. The fungus must colonise the host and invade the host tissue.
Once the fungus has invaded the host tissue, it must be able to adapt to
91
the tissue environment. Probably most importantly, the fungus must be
able to avoid the host’s cellular defences.
Adherence to host tissues
Adherence factor is essential for fungal pathogens to attach themselves
onto host tissue, and to resist physical clearing of the infectious agent.
For example, C. immitis, Aspergillus species, H. capaulatum and Cryptococcus neoformans all infect via the bronchial route and must have
specific adaptations in order to avoid effective clearance from a host’s
lungs. Adherence is dependent on a variety of factors, including surface
glycoprotions, fungal cell surface hydrophobicity, pH, temperature, and
of course, phenotype of the organism. Adhesins are biomolecules that
promote the adherence of fungi to host cells or host-cell ligands that
bind to several extracellular matrix proteins of mammalian cells, such as
fibronectin, laminin, fibrinogen and collagen Type I and IV.
Amongst many studies that have shown the association of adherence
and fungal pathogenesis, the studies on adhesion in C. albicans are most
extensive. Candida species express several cell surface proteins termed
adhesions which actively promote binding to host cells. These include a
lectin-like protein that recognises sugar residues of epithelial cell surface
glycoproteins, and a complement receptor-like protein, CR3, which may
play in a role in adherence to endothelial cells. Several adherence promoting molecules or adhesions of C. albicans regulate attachment, invasion,
and dissemination of the fungus [36, 157].
Als1p (agglutinin-like sequence) of C. albicans is a member of a family of seven lycosylated proteins with similarity to the S. cerevisiae agglutinin protein that is required for cell-cell recognition during mating.
Als1p is essential for virulence in a hematogenously disseminated murine
model [98].
HWP1 is a hyphal- and germ-tube-specific outer surface mannopro-
92
tein that binds C. albicans hyphae to human buccal epithelial cells [319].
The null mutant was less virulent than parental or single-gene-deleted
strains in a hematogenously disseminated murine model. The yeast germinated less readily in the kidneys of infected mice and caused less endothelial cell damage [319]. C. albicans binds to several ECM ligands,
including FN, laminin and collagens I and IV. C. albicans expresses an
integrin-like protein INT1 which is 25% identical to a non-repeat region
of the fibrinogen-binding protein, ClfA, of Staphylococcus aureus. Strains
of C. albicans deleted in INT1 were less virulent and adhered less readily
to an epithelial cell line [102]. Strains of C. albicans deleted in the 1,2mannosyltransferase gene (MNT1) are less able to adhere in vitro and are
avirulent. Mnt1p is a type II membrane protein that is required for both
O- and N-mannosylation in fungi and found to be required for adherence
to an epithelial cell line [34].
Adhesins of other medically important fungi, such as Blastomyces dermatitidis (a dimorphic fungal pathogen that infects the host through inhalation of conidia [276], have also been characterised. This is a 120-kDa
surface protein adhesin, namely WI-1, on B. dermatitidis, binding CD18
and CD14 receptors on human macrophages [232]. Hogan et al. [133]
cloned the adhesion WI-1 gene and found a total of 30 highly conserved
repeats of a 24-amino acid sequence. The repeat sequence is similar to
invasion, an adhesion-promoting protein on Yersiniae [169].
Invasion
Invasion is required for the development of deep mycoses in the internal
tissues of the body. The process is probably aided by hydrolytic enzymes,
such as proteinases and lipases, and in the case of dermatophytes, keratinases. Secretion of extracllular enzymes, such as phospholipase, has been
proposed as one of the virulence mechanisms used by bacteria, parasites,
and pathogenic fungi in overcoming host defence mechanism. The role of
93
extracellular phospholipase as a potential virulence factor in pathogenic
fungi, including C. albicans, C. neoformans, and A. fumigatus has been
reported. Of the 4 Candidal phospholipases (PLA, PLB, PLC and PLD),
only C. albicans null mutants that failed to secrete phospholipase B, encoded by PLB1, constructed by targeted gene disruption, when tested in
two clinically relevant murine models of candidiasis, was shown to have
attenuation of its virulence. Initial data suggest that direct host cell
damage and lysis are the main virulence mechanisms.
The secretion of lytic and degradative enzymes is also of obvious importance to the invasion of host tissues. Those necrotic enzymes secreted
by fungi can break down structural barriers and play an important role
in mediating host tissue invasion. The most extensively studied example
is SAP gene family in C. albicans [294]. At least nine proteins comprise
the family of secreted aspartyl proteinases. In guinea pig and murine
models of invasive disease, deletions in sap1-6 attenuated virulence. The
SAP genes have been shown to be differentially expressed, according to
the growth phase and phenotype of the organism; SAP2 mRNA was the
dominant transcript in the yeast phase organism; SAP4, SAP5 and SAP6
transcripts were observed only at neutral pH during serum-induced yeast
to hyphal transition. The order of expression was SAP1, -2, followed
sequentially by SAP8, -6 and -3 was correlated with tissue invasion i.e.,
early invasion (SAP1, 2), extensive penetration (SAP8) and extensive
hyphal growth (SAP6). This data indicates that members of the SAP
gene family may have distinct roles in the colonisation and invasion of
the host [63].
Growth at elevated temperature/Thermotolerance
Thermotolerance is one of the most obvious factors leading to pathogenesis. The ability of grow at body temperature 37‰ and within fever range
38 – 42‰ is important to systemic infection. The majority of fungi has an
94
optimum growth temperature of 25 to 30‰, and may grow only weakly
or not at all at 37‰. The first genome-wide analysis of the temperatureregulated transcriptome of C. neoformans has been done by Steen et
al. [296]. They identified sets of genes with higher transcript levels at
25‰ or 37‰ respectively.
Morphology/Morphogenesis
There is a growing body of evidence linking morphogenesis and virulence.
Changes in morphologies are advantageous for fungal pathogens. It has
been demonstrated that fungal hyphae can exert significant tip pressure
for penetration [224]. Many fungi adapt this morphological change and
develop virulence. Filamentous fungi (such as Aspergillus species) tend
to form branched hyphae in lung. C. neoformans, being an unique encapsulated yeast, is coated with a polysaccharide capsule. The capsule
is a potent inhibitor of macrophage phagocytosis, which is an important
factor in the resistance to C. neoformans infection.
The most remarkable ability shared among the dimorphic fungi, such
as, B. dermatitidis, C. immitis, H. capsulatum, Paracoccidioides brasiliensis, Sporothrix schenckii, is to switch between two distinct forms: yeast
and mould. The dimorphic fungi exist normally as non-pathogenic forms
(normally filamentous mycelia) in the environment and converse into
pathogenic forms (yeast) in the tissues of a host. This process is reversible; the switching trigger of conversion is unknown and differs amongst
fungi though. The importance of the yeast cell, as an invasive morphology, for dimorphic fungi has been reviewed by Gow et al. [113, 114]. As
shown in Table 4.1, most dimorphic mycelial pathogens invade tissues of
a host as yeast cells. Yeast cells are regarded as a better adapted for
dissemination within host circulatory system and avoidance of immune
capture. Note that although the opportunistic pathogens C. albicans
and Candida tropicalis shows dimorphic growth, these Candida species
95
Table 4.1: Major dimorphic fungal pathogens and their characteristic
morphologies in infectious disease. Taken from [114]
Fungal species
Blastomyces dermatitidis
Candida albicans
Candida tropicalis
Coccidioides immitis
. Cryptococcus neoformans
Histoplasma capsulatum
Paracoccidioides brasiliensis
Penicillium marneffei
Sporothrix schenckii
Wangiella dermatitidis
Form in diseased tissue
Budding yeasts
(Pesudo)hyphae, budding yeasts
Yeast and pesudohyphae
Endosporulating spherules
Budding capsulate yeasts
Budding yeasts
Budding yeasts
Yeasts undergoing binary fission
Budding yeasts
Budding yeasts
mainly form pseudohyphae, therefore they are not regarded as true dimorphic fungi. Nevertheless conversion to pseudohyphae has been long
regarded as essential for tissue invasion for Candida species.
4.2.2
Genomic approaches in identification of virulence factors
In practice, the combinatorial approaches by combining a few of the
following techniques have great potential to make elucidation of detailed
biological systems.
Mining whole genome sequences and fishing for virulence factors
The sequence of the genome of budding yeast, S. cerevisiae, is a landmark
of genomics. Since then, progress has been made in sequencing whole fungal genomes. The second complete sequence of a fungal genome, that of
S. pombe, was published in 2002 [354]. The filamentous fungi A. nidulans,
A. fumigatus, N. crassa and Ashbya gossypii are nearing completion (see
also Section 1.2.4). Even at its early stage, Fungal Genome Initiative
(FGI), a genome sequencing program by the National Human Genome
Research Institute, USA, proposed to sequence 15 fungi selected on the
basis of medical, scientific and commercial criteria, in 2002. FGI will ap-
96
ply deep-shotgun sequencing approaches (sequencing coverage > 10) in
order to finish all sequencing work quickly. If fully funded, it will produce
massive valuable information for elaborate comparative genomic analysis
across the fungal taxa.
The genome sequences have an immediate impact on conventional fungal genetics by eliminating years of efforts previously associated with gene
discovery. Traditionally genetic and biochemical approach in gene discovery suffered from many aspects of limitation in fungi, such as poor efficiencies of transfer, lack of stable extrachromosomal elements, poor growth in
the laboratory. With the genomic sequence in hand, one can bypass these
limitations by using genomics approaches, which permit rapid identification of novel genes. Therefore, obtaining genome sequences from pathogenic fungi is one of the most efficient steps in identification of potential
targets for therapeutic, intervention and vaccination.
Other genomic approaches
Current genomic approaches can be categorised into three groups: mutagenicbased, nucleotide-based and protein-based [206]. The mutagenic-based
techniques include signature-tagged mutagenesis and construction of mutant libraries, etc. Microarray analysis and serial analysis of gene expression (SAGE), for example, belong to the nucleotide based techniques.
Two-hybrid system, protein arrays and 2D-PAGE expression analysis
are examples of protein-based techniques.
4.3
4.3.1
Materials and Methods
Identification of melanin biosynthesis genes in P. marneffei
To identify melanin biosynthesis genes in P. marneffei genome, protein sequences of melanin biosynthesis genes of Aspergillus were downloaded from GenBank. The downloaded protein sequences were used
as queries to the P. marneffei genome. The comparison was conducted
97
using the NCBI TBLASTN program version 2.0 with the BLOSUM62
scoring matrix [6]. The E-value cutoff used to assign homologues was
1 × 10−20 .
The contigs in the P. marneffei genome that contained
homologues were extracted and annotated manually.
Predicted pep-
tides were compared to the amino acid sequences of their corresponding query proteins using NCBI BLAST2SEQ (http://www.ncbi.nlm.
nih.gov/blast/bl2seq/bl2.html). The statistics of the “expect value”
were calculated based on the size of NCBI non-redundant protein database. Conserved domains/motifs were identified using InterPro release
5.1 [367].
4.3.2
Multiple alignments and phylogenetic analyses
Multiple alignments of amino acid sequences were performed using the
program ClustalX 1.81 [311]. Initial pairwise alignments were performed using the Blosum62 protein weight matrix and adjustments to
the alignments were performed manually. Graphic presentation of the
alignments and consensus sequences were performed using the program
BOXSHADE 3.21 (http://www.ch.embnet.org/software/BOX form.html).
Regions of ambiguous alignment were removed by using the GeneDoc program (http://www.psc.edu/biomed/genedoc). Phylogenetic trees were
inferred by the neighbour-joining method [273]. Bootstrap resampling
with 1000 pseudoreplicates was carried out to assess support for each
individual branch.
4.4
4.4.1
Results and Discussion
Melanin gene cluster present in P. marneffei
Secondary metabolism, the production of compounds not essential for
growth in culture, is thought to be integrally intertwined with development in fungi. These events, usually induced by nutrient, biosynthesis
or addition of an inducer, and/or by a growth rate decrease, generate
98
signals which effect a cascade of regulatory events resulting in chemical
differentiation (secondary metabolism) and morphological differentiation
(morphogenesis). Microbial secondary metabolites have a major effect on
the health, nutrition and economics of our society. They include antibiotics, pigments, toxins, effectors of ecological competition and symbiosis,
pheromones, enzyme inhibitors, immunomodulating agents, receptor antagonists and agonists, pesticides, antitumor agents and growth promoters of animals and plants. Among them, fungal secondary metabolites
are of intense interest due to their pharmaceutical (antibiotics) and/or
toxic (mycotoxins) properties. Unlike primary metabolism, the pathways
of secondary metabolism are still not understood to a great degree and
thus provide opportunities for basic investigations of enzymology, control and differentiation. Recently tremendous progress has been made
in understanding the genes that are associated with production of various fungal secondary metabolites. For example, work with Aspergillus
species has revealed a link between asexual reproduction and the production of toxic secondary metabolites. One of the most well studied fungal
secondary metabolic processes is the biosynthesis of melanin.
Based on the principle of similarity search, we took advantage of
the whole genome sequence to identify the presence of this important
genetic capacity in P. marneffei. Six known genes for DHN-melanin
biosynthesis in A. fumigatus are abr2, abr1, ayg1, arp2, arp1, and alb1
[318]. Functions or gene products of these genes are given in Table 4.2,
note that function of ayg1 is unknown. All these genes are available
from GenBank and gene order has been determined by a previous genetic
study [318] and further confirmed by the A. fumigatus genome project.
The gene order is: abr2 -abr1 -ayg1 -arp2 -arp1 -alb1 (Fig 4.2).
When the amino acid sequences of proteins encoded by these 6 genes
were used as queries to the P. marneffei genome, significant hits were
obtained for all 6 proteins. When the predicted peptides of the corre-
99
Af protein (Acc. No.)
abr1 (AAF03353)
ayg1 (AAF03354)
arp2 (AAF03314)
arp1 (AAC49843)
abr2 (AAF03349)
alb1 (AAC39471)
Pm protein
664/555
406/403
273/275
Length
Af/Pm
2e-81
0.0
0.0
0.0
e-140
8e-95
77/91
55/73
59/71
60/77
57/71
63/74
Identity / Positive (%)
160
505
1639
528
403
254
Overlap length
(aa)
E-value
pm-abr1
pm-ayg1
pm-arp2
168/208
587/526
2146/1568
(aa),
pm-arp1
pm-abr2
pm-alb1
Table 4.2: Putative gene products related to melanin biosynthesis in P. marneffei.
Function
brown 1
yellowish-green 1
1,3,6,8-tetrahydroxynaphthalene reductase
scytalone dehydratase
brown 2
polyketide synthase
100
sponding contigs were compared to the amino acid sequences of the corresponding query proteins, the E-values of the 6 comparisons ranged from
5E-13 to 0 (Table 4.2), indicating high levels of similarity between the P.
marneffei protein and the A. fumigatus proteins. In A. fumigatus, abr1
encodes a multicopper oxidase and abr2 encodes laccase. We detected
weak sequence similarity (60% alignable overlap with 30% amino-acid
positive similarity) between the two genes at the amino-acid level. This
weak sequence similarity suggests two genes are paralogs of each other
which originated from gene duplication. In addition, we collected abr1 or
abr2 homologs from some other fungal species and did a multiple alignment of the gene family (Fig. 4.1). This gives information about how
the gene family diverges.
Figure 4.1: P. marneffei abr1 gene Cu-oxidase domain homologues.
Alignment of partial amino acid sequences of Cu-oxidase domains of ascomycetes.
More importantly, the synthases of secondary metabolism are often
coded by clustered genes on chromosomal DNA. It has been suggested
that such an organisation of genes may allow coordinated regulation of
the pathway [337]. The 6 melanin biosynthesis are located in a gene cluster in P. marneffei (Fig. 4.2). The gene order is largely conserved when
101
compared to that of A. fumigatus. In P. marneffei, abr1 -ayg1 -arp2 -arp1
locate in one contig, and abr2 and alb1 in other two contigs. Scaffolding
suggests that these 3 contigs belong to one single scaffold. Within this
scaffold, the 3 contigs are ordered one after another, i.e. uninterrupted
by other contigs. Therefore, gene order in P. marneffei can be inferred
as: abr1 -ayg1 -arp2 -arp1 -abr2 -alb1. Such a placement was supported by
5 and 6 pairs of forward-reverse paired reads respectively in the 2 gaps
of the 3 contigs, therefore, it is likely the location of 6 genes is correctly
ordered and the length of this gene cluster can be closely approximated.
As shown in Fig 4.2, the 6 genes span over 35 kb on the P. marneffei
genome, which is about as twice the length in A. fumigatus (19 kb). The
majority of this difference is due to a > 15 kp of gene-free region between
abr2 and alb1 (Fig 4.2). Comparing the gene order in the two fungi, the
only gene order change is abr2 jumping from the beginning of the cluster
(as in A. fumigatus) to after arp1 in P. marneffei. In addition, the direction of alb1 is reversed. The tendency of genes for enzymes of certain
metabolic pathways to be clustered in filamentous fungi has been noted
previously [161]. Generally these gene clusters encode optional pathways
for nutrient utilisation (e.g., the optional carbon source, quinate) [107]
or for synthesis of secondary metabolites (e.g., the mycotoxin, sterigmatocystin) [28]. Unlike the clustering of genes as operons in prokaryotes,
clusters of similar genes in fungi are not cotranscribed, nor has any vital
regulatory function for clustering been established [161]. Thus the reason for the existence of gene clusters in filamentous fungi has not been
resolved.
4.4.2
Disrupted aflatoxin biosynthesis gene cluster in P. marneffei
With the possible exception of the penicillin metabolic cluster, the most
thoroughly examined fungal secondary metabolite gene clusters are those
involved in mycotoxin biosynthesis, particularly the aflatoxin (AF) and
102
A. fumigatus
abr2
abr1
P. marneffei
abr1
ayg1
ayg1
arp2
arp2 arp1
arp1
alb1
5kb
abr2
alb1
Figure 4.2: Comparison between melanin gene cluster between P. marneffei and A. fumigatus.
sterigmatocystin (ST) biosynthetic clusters found in several Aspergillus
species [28]. These clusters contain a total of 23 genes involved in aflatoxin biosynthesis and other related functions (including 20 genes that
encode enzymes, two genes that encode regulatory proteins, and one gene
that encode an efflux transport protein) in Aspergillus species. No sequence information of cypA, norB, and ordB was available from GenBank at the time of analysis. The sequences of the remaining 20 genes,
including 17 genes that encode enzymes (hexA, hexB, pksA, nor-1, avnA,
adhA, norA, avfA, cypX, estA, vbs, ver1, moxY, verB, omtB, omtA, and
ordA) and the two regulatory (aflR and aflJ ) and one transport (aflT )
genes, were downloaded. When the amino acid sequences of these proteins were used as queries to search against the P. marneffei genome,
significant hits (TBLSTN E-value cutoff 1.0e-10) were obtained for all 20
proteins. When the predicted peptides of the corresponding contigs were
compared to the amino acid sequences of the corresponding query proteins, the BLASTP E-values of these comparisons ranged from 5.0e-13
to 0 (data not shown), indicating high levels of similarity between the P.
marneffei protein and the Aspergillus proteins. It is noticeable that the
putative gene products of omtA and ordA that are responsible for the
last step in conversion of ST to AF were found in P. marneffei to have
high similarity with their corresponding genes in A. parasiticus.
Despite putative homologues of the Aspergillus genes in the aflatoxin
biosynthesis pathway being present in the P. marneffei genome, these
103
genes do not form a cluster as they do in Aspergillus. This contradicts the
general trend that genes involved in fungal secondary metabolism usually
appear as a cluster, as in the A. flavus and A. parasiticus genomes.
Since almost all of these genes in the P. marneffei genome were not
in the same contig, it suggests that the homologs we identified might
be for production of other unknown secondary metabolites, instead of
aflatoxin. Or major movement of the genes in the aflatoxin biosynthesis
gene cluster has occurred in P. marneffei during evolution, which might
affect the ability and amount of aflatoxins.
4.4.3
Absence of penicillin biosynthesis genes in P. marneffei
Genomic sequence provides evidence for the presence of genetic components, such as, melanin biosynthesis gene cluster. On the other hand, it
also provides evidence for the absence of some important genetic component, which is also valuable. The beta-lactam antibiotic penicillin, one
of the most commonly used antibiotics for the therapy of infectious diseases, is produced as an end product by some filamentous fungi, such as,
Penicillium chrysogenum. Penicillin biosynthesis is catalysed by three
enzymes which are encoded by the following three genes: acvA (pcbAB ),
ipnA (pcbC ) and aatA (penDE ), which are organised in a gene cluster.
Although the production of secondary metabolites, such as penicillin,
is not essential for the direct survival of the producing organisms, several studies indicated that penicillin biosynthesis genes are controlled by
a complex regulatory network, e.g., by the ambient pH, carbon source,
amino acids, nitrogen etc. Most notably, this gene cluster is present in
A. nidulans which is a penicillin producer.
In conclusion, the identification of the coding capacity for a set of
proteins that could be involved in melanin biosynthesis has been reported
here. The presence of these homologues suggests the potential ability for
the biosynthesis of melanin or melanin-like substances in P. marneffei.
104
Since melanin is a well-defined fungal virulence factor, it is reasonable to
infer that it is also a virulence factor in P. marneffei, albeit experimental
confirmation is required. In addition, despite putative homologues of the
Aspergillus genes in the aflatoxin biosynthesis pathway being present in
the P. marneffei genome, these genes do not form a cluster as they do in
Aspergillus. They might be involved in the production of other unknown
secondary metabolites.
105
Chapter 5
MATING ABILITIES IN PENICILLIUM MARNEFFEI
Penicillium marneffei was believed to be asexual, but the genome
sequence analysis suggests that the fungus maintains the genetic capability for sexual reproduction. If confirmed, this raises the potential for
developing powerful genetic tools for the organism, with far reaching implications for its genetic study and disease control.
5.1
Introduction
The most unique feature of Penicillium marneffei is the temperaturedependent dimorphic switch. At 25‰ P. marneffei exhibits true filamentous growth, while at 37‰ it undergoes a dimorphic transition to
produce uninucleate yeast cells that divide by fission. The control of this
“dramatic” developmental process is of interest because it is required for
pathogenicity and may therefore provide a target for controlling infection. Fungal dimorphic growth and mating are regulated by common
signal transduction pathways, such as the mitogen-activated protein kinase pathway and the nutrient sensing cAMP pathway. Studies of development in many fungi have converged to define these conserved pathways,
which are organised in different ways to regulate filamentation, mating
and virulence, in different fungi as they adapt to unique environmental
challenges [192]. Given such a common regulatory mechanism, it is not so
surprising to find an association between the mating process and virulence
in some fungi. For example, a MAT α strain of Cryptococcus neoformans
is 30-fold more prevalent in the environment and 40-fold more prevalent
in infections than a MAT a strain [183, 193]. Candida albicans utilises a
106
number of the same genes for both mating and pathogenesis. The mating
pheromone of C. albicans elicits an over-expression of a set of virulence
genes in recipient cells [16]. Proteins encoded by these genes were previously shown to be required for virulence in a mouse model of disseminated
candidiasis. Therefore, it is of particular interest to understand the P.
marneffei mating system, which may be parallel to dimorphic development and pathogenesis of this medically important fungus.
Traditionally, P. marneffei is considered as an asexual (anamorph)
ascomycete that lacks an apparent sexual (teleomorph) stage in its life
cycle and seems to reproduce only mitotically [44, 104]. Recent genetic
studies, however, suggest it may have an unidentified sexual cycle [20,19].
Two homologs of the Aspergillus nidulans steA and stuA genes, stlA and
stuA have been cloned from P. marneffei [20, 19]. Both steA and stuA
are involved in controlling mating in the sexual homothallic A. nidulans.
The stlA gene displays no role in vegetative growth, asexual development, or dimorphic switching in P. marneffei and is able to complement
the sexual defect of an steA mutant of A. nidulans [19]. The P. marneffei stuA gene encodes a basic helix-loop-helix (bHLH) protein of the
APSES family and is supposed to regulate both dimorphic growth and
mating or asexual sporulation. Loss of stuA from P. marneffei resulted
in no obvious effect on dimorphic growth and P. marneffei stuA is able
to complement the conidation defect of an A. nidulans stuA mutant [20].
Moreover, the P. marneffei tupA gene, a homolog of rcoA, is able to complement both the asexual and sexual development phenotypes of an A.
nidulans rcoA deletion mutant [315]. This indicates that the sexual function of tupA has been retained in P. marneffei. Although the presence of
these highly conserved P. marneffei homologs of these A. nidulans genes
indeed suggests the presence of an undiscovered mating systems in P.
marneffei, the mating process needs a comprehensive network of genes
to function coordinately. Therefore, the finding of a complete mating
107
gene repository in P. marneffei would be a stronger piece of evidence to
support the presence of a sexual stage for the fungus.
Now the genome sequence information has enabled us to conduct a
search for mating-related genes in the P. marneffei genome in order to
reveal the potential mating system in this important dimorphic fungal
pathogen. Similar studies have been carried out in C. albicans, which
was thought to be constitutively diploid and to reproduce only asexually
[138]. The complete genome predicted that a mating system existed in C.
albicans after the identification of numerous highly conserved homologs
of S. cerevisiae mating genes [190, 259, 272]. Eventually, it has been
demonstrated by two research groups that C. albicans can be induced to
mate under certain conditions [139, 213].
The sexual cycle introduces valuable genetic tools for fungal study.
If a fungus has a sexual cycle, we can always screen for mutants from
recombination events during meosis and gamete formation, then zygote
formation. In the case of P. marneffei, the absence of a sexual stage
has handicapped biological studies with this fungus. Genome sequence
analysis reported in this chapter, however, provides encourageing information: many homologs of sex cycle-related genes have been identified
in the P. marneffei genome, suggesting a potential matting ability of
this important pathogenic fungus, despite which the sexual state has not
been reported. Practically, this discovery might open the door to simple
and efficient procedures for obtaining sexual recombinants of P. marneffei that will be useful for genetic analyses of pathogenicity and other
traits.
5.2
Literature Review
Studies on mating type in fungi have been helpful for the understanding of many eukaryotic regulation pathways, including cell cycle regulation, cellular and nuclear identity, and signal transduction. Most as-
108
comycetes have only two different mating types, their MAT locus encodes
transcription factors that regulate mating-type–specific genes involved in
pheromone production, pheromone sensing, and signal transduction [94].
Some ascomycetes are asexual, while many others have adopted different
reproductive strategies: heterothallic, homothallic, and, less frequently,
pseudohomothallic (Table 5.1). For homothallic species, homokaryotic
haploid strains are self-fertile and complete the sexual cycle without seeking a mate. This diversity is so extensive that even species within the
same genus, such as Neurospora, adopt either homothallic or heterothallic
modes. More strikingly, in a recent study, researchers discovered that the
heterothallic C. neoformans α cells can sexually reproduce via fruiting,
without fusing with a partner of the opposite mating type.
5.2.1
Mating in hemiascomycete yeasts
The mating-type locus has been well studied in ascomycete S. cerevisiae.
Two haploid cell types of S. cerevisiae are determined by their MAT loci,
denominated as α and a. A pheromone-mediated fusion process creates
a diploid cell (a/α), which then, under starvation conditions, can undergoes meiosis with the formation of four haploid cells, two of which
are a, two are α. Each α and a mating-type locus contains two divergently transcribed genes: a1, a2 and α1, α2, respectively. The a1 and
α2 proteins are transcriptional repressors (when both are present) and
both contain a homeodomain DNA-binding motif [284]. The α1 protein
has been shown to be a transcription activator [278] but its DNA-binding
domain (the α-box) has yet to be characterised in detail. The function of
a2 is unknown. The a1 and α1 proteins are encoded by totally dissimilar
sequences of 642 and 747 bp, respectively, while a2 and α2 sequences
have partial similarity [227, 299]. S. cerevisiae is basically heterothallic, however, a homothallic breeding system can be achieved through a
mating-type switching, in which S. cerevisiae α haploid cell can switch
109
to the opposite mating type a, or vice verse [132]. This is caused by gene
conversion between the MAT locus and two MAT-like loci during cellular
division of haploid cells [120]. The molecular basis of the gene conversion is the presence of two MAT-like cassettes, HMR and HML. Normally
they are transcriptionally repressed through silencing by the formation of
a specialised compacted chromatin structure. They are both surrounded
by “silencers,” short specific sequences that are binding sites for DNAbinding proteins and are also involved in transcriptional activation and
DNA replication (for recent reviews, see [105, 117]). Moreover, haploidspecific gene products, such as the HO endonuclease, are involved in
repression of meiosis and mating-type switching [120].
5.2.2
Mating in filamentous ascomycetes
These mating systems include many conserved components, such as gene
regulatory polypeptides and pheromone/receptor signal transduction cascades, as well as conserved processes, like self-nonself recognition and
controlled nuclear migration. The mating systems in filamentous ascomycetes share similar components and processes with those in yeasts
but they exhibit many unique properties. First, the sequence dissimilarity between two alternate mating-type alleles is more pronounced in
filamentous ascomycetes. Usually they consist of unrelated and unique
sequences. Second, the mating-type switching mechanism of filamentous
ascomycetes is unknown but different from that of yeast. Filamentous
ascomycetes exhibit great stability of the mating type, which might be
due to the lack of additional copies of mating-type sequences outside
the mating-type locus. The additional copies of the mating-type locus in
yeasts are usually silent copies facilitating mating type switching through
gene conversion.
Among filamentous ascomycetes, the structure of the components and
genetic arrangements of their mating type loci vary greatly. Neurospora
110
Table 5.1: Mating strategies adopted by ascomycetous fungi, the presence
of mating type gene and ability in switching between mating types.
Species
Mating strategy
S. cerevisiae
C. glabrata
Kluyveromyces lactis
Homothallic
Asexual?
Heterothallic, some
homothallic strains
Homothallic
Kluyveromyces
waltii
Ashbya gossypii
Debaryomyces
hansenii
Yarrowia lipolytica
Neurospora crassa
Podospora anserina
Bipolaris sacchari
Neurospora intermedia
S. almonella
C. neoformans
Mating
type
gene
Y
Y
Y
Switching
Y
NA
Y
Y
Y
Asexual?
Homothallic
Y
Y
Y
Y
Heterothallic
Homothallic
Pseudohomothallic
Asexual
Heterothallic
Y
Y
Y
Y
Y
Y
NA
NA
NA
NA
Heterothallic
Heterothallic
Y
Y
Y
N
111
crassa and Podospora anserina are two representative ascomycetes from
which molecular analyses of mating systems have been well-characterised.
In N. crassa, mat a-1 and mat A-1 are the two genes responsible for a
and A mating specificity, respectively. Two additional genes mat A2 and mat A-3, with opposite orientations are present at the mat A-1
adjacent region. In P. anserine, FPR1 is the only gene present in the
mat+ idiomorph and sufficient to induce fertilisation, in contrast, FMR1
with two additional genes, SMR1 and SMR2, are required for the matstrain to develop perithecia to maturity.
Heterothallic species require a partner for mating, whereas homothallic species are able to self-mate. The difference between heterothallic
species and homothallic species is not due to the presence or absence of
mating-type genes. Sequences similar to mating types have been identified and functionally characterised in all the species tested, whether they
are heterothallic or homothallic. Mating type genes are even present in
asexual species, for example, asexual Bipolaris sacchari has a homolog of
the MAT-2 gene of the related species C. heterostrophus. The process of
sexual development is identical in homothallic and heterothallic species.
Homothallic filamentous ascomycetes, even individual nuclei contain both
mating-type informations, could be functionally heterothallic through a
proposed a mechanism allowing alternate expression of either mating
type.
Mating may serve as a model for the study of developmental genetics
and could help in elucidating regulatory mechanisms of multicellularity
and sexual dimorphism. Mating systems are divergent in ascomycetes.
The presence of mating-type genes does not determine the mode of sexual
reproduction. Because the changes in modes of sexual reproduction are
frequent and disruption of sexual function is tolerated in ascomycetous
fungi, the presence or absence of particular genetic components involved
in the mating system is not necessarily a good indicator for which repro-
112
ductive modes a fungus adopted.
5.3
Materials and Methods
Protein sequences of fungal sex-related genes downloaded from GenBank
were used as queries to the P. marneffei genome sequences. The comparison was conducted using the NCBI TBLASTN program 2.0 with
the BLOSUM62 scoring matrix [6]. The E-value cutoff used to assign
homologues was 1.0e-20. The contigs of the P. marneffei genome that
contained homologues were extracted and annotated manually. Each annotated gene is given a locus number of the form Pm## sequentially
to identify a gene uniquely and positively. Each gene also has a version attribute (so loci are in fact displayed as Pm##.version). Predicted
peptides were compared to the amino acid sequences of their corresponding query proteins using NCBI BLAST2SEQ (http://www.ncbi.nlm.
nih.gov/blast/bl2seq/bl2.html). The statistics of the expect value
were calculated based on the size of NCBI non-redundant protein database. Conserved domains/motifs were identified using InterPro release
5.1 [367]. Multiple alignments of amino acid sequences were performed
using the program ClustalX 1.81 [311]. Adjustments to the alignments
were performed manually. Graphic presentation of the alignments and
consensus sequences were performed using the program BOXSHADE 3.21
(http://www.ch.embnet.org/software/BOX form.html).
In addition to the degree of sequence similarity, several lines of supplementary information were used to further support gene homology. These
include: (i) conserved positions of intron(s) between homologs, which
argues for a common ancestor of genes studied; (ii) phylogenetic trees
constructed from aligned genes, so that the most close homolog can be
identified when paralogous genes present; (iii) identified features characteristic of the family that a gene belongs to.
Phylogenetic trees were inferred by the neighbour-joining method
113
HMG box
alpha box
A. fumigatus
AfMAT-2 (Af59.m09249)
AnMAT-2 (AF508279/AN4734.2*)
A. nidulans
Chromosome 3
AnMAT-1 (AY339600/AN2755.2)
Chromosome 6
P. marneffei
PmMAT-1 (Pm1.126)
N. crassa
mat a-1 (M54787)
MATalpha2
mat1-P
mat A-3
MATalpha1
mat2-P
15kb
mat3-M
mat A-2
S. cerevisiae
S. pombe
11kb
mat A-1
MATa1
mat1-M
mat2-P
15kb
mat3-M
11kb
Figure 5.1: Comparison of the mating-type loci in P. marneffei and other
fungi. Boxes interrupted by gaps represent the coding sequences of the
genes and the introns, respectively. Arrows indicates the directions of
genes. Dash lines indicate the genes linked together are present in the
genome of the same isolate. Symbols: dark-gray bar, conserved HMGbox domain; light-gray bar, conserved alpha-box motif.
[273]. Genetic distances between protein sequences was estimated using
WAG amino-acid substitution model [342] implemented in MBEToolbox
(Chapter 10).
5.4
Results and Discussion
The close relationship between Penicillium and Aspergillus genera has
been well established based on various sources of evidences. It is further
supported by our recent comparative study of the mitochondrial genome
of P. marneffei and those of other fungi (Chapter 3). It has prompted the
search for previously undiscovered characteristics in P. marneffei based
on our knowledge in the various Aspergillus species.
114
5.4.1
Homologs of known sexual genes
With respect to the potential mating system of P. marneffei, A. nidulans is of particular interest as this model species has two distinctive
reproductive developmental processes: sexual and asexual development.
We used a set of empirically selected A. nidulans genes involved in sexual development as queries to identify their homologs in P. marneffei.
These genes are veA, medA, tubB, phoA and nsdD. The veA gene was
first known to mediate the light response as early as 1965 [156]. It was
later found to be required for cleistothecium and ascospore formation as
well [159]. The veA1 mutant is unable to develop sexual structures and
asexual sporulation in the veA1 mutant is promoted and increased [164],
implying that veA gene plays a key role in activating sexual development and/or inhibiting asexual development. A. nidulans medA (Genbank Acc.: AAC31205) encodes a transcriptional regulator of sexual and
asexual reproduction. tubB, one of two genes encoding alpha-tubulin, is
involved in the processes of karyogamy and meiosis I [167, 168], but it
is not required for vegetative growth or asexual reproduction, nor is it
required for the initiation or early stages of sexual differentiation. The
gene nsdD encodes a GATA-type transcription factor that functions in
activating sexual development [124]. The gene phoA [33], like stuA [222],
is involved in the biosynthesis of tryptophan and has been identified as
being involved in sexual development [77, 314, 355].
As in A. nidulans veA, the predicted P. marneffei veA contains one
intron with conserved boundaries. The predicted P. marneffei MedA
(741 aa) shows 49% identity in amino acid to A. nidulans MedA (600 aa)
within an alignable region of 555 aa. The predicted P. marneffei tubB
and phoA are highly conserved, sharing 83 and 80% identical amino acid
residues with A. nidulans tubB and phoA, respectively. The predicted P.
marneffei NsdD consists of 385 amino acid residues and, like A. nidulans
NsdD, is rich in proline (13.8 and 11.3%) and serine (13.8 and 13.4%).
115
Both have the type IVb C-X2 -C-X18 -C-X2 -C zinc finger DNA-binding
domains at their C-termini.
We also identified homologs of two inhibitors of sexual processes, lsdA
and rosA, in P. marneffei. The LsdA is expressed abundantly at the late
sexual developmental stage of A. nidulans. Disruption of lsdA causes the
preferential formation of sexual structures even under certain conditions,
such as a salt at high concentration, where sexual development in the wild
type is inhibited [191]. Hence, the lsdA gene inhibits sexual development
in the presence of sex-inhibiting environmental signals. Under low-carbon
conditions and in submersed culture, A. nidulans RosA is also a repressor
of sexual development initiation [331]. The predicted P. marneffei lsdA
encodes a 350 amino acid polypeptide, which when compared to the
356 amino-acid A. nidulans lsdA, shares 43% identical and 60% similar
amino-acid residues. The predicted P. marneffei RosA exhibits 57%
amino acid identity to A. nidulans RosA. The position of the larger intron
of P. marneffei rosA is same as that in orthologs of A. nidulans, Sordaria
brevicollis and N. crassa. At the N terminus of P. marneffei RosA,
the highly conserved Zn(II)2Cys6 motif and a putative bipartite nuclear
localisation signal and a predicted DNA-binding domain are predicted.
In summary, although studies of the molecular mechanism controlling
sexual development in filamentous fungi are very limited, several sexual
genes that have been identified, isolated and characterised from A. nidulans enable us to find their homologs in P. marneffei. This finding is in
line with the other two genes mentioned above, stuA [222] and steA [19],
that have been experimentally characterised in both A. nidulans and P.
marneffei, revealing the functional exchangeability between corresponding homologs. The presence of these faithful homologs suggests that
sexual development is potentially possible in P. marneffei. However, it
becomes not so conclusive when the following fact is taken into account –
many sexual genes may function not only in sexual development but also
116
Figure 5.2: Comparison of the alpha1 domian of MAT proteins of filamentous ascomycetes. The amino acid sequence alignments are as follows:
putative P. marneffei, MAT-1 (Pm1.126); putative A. nidulans, MAT1 (AN2755.2); N. crassa, mat A-1; Paecilomyces tenuipes MAT1-1-1;
Gibberella fujikuroi, MAT-1-1; Alternaria alternate, MAT-1; Pyrenopeziza brassicae, alpha-1 domain protein (CAA06844.1); Gibberella zeae,
MAT1-1-1; Fusarium oxysporum, MAT-1; Cochliobolus ellisii, MAT-1;
Podospora anserine, FMR1. The arrow indicates conserved position of
introns.
in other processes, like secondary metabolism. Hence, homologous sexual
genes in P. marneffei might be responsible for other processes that are
not related to sexual development. Therefore we need further evidences
to draw a conclusion.
5.4.2
Mating type genes
Fungi are capable of sexual reproduction by using either heterothallic
(self-sterile) or homothallic (self-fertile) mating strategies. In most ascomycetes, mating ability is controlled by a single mating type locus,
MAT, with two alternate forms (MAT-1 and MAT-2) called idiomorphs.
MAT-1 and/or MAT-2 mediate not only mating, but also several other
key processes, including secretion of and response to pheromones and
vegetative incompatibility. In heterothallic ascomycetes, these alternate
idiomorphs reside in different nuclei. In contrast, most homothallic ascomycetes carry both MAT-1 and MAT-2 in a single nucleus, usually
closely linked.
A. nidulans is a homothallic ascomycete. A. nidulans MAT-2 (AnMAT -
117
Relationship:
is neighbor
is homolog
cytoskeleton assembly control protein
Pm1.124
AN4732.2
Af59.m09500
AN4733.2
Af59.m09250
AnMAT-2
(AN4734.2)
AfMAT-2
(Af59.m09249)
Pm1.125
PmMAT-1
(Pm1.126)
AnMAT-1
(AN2755.2)
Pm1.127
AN2754.2
AN4735.2
Af59.m09248
Pm1.128
AN4736.2
Af59.m09247
Pm1.129
AN4737.2
Af59.m09246
AN2753.2
A. nidulans contig 27
P. marneffei
A. nidulans contig 47
AN2756.2
DNA lyase
A. fumigatus
Figure 5.3: Gene organisation around the MAT locus of A. nidulans and
the putative MAT loci of P. marneffei and A. fumigatus. AnMAT -1 and
AnMAT -2 are A. nidulans MAT-1 and MAT-2, locating on contig 47 and
27 of A. nidulans unfinished genome, respectively.
2) have been previously characterised using ‘classic’ molecular biological
techniques [76], while A. nidulans MAT-1 (AnMAT -1, Genbank Acc.
BK001307) has been found by similarity searching [76]. In the MIT A.
nidulans genome database, two annotated genes AN2755.2 and AN4734.2
on different contigs are actually the AnMAT -1 and AnMAT -2 respectively. Note that AN4734.2 is slightly different from AnMAT -2 (Genbank
Acc. AF508279), simply due to different isolates of A. nidulans. In contrast to A. nidulans, only MAT-2 has been identified by genome analyses
from A. fumigatus [253,326]. The AfMAT -2 encodes a regulatory protein
with a high mobility group (HMG) DNA-binding domain [320], which is
the characteristic feature of MAT-2 genes. No homologue of the MAT 1 gene sequence in any of the tested fungi was found in the TIGR A.
fumigatus genomic database. This suggests A. fumigatus is perhaps a
heterothallic ascomycete, rather than a homothallic ascomycete (as all
homothallic euascomycetes so far analysed either contain only MAT-1 or
both an MAT-1 and MAT-2 [252]), and the genome sequence was from a
118
MAT-2 strain.
Using this pair of Aspergillus species that are closely related to P.
marneffei, the homothallic A. nidulans and the possibly heterothallic A.
fumigatus as models we undertook a series of MAT searches to determine
whether P. marneffei has a hypothetical MAT locus, and if so, whether P.
marneffei carries both MAT1-1 and MAT1-2 genes. Through BLAST
searches, we identified a putative mating-type (PmMAT ) locus in P.
marneffei, containing a conserved homolog of the A. nidulans MAT-1
(AnMAT -1), which is denoted as PmMAT -1 hereafter. The PmMAT -1
gene encodes a putative 348 amino acid polypeptide which shares 38%
similarity to AnMAT-1 (361 aa) in full length, and exhibits 59, 60, 61
and 60% similarity to the alpha-box domain of AnMAT-1, P. brassicae
MAT-1, G. fujikuroi MAT-1 and P. anserine MAT-1. More importantly,
the intron boundaries are conserved between the putative PmMAT -1 and
other fungal MAT -1 genes (Fig. 5.2).
Despite extensive genome sequence searches, we cannot identify a
MAT-2 like gene in P. marneffei. Having one mating-type gene is similar
to the situation in A. fumigatus, where, in contrast, MAT-1 cannot be
found. The other mating type gene, P. marneffei MAT -2 or A. fumigatus MAT-1, might be present in other isolates, as observed in the asexual
Fusarium culmorum species [163]). Alternatively the other putative mating type gene could have become extinct, as observed in C. neoformans
populations and Ophiostoma novoulmi [356].
The former explanation seems more plausible after we identified putative mating-type loci in P. marneffei and A. fumigatus, which show
similarity to A. nidulans MAT-2 and MAT-1 regions, respectively. We
compared flanking genes of two mating-type loci to each other, as well as
to corresponding A. nidulans MAT-2 or MAT-1 regions (Fig. 5.3). Striking patterns were observed in the organisation of flanking genes where
several syntenies were identified. Comparing P. marneffei to A. fumi-
119
gatus, PmMAT-1 (Pm1.126) and AfMAT-2 (Af59.m09249) are oriented
differently, upstream of a hypothetical gene (Pm1.127 and Af59.m09250
respectively). The mating-type gene and its following gene occupy a
unique region of ∼5 kb in both P. marneffei and A. fumigatus. No significant similarity at the amino-acid or nucleotide level can be detected
between the two regions. Three pairs of homologous genes flank the two
regions, the first pair encodes a homologues of S. cerevisiae SLA2-like
cytoskeleton assembly control protein, and the other two encode a putative DNA lyase and a proteins of the cytochrome c oxidase subunit
VIa family. It therefore seems likely that the non-homologous regions in
P. marneffei or A. fumigatus are the mating-locus of their idiomorphic
type. The mating-locus of the other idiomorphic type might be found
in another isolates. This suggests P. marneffei and A. fumigatus are
heterothallic fungi.
Taken together with N. crassa, we now have the schematic organisation of mating-type loci from four filamentous fungi, whose genome
sequences are completed or almost completed (Fig. 5.1). To compare
them with those from yeasts, we note that the mating-type DNA regions
of filamentous fungi are generally larger than in S. cerevisiae [10] or in
S. pombe [162]. In fission yeast S. pombe, the mating-type region comprises three linked loci, mat1, mat2 and mat3, which occupy about 30
kb of DNA on chromosome II [14]. The mat1 locus determines the cell
type, depending on whether it has P (for plus) or M (for minus) information. mat2-P and mat3-M loci are transcriptionally silent and act as
donors of information for switching mat1 DNA by the process of gene
conversion. There is no similar arrangement of such mating-type regions
in P. marneffei ; however, it is noteworthy that there are other genes,
such as Pm6.88 or AN1962.2, in P. marneffei or A. nidulans, having
similarity to the HMG mating-type genes. They are not ‘true’ MAT2 family mating-type genes because they do not contain the intron with
120
conserved positions and some other conserved motifs, which are only seen
in the MAT -2 gene. Also they are not located at the MAT locus, unlike
other filamentous fungi, such as N. crassa, which may have an additional
HMG gene at the MAT-1 idiomorph involved in fertility. These extra
HMG genes are not possible to be silent copies of MAT genes, as seen in
the yeasts. However, they may theoretically have some role in fertility
which will need experimental investigation [Dr Paul S. Dyer, personal
communication].
Finally, the detection of mating type genes, which play roles in sexual
signalling between compatible heterothallic isolates, yet are present in a
‘selfing’ fungus like A. nidulans, is noteworthy itself. As suggested by
Dyer [76], this observation can be interpreted by either the evolution of
heterothallic species towards homothallic form or vice versa. Taking our
observation from the P. marneffei genome into account, then we assume
the former interpretation is more plausible, i.e., homothallic A. nidulans
is originated from a heterothallic common ancestor of Penicillium and
Aspergillus.
5.4.3
Mating pheromone precursor genes
The nucleotide sequence and deduced amino acid sequence of the pheromone
precursor gene from several fungi have been used to search the P. marneffei genome. After intensive searches, however, no significant similarity
was found (BLAST E-value cutoff = 10). As mentioned in a previous
section (Section 1.4.2), syntenic comparisons suggest the loss of original
mating pheromone precursor loci may occur in P. marneffei. However,
we cannot exclude the possibility that P. marneffei mating pheromone
precursor genes are so highly specific that they are too divergent to be
detected by similarity searches.
121
Pm6.49
Ram1p
Pm60.30
Ram2p
Pm60.4
Ste24p
Pm96.20
Rce1p
Pm92.26
Ste14p
Carboxylmethylation
Pm60.4
Ste24p
P1->P2 Proteolysis
Farnesylation
C-Terminal
CAAX
Modification
AXX Proteolysis
N-Terminal
Processing
No match
Axl1p
Pm134.14
Ste23p
P2->M Proteolysis
Export
Export
Pm125.22
Ste6p
Figure 5.4: Predicted P. marneffei homologues of the genes involved in
the biogenesis of the a-factor pheromones in S. cerevisiae. The a-factor
biosynthetic intermediates and the components of the a-factor biogenesis machinery are shown (see the text for more information). Several of
the a-factor intermediates can be directly visualised by SDS-PAGE and
are designated P0, P1, P2, and M [49]. The a-factor precursor contains
an N-terminal extension, a mature portion, and a C-terminal CAAX
motif, as indicated at top. During a-factor biogenesis, the unmodified
a-factor precursor (P0) undergoes C-terminal modification (prenylation,
proteolytic cleavage of AAX, and carboxylmethylation) to yield the fully
C-terminally modified species P1. Next, N-terminal proteolytic processing occurs in two distinct steps, the first (P1→P2) cleavage removing
seven residues from the N-terminal extension to yield the P2 species, and
the second (P2→M) cleavage generating mature a-factor, which is exported from the cell. The corresponding components predicted from P.
marneffei have been given. Among them, AXL1 has not been identified.
122
Function
Carboxypeptidase α-factor processing
Endoprotease α-factor processing
Dipeptidyl aminopeptidase α-factor processing
CaaX Farnesyltransferase α subunit; a-factor modification
CaaX Farnesyltransferase β subunit; a-factor modification
CaaX protease a-factor C-terminal processing
Prenylcysteine carboxyl methyltransferase
CaaX prenyl protease N- and C-terminal a-factor
processing
Metalloprotease involved, with homolog Axl1p, in
N-terminal processing of pro-a-factor to the mature
form
ATP-dependent multidrug efflux pump of a-factor
Pm125.22 (1262)
Pm134.44 (1012)
Pm96.20 (333)
Pm92.26 (259)
Pm60.4 (456)
Pm6.49 (635)
Pm protein (aa)
Pm76.8 (672)
Pm6.3 (813)
Pm10.77 (899)
Pm60.30 (350)
1e-127, 335/1280 (26%), 580/1280 (45%)
0.0, 369/947 (38%), 562/947 (59%)
3e-025, 79/263 (30%), 132/263 (50%)
1e-034, 61/134 (45%), 87/134 (64%)
1e-115, 202/446 (45%), 274/446 (61%)
6e-050, 114/329 (34%), 157/329 (47%)
E-value, identity and similarity in overlap
4e-057, 124/350 (35%), 183/350 (52%)
1e-154, 302/774 (39%), 428/774 (55%)
1e-128, 263/787 (33%), 399/787 (50%)
5e-051, 124/354 (35%), 177/354 (50%)
Table 5.2: Pheromone-processing enzymes encoded by the putative P. marneffei genes, as shown by a BLAST search of the P.
marneffei genome.
Sc protein (aa)
Kex1p (729)
Kex2p (814)
Ste13p (931)
Ram2p (316)
Ram1p (431)
Rce1p (315)
Ste14p (239)
Ste24p (453)
Ste23p (988)
Ste6p (1290)
123
5.4.4
Mating pheromone processing genes
The production of pheromones has provided important insights into proprotein processing in eukaryotic cells. The system has been well characterised in S. cerevisiae (for review, see [62]). A budding yeast cell
produces either a-factor or α-factor corresponding to its mating type.
Either a- or α-factor is synthesised as precursor that undergoes multiple
maturation steps to generate its mature form. A number of S. cerevisiae
pheromone processing genes have been cloned and characterised [32]. We
used the protein sequences of all these genes in a BLAST search to identify pheromone-processing genes encoding putative homologous proteins
in P. marneffei. For all the query S. cerevisiae proteins, except Axl1p,
the corresponding P. marneffei homologs with high levels of amino-acid
similarity have been identified (Table 5.2). Hence, P. marneffei appears capable of synthesising/processing mating pheromones although
the pheromone precursor gene has not been identified by searching for
known pheromone precursor genes.
Genes involved in the processing of α-factor and a-factor are different.
In the case of α-factor, the maturation requires signal cleavage, glycosylation and proteolytic processing by three peptidases encoded by KEX2,
KEX1 and STE13. The S. cerevisiae KEX2 gene encoding kexin belongs
to the prohormone convertase family, which has been identified in many
species. The S. cerevisiae Kex2p is membrane-bound and cleaves peptide substrates at both Lys-Arg and Arg-Arg sites [26, 100]. A previous
study has shown that mutant Kex2p enzyme molecules lacking as many
as 200 C-terminal residues still retained protease activity. Although not
essential for enzymatic activity, C-terminal cytoplasmic tail contains a
localisation signal so that Kex2p is localised to a later compartment of
the Golgi complex. The predicted P. marneffei Kex2p shows high similarity (55%) to S. cerevisiae Kex2p overall and similarity at C-terminal
residues is slightly lower, hence, the predicted P. marneffei Kex2p pos-
124
sibly bears protease activity but may be localised differently. The S.
cerevisiae KEX1 encoding carboxypeptidase cleaves the Lys-Arg residues
exposed at the C-terminus of α-factor precursor following digestion with
the kexin [60, 70, 188]. Like Kex2p, the C-terminal residues of S. cerevisiae Kex1p are not highly conserved in P. marneffei, also suggesting
a difference in peptide localisation between species. P. marneffei is predicted to have a homolog of S. cerevisiae Ste13p, a type IV dipeptidyl
aminopeptidase that trims N-terminal x-Ala dipeptides of the α-factor
precursors [154].
a-factor undergoes three major maturation stages: C-terminal modification, N-terminal modification, and export [49], which involve genes
RAM2, RAM1, RCE1, STE14, STE24/AFC1, STE23, AXL1 and STE6
(Fig. 5.4). The S. cerevisiae RAM2 and RAM1 genes encode the α
and β subunits of farnesyltransferase (FTase), respectively [129]. FTase
catalyses the addition of 15-carbon (farnesyl) groups to a-factor destined for cell membranes [260]. RAM2 and RAM1 are conserved genes
that have mammalian counterparts. RAM2 is essential to the viability of C. albicans, while RAM1 is essential to C. neoformans, indicating
that protein prenylation is an indispensable cellular process in these opportunistic yeast pathogens. The predicted P. marneffei Ram1p shows
high levels of similarity to S. cerevisiae Ram1p (51 %) and to mammalian protein farnesyltransferase β subunits (e.g. 55 % similarity to
rat fntb). The predicted P. marneffei Ram2p shows 50 % similarity to
S. cerevisiae Ram2p, with both containing at least three PPTA (Pfam
acc. PF01239) domains at their N-termini. The S. cerevisiae RCE1
encodes an AAX prenyl protease [21]. The sequence of RCE1 contains
three potential transmembrane domains but there are no other defining
features and no significant similarity with other proteins, hence it may
belong to a novel superfamily [247]. The predicted P. marneffei Rce1p,
which is 50% similar, also contains multiple potential transmembrane
125
domains. More importantly, the three putative zinc-binding residues
(E156A, H184A, H248A) and Cys (C251) are all conserved. Mutating
each of these residues inactivates the protease [72]. The S. cerevisiae
STE14 encodes a carboxyl methyltransferase that methylates a-factor.
The predicted P. marneffei Ste14p, containing multiple predicted transmembrane spans, shares 64% similarity with S. cerevisiae Ste14p. The
S. cerevisiae Ste24p, a membrane-associated metalloprotease, is required
for the first step of N-terminal processing of a-factor [99]. The predicted
P. marneffei Ste24p shows 60% similarity to its counterpart. Like S.
cerevisiae Ste24p, P. marneffei Ste24p (at position 299 to 303) has a Zndependent metalloprotease motif (HEXXH) [304]. It also matches the
larger consensus sequence characteristic of neutral Zn metalloproteases,
and contains multiple predicted transmembrane regions. Unlike S. cerevisiae Ste24p, however, the C-terminal di-lysine motif, KKXX (K is Lys)
is replaced with KXXX in P. marneffei Ste24p. Our analysis reveals that
the predicted Ste24p homologs in A. fumigatus (AF58.m07859) and N.
crassa (NCU03637.2) also have the replacement of the di-lysine motif.
Since the di-lysine motif at the C-terminus of many proteins facilitates
their retrieval from the Golgi complex to the ER [310], it could suggest that Ste24p in S. cerevisiae is localised to the ER, but this is not
the case in P. marneffei or the other two filamentous fungi. The S. cerevisiae metalloprotease Ste23p, a member of the insulin-degrading enzyme
family, is involved in N-terminal processing of pro-a-factor to the mature
form. Axl1p is a paralog to Ste23p. In S. cerevisiae, Ste23p and Axl1p
proteins show 22% identity and 39% similarity throughout their entire
length and Ste23p performs a role at least partially redundant with that
of Axl1p in a-factor processing [1]. In P. marneffei, I identified a putative homolog of Ste23p but not Axl1p. P. marneffei Ste23p is highly
conserved, showing 59% similarity to S. cerevisiae Ste23p. We argue that
since STE23 genes are present in S. cerevisiae and P. marneffei while
126
AXL1 is present in S. cerevisiae only, it is possible that AXL1 was created by duplication of the gene STE23 after the separation of the two
species. Moreover, S. cerevisiae STE23 and AXL1 may be an example of
duplicate genes that undergo subfunctionalisation, through which Axl1p
gains a new role in controlling the axial budding pattern of haploid cells
while retaining partial STE23 functions in processing a-factor. Finally,
unlike α-factor that is exported in MATα cells via the classical secretion
pathway, a-factor is pumped out of the cell by the MATa cell-specific
protein Set6p. The homolog of Set6p was identified in P. marneffei, with
multiple transmembrane domains and two ATP binding domains.
5.4.5
Mating pheromone receptor and other GPCRs
In S. cerevisiae, a or α-factor binds to cell-type-specific receptors encoded
by STE2 or STE3. STE2 is expressed in a cells and is recognised by αfactor, and STE3 is expressed in α cells and recognised by a-factor. The
binding is essential for signalling mating process between haploid cells.
In A. nidulans, Han et al. [125] identified 9 genes, gprA∼I, belonging to
the GPRC family. Among them, gprA and gprB are putative orthologs to
STE2 and STE3. gprD is similar to the yeast glucose sensing Gpr1p [176]
and plays a key role in coordinating hyphal growth and sexual development. Using these A. nidulans GPCRs as query genes, I identified 7 P.
marneffei GPCRs closely related to them. A phylogeny reconstructed
from a collection of fungal GPCRs gives an indication of several distinct
families. The seven P. marneffei distribute across all these sub-divisions.
They all contain multiple predicted transmembrane domains, which is one
of characteristic features of GPCRs. Han et al. [125] also claimed that 7
putative GPCRs have been found in A. fumigatus genome. It would be
interesting to re-analyse this gene family when gene sequences from all
these three genomes of closely related species become available.
Our results indicate that P. marneffei might have a recent evolu-
127
tionary history of sexual recombination and might have the potential for
sexual reproduction. The possible presence of a sexual cycle is highly
significant for the population biology and disease management of the
species.
128
Chapter 6
EXPLORING THE GENETIC COMPONENTS
ASSOCIATED WITH THE DIMORPHISM OF
PENICILLIUM MARNEFFEI
Penicillium marneffei accommodates both complex asexual development and dimorphic switching programs, hence becomes a valuable system for the study of morphogenesis and pathogenicity. The study of
the morphogenetic programs of P. marneffei has been recently greatly
facilitated by the development of molecular genetic techniques, but we
are only beginning to uncover some determinants which control these
events, and the comprehensive picture still remains blurred. This chapter contributes to the thesis by offering a systemic exploration of genetic
components that may be responsible for the morphogenetic processes in
the genome of P. marneffei, mainly through sequence analysis in a context of comparative genomics. This will provide insights into the biology
of P. marneffei and its pathogenic capacity.
6.1
Introduction
Dimorphism, the ability to switch between a cellular yeast form and a
filamentous form, is a common morphogenetic feature in many fungi, despite their enormous diversity in size and shape. The change of growth
form is believed to be effected by an altered programme of gene expression, which is induced by a wide range of metabolic and environmental
factors. In Saccharomyces, it is starvation for nitrogen, in Candida, it is
serum (among other things); in Ustilago, it is a putative molecular signal
from the host plant; and in P. marneffei, it is apparently temperature.
129
Note that environment conditioned dimorphism is reversible.
The yeast-form is characterised by a round or ovoid unicellular organisms, dividing mitotically, either by budding or fission, to form two
independent daughters.
Filamentous or mould forms are more com-
plex multicellular structures. The filaments are characterised by long,
thin, parallel-walled tubes, growing by apical extension, with occasional
branching at an angle from the original direction of growth. In contrast
to yeast, filamentous cells do not separate after nuclear division but,
rather, forming septations between cellular units that remain physically
associated to the mother cell.
There is a growing body of evidence suggesting that the morphogenesis is a crucial determinant of fungal pathogenicity in both plants and
animals. In Magnaporthe grisea, for example, MAPK and cAMP signalling promote the formation of a highly specialized infection structure,
appressorium, which is essential for invasion into the host [223]. Most dimorphic fungal pathogens including P. marneffei, Blastomyces dermatitidis, Coccidioides immitis, Histoplasma capsulatum and Paracoccidioides
brasiliensis, typically enter the body as spores or, possibly, mycelial fragments via the lungs and grow in yeast forms in the body. Pathogenic
Cryptococcus neoformans has been shown to form self-fertilising, diploid
strains that are thermally dimorphic [286]. Aspergillus fumigatus spores
establish invasive disease in lung tissue exclusively by hyphal development.
Because of the prevalence of dimorphism among human pathogenic
fungi, it is of interest and importance to identify the molecules necessary for the morphologic switch. However, the mechanism of thermal dimorphism of P. marneffei remains unknown. Nevertheless, since fungal
dimorphism has been seen by many investigators as a useful model of differentiation in eukaryotic systems, significant progress has been achieved
in the study of fungal morphogenesis in other fungi. The approach to
130
this chapter is a review of this progress (especially experimental developments) achieved in recent years in the fields of fungal genetics. These
developments have suggested models and hypothesis to understand the
regulation of the molecular mechanisms involved in fungal differentiation. Comparative sequence analysis is adopted to explore the genetic
components that may be involved in the morphogenesis of P. marneffei.
Specifically, we would like to know whether P. marneffei possess specific (probably temperature-sensitive) cellular sensors to detect external
stimuli, or unique signalling transduction pathways that translate the
external stimuli into biochemical messages that alter genomic expression
levels, or an enhanced ability in structural reorganization resulting in the
morphological change.
It is noteworthy that the comparative genomics approach adopted
in this Chapter is impaired by the lack of genome sequence information
from true dimorphic fungus. Nevertheless, even the genome sequences of
Blastomyces dermatitidis, Coccidioides immitis, Histoplasma capsulatum
or Paracoccidioides brasiliensis had become available, the comparative
genomics approach might also be handicapped by the too far genetics
distance between P. marneffei and these divergent species. The following analysis is therefore mainly limited by the comparison between P.
marneffei and Aspergillus species.
6.2
6.2.1
Materials and Methods
Sequence similarity
To identify homologous genes in the P. marneffei genome, protein sequences derived from target genes were used as queries to the P. marneffei
genome. Sequence similarity searches were performed using BLASTP or
PSI-BLAST against selected fungal genomes downloaded from GenBank.
The searches were also performed against an inhouse database composed
of whole-genome sequences of several fungal species from finished and
131
ongoing sequencing projects. The comparison was conducted using the
BLOSUM62 scoring matrix [6]. The E-value cutoff used to assign homologues was 1e10-5, unless otherwise claimed. Conserved domains/motifs
were identified using InterPro release 5.1 [367].
6.2.2
Phylogenetic Analysis
Protein sequences were aligned using PROBCONS [71] and columns of
low conservation removed manually. Phylogenetic trees were inferred
by the neighbour-joining method [273]. The alignments were also used
to infer maximum-likelihood trees. The maximum-likelihood trees were
constructed using the PHYLIP package [86], applying the JTT substitution model with a gamma distribution (alpha = 0.5) of rates over
four categories of variable sites. In general, the maximum-likelihood and
neighbour-joining trees were congruent.
6.3
Results and Discussion
It has long been assumed that morphogenesis and virulence are associated
in dimorphic fungi, as one morphotype exists in the environment or during commensalism, and another within the host during invasive process.
For instance, P. marneffei lives outside the host as environmental saprophytic moulds. Its primary infectious form may be conidia or mycelial
fragments aerosolised from disturbed soil or animal excreta. After entering the host via the respiratory route upon inhalation, the cells rapidly
convert to the yeast form. So do the other members of dimorphic fungi,
such as B. dermatitidis, C. immitis, H. capsulatum and P. brasiliensis.
From the perspective of the fungal cell, the phenomenon of dimorphic
switching can be divided into four interwoven events as follows [275]:
(i) perception of external stimuli by cellular sensors; (ii) transduction of
biochemical signal; (iii) alteration of the genomic expression, and (iv)
structural reorganization towards the morphological change.
132
6.3.1
Perception of external stimuli by cellular sensors
Table 6.1: GPCR family in P. marneffei and A. nidulans. … ortholog
relationship supported by synteny; „ when knocked out, no phenotypic
changes. Abbreviations: Pm - P. marneffei, An - A. nidulans, Sc - S.
cerevisiae, Af - A. fumigatus, and Sp, S. pombe.
Family
1
2
3
4
5
An gene
gprA (AN2520.2)
gprB (AN7743.2)
gprC (AN3765.2)„
gprD (AN3387.2)
gprE (AN9199.2)„
gprF (AN5721.2)…
gprG (An5720.2) „
gprH (AN8262.2)
gprI (AN8348.2)
Pm gene
Pm198.6…
Pm20.41…
Sc/Af homolog
Ste2
Ste3
Sp homolog
Map3
Pm14.37
Gpr1
Git3
Pm105.27…
Pm34.71
Pm58.4
Pm31.53
AF54.m07020…
Stm1
AF53.m04209…
Limited information about cellular sensors that detect external stimuli (especially temperature) is available for ascomycetes. Among known
receptors, G protein-coupled receptors (GPCRs) are key components of
heterotrimeric G protein-mediated signalling pathways. The receptors
detect environmental signals and confer rapid cellular responses. The
GPCR family has been propagated in the genome of Aspergillus nidulans
as shown in the recent analyses of the Aspergillus nidulans genome: 9
genes (gprA∼gprI) predicted to encode seven transmembrane spanning
GPCRs have been identified [125]. Among them, gprD gene was found
to play a central role in coordinating hyphal growth and sexual development. Deletion of gprD causes extremely restricted hyphal growth,
delayed conidial germination and uncontrolled activation of sexual development resulting in a small colony covered by sexual fruiting bodies. We
identified 7 P. marneffei GPCRs closely related to A. nidulans GPCRs
(Table 6.1). The phylogenetic tree of fungal GPCR family genes (Fig.
6.1) helps the assignment of these putative P. marneffei GPCRs into
their corresponding sub-families.
133
B
pr
p3
Pm
58
G
ma
.4
62
r H 82
An Gp
t e3
St e
Pm20.41
Sc S
Sc
An
Sp
5
Af
04
3.m
rC
Gp
An
An GprG 5720
An
65
Pm14.37
An
Gp
rD
338
7
91
99
98
.6
1
rF 572
1
Pm
An G p
Dd cAR1
2520
.5
Pm31
3
AN834
8
2
An Gp
rA
20
r1
Gp
Sc
rlA
27
4. m
070
37
it3
G
E
c
Dd
5.
Sp
m
am
2
Af5
Sp
r
Gp
4. 71
10
Pm
9
2
Sp Stm1
Pm3
20
.2
Figure 6.1: Phylogenetic tree of fungal GPCR family genes. Classification of fungal GPCR families was carried out by analyses of P. marneffei
Pm198.6, Pm20.41, Pm14.37, Pm105.27, Pm34.71, Pm58.4 and Pm31.53,
A. nidulans GprA∼GprI, A. fumigatus Af54.m07020, Af53.m04209, Saccharomyces cerevisiae Ste2p, Ste3p, Gpr1p, Schizosaccharomyces pombe
Mam2p, Map3p, Git3p, Stm1p, Dictyostelium discoideum cAR1p and
crlAp (GenBank Acc.: AAO62367) using PROBCONS [71]. Algorithm
parameters: Gaps/Missing data - Pairwise Deletion; Distance method
– Amino Gamma Model [Pairwise distances]; Tree making method Neighbour-joining.
134
6.3.2
Transduction of biochemical signal
Studies combining the powerful genetic and genomics tools available in
fungi (mainly in Saccharomyces) have revealed three pathways that couple afferent signals to the dimorphic switch. Although many different
signals can induce filamentous development, the strategies for connecting the external signal to the change in cell differentiation are broadly
conserved among the fungi. For example, studies show that distantly
related fungi – Saccharomyces, an ascomycete, and Cryptococcus, a basidiomycete, – use common STE12 family members to forms filamentous
structures in response to nitrogen starvation, sharing a high degree of
conservation in the regulatory pathways that control filamentous growth.
Studies on signalling filamentous growth in S. cerevisiae have revealed
that four genes of the MAPK pathway that signals the mating pheromone
response are also required for filamentous growth of diploid cells and the
invasive growth of haploid cells (Fig. 1.6). These four genes are STE20,
STE11 and STE7, which encode three protein kinases that act in sequence, and STE12, which acts as a transcription factor at the terminus
of both pathways. As shown in Fig. 1.6 all these four genes are marked
with asterisks, indicating that the S. cerevisiae genes’ ortholog in P.
marneffei has been identified (see also Table 6.2). The STE20 homolog
from P. marneffei, pakA (GenBank Acc. AY621630; Pm80.15), is known
to be essential during yeast but not hyphal growth (Boyce KJ et al., personal communications). The STE12 homolog, stlA, has been cloned [19].
The P. marneffei stlA gene together with the A. nidulans steA and C.
neoformans STE12alpha genes form a distinct subclass of STE12 homologs that have a C2H2 zinc-finger motif in addition to the homeobox
domain that defines STE12 genes. The stlA gene had no detectable function on vegetative growth, asexual development, or dimorphic switching
in P. marneffei. However stlA complements the sexual defect of an A.
nidulans steA mutant [19]. These data suggest that although members
135
Ras2p (Pm85.8)
Gpa2P (Pm51.59)
Cyr1p (Pm7.24)
ATP
PKA (r)
PKA (c)
cAMP
Pde2p (Pm146.17)
AMP
Bcy1p (Pm33.83)
Tpk1p, 2p, 3p
(Pm18.86, Pm47.4, Pm19.3)
Figure 6.2: P. marneffei genes in cAMP pathway.
of the STE12 family of regulators are involved in both controlling mating
and yeast-hyphal transitions in a number of fungi, stlA in P. marneffei
may only play a role in controlling mating processes (see also chapter 5)
but not dimorphic switching. There may be as yet undetected compensatory genes or pathways responsible for dimorphic switching.
Another pathway controlling filamentation in Saccharomyces is cAMP
pathway (Fig. 6.2). Ras2p and Gpa2p are regulators of cAMP levels,
acting upstream of adenylate cyclase, Cyr1p, which in turn regulates formation of cAMP. The processes inactivates the cAMP-dependent protein
kinase (protein kinase A, PKA), leading to enhanced filamentous growth
in Saccharomyces. Homologs of all genes related in this pathway have
been identified in P. marneffei (Fig. 6.2 and Table 6.2).
Another regulator implicated in Saccharomyces filamentation is Rim1p
zinc-finger transcription factor. It is activated by a proteolytic cleavage
dependent on several other RIM genes (RIM8, RIM9, RIM13). Rim1p’s
homolog in Aspergillus nidulans, PacC, is also regulated by such a proteolysis mechanism. Again homologs of all these RIM genes are identified
136
in P. marneffei, suggesting the existence of the regulatory pathway.
Because signal transduction pathways have been well elucidated in
Saccharomyces, the yeast has been used as a reference library for the
analysis of conserved signalling pathways. However, the most detailed
analyses in S. cerevisiae will be able only to provide stepping stones on
the way to the explaining of key morphological features in more complex, multicellular filamentous fungi. These mould-specific features may
include polarized hyphal growth, septation, establishment of multinucleate cellular compartments, cell type-specific gene expression, and subcellular localization of proteins. Furthermore, protein networks of other
fungi may even differ in their regulation of similar morphological tasks.
Hence, further studies toward an understanding of these differences on
the molecular level will remain an important task in functional analyses, particularly of organisms, like P. marneffei, whose genomes will be
completely sequenced in the near future.
6.3.3
Alteration of the genomic expression
Elevated temperature is apparent by the major environmental stimulus
to P. marneffei resulting in the fungus undergoing a mycelium-to-yeast
transformation. However, the influence of elevated temperature on the
overall gene expression of P. marneffei has not been studied. Nevertheless, since surviving at the elevated temperatures, i.e. thermotolerance,
is a trait critical to the ability of many fungal pathogens to thrive in host
infections, a number of studies have been conduced in other fungi. For
example, two genes have been implicated during growth at elevated temperatures in C. neoformans. Gene RAS1 (encoding a small GTP-binding
protein) regulates filamentation, mating and growth at high temperature [5]. Gene CNA1 (encoding calcineurin) is required for C. neoformans virulence and may define signal transduction elements required
for fungal pathogenesis [236]. Homologs of both genes can be identified
137
Table 6.2: Homologous genes related to signal transduction in filamentous
growth.
Sc gene
Pm gene
MAPK pathway
STE20
Pm80.15
(CST20)
STE11
Pm129.8
STE7
(HST7)
STE12
(CPH1)
Pm161.15
TEC1
Pm109.16
(abaA)
PSS1
Pm41.61
FUS3
Pm8.42
Pm201.2
(stlA)
cAMP pathway
PDE2
Pm146.17
RAS2
GPA2
CYR1
Pm85.8
Pm51.59
Pm7.24
BCY1
Pm33.83
TPK1, 2, 3
Pm18.86,
Pm47.4,
Pm19.3
Function/product
Signal transducing kinase of the PAK family, involved in pheromone response and
pseudohyphal/invasive growth pathways
MAP kinase kinase kinase in the filamentous growth pathway pathway
Serine/threonine/tyrosine protein kinase
of MAP kinase kinase family
Ortholog to AN2290.2 (SteA). Members
of the STE12 family of regulators are involved in controlling mating and yeasthyphal transitions in a number of fungi
Transcription factor participates in two
developmental programmes: conidiation
and dimorphic growth
MAP kinase dedicated to filamentation
pathway
MAP kinase dedicated to pheromone response pathway
cAMP phosphodiesterase, component of
the cAMP-dependent protein kinase signaling system
Regulator of cAMP levels
G protein alpha subunit homologue
Adenylate cyclase, required for cAMP production and cAMP-dependent protein kinase signalling
Regulatory subunit of the cyclic AMPdependent protein kinase (PKA)
Subunit of cytoplasmic cAMP-dependent
protein kinase;
promotes vegetative
growth in response to nutrients; inhibits
filamentous growth
to be continued...
138
RIM1 related
RIM1
Pm20.42
RIM8
Pm148.7
RIM9
Pm26.50
RIM13
Pm146.2
Rim1p is homologous to the Aspergillus
nidulans transcription factor PacC, which
is also regulated by proteolysis
Protein of unknown function, involved in
the proteolytic activation of Rim101p in
response to alkaline pH; has similarity to
A. nidulans PalF
Involved in the proteolytic activation of
Rim101p in response to alkaline pH; has
similarity to A. nidulans PalI
Calpain-like protease involved in proteolytic activation of Ri0m101p in response
to alkaline pH; has similarity to A. nidulans palB
within the P. marneffei genome. The P. marneffei homolog of C. neoformans RAS1, Pm85.8, is a known P. marneffei gene (rasA, GenBank
Acc. AY232652). It has been confirmed by experiment to act upstream
of CflA (Cdc42) to regulate germination of spores and polarized growth
of both hyphal and yeast cells, while also exhibiting CflA-independent
activities [23]. For CNA1, the putative homologue gene, Pm119.15, encodes a highly conserved (74% aa identity within alignable region of 485
aa) calcineurin peptide sequence (557 aa long).
In addition to these analyses on individual gene’s functions, Steen
et al. have initiated a genome-wide analysis of the response of C. neoformans to host temperature [296]. This analysis revealed differences
in the levels of responsiveness of serotype A and D strains to growth
at 25‰ versus 37‰ with changes in transcript levels for histone genes,
stress-related genes, and genes encoding translation components. Nunes
et al. [234] used a Paracoccidioides brasiliensis biochip to monitor gene
expression at several time points of the mycelium-to-yeast morphological shift. Their results revealed a total of 2,583 genes that displayed
statistically significant modulation in at least one experimental time
point. Among the identified genes, some encoded enzymes involved in
139
amino acid catabolism, signal transduction, protein synthesis, cell wall
metabolism, genome structure, oxidative stress response, growth control,
and development. Particularly, the gene 4-HPPD encoding 4-hydroxylphenyl pyruvate dioxygenase is highly overexpressed during mycelium-toyeast differentiation, and its function has been shown to be the inhibition
of growth and differentiation of the pathogenic yeast phase of the fungus in vitro [234]. Two copies of 4-HPPD, Pm48.10 and Pm14.48, were
identified in the P. marneffei genome.
Neither C. neoformans nor P. brasiliensis are phylogenetically closely
related to P. marneffei. Comparison of patterns in gene expression with
the much more closely related Aspergillus species may be more meaningful. Information about A. fumigatus gene expression in metabolic adaptation to higher temperatures became available recently [233]. Nierman
et. al., examined gene expression throughout a time course upon shift of
growth temperatures from 30 to 37‰ and 48‰ [233]. A total 1926 temperature shift-responsive genes were identified. Comparative data also
indicate that high temperature responses in A. fumigatus differ from the
general stress response in yeast. We performed comparative analysis of
these genes against P. marneffei genome in order to identify their homologs. Among the 1,926 genes, 1,032 have homologs in P. marneffei,
i.e., a majority of A. fumigatus temperature shift-responsive genes are
present in P. marneffei. Here the set of homologs was defined by identifying unique pairwise reciprocal best hits, with at least 40% similarity
in protein sequence and less than 20% difference in length. This result
suggests that the genetic component of P. marneffei may not differ much
from those for general high temperature responses in A. fumigatus.
The experiments mentioned above identified the temperature shiftresponsive genes that may play a role in the structural or metabolic
changes that take place during morphogenesis or may be necessary for
colonisation and survival in the host. However, a direct interpretation
140
of the association between P. marneffei homologs of temperature shiftresponsive genes in other fungi may not be reliable. Moreover, very few
genetic determinants have been identified to be directly involved in either
phase transition and/or pathogenicity. Further studies of gene expression
in P. marneffei are necessary in order to solve these problems.
In addition to revealing the overall gene expression pattern, understanding the transcriptional mechanisms which control the dimorphic
program is also important. Some of transcription factors within known
pathways have been mentioned above. Here I mention more studies that
identified several other transcription factors which control conidiation
and dimorphic switching in P. marneffei. The P. marneffei abaA gene
(Pm109.16) encoding an ATTS/TEA DNA-binding domain transcriptional regulator regulates cell cycle events and morphogenesis in both
filamentous and yeast growth [18]. The stuA gene (Pm107.14) encoding a basic helix-loop-helix transcription factor may control processes
that require budding but not those that require fission as in dimorphic
growth in P. marneffei [20]. TATA-binding protein (TBP) is a general
transcription factor required for initiation of transcription in eukaryotes.
The TBP encoding gene, Tbp (Pm19.17), has been cloned and characterized in P. marneffei [254]. Tbp is essential for P. marneffei filamentous
growth, but plays a less significant role in growth and development during the yeast phase. Furthermore, it has been shown that transcriptional
regulation in S. cerevisiae appears to be mechanistically bipolar, i.e.,
TATA box-containing genes are predominantly involved in responses to
stress, whereas TATA-less genes are mainly associated with constitutive
housekeeping functions [12]. Only 20% of yeast genes contain a TATA
box [12]. It therefore is interest to see if TATA-less promoters are also
present in P. marneffei, suggesting a need to balance inducible stressrelated responses with constitutive housekeeping functions or reflecting
the difference in the regulatory basis for growth and development of the
141
two morphological forms [254].
6.3.4
Structural reorganization towards the morphological change
It is reasonable to speculate that the mycelium-to-yeast transformation of
P. marneffei is an active process triggered by a shift in temperature. The
fungus undergoes a ‘drastic’ structural reorganisation associated with this
active process. We assume this process may be linked with a number of
phenotypic changes like those characteristic of apoptosis or programmed
cell death. Indeed, programmed cell death has been observed in both A.
fumigatus [225] and A. nidulans [313]. The metazoan upstream apoptotic machinery is absent in fungi, whereas the downstream effectors and
regulators, both caspase-dependent and caspase-independent, seem to
present in A. fumigatus [225]. As in animal apoptotic cells, caspase activities are involved in fungal mycelium self-activated proteolysis. Searches
in P. marneffei genome revealed three genes (Pm105.4, Pm112.34 and
Pm205.1) encoding metacaspase proteins that could be responsible for the
caspase-like activities. Only two copies of these proteins were identified in
A. nidulans genome. The searches also found a single gene (Pm93.8) encoding a poly (ADP-ribose) polymerase (PARP) protein, a homologue of
the key participant of caspase-independent apoptosis in mammals. PARP
is one of the known target proteins inactivated by caspase degradation in
animal cells. PARP activity was demonstrated previously in A. nidulans
during sporulation-induced apoptosis. PARP is absent in S. cerevisiae
but present in Aspergillus. The presence of these proteins in P. marneffei and Aspergillus is indicative of the PARP-dependent programmed cell
death pathway. In addition, homologs of mammalian apoptotic protein
AMID are found in P. marneffei and A. fumigatus, but not in unicellular
yeasts such as S. cerevisiae, further suggesting that mechanisms of cell
death appear to be more complex in filamentous fungi.
Analysis of the cell wall of P. marneffei is basic for understanding its
142
morphological transformation. In the mould form, the hyphal cell wall
is essential for P. marneffei to penetrate solid nutrient substrates. In
yeast form, a transformed cell wall is essential to resist host cell defence
reactions. The cell wall protects P. marneffei against the aggressive
human defence reactions, harbours most of the fungal antigens and it
represents a potential drug target. Therefore, comprehension of cell wall
biosynthesis pathways is important. We speculate that, like many other
filamentous fungi, the structural organization of the cell wall of P. marneffei is the polysaccharide constituents composed of alpha and beta(1,3)glucans, chitin, galactomannan, and beta(1,3),(1,4)-glucan. These structural genes and genes encoding a number of enzymes including synthases,
transglycosidases, and glycosyl hydrolases responsible for their biosynthesis and remodelling were identified in the P. marneffei genome (provided
in PMGD website: www.pmarneffei.hku.hk). One of the known differences between the yeast cell wall and the mycelium cell wall is that
β1,6-glucan and peptidomannan present in yeast cell walls are missing in
A. fumigatus [233]. The beta1,6-Glucan is a key component of the yeast
cell wall, interconnecting cell wall proteins, beta1,3-glucan, and chitin.
Yeast genes, KRE5, KRE6 and SKN1, are predicted to encode paralog
proteins that participate in assembly of the β1,6-glucan. Homologs of
these three genes, Pm76.37, Pm104.21 and Pm34.5 were identified in P.
marneffei genome, as well as in A. fumigatus genome. Seemingly, the
specificity of the cell wall biosynthetic gene inventory in the P. marneffei
genome determines the specificity of the polymer organization of the cell
wall. Yet we need further analysis for confirmation.
As a general feature of development in eukaryotes, only a small proportion of the genome is associated with any particular morphogenetic
process. In yeast for example, only 21-75 of the estimated 6,000 genes
were assumed to be specific to meiosis and ascospore formation. This
is also the case in P. marneffei. Therefore, the study of morphogenesis
143
should be directed to an emphasis on morphogenetic gene regulation of
differential expression of activity, rather than on large scale replacement
of one set of gene products by another. We still lack gene expression
studies in P. marneffei to date. Nevertheless, the findings in this chapter offer new interpretive clues to the mechanisms of fungal virulence
and dimorphism. First, the signalling systems that control dimorphism
may be conserved between P. marneffei and related fungi. That is to
say, many fungal species contain orthologous genes specifying the same
pathways. Presumably, only subtle quantitative differences in the inputs
and outputs of each pathway generate the different morphologies and
behaviours characteristics. Second, dimorphism in P. marneffei may be
controlled by multiple signalling pathways. As in Saccharomyces, at least
three parallel pathways control the switch to filamentous growth. How
the fungus integrates the information from different pathways to effect a
change in cell type is not known.
In summary, morphogenesis is an essential developmental event, promoting host invasion and evasion by dimorphic fungi. Prevention of this
event may hold the key to control of infections by these fungi. Understanding the molecular mechanisms for the morphologic switch could lead
to new drug or vaccine targets that block the earliest events in colonization or infection.
144
Chapter 7
INTRAGENIC TANDEM REPEATS IN PENICILLIUM
MARNEFFEI AND OTHER ASCOMYCETES
Tandemly repeated DNA sequences occur frequently in the genomes of
organisms. Although their function and origin are not truly understood,
these highly dynamic genomic components may provide the most insights
into how a pathogenic fungus adapts to the host immune system.
7.1
Introduction
A tandem repeat (TR) is defined to be two or more adjacent copies of
the same sequence of nucleotides and may result from tandem duplication event(s). Over time, individual copies within a TR may undergo
additional, uncoordinated mutations so that typically, only approximate
tandem copies are present. The number of adjacent copies in a TR can
be variable. Lengths of TR range from few tens of base pairs (micro- and
mini-satellites) to megabases (larger satellite repeats).
Genomes, particularly of eukaryotes, contain a large number of TR.
For example, 10% or more human genome is composed of TRs. Simple
sequence repeats are fairly abundant in plant genomes, occurring once
in every approximately 6 Kb [258]. TRs are of biological importance
for many reasons. First, they cause human diseases, including fragile-X
mental retardation, Huntington’s disease, myotonic dystrophy, etc [288],
which are the result of a dramatic expansion in the number of copies of
a trinucleotide pattern. Second, they play a variety of regulatory and
evolutionary roles. The repeats may interact with transcription factors
or alter the structure of the chromatin or act as protein binding sites [121,
145
208]. Third, they are important laboratory and analytic tools. They have
been applied in linkage analysis and DNA fingerprinting [78,340] since the
number of copies of a specific TR is often polymorphic in the population.
Last but not least, TRs play an apparent role in the development of
immune system cells in human. Du et al. [75] showed that breakpoints
of immunoglobulin switch recombination, which occur between pairs of
switch regions located upstream of the constant heavy chain genes, cluster
to a defined subregion in three TRs.
The most interesting feature of TRs is that their association with the
functional variability of a gene product. Most TRs are in intergenic regions, but some are in coding sequences or pseudogenes. Verstrepen et
al. [328] showed that in the genome of Saccharomyces cerevisiae, most
genes containing intragenic TRs (IntraTRs) encode cell-wall proteins.
The presence of IntraTRs facilitates recombination in the gene or between
the gene and a pseudogene. The result of this increased frequency of recombination events is an expansion or contraction of the gene size. More
importantly, this size variation creates quantitative alterations in phenotypes (e.g., adhesion, flocculation or biofilm formation). The variation of
the fungal cell surface allows fungal microbes to ‘disguise’ themselves in
order to evade the host immune system’s defences.
Inspired by the finding of Verstrepen et al. [328], the aim of this
chapter is to reveal the composition of IntraTRs from the genomes of
Penicillium marneffei, as well as other related species. Using computer
programs, we searched for both long and short repeated sequences within
protein-coding regions in P. marneffei and related Ascomycetes. Comparison of observed frequencies with expected values reveals that repeats
are enriched in the P. marneffei genome.
146
7.2
7.2.1
Materials and Methods
Identification of coding tandem repeats
The previously described methodology [328] was applied to find IntraTRs in P. marneffei genome and other fungal genomes, using the EMBOSS ETANDEM software [263] to screen the sequences. The ETANDEM threshold score was set to 20. All known and predicted genes were
scanned for long (> 40 nucleotide (nt)) or short (3-39 nt) repeats. Here
a sequence was considered to be an intragenic repeat if it meet two conditions: (i) repeat conservation was at least 85%; and (ii) the number of
repeats was at least 20 for trinucleotide repeats, 16 for repeats between
4 and 10 nt, 10 for repeats between 11 and 39 nt and 3 for repeats of at
least 40 nt.
7.2.2
Sequence analysis
Position-specific iterated BLAST (PSI-BLAST) [6] was used to search
publicly available microbial genome sequences, GenBank, or EMBL. GenBank and EMBL were accessed through the National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/ and the Oxford University Bioinformatics Centre, respectively. Protein domain determinations were addressed through the NCBI Conserved Domain Search. The
MBEToolbox package (Chapter 10) was used for nucleotide and amino
acid sequence analysis and alignments.
7.3
Results and Discussion
One of the ultimate goals of sequence analysis is to accurately identify candidate virulence genes that confer pathogenicity to P. marneffei.
General comparative analyses, such as ortholog prediction and speciesspecific gene detection, are valuable, but not very specific. That is to say,
these methods give too many candidate genes. To narrow these candidate
147
Table 7.1: P. marneffei genes containing intragenic tandem repeats. Column “size” is the length of repeat unit, “count” is the occurrence of repeat unit. Total length of repeat units is therefore equals: size × count.
Sequence identity (%) of repeat unit is greater than 80%. Consensus sequences of repeat unit for each gene are available in PMGD. * indicates
the gene contains more than one type of repeat. Genes are ordered by
the size of repeat unit. The last 12 genes contain short repeats, the rest
contain long repeats.
Pm gene
Pm6.47
Size
228
Count
3
Pm27.95
Pm78.37*
Pm54.4
171
165
147
5
3
3
Pm71.41
Pm133.2
Pm1.199
144
141
126
5
3
12
Pm12.139
Pm14.111
126
126
9
4
Pm30.75
126
7
Pm35.44
126
11
Pm94.31
126
8
Pm210.2
126
9
Pm39.56
Pm183.10
120
117
3
3
Pm54.56*
Pm12.114
Pm161.1
Pm77.10
Pm226.4*
Pm209.2
108
102
102
99
99
96
6
3
3
5
7
3
Putative Function
Polyubiquitin, similar to S. cerevisiae
UBI4 (YLL039C)
Unknown function
Unknown function
Streptococcal
protective
antigen
(Q8NZA4)
Unknown function
Unknown function
Homologous to AN7363.2, AN3547.2 and
AN8457.2
Putative ATP/GTP binding protein
O-acetylhomoserine
(Thiol)-lyase
(CYSD EMENI)
Beta transducin-like protein HET-E2C*4
(Q8X1P4)
Beta transducin-like protein HET-E2C
(Q8X1P5)
Putative ATP/GTP binding protein
(Q6TMU6)
Beta transducin-like protein HET-D2Y
(Q8X1P2)
Unknown function
Casein
kinase
I
homolog
hhp1
(HHP1 SCHPO)
Pedal peptide precursor protein (O01387)
Unknown function
Phosphorylase (Q8TK58)
KIAA1223 protein (Q8TB46)
Ankyrin 2 (Q9NCP8)
Beta transducin-like protein HET-E4S
(Q8X1P6)
to be continued...
148
Pm44.53
81
3
Pm163.5
78
9
Pm42.29
Pm117.16*
Pm31.1
Pm34.34
Pm54.65
72
72
66
66
66
3
5
5
5
4
Pm78.42
Pm118.4
Pm40.30
66
63
60
3
5
3
Pm64.14
60
8
Pm95.32
60
3
Pm194.2
60
3
Pm41.72
Pm166.6
Pm48.11
54
54
48
5
3
5
Pm78.3
Pm173.14
Pm194.1
48
48
48
4
4
4
Pm194.5
Pm224.1
Pm224.2
Pm230.1
Pm234.1
Pm236.2
48
48
48
48
48
48
5
3
5
5
5
4
Pm236.3
48
7
Related to transport protein USO1
(Q873K7)
Erythrocyte binding protein 3 [Plasmodium falciparum] (Q7K5Q6)
Phenol 2-monooxygenase (Q8X0B1)
Unknown function
Unknown function
Chitinase (Q873Y0)
Extensin class I (cell wall hydroxyprolinerich glycoprotein) [Plasmodium falciparum] (Q09082)
Chitinase 4 (Q7ZA41)
Unknown function
PAAA motif protein, similar to microfilament and actin filament cross-linker protein [Pan troglodytes]
Zonadhesin – [Mouse]; PT repeat protein family (EAL93999) [Aspergillus fumigatus]
Related to mannosyltransferase ALG2
(Q8X0H8)
Retrovirus-related Pol polyprotein from
transposon TNT 1-94 (POLX TOBAC)
Unknown function
Unknown function
Similar to S. cerevisiae YJR054W
(Q6CXI0)
Telomere-linked helicase 1 (Q8J216)
Telomere-linked helicase 1 (Q8J216)
Telomere-associated recQ-like helicase
(O13400)
Polymerase (Q9C435)
Telomere-linked helicase 1 (Q8J216)
Telomere-linked helicase 1 (Q8J216)
Telomere-linked helicase 1 (Q8J216)
Telomere-linked helicase 1 (Q8J216)
DWIQ motif containing hypothetical protein (NP 702011) PF14 0123 [Plasmodium
falciparum]
Q8J216 Telomere-linked helicase 1
to be continued...
149
Pm247.2
Pm108.33
Pm8.109
Pm40.29
Pm40.31
48
45
42
42
42
5
4
3
3
4
Pm52.29
42
3
Pm210.1
Pm173.16
Pm36.21
Pm1.35
42
24
12
6
4
10
11
25
Pm1.28
Pm3.168
Pm5.75
3
3
3
24
28
25
Pm14.75
Pm22.8
Pm67.24
3
3
3
29
22
22
Pm76.36
Pm85.21
Pm138.7
3
3
3
21
30
24
Q8J216 Telomere-linked helicase 1
Unknown function
ATPase, AAA family
Unknown function
H7H motif in multiple proteins of Plasmodium
Mitochondrial
chaperone
BCS1
(BCS1 XENLA)
Unknown function
Unknown function
Unknown function
Transcription initiation factor TFIID subunit 12 (TAF12 YEAST)
Unknown function
Q7Z884 Putative cell wall protein FLO11p
Dynamin binding protein, TUBA; DNMBP MOUSE (Q6TXD4)
Unknown function
Unknown function
Related to heat shock transcription factore
HSF21 (Q9P554)
Unknown function
Unknown function
Oxygenase-like protein (Q93M01)
genes down to a manageable amount, genes that contain IntraTRs were
carefully investigated. This is because IntraTRs have been suggested to
generate functional variability in S. cerevisiae, and variation in IntraTR
number provides the functional diversity of cell surface antigens that, in
fungi and other pathogens, allows rapid adaptation to the environment
and elusion of the host immune system [328]. In S. cerevisiae, there are
a total of 44 such genes with known functions that have been identified.
These genes show unexpected functional similarities: 62% with conserved
long repeats encode cell-wall proteins [328].
A total 66 P. marneffei genes that contain IntraTR(s) were identified (Table 7.1). Nearly one third of these genes are of unknown function, i.e., neither putative homologs have been detected by the extensive
150
PSI-BLAST search against GenPept databases, nor putative conserved
domains have been detected. These genes may be P. marneffei -specific.
The remaining two thirds of them, whose putative homologs can be found,
are genes with assigned functions. Nine of these genes, namely, Pm78.3,
Pm173.14, Pm224.1, Pm224.2, Pm230.1, Pm234.1, Pm236.3, Pm247.2,
and Pm194.1, are homologs of the Magnaporthe grisea telomere-linked
helicase 1 (TLH1) gene. Genetic mapping showed that most members
of the TLH gene family are tightly linked to the telomeres and located
within 10 kb from the telomeric repeat. Similar helicase gene families
are also present in the chromosome ends of Saccharomyces cerevisiae
and Ustilago maydis, which suggests the initial association of helicase
genes with fungal telomeres might date back to the very early stages of
the fungal evolution [103]. Four genes, Pm210.2, Pm30.75, Pm35.44, and
Pm209.2 are homologs of beta transducin-like protein genes, most closely
similar to Podospora anserina het-d2y, het-e2c, het-e2c*4 and het-e4s, respectively. These genes are involved in vegetative incompatibility, which
prevents a viable heterokaryotic cell from being formed by the fusion of
filaments from two different wild-type strains. In P. anserina, such incompatibility is always the consequence of at least one genetic difference
in het genes, specifically het-e and het-d. These loci control heterokaryon
viability through genetic interactions with alleles of the unlinked het-c locus [82]. The other interesting homologs include streptococcal protective
antigen, chitinase, extensin, zonadhesin, and erythrocyte binding protein,
etc (Table 7.1).
For further experimental studies, such as, DNA typing, only those
that are most likely to be responsible for P. marneffei ’s pathogenic adaptation should be selected. The selective process involves a multi-step filtering. The underlying rationale is that a candidate virulence gene has to
be (1) P. marneffei -specific (without orthologs or orthologs containing no
similar IntraTR), and (2) functionally known to be related to intracellular
151
adaptation or otherwise completely functionally unknown. Moreover, in
order to conduct a PCR-based IntraTR length polymorphism study, the
constraint of the length of target DNA in PCR reactions has to be taken
into account. After the multi-step filtering and investigating the lengths
of IntraTR and introns of these genes, two genes, Pm40.30 (745 bp) and
Pm40.31 (733 bp), were selected for further polymorphism study. The
lengths of IntraTRs plus introns of the two genes are 234 and 277 bp respectively. What makes these two genes special are their BLAST analysis
results. Pm40.30’s top hit of PSI-BLAST against NCBI NRProt database
is a hypothetical Chimpanzee protein containing multiple PAAA motifs.
While Pm40.31’s top hit is a hypothetical histidine-rich motif containing
protein from Plasmodium falciparum. Although the function of this hypothetical gene encoding this protein is unknown, it is still noteworthy
that another histidine-rich protein PfHRP2, encoded by P. falciparum
gene HRP-2, is indeed responsible for intracellular adaptation of this
parasite [11]. PfHRP2 binds heme, playing a role in hemoglobin proteolysis, which is the primary nutrient source of the erythrocytic growth
stage of P. falciparum [52].
The relative abundances of IntraTR within different fungi are compared. Table 7.2 shows the genome size, G, bases in repeat regions, B,
and number of genes containing repeats, n, from several fungi. When
take all diploid and haploid species are taken together, the two diploid
fungi, S. cerevisiae and C. albicans show higher B/G ratio. It appears
that genomes of diploid species may accommodate more bases located in
IntraTR regions, as much as 3 times higher. Among haploid fungi, P.
marneffei shows the highest B/G ratio, i.e. its fraction of bases belong to
repeat regions is higher than any other haploid fungi. We argue that the
relatively more abundant IntraTRs in P. marneffei might be responsible
for its immuno-escaping mechanism, which enables the fungal pathogen
to survive within its host. Finally, note that B/N ratios remain largely
152
constant across different species, i.e., the average number of bases within
each gene is similar.
Table 7.2: Comparison of genome size and base in repeats. Abbreviations: Pm, P. marneffei; Af, Aspergillus fumigatus; An, Aspergillus
nidulans; Sc, Saccharomyces cerevisiae; Ca, Candida albicans; Mg, Magnaporthe grisea; Nc, Neurospora crassa.
Diploid
Genome size (Mb), G
Bases in repeat regions (bp), B
No. of genes containing repeats, N
B/G ratio
B/N ratio
Pm
No
30
23,814
Af
No
28
12,687
An
No
30
16,820
Sc
Yes
12
29,664
Ca
Yes
16
34,662
Mg
No
39
16,933
Nc
No
40
22,101
66
33
31
69
82
62
121
794
361
453
384
561
543
2,472
430
2,166
423
434
273
553
183
The amino acid composition of a protein is the mole percent of the
different amino acids its sequence. It is usually conserved among the
same proteins of different organism species. Here we performed a crossspecies comparsion of IntraTRs’ amino acid composition (Fig. 7.1). The
two yeasts show a different visual pattern compared to these of moulds.
S. cerevisiae and C. albicans use much more threonine and/or serine
residues than any other amino acid; while in moulds the patterns are
more contrast. Serine is used most in P. marneffei and A. fumigatus;
alanine in A. nidulans, glycine in N. crassa and isoleucine in M. grisae.
Phenylalanine, valine and tryptophan are ubiquitously less used in all
species. The overall patterns of P. marneffei, A. nidulans and A. fumigatus are similar to each other. The result shows that the differences
among amino acid composition are associated with the phylogenetic distances among species. This suggests that the amino acid composition of
IntraTR is not subject to neutral mutation but under the constraint of
a certain level of selection.
The cell surfaces of microorganisms show distinctive properties which
153
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
0
Af
100
200
300
400
500
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
0
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
600
200
300
400
500
600
An
100
200
300
400
500
600
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
Mg
100
0
700
700
500
1000
1500
2000
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
Nc
0
500
1000
1500
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
Ca
0
0
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
2500
0
Sc
500
1000
1500
2000
2500
Pm
100
200
300
400
500
600
700
800
Figure 7.1: Amino acid composition in intragenic tandem repeats. Fungal
species are: Af, A. fumigatus; Pm, P. marneffei ; An, A. nidulans; Sc,
S. cerevisiae; Ca, C. albicans; Nc, N. crassa; Mg, M. grisae. For each
subplot, x axis is occurrence/frequency of amino acid, y axis is amino
acid in the order of downwards: A - Alanine, C - Cysteine, D - Aspartic
Acid, E - Glutamic Acid, F - Phenylalanine, G - Glycine, H - Histidine,
I - Isoleucine, K - Lysine, L - Leucine, M - Methionine, N - Asparagine,
P - Proline, Q - Glutamine, R - Arginine, S - Serine, T - Threonine, V Valine, W - Tryptophan, and Y - Tyrosine (Tyr).
154
can be recognised by the host immune system. Many microorganisms
have the ability to switch their cell-surface molecules, a tactic that permits them to elude the immune system and adhere to diverse materials
and cells (for review, see [329]). The human immune system poses challenges to P. marneffei, which might have characteristic cell-surface molecules that are recognized by dedicated phagocytic cells. Recent studies
linked the the diversity of cell surface molecules to the variation in IntraTR number. The persistence of a large amount of IntraTRs in the
P. marneffei genome suggests that there is a compensating benefit. We
therefore propose that variation in IntraTR number provides the functional diversity of cell surface antigens in P. marneffei, allowing rapid
adaptation to the environment and evasion of the host immune system.
155
Chapter 8
EXTENT AND EVOLUTIONARY PATTERN OF
DUPLICATE GENES IN PENICILLIUM MARNEFFEI
AND OTHER ASCOMYCETES
Gene duplication and subsequent divergence have long been believed
to be of importance for the functional novelty and complexity of organisms. The extent and evolutionary patterns of duplicate genes (paralogs)
have long been studied in higher eukaryotes, but not in lower eukaryotes such as fungi. In this chapter, gene-coding sequences in genomes
from Penicillium marneffei, together with those from other ascomycetes,
Saccharomyces cerevisiae, Schizosaccharomyces pombe, Candida albicans,
Aspergillus nidulans and Neurospora crassa, are used to identify multigene families. The number of synonymous substitutions per synonymous
site, Ks , and the number of nonsynonymous substitutions per nonsynonymous site, Ka , are calculated to measure the time (or relative frequency) of duplication as well as the selective constraint on gene pairs.
The evolutionary rates of duplicate gene pairs are measured by applying
the codon substitution model, which is more sensitive than traditional
models [111]. A large variation in the extent of gene duplication in these
species was found (percentage of genes in multigene families ranged from
23.6% in S. cerevisiae to 8.0% in N. crassa). The age distribution of
the gene duplications tentatively suggests that the P. marneffei genome
may have experienced two rounds of large-scale duplication. It is also
detected that paralogs in filamentous ascomycetes (but not paralogs in
yeast ascomycetes) are under weaker functional constraint than those
of orthologs. Analysis of the divergence of evolutionary rates in S. cere-
156
visiae and C. albicans revealed that 17.8% of gene pairs show asymmetric
divergence pattern in amino-acid substitutions. However, there is no evidence to show that this asymmetry is associated with positive selection. I
speculate that the different extent and evolutionary pattern of duplicate
genes in these ascomycetes might be associated with their genotypical
and phenotypical differences.
8.1
Introduction
In early 1970s Ohno proposed in his book that gene duplication is a major evolutionary source of gene innovation [237]. By this he meant that:
the creation of a paralog of a gene through duplication (by many possible
means) results in one of the duplicates being functional redundant. This
redundant copy may mutate more freely without affecting the overall fitness of the organism, and thus is more likely to become a gene with a
novel function. Now generally, biologists accept the vision that, by creating sets of gene paralogs, gene duplication plays an important role in
the adaptation of organisms to their environment and in the origin of yhe
phenotypic diversity of organismal evolution [210]. Nowadays, with the
completion of several eukaryotic genome projects, it is well known that
one of the characteristics of eukaryotic genomes is the presence of duplicate genes, forming numerous gene families [287]. More than a third of a
typical eukaryotic genome consists of gene families [115, 287, 345]. Whole
genome duplication(s) during the earlier evolution of the vertebrate lineage have been proposed to account for the presence of extensive gene
duplications in most of the vertebrate genomes [209, 221, 287].
The extent of gene families in one organism is firstly determined by
the frequency and magnitude of gene duplication events, and secondly determined by the subsequent evolutionary fates of gene pairs following the
duplication events. This may be better understood through comparative
studies of sequence divergence in duplicate genes in different genomes.
157
However, until recently few studies have been conducted in the limited
number of representative organisms available [68, 174, 210], because such
kinds of inter-genomic comparisons rely on the availability of complete
genome sequence from multiple organisms.
In this study, I compare the extent and evolutionary pattern of duplicate genes in the phylum ascomycota, using the complete sets of proteincoding genes in the fungi, Saccharomyces cerevisiae [110], Schizosaccharomyces pombe [354], Candida albicans, Aspergillus nidulans, Penicillium
marneffei and Neurospora crassa [101]. These fungal species display different life styles and phenotypic characteristics. The brewer’s yeast, S.
cerevisiae, and fission yeast, S. pombe, have a life cycle characterized
by a unicellular thallus that reproduces by budding and fission respectively. Filamentous ascomycetes, N. crassa and A. nidulans, grow hyphae
apically and branch laterally. P. marneffei shows dimorphic switching
between mould and yeast forms of growth under different temperatures.
It is of interest to know how gene duplication shaped their gene repositories leading to novel genes conferring novel adaptive functions in these
fungi.
In practice I used nucleotide alignments of duplicate genes to calculate
two key parameters of molecular evolution: the number of synonymous
(silent) substitutions per synonymous site, Ks , and the number of nonsynonymous (amino-acid replacement) substitutions per nonsynonymous
site, Ka . Ks provides a crude measure of the time since duplication for
each gene pair, if assume Ks increases approximately linearly with time.
The ratio Ka /Ks provides a measure of the selection pressure to which a
gene pair is being subjected. Generally speaking, if Ka /Ks ratio = 1, it
means that the duplicate genes are under few or no selective constraints
(i.e., amino acid replacement substitutions occur at the same rate as synonymous substitutions). A Ka /Ks ratio > 1, which is a strong evidence
for positive selection, indicates that replacement substitutions occur at
158
a rate higher than that expected by chance, so advantageous mutations
have occurred during sequence divergence. In contrast, a Ka /Ks < 1
is consistent with ‘purifying selection’. That is to say, some amino-acid
replacement substitutions have been purged by natural selection because
of their deleterious effects [48]. Another evolutionary pattern that has attracted great interest is the asymmetry of evolutionary rates between the
two copies of a duplicated gene pair, i.e., one copy evolves faster than the
other one. Intensive studies on this pattern in different organisms have
shown a wide range in estimation of the portion of duplicate gene pairs
show asymmetric evolution [59, 68, 137, 174, 265, 321, 370, 371].
Since the completion of whole genome sequence of S. cerevisiae [110],
a number of studies have involved the identification of multigene families
in this model eukaryotic genome. The resulting numbers of multigene
families in S. cerevisiae reported by Rubin et al. [270] are higher than
those reported by Friedman and Hughes [95] (1858 compared to 1440).
This is because the former study used the simple criterion, BLAST Evalue of 10−6 , while the latter used the much stricter search with E =
10−50 . However, using a single statistical score (such as the E value given
by a BLAST search or a related score) without specifying the proportion
of alignable regions may put two non-homologous proteins into the same
family due to domain sharing [118]. Hence in this study, in order to
obtain a reasonable estimate, I adopted a relatively stringent definition
in which the lengths of gene-encoding proteins are taken into account,
instead of relying on E-values only.
8.2
Literature Review
The ability to adapt to changing environments and to exploit new niches
has a great influence on the success of an organism [210]. This ability is
associated with new genes or genes with new functions [219]. Gene duplications are traditionally considered to be a major evolutionary source of
159
new protein functions. After duplication, the fate of the resulting copy of
a gene is of great interest. At least three hypotheses have been proposed,
as follows:
Nonfunctionalisation The classical view pioneered by Susumu Ohno
[237] holds that a duplicate gene produces two functionally redundant,
paralogous genes and thereby frees one of them from selective constraints.
The duplicate gene may be degraded to a pseudogene by mutational
inactivation and finally could be removed from the genome by deletion
[237, 238]. This is the most likely outcome of duplicate genes [237, 68].
Neofunctionalisation The duplicate gene may avoid redundancy by
assuming a novel function, i.e., the redundant copy may be modified and
in time assume a new role [237, 166, 336, 334, 298, 212]. Since this unconstrained paralog is free to accumulate neutral mutations, there is the
possibility of fixation of mutations that may lead to a new function. This
prediction was supported by studies on isozyme spectra of polyploidy in a
number of organisms (reviewed in [196]). Of course, mutational time is a
deciding factor, since copies need sufficient modifications to assume roles
different from their parents, assuming that they are initially of neutral
fitness. Thus, the deletion rate is of great importance to gene innovation
by being sufficiently slow to give copies time to diverge.
Both the hypotheses above assume one copy of a duplicate gene pair
is free to evolve, while the other remains under selective pressure. This
has been challenged in work by Kondrashov et al. [174] and Lynch and
Conery [212], who show that paralogs do not seem to have experienced
any extensive period of neutral evolution. Kondrashov et al. [174] proposed that paralogs avoid neutrality through gene amplification, followed
by a period of either relaxed or positive selection. They also observed
that paralogs evolve faster than their corresponding orthologs. Again,
this could be due to relaxed or positive selection. Furthermore, a study
160
of 17 pairs of duplicate genes in the tetraploid frog Xenopus laevis has
shown that both copies were subject to purifying selection, contrary to
the notion of neutrality of one of the copies [137]. The failure of empirical research to support Ohno’s model has led to the proposal of an
alternative hypothesis – subfunctionalisation.
Subfunctionalisation The third hypothesis, ‘subfunctionalisation’ or
the duplication-degeneration-complementation (DDC) model [90], proposes that duplicate genes come under selective pressure and are retained by losing separate subfunctions from a multifunctional ancestal
gene. Redundant material is discarded through degradation [90]. It also
states that duplicate genes are initially redundant in function and, accordingly, a duplication event is selectively neutral. But it differs from
the hypothesis that successfully retained subdomains can be reused for
subset of orignial functions or even other new or related purposes [90].
As a result, the two genes can be said to belong to a family, being related
by sequence similarity, if not by function. Naturally, this relationship
will decrease with time until no discernable similarity can be observed in
regions of low conservation. A large number of observations support this
model, although mostly in diploid or polyploid eukaryotes.
8.3
8.3.1
Materials and Methods
Sequences and gene families
For each organism, other than P. marneffei, the complete sets of available
putative amino-acid sequences and coding DNA sequences were downloaded from genomic databases as follows: for S. cerevisiae, http://
genome-www.stanford.edu/Saccharomyces; for S. pombe, http://www.
genedb.org/genedb/pombe (Schizosaccharomyces pombe GeneDB); for
C. albicans, http://genolist.pasteur.fr/CandidaDB/ (CandidaDB Data
Release R1 Dec 17, 2001), this genome database was created by the EU-
161
funded consortium Galar Fungail by performing independent annotation
of assembly 19 sequence data obtained from the Stanford Genome Technology Centre (http://www-sequence.stanford.edu/group/candida);
for A. nidulans, http://www.broad.mit.edu/annotation/fungi/aspergillus/
(Aspergillus nidulans Database), and for N. crassa, http://www-genome.
wi.mit.edu/annotation/fungi/neurospora (Neurospora crassa Database release 3: 02.12.2002). All protein sequences that were annotated
as known or suspected pseudogenes and those proteins encoded by mitochondrial genomes were removed. Gene families in each genome were
identified by using BLASTCLUST (30% of identical residues and aligned
over at least 80% of their lengths). BLASTCLUST applies the singlelinkage algorithm. For documentation on its use, see ftp://ftp.ncbi.
nlm.nih.gov/blast/documents/README.bcl. The clusters were used to
identify and count duplication events (although not all pairs of genes in
the cluster are homologous to each other). Throughout the analysis, the
same criteria were applied in searching for orthologs of genes from all
other species, that is to say, orthologs were predicted by BLASTP search
for interspecies genes with > 30% identical residues and alignable region
over at least 80%.
8.3.2
Estimation of substitution rate
Gene families with sequences similar to known transposable elements
were removed at this point and excluded from the rest of analysis. Paralogous protein sequences were aligned using ClustalW version 1.82 with the
default parameters (PAM matrix; gap opening penalty = 10.0; gap extension penalty = 0.2). The corresponding nucleotide-sequence alignments
were derived by substituting the respective coding sequences from the
protein sequences by using MBEToolbox (Chapter 10 ). Ks and Ka were
calculated by the method of maximum-likelihood, which is implemented
in the CODEML program of the PAML package version 3.13d [359].
162
Following the procedure described in Zhang et al. [371], pairs of duplicate genes with smallest value of Ks were picked within each family. This
process was repeated for the remaining genes within the family until there
was no gene pairs that could be picked. The process was implemented
by ad hoc scripts in Perl.
To plot Ka versus Ks , pairs with Ks > 5.0 or Ka > 5.0 were eliminated because such high sequence divergence is often associated with
problems like difficulty in alignment, different codon usage biases or
nucleotide compositions in the different sequences. Ks is known to be
strongly distorted by codon usage bias [283]. The codon adaptation index
(CAI) [282] was used as a measure of codon bias. I therefore calculated
average values of CAI for all gene pairs and excluded those with average
CAI > 0.5 from the analysis.
8.3.3
Relative rate test
The relative evolutionary rate test aims to compare the substitution rates
of two sequences or two groups of sequences. Here it was applied to
compare the evolutionary rate of two copies of a duplicate gene pair.
In the test I only used recently duplicated (i.e., duplicate genes with
Ks < 0.5). These ‘young’ duplicates have fewer multiple substitutions
and therefore can be estimated more accurately than those of older ones.
In addition, very young duplicates (Ks < 0.05) were excluded because
they have too few substitutions to make statistical test significance [199].
In order to apply the relative rate test, I obtained outgroup sequences
for these young gene pairs. Each relative rate test was based on one gene
pair and its outgroup, forming triplets. Selection of outgroup were done
by using the method described in Conant and Wagner [59]. When more
than one outgroup sequence was available, either from the same genome
or from other genomes, triplets of genes closest to each other in synonymous divergence rate, Ks , were chosen. I used two likelihood ratio
163
Table 8.1: Distribution of multigene families in fungi. Abbreviations: SC
- S. cerevisiae; SP - S. pombe; CA - C. albicans; AN - A. nidulans; PM
- P. marneffei; NC - N. crassa.
Family size
1
2
3
4
5
6-10
11-20
>20
Number of multigene families
(size >=2)
Total genes used in the analysis
Number of genes in families
Number of young duplicate
gene pairs (Ks < 0.5)
SC
4500
390
54
23
11
17
7
2
504
SP
4104
229
34
18
4
18
4
0
307
CA
5276
188
41
29
8
24
2
2
294
AN
7887
320
84
38
17
29
9
5
502
PM
8725
291
64
26
10
29
3
5
428
NC
9274
198
43
22
5
15
3
1
287
5889
1389
165
4939
835
51
6165
1189
50
9541
1654
43
10060
1335
52
10082
808
10
(LR) tests to test for asymmetric divergence in both amino-acid and
codon. Codon substitution rate was estimated using the codon substitution model described by Goldman and Yang [111]. To do the LR test,
two models were applied to the data: model 0 constrains the amino-acid
or codon substitution rates to be equal in the two sequences; and model
1 assumes the rates are free parameters (hence they could be unequal to
each other in two sequences). Maximum likelihood values ML1 and ML2
from the two models were collected and the likelihood ratios were calculated as LR = 2(ln(M L1 ) − ln(M L2 )). LR was then compared against
the χ2 distribution with one degree of freedom, as detailed by Yang [358].
8.4
8.4.1
Results
Extent of gene duplication in ascomycetes
As shown in Table 8.1, 1,389 (23.6%) of 5,889 genes in S. cerevisiae belong
to multigene families (including at least two genes), 16.9% in S. pombe,
164
19.3% in C. albicans, 17.3% in A. nidulans, 13.3% in P. marneffei, and
only 8.0% in N. crassa.
When comparing number of young duplicates, I found 23.8% of gene
families are young (Ks < 0.5) in S. cerevisiae, 12.2% in S. pombe, 8.4%
in C. albicans, 5.2% in A. nidulans, 7.8% in P. marneffei, and only 2.5%
in N. crassa (Table 8.1).
Apparently S. cerevisiae contains more multigene families and more
recently duplicated genes than any other fungus in this analysis. This
is in concordance with an earlier study [345]. Whole-genome duplication approximately 108 years ago was proposed as an explanation for the
presence of many duplicate genes [279]. S. pombe, C. albicans, A. nidulans and P. marneffei contain moderate numbers of duplicated genes to
roughly the same extent as each other. Very few duplicated genes are
present in the N. crassa genome. This low number of duplicate genes is
consistent with results reported previously [101, 231].
Table 8.2 lists top multigene families that contain the most homologous genes in number. S. cerevisiae contains large amount of transposable elements which play an important role in creating duplication
in yeast genome [366]. Top multigene families of S. cerevisiae include
a group of proteins, seripauperins, whose function(s) remain poorly understood [332]. Comparable number of predicted sugar transporters is
found in N. crassa and S. cerevisiae. Transporter and reductase gene
families are expanded in filamentous fungi. Interestingly, P. marneffei
has large gene family of 24 putative pepsin-like proteases, which is not
so substantial in other fungi studied here.
8.4.2
Age distribution of duplicate genes
In general, we assume Ks increases approximately linearly with time
because synonymous substitutions do not alter the amino-acid sequence
and therefore there will be lower constraint due to natural selection [212].
165
Table 8.2: Large multigene families in fungi.
Fungi
S. cerevisiae
Size of family
Function/Product
20
20
17
15
13
Hexose transporter
Seripauperins
Amino acid permease
GTP-binding protein
Helicase
20
17
12
11
10
Multidrug resistance protein
GTP-binding protein
Amino acid permease
Retrotransposable element
Protein kinase
23
21
13
11
9
Unknown proteins
Amino acid permease
GTP-binding protein
Ferric reductase transmembrane component
Unknown proteins
61
42
36
28
21
Hexose transporter
Putative transporter
Oxidoreductase
Multidrug resistance protein
Aldehyde dehydrogenase
34
31
27
24
23
MFS multidrug transporter
Short chain dehydrogenase/reductase family
Hexose transporter protein
Pepsin-type protease
Major facilitator superfamily
21
17
16
11
10
Oxidoreductase
Phosphoethanolamine N-methyltransferase
Hexose transporter
Aldehyde dehydrogenase
Endoglucanase
S. pombe
C. albicans
A. nidulans
P. marneffei
N. crassa
166
100
80
60
40
20
0
40
30
20
10
0
1.
0
S. cerevisiae
A. nidulans
2.
0
2.
0
3.
0
3.
0
4.
0
4.
0
5.
0
5.
0
40
30
20
5.
0
N = 142.00
N = 123.00
N = 174.00
5.
0
30
20
10
0
10
8
6
4
2
0
N. crassa
C. albicans
1.
0
1.
0
2.
0
2.
0
3.
0
3.
0
4.
0
4.
0
5.
0
5.
0
N = 48.00
Mean = 2.47
Std. Dev = 1.49
N = 198.00
Std. Dev = 1.63
10
4.
0
Mean = 2.09
3.
0
Std. Dev = 1.30
2.
0
Std. Dev = 1.65
1.
0
Mean = 1.16
0
S. pombe
Mean = 1.39
N = 313.00
30
20
10
4.
0
Std. Dev = 1.72
3.
0
Std. Dev = 1.82
2.
0
Mean = 2.16
1.
0
Mean = 2.47
0
P. marneffei
Figure 8.1: Frequency distribution of Ks . Frequency distribution of duplicate gene pairs as a function of the number of synonymous
substitution per synonymous site (Ks ). Arrow indicates the second peak in P. marneffei
167
S. cerevisiae
Ka
S. pombe
1
1
1
0.1
0.1
0.1
0.01
0.01
0.1
1
0.01
0.01
A. nidulans
Ka
C. albicans
0.1
1
0.01
0.01
P.. marneffei
1
1
0.1
0.1
0.1
0.1
1
0 .01
0.01
0.1
1
N. crassa
1
0.01
0.01
0.1
1
0.01
0.01
0.1
1
Ks
Figure 8.2: Log-log plots of Ka vs. Ks for duplicate gene pairs. Log-log
plots of the number of nonsynonymous substitution per nonsynonymous
site (Ka ) vs. the number of synonymous substitution per synonymous
site (Ks ) for duplicate gene pairs. Each point represents a single pair of
gene duplications. Points below the diagonal (Ka < Ks ) imply the genes
have been subjected to purifying selection against amino acid changes.
Open points denote orthologous gene pairs.
168
If this assumption largely holds, the distribution of Ks can be used as an
indicator for the distribution of duplication events along a time scale. I
plotted the frequency distribution of pairs of duplicate genes as a function
of the number of Ks in Fig. 8.1. An obvious pattern found in all species
is that most of gene duplicates are young and the density of duplicates
drops off with increasing Ks . The distribution of C. albicans shows a flat
pattern, in which the gene pairs are evenly distributed over Ks , with a
peak around Ks = 0.2. This may indicate small-scale gene duplications
happened persistently during the course of evolution.
For P. marneffei, there are two peaks in the plot: the first one is a
high peak in the age distribution centered around Ks = 0.1, indicating
there are a large number of gene pairs of a similar recent age, the second
peak coresponds to a low region from Ks = 2.0 to 4.5. I speculate the
second peak is a trace of ancient gene duplication events on a relatively
large-scale. This proposed ancient duplication would have created many
duplicate gene pairs. After such a long evolutionary time, most of these
gene pairs would be expected to have mutated and become divergent.
Only some pairs retain some degree of similarity, which gives rise to
the second peak. This dual-peak pattern is not readily observed in other
fungal species, except for N. crassa with a second-peak which might result
from gene duplication prior to the development of the repeat-induced
point mutation (see below).
8.4.3
Selective constraint between paralogs
As metioned in the Introduction, Ka /Ks is used as a measure of selective
constraint between two copies of duplicate genes. The larger the Ka /Ks
value, the stronger the selective constraint between the two copies. Table
8.3 gives the estimated Ka /Ks values in different fungi.
Comparison of Ka /Ks values for different fungi revealed that the
strength of selection is generally similar among yeasts (i.e., S. cerevisiae,
169
S. pombe and C. albicans, and among moulds (i.e., A. nidulans, P. marneffei and N. crassa). There is substantial difference in Ka /Ks between
yeasts and moulds. The strongest purifying selection is among the S.
cerevisiae paralogs and the weakest purifying selection in A. nidulans.
Mould paralogs show significantly stronger functional constraints, indicated by larger values of Ka /Ks , than those in yeasts (Student’s t-tests
for pairwise comparisons).
Table 8.3: Ratio of nonsynonymous to synonymous substitution rates
(Ka /Ks ) for recently diverged paralogs (0.05 < Ks < 0.5).
Fungi
S. cerevisiae
S. pombe
C. albicans
A. nidulans
P. marneffei
N. crassa
8.4.4
No. of gene pairs
89
22
34
12
29
9
Ka /Ks (mean ± SD)
0.134 ± 0.166
0.148 ± 0.234
0.245 ± 0.224
0.491 ± 0.214
0.456 ± 0.231
0.359 ± 0.276
Ka /Ks between paralogs and orthologs
Ka /Ks is also used to estimate the selective constraints acting on orthologs. I therefore also characterised rates of synonymous and nonsynonymous substitution of orthologs for each genome. By plotting Ka as
a function of Ks and superimposing data from paralogs onto those from
orthologs, we can get an overall view of how natural selection acts on two
groups of comparisons (Fig. 8.2).
In all species, overall Ka values are much smaller than Ks values,
which implies that vast majority of duplicate gene are subject to purifying
selection. In C. albicans, A. nidulans and P. marneffei, gene pairs with
smaller Ks tend to gather round the diagonal line (Ka /Ks = 1) and gene
pairs with larger Ks tend to get away from the line. It seems that, in
C. albicans, A. nidulans and P. marneffei, recent duplicates appear to
170
tolerate more amino-acid replacement substitution than older duplicates.
In mould species, the strength of purifying selection acting on paralogs
is smaller than that acting on orthologs with the same level of sequence
divergence. As shown in Fig. 8.2, at the same level of Ks , most of
the open points are below clusters of closed points, that is to say, Ka
in paralogs is generally larger than that of orthologs in A. nidulans, P.
marneffei and N. crassa. On the other hand, there is no difference in
overall Ka /Ks between paralogs and orthologs in yeasts, S. cerevisiae, S.
pombe and C. albicans.
8.4.5
Relative evolutionary rate between paralogs
The two copies of a paralog pair may evolve at the different rate. If most
paralog pairs evolve in such an asymmetric way, it may indicate that
Ohno’s neofunctionalisation theory is plausible. Therefore, as mentioned,
many studies on the relative evolutionary rates between paralogs have
been conducted. However, these studies have led to different conclusions.
Two critical aspects responsible for the success of such analyses are the
sensitivity of methods and the appropriateness of the outgroup used.
Here I used a method that incorporates a codon-based model. Generally speaking, methods relying on codon-based models (for example,
[111, 226]) are more sensitive than nucleotide-based tests and aminoacid based tests, because, in the latter two, one cannot distinguish between silent substitutions and amino-acid replacement substitutions [59].
Codon-based model however takes into account the ratio between the rate
of nonsynonymous and synonymous substitutions which gives a more direct measure of the strength of selection or functional constraints on the
gene.
The major issue is choosing an outgroup is that the potential outgroup
cannot be too distant evolutionarily from the paralogs being studied, otherwise, saturation in synonymous sites for many genes will interfere with
171
the power of the statistical test. To avoid this influence, Kondrashov et
al. [174] used a within-genome approach, since their study included four
highly diverged eukaryotic organisms, S. cerevisiae, A. thaliana, C. elegans and D. melanogaster. By using the within-genome approach, they
identified outgroups of S. cerevisiae paralogs within the S. cerevisiae
genome itself. In addition, they required that the two paralogs be closer
in amino-acid sequence to each other than to the outgroup. This extra
condition, which probably has led to underestimate asymmetric divergence, was criticised by Conant and Wagner [59], who adopted a similar
within-genome approach in multiple eukaryotes.
In the selection of gene duplicates and their outgroups, I adopted a
method similar to that of Conant and Wagner [59]. The only modification
made was the search of all fugal genomes for outgroups, instead of using
the within-genome approach.
I identified a total 163 triplets (composed of two paralogs and one
corresponding outgroup) which included 101 triplets based on paralogs
from S. cerevisiae, 6 from S. pombe, 50 from C. albicans, 2 from A.
nidulans, 3 from P. marneffei, and 1 from N. crassa.
Because the majority of triplets are from S. cerevisiae and C. albicans, the following analysis has no power to distinguish differences among
species. Instead it can only be considered as a comprehensive analysis
dealing with the subject of ascomycetes as a whole.
I adopted the model of Goldman and Yang [111] (see Methods) in
the comparison of the relative rates in amino-acid substitution between
each of the paralogs. The result shows that, of a total of 163 analysed
gene pairs from the ascomycetes, 29 (17.8%) evolve at a significantly
(p < 0.05) different rate (Table 8.4). This figure includes 12 (11.9%) of
101 triplets in S. cerevisiae and 17 (32.7%) of 52 in C. albicans. In the
majority of cases, both paralogs evolved at approximately the same rate,
under a similar level of purifying selection.
172
In order to examine whether Ka /Ks ratio is the factor causing asymmetry in evolutionary rates between paralogs, I estimated the asymmetry
of Ka /Ks ratios between two paralogs. A 2 × 2χ2 test failed to reject
the null hypothesis that the number of pairs with different Ka /Ks ratio
is independent of the number of pairs with different amino-acid substitution rates (Table 8.4). That is to say, there is no correlation between
different Ka /Ks ratios and different amino-acid substitution rates.
Table 8.4: Amino-acid substitution rates versus Ka /Ks ratios in two
copies of duplicate genes. Columns show gene pairs with different or
equal amino-acid substitution rates between two paralogs; rows show
gene pairs with different or equal Ka /Ks ratios between two paralogs.
Different Ka /Ks ratio
Equal Ka /Ks ratio
Total
8.5
Different Ka
3
26
29
Equal Ka
10
124
134
Total
13
150
163
Discussion
This study took advantage of the avaiability of genome sequences of P.
marneffei and other 5 ascomycetes, S. cerevisiae, S. pombe, C. albicans,
A. nidulans and N. crassa. It also relied on the recent development
of methods to analyse selective constrains on duplicate genes in each
genome. Given the considerable phenotypic variation between the two
groups of distinct ascomycetes, yeasts and moulds, I speculated that gene
duplication may play an evolutionary role at different levels and selection
patterns of duplicate genes may be different. To my knowledge, no similar
analysis has been conducted in fungi, despite several genome-level studies
on gene duplications using S. cerevisiae as one of their model eukaryotic
organisms [95].
173
8.5.1
Gene duplication in ascomycetes is highly diverse
Most genomes show a certain degree of redundancy caused by singlegene duplication, chromosomal segment duplication or complete genome
duplication (through polyploidisation). So do the ascomycetes I studied.
S. cerevisiae
S. cerevisiae has the largest amount of gene redundancy
among all ascomycetes I analysed. Previously studies have revealed that
its genome contains approximately 55 large duplicated chromosomal regions [345]. It has been widely accepted that the duplicated regions
found in the modern Saccharomyces species are probably the result of
a whole-genome duplication (tetraploidisation) approximately 108 years
ago [95, 250, 279, 280, 345]. This proposed genome duplication might coincide with the origin of the ability to grow under anaerobic conditions,
one of most striking physiological differences between S. cerevisiae and
other yeasts.
S. pombe
S. pombe and S. cerevisiae have been separated for as long
as 420 million years [289]. Comparing the two yeasts, S. pombe has
fewer gene duplications than S. cerevisiae, which may account in part
for the smaller genome size. Transposable elements exist in the S. pombe
genome. However, their proportion is low compared to S. cerevisiae.
Using phylogenetic analysis, Hughes and Friedman [136] suggested that
parallel gene duplication appears to have played a role in the independent
origin of similar adaptations in the two unicellular fungi, S. pombe and S.
cerevisiae [136]. That is to say, gene duplications have occurred independently in the same gene families in S. pombe and S. cerevisiae; S. pombe
has adapted to a similar unicellular lifestyle without polyploidisation.
C. albicans
The age distribution of relative by young duplicate genes
(Ks < 5) in C. albicans (Fig. 8.1) suggests that duplication events are
likely to occur continuously during the course of evolution in this yeast.
174
In either S. pombe or C. albicans, no evidence suggesting polyploidisation, such as, duplicated genomic blocks, has so far been found. Hence,
genome duplication, as happened in S. cerevisiae, which may represent
an extreme adaptive strategy in providing genetic raw material for functional divergence of novel genes, has not occurred in C. albicans.
A. nidulans
A. nidulans contains a relatively large number of recently
duplicated gene pairs; totally 43 with Ks < 0.5. The age distribution of
duplicate genes (Ks < 5) in A. nidulans displays a high peak at Ks = 0.1
to 0.2 and shows a similar pattern with that in S. cerevisiae (Fig. 8.2).
However, S. cerevisiae has undergone genome duplication and there are
extensive duplicated blocks in its genome as the traces of the proposed
ancient tetraploidy that remain detectable after widespread deletion of
superfluous duplicate genes and sequence divergence. Most of gene pairs
in these duplicated regions are believed to have been produced simultaneously or within a narrow time frame [95]. Based on the similar patterns
of age distribution of gene pairs between A. nidulans and S. cerevisiae,
I might propose that duplicate genes in A. nidulans probably originated
through one or more episodic, large-scale gene duplications in a relatively
short period of time. What is uncertain is whether such a peak of gene
duplication over the course of evolution implies a polyploidisation event
in A. nidulans. As noted by Friedman and Hughes [95], a peak of gene
duplication need not imply polyploidisation event. Therefore, it would
be interesting to know how many duplicated blocks are present within
and between A. nidulans chromosomes when the genome sequencing of
A. nidulans is completely finished.
P. marneffei
Slightly fewer genes in P. marneffei belong to multiple
gene families than A. nidulans. However, 52 pairs are young duplicate
genes compare to 43 in A. nidulans. There is no difference in the overall
extent of duplicate genes between these two close species. The pattern
175
of the Ks histogram is broadly similar to those of A. nidulans and S.
cerevisiea. A difference is the dual-peak pattern, seemingly implying
that besides the modern duplications, there was an ancient large-scale
duplication. The modern peak is at the similar location, Ks = 0.1, as
that of A. nidulans and other fungi, but on a smaller scale (less than 25%
genes belong to this peak) compared to that of A. nidulans. In contrast
the second peak at Ks = 2.0 to 4.0 is more apparent than in other fungi
except N. crassa. More evidence is needed before any solid conclusion
can be reached though.
N. crassa
N. crassa exhibits much greater morphological and devel-
opmental complexity. Its genome is approximately three times the size
of the S. cerevisiae genome, and accordingly has a protein count much
larger than those in yeasts . However the paucity of duplicate genes in
N. crassa is obvious: (1) the number of multigene families in N. crassa is
much smaller than that in yeast, and (2) the number of gene pairs with
a small Ks (0.05 < Ks < 0.5) in N. crassa is much smaller that those
in unicellular yeasts (Table 8.1). An extraordinary feature of N. crassa,
repeat-induced point (RIP) mutation [219], has been suggested to play a
major role in preventing gene innovation through gene duplication and
response for this paucity. The RIP, acting as a defense against mobile
DNA [219], can detect and mutate both copies of a sequence duplication. In fact, the RIP is so efficient that all gene duplications remaining
in N. crassa genomes have been proposed to be raised and fixed before
the emergence of the RIP mechanism. Examples of the remaining multigene families may have ‘survived’ RIP include hexose transporters and
cellulases (Table 8.2). N. crassa may have other mechanisms of gene
innovation, since gene duplication has rarely occurred in its genome.
Ascomycetes display a wide variation in the number of gene duplication events. This may have provided the foundation for specialisation of
a number of genes and their corresponding proteins, and formed the basis
176
for diversification. Amplification of their genetic material might increase
their fitness of adaptation to the environment. Examples include genes
for the yeast hexose transporters increasing fitness in low-glucose; genes
for N. crassa cellulases to allow growth on decaying plant material; genes
for cytochrome P450 and efflux systems involving in detoxification.
8.5.2
Different selective constraints in yeasts and filamentous ascomycetes
There are differnt models, such as, the classical model and duplicationdegeneration-complementation (DDC) model, to explain the creation of
novel genes by gene duplication. The classical model emphasises that
one copy is neutral and free to evolve while the other remains under
selective pressure. The DDC model [90] explains sub-functional divergence when a gene has been duplicated. According to the DDC model,
the two gene copies then acquire complementary loss of function mutations in independent sub-functions. Thus both genes required to produce
the full complement of functions of the single ancestral gene. Both the
classic model and DDC model predict a period immediately following
duplication when the genome should be able to tolerate a high degree of
nonsynonymous substitutions in one member of a duplicate pair because
the other member is still functioning at full strength.
Comparing Ka with Ks in each genome, I found a common pattern
in all fungi which is in partial agreement with these theoretical expectations. First, in either filamentous fungi or yeasts, purifying selection
was dominant against amino acid changes in paralogous genes. This
confirms the earlier observation that paralogs evolve under purifying selection [211], which challenges the classical model but supports the DDC
model. Second, recent duplicates with smaller Ks appear to tolerate
more replacement amino-acid substitutions than older duplicates, which
is compatible with both models.
I also found two exclusive patterns in filamentous fungi. The first
177
finding is that there are significantly (p < 0.01) higher values of the
Ka /Ks ratio in paralogs in moulds than those in yeasts with a similar
level of divergence (Table 8.3). Filamentous fungi show greater morphological and developmental complexity than do yeasts, and their genomes
are normally larger. As gene duplication is a source of novel protein
functions, the bigger genome size may partially result from frequently
occurring gene duplications provided a basis for divergence and resulting
in the increase of novel genes caused by the neofunctionlisation, or the
increase of gene number caused by the subfunctionalisation. Therefore,
the higher value of Ka /Ks ratio in paralogs in moulds may imply that, at
the similar stage after duplication, gene pairs in filamentous fungi have
faster evolutionary rates than those in yeasts. Either positive selection
or relaxed functional constraint can cause the higher value of the Ka /Ks
ratio. Few gene pairs in moulds are actually found under positive selection, when use Ka /Ks > 1 as indicator of positive selection. Thus,
the slightly elevated Ka relative to Ks , accounts for the larger value of
Ka /Ks given by gene pairs in moulds.
Another interesting finding is that paralogs in A. nidulans, P. marneffei and N. crassa appear to be under weaker functional constraint than
orthologs at the same age. In other words, orthologs in moulds experience stronger functional constraints than paralogs. Natural selection
seems to allow paralogs in these three filamentous fungi to mutate with
less constraint, which may lead to more advantageous mutations. This
phenomenon was first observed in eukaryotes [174] but it has not been
reported in fungi. Note that this trend is not observed in the unicellular
yeasts, S. cerevisiae, S. pombe, and C. albicans. Therefore, it is suggested
that elevated functional constraint in orthologs or weaker functional constraint in paralogs is a more common feature in the evolutionary pattern
of multicellular eukaryotes.
178
8.5.3
Majority of paralogous genes evolve symmetrically
Estimation of asymmetric evolution rates were conducted mainly on paralogs from S. cerevisiae and C. albicans, so the result should not be
applied to other species. 29 (17.8%) of a total of 163 analysed gene pairs,
evolve at significantly (p < 0.05) different rates (Table 8.4). Therefore,
in the majority of cases at least in S. cerevisiae and C. albicans, both
paralogs evolved at approximately the same rate, under similar levels of
purifying selection.
Several similar studies have been done in S. cerevisiae and in several
other eukaryotes. Some concluded that both copies of duplicate gene typically evolved at the same rates [137, 174, 265], whereas others suggested
asymmetric divergence between two paralogs is not uncommon. Because
different organisms were used in those studies and different methods with
varying sensitivities were applied, it is hard to compare data in this study
with others directly. For instance, Kondrashov et al. [174] selected 15 S.
cerevisiae triplet genes and, by using a distance based method they found
no paralogs showing different rates. In another study, Conant and Wagner [59] identified six of 22 (27%) gene triplets in S. cerevisiae, and three
(21%) of 14 in S. pombe, that showed asymmetry in Ka by using codon
based model following Muse and Gaut [226].
An asymmetric evolutionary rate is not always associated with an
asymmetric evolutionary constraint, as indicated by Ka /Ks . Moreover,
no simple dependence between evolutionary rate and gene function is
observed (data not shown). This finding is inconsistent with Zhang’s
finding in young paralogs of human genes [371], that genes with different
Ka /Ks ratios tend to evolve at different rates, suggesting that different
functional constraints might be largely responsible for the unequal evolutionary rates. The incongruence may be again due to the difference in
species used in the studies.
In conclusion, this chapter reports the variation in the extent of gene
179
duplications in ascomycetes. The age distribution of gene duplications
tentatively suggests that the P. marneffei genome has experienced a
recent as well as an ancient large-scale duplication. Analysis of the divergence of evolutionary rates in S. cerevisiae and C. albicans revealed
that less than 20% of gene pairs in these two yeasts show asymmetric
divergence patterns in amino-acid substitutions. I speculate that the different extent and evolutionary pattern of duplicate genes in ascomycetes
might be associated with their genotypical and phenotypical differences.
180
Chapter 9
ACCELERATED EVOLUTIONARY RATE MAY BE
RESPONSIBLE FOR THE EMERGENCE OF
LINEAGE-SPECIFIC GENES
Once the genome of Penicillium marneffei become available, genes
can be predicted and annotated. Hundreds of these predicted genes lack
homology to any known gene. They are species-specific genes or called
“orphan” genes. Where do these genes come from? This is still a mystery. One suggestion has been that most orphan genes evolve rapidly
so that similarity to other genes cannot be traced after a certain evolutionary distance. This can be tested by examining the divergence rates
of genes with different degrees of lineage specificity. Here the lineage
specificity (LS) of a gene describes the phylogenetic distribution of that
gene’s orthologs in related species. Highly lineage-specific genes will be
distributed in fewer species in a phylogeny.
In this chapter, I used the complete genomes of seven ascomycetes
and two animals to define several levels of LS, such as, Eukaryotes-core,
Ascomycota-core, Euascomycetes-specific, Hemiascomycetes-specific, Aspergillus-specific and Saccharomyces-specific. The rates of gene evolution
in groups of higher LS to those in groups with lower LS are compared.
Molecular evolutionary analyses indicate a significant increase in nonsynonymous nucleotide substitution rates in genes with higher LS. Multiple
regression analyses suggest that LS is significantly correlated with the
evolutionary rate of the gene. This correlation is stronger than those of a
number of other factors that have been proposed as predictors of a gene’s
evolutionary rate, including the expression level of genes, gene essential-
181
ity or dispensability and the number of protein-protein interactions. The
significantly accelerated evolutionary rates of genes with higher LS may
reflect the influence of selection and adaptive divergence during the emergence of orphan genes. These analyses suggest that accelerated rates of
gene evolution may be responsible for the origin of apparently orphan
genes.
This chapter is very closely based on a paper I have published with
colleagues [in press]. The original draft of the manuscript has been revised by Dr. David K. Smith, in Department of Biochemistry, HKU.
The preliminary version of this work has been presented at the SMBE
conference on 17th June 2004.
9.1
Introduction
During annotation of genome sequences a substantial fraction of the putative genes are found to lack sequence similarity to any of the genes in public databases. These genes or protein-coding regions have been referred
to as “orphan” genes. Some may have crucial organism-specific functions, however, the origin and evolution of orphan genes remain poorly
understood. A proposed explanation of this problem has been that some
genes evolve so rapidly that their homologs cannot be discovered over
larger evolutionary distances. Although this has been supported by recent findings in Drosophila that orphan genes evolve, on average, more
than three times faster than non-orphan genes [73], the influence of other
factors on the evolutionary rate of genes should be taken into account.
These factors include the expression level of genes [127, 241], a gene’s
dispensability (the organism’s fitness after deletion of the gene) [178],
gene essentiality [343], gene duplication [150, 357], and the number of
protein-protein interactions involving the gene’s product [93, 335]. Due
to the inherently stochastic property of evolutionary rates, the influence
of many of these factors has proved difficult to confirm and their relative
182
importance also needs further elaboration.
In order to systematically examine the relationship between a gene’s
evolutionary rate and the origin of orphan genes, as well as to assess the
influence of other factors, we have devised a study based on the following
rationale. Orthologs of a gene usually have a particular phyletic distribution in several related species, thus giving each gene a certain lineage
specificity (LS). Orphan genes represent the extreme of LS because they
are only present in one node of a phylogeny. In contrast, highly conserved genes have a low degree of LS and are widely distributed, while a
range of different degrees of LS can be defined for other gene groups. If
an elevated evolutionary rate is the major cause of the origin of orphan
genes, one should find a correlation between evolutionary rate and LS.
Slower evolving genes should tend to be less lineage specific.
Studying the relationship between the evolutionary rate of genes and
LS may reveal the dynamic processes that lead to the origin of species
specific, or orphan, genes. It can also be tested whether the evolutionary
rate leading to the emergence of orphan genes is relatively constant or
highly variable. If genes become lineage-specific gradually, one might
expect a simple relationship (e.g., a linear relationship, perhaps after data
transformation) between divergence time and genetic distance, otherwise,
a more complex relationship would be expected.
To investigate these matters, the complete sets of predicted proteincoding genes from Aspergillus fumigatus (http://www.sanger.ac.uk/
Projects/A fumigatus/) and Saccharomyces cerevisiae [110] were extracted. Orthologs of these genes from five other ascomycotan fungi,
Aspergillus nidulans (http://www.broad.mit.edu/annotation/fungi/
aspergillus/), Schizosaccharomyces pombe [354], Candida albicans [65],
Neurospora crassa [101], and Saccharomyces mikatae , and two metazoans Caenorhabditis elegans [79] and Drosophila melanogaster [2] were
also obtained.
183
The fungi studied here represent three major Ascomycetes classes,
Euascomycetes, Hemiascomycetes and Archaeascomycetes. The Euascomycetes, which contain well over 90% of Ascomycota, comprises Aspergillus and Neurospora. The Hemiascomycetes comprises the Saccharomyces yeasts and Candida. The fission yeast, S. pombe belongs to the
class Archaeascomycetes which are distantly related to each other, possibly remnants of an early radiation of Ascomycota [289]. These fungi
also represent two major fungal morphological subdivisions, yeasts and
moulds. Yeasts, like S. cerevisiae, S. mikatae, C. albicans, as well as
S. pombe, have life cycles characterised by unicellular (occasionally dimorphic) growth. In contrast, the filamentous ascomycota, A. nidulans,
A. fumigatus and N. crassa, predominantly grow as hyphal filaments.
Despite having such a morphological divergence, all of them share a relatively recent common ancestor with respect to the rest of the eukaryotes.
The phylogeny of these ascomycota is clear and generally accepted, except for the ancient Schizosaccharomyces, S. pombe [289].
Genes from S. cerevisiae and A. fumigatus were classified, according
to their phylogenetic profiles, into several LS groups as follows: Eukaryotecore, Ascomycota-core, Euascomycetes-specific, Hemiascomycetes-specific,
Aspergillus-specific and Saccharomyces-specific. Average nonsynonymous
substitution rates, Ka , of genes among LS groups were compared and
correlations between LS and several other factors, for example, gene expression level, gene dispensability and gene redundancy, were explored.
The relative importance of LS and other factors, in terms of the prediction of a protein’s evolutionary rate, were evaluated and whether the
divergence rate is relatively constant over genes with similar degrees of
LS was tested.
184
9.2
Literature Review
Holding the gene-centric rationale, our understanding of evolutionary
novelties is limited in the consequence of creation new gene. Recent attention has been put to this phenomenon in genomes, yet the mechanism
remains mystery. Some insights have been obtained especially by studying newly created genes (i.e., young genes) [210, 257, 204]. A number of
mechanisms that may be responsible for new gene origination have been
proposed. These include gene duplication, exon shuffling, retroposition,
lateral gene transfer, and transposable element assimilation (for review,
see [204]). Topic regarding to the gene duplication has been reviewed in
Chapter 8.
Here I only focus on the origination of exon – the basic units of gene.
Once exons exist, exon-shuffling, recombination or exclusion of exons, is
widely recognised as important in the generation of new genes [109, 244,
155]. The creation of new exons has been proposed through three possible processes: (1) exaptation of transposable elements [27, 215, 230, 293],
(2) exon duplication [172, 194], and (3) exonisation of intronic sequences
[173].
Exaptation of transposable elements is a process in which a retroelement has taken on new functions for a genome. It was firstly exampled by
the integration of an Alu element into the coding portion of the human
decay-accelerating factor (DAF) gene [215], and an L1 retrotransposon element insertion provides a premature stop codon and the polyadenylation
sites is responsible for the generation of the secreted form of the human
transmembrane protein attractin [305]. Recently as much as about 4% of
human genes were found containing transposable elements in their coding regions [230]. Exon duplication has been reported as about 10% of
all genes contain tandemly duplicated exons when searching the genomes
of human, fly and worm. They are likely to be involved in mutually
exclusive alternative splicing events, which might confer further evolu-
185
tionary potential [194]. Exonisation of intronic sequences is the most
easily conceived mechanism but few examples of such a process have
been reported [173]. Wang et al. [339] identified newly evolved exons by
EST comparison against outgroup to learn the ways new exons originate
and evolve, and how often new exons appear. They claim that the new
exon origination rate is about 2.71−3 per gene per million years and a
much higher proportion of new exons have Ka /Ks ratios > 1 than do the
old exons.
It is noteworthy that gene origination processes mentioned above does
not necessarily create new genes with novel functions, instead yield new
variants of genes [369]. Moreover, newly evolved genes often come up
with elevated evolutionary rate driven by positive selection [205,235,147,
338, 369].
9.3
Materials and Methods
9.3.1
Sequences and data sets
Table 9.1: Genomic sequence sources.
Species
A. nidulans
A. fumigatus
N. crassa
S. cerevisiae
S. mikatae
C. albicans
S. pombe
C. elegans
D. melanogaster
Web Source for the sequence data.
www-genome.wi.mit.edu/annotation/fungi/aspergillus/
www.sanger.ac.uk/Projects/A fumigatus
www-genome.wi.mit.edu/annotation/fungi/neurospora/
genome-www.stanford.edu/Saccharomyces
ftp://genome-ftp.stanford.edu/pub/yeast/data
download/sequence/fungal genomes/S mikatae
genolist.pasteur.fr/CandidaDB
www.genedb.org/genedb/pombe/index.jsp
www.sanger.ac.uk/Projects/C elegans/wormpep
www.fruitfly.org
For each Ascomycotan, the complete set of available amino acid sequences and coding DNA sequences was downloaded from the repositories
186
c ch
c
ific
cifi
ec
ific
ec
pe
s-s
ce
my
-sp
tes
ce
my
aro
co
ific
ec
-sp
llus
-sp
re
-co
ore
s-c
ta
co
tes
ce
rgi
pe
my
co
As
as
as
mi
Sa
He
Eu
my
ote
ary
co
As
k
Eu
A. fumigatus
A. nidulans
N. crassa
S. cerevisiae
S. mikatae
C. albicans
S. pombe
ANIMALS
1,458
1,144 1,085
841
670
~10
Figure 9.1: LS classification based on phylogenetic profiles of genes. Divergence times were adopted from Hedges and Kumar [131]. The divergence times between S. cerevisiae and S. mikatae and between A. fumigatus and A. nidulans are based on the estimates by Cliften et al. [56]
and [87], respectively. A solid square (¥) means the gene is present in
corresponding species; an open square point (¤) means it is absent.
187
given in Table 9.1. All known or suspected pseudogenes and genes in mitochondrial genomes were removed. The S. mikatae dataset is derived
from the ORF predictions of Cliften et al. [56].
Yeast gene expression data came from Cho et al. [51] who characterised all mRNA transcript levels during the cell cycle of S. cerevisiae.
mRNA levels were measured at 17 time points at 10 min intervals, covering nearly two full cell cycles. The mean of these 17 numbers was taken
to produce one general time-averaged expression level for each protein.
Protein dispensability was assessed by the fitness effect of a singlegene deletion, as measured by the average growth rate of the knockout
strain in several types of media. The results of assays of a nearly complete
set of single gene deletions in S. cerevisiae [297] were obtained, and the
data were manipulated following the method by Gu et al. [119]. Briefly,
the fitness value fi is defined as ri /rpool , where ri is the growth rate of
the strain with gene i deleted and rpool is the pooled average growth rate
of different strains.
Essential genes were from the dataset of the Saccharomyces Genome
Deletion Project which contains 1,106 essential genes (http://www-sequence.
stanford.edu/group/yeast deletion project/). Although gene dispensability and gene essentiality are highly associated, they were treated
as two separate variables in order to compare the results of each variable
to previous studies.
A list of protein-protein interactions among S. cerevisiae proteins
was obtained from two integrated interaction databases, YEAST GRID
[25] and the yeast subset of DIP [274], and a number of major highthroughput studies published to date [106]. The final non-redundant set
contains 252,011 interactions involving 5,698 proteins.
188
9.3.2
Identification of orthologs
Orthologs of the genes from S. cerevisiae and A. fumigatus in each other
and in other genomes studied here were identified by the automatic clustering method INPARANOID [261]. Orthologs between the genomes
of two species are derived in this method from mutual best pairwise
BLASTP hits. A further reciprocal test was applied by requiring the
longest region of local sequence similarity between putative orthologs to
cover ≥ 80% of each sequence and to have ≥ 30% sequence identity in
this region. 113 pairs that did not pass this test were excluded. A gene
was considered as being absent from another genome if no sequence similarity could be detected between the gene and the genes in that genome.
To define the level at which sequence similarity was not detectable, a
TBLASTN expectation (E) value 1×10−2 with respect to a fixed effective search space (set to the size of the N. crassa genome) was used as a
cut-off.
Orthologs of fast-evolving genes may not be detected in their more distantly related genomes by the TBLASTN search used above. To address
this, ancestral sequence(s) were constructed (Collins et al. [58], based
on the detected orthologs, using the maximum likelihood method implemented in the PAML phylogenetic analysis package version 3.13d [359].
Ancestral sequences are expected to be less divergent from their possible orthologs in the more distant genomes and their reconstructions
were used to search, as above, for orthologs in the more distantly related
genomes. If potential orthologs were identified, the gene was excluded
from further analysis to avoid ambiguity in the assignment of genes to
LS groups.
9.3.3
Classification of genes into LS groups
Phylogenetic profiles, a gene table giving 1 (or 0) if a gene is present in (or
absent from) a genome, for the genes from S. cerevisiae and A. fumigatus,
189
were constructed based on the detected orthologs in the genomes studied.
The genes were then classified into the different LS groups, Eukaryotescore (present in all genomes studied), Ascomycota-core (present in all fungal genomes), Hemiascomycetes-specific, Euascomycetes-specific, Saccharomycesspecific and Aspergillus-specific (Fig. 9.1). The phylogenetic tree relating
the species was derived from [131].
9.3.4
Divergence Times
Lineage divergence times are somewhat controversial [285]. In this work
divergence times were taken from [130] and [131]. These give the following
divergence times (Fig. 9.1): Animals vs Fungi, 1576 Mya; Fungi vs Ascomycetes, 1144 Mya; Saccharomyces and Candida vs Aspergillus, 1085
Mya; Candida vs Saccharomyces, 841 Mya; Neurospora vs Aspergillus,
670 Mya. Divergence times for S. cerevisiae vs S. mikatae and A. fumigatus vs A. nidulans were taken as ∼10 Mya.
To convert LS into numeric form to calculate correlations with other
properties, the ratio of the time of the animal-fungi divergence to that
of the divergence of a lineage from its last common ancestor was used.
For example, the Eukaryotes-core value is 1 (1458/1458) while that of
Ascomycota-core is 1.27 (1458/1144). The final results were not sensitive
to changes in the divergence time estimates used for this category to
numeric conversion.
9.3.5
Estimation of substitution rates and statistical analyses
The number of synonymous substitutions per synonymous site, Ks , and
the number of nonsynonymous substitutions per nonsynonymous site,
Ka , were estimated between A. fumigatus-A. nidulans ortholog pairs and
S. cerevisiae-S. mikatae ortholog pairs in the Euascomycetes and Hemiascomycetes lineages respectively. For each ortholog pair, the orthologous protein sequences were aligned using ClustalW version 1.82 with the
190
default parameters. The corresponding nucleotide-sequence alignments
were derived by substituting the respective coding sequences from the
protein sequences by using MBEToolbox (Chapter 10 [35] ). Ks and Ka
were then estimated by the maximum-likelihood method implemented in
the CODEML program of PAML [359].
High apparent sequence divergence, as shown by high Ks or Ka values,
is often associated with problems such as difficulty in alignment, or differences in codon usage bias or nucleotide composition in the sequences.
Ortholog pairs with Ks < 0.05 may include too few substitutions to
provide a statistically significant measure of change [371]. To accurately
measure the intensity of selective forces acting on a protein, only ortholog
pairs with Ka ≤ 2 and 0.05 ≤ Ks ≤ 2 were used. Similar results were
obtained when more relaxed cutoffs for Ka and Ks (≤ 5) were used (data
not shown). All known ribosomal protein genes were excluded from the
data set as their high level of conservation gives them substantially lower
average values of Ka , Ks and Ka /Ks than those for the rest of the genes.
Statistical regression analyses were performed by referring to the procedure described by Rocha and Danchin. Since the linear regression
model works better with normal variables , the scatter plots of Ka by
other variables were examined to determine whether linear models are
reasonable for these variables. It was necessary to transform the values
of Ka , expression level and fitness of gene deletion into their logarithmic
forms to give a distribution closer to a normal distribution. For the same
reason, log(Ka ) values were used in the correlation and partial correlation
analyses.
9.3.6
Detection of rate variability across species - Relative Divergence
Score (RDS)
To measure the degree of divergence of genes in a species away from orthologs in other species TBLASTN comparisons for all proteins in the A.
191
fumigatus or S. cerevisiae genomes were run against all DNA sequences
in the 9 genomes studied here. The relative divergence score (RDS) was
defined as: DA,B = −ln(SA,B /SA,A ), where SA,B is the TBLASTN bit
score for the query protein from genome A and subject genome B. Such
scores range from 0 (identical proteins found in the subject genome) to
infinity (no significant hit found). For genes belonging to each LS group,
and to the relevant species at each divergence time point, 10,000 bootstrapped medians of random samples were taken from the RDS values
of the genes. The mean of the bootstrapped medians was used as the
estimated RDS of the LS group.
9.4
Results
9.4.1
Evolutionary rate differences among LS groups
The Ascomycotan fungi used in this study represent two distinct fungal groups: Euascomycetes (A. nidulans, A. fumigatus and N. crassa)
and Hemiascomycetes (S. cerevisiae, S. mikatae and C. albicans) and
the more distantly related fission yeast, S. pombe. Data from the two
groups, Euascomycetes and Hemiascomycetes, were processed separately.
For the Euascomycetes sequences, we predicted 6,432 A. fumigatus-A.
nidulans orthologs and calculated the nonsynonymous substitution rate,
Ka , and the synonymous substitutions rate, Ks , for each gene pair. We
then classified the predicted orthologs into the following groups: (1)
Eukaryotes-core, (2) Ascomycota-core, (3) Euascomycetes-specific and
(4) Aspergillus-specific, according to the phylogenetic profiles of A. fumigatus genes.
The Hemiascomycetes sequences gave 3,707 pairs of
S. cerevisiae-S. mikatae orthologs which were processed similarly and
classified into four groups: (1) Eukaryotes-core, (2) Ascomycota-core,
(3) Hemiascomycetes-specific and (4) Saccharomyces-specific. Thus, LS
groups from (1) to (4) represent increasingly more recent times of origin.
Filtering steps of (1) removing ortholog pairs with Ks , Ka > 2 or
192
(A)
.7
.6
.5
.4
.3
.2
.1
Ka
0.0
-.1
N=
113
27
Eukaryotes-core
22
21
Euascomycetes-spec
Ascomycota-core
Aspergillus-spec
.5
(B)
.4
.3
.2
.1
Ka
0.0
-.1
N=
17
23
Eukaryotes-core
22
297
Hemiascomycetes-spec
Ascomycetes-core
Saccharomyces-spec
Figure 9.2: Divergence of nonsynonymous substitution rate in LS groups.
The edges of the boxes indicate the upper and lower quartiles. The line at
the centre of the box indicates the median and the edges of the whiskers
represent the limits of 1.5 times the upper or lower inter-quartile ranges.
The circle (°) indicates cases with values between 1.5 and 3 box lengths
from the upper or lower edge of the box. The number of the gene pairs
(N) is given. (A) A. fumigatus-A. nidulans orthologs. (B) S. cerevisiae-S.
mikatae orthologs.
193
Ks < 0.05, (2) excluding ribosomal proteins, and (3) eliminating genes
where possible similarity to a reconstructed ancestral sequence was found,
were applied to the data set. Step 3 removed only 3 gene pairs, 2 in the
Hemiascomycetes lineage and 1 in the Euascomycetes lineage, which may
be due to either the limits of the ancestral reconstruction method or the
relatively conservative criteria adopted in defining orthologs. Final sets
of 183 A. fumigatus-A. nidulans ortholog pairs and of 359 S. cerevisiaeS. mikatae ortholog pairs were obtained. The mean Ka , Ks and Ka /Ks
of the ortholog pairs in each LS group are given in Table 9.2.
Genes that are distributed in the more specific lineages tend to have
higher Ka values than more widely distributed genes. Box plots of the
distribution of the Ka values for the Aspergillus and Saccharomyces genes
are shown in Fig. 9.2 (A and B, respectively). In both the Aspergillus
and Saccharomyces gene sets, average Ka increases with the degree of LS
with significant among-group variation as measured by a Kruskal-Wallis
test (Aspergillus, P < 0.001; Saccharomyces, P < 0.001). Moreover, as
expected, Ka is consistently smaller than Ks within all LS groups, which
suggests the operation of purifying (negative) selection or functional constraints.
The ratio Ka /Ks (i.e., the rate of nonsynonymous substitutions corrected for neutral rates) showed a trend similar to Ka , namely, the values
of Ka /Ks for genes of high LS (e.g., Aspergillus-specific or Euascomycetesspecific genes) are significantly higher than those for genes of low LS (e.g.,
Eukaryotes-core or Ascomycota-core genes). The differences among the
rates of sequence divergence for different LS groups are more pronounced
for Ka than for Ks , which suggests that the acceleration of a gene’s divergence rate may be mainly caused by more relaxed purifying selection
against amino acid replacement. Functions of representative genes in different LS groups were also examined. Largely, the functions of highly
lineage-specific genes are poorly characterised or simply unknown.
194
(A)
0.0
-.5
-1.0
-1.5
Saccharomycesspecific
-2.0
Hemiascomycetes-2.5
specific
Log(Ka)
Ascomycota-core
-3.0
Eukaryotes-core
-3.5
All genes
-1
0
1
2
3
4
Log(EXP)
.4
(B)
.2
0.0
-.2
Saccharomyces-
-.4
specific
-.6
Hemiascomycetesspecific
-.8
Log(Ks)
Ascomycota-core
-1.0
Eukaryotes-core
-1.2
All genes
-1
0
1
2
3
4
Log(EXP)
Figure 9.3: Dependence of log gene expression level, Log(EXP), and
substitution rate. (A) log non-synonymous substitution rate, log(Ka ).
(B) log synonymous substitution rate, log(Ks ).
195
-ln(D), D=relative dissimilarity score
(A)
2.5
2.0
2
R = 0.9518
1.5
Euascomycetes-specific
Ascomycota-core
Eukaryotes-core
1.0
2
R = 0.9429
0.5
0.0
0
500
1000
1500
2000
Divergence time (Myr)
-ln(D), D=relative dissimilarity score
(B)
3.0
2.5
2.0
Hemiascomycetes-specific
Ascomycota-core
1.5
Eukaryotes-core
2
R = 0.9544
1.0
0.5
2
R = 0.939
0.0
0
500
1000
1500
2000
Divergence time (Myr)
Figure 9.4: Linear regression analysis of divergence time and RDS. (A)
LS of A. fumigatus-A. nidulans genes. (B) LS of S. cerevisiae-S. mikatae
genes.
196
9.4.2
Evolutionary rate-related factors of genes belonging to different
LS groups
The correlation between Ka and LS may be confounded by other factors.
For S. cerevisiae-S. mikatae orthologs, bivariate correlations were used
to compute the pairwise associations between Ka and LS and potentially
confounding factors. These factors include the expression level of genes,
the dispensability or essentiality of a gene, gene duplication and the number of protein-protein interactions of the gene product. The results are
summarised in the upper diagonal of Table 9.3. The coefficient for correlation between log(Ka ) and LS is 0.584 (Pearson’s R, P < 0.01, Table
9.4), which is higher than that between log(Ka ) and any other factor or
that between any two other factors.
Log gene expression level correlates negatively with log Ka (R = 0.382, P < 0.01, Table 9.3, Fig. 9.3). This is consistent with previous studies which showed a correlation between Ka and gene expression
level [127, 241]. A correlation between Ka and gene essentiality has long
been proposed [343] but remains controversial [141,149]. The correlation
between log(Ka ) and gene essentiality was found to be weak, albeit significant (R = -0.163, P < 0.01), and essential genes have a lower mean
Ka (0.081, median 0.081) compared to that for non-essential genes (mean
0.136; median 0.110) (Mann-Whitney U test, P = 0.004).
Our data show a weak correlation between log(Ka ) and gene dispensability (R = 0.186, P < 0.001, Table 9.3), which is at a similar magnitude
to that of gene essentiality. This result is consistent with that recently
reported by Hirsh and Fraser. This correlation remains significant after controlling for gene expression levels (partial R = 0.240, P < 0.01),
suggesting the independent nature of gene dispensability as a factor.
Gene duplication has been shown to play a role in influencing gene
divergence rates [119, 150, 357]. Genes were classified as either singletons
or duplicate genes if they belonged to any multigene family. The mean
197
1.431
1.577
1.436
1.263
(0.213)
(0.172)
(0.284)
(0.329)
(0.441)
(0.329)
(0.490)
(0.567)
0.029
0.047
0.091
0.165
0.039
0.080
0.155
0.261
(0.026)
(0.040)
(0.045)
(0.130)
(0.027)
(0.042)
(0.091)
(0.127)
Ka /Ks∗ mean (SD)
0.586
0.639
0.839
0.830
Ks§ mean (SD)
Table 9.2: Average Ka , Ks and Ka /Ks among LS classes. ∗ A Kruskal-Wallis test reveals significant rate heterogeneity of average
Ka or average Ka /Ks of genes in different LS groups in both Euascomycetes branch and Hemiascomycetes branch, P < 0.001. § A
Kruskal-Wallis test reveals no significant rate heterogeneity of average Ks of genes in different LSG groups in both Euascomycetes
branch and Hemiascomycetes branch, P > 0.01.
LS Class
Number of genes pairs Ka∗ mean (SD)
A. fumigatus – A. nidulans (Euascomycetes branch)
Eukaryotes-core
113
0.051 (0.032)
Ascomycota-core
27
0.126 (0.069)
Euascomycetes-specific
22
0.198 (0.118)
Aspergillus-specific
21
0.293 (0.136)
S. cerevisiae – S. mikatae (Hemiascomycetes branch)
Eukaryotes-core
17
0.018 (0.021)
Ascomycota-core
23
0.031 (0.030)
Hemiascomycetes-specific
22
0.072 (0.037)
Saccharomyces-specific
297
0.131 (0.100)
198
Table 9.3: Correlation (Pearson’s R) (upper triangle) and partial correlation after controlling for log(Ks ) (lower triangle). Abbreviations: Ka :
nonsynonymous substitution rate; LS: lineage specificity; EXP: expression level; FIT: fitness effect (gene dispensability); ESS: gene essentiality;
DUP; duplicated (or not) gene; (INT) number of interactions. Among
them, Ka , Ks , EXP and FIT are in their log forms.
Ka
LS
EXP
FIT
ESS
DUP
INT
Ka
–
0.582
-0.294
0.240
-0.018
0.215
-0.253
LS
0.584
–
-0.161
0.192
-0.146
0.312
-0.379
EXP
-0.382
-0.271
–
-0.049
-0.091
-0.065
0.123
FIT
0.186
0.195
-0.037
–
0.033
-0.106
-0.175
ESS
-0.163
-0.263
0.076
0.032
–
0.028
-0.007
DUP
0.257
0.324
-0.113
-0.116
0.020
–
-0.111
INT
-0.308
-0.428
0.197
-0.159
0.243
-0.163
–
Ks
0.429
0.185
-0.165
-0.048
-0.087
0.160
-0.128
Ka of 0.097 (median 0.049) for duplicate genes was significantly smaller
than the mean of 0.138 (median = 0.114) for singleton genes (MannWhitney U test, P < 0.001). The same pattern was observed between
different LS groups with the exception of the Ascomycota-core group.
Ka has been shown to be positively correlated with Ks in several
species [116, 214, 239, 344]. Such a correlation, which may confound correlations between log(Ka ) and LS or with other factors, was observed here
for log(Ka ) and log(Ks ) (R = 0.429, p < 0.01, Table 9.4). To examine
the influence of the correlation ofKa with Ks on other factors, partial correlation coefficients between log(Ka ) and other variables were calculated
while holding the value of log(Ks ) constant. The results are given in the
lower diagonal portion of Table 9.4 and indicate that, after controlling
for log(Ks ), log(Ka ) remains significantly correlated with LS. There is
little change in the value of the coefficients with or without controlling for
log(Ks ) (partial Rlog(Ka)−LS|log(Ks) =0.582 to Rlog(Ka)−LS =0.584). Thus,
Ka is correlated with LS independently of Ks .
A decrease in the absolute value of the correlation coefficient was observed between log(Ka ) and expression level when controlling for log(Ks )
199
Entry order∗
–
1
2
3
4
5
6
-1.149±0.113
0.048±0.004
-0.197±0.038
Unstd. coeffi (B)±1SE
0.087
0.070
0.038
-0.028
–
0.562
-0.247
Std. coeffi (β)
1.836
1.399
0.787
-0.546
-10.148
11.676
-5.124
t∗∗
>0.1
>0.1
>0.1
>0.1
<0.0001
<0.0001
<0.0001
P
Table 9.4: Results of the regression analyses on 359 predicted S. cerevisiae-S. mikatae orthologs. ¶ R2 is the proportion of variation
in the dependent variable explained by the regression model constructed from the individual variable. The values indicate the
independent contribution of each variable to explain the global variance of log(Ka ). ∗ Order of variables entered into model at
each step. ∗∗ t statistics can indicate the relative importance of each variable in the model.
Indep. contribution (R2 )¶
Included Variables
(Constant)
–
LS
0.341
log(EXP)
0.164
Excluded Variables
log(FIT)
0.035
DUP
0.066
ESS
0.027
INT
0.095
200
(|Rlog(Ka)−log(EXP )|Log(Ks) | = 0.294 and |Rlog(Ka)−log(EXP ) | = 0.382).
This suggests Ks might be a confounding factor for gene expression level
in determining Ka . Figure 9.3 plots the relationship of log expression
level with log(Ka ) (Fig. 9.3A) and with log(Ks ) (Fig. 9.3B) showing the
values for the Saccharomyces gene lineage groups. The more consistent
relationship of log expression value with log(Ks ) among the genes can be
seen.
Linear multiple regression was used to further examine the effect of
multiple factors on log(Ka ). Gene essentiality and gene redundancy were
recoded to be quantitative variables by using two sets of binary variables
(essential = 1 and non-essential = 0; duplicated gene = 1 and singleton
gene = 0). A forward stepwise regression model was used to examine
the contribution of each independent variable to the regression. The
regression model defines log(Ka ) as a function of LS (XLS ), log expression
level (log(Xexp )), log fitness effect of gene deletion (log(Xf it )), essentiality
(Xess ), gene duplication (Xdup ), and the number of protein interactions
(Xint ):
log(Ka ) = β0 +βlsg Xlsg +βexp log(Xexp )+βf it log(Xf it )+βess Xess +βdup Xdup +βint Xint
Table 9.4 gives the results of the modelling procedure. The final model
gives a global R2 of 0.436 (P < 0.001). That is, nearly one half of the
variation in log(Ka ) is explained by this model. During the construction
of the final model, the predictors most highly correlated with log(Ka ),
LS and the expression level, were kept. The remaining variables, which
have minor roles in overall regression with log(Ka ), were excluded from
the final model (Table 9.4). The standardised coefficients were examined
to determine the relative importance of the significant predictors. LS
contributes more to the model than does the expression level, as shown
by its larger absolute standardised coefficient of 0.562 and t statistic of
201
11.676, when compared with values of 0.247 and 5.124, respectively, for
expression level. This analysis suggests that LS is the most relevant
predictor of the rate of protein divergence.
9.4.3
Linear regression of divergence time and relative divergence score
(RDS)
To relate the group divergence times and RDS a linear regression for
each LS group was performed (Fig. 9.4). An increasing linear trend of
RDS with divergence time was observed in each LS group, suggesting
that genes diverge from other species at an approximately constant rate.
Groups with higher LS have greater slopes than those with lower LS, indicating that genes with higher LS evolve faster than those with lower LS.
This trend would still be apparent if different divergence time estimates
were used.
9.5
Discussion
The phylogenetic distribution of a gene has been suggested to be of biological importance. For example, genes with the same phylogenetic
distribution may have linked functions [8, 218]. Lineage specificity (LS)
is a form of phylogenetic distribution whereby genes are found only in
a group of species that diverge from a certain point in a species tree.
Orphan genes, those identified from only one species, are the extremes of
lineage specificity. How these orphan and lineage specific genes arose is
still an open question.
Three possibilities are generally proposed [73]. One is that genes in a
lineage originate from a lineage ancestral gene formed by the recombination of exons from other genes or from random ORFs. These genes might
show similarity to the original exons and so not necessarily be considered
orphans or lineage specific. In the case of formation from random ORFs
it is unlikely that such a protein would be functional. A second option is
202
gene loss [8, 178]. However it is relatively unlikely that a gene would be
lost in all but one lineage [73] and this may not explain most orphan or
lineage specific genes. The third option, which is examined here, is that
some genes evolve at a rapid rate and so can no longer be recognised as
orthologs of the genes they diverged from after a certain time span.
If accelerated rates of evolution lead to the creation of orphan or
lineage specific genes, then it follows that genes with a high degree of LS
should show higher rates of evolution than genes with lower degrees of LS.
This hypothesis has been tested here with respect to the Ascomycotan
fungi. If LS arose through widespread gene loss or from creation of new
genes from recombination of exons or ORFs there is no reason to expect
accelerated evolutionary rates or a trend in evolutionary rate with respect
to the degree of LS.
The evolutionary rate of genes in Ascomycotan fungi that have different degrees of LS were compared and revealed a significant, strong
correlation between LS and the evolutionary rates of the genes. A trend
that genes with narrow phylogenetic distributions (high LS) tend to have
elevated evolutionary rates when compared with more ubiquitous genes
(low LS) was observed. This is consistent with the hypothesis that acceleration of the evolutionary rate is largely responsible for the formation
of lineage specific genes.
However, the rate of gene evolution is one of the most important parameters in molecular evolution. Correlations between the rate of gene
evolution and many properties of genes, including their phylogenetic distribution have been explored by several studies. As noted in the Introduction, the evolutionary rate has been associated with expression
level [127, 241], gene dispensability [178], essentiality [343] or morbidity ,
gene duplication, gene loss [178] and protein-protein interactions [93,335].
Not all these studies have been in agreement e.g., [93,151]. These factors
may influence the apparent correlation of LS with evolutionary rate.
203
All pair-wise correlations of these factors with LS, Ka and Ks were
examined to investigate the influence of these factors on the relationship
between LS and Ka . The strongest correlation observed was that of LS
with log(Ka ), however log(Ka ) also correlated highly with log(Ks ). Correlations of log(Ka ) with LS and the other factors were then calculated
after controlling for log(Ks ). Again the correlation of LS with log(Ka )
was the strongest and similar to that without controlling for log(Ks ).
With one exception, both LS and log(Ka ) showed significant but low
correlations to all other factors. As log(Ka ) showed the strongest correlation with LS in both cases it seems clear that the evolutionary rate has
a considerable, though not unique, influence on the origin of LS.
Further examination of this was undertaken with a stepwise regression analysis of the factors likely to influenceKa . In the final regression
model, which explained close to half the variation in log(Ka ), only the
parameters LS and log expression level were kept, with LS making the
larger individual contribution. The other parameters investigated did
not make significant contributions to the regression model. This again
indicated the role of evolutionary rate on LS.
Another approach used the relative divergence score (RDS) which
measures the divergence of a gene from its orthologs in other genomes
as a ratio of the TBLASTN score with its orthologs to the maximal (or
self-self) score. This provides another view of the degree of divergence
within a lineage and, when matched to divergence times, allows an examination of the evolutionary rate as the degree of LS increases. Within
each LS group a reasonably constant rate of evolution was seen since the
appearance of the LS group. Groups with low LS show lower RDS values
and evolutionary rates than groups with higher LS, consistent with the
evolutionary rate being a major determinant of LS. Allowing for errors
in the determination of divergence times this trend will still hold.
Genes with a certain degree of LS may have arisen from duplication
204
followed by acquisition of a lineage specific function [73] or simply have
diverged from a common ancestor to the extent that they cannot be
recognised as orthologs across lineages. Our findings support the idea
that genes destined to have high levels of LS will have higher evolutionary rates. It should be noted that Ka is a measurement of the average
nonsynonymous substitution rate along the whole length of a gene. Although highly lineage-specific genes had higher average Ka , the extent
to which region- specific or site-specific contributions to Ka affect this
was not examined. Further research could be directed to evaluate such
region- or site-specific effects on the rate of protein divergence, especially,
for instance, for genes that have high LS but low evolutionary rates or
vice versa.
For ascomycotan fungi, our findings show that the degree of LS correlates with the evolutionary rate and indicate that an elevated evolutionary rate may be a major cause of the development of lineage specific
genes.
205
Chapter 10
MBETOOLBOX: A MATLAB TOOLBOX FOR
SEQUENCE DATA ANALYSIS IN MOLECULAR
BIOLOGY AND EVOLUTION
This chapter is very closely based on a paper I have published [35].
The original draft of the manuscript has been revised by Dr. David K.
Smith, in Department of Biochemistry, HKU.
10.1
Introduction
Matlab is a high-performance language for technical computing, integrating computation, visualization, and programming in an easy-to-use environment. It has been widely used in many areas, such as, mathematics
and computation, algorithm development, data acquisition, modelling,
simulation, and scientific and engineering graphics. However, few functions are freely available in Matlab to perform the sequence data analysis
for molecular biology and evolution specifically. I have developed a Matlab toolbox, called MBEToolbox, aiming at filling this gap by offering
efficient implementations of the most needed functions in molecular biology and evolution. It can be used to manipulate aligned sequences,
calculate evolutionary distances, estimate synonymous and nonsynonymous substitution rates, and infer phylogenetic trees. Moreover, it provides an extensible functional framework for more specialized needs in
exploring and analysing aligned nucleotide or protein sequences from the
evolutionary perspective. The full functions in the toolbox are accessible
through command-line for those seasoned Matlab users, yet, it does provide a graphical user interface may be especially useful for non-specialist
206
end users. Through applicaiton of this software during the Penicillium
marneffei genome project, MBEToolbox is proved to be a useful tool
that can aid in the exploration, interpretation and visualization of data
in molecular biology and evolution. The software are publicly available
at http://web.hku.hk/∼jamescai/mbetoolbox/.
10.2
10.2.1
Literature Review
Probabilistic DNA substitution models
In this section I will discuss probability models, more specifically, Markov
models. (Of course, there also exist other types of models, e.g., deterministic models). Morkov models can be discrete or continuous in regard to
time. The discrete time models are called Markov chains, whereas continuous time models are usually called Markov processes. Mathematical
notations used in this section are given as: R - intrinsic rate matrix; Q
- (instantaneous) transition rate matrix; P - transition probability matrix; X - divergence matrix; Π - matrix base frequencies; and t - time or
evolutionary distance.
Molecular evolution of sequences generally is constructed under a hypothesis of phylogeny, i.e., modelling sequence evolution along a branch
of phylogenetic tree. This is using a continuous time Markov process,
more specifically finite, aperiodic, irreducible such processes (here refer
to these simply as Markov process). A Markov process has a defined
state space, e.g., {A, C, G, T}, and the (instantaneous) transition rate
between states is given by any n × n transition rate matrix, Q, where
P
Qij > 0 for all i 6= j and Qii = − i6=j Qij . Amino acid models have
207
n = 20, while nucleotide models have n = 4, e.g.:


−1.218


 0.126
Q=

 0.168

0.126
0.504
0.336
0.378


−0.882 0.252
0.504 


0.504 −1.050 0.378 

0.672
0.252 −1.050
Qij indicates the rate for going from state i to state j. Since the total
instantaneous rate is zero each row should sum to zero. For a specified
time interval, t, we can calculate the transition probability matrix from
P(t) = eQt , e.g.:


0.6883 0.1308 0.0828 0.0981




 0.0327 0.7783 0.0654 0.1236 


P(t) = 

 0.0414 0.1308 0.7297 0.0981 


0.0327 0.1647 0.0654 0.7372
Here t = 0.33, the exponential operation is matrix exponential. In
Matlab, this is computed using a scaling and squaring algorithm with a
Pade approximation. In P, the rows sum to one, since the total probability under the time interval is one. If the Markov process are run
sufficiently long time, the probabilities, P(t) will converge on a stationary distribution such that for all pairs (i, k) of states, Pi,j (t) = Pi,k (t).
That is the probability of the end state is independent of the starting
state. Here we will limit our discussion to cases where the overall rate
of changing from state i to state j is the same as the rate from i to j,
a constraint to models that are said to be time-reversible. The models
used in phylogenetic inference to date are almost exclusively subsets of
this class.
The transition rate matrix, Q, can be decomposed into an intrinsic
rate matrix, R, and Π, such that:
208
Q = RΠ
If R is symmetric, and Q is constructed as indicated above, and Π
is the equilibrium frequency vector. The rates at which each state is
replaced with each alternative state in R and methods for calculating or
estimating Π are set differently in different situation. Hence, different
DNA substitution model are existing. I will start to introduce the most
general models of nucleotide substitution is the general time reversible
model (REV), also called General Time Reversible model (GTR). The
instantaneous rate matrix for the REV model is:

R<REV>

− µa µb µc




 µa − µd µe 


=

 µb µd − µf 


µc µe µf −
In this matrix, the rows (and columns) correspond to the bases A, C,
G, and T respectively. The factor µ represents the mean instantaneous
rate. This rate is modified with the relative rate parameters a, b, c, · · · , l,
which correspond to each possible transformation between two bases. To
construct Q<REV> , all we need to do is: RΠ, where Π, (πA , πC , πG , πT ),
is frequency parameters that correspond to the frequencies of the four
bases. The diagonal elements of Q are always chosen so that the row
sums are zero (i.e., stationarity).
Many other models (still belong to GTR class) have been designated.
They are usually designated by the initial letters of the authors last names
and the year of the publication. Their relationship can be illustrated as
in Fig 10.1. The κ parameter represents the ratio of the instantaneous
rate of transition-type substitutions to transversion-type substitutions.
It assumes the value 1.0 for models in which all substitutions are taken
to occur at the same rate (i.e., the JC and F81 models). In the K2P and
209
Allow transition/
transversion bias
JC
JC
K2P
K2P
ππAA==ππCC==ππGG==ππTT
α≠β
α≠β
ππAA==ππCC==ππGG==ππTT
α=
α=ββ
Allow base
frequencies to vary
F81
F81
ππAA≠π
≠πCC≠π
≠πGG≠π
≠πTT
α=
α=ββ
Allow base
frequencies to vary
HKY85
HKY85
GTR/REV
GTR/REV
ππAA≠π
≠πCC≠π
≠πGG≠π
≠πTT
ππAA≠π
≠πCC≠π
≠πGG≠π
≠πTT
α≠β
α≠β
a,b,c,d,e,f
a,b,c,d,e,f
Allow transition/
transversion bias
Figure 10.1: Relationship of GTR class DNA substitution models
HKY models, the rate of transversion is β, with the rate of transitions
being determined as α = κβ.
JC model The JC model was described by Jukes & Cantor in 1969
[153] and is the most restrictive model. It assumes that the base frequencies are all equal and the instantaneous rate of substitution is the
same for all possible changes. When this model is selected, the base frequencies (πA , πC , πG , πT ) are all set to 0.25 and a, b, c, · · · , l is set to 1.0.
The only free parameter that can be adjusted under this model is the µt
parameter.
F81 model The F81 model was described by Felsenstein (1981) [85].
It is like the JC model in assuming that all possible changes occur at
the same rate, but allows the base frequencies to be unequal. If the base
frequencies are all set to 0.25, this model is equivalent to the JC model.
When this model is selected, you will be free to vary the base frequency
parameters, but the κ parameter will not be changed as it is set to 1.0
under this model.
K2P model The K2P model was described by Kimura in 1980 [165].
It is like the JC model in assuming equal base frequencies, but allows the
210
rate of transition-type substitutions to differ from the rate of transversiontype substitutions. As you know, the ratio of these two instantaneous
rates is κ. Two parameters, both κ and µt, will be free to vary when
using this model. In case of setting κ = 1.0, K2P model is identical with
the JC model. The base frequency parameters are forced to be equal.
HKY model The Hasegawa, Kishino and Yano (HKY) model [126]
allows for a different rate of transitions and transversions as well as unequal frequencies of base frequencies. The parameters requires by this
model are transition to transversion ratio κ and the base frequencies. If
base frequencies are uniform, the HKY model reduces to the K2P model.
10.2.2
Maximum likelihood estimation
Maximum likelihood estimation (MLE) is a popular statistical method
used to make inferences about parameters of the underlying probability
distribution of a given data set. Given a set of observations, the method
of maximum likelihood finds the parameters of a model that are most
consistent with these observations.
Here I use a simple and general example to explain the philosophy of
MLE. Example n data, X1, X2, . . . , Xn, are drawn from a given discrete
probability distribution D with known probability mass function fD and
distributional parameter θ. The probability associated with our observed
data may be computed:
P (x) = fD (x|θ)
where x ∈ {x1 , x2 , . . . , xn }. At this moment, although we know that
our data comes from the distribution D, we may don’t know the value of
the parameter θ. Such a situation is usually the case when we do experiment to sample data points so that we can estimate some parameters,
such as, θ of a distribution. The question is how should we estimate θ?
211
MLE provides a general technique for seeking an estimate of the value
of θ from the sample. We maximise the likelihood of the observed data
set over all possible values of θ, i.e., seeking the most likely value of the
parameter θ.
We define likelihood mathematically:
lik(θ) =
n
Y
fD (x|θ)
i=1
MLE seeks the value θ̂ which maximises this likelihood function over all
possible θ. MLE methods are versatile and apply to most models and to
different types of data.
The general principle of MLE has found its way of applying in many
aspects of phylogenetics, such as, phylogenetic parameter estimation, and
optimal tree searching [41, 85]. Generally, the likelihood of observing a
given set of data is maximised for each topology, and the topology that
gives the highest maximum likelihood is chosen as the final tree. In this
case, however, the parameters to be considered are not the topologies but
the branch lengths for each topology, and the likelihood is maximised to
estimate branch lengths rather than the topology. The problem with
phylogenetic inference based on the optimisation principle is that it is
very time-consuming, because the number of possible topologies is very
large for a sizable number of nucleotide sequences (> 15) and an enormous amount of computational time is required to find the optimal tree.
Calculating MLE’s in phylogeny often requires specialised software for
solving complex non-linear equations. Numerical optimisation is often
required to solve these non-linear problems.
10.2.3
Elements of phylogenetic theory
The purpose of the reminder section is to explain how phylogenetic trees
may be constructed from analysis of nucleotide and protein sequences.
212
Such analyses enable the evolutionary relationships among species or
genes to be deduced. I will review basic concepts of phylogenetic theory, such as, phylogenetic tree and likelihood calculation of a phylogeny,
given a substitution model. Then I will introduce some most commonly
used software packages in phylogenetic analyses, their advantages and
shortcomings.
Phylogenetic trees
We usually describe evolution, of either genes or species, by using a sketch
of a tree-like structure, which represents the hierarchical relationships
among species/genes arising through evolution. Such a tree-like structure is phylogenetic tree. In the case of rooted trees the root is the
common ancestor of all the nodes. In a evolutionary tree of species,
ancestors’ species are located at the root of the tree and contemporary
species are the leaves. In this sense, the tree is rooted. The topology of
the tree, branching pattern, defines the phylogenetic relationships among
the nodes. When the data for the ancestors are missing, the phylogenetic
trees produced are unrooted, which are only schematic trees comprising
a set of nodes linked together by branches. The location of the common ancestor of all the species/genes under study cannot be identified in
unrooted tree.
The string representation of a tree, following the newick standard,
is usually used. It uses the recursive definition of a tree to represent
phylogenies in a computer readable form with nested parentheses. For
example, a tree can be written:
(outgroup, neurospora, (penicillium, aspergillus));
However one must be aware that this representation is not unique,
the following one works as well:
(penicillium,(outgroup,neurospora),aspergillus));
213
Sometimes, when an outgroup was provided, the rooted representation is:
(outgroup,(neurospora,(penicillium,aspergillus)));
In addition to the branch topology, the branch lengths in phylogeny
are also important to specify a particular tree. The lengths of branches
represent the evolutionary distances between two consecutive nodes.
Phylogeny reconstruction
Data required for phylogeny reconstruction is not limited in nucleotide
and amino acid sequences; in fact, protein structures or exon-intron structures can also be used for this purpose. But I will limit the following discussion on nucleotide and amino acid sequences merely. It is important to
note that most phylogeny-building methods require multiple alignment of
sequences. Sequence alignment is one of the most important problems in
bioinformatics. Many efforts have been put in improvement of efficiency
and accuracy. The area is still actively developing.
Once obtaining the multiple alignments, we can usually use 3 different
methods to construct phylogeny: the distance matrix method, maximum
parsimony method and maximum likelihood method. A good review for
all these methods can be found in [199].
Maximum parsimony infers a phylogenetic tree by minimising the
total number of evolutionary steps required to explain a given set of data,
or in other words by minimising the total tree length. It is a characterbased method, the input data used is in the form of “characters” for a
range of taxa. Besides protein or nucleotide residue, a character could
be a binary value for the presence or absence of a feature (such as the
presence of a tail). Maximum parsimony is a very simple approach, and
is popular for this reason. However, it is not always very accurate.
Maximum likelihood evaluates a hypothesis about evolutionary history in terms of the probability that the proposed model and the hypoth-
214
esised history would give rise to the observed data set.
The central of likelihood based method is the likelihood function (for
general description, see Section 10.2.2).
Likelihood = f (Data|T, l, θ)
where T is topology, l is branch lengths of the given tree.
The topology with the highest maximum probability (likelihood) is
chosen. Advantages of maximum likelihood methods over other methods are: may have lower variance than other methods (least affected by
sampling error), tend to be robust to violations of the assumptions in
the evolutionary model, are statistically well founded, can statistically
evaluate different tree topologies and use all of the sequence information.
There are also some disadvantages: very computationally intensive (slow)
and the result depends on the model of evolution.
Computation of likelihood of phylogeny
Substitution models are a description of the way sequences evolve in
time by nucleotide replacements. Most commonly used Markov models
of DNA subsititution has been reviewed in Section 10.2.1.
10.2.4
Programs used for phylogenetic analyses
A few selective programs are introduced below, they are representatives
of the most commonly used ones in phylogenetic analyses.
PAUP* - http://paup.csit.fsu.edu/ is an integrated and userfriendly package. Many distinct models of nucleotide substitution are
available (all possible submodels of the GTR + Γ + inv sites model). It
does not allow analyses of protein sequences using parametric approaches.
Tree-Puzzle - http://www.tree-puzzle.de/ reconstructs phylogenetic trees from molecular sequence data by maximum likelihood. It
implements a fast tree search algorithm, quartet puzzling, that allows
215
analysis of large data sets and automatically assigns estimations of support to each internal branch. It also computes pairwise maximum likelihood distances as well as branch lengths for user specified trees.
Mesquite - http://mesquiteproject.org/mesquite/mesquite.html
is an extensible and modular program for a variety of evolutionary analyses. It is written in Java, therefore, is plantform-independent. At this
point Mesquite is of limited usefulness because it is a modular set of
programs to which specific applications must be added. But it does implement one- and two-parameter models of evolution for ancestral state
reconstruction.
MrBayes - http://morphbank.ebc.uu.se/mrbayes/ is a program for
Markov chain Monte Carlo analysis of phylogeny. Implements a limited
set of submodels of the GTR + Γ + inv sites model. The current version
allows the use of mixed models (e.g., distinct GTR + Γ + inv sites submodels for 1st, 2nd, and 3rd codon positions or for different genes). A
number of protein models, using parameters estimated from large-scale
analyses of protein databases, are also available. It is only known package
implementing the covarion model.
PAML - http://abacus.gene.ucl.ac.uk/software/paml.html, is
a package of programs for phylogenetic analyses of DNA or protein sequences using maximum likelihood. It contains a modular set of programs
for various likelihood analyses flexibly (submodels of the GTR + Γ + inv
sites model, amino acid models, codon-based models). It is not designed
for tree-searches. But it is ideal for analyses of the evolutionary process,
estimation of evolutionary parameters, because of its flexibility. PAML
has a simulator module called “evolver” that is also quite flexible.
PHYLIP - http://evolution.genetics.washington.edu/phylip.
html, is a modular set of programs for various types of phylogenetic analyses (including likelihood analyses of DNA and proteins). It implements
a heuristic tree space search algorithm, which is faster than PAML, but
216
does not search as rapidly or as extensively as PAUP*.
10.3
Implementation
MBEToolbox is written in the Matlab language and has been tested on
the Windows platform with Matlab version 6.1.0. The main functions
implemented are: sequence manipulation, computation of evolutionary
distances derived from nucleotide-, amino acid- or codon-based substitution models, phylogenetic tree construction, sequence statistics and
graphics functions to visualize the results of analyses. Although it implements only a small fraction of the multiplicity of existing methods used
in molecular evolutionary analyses, interested users can easily extend the
toolbox.
10.3.1
Input data and formats
MBEToolbox requires a single ASCII file containing the nucleotide or
amino acid sequence alignment in either Phylip [86], ClustalW [312]
or Fasta format. The toolbox does provide a built-in Clustalw [312]
interface if an unaligned sequence file is provided. Protein-coding DNA
sequences can be automatically aligned based on the corresponding protein alignment with the command alignseqfile.
After input, in common with the MathWorks bioinformatics toolbox, MBEToolbox represents the alignment as a numeric matrix with
every element standing for a nucleic or amino acid character. Nucleotides
A, C, G and T are converted to integers 1 to 4, and the 20 amino acids are
converted to integers 1 to 20. A header, containing information about the
names and type of the sequences as well as the relevant genetic code for
protein-coding nucleotides, is attached to the alignment matrix to form a
Matlab structure. An example alignment structure, aln, in Matlab code
follows:
aln =
217
seqtype: 2
geneticcode: 1
seqnames: {1xn cell}
seq: [nxm double]
where n is the number of sequences and m is the length of the aligned
sequences. The type of sequence is denoted by 1, 2 or 3 for sequences
of non-coding nucleotides, protein coding nucleotides and amino acids,
respectively.
10.3.2
Sequence Manipulation and Statistics
The alignment structure, aln, can be manipulated using the Matlab language. For example, aln.seq(x,:) will extract the xth sequence from
the alignment, while aln.seq(:,[i:j]) will extract columns i to j from
the alignment. Users may easily extract more specific positions by using functions developed in the toolbox, such as extractpos(aln,3) or
extractdegeneratesites to obtain the third codon positions or fourfold
degenerate sites, respectively. For each sequence, some basic statistics
such as the nucleotide composition (ntcomposition) and GC content,
can be reported. Other functions include the calculation of the relative
synonymous codon usage (RSCU) and the codon adaptation index (CAI),
counts of segregating sites, taking the reverse complement or translating
a sequence, and determining the sequence complexity.
10.3.3
Evolutionary Distances
The evolutionary distance is one of the important measures in molecular evolutionary studies. It is required to measure the diversity among
sequences and to infer distance-based phylogenies. MBEToolbox contains a number of functions to calculate evolutionary distances based
on the observed number of differences.
The formulae used in these
functions are analytical solutions of a variety of Markov substitution
218
models, such as JC69 [153], K2P [165], F84 [86], HKY [126] (see [229]
for detail). Given the stationarity condition, the most general form of
Markov substitution models is the General Time Reversible (GTR or
REV) model [185, 309, 266, 358]. There is no analytical formula to calculate the GTR distance directly. A general method, described by Rodriguez et al. [266], has been implemented here. In this method a matrix
F, where Fij denotes the proportion of sites for which sequence 1 (s1 ) has
an i and sequence 2 (s2 ) has a j, is formed. The GTR distance between
s1 and s2 is then given by
dˆ = −tr(Π log(Π−1 F))
where Π denotes the diagonal matrix with values of nucleotide equilibrium frequencies on the diagonal, and tr(A) denotes the trace of matrix
A. The above formula can be expressed in Matlab syntax directly as:
>> d=-trace(PI*logm(inv(PI)*F))
MBEToolbox also calculates the gamma distribution distance and the
LogDet distance [295] (i.e., Lake’s paralinear distance [184]).
For alignments of codons, the toolbox provides calculation or estimation of the synonymous (Ks ) and non-synonymous (Ka ) substitution
rates by the counting method of Nei and Gojobori [228], the degenerate
methods of Li, Wu and Luo [198] and the method of Li or Pamilo and
Bianchi [197, 242], as well as the maximum likelihood method through
PAML [360]. All these methods for calculating Ks and Ka require that
the input sequences are aligned in the appropriate reading frame, which
can be performed by the function alignseqfile. Unresolved codon sites
will be removed automatically. In addition, several quantities, including the number of substitutions per site at only synonymous sites, at
only non-synonymous sites, at only four-fold-degenerate sites, or at only
219
zero-fold-degenerate sites can be calculated. The output from these calculations are distance matrices which can be exported into text or Excel
files, or used directly in further operations.
10.3.4
Phylogeny Inference
Two distance-based tree creation algorithms, Unweighted Pair Group
Method with Arithmetic mean (UPGMA) and neighbour-joining (NJ)
[273] are provided and trees from these methods can be displayed or exported. Maximum parsimony and maximum likelihood algorithms can
be applied to nucleotide or amino acid alignments through an interface
to the Phylip package [86]. As properly implemented maximum likelihood methods are the best vehicles for statistical inference of evolutionary relationships among species from sequence data, several maximum
likelihood functions have been explicitly implemented in MBEToolbox.
These functions allow users to incorporate various evolutionary models,
estimate parameters and compare different evolutionary trees.
The simplest case of estimation of the evolutionary distance between
two sequences, s1 and s2, can be considered as the estimation of the
branch length (the number of substitutions along a branch) separating
ancestor and descendent nodes. Branch lengths, relative to a calibrated
molecular clock, can reveal the time interval for this separation. A continuous time Markov process is generally used to model evolution along
the branch from s1 to s2. A transition rate matrix, Q, is used to indicate
the rate of changing from one state to another. For a specified time interval or distance, t, the transition probability matrix is calculated from
P(t) = eQt . If there are N sites, the full likelihood is
L=
N
Y
i=1
πs1 P (s1i → s2i , t)
i
In this equation, s1i and s2i are the ith bases of sequences 1 and 2 respec-
220
tively; πs1 is the expected frequency of base s1i .
i
In MBEToolbox, to calculate the likelihood, L, at a given time interval
(or distance) t, we have to specify a substitution model by using an appropriate model defining function, such as modeljc, modelk2p or modelgtr
for non-coding nucleotides, modeljtt or modeldayhoff for amino acids,
or modelgy94 for codons. These functions return a model structure composed of an instantaneous rate matrix, R, and an equilibrium frequency
vector, pi which give Q, (Q=R*diag(pi)). Once the model is specified,
the function likelidist(t,model,s1,s2) can calculate the log likelihood of the alignment of the two sequences, s1 and s2, with respect to
the time or distance, t, under the substitution model, model.
In most cases we wish to estimate t instead of calculating L as a function of t, so the function optimlikelidist(model,s1,s2) will search for
the t that maximises the likelihood by using the Nelder-Mead simplex (direct search) method, while holding the other parameters in the model at
fixed values. This constraint can be relaxed by allowing every parameter
in the model to be estimated by functions, such as optimlikelidistk2p,
that can estimate both t and the model’s parameters. Figure 10.2(a and
b) illustrates the estimation of the evolutionary distance between two
ribonuclease genes through the fixed- and free-parameter K2P models,
respectively. When the K2P model’s parameter, kappa, is fixed, the result and trace of the optimisation process is illustrated by the graph of
L and t (Fig. 10.2a). When kappa is a free parameter, a surface shows
the result and trace of the optimisation process (Fig. 10.2b).
When calculating the likelihood of a phylogenetic tree, where s1 and
s2 are two (descendant) nodes in a tree joined to an internal (ancestor)
node, sa , we must sum over all possible assignments of nucleotides to sa
to get the likelihood of the distance between s1 and s2. Consequently,
the number of possible combinations of nucleotides becomes too large to
be enumerated for even moderately sized trees. The pruning algorithm
221
(a) 1040
1060
1080
ln(Likelihood)
1100
1120
1140
1160
1180
1200
1220
1240
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Distance (substitutions/site)
(
(b)
950
ln(Likelihood)
1000
1050
1100
1150
1200
1250
1300
1350
5
4
3
2
1
kappa
0
0
0.1
0.2
0.3
0.4
0.5
Distance (substitutions/site)
Figure 10.2: Log-likelihood of evolutionary distance. (a) Likelihood as
function of K2P distance. Distance is estimated by maximising likelihood
of the alignment when the bias of transition and transversion, kappa, is
fixed. (b) Likelihood as function of distance and kappa. Both distance
and kappa are numerically optimised simultaneously to give maximum
likelihood. The maximum likelihood peaks are marked with *. The two
sequences used are coding regions of two mammalian ribonuclease genes,
enc, of 474 bp.
222
introduced by Felsenstein [85] takes advantage of the tree topology to
evaluate the summation in a computationally efficient (but mathematically equivalent) manner. This and a simple and elegant mapping from a
‘parentheses’ encoding of a tree to the matrix equation for calculating the
likelihood of a tree, developed in the Matlab software, PhylLab [271],
have been adopted in likelitree.
10.3.5
Combination of functions
Basic operations can be combined to give more complicated functions.
A simple combination of the function to extract the fourfold degenerate
sites with the function to calculate GC content produces a new function
(countgc4) that determines the GC content at 4-fold degenerate sites
(GC4). A subfunction for calculating synonymous and nonsynonymous
differences between two codons, getsynnonsyndiff, can be converted
into a program for calculating codon volatility [251] with trivial effort.
Similarly, karlinsig which returns Karlin’s genomic signature (the dinucleotide relative abundance or bias) for a given sequence can be easily
re-formulated to estimate relative di-codon frequencies, which may be a
new index of biological signals in a coding sequence. In addition, the
menu-driven user interface, MBEGUI, is also a good example illustrating
the power of combination of basic MBEToolbox functions.
10.3.6
Graphics and GUI
Good visualisation is essential for successful numerical model building.
Leveraging the rich graphics functionality of Matlab, MBEToolbox provides a number of functions that can be used to create graphic output,
such as scatterplots of Ks vs Ka , plots of the number of transitions and
transversions against genetic distance, sliding window analyses on a nucleotide sequence and the Z-curve (a 3-dimensional curve representation
of a DNA sequence [372]). A simple menu-driven graphical user inter-
223
face (GUI) has been developed by using GUIDE (Graphical User Interface Development Environment) in Matlab. The top menu includes File,
Sequences, Distances, Phylogeny, Graph, Polymorphism and Help submenus (Fig. 10.3). It aids the usage of the most frequently required
functions so that users do not have to run any scripts or functions from
the Matlab command line in most cases.
10.4
Results and Discussion
Only few Matlab toolboxes or functions are freely available for data analysis, exploration, and visualisation of nucleotide and protein sequences.
The toolbox, MBEToolbox, presented here to fulfil most obvious needs in
sequence manipulation, genetic distance estimation and phylogeny inference under Matlab environment. Moreover, it is an extensible functional
framework to formulate and solve problems in evolutionary data analysis;
it facilitates the rapid construction of both general applications as well
as special-purpose tools for computational biologists in a fraction of the
time it would take to write a program in a scalar noninteractive language
such as C or FORTRAN.
10.4.1
Vectorisation simplifies programming
Matlab is a matrix language, which means it is designed for vector and
matrix operations. Programming can be simplified and made more efficient by using algorithms that take advantage of vectorisation (converting
for and while loops to the equivalent vector or matrix operations). The
Matlab compiler in version 7.0 will automatically recognise and vectorise
loops without recursion. An example of vectorisation is the calculation
of Z-scores [246] for Smith-Waterman alignments [291] to give a measure of the significance of an alignment score against a background of
scores from randomly generated sequences with the same composition
and length. Hence, Z-scores are designed to overcome the bias due to the
224
Figure 10.3: MBEToolbox GUI. (a) Distances submenu; (b) Phylogeny
submenu; and (c) Graph submenu.
225
composition of the alignment and are usually calculated by comparing
an actual alignment score with the scores obtained on a set of random
sequences generated by a Monte-Carlo process. The Z-score is defined
as:
Z(A, B) = (S(A, B) − mean)/standard deviation
where S(A, B) is the Smith-Waterman (S-W) score between two sequences A and B. The mean and standard deviation are taken from
realignments of the permuted sequences. The algorithm is implemented
as follows in Matlab with as few as 15 lines of code:
function [z,z_raw]=zscores(s1,s2,nboot)
m1=length(s1);
m2=length(s2);
% Initialise two vectors holding Z-score of
% s1_rep and s2_rep, \textit{i.e.}, replicate samples
% of sequences s1 and s2.
v_z1=zeros(1,nboot);
v_z2=zeros(1,nboot);
z_raw=smithwaterman(s1,s2);
for (k=1:nboot),
s1_rep=s1(:,randperm(m1));
v_z1(1,k)=smithwaterman(s1_rep, s2);
s2_rep=s2(:,randperm(m2));
v_z2(1,k)=smithwaterman(s1, s2_rep);
end
z1=(z_raw-mean(v_z1))./std(v_z1);
226
z2=(z_raw-mean(v_z2))./std(v_z2);
z=min(z1,z2);
where randperm(n) is a vector function returning a random permutation
of the integers from 1 to n and smithwaterman performs local alignment
by the standard dynamic programming technique.
10.4.2
Extensibility
An important distinction between compiled languages with subroutine
libraries and interactive environments like Matlab is the ease with which
problems can be specified and solved in the latter. Moreover, Matlab
toolboxes are traditionally organised in a less object-oriented mode and,
consequently, functions are more independent of each other and easier to
combine and extend. Several examples were given in the Implementation
section.
10.4.3
Comparison with other toolboxes
Some other toolboxes have been developed in Matlab for bioinformatics
related analyses. These include PhylLab [271] and MatArray [327]
as well as the bioinformatics toolbox developed by MathWorks. Other
examples can be found at the link and file exchange maintained at Matlab Central [42]. PhylLab is a molecular phylogeny toolbox which
also provides some functions for sequence and tree input and manipulation. Its main focus is on creating a maximum likelihood tree based on
Bayesian principles using a Markov chain Monte Carlo method to compute posterior parameter distributions. MatArray is focussed on the
analysis of gene expression data from microarrays and provides normalisation and clustering functions but does not address molecular evolution.
The bioinformatics toolbox from MathWorks provides a range of bioinformatics functions, including some related to molecular evolution.
227
MBEToolbox provides a much broader range of molecular evolution
related functions and phylogenetic methods than either the more specialised Phyllab project or the more general bioinformatics toolbox from
MathWorks. These extra functions include IO in Phylip format, statistical and sequence manipulation functions relevant to molecular evolution (e.g. count segregating sites), evolutionary distance calculation
for nucleic and amino acid sequences, phylogeny inference functions and
graphic plots relevant to molecular evolution (e.g. Ka vs Ks ). As such
it makes an important contribution to the bioinformatics analyses that
can be performed in the Matlab environment.
10.4.4
A novel enhanced window analysis
To test for the selective pressures in the different lineages of a phylogenetic tree, the nonsynonymous to synonymous rate ratio (Ka /Ks ) is normally estimated [281, 4, 61]. Values of Ka /Ks = 1, > 1, or < 1 indicate
neutrality, positive selection, or purifying selection, respectively. However, Ks and Ka are measurements of average synonymous and nonsynonymous substitutions per site along the whole length of the sequences.
Average Ks and Ka values give neither the pattern of intragenic fluctuation of selective constraints, nor region- or site-specific information.
A sliding window method is usually adopted to examine the intragenic
pattern of the substitution rates and to test for the occurrence of significant clusters of variant regions [55, 145, 80, 53]. Significant heterogeneity
in Ks would indicate that the neutral substitution rate varies across the
gene, whereas heterogeneity in Ka may indicate that selective constraints
vary along the gene. The results and accuracy of sliding window methods, either overlapping or non-overlapping, depend on both the size of
the window and the moving distance adopted. Large window lengths
may obliterate the details of patterns in Ks or Ka , whereas small window lengths usually result in larger statistical fluctuations. Hence, the
228
(a)
2.5
syn
nonsyn
Substitution number per site
2
1.5
1
0.5
0
500
C
(b)
E1
E2
1000
1500
NS2
NS3
2000
NS4
2500
NS5A
3000
NS5B
40
syn
nonsyn
Transformed substitution number per site
20
a
e
c
0
-20
-40
-60
d
b
-80
-100
f
-120
500
1000
1500
2000
2500
3000
Codon site
Figure 10.4: Comparison between sliding window and enhanced sliding
window methods. Sliding window analysis of Ks and Ka for the concatenated coding regions of two hepatitis C virus strains, HCV-JS and
HCV-JT. The number of codons for the C, E1, E2, NS2, NS3, NS4,
NS5A, and NS5B genes are 191, 192, 426, 217, 631, 315, 447, and 591,
respectively. The different coding regions are separated by vertical lines.
(a) illustrates the result of a normal sliding window analysis; (b) illustrates the result of the enhanced sliding window analysis. Beginnings
and ends of regions poor in synonymous substitutions (slope < 0) are
indicated by the arrows a and b (genes C and E1) and e and f (gene
NS5B). A region rich in synonymous substitutions (slope > 0) in gene
NS3 is indicated by arrows c and d.
229
resolution of a sliding window is usually limited.
A mathematical formalism, similar to the Z’-curve [368], is introduced
here to solve this problem. Consider a subsequence based analysis of Ks
or Ka . In the n-th step, count the cumulative numbers of Ks or Ka
occurring from the first to the n-th nucleotide position in the gene sequences being inspected. Let K denote either Ks or Ka and K(n) denote
the cumulative K at the n-th sequence position. K(n) is usually an approximately mono-increasing linear function of n. The points (K(n) , n),
n = 1, 2, · · · , N are fit by a least square method to a linear function,
f (K(n) ) = βn, to give a straight line with β being its slope. We define
K0(n) = K(n) − βn
The two-dimensional curve of (K0(n) ∼ n) gives an alternative representation of the normal sliding window curve.
To compare these two curve representations, the example dataset of
Suzuki and Gojobori [303], which contains the coding regions of two
hepatitis C virus strains (HCV-JS - Genbank Acc.: D85516 and HCVJT - Genbank Acc.: D11168), was used. The entire coding sequence is
divided into eight regions (C, E1, E2, NS2, NS3, NS4, NS5A, NS5B).
Some of the coding regions have been combined as these short ORFs are
unlikely to yield meaningful Ks and Ka values. The reduction of Ks
in the C, E1 and NS5B regions, as well as its elevation in NS3, which
have been shown in previous studies [303], are not clear in a standard
sliding window representation (Fig. 10.4a). In contrast a sharp increase
in the (K0(n) ∼ n) curve (Fig. 10.4b), indicates an increase in K, while
a drop in the curve indicates a decrease in K. This new method has
been implemented in the function plotSlidingKaKs. Since it is derived
from the sliding window method, it is called the enhanced sliding window
method.
230
10.4.5
Limitations
The current version of this toolbox lacks novel algorithms yet it implements a variety of existing algorithms. There are some limitations in
the practical use of MBEToolbox. First, though the toolbox provides
many methods to infer and handle sequence and evolutionary analyses,
the full range of these features can only be accessed through the Matlab
command line interface, as in the majority of Matlab packages. Second,
some of the functions cannot handle ambiguous nucleotide or amino acid
codes in the sequences. The future development of MBEToolbox will
overcome these present limitations.
In summary, the MBEToolbox project is an ongoing effort in providing an
easy-to-use and yet powerful analysis environment for molecular biology
and evolution. Currently, it offers a solid set of frequently used functions
to manipulate sequences, calculate genetic distances, infer phylogenetic
trees and for related analyzes. MBEToolbox is a useful tool and inspires
evolutionary biologists to take advantage of Matlab. Moreover, it has
been widely applied in data analysis in the Penicillium marneffei genome
project as mentioned in pages 73, 113, 146, 161 and 190.
231
Chapter 11
CONCLUDING REMARKS
In this last chapter I provide a summary of the conclusions and recommendations for future research to the preceding chapters presented.
Chapter 1 has presented the draft genome of the important thermally
dimorphic fungus Penicillium marneffei. A number of features of the
pathogenic fungus have been uncovered.
Given the similarity of mitochondrial genome of P. marneffei and
other nonpathogenic Aspergillus (Chapter 3), it suggests that P. marneffei is more close to mould than yeast, which is consistent with established
classification. No direct association between mitochondrion-encoding genetic components and pathogenicity can be observed. Moreover, in silico
evidences for the capability of melanin biosynthesis P. marneffei (Chapter 4) will inspire further research towards the experimental elucidation
of melanin’s role in fungal virulence. Based on the computational finding,
gene knockout and in vivo animal survival analysis are being undertaken
in our department. The possible presence of sexual cycle in P. marneffei
reported in Chapter 5 is highly significant as it affects genetic study of
the fungus, since the sexual cycle could be a useful genetic tool allowing
us to study the way in which the fungus causes disease. On the other
hand, if the fungus does reproduce sexually as part of its life cycle, it
might evolve more rapidly to become resistant to anti-fungal drugs because sex might create new strains with increased ability to cause disease
and infect humans. Chapter 6 explored our current knowledges about
the genetic components related to the fungal morphogenesis, trying to
emphasise molecular mechanism for dimorphic switching. Yet more researches are required in the following directions, including (i) perception
232
of external stimuli by cellular sensors; (ii) transduction of biochemical
signal; (iii) alteration of the genomic expression, and (iv) structural reorganization towards the morphological change, in order to solve this
far less archived task. The presence of over-abundant intragenic tandem repeats (IntraTRs) in P. marneffei genome is a striking finding
(Chapter 7). The IntraTRs may create quantitative alterations in phenotypes (e.g., adhesion, flocculation or biofilm formation). The variation
resulted from the quantitative alterations of the fungal cell surface may
have allowed the fungus ‘disguise’ itself in order to slip past the host
immune system’s vigilant defences. Many P. marneffei proteins containing tandemly repeated domain/motif, with some degree of homology to
Plasmodium erythrocyte-binding protein domain.
The area of gene and genome duplication and its evolutionary significance has attracted significant attention from researchers in recent
years. Chapter 8 represents a novel contribution to the field by presenting a description of gene duplication in five ascomycetes. We have calculated the rates of synonymous and non-synonymous substitution using
the codon substitution model and reported large variation in the proportion of genes in multigene families across these fungi. We also suggest
that paralogs of filamentous fungi are under less selective constraint than
orthologs (but that this does not hold for yeasts), also there is a lack
of evidence for an association between asymmetry in rates of evolution
and positive selection, and finally that different extents and consequences
of gene duplication may explain some of the phenotypic variation of the
ascomycetes. One of new conclusion, that P. marneffei may have undergone a whole-genome duplication, is not solidly supported by the evidence
presented so far; analysis of gene order information will be necessary to
support the claim, when the P. marneffei genome sequencing approaches
complete. Moreover, at the time when the analysis was performed, Aspergillus genomes remain unpublished, the underlying data may change,
233
and results from a pre-mature analysis may be hard to reproduce or become obsolete. Therefore, no Aspergillus genomes was included into the
comparison; further analysis of this sort should overcome this limitation.
In addition, in Chapter 9 we conducted the analysis on genes with
various degree of conservation among species as measured by lineagespecificity of genes (LS). We examined the correlations between evolutionary rate and LS, as well as several other related factors, such as
expression, essentiality, and protein-protein interactions. We found that
in seven ascomycets genomes, the more lineage specific a gene, the higher
its evolutionary rate. This is taken as evidence for the hypothesis that
orphan genes arise as a result of higher rate of evolution. The general
rule applies to the explaining of the origin of P. marneffei -specific genes.
Finally, the software products, P. marneffei genome database and
MBEToolbox for sequence data analysis, have been developed (Chapters
2 and 10). Two of them literally covers two major aspects of bioinformatics, i.e., biological database management system and algorithm
development. They have been successfully applied throughout the whole
genome project, and proved to be efficient and sufficient.
In conclusion, the boom in fungal genome sequence data over the past
few years came with high expectations for new insights into fungal biology, and pathogen control strategies. In the case of P. marneffei, it
became evident that computational approaches can be used in the deciphering of the genome so as to derive biological meaning or evolutionary
processes. This work paves the way for a systemic experimental study of
the pathogenic fungus.
234
BIBLIOGRAPHY
[1] N. Adames, K. Blundell, M. N. Ashby, and C. Boone. Role of yeast insulindegrading enzyme homologs in propheromone processing and bud site selection.
Science, 270(5235):464–7, 1995.
[2] M. D. Adams, S. E. Celniker, R. A. Holt, C. A. Evans, J. D. Gocayne, P. G. Amanatides, S. E. Scherer, P. W. Li, R. A. Hoskins, R. F. Galle, R. A. George, S. E.
Lewis, S. Richards, M. Ashburner, S. N. Henderson, G. G. Sutton, J. R. Wortman, M. D. Yandell, Q. Zhang, L. X. Chen, R. C. Brandon, Y. H. Rogers, R. G.
Blazej, M. Champe, B. D. Pfeiffer, K. H. Wan, C. Doyle, E. G. Baxter, G. Helt,
C. R. Nelson, G. L. Gabor, J. F. Abril, A. Agbayani, H. J. An, C. AndrewsPfannkoch, D. Baldwin, R. M. Ballew, A. Basu, J. Baxendale, L. Bayraktaroglu,
E. M. Beasley, K. Y. Beeson, P. V. Benos, B. P. Berman, D. Bhandari, S. Bolshakov, D. Borkova, M. R. Botchan, J. Bouck, P. Brokstein, P. Brottier, K. C.
Burtis, D. A. Busam, H. Butler, E. Cadieu, A. Center, I. Chandra, J. M. Cherry,
S. Cawley, C. Dahlke, L. B. Davenport, P. Davies, B. de Pablos, A. Delcher,
Z. Deng, A. D. Mays, I. Dew, S. M. Dietz, K. Dodson, L. E. Doup, M. Downes,
S. Dugan-Rocha, B. C. Dunkov, P. Dunn, K. J. Durbin, C. C. Evangelista,
C. Ferraz, S. Ferriera, W. Fleischmann, C. Fosler, A. E. Gabrielian, N. S. Garg,
W. M. Gelbart, K. Glasser, A. Glodek, F. Gong, J. H. Gorrell, Z. Gu, P. Guan,
M. Harris, N. L. Harris, D. Harvey, T. J. Heiman, J. R. Hernandez, J. Houck,
D. Hostin, K. A. Houston, T. J. Howland, M. H. Wei, C. Ibegwam, et al. The
genome sequence of drosophila melanogaster. Science, 287(5461):2185–95, 2000.
[3] L. Ajello, A. A. Padhye, S. Sukroongreung, C. H. Nilakul, and S. Tantimavanic.
Occurrence of penicillium marneffei infections among wild bamboo rats in thailand. Mycopathologia, 131(1):1–8, 1995.
[4] H. Akashi. Within- and between-species dna sequence variation and the ‘footprint’ of natural selection. Gene, 238:39–51, 1999.
[5] J. A. Alspaugh, L. M. Cavallo, J. R. Perfect, and J. Heitman. Ras1 regulates filamentation, mating and growth at high temperature of cryptococcus neoformans.
Mol Microbiol, 36(2):352–65, 2000.
[6] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and
D. J. Lipman. Gapped blast and psi-blast: a new generation of protein database
search programs. Nucleic Acids Res, 25(17):3389–402, 1997.
[7] M. A. Andrade, N. P. Brown, C. Leroy, S. Hoersch, A. de Daruvar, C. Reich,
A. Franchini, J. Tamames, A. Valencia, C. Ouzounis, and C. Sander. Automated
genome sequence analysis and annotation. Bioinformatics, 15(5):391–412, 1999.
[8] L. Aravind, H. Watanabe, D. J. Lipman, and E. V. Koonin. Lineage-specific
loss and divergence of functionally linked genes in eukaryotes. Proc Natl Acad
Sci U S A, 97(21):11319–24, 2000.
[9] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P.
Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. IsselTarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald,
G. M. Rubin, and G. Sherlock. Gene ontology: tool for the unification of biology.
the gene ontology consortium. Nat Genet, 25(1):25–9, 2000.
[10] C. R. Astell, L. Ahlstrom-Jonasson, M. Smith, K. Tatchell, K. A. Nasmyth,
and B. D. Hall. The sequence of the dnas coding for the mating-type loci of
saccharomyces cerevisiae. Cell, 27(1 Pt 2):15–23, 1981.
235
[11] J. Baker, J. McCarthy, M. Gatton, D. E. Kyle, V. Belizario, J. Luchavez, D. Bell,
and Q. Cheng. Genetic diversity of plasmodium falciparum histidine-rich protein
2 (pfhrp2) and its effect on the performance of pfhrp2-based rapid diagnostic
tests. J Infect Dis, 192(5):870–7, 2005.
[12] A. D. Basehoar, S. J. Zanton, and B. F. Pugh. Identification and distinct regulation of yeast tata box-containing genes. Cell, 116(5):699–709, 2004.
[13] A. Bateman, L. Coin, R. Durbin, R. D. Finn, V. Hollich, S. Griffiths-Jones,
A. Khanna, M. Marshall, S. Moxon, E. L. Sonnhammer, D. J. Studholme,
C. Yeats, and S. R. Eddy. The pfam protein families database. Nucleic Acids
Res, 32(Database issue):D138–41, 2004.
[14] D. H. Beach and A. J. Klar. Rearrangements of the transposable mating-type
cassettes of fission yeast. Embo J, 3(3):603–10, 1984.
[15] G. Bejerano and G. Yona. Variations on probabilistic suffix trees: statistical
modeling and prediction of protein families. Bioinformatics, 17(1):23–43, 2001.
[16] R. J. Bennett and S. C. West. Ruvc protein resolves holliday junctions via
cleavage of the continuous (noncrossover) strands. Proc Natl Acad Sci U S A,
92(12):5635–9, 1995.
[17] P. Bork, T. Dandekar, Y. Diaz-Lazcoz, F. Eisenhaber, M. Huynen, and Y. Yuan.
Predicting function: from genes to genomes and back. J Mol Biol, 283(4):707–25,
1998.
[18] A. R. Borneman, M. J. Hynes, and A. Andrianopoulos. The abaa homologue
of penicillium marneffei participates in two developmental programmes: conidiation and dimorphic growth. Mol Microbiol, 38(5):1034–47, 2000.
[19] A. R. Borneman, M. J. Hynes, and A. Andrianopoulos. An ste12 homolog
from the asexual, dimorphic fungus penicillium marneffei complements the defect in sexual development of an aspergillus nidulans stea mutant. Genetics,
157(3):1003–14, 2001.
[20] A. R. Borneman, M. J. Hynes, and A. Andrianopoulos. A basic helix-loop-helix
protein with similarity to the fungal morphological regulators, phd1p, efg1p and
stua, controls conidiation but not dimorphic growth in penicillium marneffei.
Mol Microbiol, 44(3):621–31, 2002.
[21] V. L. Boyartchuk, M. N. Ashby, and J. Rine. Modulation of ras and a-factor
function by carboxyl-terminal proteolysis. Science, 275(5307):1796–800, 1997.
[22] K. J. Boyce, M. J. Hynes, and A. Andrianopoulos. The cdc42 homolog of the
dimorphic fungus penicillium marneffei is required for correct cell polarization
during growth but not development. J Bacteriol, 183(11):3447–57, 2001.
[23] K. J. Boyce, M. J. Hynes, and A. Andrianopoulos. The ras and rho gtpases
genetically interact to co-ordinately regulate cell polarity during development in
penicillium marneffei. Mol Microbiol, 55(5):1487–501, 2005.
[24] A. A. Brakhage, K. Langfelder, G. Wanner, A. Schmidt, and B. Jahn. Pigment
biosynthesis and virulence. Contrib Microbiol, 2:205–15, 1999.
[25] B. J. Breitkreutz, C. Stark, and M. Tyers. The grid: the general repository for
interaction datasets. Genome Biol, 4(3):R23, 2003.
[26] C. Brenner and R. S. Fuller. Structural and enzymatic characterization of a
purified prohormone-processing enzyme: secreted, soluble kex2 protease. Proc
Natl Acad Sci U S A, 89(3):922–6, 1992.
[27] J. Brosius and S. J. Gould. On ”genomenclature”: a comprehensive (and respectful) taxonomy for pseudogenes and other ”junk dna”. Proc Natl Acad Sci
U S A, 89(22):10706–10, 1992.
236
[28] D. W. Brown, J. H. Yu, H. S. Kelkar, M. Fernandes, T. C. Nesbitt, N. P. Keller,
T. H. Adams, and T. J. Leonard. Twenty-five coregulated transcripts define a
sterigmatocystin gene cluster in aspergillus nidulans. Proc Natl Acad Sci U S
A, 93(4):1418–22, 1996.
[29] T. A. Brown, R. B. Waring, C. Scazzocchio, and R. W. Davies. The aspergillus
nidulans mitochondrial genome. Curr Genet, 9(2):113–7, 1985.
[30] C. Burge and S. Karlin. Prediction of complete gene structures in human genomic
dna. J Mol Biol, 268(1):78–94, 1997.
[31] M. Burset and R. Guigo. Evaluation of gene structure prediction programs.
Genomics, 34(3):353–67, 1996.
[32] H. Bussey. Proteases and the processing of precursors to secreted proteins in
yeast. Yeast, 4(1):17–26, 1988.
[33] H. J. Bussink and S. A. Osmani. A cyclin-dependent kinase family member
(phoa) is required to link developmental fate to environmental conditions in
aspergillus nidulans. Embo J, 17(14):3990–4003, 1998.
[34] E. T. Buurman, C. Westwater, B. Hube, A. J. Brown, F. C. Odds, and N. A.
Gow. Molecular analysis of camnt1p, a mannosyl transferase important for adhesion and virulence of candida albicans. Proc Natl Acad Sci U S A, 95(13):7670–5,
1998.
[35] J. J. Cai, D. K. Smith, X. Xia, and K. Y. Yuen. Mbetoolbox: a matlab toolbox
for sequence data analysis in molecular biology and evolution. BMC Bioinformatics, 6(1):64, 2005.
[36] R. Calderone. Molecular pathogenesis of fungal infections. Trends Microbiol,
2(12):461–3, 1994.
[37] L. Cao, C. M. Chan, C. Lee, S. S. Wong, and K. Y. Yuen. Mp1 encodes an
abundant and highly antigenic cell wall mannoprotein in the pathogenic fungus
penicillium marneffei. Infect Immun, 66(3):966–73, 1998.
[38] L. Cao, K. M. Chan, D. Chen, N. Vanittanakom, C. Lee, C. M. Chan, T. Sirisanthana, D. N. Tsang, and K. Y. Yuen. Detection of cell wall mannoprotein mp1p
in culture supernatants of penicillium marneffei and in sera of penicilliosis patients. J Clin Microbiol, 37(4):981–6, 1999.
[39] L. Cao, D. L. Chen, C. Lee, C. M. Chan, K. M. Chan, N. Vanittanakom, D. N.
Tsang, and K. Y. Yuen. Detection of specific antibodies to an antigenic mannoprotein for diagnosis of penicillium marneffei penicilliosis. J Clin Microbiol,
36(10):3028–31, 1998.
[40] T. J. Carver, K. M. Rutherford, M. Berriman, M. A. Rajandream, B. G. Barrell,
and J. Parkhill. Act: the artemis comparison tool. Bioinformatics, 21(16):3422–
3, 2005.
[41] L. L. Cavalli-Sforza and A. W. Edwards. Phylogenetic analysis. models and
estimation procedures. Am J Hum Genet, 19(3):Suppl 19:233+, 1967.
[42] MATLAB Central. Matlab central, 2005.
[43] C. M. Chan, P. C. Woo, A. S. Leung, S. K. Lau, X. Y. Che, L. Cao, and K. Y.
Yuen. Detection of antibodies specific to an antigenic cell wall galactomannoprotein for serodiagnosis of aspergillus fumigatus aspergillosis. J Clin Microbiol,
40(6):2041–5, 2002.
[44] Y. F. Chan and T. C. Chow. Ultrastructural observations on penicillium marneffei in natural human infection. Ultrastruct Pathol, 14(5):439–52, 1990.
[45] S. Chariyalertsak, T. Sirisanthana, K. Supparatpinyo, and K. E. Nelson. Seasonal variation of disseminated penicillium marneffei infections in northern thailand: a clue to the reservoir? J Infect Dis, 173(6):1490–3, 1996.
237
[46] S. Chariyalertsak, T. Sirisanthana, K. Supparatpinyo, J. Praparattanapan, and
K. E. Nelson. Case-control study of risk factors for penicillium marneffei infection
in human immunodeficiency virus-infected patients in northern thailand. Clin
Infect Dis, 24(6):1080–6, 1997.
[47] S. Chariyalertsak, P. Vanittanakom, K. E. Nelson, T. Sirisanthana, and N. Vanittanakom. Rhizomys sumatrensis and cannomys badius, new natural animal hosts
of penicillium marneffei. J Med Vet Mycol, 34(2):105–10, 1996.
[48] D. Charlesworth, B. Charlesworth, and G. A. McVean. Genome sequences and
evolutionary biology, a two-way interaction. Trends Ecol Evol, 16(5):235–242,
2001.
[49] P. Chen, S. K. Sapperstein, J. D. Choi, and S. Michaelis. Biogenesis of the
saccharomyces cerevisiae mating pheromone a-factor. J Cell Biol, 136(2):251–
69, 1997.
[50] C. S. Chim, C. Y. Fong, S. K. Ma, S. S. Wong, and K. Y. Yuen. Reactive
hemophagocytic syndrome associated with penicillium marneffei infection. Am
J Med, 104(2):196–7, 1998.
[51] R. J. Cho, M. J. Campbell, E. A. Winzeler, L. Steinmetz, A. Conway, L. Wodicka, T. G. Wolfsberg, A. E. Gabrielian, D. Landsman, D. J. Lockhart, and
R. W. Davis. A genome-wide transcriptional analysis of the mitotic cell cycle.
Mol Cell, 2(1):65–73, 1998.
[52] C. Y. Choi, E. L. Schneider, J. M. Kim, I. Y. Gluzman, D. E. Goldberg, J. A.
Ellman, and M. A. Marletta. Interference with heme binding to histidine-rich
protein-2 as an antimalarial strategy. Chem Biol, 9(8):881–9, 2002.
[53] S. S. Choi and B. T. Lahn. Adaptive evolution of mrg, a neuron-specific gene
family implicated in nociception. Genome Res, 13:2252–2259, 2003.
[54] P. Chongtrakool, S. C. Chaiyaroj, V. Vithayasai, S. Trawatcharegon, R. Teanpaisan, S. Kalnawakul, and S. Sirisinha. Immunoreactivity of a 38-kilodalton
penicillium marneffei antigen with human immunodeficiency virus-positive sera.
J Clin Microbiol, 35(9):2220–3, 1997.
[55] A. G. Clark and T. Kao. Excess nonsynonymous substitution at shared polymorphic sites among self-incompatibility alleles of solanaceae. Proc Natl Acad
Sci USA, 88:9823–9827, 1991.
[56] P. Cliften, P. Sudarsanam, A. Desikan, L. Fulton, B. Fulton, J. Majors, R. Waterston, B. A. Cohen, and M. Johnston. Finding functional features in saccharomyces genomes by phylogenetic footprinting. Science, 301(5629):71–6, 2003.
[57] L. Coin, A. Bateman, and R. Durbin. Enhanced protein domain discovery by
using language modeling techniques from speech recognition. Proc Natl Acad
Sci U S A, 100(8):4516–20, 2003.
[58] L. J. Collins, A. M. Poole, and D. Penny. Using ancestral sequences to uncover
potential gene homologues. Appl Bioinformatics, 2(3 Suppl):S85–95, 2003.
[59] G. C. Conant and A. Wagner. Asymmetric sequence divergence of duplicate
genes. Genome Res, 13(9):2052–8, 2003.
[60] A. Cooper and H. Bussey. Characterization of the yeast kex1 gene product: a
carboxypeptidase involved in processing secreted precursor proteins. Mol Cell
Biol, 9(6):2706–14, 1989.
[61] KA Crandall, CR Kelsey, H Imamichi, HC Lane, and NP Salzman. Parallel
evolution of drug resistance in hiv: failure of nonsynonymous/synonymous substitution rate ratio to detect selection. Mol Biol Evol, 16:372–382, 1999.
[62] J. Davey, K. Davis, M. Hughes, G. Ladds, and D. Powner. The processing of
yeast pheromones. Semin Cell Dev Biol, 9(1):19–30, 1998.
238
[63] F. De Bernardis, S. Arancia, L. Morelli, B. Hube, D. Sanglard, W. Schafer, and
A. Cassone. Evidence that members of the secretory aspartyl proteinase gene
family, in particular sap2, are virulence factors for candida vaginitis. J Infect
Dis, 179(1):201–8, 1999.
[64] R. A. Dean, N. J. Talbot, D. J. Ebbole, M. L. Farman, T. K. Mitchell, M. J.
Orbach, M. Thon, R. Kulkarni, J. R. Xu, H. Pan, N. D. Read, Y. H. Lee, I. Carbone, D. Brown, Y. Y. Oh, N. Donofrio, J. S. Jeong, D. M. Soanes, S. Djonovic,
E. Kolomiets, C. Rehmeyer, W. Li, M. Harding, S. Kim, M. H. Lebrun, H. Bohnert, S. Coughlan, J. Butler, S. Calvo, L. J. Ma, R. Nicol, S. Purcell, C. Nusbaum,
J. E. Galagan, and B. W. Birren. The genome sequence of the rice blast fungus
magnaporthe grisea. Nature, 434(7036):980–6, 2005.
[65] C. d’Enfert, S. Goyard, S. Rodriguez-Arnaveilhe, L. Frangeul, L. Jones,
F. Tekaia, O. Bader, A. Albrecht, L. Castillo, A. Dominguez, J. F. Ernst,
C. Fradin, C. Gaillardin, S. Garcia-Sanchez, P. de Groot, B. Hube, F. M. Klis,
S. Krishnamurthy, D. Kunze, M. C. Lopez, A. Mavor, N. Martin, I. Moszer,
D. Onesime, J. Perez Martin, R. Sentandreu, E. Valentin, and A. J. Brown.
Candidadb: a genome database for candida albicans pathogenomics. Nucleic
Acids Res, 33(Database issue):D353–7, 2005.
[66] Z. L. Deng and D. H. Connor. Progressive disseminated penicilliosis caused by
penicillium marneffei. report of eight cases and differentiation of the causative
organism from histoplasma capsulatum. Am J Clin Pathol, 84(3):323–7, 1985.
[67] Z. L. Deng, M. Yun, and L. Ajello. Human penicilliosis marneffei and its relation
to the bamboo rat (rhizomys pruinosus). J Med Vet Mycol, 24(5):383–9, 1986.
[68] E. T. Dermitzakis and A. G. Clark. Differential selection after duplication in
mammalian developmental genes. Mol Biol Evol, 18(4):557–62, 2001.
[69] V. Desakorn, M. D. Smith, A. L. Walsh, A. J. Simpson, D. Sahassananda, A. Rajanuwong, V. Wuthiekanun, P. Howe, B. J. Angus, P. Suntharasamai, and N. J.
White. Diagnosis of penicillium marneffei infection by quantitation of urinary
antigen by using an enzyme immunoassay. J Clin Microbiol, 37(1):117–21, 1999.
[70] A. Dmochowska, D. Dignard, D. Henning, D. Y. Thomas, and H. Bussey. Yeast
kex1 gene encodes a putative protease with a carboxypeptidase b-like function
involved in killer toxin and alpha-factor precursor processing. Cell, 50(4):573–84,
1987.
[71] C. B. Do, M. S. Mahabhashyam, M. Brudno, and S. Batzoglou. Probcons: Probabilistic consistency-based multiple sequence alignment. Genome Res, 15(2):330–
40, 2005.
[72] J. M. Dolence, L. E. Steward, E. K. Dolence, D. H. Wong, and C. D. Poulter.
Studies with recombinant saccharomyces cerevisiae caax prenyl protease rce1p.
Biochemistry, 39(14):4096–104, 2000.
[73] T. Domazet-Loso and D. Tautz. An evolutionary analysis of orphan genes in
drosophila. Genome Res, 13(10):2213–9, 2003.
[74] R. F. Doolittle. The multiplicity of domains in proteins. Annu Rev Biochem,
64:287–314, 1995.
[75] J. Du, Y. Zhu, A. Shanmugam, and A. L. Kenter. Analysis of immunoglobulin
sgamma3 recombination breakpoints by pcr: implications for the mechanism of
isotype switching. Nucleic Acids Res, 25(15):3066–73, 1997.
[76] P. S. Dyer, M. Paoletti, and D. B. Archer. Genomics reveals sexual secrets of
aspergillus. Microbiology, 149(Pt 9):2301–3, 2003.
[77] S. E. Eckert, B. Hoffmann, C. Wanke, and G. H. Braus. Sexual development of aspergillus nidulans in tryptophan auxotrophic strains. Arch Microbiol,
172(3):157–66, 1999.
239
[78] A. Edwards, H. A. Hammond, L. Jin, C. T. Caskey, and R. Chakraborty. Genetic variation at five trimeric and tetrameric tandem repeat loci in four human
population groups. Genomics, 12(2):241–53, 1992.
[79] C. elegan Sequencing Consortium. Genome sequence of the nematode c. elegans:
a platform for investigating biology. Science, 282(5396):2012–8, 1998.
[80] T Endo, K Ikeo, and T Gojobori. Large-scale search for genes on which positive
selection may operate. Mol Biol Evol, 13:685–690, 1996.
[81] E. Eskin, W. N. Grundy, and Y. Singer. Protein family classification using sparse
markov transducers. Proc Int Conf Intell Syst Mol Biol, 8:134–45, 2000.
[82] E. Espagne, P. Balhadere, M. L. Penin, C. Barreau, and B. Turcq. Het-e and
het-d belong to a new subfamily of wd40 proteins involved in vegetative incompatibility specificity in the fungus podospora anserina. Genetics, 161(1):71–81,
2002.
[83] B. Ewing and P. Green. Base-calling of automated sequencer traces using phred.
ii. error probabilities. Genome Res, 8(3):186–94, 1998.
[84] B. Ewing, L. Hillier, M. C. Wendl, and P. Green. Base-calling of automated
sequencer traces using phred. i. accuracy assessment. Genome Res, 8(3):175–85,
1998.
[85] J. Felsenstein. Evolutionary trees from dna sequences: a maximum likelihood
approach. J Mol Evol, 17:368–376, 1981.
[86] J. Felsenstein. Phylip – phylogeny inference package (version 3.2). Cladistics,
5:164–166, 1989.
[87] Fungal Research Community FGI.
Fungal genome
(http://www.broad.mit.edu/annotation/fungi/fgi/), 2002.
initiative
[88] M. C. Fisher, D. Aanensen, S. de Hoog, and N. Vanittanakom. Multilocus
microsatellite typing system for penicillium marneffei reveals spatially structured
populations. J Clin Microbiol, 42(11):5065–9, 2004.
[89] M. C. Fisher, W. P. Hanage, S. de Hoog, E. Johnson, M. D. Smith, N. J.
White, and N. Vanittanakom. Low effective dispersal of asexual genotypes in
heterogeneous landscapes by the endemic pathogen penicillium marneffei. PLoS
Pathog, 1(2):e20, 2005.
[90] A. Force, M. Lynch, F. B. Pickett, A. Amores, Y. L. Yan, and J. Postlethwait.
Preservation of duplicate genes by complementary, degenerative mutations. Genetics, 151(4):1531–45, 1999.
[91] F. Foury, T. Roganti, N. Lecrenier, and B. Purnelle. The complete sequence of
the mitochondrial genome of saccharomyces cerevisiae. FEBS Lett, 440(3):325–
31, 1998.
[92] C. M. Fraser and R. D. Fleischmann. Strategies for whole microbial genome
sequencing and analysis. Electrophoresis, 18(8):1207–16, 1997.
[93] H. B. Fraser, D. P. Wall, and A. E. Hirsh. A simple dependence between protein
evolution rate and the number of protein-protein interactions. BMC Evol Biol,
3(1):11, 2003.
[94] J. A. Fraser and J. Heitman. Evolution of fungal sex chromosomes. Mol Microbiol, 51(2):299–306, 2004.
[95] R. Friedman and A. L. Hughes. Gene duplication and the structure of eukaryotic
genomes. Genome Res, 11(3):373–81, 2001.
[96] D. Frishman, M. Mokrejs, D. Kosykh, G. Kastenmuller, G. Kolesov, I. Zubrzycki,
C. Gruber, B. Geier, A. Kaps, K. Albermann, A. Volz, C. Wagner, M. Fellenberg,
K. Heumann, and H. W. Mewes. The pedant genome database. Nucleic Acids
Res, 31(1):207–11, 2003.
240
[97] M. C. Frith, J. L. Spouge, U. Hansen, and Z. Weng. Statistical significance of
clusters of motifs represented by position specific scoring matrices in nucleotide
sequences. Nucleic Acids Res, 30(14):3214–24, 2002.
[98] Y. Fu, G. Rieg, W. A. Fonzi, P. H. Belanger, Jr. Edwards, J. E., and S. G. Filler.
Expression of the candida albicans gene als1 in saccharomyces cerevisiae induces
adherence to endothelial and epithelial cells. Infect Immun, 66(4):1783–6, 1998.
[99] K. Fujimura-Kamada, F. J. Nouvet, and S. Michaelis. A novel membraneassociated metalloprotease, ste24p, is required for the first step of nh2-terminal
processing of the yeast a-factor precursor. J Cell Biol, 136(2):271–85, 1997.
[100] R. S. Fuller, A. Brake, and J. Thorner. Yeast prohormone processing enzyme
(kex2 gene product) is a ca2+-dependent serine protease. Proc Natl Acad Sci U
S A, 86(5):1434–8, 1989.
[101] J. E. Galagan, S. E. Calvo, K. A. Borkovich, E. U. Selker, N. D. Read, D. Jaffe,
W. FitzHugh, L. J. Ma, S. Smirnov, S. Purcell, B. Rehman, T. Elkins, R. Engels,
S. Wang, C. B. Nielsen, J. Butler, M. Endrizzi, D. Qui, P. Ianakiev, D. BellPedersen, M. A. Nelson, M. Werner-Washburne, C. P. Selitrennikoff, J. A. Kinsey, E. L. Braun, A. Zelter, U. Schulte, G. O. Kothe, G. Jedd, W. Mewes,
C. Staben, E. Marcotte, D. Greenberg, A. Roy, K. Foley, J. Naylor, N. StangeThomann, R. Barrett, S. Gnerre, M. Kamal, M. Kamvysselis, E. Mauceli,
C. Bielke, S. Rudd, D. Frishman, S. Krystofova, C. Rasmussen, R. L. Metzenberg, D. D. Perkins, S. Kroken, C. Cogoni, G. Macino, D. Catcheside, W. Li,
R. J. Pratt, S. A. Osmani, C. P. DeSouza, L. Glass, M. J. Orbach, J. A. Berglund,
R. Voelker, O. Yarden, M. Plamann, S. Seiler, J. Dunlap, A. Radford, R. Aramayo, D. O. Natvig, L. A. Alex, G. Mannhaupt, D. J. Ebbole, M. Freitag,
I. Paulsen, M. S. Sachs, E. S. Lander, C. Nusbaum, and B. Birren. The genome
sequence of the filamentous fungus neurospora crassa. Nature, 422(6934):859–68,
2003.
[102] C. A. Gale, C. M. Bendel, M. McClellan, M. Hauser, J. M. Becker, J. Berman,
and M. K. Hostetter. Linkage of adhesion, filamentous growth, and virulence in
candida albicans to a single gene, int1. Science, 279(5355):1355–8, 1998.
[103] W. Gao, C. H. Khang, S. Y. Park, Y. H. Lee, and S. Kang. Evolution and
organization of a highly dynamic, subtelomeric helicase gene family in the rice
blast fungus magnaporthe grisea. Genetics, 162(1):103–12, 2002.
[104] R. G. Garrison and K. S. Boyd. Dimorphism of penicillium marneffei as observed
by electron microscopy. Can J Microbiol, 19(10):1305–9, 1973.
[105] S. M. Gasser and M. M. Cockell. The molecular biology of the sir proteins.
Gene, 279(1):1–16, 2001.
[106] A. C. Gavin, M. Bosche, R. Krause, P. Grandi, M. Marzioch, A. Bauer,
J. Schultz, J. M. Rick, A. M. Michon, C. M. Cruciat, M. Remor, C. Hofert,
M. Schelder, M. Brajenovic, H. Ruffner, A. Merino, K. Klein, M. Hudak, D. Dickson, T. Rudi, V. Gnau, A. Bauch, S. Bastuck, B. Huhse, C. Leutwein, M. A.
Heurtier, R. R. Copley, A. Edelmann, E. Querfurth, V. Rybin, G. Drewes,
M. Raida, T. Bouwmeester, P. Bork, B. Seraphin, B. Kuster, G. Neubauer, and
G. Superti-Furga. Functional organization of the yeast proteome by systematic
analysis of protein complexes. Nature, 415(6868):141–7, 2002.
[107] R. F. Geever, L. Huiet, J. A. Baum, B. M. Tyler, V. B. Patel, B. J. Rutledge,
M. E. Case, and N. H. Giles. Dna sequence, organization and regulation of the
qa gene cluster of neurospora crassa. J Mol Biol, 207(1):15–34, 1989.
[108] M. S. Gelfand. Prediction of function in dna sequence analysis. J Comput Biol,
2(1):87–115, 1995.
[109] W. Gilbert, S. J. de Souza, and M. Long. Origin of genes. Proc Natl Acad Sci
U S A, 94(15):7698–703, 1997.
241
[110] A. Goffeau, B. G. Barrell, H. Bussey, R. W. Davis, B. Dujon, H. Feldmann,
F. Galibert, J. D. Hoheisel, C. Jacq, M. Johnston, E. J. Louis, H. W. Mewes,
Y. Murakami, P. Philippsen, H. Tettelin, and S. G. Oliver. Life with 6000 genes.
Science, 274(5287):546, 563–7, 1996.
[111] N. Goldman and Z. Yang. A codon-based model of nucleotide substitution for
protein-coding dna sequences. Mol Biol Evol, 11(5):725–36, 1994.
[112] D. Gordon, C. Abajian, and P. Green. Consed: a graphical tool for sequence
finishing. Genome Res, 8(3):195–202, 1998.
[113] N. A. Gow. Candida albicans switches mates. Mol Cell, 10(2):217–8, 2002.
[114] N. A. Gow, A. J. Brown, and F. C. Odds. Fungal morphogenesis and host
invasion. Curr Opin Microbiol, 5(4):366–71, 2002.
[115] D. Grant, P. Cregan, and R. C. Shoemaker. Genome organization in dicots:
genome duplication in arabidopsis and synteny between soybean and arabidopsis.
Proc Natl Acad Sci U S A, 97(8):4168–73, 2000.
[116] D. Graur. Amino acid composition and the evolutionary rates of protein-coding
genes. J Mol Evol, 22(1):53–62, 1985.
[117] S. I. Grewal and D. Moazed. Heterochromatin and epigenetic control of gene
expression. Science, 301(5634):798–802, 2003.
[118] Z. Gu, A. Cavalcanti, F. C. Chen, P. Bouman, and W. H. Li. Extent of gene
duplication in the genomes of drosophila, nematode, and yeast. Mol Biol Evol,
19(3):256–62, 2002.
[119] Z. Gu, L. M. Steinmetz, X. Gu, C. Scharfe, R. W. Davis, and W. H. Li.
Role of duplicate genes in genetic robustness against null mutations. Nature,
421(6918):63–6, 2003.
[120] J. E. Haber. Mating-type gene switching in saccharomyces cerevisiae. Annu Rev
Genet, 32:561–99, 1998.
[121] H. Hamada, M. Seidman, B. H. Howard, and C. M. Gorman. Enhanced gene
expression by the poly(dt-dg).poly(dc-da) sequence. Mol Cell Biol, 4(12):2622–
30, 1984.
[122] A. J. Hamilton, L. Jeavons, S. Youngchim, and N. Vanittanakom. Recognition of
fibronectin by penicillium marneffei conidia via a sialic acid-dependent process
and its relationship to the interaction between conidia and laminin. Infect Immun, 67(10):5200–5, 1999.
[123] A. J. Hamilton, L. Jeavons, S. Youngchim, N. Vanittanakom, and R. J. Hay.
Sialic acid-dependent recognition of laminin by penicillium marneffei conidia.
Infect Immun, 66(12):6024–6, 1998.
[124] K. H. Han, K. Y. Han, J. H. Yu, K. S. Chae, K. Y. Jahng, and D. M. Han. The
nsdd gene encodes a putative gata-type transcription factor necessary for sexual
development of aspergillus nidulans. Mol Microbiol, 41(2):299–309, 2001.
[125] K. H. Han, J. A. Seo, and J. H. Yu. A putative g protein-coupled receptor
negatively controls sexual development in aspergillus nidulans. Mol Microbiol,
51(5):1333–45, 2004.
[126] M Hasegawa, H Kishino, and T Yano. Dating of the human-ape splitting by a
molecular clock of mitochondrial dna. J Mol Evol, 22:160–174, 1985.
[127] K. E. Hastings. Strong evolutionary conservation of broadly expressed protein
isoforms in the troponin i gene family and other vertebrate gene families. J Mol
Evol, 42(6):631–40, 1996.
[128] K. Haynes. Virulence in candida species. Trends Microbiol, 9(12):591–6, 2001.
242
[129] B. He, P. Chen, S. Y. Chen, K. L. Vancura, S. Michaelis, and S. Powers. Ram2,
an essential gene of yeast, and ram1 encode the two polypeptide components of
the farnesyltransferase that prenylates a-factor and ras proteins. Proc Natl Acad
Sci U S A, 88(24):11373–7, 1991.
[130] D. S. Heckman, D. M. Geiser, B. R. Eidell, R. L. Stauffer, N. L. Kardos, and
S. B. Hedges. Molecular evidence for the early colonization of land by fungi and
plants. Science, 293(5532):1129–33, 2001.
[131] S. B. Hedges and S. Kumar. Genomic clocks and evolutionary timescales. Trends
Genet, 19(4):200–6, 2003.
[132] I. Herskowitz. Fungal physiology. yeast branches out. Nature, 357(6375):190–1,
1992.
[133] L. H. Hogan, S. Josvai, and B. S. Klein. Genomic cloning, characterization, and
functional analysis of the major surface adhesin wi-1 on blastomyces dermatitidis
yeasts. J Biol Chem, 270(51):30725–32, 1995.
[134] P. R. Hsueh, L. J. Teng, C. C. Hung, J. H. Hsu, P. C. Yang, S. W. Ho, and
K. T. Luh. Molecular evidence for strain dissemination of penicillium marneffei:
an emerging pathogen in taiwan. J Infect Dis, 181(5):1706–12, 2000.
[135] H. Huang, W. C. Barker, Y. Chen, and C. H. Wu. iproclass: an integrated
database of protein family, function and structure information. Nucleic Acids
Res, 31(1):390–2, 2003.
[136] A. L. Hughes and R. Friedman. Parallel evolution by gene duplication in the
genomes of two unicellular fungi. Genome Res, 13(6A):1259–64, 2003.
[137] M. K. Hughes and A. L. Hughes. Evolution of duplicate genes in a tetraploid
animal, xenopus laevis. Mol Biol Evol, 10(6):1360–9, 1993.
[138] C. M. Hull and A. D. Johnson. Identification of a mating type-like locus in the
asexual pathogenic yeast candida albicans. Science, 285(5431):1271–5, 1999.
[139] C. M. Hull, R. M. Raisner, and A. D. Johnson. Evidence for mating of the
”asexual” yeast candida albicans in a mammalian host. Science, 289(5477):307–
10, 2000.
[140] C. C. Hung, M. Y. Chen, S. M. Hsieh, W. H. Sheng, C. F. Hsiao, and S. C.
Chang. Discontinuation of secondary prophylaxis for penicilliosis marneffei in
aids patients responding to highly active antiretroviral therapy. Aids, 16(4):672–
3, 2002.
[141] L. D. Hurst and N. G. Smith. Do essential genes evolve slowly?
9(14):747–50, 1999.
Curr Biol,
[142] M. Huynen, B. Snel, 3rd Lathe, W., and P. Bork. Predicting protein function
by genomic context: quantitative evaluation and qualitative inferences. Genome
Res, 10(8):1204–10, 2000.
[143] I. Iliopoulos, S. Tsoka, M. A. Andrade, A. J. Enright, M. Carroll, P. Poullet, V. Promponas, T. Liakopoulos, G. Palaios, C. Pasquier, S. Hamodrakas,
J. Tamames, A. T. Yagnik, A. Tramontano, D. Devos, C. Blaschke, A. Valencia,
D. Brett, D. Martin, C. Leroy, I. Rigoutsos, C. Sander, and C. A. Ouzounis.
Evaluation of annotation strategies using an entire genome sequence. Bioinformatics, 19(6):717–26, 2003.
[144] P. Imwidthaya, A. S. Sekhon, T. D. Mastro, A. K. Garg, and E. Ambrosie. Usefulness of a microimmunodiffusion test for the detection of penicillium marneffei
antigenemia, antibodies, and exoantigens. Mycopathologia, 138(2):51–5, 1997.
[145] Y. Ina. Oden: a program package for molecular evolutionary analysis and database search of dna and amino acid sequences. Comput Appl Biosci, 10:11–12,
1994.
243
[146] L. Jeavons, A. J. Hamilton, N. Vanittanakom, R. Ungpakorn, E. G. Evans,
T. Sirisanthana, and R. J. Hay. Identification and purification of specific penicillium marneffei antigens and their recognition by human immune sera. J Clin
Microbiol, 36(4):949–54, 1998.
[147] M. E. Johnson, L. Viggiano, J. A. Bailey, M. Abdul-Rauf, G. Goodwin, M. Rocchi, and E. E. Eichler. Positive selection of a gene family during the emergence
of humans and african apes. Nature, 413(6855):514–9, 2001.
[148] T. Jones, N. A. Federspiel, H. Chibana, J. Dungan, S. Kalman, B. B. Magee,
G. Newport, Y. R. Thorstenson, N. Agabian, P. T. Magee, R. W. Davis, and
S. Scherer. The diploid genome sequence of candida albicans. Proc Natl Acad
Sci U S A, 101(19):7329–34, 2004.
[149] I. K. Jordan, I. B. Rogozin, Y. I. Wolf, and E. V. Koonin. Essential genes are
more evolutionarily conserved than are nonessential genes in bacteria. Genome
Res, 12(6):962–8, 2002.
[150] I. K. Jordan, I. B. Rogozin, Y. I. Wolf, and E. V. Koonin. Microevolutionary
genomics of bacteria. Theor Popul Biol, 61(4):435–47, 2002.
[151] I. K. Jordan, Y. I. Wolf, and E. V. Koonin. No simple dependence between
protein evolution rate and the number of protein-protein interactions: only the
most prolific interactors tend to evolve slowly. BMC Evol Biol, 3(1):1, 2003.
[152] T. Joseph-Horne, D. W. Hollomon, and P. M. Wood. Fungal respiration: a
fusion of standard and alternative components. Biochim Biophys Acta, 1504(23):179–95, 2001.
[153] T. H. Jukes and C.R. Cantor. Evolution of protein molecules. In H. N. Munro,
editor, Mammalian Protein Metabolism, pages 21–132. Academic Press, New
York, 1969.
[154] D. Julius, L. Blair, A. Brake, G. Sprague, and J. Thorner. Yeast alpha factor is
processed from a larger precursor polypeptide: the essential role of a membranebound dipeptidyl aminopeptidase. Cell, 32(3):839–52, 1983.
[155] H. Kaessmann, S. Zollner, A. Nekrutenko, and W. H. Li. Signatures of domain
shuffling in the human genome. Genome Res, 12(11):1642–50, 2002.
[156] E. Kafer. Origins of translocations in aspergillus nidulans. Genetics, 52(1):217–
32, 1965.
[157] T. Kanbe and J. E. Cutler. Minimum chemical requirements for adhesin activity of the acid-stable part of candida albicans cell wall phosphomannoprotein
complex. Infect Immun, 66(12):5812–8, 1998.
[158] R. Kappe, C. Fauser, C. N. Okeke, and M. Maiwald. Universal fungus-specific
primer systems and group-specific hybridization oligonucleotides for 18s rdna.
Mycoses, 39(1-2):25–30, 1996.
[159] N. Kato, W. Brooks, and A. M. Calvo. The expression of sterigmatocystin and
penicillin genes in aspergillus nidulans is controlled by vea, a gene required for
sexual development. Eukaryot Cell, 2(6):1178–86, 2003.
[160] L. Kaufman, P. G. Standard, M. Jalbert, P. Kantipong, K. Limpakarnjanarat,
and T. D. Mastro. Diagnostic antigenemia tests for penicilliosis marneffei. J
Clin Microbiol, 34(10):2503–5, 1996.
[161] N. P. Keller and T. M. Hohn. Metabolic pathway gene clusters in filamentous
fungi. Fungal Genet Biol, 21(1):17–29, 1997.
[162] M. Kelly, J. Burke, M. Smith, A. Klar, and D. Beach. Four mating-type genes
control sexual differentiation in the fission yeast. Embo J, 7(5):1537–47, 1988.
[163] Z. Kerenyi and L. Hornok. Structure and function of mating-type genes in
fusarium species. Acta Microbiol Immunol Hung, 49(2-3):313–4, 2002.
244
[164] H. Kim, K. Han, K. Kim, D. Han, K. Jahng, and K. Chae. The vea gene activates
sexual development in aspergillus nidulans. Fungal Genet Biol, 37(1):72–80,
2002.
[165] M. Kimura. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol,
16:111–120, 1980.
[166] M. Kimura and J. L. King. Fixation of a deleterious allele at one of two ”duplicate” loci by mutation pressure and random drift. Proc Natl Acad Sci U S A,
76(6):2858–61, 1979.
[167] K. E. Kirk and N. R. Morris. The tubb alpha-tubulin gene is essential for sexual
development in aspergillus nidulans. Genes Dev, 5(11):2014–23, 1991.
[168] K. E. Kirk and N. R. Morris. Either alpha-tubulin isogene product is sufficient for
microtubule function during all stages of growth and differentiation in aspergillus
nidulans. Mol Cell Biol, 13(8):4465–76, 1993.
[169] B. S. Klein, L. H. Hogan, and J. M. Jones. Immunologic recognition of a 25-amino
acid repeat arrayed in tandem on a major antigen of blastomyces dermatitidis.
J Clin Invest, 92(1):330–7, 1993.
[170] M. A. Klich, E. J. Mullaney, C. B. Daly, and J. W. Cary. Molecular and physiological aspects of aflatoxin and sterigmatocystin biosynthesis by aspergillus
tamarii and a. ochraceoroseus. Appl Microbiol Biotechnol, 53(5):605–9, 2000.
[171] Y. Koguchi, K. Kawakami, S. Kon, T. Segawa, M. Maeda, T. Uede, and A. Saito.
Penicillium marneffei causes osteopontin-mediated production of interleukin-12
by peripheral blood mononuclear cells. Infect Immun, 70(3):1042–8, 2002.
[172] F. A. Kondrashov and E. V. Koonin. Origin of alternative splicing by tandem
exon duplication. Hum Mol Genet, 10(23):2661–9, 2001.
[173] F. A. Kondrashov and E. V. Koonin. Evolution of alternative splicing: deletions,
insertions and origin of functional parts of proteins from intron sequences. Trends
Genet, 19(3):115–9, 2003.
[174] F. A. Kondrashov, I. B. Rogozin, Y. I. Wolf, and E. V. Koonin. Selection in the
evolution of gene duplications. Genome Biol, 3(2):RESEARCH0008, 2002.
[175] R. Koszul, A. Malpertuy, L. Frangeul, C. Bouchier, P. Wincker, A. Thierry,
S. Duthoy, S. Ferris, C. Hennequin, and B. Dujon. The complete mitochondrial
genome sequence of the pathogenic yeast candida (torulopsis) glabrata. FEBS
Lett, 534(1-3):39–48, 2003.
[176] L. Kraakman, K. Lemaire, P. Ma, A. W. Teunissen, M. C. Donaton, P. Van Dijck,
J. Winderickx, J. H. de Winde, and J. M. Thevelein. A saccharomyces cerevisiae
g-protein coupled receptor, gpr1, is specifically required for glucose activation of
the camp pathway during the transition to growth on glucose. Mol Microbiol,
32(5):1002–12, 1999.
[177] A. Krause, J. Stoye, and M. Vingron. The systers protein sequence cluster set.
Nucleic Acids Res, 28(1):270–2, 2000.
[178] D. M. Krylov, Y. I. Wolf, I. B. Rogozin, and E. V. Koonin. Gene loss, protein
sequence divergence, gene dispensability, expression level, and interactivity are
correlated in eukaryotic evolution. Genome Res, 13(10):2229–35, 2003.
[179] N. Kudeken, K. Kawakami, and A. Saito. Cytokine-induced fungicidal activity
of human polymorphonuclear leukocytes against penicillium marneffei. FEMS
Immunol Med Microbiol, 26(2):115–24, 1999.
[180] N. Kudeken, K. Kawakami, and A. Saito. Role of superoxide anion in the fungicidal activity of murine peritoneal exudate macrophages against penicillium marneffei. Microbiol Immunol, 43(4):323–30, 1999.
245
[181] N. Kudeken, K. Kawakami, and A. Saito. Mechanisms of the in vitro fungicidal effects of human neutrophils against penicillium marneffei induced by
granulocyte-macrophage colony-stimulating factor (gm-csf). Clin Exp Immunol,
119(3):472–8, 2000.
[182] E. Y. Kwan, Y. L. Lau, K. Y. Yuen, B. M. Jones, and L. C. Low. Penicillium marneffei infection in a non-hiv infected child. J Paediatr Child Health,
33(3):267–71, 1997.
[183] K. J. Kwon-Chung and J. E. Bennett. Distribution of alpha and alpha mating
types of cryptococcus neoformans among natural and clinical isolates. Am J
Epidemiol, 108(4):337–40, 1978.
[184] J. A. Lake. Reconstructing evolutionary trees from dna and protein sequences:
paralinear distances. Proc Natl Acad Sci USA, 91:1455–1459, 1994.
[185] C. Lanave, G. Preparata, C. Saccone, and G. Serio. A new method for calculating
evolutionary substitution rates. J Mol Evol, 20:86–93, 1984.
[186] E. S. Lander and M. S. Waterman. Genomic mapping by fingerprinting random
clones: a mathematical analysis. Genomics, 2(3):231–9, 1988.
[187] K. Langfelder, B. Jahn, H. Gehringer, A. Schmidt, G. Wanner, and A. A.
Brakhage. Identification of a polyketide synthase gene (pksp) of aspergillus fumigatus involved in conidial pigment biosynthesis and virulence. Med Microbiol
Immunol (Berl), 187(2):79–89, 1998.
[188] L. Latchinian-Sadek and D. Y. Thomas. Expression, purification, and characterization of the yeast kex1 gene product, a polypeptide precursor processing
carboxypeptidase. J Biol Chem, 268(1):534–40, 1993.
[189] J. P. Latge and R. Calderone. Host-microbe interactions: fungi invasive human
fungal opportunistic infections. Curr Opin Microbiol, 5(4):355–8, 2002.
[190] E. Leberer, D. Harcus, I. D. Broadbent, K. L. Clark, D. Dignard, K. Ziegelbauer,
A. Schmidt, N. A. Gow, A. J. Brown, and D. Y. Thomas. Signal transduction
through homologs of the ste20p and ste7p protein kinases can trigger hyphal
formation in the pathogenic fungus candida albicans. Proc Natl Acad Sci U S
A, 93(23):13217–22, 1996.
[191] D. W. Lee, S. Kim, S. J. Kim, D. M. Han, K. Y. Jahng, and K. S. Chae. The
isda gene is necessary for sexual development inhibition by a salt in aspergillus
nidulans. Curr Genet, 39(4):237–43, 2001.
[192] K. B. Lengeler, R. C. Davidson, C. D’Souza, T. Harashima, W. C. Shen,
P. Wang, X. Pan, M. Waugh, and J. Heitman. Signal transduction cascades regulating fungal development and virulence. Microbiol Mol Biol Rev, 64(4):746–85,
2000.
[193] K. B. Lengeler, P. Wang, G. M. Cox, J. R. Perfect, and J. Heitman. Identification of the mata mating-type locus of cryptococcus neoformans reveals a
serotype a mata strain thought to have been extinct. Proc Natl Acad Sci U S
A, 97(26):14455–60, 2000.
[194] I. Letunic, R. R. Copley, and P. Bork. Common exon duplication in animals and
its role in alternative splicing. Hum Mol Genet, 11(13):1561–7, 2002.
[195] J. C. Li, L. Q. Pan, and S. X. Wu. Mycologic investigation on rhizomys pruinous
senex in guangxi as natural carrier with penicillium marneffei. Chin Med J
(Engl), 102(6):477–85, 1989.
[196] W. H. Li. Rate of gene silencing at duplicate loci: a theoretical study and
interpretation of data from tetraploid fishes. Genetics, 95(1):237–58, 1980.
[197] W. H. Li. Unbiased estimation of the rates of synonymous and nonsynonymous
substitution. J Mol Evol, 36:96–99, 1993.
246
[198] W. H. Li, C. I. Wu, and C. C. Luo. A new method for estimating synonymous
and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Mol Biol Evol, 2:150–174, 1985.
[199] Wen-Hsiung Li. Molecular evolution. Sinauer Associates, Sunderland, Mass.,
1997.
[200] F. Lisacek, Y. Diaz, and F. Michel. Automatic identification of group i intron
cores in genomic dna sequences. J Mol Biol, 235(4):1206–17, 1994.
[201] C. Y. Lo, D. T. Chan, K. Y. Yuen, F. K. Li, and K. P. Cheng. Penicillium
marneffei infection in a patient with sle. Lupus, 4(3):229–31, 1995.
[202] K. F. LoBuglio and J. W. Taylor. Phylogeny and pcr identification of the human
pathogenic fungus penicillium marneffei. J Clin Microbiol, 33(1):85–9, 1995.
[203] B. J. Loftus, E. Fung, P. Roncaglia, D. Rowley, P. Amedeo, D. Bruno, J. Vamathevan, M. Miranda, I. J. Anderson, J. A. Fraser, J. E. Allen, I. E. Bosdet,
M. R. Brent, R. Chiu, T. L. Doering, M. J. Donlin, C. A. D’Souza, D. S. Fox,
V. Grinberg, J. Fu, M. Fukushima, B. J. Haas, J. C. Huang, G. Janbon, S. J.
Jones, H. L. Koo, M. I. Krzywinski, J. K. Kwon-Chung, K. B. Lengeler, R. Maiti,
M. A. Marra, R. E. Marra, C. A. Mathewson, T. G. Mitchell, M. Pertea, F. R.
Riggs, S. L. Salzberg, J. E. Schein, A. Shvartsbeyn, H. Shin, M. Shumway, C. A.
Specht, B. B. Suh, A. Tenney, T. R. Utterback, B. L. Wickes, J. R. Wortman, N. H. Wye, J. W. Kronstad, J. K. Lodge, J. Heitman, R. W. Davis, C. M.
Fraser, and R. W. Hyman. The genome of the basidiomycetous yeast and human
pathogen cryptococcus neoformans. Science, 307(5713):1321–4, 2005.
[204] M. Long, E. Betran, K. Thornton, and W. Wang. The origin of new genes:
glimpses from the young and old. Nat Rev Genet, 4(11):865–75, 2003.
[205] M. Long and C. H. Langley. Natural selection and the origin of jingwei, a
chimeric processed functional gene in drosophila. Science, 260(5104):91–5, 1993.
[206] M. C. Lorenz. Genomic approaches to fungal pathogenicity. Curr Opin Microbiol, 5(4):372–8, 2002.
[207] T. M. Lowe and S. R. Eddy. trnascan-se: a program for improved detection of
transfer rna genes in genomic sequence. Nucleic Acids Res, 25(5):955–64, 1997.
[208] Q. Lu, L. L. Wallrath, H. Granok, and S. C. Elgin. (ct)n (ga)n repeats and heat
shock elements have distinct roles in chromatin structure and transcriptional
activation of the drosophila hsp26 gene. Mol Cell Biol, 13(5):2802–14, 1993.
[209] L. G. Lundin. Evolution of the vertebrate genome as reflected in paralogous
chromosomal regions in man and the house mouse. Genomics, 16(1):1–19, 1993.
[210] M. Lynch and J. S. Conery. The evolutionary fate and consequences of duplicate
genes. Science, 290(5494):1151–5, 2000.
[211] M. Lynch and J. S. Conery. The evolutionary demography of duplicate genes. J
Struct Funct Genomics, 3(1-4):35–44, 2003.
[212] M. Lynch and A. Force. The probability of duplicate gene preservation by
subfunctionalization. Genetics, 154(1):459–73, 2000.
[213] B. B. Magee and P. T. Magee. Induction of mating in candida albicans by
construction of mtla and mtlalpha strains. Science, 289(5477):310–3, 2000.
[214] W. Makalowski and M. S. Boguski. Synonymous and nonsynonymous substitution distances are correlated in mouse and rat genes. J Mol Evol, 47(2):119–21,
1998.
[215] W. Makalowski, G. A. Mitchell, and D. Labuda. Alu sequences in the coding
regions of mrna: a source of protein variability. Trends Genet, 10(6):188–93,
1994.
247
[216] G. Mannhaupt, C. Montrone, D. Haase, H. W. Mewes, V. Aign, J. D. Hoheisel,
B. Fartmann, G. Nyakatura, F. Kempken, J. Maier, and U. Schulte. What’s
in the genome of a filamentous fungus? analysis of the neurospora genome
sequence. Nucleic Acids Res, 31(7):1944–54, 2003.
[217] E. M. Marcotte, M. Pellegrini, H. L. Ng, D. W. Rice, T. O. Yeates, and D. Eisenberg. Detecting protein function and protein-protein interactions from genome
sequences. Science, 285(5428):751–3, 1999.
[218] E. M. Marcotte, M. Pellegrini, M. J. Thompson, T. O. Yeates, and D. Eisenberg.
A combined algorithm for genome-wide prediction of protein function. Nature,
402(6757):83–6, 1999.
[219] A. McLysaght, K. Hokamp, and K. H. Wolfe. Extensive genomic duplication
during early chordate evolution. Nat Genet, 31(2):200–4, 2002.
[220] H. W. Mewes, K. Albermann, M. Bahr, D. Frishman, A. Gleissner, J. Hani,
K. Heumann, K. Kleine, A. Maierl, S. G. Oliver, F. Pfeiffer, and A. Zollner.
Overview of the yeast genome. Nature, 387(6632 Suppl):7–65, 1997.
[221] A. Meyer and M. Schartl. Gene and genome duplications in vertebrates: the
one-to-four (-to-eight in fish) rule and the evolution of novel gene functions.
Curr Opin Cell Biol, 11(6):699–704, 1999.
[222] K. Y. Miller, T. M. Toennis, T. H. Adams, and B. L. Miller. Isolation and transcriptional characterization of a morphological modifier: the aspergillus nidulans
stunted (stua) gene. Mol Gen Genet, 227(2):285–92, 1991.
[223] T. K. Mitchell and R. A. Dean. The camp-dependent protein kinase catalytic
subunit is required for appressorium formation and pathogenesis by the rice blast
pathogen magnaporthe grisea. Plant Cell, 7(11):1869–78, 1995.
[224] N. P. Money. Plant pathology. reverend berkeley’s devil. Nature, 411(6838):644,
2001.
[225] S. A. Mousavi and G. D. Robson. Oxidative and amphotericin b-mediated cell
death in the opportunistic pathogen aspergillus fumigatus is associated with an
apoptotic-like phenotype. Microbiology, 150(Pt 6):1937–45, 2004.
[226] S. V. Muse and B. S. Gaut. A likelihood approach for comparing synonymous and
nonsynonymous nucleotide substitution rates, with application to the chloroplast
genome. Mol Biol Evol, 11(5):715–24, 1994.
[227] K. A. Nasmyth and K. Tatchell. The structure of transposable yeast mating
type loci. Cell, 19(3):753–64, 1980.
[228] M. Nei and T. Gojobori. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol, 3:418–426,
1986.
[229] Masatoshi Nei and S. Kumar. Molecular evolution and phylogenetics. Oxford
University Press, Oxford, UK, 2000.
[230] A. Nekrutenko and W. H. Li. Transposable elements are found in a large number
of human protein-coding genes. Trends Genet, 17(11):619–21, 2001.
[231] M. A. Nelson, S. Kang, E. L. Braun, M. E. Crawford, P. L. Dolan, P. M.
Leonard, J. Mitchell, A. M. Armijo, L. Bean, E. Blueyes, T. Cushing, A. Errett, M. Fleharty, M. Gorman, K. Judson, R. Miller, J. Ortega, I. Pavlova,
J. Perea, S. Todisco, R. Trujillo, J. Valentine, A. Wells, M. Werner-Washburne,
D. O. Natvig, and et al. Expressed sequences from conidial, mycelial, and sexual
stages of neurospora crassa. Fungal Genet Biol, 21(3):348–63, 1997.
[232] S. L. Newman, S. Chaturvedi, and B. S. Klein. The wi-1 antigen of blastomyces
dermatitidis yeasts mediates binding to human macrophage cd11b/cd18 (cr3)
and cd14. J Immunol, 154(2):753–61, 1995.
248
[233] W. C. Nierman, A. Pain, M. J. Anderson, J. R. Wortman, H. S. Kim, J. Arroyo, M. Berriman, K. Abe, D. B. Archer, C. Bermejo, J. Bennett, P. Bowyer,
D. Chen, M. Collins, R. Coulsen, R. Davies, P. S. Dyer, M. Farman, N. Fedorova,
T. V. Feldblyum, R. Fischer, N. Fosker, A. Fraser, J. L. Garcia, M. J. Garcia,
A. Goble, G. H. Goldman, K. Gomi, S. Griffith-Jones, R. Gwilliam, B. Haas,
H. Haas, D. Harris, H. Horiuchi, J. Huang, S. Humphray, J. Jimenez, N. Keller,
H. Khouri, K. Kitamoto, T. Kobayashi, S. Konzack, R. Kulkarni, T. Kumagai, A. Lafton, J. P. Latge, W. Li, A. Lord, C. Lu, W. H. Majoros, G. S.
May, B. L. Miller, Y. Mohamoud, M. Molina, M. Monod, I. Mouyna, S. Mulligan, L. Murphy, S. O’Neil, I. Paulsen, M. A. Penalva, M. Pertea, C. Price,
B. L. Pritchard, M. A. Quail, E. Rabbinowitsch, N. Rawlins, M. A. Rajandream, U. Reichard, H. Renauld, G. D. Robson, S. Rodriguez de Cordoba, J. M.
Rodriguez-Pena, C. M. Ronning, S. Rutter, S. L. Salzberg, M. Sanchez, J. C.
Sanchez-Ferrero, D. Saunders, K. Seeger, R. Squares, S. Squares, M. Takeuchi,
F. Tekaia, G. Turner, C. R. Vazquez de Aldana, J. Weidman, O. White, J. Woodward, J. H. Yu, C. Fraser, J. E. Galagan, K. Asai, M. Machida, N. Hall, B. Barrell, and D. W. Denning. Genomic sequence of the pathogenic and allergenic
filamentous fungus aspergillus fumigatus. Nature, 438(7071):1151–6, 2005.
[234] L. R. Nunes, R. Costa de Oliveira, D. B. Leite, V. S. da Silva, E. dos Reis Marques, M. E. da Silva Ferreira, D. C. Ribeiro, L. A. de Souza Bernardes, M. H.
Goldman, R. Puccia, L. R. Travassos, W. L. Batista, M. P. Nobrega, F. G. Nobrega, D. Y. Yang, C. A. de Braganca Pereira, and G. H. Goldman. Transcriptome analysis of paracoccidioides brasiliensis cells undergoing mycelium-to-yeast
transition. Eukaryot Cell, 4(12):2115–28, 2005.
[235] D. I. Nurminsky, M. V. Nurminskaya, D. De Aguiar, and D. L. Hartl. Selective sweep of a newly evolved sperm-specific gene in drosophila. Nature,
396(6711):572–5, 1998.
[236] A. Odom, S. Muir, E. Lim, D. L. Toffaletti, J. Perfect, and J. Heitman.
Calcineurin is required for virulence of cryptococcus neoformans. Embo J,
16(10):2576–89, 1997.
[237] S Ohno. Evolution by Gene Duplication. Springer-Verlag Inc., New York, 1970.
[238] T. Ohta. How gene families evolve. Theor Popul Biol, 37(1):213–9, 1990.
[239] T. Ohta. Synonymous and nonsynonymous substitutions in mammalian genes
and the nearly neutral theory. J Mol Evol, 40(1):56–63, 1995.
[240] H. D. Osiewacz and E. Kimpel. Mitochondrial-nuclear interactions and lifespan
control in fungi. Exp Gerontol, 34(8):901–9, 1999.
[241] C. Pal, B. Papp, and L. D. Hurst. Highly expressed genes in yeast evolve slowly.
Genetics, 158(2):927–31, 2001.
[242] P. Pamilo and N. O. Bianchi. Evolution of the zfx and zfy genes: rates and
interdependence between the genes. Mol Biol Evol, 10:271–281, 1993.
[243] B. Paquin and B. F. Lang. The mitochondrial dna of allomyces macrogynus: the
complete genomic sequence from an ancestral fungus. J Mol Biol, 255(5):688–
701, 1996.
[244] L. Patthy. Genome evolution and the evolution of exon-shuffling–a review. Gene,
238(1):103–14, 1999.
[245] W. R. Pearson. Rapid and sensitive sequence comparison with fastp and fasta.
Methods Enzymol, 183:63–98, 1990.
[246] W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison. Proc Natl Acad Sci USA, 85:2444–2448, 1988.
[247] J. Pei and N. V. Grishin. Type ii caax prenyl endopeptidases belong to a novel
superfamily of putative membrane-bound metalloproteases. Trends Biochem Sci,
26(5):275–7, 2001.
249
[248] M. Pellegrini, E. M. Marcotte, M. J. Thompson, D. Eisenberg, and T. O. Yeates.
Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A, 96(8):4285–8, 1999.
[249] G. E. Pierard, J. Arrese Estrada, C. Pierard-Franchimont, A. Thiry, and D. Stynen. Immunohistochemical expression of galactomannan in the cytoplasm of
phagocytic cells during invasive aspergillosis. Am J Clin Pathol, 96(3):373–6,
1991.
[250] J. Piskur. Origin of the duplicated regions in the yeast genomes. Trends Genet,
17(6):302–3, 2001.
[251] J. B. Plotkin, J. Dushoff, and H. B. Fraser. Detecting selection using a single
genome sequence of m. tuberculosis and p. falciparum. Nature, 428:942–945,
2004.
[252] S. Poggeler. Mating-type genes for classical strain improvements of ascomycetes.
Appl Microbiol Biotechnol, 56(5-6):589–601, 2001.
[253] S. Poggeler. Genomic evidence for mating abilities in the asexual pathogen
aspergillus fumigatus. Curr Genet, 42(3):153–60, 2002.
[254] S. Pongsunk, A. Andrianopoulos, and S. C. Chaiyaroj. Conditional lethal disruption of tata-binding protein gene in penicillium marneffei. Fungal Genet Biol,
42(11):893–903, 2005.
[255] M. Pop, D. S. Kosack, and S. L. Salzberg. Hierarchical scaffolding with bambus.
Genome Res, 14(1):149–59, 2004.
[256] R. O. Poyton and J. E. McEwen. Crosstalk between nuclear and mitochondrial
genomes. Annu Rev Biochem, 65:563–607, 1996.
[257] V. E. Prince and F. B. Pickett. Splitting pairs: the diverging fates of duplicated
genes. Nat Rev Genet, 3(11):827–37, 2002.
[258] L. Ramsay, M. Macaulay, S. degli Ivanissevich, K. MacLean, L. Cardle, J. Fuller,
K. J. Edwards, S. Tuvesson, M. Morgante, A. Massari, E. Maestri, N. Marmiroli,
T. Sjakste, M. Ganal, W. Powell, and R. Waugh. A simple sequence repeat-based
linkage map of barley. Genetics, 156(4):1997–2005, 2000.
[259] M. Raymond, D. Dignard, A. M. Alarco, N. Mainville, B. B. Magee, and D. Y.
Thomas. A ste6p/p-glycoprotein homologue from the asexual yeast candida
albicans transports the a-factor mating pheromone in saccharomyces cerevisiae.
Mol Microbiol, 27(3):587–98, 1998.
[260] Y. Reiss, J. L. Goldstein, M. C. Seabra, P. J. Casey, and M. S. Brown. Inhibition
of purified p21ras farnesyl:protein transferase by cys-aax tetrapeptides. Cell,
62(1):81–8, 1990.
[261] M. Remm, C. E. Storm, and E. L. Sonnhammer. Automatic clustering of
orthologs and in-paralogs from pairwise species comparisons. J Mol Biol,
314(5):1041–52, 2001.
[262] M. Ricchetti, C. Fairhead, and B. Dujon. Mitochondrial dna repairs doublestrand breaks in yeast chromosomes. Nature, 402(6757):96–100, 1999.
[263] P. Rice, I. Longden, and A. Bleasby. Emboss: the european molecular biology
open software suite. Trends Genet, 16(6):276–7, 2000.
[264] I. Rigoutsos, T. Huynh, A. Floratos, L. Parida, and D. Platt. Dictionary-driven
protein annotation. Nucleic Acids Res, 30(17):3901–16, 2002.
[265] M. Robinson-Rechavi and V. Laudet. Evolutionary rates of duplicate genes in
fish and mammals. Mol Biol Evol, 18(4):681–3, 2001.
[266] F. Rodriguez, J. L. Oliver, A. Marin, and J. R. Medina. The general stochastic
model of nucleotide substitution. J Theor Biol, 142:485–501, 1990.
250
[267] S. Rogic, A. K. Mackworth, and F. B. Ouellette. Evaluation of gene-finding
programs on mammalian sequences. Genome Res, 11(5):817–32, 2001.
[268] S. Rogic, B. F. Ouellette, and A. K. Mackworth. Improving gene recognition
accuracy by combining predictions from two gene-finding programs. Bioinformatics, 18(8):1034–45, 2002.
[269] Y. Rongrungruang and S. M. Levitz. Interactions of penicillium marneffei with
human leukocytes in vitro. Infect Immun, 67(9):4732–6, 1999.
[270] G. M. Rubin, M. D. Yandell, J. R. Wortman, G. L. Gabor Miklos, C. R. Nelson,
I. K. Hariharan, M. E. Fortini, P. W. Li, R. Apweiler, W. Fleischmann, J. M.
Cherry, S. Henikoff, M. P. Skupski, S. Misra, M. Ashburner, E. Birney, M. S.
Boguski, T. Brody, P. Brokstein, S. E. Celniker, S. A. Chervitz, D. Coates,
A. Cravchik, A. Gabrielian, R. F. Galle, W. M. Gelbart, R. A. George, L. S.
Goldstein, F. Gong, P. Guan, N. L. Harris, B. A. Hay, R. A. Hoskins, J. Li,
Z. Li, R. O. Hynes, S. J. Jones, P. M. Kuehl, B. Lemaitre, J. T. Littleton, D. K.
Morrison, C. Mungall, P. H. O’Farrell, O. K. Pickeral, C. Shue, L. B. Vosshall,
J. Zhang, Q. Zhao, X. H. Zheng, and S. Lewis. Comparative genomics of the
eukaryotes. Science, 287(5461):2204–15, 2000.
[271] A. Rzhetsky and P. Morozov. Markov chain monte carlo computation of confidence intervals for substitution-rate variation in proteins. Pac Symp Biocomput,
6:203–214, 2001.
[272] C. Sadhu, D. Hoekstra, M. J. McEachern, S. I. Reed, and J. B. Hicks. A gprotein alpha subunit from asexual candida albicans functions in the mating
signal transduction pathway of saccharomyces cerevisiae and is regulated by the
a1-alpha 2 repressor. Mol Cell Biol, 12(5):1977–85, 1992.
[273] N. Saitou and M. Nei. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol, 4(4):406–25, 1987.
[274] L. Salwinski, C. S. Miller, A. J. Smith, F. K. Pettit, J. U. Bowie, and D. Eisenberg. The database of interacting proteins: 2004 update. Nucleic Acids Res,
32(Database issue):D449–51, 2004.
[275] G. San-Blas. [dimorphic fungi: biochemical approach to their dimorphism]. Acta
Cient Venez, 46(4):221–4, 1995.
[276] G. A. Sarosi and D. S. Serstock. Isolation of blastomyces dermatitidis from
pigeon manure. Am Rev Respir Dis, 114(6):1179–83, 1976.
[277] A. S. Sekhon, J. S. Li, and A. K. Garg. Penicillosis marneffei: serological and
exoantigen studies. Mycopathologia, 77(1):51–7, 1982.
[278] P. Sengupta and B. H. Cochran. Mat alpha 1 can mediate gene activation by
a-mating factor. Genes Dev, 5(10):1924–34, 1991.
[279] C. Seoighe and K. H. Wolfe. Extent of genomic rearrangement after genome
duplication in yeast. Proc Natl Acad Sci U S A, 95(8):4447–52, 1998.
[280] C. Seoighe and K. H. Wolfe. Updated map of duplicated regions in the yeast
genome. Gene, 238(1):253–61, 1999.
[281] P. M. Sharp. In search of molecular darwinism. Nature, 385:111–112., 1997.
[282] P. M. Sharp and W. H. Li. The codon adaptation index–a measure of directional
synonymous codon usage bias, and its potential applications. Nucleic Acids Res,
15(3):1281–95, 1987.
[283] P. M. Sharp and W. H. Li. The rate of synonymous substitution in enterobacterial genes is inversely related to codon usage bias. Mol Biol Evol, 4(3):222–30,
1987.
[284] J. C. Shepherd, W. McGinnis, A. E. Carrasco, E. M. De Robertis, and W. J.
Gehring. Fly and frog homoeo domains show homologies with yeast mating type
regulatory proteins. Nature, 310(5972):70–1, 1984.
251
[285] R. Shields. Pushing the envelope on molecular dating. Trends Genet, 20(5):221–
2, 2004.
[286] R. A. Sia, K. B. Lengeler, and J. Heitman. Diploid strains of the pathogenic
basidiomycete cryptococcus neoformans are thermally dimorphic. Fungal Genet
Biol, 29(3):153–63, 2000.
[287] A. Sidow. Gen(om)e duplications in the evolution of early vertebrates. Curr
Opin Genet Dev, 6(6):715–22, 1996.
[288] R. R. Sinden. Biological implications of the dna structures associated with
disease-causing triplet repeats. Am J Hum Genet, 64(2):346–53, 1999.
[289] M. Sipiczki. Where does fission yeast sit on the tree of life?
1(2):REVIEWS1011, 2000.
Genome Biol,
[290] T. Sirisanthana, K. Supparatpinyo, J. Perriens, and K. E. Nelson. Amphotericin
b and itraconazole for treatment of disseminated penicillium marneffei infection
in human immunodeficiency virus-infected patients. Clin Infect Dis, 26(5):1107–
10, 1998.
[291] T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. J Mol Biol, 147:195–197, 1981.
[292] T. F. Smith, M. S. Waterman, and C. Burks. The statistical distribution of
nucleic acid similarities. Nucleic Acids Res, 13(2):645–56, 1985.
[293] R. Sorek, G. Ast, and D. Graur. Alu-containing exons are alternatively spliced.
Genome Res, 12(7):1060–7, 2002.
[294] P. Staib, M. Kretschmar, T. Nichterlein, H. Hof, and J. Morschhauser. Differential activation of a candida albicans virulence gene family during infection. Proc
Natl Acad Sci U S A, 97(11):6102–7, 2000.
[295] M. A. Steel. Recovering a tree from the leaf colourations it generates under a
markov model. Appl Math Lett, 7:19–32, 1994.
[296] B. R. Steen, T. Lian, S. Zuyderduyn, W. K. MacDonald, M. Marra, S. J. Jones,
and J. W. Kronstad. Temperature-regulated transcription in the pathogenic
fungus cryptococcus neoformans. Genome Res, 12(9):1386–400, 2002.
[297] L. M. Steinmetz, C. Scharfe, A. M. Deutschbauer, D. Mokranjac, Z. S. Herman,
T. Jones, A. M. Chu, G. Giaever, H. Prokisch, P. J. Oefner, and R. W. Davis.
Systematic screen for human disease genes in yeast. Nat Genet, 31(4):400–4,
2002.
[298] A. Stoltzfus. On the possibility of constructive neutral evolution. J Mol Evol,
49(2):169–81, 1999.
[299] J. N. Strathern, E. Spatola, C. McGill, and J. B. Hicks. Structure and organization of transposable of transposable mating type cassettes in saccharomyces
yeasts. Proc Natl Acad Sci U S A, 77(5):2839–43, 1980.
[300] K. Supparatpinyo, C. Khamwan, V. Baosoung, K. E. Nelson, and T. Sirisanthana. Disseminated penicillium marneffei infection in southeast asia. Lancet,
344(8915):110–3, 1994.
[301] K. Supparatpinyo, K. E. Nelson, W. G. Merz, B. J. Breslin, Jr. Cooper, C. R.,
C. Kamwan, and T. Sirisanthana. Response to antifungal therapy by human
immunodeficiency virus-infected patients with disseminated penicillium marneffei infections and in vitro susceptibilities of isolates from clinical specimens.
Antimicrob Agents Chemother, 37(11):2407–11, 1993.
[302] K. Supparatpinyo, J. Perriens, K. E. Nelson, and T. Sirisanthana. A controlled trial of itraconazole to prevent relapse of penicillium marneffei infection
in patients infected with the human immunodeficiency virus. N Engl J Med,
339(24):1739–43, 1998.
252
[303] Y. Suzuki and T Gojobori. Analysis of coding sequences. In M. Salemi and A.M.
Vandamme, editors, The phylogenetic handbook: a practical approach to DNA
and protein phylogeny, pages 283–311. Cambridge University Press, Cambridge,
UK, 2003.
[304] A. Tam, W. K. Schmidt, and S. Michaelis. The multispanning membrane protein
ste24p catalyzes caax proteolysis and nh2-terminal processing of the yeast afactor precursor. J Biol Chem, 276(50):46798–806, 2001.
[305] W. Tang, T. M. Gunn, D. F. McLaughlin, G. S. Barsh, S. F. Schlossman, and
J. S. Duke-Cohan. Secreted and membrane attractin result from alternative
splicing of the human atrn gene. Proc Natl Acad Sci U S A, 97(11):6025–30,
2000.
[306] D. Taramelli, S. Brambilla, G. Sala, A. Bruccoleri, C. Tognazioli, L. RivieraUzielli, and J. R. Boelaert. Effects of iron on extracellular and intracellular
growth of penicillium marneffei. Infect Immun, 68(3):1724–6, 2000.
[307] D. Taramelli, C. Tognazioli, F. Ravagnani, O. Leopardi, G. Giannulis, and J. R.
Boelaert. Inhibition of intramacrophage growth of penicillium marneffei by 4aminoquinolines. Antimicrob Agents Chemother, 45(5):1450–5, 2001.
[308] R. L. Tatusov, D. A. Natale, I. V. Garkavtsev, T. A. Tatusova, U. T.
Shankavaram, B. S. Rao, B. Kiryutin, M. Y. Galperin, N. D. Fedorova, and
E. V. Koonin. The cog database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res, 29(1):22–8, 2001.
[309] S. Tavare. Some probabilistic and statistical problems in the analysis of dna
sequences. Lectures on Mathematics in the Life Sciences, 17:57–86, 1986.
[310] R. D. Teasdale and M. R. Jackson. Signal-mediated sorting of membrane proteins
between the endoplasmic reticulum and the golgi apparatus. Annu Rev Cell Dev
Biol, 12:27–54, 1996.
[311] J. D. Thompson, D. G. Higgins, and T. J. Gibson. Clustal w: improving the
sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res,
22(22):4673–80, 1994.
[312] JD Thompson, DG Higgins, and TJ Gibson. Clustal w: improving the sensitivity
of progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight matrix choice. Nucl Acids Res, 22:4673–4680,
1994.
[313] C. Thrane, U. Kaufmann, B. M. Stummann, and S. Olsson. Activation of
caspase-like activity and poly (adp-ribose) polymerase degradation during sporulation in aspergillus nidulans. Fungal Genet Biol, 41(3):361–8, 2004.
[314] W. E. Timberlake. Molecular genetics of aspergillus development. Annu Rev
Genet, 24:5–36, 1990.
[315] R. B. Todd, J. R. Greenhalgh, M. J. Hynes, and A. Andrianopoulos. Tupa, the
penicillium marneffei tup1p homologue, represses both yeast and spore development. Mol Microbiol, 48(1):85–94, 2003.
[316] S. Trewatcharegon, S. Sirisinha, A. Romsai, B. Eampokalap, R. Teanpaisan, and
S. C. Chaiyaroj. Molecular typing of penicillium marneffei isolates from thailand
by noti macrorestriction and pulsed-field gel electrophoresis. J Clin Microbiol,
39(12):4544–8, 2001.
[317] H. F. Tsai, Y. C. Chang, R. G. Washburn, M. H. Wheeler, and K. J. KwonChung. The developmentally regulated alb1 gene of aspergillus fumigatus:
its role in modulation of conidial morphology and virulence. J Bacteriol,
180(12):3031–8, 1998.
253
[318] H. F. Tsai, M. H. Wheeler, Y. C. Chang, and K. J. Kwon-Chung. A developmentally regulated gene cluster involved in conidial pigment biosynthesis in
aspergillus fumigatus. J Bacteriol, 181(20):6469–77, 1999.
[319] N. Tsuchimori, L. L. Sharkey, W. A. Fonzi, S. W. French, Jr. Edwards, J. E., and
S. G. Filler. Reduced virulence of hwp1-deficient mutants of candida albicans
and their interactions with host cells. Infect Immun, 68(4):1997–2002, 2000.
[320] B. G. Turgeon and O. C. Yoder. Proposed nomenclature for mating type genes
of filamentous ascomycetes. Fungal Genet Biol, 31(1):1–5, 2000.
[321] Y. Van de Peer, J. S. Taylor, I. Braasch, and A. Meyer. The ghost of selection
past: rates of evolution and functional divergence of anciently duplicated genes.
J Mol Evol, 53(4-5):436–46, 2001.
[322] K. Vandepoele, Y. Saeys, C. Simillion, J. Raes, and Y. Van De Peer. The automatic detection of homologous regions (adhore) and its application to microcolinearity between arabidopsis and rice. Genome Res, 12(11):1792–801, 2002.
[323] N. Vanittanakom, Jr. Cooper, C. R., S. Chariyalertsak, S. Youngchim, K. E.
Nelson, and T. Sirisanthana. Restriction endonuclease analysis of penicillium
marneffei. J Clin Microbiol, 34(7):1834–6, 1996.
[324] N. Vanittanakom, W. G. Merz, N. Sittisombut, C. Khamwan, K. E. Nelson, and
T. Sirisanthana. Specific identification of penicillium marneffei by a polymerase
chain reaction/hybridization technique. Med Mycol, 36(3):169–75, 1998.
[325] N. Vanittanakom, P. Vanittanakom, and R. J. Hay. Rapid identification of
penicillium marneffei by pcr-based detection of specific sequences on the rrna
gene. J Clin Microbiol, 40(5):1739–42, 2002.
[326] J. Varga and B. Toth. Genetic variability and reproductive mode of aspergillus
fumigatus. Infect Genet Evol, 3(1):3–17, 2003.
[327] D. Venet. Matarray: a matlab toolbox for microarray data. Bioinformatics,
19:659–660, 2003.
[328] K. J. Verstrepen, A. Jansen, F. Lewitter, and G. R. Fink. Intragenic tandem
repeats generate functional variability. Nat Genet, 37(9):986–90, 2005.
[329] K. J. Verstrepen, T. B. Reynolds, and G. R. Fink. Origins of variation in the
fungal cell surface. Nat Rev Microbiol, 2(7):533–40, 2004.
[330] P. E. Verweij, J. F. Meis, P. van den Hurk, J. Zoll, R. A. Samson, and W. J.
Melchers. Phylogenetic relationships of five species of aspergillus and related
taxa as deduced by comparison of sequences of small subunit ribosomal rna. J
Med Vet Mycol, 33(3):185–90, 1995.
[331] K. Vienken, M. Scherer, and R. Fischer. The zn(ii)2cys6 putative aspergillus
nidulans transcription factor repressor of sexual development inhibits sexual development under low-carbon conditions and in submersed culture. Genetics,
169(2):619–30, 2005.
[332] M. Viswanathan, G. Muthukumar, Y. S. Cong, and J. Lenard. Seripauperins of
saccharomyces cerevisiae: a new multigene family encoding serine-poor relatives
of serine-rich proteins. Gene, 148(1):149–53, 1994.
[333] M. A. Viviani, A. M. Tortorano, G. Rizzardini, T. Quirino, L. Kaufman, A. A.
Padhye, and L. Ajello. Treatment and serological studies of an italian case of
penicilliosis marneffei contracted in thailand by a drug addict infected with the
human immunodeficiency virus. Eur J Epidemiol, 9(1):79–85, 1993.
[334] A. Wagner. The fate of duplicated genes: loss or new function?
20(10):785–8, 1998.
Bioessays,
[335] A. Wagner. The yeast protein interaction network evolves rapidly and contains
few redundant duplicate genes. Mol Biol Evol, 18(7):1283–92, 2001.
254
[336] J. B. Walsh. How often do duplicated genes evolve new functions? Genetics,
139(1):421–8, 1995.
[337] J. D. Walton. Horizontal gene transfer and the evolution of secondary metabolite
gene clusters in fungi: an hypothesis. Fungal Genet Biol, 30(3):167–71, 2000.
[338] W. Wang, F. G. Brunet, E. Nevo, and M. Long. Origin of sphinx, a young
chimeric rna gene in drosophila melanogaster. Proc Natl Acad Sci U S A,
99(7):4448–53, 2002.
[339] W. Wang, H. Zheng, S. Yang, H. Yu, J. Li, H. Jiang, J. Su, L. Yang, J. Zhang,
J. McDermott, R. Samudrala, J. Wang, H. Yang, J. Yu, K. Kristiansen, and
G. K. Wong. Origin and evolution of new exons in rodents. Genome Res,
15(9):1258–64, 2005.
[340] J. L. Weber and P. E. May. Abundant class of human dna polymorphisms
which can be typed using the polymerase chain reaction. Am J Hum Genet,
44(3):388–96, 1989.
[341] M. H. Wheeler and A. A. Bell. Melanins and their importance in pathogenic
fungi. Curr Top Med Mycol, 2:338–87, 1988.
[342] S. Whelan and N. Goldman. A general empirical model of protein evolution
derived from multiple protein families using a maximum-likelihood approach.
Mol Biol Evol, 18(5):691–9, 2001.
[343] A. C. Wilson, S. S. Carlson, and T. J. White. Biochemical evolution. Annu Rev
Biochem, 46:573–639, 1977.
[344] K. H. Wolfe and P. M. Sharp. Mammalian gene evolution: nucleotide sequence
divergence between mouse and rat. J Mol Evol, 37(4):441–56, 1993.
[345] K. H. Wolfe and D. C. Shields. Molecular evidence for an ancient duplication of
the entire yeast genome. Nature, 387(6634):708–13, 1997.
[346] K. H. Wong and S. S. Lee. Comparing the first and second hundred aids cases
in hong kong. Singapore Med J, 39(6):236–40, 1998.
[347] L. P. Wong, P. C. Woo, A. Y. Wu, and K. Y. Yuen. Dna immunization using
a secreted cell wall antigen mp1p is protective against penicillium marneffei
infection. Vaccine, 20(23-24):2878–86, 2002.
[348] S. S. Wong, H. Siau, and K. Y. Yuen. Penicilliosis marneffei–west meets east. J
Med Microbiol, 48(11):973–5, 1999.
[349] S. S. Wong, K. H. Wong, W. T. Hui, S. S. Lee, J. Y. Lo, L. Cao, and K. Y. Yuen.
Differences in clinical and laboratory diagnostic characteristics of penicilliosis
marneffei in human immunodeficiency virus (hiv)- and non-hiv-infected patients.
J Clin Microbiol, 39(12):4535–40, 2001.
[350] S. S. Wong, P. C. Woo, and K. Y. Yuen. Candida tropicalis and penicillium
marneffei mixed fungaemia in a patient with waldenstrom’s macroglobulinaemia.
Eur J Clin Microbiol Infect Dis, 20(2):132–5, 2001.
[351] P. C. Woo, C. M. Chan, A. S. Leung, S. K. Lau, X. Y. Che, S. S. Wong, L. Cao,
and K. Y. Yuen. Detection of cell wall galactomannoprotein afmp1p in culture
supernatants of aspergillus fumigatus and in sera of aspergillosis patients. J Clin
Microbiol, 40(11):4382–7, 2002.
[352] P. C. Woo, K. T. Chong, A. S. Leung, S. S. Wong, S. K. Lau, and K. Y.
Yuen. Aflmp1 encodes an antigenic cel wall protein in aspergillus flavus. J Clin
Microbiol, 41(2):845–50, 2003.
[353] P. C. Woo, H. Zhen, J. J. Cai, J. Yu, S. K. Lau, J. Wang, J. L. Teng, S. S. Wong,
R. H. Tse, R. Chen, H. Yang, B. Liu, and K. Y. Yuen. The mitochondrial genome
of the thermal dimorphic fungus penicillium marneffei is more closely related to
those of molds than yeasts. FEBS Lett, 555(3):469–77, 2003.
255
[354] V. Wood, R. Gwilliam, M. A. Rajandream, M. Lyne, R. Lyne, A. Stewart,
J. Sgouros, N. Peat, J. Hayles, S. Baker, D. Basham, S. Bowman, K. Brooks,
D. Brown, S. Brown, T. Chillingworth, C. Churcher, M. Collins, R. Connor,
A. Cronin, P. Davis, T. Feltwell, A. Fraser, S. Gentles, A. Goble, N. Hamlin,
D. Harris, J. Hidalgo, G. Hodgson, S. Holroyd, T. Hornsby, S. Howarth, E. J.
Huckle, S. Hunt, K. Jagels, K. James, L. Jones, M. Jones, S. Leather, S. McDonald, J. McLean, P. Mooney, S. Moule, K. Mungall, L. Murphy, D. Niblett,
C. Odell, K. Oliver, S. O’Neil, D. Pearson, M. A. Quail, E. Rabbinowitsch,
K. Rutherford, S. Rutter, D. Saunders, K. Seeger, S. Sharp, J. Skelton, M. Simmonds, R. Squares, S. Squares, K. Stevens, K. Taylor, R. G. Taylor, A. Tivey,
S. Walsh, T. Warren, S. Whitehead, J. Woodward, G. Volckaert, R. Aert,
J. Robben, B. Grymonprez, I. Weltjens, E. Vanstreels, M. Rieger, M. Schafer,
S. Muller-Auer, C. Gabel, M. Fuchs, A. Dusterhoft, C. Fritzc, E. Holzer,
D. Moestl, H. Hilbert, K. Borzym, I. Langer, A. Beck, H. Lehrach, R. Reinhardt,
T. M. Pohl, P. Eger, W. Zimmermann, H. Wedler, R. Wambutt, B. Purnelle,
A. Goffeau, E. Cadieu, S. Dreano, S. Gloux, et al. The genome sequence of
schizosaccharomyces pombe. Nature, 415(6874):871–80, 2002.
[355] J. Wu and B. L. Miller. Aspergillus asexual reproduction and sexual reproduction are differentially affected by transcriptional and translational mechanisms
regulating stunted gene expression. Mol Cell Biol, 17(10):6191–201, 1997.
[356] Z. Yan, X. Li, and J. Xu. Geographic distribution of mating type alleles of
cryptococcus neoformans in four areas of the united states. J Clin Microbiol,
40(3):965–72, 2002.
[357] J. Yang, Z. Gu, and W. H. Li. Rate of protein evolution versus fitness effect of
gene deletion. Mol Biol Evol, 20(5):772–4, 2003.
[358] Z. Yang. Estimating the pattern of nucleotide substitution. J Mol Evol, 39:105–
111, 1994.
[359] Z. Yang. Paml: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci, 13(5):555–6, 1997.
[360] Z Yang. Phylogenetic Analysis by Maximum Likelihood (PAML). Version 3.0.
London: University College, 2000.
[361] R. F. Yeh, L. P. Lim, and C. B. Burge. Computational inference of homologous
gene structures in the human genome. Genome Res, 11(5):803–16, 2001.
[362] G. Yona, N. Linial, and M. Linial. Protomap: automatic classification of protein
sequences and hierarchy of protein families. Nucleic Acids Res, 28(1):49–55,
2000.
[363] K. Y. Yuen, C. M. Chan, K. M. Chan, P. C. Woo, X. Y. Che, A. S. Leung,
and L. Cao. Characterization of afmp1: a novel target for serodiagnosis of
aspergillosis. J Clin Microbiol, 39(11):3830–7, 2001.
[364] K. Y. Yuen, G. Pascal, S. S. Wong, P. Glaser, P. C. Woo, F. Kunst, J. J. Cai,
E. Y. Cheung, C. Medigue, and A. Danchin. Exploring the penicillium marneffei
genome. Arch Microbiol, 179(5):339–53, 2003.
[365] K. Y. Yuen, S. S. Wong, D. N. Tsang, and P. Y. Chau. Serodiagnosis of penicillium marneffei infection. Lancet, 344(8920):444–5, 1994.
[366] M. Zagulski, B. Babinska, R. Gromadka, A. Migdalski, J. Rytka, J. Sulicka,
and C. J. Herbert. The sequence of 24.3 kb from chromosome x reveals five
complete open reading frames, all of which correspond to new genes, and a
tandem insertion of a ty1 transposon. Yeast, 11(12):1179–86, 1995.
[367] E. M. Zdobnov and R. Apweiler. Interproscan–an integration platform for the
signature-recognition methods in interpro. Bioinformatics, 17(9):847–8, 2001.
[368] C. T. Zhang, J. Wang, and R. Zhang. A novel method to calculate the g+c
content of genomic dna sequences. J Biomol Struct Dyn, 19:333–341, 2001.
256
[369] J. Zhang, Y. P. Zhang, and H. F. Rosenberg. Adaptive evolution of a duplicated
pancreatic ribonuclease gene in a leaf-eating monkey. Nat Genet, 30(4):411–5,
2002.
[370] L. Zhang, T. J. Vision, and B. S. Gaut. Patterns of nucleotide substitution
among simultaneously duplicated gene pairs in arabidopsis thaliana. Mol Biol
Evol, 19(9):1464–73, 2002.
[371] P. Zhang, Z. Gu, and W. H. Li. Different evolutionary patterns between young
duplicate genes in the human genome. Genome Biol, 4(9):R56, 2003.
[372] R. Zhang and C. T. Zhang. Z curves, an intutive tool for visualizing and analyzing the dna sequences. J Biomol Struct Dyn, 11:767–782, 1994.