Automated analysis detects cases of possible horizontal gene

Automated analysis detects cases of possible horizontal gene transfer in
different domains of life
Alexander Y.
Panchin
Institute for
Information
Transmission
Problems, RAS
alexpanchin@yahoo
.com
Yuri V. Panchin
Institute for
Information
Transmission
Problems, RAS
[email protected]
m
Alexander I.
Tuzhikov
Institute for
Information
Transmission
Problems, RAS
alexander.tuzhikov
@gmail.com
Abstract
We developed a novel approach to identify horizontal
gene transfer (HGT) events in genomic sequences.
Predicted protein-coding sequences from pairs of genomes
are mixed together and logistic regression models are
trained on a sub-sample of these sequences to separate
them. The models use arrays of normalized BLAST bit
scores obtained for each sequence by comparison with its
closest hit from a number of other genomes, covering
different domains of life. Regression models that passed
validation on independent subsamples are used to identify
sequences that cluster with sequences from other genomes
(predicted HGT events). Using the genomes of 61 species,
we confirmed some previously reported cases of HGT (such
as the transfer of ankyrin-encoding sequences from
eukaryotes to their symbiotic bacteria or the acquisition of
taumatins by the round worm C. elegans) and identified
novel cases of HGT (such as the acquisition of actin-coding
sequenes by the carnivorous plant Genlisea aurea).
1. Introduction
Genes within a single genome may have different
phylogenetic histories: some passed down vertically from
the direct ancestors of an organism, while others may have
been acquired via horizontal gene transfer (HGT) from
other species. HGT is widely accepted as an important
factor in the evolution of prokaryotic organisms [1]. A
156
number of HGT events between prokaryotes and
eukaryotes and even between different eukaryotic species
have been described, and in some cases, the molecular
mechanisms that facilitate such events have been
uncovered [2].
For example, agrobacteria are widely used in modern
biotechnology because of their natural ability to transfer
some of their genes into plant cells [3]. Therefore, it is not
surprising that “naturally transgenic” plants can be found,
such as the sweet potato Ipomoea batatas with actively
transcribed genes of Agrobacterium spp. [4].
Notably, bacterial genes were found in all studied
cultivated (but not wild type) species of Ipomoea batatas,
suggesting that lateral gene transfer may have played an
important evolutionary role during the domestication of this
crop. Similar cases of HGT have been reported for species
within the genera Nicotiana and Linaria [5, 6].
Gene transfer is possible from bacteria to other
eukaryotes, not just plants. It is not yet confirmed that
human germline can obtain genes of bacterial origin,
however, the transfer of genes from bacteria to human
somatic cells has been reported and even linked to cancer
[7]. Another interesting case is the acquisition of genes by
arthropods from bacterial endosymbionts. For example, the
genome of the Callosobruchus chinensis beetle contains
around 30% of the Wolbachia genome [8].
Viruses provide a well-established mechanism of HGT.
The acquisition of a particular viral gene played an
important role in the evolution of mammals: a protein
called syncytin is a captive retroviral envelope protein
involved in placental morphogenesis [9]. There is some
evidence that HGT between different eukaryotic species
can occur, for example between certain parasitic plants and
their hosts [10-13]. These and other examples have been
reviewed elsewhere [2].
There is ongoing controversy about the extent in which
HGT plays a role in eukaryotic evolution. Numerous
examples of HGT were found in different eukaryotic
species, including humans by Crisp et al [14]. However, a
study by Ku et al concluded that even gene transfer from
bacteria to eukaryotes is episodic and coincides with major
evolutionary transitions such as the origin of chloroplasts
and mitochondria [15].
It should be noted that Ku et al analyzed eukaryotic
gene families with prokaryotic homologs and used
clustering and phylogenetic analysis. Crisp et al used
differences in blast bit-scores between prokaryotic and nonprokaryotic best hits of eukaryotic genes as a criterion for
the identification of probable HGT events. Thus, these
studies are quite different by design, including the criteria
of HGT, as there is no apparent consensus on what
evidence is enough to confirm or disprove the horizontal
transfer of gene in eukaryotic species with sufficient
certainty.
Here we present a novel approach for the identification
of HGT events in genomic sequences. The approach
involves in silico metagenomic experiments – a process,
when a model is trained (and validated after) to sort genes
from two different genomes back into two piles. Within the
model, each gene is represented by an array of values that
correspond to its normalized degrees of similarity with its
best hits within each genome from a larger set of genomes,
covering a wide section of the tree of life.
2. Methods
We used publically available genomic data of 61
different genomes. Since most of these genomes were not
well-annotated, we analyzed each genome for open reading
frames of minimum length 150 b.p. to define putative
protein-coding regions. These regions may or may not
correspond to actual genes, because splicing was not taken
into account. We did not use existing annotations so that
well-annotated and poorly annotated genomes were
handled in the same way and no bias corresponding to the
different annotation processes used by different authors
could influence our results.
We used blastp to compare all predicted protein-coding
sequences with all protein-coding sequences within all
other genomes. We used bit scores as a degree of similarity
for further analysis. We chose bit scores over other
measures of sequence similarity due to several important
properties. Bit scores do not depend on the database size
and represent a level of alignment similarity that considers
alignment size. We normalize the bit score for each hit by a
value equal to the maximum bit score the query sequence
could receive if an identical match was found. We only
157
used predicted protein-coding sequences that had hits in 3
or more genomes.
Consider protein sequence X from a mix of sequences
encoded by genomes S1 and S2. Suppose it has the
normalized score X1 against its BH (Best Hit) from
genome A, X2 against its BH from genome B e.t.c., up to
Xn, where n = 59 (61 genomes minus genomes S1 and S2).
In each in silico metagenomic experiment, logistic
regression analysis optimizes the following prediction
function:
ln(Y/(1−Y))=a+b1X1+b2X2+b3X3 ... +bnXn
Here, Y (with values ranging from 0 to 1) is the
estimated probability that the gene belongs to a given
cluster (S1 or S2). This prediction function is optimized on
a training set of sequences (a random 25% subsample) and
verified on the remaining set of sequences. R2 values were
used to estimate if the regression explains the variance in
the data and is capable of separating sequences from
genomes S1 and S2.
We used the common McFadden’s R2 that is defined as
2
R McF = 1 – ln(LM) / ln(L0), where L0 is the value of the
likelihood function for a model with no predictors, and LM
is the likelihood for the model being estimated. We fit the
logistic regression with glm function in R. Note that here
and in further analysis we only analyze those genes that
have hits in at least 3 other genomes, which indicates at
least some degree of evolutionary conservation.
We have selected an R2 = 0.73 cut-off as a criteria for
the successful separation of genes from two genomes. In
most cases, this corresponds to > 90% true positive
predictions. If the separation is successful, the genes placed
in the wrong genomic clusters are considered outsider
genes. Their presence if the wrong genomic cluster could
indicate that they underwent HGT or are present in a
genome assembly because of contamination.
Core sequences within each genome are defined as
sequences that were found in the correct cluster in all cases
when the regression model was successful at separation.
“Outsider once” sequences are defined as sequences that
were found in the correct cluster in all but one case when
the regression model was successful at separation. Outsider
multiple sequences are defined as sequences that were
found in the correct cluster in at least two cases when the
regression model was successful at separation.
Each protein-coding sequence with hits in 3 or more
genomes was analyzed using PFAM. For each genome, we
compared the number of genes with each discovered PFAM
domain between the core and outsider multiple sequences.
3. Results and discussion
We performed 61*60/2 = 1830 in silico metagenomic
experiments involving a total of 296630 predicted proteincoding sequences, each of which had a hit in at least 3
genomes. A total of 67% of the obtained regression models
were successfully validated and thus allowed the distinction
between the sequences of two genomes.
The majority of sequences (77% on average) from each
analyzed genome were identified as core sequences in all
metagenomic experiment with validated regression models.
Around 7% of the sequences were in the core in all but one
metagenomic experiment. The remaining 16% sequences
were found as outsiders more than once – these are the
most likely candidates for HGT.
We can now identify “definite core” genes within a
genome as those genes that always appear inside the correct
genomic cluster when a clear separation between genomic
clusters (with R2 > 0.73) was possible. The remaining
genes are outsiders.
We used a PFAM analysis to check for any PFAM
domains are significantly overrepresented in “outsider
multiple” sets of sequences. Below we present some of the
most interesting findings.
3.1. Ankyrins are acquired
endosymbiotic bacteria.
via
HGT
by
Recently Jernigan and Bordenstein [16] showed that the
lifestyle of bacteria, rather than phylogenetic history, is a
predictor of ANK repeat abundance. They also showed that
phylogenetically unrelated organisms that forge facultative
and obligate symbioses with eukaryotes show enrichment
for ANK repeats in comparison to free-living bacteria. This
observation was especially strong for obligate intracellular
bacteria. Ankyrin domains are very common in eukaryotes,
but much rarer in bacteria, with the exception of parasites
and symbionts. In a paper by Siozios et al. [17] it is
concluded that ankyrin genes are likely to be horizontally
transferred between strains with the aid of bacteriophages.
Al-Khodor et al. [18] also suggests that prokaryotic genes
encoding ANK-containing proteins have been acquired
from eukaryotes by horizontal gene transfer. This finding
was reproduced in our study. Our analysis suggests that
both species of endosymbiotic bacteria Wolbachia included
in our study have acquired ANK-containing proteins via
HGT.
3.2. Taumatins are acquired via HGT by nematode
species
Taumatins are a group of defensive proteins produced
by plants in response to fungal infections, some of which
are characterized as “sweet” by humans. Recently the
presence of these proteins was reported in the desert locust
Schistocerca gregaria, a related species Locusta migratoria
and in the nematode Caenorhabditis but not in Drosophila
and other insects [19]. According to InterPRO among
Metazoa thaumatins can be found only in a few round
worms, arthropods and mollusk (IPR001938). This
observation is in line with our analysis that suggests that
thaumatin-domain containing proteins were acquired via
HGT by nematode species.
3.3. Possible Tubulin acquisition by Leishmania
mexicana
The finding that some sequences encoding Tubulins in
Leishmania mexicana could have been obtained via HGT
158
(possible from their hosts) is intriguing, especially
considering two observations. One is that tubulins can be
targets of antileishmanial drugs [20] and the other is that
tubulin gene arrays have undergone genomic restructuring
in Leishmania during their evolution [21].
3.4. Certain actins are acquired via HGT by
carnivorous plant Genlisea Aurea
Genlisea Aurea is a carnivorous plant with an unusually
compact genome [22]. Our analysis revealed a significant
overrepresentation of actin-domain containing sequences in
this plant among the “outsider multiple” sequences,
candidates of HGT. These plants produce subsoil traps of
foliar origin for catching small soil protozoa and metazoa
[23]. This makes Genlisea Aurea an interesting subject for
the study of potential HGT because it had access to
digested metazoan and protozoan genetic material during
its evolution history.
3.5. Large scale HGT detected in Archaeon Loki
Archaeon Loki was reported as a complex archaea that
bridges the gap between prokaryotes and eukaryotes [24].
Sequences encoding six different PFAM domains were
overrepresented among Archaeons outsiders: Arf,
Gtr1_RagA, LRR_4, LRR_8, Ras and Roc. According to
InterPRO (IPR006762) GTR1/RagA proteins are
widespread in eukaryotes, especially in Opisthokonta, but
can be found in a few eubacteria species and only one
archaea: Archaeon Loki. A similar phylogenetic pattern is
observed for Small GTPase superfamily, ARF/SAR type
(IPR006689), except it is present in several groups of
Archaea. Leucine reach repeats 4 (IPR025875) and 8
(IPR001611) are common in bacteria and eukaryotes but
are rarely found in Archaea. Small GTPase Ras type
proteins (IPR020849) are specific to eukaryotes with the
exception of several bacterial species and Roc domain
(IPR020859) is also more common in eukaryotes than in
other domains of life. All this is consistent with the
hypothesis of HGT acquisition of proteins with the
mentioned domain by Archaeon Loki. Since certain
sequences such as those of small GTPases are commonly
used in the phylogenetic reconstruction of eukaryotic
origins, the possibility of HGT should be considered when
drawing conclusions on phylogenetic similarity.
4. Conclusions
In silico metagenomic experiments are a useful tool to
discover candidate sequences with unusual phylogenetic
histories. It findings suggest that HGT is not uncommon in
both prokaryotic and eukaryotic genomes. One important
limitation of our approach is that exhaustive calculations
are involved, leading to a relatively small number of
genomes we can practically use. Due to these limitations,
some cases of HGT could remain unidentified. Also we
limited this report to the most infesting cases found during
our analysis.
References
[1] Ochman H, Lawrence JG, Groisman EA: Lateral gene
transfer and the nature of bacterial innovation. Nature 2000,
405(6784):299-304.
[2] Soucy SM, Huang J, Gogarten JP: Horizontal gene
transfer: building the web of life. Nat Rev Genet 2015, 16(8):472482.
[3] Chilton MD, Drummond MH, Merio DJ, Sciaky D,
Montoya AL, Gordon MP, Nester EW: Stable incorporation of
plasmid DNA into higher plant cells: the molecular basis of crown
gall tumorigenesis. Cell 1977, 11(2):263-271.
[4] Kyndt T, Quispe D, Zhai H, Jarret R, Ghislain M, Liu Q,
Gheysen G, Kreuze JF: The genome of cultivated sweet potato
contains Agrobacterium T-DNAs with expressed genes: An
example of a naturally transgenic food crop. Proc Natl Acad Sci
U S A 2015, 112(18):5844-5849.
[5] Matveeva TV, Bogomaz DI, Pavlova OA, Nester EW,
Lutova LA: Horizontal gene transfer from genus agrobacterium
to the plant linaria in nature. Mol Plant Microbe Interact 2012,
25(12):1542-1551.
[6] Matveeva TV, Lutova LA: Horizontal gene transfer from
Agrobacterium to plants. Front Plant Sci 2014, 5:326.
[7] Riley DR, Sieber KB, Robinson KM, White JR, Ganesan
A, Nourbakhsh S, Dunning Hotopp JC: Bacteria-human somatic
cell lateral gene transfer is enriched in cancer samples. PLoS
Comput Biol 2013, 9(6):e1003107.
[8] Nikoh N, Tanaka K, Shibata F, Kondo N, Hizume M,
Shimada M, Fukatsu T: Wolbachia genome integrated in an insect
chromosome: evolution and fate of laterally transferred
endosymbiont genes. Genome Res 2008, 18(2):272-280.
[9] Mi S, Lee X, Li X, Veldman GM, Finnerty H, Racie L,
LaVallie E, Tang XY, Edouard P, Howes S et al: Syncytin is a
captive retroviral envelope protein involved in human placental
morphogenesis. Nature 2000, 403(6771):785-789.
[10] Yoshida S, Maruyama S, Nozaki H, Shirasu K:
Horizontal gene transfer by the parasitic plant Striga
hermonthica. Science 2010, 328(5982):1128.
[11] Xi Z, Bradley RK, Wurdack KJ, Wong K, Sugumaran M,
Bomblies K, Rest JS, Davis CC: Horizontal transfer of expressed
genes in a parasitic flowering plant. BMC Genomics 2012,
13:227.
[12] Zhang Y, Fernandez-Aparicio M, Wafula EK, Das M,
Jiao Y, Wickett NJ, Honaas LA, Ralph PE, Wojciechowski MF,
Timko MP et al: Evolution of a horizontally acquired legume
gene, albumin 1, in the parasitic plant Phelipanche aegyptiaca
and related species. BMC Evol Biol 2013, 13:48.
[13] Zhang D, Qi J, Yue J, Huang J, Sun T, Li S, Wen JF,
Hettenhausen C, Wu J, Wang L et al: Root parasitic plant
Orobanche aegyptiaca and shoot parasitic plant Cuscuta
australis obtained Brassicaceae-specific strictosidine synthaselike genes by horizontal gene transfer. BMC Plant Biol 2014,
14:19.
[14] Crisp A, Boschetti C, Perry M, Tunnacliffe A, Micklem
G: Expression of multiple horizontally acquired genes is a
hallmark of both vertebrate and invertebrate genomes. Genome
Biol 2015, 16:50.
[15] Ku C, Nelson-Sathi S, Roettger M, Sousa FL, Lockhart
PJ, Bryant D, Hazkani-Covo E, McInerney JO, Landan G, Martin
WF: Endosymbiotic origin and differential loss of eukaryotic
genes. Nature 2015, 524(7566):427-432.
[16] Jernigan KK, Bordenstein SR: Ankyrin domains across
the Tree of Life. PeerJ 2014, 2:e264.
159
[17] Siozios S, Ioannidis P, Klasson L, Andersson SG, Braig
HR, Bourtzis K: The diversity and evolution of Wolbachia ankyrin
repeat domain genes. PLoS One 2013, 8(2):e55390.
[18] Al-Khodor S, Price CT, Kalia A, Abu Kwaik Y:
Functional diversity of ankyrin repeats in microbial proteins.
Trends Microbiol 2010, 18(3):132-139.
[19] Brandazza A, Angeli S, Tegoni M, Cambillau C, Pelosi
P: Plant stress proteins of the thaumatin-like family discovered in
animals. FEBS Lett 2004, 572(1-3):3-7.
[20] Havens CG, Bryant N, Asher L, Lamoreaux L, Perfetto
S, Brendle JJ, Werbovetz KA: Cellular effects of leishmanial
tubulin inhibitors on L. donovani. Mol Biochem Parasitol 2000,
110(2):223-236.
[21]. Jackson AP, Vaughan S, Gull K: Evolution of tubulin
gene arrays in Trypanosomatid parasites: genomic restructuring
in Leishmania. BMC Genomics 2006, 7:261.
[22] Leushkin EV, Sutormin RA, Nabieva ER, Penin AA,
Kondrashov AS, Logacheva MD: The miniature genome of a
carnivorous plant Genlisea aurea contains a low number of genes
and short non-coding sequences. BMC Genomics 2013, 14:476.
[23] Plachno BJ, Kozieradzka-Kiszkurno M, Swiatek P:
Functional utrastructure of Genlisea (Lentibulariaceae) digestive
hairs. Ann Bot 2007, 100(2):195-203.
[24] Spang A, Saw JH, Jorgensen SL, Zaremba-Niedzwiedzka
K, Martijn J, Lind AE, van Eijk R, Schleper C, Guy L, Ettema TJ:
Complex archaea that bridge the gap between prokaryotes and
eukaryotes. Nature 2015, 521(7551):173-179.