Automated analysis detects cases of possible horizontal gene transfer in different domains of life Alexander Y. Panchin Institute for Information Transmission Problems, RAS alexpanchin@yahoo .com Yuri V. Panchin Institute for Information Transmission Problems, RAS [email protected] m Alexander I. Tuzhikov Institute for Information Transmission Problems, RAS alexander.tuzhikov @gmail.com Abstract We developed a novel approach to identify horizontal gene transfer (HGT) events in genomic sequences. Predicted protein-coding sequences from pairs of genomes are mixed together and logistic regression models are trained on a sub-sample of these sequences to separate them. The models use arrays of normalized BLAST bit scores obtained for each sequence by comparison with its closest hit from a number of other genomes, covering different domains of life. Regression models that passed validation on independent subsamples are used to identify sequences that cluster with sequences from other genomes (predicted HGT events). Using the genomes of 61 species, we confirmed some previously reported cases of HGT (such as the transfer of ankyrin-encoding sequences from eukaryotes to their symbiotic bacteria or the acquisition of taumatins by the round worm C. elegans) and identified novel cases of HGT (such as the acquisition of actin-coding sequenes by the carnivorous plant Genlisea aurea). 1. Introduction Genes within a single genome may have different phylogenetic histories: some passed down vertically from the direct ancestors of an organism, while others may have been acquired via horizontal gene transfer (HGT) from other species. HGT is widely accepted as an important factor in the evolution of prokaryotic organisms [1]. A 156 number of HGT events between prokaryotes and eukaryotes and even between different eukaryotic species have been described, and in some cases, the molecular mechanisms that facilitate such events have been uncovered [2]. For example, agrobacteria are widely used in modern biotechnology because of their natural ability to transfer some of their genes into plant cells [3]. Therefore, it is not surprising that “naturally transgenic” plants can be found, such as the sweet potato Ipomoea batatas with actively transcribed genes of Agrobacterium spp. [4]. Notably, bacterial genes were found in all studied cultivated (but not wild type) species of Ipomoea batatas, suggesting that lateral gene transfer may have played an important evolutionary role during the domestication of this crop. Similar cases of HGT have been reported for species within the genera Nicotiana and Linaria [5, 6]. Gene transfer is possible from bacteria to other eukaryotes, not just plants. It is not yet confirmed that human germline can obtain genes of bacterial origin, however, the transfer of genes from bacteria to human somatic cells has been reported and even linked to cancer [7]. Another interesting case is the acquisition of genes by arthropods from bacterial endosymbionts. For example, the genome of the Callosobruchus chinensis beetle contains around 30% of the Wolbachia genome [8]. Viruses provide a well-established mechanism of HGT. The acquisition of a particular viral gene played an important role in the evolution of mammals: a protein called syncytin is a captive retroviral envelope protein involved in placental morphogenesis [9]. There is some evidence that HGT between different eukaryotic species can occur, for example between certain parasitic plants and their hosts [10-13]. These and other examples have been reviewed elsewhere [2]. There is ongoing controversy about the extent in which HGT plays a role in eukaryotic evolution. Numerous examples of HGT were found in different eukaryotic species, including humans by Crisp et al [14]. However, a study by Ku et al concluded that even gene transfer from bacteria to eukaryotes is episodic and coincides with major evolutionary transitions such as the origin of chloroplasts and mitochondria [15]. It should be noted that Ku et al analyzed eukaryotic gene families with prokaryotic homologs and used clustering and phylogenetic analysis. Crisp et al used differences in blast bit-scores between prokaryotic and nonprokaryotic best hits of eukaryotic genes as a criterion for the identification of probable HGT events. Thus, these studies are quite different by design, including the criteria of HGT, as there is no apparent consensus on what evidence is enough to confirm or disprove the horizontal transfer of gene in eukaryotic species with sufficient certainty. Here we present a novel approach for the identification of HGT events in genomic sequences. The approach involves in silico metagenomic experiments – a process, when a model is trained (and validated after) to sort genes from two different genomes back into two piles. Within the model, each gene is represented by an array of values that correspond to its normalized degrees of similarity with its best hits within each genome from a larger set of genomes, covering a wide section of the tree of life. 2. Methods We used publically available genomic data of 61 different genomes. Since most of these genomes were not well-annotated, we analyzed each genome for open reading frames of minimum length 150 b.p. to define putative protein-coding regions. These regions may or may not correspond to actual genes, because splicing was not taken into account. We did not use existing annotations so that well-annotated and poorly annotated genomes were handled in the same way and no bias corresponding to the different annotation processes used by different authors could influence our results. We used blastp to compare all predicted protein-coding sequences with all protein-coding sequences within all other genomes. We used bit scores as a degree of similarity for further analysis. We chose bit scores over other measures of sequence similarity due to several important properties. Bit scores do not depend on the database size and represent a level of alignment similarity that considers alignment size. We normalize the bit score for each hit by a value equal to the maximum bit score the query sequence could receive if an identical match was found. We only 157 used predicted protein-coding sequences that had hits in 3 or more genomes. Consider protein sequence X from a mix of sequences encoded by genomes S1 and S2. Suppose it has the normalized score X1 against its BH (Best Hit) from genome A, X2 against its BH from genome B e.t.c., up to Xn, where n = 59 (61 genomes minus genomes S1 and S2). In each in silico metagenomic experiment, logistic regression analysis optimizes the following prediction function: ln(Y/(1−Y))=a+b1X1+b2X2+b3X3 ... +bnXn Here, Y (with values ranging from 0 to 1) is the estimated probability that the gene belongs to a given cluster (S1 or S2). This prediction function is optimized on a training set of sequences (a random 25% subsample) and verified on the remaining set of sequences. R2 values were used to estimate if the regression explains the variance in the data and is capable of separating sequences from genomes S1 and S2. We used the common McFadden’s R2 that is defined as 2 R McF = 1 – ln(LM) / ln(L0), where L0 is the value of the likelihood function for a model with no predictors, and LM is the likelihood for the model being estimated. We fit the logistic regression with glm function in R. Note that here and in further analysis we only analyze those genes that have hits in at least 3 other genomes, which indicates at least some degree of evolutionary conservation. We have selected an R2 = 0.73 cut-off as a criteria for the successful separation of genes from two genomes. In most cases, this corresponds to > 90% true positive predictions. If the separation is successful, the genes placed in the wrong genomic clusters are considered outsider genes. Their presence if the wrong genomic cluster could indicate that they underwent HGT or are present in a genome assembly because of contamination. Core sequences within each genome are defined as sequences that were found in the correct cluster in all cases when the regression model was successful at separation. “Outsider once” sequences are defined as sequences that were found in the correct cluster in all but one case when the regression model was successful at separation. Outsider multiple sequences are defined as sequences that were found in the correct cluster in at least two cases when the regression model was successful at separation. Each protein-coding sequence with hits in 3 or more genomes was analyzed using PFAM. For each genome, we compared the number of genes with each discovered PFAM domain between the core and outsider multiple sequences. 3. Results and discussion We performed 61*60/2 = 1830 in silico metagenomic experiments involving a total of 296630 predicted proteincoding sequences, each of which had a hit in at least 3 genomes. A total of 67% of the obtained regression models were successfully validated and thus allowed the distinction between the sequences of two genomes. The majority of sequences (77% on average) from each analyzed genome were identified as core sequences in all metagenomic experiment with validated regression models. Around 7% of the sequences were in the core in all but one metagenomic experiment. The remaining 16% sequences were found as outsiders more than once – these are the most likely candidates for HGT. We can now identify “definite core” genes within a genome as those genes that always appear inside the correct genomic cluster when a clear separation between genomic clusters (with R2 > 0.73) was possible. The remaining genes are outsiders. We used a PFAM analysis to check for any PFAM domains are significantly overrepresented in “outsider multiple” sets of sequences. Below we present some of the most interesting findings. 3.1. Ankyrins are acquired endosymbiotic bacteria. via HGT by Recently Jernigan and Bordenstein [16] showed that the lifestyle of bacteria, rather than phylogenetic history, is a predictor of ANK repeat abundance. They also showed that phylogenetically unrelated organisms that forge facultative and obligate symbioses with eukaryotes show enrichment for ANK repeats in comparison to free-living bacteria. This observation was especially strong for obligate intracellular bacteria. Ankyrin domains are very common in eukaryotes, but much rarer in bacteria, with the exception of parasites and symbionts. In a paper by Siozios et al. [17] it is concluded that ankyrin genes are likely to be horizontally transferred between strains with the aid of bacteriophages. Al-Khodor et al. [18] also suggests that prokaryotic genes encoding ANK-containing proteins have been acquired from eukaryotes by horizontal gene transfer. This finding was reproduced in our study. Our analysis suggests that both species of endosymbiotic bacteria Wolbachia included in our study have acquired ANK-containing proteins via HGT. 3.2. Taumatins are acquired via HGT by nematode species Taumatins are a group of defensive proteins produced by plants in response to fungal infections, some of which are characterized as “sweet” by humans. Recently the presence of these proteins was reported in the desert locust Schistocerca gregaria, a related species Locusta migratoria and in the nematode Caenorhabditis but not in Drosophila and other insects [19]. According to InterPRO among Metazoa thaumatins can be found only in a few round worms, arthropods and mollusk (IPR001938). This observation is in line with our analysis that suggests that thaumatin-domain containing proteins were acquired via HGT by nematode species. 3.3. Possible Tubulin acquisition by Leishmania mexicana The finding that some sequences encoding Tubulins in Leishmania mexicana could have been obtained via HGT 158 (possible from their hosts) is intriguing, especially considering two observations. One is that tubulins can be targets of antileishmanial drugs [20] and the other is that tubulin gene arrays have undergone genomic restructuring in Leishmania during their evolution [21]. 3.4. Certain actins are acquired via HGT by carnivorous plant Genlisea Aurea Genlisea Aurea is a carnivorous plant with an unusually compact genome [22]. Our analysis revealed a significant overrepresentation of actin-domain containing sequences in this plant among the “outsider multiple” sequences, candidates of HGT. These plants produce subsoil traps of foliar origin for catching small soil protozoa and metazoa [23]. This makes Genlisea Aurea an interesting subject for the study of potential HGT because it had access to digested metazoan and protozoan genetic material during its evolution history. 3.5. Large scale HGT detected in Archaeon Loki Archaeon Loki was reported as a complex archaea that bridges the gap between prokaryotes and eukaryotes [24]. Sequences encoding six different PFAM domains were overrepresented among Archaeons outsiders: Arf, Gtr1_RagA, LRR_4, LRR_8, Ras and Roc. According to InterPRO (IPR006762) GTR1/RagA proteins are widespread in eukaryotes, especially in Opisthokonta, but can be found in a few eubacteria species and only one archaea: Archaeon Loki. A similar phylogenetic pattern is observed for Small GTPase superfamily, ARF/SAR type (IPR006689), except it is present in several groups of Archaea. Leucine reach repeats 4 (IPR025875) and 8 (IPR001611) are common in bacteria and eukaryotes but are rarely found in Archaea. Small GTPase Ras type proteins (IPR020849) are specific to eukaryotes with the exception of several bacterial species and Roc domain (IPR020859) is also more common in eukaryotes than in other domains of life. All this is consistent with the hypothesis of HGT acquisition of proteins with the mentioned domain by Archaeon Loki. Since certain sequences such as those of small GTPases are commonly used in the phylogenetic reconstruction of eukaryotic origins, the possibility of HGT should be considered when drawing conclusions on phylogenetic similarity. 4. Conclusions In silico metagenomic experiments are a useful tool to discover candidate sequences with unusual phylogenetic histories. It findings suggest that HGT is not uncommon in both prokaryotic and eukaryotic genomes. One important limitation of our approach is that exhaustive calculations are involved, leading to a relatively small number of genomes we can practically use. Due to these limitations, some cases of HGT could remain unidentified. Also we limited this report to the most infesting cases found during our analysis. References [1] Ochman H, Lawrence JG, Groisman EA: Lateral gene transfer and the nature of bacterial innovation. Nature 2000, 405(6784):299-304. [2] Soucy SM, Huang J, Gogarten JP: Horizontal gene transfer: building the web of life. Nat Rev Genet 2015, 16(8):472482. [3] Chilton MD, Drummond MH, Merio DJ, Sciaky D, Montoya AL, Gordon MP, Nester EW: Stable incorporation of plasmid DNA into higher plant cells: the molecular basis of crown gall tumorigenesis. Cell 1977, 11(2):263-271. [4] Kyndt T, Quispe D, Zhai H, Jarret R, Ghislain M, Liu Q, Gheysen G, Kreuze JF: The genome of cultivated sweet potato contains Agrobacterium T-DNAs with expressed genes: An example of a naturally transgenic food crop. Proc Natl Acad Sci U S A 2015, 112(18):5844-5849. [5] Matveeva TV, Bogomaz DI, Pavlova OA, Nester EW, Lutova LA: Horizontal gene transfer from genus agrobacterium to the plant linaria in nature. Mol Plant Microbe Interact 2012, 25(12):1542-1551. [6] Matveeva TV, Lutova LA: Horizontal gene transfer from Agrobacterium to plants. Front Plant Sci 2014, 5:326. [7] Riley DR, Sieber KB, Robinson KM, White JR, Ganesan A, Nourbakhsh S, Dunning Hotopp JC: Bacteria-human somatic cell lateral gene transfer is enriched in cancer samples. PLoS Comput Biol 2013, 9(6):e1003107. [8] Nikoh N, Tanaka K, Shibata F, Kondo N, Hizume M, Shimada M, Fukatsu T: Wolbachia genome integrated in an insect chromosome: evolution and fate of laterally transferred endosymbiont genes. Genome Res 2008, 18(2):272-280. [9] Mi S, Lee X, Li X, Veldman GM, Finnerty H, Racie L, LaVallie E, Tang XY, Edouard P, Howes S et al: Syncytin is a captive retroviral envelope protein involved in human placental morphogenesis. Nature 2000, 403(6771):785-789. [10] Yoshida S, Maruyama S, Nozaki H, Shirasu K: Horizontal gene transfer by the parasitic plant Striga hermonthica. Science 2010, 328(5982):1128. [11] Xi Z, Bradley RK, Wurdack KJ, Wong K, Sugumaran M, Bomblies K, Rest JS, Davis CC: Horizontal transfer of expressed genes in a parasitic flowering plant. BMC Genomics 2012, 13:227. [12] Zhang Y, Fernandez-Aparicio M, Wafula EK, Das M, Jiao Y, Wickett NJ, Honaas LA, Ralph PE, Wojciechowski MF, Timko MP et al: Evolution of a horizontally acquired legume gene, albumin 1, in the parasitic plant Phelipanche aegyptiaca and related species. BMC Evol Biol 2013, 13:48. [13] Zhang D, Qi J, Yue J, Huang J, Sun T, Li S, Wen JF, Hettenhausen C, Wu J, Wang L et al: Root parasitic plant Orobanche aegyptiaca and shoot parasitic plant Cuscuta australis obtained Brassicaceae-specific strictosidine synthaselike genes by horizontal gene transfer. BMC Plant Biol 2014, 14:19. [14] Crisp A, Boschetti C, Perry M, Tunnacliffe A, Micklem G: Expression of multiple horizontally acquired genes is a hallmark of both vertebrate and invertebrate genomes. Genome Biol 2015, 16:50. [15] Ku C, Nelson-Sathi S, Roettger M, Sousa FL, Lockhart PJ, Bryant D, Hazkani-Covo E, McInerney JO, Landan G, Martin WF: Endosymbiotic origin and differential loss of eukaryotic genes. Nature 2015, 524(7566):427-432. [16] Jernigan KK, Bordenstein SR: Ankyrin domains across the Tree of Life. PeerJ 2014, 2:e264. 159 [17] Siozios S, Ioannidis P, Klasson L, Andersson SG, Braig HR, Bourtzis K: The diversity and evolution of Wolbachia ankyrin repeat domain genes. PLoS One 2013, 8(2):e55390. [18] Al-Khodor S, Price CT, Kalia A, Abu Kwaik Y: Functional diversity of ankyrin repeats in microbial proteins. Trends Microbiol 2010, 18(3):132-139. [19] Brandazza A, Angeli S, Tegoni M, Cambillau C, Pelosi P: Plant stress proteins of the thaumatin-like family discovered in animals. FEBS Lett 2004, 572(1-3):3-7. [20] Havens CG, Bryant N, Asher L, Lamoreaux L, Perfetto S, Brendle JJ, Werbovetz KA: Cellular effects of leishmanial tubulin inhibitors on L. donovani. Mol Biochem Parasitol 2000, 110(2):223-236. [21]. Jackson AP, Vaughan S, Gull K: Evolution of tubulin gene arrays in Trypanosomatid parasites: genomic restructuring in Leishmania. BMC Genomics 2006, 7:261. [22] Leushkin EV, Sutormin RA, Nabieva ER, Penin AA, Kondrashov AS, Logacheva MD: The miniature genome of a carnivorous plant Genlisea aurea contains a low number of genes and short non-coding sequences. BMC Genomics 2013, 14:476. [23] Plachno BJ, Kozieradzka-Kiszkurno M, Swiatek P: Functional utrastructure of Genlisea (Lentibulariaceae) digestive hairs. Ann Bot 2007, 100(2):195-203. [24] Spang A, Saw JH, Jorgensen SL, Zaremba-Niedzwiedzka K, Martijn J, Lind AE, van Eijk R, Schleper C, Guy L, Ettema TJ: Complex archaea that bridge the gap between prokaryotes and eukaryotes. Nature 2015, 521(7551):173-179.
© Copyright 2026 Paperzz