Origin of metazoan cadherin diversity and the antiquity of the classical cadherin/β-catenin complex Scott Anthony Nicholsa, Brock William Robertsb, Daniel Joseph Richterb, Stephen Robert Faircloughb, and Nicole Kingb,1 a Department of Biological Sciences, University of Denver, Denver, CO 80208; and bDepartment of Molecular and Cell Biology, University of California, Berkeley, CA 94720 The evolution of cadherins, which are essential for metazoan multicellularity and restricted to metazoans and their closest relatives, has special relevance for understanding metazoan origins. To reconstruct the ancestry and evolution of cadherin gene families, we analyzed the genomes of the choanoflagellate Salpingoeca rosetta, the unicellular outgroup of choanoflagellates and metazoans Capsaspora owczarzaki, and a draft genome assembly from the homoscleromorph sponge Oscarella carmela. Our finding of a cadherin gene in C. owczarzaki reveals that cadherins predate the divergence of the C. owczarzaki, choanoflagellate, and metazoan lineages. Data from these analyses also suggest that the last common ancestor of metazoans and choanoflagellates contained representatives of at least three cadherin families, lefftyrin, coherin, and hedgling. Additionally, we find that an O. carmela classical cadherin has predicted structural features that, in bilaterian classical cadherins, facilitate binding to the cytoplasmic protein β-catenin and, thereby, promote cadherin-mediated cell adhesion. In contrast with premetazoan cadherin families (i.e., those conserved between choanoflagellates and metazoans), the later appearance of classical cadherins coincides with metazoan origins. T he cadherin gene family is hypothesized to have had special importance for metazoan origins (1–5). Cadherins are cellsurface receptors that function in cell adhesion, cell polarity, and tissue morphogenesis (6–8). Moreover, cadherins are found in the genomes of all sequenced metazoans, including diverse bilaterians, cnidarians, and sponges, and are apparently lacking from multicellular lineages such as plants, fungi, and Dictyostelium (9). Although it once seemed likely that cadherins were unique to metazoans, 23 genes encoding the diagnostic extracellular cadherin (EC) domain (10) have since been discovered in the genome of the unicellular choanoflagellate Monosiga brevicollis, one of the closest living relatives of Metazoa (1, 11). Proteins in the cadherin family are characterized by the presence of one or more tandem copies of the EC domain, an ∼100-aa protein domain that mediates adhesion with EC domains in other cadherins (10, 12–14). Cadherins are further assigned to different subfamilies based on the number and arrangement of additional, non-EC protein domains and sequence motifs that refine cadherin function and suggest shared ancestry (2, 3). For example, classical cadherins are distinguished by the presence of a cytoplasmic cadherin domain (CCD) at the C terminus that regulates interactions with the cytoplasmic protein β-catenin (2, 3, 12, 15). When bound to β-catenin, classical cadherins on neighboring cells interact homophilically and, thereby, promote cell-cell adhesion (16). When not bound to β-catenin, classical cadherins are rapidly degraded (17, 18). The regulation of classical cadherin function by β-catenin thereby forms the foundation of adherens junctions and is crucial for cell adhesion in all studied bilaterian tissues, including epithelia, neurons, muscles, and bones (3, 19). The classical cadherins are one of six cadherin families (including fat, dachsous, fat-like, CELSR/flamingo, and protocadherins) that are found in most metazoans. In contrast with the cell adhesion functions of classical cadherins, CELSR/ www.pnas.org/cgi/doi/10.1073/pnas.1120685109 flamingo, dachsous, fat, and fat-like cadherins regulate planar cell polarity in organisms as disparate as Drosophila and mouse (20–22). Members of the protocadherin family have diverse functions that include mechanosensation in stereocilia and regulation of nervous system development (23, 24). It is not known whether the bilaterian roles of these cadherin families had already evolved in the last common ancestor of metazoans, and it is not clear how these cadherin families themselves originated. To date, only one cadherin family—the hedgling family—is inferred to have been present in the last common ancestor of choanoflagellates and metazoans. Hedgling family members are defined by the presence of an N-terminal hedgehog signal domain (Hh-N) and are absent from Bilateria (25, 26). Differences in the cadherin repertoire of choanoflagellates and metazoans have led to the proposal that cadherins in these two lineages may have largely independent histories—that is, one or a few ancestral cadherins may have undergone independent evolutionary radiations in each lineage (2). To reconstruct the evolutionary history of cadherin families before and after the transition to metazoan multicellularity, we have analyzed the diversity of cadherins in the newly sequenced genomes of phylogenetically relevant taxa: the colony forming choanoflagellate Salpingoeca rosetta, the close choanoflagellate/metazoan outgroup Capsaspora owczarzaki, and the homoscleromorph sponge Oscarella carmela. Results Reconstructing the Ancestry of Cadherin Diversity. By searching the S. rosetta genome using BLAST analyses (27) and hidden Markov model (HMM)-based searches (28–30) for the EC domain (Fig. 1), we identified at least 29 predicted cadherin genes (Fig. 1 and SI Appendix, Figs. S1 and S2), all of which were verified through deep sequencing of the transcriptome (SI Appendix, Table S1). The number of cadherin genes in S. rosetta, like that in M. brevicollis (1), rivals that of most metazoans (Fig. 1), whereas the C. owczarzaki genome assembly was found to contain only a single cadherin gene. To increase the taxonomic breadth of genomes available from early branching metazoan lineages, we also sequenced the Author contributions: S.A.N., B.W.R., and N.K. designed research; S.A.N., B.W.R., and D.J.R. performed research; S.A.N., D.J.R., and S.R.F. contributed new reagents/analytic tools; S.A.N., B.W.R., and S.R.F. analyzed data; and S.A.N. and N.K. wrote the paper. The authors declare no conflict of interest. This article is a PNAS Direct Submission. Data deposition: The sequences reported in this paper have been deposited in the GenBank database [accession nos. PRJNA20341 (Capsaspora genome); PRJNA37927S (Salpingoeca genome); EGD72656, EGD73963, EGD74518, EGD74667, EGD74707, EGD74783, EGD75074, EGD75359, EGD75381, EGD75404, EGD75405, EGD75586, EGD75710, EGD76846, EGD77346, EGD78086, EGD78170, GD78171, EGD78831, EGD78839, EGD78969, EGD78970, EGD79002, EGD79017, EGD79249, EGD80879, EGD80917, EGD81200, EGD82245, and EGD82557 (S. rosetta cadherins); EFW44034 (Capsaspora owczarzaki cadherins), JN197609 (Oscarella carmela lefftyrin), AEC12441 (Oscarella carmela cadherin 1), and HQ234356 (Oscarella carmela β-catenin)]. 1 To whom correspondence should be addressed. E-mail: [email protected]. This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1120685109/-/DCSupplemental. PNAS Early Edition | 1 of 6 EVOLUTION Edited by Masatoshi Takeichi, RIKEN, Kobe, Japan, and approved June 20, 2012 (received for review December 19, 2011) Holozoa 29 Number of Cadherins 23 17 119 Placozoa Choano- Porifera (sponge) Metazoa Cnidaria 18 17 16 12 15 8 3 1 0 0 0 * * Pult Plants Ddis Fungi Cowc Mbre Sros Aque Tadh Hmag Nvec Dmel Cele Cint Mmus ≥8 ≥1 ? ≥3 ≥5 ≥8 Functionally characterized cadherins Fig. 1. Phylogenetic distribution and abundance of cadherins in the genomes of diverse eukaryotes. Once thought to be restricted to metazoans, cadherins are abundant in choanoflagellates and evolved before the divergence of Capsaspora owczarzaki, choanoflagellates, and metazoans (1). EC domains detected in the genome of the oomycte Pythium ultimum likely evolved through convergence or lateral gene transfer (9). The number of cadherin families inferred at ancestral nodes (determined based upon their shared domain composition and organization) is indicated (open circles). The dashed lineage of Trichoplax adhaerens reflects its uncertain phylogenetic placement. *All fungal and plant species represented in the Pfam v24.0 database (29) were analyzed. Aque, A. queenslandica; Cele, Caenorhabditis elegans; Cint, Ciona intestinalis; Cowc, C. owczarzaki; Ddis, Dictyostelium discoideum; Dmel, D. melanogaster; Hmag, Hydra magnipapillata; Mbre, M. brevicollis; Mmus, Mus musculus; Nvec, N. vectensis; Pult, P. ultimum; Sros, S. rosetta; Tadh, T. adhaerens. genome of the sponge O. carmela by using massively parallel sequencing (Illumina). Although the genome assembly is fragmented relative to traditional Sanger assemblies (SI Appendix), multiple cadherin-domain encoding sequences were detected and two cadherin genes assembled in near entirety (GenBank accession nos. JN197609 and AEC12441). The value of this draft genome for providing unique insights into cadherin evolution is demonstrated by the fact that one of the two assembled cadherins, JN197609, has homologs in choanoflagellates, despite being absent from the genome of the only other sequenced sponge, Amphimedon queenslandica, which encodes at least 17 cadherins (Fig. 2 and ref. 31). To reconstruct the evolutionary relationships among cadherins from nonmetazoans and early branching metazoans, we grouped cadherins from C. owczarzaki, choanoflagellates, and sponges according to shared structural features (i.e., domain composition and arrangement). Mapping of the phylogenetic distribution of cadherin families reveals that they have origins that predate the evolution of Metazoa. Although the earliest branching lineage to contain a predicted cadherin (Owcz_Cdh1) is C. owczarzaki, the evolutionary connection between this and cadherin families from choanoflagellate and metazoans is uncertain (Fig. 2A). Owcz_Cdh1 has at least 10 predicted EC domains, two membrane-proximal epidermal growth factor (EGF) domains, and a transmembrane (TM) domain. This domain organization resembles that of cadherins in the choanoflagellates M. brevicollis (accession no. MBCDH14) and S. rosetta (accession nos. EGD82557 and EGD79002) but is not sufficiently complex to definitively indicate that these proteins are orthologous. In contrast, two cadherin families are clearly shared by choanoflagellates and sponges to the exclusion of all other lineages analyzed in this study. The first, lefftyrins, are defined by the presence of an amino-terminal “LEF” cassette [containing a Laminin N-terminal (Lam-N) domain, four EGF domains, and a Furin domain] and a carboxyl-terminus “FTY” cassette [containing one or two Fibronectin 3 (FN3) domains, a TM domain and a cytoplasmic protein tyrosine phosphatase (PTPase) domain; Fig. 2B]. The M. brevicollis lefftyrin family member, MBCDH21, also has an N-terminal Laminin G (Lam-G) domain that has prompted previous comparisons with metazoan classical cadherins and fat cadherins (1, 4). Cadherins in the second 2 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1120685109 family, the coherins (Fig. 2C), are united by the presence of at least one cohesin domain (not to be confused with the eukaryotic cohesin protein that regulates sister chromatid separation). The presence of cohesin domains (SI Appendix, Fig. S3) in coherins is diagnostic because they are otherwise found only in bacteria and archaea (32). Members of the remaining premetazoan family of cadherins, the hedglings (Fig. 2D), are found in choanoflagellates, sponges, and the cnidarian Nematostella vectensis (1, 25, 26), but are absent from C. owczarzaki and bilaterians. Hedglings contain an amino-terminal Hedgehog signal domain (Hh-N; ref. 33) that was thought to be exclusive to the secreted signaling portion of the metazoan-specific Hedgehog protein. The amino-terminal Hh-N domain in all hedglings is adjacent to a von Willebrand factor A (VWA) domain and, with the exception of one M. brevicollis hedgling (accession no. MBCDH3), all hedglings have a carboxyl-terminal cassette with between one and eight extracellular EGF domains positioned proximal to the TM region. Although the first identified choanoflagellate hedgling, MBCDH11 from M. brevicollis, contains additional domains (including TNFR, Furin, and 9-cystein GPCR), all other choanoflagellate hedglings detected in this study and all known metazoan hedglings lack these domains. Thus, hedgling in the last common ancestor of metazoans more likely resembled hedglings from metazoans (e.g., Aque_hedgling and Nvec_hedgling) and S. rosetta (accession no. EGD79017) than MBCDH11. The inference that the last common ancestor of choanoflagellates and metazoans contained lefftyrins, cohesins, and hedgling cadherins reveals the evolutionary foundations for the subsequent origin of metazoan-specific cadherins. Metazoan Classical Cadherin/β-Catenin Adhesion Complex. Among the cadherins that evolved along the metazoan stem lineage, classical cadherins have the clearest potential link to metazoan origins, both because of their ubiquity in modern metazoan lineages and because of their central roles in bilaterian cell adhesion (4). To investigate whether the adhesive functions of classical cadherins might extend to the earliest branching lineages of metazoans, we examined the possibility that the regulatory interaction between classical cadherins and β-catenin is conserved in sponges. The single detected classical cadherin homolog in Nichols et al. A M. brevicollis (MBCDH14) S. rosetta (EGD82557) S. rosetta (EGD79002) C. owczarzaki (EFW44034) B Lam-G M. brevicollis (MBCDH21) * Candida ALS S. rosetta (EGD79249) * Dockerin 1 PKD Lefftyrin Family PKD O. carmela (JN197609) M. brevicollis (MBCDH8) Coherin Family S. rosetta (EGD82245) EVOLUTION C A. queenslandica (Aqu1.221884) D M. brevicollis (MBCDH11) TNFR/FU TNFR 9-cystein GPCR M. brevicollis (MBCDH3) M. brevicollis (MBCDH15) Hedgling Family S. rosetta (EGD79017) Epidermal Growth Factor (EGF) Extracellular Cadherin (EC) Transmembrane (TM) Cohesin Domain N-terminal Hedgehog Signal Domain (Hh-N) von Willebrand A (VWA) Laminin N-terminal Domain (Lam-N) Furin Protein Tyrosine IG I-set (PTPase), * Phosphatase inactive A. queenslandica (ABX90059) IG I-set SH2 Protein Tyrosine Phosphatase (PTPase) Fibronectin 3 Domain (FN3) 1000aa N. vectensis (ABX84114) Fig. 2. Predicted domain architecture of modern representatives of premetazoan cadherins. At least three cadherin families evolved before the origin of metazoans. (A) The single cadherin discovered in the genome of C. owczarzaki has a cassette of EGF repeats positioned proximal to a single transmembrane domain (blue box) that is also found in choanoflagellate and sponge cadherins. The phylogenetic relationships among cadherins with this feature are not yet clear. The lefftyrin (B) and coherin (C) families are present only in choanoflagellates and sponges. Lefftyrins are distinguished by an N-terminal “LEF” cassette (orange box) with a Lam-N domain, four EGF repeats, and a Furin repeat and a C-terminal “FTY” cassette (purple box) with one or two Fibronectin 3 domains, a transmembrane domain, and a tyrosine phosphatase domain. Coherins contain a diagnostic bacterial/archaeal-like cohesin (50) domain. (D) The hedgling family (1, 26) is present in choanoflagellates, sponges and cnidarians and is absent from bilaterians. All hedglings contain an N-terminal Hedgehog signal domain linked to a von Willebrand A domain (green box) and most contain a series of EGF repeats proximal to the transmembrane domain (blue box). Candida ALS, Candida Agglutinin-like sequence; IG I-set, Ig I-set; KU, BPTI/Kunitz family of serine protease inhibitors; Lam-G, Laminin G domain; 9-cystein GPCR, 9-cystein G protein coupled receptor; PKD, polycystic kidney disease; SH2, src homogy domain 2; TNFR, tumor necrosis factor receptor. O. carmela, OcCdh1 (GenBank accession no. AEC12441), encodes at least seven EC domains and a CCD domain, as well as multiple EGF and Lam-G domains that are typical of classical cadherins in invertebrates (e.g., Drosophila melanogaster N-cadherin and Shotgun; Fig. 3A and refs. 3 and 34). By aligning the amino acid sequence of the CCD of OcCdh1 with those of other classical cadherins, we found that two residues (D675 and E682) necessary for binding and modulating interactions with β-catenin (35) in bilaterians are conserved (Fig. 3B). Nichols et al. We next investigated whether O. carmela β-catenin (Oc_bcat; GenBank accession no. HQ234356) has diagnostic protein domains and residues indicative of the ability to interact with classical cadherins. Oc_bcat contains at least 11 of the 12 conserved armadillo (arm) repeats (36, 37) that are typical of eumetazoan β-catenin proteins (Fig. 3C) and shows 66.4% amino acid sequence identity with human β-catenin over the conserved arm-repeat region. Furthermore, Oc_bcat has two lysine residues (homologous to positions K312 and K435 in mouse) required for PNAS Early Edition | 3 of 6 A TM EC domains mouse E-cadherin CCD Drosophila Shotgun EC domains EGF LamG TM CCD O. carmela cadherin1 EGF EC domains LamG EGF TM CCD B 674 | mouse D S L L V F D Y E G S G S E A A S L S S L - N S S E Drosophila D D V R H Y A Y E G D G N S D G S L S S L A S C T D O. carmela D E L L H F E D E G I L S E G A S L S S L S I A S E C armadillo repeats zebrafish Drosophila O. carmela helix-C 1 2 3 4 5 6 7 8 9 10 11 12 1 3 4 5 6 8 9 10 11 12 3 4 5 6 7 8 9 10 11 12 1 2 7 700 | S D D D S S D zebrafish K435 K312 O. carmela K358 E K482 312 435 | | mouse Y G N Q E S K L I I L A S . . . C N N Y K N K M M V C Q V zebrafish Y G N Q E S K L I I L A S . . . C N N Y K N K M M V C Q V O. carmela Y G N Q E S K L I I L A S . . . C N N Q Q N K V I V C Q C | | 358 482 Fig. 3. A conserved β-catenin/classical cadherin protein complex in a sponge. (A) The genome of the sponge O. carmela encodes a classical cadherin, Oc_cdh1, identified by the presence of the diagnostic cadherin cytoplasmic domain (CCD). Oc_cdh1 also has EGF and Lam-G domains in a membrane-proximal position that is typical of invertebrate classical cadherins (4). The dashed line at the N terminus of Oc_cdh1 indicates that the gene model is incomplete because of the draft nature of the genome assembly. (B) An alignment of a portion of the Oc_cdh1 CCD with bilaterian CCDs demonstrates the conservation of two residues (Aspartate and Glutamate, highlighted in green) required for binding to β-catenin (SI Appendix, Fig. S4 depicts the full alignment and includes the only known CCD from the demosponge A. queenslandica, in which critical β-catenin binding residues are also conserved). Conserved residues are shaded gray and Casein Kinase II and Glycogen Synthase Kinase 3b phosphorylation sites essential for the regulation of adhesion dynamics are indicated by filled or open circles, respectively (35, 38, 39). (C) The O. carmela genome also encodes a single β-catenin ortholog (Oc_bcat) with 11 predicted armadillo (arm) repeats and a helix-C domain; each arm repeat is numbered according to its similarity (determined by best-reciprocal Blast) with the 12 arm repeats from other metazoan β-catenin homologs (SI Appendix, Fig. S4). (D) Through comparison of a surface representation of the 3D structure of zebrafish β-catenin (37) with a structural model of Oc_bcat, we predict the conservation of a positively charged groove lined by the third helix (blue) of each arm repeat. Within this groove there are two lysine residues whose orientation resembles that of conserved lysines from zebrafish β-catenin. (E) These lysines align with Lysine-312 and Lysine-435 of mouse β-catenin, each of which are required for binding to mouse E-cadherin (35, 38, 39) at Aspartate647 and Glutamate-682 (highlighted in B). Ocar_cdh1 was initially discovered from a yeast two-hybrid screen using full-length Ocar_bcat as bait (SI Appendix, Table S2; see SI Appendix for further discussion). CCD, cadherin cytoplasmic domain; EC, extracellular cadherin; EGF, epidermal growth factor domain; Lam-G, Laminin G domain; TM, transmembrane domain. 4 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1120685109 the interaction of mouse β-catenin with E-cadherin (Fig. 3 D and E and refs. 35, 38, and 39). By threading the full-length sequence of Oc_bcat onto the crystal structure of zebrafish β-catenin (Fig. 3D), we predict that the third helix of each arm repeat in Oc_bcat orients along the surface of a positively charged groove that has been shown to contact E-cadherin directly in mouse (35, 38, 39). Moreover, the conserved lysines of β-catenin that are required to mediate interactions with E-cadherin are oriented similarly in the 3D models of the full-length zebrafish (37) and Oc_bcat. Furthermore, an unbiased yeast two-hybrid screen of O. carmela proteins using Oc_bcat as the “bait” recovered OcCdh1 as a binding partner (SI Appendix). Further study is required to determine whether OcCdh1 and Oc_bcat have the capacity to bind to each other directly in vivo and, thereby, contribute to cell adhesion in O. carmela. Discussion Cadherins represent a compelling case study for how large metazoan gene families evolve. Like members of most metazoan signaling and adhesion protein families, cadherins are typically large, multidomain proteins. Such protein families evolve through duplication and divergence and through the shuffling of protein domains among different protein families (40, 41). By using a phylogenetically informed comparative genomic approach, we were able to reconstruct a concrete portrait of the minimal cadherin diversity in the metazoan stem lineage. Furthermore, by reconstructing the ancestral domain composition of early-evolving cadherin families, we have been able to predict their evolutionary relationships with other, later-evolving modern protein families. Premetazoan Cadherin Diversity. An initially surprising result from the genome of M. brevicollis was that the genomes of choanoflagellates and most metazoans have comparable numbers of cadherin genes (1), despite vast differences in their biology. This result is further supported by our analysis of the S. rosetta genome, which has at least 29 predicted cadherin genes. In contrast, our analyses of cadherin relationships among metazoans, choanoflagellates, and C. owczarzaki suggest that as few as three modern cadherin families were present in the last common ancestor of choanoflagellates and metazoans, and that potentially only one cadherin was present in the last common ancestor of C. owczarzaki, choanoflagellates, and metazoans (Fig. 4A). However, these inferences may represent an underestimate because of limited available data. For example, C. owczarzaki is the only known member of its lineage, it diverged from choanoflagellates and metazoans more than 650 Mya, and it is a symbiont (42) that is likely to have evolved from a free-living ancestor; hence, aspects of its biology and genome content may be reduced. The contrast between the large number of cadherins in modern lineages and the low diversity of cadherins inferred in the metazoan-stem lineage raises the intriguing possibility that modern cadherin diversity arose from a handful of ancestral cadherin families that still exist today (however, it is notable that all of the premetazoan cadherin families detected are absent from Bilateria). Alternatively, although future studies of a broader diversity of choanoflagellates and early branching metazoans may reveal additional members of the premetazoan cadherin repertoire, it is also possible that cadherins present in the ancestors of metazoans and choanoflagellates were subsequently lost (or evolved beyond recognition) in both lineages. Radiation of Cadherins in Choanoflagellate and Metazoan Lineages. The study of cadherin families conserved in choanoflagellates and metazoans promises to provide an unprecedented perspective on cadherin function before the evolution of metazoan multicellularity. Three cadherin families—lefftyrins, coherins, and hedglings—were present in the last common ancestor of Nichols et al. coherin family lefftyrin family hedgling family CELSR/flamingo Cowc_Cdh1 Fungi classical cadherins Capsaspora Choanoflagellates Sponges Cnidaria Bilateria Holozoa Metazoa B C-terminal cassette of FN3-TM-PTPase domains Receptor Protein Tyrosine Phosphatases LamNT Usherin, Laminin and Netrin coherins Cohesin domain Bacterial Cellulosome hedglings HhN domain Hedgehog lefftyrins Fig. 4. An emerging model of cadherin evolution. (A) At least five modern families of cadherins—hedglings, coherins, lefftyrins, CELSR/flamingo and classical cadherins—evolved before the diversification of modern metazoans. Of these families, only the CELSR/flamingo and classical cadherin families are clearly conserved in all metazoan lineages (2, 4, 31). In contrast, among metazoans, hedgling is restricted to sponges and cnidarians. All of the cadherin families that evolved before the divergence of choanoflagellates and metazoans (“premetazoan” cadherin families) have been lost or have evolved beyond recognition in bilaterians. The relationships among the single cadherin detected in the genome of C. owczarzaki (Cowc_Cdh1) and other modern cadherin families are uncertain (indicated by dotted circle, also see Fig. 2A). (B) In addition to having EC domains, members of many cadherin families contain domains that provide clues to their evolutionary origins and to their relationships with other modern protein families (see Discussion). metazoans and choanoflagellates and seem to have evolutionary connections to diverse metazoan signaling and adhesion gene families (Fig. 4B). For example, lefftyrins, so far known only from choanoflagellates and the sponge O. carmela, contain a Lam-N domain that is otherwise found in the proteins laminin, netrin, and usherin. These proteins are united by the fact that they function in the extracellular matrix (43–46). Furthermore, the carboxyl-terminal FTY cassette of lefftyrins is diagnostic of metazoan receptor PTPases, which help regulate cellular responses to interactions with neighboring cells and the extracellular matrix (47–49). C. owczarzaki is the most divergent outgroup of metazoa that has cadherins, and we have discovered that its genome also encodes a metazoan-like receptor PTPase that lacks EC domains (GenBank accession no. EFW39745). Thus, it seems that lefftyrins may have evolved through a domain-shuffling event that brought PTPase and EC domains together in the choanoflagellate/metazoan stem lineage. Whereas lefftyrins may represent a case of protein family evolution through the process of domain shuffling, the newly discovered coherin family may have evolved through horizontal gene transfer. Coherins, which are restricted to choanoflagellates and sponges, are defined by the presence of EC domains and the cohesin domain. The cohesin domain is otherwise known only from archaea and bacteria. In the bacterial genus Clostridium, the cohesin domain functions in the assembly of the cellulosome, Nichols et al. a complex of enzymes used to degrade plant cell walls (50). The possible evolutionary connection between coherins and the prokaryotic cohesin domain-containing proteins highlights the complexities of the evolutionary processes that shaped cadherin evolution during the early ancestry of Metazoa. Unless the cohesin domain of coherins evolved by convergent evolution with its prokaryotic counterpart, then it must have been acquired by horizontal gene transfer (32); this explanation seems quite plausible when considering that the earliest metazoan ancestors likely were bacterivorous (51). Either way, the presence of a cohesin domain in coherins is compelling evidence of the homology of these proteins between sponges and choanoflagellates. Premetazoan Cadherin Functions. Our understanding of the scope of cadherin function derives from their study in morphologically complex bilaterians, but C. owczarzaki is unicellular (42) and choanoflagellates exist as either single cells or simple undifferentiated colonies (52–54). Cadherins in these organisms may have functions that are unrelated to cadherin functions known from bilaterians. For example, even in colony-forming S. rosetta, adjacent cells are linked by cytoplasmic bridges and lack structures that resemble the cadherin-based adherens junctions of metazoans (53). However, it is possible to identify some analogous functions that might be served by cadherins in nonmetazoans. For example, cadherins in unicellular lineages could have adhesive functions other than the regulation of stable cellcell adhesion, such as during bacterial prey capture, attachment to ECM, attachment to environmental substrates, or gamete recognition (although sex is undocumented in choanoflagellates). One biological context in which cadherin function may be conserved between choanoflagellates and metazoans is in the collar cells of sponges. Like choanoflagellates, sponge collar cells have a motile flagellum used to generate water flow for the capture of bacterial prey on a surrounding microvillar collar where they are phagocytosed. It is reasonable to hypothesize that cadherin families restricted to sponges and choanoflagellates (i.e., lefftyrins and coherins), in particular, may have functions specific to the biology of collar cells. Such functions may include roles in the regulation of microvillar collar integrity or bacterial prey capture. Indeed, one cadherin (MBCDH1) has been shown to localize to the microvillar collar of M. brevicollis (1). Furthermore, there is precedent for a physiologically important interaction between bacteria and cadherins in metazoans: Some pathogenic bacteria interact with classical cadherins in gut epithelia, thereby stimulating the host cells to phagocytose the invading pathogen (55–57). Linking Cadherin Evolution to the Origin of Metazoa. A challenge for relating cadherin gene family evolution to metazoan morphological evolution is that, until now, none of the functionally characterized cadherin families of bilaterians have been studied in nonbilaterians. Of all of the modern cadherin families, the classical cadherin family is perhaps the strongest candidate for having played a role in the evolution of metazoan multicellularity (2, 4). The CCD of classical cadherins binds to β-catenin to regulate cell-cell adhesion in all studied bilaterian tissues. Here, we show that the genome of the sponge O. carmela encodes a typical nonchordate classical cadherin with a CCD domaincontaining cytoplasmic tail that is predicted to be capable of binding to O. carmela β-catenin. Thus, it is plausible that an evolutionarily conserved classical cadherin/β-catenin adhesion complex was a feature of the cell biology of the last common ancestor of all modern metazoans. The ubiquity of certain cadherin families in lineages that diverged more than 600 Mya indicates that these protein families have conserved (and essential) roles in organisms with vastly different biology. As we learn about their functions, we stand to gain insight into ancestral features of metazoans and their PNAS Early Edition | 5 of 6 EVOLUTION A single-celled relatives—similarities that are fundamental to their basic cell biology. Materials and Methods The genomes of C. owczarzaki and S. rosetta were sequenced and assembled by the Broad Institute (Massachusetts Institute of Technology/Harvard; http://www.broadinstitute.org/annotation/genome/multicellularity_project/ MultiHome.html), and the S. rosetta gene models were refined by using Illumina RNA-seq data. The O. carmela genome was sequenced by using paired-end Illumina reads at the Vincent J. Coates Genomic Sequencing Laboratory at the University of California, Berkeley and an early draft was assembled in-house. To identify new cadherins in these genomes, we performed protein homology-based searches (i.e., Blast; ref. 27) and domain-based searches (e.g., Pfam; ref. 29 and Smart; ref. 30). Any protein containing an EC domain was defined as a cadherin, and most of these also had a transmembrane domain. Cadherin families were identified 1. Abedin M, King N (2008) The premetazoan ancestry of cadherins. Science 319: 946–948. 2. Hulpiau P, van Roy F (2011) New insights into the evolution of metazoan cadherins. Mol Biol Evol 28:647–657. 3. Hynes RO, Zhao Q (2000) The evolution of cell adhesion. J Cell Biol 150:F89–F96. 4. Oda H, Takeichi M (2011) Evolution: Structural and functional diversity of cadherin at the adherens junction. J Cell Biol 193:1137–1146. 5. Rokas A (2008) The origins of multicellularity and the early history of the genetic toolkit for animal development. Annu Rev Genet 42:235–251. 6. Angst BD, Marcozzi C, Magee AI (2001) The cadherin superfamily: Diversity in form and function. J Cell Sci 114:629–641. 7. Saburi S, McNeill H (2005) Organising cells into tissues: New roles for cell adhesion molecules in planar cell polarity. Curr Opin Cell Biol 17:482–488. 8. Simons M, Mlodzik M (2008) Planar cell polarity signaling: From fly development to human disease. Annu Rev Genet 42:517–540. 9. Lévesque CA, et al. (2010) Genome sequence of the necrotrophic plant pathogen Pythium ultimum reveals original pathogenicity mechanisms and effector repertoire. Genome Biol 11:R73. 10. Overduin M, et al. (1995) Solution structure of the epithelial cadherin domain responsible for selective cell adhesion. Science 267:386–389. 11. King N, Hittinger CT, Carroll SB (2003) Evolution of key cell signaling and adhesion protein families predates animal origins. Science 301:361–363. 12. Nollet F, Kools P, van Roy F (2000) Phylogenetic analysis of the cadherin superfamily allows identification of six major subfamilies besides several solitary members. J Mol Biol 299:551–572. 13. Posy S, Shapiro L, Honig B (2008) Sequence and structural determinants of strand swapping in cadherin domains: Do all cadherins bind through the same adhesive interface? J Mol Biol 378:954–968. 14. Shapiro L, et al. (1995) Structural basis of cell-cell adhesion by cadherins. Nature 374: 327–337. 15. Ozawa M, Baribault H, Kemler R (1989) The cytoplasmic domain of the cell adhesion molecule uvomorulin associates with three independent proteins structurally related in different species. EMBO J 8:1711–1717. 16. Shapiro L, Weis WI (2009) Structure and biochemistry of cadherins and catenins. Cold Spring Harb Perspect Biol 1:a003053. 17. Chen YT, Stewart DB, Nelson WJ (1999) Coupling assembly of the E-cadherin/betacatenin complex to efficient endoplasmic reticulum exit and basal-lateral membrane targeting of E-cadherin in polarized MDCK cells. J Cell Biol 144:687–699. 18. Huber AH, Stewart DB, Laurents DV, Nelson WJ, Weis WI (2001) The cadherin cytoplasmic domain is unstructured in the absence of beta-catenin. A possible mechanism for regulating cadherin turnover. J Biol Chem 276:12301–12309. 19. Okazaki M, et al. (1994) Molecular cloning and characterization of OB-cadherin, a new member of cadherin family expressed in osteoblasts. J Biol Chem 269:12092–12098. 20. Casal J, Lawrence PA, Struhl G (2006) Two separate molecular systems, Dachsous/Fat and Starry night/Frizzled, act independently to confer planar cell polarity. Development 133:4561–4572. 21. Goodrich LV, Strutt D (2011) Principles of planar polarity in animal development. Development 138:1877–1892. 22. Viktorinová I, König T, Schlichting K, Dahmann C (2009) The cadherin Fat2 is required for planar cell polarity in the Drosophila ovary. Development 136:4123–4132. 23. Morishita H, Yagi T (2007) Protocadherin family: Diversity, structure, and function. Curr Opin Cell Biol 19:584–592. 24. Kazmierczak P, et al. (2007) Cadherin 23 and protocadherin 15 interact to form tiplink filaments in sensory hair cells. Nature 449:87–91. 25. King N, et al. (2008) The genome of the choanoflagellate Monosiga brevicollis and the origin of metazoans. Nature 451:783–788. 26. Adamska M, et al. (2007) The evolutionary origin of hedgehog proteins. Curr Biol 17: R836–R837. 27. Altschul SF, et al. (1997) Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res 25:3389–3402. 28. Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14:755–763. 29. Finn RD, et al. (2010) The Pfam protein families database. Nucleic Acids Res 38(Database issue):D211–D222. 6 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1120685109 based on the shared composition and arrangement of their protein domains. Structural predictions for Ocar_bcat were inferred by using LOOPP (58) to thread the full-length sequence onto the crystal structure of full-length zebrafish β-catenin. For detailed experimental procedures, see SI Appendix. ACKNOWLEDGMENTS. We thank M. Abedin, S. Brenner, A. Brooks, M. Eisen, W. J. Nelson, M. Paris, D. Scannell, B. Steele, S. Q. Schneider, L. Tonkin, and Q. Zhou for technical support, advice, and helpful discussions. This work was supported in part by funding from an American Cancer Society Postdoctoral Fellowship (to S.A.N.), American Cancer Society Research Scholar Grant 116795-RSG-09-044-01-DDC (to N.K.), the National Aeronautics and Space Administration Astrobiology program (to N.K., S.A.N., and D.J.R.), the Hellman Family Fund (to N.K.), and a National Defense Science and Engineering Graduate fellowship from the Department of Defense (to D.J.R.). N.K. is a Fellow in the Integrated Microbial Biodiversity program of the Canadian Institute for Advanced Research. 30. Schultz J, Milpetz F, Bork P, Ponting CP (1998) SMART, a simple modular architecture research tool: Identification of signaling domains. Proc Natl Acad Sci USA 95:5857–5864. 31. Fahey B, Degnan BM (2010) Origin of animal epithelia: Insights from the sponge genome. Evol Dev 12:601–617. 32. Peer A, Smith SP, Bayer EA, Lamed R, Borovok I (2009) Noncellulosomal cohesin- and dockerin-like modules in the three domains of life. FEMS Microbiol Lett 291:1–16. 33. Hall TM, Porter JA, Beachy PA, Leahy DJ (1995) A potential catalytic site revealed by the 1.7-A crystal structure of the amino-terminal signalling domain of Sonic hedgehog. Nature 378:212–216. 34. Iwai Y, et al. (1997) Axon patterning requires DN-cadherin, a novel neuronal adhesion receptor, in the Drosophila embryonic CNS. Neuron 19:77–89. 35. Huber AH, Weis WI (2001) The structure of the beta-catenin/E-cadherin complex and the molecular basis of diverse ligand recognition by beta-catenin. Cell 105:391–402. 36. Huber AH, Nelson WJ, Weis WI (1997) Three-dimensional structure of the armadillo repeat region of beta-catenin. Cell 90:871–882. 37. Xing Y, et al. (2008) Crystal structure of a full-length beta-catenin. Structure 16:478–487. 38. Gooding JM, Yap KL, Ikura M (2004) The cadherin-catenin complex as a focal point of cell adhesion and signalling: New insights from three-dimensional structures. Bioessays 26:497–511. 39. Graham TA, Weaver C, Mao F, Kimelman D, Xu W (2000) Crystal structure of a betacatenin/Tcf complex. Cell 103:885–896. 40. Doolittle RF (1995) The origins and evolution of eukaryotic proteins. Philos Trans R Soc Lond B Biol Sci 349:235–240. 41. Lundin LG (1999) Gene duplications in early metazoan evolution. Semin Cell Dev Biol 10:523–530. 42. Hertel LA, Bayne CJ, Loker ES (2002) The symbiont Capsaspora owczarzaki, nov. gen. nov. sp., isolated from three strains of the pulmonate snail Biomphalaria glabrata is related to members of the Mesomycetozoea. Int J Parasitol 32:1183–1191. 43. Colognato H, Yurchenco PD (2000) Form and function: The laminin family of heterotrimers. Dev Dyn 218:213–234. 44. Eudy JD, et al. (1998) Mutation of a gene encoding a protein with extracellular matrix motifs in Usher syndrome type IIa. Science 280:1753–1757. 45. Serafini T, et al. (1994) The netrins define a family of axon outgrowth-promoting proteins homologous to C. elegans UNC-6. Cell 78:409–424. 46. Vuolteenaho R, Chow LT, Tryggvason K (1990) Structure of the human laminin B1 chain gene. J Biol Chem 265:15611–15616. 47. Petrone A, Sap J (2000) Emerging issues in receptor protein tyrosine phosphatase function: Lifting fog or simply shifting? J Cell Sci 113:2345–2354. 48. Blanchetot C, Tertoolen LG, Overvoorde J, den Hertog J (2002) Intra- and intermolecular interactions between intracellular domains of receptor protein-tyrosine phosphatases. J Biol Chem 277:47263–47269. 49. Tonks NK (2006) Protein tyrosine phosphatases: From genes, to function, to disease. Nat Rev Mol Cell Biol 7:833–846. 50. Carvalho AL, et al. (2003) Cellulosome assembly revealed by the crystal structure of the cohesin-dockerin complex. Proc Natl Acad Sci USA 100:13809–13814. 51. Nichols SA, Dayel MJ, King N (2009) Genomic, phylogenetic and cell biological insights into metazoan origins. Animal evolution: Genomes, fossils and trees, eds Telford MJ, Littlewood D (Oxford Univ Press, Oxford), pp 24–32. 52. Leadbeater BSC (1983) Life-history and ultrastructure of a new marine species of Proterospongia (Choanoflagellida). J Mar Biol Assoc U K 63:135–160. 53. Dayel MJ, et al. (2011) Cell differentiation and morphogenesis in the colony-forming choanoflagellate Salpingoeca rosetta. Dev Biol 357:73–82. 54. Karpov S, Coupe S (1998) A revision of choanoflagellate genera Kentrosiga Schiller, 1953 and Desmarella Kent, 1880. Acta Protozool 37:23–27. 55. Mengaud J, Ohayon H, Gounon P, Cossart P, Cossart P; Mege R-M (1996) E-cadherin is the receptor for internalin, a surface protein required for entry of L. monocytogenes into epithelial cells. Cell 84:923–932. 56. Boyle EC, Finlay BB (2003) Bacterial pathogenesis: Exploiting cellular adherence. Curr Opin Cell Biol 15:633–639. 57. Blau K, et al. (2007) Flamingo cadherin: A putative host receptor for Streptococcus pneumoniae. J Infect Dis 195:1828–1837. 58. Tobi D, Elber R (2000) Distance-dependent, pair potential for protein folding: Results from linear optimization. Proteins 41:40–46. Nichols et al. Nichols et al. Supplemental Information Detailed Experimental Procedures: O. carmela, Illumina library construction A paired-end genomic library for Illumina sequencing was constructed using Oscarella carmela DNA prepared by whole genome amplification (WGA, (1)). To reduce contamination and polymorphism that could complicate genome assembly and analysis, a single sponge larva was isolated, washed five times in sterile-filtered seawater and lysed using the REPLI-g Mini kit for WGA (Qiagen, Valencia, CA). The lysate was divided and used to conduct four separate WGA reactions that were pooled to reduce the effects of stochastic amplification bias. Paired-end library construction was performed using the Illumina PE Adapter Oligo Mix and PCR primers (Illumina Inc., San Diego, CA) in combination with protocol modifications suggested by Quail and colleagues (2). Additionally, during each spin-column purification step, residual ethanol was pipetted out of the column prior to elution to prevent ethanol carry-over. Library quality was determined using a 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA) to confirm fragment size and concentration. O. carmela Illumina sequencing and draft genome assembly A total of 388,627,652 reads were generated from two separate paired-end Illumina runs on the same library: 39,460,320 reads from 2 lanes of 76 cycle sequencing (hereafter called “run 1”), and 349,167,332 reads from 7 lanes of 101 cycle sequencing (“run 2”). Before assembly, low frequency “noise” k-mers were corrected in the reads using the Corrector tool version 1.00 from the Beijing Genomics Institute [http://soap.genomics.org.cn/down/correction.tar.gz] with default parameter values. The two lanes from run 1 were corrected together using a frequency cutoff of 5 per k-mer, and each lane from run 2 was corrected individually using a frequency cutoff of 10 per k-mer. After correction, 33,249,809 reads from run 1 and 298,166,837 reads from run 2 remained, for a total of 331,416,646 reads. Genome assembly was performed iteratively using 1 Nichols et al. SOAPdenovo version 1.04 (3) with default parameter values (unless otherwise noted), as follows: an initial assembly was created using a k-mer size of 31, with both runs used for building contigs and only run 1 used for building scaffolds. To close gaps in the initial assembly, we ran GapCloser version 1.10 (4) with default parameter values using only reads from run 1. We found that the processes of building scaffolds and gap closing were more successful using fewer reads, and thus we chose run 1 for both tasks; using the reads from any single lane of run 2 produced similar results. After running SOAPdenovo and GapCloser, we mapped all corrected reads back to the assembly using Bowtie version 0.12.1 (5) with default parameter values. We then created a final assembly using only the reads that mapped to the initial gap-closed assembly. We ran SOAPdenovo followed by GapCloser, repeating the initial assembly process but instead using at each step the set of reads mapping to the initial assembly. Assembly statistics are shown in Tables S1-S3. O. carmela gene prediction Gene prediction was performed de novo on the final assembly using Augustus version 2.3 (6) with the autoAug script and the 6,235 assembled Sanger ESTs (7) as prediction aids. Gene prediction was only performed on sequences with a minimum length of 500 (9,823 genes were predicted). O. carmela genome: assembly statistics for scaffolds Assemblies Total Number of Assembly Scaffolds Size (bp) Number of Scaffolds + Contigs Longest (bp) N50 (bp) N90 (bp) Pilot assembly 29,148 57,006,393 70,595 49,630 3,324 416 Initial assembly 22,699 60,727,654 77,270 84,460 4,699 351 Final assembly 17,451 56,386,309 67,767 108,178 5,897 368 2 Nichols et al. O. carmela genome: assembly statistics for contigs Assemblies Number of Reads Total Assembly Size (bp) Longest (bp) N50 (bp) N90 (bp) Average Coverage (x) Pilot assembly 39,460,320 46,779,956 6,153 339 124 22 Initial assembly 331,416,646 54,313,237 28,111 890 132 568 Final assembly 239,209,057 54,193,990 43,946 1,158 142 562 O. carmela genome: scaffold GC content, paired end insert size, and gap information Assemblies Estimated Total Size Insert Number of of Gaps Size Number of Total Size of Gaps Remaining GC Estimated Standard Gaps Gaps Before Remaining After Content Insert Deviation Before GapCloser After GapCloser (percent) Size (bp) (bp) GapCloser (bp) GapCloser (bp) Pilot assembly 43.7 390 78 85,376 13,410,978 - - Initial assembly 43.5 397 71 55,768 7,991,639 21,580 5,733,376 Final assembly 43.5 395 79 39,994 4,765,458 8,105 2,452,188 Discovery and annotation of novel cadherins The stand-alone BLAST search algorithm was used to search the best predicted protein set from the draft genomes of S. rosetta, C. owczarzaki, and O. carmela using the 23 predicted cadherins from the M. brevicollis genome (8) as a query. As a complement to this approach, Pfam (9), SMART (10) and Phobius (11) domain prediction programs were run on all predicted S. rosetta proteins. Every protein predicted to have at least one extracellular cadherin (EC) domain was annotated and categorized according to whether its overall domain composition and architecture matched known cadherins from M. brevicollis or any metazoan. The S. rosetta gene models are supported by 33-fold sequence coverage suggesting that we have identified most, if not all cadherins in the genome (12). Accurate abundance data for O. carmela could not be determined due to the 3 Nichols et al. early draft status of the genome. Therefore, cadherin abundance in sponges was determined from the genome of Amphimedon queenslandica (11). Cadherin abundance estimates for eumetazoans were derived from Hulpiau and van Roy (13) and references therein. Taxonomic data from SMART were used to conclude that no EC domains are present in any annotated plant or fungus. HMM searches for Hh-N domain-containing proteins We used the HMMER 3.0 suite of tools (14) to build custom models of the Hh-N signaling domain in order to increase sensitivity for searches of choanoflagellates and other opisthokonts. We used hmmsearch (14) with the Pfam domain Hh_signal [PF01085, Pfam version 24.0 (9)] to detect Hh-N domains in the predicted protein sets from the genomes of the sponge A. queenslandica (15), the sea anemone N. vectensis (16), and the choanoflagellates S. rosetta (12) and M. brevicollis (17). Using the sequences of all domains predicted by hmmsearch with an E value below the gathering threshold for the model in Pfam, we built a multiple alignment using the FSA web server version 1.15.2 (18). We used the resulting alignment to build a custom model with hmmbuild (14), and ran hmmsearch with the custom model against the predicted protein sets from O. carmela, S. rosetta and M. brevicollis in order to detect previously unidentified instances of the Hh-N domain. Cloning full-length Ocar_bcat Tissue of O. carmela was flash frozen and ground to a powder using a mortar and pestle containing liquid nitrogen. Messenger RNA was isolated using Trizol Reagent (Invitrogen Corp., Carlsbad, CA) followed by the Oligotex mRNA Mini Kit (Qiagen, Valencia, CA). The unknown 5’ sequence of Ocar_bcat was cloned and sequenced using GeneRacer (Invitrogen Corp., Carlsbad, CA) in combination with an antisense primer (SN33R: 5’ CCCAAGGGCAAGTCTTCGCTGGAT 3’) corresponding to the known 3’ EST sequence (7). The full-length sequence is deposited in GenBank (HQ234356). 4 Nichols et al. Ocar_bcat structural predictions The full-length sequence of Ocar_bcat was translated from the cloned mRNA transcript using NCBI ORF Finder. The predicted protein was analyzed for its homology to known beta-catenin sequences by comparing its primary sequence to the non-redundant Genbank database (nr) via blastp (19) and by searching for conserved structural domains (arm repeats) using Pfam (9) and SMART (10). Each predicted arm repeat in beta-catenin-related proteins from human, O. carmela, M. brevicollis, S. rosetta, Dictyostelium discoideum and Arabidopsis thaliana was subjected to pair-wise reciprocal blast (9). For example, arm repeat 1 from Ocar_bcat was used to perform a Blastp (19) search against a database of all arm repeats from all sampled proteins. We expected that orthologous sequences from different species would exhibit a co-linear sequence of arm repeat homology with human beta-catenin [Fig.S5; method modified from (20)]. In the example of O. carmela arm repeat 1, only a best-reciprocal blast with arm repeat 1 from human beta-catenin would be interpreted support homology of these two proteins. To identify conserved functional residues and motifs within Ocar_bcat, multiple sequence alignment was performed using MUSCLE (21). Additionally, the threedimensional structure of Ocar_bcat was analyzed using alignment-based foldprediction as implemented by LOOPP (22). Predicted structures were visualized with PyMOL (The PyMOL Molecular Graphics System, Version 1.2r3pre, Schrödinger, LLC.). Yeast two-hybrid screen A yeast two-hybrid screen was conducted to identify candidate binding-partners of full-length Ocar_bcat. To construct a yeast expression library representative of the expressed genes of O. carmela, mRNA was isolated from pooled adult and embryonic tissues (from many individuals to maximize transcript diversity) and cloned into pDONR222 using the CloneMiner cDNA Library Construction Kit (Invitrogen Corp., Carlsbad, CA). Inserts from this library were shuttled into the 5 Nichols et al. yeast two-hybrid prey plasmid, pDEST22 using LR Clonase II enzyme mix (Invitrogen Corp., Carlsbad, CA) and transformed for storage and amplification into ElectroMAX DH10B T1 Phage Resistant Cells (Invitrogen Corp., Carlsbad, CA). Likewise, full-length Ocar_bcat was modified using PCR to incorporate Gateway compatible attB1/attB2 recombination sites and cloned into pDONR221 using BP Clonase II enzyme mix (Invitrogen Corp., Carlsbad, CA). This insert was shuttled into the yeast two-hybrid bait-plasmid, pDEST32 using LR Clonase II enzyme mix. Yeast transformation and screening was performed at the yeast two-hybrid facility at Indiana University (23). Full-length Ocar_bcat and positive clones were tested for autoactivation on his- media. E-Amino-1,2,4-Triazol (3AT), which acts as a quantitative inhibitor of the HIS3 reporter gene, was used to control autoactivation by Ocar_bcat. After a <10 day screen, positive clones were retested on his- media, ura- media, and in LacZ assays. Inserts from positive clones were rescued and sequenced at the University of California DNA sequencing facility. Insert sequences from positive clones were compared against the draft assembly of the O. carmela genome using blastn (19) and predicted proteins were annotated using blastp (19), Pfam (9) and SMART (10) to test for homology with known proteins. Seventeen unique candidate binding-partners of Oc_bcat were detected (Table S2), including three clones encoding the CCD region of OcCdh1 and an additional well-known beta-catenin binding protein, Axin. These detected interactions could not be independently validated using in vitro binding assays because recombinant forms of Ocar_bcat proved to be highly insoluble. Nevertheless, the conserved structural features of Ocar_bcat and Ocar_Cdh1, coupled with the fact that this is a widely conserved interaction in metazoans, suggest that the yeast two-hybrid result represents a bona fide interaction. 6 Nichols et al. Fig. S1 7 Nichols et al. Fig. S1, continued. B M. brevicollis (MBCDH12) M. brevicollis (MBCDH1) M. brevicollis (MBCDH2) S. rosetta (EGD72656) PKD S. rosetta (EGD75710) S. rosetta (EGD74518) C M. brevicollis (MBCDH10) S. rosetta (EGD78831) S. rosetta (EGD78839) D M. brevicollis (MBCDH9) M. brevicollis (MBCDH13) S. rosetta (EGD75586) E M. brevicollis (MBCDH7) PbH1 PbH1 PbH1 PbH1 Candida ALS PbH1 PbH1 S. rosetta (EGD81200) F KU M. brevicollis (MBCDH18) KU S. rosetta (EGD77346) Fig. S1. Domain architecture of S. rosetta cadherins without orthologs in Metazoa or C. owczarzaki. (A) 16 out of 29 predicted S. rosetta cadherin proteins have no clear orthology to any cadherins known from other species, 8 Nichols et al. whereas five protein families (B-F) can be identified as shared between and exclusive to S. rosetta and M. brevicollis based upon similarities in their domain composition and arrangement. Of these, one family (E) has partial homology to the lefftyrin family that is found in choanoflagellates and sponges. However, genes in this family differ from choanoflagellate lefftyrins in that they are predicted to have catalytically active cytoplasmic PTPase domains. (Abbreviations: Candida ALS = Candida Agglutinin-like sequence; CCP = domain abundant in complement control proteins; FN2 = fibronectin 2; HYR = Hyalin Repeat; KU = BPTI/Kunitz family of serine protease inhibitors; LamG = laminin G domain; P protein = Proprotein convertase P-domain; PbH1 = parallel beta-helix repeats; PKD = polycystic kidney disease; TIG = transcription factor immunoglobulin-like domain; TSPN = Thrombospondin N-terminal-like domain; WAP = whey acidic protein; ZnF_c2h2 = zinc-finger, c2h2 type). 9 Nichols et al. Fig. S2. Fig. S2. Additional detected Hh-N domain containing proteins from S. rosetta and M. brevicollis. (A) In S. rosetta, two adjacent gene models on a single scaffold have close homology to parts of M. brevicollis hedgling (MBCDH11). Both gene models are supported by RNAseq expression data, but there is a predicted stop codon between them and there are no RNAseq reads that span the divide. We infer either that the stop codon that splits S. rosetta hedgling evolved following the divergence of the M. brevicollis and S. rosetta lineages, or that it is the result of a genome assembly error. Further interpretation 10 Nichols et al. will require experimental investigation of these gene models. (B) Using a custom HMM created against the Hh-N domain of known hedgling proteins we also identified five S. rosetta proteins and one M. brevicollis protein that have a conserved Hh-N domain, but lack EC domains. In each case, as in all known hedglings, the Hh-N domain is adjacent to a von Willebrand A domain. Therefore, we hypothesize that the association of these two domains in diverse proteins and in diverse organisms reflects an ancestral function that has been lost in eumetazoans. 11 Nichols et al. Fig. S3 Fig. S3. Cohesin domains from Coherin family proteins aligned against the Cohesin Hidden Markov Model from Pfam. Residues that exactly match Pfam HMM (highlighted in blue) are indicated with black shading whereas residues that are considered to be a conservative substitution with respect to what the model expects are indicated with gray shading. Cohesin domains 1 and 2 from Monosiga brevicollis (MBCDH8) are identical to each other. Protein identifiers correspond to Fig. 2c. (Abbreviations: HMM: Hidden Markov Model). 12 Nichols et al. Fig. S4 Fig. S4. Annotated alignment of classical cadherin cytoplasmic tails. The juxtamembrane domain (purple box) that constitutes the binding site for p120 catenin is partially conserved between human and Drosophila and Amphimedon, but is divergent in Ocar_Cdh1. In contrast, the beta-catenin binding domain (light green box) of the predicted CCD (light orange box) of Ocar_Cdh1 is conserved, including at residues that are required for the interaction (dark green). The sponge sequences are predicted to be longer than their bilaterian counterparts, complicating alignment of all but the most highly conserved residues. 13 Nichols et al. Fig. S5. Fig. S5. Domain organization and phylogenetic distribution of proteins with homology to beta-catenin. Protein diagrams are mapped onto a previously determined phylogenetic tree (24) with arm domains colored to indicate their similarity. Repeats of the same color are best-reciprocal Blast pairs. Arm repeats without close identity to any other are uncolored and indicated with an asterisk. Linear conservation of homologous arm repeats is restricted to metazoan beta-catenin orthologs, suggesting that the metazoan roles of beta-catenin evolved in the metazoan stem lineage and have been highly conserved throughout metazoan evolution. 14 Nichols et al. Tables. Table S1. S. rosetta cadherin expression levels. Genbank ID Min FPKM1 Max FPKM Mean FPKM Median FPKM EGD80879 27.617977 113.246096 56.93163613 48.3590645 EGD80917 2.25581 6.049739 3.860855875 3.533781 EGD78831 7.874201 40.03944 19.4444355 15.1492895 EGD78839 0.109114 26.26796 11.104367 9.8600525 EGD79002 1.87756 6.256101 3.839630625 3.370277 EGD79017 29.325694 128.553899 80.03320963 89.619573 EGD82245 3.403667 15.877104 9.775254375 10.434421 EGD82557 0.85664 8.627106 4.377121 3.1091385 EGD72656 168.694501 984.67225 624.6796178 621.626123 EGD73963 2.017099 8.457588 4.626551625 3.828202 EGD74518 46.138224 267.075716 159.4101904 161.7252545 EGD74707 1.962277 15.002993 8.222477875 8.487063 EGD75381 0.133699 51.162787 18.63580838 15.35265 EGD75404 3.990599 9.91731 7.319792125 7.684626 EGD75405 2.37142 9.804725 6.56574025 6.3694265 EGD75586 0.087914 6.21004 2.840604125 2.17229 EGD75074 2.197013 6.533185 4.66290875 4.722256 EGD74783 0.026136 3.799631 1.556376 1.0930275 EGD75710 71.962177 626.409101 259.4577603 220.060925 EGD76846 5.967787 85.11871 33.87954975 17.35544 EGD77346 7.357232 20.994633 12.16801713 10.4218215 EGD78086 0 7.934519 2.76326325 1.801815 EGD78170 18.746381 50.514396 28.3529975 26.7736315 EGD78171 23.099038 61.605291 35.97480775 33.790346 EGD81200 0.053023 20.10651 9.513880375 8.764376 EGD78969 9.214266 59.752132 31.85870863 33.0626255 EGD78970 5.89713 31.968329 15.953967 15.8754105 EGD74667 2.071066 14.728778 7.26199325 6.9755495 EGD75359 0.023944 7.275513 3.04185575 2.4945935 EGD79249 0.020866 3.374047 1.3963085 1.166545 1 The number of fragments per kilobase per million sequenced reads (FPKM) mapping to each identified S. rosetta cadherin from RNA-seq of eight growth conditions is summarized as evidence of gene expression. 15 Nichols et al. Table S2. O. carmela binding partners predicted from yeast two-hybrid screen of beta-catenin. gene ID Tentative Identification Predicted domain architecture (Pfam) Predicted domain architecture (Smart) g4908.t1 none none none g9583.t1 none death none CP2 none Ribosomal S17 none g6098.t1 g8349.t1 g6246.t1 Tenascin EGF 2 (x9); EGF Ca (x2) VWD; EGF like; EGF (x10); EGF Ca (x2) g6719.t1 none EIF4E-T coiled coil g8701.t1 Transcription factor AP-1/c-Jun bZIP 1 BRLZ g2054.t1 Calumenin SPARC Ca bdg; efhand (x2) EFh (x2) g6285.t1 E74-like factor Ets ETS g10012.t1 Chromosomal segregation protein SMC none coiled-coil g4744.t1 GTPase Rab2 Ras RAB BIR (x4) BIR (x4); RING Ribosomal L13e none g8915.t1 g6056.t1 Upstream binding protein 40S ribosomal protein S11 Baculoviral IAP repeat-containing protein 4 Ribosomal protein L13 g2979.t1 Ral Ras RAS g3724.t1 Choline-phosphate cytidylyltransferase none coiled-coil g6554.t1 Axin RGS; DIX RGS; DAX AEC12441 Ocar_Cdh1 EC; EGF; Lam-G; CCD EC; EGF; Lam-G 16 Nichols et al. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. Hosono S, et al. (2003) Unbiased whole-‐genome amplification directly from clinical samples. Genome Res 13(5):954-‐964. Quail MA, et al. (2008) A large genome center's improvements to the Illumina sequencing system. Nat Methods 5(12):1005-‐1010. Li R, et al. (2010) De novo assembly of human genomes with massively parallel short read sequencing. Genome Res 20(2):265-‐272. http://soap.genomics.org.cn/down/GapCloser.tar.gz Langmead B, Trapnell C, Pop M, & Salzberg SL (2009) Ultrafast and memory-‐ efficient alignment of short DNA sequences to the human genome. Genome Biol 10(3):R25. Stanke M, Diekhans M, Baertsch R, & Haussler D (2008) Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24(5):637-‐644. Nichols SA, Dirks W, Pearse JS, & King N (2006) Early evolution of animal cell signaling and adhesion genes. Proc Natl Acad Sci U S A 103(33):12451-‐12456. Abedin M & King N (2008) The premetazoan ancestry of cadherins. Science 319(5865):946-‐948. Finn RD, et al. (2010) The Pfam protein families database. Nucleic Acids Res 38(Database issue):D211-‐222. Schultz J, Milpetz F, Bork P, & Ponting CP (1998) SMART, a simple modular architecture research tool: identification of signaling domains. Proc Natl Acad Sci U S A 95(11):5857-‐5864. Kall L, Krogh A, & Sonnhammer EL (2007) Advantages of combined transmembrane topology and signal peptide prediction-‐-‐the Phobius web server. Nucleic Acids Res 35(Web Server issue):W429-‐432. http://www.broadinstitute.org/annotation/genome/multicellularity_project /MultiHome.html Hulpiau P & van Roy F (2011) New insights into the evolution of metazoan cadherins. Mol Biol Evol 28(1):647-‐657. http://www.hmmer.janelia.org Srivastava M, et al. (2010) The Amphimedon queenslandica genome and the evolution of animal complexity. Nature 466(7307):720-‐726. Putnam NH, et al. (2007) Sea anemone genome reveals ancestral eumetazoan gene repertoire and genomic organization. Science 317(5834):86-‐94. King N, et al. (2008) The genome of the choanoflagellate Monosiga brevicollis and the origin of metazoans. Nature 451(7180):783-‐788. Bradley RK, et al. (2009) Fast statistical alignment. PLoS Comput Biol 5(5):e1000392. Altschul SF, et al. (1997) Gapped BLAST and PSI-‐BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389-‐3402. Oda H, Tagawa K, & Akiyama-‐Oda Y (2005) Diversification of epithelial adherens junctions with independent reductive changes in cadherin form: identification of potential molecular synapomorphies among bilaterians. Evol Dev 7(5):376-‐389. 17 21. 22. 23. 24. Nichols et al. Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32(5):1792-‐1797. Tobi D & Elber R (2000) Distance-‐dependent, pair potential for protein folding: results from linear optimization. Proteins 41(1):40-‐46. http://sites.bio.indiana.edu/~michaelslab/yeast_two_hybrid_facility.html Ruiz-‐Trillo I, Roger AJ, Burger G, Gray MW, & Lang BF (2008) A phylogenomic investigation into the origin of metazoa. Mol Biol Evol 25(4):664-‐672. 18
© Copyright 2026 Paperzz