Phylogenetic Trees Lecture 11 Sections 6.1, 6.2, in Setubal et. al., 7.1, 7.1 Durbin et. al. © Shlomo Moran, based on Nir Friedman. Danny Geiger, Ilan Gronau . Evolution Evolution of new organisms is driven by Diversity Different individuals carry different variants of the same basic blue print Mutations The DNA sequence q can be changed due to single base changes, deletion/insertion of DNA segments, etc. Selection bias 2 1 Theory of Evolution Basic idea speciation events lead to creation of different species (speciation: physical separation into groups where different genetic variants become dominant) Any two species share a (possibly distant) common ancestor This is described by a rooted tree – the tree of life. 3 Tree of Life Any two species share a (possibly distant) common ancestor The process of evolution consists of: speciation events. events mutations along evolutionary branches. Source: Alberts et al 4 2 Often only a subtree is studied Definition: A phylogeny is a tree that describes the sequence of speciation events that lead to the forming of a set of current day species; also called a phylogenetic tree. 5 Components of Phylogenenetic Trees Aardvark Bison Chimp Dog Elephant Leaves - current day species (or taxa – plural of taxon) Internal vertices - hypothetical common ancestors Edges length - “time” from one speciation to the next The Tree Topology – the tree structure, ignoring edge lengths. Usually the goal is to find the topolgy. 6 3 Historical Note Until mid 1950’s phylogenies were constructed by experts based on their opinion (subjective criteria) Since then, focus on objective criteria for constructing phylogenetic trees Thousands of articles in the last decades Important for many aspects of biology Classification Understanding biological mechanisms 7 Outline A. Introduction (this lecture) 1. The p phylogenetic y Reconstruction Problem: from sequences to trees 2.Morphological vs. molecular sequences 3. Possible pitfalls 4. Directed and undirected trees 5. The “big” problem, the “small” problem. . 4 Outline (cont) B. Character based methods (this + next lectures) 1. Perfect Phylogeny y y 2. Maximum Parsimony 3. Maximum Likelihood (not studied in this course) • These methods consider the evolution of each character separately. • Try to find the tree which gives the “best” best evolutionary explanation: - least number of observed mutations (1&2), or most probable tree (3). • These optimization problems are typically NP-hard. • We’ll discuss ways for solving simplified versions of the problems. . Outline (cont) C. Distance based methods (last 1-2 lectures) - Run in polynomial time p distances between all taxon-pairs p - Compute - Find a tree (edge-weighted) best-describing the distances D 0 9 0 21 19 15 16 22 20 16 17 0 18 14 15 0 8 9 0 3 0 4 5 10 7 6 1 2 1 2 5 Outline (end) Distance Methods (cont.) 1. Efficient reconstruction (O(n2) time) from accurate distances 2. Reconstruction from noisy distances: Can we reconstruct accurate trees from approximate distances? • Worst-case noise model • More realistic noise models: inter-species distances derived from probabilistic models of mutations. p . Evolution as a Tree ACGGTCA AAAGTCA ACGGATA ACGGGTA AAAGGCG AAAGCTG AAACACA GGGGATT TCTGGTA GAACGTA ACCCGTG AATCCTG AATGGGC ATAGCTG AAACCGA TCTGGGA TCCGGAA ACCGTTG AGCCGTG 12 6 Phylogenetic Reconstruction GGGGATT GAACGTA AATCCTG AATGGGC ATAGCTG GC G AAACCGA TCTGGGA ACCGTTG TCCGGAA AGCCGTG 13 Phylogenetic Reconstruction A : AATGGGC B: C: D: E: F: G: H: I: J: AATCCTG ATAGCTG GAACGTA AAACCGA GGGGATT TCTGGGA TCCGGAA AGCCGTG ACCGTTG reconstruct F D B A G C E H I J Goal: reconstruct the ‘true’ tree as accurately as possible 14 7 What are the sequences? Morphological vs. Molecular Classical methods. morphological features: number of legs, lengths of legs, etc. Modern methods. molecular features: Gene (DNA) sequences Protein sequences Analysis based on homologous sequences (e.g., globins) in different species 16 Possible pitfall in reconstruction: Misleading selection of sequences Gene/protein sequences can be homologous for several different reasons: Orthologs -- sequences diverged after a speciation event Paralogs -- sequences diverged after a duplication p event ((next slides)) Xenologs -- sequences diverged after a horizontal transfer (e.g., by virus) 17 8 Misleading selection of sequences: Using paralogs instead of orthologs Consider evolutionary tree of three taxa: Gene Duplication …and assume that at some point in the past a gene duplication event occurred. 1 3 2 18 Paralogs instead of Orthologs The gene evolution is described by this tree (1,2,3 are species; A, B are the copies of the same gene). Gene Duplication Copy B Copy A Speciation events 1A 2A 3A 3B 2B 1B 19 9 Paralogs instead of Orthologs If we happen to consider genes 1A, 2B, and 3A of species 1,2,3, we get a wrong tree. S Gene Duplication S 1A 2A Speciation events 3A 3B S 2B 1B In the sequel we assume all given sequences are orthologs – created from a common ancestor by specification events. 20 Rooted vs. Undirected Trees A natural representation of phylogeny is rooted trees Common Ancestor 21 10 Types of trees Unrooted tree represents the same phylogeny without the root node Most known tree-reconstruction techniques do not distinguish between different placements of the root. 22 Rooted versus unrooted trees Tree a Tree b Tree c b a c Represents the three rooted trees 23 11 Positioning Roots in Unrooted Trees We can estimate the position of the root by introducing an outgroup: a set of species that are definitely distant from all the species of interest Proposed root Falcon Aardvark Bison Chimp Dog Elephant 24 Two phylogenenetic trees of the same species: Do these trees represent the same evolutionary history? Aardvark Bison Chimp Dog Chimp Aardvark Bison Elephant Elephant Dog 25 12 When two unrooted phylogenetic trees are considered different? Trees T1 and T2 on the same set of species are considered identical if they represent the same evolutionary history, i.e.: they have the same topology. Formally, this is equivalent to: There is a tree isomorphism h: T1 T2 s.t: For each species x, h(x)=x. h( ) 26 The two trees represent the same evolution v u w Aardvark Bison Chimp Dog h(v) Elephant h(u) h(w) Chimp Aardvark Bison Elephant Dog 27 13 The “Big” reconstruction problem, the “Small” problem The “big” problem: compute the whole phylogenetic tree from the n input sequences. sequences The “small” problem: Assume the tree topology and the identities of the leaf-species are known. Reconstruct the sequences at the internal vertices, and give a score to the resulted phylogeny. Connection between the problems: In order to solve the big problem, problem solve the small problem on all possible trees with n leaves, and output the tree(s) with the highest “score”. This is impossible in practice for more than few taxa. 28 Input for the “big” problem A: B B: C: D: E: CAGGTA CAGACA CGGGTA TGCACT TGCGTA Our task: Find evolutionary tree with leafs corresponding to the 5 sequences, which best explains the evolution of the strings. 29 14 Input for the “small” problem Aardvark Bison Chimp Dog A: B: C: D: E: CAGGTA CAGACA CGGGTA TGCACT TGCGTA Elephant The tree and assignments of strings to the leaves is given, and we need only to assign strings to internal vertices. 30 Character-based methods for constructing phylogenies IIn this hi approach, h trees are constructedd by b comparing i the h characters h off the corresponding sequences. Characters may be morphological (teeth structures) or molecular (nucleotides in homologous DNA sequences). We will present two methods: “Perfect Phylogeny” and “Maximum Parsimony” Basic Assumption in these methods: Best tree is one with minimal number of observed mutations (character changes along the edges, aka substitutions). 31 15 Character based methods: Input data species C1 C2 C3 C4 … Cm dog A A C A G G T C T T C G A G G C C C horse A A C A G G C C T A T G A G A C C C frog A A C A G G T C T T T G A G T C C C human A A C A G G T C T T T G A T G A C C pig A A C A G T T C T T C G A T G G C C * * * * * * * * * * * • Each character (column) is processed independently. • The green character will separate the human and pig from frog, horse and dog. • The red character will separate the dog and pig from frog, horse and human. 32 The perfect phylogeny problem A character is assumed to be a significant g property, p p y, which distinguishes between species (e.g. dental structure, number of legs/limbs). A characters state is a value of the character (eg: human dental structure). Assumption: It is unlikely that a given state will be created twice in the evolution tree. Such characters are called “Homoplasy free”, and are detailed next. 33 16 Homoplasy-free characters 1 Homoplasy free characters should avoid: reversal transitions A species regains a state it’s direct ancestor has lost. Famous known exceptions: Teeth in birds. Legs in snakes. snakes 34 Homoplasy-free characters 2 …and also avoid convergence transitions Two species possess the same state while their least common ancestor possesses a different state. Famous known exceptions: The marsupials. 35 17 The Perfect Phylogeny Problem Input: 1. A set of species 2 A set of characters 2. 3. For each character, assignment of states to the species Problem: Is there a phylogenetic tree T=(V,E), s.t. the evolution of all characters is “homoplasy free” (no reversal, no convergence) First, we define the problem using graphtheoretic terms. 36 Characters = Colorings A coloring of a tree T=(V,E) is a mapping C:V [set of colors] A partial coloring of T is a coloring of a subset of the vertices U V: C:U [set of colors] U= 37 18 Characters as Colorings Each character defines a (partial) coloring of the corresponding phylogenetic tree: Species ≡ Vertices States ≡ Colors 38 Convex Colorings (and Characters) Let T=(V,E) be a partially colored tree, and d be a color. The d-carrier is the minimal subtree of T containing all vertices colored d Definition: A (partial/total) coloring of a tree is convex iff all d-carriers are disjoint 39 19 Convexity Homoplasy Freedom A character is Homoplasy free (avoids reversal and convergence transitions) ↕ The corresponding (partial) coloring is convex 40 The Perfect Phylogeny Problem (pure graph theoretic setting) Input: Partial colorings (C1,…,Ck) of a set of vertices U (in the example: 3 total colorings: left left, center, center right, right each by two colors). colors) R G A B B P R G P R B P Problem: Is there a tree T=(V,E), s.t. UV and for i=1,…,k,, Ci is a convex (partial) coloring of T? PP is NP-Hard In general In the tutorial you will see a special case solvable in p-time. 41 20 Maximum Parsimony Perfect Phylogeny is not only hard to compute, but in y cases it doesn’t exist. many Next we discuss a more common approach, called “Maximum Parsimony”, which looks for a tree which minimizes the number of mutations. 42 Maximum Parsimony A Character-based method Input: h sequences (one per species), all of length k. Goal: Find a tree whose leaves are labeled by the input sequences and an assignment of sequences to internal sequences, nodes, such that the total number of substitutions is minimized. 43 21 Example Input: four nucleotide sequences: AAG, AAA, GGA, AGA taken from four species. By th B the parsimony i principle, i i l we seekk a tree t whose h leaves l are labeled l b l d by the input sequences, and assignment of sequences to internal vertices, with minimum total number of mutations (ie, letter changes) along the tree edges. Here is one possible tree + sequences assignment. AAA AAA 1 GGA AAG AAA 2 AAA 1 AGA Total #substitutions = 4 44 Example Continued Here are two other trees+ sequence assignments: AAA 1 AAA 1 AAG AAA AAA AAA AGA 1 GGA AGA 1 AAG 1 AGA AAA AAA 2 GGA Total #substitutions = 3 Total #substitutions = 4 The left solution is preferred over the right one. A solution has two parts: First, select a tree and label its leaves by the input sequences; then, assign sequences to the internal vertices. 45 22 Parsimony score AAA 1 AAA 1 AAG AAA AAA AAA AGA 1 GGA Parsimony score = 3 AGA 1 AAG AAA 1 AGA AAA 2 GGA Parsimony score = 4 The parsimony score of a leaf-labeled tree T is the minimum possible number of mutations over all assignments of sequences to internal vertices of T. 47 Parsimony Based Reconstruction We have here both the small and big problems: 1. The small problem: find the parsimony score for a given leaf labeled tree. 2. The big problem: Find a tree whose leaves are labeled by y the input p sequences, q , with the minimum possible parsimony score. 3. We will see efficient algorithms for (1). (2) is hard. 48 23
© Copyright 2026 Paperzz