Human Genome Sequencing Project: What Did We Learn?

Human Genome Sequencing Project: What Did We Learn? 02-­‐223 Personalized Medicine: Understanding Your Own Genome Fall 2014 Our Goal is… • To understand your own genome – Imagine you just got your genome sequenced. What can you learn from your genome sequence • To understand our genomes – What can you learn by comparing your genome with others’ genomes and how? Human Genome Sequencing Project • InternaMonal Human Genome Sequencing ConsorMum (‘public project’) – Ini*al Sequencing and Analysis of the Human Genome. Nature 409:860-­‐921, 2001 – 20 centers from 6 countries (US, UK, China, France, Germany, Japan) – Also sequenced yeast, worm, mouse, and regions of mammalian genomes • Celera Genomics – Venter JC et al. (‘private project’) – The Sequence of the Human Genome. Science 291:1304-­‐1351, 2001. Public Project: Sequencing Strategy • Shotgun Phase - The genome is divided into appropriately sized segments and each segment is covered to a high degree of redundancy (typically, eight-­‐ to tenfold) through the sequencing of randomly selected subfragments - Challenges in sequencing the repeat regions! -­‐ Hierarchical shotgun sequencing used to produce dra] sequence of 90% of the genome • Finishing Phase - Fill in gaps and resolve ambiguiMes Hierarchical Shotgun Sequencing • Strategy for sequencing repeat-­‐rich genomes such as human genomes – generaMng and organizing a set of large-­‐insert clone (typically 100±200 kb each) covering the genome and separately performing shotgun sequencing on appropriately chosen clones. – Because the sequence informaMon is local, the issue of long-­‐range misassembly is eliminated and the risk of short-­‐range misassembly is reduced. Sample SelecAon • Deciding whose genome to sequence – DNA obtained from anonymous human donors in accordance with US Federal RegulaMons for the ProtecMon of Human Subjects in Research(45CFR46) and following full review by an InsMtuMonal Review Board. – Volunteers of diverse backgrounds – Samples were obtained a]er discussion with a geneMc counselor and wrigen informed consent. – The samples were made anonymous. – The processing laboratory chose samples at random from which to prepare DNA. Automated ProducAon Line for Sample PreparaAon Nature 409:860-­‐921, 2001 What did We Learn from the DraD of Human Genome Sequence? • Long-­‐range variaMon in GC content – GC content = (G+C)/(A+T+G+C)x100 (the proporMon of G’s and C ’s in the given region of the sequence) – GC-­‐rich and GC-­‐poor regions may have different biological properMes such as gene density, composiMon of repeat sequences, and recombinaMon rate. – Genome-­‐wide average: 41% – CpG island GC content in a 100Mb region of chromosome 1 What did We Learn from the DraD of Human Genome Sequence? • Classes of interspersed repeat in the human genome What did We Learn from the DraD of Human Genome Sequence? • Gene content of human genome – Human protein-­‐coding genes tend to have small exons (encoding an average of only 50 codons) separated by long introns (some exceeding 10 kb) – Coding sequences comprise only a few percent of the genome and an average of about 5% of each gene (long intron) • Can we idenMfy genes computaMonally by scanning the sequenced genome? – Hidden Markov models (HMMs) • a state-­‐space model that represents the gene structure (intron, exon, intergenic region etc.) • HMMs were popular for speech recogniMon and were applied to the problem of finding genes from a genome sequence Machine Learning Approach Learn a Model for genes Training data: genome sequences for a set of known genes Model (HMMs) Use the Model (HMM) to analyze a previously unseen dataset (e.g., detecMng genes in the new sequence) Machine Learning Approach for Sequence Modeling • Markov models – Modeling observed sequences • Hidden Markov models – Modeling observed sequences generated by unobserved hidden processes – Observed sequences: ACTTAAAGG….. – Unovserved hidden processes: gene region, intergenic region… Markov Models ObservaMons are generated from the model Markov Models ObservaMons are generated from the model Ingredients of Our Markov Model • CollecMon of states {Ssunny, Srainy, Ssnowy} • State transiMon probabiliMes (transiMon matrix) Sunny A =
.8
Rainy Snowy .15 .05
.38 .6
.02
.75 .05 .2
Sunny Rainy Snowy • IniMal state distribuMon π = (.7 .25 .05) i
P(Sunny at day t | Rainy at day (t-­‐1)) Ingredients of a Markov Model • CollecMon of states {S1, S2, …,SN} • State transiMon probabiliMes (transiMon matrix) Aij = P(qt+1 = Si | qt = Sj) • IniMal state distribuMon π = P(q = S ) i
1
i
Scoring a Sequence Given a Markov Model Probability of a Sequence of Events P(Ssunny) x P(Srainy | Ssunny) x P(Srainy | Srainy) x P(Srainy | Srainy) x P(Ssnowy | Srainy) x P(Ssnowy | Ssnowy) = 0.7 x 0.15 x 0.6 x 0.6 x 0.02 x 0.2 = 0.0001512 Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) ObservaMons are generated from the model Ingredients of Our HMM • States: {Ssunny, Srainy, Ssnowy} • ObservaMons: {O1, O2,…,OM} • State transiMon probabiliMes (transiMon matrix) .08 .15 .05
A = .38 .6
.02
.75 .05 .2
• IniMal state distribuMon πi = (.7 .25 .05) • ObservaMon probabiliMes (emission matrix): B = .08 .15 .05
.38 .6
.02
.75 .05 .2
Ingredients of a HMM • CollecMon of states (unobserved): {S1, S2,…,SN} • ObservaMons: {O1, O2,…,OM} • State transiMon probabiliMes (transiMon matrix) Aij = P(qt+1 = Si | qt = Sj) • IniMal state distribuMon πi = P(q1 = Si) • ObservaMon probabiliMes: Bj(k) = P(vt = Ok | qt = Sj) Probability of a Sequence of Events Ssnowy Ssnowy Srainy Srainy Srainy Srainy Srainy P(O,S) = P(Ssnowy)P(Ogloves |Ssnowy)P(Ssnowy|Ssnowy) P(Ogloves |
Ssnowy) P(Srainy|Ssnowy)P(Oumbrella |Srainy) …….. HMMs and Gene Structure • Sequence of nucleoMdes {A,C,G,T} are observed • Gene annotaMons {intergenic, start/stop, coding} are unobserved hidden processes. • Different states generate nucleoMdes at different frequencies • A simple HMM for unspliced genes: AAAGC ATG CAT TTA ACG AGA GCA CAA GGG CTC TAA TGCCG • The sequence of states is an annotaMon of the generated string – each nucleoMde is generated in intergenic, start/stop, coding state Genscan Example • Developed by Chris Burge 1997 • One of the most accurate ab ini*o programs for finding genes GenScan States
• • • • • • • • N -­‐ intergenic region P -­‐ promoter F -­‐ 5’ untranslated region Esngl – single exon (intronless) (translaMon start -­‐> stop codon) Einit – iniMal exon (translaMon start -­‐> donor splice site) Ek – phase k internal exon (acceptor splice site -­‐> donor splice site) Eterm – terminal exon (acceptor splice site -­‐> stop codon) Ik – phase k intron: 0 – between codons; 1 – a]er the first base of a codon; 2 – a]er the second base of a codon Summary • Human genome sequencing project – Shotgun sequencing – First global view of the whole genome in terms of repeat regions, GC content, CpG islands, genes etc. – Finding genes from the whole genome sequence • Hidden Markov models for modeling a complex gene structure and scanning the genome to find genes Acknowledgement • Sanja Rogic‘s lecture slides “ComputaMonal Gene Finding”