Bio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am 4th floor classroom 4515 MSRB Outline of the day • • • • • • Outline of the course What is genomics? A little history The simple principles of genomics Being quantitative From a student to an investigator A few TA administrivia... • If you didn’t receive an email from [email protected] this week, please email [email protected] or talk to a TA after class • If you’re taking the lab: – Please fill out the anonymous survey: https://goo.gl/YfGnEp – Read assignment 1 (link on the syllabus section in Blackboard) – Attempt to install the required software (instructions in the resources section of Blackboard) – Bring your laptop to class on Friday Course Web Site • http://www.genetics.wustl.edu/bio548 8/home.html • Blackboard: bb.wustl.edu • Linux Primer • Python Primer • Lecture notes • Schedule • Weekly Assignments and Answers • Weekly Readings Grading 4 credit 3 credit • ¼ midterm • ½ midterm • ¼ final • ½ weekly assignments • ½ final What is the key to your success? http://www.genome.gov/Director/ History of Bio5488 Computational Biology HGP Functional Genomics 1998 Bio5495 Eddy Systems/Synt hetic Biology Microarray 2003 Bio5488 Cohen Mitra 2006 Bio5495 Brent Next-gen Sequencing ENCODE GWAS Roadmap etc 2012 Bio5488 Wang Conrad History of Genomics and Epigenomics 1865 1953 1955 1970 1977 1978 1980 1981 1981 1983 1986 1986 1987 1990 1995 1996 2001 Gregor Mendel: founding of genetics Watson and Crick: double helix model for DNA Sanger: first protein sequence, bovine insulin Needleman-Wunsch algorithm for sequence alignment Sanger: DNA sequencing The term “bioinformatics” appeared for the first time The first complete gene sequence (Bacteriophage FX174), 5386 bp Smith-Waterman algorithm for sequence alignment IBM: first Personal Computer Kary Mullis: PCR The term "Genomics" appeared for the first time: name of a journal The SWISS-PROT database is Perl (Practical Extraction Report Language) is released by Larry Wall. BLAST is published The Haemophilus influenzea genome (1.8 Mb) is sequenced Affymetrix produces the first commercial DNA chips A draft of the human genome (3,000 Mbp) is published History of Genomics and Epigenomics HGP 90’s Computational Biology Sequence analysis Hidden Markov Model Gene finding BLAT Genome Browser Motif finding Assembly Microarray 00’s Omics Gene expression ENCODE GWAS Roadmap etc 10’s Next-gen Sequencing Machine Learning Data mining Structural informatics Drug Design Statistical Modeling Database Comparative Genomics Evolution Systems/Synt hetic Biology Genome, genetics, and genomics • What is a genome? – The genetic material of an organism. – A genome contains genes, regulatory elements, and other mysterious stuff. • What is genetics? – The study of genes and their roles in inheritance. • What is genomics – The study of all of a person's genes (the genome), including interactions of those genes with each other and with the person's environment What about genomes • Characterize the genome – How big – How many genes – How are they organized • Annotate the genome – What, where, and how • Modern genomics: “ChIPer” vs “Mapper” – – – – • Direct measurement Inference Comparison Evolution From genome to molecular mechanisms to diseases – Genomes/epigenomes of diseased cells • What do you want to learn from this class? – – – – Being quantitative Concept/philosophy Techniques/problem solving skills Do not forget genetics!!! Motivation slides Genomes sequenced http://www.genomesonline.org/, (as of November 2015) 49559 Bacteria, 1136 Archaea, 11122 Eukarya, 4473 viruses 18385 metagenome samples 2298 Synthetic genomes The sequence explosion The Human Genome Project The human genome completed! – 2001 The human genome completed, again! – 2010 April 1, 2010 June 26, 2000 President Clinton, with Craig Venter and Francis Collins, announces completion of "the first survey of the entire human genome." February 15, 2001 10 years after draft sequence: What have we learned? • • • • • • Genome sequencing Functional elements Evolution of genome Basis of diseases Human history Computational biology “ If I was a senior in college or a first-year graduate student trying to figure out what area to work in, I would be a computational biologist. ” -- During AAAS Meeting Jan 2010 COLLINS: .... Computational biologists are having a really good time and it’s going to get better. ROSE: Their day is coming? COLLINS: Their day is here, but it’s going to be even more here in a few years. COLLINS: They’re going to be the breakthrough artists .... -- Mar 15, 2010, Charlie Rose Show Francis Collins NIH Director The Human genome: the “blueprint” of our body 1013 different cells in an adult human The cell is the basic unit of life James Watson Francis Crick Fred Sanger GTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCT CCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATG AAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAA ATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATT AGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCG ATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTG CATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTA TTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATT DNA = linear molecule inside the cell that carries instructions needed throughout the cell’s life ~ long string(s) over a small alphabet Alphabet of four (nucleotides/bases) {A,C,G,T} DNA, Chromosome, and Genome Building an Organism DN cellA Every cell has the same sequence of DNA Subsets of the DNA sequence determine the identity and function of different cells What makes us different? Differences between individuals? Differences between species? One genome, thousands of epigenomes Embryonic stem cells Fetal tissues Adult cells and tissues Understanding the genome • View from 2000 – Protein-coding genes 35,000 - 120,000 – Regulatory sequence Less than proteincoding information – Transposons Junk DNA All these are WRONG! • We now know ... How many genes do we have? What we used to think fly worm human weed fish rice Science 2005 Gene numbers do not correlate with organism complexity. Many gene families are surprisingly old. Complexity, Genome Size and the C-value Paradox Organism Amoeba Fern Salamander Onion Paramecium Toad Barley Chimp Gorilla Human Mouse Dog Pig Rat Boa Constrictor Zebrafish Chicken Fruit fly C. elegans Plasmodium falciparum Yeast, Fission Yeast, Baker's Escherichia coli Bacillus subtilis H. influenzae Mycoplasma genitalium Genome Size (MB) 670,000 160,000 81,300 18,000 8,600 6,900 5,000 3,600 3,500 3,500 3,400 3,300 3,100 3,000 2,100 1,900 1,200 180 100 25 14 12 4.6 4.2 1.8 0.60 www.genomesize.com C-value: the amount of DNA contained within a haploid nucleus (e.g. a gamete) or one half the amount in a diploid somatic cell of a eukaryotic organism, expressed in picograms (1pg = 10−12 g). Complexity and Organism Specific Genes • Only 14 out of 731 genes on mouse chromosome 16 have no human homolog (Celera) • Many genes specific to the mouse are olfactory receptors, also some differences in immunity and reproduction Mostfunctional functionalinformation information is is non-coding non-coding Most !5% highly conserved, but only 1.5% encodes proteins 21 chr2 (q31.1) c h r 2 : p14 1 7 2 6 6 0 0 0 0 2p12 13 31.1 q34 q35 1 7 2 6 6 5 0 0 0 1 7 2 6 7 0 0 0 0 U C S C G e n e s B a s e d o n R e fS e q ,U n iP r o t,G e n B a n k ,C C D S a n d C o m p a r a tiv e G e n o m ic s 1 7 2 6 7 5 0 0 0 D L X 1 D L X 2 V e r te b r a te M u ltiz A lig n m e n t& P h a s tC o n s C o n s e r v a tio n ( 2 8 S p e c ie s ) V e r te b r a te C o n s C h im p R h e s u s B u s h b a b y T r e e _ s h r e w M o u s e R a t G u in e a _ P ig S h r e w H e d g e h o g D o g C a t H o r s e C o w A r m a d illo E le p h a n t T e n r e c O p o s s u m P la ty p u s L iz a r d C h ic k e n Z e b r a fis h T e tr a o d o n F u g u S tic k le b a c k M e d a k a D L X 1 G a p s H u m a nA C h im pA R h e s u sA B u s h b a b yA T r e e _ s h r e wA M o u s eA R a t A G u in e a _ P ig C S h r e wA H e d g e h o gA D o gA C a t A H o r s eA C o wA A r m a d illo A E le p h a n t A T e n r e cA O p o s s u mA P la ty p u sA K P A A A A A A A C A A A A A A A A A A A 1 CC CC CC CC CC CC CC C T CC CC CC CC CC CC CC CC C T CC CC A C A C A C A C A C A C A C CC A C GC A C A C A C A C A C A C A C A C A C A A A A A A A A A A A A A A A A A A A R T GGA GGA GGA GGA GGA GGA GGA GGA GGA GGA GGA GGA GGA GGA GGA GGA GGA GGA GGA C C C C C C C C C C C C C C C C C C C GA GA GA GA GA A A A A A A GA A A GA GA GA GA GA A A GA T A T A I U C S C G e n e s B a s e d o n R e fS e q ,U n iP r o t,G e n B a n k ,C C D S a n d C o m p a r a tiv e G e n o m ic s Y S S L Q L V e r te b r a te M u ltiz A lig n m e n t& P h a s tC o n s C o n s e r v a tio n ( 2 8 S p e c ie s ) T T T T T T T T T T T T T T T T T T T T T T T T T T T T C T T T T T T T T A T T T T T T T T T T T T T T T T T T T A A A A A A A A A A A A A A A A A A A T T T T T T T T T T C T T T T T T T T T T T T T T T T T T T T T T T T T T T C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C A A A A A A A A A A A A A A A A A A A GT GT GC GT GT GT GT GT GT GT GT GT GT GT GT GT GT GT GT T T T T T T T T T T T T T T T T T C T T T T T T T T T T T T T T T T T T T T GC GC GC GC GC GC GC GC GC GC GC GC GC GC GC GC GC GC GC A A A A A A A A A A A A A A A A A A A GT GT GT GT GT GT GT GC GT GT GT GT GT GT GT GT GC GT GT T T T T T T T T T T T T T T T T T T T Q GC A GC A GC A GC A GC A GC A GC A GGA GC A GC A GC A GC A GC A GC A GC A GC A GC A GC A GC A What do they do? A GGC GGC GGC GGC GGC GGC GGC C GC GGC GGC GGC GGC GGC GGC GGC GGC GGC GGC GGC L T T T T T T T T T T T T T T T T T T A T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T N GA GA GA GA GA GA GA GA GA GA GA GA GA GA GA GA GA GA GA A A A A A A A A A A A A A A A A A A A C C C T C C C T C C C C C C C C T C C Ultra conserved elements ultra conserved e.d 12.5 HARs: Human accelerated regions • 118 bp segment with 18 changes between the human and chimp sequences • Expect less than 1 Human HAR1F differs from the ancestral RNA structure Main components in the Human genome Barbara McClintock Only 1.5% of the human genome are protein-coding regions Transposable elements make up almost half of the human genome Transposable Elements (TEs) TEs can shape transcriptional network The LTR10 and MER61 families are particularly enriched for copies with a p53 site. These ERV families are primate-specific and transposed actively near the time when the New World and Old World monkey lineages split. Evolutionary pattern of LTR10B1 genomic copies Wang et al. PNAS 2007 Understanding Human Disease Published Genome-Wide Associations through 12/2013 -8 for 17 trait categories Publi sh e dGW A a tp≤5X1 0 NHGRI GWA Catalog www.genome.gov/GWAStudies www.ebi.ac.uk/fgpt/gwas/ Proposed in Nov 2009 Epigenome ENCODE Project HapMap 1000 Genomes TCGA Thinking Quantitatively Biology Is A Quantitative Science!!! Erwin Chargaff Gregor Mendel 1) Mendel’s Laws 2) Chargaff’s Rules 1823-1884 1929-1992 Thinking Quantitatively • Space – • Be comprehensive Signal to Noise Ratio – – • Sensitivity, specificity, dynamic range What is my background control? Distributions – – Normal, Gaussian, Poisson, negative binomial, extreme value, hypergeometric, etc. Discrete vs continuous • The P value • Statistics, Probability, Computation, and Informatics • Don’t forget genetics!!! Spaces: Be comprehensive • Conditions: spatial, temporal, treatment – think about controlling for multiple variables • Think globally – interaction between local features and global features (Placenta histone example) • Be comprehensive about what assumptions are made – some we know, some we don’t (genome assembly example) Signal to Noise relative abundance 100 ChIP 75 input 50 25 400 600 800 m/z 1000 Different sources of noise Sensitivity, Specificity, and Dynamic Range • Sensitivity – What is the smallest signal that can reliably be detected (signal to noise)? – True positive rate • Specificity – How well can we discriminate between similar signals? – True negative rate • Dynamic Range – What is the linear range of detection? – What is the range of natural variation? Distributions Normal (Gaussian) Poisson Extreme value Negative binomial The P value # of occurrences *for discreet variables 25 What is the chance of getting exactly seven? 20 15 # of trials that were seven total # of trials P 9 1 3 9 17 23 17 9 3 1 10 5 0 1 2 3 4 5 6 7 8 9 10 # of occurrences P 25 P 0.10 What is the chance of getting seven or better? 20 15 P # of trials that were seven or better total # of trials P 9 3 1 1 3 9 17 23 17 9 3 1 10 5 0 1 2 3 4 5 6 7 8 9 10 P 0.16 “There Are Three Kinds of Lies: Lies, Damned Lies, and Statistics” -Benjamin Disraeli (1804-1881) Statistics: drawing inferences from the results of an experiment 1 0.6 0.4 0.2 97 91 85 79 73 67 61 55 49 43 37 31 25 19 13 7 0 1 • Comparing averages • Comparing variation 0.8 Gaussian (Normal) Distributions I 1 # of people 0.8 0.6 0.4 0.2 n 1 • Mean (x) = xi n i 1 n • Standard Deviation (s) = 2 ( x x ) i i 1 n 1 97 91 85 79 73 67 61 55 height 49 43 37 31 25 19 13 7 1 0 Gaussian (Normal) Distributions II 1 # of people 0.8 0.6 0.4 0.2 97 91 79 73 67 61 55 49 43 37 31 25 19 13 7 height Mean (x or ) Standard Deviation (s or ) Compare means (t-tests) Compare standard deviations (f-tests) Calculate P-values for particular results (z-tests) 1 • • • • • 85 0 “The most important questions of life are …really only problems of probability.” -Pierre Simon, Marquis de Laplace (17491827) Probability: Computing the chance of a particular outcome of an experiment S EC E P(E)=E/S P(Ec)=1-P(E) Independent Events/Mutually Exclusive Events What is the probability of A and B both occurring? P(AB) = P(A)*P(B) (Independent) P(A|B) = P(A) (Independent) P(A|B)*P(B) = P(B|A)*P(A) (Bayes rule) P(AB) = P(A|B) = 0 (Mutually Exclusive) What is the probability of A or B occurring (independent)? P(A) * P(B) + P(A) * 1-P(B) + 1-P(A) * P(B) both + A only + B only What is the probability of A or B occurring (mutually exclusive)? P(A) + P(B) An example Amino acid percentages of Swisprot Ala Arg Asn Asp Cys (A) (R) (N) (D) (C) 7.81 5.32 4.20 5.30 1.56 Gln Glu Gly His Ile (Q) (E) (G) (H) (I) 3.94 6.60 6.93 2.28 5.91 Leu Lys Met Phe Pro (L) (K) (M) (F) (P) 9.62 5.93 2.37 4.01 4.84 Ser Thr Trp Tyr Val (S) (T) (W) (Y) (V) 6.88 5.45 1.15 3.07 6.71 What is the probability that a peptide of length 25 contains at least one SP motif? P(SP) = P(S)P(P) = .0688*.0484 =.00329 P(SPc) = 1 - P(SP) = .9966 P(no SP anywhere in a 25 mer) = P(SPc)24 = 0.92 P(at least one SP in a 25 mer) = 1- P(SPc)24 =0.08 What is the probability that the last residue of a protein is either K or R? P(K)+P(R) = .0593 + .0532= .11 counting permutations Permutations: Question: How many different 5-mer sequences can I make using each of the amino acids S, T, A, G, P once and only once? Answer: 5! = 5*4*3*2*1 = 120 counting combinations Combinations: Question: How different 5-mer sequences can I make using three Ser’s and two Pro’s Answer: 5! 5 * 4 * 3 * 2 *1 10 3!*2! 3 * 2 *1* 2 *1 Example A particular cluster has 25 coexpressed genes in it. 15 of these genes are annotated as being involved in rRNA transcription. Is 15/25 significant? Hypergeometric Probability Distribution the “overlap problem” or sampling without replacement N n s m • N = number of genes in the genome (6000 for yeast) • n = number of genes in the cluster (25) • m = number of rRNA transcription genes (109 from MIPS) • s = number of rRNA transcription genes in the cluster (15) Hypergeometric Probability Distribution N n s m N m n i n i P( X s) N is n m a a! where b b!(a b)! Hypergeometric Probability Distribution 6000 25 15 109 109 6000 109 25 i 25 i 16 P ( X 15) 110 6000 i 15 25 Therefore in our example… 6000 25 15 109 15/25 rRNA transcription genes in the cluster is significant. BUT… 1) 2) 3) 10 out of 25 genes in the cluster are not rRNA transcription genes. 94 rRNA transcription genes are not in the cluster. What about the other genes in the cluster?
© Copyright 2026 Paperzz