The Human genome

Bio5488
Genomics
Spring, 2016
Lectures: Mon, Wed 10:00-11:30 am
Lab: Fri 10:00-11:30 am
4th floor classroom
4515 MSRB
Outline of the day
•
•
•
•
•
•
Outline of the course
What is genomics?
A little history
The simple principles of genomics
Being quantitative
From a student to an investigator
A few TA administrivia...
• If you didn’t receive an email from
[email protected] this week, please
email [email protected] or talk to a
TA after class
• If you’re taking the lab:
– Please fill out the anonymous survey:
https://goo.gl/YfGnEp
– Read assignment 1 (link on the syllabus
section in Blackboard)
– Attempt to install the required software
(instructions in the resources section of
Blackboard)
– Bring your laptop to class on Friday
Course Web Site
• http://www.genetics.wustl.edu/bio548
8/home.html
• Blackboard: bb.wustl.edu
• Linux Primer
• Python Primer
• Lecture notes
• Schedule
• Weekly Assignments and Answers
• Weekly Readings
Grading
4 credit
3 credit
• ¼ midterm
• ½ midterm
• ¼ final
• ½ weekly assignments • ½ final
What is the key to your success?
http://www.genome.gov/Director/
History of Bio5488
Computational
Biology
HGP
Functional
Genomics
1998
Bio5495
Eddy
Systems/Synt
hetic
Biology
Microarray
2003
Bio5488
Cohen
Mitra
2006
Bio5495
Brent
Next-gen
Sequencing
ENCODE
GWAS
Roadmap
etc
2012
Bio5488
Wang
Conrad
History of Genomics and Epigenomics
1865
1953
1955
1970
1977
1978
1980
1981
1981
1983
1986
1986
1987
1990
1995
1996
2001
Gregor Mendel: founding of genetics
Watson and Crick: double helix model for DNA
Sanger: first protein sequence, bovine insulin
Needleman-Wunsch algorithm for sequence alignment
Sanger: DNA sequencing
The term “bioinformatics” appeared for the first time
The first complete gene sequence (Bacteriophage FX174), 5386 bp
Smith-Waterman algorithm for sequence alignment
IBM: first Personal Computer
Kary Mullis: PCR
The term "Genomics" appeared for the first time: name of a journal
The SWISS-PROT database is
Perl (Practical Extraction Report Language) is released by Larry Wall.
BLAST is published
The Haemophilus influenzea genome (1.8 Mb) is sequenced
Affymetrix produces the first commercial DNA chips
A draft of the human genome (3,000 Mbp) is published
History of Genomics and Epigenomics
HGP
90’s
Computational
Biology
Sequence analysis
Hidden Markov Model
Gene finding
BLAT
Genome Browser
Motif finding
Assembly
Microarray
00’s
Omics
Gene
expression
ENCODE
GWAS
Roadmap etc
10’s
Next-gen
Sequencing
Machine Learning
Data mining
Structural informatics
Drug Design
Statistical Modeling
Database
Comparative
Genomics
Evolution
Systems/Synt
hetic
Biology
Genome, genetics, and genomics
• What is a genome?
– The genetic material of an organism.
– A genome contains genes, regulatory elements, and other
mysterious stuff.
• What is genetics?
– The study of genes and their roles in inheritance.
• What is genomics
– The study of all of a person's genes (the genome), including
interactions of those genes with each other and with the
person's environment
What about genomes
•
Characterize the genome
– How big
– How many genes
– How are they organized
•
Annotate the genome
– What, where, and how
•
Modern genomics: “ChIPer” vs “Mapper”
–
–
–
–
•
Direct measurement
Inference
Comparison
Evolution
From genome to molecular mechanisms to diseases
– Genomes/epigenomes of diseased cells
•
What do you want to learn from this class?
–
–
–
–
Being quantitative
Concept/philosophy
Techniques/problem solving skills
Do not forget genetics!!!
Motivation slides
Genomes sequenced
http://www.genomesonline.org/, (as of November 2015)
49559 Bacteria, 1136 Archaea, 11122 Eukarya, 4473 viruses
18385 metagenome samples
2298 Synthetic genomes
The sequence explosion
The Human Genome Project
The human genome completed!
– 2001
The human genome completed,
again! – 2010
April 1, 2010
June 26, 2000
President Clinton, with
Craig Venter and
Francis Collins,
announces completion
of "the first survey of
the entire human
genome."
February 15, 2001
10 years after draft sequence:
What have we learned?
•
•
•
•
•
•
Genome sequencing
Functional elements
Evolution of genome
Basis of diseases
Human history
Computational biology
“
If I was a senior in college or a
first-year graduate student trying to
figure out what area to work in, I
would be a computational
biologist.
”
-- During AAAS Meeting Jan 2010
COLLINS: .... Computational biologists
are having a really good time and it’s
going to get better.
ROSE: Their day is coming?
COLLINS: Their day is here, but it’s
going to be even more here in a few
years.
COLLINS: They’re going to be the
breakthrough artists ....
-- Mar 15, 2010, Charlie Rose Show
Francis Collins
NIH Director
The Human genome:
the “blueprint” of our body
1013 different cells in
an adult human
The cell is the basic
unit of life
James Watson
Francis Crick
Fred Sanger
GTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCT
CCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATG
AAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAA
ATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATT
AGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCG
ATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTG
CATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTA
TTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATT
DNA = linear
molecule inside the
cell that carries
instructions needed
throughout the
cell’s life ~ long
string(s) over a
small alphabet
Alphabet of four
(nucleotides/bases)
{A,C,G,T}
DNA, Chromosome, and Genome
Building an Organism
DN
cellA
Every cell has the same sequence of DNA
Subsets of the DNA
sequence determine the
identity and function of
different cells
What makes us different?
Differences between individuals?
Differences between species?
One genome, thousands of epigenomes
Embryonic
stem cells
Fetal
tissues
Adult cells
and tissues
Understanding the genome
• View from 2000
– Protein-coding genes
35,000 - 120,000
– Regulatory sequence
Less than proteincoding information
– Transposons
Junk DNA
All these are WRONG!
• We now know ...
How many genes do we have?
What we used to think
fly
worm
human
weed
fish
rice
Science 2005
Gene numbers do not correlate with organism complexity.
Many gene families are surprisingly old.
Complexity, Genome Size and the C-value Paradox
Organism
Amoeba
Fern
Salamander
Onion
Paramecium
Toad
Barley
Chimp
Gorilla
Human
Mouse
Dog
Pig
Rat
Boa Constrictor
Zebrafish
Chicken
Fruit fly
C. elegans
Plasmodium falciparum
Yeast, Fission
Yeast, Baker's
Escherichia coli
Bacillus subtilis
H. influenzae
Mycoplasma genitalium
Genome Size (MB)
670,000
160,000
81,300
18,000
8,600
6,900
5,000
3,600
3,500
3,500
3,400
3,300
3,100
3,000
2,100
1,900
1,200
180
100
25
14
12
4.6
4.2
1.8
0.60
www.genomesize.com
C-value: the amount of
DNA contained within a
haploid nucleus (e.g. a
gamete) or one half the
amount in a diploid
somatic cell of a
eukaryotic organism,
expressed in picograms
(1pg = 10−12 g).
Complexity and Organism Specific Genes
• Only 14 out of 731 genes on mouse
chromosome 16 have no human homolog
(Celera)
• Many genes specific to the mouse are
olfactory receptors, also some differences in
immunity and reproduction
Mostfunctional
functionalinformation
information is
is non-coding
non-coding
Most
!5% highly conserved, but only 1.5% encodes proteins
21
chr2 (q31.1)
c
h
r
2
:
p14
1
7
2
6
6
0
0
0
0
2p12
13
31.1
q34 q35
1
7
2
6
6
5
0
0
0
1
7
2
6
7
0
0
0
0
U
C
S
C
G
e
n
e
s
B
a
s
e
d
o
n
R
e
fS
e
q
,U
n
iP
r
o
t,G
e
n
B
a
n
k
,C
C
D
S
a
n
d
C
o
m
p
a
r
a
tiv
e
G
e
n
o
m
ic
s
1
7
2
6
7
5
0
0
0
D
L
X
1
D
L
X
2
V
e
r
te
b
r
a
te
M
u
ltiz
A
lig
n
m
e
n
t&
P
h
a
s
tC
o
n
s
C
o
n
s
e
r
v
a
tio
n
(
2
8
S
p
e
c
ie
s
)
V
e
r
te
b
r
a
te
C
o
n
s
C
h
im
p
R
h
e
s
u
s
B
u
s
h
b
a
b
y
T
r
e
e
_
s
h
r
e
w
M
o
u
s
e
R
a
t
G
u
in
e
a
_
P
ig
S
h
r
e
w
H
e
d
g
e
h
o
g
D
o
g
C
a
t
H
o
r
s
e
C
o
w
A
r
m
a
d
illo
E
le
p
h
a
n
t
T
e
n
r
e
c
O
p
o
s
s
u
m
P
la
ty
p
u
s
L
iz
a
r
d
C
h
ic
k
e
n
Z
e
b
r
a
fis
h
T
e
tr
a
o
d
o
n
F
u
g
u
S
tic
k
le
b
a
c
k
M
e
d
a
k
a
D
L
X
1
G
a
p
s
H
u
m
a
nA
C
h
im
pA
R
h
e
s
u
sA
B
u
s
h
b
a
b
yA
T
r
e
e
_
s
h
r
e
wA
M
o
u
s
eA
R
a
t A
G
u
in
e
a
_
P
ig C
S
h
r
e
wA
H
e
d
g
e
h
o
gA
D
o
gA
C
a
t A
H
o
r
s
eA
C
o
wA
A
r
m
a
d
illo A
E
le
p
h
a
n
t A
T
e
n
r
e
cA
O
p
o
s
s
u
mA
P
la
ty
p
u
sA
K
P
A
A
A
A
A
A
A
C
A
A
A
A
A
A
A
A
A
A
A
1
CC
CC
CC
CC
CC
CC
CC
C T
CC
CC
CC
CC
CC
CC
CC
CC
C T
CC
CC
A C
A C
A C
A C
A C
A C
A C
CC
A C
GC
A C
A C
A C
A C
A C
A C
A C
A C
A C
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
R
T
GGA
GGA
GGA
GGA
GGA
GGA
GGA
GGA
GGA
GGA
GGA
GGA
GGA
GGA
GGA
GGA
GGA
GGA
GGA
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
GA
GA
GA
GA
GA
A A
A A
A A
GA
A A
GA
GA
GA
GA
GA
A A
GA
T A
T A
I
U
C
S
C
G
e
n
e
s
B
a
s
e
d
o
n
R
e
fS
e
q
,U
n
iP
r
o
t,G
e
n
B
a
n
k
,C
C
D
S
a
n
d
C
o
m
p
a
r
a
tiv
e
G
e
n
o
m
ic
s
Y
S
S
L
Q
L
V
e
r
te
b
r
a
te
M
u
ltiz
A
lig
n
m
e
n
t&
P
h
a
s
tC
o
n
s
C
o
n
s
e
r
v
a
tio
n
(
2
8
S
p
e
c
ie
s
)
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
C
T
T
T
T
T
T
T
T
A
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
T
T
T
T
T
T
T
T
T
T
C
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
GT
GT
GC
GT
GT
GT
GT
GT
GT
GT
GT
GT
GT
GT
GT
GT
GT
GT
GT
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
C
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
GC
GC
GC
GC
GC
GC
GC
GC
GC
GC
GC
GC
GC
GC
GC
GC
GC
GC
GC
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
GT
GT
GT
GT
GT
GT
GT
GC
GT
GT
GT
GT
GT
GT
GT
GT
GC
GT
GT
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
Q
GC A
GC A
GC A
GC A
GC A
GC A
GC A
GGA
GC A
GC A
GC A
GC A
GC A
GC A
GC A
GC A
GC A
GC A
GC A
What do they do?
A
GGC
GGC
GGC
GGC
GGC
GGC
GGC
C GC
GGC
GGC
GGC
GGC
GGC
GGC
GGC
GGC
GGC
GGC
GGC
L
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
A
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
N
GA
GA
GA
GA
GA
GA
GA
GA
GA
GA
GA
GA
GA
GA
GA
GA
GA
GA
GA
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
C
C
C
T
C
C
C
T
C
C
C
C
C
C
C
C
T
C
C
Ultra conserved elements
ultra conserved
e.d 12.5
HARs: Human accelerated regions
• 118 bp segment with 18 changes between
the human and chimp sequences
• Expect less than 1
Human HAR1F differs from the ancestral
RNA structure
Main components in the Human genome
Barbara McClintock
Only 1.5% of the human genome are protein-coding regions
Transposable elements make up almost half of the human genome
Transposable Elements (TEs)
TEs can shape transcriptional network
The LTR10 and MER61 families are
particularly enriched for copies with a p53
site. These ERV families are primate-specific
and transposed actively near the time when
the New World and Old World monkey
lineages split.
Evolutionary pattern of LTR10B1 genomic copies
Wang et al. PNAS 2007
Understanding Human Disease
Published Genome-Wide Associations through 12/2013
-8 for 17 trait categories
Publi
sh
e
dGW A a
tp≤5X1
0
NHGRI GWA Catalog
www.genome.gov/GWAStudies
www.ebi.ac.uk/fgpt/gwas/
Proposed in Nov 2009
Epigenome
ENCODE Project
HapMap
1000 Genomes
TCGA
Thinking Quantitatively
Biology Is A Quantitative Science!!!
Erwin Chargaff
Gregor Mendel
1) Mendel’s Laws
2) Chargaff’s Rules
1823-1884
1929-1992
Thinking Quantitatively
•
Space
–
•
Be comprehensive
Signal to Noise Ratio
–
–
•
Sensitivity, specificity, dynamic range
What is my background control?
Distributions
–
–
Normal, Gaussian, Poisson, negative binomial, extreme value,
hypergeometric, etc.
Discrete vs continuous
•
The P value
•
Statistics, Probability, Computation, and Informatics
•
Don’t forget genetics!!!
Spaces: Be comprehensive
• Conditions: spatial, temporal, treatment – think about
controlling for multiple variables
• Think globally – interaction between local features and
global features (Placenta histone example)
• Be comprehensive about what assumptions are made
– some we know, some we don’t (genome assembly
example)
Signal to Noise
relative abundance
100
ChIP
75
input
50
25
400
600
800
m/z
1000
Different sources of noise
Sensitivity, Specificity, and Dynamic Range
• Sensitivity
– What is the smallest signal that can reliably be detected
(signal to noise)?
– True positive rate
• Specificity
– How well can we discriminate between similar signals?
– True negative rate
• Dynamic Range
– What is the linear range of detection?
– What is the range of natural variation?
Distributions
Normal (Gaussian)
Poisson
Extreme value
Negative binomial
The P value
# of occurrences
*for discreet variables
25
What is the chance of getting exactly
seven?
20
15
# of trials that were seven
total # of trials
P
9
1  3  9  17  23  17  9  3  1
10
5
0 1 2 3 4 5 6 7 8 9 10
# of occurrences
P
25
P  0.10
What is the chance of getting seven or
better?
20
15
P
# of trials that were seven or better
total # of trials
P
9  3 1
1  3  9  17  23  17  9  3  1
10
5
0 1 2 3 4 5 6 7 8 9 10
P  0.16
“There Are Three Kinds of Lies:
Lies, Damned Lies, and Statistics”
-Benjamin Disraeli (1804-1881)
Statistics: drawing inferences from the results
of an experiment
1
0.6
0.4
0.2
97
91
85
79
73
67
61
55
49
43
37
31
25
19
13
7
0
1
• Comparing averages
• Comparing variation
0.8
Gaussian (Normal) Distributions I
1
# of people
0.8
0.6
0.4
0.2
n
1
• Mean (x) =  xi
n i 1
n
• Standard Deviation (s) =
2
(
x

x
)
 i
i 1
n 1
97
91
85
79
73
67
61
55
height
49
43
37
31
25
19
13
7
1
0
Gaussian (Normal) Distributions II
1
# of people
0.8
0.6
0.4
0.2
97
91
79
73
67
61
55
49
43
37
31
25
19
13
7
height
Mean (x or )
Standard Deviation (s or )
Compare means (t-tests)
Compare standard deviations (f-tests)
Calculate P-values for particular results (z-tests)
1
•
•
•
•
•
85
0
“The most important questions of life are
…really only problems of probability.”
-Pierre Simon, Marquis de Laplace (17491827)
Probability: Computing the chance of a particular
outcome of an experiment
S
EC
E
P(E)=E/S
P(Ec)=1-P(E)
Independent Events/Mutually Exclusive Events
What is the probability of A and B both occurring?
P(AB) = P(A)*P(B)
(Independent)
P(A|B) = P(A)
(Independent)
P(A|B)*P(B) = P(B|A)*P(A)
(Bayes rule)
P(AB) = P(A|B) = 0
(Mutually Exclusive)
What is the probability of A or B occurring (independent)?
P(A) * P(B) + P(A) * 1-P(B) + 1-P(A) * P(B)
both
+
A only
+ B only
What is the probability of A or B occurring (mutually exclusive)?
P(A) + P(B)
An example
Amino acid percentages of Swisprot
Ala
Arg
Asn
Asp
Cys
(A)
(R)
(N)
(D)
(C)
7.81
5.32
4.20
5.30
1.56
Gln
Glu
Gly
His
Ile
(Q)
(E)
(G)
(H)
(I)
3.94
6.60
6.93
2.28
5.91
Leu
Lys
Met
Phe
Pro
(L)
(K)
(M)
(F)
(P)
9.62
5.93
2.37
4.01
4.84
Ser
Thr
Trp
Tyr
Val
(S)
(T)
(W)
(Y)
(V)
6.88
5.45
1.15
3.07
6.71
What is the probability that a peptide of length 25 contains at least one SP motif?
P(SP) = P(S)P(P) = .0688*.0484 =.00329
P(SPc) = 1 - P(SP) = .9966
P(no SP anywhere in a 25 mer) = P(SPc)24 = 0.92
P(at least one SP in a 25 mer) = 1- P(SPc)24 =0.08
What is the probability that the last residue of a protein is either K or R?
P(K)+P(R) = .0593 + .0532= .11
counting permutations
Permutations:
Question: How many different 5-mer
sequences can I make using each of the
amino acids S, T, A, G, P once and only
once?
Answer:
5! = 5*4*3*2*1 = 120
counting combinations
Combinations:
Question: How different 5-mer sequences
can I make using three Ser’s and two
Pro’s
Answer:
 5!  5 * 4 * 3 * 2 *1
 10


 3!*2!  3 * 2 *1* 2 *1
Example
A particular cluster has 25
coexpressed genes in it. 15 of these
genes are annotated as being
involved in rRNA transcription.
Is 15/25 significant?
Hypergeometric Probability Distribution
the “overlap problem” or sampling without replacement
N
n
s
m
• N = number of genes in the genome (6000 for yeast)
• n = number of genes in the cluster (25)
• m = number of rRNA transcription genes (109 from
MIPS)
• s = number of rRNA transcription genes in the cluster
(15)
Hypergeometric Probability Distribution
N
n
s
 m  N  m 
 

n 
i  n  i 

P( X  s)  
N
is
 
n 
m
a
a!
where   
 b  b!(a  b)!
Hypergeometric Probability Distribution
6000
25 15 109
109  6000  109 



25 
i  25  i


16
P ( X  15)  
 110
 6000 
i 15


 25 
Therefore in our example…
6000
25 15 109
15/25 rRNA transcription genes in the cluster is significant.
BUT…
1)
2)
3)
10 out of 25 genes in the cluster are not rRNA transcription genes.
94 rRNA transcription genes are not in the cluster.
What about the other genes in the cluster?