Gene expression

Gene expression
Statistics 246, Week 3
Thesis: the analysis of gene
expression data is going to be big
in 21st century statistics
Many different technologies, including
High-density nylon membrane arrays
Serial analysis of gene expression (SAGE)
Short oligonucleotide arrays (Affymetrix)
Long oligo arrays (Agilent)
Fibre optic arrays
(Illumina)
cDNA arrays (Brown/Botstein)*
Total microarray articles indexed in Medline
600
Number of papers
500
400
300
200
100
0
1995
1996
1997
1998
1999
2000
2001
(projected)
Year
Common themes
•
Parallel approach to collection of very large
•
Sophisticated instrumentation, requires some
•
•
•
amounts of data (by biological standards)
understanding
Systematic features of the data are at least as
important as the random ones
Often more like industrial process than single
investigator lab research
Integration of many data types: clinical,
genetic, molecular…..databases
Biological background
Transcription
DNA
G T A A T C C T C
|
|
|
|
|
|
|
|
|
C A T T A G G A G
RNA
polymerase
mRNA
G
U
A
A
U
C
C
Idea: measure the amount of mRNA to see which
genes are being expressed in (used by) the cell.
Measuring protein might be better, but is currently
harder.
Reverse transcription
Clone cDNA strands, complementary to the mRNA
mRNA
G U A A U C C U C
Reverse
transcriptase
cDNA
T
T
A
G
G
A
G
C
A
T T
A
G
A
G
G
G
CT
G
C A
T
A
G
G
A G
A
A T
A
C A TT
TT
AA
G G A GG
G
CTA
AG
GA
GG
A G
G
C A
TT
ATG
T
A
G
AG
GA
G
A G
A
C ACTA
TT
ATG
G
C
cDNA microarray experiments
mRNA levels compared in many different contexts
Different tissues, same organism
Same tissue, same organism
(brain v. liver)
(ttt v. ctl, tumor v. non-tumor)
Same tissue, different organisms (wt v. ko, tg, or mutant)
Time course experiments
(effect of ttt, development)
Other special designs (e.g. to detect spatial patterns).
cDNA microarrays
cDNA clones
cDNA microarrays
Compare the genetic expression in two samples of cells
PRINT
cDNA from one
gene on each spot
SAMPLES
cDNA labelled
red/green
e.g. treatment / control
normal / tumor
tissue
HYBRIDIZE
Add equal amounts of
labelled cDNA samples
to microarray.
SCAN
Laser
Detector
Biological question
Differentially expressed genes
Sample class prediction etc.
Experimental design
Microarray experiment
16-bit TIFF files
Image analysis
(Rfg, Rbg), (Gfg, Gbg)
Normalization
R, G
Estimation
Testing
Clustering
Biological verification
and interpretation
Discrimination
Some statistical questions
Image analysis: addressing, segmenting, quantifying
Normalisation: within and between slides
Quality: of images, of spots, of (log) ratios
Which genes are (relatively) up/down regulated?
Assigning p-values to tests/confidence to results.
Some statistical questions, ctd
Planning of experiments: design, sample size
Discrimination and allocation of samples
Clustering, classification: of samples, of genes
Selection of genes relevant to any given analysis
Analysis of time course, factorial and other special
experiments…..…...& much more.
Some bioinformatic questions
Connecting spots to databases, e.g. to sequence,
structure, and pathway databases
Discovering short sequences regulating sets of
genes: direct and inverse methods
Relating expression profiles to structure and
function, e.g. protein localisation
Identifying novel biochemical or signalling
pathways, ………..and much more.
Part of the image of one channel false-coloured on a white (v. high)
red
(high) through yellow and green (medium) to blue (low) and black scale
Does one size fit all?
Segmentation: limitation of the
fixed circle method
SRG
Fixed Circle
Inside the boundary is spot (foreground), outside is not.
Some local backgrounds
Single channel
grey scale
We use something different again: a smaller, less variable value.
Quantification of expression
For each spot on the slide we calculate
Red intensity
= Rfg - Rbg
fg = foreground, bg = background, and
Green intensity = Gfg - Gbg
and combine them in the log (base 2) ratio
Log ( Red intensity / Green intensity)
2
Gene Expression Data
On p genes for n slides: p
is O(10,000), n is O(10-100), but growing,
Slides
1
Genes
slide 1
0.46
slide 2
0.30
2
-0.10
0.49
4
-0.45
-1.03
3
5
0.15
-0.06
0.74
1.06
slide 3
0.80
0.24
0.04
-0.79
1.35
slide 4
1.51
0.06
0.10
-0.56
1.09
slide 5
…
0.46
...
0.90
0.20
-0.32
-1.09
...
...
...
...
Gene expression level of gene 5 in slide 4
=
Log ( Red intensity / Green intensity)
2
These values are conventionally displayed
on a red (>0) yellow (0) green (<0) scale.
The red/green ratios can be spatially biased
•
.
Top 2.5%of ratios red, bottom 2.5% of ratios green
The red/green ratios can be intensity-biased
M = log R/G
2
= log R - log G
2
2
Values should scatter about zero.
= (log R + log G )/2
2
2
Normalization: how we “fix” the previous
problem
The curved line becomes the new zero line
Orange: Schadt-Wong rank invariant set
Yellow: GAPDH, tubulin
Light blue: MSP pool / titration
Red line: lowess smooth
-4
-2
0
M
2
Normalizing: before
6
8
10
12
14
16
-4
-2
0
M normalised
2
Normalizing: after
6
8
10
12
14
16
From a study of the mouse olfactory system
Main (Auxiliary)
Olfactory Bulb
VomeroNasal
Organ
Olfactory
Epithelium
From Buck (2000)
Axonal connectivity between the nose
and the mouse olfactory bulb
>2M, ~1,800 types
Neocortex
Two principles: “zone-to-zone projection”, and “glomerular convergence”
Of interest: the hardwiring of the
vertebrate olfactory system
•
Expression of a specific odorant receptor gene by
•
Targeting and convergence of like axons to specific
an olfactory neuron.
glomeruli
in the olfactory bulb.
The biological question in this case
Are there genes with spatially
restricted expression patterns within
the olfactory bulb?
Layout of the cDNA Microarrays
•
Sequence verified mouse cDNAs
19,200 spots in two print groups of 9,600 each
– 4 x 4 grid, each with 25 x24 spots
– Controls on the first 2 rows of each grid.
77
•
pg1
pg2
Design: How We Sliced Up the Bulb
A
P
D
L
V
M
Design: Two Ways to Do the
Comparisons
Goal: 3-D representation of gene expression
Compare all samples to a
Multiple direct comparisons
sample (e.g., whole bulb)
(no common reference)
common reference
A
between different samples
L
V
V
R
D
M
A
M
D
P
L
P
An Important Aspect of Our Design
A
D
Different ways of estimating
the same contrast:
e.g. A compared to P
M
L
Direct
= A-P
Indirect
= A-M + (M-P)
A-D + (D-P)
or
or
-(L-A) - (P-L)
V
P
How do we combine these?