Chicago - Duke Statistical

Identification of regulatory elements using
high-throughput binding evidence.
Inference of population structure on large
genetic data sets.
Stoyan Georgiev
advisors: Uwe Ohler and Sayan Mukherjee
Computational Biology and Bioinformatics, Duke University
February 2011
Outline
• Motif analysis
– Transcriptional regulation
• genome-wide DNA binding data (Georgiev et al. 2010)
– Post-transcriptional regulation
• transcriptome-wide RNA binding data (Mukherjee et al.,
under review; Corcoran* and Georgiev* et al., submitted)
• Inference of population structure
– randomized algorithm
Motif analysis
Outline
• Introduction
• Transcriptional regulation
– Problem statement
– Genomic assays
– Statistical framework
– Results
• Post-transcriptional regulation
Gene regulation
Transcription
DNA motifs
Splicing, Capping,
Poly-adenylation
Nucleus
Export
Cytoplasm
Stability
Translation
miRBP
miR-RBP
complexes
RBP
RNA-binding
Proteins
RNA motifs
Gene regulatory code
• Transcriptional regulation: short patterns in DNA (motifs)
control the initiation of production of gene transcripts
– mechanism: sequence-specific DNA binding proteins (TFs)
Motif Discovery Tool: cERMIT (Georgiev et al. 2010)
• Post-transcriptional regulation: short patterns in RNA control
the utilization of gene transcripts
– mechanism: sequence-specific RNA binding proteins (RBPs), or
microRNA mediated
Motif Analysis Tool (Corcoran* and Georgiev* et
al.; Mukherjee et al.)
Transcriptional regulation
Transcriptional regulation
• Chromatin arrangement
• Activity of transcription factors
- intra-cellular environment
- cis-regulatory code
• DNA methylation
• Copy Number Variation
Simplified abstraction
location
ChIP-seq
cERMIT
• Computational tool for de-novo motif discovery
– Predict binding motif and functional targets of a specific transcription
factor of interest (e.g. TF) using genome-wide measurements of
binding (e.g. ChIP-seq, ChIP-chip) (Georgiev et al. 2010)
• Input: set of sequence regions with assigned binding evidence
• Output: ranked list of predicted binding motifs and
corresponding target locations
Brief introduction to cERMIT
• Binding site representation: consensus sequence
• Search for the "best" binding site that explains the genomewide binding evidence.
– "best“: occurs in regions that tend to have high evidence of being
bound (this is formalized as a normalized average score)
– can evaluate all possible binding sites up to some reasonable
length...in theory
– in practice, we try to cover as many as possible
• start with all possible 5-mers
(AAAAA, AAAAG, AAAAC,...,TTTTT)
• for each, evaluate its "neighbours“ and replace it with the "best" one
• repeat until no neighbour scores better than the current motif
Algorithmic view
sequence regions
high
evidence
AAAAA
AAAAG
AAAAC
AAAAT
AAAGA
ES = 1.5
.
.
.
.
.
TTTCT
TTTTA
TTTTG
TTTTC
TTTTT
low
evidence
sequence regions
sequence regions
512 seed motifs
RTGASTCA ES = 15.0
TGACTCA
RTGASTCAK
GAWTCAYY
TGACTCA
TGAWTCAK
.
.
.
.
.
evolved motifs
ES = normalized average binding evidence
Variable definitions
s i , i  {1, . . . , n}  sequence regions
y i  binding evidence for sequence region s i
m j , j  {1, . . . , T}  candidate sequence motifs
x ij  1 if a match to motif m j is present in region s i
0 otherwise
n
n j   x ij  number of motif occurrence s in {s i }in1
i 1
Motif model
Motif Binding Evidence E j :
Ej  A
ej
σ̂ j
,
Notation :
ej 
1
 (y i  y),
n j i:x ij 1
 n nj
A  
 n 1

 ;

σ̂ j 
2
1
n
n
 (y
i 1
i
nj
 y) 2
Motif model
Motif Binding Evidence :
Ej  A
ej
σ̂ j
,
Notation :
ej 
1
(y i  y),

n j i:x ij 1
 n nj
A  
 n 1

 ;

Optimal motif m j* :
j*  arg max E j
j{1,..,T}
σ̂ j 
2
1
n
n
 (y
i 1
i
nj
 y) 2
ChIP-seq motif discovery
input
output
Results
ChIP-chip validation
•
conservation
filter improves
prediction
accuracy
(Georgiev et al. 2010)
SKO1
Example yeast ChIP-chip output
GCN4
Human ChIP-seq
prediction
literature
CTCF
Barski et al. 2007
STAT1
Robertson et al. 2007
SRF
Valouev et al. 2008
Post-transcriptional
control
Gene regulatory code
• Post-transcriptional regulation: short patterns in RNA control
the utilization of gene transcripts
– mechanism: sequence-specific RNA binding proteins (RBPs), or
microRNA mediated to control translation
Motif Analysis Tool (Corcoran*, Georgiev* et al.; Mukherjee et al.)
PAR-CLIP
• CLIP: Cross linking and immunoprecipitation
– a method of transcriptome-wide identification of RNAprotein interaction sites – problem, quite noisy
• PAR-CLIP = CLIP + photoactivatable nucleotides
– more efficient cross linking
– directly observable evidence of Protein-RNA cross linking:
upon reverse transcription T->C conversion near or at the
interaction site
PAR-CLIP
1. culture with 4-SU
2. cross-link
3. Immunoprecipitate
& size-select
4. convert into
a cDNA library
& sequence
[Hafner et al. 2010]
RBP motif analysis pipeline
RBP
cERMIT
Motif seeds
Motif
predictions
Modified motif score
Variable definitions
Y  (y1 , . . . , y n ) T
 binding evidence
m j , j  {1, . . . , T}
 candidate sequence motifs
x ij  1 if a match to motif m j is present in sequence region s i
0 otherwise
c k   n , k  {1, . . . , p - 1}  confounder s (e.g. sequence biases)
Z j  (x j , c1j , . . . , c p-1j )
 matrix of regression covariates
Motif model
model
:
Y  Z jβ j  ε , ε ~ N(0, σ 2 Ι nn )
OLS fit :
β̂ j  (Z Tj Z j ) 1 ZTj y , Σ̂ β̂  σ̂ 2 (Z Tj Z j ) 1
j
Motif model
model :
Y*  Z jβ j  ε , ε ~ N(0, σ 2 Ι nn )
OLS fit :
β̂ j  (Z Tj Z j ) 1 ZTj y* , Σ̂ β̂  σ̂ 2 (Z Tj Z j ) 1
j
motif binding evidence E j :
Ej 
β̂1j
(Σ̂ β̂j )11
Motif model
model :
Y *  Z jβ j  ε , ε ~ N(0, σ 2 Ι nn )
OLS fit :
β̂ j  (Z Tj Z j ) 1 ZTj y* , Σ̂ β̂  σ̂ 2 (Z Tj Z j ) 1
j
motif binding evidenc E j :
Ej 
β̂1j
(Σ̂ β̂j )11
optimal motif is m j* :
j*  arg max E j
j{1,..., T}
Results
Pumilio
predicted motif
• 2 million mapped reads
• # clusters with site / total # clusters = 1,162 / 8,483
(Hafner et al. 2010)
Summary
• cERMIT: motif discovery using genome-wide binding
data
– identify motifs that are highly enriched in targets with high
binding evidence.
– applicable to RNA and DNA binding data
– adjust for sequence biases and other potential
confounders using linear regression framework
• In progress…
– Bayesian formulation
– improve stability of predictions
– more comprehensive search
Inference of population structure and
generalized eigendecomposition
Outline
• Motivation
• Current approaches
• Extensions
– large data sets
– supervised dimension reduction
• Empirical results
– Wishart simulation
– WCCC Crohn’s disease data set
Motivation
• A classic problem in biology and genetics is to study
population structure (Cavalli-Sforza 1978, 2003)
• Genotype data on millions of loci and thousands of
individuals
• Can we detect structure based on the genetic data?
– infer population demographic histories
– correct for population structure in disease association
studies
– correspondence to geography
Current approaches
• Structure (Pritchard et al. 2000)
– Bayesian model-based clustering of genotype data
• Eigenstrat (Patterson et al. 2006)
– PCA-based inference of axis of genetic variation
Population structure within Europe
(Novembre et al. 2008)
Eigenstrat
(Patterson et al. 2006)
•
Combines Principal Component Analysis and
Random Matrix Theory
1. M ij 
2. X 
Cij  ˆ j
ˆ j  ˆ j 
1  
2 
2
; i  1,..., m; j  1,..., n
1
MM T
n


(m  1)  i 
 i 
3. Estimate " effective population size" : n' 


2
2
(m  1)  i     i 
 i
  i

4. order 1 ,..., m and test for significan ce using Tracy - Widom statistic
Eigenstrat
(Patterson et al. 2006)
• Runtime O(m2n) computation
• The challenge: future (current?) genetic data sets
n ≥ 500, 000
m ≥ 20, 000
(e.g. WTCCC Nature 2007: 17,000 individuals, 500K snp array)
• Can we extend Eigenstrat to this data to be run on a
standard desktop?
• Assume low rank, k << min(m,n)
• Approx algorithm in O(kmn) computation
Randomized PCA
Basic steps:
1. Random projection (approx. preserves distances)
•
project data onto low dimensional space
– Y  M(m - by - n)  G(n - by - r), G ij ~ N(0,1), r  k  n
•
do SVD on Y -- similar to SVD on M
2. Power method : when spectrum decay is slow
Y  (MM T )i  G, i  1, 2, ...
Properties of Randomized PCA
• Error bound on the k rank approximation  k:
E || Â k - A ||  (1  C)
1
2i1
 k 1
power iteration drives the leading constant to one exponentially fast as i increases!
• Top k eigenvalues and eigenvectors can be well
approximated in time O(ikmn)
– rapid convergence when close to low rank structure (i=1-3)
– slowly decaying singular values require more iterations
• Clearly no benefit when ik ≈ m << n
Properties of Randomized PCA
• Empical observations
– we don’t seem to need power iteration, as random
projection good enough (data is low rank)
– eigenvalue accuracy estimate can be “sloppy” if emphasis
is on subspace estimation, assuming a spectral gap
– often we care mainly about subspace estimation accuracy
Generalized eigdecomposition
1. (Semi) supervised dimension reduction
–
add prior information by means of class labels
–
linear and non-linear variations: (L)SIR (Li et al. 1991, Wu et al. 2010)
2. (Non-) linear embeddings
–
Laplacian Eigenmaps (Belkin and Niyogi 2002)
–
Locality Preserving Projections (He and Niyogi, 2003)
3. Canonical Correlation Analysis
Empirical results
• Wishart Covariance Structure
– independent N(0,1) entries for data matrix
• The Wellcome Trust Case Control Consortium
(Nature 2007)
– Crohn’s Disease; 500K SNP array; 5,000 individuals
Subspace distance metric
• Exact method -- subspace A, approx. method -subspace B (consider column spaces)
• Construct projection operators
PA  A(A T A) -1 A T
PB  B(B T B) -1 BT
• Define distance metric: (Ye and Weiss, 2003)
1
dist(A, B)  1 tr(PA PB )
n
0  dist(A, B)  1
Wishart covariance
• Data matrix: independent N(0,1) entries
• Runtime improvement over exact
Spiked wishart (rank = 5)
WTCCC Crohn’s disease data set
Subspace distance metric (WTCCC)
Subspace distance metric (WTCCC)
Acknowledgements
•
•
•
•
Uwe Ohler1,2 & Sayan Mukherjee2,3
David Corcoran1,2
Nick Patterson4
Ohler & Mukherjee Group
1 Department
of Biostatistics and Bioinformtics, Duke University
2 Institute for Genome Sciences and Policy, Duke University
3 Department of Statistical Sciences, Duke University
4 Broad Institute, Harvard and MIT
Thank you!
Wishart Covariance Structure
• Data matrix: independent N(0,1) entries
• Runtime improvement over exact
Decreasing difference in dimension size
Random wishart (# iter = 1)
Random wishart (# iter = 2)
Random wishart (# iter = 3)
Spiked wishart (rank = 5)
WTCCC data (# iter = 1)
WTCCC data (# iter = 2)
Sequence region binding evidence
binding evidence = log[# T-> C conversion events]
T-T
T-C
X-linked clusters
TCATGCTATTTTAGCGATCTGATCGTAGACTGTTAGTCGATGCTGTGTATTTGCA
[David Corcoran]
Quaking
predicted motif
• 4 million mapped reads
• # clusters with site / total # clusters = 3,740 / 9,998
Bibliography
[1] Jonathan K. Pritchard, Matthew Stephens, and Peter Donnelly. Inference of Population Structure Using
Multilocus Genotype Data (2000). Genetics, Vol. 155, 945-959
[2] Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko AR, Auton A, Indap A, King KS, Bergmann S, Nelson MR,
Stephens M, Bustamante CD. Genes mirror geography within Europe (2008). Nature. Nov 6; 456 (7218):
98-101
[3] Patterson N, Price AL, Reich D: Population Structure and Eigenanalysis (2006). PLoS Genetics (12): e190.
doi:10.1371/journal.pgen.0020190
[4] Rokhlin V, SzlamA and Tygert M: A randomized algorithm for principal component analysis (2009). SIAM
Journal on Matrix Analysis and Applications, 31 (3): 1100-1124
[5] Halko N, Martinsson P., Tropp JA. Finding structure with randomness: Probabilistic algorithms for
constructing approximate matrix decompositions. arXiv:0909.4061v2 [math.NA]
[6] Ye and Weiss RE: Using the bootstrap to select one of a new class of dimension reduction methods
(2003). Journal of the American Statistical Association. 98, pp. 968979.
[7] Zhu Y and Zeng P: Fourier methods for estimating the central subspace and the central mean subspace in
regression (2006). Journal of the American Statistical Association. 101, pp. 16381651.
[8] The Wellcome Trust Case Control Consortium: Genome-wide association study of 14,000 cases of seven
common diseases and 3,000 shared controls (2007). Nature. 447, pp. 661-678.
ChIP-seq papers
CTCF: Barski A, Cuddapah S, Cui K, Roh T, Schones D, Wang Z, Wei G, Chepelev I,
Zhao K High-resolution profiling of histone methylations in the human genome.
Cell 2007
STAT1: Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen
G, Bernier B, Varhol R, Delaney A, Thiessen N, Griffith O, He A, Marra M, Snyder M,
Jones S Genome-wide profiles of STAT1 DNA association using chromatin
immunoprecipitation and massively parallel sequencing.
Nat Methods 2007
SRF: Valouev A, Johnson DS, Sundquist A, Medina C, Anton E, Batzoglou S, Myers
RM, Sidow A Genome-wide analysis of transcription factor binding sites based
on ChIP-Seq data.
Nat Methods 2008
Example of cluster generation in the
Argonaute dataset
Eigenstrat
Properties of Randomized PCA
• Empical observations
– we don’t seem to need power iteration, as random
projection good enough (data is low rank)
– eigenvalue accuracy estimate can be “sloppy” if emphasis
is on subspace estimation, assuming a spectral gap
– often we care mainly about subspace estimation accuracy
• Lot’s of “painful” implementation details
– efficient matrix multiply
– data packing
Inference of population structure and
generalized eigendecomposition
(with Sayan Mukherjee1 and Nick Patterson2)
1
2
Department of Statistical Sciences, Duke University
Broad Institute, Harvard and MIT
PARalyzer
•
1.
2.
3.
4.
non-parametric kernel-density estimate classifier to identify
the RNA-protein interaction sites from a combination of T=>C
conversions and read density
reads that have been aligned to the genome and overlap by
at least 1 nucleotide are grouped together.
Within each read-group we generate two smoothened
kernel density estimates; the first of T=>C transitions and the
other of non-transition events.
Nucleotides within the grouped-reads that maintain a
minimum read depth, and where the relative likelihood of
T=>C conversion is higher than non-conversion, are
considered interaction sites
This region is then extended either to include the underlying
reads, or by a generic window size (by 3nt for Pum)
AGO
• largest number of clusters for the Argonaute dataset was
found in intergenic regions
• requiring at least two separate locations with observed T=>C
conversions within the cluster removed a large proportion
(67%) of those sites, while only removing a small proportion
(24%) of clusters found in 3'UTRs
• We therefore require all clusters to have more than one
location with a T=>C conversion for all subsequent analysis.
• To increase the stringency of the CCRs, we required the mode
location to have had at least 20% T=>C conversion
Argaunote(AGO) PAR-CLIP Analysis
microRNA Enrichment Analysis Tool (mEAT)
sequence regions
sequence regions
high
evidence
miR-93
miR-15
let-7
.
.
.
.
.
.
low
evidence
ES = 15.0
miR seeds
ES = average binding evidence across miR canonical seeds
Variable Definitions
Y *  (y1* , . . . , y*n ) T
 binding evidence
k  number of miR canonical seeds
Z j  (x j1, . . . , x jk , c1 , . . . , c p-k )  regression covariates
β j  (β1j , . . . , β kj , β k 1j , . . ., β pj ) T  regression coeffiecie nts
x jr  {0, 1}n , r  {1, . . . , k}
 miR seed match indicator
c k   n , k  {1, . . . , p - k}
 confounder s
miR seed enrichment
model :
Y *  Z jβ j  ε , ε ~ N(0, σ 2 Ι n )
OLS fit :
β̂ j  (Z Tj Z j ) 1 ZTj y* , Σ̂ β̂  σ̂ 2 (Z Tj Z j ) 1
j
miR seed enrichment
model :
Y *  Z jβ j  ε , ε ~ N(0, σ 2 Ι n )
OLS fit :
β̂ j  (Z Tj Z j ) 1 Z Tj y* , Σ̂ β̂  σ̂ 2 (Z Tj Z j ) 1
j
miR seed binding evidence :


m (r)
j  max


k

, 0 (contribut ion of r th miR seed)
(Σ̂ β̂j ) rr 

β̂ rj
n j  1[m (r)
j  0]
r 1
miR seed enrichment
model :
Y *  Z jβ j  ε , ε ~ N(0, σ 2 Ι n )
OLS fit :
β̂ j  (Z Tj Z j ) 1 ZTj y* , Σ̂ β̂  σ̂ 2 (Z Tj Z j ) 1
j
miR seed binding evidence :


S(r)
j  max



, 0 (contribut ion of r th miR seed)
(Σ̂ β̂j ) rr 

β̂ rj
k
n j  1[S(r)
a  fit using cross - validation
j  a];
r 1
miR binding evidence :
S
miR
j
1 k (r)
  Sj
n j r 1
Results
cluster #
1
2
3
4
5
miRbase
hsa-mir-106b
hsa-mir-20a
hsa-mir-519c
hsa-mir-519c-3p
hsa-mir-519a-2
hsa-mir-519b-3p
hsa-mir-519a-1
hsa-mir-106a
hsa-mir-526bstar
hsa-mir-93
hsa-mir-17
hsa-mir-20b
hsa-mir-519d
hsa-mir-520d-3p
hsa-mir-520b
hsa-mir-520e
hsa-mir-372
hsa-mir-520c-3p
hsa-mir-520a-3p
hsa-mir-16-2
hsa-mir-15b
hsa-mir-15a
hsa-mir-195
hsa-mir-16-1
hsa-mir-424
hsa-mir-497
hsa-mir-103-2
hsa-mir-107
hsa-mir-103-1
hsa-mir-503
hsa-mir-92a-1
hsa-mir-32
hsa-mir-92b
hsa-mir-92a-2
hsa-mir-25
hsa-mir-363
hsa-mir-367
hsa-mir-19b-1
hsa-mir-19a
hsa-mir-19b-2
hsa-mir-454
hsa-mir-301a
hsa-mir-130a
hsa-mir-130b
hsa-mir-301b
hsa-mir-3666
hsa-mir-4295
8-mer
GCACTTTA
GCACTTTA
TGCACTTT
TGCACTTT
TGCACTTT
TGCACTTT
TGCACTTT
GCACTTTT
GCACTTTC
GCACTTTG
GCACTTTG
GCACTTTG
GCACTTTG
AGCACTTT
AGCACTTT
AGCACTTT
AGCACTTT
AGCACTTT
AGCACTTT
TGCTGCTA
TGCTGCTA
TGCTGCTA
TGCTGCTA
TGCTGCTA
TGCTGCTG
TGCTGCTG
ATGCTGCT
ATGCTGCT
ATGCTGCT
CGCTGCTA
GTGCAATA
GTGCAATA
GTGCAATA
GTGCAATA
GTGCAATG
GTGCAATT
GTGCAATT
TTTGCACA
TTTGCACA
TTTGCACA
TTGCACTA
TTGCACTG
TTGCACTG
TTGCACTG
TTGCACTG
TTGCACTG
TTGCACTG
expression rank
5
9
287
NA
NA
NA
NA
121
NA
1
10
225
288
NA
NA
NA
NA
NA
NA
22
53
64
218
NA
60
133
2
39
NA
97
4
95
101
NA
11
130
NA
6
7
NA
108
18
47
56
74
NA
NA
microRNA score
11.42
11.42
9.93
9.93
9.93
9.93
9.93
9.84
9.77
8.96
8.96
8.96
8.96
7.44
7.44
7.44
7.44
7.44
7.44
10.32
10.32
10.32
10.32
10.32
7.61
7.61
7.6
7.6
7.6
6.81
8.68
8.68
8.68
8.68
7.44
7.21
7.21
7.93
7.93
7.93
7.74
6.36
6.36
6.36
6.36
6.36
6.36
# targets
1028
cumulative # targets
1028(9.8%)
795
1799(17.2%)
329
2098(20.1%)
570
2367(22.6%)
745
2367(22.6%)
Regression Interpretation
β j  regression coefficien t for motif m j
Model : y i  y  x ijβ j  ε i , ε i ~ N(0, σ 2 )
 β̂ OLS

j
Sreg
j 
1
*
y

i  ej
n j i:x ij 1
β̂ OLS
j
σ̂ j

1
*
reg
 ScERMIT

m

arg
max
S
j
reg
j
A
j{1,.., T}
cERMIT
n j  n  Sreg

S
j
j
PARalayzer (PAR-CLIP data analyzer)
 mRNAs translation into protein can be regulated
through sequence motifs on the mRNA transcript

RNA binding proteins (RBPs)
 Input:

binding evidence for transcribed mRNAs

library of mRNA sequence motifs
 Output: enriched mRNA sequence motifs
[David Corcoran]
PARalayzer (PAR-CLIP data analyzer)
1. Align reads to a reference genome
2. Group adjacent reads into clusters (sequence
regions)
3. Assign binding evidence to each cluster: log2[#
reads]
4. Use clusters to find enriched motifs
[David Corcoran]
PARalayzer (PAR-CLIP data analyzer)
1. Align reads to a reference genome, allowing for up to 3
mismatches (i.e. up to 3 T->C conversion events per read)
2. Group overlapping reads
– groups with ≥ 5 reads are further analyzed
– Clusters are extended to either the longest read that overlaps a ‘positive’
signal or until there are no longer at least 5 reads at a location
– filter groups based on known repeat regions
3. Within each group generate sub-groups (clusters) based on the
observed T->C conversion events
– identify regions with enriched T->C relative to T->T
– use non-parametric smoothing (KDE) to call peaks
4. Use sub-groups in downstream motif enrichment analysis
[David Corcoran]
mEAT
mEAT
 Input data: Argonaute X-linked clusters, miRbase seeds
 miR seeds: 8mer, 7mer-M1, 7mer-A1, 7mer-1-7, 7mer-2-8
 miR seed enrichment: normalized average enrichment score
for the set of clusters with a seed match [3]
 Targets miR i: indexes of clusters, containing a match to the top
enriched seed for miR i
 Goal: find the most highly enriched miRs
mEAT: Enrichment vs. Expression
Top expressed miRs
Top expressed miRs
Gene expression in the eukaryotic cell
ChIP-seq
PAR-CLIP
1. culture with 4-SU
2. cross-link
3. Immunoprecipitate
& size-select
4. convert into
a cDNA
library &
sequence
[Hafner et al. 2010]
Motif model
(Georgiev et al. 2010)
Notation :
y
1
yi ;

n i{1,..., n}
σ̂ 2 
1
(y i  y) 2 ;

n i
 n nj 
1
*
 ; e j 
A  
y

i ;
n j i:x ij 1
 n 1 
Motif Binding Evidence :
mj  A
ej
σ̂ j
Optimal motif :
m*cERMIT  arg max m j
j{1,.., T}
y i  y i  y;
*
σ̂ 2
σ̂ j 
nj
2
HuR
# reads = 20M, aligned 13M,
# clusters = 250K,
# clusters after pre-processing = 125K,
“explained” with presence of binding motif = 25% long, 75% two short
plots with T->C conversions (David),
- in vitro binding studies, which have shown that HuR is capable of
binding to AREs including, AUUUA pentamers, long poly-U stretches,
and 3 to 5 nucleotide stretches of Us
separated by A, C, or G (Levine et al., 1993; Meisner et al., 2004).
Quaking
• # clusters with site / total # clusters = 3,740 /
9,998
• # reads, # clusters, # “explained” with
presence of binding motif, plots with T->C
conversions (David),
• Group using reads, as not all X-linked
Pumilio