Supplementary Information

The biological function of some human transcription factor
binding motifs varies with position relative to the transcription
start site
Kannan Tharakaraman, Olivier Bodenreider, David Landsman, John L. Spouge and
Leonardo Mariño-Ramírez
Supplementary Information
1.1 Local Maximum Statistic and Evaluation of Statistically Significant Clusters
Our previous work describes the local maximum statistic and the clustering algorithm
(described below) in much detail (Tharakaraman et al. 2005). In brief, consider n PPRs
of length l aligned in an n  l block, so the columns are indexed by 1,...,l . For brevity,
we call this arrangement of PPRs “a block alignment”. For any particular word W
containing w letters, if an unmasked instance of W has its final letter in column i , we
say that W has “occurred at i ”, where w  i  l . We wish to locate occurrences of W
that are unusually clustered by columns within the block alignment.
To develop our clustering statistic, let X i be the number of PPRs where W has
occurred at i ( i  1,..., l ), and let Si   j 1 X j be the cumulative occurrences of W up
i
to i . Let a  0 be arbitrary. Our statistic for sequence clustering is the “local maximum”
Mˆ l  max1i  j l Dij , where the difference Di , j   S j  aj    Si  ai  for i, j  1,..., l .
Intuitively, Di , j is large if there is a large number S j  Si of occurrences of W in a short
interval j  i of columns.
To determine the approximate distribution of Mˆ l under a random model, assume the
n PPRs are independent, the l letters of each PPR chosen independently. Each letter is
randomly drawn with fixed frequencies from the nucleotide alphabet a, c, g, t . Because
of masking, W occurs at most once in each PPR, at a random position. Under the random
model, therefore, the X i ’s are independent and identically distributed with a binomial
distribution. The number of binomial trials for each X i is n , each trial having
approximate probability of success p  Sl /  n  l  w  1  if nl is large. If the average
number   np of occurrences in each column is small, most of the  X i  are 0 or 1, like
a coin toss. If l 2 is also small, Sl is approximately Poisson distributed, with mean l .
Intuitively, the  X i  approximate a Poisson process of intensity   Sl /  l  w  1 on the
continuous time-interval from w to l .
The Poisson process approximation is improved, if each occurrence of W is “jittered”
by adding a random value chosen uniformly from the interval 0,1 to its position i . Let
S  t  be the cumulative jittered occurrences of W up to (continuous) time t , where
w  t  l  1. Let D  t , u    S  u   au    S  t   at  and Mˆ  l  1  max wt u  L1 D  t , u 
be the continuous-time analogs of Di , j and Mˆ l . Karlin and Dembo (Karlin and Dembo
1992) suggest Mˆ  l  1 as a statistic for assessing clustering.


Karlin and Dembo give inequalities on the p-value P M̂  t   y
for the simple
Poisson process described above. Their inequalities can be extended to compound
Poisson processes and sharpened to an exact asymptotic formula. The exact formula is an
extreme-value distribution, closely related to BLAST E-values (Karlin and Altschul
1990; Karlin and Dembo 1992).


As in BLAST, P M̂  t   y  e  , where  is an E-value (i.e., a mean for a Poisson
distribution). The E-value
  ke yt ,
(1)
where the time t  l  w  1. Here,  is the unique positive solution to the equation


e
a

 1 ,
(2)
and
 
1  
a
.
k  a 
 
e 1
a
2
(3)
In the notation of Eqs 6-8 of (Frith et al. 2002), Eqs (1)-(3) specialize the known Poisson
process solution from compound to simple with the substitution Z  1 . More recent
versions of BLAST include a statistical correction for edge effects (Altschul and Gish
1996; Spouge 2001), but for l  3001 these corrections are negligible and were omitted.
1.2 Algorithm to Find Significant Clusters
Our algorithm was a mild modification of the linear-time Ruzzo-Tompa algorithm for
finding all maximal segments in a set of real numbers z1 , z2 ,..., zk  (Ruzzo and Tompa
1999). Briefly, a segment is a contiguous subset, having the form
z
i 1
,..., z j  . The
segment corresponds to the score di , j   mi 1 zm . According to Ruzzo and Tompa, a
j
segment has “Property P1” if all sub-segments have a lower score. A segment is
“maximal”, if it has Property P1, but none of its containing segments has Property P1.
Because of the definition, Ruzzo and Tompa show that maximal segments are disjoint.
In our set-up, let the sequence W occur k times in the block alignment, at jittered
column positions T1 , T2 ,..., Tk  . Associate with T j the cumulative score s j  j  aT j
described above, since there are
j occurrences of W up to time T j . Define
z j  s j  s j 1  1  a T j  T j 1  , where s0  T0  0 . Our modification of the Ruzzo-Tompa
algorithm determines all maximal segments of z j , while maintaining a list of the
corresponding positions and PPRs where W occurred.
To calculate statistical significance of a segment  zi 1 ,..., z j  , determine y so that


p  P Mˆ  l  w  1  y above. Thus, in a random block alignment, a segmental score
exceeds y with a probability not exceeding the p-value p . Because the maximal segment
segments are disjoint, they are probabilistically independent, so each segmental score
dij  y can be considered statistically significant.
1.3 Gene Ontology Similarity Metric
First, the Lin metric quantified the semantic similarity among terms from the Molecular
Function hierarchy in the Gene Ontology (GO) database. The metric uses information
content to define the semantic similarity between two terms ci and c j as follows:
sim  ci , c j  
2 max cS  c ,c  log p  c 
i j
,
log p  ci   log p  c j 
(4)
where S  ci , c j  represents the set of ancestor terms ci and c j share; “max”, the
maximum operator; and p  c  , the probability of finding c or any of its descendants in
the GO database. Semantic similarity satisfies 0  sim  ci , c j   1 . Similarity between two
genes gi and g j , with sets of annotations Ai and A j comprising m and n terms
respectively, is defined as the largest average (inter-set) similarity between terms from
Ai and A j :
SIM  gi , g j  


1
   max sim  ck , c p    max sim  ck , c p   .
kAi
m  n  kAi pAj
pA j

(5)
This aggregation method can be understood as a variant of Dice similarity (Azuaje et al.
2006).
1.4 The Fisher Inverse Chi-Square Test for Combining p-values
Consider the product Z n  p1... pn of continuous independent p-values p1 ,..., pn . The
probability that Z n  p is
n 1
P Z n  p  p
i 0
  ln p 
i
i!
(6)
for 0  p  1 and is zero when p is zero (Bailey and Gribskov 1998). The test is
classical, but (Bailey and Gribskov 1998) give the particularly easy formula in Eq (6) for
the computation.
Supplementary References
Altschul, S.F. and W. Gish. 1996. Local alignment statistics. Methods Enzymol 266: 460480.
Azuaje, F., F. Al-Shahrour, and J. Dopazo. 2006. Ontology-driven approaches to
analyzing data in functional genomics. Methods Mol Biol 316: 67-86.
Bailey, T.L. and M. Gribskov. 1998. Combining evidence using p-values: application to
sequence homology searches. Bioinformatics 14: 48-54.
Frith, M.C., J.L. Spouge, U. Hansen, and Z. Weng. 2002. Statistical Significance of
Clusters of Motifs Represented by Position Specific Scoring Matrices in
Nucleotide Sequences. Nucleic Acids Res 30: 3214-3224.
Karlin, S. and S.F. Altschul. 1990. Methods for assessing the statistical significance of
molecular sequence features by using general scoring schemes. Proc Natl Acad
Sci U S A 87: 2264-2268.
Karlin, S. and A. Dembo. 1992. Limit Distributions of Maximal Segmental Score Among
Markov-Dependent Partial-Sums. Advances in Applied Probability 24: 113-140.
Ruzzo, W.L. and M. Tompa. 1999. A linear time algorithm for finding all maximal
scoring subsequences. In Seventh International Conference on Intelligent Systems
for Molecular Biology eds T. Lengauer R. Schneider P. Bork D. Brutlag J.
Glasgow H.-W. Mewes, and R. Zimmer). American Association for Artificial
Intelligence, Heidelberg, Germany.
Spouge, J.L. 2001. Finite-size correction to Poisson approximations of rare events in
renewal processes. J. Appl. Prob. 38: 554-569.
Tharakaraman, K., L. Marino-Ramirez, S. Sheetlin, D. Landsman, and J.L. Spouge. 2005.
Alignments anchored on genomic landmarks can aid in the identification of
regulatory elements. Bioinformatics 21: I440-I448.
Figure S1. The nucleotide frequency distribution in the columns of the block alignment
of the 7,914 PPRs. The anchored PPR dataset shows an over-representation of G and C
around the TSS, probably reflecting excess CpG dinucleotides. Strikingly, the A and C
are over-represented near the TSS (specifically, at 1 and 0 bp), probably reflecting the
dinucleotide CA, part of the consensus initiator (INR) element. These trends support the
quality of TSS annotation.
Figure S2. The tissue specificities of the 50 most significant clusters under the MannWhitney test. The matrices in Figures S2a and S2b show a cross-table of clusters (rows)
and tissues (columns), with its entries color-coded according to 1  p from the MannWhitney test (so large values of 1  p indicate enrichment of the cluster’s gene group in
the tissue). Figure S2a represents the actual gene group; Figure S2b, the control gene
group using the same word without positional preference.
Figure S3. The intersection network. The intersection network obtained by intersecting
three individual networks, corresponding to (1) the positionally significant words, (2) the
GO annotation, and (3) the microarray Atlas. Figure S3 shows the disconnected
components of the intersection network.
Figure S4 a,b,c. Three highly interconnected networks of genes. For each interconnected
network, the top of each figure shows the microarray coexpression matrix with its rows
corresponding to tissues and its columns to genes.
Supplementary Table 1. For a set of 44 words, the tissue of enriched expression
predicted using the Mann-Whitney rank sum statistic and the TRANSFAC TF are listed.
For each word, evidence from literature linked the TF to at least one of the predicted
tissues.
Supplementary Table 2. Global characteristics of individual and intersection networks.
Network
Nodes1
Edges2
Average Degree3
Coexpression
2875
320230
222.76
GO similarity
3574
320237
179.20
TF similarity
3582
320230
178.80
Intersection network
764
949
2.48
1
Number of genes (nodes) in the network
Number of gene pairs (edges) in the network
3
Average number of edges at a node (degree)
2