The biological function of some human transcription factor binding motifs varies with position relative to the transcription start site Kannan Tharakaraman, Olivier Bodenreider, David Landsman, John L. Spouge and Leonardo Mariño-Ramírez Supplementary Information 1.1 Local Maximum Statistic and Evaluation of Statistically Significant Clusters Our previous work describes the local maximum statistic and the clustering algorithm (described below) in much detail (Tharakaraman et al. 2005). In brief, consider n PPRs of length l aligned in an n l block, so the columns are indexed by 1,...,l . For brevity, we call this arrangement of PPRs “a block alignment”. For any particular word W containing w letters, if an unmasked instance of W has its final letter in column i , we say that W has “occurred at i ”, where w i l . We wish to locate occurrences of W that are unusually clustered by columns within the block alignment. To develop our clustering statistic, let X i be the number of PPRs where W has occurred at i ( i 1,..., l ), and let Si j 1 X j be the cumulative occurrences of W up i to i . Let a 0 be arbitrary. Our statistic for sequence clustering is the “local maximum” Mˆ l max1i j l Dij , where the difference Di , j S j aj Si ai for i, j 1,..., l . Intuitively, Di , j is large if there is a large number S j Si of occurrences of W in a short interval j i of columns. To determine the approximate distribution of Mˆ l under a random model, assume the n PPRs are independent, the l letters of each PPR chosen independently. Each letter is randomly drawn with fixed frequencies from the nucleotide alphabet a, c, g, t . Because of masking, W occurs at most once in each PPR, at a random position. Under the random model, therefore, the X i ’s are independent and identically distributed with a binomial distribution. The number of binomial trials for each X i is n , each trial having approximate probability of success p Sl / n l w 1 if nl is large. If the average number np of occurrences in each column is small, most of the X i are 0 or 1, like a coin toss. If l 2 is also small, Sl is approximately Poisson distributed, with mean l . Intuitively, the X i approximate a Poisson process of intensity Sl / l w 1 on the continuous time-interval from w to l . The Poisson process approximation is improved, if each occurrence of W is “jittered” by adding a random value chosen uniformly from the interval 0,1 to its position i . Let S t be the cumulative jittered occurrences of W up to (continuous) time t , where w t l 1. Let D t , u S u au S t at and Mˆ l 1 max wt u L1 D t , u be the continuous-time analogs of Di , j and Mˆ l . Karlin and Dembo (Karlin and Dembo 1992) suggest Mˆ l 1 as a statistic for assessing clustering. Karlin and Dembo give inequalities on the p-value P M̂ t y for the simple Poisson process described above. Their inequalities can be extended to compound Poisson processes and sharpened to an exact asymptotic formula. The exact formula is an extreme-value distribution, closely related to BLAST E-values (Karlin and Altschul 1990; Karlin and Dembo 1992). As in BLAST, P M̂ t y e , where is an E-value (i.e., a mean for a Poisson distribution). The E-value ke yt , (1) where the time t l w 1. Here, is the unique positive solution to the equation e a 1 , (2) and 1 a . k a e 1 a 2 (3) In the notation of Eqs 6-8 of (Frith et al. 2002), Eqs (1)-(3) specialize the known Poisson process solution from compound to simple with the substitution Z 1 . More recent versions of BLAST include a statistical correction for edge effects (Altschul and Gish 1996; Spouge 2001), but for l 3001 these corrections are negligible and were omitted. 1.2 Algorithm to Find Significant Clusters Our algorithm was a mild modification of the linear-time Ruzzo-Tompa algorithm for finding all maximal segments in a set of real numbers z1 , z2 ,..., zk (Ruzzo and Tompa 1999). Briefly, a segment is a contiguous subset, having the form z i 1 ,..., z j . The segment corresponds to the score di , j mi 1 zm . According to Ruzzo and Tompa, a j segment has “Property P1” if all sub-segments have a lower score. A segment is “maximal”, if it has Property P1, but none of its containing segments has Property P1. Because of the definition, Ruzzo and Tompa show that maximal segments are disjoint. In our set-up, let the sequence W occur k times in the block alignment, at jittered column positions T1 , T2 ,..., Tk . Associate with T j the cumulative score s j j aT j described above, since there are j occurrences of W up to time T j . Define z j s j s j 1 1 a T j T j 1 , where s0 T0 0 . Our modification of the Ruzzo-Tompa algorithm determines all maximal segments of z j , while maintaining a list of the corresponding positions and PPRs where W occurred. To calculate statistical significance of a segment zi 1 ,..., z j , determine y so that p P Mˆ l w 1 y above. Thus, in a random block alignment, a segmental score exceeds y with a probability not exceeding the p-value p . Because the maximal segment segments are disjoint, they are probabilistically independent, so each segmental score dij y can be considered statistically significant. 1.3 Gene Ontology Similarity Metric First, the Lin metric quantified the semantic similarity among terms from the Molecular Function hierarchy in the Gene Ontology (GO) database. The metric uses information content to define the semantic similarity between two terms ci and c j as follows: sim ci , c j 2 max cS c ,c log p c i j , log p ci log p c j (4) where S ci , c j represents the set of ancestor terms ci and c j share; “max”, the maximum operator; and p c , the probability of finding c or any of its descendants in the GO database. Semantic similarity satisfies 0 sim ci , c j 1 . Similarity between two genes gi and g j , with sets of annotations Ai and A j comprising m and n terms respectively, is defined as the largest average (inter-set) similarity between terms from Ai and A j : SIM gi , g j 1 max sim ck , c p max sim ck , c p . kAi m n kAi pAj pA j (5) This aggregation method can be understood as a variant of Dice similarity (Azuaje et al. 2006). 1.4 The Fisher Inverse Chi-Square Test for Combining p-values Consider the product Z n p1... pn of continuous independent p-values p1 ,..., pn . The probability that Z n p is n 1 P Z n p p i 0 ln p i i! (6) for 0 p 1 and is zero when p is zero (Bailey and Gribskov 1998). The test is classical, but (Bailey and Gribskov 1998) give the particularly easy formula in Eq (6) for the computation. Supplementary References Altschul, S.F. and W. Gish. 1996. Local alignment statistics. Methods Enzymol 266: 460480. Azuaje, F., F. Al-Shahrour, and J. Dopazo. 2006. Ontology-driven approaches to analyzing data in functional genomics. Methods Mol Biol 316: 67-86. Bailey, T.L. and M. Gribskov. 1998. Combining evidence using p-values: application to sequence homology searches. Bioinformatics 14: 48-54. Frith, M.C., J.L. Spouge, U. Hansen, and Z. Weng. 2002. Statistical Significance of Clusters of Motifs Represented by Position Specific Scoring Matrices in Nucleotide Sequences. Nucleic Acids Res 30: 3214-3224. Karlin, S. and S.F. Altschul. 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci U S A 87: 2264-2268. Karlin, S. and A. Dembo. 1992. Limit Distributions of Maximal Segmental Score Among Markov-Dependent Partial-Sums. Advances in Applied Probability 24: 113-140. Ruzzo, W.L. and M. Tompa. 1999. A linear time algorithm for finding all maximal scoring subsequences. In Seventh International Conference on Intelligent Systems for Molecular Biology eds T. Lengauer R. Schneider P. Bork D. Brutlag J. Glasgow H.-W. Mewes, and R. Zimmer). American Association for Artificial Intelligence, Heidelberg, Germany. Spouge, J.L. 2001. Finite-size correction to Poisson approximations of rare events in renewal processes. J. Appl. Prob. 38: 554-569. Tharakaraman, K., L. Marino-Ramirez, S. Sheetlin, D. Landsman, and J.L. Spouge. 2005. Alignments anchored on genomic landmarks can aid in the identification of regulatory elements. Bioinformatics 21: I440-I448. Figure S1. The nucleotide frequency distribution in the columns of the block alignment of the 7,914 PPRs. The anchored PPR dataset shows an over-representation of G and C around the TSS, probably reflecting excess CpG dinucleotides. Strikingly, the A and C are over-represented near the TSS (specifically, at 1 and 0 bp), probably reflecting the dinucleotide CA, part of the consensus initiator (INR) element. These trends support the quality of TSS annotation. Figure S2. The tissue specificities of the 50 most significant clusters under the MannWhitney test. The matrices in Figures S2a and S2b show a cross-table of clusters (rows) and tissues (columns), with its entries color-coded according to 1 p from the MannWhitney test (so large values of 1 p indicate enrichment of the cluster’s gene group in the tissue). Figure S2a represents the actual gene group; Figure S2b, the control gene group using the same word without positional preference. Figure S3. The intersection network. The intersection network obtained by intersecting three individual networks, corresponding to (1) the positionally significant words, (2) the GO annotation, and (3) the microarray Atlas. Figure S3 shows the disconnected components of the intersection network. Figure S4 a,b,c. Three highly interconnected networks of genes. For each interconnected network, the top of each figure shows the microarray coexpression matrix with its rows corresponding to tissues and its columns to genes. Supplementary Table 1. For a set of 44 words, the tissue of enriched expression predicted using the Mann-Whitney rank sum statistic and the TRANSFAC TF are listed. For each word, evidence from literature linked the TF to at least one of the predicted tissues. Supplementary Table 2. Global characteristics of individual and intersection networks. Network Nodes1 Edges2 Average Degree3 Coexpression 2875 320230 222.76 GO similarity 3574 320237 179.20 TF similarity 3582 320230 178.80 Intersection network 764 949 2.48 1 Number of genes (nodes) in the network Number of gene pairs (edges) in the network 3 Average number of edges at a node (degree) 2
© Copyright 2026 Paperzz