S.3. Brief Summary of UTR Features

SUPPORTING MATERIA
Improving Performance of Mammalian MicroRNA Target Prediction
Hui Liu1, Dong Yue2, Yidong Chen4,5, Shou-Jiang Gao3,5 and Yufei Huang2,5*
1
SIEE, China University of Mining and Technology, Xuzhou, Jiangsu, CHINA.
Department of ECE, University of Texas at San Antonio, 3 Department of Pediatrics, 4 Department of Epidemiology and
Biostatistics, 5Greehey Children’s Cancer Research Institute, University of Texas Health Science Center at San Antonio.
2
S.1. SENISITIVTIES OF PROPOSED POTENTIAL SITE FILTER
Table 1. Sensitivities of the proposed filter and the rule based on 6-mer seed match obtained on training data.
Seed Match Rules
Sensitivity of Site Detection
Sensitivity of UTR Detection
6mer perfect match
proposed rules
79.8%
96.2
77.1%
95.8%
© Oxford University Press 2005
1
H. Liu
S.2. BRIEF SUMMARY OF SITE FEATURES
Table 2. Brief summary of all site features.
Index
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
2
Feature name
consv_3cntxt
consv_seed
consv_5cntxt
sm_6mer
sm_7mer_A1
sm_7mer_m1
sm_7mer_m8
sm_8mer_A1
sm_8mer_m1
to_stop_codon
to_ends
ratio_to_ends
nt1
nt2
nt3
nt4
nt5
nt6
nt7
nt8
nt9
nt10
nt11
nt12
nt13
nt14
nt15
nt16
nt17
nt18
nt19
nt20
2mer1
2mer2
2mer3
2mer4
2mer5
2mer6
2mer7
2mer8
2mer9
2mer10
2mer11
2mer12
2mer13
2mer14
2mer15
2mer16
2mer17
2mer18
2mer19
rgs_match
rgs_gu
rgs_mismatch
rgs_gap
rgs_bulge
rgs_bulge_nt
Data type
FLOAT
FLOAT
FLOAT
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
FLOAT
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
Group
conservation
conservation
conservation
seed match type
seed match type
seed match type
seed match type
seed match type
seed match type
position
position
position
nt match status
nt match status
nt match status
nt match status
nt match status
nt match status
nt match status
nt match status
nt match status
nt match status
nt match status
nt match status
nt match status
nt match status
nt match status
nt match status
nt match status
nt match status
nt match status
nt match status
2mer match status
2mer match status
2mer match status
2mer match status
2mer match status
2mer match status
2mer match status
2mer match status
2mer match status
2mer match status
2mer match status
2mer match status
2mer match status
2mer match status
2mer match status
2mer match status
2mer match status
2mer match status
2mer match status
region
region
region
region
region
region
Explanation
seed's 3' context conservation score
seed conservation score
seed's 5' context conservation score
6mer seed match
7mer_A1 seed match
7mer_m1 seed match
7mer_m8 seed match
8mer_A1 seed match
8mer_m1 seed match
distance to stop codon
distance to nearest end
ratio to nearest end
p1 match status
p2 match status
p3 match status
p4 match status
p5 match status
p6 match status
p7 match status
p8 match status
p9 match status
p10 match status
p11 match status
p12 match status
p13 match status
p14 match status
p15 match status
p16 match status
p17 match status
p18 match status
p19 match status
p20 match status
2mer1 match status
2mer2 match status
2mer3 match status
2mer4 match status
2mer5 match status
2mer6 match status
2mer7 match status
2mer8 match status
2mer9 match status
2mer10 match status
2mer11 match status
2mer12 match status
2mer13 match status
2mer14 match status
2mer15 match status
2mer16 match status
2mer17 match status
2mer18 match status
2mer19 match status
number of match in seed region
number of mismatch in seed region
number of G:U in seed region
number of gap in seed region
number of bulge in seed region
number of bulged nts in seed region
Typical value
Min
Max
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
2000
2000
1
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
16
8
8
8
8
2
2
Improving Performance of Mammalian MicroRNA Target Prediction
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
rgs_energy
rg3_match
rg3_gu
rg3_mismatch
rg3_gap
rg3_bulge
rg3_bulge_nt
rg3_energy
rgt_match
rgt_gu
rgt_mismatch
rgt_gap
rgt_bulge
rgt_bulge_nt
rgt_energy
acc_energy
cntxt_A_cntnt
cntxt_C_cntnt
cntxt_G_cntnt
cntxt_U_cntnt
cntxt_AA_cntnt
cntxt_AC_cntnt
cntxt_AG_cntnt
cntxt_AU_cntnt
cntxt_CA_cntnt
cntxt_CC_cntnt
cntxt_CG_cntnt
cntxt_CU_cntnt
cntxt_GA_cntnt
cntxt_GC_cntnt
cntxt_GG_cntnt
cntxt_GU_cntnt
cntxt_UA_cntnt
cntxt_UC_cntnt
cntxt_UG_cntnt
cntxt_UU_cntnt
cntxt_pos_n8
cntxt_pos_n7
cntxt_pos_n6
cntxt_pos_n5
cntxt_pos_n4
cntxt_pos_n3
cntxt_pos_n2
cntxt_pos_n1
cntxt_pos_n0
cntxt_pos_p1
cntxt_pos_r1
cntxt_pos_r2
cntxt_pos_r3
cntxt_pos_r4
cntxt_pos_r5
cntxt_pos_r6
cntxt_pos_r7
cntxt_pos_r8
cntxt_pos_r9
cntxt_pos_r10
FLOAT
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
FLOAT
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
FLOAT
FLOAT
FLOAT
FLOAT
FLOAT
FLOAT
FLOAT
FLOAT
FLOAT
FLOAT
FLOAT
FLOAT
FLOAT
FLOAT
FLOAT
FLOAT
FLOAT
FLOAT
FLOAT
FLOAT
FLOAT
FLOAT
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
INTEGER
region
region
region
region
region
region
region
region
region
region
region
region
region
region
region
accessbility energy
context
context
context
context
context
context
context
context
context
context
context
context
context
context
context
context
context
context
context
context
context
context
context
context
context
context
context
context
context
context
context
context
context
context
context
context
context
context
context
context
binding energy of seed region
number of match in 3' region
number of mismatch in 3' region
number of G:U in 3' region
number of gap in 3' region
number of bulge in 3' region
number of bulged nts in 3' region
binding energy of 3' region
number of match in total region
number of mismatch in total region
number of G:U in total region
number of gap in total region
number of bulge in total region
number of bulged nts in total region
binding energy of total region
accessbility energy
A content in context
C content in context
G content in context
U content in context
AA content in context
AC content in context
AG content in context
AU content in context
CA content in context
CC content in context
CG content in context
CU content in context
GA content in context
GC content in context
GG content in context
GU content in context
UA content in context
UC content in context
UG content in context
UU content in context
nt type of -8
nt type of -7
nt type of -6
nt type of -5
nt type of -4
nt type of -3
nt type of -2
nt type of -1
nt type of -0
nt type of +1
nt type of r1
nt type of r2
nt type of r3
nt type of r4
nt type of r5
nt type of r6
nt type of r7
nt type of r8
nt type of r9
nt type of r10
-10
0
0
0
0
0
0
-10
0
0
0
0
0
0
-20
-20
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
5
8
8
8
8
2
5
5
15
15
15
15
4
12
10
10
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
3
H. Liu
S.3. BRIEF SUMMARY OF UTR FEATURES
Table 3. Brief summary of all UTR features.
Index
Feature name
Data type
Group
Explanation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
utr_len
psite_dens
max_partial_psite_num
pos_site_dens
max_partial_pos_site_num
total_pos_score
psite_num
pos_site_num
top_score
psite_num_6mer
pos_site_num_6mer
top_score_6mer
psite_num_7mer_A1
pos_site_num_7mer_A1
top_score_7mer_A1
psite_num_7mer_m1
pos_site_num_7mer_m1
top_score_7mer_m1
psite_num_7mer_m8
pos_site_num_7mer_m8
top_score_7mer_m8
psite_num_8mer_A1
pos_site_num_8mer_A1
top_score_8mer_A1
psite_num_8mer_m1
pos_site_num_8mer_m1
top_score_8mer_m1
psite_num_other
pos_site_num_other
top_score_other
INTEGER
FLOAT
INTEGER
FLOAT
INTEGER
FLOAT
INTEGER
INTEGER
FLOAT
INTEGER
INTEGER
FLOAT
INTEGER
INTEGER
FLOAT
INTEGER
INTEGER
FLOAT
INTEGER
INTEGER
FLOAT
INTEGER
INTEGER
FLOAT
INTEGER
INTEGER
FLOAT
INTEGER
INTEGER
FLOAT
utr length
density
density
density
density
globe site score
globe site score
globe site score
globe site score
site score of seed type
site score of seed type
site score of seed type
site score of seed type
site score of seed type
site score of seed type
site score of seed type
site score of seed type
site score of seed type
site score of seed type
site score of seed type
site score of seed type
site score of seed type
site score of seed type
site score of seed type
site score of seed type
site score of seed type
site score of seed type
site score of seed type
site score of seed type
site score of seed type
length of utr
density of potential site in entire UTR
max number of potential site in 100 nt
density of positive site in entire UTR
max number of positive site in 100 nt
total score of positive sites
number of potential sites
number of postive sites
top score of all potential sites
number of potential sites with 6mer seed
number of postive sites with 6mer seed
top score of all potential sites with 6mer seed
number of potential sites with 7mer_A1 seed
number of postive sites with 7mer_A1 seed
top score of all potential sites with 7mer_A1 seed
number of potential sites with 7mer_m1 seed
number of postive sites with 7mer_m1 seed
top score of all potential sites with 7mer_m1 seed
number of potential sites with 7mer_m8 seed
number of postive sites with 7mer_m8 seed
top score of all potential sites with 7mer_m8 seed
number of potential sites with 8mer_A1 seed
number of postive sites with 8mer_A1 seed
top score of all potential sites with 8mer_A1 seed
number of potential sites with 8mer_m1 seed
number of postive sites with 8mer_m1 seed
top score of all potential sites with 8mer_m1 seed
number of potential sites without perfect seed
number of postive sites without perfect seed
top score of all potential sites without perfect seed
4
Typical value
Min
0
0
0
0
0
-2
0
0
-2
0
0
-2
0
0
-2
0
0
-2
0
0
-2
0
0
-2
0
0
-2
0
0
-2
Max
2000
0.1
5
0.01
2
2
50
5
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Improving Performance of Mammalian MicroRNA Target Prediction
S.4. DATA SOURCE OF NEGATIVE SAMPLES
Table 4. Data Source of Negative Samples.
Index
miRNA
GEO Dataset ID
NO. of negative sample
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
hsa-let-7c
hsa-miR-15a
hsa-miR-16
hsa-miR-17
hsa-miR-192
hsa-miR-20a
hsa-miR-215
has-miR-192
has-mirR-215
hsa-miR-122
hsa-miR-128
hsa-miR-132
hsa-miR-133a
hsa-miR-142-3p
hsa-miR-148b
hsa-miR-34a
hsa-miR-34b
hsa-miR-34c-5p
hsa-miR-7
hsa-miR-9
GSM156557, GSM156558
GSM156545, GSM156549
GSM156546, GSM156550
GSM156553, GSM156555
GSM156547, GSM156551
GSM156554, GSM156556
GSM156548, GSM156552
GSM328290, GSM328287
GSM328291, GSM328288
GSM210900, GSM210901
GSM210902, GSM210903
GSM210904, GSM210905
GSM210906, GSM210907
GSM210908, GSM210909
GSM210910, GSM210911
GSM187633, GSM187634, GSM187631, GSM187632
GSM190765, GSM190757
GSM190758, GSM190766
GSM210896, GSM210897
GSM210898, GSM210899
29
613
587
115
77
108
92
21
20
13
10
11
203
38
42
676
424
451
8
4
Total No.
3542; And 3492 pairs left after remove reduplicate ones
5
H. Liu
S.5. HISTOGRAMS OF SITE FEATURES
The independent empirical distributions of each site feature in the forms of histograms were obtained from the positive and
negative data. Although, they do not reveal combinatory discriminative power of the features, they do provide information
regarding the importance of the features in prediction. Particularly, if the distributions of a feature in the positive and negative
target sites are similar, it means that the positive and negative target sites cannot be easily separated by this feature, and thus
this feature bears low discriminative power, or in other word, is unlikely to be a good feature.
0.5
0
0.5
0
0
1
6mer seed match
0.5
0
0
1
7mer_A1 seed match
0.5
0.5
0
0
1
7mer_m8 seed match
0
1
7mer_m1 seed match
1
Probability
1
Probability
Probability
1
0
1
Probability
1
Probability
Probability
1
0.5
0
0
1
8mer_A1 seed match
0
1
8mer_m1 seed match
Fig. 1. Histograms of perfect seed match features.
1
negative
positive
0.4
0.2
0.6
0.4
0.2
0
-10
-5
0
binding energy of seed region
1
negative
positive
0.8
Probability
0.6
0
0.8
Probability
Probability
0.8
1
negative
positive
0.6
0.4
0.2
-20
-10
0
binding energy of 3' region
0
0.8
Probability
1
negative
positive
0.6
0.4
0.2
-30
-20
-10
0
binding energy of total region
0
-20
0
20
accessbility energy
Fig. 2. Histograms of energy features.
1
0.4
0.2
0
Fig. 3. Histograms of conservation features.
6
0.6
0.4
0.2
0
0.2 0.4 0.6 0.8
1
seed's 3' context conservation score
0
0.8
Probability
0.6
1
negative
positive
0.8
Probability
Probability
0.8
1
negative
positive
negative
positive
0.6
0.4
0.2
0
0.2 0.4 0.6 0.8
seed conservation score
1
0
0
0.2 0.4 0.6 0.8
1
seed's 5' context conservation score
Improving Performance of Mammalian MicroRNA Target Prediction
0.5
1 2 3 4
p16 match status
1 2 3 4
p17 match status
negative
0
1 2 3 4
p15 match status
Probability
Probability
1
0.5
1 2 3 4
p18 match status
0.5
0
1 2 3 4
p14 match status
1
0.5
0
Probability
Probability
Probability
Probability
1
0.5
0
0
1 2 3 4
p10 match status
1
0.5
1 2 3 4
p13 match status
0.5
0
1 2 3 4
p9 match status
1
0.5
0
1 2 3 4
p12 match status
1
Probability
Probability
1
0
0.5
0
1 2 3 4
p11 match status
0
1 2 3 4
p5 match status
1
0.5
1 2 3 4
p8 match status
0.5
0
1 2 3 4
p4 match status
1
1
Probability
0.5
0
1 2 3 4
p3 match status
0.5
0
1 2 3 4
p7 match status
1
Probability
Probability
1
0
0.5
0
1 2 3 4
p6 match status
0.5
1
Probability
0.5
0
0
1 2 3 4
p2 match status
1
Probability
Probability
1
0.5
1
Probability
0
1 2 3 4
p1 match status
Probability
0
0.5
1
Probability
0.5
1
Probability
1
Probability
Probability
1
1 2 3 4
p19 match status
0.5
0
1 2 3 4
p20 match status
positive
Fig. 4. Histograms of nt match features.
7
H. Liu
0.5
0
5
10
2mer1 match status
0
15
0
5
10
2mer4 match status
5
10
2mer7 match status
0
5
10
2mer10 match status
Probability
5
10
2mer8 match status
0.5
0
5
10
2mer13 match status
0
5
10
2mer11 match status
0.5
0
0
5
10
2mer16 match status
15
Probability
0
0
5
10
2mer14 match status
5
10
2mer19 match status
Fig. 5. Histograms of 2mer match features.
8
15
5
10
2mer9 match status
15
0
5
10
2mer12 match status
15
0
5
10
2mer15 match status
15
0
15
0.5
0
0
5
10
2mer18 match status
15
0.5
1
0
5
10
2mer17 match status
negative
positive
0
0
1
1
0.5
15
0.5
0
15
1
Probability
1
5
10
2mer6 match status
1
0.5
0
15
0
0.5
0
15
1
Probability
Probability
0
0.5
0
15
15
1
Probability
Probability
Probability
0.5
5
10
2mer3 match status
0.5
0
15
1
1
Probability
5
10
2mer5 match status
0.5
0
15
1
0
0
Probability
Probability
Probability
0
0
1
1
0.5
0.5
0
15
0.5
0
15
1
0
5
10
2mer2 match status
Probability
0.5
0
0
1
Probability
Probability
1
0
0.5
Probability
0
1
Probability
1
Probability
Probability
1
15
0.5
0
Improving Performance of Mammalian MicroRNA Target Prediction
0
0
10
20
30
number of bulged nts in seed region
0
5
10
number of match in 3' region
0.5
0
0 1 2 3 4 5
number of bulge in 3' region
0
10
20
30
number of bulged nts in 3' region
0.5
0
0
10
20
number of gap in total region
Probability
1
0.5
0.5
0
0
0
1
1
1
0.5
0
5
10
number of G:U in 3' region
0
5
10
number of gap in 3' region
0.5
0
0
0
10
20
0
10
20
number of match in total region number of mismatch in total region
0.5
0
0
10
20
number of G:U in total region
1
Probability
1
Probability
1
0.5
0
1
2
3
number of bulge in seed region
1
0
5
10
number of mismatch in 3' region
Probability
0.5
0
Probability
0
1
Probability
Probability
1
0.5
0.5
0
0 1 2 3 4 5 6 7 8
number of gap in seed region
Probability
0.5
0
0 1 2 3 4 5 6 7 8
number of G:U in seed region
1
Probability
1
Probability
Probability
1
0
0.5
Probability
0
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8
number of match in seed region number of mismatch in seed region
0.5
1
Probability
0
Probability
0.5
1
Probability
0.5
1
Probability
1
Probability
Probability
1
0.5
negative
positive
0.5
0
0
0 1 2 3 4 5 6
0
10
20
30
number of bulge in total region number of bulged nts in total region
Fig. 6. Histogram of regional binding structure features.
1
negative
positive
0.6
0.4
0.2
0
0.8
Probability
Probability
0.8
1
negative
positive
0.6
0.4
0.2
0
5000
10000
distance to stop codon
0
negative
positive
0.8
Probability
1
0.6
0.4
0.2
0 811 2433 4055 5677
distance to nearest end
0
0
0.2
0.4
ratio to nearest end
Fig. 7. Histogram of Position Features.
9
H. Liu
S.6 ROC PERFORMANCE OF SITE-SVM
To investigate the performance of Site-SVM, the receiver operating characteristic (ROC) performance is obtained from the
cross-validation (Figure 9) based on the training dataset. The ROC evaluates the performance of the true positive rate (TPR),
or sensitivity vs. the false positive rate (FPR), or 1-specificity. TPR denotes the chance of having predicted the entire true
targets, while FPR measures the odds of falsely predicting a target. A better algorithm should have smaller FPR at a given
TPR. In Figure 9, Site-SVM shows a better performance comparing to 6 types of perfect seed match. Moreover, Site-SVM
presents a continuous curve, which means Site-SVM can calculate the confidence of a potential site to be a positive site, and
this is meaningful for sequential identification work.
Site-SVM Top 11 ROC: 95.33
1
0.9
0.8
True positive rate
0.7
0.6
0.5
Site-SVM
6mer seed match
7mer_A1 seed match
7mer_m1 seed match
7mer_m8 seed match
8mer_A1 seed match
8mer_m1 seed match
0.4
0.3
0.2
0.1
0
0
0.1
Fig. 8. ROC curve of Site-SVM and perfect seed match.
10
0.2
0.3
0.4
0.5
0.6
False positive rate
0.7
0.8
0.9
1
Improving Performance of Mammalian MicroRNA Target Prediction
S.7. HISTOGRAMS OF UTR FEATURES
0.5
negative
positive
Probability
0.4
0.3
0.2
0.1
0
0
2000
4000
6000
8000
10000 12000
Fig. 9. Histogram of UTR Length Feature.
0.6
0.4
1
negative
positive
0.8
0.6
0.4
0.6
0.4
0.2
0.2
0.2
0
0
0
0
0.05
0.1
density of potential site in entire UTR
0
5
10
15
20
25
max number of potential site in 100 nt
1
negative
positive
0.8
Probability
0.8
Probability
Probability
1
negative
positive
Probability
1
0.8
negative
positive
0.6
0.4
0.2
0
0.001
0.01
0.05
density of positive site in entire UTR
0
0
1
2
3
max number of positive site in 100 nt
Fig. 10. Histograms of sites density features.
11
H. Liu
0.5
Fig. 11. Histograms of Sites Score Features.
negative
positive
0.5
negative
positive
0.5
0
4
3
2
1
0
7mer_m8 seed postive sites number
Probability
Probability
Probability
Probability
1
negative
positive
0.5
negative
positive
0.5
0
0
0
4
3
2
1
0
2
0
-2
3
2
1
0
8mer_A1 seed postive sites number 8mer_A1 seed potential sites top score 8mer_m1 seed potential sites number
1
1
1
negative
positive
0
0
2
0
-2
3
2
1
0
8mer_m1 seed postive sites number 8mer_m1 seed potential sites top score
12
0.5
Probability
0.5
Probability
Probability
Probability
0.5
1
negative
positive
1
1
negative
positive
1
negative
positive
0.5
negative
positive
0.5
0
0
2
0
-2
3
2
1
0
7mer_A1 seed postive sites number 7mer_A1 seed potential sites top score
0
0
0
4
2
0
2
0
-2
3
2
1
0
7mer_m1 seed postive sites number 7mer_m1 seed potential sites top score 7mer_m8 seed potential sites number
0
0
4
3
2
1
0
2
0
-2
7mer_m8 seed potential sites top score 8mer_A1 seed potential sites number
1
0.5
1
negative
positive
0.5
1
negative
positive
negative
positive
0.5
0
200
100
0
other seed potential sites number
negative
positive
0.5
0
3
2
1
0
other seed postive sites number
Probability
0.5
Probability
negative
positive
0
4
2
0
7mer_A1 seed potential sites number
negative
positive
0
10
5
0
6mer seed potential sites number
10
5
0
potential sites toppest score
Probability
0.5
1
1
0.5
Probability
0
4
2
0
7mer_m1 seed potential sites number
negative
positive
Probability
0.5
Probability
negative
positive
0.5
1
negative
positive
1
1
1
1
negative
positive
0
8
6
4
2
number of postive sites
Probability
4
2
0
6mer seed postive sites number
Probability
0.5
0
2
0
-2
6mer seed potential sites top score
0
0
1
negative
positive
Probability
Probability
Probability
0.5
0.5
0
65 129 193 257
1
number of potential sites
1
negative
positive
negative
positive
Probability
0
10
5
0
total score of positive sites
1
Probability
0.5
Probability
0
negative
positive
Probability
0.5
Probability
Probability
negative
positive
1
1
1
1
negative
positive
0.5
0
10
5
0
other seed potential sites top score
Improving Performance of Mammalian MicroRNA Target Prediction
S.8. EVALUATION BASED ON THE PROTEOMICS DATA
To demonstrate the robustness of prediction, we carried out the prediction of 5 more miRs (miR-155, hsa-let-7b, hsa-miR-16,
and hsa-miR-30a), for which the proteomic data are available in (Selbach, et al., 2008). The cumulative fold changes of different number of top ranked predictions for each miRNA are summarized in Figs. 12-15, respectively. In all cases, SVMicrO
achieves the largest down-fold for three of the 4 miRs by top 300, indicating a better sensitivity. For the performance of the
top 200 predictions, SVMicrO has achieved consistently among the highest cumulative down-fold; this suggests the better
precision of the algorithm.
3
cumulative sum of protein fold change
2
1
SVMicro
TargetScan
miRanda
MirTarget
PicTar
PITA
0
-1
-2
-3
-4
-5
Top 25
Top 50
Top 100
Top 200
Top 300
Fig. 12. Cumulative sum of protein fold change for different number of top ranked predictions of miR-155.
13
H. Liu
0
cumulative sum of protein fold change
-5
-10
-15
-20
-25
SVMicro
TargetScan
miRanda
MirTarget
PicTar
PITA
-30
-35
-40
Top 25
Top 50
Top 100
Top 200
Top 300
Fig. 13. Cumulative sum of protein fold change for different number of top ranked predictions of miR-let-7b.
2
cumulative sum of protein fold change
0
-2
-4
-6
-8
-10
SVMicro
TargetScan
miRanda
MirTarget
PicTar
PITA
-12
-14
-16
-18
Top 25
Top 50
Top 100
Top 200
Fig. 14. Cumulative sum of protein fold change for different number of top ranked predictions of miR-16.
14
Top 300
Improving Performance of Mammalian MicroRNA Target Prediction
6
cumulative sum of protein fold change
5
4
3
SVMicro
TargetScan
miRanda
MirTarget
PicTar
PITA
2
1
0
-1
-2
-3
-4
Top 25
Top 50
Top 100
Top 200
Top 300
Fig. 15. Cumulative sum of protein fold change for different number of top ranked predictions of miR-30a.
15
H. Liu
S.14. EVALUATION FOR MIR-1 BASED ON THE IP PULL-DOWN DATA
Validation based on IP pull-down data was also carried out on miR-1. 56 high confidence targets by the IP experiment were
treated as true targets. The ROC curve and the number of true positives among top ranked predictions are shown in Fig 17
and 18. It is easy to see that SVMicrO has the best performance.
1
0.9
0.8
True Postitive Rate
0.7
0.6
0.5
0.4
SVMicrO (0.75184)
pictar (0.56736)
miRanda (0.63766)
mirTarget (0.60963)
(0.74165)
PITA
TargetScan(0.58089)
0.3
0.2
0.1
0
0
0.1
0.2
0.6
0.5
0.4
False Positive Rate
0.3
0.7
0.8
0.9
Fig. 16. ROC curves for the predictions of miR-1 tested on IP pull-downs..
20
18
SVMicro
PicTar
miRanda
mirTarget
PITA
TargetScan
Number of True Positives
16
14
12
10
8
6
4
2
0
Top 25
50
75
Fig. 17. Number of true positives among top ranked predictions of miR-1.
16
100
150
200
250
300
1
Improving Performance of Mammalian MicroRNA Target Prediction
17