sv-lncs

Modified Association Rule Mining Approach for the
MHC-Peptide Binding Problem
Galip Gürkan Yardımcı, Alper Küçükural, Yücel Saygın, Uğur Sezerman
{yardimci, kucukural}@su.sabanciuniv.edu
{ysaygin, ugur}@sabanciuniv.edu
Faculty of Engineering and Natural Sciences,
Sabanci University, Turkey
Abstract. Computational approach to predict peptide binding to major
histocompatibility complex (MHC) is crucial for vaccine design since these
peptides can act as a T-Cell epitope to trigger immune response. There are two
main branches for peptide prediction methods; structural and data mining
approaches. These methods can be successfully used for prediction of T-Cell
epitopes in cancer, allergy and infectious diseases. In this paper, association
rule mining methods are implemented to generate rules of peptide selection by
MHCs. To capture the binding characteristics, modified rule mining and data
transformation methods are implemented in this paper. Peptides are known to
bind to the same MHC show sequence variability, to capture this characteristic,
we used a reduced amino acid alphabet by clustering amino acids according to
their physico-chemical properties. Using the classification of amino acids and
the OR-operator to combine the rules to reflect that different amino acid types
and positions along the peptide may be responsible for binding are the
innovations of the method presented. We can predict MHC Class-I binding with
75-97% coverage and 76-100% accuracy.
Keywords: Peptides, MHC Class-I, Association rule mining, reduced amino
acid alphabet, data mining.
1 Introduction
Peptide binding prediction is a crucial step for vaccine design since it enables the
understanding of the mechanism of the immune response to foreign bodies and how
vaccines work. There are numerous experimental research results regarding this
subject. These experiments take too much time and are costly since there are a vast
number of peptides to be tried as a vaccine candidate even for a single MHC.
Therefore, there is an urgent need for developing effective computational methods to
solve the peptide binding problem to the MHC. The methods developed in finding
peptide sequences specific for the target MHCs can be also used for developing
therapeutical proteins as well for other types of receptors.
MHCs recognize antigens which are foreign macromolecules that cause an
immune response in the body. There are two types of immune responses to the
antigens: humoral and cellular immune response. Class II MHC molecules are
2
Galip Gürkan Yardımcı, Alper Küçükural, Yücel Saygın, Uğur Sezerman
involved in humoral immune response whereas Class I MHC molecules are involved
in the cellular immune response which is the response after the antigen enters the cell
[9]. In this paper we will focus on cellular response which involves recognition of
antigenic fragments by Class I MHCs. After foreign bodies enter the cell they are
cleaved into smaller pieces that are called peptides. These peptides are picked up by
Class I MHCs and brought to the cell surface. There are on average three to four
different type of Class I MHCs in the human cell, which all bind to different types of
peptides including self and antigenic peptides. The T-Cells recognize the infected cell
upon binding to the antigenic peptide-MHC complex, which triggers a cascade of
events leading to the cellular immune response to foreign bodies. In both Class I and
Class II pathways the most important molecule initiation of the recognition of
infected cells is major histocompatibility complex (MHC). Knowing which peptides
that are yielding from the cleavage of antigens will be picked up by the MHC
molecule and understanding the mechanism of the binding of the peptides (sequence
motifs) will be of great use in vaccine design. A peptide presented to a T-Cell
together with a MHC molecule is called T-Cell epitope. If the cell is infected, it can
be induced to apoptosis by T-Cell. In this paper, we investigate the Class I pathway
for prediction of T-Cell epitopes.
Laboratory experiments can be used to determine which peptides bind to which
kind of MHC molecules. The peptides that are known to bind to Class I MHCs have
variable length but the majority of them have between 8 to 10 residues. Conducting
laboratory experiments for all types of peptide binding combinations is not feasible
since there are 208 to 2010 possible peptides using 20 amino acid alphabet, but only a
few are selected by the MHC[12]. We combine structural and data mining based
methods for prediction of T-Cell epitopes. Association rule mining techniques are
used for finding correlations between positions of the bound peptides and determining
the binding motifs for each type of MHC. These rules will be useful for
understanding the mechanism of peptide binding.
2 Background and Related Work
There are two main approaches to the peptide prediction problem: profile based
approaches and machine learning. Profile based approaches build profile scoring
matrices from the alignment of the binding peptides. These methods control the
peptide sequence for the availability of the preferred sequences at certain positions of
the peptide as predicted by the scoring matrix. Up to now most successful methods
are machine learning methods, like SVMHC[7].
Profile based methods, SYFPEITHI[11], Rankpep[13], and ProPred1 [15], only
take into account the positive cases to derive the information therefore they do not
have high specifity as compared to machine learning approaches where the non binder
class information is also taken into account to distinguish the properties of binders.
[4]
The second group of researchers used machine learning approaches such as
Support Vector Machines and Artificial Neural Networks to find the correlations
between the positions of the peptide to build a valid probabilistic model using both
the binding and non binding peptides’ data [5],[6],[7]. Another method done by
Milledge et. al. that was used for predicting peptides for HLA 0201 type of MHC has
A Modified Associated Rule Mining Approach for the MHC-Peptide Binding Problem
3
created sequence structural patterns by using association rules to reflect the MHC
binding characteristics of peptides [10].
2.1 Association rule mining
The problem of finding association rules among items is formally defined by Agrawal
et al. in [1], [2] as follows:
Let I = {i1, i2, ..., im} be a set of all items. Let T be a transaction consisting of a set
of items such that T  I. We call D a database of transactions. We say that a
transaction T contains X, a set of some items in I, if X T. An association rule is an
implication of the form X  Y, where X  I, Y  I and X  Y = . An item set X
has support s if s% of the transactions contain X. We say that the rule X Y holds
with confidence c if c% of the transactions in D that contain X also contain Y. The
rule X Y has support s if s% of transactions in D contain XY.
Association rule mining algorithms scan the database of transactions and calculate
the support and confidence of the candidate rules to determine if they are significant
or not. For that purpose, threshold values are used by the algorithms to prune the
insignificant rules. A rule is significant if its support and confidence is higher than the
user specified minimum support and minimum confidence threshold. In this way,
algorithms do not retrieve all the association rules that could possibly be generated
from a database, instead only a very small subset of rules which satisfies the threshold
values are retrieved.
Support of an association rule mimics the coverage of that rule, and confidence of
the rule specifies the accuracy. Both of these measures are important for determining
the significance of a rule. Therefore we used a combined support confidence measure
(CSC-Measure)1. The formula for the CSC-measure is obtained by taking the
harmonic mean of the support and confidence measure, which is formulated below:
CSC( s , c) 
2 s c
sc
where s is the support and c is the confidence of the rule. CSC-Measure takes both the
confidence and support of the rule into account, so rules which have high confidence
values and which cover more transactions over the data set will be more valuable.
3 Association Rule Mining Methods for (Peptide-Binding)
Prediction
Our data set D contain amino acid sequences of peptides which are known to bind to
Class I MHC molecules [3]. In D, there are 198 transactions (peptides) known to bind
to 4 different MHCs. We have worked with nine amino acid long sequences only
1
In information retrieval context, precision and recall measures are combined in the same way
to calculate the F-measure.
4
Galip Gürkan Yardımcı, Alper Küçükural, Yücel Saygın, Uğur Sezerman
since the majority of the known peptides were nine-mers. Each peptide is represented
by an item-set of nine elements, based on it sequence. So in our case there are 180
different items since there are nine different positions and twenty different amino
acids. Set I has 209 different item-sets, each set has nine elements for nine positions
and each element can be one of the 20 different items. The position of each amino
acid in the sequence is important so we have turned the sequences into item sets X of
the form AP where A is the one letter code of each amino acid and p is the position of
the amino acid in the sequence. The rules mined will be as follows {V1}  {G2},
meaning that the presence of a Valine in first position of the peptide sequence implies
that there will be Glycine in the second position in the peptide sequence. For
simplicity, we’ll omit the curly brackets in the following sections.
But MHC molecules are not very decisive when binding the peptides, it can
accommodate different types of amino acids at the same position of the peptide. There
are pockets at the binding site of MHC, some of these pockets have to be filled with
certain types of amino acids for the binding requirements to be fulfilled [19].
Sometimes the second position of the peptide fills the appropriate pocket and
sometimes the third position of the peptide occupies the same pocket. Therefore
different amino acids and different positions of the peptide may have the same role in
defining the peptides’ binding characteristics; association rules cannot catch this
property well. So we have decided to change the rule structure to deal with this
problem.
Our association rules have the form {V2} V {A2} V {L3}  {I9} meaning that the
presence of a Valine or an Alanine in the second position or a Leucine in the third
position of the peptide implies that the ninth position of the peptide sequence will be
an Isoleucine. Such rules can capture the binding characteristics better. This rule
structure with ORs ( V ) will also increase the CSC-Measure of the rules, resulting
with more globally correct binding characteristic rules. The support and the
confidence measures' definitions remain unchanged, the only difference is that the
calculations are done taking the OR into account.
3.1 Candidate generation and rule mining
The candidate generation step is generally done by the apriori algorithm and its
variations [2]. Since using OR as a rule increases the number of candidates so much
that the apriori algorithm will not have a reasonable runtime. We first extracted rules
with one amino acid on each side by the conventional rule mining algorithm. Then we
have combined these rules with the OR operator, to yield rules which reflect the
binding characteristic better. The confidence of a new combined rule will be between
the values of the minimum and the maximum of confidences of the rules which were
combined to yield the new rule. The support will obviously increase as the number of
sequences which contain the amino acids on the left side are increasing because of the
OR operator between them.
First we have mined the database for association rules of the form X i Yj where
X and Y are amino acids and i and j are their positions. Small confidence (50%) and
support (20%) thresholds are used for two reasons. The first is that we expect these
values to go up as we combine the rules with the OR operator so we want as many
rules as possible. The second is that low support values imply that the number of
A Modified Associated Rule Mining Approach for the MHC-Peptide Binding Problem
5
sequences or the transactions which contain both of the combined amino acids will be
small.
The combining process will be as follows, over the set of all one amino acid rules
of the form Xi  Yj, we will combine the rules which have the same implication, then
generate all the possible two amino acid combinations of these rules.
After we have the two amino acid rules, again we combine these rules to yield
three amino acid rules. This time the process will be similar to the apriori algorithm.
We combine k amino acid rules which share k-1 amino acids and which have the
same implication. Combining these rules yield k+1 amino acid rules. The fact that we
are using the OR operator guarantees that new rules’ support values will never
decrease so we don’t have to check support values. The pruning criterion is CSCMeasure of the new rules. If CSC measure does not improve by at least 2% by
addition of the new OR rule, the new rule is pruned.
3.2 Amino acid classification
Evolution allows for sequence variability; to capture this information, we have
also classified the amino acids according to their physico-chemical properties as given
in Table 1. Different classes of amino acids are obtained from a previous study by
Sezerman et. al. which used an encoding decoding algorithm that classified amino
acids based on similarity scoring matrices [14]. The classification scheme given in
Table 1 yielded the best results for us. Using the classification table enabled us to
distinguish the binding rules according to their physico-chemical properties e.g. HLAA2 molecule prefers a peptide with a bulky hydrophobic residue at position two
(Class F) and a small hydrophobic residue at position nine (Class A) for binding.
The classification step reduces the number of items and item-sets, reducing the
number of rules but making the rules more compact. The number of possible item-sets
reduces to 129 from 209 and number of items reduces to 108 from 180.
Table 1. Classification of amino acids
Class
A
B
C
D
E
F
Amino Acid(s)
I,V,L,M,A
R,K
D,E
S,T
Y
F
Class
G
H
J
K
L
M
Amino Acid(s)
W
H
G
Q,N
C
P
4 Implementation and Experimental Results
First datasets are downloaded from SYFPEITHI[11]. The peptide sequences are rewritten using the classes in Table 1 as a preprocessing operation.
6
Galip Gürkan Yardımcı, Alper Küçükural, Yücel Saygın, Uğur Sezerman
Nine amino acid long binding sequences of different kinds of MHC molecules are
used for rule extraction explicitly. The amount of binding peptides for different kinds
of MHC molecules varied from 24 to 107. The nature of our data set required data
cleaning. Peptide sequences are obtained experimentally. In some cases they obtain
MHC bound peptides and these are sequenced and stored in the databases. In other
cases they artificially create polyalanine peptide sequences of length nine, check the
binding affinity of this peptide to the specific MHC of interest. They mutate each
position to different amino acid types separately and look at the binding affinity of the
mutated peptide and compare it with the original one. Therefore many binding
peptides coming from these studies had alanine (which is a neutral small amino acid
that would not have any impact on the binding) in many positions. Since we are
looking for the support and confidence of the binders, this would cause a bias for that
amino acid type in our association rules therefore we cleaned our data of such
sequences. A peptide sequence was removed from our data set if it had the same
amino acid in four consecutive positions.
Table 2. Some of the best rules for four types of MHC molecule using four fold cross
validation.
Molecule
Rule
HLAA020110
A1VA5VA6VA7A2
A1VA6VA7VA8A2
A1VA5VA6VA7VA8A2
A1VA3VA5VA6A2
A1VA3VA5VA6VA7A2
A1VA3VA4VA6VA7VA8A9
A1VA4VA2B3
A1VC1VM2VA6VA8A9
C4VB5VA6VA7VA8A9
B1VA3VA6B2
B1VA3VA6VA9B2
B1VA3VA5VA6VA7VB9B2
HLAA02019
HLAB089
HLAB27059
Avg.
Avg.
Support % Confidence
%
69,3
69,3
71,92
77,25
83,48
85,66
68,05
72,22
75
79,27
90,80
95,4
83,22
86,89
83,71
93,25
93,71
93,85
74,17
75,63
77,28
100
100
100
Avg.
Avg
CSCAccuracy
Measure
%
%
75,59
76,38
77,01
76,38
77,32
78,88
84,48
82,19
88,29
85,93
89,56
87,82
70,96
83,33
73,84
87,5
76,11
87,5
88,41
92,85
95,17
100
97,64
100
4.1 Testing Method
The data set we have used for the association rule mining is non-redundant and the
number of sequences in the data set is not large enough especially for certain MHC
data to split the database to yield a test and training set. We have used only binding
peptides (positives) for the rule mining and testing processes. Since we haven’t
worked with nonbinding (negatives) peptides, we can only calculate sensitivity of our
rules. Therefore we refer to sensitivity as the accuracy of our method. For the testing
process, rules whose accuracy values are among the top 80% of all accuracy values
A Modified Associated Rule Mining Approach for the MHC-Peptide Binding Problem
7
are used. Some of the best rules are presented in Table 2. The values in Table 2 are
obtained by using a training set of 198 peptides total on 4 different MHC types. We
can predict the binding up to 100% accuracy and 97% coverage for some cases.
(Table 2).
We have used four fold cross-validation to test the accuracy and validity of the
rules we have mined. The data set was split in to four sub-data sets randomly. We
obtained the association rules using three data sets and tested it on the fourth set. Then
we switched the test set and the training sets until we run all possible combinations of
training and test sets. The testing procedure involved using the association rules
generated by the training set to identify binders. The values in Table 2 are average
values of the four tests. The cross validation showed that association rules can predict
between 76% and 100% accuracy.
Accuracy of the resulting rules are dependent on the confidence and support
thresholds. For some MHC classes, dataset size is not sufficiently large, thus small
confidence and support thresholds must be used. For sufficiently large datasets, large
support and confidence thresholds can be set, yielding 90% percent accuracy.
CSC measure gives a better picture of success of our method. CSC values varied
between 71% and 92%. Our methods yield approximately 81% percent accuracy.
Brusic et.al. report a predictive value of 78% for binding to human MHC HLA-A2
and 88% for mouse MHC H-2KB using ANNs[6]. Udaka et.al report approximately
80% accuracy using a scoring program for prediction on three mouse MHC binding
sequences [17]. Dönnes et. al. report 90% of all the peptides that are known to bind to
MHC can be predicted with 90% specificity using support vector machines on 21
MHC data[7]. In another article, Udaka et al reports that HMMs achieve %84
precision, assessing their method by using a so called precision recall curve analysis
in [16]. SYFPEITHI uses a profile based method, evaluating the contribution of each
amino acid in a peptide to binding process and assigns an overall score to a given
peptide. The scoring process is based on the knowledge of anchor and auxiliary
anchor positions. For a given protein, all possible octamers, nonamers and decamers
are evaluated and SYFPEITHI reports that the naturally presented epitope is among
the top scoring 2% of all peptides in 80% of all predictions.[11] The methods
reported here use different data-sets with varying data preprocessing steps so our
results are not directly comparable to theirs, except for SYFPEITHI with which we
share our dataset.
5 Discussion and Perspectives
The novelties of our approach are the use of the OR operator and reduced amino acid
alphabet classification. We have used a new association rule mining operator (OR) to
combine the rules to describe binding preference of MHC molecules. This
combination gives better explanation to the importance of specific sites at the binding
peptide. Second and ninth positions appear most frequently in the motifs. These
positions have highly correlated hydrophobicity values which is also supported by
Zhang et al.[19] Zhang et al. also report that HLA-A02 classes require isoleucine,
valine, leucine, methionine (class A according to our amino acid classification) as
consensus anchors for binding, and HLA-B classes need charged residues (class B
and C according to our classification) as consensus anchors. These finding also
8
Galip Gürkan Yardımcı, Alper Küçükural, Yücel Saygın, Uğur Sezerman
correlate with our rules. We also used a reduced amino acid alphabet which helped us
to determine important physical and chemical properties of amino acids required at
significant positions for a successful binding to MHC. Deriving general rules for
binding is a crucial contribution of our method. Profile based methods assume
contribution of each position on the peptide even though some would contribute more
than the others depending on the frequency of occurrence at the given position. Even
though a peptide has the binding motif at the key positions, the scores coming from
the other sites can cause it to be classified as non binder. According to Gulukota et. al.
[8] profile based methods have 30% accuracy in prediction of binders. Our method
points out key positions and significant features for binding. Machine learning based
methods can predict the binders with high accuracy and specificity but cannot give
out features that are important for binding, which is crucial information for vaccine
design. Therefore, they are not well suited for this type of application.
Up to now we did not consider the information coming from non binders in this
work. So, for future work, we are developing a new approach which takes non
binders’ information into account as well. We are also trying to scan for explicit pairs
or triplets in peptide sequences using a Bayesian approach and compare its efficiency
with our method.
References
[1] R.Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in
large databases. In Proc 1993 ACM-SIGMOD Int. Conf: Management of Data
(SIGMOD’93), Washington, DC, pp. 207-216 , May 1993.
[2] R.Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 1994 Int.
Conf. Very Large Data Bases (VLDB’94), Santiago, Chile, Sept, pp. 487-499, 1994.
[3] M. Bhasin, H. Singh, G. P. S. Raghava. MHCBN: A Comprehensive Database of MHC
Binding and Non-Binding Peptides. Nucleic Acids Research, Vol. 19 no.5 pp. 665-666,
2002.
[4] V. Brusic, V.B. Bajica, N. Petrovsky. Computational methods for prediction of T-cell
epitopes—a framework for modelling, testing, and applications. Methods, 34(4):436-43,
2004.
[5] V. Brusic and D.R. Flower. Bioinformatics tools for identifying T-cell epitopes. DDT:
BIOSILICO Vol. 2, No. 1, pp. 18-23, January 2004.
[6] V. Brusic, G. Rudy, L. C. Harrison. Prediction of MHC Binding Peptides Using Artificial
Neural Networks. Complexity International, Volume 2, 1995
[7] P. Dönnes, A. Elofsson. “Prediction of MHC class I binding peptides, using SVMHC”.
Bioinformatics, 3:25, 2002.
[8] K. Gulukota, J. Sidney, A. Sette, C. DeLisi. Two complementary methods for predicting
peptides binding major histocompatibility complex molecules. J Mol Biol, 267:1258-1267,
1997
[9] P. M. Kloetzel. The proteasome and MHC class I antigen processing. Biochimica et
Biophysica Acta 1695, pp. 217-225, 2004.
[10] T. Milledge, G. Zheng, G. Narasimhan. An Application Of Association Rule Mining to
Hla-A*0201 Epitope Prediction. ICBA, 2004.
[11] H. G. Rammensee, J. Bachmann, N.P.N. Emmerich, O.A. Bachor, and S. Stevanovic.
SYFPEITHI: database for MHC ligands and peptide motifs. Immunogenetics, 50(3-4):213219, 1999.
[12] H.G. Rammensee, T. Friede, S. Stevanovic. MHC ligands and peptide motifs: 1st listing.
Immunogenetics 41, pp. 178-228, 1995.
A Modified Associated Rule Mining Approach for the MHC-Peptide Binding Problem
9
[13] P.A. Reche, J. P. Glutting, and E.L. Reinherz. Prediction of MHC Class I Binding Peptides
Using Profile Motifs. Hum. Immunol., 63:701-709, 2002.
[14] O.U. Sezerman, R. Islamaj and E. Alpaydin. Three dimensional representation of amino
acid characteristics. IEEE EMBC, Vol. 3 2903-2906, 2001
[15] H. Singh and G.P.S. Raghava. ProPred1: prediction of promiscuous MHC Class-I binding
sites. Bioinformatics, Vol. 19 no. 8 pp. 1009-1014, 2003.
[16] K. Udaka, H. Mamitsuka, Y. Nakaseko and N. Abe. Empirical Evaluation of a Dynamic
Experiment Design Method for Prediction of MHC Class I-Binding Peptides. The Journal
of Immunology, 169:5744 – 5753, 2002
[17] K. Udaka, K.H. Wiesmuller, S. Kienle, G. Jung, H. Tamamura, H. Yamagishi, K.
Okumura, P. Walden, T. Suto, T. Kawasaki. An automated prediction of MHC class Ibinding peptides based on positional scanning with peptide libraries. Immunogenetics, pp.
816-828, 2000.
[18] J. Zeng, H. R. Treutlein & G. B. Rudy. Predicting sequences and structures of MHCbinding peptides: a computational combinatorial approach. Journal of Computer-Aided
Molecular Design, pp. 573-576, 2001.
[19] C. Zhang, A. Anderson , C. DeLisi . Structural principles that govern the peptide-binding
motifs of class I MHC molecules. J. Mol Biol, 929 – 947, 1998.