The Machine Learning Entailment System of the University of Alicante

MLEnt: The Machine Learning Entailment System of the
University of Alicante
Zornitsa Kozareva
Andrés Montoyo
Dept. Lenguajes y Sistemas Informáticos Dept. Lenguajes y Sistemas Informáticos
University of Alicante
University of Alicante
Alicante, 03080
Alicante, 03080
[email protected]
[email protected]
Abstract
This paper describes a machine learning
based approach for the resolution of text
entailment. We model features based on
lexical overlaps and semantic similarity
measures. The machine learning algorithm we worked with is Support Vector
Machines. Several feature sets are constructed and their combination is studied
in order to boost the final performance of
the MLEnt system.
1
Introduction
Machine learning techniques were previously applied in word sense disambiguation (Yarowski,
1994), named entity recognition (Sang and Erik,
2002), part-of-speech tagging (Màrquez et al, 1998),
parsing (Collins and Roark, 2004) and many other
Natural Language Processing tasks.
In an overview of the participating systems in the
First Textual Entailment challenge, we saw that the
used approaches are based on logic forms (Akhmatova, 2005), inversion transduction grammars (Wu,
2005), word net similarities (Herrera, 2005),(Jijkoun
and de Rijke, 2005), edit distance between parsing
trees (Kouylekov and Magnini, 2005) among others
(Dagan et al, 2005). The majority of the systems
experiment with different threshold and parameter
settings to estimate the best performance. The parameter adjustment process is related to the carrying
out of numerous experiments and still the selected
settings after these experiments may lead to incorrect reasoning of the systems.
Therefore, we decide to present a machine learning entailment system (MLEnt) that captures lexical
and semantic information and models it in the form
of attributes. Thus, to determine the result of the
entailment relation, instead of using human determined thresholds, a machine learning algorithm is
employed. Given the set of attributes for an example, the machine learning algorithm automatically
estimates what the corresponding output is.
In the performed experiments, first we studied
the limitations of the designed lexical and semantic
sets. Secondly, we explored different techniques to
combine the complementary sets. The obtained results prove that MLEnt is in its primary development
stage and needs plenty of information to improve its
functioning. However, the performance of MLEnt
is influenced not only by the attributes we worked
with, but also by the number and type of textual entailment examples.
The paper is organized in the following way: Section 2 is related to the description of the MLEnt system, the results of the performed experiments are
shown in Section 3 and finally in Section 4 conclusions and future work are discussed.
2 MLEnt system
The MLEnt system consists of a feature extraction,
machine learning (training and testing), classifier
combination and evaluation modules. The structure
of the system is shown in Figure 1.
In the next subsections we describe in details the
functioning of the modules.
ity between text T with length m and hypothesis H with length n, by searching in-sequence
matches that reflect sentence level word order.
The longest common subsequence (LCS) captures sentence level structures in a natural way.
This measure is also known as ROUGE-L (Lin
and Och, 2004).
Figure 1: Modules of MLEnt system.
2.1 MLEnt feature extraction
The performance of every machine learning system depends on the constructed set of attributes and
the selected machine learning algorithm. To create MLEnt in its present form, we studied different
sets of features and experimented with several machine learning algorithms as k-NN (Daelemans et al,
2003), Decision trees (Daelemans et al, 2003), SVM
(Collobert and Bengio, 2001) and MaxEnt (Suárez
and Palomar, 2002).
These experiments have determined as definite,
the features we describe in this subsection. The attributes rely on lexical and semantic information,
that is captured by well known text summarization
(Lin and Och, 2004) and similarity (Lin and Path)
measures. The values of the attributes are binary and
numeric.
To generate some of the features, the partof-speech tags are needed. We use TreeTagger
(Schmid, 1994). For other features the WordNet::Similarity package (Patwardhan et al, 2003) is
used. The features of MLEnt are:
c
, obtains the ratio of the con• f1 : n − gram = m
secutive word overlaps c in text T and hypothesis H in respect to all words m in H. According
to n − gram attribute, the more common consecutive words two sentences have, the more
similar they are.
• f2 : LCS =
LCS(T,H)
,
n
estimates the similar-
skip gram(T,H)
• f3,4 : skip gram = C(n,number
of skip gram) ,
where skip gram(T, H) refers to the number
of common skip grams (pair of words in sentence order that allow arbitrary gaps) found
in T and H, C(n, number of skip gram)
is a combinatorial function, where n is
the number of words in the hypothesis H
and number of skip grams corresponds to
the number of common skip-grams between
(T, H) 1 . This measure is known as ROUGES (Lin and Och, 2004). Note that in contrast
to LCS, skip-grams need determinate length.
Only bi and tri-skip grams are calculated, because skip-grams higher than three do not occur
so often. Therefore two skip-gram attributes
are generated, one looks for common two-skipgrams and the other looks for common tri-skipgrams.
• f5 : nT is 1 if there is negation in the text T, 0
otherwise.
• f6 : nH is 1 if there is negation in the hypothesis
H, 0 otherwise.
• f7 : number determines that ”four-thousand” is
the same as 4000, ”more than 5” indicates ”6
or more”, ”less than 5” indicates ”4 or below”,
”four-years-old” is the same as ”4-years old”.
For perfect match number is 1. When T and
H do not have numeric expression, number attribute is 0. Partial match has value between 0
and 1.
• f8 : Np identifies the proper names in text T
and hypothesis H, and then lexically matches
them. Consecutive proper names that are not
separated by special symbols as coma or dash
are merged and treated as a single proper name.
1
(e.g. number of skip grams is 1 if there is a common
unigram between T and H, 2 if there is a common bigram and
so on)
• f9 : adv is 1 for perfect match of adverbs between two sentences, 0 when there is no adverb
in one or both of the sentences, and betwen 0
and 1 for adverb partial matching.
• f10 : adj is 1 for perfect match of adjectives between two sentences, 0 when there is no adjective in one or both of the sentences, and betwen
0 and 1 for adjective partial matching.
Pn
sim(T,H)
lin
i=1
• f11 : n lin =
, estimates the
n
similarity sim(T, H)lin for the noun pairs in T
and H according to the measure of Lin, against
the maximum noun similarity n for T and H.
Pv
The described attributes are divided in two complementary sets: A={f1 ,f2 ,f3 ,f4 ,f5 ,f6 ,f7 } and
B={f7 ,f8 ,f9 ,f10 ,f11 ,f12 ,f13 ,f14 ,f15 ,f16 ,f17 }. The
distribution of the features according to the information gain measure is shown in Figures 2 and 3.
0.7
0.6
0.5
Information Gain Values
Two word proper names as ”Mexico City” are
transformed into one ”Mexico City”.
Pn
0.1
0
Pv nv lin
=
sim(T,H)lin
j=1
measures
the similarity of the two sentences regarding
the similarity of the nouns and the verbs in the
text T and the hypothesis H.
• fP
16 :
n
i=1
same
path
Pnv
v
sim(T,H)path ∗
n∗v
as f15 , but the
• f17 : units lin =
3
4
Features of set A
5
6
7
=
sim(T,H)path
j=1
is the
similarity measure is Path.
Qk
unit=1
µ Pl
i=1
sim(T,H)lin
l
¶
represent the text similarity between two sentences. Units correspond to the different constituents in a sentence. Noun/verb similarity
is determined with the attributes f11 and f12 ,
while the rest of the units are lexically matched.
0.9
0.8
0.7
Information Gain Values
sim(T,H)path
j=1
• f14 : v path =
is the same
v
as f12 with the difference that the similarity between the verbs is calculated with the measure
of Path.
sim(T,H)lin ∗
n∗v
2
sim(T,H)
Pv
n
i=1
1
Figure 2: Information gain of the features in set A
path
i=1
• f13 : n path =
is the same
n
as f11 with the difference that the similarity between the nouns is determined with the measure of Path.
• fP
15 :
0.3
0.2
sim(T,H)lin
, determines the
• f12 : v lin = j=1 v
similarity sim(T, H)lin of the verbs in T and H
according Lin’s measure, compared to the maximum verb similarity v for T and H.
0.4
0.6
0.5
0.4
0.3
0.2
0.1
0
1
2
3
4
5
6
7
Features of set B
8
9
10
11
Figure 3: Information gain of the features in set B
High information gain values, correspond to informative attributes. In a ten-fold cross validation
backward-forward feature selection algorithm,f2
and f4 had the least error rate for set A, while for
set B f7 and f14 were determined.
On their own, these sets are able to resolve the
textual entailment task until some point. The feature
sets obtain quite similar results, but resolve correctly
different examples. The distribution of the attributes
in complementary sets is very important for the classifier combination module.
2.2 Training MLEnt
The previously described attributes are studied and
examined with several machine learning algorithms.
We observed the behavior of several classifiers (e.g.
k-NN, Decision Trees, MaxEnt and SVM) with
these feature sets, and estimated that the best generalization is obtained with SVM.
For this reason in the RTE2 challenge only SVM
is used. The training of MLEnt is conducted with
the RTE2 development data set. To avoid overfitting,
the system is also tested with the RTE1 development
and test data sets. Thus, the final system relies on the
observations of the three data sets.
2.3 Combination methods in MLEnt
According to (Dietterich, 1997), an ensemble of
classifiers is potentially more accurate than an individual classifier. This is one of the reasons for the
building of the combination module in MLEnt. The
other reason is the design of the complementary data
sets.
With the objective to combine the different feature sets, initially we had to test the classifiers for
complementarity. When two sets are complementary, this means that they are able to resolve different examples, therefore their combination is beneficial e.g. resolving more cases. To evaluate the complementarity of the feature sets, the kappa statistics
(Cohen, 1960) is used. According to the kappa measure, set A and B are complementary and so are their
subsets.
In a continuation we present the stacking and voting techniques with which the classifiers’ combination is realized. However, we face the problem of determining which is the classifier that gives the more
precise prediction for a given example. To obtain
such confidence measure, we use the distance between the predicted value of the example and the
training prototypes.
For our system, stacking is applied in the following way. First, several base-level classifiers are generated. Secondly, the output of the base-level classifiers together with the predicted classes are included
as features in one of the base-level classifiers. Thus
a new, so called meta-classifier is generated. The
outputs of the meta-classifier are considered as the
final results of the system.
The other combination method in MLEnt is voting. For it several classifiers are generated and their
outputs are compared. The final decision about the
class assignment is taken regarding the class with
the majority votes. Example for which the classifiers disagree, acquire the class of the classifier with
the highest performance.
For the final function of MLEnt, we had to establish which are the classifiers to be combined. We
conducted several experiments with the RTE1 data
sets. For stacking, the inclusion of the outputs of
the four classifiers (set A, B and their subsets) performs better than taking the two or three of them.
The meta-classifier is created with the features of
the base-level classifier B and the outputs of all four
base-level classifiers. For voting the best combination is obtained with the merging of the two subsets.
3 Results and analysis
The obtained results for the RTE2 test data set are
influenced by the development of MLEnt with the
RTE2 development set. Therefore, first we show the
results of the development set in Table 1. We put
in the table the performance of the four feature sets
together with the stacking and voting runs2 .
As a baseline system, we used the SVM classification, where only the first feature from subsection
2.1 is considered. This feature indicates the ratio of
the words in T that are also present in H. All development and test runs outperformed the baseline.
runs
set A
set B
sub A
sub B
run 1
run 2
baseline
Acc.
56.87
60.12
57.75
57.13
68.63
62.12
50.50
IE
49.50
54.00
49.50
54.50
58.50
54.50
51.00
IR
55.50
61.00
57.00
61.00
68.00
61.50
50.00
QA
51.00
59.00
51.50
49.50
70.50
62.00
48.50
SUM
71.50
66.50
73.00
63.50
77.50
70.50
52.50
Table 1: MLEnt results with RTE2 development set
The performance of the sets is quite similar in accuracy, however set A covers the majority of the text
summarization examples, while set B identifies text
2
run1 corresponds to the output of stacking and run2 corresponds to the output of voting.
summarization and information retrieval examples.
The stacking and voting combinations lead to improvement not only in accuracy, but also for the particular tasks, especially the question answering.
Additionally the same experiments are conducted
with the RTE1 data sets, where we saw that the accuracy results for the test data set are similar to those in
Table 1, but for the RTE1 development set the voting
scheme performed better than stacking. Considering these performances, the RTE2 final submissions
were done.
In Table 2 are shown the results for the RTE2 submitted runs denoted with run1 and run2. Together
with them are listed the scores for each feature set.
runs
set A
set B
sub A
sub B
run 1
run 2
baseline
Acc.
51.75
54.25
56.75
52.38
54.87
55.00
50.50
IE
52.00
50.00
46.00
49.00
54.00
49.50
51.00
IR
53.50
55.50
55.50
51.50
56.00
54.00
50.00
QA
55.50
47.50
56.50
52.00
50.00
54.50
48.50
SUM
46.00
64.00
69.00
57.00
59.50
62.00
52.50
Table 2: MLEnt results with RTE2 test set
The subset of A performs better than each one of
the sets and even their combinations. In Table 1 can
be seen that during training, this set performed quite
similar to the others. However with the RTE2 test
data set it obtained 5% better coverage.
The stacking run reached 54,87%. It couldn’t
boost the performance of the MLEnt system, because the majority of the sets performed similar.
Secondly, the accumulated noise was introduced in
the stacking scheme and this leaded to degradation.
The same happened with the voting run. Its performance of 55% is better than stacking, however is
still insufficient and lower than the performance of
subset A.
With the RTE2 test set, MLEnt is able identify
correctly examples as:
T: Mangla was summoned after Madhumita’s sister Nidhi Shukla, who was the first witness in the
case.
H: Shukla is related to Mangla.
The relation is correctly identified as negative, because the number of common words between the
two sentences is low. Respectively, the pair below
is identified correctly as positive, because the skip-
gram and LCS attributes have high values demonstrating plenty of common words between the text
and the hypothesis.
T: Three leaders of a premier Johnstown tourism
event were charged with illegally diverting money
to their organization, state police said Wednesday,
sending shock waves through the region.
H: Three leaders of a premier Johnstown tourism
event are suspected of stealing money.
Despite the presence of the similarity measures in
the other sets, MLEnt was not able to recognize correctly
T: Growth in the vast U.S. services sector slowed
in August although managers seemed more willing
to hire new workers, according to an industry survey
published Friday.
H: A report showed that growth in the vast service
sector slowed in August.
We saw that for the great majority of the cases,
the vector space of positive and negative classes was
densly accumulated in a point, rather than being
scattered. This hampered the generalization function of the classifier.
4 Conclusions and future work
Our main interest in participating in the Second Textual Entailment challenge was to experiment to resolve textual entailment with a machine learning
based system. In this competition, the system works
with lexical and semantic features. With the performed experiments we measured the degree of contribution of the developed feature set, and the impact
of their combination.
Although MLEnt system is prepared and previously evaluated on three different sets, which
achieve accuracy results around 60% and 65%, the
RTE2 evaluation data set produced differently. This
performance is influenced by the features that we
used and mainly the high data sparcity problem.
It is interesting to note that a feature set of two
attributes (tri-skip-gram and LCS) can resolve correctly 56% of the text entailment. Of course 69%
of these examples come from text summarization,
which is expected as the attributes are generated
from text summarization measures. The accuracy
for the IE, IR and QA entailment examples is very
low around 52%. In order to improve their perfor-
mances, more attributes should be incorporated.
For the IE task the proper name attribute is not
sensitive enough, therefore a named entity recognizer is needed. The performance of IR is influences
by the length of the sentences, as the attributes are
normalized according to the number of words in the
hypothesis.
We believe that to improve MLEnt, several classifiers that resolve the different NLP tasks should be
prepared and later joint together.
Acknowledgements
This research has been partially funded by the
Spanish Government under project CICyT number
TIC2003-07158-C04-01 and PROFIT number FIT340100-2004-14 and by the Valencia Government
under project numbers GV04B-276.
References
D. Yarowsky. 1994. Decision Lists for Lexical Ambiguity Resolution: Application to Accent Restoration in
Spanish and French. Proceedings of the 32nd Annual
Meeting of the ACL.
T. K. Sang and Erik F. 2002. Introduction to the
CoNLL-2002 Shared Task: Language-Independent
Named Entity Recognition. Proceedings of CoNLL2002, Taipei, Taiwan, p. 155–158.
L. Màrquez, L. Padró and H. Rodrı́guez”. 1998. A Machine Learning Approach to POS Tagging.
M. Collins and B. Roark. 2004. Incremental Parsing with
the Perceptron Algorithm. Proceedings of the 42nd
Annual Meeting of the Association for Computational
Linguistics. Barcelona, Spain, p. 111–118.
E. Akhmatova. 2005. Textual Entailment Resolution
via Atomic Propositions. Proceedings of the PASCAL
Challenges Workshop on Recognising Textual Entailment.
D. Wu 2005. Textual Entailment Recognition Based on
Inversion Transduction Grammars. Proceedings of the
PASCAL Challenges Workshop on Recognising Textual Entailment.
J. Herrera, A. Peas and F. Verdejo. 2005. Textual Entailment Recognition Based on Dependency Analysis
and WordNet. Proceedings of the PASCAL Challenges
Workshop on Recognising Textual Entailment.
V. Jijkoun and M. de Rijke, 2005. Recognizing Textual Entailment Using Lexical Similarity. Proceedings
of the PASCAL Challenges Workshop on Recognising
Textual Entailment.
M. Kouylekov and B. Magnini 2005. Recognizing textual entailment with edit distance algorithms. Proceedings of the PASCAL Challenges Workshop on
Recognising Textual Entailment.
I. Dagan, O. Glickman and B. Magnini. 2005. The
PASCAL Recognising Textual Entailment Challenge.
Proceedings of the PASCAL Challenges Workshop on
Recognising Textual Entailment.
Chin-Yew Lin and Franz Josef Och. 2004. Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-birgam
Statistics, Proceedings of ACL-2004, Barcelona,
Spain.
W. Daelemans, J. Zavrel, K. van der Sloot and A. van
den Bosch. 2003. TiMBL: Tilburg Memory-Based
Learner. Tilburg University, ILK 03-10.
R. Collobert and S. Bengio. 2001. SVMTorch: support
vector machines for large-scale regression problems,
The Journal of Machine Learning Research, vol. 1,
issn 1533-7928, p. 143-160, MIT Press, Cambridge,
MA, USA.
A. Suárez and M. Palomar. 2002. A maximum entropybased word sense disambiguation system. Proceedings of the 19th International Conference on Computational Linguistics, (COLING 2002), p. 960-966.
H. Schmid. 1994. Probabilistic part-of-speech tagging
using decision trees. Proceedings of International
Conference on New Methods in Language Processing.
T. Dietterich. 1997. Machine-Learning Research: Four
Current Directions. AI Magazine. Winter 1997, p.
97–136.
S. Patwardhan, S. Banerjee, and T. Pedersen. 2003. Using measures of semantic relatedness for word sense
disambiguation. Proceedings of the Fourth International Conference on Intelligent Text Processing and
Computational Linguistics, Mexico City, February.
J. Cohen. 1960. A coefficient of agreement for nominal
scales. Educ. Psychol. Meas.