MLEnt: The Machine Learning Entailment System of the University of Alicante Zornitsa Kozareva Andrés Montoyo Dept. Lenguajes y Sistemas Informáticos Dept. Lenguajes y Sistemas Informáticos University of Alicante University of Alicante Alicante, 03080 Alicante, 03080 [email protected] [email protected] Abstract This paper describes a machine learning based approach for the resolution of text entailment. We model features based on lexical overlaps and semantic similarity measures. The machine learning algorithm we worked with is Support Vector Machines. Several feature sets are constructed and their combination is studied in order to boost the final performance of the MLEnt system. 1 Introduction Machine learning techniques were previously applied in word sense disambiguation (Yarowski, 1994), named entity recognition (Sang and Erik, 2002), part-of-speech tagging (Màrquez et al, 1998), parsing (Collins and Roark, 2004) and many other Natural Language Processing tasks. In an overview of the participating systems in the First Textual Entailment challenge, we saw that the used approaches are based on logic forms (Akhmatova, 2005), inversion transduction grammars (Wu, 2005), word net similarities (Herrera, 2005),(Jijkoun and de Rijke, 2005), edit distance between parsing trees (Kouylekov and Magnini, 2005) among others (Dagan et al, 2005). The majority of the systems experiment with different threshold and parameter settings to estimate the best performance. The parameter adjustment process is related to the carrying out of numerous experiments and still the selected settings after these experiments may lead to incorrect reasoning of the systems. Therefore, we decide to present a machine learning entailment system (MLEnt) that captures lexical and semantic information and models it in the form of attributes. Thus, to determine the result of the entailment relation, instead of using human determined thresholds, a machine learning algorithm is employed. Given the set of attributes for an example, the machine learning algorithm automatically estimates what the corresponding output is. In the performed experiments, first we studied the limitations of the designed lexical and semantic sets. Secondly, we explored different techniques to combine the complementary sets. The obtained results prove that MLEnt is in its primary development stage and needs plenty of information to improve its functioning. However, the performance of MLEnt is influenced not only by the attributes we worked with, but also by the number and type of textual entailment examples. The paper is organized in the following way: Section 2 is related to the description of the MLEnt system, the results of the performed experiments are shown in Section 3 and finally in Section 4 conclusions and future work are discussed. 2 MLEnt system The MLEnt system consists of a feature extraction, machine learning (training and testing), classifier combination and evaluation modules. The structure of the system is shown in Figure 1. In the next subsections we describe in details the functioning of the modules. ity between text T with length m and hypothesis H with length n, by searching in-sequence matches that reflect sentence level word order. The longest common subsequence (LCS) captures sentence level structures in a natural way. This measure is also known as ROUGE-L (Lin and Och, 2004). Figure 1: Modules of MLEnt system. 2.1 MLEnt feature extraction The performance of every machine learning system depends on the constructed set of attributes and the selected machine learning algorithm. To create MLEnt in its present form, we studied different sets of features and experimented with several machine learning algorithms as k-NN (Daelemans et al, 2003), Decision trees (Daelemans et al, 2003), SVM (Collobert and Bengio, 2001) and MaxEnt (Suárez and Palomar, 2002). These experiments have determined as definite, the features we describe in this subsection. The attributes rely on lexical and semantic information, that is captured by well known text summarization (Lin and Och, 2004) and similarity (Lin and Path) measures. The values of the attributes are binary and numeric. To generate some of the features, the partof-speech tags are needed. We use TreeTagger (Schmid, 1994). For other features the WordNet::Similarity package (Patwardhan et al, 2003) is used. The features of MLEnt are: c , obtains the ratio of the con• f1 : n − gram = m secutive word overlaps c in text T and hypothesis H in respect to all words m in H. According to n − gram attribute, the more common consecutive words two sentences have, the more similar they are. • f2 : LCS = LCS(T,H) , n estimates the similar- skip gram(T,H) • f3,4 : skip gram = C(n,number of skip gram) , where skip gram(T, H) refers to the number of common skip grams (pair of words in sentence order that allow arbitrary gaps) found in T and H, C(n, number of skip gram) is a combinatorial function, where n is the number of words in the hypothesis H and number of skip grams corresponds to the number of common skip-grams between (T, H) 1 . This measure is known as ROUGES (Lin and Och, 2004). Note that in contrast to LCS, skip-grams need determinate length. Only bi and tri-skip grams are calculated, because skip-grams higher than three do not occur so often. Therefore two skip-gram attributes are generated, one looks for common two-skipgrams and the other looks for common tri-skipgrams. • f5 : nT is 1 if there is negation in the text T, 0 otherwise. • f6 : nH is 1 if there is negation in the hypothesis H, 0 otherwise. • f7 : number determines that ”four-thousand” is the same as 4000, ”more than 5” indicates ”6 or more”, ”less than 5” indicates ”4 or below”, ”four-years-old” is the same as ”4-years old”. For perfect match number is 1. When T and H do not have numeric expression, number attribute is 0. Partial match has value between 0 and 1. • f8 : Np identifies the proper names in text T and hypothesis H, and then lexically matches them. Consecutive proper names that are not separated by special symbols as coma or dash are merged and treated as a single proper name. 1 (e.g. number of skip grams is 1 if there is a common unigram between T and H, 2 if there is a common bigram and so on) • f9 : adv is 1 for perfect match of adverbs between two sentences, 0 when there is no adverb in one or both of the sentences, and betwen 0 and 1 for adverb partial matching. • f10 : adj is 1 for perfect match of adjectives between two sentences, 0 when there is no adjective in one or both of the sentences, and betwen 0 and 1 for adjective partial matching. Pn sim(T,H) lin i=1 • f11 : n lin = , estimates the n similarity sim(T, H)lin for the noun pairs in T and H according to the measure of Lin, against the maximum noun similarity n for T and H. Pv The described attributes are divided in two complementary sets: A={f1 ,f2 ,f3 ,f4 ,f5 ,f6 ,f7 } and B={f7 ,f8 ,f9 ,f10 ,f11 ,f12 ,f13 ,f14 ,f15 ,f16 ,f17 }. The distribution of the features according to the information gain measure is shown in Figures 2 and 3. 0.7 0.6 0.5 Information Gain Values Two word proper names as ”Mexico City” are transformed into one ”Mexico City”. Pn 0.1 0 Pv nv lin = sim(T,H)lin j=1 measures the similarity of the two sentences regarding the similarity of the nouns and the verbs in the text T and the hypothesis H. • fP 16 : n i=1 same path Pnv v sim(T,H)path ∗ n∗v as f15 , but the • f17 : units lin = 3 4 Features of set A 5 6 7 = sim(T,H)path j=1 is the similarity measure is Path. Qk unit=1 µ Pl i=1 sim(T,H)lin l ¶ represent the text similarity between two sentences. Units correspond to the different constituents in a sentence. Noun/verb similarity is determined with the attributes f11 and f12 , while the rest of the units are lexically matched. 0.9 0.8 0.7 Information Gain Values sim(T,H)path j=1 • f14 : v path = is the same v as f12 with the difference that the similarity between the verbs is calculated with the measure of Path. sim(T,H)lin ∗ n∗v 2 sim(T,H) Pv n i=1 1 Figure 2: Information gain of the features in set A path i=1 • f13 : n path = is the same n as f11 with the difference that the similarity between the nouns is determined with the measure of Path. • fP 15 : 0.3 0.2 sim(T,H)lin , determines the • f12 : v lin = j=1 v similarity sim(T, H)lin of the verbs in T and H according Lin’s measure, compared to the maximum verb similarity v for T and H. 0.4 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 Features of set B 8 9 10 11 Figure 3: Information gain of the features in set B High information gain values, correspond to informative attributes. In a ten-fold cross validation backward-forward feature selection algorithm,f2 and f4 had the least error rate for set A, while for set B f7 and f14 were determined. On their own, these sets are able to resolve the textual entailment task until some point. The feature sets obtain quite similar results, but resolve correctly different examples. The distribution of the attributes in complementary sets is very important for the classifier combination module. 2.2 Training MLEnt The previously described attributes are studied and examined with several machine learning algorithms. We observed the behavior of several classifiers (e.g. k-NN, Decision Trees, MaxEnt and SVM) with these feature sets, and estimated that the best generalization is obtained with SVM. For this reason in the RTE2 challenge only SVM is used. The training of MLEnt is conducted with the RTE2 development data set. To avoid overfitting, the system is also tested with the RTE1 development and test data sets. Thus, the final system relies on the observations of the three data sets. 2.3 Combination methods in MLEnt According to (Dietterich, 1997), an ensemble of classifiers is potentially more accurate than an individual classifier. This is one of the reasons for the building of the combination module in MLEnt. The other reason is the design of the complementary data sets. With the objective to combine the different feature sets, initially we had to test the classifiers for complementarity. When two sets are complementary, this means that they are able to resolve different examples, therefore their combination is beneficial e.g. resolving more cases. To evaluate the complementarity of the feature sets, the kappa statistics (Cohen, 1960) is used. According to the kappa measure, set A and B are complementary and so are their subsets. In a continuation we present the stacking and voting techniques with which the classifiers’ combination is realized. However, we face the problem of determining which is the classifier that gives the more precise prediction for a given example. To obtain such confidence measure, we use the distance between the predicted value of the example and the training prototypes. For our system, stacking is applied in the following way. First, several base-level classifiers are generated. Secondly, the output of the base-level classifiers together with the predicted classes are included as features in one of the base-level classifiers. Thus a new, so called meta-classifier is generated. The outputs of the meta-classifier are considered as the final results of the system. The other combination method in MLEnt is voting. For it several classifiers are generated and their outputs are compared. The final decision about the class assignment is taken regarding the class with the majority votes. Example for which the classifiers disagree, acquire the class of the classifier with the highest performance. For the final function of MLEnt, we had to establish which are the classifiers to be combined. We conducted several experiments with the RTE1 data sets. For stacking, the inclusion of the outputs of the four classifiers (set A, B and their subsets) performs better than taking the two or three of them. The meta-classifier is created with the features of the base-level classifier B and the outputs of all four base-level classifiers. For voting the best combination is obtained with the merging of the two subsets. 3 Results and analysis The obtained results for the RTE2 test data set are influenced by the development of MLEnt with the RTE2 development set. Therefore, first we show the results of the development set in Table 1. We put in the table the performance of the four feature sets together with the stacking and voting runs2 . As a baseline system, we used the SVM classification, where only the first feature from subsection 2.1 is considered. This feature indicates the ratio of the words in T that are also present in H. All development and test runs outperformed the baseline. runs set A set B sub A sub B run 1 run 2 baseline Acc. 56.87 60.12 57.75 57.13 68.63 62.12 50.50 IE 49.50 54.00 49.50 54.50 58.50 54.50 51.00 IR 55.50 61.00 57.00 61.00 68.00 61.50 50.00 QA 51.00 59.00 51.50 49.50 70.50 62.00 48.50 SUM 71.50 66.50 73.00 63.50 77.50 70.50 52.50 Table 1: MLEnt results with RTE2 development set The performance of the sets is quite similar in accuracy, however set A covers the majority of the text summarization examples, while set B identifies text 2 run1 corresponds to the output of stacking and run2 corresponds to the output of voting. summarization and information retrieval examples. The stacking and voting combinations lead to improvement not only in accuracy, but also for the particular tasks, especially the question answering. Additionally the same experiments are conducted with the RTE1 data sets, where we saw that the accuracy results for the test data set are similar to those in Table 1, but for the RTE1 development set the voting scheme performed better than stacking. Considering these performances, the RTE2 final submissions were done. In Table 2 are shown the results for the RTE2 submitted runs denoted with run1 and run2. Together with them are listed the scores for each feature set. runs set A set B sub A sub B run 1 run 2 baseline Acc. 51.75 54.25 56.75 52.38 54.87 55.00 50.50 IE 52.00 50.00 46.00 49.00 54.00 49.50 51.00 IR 53.50 55.50 55.50 51.50 56.00 54.00 50.00 QA 55.50 47.50 56.50 52.00 50.00 54.50 48.50 SUM 46.00 64.00 69.00 57.00 59.50 62.00 52.50 Table 2: MLEnt results with RTE2 test set The subset of A performs better than each one of the sets and even their combinations. In Table 1 can be seen that during training, this set performed quite similar to the others. However with the RTE2 test data set it obtained 5% better coverage. The stacking run reached 54,87%. It couldn’t boost the performance of the MLEnt system, because the majority of the sets performed similar. Secondly, the accumulated noise was introduced in the stacking scheme and this leaded to degradation. The same happened with the voting run. Its performance of 55% is better than stacking, however is still insufficient and lower than the performance of subset A. With the RTE2 test set, MLEnt is able identify correctly examples as: T: Mangla was summoned after Madhumita’s sister Nidhi Shukla, who was the first witness in the case. H: Shukla is related to Mangla. The relation is correctly identified as negative, because the number of common words between the two sentences is low. Respectively, the pair below is identified correctly as positive, because the skip- gram and LCS attributes have high values demonstrating plenty of common words between the text and the hypothesis. T: Three leaders of a premier Johnstown tourism event were charged with illegally diverting money to their organization, state police said Wednesday, sending shock waves through the region. H: Three leaders of a premier Johnstown tourism event are suspected of stealing money. Despite the presence of the similarity measures in the other sets, MLEnt was not able to recognize correctly T: Growth in the vast U.S. services sector slowed in August although managers seemed more willing to hire new workers, according to an industry survey published Friday. H: A report showed that growth in the vast service sector slowed in August. We saw that for the great majority of the cases, the vector space of positive and negative classes was densly accumulated in a point, rather than being scattered. This hampered the generalization function of the classifier. 4 Conclusions and future work Our main interest in participating in the Second Textual Entailment challenge was to experiment to resolve textual entailment with a machine learning based system. In this competition, the system works with lexical and semantic features. With the performed experiments we measured the degree of contribution of the developed feature set, and the impact of their combination. Although MLEnt system is prepared and previously evaluated on three different sets, which achieve accuracy results around 60% and 65%, the RTE2 evaluation data set produced differently. This performance is influenced by the features that we used and mainly the high data sparcity problem. It is interesting to note that a feature set of two attributes (tri-skip-gram and LCS) can resolve correctly 56% of the text entailment. Of course 69% of these examples come from text summarization, which is expected as the attributes are generated from text summarization measures. The accuracy for the IE, IR and QA entailment examples is very low around 52%. In order to improve their perfor- mances, more attributes should be incorporated. For the IE task the proper name attribute is not sensitive enough, therefore a named entity recognizer is needed. The performance of IR is influences by the length of the sentences, as the attributes are normalized according to the number of words in the hypothesis. We believe that to improve MLEnt, several classifiers that resolve the different NLP tasks should be prepared and later joint together. Acknowledgements This research has been partially funded by the Spanish Government under project CICyT number TIC2003-07158-C04-01 and PROFIT number FIT340100-2004-14 and by the Valencia Government under project numbers GV04B-276. References D. Yarowsky. 1994. Decision Lists for Lexical Ambiguity Resolution: Application to Accent Restoration in Spanish and French. Proceedings of the 32nd Annual Meeting of the ACL. T. K. Sang and Erik F. 2002. Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition. Proceedings of CoNLL2002, Taipei, Taiwan, p. 155–158. L. Màrquez, L. Padró and H. Rodrı́guez”. 1998. A Machine Learning Approach to POS Tagging. M. Collins and B. Roark. 2004. Incremental Parsing with the Perceptron Algorithm. Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics. Barcelona, Spain, p. 111–118. E. Akhmatova. 2005. Textual Entailment Resolution via Atomic Propositions. Proceedings of the PASCAL Challenges Workshop on Recognising Textual Entailment. D. Wu 2005. Textual Entailment Recognition Based on Inversion Transduction Grammars. Proceedings of the PASCAL Challenges Workshop on Recognising Textual Entailment. J. Herrera, A. Peas and F. Verdejo. 2005. Textual Entailment Recognition Based on Dependency Analysis and WordNet. Proceedings of the PASCAL Challenges Workshop on Recognising Textual Entailment. V. Jijkoun and M. de Rijke, 2005. Recognizing Textual Entailment Using Lexical Similarity. Proceedings of the PASCAL Challenges Workshop on Recognising Textual Entailment. M. Kouylekov and B. Magnini 2005. Recognizing textual entailment with edit distance algorithms. Proceedings of the PASCAL Challenges Workshop on Recognising Textual Entailment. I. Dagan, O. Glickman and B. Magnini. 2005. The PASCAL Recognising Textual Entailment Challenge. Proceedings of the PASCAL Challenges Workshop on Recognising Textual Entailment. Chin-Yew Lin and Franz Josef Och. 2004. Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-birgam Statistics, Proceedings of ACL-2004, Barcelona, Spain. W. Daelemans, J. Zavrel, K. van der Sloot and A. van den Bosch. 2003. TiMBL: Tilburg Memory-Based Learner. Tilburg University, ILK 03-10. R. Collobert and S. Bengio. 2001. SVMTorch: support vector machines for large-scale regression problems, The Journal of Machine Learning Research, vol. 1, issn 1533-7928, p. 143-160, MIT Press, Cambridge, MA, USA. A. Suárez and M. Palomar. 2002. A maximum entropybased word sense disambiguation system. Proceedings of the 19th International Conference on Computational Linguistics, (COLING 2002), p. 960-966. H. Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. Proceedings of International Conference on New Methods in Language Processing. T. Dietterich. 1997. Machine-Learning Research: Four Current Directions. AI Magazine. Winter 1997, p. 97–136. S. Patwardhan, S. Banerjee, and T. Pedersen. 2003. Using measures of semantic relatedness for word sense disambiguation. Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, February. J. Cohen. 1960. A coefficient of agreement for nominal scales. Educ. Psychol. Meas.
© Copyright 2026 Paperzz