Identification of a lexical-syntactical pattern in a Corpus Fernando Miguel Filipe Gomes Instituto Superior Técnico, Portugal, [email protected] Abstract. This article focuses on the validation of a property found in a lexical-syntactical matrix, where a verb possesses a human noun in its subject and direct complements (Nhum-V-Nhum). The validation is based on a statistical comparison between results obtained from a large test corpus, and the information contained in the matrices. The statistical comparison is done with the aid of GRID computing, as well as scheduling and parallel programming software. All the tools used in this identification work are presented and briefly explained, and a sample of the results obtained is also shown. The process used to validate the results is also described. Some relevant conclusions where drawn from the results and allowed a better understanding of the data present in the matrices, as well as presenting ideas for the future improvement of this endeavor. 1 Introduction In this paper we deal with the class of Portuguese psychological verbs [1] whose properties were already available in tabular format (matrices). This class is defined by an obligatory human (Nhum) direct object. The subject is a loosely constraint noun, or even a subordinate clause, usually with a causative semantic role: Esta notı́cia irritou/alegrou/entristeceu o João. ’The news irritated/cheered/saddened João.’ These verbs, often allow also a human noun in the subject position; in this case the Nhum subject can be associated either to an agent or a causative semantic role: O Pedro irritou/alegrou/entristeceu o João. ’Pedro irritated/cheered/saddened João.’ It is this later example that corresponds to the Nhum-V-Nhum property we intend to validate. The validation step is divided into two main stages, the first one is identifying, in the nodes and dependencies of the L2 F chain, how human nouns are classified. The second stage consists of searching in the corpus for the Nhum-V-Nhum pattern (made simpler now that we know how a Nhum is represented in the data). 2 2 Tools Used The tools used in this process are the processing chain of L2 F 1 [2], the L2 F GRID [3] and the Condor [4] and Hadoop [5] systems that work on the AFS2 . The L2 F chain is used to process the corpus into an XML file (showing the properties of each node and the dependencies between them), in order to apply the needed lexical, morphological and syntactical information to the inputed information. The GRID is used to accelerate the processing of the corpus by using several computers at the same time, and the file system is used to create a stable way to run the corpus in the L2 F chain. Condor manages the scheduling of the GRID use during the corpus processing, by queuing and prioritizing processes, while Hadoop is used to perform jobs on large data (like the output of the processed corpus), while maintaining a high-throughput data access. 3 Corpus Processing Before focusing on developing programs to search for specific information we must first process the corpus data (CETEMPúblico3 [7]) in the L2 F chain in order to obtain it in an xml tree form, which shows all the relevant aspects in an easy to search format. The first step taken towards a solution is the decomposition of the problem into smaller, more manageable ones, simplifying its complexity and allowing its sequential resolution in a step-by-step manner. The first “small” problem detected is the processing of the corpus in a manageable time (less than 24 hours). This has been solved by dividing the corpus into several files processed concurrently in the GRID using the Condor AFS file system. The process itself begins with the setting up of a condor environment, in order to process the information in the GRID effectively. Next we decided to process several large sized files (6,000 sentences long ≈ 800kb) using the GRID; given that we encountered memory issues, the size of the files was shortened in each test until we found the largest size that caused no problem: 2000 sentences ≈ 280kb. The processing of 200 of these files in the grid (10 parallel processings) allowed us to process 60Mb of information in under 21 minutes. Simple arithmetics allow us to conclude that given the corpus size, its processing will be concluded well within the 24 hours limit that has been set as this task goal (this limit was set because the processing chain is in constant upgrade, so the reprocessing of the corpus is a task that may have to be done several times, and therefore must not be too much time consuming). The division and condor/hadoop running commands where done through a series of scripts developed specifically for this effect, in order to automatize the 1 2 3 Laboratório de Sistemas de Lı́ngua Falada do Instituto de Engenharia de Sistemas e Computadores - Investigacão e Desenvolvimento. Andrew File System [6]. corpus de aproximadamente 180 milhões de palavras em português europeu, criado pelo projecto Processamento Computacional do Português 3 process for subsequent runs. These scripts work by running a set of commands that run the process for each file and place the results in a designated folder in the hadoop file system. After being run, and placed in the hadoop file system, the xml results are easily accessed through the hadoop structure that allows programs to be run in a jar format. As was stated above, some of the problems found in this process where: – Discovery of the correct size for each file, – Encoding problems, – Automatization of the process (script creation). Each of these problems happened in a specific time and had a particular solution. The size discovery was a trial and error endeavor, with several sizes being tried (Starting at 6000 sentences / 800kb) until we found the 2000 sentences / 280kb value. The encoding problem occurred due to having the files in the ISO format that did not permit the correct showing of certain characters like accents, this was resolved with a script that applied a conversion command (iconv) to all input files. The final problem was the automatization of the whole process, and that was solved by creating a series of scripts to run the commands. – condor-submit.sh: That submits the results to the hadoop file system, – run-xip.sh: Runs the xip-runner jar in an input file, – run-xip.condor: Sets the parameters for running condor (like what machines are used), – xip-runner.jar: A jar application that runs the L2 F chain. Basically, by running condor-submit.sh we set the process in motion since it calls run-xip.sh for each input file and run-xip.condor to specify the condor parameters. 4 Human Noun Identification To discover if properties such as the presence of human nouns (Nhum) is allowed in a given syntactical position for a list of verbs, the identification of human nouns must be made previously. This section shows the process of identifying human nouns on the xml trees resulting from running the CETEMPúblico Corpus on the processing chain, by identifying the various classes of words that are considered human nouns, and more specifically by identifying the xml tags that represent them. The categories of nouns considered to be Nhum where personal pronouns, named entities (NE ) and other nouns that convey information like professions, proper names, affiliation status, organizations, nationalities, titles and family ties. 4 The attributes obtained from the processing chain (PROFESSION, PEOPLE, MEMBER, AFFILIATION, ORG, NATIONALITY, RELATIVE, HUMAN and TITLE) where used to identify the Nhum’s. Personal pronouns where also classified as Nhum’s (such as eu, tu, ele, and so on) and are identified by the PERS attribute. They are considered Nhum because they can replace proper nouns in a sentence, for instance the sentence O Rui comeu a sopa ’Rui ate the soup’ can be reduced to Ele comeu a sopa ’He ate the soup’, where the personal pronoun ele substitutes the human noun Rui. First and second person pronouns always represent Nhum. Notice that third person pronouns can also replace nonhuman nouns (O furacão arrasou a cidade ’The hurricane wrecked the city’ is analogous to Ele arrasou a cidade ’It wrecked the city’) but were considered nonetheless to simplify the identification process. Named entities were also considered as valid candidates for human nouns. The NE type depends not only on the string of words it contains but also on its syntactic context. For instance, Instituto Superior Técnico is a named entity, since it is an organization (ORG) in the context of the sentence Sou estudante do Instituto Superior Técnico ’I’m a student at Instituto Superior Técnico’, and a location in the context of the sentence A reunião decorreu no Instituto Superior Técnico ’The meeting took place in Instituto Superior Técnico’. It should be noted that the named entities that interest us are those related to organizations and persons, since only these represent entities that are considered Nhum. Other classes of nouns, besides named entities have been considered as human nouns. This is the case of nouns designating the professionals (e.g. carpenter ’carpinteiro’) since they can occupy the syntactic slot otherwise filled by a proper name: O carpinteiro comeu a sopa ’The carpenter ate the soup’. 5 5.1 The Nhum-V-Nhum pattern Strategy This study has two main objectives: first, the identification of nouns as human nouns when that information is not yet available for that particular lexical items; secondly, the validation of the data presented in the lexical matrices. The results of this process can then be used to feed the processing chain with those new words, which had not been previously classified as human nouns. This will have a feedback effect and it’s expected to improve the results of the following iterations (an example of this is the HUMAN tag that was added to the processing chain to identify human nouns that where identified in the first iteration of the process). The basic structure of these verbs is N0 V N1, where N1 is a obligatorily a human noun and V a psychological verb. Therefore, sentences with a psychological verb and an explicit direct object are relevant in the determination of human nouns. 5.2 Implementation This program identifies sentences that have a psychological verb as a main verb, and retrieves its subject and classifies it as a Nhum or non-Nhum according to 5 the presence or absence of the tags referred in the strategy section. Then only if the verb also presents a direct object and this is classifiable as a Nhum, can we retrieve the result. The program itself works in the following way: – First the list of psychological verbs is fed from a file into a hashmap, – Then the program runs on the xml trees and retrieves the VMAIN tags, verifying if the main verb in each sentence is a psychological verb, – For the sentences with psychological verbs, the SUBJ tag is taken and the lemma of its head is extracted, – Should the word be a Nhum then the word is added with Nhum+, and Nhum- otherwise, – Then only if a direct object (CDIR) exists and is considered a human noun can we send the information to the REDUCER, – The respective counters (Nhum+ or Nhum-) are incremented for the given verb. After running the main program in the map-reduce paradigm, the results are fed into a program that gives the percentages of Nhum’s for each psychological verb. Basically it will divide the number of Nhum’s of a verb by all of its subjects (basically Nhum+ + Nhum-) and multiply the result by 100. For the test sentences A notı́cia abalou o Rui ’The news shook Rui’ and O cliente enerva o sapateiro ’The client annoys the shoemaker’ we follow these steps: – The verbs abalar and enervar are considered the main verbs in their sentences and are present in the psychological verbs list, so the process continues to run for both sentences, – The words notı́cia and cliente are found to be subjects to the above mentioned verbs and therefore they are marked as such (in this case cliente is marked as human noun, while notı́cias as a non-human noun), – Then it is seen that the sentences have direct objects (CDIR) in relation to the verb. Those are tested to see if they are human nouns (in this case both sapateiro and Rui are human nouns), – Then all results that meet the pattern N0 V N1, with N1 being a human noun and V being a psychological verb are sent to the REDUCER (each sentence will send two results, one with the verb and the other with the verb and the subject and it’s classification). The results obtained from running the program on both these sentences are: – – – – VERBO: VERBO: VERBO: VERBO: abalar 1 abalar NHUM-:notı́cia 1 enervar 1 enervar NHUM+:cliente 1 6 6 Evaluation The process of human noun recognition applies both a program verification and a result validation, the first to discover program errors and the second to discover the inconsistencies between manual and automatic human noun identification that do not stem from the program. These inconsistencies happen because the chain does not identify all human nouns (be they named entities or others, such as certain professions). For the first problem a list of one hundred (100) sentences where randomly chosen and processed with the L2 F Chain, the resulting output was manually scanned for the patterns it contained and afterward it was used to obtain the automatic program results. The results showed 4 sentences with the pattern: – Rui marcaria Eunice tanto artı́stica como pessoalmente. ’Rui marked Eunice both artistically and personally.’ (for the verb marcar ’to mark’) – Jogadores alugados preocupam a FIFA. ’Loaned players worry FIFA.’ (for the verb preocupar ’to worry’) – As mulheres vão ultrapassar os homens? ’Will women overtake men?’ (for the verb ultrapassar ’to overtake’) – O brasileiro ainda ultrapassou o alemão [...] ’The brasilian still overtook the german [...]’ (also for the verb ultrapassar ’to overtake’) Since these results match those obtained from the manual check, the only thing left to test was the validity of the results. To validate the results a list of 10 verbs was manually checked. These verbs had less than 100 N-V-Nhum instances because of the difficulty of manually classifying such a large quantity of data. The results of these checks are present in table 1, and for the most part coincide with the results found in the manual search, yet some have discrepancies that stand in the 5-10% range. These are mostly caused by verbs that have a large quantity of proper nouns (especially people’s names) that the system does not identify as such. 7 Results In this section we show the major results found and some conclusions that can be drawn from them. The first data that was recorded was the number of times the N-V-Nhum pattern has been found for each verb, followed by the number of times the same pattern occur, but this time with a human noun in the subject position as well. So verbs that have no cases of the Nhum-V-Nhum pattern are going to be marked as not having the property. Table 2 shows the results for a group of verbs. After an analysis of these results, we can reach a number of conclusions. First, that verbs like amargurar ’to embitter’ and agoniar ’to agonize’ do not have human nouns in their subjects, since they don’t even have one example of 7 79 16 33 15 28 16 40 37 7 22 % after manual check 99 99 69 71 75 69 64 85 87 73 % of Nhum-V-Nhum Verb abater favorecer impor impressionar influenciar inspirar orientar reduzir satisfazer seduzir No of N-V-Nhum No of Nhum-V-Nhum Table 1. Results with Manual Checks 79,80% 16,16% 47,83% 21,13% 37,33% 23,19% 62,50% 43,53% 8,05% 30,14% 90,91% 22,22% 56,52% 25,35% 38,67% 28,98% 67,19% 51,76% 10,34% 36,98% the pattern Noun-Verb-Human noun. Other verbs, like afligir ’to afflict’, have the pattern, but do not have human nouns in the subject position. On the other side, we have verbs like abater ’to bring down’ that in the 99 patterns found, 79 of them have human nouns in their subjects. Next in table 3 is a list of verbs with the matrix classification and the corpusbased classification. The classification based on the corpus was attained in the following fashion, all results with zero occurrences of the Nhum-V-Nhum pattern count are marked as (-), and all other results are marked as (+). This lead to some interesting results, as some values of both (-) and (+) conflict with the results presented in the matrices. From these tables we find that some results that were obtained differed significantly from those presented by the matrices. The verb afligir ’to afflict’ (classified as (+)) for instance, appears 178 times as a main verb in the corpus, but only in 14 occasions does it present the Noun-Verb-Human Noun pattern, and in none of them is the subject noun a human noun. On the other hand, some verbs (like admirar ’to admire’) present several cases where the property is found, despite being classified as (-) in the matrix. This is the case of sentences like Chen Xinji admira Mao Tsé-Tung ’Chen Xinji admires Mao Tsé-Tung’ extracted from the corpus. This discrepancy is due to the fact that these examples do not constitute a psychological construction of the verb but another lexical-syntactical entry, with a different meaning (equivalent to ter admiração por ’have admiration for’). The psychological construction would correspond to sentences like Que o Pedro tenha feito isso admira-me imenso ’It surprises me a lot that Peter had done such a thing’. 8 No of N-V-Nhum % of N-V-Nhum No of Nhum-V-Nhum % of Nhum-V-Nhum Verb abalar abater aborrecer abrandar absorver acalmar admirar afectar afligir agastar agitar agoniar alegrar aliciar alienar alucinar amaciar amargurar amedrontar amenizar No of Verbs Table 2. Noun-Verb-Noun Results 794 1035 303 512 1003 403 1883 3541 178 46 1367 24 253 163 366 23 22 27 30 46 67 99 11 9 33 28 30 429 14 0 137 0 10 38 10 0 0 0 5 0 8,44% 9,57% 3,63% 1,76% 3,29% 6,95% 1,59% 12,12% 7,87% 0,00% 10,02% 0,00% 3,95% 23,31% 2,73% 0,00% 0,00% 0,00% 16,67% 0,00% 20 79 2 1 19 9 22 43 0 0 19 0 1 29 6 0 0 0 1 0 29,85% 79,80% 18,18% 11,11% 57,58% 32,14% 73,33% 10,02% 0,00% 0,00% 13,87% 0,00% 10,00% 76,32% 60,00% 0,00% 0,00% 0,00% 20,00% 0,00% 9 29,85% 79,80% 18,18% 11,11% 57,58% 32,14% 73,33% 10,02% 0,00% 0,00% 13,87% 0,00% 10,00% 76,32% 60,00% 0,00% 0,00% 0,00% 20,00% 0,00% Corpus Value Matrix Value Verb abalar abater aborrecer abrandar absorver acalmar admirar afectar afligir agastar agitar agoniar alegrar aliciar alienar alucinar amaciar amargurar amedrontar amenizar % of Nhum-V-Nhum Table 3. Noun-Verb-Noun Matrix Results + + + + + + + + + + + + + - + + + + + + + + + + + 10 Still, verbs like alucinar ’to hallucinate’ do not present the pattern that was expected, and (+) marked verbs like aliciar ’to entice’, have a large number of examples were they are present, like in the sentences O BCP aliciava as empresas com novos serviços ’BCP entices companies with new services’ and Liberato e Isaltino aliciam candidatos ’Liberato and Isaltino entice cadidates’. We also found that 220 verbs where correctly tagged, according to the matrices, and 150 showed results that contradicted the data in the matrices. This can also be due to the various meanings a verb can have, since some psychological verbs are known to have non-psychological uses. 8 Conclusion This article shows the process of retrieving a lexical property from a corpus, as well as a comparison work between these results and those achieved by introspective means done by linguists in matrices. The whole process was described, starting with the processing of the corpus and the description of the tools used. The results allow us to validate the data presented in the matrices, as well as improving the processing chain by finding words that had no significant tags to describe them, or even incorrect tags (for this purpose the tag HUMAN was created in the chain to describe human nouns that did not fit in other categories, like mulher ’woman’ or criança ’child’). The results thus obtained allow us to better direct our efforts of syntactical classification and description of the information developed by linguists. Further improvements could be made, such as: – Increasing the search range of this pattern to other verbs besides psychological ones, – Increasing the relevance of the processed data by using an even larger corpus, – Modifying the program to catch patterns that are only slightly different than this one. References 1. Maria Oliveira. Syntaxe des Verbes Psychologiques du Portugais. PhD thesis, Centro de Linguı́stica da Universidade de Lisboa, 1984. 2. Nuno Mamede. A cadeia de processamento XIP em Maio de 2007, 2007. 3. Carl Kesselman and Ian Foster. The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers, San Francisco, CA, USA, 1998. 4. Condor Team. Condor Version 7.0.5 Manual. University of Wisconsin-Madison, November 2008. 5. Hadoop Team. Hadoop, 2008. http://hadoop.apache.org/core/. 6. John H. Howard. An overview of the andrew file system. In USENIX Winter Technical Conference, 1988. 7. Diana Santos and Paulo Rocha. Evaluating CETEMPúblico, a free resource for Portuguese. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pages 442–449, Morristown, NJ, USA, 2001. Association for Computational Linguistics.
© Copyright 2026 Paperzz