Identification of a lexical-syntactical pattern in a

Identification of a lexical-syntactical pattern in a
Corpus
Fernando Miguel Filipe Gomes
Instituto Superior Técnico, Portugal,
[email protected]
Abstract. This article focuses on the validation of a property found
in a lexical-syntactical matrix, where a verb possesses a human noun
in its subject and direct complements (Nhum-V-Nhum). The validation
is based on a statistical comparison between results obtained from a
large test corpus, and the information contained in the matrices. The
statistical comparison is done with the aid of GRID computing, as well
as scheduling and parallel programming software.
All the tools used in this identification work are presented and briefly
explained, and a sample of the results obtained is also shown. The process
used to validate the results is also described.
Some relevant conclusions where drawn from the results and allowed
a better understanding of the data present in the matrices, as well as
presenting ideas for the future improvement of this endeavor.
1
Introduction
In this paper we deal with the class of Portuguese psychological verbs [1] whose
properties were already available in tabular format (matrices). This class is defined by an obligatory human (Nhum) direct object. The subject is a loosely
constraint noun, or even a subordinate clause, usually with a causative semantic
role:
Esta notı́cia irritou/alegrou/entristeceu o João.
’The news irritated/cheered/saddened João.’
These verbs, often allow also a human noun in the subject position; in this case
the Nhum subject can be associated either to an agent or a causative semantic
role:
O Pedro irritou/alegrou/entristeceu o João.
’Pedro irritated/cheered/saddened João.’
It is this later example that corresponds to the Nhum-V-Nhum property we
intend to validate.
The validation step is divided into two main stages, the first one is identifying,
in the nodes and dependencies of the L2 F chain, how human nouns are classified.
The second stage consists of searching in the corpus for the Nhum-V-Nhum
pattern (made simpler now that we know how a Nhum is represented in the
data).
2
2
Tools Used
The tools used in this process are the processing chain of L2 F 1 [2], the L2 F GRID
[3] and the Condor [4] and Hadoop [5] systems that work on the AFS2 . The L2 F
chain is used to process the corpus into an XML file (showing the properties of
each node and the dependencies between them), in order to apply the needed
lexical, morphological and syntactical information to the inputed information.
The GRID is used to accelerate the processing of the corpus by using several
computers at the same time, and the file system is used to create a stable way to
run the corpus in the L2 F chain. Condor manages the scheduling of the GRID
use during the corpus processing, by queuing and prioritizing processes, while
Hadoop is used to perform jobs on large data (like the output of the processed
corpus), while maintaining a high-throughput data access.
3
Corpus Processing
Before focusing on developing programs to search for specific information we
must first process the corpus data (CETEMPúblico3 [7]) in the L2 F chain in
order to obtain it in an xml tree form, which shows all the relevant aspects in
an easy to search format.
The first step taken towards a solution is the decomposition of the problem
into smaller, more manageable ones, simplifying its complexity and allowing
its sequential resolution in a step-by-step manner. The first “small” problem
detected is the processing of the corpus in a manageable time (less than 24
hours). This has been solved by dividing the corpus into several files processed
concurrently in the GRID using the Condor AFS file system.
The process itself begins with the setting up of a condor environment, in order
to process the information in the GRID effectively. Next we decided to process
several large sized files (6,000 sentences long ≈ 800kb) using the GRID; given
that we encountered memory issues, the size of the files was shortened in each test
until we found the largest size that caused no problem: 2000 sentences ≈ 280kb.
The processing of 200 of these files in the grid (10 parallel processings) allowed
us to process 60Mb of information in under 21 minutes. Simple arithmetics allow
us to conclude that given the corpus size, its processing will be concluded well
within the 24 hours limit that has been set as this task goal (this limit was set
because the processing chain is in constant upgrade, so the reprocessing of the
corpus is a task that may have to be done several times, and therefore must not
be too much time consuming).
The division and condor/hadoop running commands where done through a
series of scripts developed specifically for this effect, in order to automatize the
1
2
3
Laboratório de Sistemas de Lı́ngua Falada do Instituto de Engenharia de Sistemas
e Computadores - Investigacão e Desenvolvimento.
Andrew File System [6].
corpus de aproximadamente 180 milhões de palavras em português europeu, criado
pelo projecto Processamento Computacional do Português
3
process for subsequent runs. These scripts work by running a set of commands
that run the process for each file and place the results in a designated folder in
the hadoop file system.
After being run, and placed in the hadoop file system, the xml results are
easily accessed through the hadoop structure that allows programs to be run in
a jar format.
As was stated above, some of the problems found in this process where:
– Discovery of the correct size for each file,
– Encoding problems,
– Automatization of the process (script creation).
Each of these problems happened in a specific time and had a particular solution. The size discovery was a trial and error endeavor, with several sizes being
tried (Starting at 6000 sentences / 800kb) until we found the 2000 sentences /
280kb value. The encoding problem occurred due to having the files in the ISO
format that did not permit the correct showing of certain characters like accents,
this was resolved with a script that applied a conversion command (iconv) to all
input files.
The final problem was the automatization of the whole process, and that was
solved by creating a series of scripts to run the commands.
– condor-submit.sh: That submits the results to the hadoop file system,
– run-xip.sh: Runs the xip-runner jar in an input file,
– run-xip.condor: Sets the parameters for running condor (like what machines
are used),
– xip-runner.jar: A jar application that runs the L2 F chain.
Basically, by running condor-submit.sh we set the process in motion since
it calls run-xip.sh for each input file and run-xip.condor to specify the condor
parameters.
4
Human Noun Identification
To discover if properties such as the presence of human nouns (Nhum) is allowed
in a given syntactical position for a list of verbs, the identification of human nouns
must be made previously. This section shows the process of identifying human
nouns on the xml trees resulting from running the CETEMPúblico Corpus on the
processing chain, by identifying the various classes of words that are considered
human nouns, and more specifically by identifying the xml tags that represent
them.
The categories of nouns considered to be Nhum where personal pronouns,
named entities (NE ) and other nouns that convey information like professions,
proper names, affiliation status, organizations, nationalities, titles and family
ties.
4
The attributes obtained from the processing chain (PROFESSION, PEOPLE, MEMBER, AFFILIATION, ORG, NATIONALITY, RELATIVE, HUMAN and TITLE) where used to identify the Nhum’s. Personal pronouns where
also classified as Nhum’s (such as eu, tu, ele, and so on) and are identified by
the PERS attribute. They are considered Nhum because they can replace proper
nouns in a sentence, for instance the sentence O Rui comeu a sopa ’Rui ate the
soup’ can be reduced to Ele comeu a sopa ’He ate the soup’, where the personal
pronoun ele substitutes the human noun Rui. First and second person pronouns
always represent Nhum. Notice that third person pronouns can also replace nonhuman nouns (O furacão arrasou a cidade ’The hurricane wrecked the city’ is
analogous to Ele arrasou a cidade ’It wrecked the city’) but were considered
nonetheless to simplify the identification process.
Named entities were also considered as valid candidates for human nouns.
The NE type depends not only on the string of words it contains but also on
its syntactic context. For instance, Instituto Superior Técnico is a named entity,
since it is an organization (ORG) in the context of the sentence Sou estudante do
Instituto Superior Técnico ’I’m a student at Instituto Superior Técnico’, and a
location in the context of the sentence A reunião decorreu no Instituto Superior
Técnico ’The meeting took place in Instituto Superior Técnico’. It should be
noted that the named entities that interest us are those related to organizations
and persons, since only these represent entities that are considered Nhum.
Other classes of nouns, besides named entities have been considered as human
nouns. This is the case of nouns designating the professionals (e.g. carpenter
’carpinteiro’) since they can occupy the syntactic slot otherwise filled by a proper
name: O carpinteiro comeu a sopa ’The carpenter ate the soup’.
5
5.1
The Nhum-V-Nhum pattern
Strategy
This study has two main objectives: first, the identification of nouns as human
nouns when that information is not yet available for that particular lexical items;
secondly, the validation of the data presented in the lexical matrices. The results
of this process can then be used to feed the processing chain with those new
words, which had not been previously classified as human nouns. This will have a
feedback effect and it’s expected to improve the results of the following iterations
(an example of this is the HUMAN tag that was added to the processing chain to
identify human nouns that where identified in the first iteration of the process).
The basic structure of these verbs is N0 V N1, where N1 is a obligatorily a
human noun and V a psychological verb. Therefore, sentences with a psychological verb and an explicit direct object are relevant in the determination of human
nouns.
5.2
Implementation
This program identifies sentences that have a psychological verb as a main verb,
and retrieves its subject and classifies it as a Nhum or non-Nhum according to
5
the presence or absence of the tags referred in the strategy section. Then only if
the verb also presents a direct object and this is classifiable as a Nhum, can we
retrieve the result.
The program itself works in the following way:
– First the list of psychological verbs is fed from a file into a hashmap,
– Then the program runs on the xml trees and retrieves the VMAIN tags,
verifying if the main verb in each sentence is a psychological verb,
– For the sentences with psychological verbs, the SUBJ tag is taken and the
lemma of its head is extracted,
– Should the word be a Nhum then the word is added with Nhum+, and
Nhum- otherwise,
– Then only if a direct object (CDIR) exists and is considered a human noun
can we send the information to the REDUCER,
– The respective counters (Nhum+ or Nhum-) are incremented for the given
verb.
After running the main program in the map-reduce paradigm, the results are
fed into a program that gives the percentages of Nhum’s for each psychological
verb. Basically it will divide the number of Nhum’s of a verb by all of its subjects
(basically Nhum+ + Nhum-) and multiply the result by 100.
For the test sentences A notı́cia abalou o Rui ’The news shook Rui’ and O
cliente enerva o sapateiro ’The client annoys the shoemaker’ we follow these
steps:
– The verbs abalar and enervar are considered the main verbs in their sentences and are present in the psychological verbs list, so the process continues
to run for both sentences,
– The words notı́cia and cliente are found to be subjects to the above mentioned verbs and therefore they are marked as such (in this case cliente is
marked as human noun, while notı́cias as a non-human noun),
– Then it is seen that the sentences have direct objects (CDIR) in relation to
the verb. Those are tested to see if they are human nouns (in this case both
sapateiro and Rui are human nouns),
– Then all results that meet the pattern N0 V N1, with N1 being a human
noun and V being a psychological verb are sent to the REDUCER (each
sentence will send two results, one with the verb and the other with the verb
and the subject and it’s classification).
The results obtained from running the program on both these sentences are:
–
–
–
–
VERBO:
VERBO:
VERBO:
VERBO:
abalar 1
abalar NHUM-:notı́cia 1
enervar 1
enervar NHUM+:cliente 1
6
6
Evaluation
The process of human noun recognition applies both a program verification and a
result validation, the first to discover program errors and the second to discover
the inconsistencies between manual and automatic human noun identification
that do not stem from the program. These inconsistencies happen because the
chain does not identify all human nouns (be they named entities or others, such
as certain professions).
For the first problem a list of one hundred (100) sentences where randomly
chosen and processed with the L2 F Chain, the resulting output was manually
scanned for the patterns it contained and afterward it was used to obtain the
automatic program results.
The results showed 4 sentences with the pattern:
– Rui marcaria Eunice tanto artı́stica como pessoalmente. ’Rui marked Eunice
both artistically and personally.’ (for the verb marcar ’to mark’)
– Jogadores alugados preocupam a FIFA. ’Loaned players worry FIFA.’ (for
the verb preocupar ’to worry’)
– As mulheres vão ultrapassar os homens? ’Will women overtake men?’ (for
the verb ultrapassar ’to overtake’)
– O brasileiro ainda ultrapassou o alemão [...] ’The brasilian still overtook the
german [...]’ (also for the verb ultrapassar ’to overtake’)
Since these results match those obtained from the manual check, the only
thing left to test was the validity of the results.
To validate the results a list of 10 verbs was manually checked. These verbs
had less than 100 N-V-Nhum instances because of the difficulty of manually
classifying such a large quantity of data.
The results of these checks are present in table 1, and for the most part
coincide with the results found in the manual search, yet some have discrepancies
that stand in the 5-10% range. These are mostly caused by verbs that have a
large quantity of proper nouns (especially people’s names) that the system does
not identify as such.
7
Results
In this section we show the major results found and some conclusions that can
be drawn from them. The first data that was recorded was the number of times
the N-V-Nhum pattern has been found for each verb, followed by the number of
times the same pattern occur, but this time with a human noun in the subject
position as well. So verbs that have no cases of the Nhum-V-Nhum pattern are
going to be marked as not having the property. Table 2 shows the results for a
group of verbs.
After an analysis of these results, we can reach a number of conclusions.
First, that verbs like amargurar ’to embitter’ and agoniar ’to agonize’ do not
have human nouns in their subjects, since they don’t even have one example of
7
79
16
33
15
28
16
40
37
7
22
% after manual check
99
99
69
71
75
69
64
85
87
73
% of Nhum-V-Nhum
Verb
abater
favorecer
impor
impressionar
influenciar
inspirar
orientar
reduzir
satisfazer
seduzir
No of N-V-Nhum
No of Nhum-V-Nhum
Table 1. Results with Manual Checks
79,80%
16,16%
47,83%
21,13%
37,33%
23,19%
62,50%
43,53%
8,05%
30,14%
90,91%
22,22%
56,52%
25,35%
38,67%
28,98%
67,19%
51,76%
10,34%
36,98%
the pattern Noun-Verb-Human noun. Other verbs, like afligir ’to afflict’, have
the pattern, but do not have human nouns in the subject position. On the other
side, we have verbs like abater ’to bring down’ that in the 99 patterns found, 79
of them have human nouns in their subjects.
Next in table 3 is a list of verbs with the matrix classification and the corpusbased classification. The classification based on the corpus was attained in the
following fashion, all results with zero occurrences of the Nhum-V-Nhum pattern
count are marked as (-), and all other results are marked as (+). This lead to
some interesting results, as some values of both (-) and (+) conflict with the
results presented in the matrices.
From these tables we find that some results that were obtained differed significantly from those presented by the matrices. The verb afligir ’to afflict’ (classified as (+)) for instance, appears 178 times as a main verb in the corpus, but
only in 14 occasions does it present the Noun-Verb-Human Noun pattern, and
in none of them is the subject noun a human noun.
On the other hand, some verbs (like admirar ’to admire’) present several
cases where the property is found, despite being classified as (-) in the matrix.
This is the case of sentences like Chen Xinji admira Mao Tsé-Tung ’Chen Xinji
admires Mao Tsé-Tung’ extracted from the corpus. This discrepancy is due to
the fact that these examples do not constitute a psychological construction of the
verb but another lexical-syntactical entry, with a different meaning (equivalent to
ter admiração por ’have admiration for’). The psychological construction would
correspond to sentences like Que o Pedro tenha feito isso admira-me imenso ’It
surprises me a lot that Peter had done such a thing’.
8
No of N-V-Nhum
% of N-V-Nhum
No of Nhum-V-Nhum
% of Nhum-V-Nhum
Verb
abalar
abater
aborrecer
abrandar
absorver
acalmar
admirar
afectar
afligir
agastar
agitar
agoniar
alegrar
aliciar
alienar
alucinar
amaciar
amargurar
amedrontar
amenizar
No of Verbs
Table 2. Noun-Verb-Noun Results
794
1035
303
512
1003
403
1883
3541
178
46
1367
24
253
163
366
23
22
27
30
46
67
99
11
9
33
28
30
429
14
0
137
0
10
38
10
0
0
0
5
0
8,44%
9,57%
3,63%
1,76%
3,29%
6,95%
1,59%
12,12%
7,87%
0,00%
10,02%
0,00%
3,95%
23,31%
2,73%
0,00%
0,00%
0,00%
16,67%
0,00%
20
79
2
1
19
9
22
43
0
0
19
0
1
29
6
0
0
0
1
0
29,85%
79,80%
18,18%
11,11%
57,58%
32,14%
73,33%
10,02%
0,00%
0,00%
13,87%
0,00%
10,00%
76,32%
60,00%
0,00%
0,00%
0,00%
20,00%
0,00%
9
29,85%
79,80%
18,18%
11,11%
57,58%
32,14%
73,33%
10,02%
0,00%
0,00%
13,87%
0,00%
10,00%
76,32%
60,00%
0,00%
0,00%
0,00%
20,00%
0,00%
Corpus Value
Matrix Value
Verb
abalar
abater
aborrecer
abrandar
absorver
acalmar
admirar
afectar
afligir
agastar
agitar
agoniar
alegrar
aliciar
alienar
alucinar
amaciar
amargurar
amedrontar
amenizar
% of Nhum-V-Nhum
Table 3. Noun-Verb-Noun Matrix Results
+
+
+
+
+
+
+
+
+
+
+
+
+
-
+
+
+
+
+
+
+
+
+
+
+
10
Still, verbs like alucinar ’to hallucinate’ do not present the pattern that was
expected, and (+) marked verbs like aliciar ’to entice’, have a large number of
examples were they are present, like in the sentences O BCP aliciava as empresas
com novos serviços ’BCP entices companies with new services’ and Liberato e
Isaltino aliciam candidatos ’Liberato and Isaltino entice cadidates’.
We also found that 220 verbs where correctly tagged, according to the matrices, and 150 showed results that contradicted the data in the matrices. This can
also be due to the various meanings a verb can have, since some psychological
verbs are known to have non-psychological uses.
8
Conclusion
This article shows the process of retrieving a lexical property from a corpus, as
well as a comparison work between these results and those achieved by introspective means done by linguists in matrices. The whole process was described,
starting with the processing of the corpus and the description of the tools used.
The results allow us to validate the data presented in the matrices, as well
as improving the processing chain by finding words that had no significant tags
to describe them, or even incorrect tags (for this purpose the tag HUMAN was
created in the chain to describe human nouns that did not fit in other categories,
like mulher ’woman’ or criança ’child’).
The results thus obtained allow us to better direct our efforts of syntactical
classification and description of the information developed by linguists.
Further improvements could be made, such as:
– Increasing the search range of this pattern to other verbs besides psychological ones,
– Increasing the relevance of the processed data by using an even larger corpus,
– Modifying the program to catch patterns that are only slightly different than
this one.
References
1. Maria Oliveira. Syntaxe des Verbes Psychologiques du Portugais. PhD thesis, Centro
de Linguı́stica da Universidade de Lisboa, 1984.
2. Nuno Mamede. A cadeia de processamento XIP em Maio de 2007, 2007.
3. Carl Kesselman and Ian Foster. The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers, San Francisco, CA, USA, 1998.
4. Condor Team. Condor Version 7.0.5 Manual. University of Wisconsin-Madison,
November 2008.
5. Hadoop Team. Hadoop, 2008. http://hadoop.apache.org/core/.
6. John H. Howard. An overview of the andrew file system. In USENIX Winter
Technical Conference, 1988.
7. Diana Santos and Paulo Rocha. Evaluating CETEMPúblico, a free resource for
Portuguese. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pages 442–449, Morristown, NJ, USA, 2001. Association for
Computational Linguistics.