a pilot study

KWIC corpora as a source of
specialized definitional
information:
a pilot study
Antonio San Martín
University of Granada, Spain
1. Introduction
Motivation: definition writing
•Definitions in other resources
•Corpus analysis
http://ecolexicon.ugr.es
What should I include in my definitions?
Assumption
The lexical units that normally
co-occur with another lexical
unit are potentially important to
define them.
Hypothesis
Corpus of KWIC (Key
Word In Context)
concordances of the
concept to define
Term list: potentially
definitional terms for
the concept to define
2. Methods
2. Methods
Analysis list
Reference list
2.1. Reference list
- Term list generated with TermoStat Web 3.0 (Drouin
2003): most frequent nouns, noun phrases and
adjectives (+4 occurrences)
- Source: English corpus of 133 specialized definitions
of MAGMA.
2.1. Reference list
- To minimize interference from terminological
variation, terms in the reference list were categorized
according to the conceptual proposition established
with MAGMA.
- Any categorization has a certain degree of
subjectivity. The configuration of our reference list is
the result of certain choices.
2.1. Reference list
Conceptual proposition
Instances from the list generated by TermoStat
magma is a rock
rock (163), molten rock (79), rock material (17), molten rock material (10), liquid rock (4)
magma is a material
material (37), rock material (17), molten rock material (10), molten material (8)
magma is (a) liquid / magma is a >luid liquid (13), >luid (6), liquid rock (4)
magma is a mixture / magma is made of a mixture
mixture (6)
magma is molten
molten (105), molten rock (79), molten rock material (10), molten material (8), molten state (4)
magma is hot
hot (18), temperature (6)
magma is mobile
mobile (6)
magma contains gas/bubbles
gas (25), bubble (4)
magma contains crystals
crystal (24)
magma contains silicate
silicate (9)
magma contains volatiles
volatile (4)
magma contains minerals
mineral (4)
magma undergoes solidi>ication
solidi>ication (6), solid (5)
magma undergoes (partial) melting
melting (7), partial melting (6)
magma causes intrusion
intrusion (7)
magma causes extrusion
extrusion (6)
magma becomes igneous rock / magma is the raw material of igneous rocks
igneous (40), igneous rock (37), raw material (4)
magma becomes lava
lava (38)
magma is found under the Earth’s or a planet’s surface
earth (98), surface (63), planet (5), deep (6), depth (4), underground (5)
magma is found deep in the Earth / at depth
deep (6), depth (4)
magma is found in the (Earth’s) crust
crust (33)
magma is found in the upper part of the (Earth’s) mantle.
mantle (20), upper (5)
magma is erupted from a volcano
volcano (7), volcanic (7)
2.2. Analysis lists
- An English corpus of environmental texts (PANACEA
corpus + LexiCon corpus). 359 occurences of MAGMA.
- Wordsmith Tools (Scott 2008) to generate KWIC
concordance lines:
100c MAGMA 100c
250c MAGMA 250c
500c MAGMA 500c
750c MAGMA 750c
Sentences
2.2. Analysis list
-Each
corpus was fed into TermoStat in order to
obtain the most frequent nouns, noun phrases, and
adjectives.
-The 50 and 100 terms with the highest raw
frequency were retained for comparison with the
reference list.
-Analysis lists:
50-term 100c
50-term 250c
50-term 500c
50-term 750c
50-term sentence
100-term 100c
100-term 250c
100-term 500c
100-term 750c
100-term sentence
2.3. Precision and recall
P = TP / (TP+FP)
R = TP / (TP+FN)
-TP (true positive): a term in the analysis list that matches any of
the categories in the reference list. The result is expressed as a
percentage.
- FP (false positive): a term in the analysis list that matches no
category in the reference list. The result is expressed as a
percentage.
- FN (false negative): a category in the reference list that is not
matched by any of the terms in the analysis list. The result is
expressed as a percentage.
2.3. Precision and recall
F2-measurement (Chinchor, 1992, 25), which gives
twice the importance to recall as to precision. The
formula used was the following:
F2 = (5 · P ·R) / (5 · P + R)
3. Results
3. Results
3. Results
-The 100-term 250C list performed the best (F2-M: 69.08
%). Also, its recall ratio was the highest (78.28 %).
-The highest precision ratio corresponded to the 50term 100C list. But its recall ratio was 12 points below
the 100-term 250C.
-The SC list obtained a lower F2 score compared to any
of the KWIC lists.
-Once the threshold of the 250-character context was
exceeded, longer contexts caused both precision and
recall to decrease.
4. Conclusions and future work
Conclusions and future work
‣Although the scope of this pilot study was limited, results
indicate that a 250-character KWIC corpus coupled with a
100-term list generated from it could be a useful tool for
definition writing.
‣The
inevitable bias caused by the use of a reference list
based on a manual classification does not invalidate the
results.
Conclusions and future work
‣This
initial pilot study will subsequently be
expanded to include new variables:
‣other kind of definienda
‣verbs and adverbs in the term lists
‣corpora of different levels of specialization
‣more KWIC corpora with different character
counts. comparison of the output of TermoStat
with other term extractors as well as a simple
keyword generator
Conclusions and future work
‣Our ultimate objective is to combine our approach with
the application of knowledge-pattern-based techniques
(Pearson, 1998; Meyer, 2001; Malaisé et al., 2005;
Marshman and L’Homme 2006; Auger and Barrière, 2008,
inter alia) to create a system of semi-automatic
definitional information extraction.
Thank you
[email protected]
http://lexicon.ugr.es/sanmartin