Analysis of Linguistic Features Associated with Point of View for

Chapter 4
4.
Analysis of Linguistic Features Associated with
Point of View for Generating Stylistically
Appropriate Text
Nancy L. Green
Nancy L. Green
University of North Carolina at Greensboro
Dept. of Mathematical Sciences
University of North Carolina at Greensboro
Greensboro, NC 27402-6170 USA
Email: [email protected]
Abstract
We describe a qualitative analysis of a corpus of clinical genetics patient
letters. In this genre, a single letter is intended to serve multiple functions
and is designed for multiple audiences. The goal of the analysis was to
identify stylistically-related features for a natural language generation
system. We found that, perhaps because of the multiple intended functions
and audiences, within a single letter more than one writing style (set of
realization choices) can be observed, and the sets of features are associated
with different perspectives. Thus, an NLG system must take perspective
into account to generate stylistically appropriate text in this application.
The paper outlines the perspectives and the features associated with each
that were identified in the corpus.
Keywords: clinical genetics, patient letters, style analysis, natural language generation,
perspective, point of view.
1. Introduction
We are studying a corpus of clinical genetics patient letters written by genetic counselors to their
clients. According to Baker et al. (2002), the typical patient letter, one to two pages in length,
summarizes the counselor's meeting with the client. At a meeting the counselor may provide
information on the client's case (e.g., test results, diagnosis of a genetic disorder, prediction of
genetic risks), counseling to cope with the potential emotional effects of the information, as well
as explanations of genetics concepts relevant to the client's case. While the client is the addressee
of the letter, intended secondary audiences include family members and (in case the client is the
parent or guardian of a pediatric patient) staff members at the patient's school or daycare. In
addition, the letter is intended to provide medical documentation for healthcare providers. These
audiences differ in background (e.g., expert or layperson), in information needs (e.g., a description
34
ANALYSIS OF LINGUISTIC FEATURES ASSOCIATED WITH POINT OF VIEW FOR STYLISTIC TEXT GENERATION
of patient symptoms to support a medical diagnosis or to provide information for caregivers), and
in their emotional relationship to the patient (e.g., a parent or someone not personally involved
with the patient). The motivation for our study of the corpus, unlike most of the other papers in
this volume, is generation rather than interpretation. We wish to identify stylistically-related
features to guide linguistic realization and content selection in a natural language generation
(NLG) system for genetic counselors. The system will generate the first draft of a patient letter
using general information about clinical genetics and specific information about the patient's case.
Previous NLG research on stylistic variation has viewed style as a constant property within a
document and as defining a genre (Hovy, 1990; DiMarco and Hirst, 1993). After informal review
of letters in the corpus, we noted that, perhaps because of the multiple intended functions and
audiences, within a single letter (and in some cases within a single sentence) more than one
writing style can be observed. Our hypothesis is that each style (i.e., coherent set of realization
choices) is associated with a different perspective assumed by the writer, e.g., a counseling
perspective addressing the client’s emotional state or a medical perspective serving a
documentation function. For example, in sentence (2) below the writer uses the referring doctor's
perspective in reporting the reason for the referral to the author's clinic. (The number in
parentheses identifies the sentence; the letter's identifier, VCF, is given in parentheses at the end of
the excerpt. In the corpus, capitalized words in brackets have been substituted for original text to
maintain client confidentiality but convey the gist of the original text. In this domain, proband
refers to the person who is the focus of a genetic study, i.e., the patient.)
(2) [DOCTOR] asked us to evaluate [PROBAND] to determine if [HIS/HER] delays in
development and [SPECIFIC TYPES OF BIRTH DEFECT] were due to a recognizable
genetic condition. (letter VCF)
When speaking from the referring doctor's perspective, the writer's description of the patient's
symptoms is precise and uses words that may have negative connotations to the addressee (the
patient's parent), e.g., a description of the specific types of birth defects and use of the term delays.
In contrast, when the writer assumes the genetic counselor's perspective, the wording is designed
to mitigate the possible negative effect of the information on the addressee. A key stylistic choice
expressing the voice of the counselor in sentence (14) below is use of the value-free or
nonstigmatizing phrase altered form instead of mutation (Baker et al., 2002).
(14) [PROBAND] could have inherited an altered form of a gene from both you and
[HIS/HER] father that caused [HIS/HER] birth defects and learning problems. (letter VCF)
In summary, we claim that in addition to a representation of what must be said, our NLG system
must take perspective into account in order to be able to generate stylistically appropriate text in
this application. This paper justifies the claim by outlining a set of perspectives and some of the
features potentially associated with each that we have identified by qualitative analysis of the
corpus.
2. Perspectives in Corpus
Based upon a review of letters in the corpus and information on genetic counseling, e.g., (Wilson,
2000), we have identified the following perspectives:
• author: the letter writer, i.e., a genetic counselor writing on behalf of a genetics clinic. This
voice can be distinguished from the voices that we call genetic counselorr and clinic. For
COMPUTING AFFECT AND ATTITUDE IN TEXT: THEORY AND APPLICATIONS
•
•
•
•
•
•
35
example, in the author's voice, formulaic expressions are used (e.g., We hope this information
is helpful), which are not used in parts of a letter representing those other perspectives.
client: the person(s) who met with the counselor and who is (are) the principal addressee(s) of
the letter, usually the patient or some member(s) of the patient's family. This perspective is
taken to document discussion initiated by the client at the meeting (e.g., You expressed
concern that …) as well as to enable the writer to include information for the medical record
although it is already known to the client (e.g., As you know, [DOCTOR] first saw
[PROBAND] at eight months…).
referring doctor: the doctor who referred the patient to the clinic (e.g., [DOCTOR] asked us to
evaluate …). This perspective is used to document the referring doctor’s findings and
tentative diagnosis, with which the clinic need not agree.
clinic: genetics clinic with which the genetic counselor is affiliated and that was visited by the
client. This voice is used to document what was done to a patient (e.g., We obtained a blood
sample …), or told to the client (e.g., We have recommended …) during the visit.
genetic counselor: the genetic counselor who met with the client(s), who is also the letter
writer. This perspective is used in discussing patient-specific information such as the
diagnosis or a family member’s inheritance risks in terms that the client can understand and
that mitigate the potential negative effect of the information (e.g., It is important to remember
that [PROBAND'S] problems could still be caused by genetic alteration…).
education: basic background knowledge about human genetics. For example, this perspective
is used to explain the role of genes in health and how genes are inherited (e.g., In autosomal
dominant inheritance, only one altered gene is needed for the person to have the condition.
This gene can come from either the mother or the father…).
research: information from the clinical genetics research literature (e.g., Most children [with
osteogenesis imperfecta] have fragile bones, blue sclera, ...).
Although originally developed for the automated analysis of narrative (Wiebe, 1994), and later
applied to analysis of attitude in newspaper articles (Wilson and Wiebe, 2003), the model of
psychological point of view (POV) provides a framework for our own study. That model defines a
private-state relation whose components include an experiencer, an attitude, and the objectt of the
private state. For example in sentence (2, VCF) repeated below, the experiencer, identified as
[DOCTOR], is the referring doctor, the attitude could be interpreted as believes it likely that, and
the object corresponds to what is expressed as the proband's delays in development and
[SPECIFIC TYPES OF BIRTH DEFECT]
T were due to a recognizable genetic condition.
(2) [DOCTOR] asked us to evaluate [PROBAND] to determine if [HIS/HER] delays in
development and [SPECIFIC TYPES OF BIRTH DEFECT] were due to a recognizable
genetic condition.
(3) During your appointment on [DATE], we obtained a blood sample from [PROBAND].
(4a) In addition to the routine chromosome study,
(4b) in which a microscopic study of the 46 chromosomes is done,
(4c) a special analysis of the long arm of chromosome 22 (22q11)
(4d) by a technique called fluorescence in situ hybridization (FISH)
36
ANALYSIS OF LINGUISTIC FEATURES ASSOCIATED WITH POINT OF VIEW FOR STYLISTIC TEXT GENERATION
(4e) was done to test for Velocardiofacial syndrome (VCF).
(5) Individuals with VCF often have [SPECIFIC TYPES OF BIRTH DEFECT] and learning
problems. (letter VCF)
This excerpt illustrates several other points. As noted in (Wiebe, 1994), experiencer and attitude
need not be stated explicitly. In (3), the experiencer, signaled by we, is the clinic and the attitude
could be interpreted as knowledge shared by experiencer and addressee. In (4a), the experiencer
could be interpreted as the clinic again, although it was not explicitly signaled; (4a) continues
(Wiebe, 1994) the experiencer of the current POV. However, we claim that the explanatory
information provided in (4b) and (4d) is the voice of the genetic counselor and the attitude for
those phrases could be interpreted as knowledge that the experiencer believes the addressee does
not share with the experiencer. This change in attitude is associated with a shift in tense; the
explanatory information in (4b) and (4d) is presented in the present tense while the rest of (2)
through (4), a narration of the patient's referral, history and clinic visit, is presented in the past
tense. Finally, the experiencer in (5) is the research perspective. This change in experiencer is
marked also by a shift to the present tense.
3. Associated Features
Table 1 shows, for each perspective defined above, some associated features that we have
identified by manual inspection of the corpus. The second column lists the typical forms used for
referring to each type of experiencer. Note that according to the table, first person plural pronoun
forms such as we are used to refer to several categories of experiencer. The third column lists
typical forms for referring to individuals other than the experiencer. For example, the education
and research perspectives are characterized by reference to generic individuals instead of to
members of the client's family. According to Baker et al. (2002) the strategy of conveying
information about a patient indirectly by using general terms (e.g., Children with this condition
tend to lose their hearing, instead of Nisha is likely to lose her hearing) can be used by the writer
to mitigate the negative impact of the information on the client. The fourth column lists verb
tenses characteristic of each perspective. The fifth column lists forms for conveying probability,
and is discussed below. The last column lists other associated features, including characteristic
open-class words and word patterns. For example, several perspectives can be distinguished on the
basis of use of expert biomedical terminology in contrast to use of more layperson-oriented
terminology, e.g., use of the geneticist's term allele instead of the layperson-oriented copy. In
addition to this distinction, some perspectives can be characterized by use of value-free or nonstigmatizing language.
COMPUTING AFFECT AND ATTITUDE IN TEXT: THEORY AND APPLICATIONS
Experiencer
author
client
referring
doctor
Reference to
experiencer
pronoun
(1p-plural),
self-reference to
letter (e.g., this
letter)
reference
to
family members
by name or
pronoun
(2p, 3p)
doctor's name
genetics
clinic
pronoun
(1p-plural)
genetic
counselor
pronoun
(1p-plural)
education
agentless
passive (e.g., it
is believed that)
research
Reference
others
reference
family
members
name
pronoun
(2p, 3p)
to
Tense
to
present
or past
(time of
clinic
visit)
by
or
37
Probability
formulaic language
(e.g., it was a
pleasure),
position
near
beginning and end
of letter
client’s knowledge
or questions (e.g.,
you asked whether,
as you know)
past
(time of
clinic
visit)
reference to
family
members by
name
or
pronoun
(2p, 3p)
reference to
family
members by
name
or
pronoun
(2p, 3p)
reference to
family
members by
name
or
pronoun
(2p, 3p)
past
(before
clinic
visit)
implicit
(e.g., due to)
referral verbs (e.g.,
referred by), expert
biomedical
terminology, nonvalue-free words
clinic’s
actions
(e.g., we gave you,
we
obtained),
expert biomedical
terminology
past
(time of
clinic
visit)
present
or
future
Other cues
qualitative
(e.g., could,
it appears
that),
Mendelian
ratio (e.g., a
50%
chance)
qualitative,
Mendelian
ratio
emphasis (still, it is
important), valuefree words (e.g.,
alteration instead of
mutation),
layperson-oriented
biomedical
terminology
layperson-oriented
biomedical
terminology, called
(e.g., a gene called
GJB2)
reference to habitpopulation
ual
(e.g.,
the present
parents, the or
mother)
or future
universal
(e.g.,
we,
everyone)
qualitative,
expert biomedical
reference to habitagentless
ual
quantitative
terminology
(e.g., population
passive
present
has
been (e.g.,
individuals)
reported)
Table 1. Types of features characterizing each perspective.
In a previous study of this corpus (Green, 2003), we manually tagged both qualitative and
quantitative indicators of probability. Examples of qualitative indicators are modal verbs (e.g.,
38
ANALYSIS OF LINGUISTIC FEATURES ASSOCIATED WITH POINT OF VIEW FOR STYLISTIC TEXT GENERATION
can, could), frequency adverbs (e.g., often), and quantifiers (e.g., many). Quantitative indicators
are phrases containing numeric expressions (e.g., rates, odds, percentages), possibly with
qualifiers (e.g., approximately 80%.). That study determined that the ratio of probability cues to
the number of sentences was high, which is not surprising due to the inherent uncertainty in
human genetics. Column five of Table 1 shows the types of probability cues associated with each
perspective. Qualitative cues are used in all perspectives characterized by explicit use of
probability terms. The cues that we call Mendelian ratios, i.e., the idealized ratios of a Mendelian
inheritance model (e.g., 0%, 25%, 50%, 75%, and 100%) are characteristic of the education
perspective (in explanations of inheritance patterns) and in the genetic counselor perspective (in
explaining inheritance patterns that occur in the client's family). Presence of a quantitative, nonMendelian probability value (e.g., 6%), seems to be a good indicator of the research perspective,
since the original source of information would have been from empirical studies published in the
research literature.
4. Implications for Natural Language Generation and Automatic Recognition
of Point of View
An NLG system for a domain such as this must take perspective into account in order to be able to
generate stylistically appropriate text, regardless of whether perspective is considered in
generating text from "first principles", or whether it is "compiled into" quasi-textual building
blocks. Otherwise, for example, information needed for medical documentation purposes might be
realized in layperson-oriented terminology that is unsuitable for its intended function, or
information intended for a parent might be realized in obscure-sounding medical terminology that
fails to consider the emotional impact on the parent. Even when a generator uses precompiled
"building blocks" (Hirst et al., 1997), if the generator is not informed of the perspective
represented by each building block, then subsequent transformations such as text aggregation or
referring expression construction could produce phrasing that mixes perspective infelicitously.
In contrast to our work, most of the other projects described in this volume have goals related to
automatic recognition of point of view in text. Despite the difference in motivation, our qualitative
analysis can be seen as a possible step towards automatic recognition of point of view in clinical
genetics-related documents. It seems likely one could build a classifier to predict perspective
based on features like those that we have identified. The classifier might be used, for example, in
a question-answering system with access to a heterogeneous collection of text, e.g., patient
medical records and general patient education material on genetic disorders.
5. Acknowledgments
This work is supported by the National Science Foundation under CAREER Award No. 0132821.
6. Bibliography
Baker, D.L., Eash, T., Schuette, J.L., and Uhlmann, W.R. (2002) Guidelines for Writing Letters to
Patients. Journal of Genetic Counseling, 11 (5), 399-418.
DiMarco, C. and Hirst, G. (1993) A Computational Theory of Goal-Directed Style in Syntax.
Computational Linguistics, 19 (3), 451-500.
COMPUTING AFFECT AND ATTITUDE IN TEXT: THEORY AND APPLICATIONS
39
Green, N. (2003) Towards an Empirical Model of Argumentation in Medical Genetics. In
Proceedings of IJCAI 2003 Workshop on Computational Models of Natural Argument (CMNA03). 39-44.
Hirst, G., DiMarco, C., Hovy, E., and Parsons, K. (1997) Authoring and Generating Healtheducation Documents that are Tailored to the Needs of the Individual Patient. In Proceedings of
User Modeling 1997.
Hovy, E. (1990) Pragmatics and Natural Language Generation. Artificial Intelligence 43, 153-197.
Wiebe, J. M. (1994) Tracking Point of View in Narrative. Computational Linguistics 20 (2), 233288.
Wilson, T. and Wiebe, J. (2003) Annotating Opinions in the World Press. In Proceedings of the 4th
SIGDial Workshop.
Wilson, G.N. (2000) Clinical Genetics: A Short Course. Wiley-Liss.