A Flexible Synonym Interface with application examples in CAL and

A Flexible Synonym Interface with application examples in
CAL and help environments
G. M. GWEI* AND E. FOXLEY
Computer Science Department, University of Nottingham, Nottingham, NG7 2RD
In both writing and conversation, different people may use different terminologies for the same concept. This poses
problems in computer environments which support conversation. This paper outlines a flexible interface obtained by
exploiting any establishable relationships between different terminologies. We illustrate theflexibleinterface in a
Computer Assisted Learning {CAL) environment. We also argue that every application needs an additional component
to provide the synonym facility. The environment described incorporates synonym-generation programs based on an
on-line version o/Roget's Thesaurus.
Received April 1986, revised June 1986
Table 1
Operating
system
Command
name
UNIX
rm
del
el
era
VMS
Harris
CP/M
OS/360
decatalog
Root word
1.2 Grouping words of similar stems
remove
delete
eliminate
erase
decatalog
Another issue of concern is how various forms of a word
might appear in a user's query or answer. For instance,
remove might appear in any of the following forms:
1.1 Needs for synonyms in a help environment
Most operating systems provide on-line help of some
kind. Unfortunately, most systems expect exact citations
of the entities for which help is being sought. Typically,
the vocabulary of a system is limited and different systems
have different vocabularies. Table 1 shows the terms used
on some systems for the command to get rid of a file.
To obtain help, a user is typically expected to type
'help <command name)'. From the example above
* This is part of research supported by Cameroon Government BS
grant (Cameroon Embassy, London).
removers
remove
removal
removes
removals
removed
removable
removedness
removability
removing
remover
Though of the same root, these words are distinct and are
considered to be unrelated in many computing environments. It would be advantageous to get a single and
unique representation for any group of such words; the
technical term for this is conflation.
1.3 A flexible user interface
An interface which exploits the relationships cited above
would benefit users and course designers. Users would be
THE COMPUTER JOURNAL, VOL. 30, NO. 6, 1987 551
Downloaded from http://comjnl.oxfordjournals.org/ at Pennsylvania State University on May 17, 2016
'<command name)' would depend not only on the
concept but also on the operating system in use. Knowing
the root word used by a particular system is not
straightforward, knowing the command name is even less
so.
Users with computer experience may well try the
command names from their previous system, while
newcomers might try less obvious terms. However, the
terms used by the newcomers and the terms used by those
which computer experience usually bear some relationship to the term of the system. Such relationships are
usually resolved when queries are presented to local
operating system gurus. We can make use of their
knowledge in resolving these relationships, thereby
improving the utility of help environments. In effect, the
gurus should stipulate appropriate synonyms as well as
providing help material.
The need for synonyms is evident in most computing
environments that permit queries or expect any form of
information from users. For example, some CAL
environments perform script marking by searching for
keywords.7 Such systems would be greatly improved if
they searched additionally for keyword synonyms.
1. RATIONALE
It is common practice to express any concept in different
terms. These terms (that is, word or phrase) usually bear
some relationship with each other. These relationships
include the following.
(1) Words with entirely different roots but of similar
meaning (synonyms) possibly in a specific context. For
example, via and through are synonyms of each other.
(2) Words with the same stem but used in different
senses. These can be conjugated forms of verbs, various
noun forms or any other permissible inflections in English
grammar. For example, successive and successor are
different inflections of succeed.
Although these relationships are commonly known,
most computing environments fail to exploit them. This
often frustrates users who have to search for systemspecify terminology prior to formulating queries. It will
be useful to exploit the cited relationships to improve the
interface to most computer applications.
G. M. GWEI AND E. FOXLEY
able to formulate queries with greater freedom in
terminology. Course authors would have similar freedom
in making cross-references and other queries.
In addition to providing flexibility for the user, the
interface described here also incorporates a teaching
strategy. The system strives to familiarise users with the
terminology preferred by the domain expert or course
designer. The next section gives an overview of a CAL
and help environment on which the interface is
demonstrated.
2. A CAL AND HELP ENVIRONMENT
We describe an environment that meets the teaching
requirements of CAL in addition to providing on-line
help or consultancy in a given domain.
The environment consists of scripts for various subjects,
a script-writing aide, a script interpreter, and a library of
script-marking routines. The material for each subject is
kept in a separate directory that contains the following.
(1) A list of all topics or keywords of the subject - in a
file called KEYWORDS.
(2) Lessons and drills - each in a separate file and
containing:
(a) A list of the topics treated, each topic optionally
followed by
(i) an indication of the level (introductory,
ordinary or advanced) at which the topic is
treated in the lesson;
(ii) an abstract of the information about the
topic in the lesson;
(iii) an optional list of set-up commands that
may be executed if more information about
the topic is demanded by the user.
{b) A list of prerequisite topics each indicating the
depth of knowledge assumed.
(c) The text for the lesson.
(d) A list of set-up commands to be executed if any
drill or special program is to be run as part of the
lesson.
(e) An optional list of tasks which are considered as
part of the lesson.
(3) Tasks - each in a separate file and containing the
following.
(a) A list of the topics tested, each indicating the level
at which it is tested (as above).
(b) A list of prerequisite topics and indications of the
depth of knowledge assumed but not tested.
(c) The text for the task.
(d) A list of set-up commands to be executed if any
drill of special program is to be run as part of the
task.
(e) The data, if any, which a user is supposed to edit,
transform or otherwise process.
(/) The set-up commands to be executed before the
user takes control.
(g) The evaluating commands to be executed after the
user has finished the task, to decide whether the
task has been carried out successfully.
(h) An optional list of possible successor lessons.
552 THE COMPUTER JOURNAL, VOL. 30, NO. 6, 1987
2.2. Restriction in terminology
While building lessons and tasks, the course author is
required to refer to topics in identical terms to those in
the topicsfile.The same requirement is expected of users
on the queries and answers they give to the system. These
restrictions inhibit the usage of terms other than those
listed. This can frustrate users who are convinced that
they are using correct concepts or those who have used
the same terminology elsewhere.
We now describe how a flexible interface can be built
on top of an application program. This, we believe, will
result in a more user-friendly interface and will also bring
to light some of the inherent facilities of the original
environment. Such facilities would otherwise be shielded
from frustrated users.
3. A SYNONYM INTERFACE
The first task is to provide synonyms for any interface.
Who provides all the synonyms for a given environment?
A possible solution is to provide tools to enable the
course author to furnish the system with the synonyms
of each keyword.14 Each synonym obtained this way is
surely reliable. However, providing all the synonyms for
each keyword is an arduous and boundless task. We
cannot expect course authors to add this to their already
tedious responsibilities.
3.1 Synonym generation
There are several synonym dictionaries which can be
helpful in synonym generation.1-2'3 These dictionaries
group terms of similar meaning together. Synonymgeneration programs can be written around similar
on-line versions of such dictionaries.
Downloaded from http://comjnl.oxfordjournals.org/ at Pennsylvania State University on May 17, 2016
2.1 Components of the environment
Lessons and tasks can be added or edited at any time.
The script-writing aide builds index files to facilitate
searching; checks each item and reports any errors; and
gives advice on any inadequacy diagnosed.
The script interpreter presents lessons and supervises
tasks. It uses the information above in addition to a
profile for each user. Each user profile maintains an
up-to-date record of the attributes assigned to the user on
a given subject. These attributes include: a categorisation
of the user; a list of misjudgements already made about
the user; a vector of performance on tasks relative to
lessons on all the topics of a subject; and a quantisation
of the knowledge demonstrated by the user thus far. The
profile aids in the choice of material for the user at any
point. The interpreter also makes queries on behalf of the
user and allows the user to consult any aspect of the
subject. The terms used in any query are matched against
the keywords in the topics file to determine the topic of
discourse. The topic of discourse is used to select
potential lessons to deal with the query. The user profile
is then used to select the most relevant among the
potential lessons. Tasks are selected on similar criteria.
The notation used in the system is based on those of
the UNIX LEARN environment.10 The incorporation of
a user profile, the use of information-retrieval techniques,
and some marking routines take their origin from Gwei's
previous work on CAL in Aston.7
A FLEXIBLE SYNONYM INTERFACE AND CAL
The computing facilities at Nottingham University
Computer Science department include an on-line version
of Roget's Thesaurus supplied by the publishers
Longmans.2 We have implemented a synonym-generation program based on this thesaurus.8 Thus synonyms
to terms can be obtained on-line, through our roget
command.5
3.1.1 Description of the roget command
3.1.1.1 Searching mechanism
When roget is called with a non-numerical argument, a
search is carried out for the given inputs. In this search,
all terms derivable from the inputs by applying rules of
conjugations, suffix manipulation and phrase juxtapositions are taken into consideration. A minimal spelling
correction and conversion from British to American
spelling (and vice versa) is also incorporated. The rules
for suffix manipulation in roget are based on a list of
suffixes with associated pre-conditions for replacements
and padding. The entire list and conditions can be found
in Ref. 8. A few entries from the list are given in Table 2.
Table 2
Given
suffix
Unique
substitute
Possible
suffices
Minimum
word length
-e-y
-e-y
-e-y
-e-y
-e-y
-e-y
-e-y
-eive
-eive
-eive
-eive
-mit
-orb
-our
-ribe
-ab
-ie
-in
-io
-iv
-or
-ou
-ript
-X
-et
-able -ability
-ier -ied
-ing -iness
-ion -ious
-ive -iving
-ory -ors
-our -ous
-eipt -eipted
-eits -eitive
-eption -eption
-ipient -ipience
-mission -missive
-orption -orptive
-oration -orimetry
-riptive -ription
-ction -ctive
4
4
4
4
4
4
4
6
6
6
6
5
5
6
6
4
-eipt
-eit
-ept
-ip
-miss
-orpt
-or
Table 3
% given to the terms found
Suffix manipulation
No padded Resulting
word
phrase
No suffix change
Known suffix padded
Long suffix replaced
Short suffix (-e/y) replaced
Unknown suffix padded
Negative notion suffix padded
100
75
75
30
8
2
50
20
20
15
5
1
After the searching phase, the sum of the percentage
values for each paragraph forms a measure for that
paragraph. The measures and the headwords for all
paragraphs considered are displayed to enable the user to
make a choice. Continuing with the call above, the next
information would be:
Paragraph headwords are:
(a) para. 542 (225): 'Deception'
(b) para. 541 (120): 'Falsehood'
(c) para. 509 (80): 'Disappointment'
(d) para. 951 (50): 'Impurity'
(e) para. 18 (30): 'Similarity'
(/) para. 419 (30): 'Dimness'
(g) para. 445 (30): 'Appearance'
(h) para. 477 (30): 'Sophistry: false reasoning'
(0 para. 495 (30): 'Error'
(J) para. 525 (30): 'Concealment'
(k) para. 523 (15): 'Latency'
(/) para. 952 (13): 'Libertine'
(jn) para. 544 (8):
'Dupe'
Choose a letter, (a) to («).
Here paragraph 542 on 'Deception' is considered to be
the best (measure of 225). In any case, the system waits
for the user's choice.
THE COMPUTER JOURNAL, VOL. 30, NO. 6, 1987 553
Downloaded from http://comjnl.oxfordjournals.org/ at Pennsylvania State University on May 17, 2016
Roget generates terms of related meaning to a given input
term. It can be called with a single terms as parameter,
as in 'roget remove', or with a single number, as in 'roget
123'. The latter form gives the contents of paragraph 123
of the Thesaurus. This corresponds to contemporary
usage of the Thesaurus. The former looks up all terms
starting with the same stem as the given parameter
(' remove',' removable' and' removed' in the above case),
and prints each one together with the numbers of the
paragraphs in which it occurs. It also prints the number
and headwords of each paragraph together with a
measure of how appropriate the paragraph is considered
to be. This information guides the user to choose a
desired paragraph. All synonyms to the searched terms
appearing in the chosen paragraph are printed. An input
phrase is treated in much the same way as input word.
For example, 'roget "good time"' searches for all
phrases starting with 'good time' and 'good timing'.
Various options exist to allow users to obtain desired
results (see later).
Thus, the searching phase of a call such as 'roget
deceive' prints out the following:
para. 509
'deceive one's hopes'
'deceive one's spouse' para. 951
'deceived husband'
para. 952
' deceived'
para. 544
'deceiver'
para. 952
'deceptive appearance' para. 523
paras 18, 419, 445, 477,
' deceptive'
495, 509, 541, 542
para. 542
'deceptiveness'
para. 542
'deceit'
paras 541, 542
'deceitful'
'deceitfully'
paras 541, 542
'deceitfulness'
paras 525, 541, 542
That is, each term found is printed together with the
numbers of all the paragraphs that contain the term. Each
term found is given a percentage value according to the
method of derivation. The values used were chosen
non-rigorously (by trial and error) and are given in
Table 3.
G. M. GWEI AND E. FOXLEY
3.1.1.2 Options available with the roget command
Through flags, the roget command can be asked to:
defeat one's hopes
dash one's hopes
crush one's hopes
blight one's hopes
deceive one's hopes
betray one's hopes
deceive
delude
dazzle
cheat
cozen
con
swindle
sell
rook
do
do down
With this environment, synonyms to each keyword in the
topics file can be obtained by appropriate roget calls.
The results for all the keywords would then form the
vocabulary of the subject concerned. All queries to the
subject can then be checked against this vocabulary,
thereby determining the topic of discourse.
3.1.2 A new role for the course author in synonym
provision
Synonyms obtained from the course author or other
experts in the domain would be more reliable than those
obtained by the method above. At least, most technical
terms would only be obtainable in this way. Experts in
a domain would be best informed of synonymous
relations among terms and their context dependency.
Our system capitalises on this asset by making the
following provisions:
(1) Each keyword in the topics file can be followed by
a set of synonyms, each synonym being delimited by a
554 THE COMPUTER JOURNAL, VOL. 30, NO. 6, 1987
3.1.3 Problems with this method of synonym generation
In spite of its potentials, the method above has inherent
pitfalls. They include the following.
(1) Roget's Thesaurus is much too general-purpose.
The context considerations given above do not go far
enough to meet the requirement in very restricted and
specific technical contexts.
(2) Terms that are too technical or those that are
accepted acronyms in a technical domain would not be
found in the Thesaurus.
(3) Special-purpose synonym relationships would not
be found in Roget's Thesaurus.
4. CONSOLIDATING THE VOCABULARY
The total number of distinct terms that make up the
vocabulary of a course or subject could be very large. This
would involve excessive demands on storage. The
addition of synonyms further increases these demands.
Luckily, some terms in a given system may be redundant.
Redundant terms include:
(1) Words that do not add any more meaning to
sentences or phrases: such words may include definite
and indefinite articles, conjunctions and prepositions.
(2) Words easily derived from others: we need store
only a stem to represent all the words that can be derived
from it.
By storing a general collection of words belonging to
thefirstclass, we can eliminate them from the vocabulary
of a given course. The course author can choose to
modify the general collection of such words to suit the
course. This new copy needs to be kept in a special file
Downloaded from http://comjnl.oxfordjournals.org/ at Pennsylvania State University on May 17, 2016
(1) Give the same weight to each string found (original
or derived inputs) [-n].
(2) Consider only the paragraph(s) with the best
measure [-b].
(3) Choose a paragraph at random from those
considered [-r].
(4) Choose and print all the paragraphs considered
[-a].
(5) Choose synonyms of particular part(s) of speech
[-p<juxtaposed parts of speech)].
(6) Print output exactly as it appears in the thesaurus
[-v].
(7) Stop suffix manipulation and search for exact input
[-x and -xx] (-x allows phrases but -xx allows only exact
input).
(8) Prevent the padding phrases but allow suffix
manipulation of input [-w].
(9) Consider only synonyms in a particular context.
This requires at least one other term in the same context
[-c followed by context term].
For example, the command 'roget -a -p verb deceive -c
cheat' requests all synonyms to deceive but only those
which are of part of speech verb and in the same context
as cheat. It prints out the following:
pair of parentheses ('('and')'). The first entry for each
keyword (which is not within parentheses) is taken as the
most acceptable form of the keyword. All other entries
facilitate access to materials. The subject-building package calls roget with each keyword. All other entries
for the keyword are taken as context parameters. The
results are added to the entries of the keyword concerned.
(2) Interactively, the course author can add synonyms
to any keyword. The course author can type two or more
terms separated by ' = ' to indicate equivalence. The
SIMILARITY, as a measure of the co-occurrence of
words6-13 in two terms is obtained for each term given by
the course author and each topic in the topics file. For
example, the SIMILARITY of 'computer logic' and
' computer programming logic' is 2/3 while that of' logic'
and 'computer logic' is 1/2, but the SIMILARITY of
'logic' and 'computer programming logic' is 1/3. If any
of the typed terms has an acceptably high SIMILARITY
to a topic already in the system, the other terms would
then be added as entries of the same topic.
The first option is highly recommended during the
early stages of course design. After a few lessons or tasks
have been built, the second option would save the
overhead of building the environment from scratch.
As the material of a course is being gathered, the course
author may realise that references are made to the topics
in terms not present in the topics file. Similarly, users
already making queries may be doing so in terminology
not known to the system. The rejected terms can be vetted
by the course author and added to the system using the
second option above.
A FLEXIBLE SYNONYM INTERFACE AND CAL
like did, went, sold, destruction, receipt, indices, etc., which
undergo irregular changes in their derivation, are not
conflated together with their roots. On the other hand,
conditions may hold for a suffix to be wrongly removed.
Such suffixes could be meant to introduce entirely
different concepts. For example, in the system implemented relative and relativity are conflated together. This
can potentially lead to confusion in a domain like
theoretical physics. In the light of such shortcomings, our
environment makes the following cautional provisions.
(1) The course author is advised to give other irregular
forms of words as synonyms to their normal forms (e.g.
' indices' needs to be included as a synonym to ' index').
(2) Anything within a pair of double quotes (" ") is
left untouched. This enables experienced users to prevent
conflation when the meaning will be lost otherwise. In the
example above, specific reference to relativity should
always be double quoted.
(3) A stop list can be incorporated into a course. That
is, the course author may give a file ('nostrip') or words
that should stay unstripped within the course.
5. CONCLUSIONS AND FUTURE
CONSIDERATIONS
Most of the aspects of the interface have been discussed
together with their drawbacks. Advantages that accrue
from our routines include the following.
(a) Users and course authors have a greater freedom
in the terminology they use. There is less need for exact
citation in queries, references or answers.
(b) It is now possible to navigate through any text in
search of referenced keywords. This is achieved by
breaking the text into phrases. The valid separators for
these phrases include punctuation marks and conjunctions such as 'and', 'or' and 'by'. Each phrase so
obtained is matched against the synonyms to the given
keywords. If the SIMILARITY of the phrase and the best
match is above some predefined cut-off value, the
particular keyword is reported as found; otherwise, the
phrase is ignored. This proves useful in two areas.
(i) Checking on lesson and task scripts to help the
teacher. We wish to help the teacher by checking that all
keywords occurring in the text are either topics being
taught in this lesson, or prerequisites to this lesson. We
navigate through the text eliminating any keywords
found which occur in the list of topics or prerequisites for
this lesson. Any remaining keywords could be simply
reported for further action by the teacher. To be more
sophisticated we need to determine from the context
whether this keyword is occurring as a prerequisite (its
meaning is being assumed) or as a topic (its meaning is
being expounded upon). If backward reasoning (see
Jackson9) is applied to this unassigned keyword, and a
keyword representing a topic is reached, the assigned
keyword should be assigned to the list of prerequisites.
If no backward reasoning can be achieved, the unassigned keyword represents a topic being taught, and
the teacher must take appropriate action.
(ii) Script marking. It is already possible to navigate
through a user's answer in search of the keywords of a
model answer (a marking scheme).
Some aspects discussed still need further effort or
rethinking. They include the following.
(a) Eliminating some words from terms (Section 4)
can totally change the meaning of a term. This can lead
to wrong interpretation. For example, 'the ship in the
ocean' and 'a ship by the ocean' both come to 'ship
ocean'.
(b) The elimination of some suffixes can also lead to
a change in meaning, thus leading to inaccurate
conclusions. For example, something that is removable
need not be removed. This difference disappears after the
words are stripped.
Other aspects also need consideration for better
performance. They include:
(a) Expansion of the user model (profile) is necessary.
Even though this aspect has taken a low key in the
discussion, it is already playing a major part in the
interface. Without it, the system would pour out vast
amounts of information to users regardless of their
experience. This concept needs further consideration to
include aspects of time, and change in subject material.
(b) The conflation process can be extended and
modified to cater for prefixes as well as suffices. Words
like: aboriginal; aforementioned; befriend; enact; microcomputer; predefined; subgroup for example, can be used
as forms of their respective stems {original, mention,
friend, act, computer, define, group in these cases). The
system can be extended to include notions of the prefixes
to cater for these possibilities. More thought is needed
here than with suffixes. Unlike suffixes, which usually
introduce synonyms to the original root, some prefixes
can change a root into its antonym (as in anti-, dis-, un-,
etc.). Similar considerations can be made on suffixes like
-less.
(c) The process of navigating through text needs to be
explored further in the light of Jackson.9 This will
improve on the facilities for course design.
THE COMPUTER JOURNAL, VOL. 30, NO. 6, 1987 555
Downloaded from http://comjnl.oxfordjournals.org/ at Pennsylvania State University on May 17, 2016
of that subject's directory. This is used in eliminating
redundant words while building up the vocabulary of the
course. All queries and answers go through a similar
procedure.
The second class calls for a method of converting all
forms of a word into the root word or any unique and
consistent representation (that is conflation). This calls
for a profound understanding of how the numerous
forms of words are derived.
Several linguists have tackled this problem using
various suffix-stripping categories.4' "• 12 The strategy
used is often dictated by the purpose for which the
stripping is needed. In some applications it might be
useful to keep a stem dictionary on the system to aid
stripping. In other applications, a suffix list might be
sufficient. When the requirement is consistency (as
happens here), a list of suffixes together with the criteria
under which each may be removed is expedient. This
approach, first adopted by Porter,12 is used in our system.
For the reader's convenience, a summary of the algorithm
used by Porter together with minor additions is presented
in the Appendix. Further details on the theory and
workings of the algorithm can be found in Porter's
original paper.12
Porter's algorithm is simple, yet its performance is
comparable with much more sophisticated algorithms.
Like most others, however, it is far from being perfect.
The system takes care only of words derived from others
using the suffixes and the conditions stipulated. Words
G. M. GWEI AND E. FOXLEY
Acknowledgements
We wish to express our appreciation to: Longmans for
supplying us with an on-line version of Roget's
Thesaurus; Dave Allsopp, Avho brought forth the need
for an option for exact match during the searching phase
in roget; William Shu, who was helpful in the
implementation of the lexical analyser for parsing input;
William Armitage (our local UNIX guru), Julian Onions
and various members of our Computer Science group
who contributed in one form or another. Special thanks
to Dr Ann Lomax, who made very useful comments on
the original draft of this paper.
REFERENCES
9. P. Jackson, Towards a theory of topics. Computers and
Education 8 (1), 21-26 (1984).
10. B. W. Kernigham and M. E. Lesk, LEARN-computeraided instruction on UNIX. In UNIX Programmer's
Manual, Bell Laboratories, Murray Hill, N J . (1979).
11. J. B. Lovins, Development of a stemming algorithm.
Mechanical Translation and Computational Linguistics, 11
(2), 22-31 (1968).
12. M. F. Porter, An algorithm for suffix stripping. Program
14(3), 130-137(1980).
13. C. Van Rijsbergen, Information Retrieval. Butterworths,
London (1979).
14. R. Wilensky, Y. Arens and D. Chin, Talking to UNIX in
English: an overview of UC. Communications of the ACM,
27(6), 574-593(1984).
A P P E N D I X : SUMMARY OF ALGORITHM
FOR SUFFIX STRIPPING
Notation
Symbol(s)
Meaning
Example
Interpretation
s"
Literal
{stem}
Instance of definition
Logically equal
{cl} = {c2}
i =
Logically unequal
{c}!="s"
&
({cl} = {c2})&({cl} = "t";)
Logical and
Logical or
({c} = "s")|({cl} = " t " )
1
{stem}"s" -> {stem}
If LHS conditions hold,
_>
expression changes to RHS
• {label} If you get this far, do the operations starting at {label}
:: =
LHS item defined by RHS expression
=
Indicates logical equivalence
—
Indicates an ordinal range
Indicates options in a range, e.g. [a-z] s= any letter
[]
*
Means zero or more occurrence of preceding item
The letter s
Some letter combination
The two consonants are the same
The consonant is not s
Both consonants are t's
The consonant is s or t
If a stem is terminated by " s " ,
*"
{}
Definitions
Mix
stem:: = [a-z]* (i.e. zero or more letters)
steml s stem2 = stem
The function m(stem) is a measure of vowel/consonant
mix in stem. This measure is obtained by counting only
the occurrence of a vowel (v) followed by a consonant (c)
in a stem. The following table includes some examples:
v:: = [aeiou]
vl = v2 = v
(i.e. any vowel)
c:: = [b—df-hj-np-tv-xz]
cl = c2 = c
(i.e. any consonant)
" y " = {v} if " y " is preceded by a consonant
= {c} otherwise
556 THE COMPUTER JOURNAL, VOL. 30, NO. 6, 1987
m(stem)
0
1
2
Examples of stems
my, bye, by, a, tree, me, you
type, aye, your, buy, quest
replay, conquest, types
Downloaded from http://comjnl.oxfordjournals.org/ at Pennsylvania State University on May 17, 2016
1. Webster's Synonyms and Antonyms. Barnes & Noble
Books, New York (1962).
2. Roget's Thesaurus. Harlow, Longman Green (1982).
3. George Crabb, Crabb's English Synonyms. London,
Routledge & Kegan Paul.
4. J. L. Dawson,'Suffix removal and word conflation, ALLC
Bulletin, pp. 33-46 (1974).
5. E. Foxley, UNIX for Super-users. Reading, Mass., AddisonWesley (1985).
6. Gosta Grahne, Adaptive features of a CAL system based
on information retrieval. Computers and Education 6,
99-104 (1982).
7. G. M. Gwei, Towards automatic teaching. MSc. Thesis,
University of Aston in Birmingham (1983).
8. G. M. Gwei, The Roget Environment (Synonym Generation). Internal report, Computer Science Group, Nottingham University (1984).
A FLEXIBLE SYNONYM INTERFACE AND CAL
5. Cater for suffixes that change words into adjectival,
noun, and other forms:
1. Cater for suffixes that create noun plurals (or present
singular tenses):
{stem}"alize" &(m({stem}) > 0) -> {stem}"al"
{stem}"ative" &(m({stem}) > 0) -» {stem}
* {stem}"cept" &(m({stem}) > 0) -»{stem}"ceive"
* {stem}"cess" &(m({stem}) > 0) -»{stem}"ceed"
* {stem}"empt" &(m({stem}) > 0) -»{stem}"eem"
{stem}"ful" &(m({stem}) > 0) -»{stem}
{stem}"ical" &(m({stem}) > 0) -^ {stem}"ic"
{stem}"icate" &(m({stem}) > 0) -> {stem}"ic"
{stem}"iciti" &(m({stem}) > 0) -»{stem}"ic"
{stem}"luble" &(m({stem}) > 0) -* {stem}"Ive"
{stem}"lute" &(m({stem}) > 0) -»• {stem}"Ive"
{stem}"miss" &(m({stem}) > 0) -»• {stem}"mit"
{stem}"ness" &(m({stem}) > 0) -»{stem}
{stem}"orpt" &(m({stem}) > 0) -»{stem}"orb"
{stem}"ript" &(m({stem}) > 0) -»{stem}"ribe"
{stem}"ular" &(m({stem}) > 0) -> {stem}"le"
{stem}"umpt" &(m({stem}) > 0) ->• {stem}"ume"
{stem}" vis" &(m({stem}) > 0) -»{stem}" vide"
{stem}"sses" -»{stem}"ss"
{stem}" ies " -+ {stem}" i"
{stem}"ss" -• {stem}"ss"
{stem}"s" -»{stem}
2. Cater for suffixes that create past participles, simple
past, and continuous tenses:
{stem}"eed" & (m({stem}) > 0) -> {stem}"ee"
({steml}{v}{stem2})"ed" -> ({steml}{v}{stem2}) => 2b
({stem 1} {v} {stem2})" ing " -• ({stem 1} {v} {stem2}) => 2b
2b. Tidy up effect from any of the last two operations:
{stem}"at" -*• {stem}"ate"
{stem}" b l " - > {stem}" ble"
{stem}"iz" -»{stem}"ize"
* {stem}"is" -> {stem}"ise"
({stem}{cl}{c2})&({cl} = {c2})&
{cl}! = "s")&({cl}! = "z")&({cl}! =
({stem}{cl}{v}{c2})&(m{stem}) = 0)&
({c2}! = "w")&({c2}! = "x")&({c2}! = " y " )
-{stem}{cl}{v}{c2}"e"
3. Cater for words ending in " y " to make them
consistent with the results of step 1.
({stem 1} {v} {stem2})" y " -»({stem 1} {v} {stem2})" i"
4. Cater for suffixes that create noun formations from
root words:
{stem}"ational" &(m({stem}) > 0) -> {stem}"ate"
{stem}"tional" &(m({stem}) > 0) -+ {stem}"tion"
{stem}"enci" &(m({stem}) > 0) -»{stem}"ence"
{stem}"anci" &(m({stem}) > 0) -> {stem}"ance"
{stem}"izer" &(m({stem}) > 0) -* {stem}"ize"
* {stem}"iser" &(m({stem}) > 0) -»{stem}"ise"
{stem}"abli" &(m({stem}) > 0) -> {stem}"able"
{stem}"alii" &(m({stem}) > 0) - {stem}"al"
{stem}"entli" &(m({stem}) > 0) -> {stem}"ent"
{stem}"eli" &(m({stem}) > 0) -> {stem}"e"
{stem}"ousli" &(m({stem}) > 0) -> {stem}"ous"
{stem}"ization" &(m({stem}) > 0) ->{stem}"ization"
* {stem}"isation" &(m({stem}) > 0) -»{stem}" isation "
{stem}"ation" &(m({stem}) > 0) -> {stem}"ate"
{stem}"ator" &(m({stem}) > 0) -> {stem}"ate"
{stem}"alism" &(m({stem}) > 0) -»{stem}"al"
{stem}"iveness" &(m({stem}) > 0) -»{stem}" ive "
{stem}" fulness" &(m({stem}) > 0) -> {stem}"ful"
{stem}"ousness" &(m({stem}) > 0) -»{stem}"ous"
{stem}"aliti" &(m({stem}) > ( ) ) - • {stem}"al"
{stem}"iviti" &(m({stem}) > 0) -+ {stem}"ive"
{stem}"biliti" &(m({stem}) > 0) -> {stem}"ble"
Downloaded from http://comjnl.oxfordjournals.org/ at Pennsylvania State University on May 17, 2016
The algorithm: only one rule can be activated within any
group
6. Similar reasons to step 5:
{stem}"al" &(m({stem}) > 1) -»{stem}
{stem}"ance" &(m({stem}) > 1) ->• {stem}
{stem}"ence" &(m({stem}) > 1) -> {stem}
{stem}"er" &(m({stem}) > 1) -»{stem}
{stem}"ic" &(m({stem}) > 1) -»{stem}
{stem}"able" &(m({stem}) > 1) -> {stem}
{stem}"ible" &(m({stem}) > 1) - {stem}
{stem}"ant" &(m({stem}) > 1) -»{stem}
{stem}"ement" &(m({stem}) > 1) -^ {stem}
{stem}"ment" &(m({stem}) > 1) -> {stem}
{stem}"ent" &(m({stem}) > 1) -> {stem}
({stem}{c})"ion" &(m({stem}{c}) > 1)&
((
{stem}" o u " &(m({stem}) > 1) - {stem}
{stem}" ism" &(m({stem}) > 1) • {stem}
{stem}" a t e " &(m({stem}) > ! ) >• {stem}
{stem}" iti" &(m({stem}) > 1) - {stem}
{stem}" o u s " &(m({stem}) > ! ) >• {stem}
{stem}" ive" &(m({stem}) > 1) • {stem}
{stem}" ize" &(m({stem}) > 1) -»{stem}
* {stem}" ise" &(m({stem})> l ) - {stem}
7. General tidying up (first stage):
{stem}"e" & (m({stem}) > 1) -> {stem}
({stem}{vl}{v2}{c})"e"&
(m({stem}) = 0) -* {stem} {vl} {v2} {c}
8. General tidying up (last stage):
({stem}{cl}{c2}) & (({cl}) = {c2}) & ({cl}) = "1")&
(m({stem}){cl})> l)->{stem}{cl}
Note: All rules marked with ' * ' are local additions (not
given in Porter's paper). They cater for words like
divisible, admission, deceptive and successor into divide,
admit, deceive and succeed respectively. However, these
steps were added much later and have not been fully
tested.
T H E COMPUTER JOURNAL, VOL. 30, NO. 6, 1987
557