CS4025: Morphology and the Lexicon
Words
The Lexicon
Morphology
Reading: J&M (chapter 3 in both editions)
Types of Words
Open-class, content words:
Nouns (shoe, tablecloth, cause, E-insertion…)
Verbs (see, walk, cause, forget, promise, …)
Adjectives (big, small, reliable, three-legged, …)
Adverbs (quickly, well, reliably, …)
Function (closed-class) words
•
Prepositions (to, from ,on ...), determiners (a, the, an …) etc
See J&M, sect 8.1
Lexicons
Lexicons are databases of word information.
“Dictionary” of NLP system
A good lexicon is critical to performance
» “the system with the bigger lexicon always wins”
Dictionary: Child
The following is, of course, written for humans:
child \'chi-(*)ld\ n pl children [ME, fr. OE cild] 1a: an unborn or recently
born person 1b: dialect: a female infant 2a: a young person of either
sex esp. between infancy and youth 2b: a childlike or childish person
2c: a person not yet of age...
» [From Webster’s on-line dictionary]
On the other hand, WordNet is an Ontology of words for
human and computer use.
Word Information
An NLP system needs to know
»
»
»
»
»
Spelling
Category and subcategory
Inflections (plurals, past, etc)
What word corresponds to in DB or KB
Statistical information
• Frequency, contexts in which it appears
» maybe pronunciation
» probably not derivation
Example: Person
Person
»
»
»
»
»
Category: noun
Subcategory: count noun
Inflections: plural = people
Database correspondence: person class.
Frequency: .03%
Word Meaning
What do these words mean
» He told a lie
» The temperature fell in the afternoon
Many subtleties, difficult to represent
Most NLP systems ignore subtleties, anduse very
simple or coarse models of meaning
Digression: approaches to word meaning
Create a network of all possible word senses, with
links between them (e.g. for hyponym, antonym). A
word then has a number of these possible senses (
WordNet) – this is an expensive undertaking.
2. Try to decompose word senses into complex
expressions involving primitive concepts (Schank,
Jackendoff) – only possible in limited sense areas.
In either case, the senses/concepts need to be related
to domain objects (e.g. database fields)
[See later lecture on Semantics]
1.
Digression: approaches to word meaning
3.
Alternately, model word meaning statistically:
– The meaning of a word is some function of the
contexts it appears (word vectors) :
• beer=[pint:0.1, glass:0.2, ale:0.1, drink:0.2 …]
• Usualy word vectors are compressed using
neural networks to create a representation
where each word is represented by 500
numbers (a vector in 500 dimensional space)
– In vector space one can reason about meaning!
Word2Vec (similarity of words)
http://deeplearning4j.org/word2vec.html
Similarity of words (you can calculate cosine between
vectors)
Words most similar to “Sweden”:
If you cluster similar words, you can create dictionary like “word
senses”
Word2Vec (analogies)
http://deeplearning4j.org/word2vec.html
Analogies (King is to Queen as Man is to ?) as arithmatic:
King – Queen + Man =?
[woman, Attempted abduction, teenager, girl]
house:roof :: castle:?
[dome, bell_tower, spire, crenellations, turrets]
knee:leg :: elbow:?
[forearm, arm, ulna_bone]
Word2Vec (differences)
http://deeplearning4j.org/word2vec.html
Calculating Difference of word vectors:
Iraq – Violence = ?
Jordan
Human - Animal = ?
Ethics
President - Power = ?
Prime Minister
Library - Books = ?
Hall
2. Decompose meaning into primitives
KILL =
CAUSE TO die =
CAUSE TO NOT LIVE
THROW =
project WITH hand =
(CAUSE to MOVE) WITH BODYPART
General inference rules can be formulated in terms of
CAUSE, NOT, WITH etc.
Lexicon Structure
Object-oriented representation
» Noun, Verb, etc are classes
» Individual words are instances
» Slots (data members) hold information.
Use of inheritance
» Members of a class inherit properties (e.g. values of slots,
applicable rules)
» Multiple inheritance is common
Example: Lexicon Structure
Word
Noun
Verb
Intransitive
sleep
Adjective
Transitive
eat
have
Bitransitive
give
Detailed Lexical information
Morphology
Variations of a root form of a word
» prefixes, suffixes
Inflectional morph - same core meaning
» plurals, past tense, superlatives
» part of speech unchanged
Derivational morph - change meaning
»
»
»
»
prefix re means do again: reheat, resit
suffix er means one who: teacher, baker
part of speech changed
e.g. Antidisestablishmentarianism
Noun Inflections
Nouns are inflected to form plurals, usually by adding
s
Example
» chair - Tom has one chair
» chairs - John has 2 chairs
Also possessive inflection with ‘s
» The boy’s mother
Adjective Inflections
Adjectives are inflected to form comparative (er) and
superlative (est) forms.
Example:
» fast - A Sun is fast.
» faster - A Sun is faster than a PC.
» fastest - The Cray is the fastest computer
Verb Inflections
Tense and aspect
»
»
»
»
infinitive (root) - I like to walk to the store
past (ed) - I walked to the store
past participle (ed or en) - I have walked to the store
present participle (ing) - I am walking to the store.
Agreement with subject
» pres/3sing (s) - John walks to the store
Syntactic Aside: Verb Group
Verb group is the main verb plus helping words
(auxiliaries).
Encodes information such as tense
John will watch TV (future – add will)
John watches TV (present - +s form of verb for
third-person singular subjects)
John is watching TV (progressive – form of BE
verb, plus +ing form of verb)
John watched TV (past – use +ed form of verb)
Syntactic Aside: Verb group
Negation
Usually add “not” after first word of verb group
John will not watch TV
Exception: add “do not” before 1-word VG
Inflections on do, use infinitive form of main
verb
John does not watch TV vs
John watches not TV
Exception to exception: use first strategy if verb is
form of BE
John is not happy vs
John does not be happy
Spelling Changes
Some spelling changes are automatically made when
adding a suffix to a word.
Delete final “e” when ending starts with a vowel
» write + ing = writing, not writeing
Change final “y” to “i”
» fry + ed = fried, not fryed
Irregular forms
Some words have irregular forms that must be looked
up in a dictionary
» plural of child is children, not childs
» past of eat is ate, not eated
Irregular forms are usually maintained when a prefix
is added
» past of rethink is rethought
Rules for spelling variants
Example: plural
Usually add “s” (dogs)
But add “es” if base noun ends in certain letters
(boxes, guesses)
Also change final “y” to “i” (try → tries)
But not if y is preceded by a vowel
buy → buys, not buies
Of course many special cases that need to be
enumerated for a language
children (vs childs), people (vs persons), etc
NL Tasks
Morphological analysis
» Replace inflected forms by root+inflection
– The children ate apples
– The child+s eat+ed apple+s
becomes
Morphological generation
» Replace root+inflection by inflected forms
– The child+s eat+ed apple+s
– The children ate apples
becomes
Implementation – 1: stemming
Suffix stripping - 2 pages of code with rules like:
–
–
–
–
if the word ends in 'ed', remove the 'ed'
if the word ends in 'ing', remove the 'ing'
if the word ends in 'ly', remove the 'ly'
Example: Porter stemmer (1979) in Information Retrieval
suffix replacement
examples
ed
null
plastered->plaster, bled->bl
(many exceptions needed, hence the two pages of code!)
Dictionary of special cases
1500 special case rules (Sussex morpha)
More complex processing is needed for languages with
complex morphology.
Real Morphology (Turkish)
Uygarlastiramadiklarimizdanmissinizcasina
»
»
»
»
»
»
»
»
»
»
»
uygar -civilized
+las +BEC (become)
+tir +CAUS (cause to)
+ama +NEGABLE (not able)
+dik +PPART (past tense)
+lar +PL (plural)
+imiz +P1PL (first person plural - we)
+dan +ABL
+mis +PAST
+siniz +2PL
+casina +ASIF
» “(behaving) as if you are among those whom we could not
cause to become civilized”
Implementation 2: analysis & generation
Express possible word analyses as simple
concatenations of morphemes, e.g. “in+probable+ly”
(can express allowable combinations via a finite state
automaton)
Represent the principles of a particular spelling
change (e.g. “in+p -> imp”) in a mapping between
this and the surface form which enforces this but
leaves everything else unchanged
Mappings can be implemented by finite state
transducers, which can efficiently test correctness.
Morphology as tape-tape mapping
TAPE 1
i
n
TAPE 2
i
m
+
p
r
o
b
a
b
p
r
o
b
a
b
l
e
+
l
Y
l
y
Different (partial) mappings involved:
• n_to_m: Knows about when to legally transform n to m
• y_to_i: Knows about when to legally transform y to I
– play->plays, but dry → dries
•…
Finite State Automata
FSA can recognise or generate valid sequences
The red ball
The ball
The big red ball
…
Designated
“End” State
Think of a tape with part-of-speech symbols “det adj adj adj adj noun”
and a tape reader which changes state with each symbol. If at the end
of the tape, the reader is in state S2, it is a valid Noun Phrase.
Finite State Transducers (FSTs)
A finite state transducer is
like a finite state automaton,
except that it accepts two tapes,
rather than one.
Each transition has a label a:b
where a is a symbol to appear
on the first tape and b on the second (eg. d:100).
FSTs can be used to express a mapping between the
first tape and the second. (letter to ASCII code here)
FSTs: J&M (2nd Edition) section 3.4
Morphology of Negation
Consider the following negations:
inaccurate inadequate imbalanced incompetent
indefinite infinite immature impatience impossible insane
The negative forms are constructed by adding the abstract morpheme:
iN to the positive forms. N is realized as either n or m.
Rule 1: N -> m if followed by one of b, m, or p
Rule 2: N -> n (if rule 1 does not apply)
Possible FST
Rule 1: N -> m if followed by one of b, m, or p
Rule 2: N -> n (if rule 1 does not apply)
m:m, p:p, b:b
S2
N:m
S0
X:X
N:n
S1
X is shorthand here for
any character that is
not m,p or b
FST – n to m before B ={b,m,p}
(Accepts a pair of tapes, and can be used to generate
one from the other)
Parallel FSTs for morphology
(Can compile a set of
parallel FSTs into one
giant FST)
Summary
The lexicon is a vital part of an NLP system
Lexicons need to be organised properly to ease
creation and maintenance (e.g. object oriented)
Various information needs to be stored about a word
Words belong to classes and change in form
according to rules of morphology (2 kinds)
Simple analysis of regular morphology is quite easy
for English (Porter stemmer)
Other languages or more complete coverage may
require more sophisticated techniques (FSTs)
© Copyright 2026 Paperzz