Morphology, Lexicons, Sentiment Lexicons

CS4025: Morphology and the Lexicon
Words
 The Lexicon
 Morphology
Reading: J&M (chapter 3 in both editions)
Types of Words
Open-class, content words:

Nouns (shoe, tablecloth, cause, E-insertion…)

Verbs (see, walk, cause, forget, promise, …)

Adjectives (big, small, reliable, three-legged, …)

Adverbs (quickly, well, reliably, …)
Function (closed-class) words
•
Prepositions (to, from ,on ...), determiners (a, the, an …) etc
See J&M, sect 8.1
Lexicons



Lexicons are databases of word information.
“Dictionary” of NLP system
A good lexicon is critical to performance
» “the system with the bigger lexicon always wins”
Dictionary: Child
The following is, of course, written for humans:

child \'chi-(*)ld\ n pl children [ME, fr. OE cild] 1a: an unborn or recently
born person 1b: dialect: a female infant 2a: a young person of either
sex esp. between infancy and youth 2b: a childlike or childish person
2c: a person not yet of age...
» [From Webster’s on-line dictionary]
On the other hand, WordNet is an Ontology of words for
human and computer use.
Word Information
An NLP system needs to know
»
»
»
»
»
Spelling
Category and subcategory
Inflections (plurals, past, etc)
What word corresponds to in DB or KB
Statistical information
• Frequency, contexts in which it appears
» maybe pronunciation
» probably not derivation
Example: Person

Person
»
»
»
»
»
Category: noun
Subcategory: count noun
Inflections: plural = people
Database correspondence: person class.
Frequency: .03%
Word Meaning

What do these words mean
» He told a lie
» The temperature fell in the afternoon


Many subtleties, difficult to represent
Most NLP systems ignore subtleties, anduse very
simple or coarse models of meaning
Digression: approaches to word meaning
Create a network of all possible word senses, with
links between them (e.g. for hyponym, antonym). A
word then has a number of these possible senses (
WordNet) – this is an expensive undertaking.
2. Try to decompose word senses into complex
expressions involving primitive concepts (Schank,
Jackendoff) – only possible in limited sense areas.
In either case, the senses/concepts need to be related
to domain objects (e.g. database fields)
[See later lecture on Semantics]
1.
Digression: approaches to word meaning
3.
Alternately, model word meaning statistically:
– The meaning of a word is some function of the
contexts it appears (word vectors) :
• beer=[pint:0.1, glass:0.2, ale:0.1, drink:0.2 …]
• Usualy word vectors are compressed using
neural networks to create a representation
where each word is represented by 500
numbers (a vector in 500 dimensional space)
– In vector space one can reason about meaning!
Word2Vec (similarity of words)
http://deeplearning4j.org/word2vec.html


Similarity of words (you can calculate cosine between
vectors)
Words most similar to “Sweden”:
If you cluster similar words, you can create dictionary like “word
senses”
Word2Vec (analogies)
http://deeplearning4j.org/word2vec.html
Analogies (King is to Queen as Man is to ?) as arithmatic:

King – Queen + Man =?
[woman, Attempted abduction, teenager, girl]
house:roof :: castle:?


[dome, bell_tower, spire, crenellations, turrets]
knee:leg :: elbow:?



[forearm, arm, ulna_bone]
Word2Vec (differences)
http://deeplearning4j.org/word2vec.html
Calculating Difference of word vectors:

Iraq – Violence = ?
Jordan
Human - Animal = ?


Ethics
President - Power = ?


Prime Minister
Library - Books = ?



Hall
2. Decompose meaning into primitives
KILL =
CAUSE TO die =
CAUSE TO NOT LIVE
THROW =
project WITH hand =
(CAUSE to MOVE) WITH BODYPART
General inference rules can be formulated in terms of
CAUSE, NOT, WITH etc.
Lexicon Structure

Object-oriented representation
» Noun, Verb, etc are classes
» Individual words are instances
» Slots (data members) hold information.

Use of inheritance
» Members of a class inherit properties (e.g. values of slots,
applicable rules)
» Multiple inheritance is common
Example: Lexicon Structure
Word
Noun
Verb
Intransitive
sleep
Adjective
Transitive
eat
have
Bitransitive
give
Detailed Lexical information
Morphology

Variations of a root form of a word
» prefixes, suffixes

Inflectional morph - same core meaning
» plurals, past tense, superlatives
» part of speech unchanged

Derivational morph - change meaning
»
»
»
»
prefix re means do again: reheat, resit
suffix er means one who: teacher, baker
part of speech changed
e.g. Antidisestablishmentarianism
Noun Inflections


Nouns are inflected to form plurals, usually by adding
s
Example
» chair - Tom has one chair
» chairs - John has 2 chairs

Also possessive inflection with ‘s
» The boy’s mother
Adjective Inflections


Adjectives are inflected to form comparative (er) and
superlative (est) forms.
Example:
» fast - A Sun is fast.
» faster - A Sun is faster than a PC.
» fastest - The Cray is the fastest computer
Verb Inflections

Tense and aspect
»
»
»
»

infinitive (root) - I like to walk to the store
past (ed) - I walked to the store
past participle (ed or en) - I have walked to the store
present participle (ing) - I am walking to the store.
Agreement with subject
» pres/3sing (s) - John walks to the store
Syntactic Aside: Verb Group


Verb group is the main verb plus helping words
(auxiliaries).
Encodes information such as tense
 John will watch TV (future – add will)
 John watches TV (present - +s form of verb for
third-person singular subjects)
 John is watching TV (progressive – form of BE
verb, plus +ing form of verb)
 John watched TV (past – use +ed form of verb)
Syntactic Aside: Verb group

Negation
 Usually add “not” after first word of verb group

John will not watch TV
 Exception: add “do not” before 1-word VG

Inflections on do, use infinitive form of main
verb

John does not watch TV vs
John watches not TV
 Exception to exception: use first strategy if verb is
form of BE

John is not happy vs
John does not be happy
Spelling Changes


Some spelling changes are automatically made when
adding a suffix to a word.
Delete final “e” when ending starts with a vowel
» write + ing = writing, not writeing

Change final “y” to “i”
» fry + ed = fried, not fryed
Irregular forms

Some words have irregular forms that must be looked
up in a dictionary
» plural of child is children, not childs
» past of eat is ate, not eated

Irregular forms are usually maintained when a prefix
is added
» past of rethink is rethought
Rules for spelling variants

Example: plural
 Usually add “s” (dogs)
 But add “es” if base noun ends in certain letters
(boxes, guesses)
 Also change final “y” to “i” (try → tries)

But not if y is preceded by a vowel
buy → buys, not buies

Of course many special cases that need to be
enumerated for a language

children (vs childs), people (vs persons), etc
NL Tasks

Morphological analysis
» Replace inflected forms by root+inflection
– The children ate apples
– The child+s eat+ed apple+s

becomes
Morphological generation
» Replace root+inflection by inflected forms
– The child+s eat+ed apple+s
– The children ate apples
becomes
Implementation – 1: stemming

Suffix stripping - 2 pages of code with rules like:
–
–
–
–
if the word ends in 'ed', remove the 'ed'
if the word ends in 'ing', remove the 'ing'
if the word ends in 'ly', remove the 'ly'
Example: Porter stemmer (1979) in Information Retrieval
suffix replacement
examples
ed
null
plastered->plaster, bled->bl
(many exceptions needed, hence the two pages of code!)
Dictionary of special cases
1500 special case rules (Sussex morpha)
More complex processing is needed for languages with
complex morphology.
Real Morphology (Turkish)

Uygarlastiramadiklarimizdanmissinizcasina
»
»
»
»
»
»
»
»
»
»
»
uygar -civilized
+las +BEC (become)
+tir +CAUS (cause to)
+ama +NEGABLE (not able)
+dik +PPART (past tense)
+lar +PL (plural)
+imiz +P1PL (first person plural - we)
+dan +ABL
+mis +PAST
+siniz +2PL
+casina +ASIF
» “(behaving) as if you are among those whom we could not
cause to become civilized”
Implementation 2: analysis & generation



Express possible word analyses as simple
concatenations of morphemes, e.g. “in+probable+ly”
(can express allowable combinations via a finite state
automaton)
Represent the principles of a particular spelling
change (e.g. “in+p -> imp”) in a mapping between
this and the surface form which enforces this but
leaves everything else unchanged
Mappings can be implemented by finite state
transducers, which can efficiently test correctness.
Morphology as tape-tape mapping
TAPE 1
i
n
TAPE 2
i
m
+
p
r
o
b
a
b
p
r
o
b
a
b
l
e
+
l
Y
l
y
Different (partial) mappings involved:
• n_to_m: Knows about when to legally transform n to m
• y_to_i: Knows about when to legally transform y to I
– play->plays, but dry → dries
•…
Finite State Automata

FSA can recognise or generate valid sequences




The red ball
The ball
The big red ball
…
Designated
“End” State

Think of a tape with part-of-speech symbols “det adj adj adj adj noun”
and a tape reader which changes state with each symbol. If at the end
of the tape, the reader is in state S2, it is a valid Noun Phrase.
Finite State Transducers (FSTs)




A finite state transducer is
like a finite state automaton,
except that it accepts two tapes,
rather than one.
Each transition has a label a:b
where a is a symbol to appear
on the first tape and b on the second (eg. d:100).
FSTs can be used to express a mapping between the
first tape and the second. (letter to ASCII code here)
FSTs: J&M (2nd Edition) section 3.4
Morphology of Negation
Consider the following negations:

inaccurate inadequate imbalanced incompetent

indefinite infinite immature impatience impossible insane
The negative forms are constructed by adding the abstract morpheme:
iN to the positive forms. N is realized as either n or m.
Rule 1: N -> m if followed by one of b, m, or p
Rule 2: N -> n (if rule 1 does not apply)
Possible FST
Rule 1: N -> m if followed by one of b, m, or p
Rule 2: N -> n (if rule 1 does not apply)
m:m, p:p, b:b
S2
N:m
S0
X:X
N:n
S1
X is shorthand here for
any character that is
not m,p or b
FST – n to m before B ={b,m,p}
(Accepts a pair of tapes, and can be used to generate
one from the other)
Parallel FSTs for morphology
(Can compile a set of
parallel FSTs into one
giant FST)
Summary






The lexicon is a vital part of an NLP system
Lexicons need to be organised properly to ease
creation and maintenance (e.g. object oriented)
Various information needs to be stored about a word
Words belong to classes and change in form
according to rules of morphology (2 kinds)
Simple analysis of regular morphology is quite easy
for English (Porter stemmer)
Other languages or more complete coverage may
require more sophisticated techniques (FSTs)