A group structure for strings: Towards a learning algorithm for

A group structure for strings:
Towards a learning algorithm for morphophonology
John A. Goldsmith
July 11, 2015
Abstract
In this paper, we define a group structure over strings and note briefly that by applying this computation to words, we obtain major steps towards a method for identifying allomorphy and learning
morphophonemics. First order differences among a set of words forming a paradigm identifies morphs,
while second order differences identifies allomorphy. When this allomorphy appears at morpheme boundary, this can in a wide range of cases be identified as morphophonology.1
1
1.1
A group structure for strings
Defining difference operators
We are familiar with difference operators such as subtraction and division. We say that n = a−b iff n+b = a,
and we say that n = ab iff b×n = a (an approach known to the Greeks). We thus create new objects (negative
numbers and rational numbers) as a way of generalizing the operations of addition and multiplication.
(However, this method does not necessarily increase the universe of objects; this is particularly clear when
we look at the multiplication to a prime modulus; if p = 5, then 11 = 1, 21 = 3, 13 = 2, 14 = 4).)
Our goal is to define two binary operators on strings (each of which have two directions: left-to-right and
right-to-left; from a linguistic point of view, it’s natural to call left-to-right operators suffixal, and right-toleft operators prefixal). One is (string) difference, and the other is (string) commonality. To make discussion
simpler, I will assume today that all of the discussion involves suffixal operators.
string S
jumped
jump
walk
walked
remind
string T
jumping
jumping
jump
jumped
mind
∆R (S, T ) =
S
TR
∆L (S, T ) =
jumped
jumping
jump
jumping
walk
jump
walk
jump
re
∅
ed
ing
∅
ing
walk
jump
walked
jumped
remind
mind
S
TL
CR (S, T )
CL (S, T )
∅
∅
∅
ed
mind
jump
jump
∅
∅
∅
Figure 1: Some examples of string difference
1.2
Defining string difference
It is common to define a semi-group structure for concatenation of symbols in an alphabet A. This simply
assumes a finite set of symbols, which we may indicate mnemonically as a..z or A and which we call letters,
plus a binary concatenation operator represented by • when explicit, and simply by juxtaposition when
we do not need to be explicit. Concatenation is associative and not commutative. The set of all finite
concatenations of elements of A is A∗ , which includes a null string, the concatenation of no letters.
1 Thanks to Paul Goldsmith-Pinkham and Jason Riggle for helpful comments on this paper. This paper appeared as a 2011
technical report in the CS Department at the University of Chicago, and is available from the CS server/web page.
1
We add to this structure a null element for the concatation operator, indicated ∅;{A ∪ ∅} forms a monoid
with the concatenation operator.
In addition, we augment the alphabet by adding for each letter in A an inverse: the inverse of a will be
indicated a−1 , that of b as b−1 and so on: a • a−1 = aa−1 = a−1 a = ∅. The set of all inverses of the letters in
A is noted as A−1 . We will use the capital script A to indicate the original alphabet A augmented with both
the null symbol ∅ and the set of all inverses A−1 . A with the concatenation operator • is thus homomorphic
to the free group over the original alphabet A.
We remind the reader that (ab)−1 = b−1 a−1 ; thus (cats)−1 = s−1 (cat)−1 = s−1 t−1 a−1 c−1 . It follows
that(cat)−1 cats = s and (cats)−1 cat = s−1 . Similarly, (walked)−1 walking = (ed)−1 ing.
The relation a • a−1 = ∅ imposes a group structure on the set of all strings from A, one in which
the two strings abb−1 c and ac are identified with the same element. In general, we will refer to
an element of a string group with the shortest string associated with it (i.e., with ac rather than
with abb−1 c), but not always; this point will arise in connection with the definition of a regular
paradigm, just below.
Generalized concatenation operator. It will become convenient below to generalize the binary
concatenation operator • to a ternary operator σ, whose first argument is the string we will modify, the
second argument is the position in which modification begins, and the third argument is the string that will
be added. We define σ(s; 0, t) as s•t and σ(t; −1, s) as t•s. This should be interpreted as follows: in σ(s; 0, t),
s is taken as a string to which we will do something: we add the string t at the position in s specified by
the second parameter of σ. The string-initial position of a string is position 0, while the string-final position
(following traditional computer science notation) is position -1.2
It will often be convenient to talk about the map created by lambda abstraction out of the first argument
position of generalized concatenation—that is, the map that concatenates the string t into another string at
some position α. Using ⋄ to mark the operand, we write: σ(α, t)(⋄) ≡ σ(⋄; α, t).
We may then speak of the composition of two maps φ, π (where φ, π map from A to A): φ ◦ π() is the
map that sends s to φ(π(s)).
Proper words. Symbols from A and those from A−1 will play a slightly different role in some of what
follows, and it will be convenient to have a term to refer to strings strictly from A∗ ; we will call these strings
proper words.
Defining left- and right- string difference. Our first goal is to define the difference of two strings;
we initially define a left-difference and a right-difference in the natural way. The right-difference of s and t
is defined as t−1 • s = σ(s; 0, t−1 ), and for convenience written as ∆R (s, t) or st R. The left-difference of s
and t is defined as s • t−1 , and written ∆L (s, t) or st L. See Figure 1 for examples that clarify this notation.
The set of all right differences among the strings in A is a subset of A, and consists of (A−1 )∗ • A∗ ,
A
R. Restating this slightly, the set of right-string differences consists of
and it is natural to write this as A
all strings composed of two substrings: first, a string from the monoid generated by A−1 , and then a string
from the monoid generated by A.
However, at a more abstract level, it is natural to think of the right-difference of s and t as a f unction
ed
which maps t to s. If the right-difference of jumped and jumping is ing
, then it is reasonable to think of
3
ed
acting
as
a
map
from
jumping
to
jumped
(just
as
the
fraction
is
closely
related to the multiplicative
ing
5
map from 5 to 3).
1.3
Paradigms and difference arrays
We define a paradigm simply as a list of proper words. In actual applications, we have in mind lists such
as walk, walks, walked, walking or am, is, are, was, were, be. If P is such a list, the number of words that
compose it is its length, and we refer to the k th item in the list as P [k].
We define a right (resp.,left-) difference array between two lists of proper words, P and Q, each of length
n, as the n × n array D whose (i, j)th element is the right- (resp., left-) difference of P[i] and Q[j].
2 More generally, if −|s| − 1 ≤ n ≤ |s|, then σ(s; n, t) is defined as the insertion of t into s at the nth position in s, and if
0 ≤ n ≤ |s|, then σ(s; n, t) = σ(s; n − |s| − 1, t). We will not use negative positions other than -1 very much.
2
We will primarily be interested in cases of self-difference, i.e., the difference between a paradigm and
itself. In Figure 1 we see two examples of this, one from the English verb jump (the pattern called a weak
verb in English), and another from part of the Spanish −ar verb.
jump
jump
jumps
jumped
jumping
jumps
jumped
jumping
∅
s
∅
ed
s
ed
∅
ing
s
ing
ed
ing
s
∅
ed
∅
ing
∅
ed
s
ing
s
∅
s
ing
ed
ed
∅
s
ed
ing
ing
Figure 2: An array of first order differences
In like manner, we genenerate arrays of commonalities. We note that the diagonal elements are by
definition the same as the corresponding words of the word list, while in the simple case presented here, all
the other entries are the same (jump). We will leave out the diagonal elements hereafter.
By definition, the difference arrays are skew-symmetric (corresponding elements across the major diagonal
are reciprocals of each other), and the commonality array is symmetric.
jump
jumps
jumped
jumping
jump
jumps
jumped
jumping
jump
jump
jump
jump
jump
jumps
jump
jump
jump
jump
jumped
jump
jump
jump
jump
jumping
∅
s
ed
ing
Figure 3: Corresponding array of commonalities
hablar
hablar
hablo
hablas
habla
hablamos
hablan
hablé
hable
hables
o
ar
s
r
∅
r
mos
r
n
r
é
ar
e
ar
es
ar
hablo
hablas
habla
hablamos
hablan
hablé
hable
hables
ar
o
r
s
o
as
r
∅
o
a
s
∅
r
mos
o
amos
s
mos
∅
mos
r
n
o
an
s
n
∅
n
mos
n
ar
é
o
é
as
é
a
é
amos
é
an
é
ar
e
o
e
as
e
a
e
amos
e
an
e
é
e
ar
es
o
es
as
es
a
es
amos
es
an
es
é
es
∅
s
as
o
a
o
amos
o
an
o
é
o
e
o
es
o
∅
s
mos
s
n
s
é
as
e
as
es
as
mos
∅
n
∅
é
a
e
a
es
a
n
mos
é
amos
e
amos
es
amos
é
an
e
an
es
an
e
é
es
é
s
∅
Figure 4: An array of first order differences: Spanish a-stems
3
hablar
hablar
hablo
hablas
habla
hablamos
hablan
hablé
hable
hables
habl
habla
habla
habla
habla
habl
habl
habl
hablo
hablas
habla
hablamos
hablan
hablé
hable
hables
habl
habla
habl
habla
habl
habla
habla
habl
habla
habla
habla
habl
habla
habla
habla
habl
habl
habl
habl
habl
habl
habl
habl
habl
habl
habl
habl
hable
habl
habl
habl
habl
habl
habl
hable
hable
habl
habl
habl
habl
habl
habl
habl
habla
habla
habla
habl
habl
habl
habla
habla
habl
habl
habl
habla
habl
habl
habl
habl
habl
habl
hable
hable
hable
Figure 5: An array of commonalities: Spanish a-stems
1.4
Regular paradigms
A regular paradigm is one for which the self-difference array has a single, common numerator in each row,
and a single denominator in each column, and in which all of the commonalities are identical (off the major
diagonal, of course). We see that the paradigm {jump. jumps, jumped, jumping} is regular, but the Spanish
paradigm {hablar, hablo, hablas...} is not.
The intuition that this definition is intended to capture is that a paradigm is regular if the words that
form it can be consistently divided into a stem and a suffix. Why does the condition of regularity fail in a
case like that of Spanish hablar? For the nine cases shown here, five cases show the vowel a following the
initial sequence habl, while two show e, one shows , and one shows . The evidence suggests (correctly) that
the stem ought to contain a—but it does not always. One could cut the verb stem after habl, and have a
consistent set of suffixes, but there would be a lot of suffixes then that all begin with a.
The paradigm for a verb such as jump in English if we include the agentive form jumper is given in
Figure 5.
4
Regular verbal patterns
jump
jumped
jumping
jumps
e-final verbal pattern
walk
walked
walking
walks
move
moved
moving
moves
s-final pattern
push
pushed
pushing
pushes
miss
missed
missing
misses
love
loved
loving
loves
hate
hated
loved
loves
C-doubling pattern
veto
vetoed
vetoing
vetoes
tap
tapped
tapping
taps
slit
slitted
slitting
slits
nag
nagged
nagging
nags
y-final pattern
try
tried
trying
tries
cry
cried
crying
cries
lie*
lied
lying
lies
Figure 6: Some related paradigms
jump
jump
jumps
jumped
jumping
jumps
jumped
jumping
∅
s
∅
ed
s
ed
∅
ing
s
ing
ed
ing
s
∅
ed
∅
ing
∅
ed
s
ing
s
ing
ed
∅
s
ed
move
moves
moved
moving
d, ed
try
try
tries
tried
trying
d, ed
slit
trying
y
ies
y
ied
s
d
∅
ing
ies
ying
ied
ying
ying
ied
y,∅
ies, s
d, ied
pushed
e, ∅
pushing
slit
es, s
slits
d, ed
slitted
slitting
ing
tried
d
s
ying
ies
pushes
ed
ing
tries
ies
y
ied
y
ing
∅
∅
ing
es
ing
ed
ing
es, s
e
ing
es
ing
ed
ing
es, s
∅
ed
s
d
∅
∅
d
s
d
e, ∅
∅
es
ing
s
ing
∅
s
ing
ed
pushing
ing
ed
moving
d
s
ing
es
push
pushed
d
s
ing
es
moved
s
∅
d
∅
ing
e
∅
pushes
es
∅
ed
∅
ing
∅
moves
move
push
∅
s
∅
ted
s
ted
∅
ting
s
ting
ed
ing
∅
s
ed, ted
y, ∅
ies, s
ied, d
ing, ying
ing, ying
Figure 7: 5 arrays of self-differences for verb patterns
5
slitting
ing
ed
d,ed
ing
slitted
ted
s
ting
s
es,s
ing
slits
s
∅
ted
∅
ting
∅
∅
∅
s
ed, ted
ing, ting
ing, ting
jump
jump
jumps
jumped
jumper
jumping
jumps
jumped
jumper
jumping
∅
s
∅
ed
s
ed
∅
er
s
er
d
r
∅
ing
s
ing
ed
ing
er
ing
∅
s
∅
ed
∅
er
∅
ing
∅
ed
s
er
s
ing
s
r
d
ing
ed
ing
er
∅
s
d, ed
er,r
ing
jump
jumps
jumped
jumper
jumping
jump
jump
jump
jump
jump
jump
jump
jumpe
jump
jump
s
d, ed
er, r
ing
jumps
jump
jumped
jump
jump
jumper
jump
jump
jumpe
jumping
jump
jump
jump
jump
jump
jump
jumps
jumped
jumper
jumping
∅
s
∅
ed
s
ed
∅
er
s
er
ed
er
∅
ing
s
ing
ed
ing
er
ing
jump
jumps
jumped
jumper
jumping
jump
∅
s
∅
ed
∅
er
∅
ing
∅
ed
s
er
s
ing
s
er
ed
ing
ed
ing
er
∅
s
ed
er
ing
jump
jumps
jumped
jumper
jumping
jump
jump
jump
jump
jump
jump
jump
jump
jump
jump
jumps
jump
jumped
jump
jump
jumper
jump
jump
jump
jumping
jump
jump
jump
s
ed
er
ing
jump
jump
jump
Figure 8: Non-regular paradigm, and equivalent regular paradigm
1.5
Morphs arise from first order differences
A decision on how to indicate first order differences is ipso facto a decision on how to analyze a word
into morphs. The decision to indicate left-differences is a decision to find prefixes; the decision to indicate
6
right-differences is a decision to find suffixes.
2
Second order differences: it starts to get interesting
We are now in a position to compare the self-difference arrays corresponding to different paradigms. 3 This
comparison is accomplished by defining a second-order difference, a difference between two items in the firstorder difference array, that is, a difference between two items in (A−1 )∗ • A∗ , such as a difference between
ing
ing
s and es .
∗
As we noted above, we can think of objects such as ing
s either as (1) strings in A — more specifically,
−1 ∗
∗
in (A ) • A ⊂ A∗ — or (2) as maps from sets of strings to sets of strings. It is the latter interpretation
which is really interesting. As a map, ing
s is an object that takes a string ending in s and returns the same
string with the s replaced by ing. Intuitively, then, to change a map of the form f () = ab into dc , we create a
map that first removes a final d, adds a final b, applies the function f (), removes the final a created by f (),
and then adds a final c.
We can rephrase this more clearly in terms of operators.
c
d
a
b
f () = σ(−1, c)σ(−1, a−1 )f σ(−1, b)σ(−1, d−1 ) = ∆L (c, a)f ∆L (d, b) =
c b
f
a d
(1)
Still, these matrix are quite similar to one another. We can formalize that observation, if we extend the
notion of string difference we defined just above. We extend the definition of ∆R to Σ∗ × Σ∗ in this way:
σ(0, s, t)
(2)
∆L (a, c)
a c
∆R ( , ) = σ(−1, a)σ(−1, c−1 )σ(−1, c)σ(−1, d−1 )σ(−1, d)σ(−1, b−1 )f =
b d
∆L (b, d)
(3)
If we define ∆L on a matrix as the item-wise application of that operation on the individual members,
then we can express the difference between 6 and 7 in this way (where we indicate ∅∅ with a blank). See
Figures 7,8 on next two pages.
3 Brief reminder from information theory about why we care about collapsing cases
Morphology treats the items in the lexicon of a language (finite or infinite; let’s assume finite to make the math easier). Any
given analysis divides the lexicon up into a certain number of subgroups. If there are n subgroups, each equally likely, in a
lexicon of size V (V for vocabulary), then marking each word costs −log2 Vn . (If the groups are not equally likely, and the ith
group has ni members, then marking a word as being in that group costs −log2 nVi = log2 nV . Each word in the ith group needs
i
to be marked, and all of those markings together costs ni × log2 nV . If we can collapse two subgroups analytically, then we
i
savea lot of bits. How many? If the two groups are equal-sized, then we save 1 bit for each item.
Why? Suppose we have two groups, g1 and g2 of 100 words out of a vocabulary of 1000 words. Each item in those two
groups is marked in the lexicon at a cost of log2 1000
≈ 3.3 bits; 200 such words costs us 200 × 3.32 bits = 664 bits. If they
100
200
were all treated as part of a single category, the cost of pointing to the larger category would be −log2 1000
= 2.32 bits, so we
would pay a total of 200 × 2.32 = 464 bits. for a total saving of 200 bits. We actually compute how complex an analysis is. And
the morphological analysis that Linguistica provides can be made “cheaper” by decreasing the number of distinct patterns it
contains, by adding a (morpho)phonology component after the morphology.
7
1
jump:move
2
1. ∅
2. s
4. ing
1
jump:split
1. ∅
2. s
3. ed
4. ing
∅
t
∅
t
4
∅
e
∅
e
∅
e
∅
e
1
jump:push
2
∅
e
2. s
4. ing
3
t
∅
t
∅
t
∅
t
∅
4
∅
e
∅
e
jump:try
1
2
3
1. ∅
ie
y
i
y
y
ie
y
ie
2. s
∅
t
∅
t
4
e
∅
e
∅
3. ed
2
3
e
∅
1. ∅
e
∅
e
∅
e
∅
e
∅
3. ed
3
3. ed
4. ing
y
ie
y
i
ie
y
ie
y
Figure 9: Difference of differences: English verb
2.1
Hungarian
See Figure 10 below.
2.2
Spanish
See Figure 9 below.
3
Conclusion
Let P be a sequence of words (think P[aradigm] ) of length n.
We define the quotient P ÷ Q of two sequences P, Q of the same length n as a 2 × 2 matrix, where
P ÷ Q(i, j) ≡ ∆L (pi , qj )
In particular
P ÷ P (i, j) ≡ ∆L (pi , pj )
We may compare two paradigms then as the second difference:
▽(P, Q) ≡ (P ÷ P ) ÷ (Q ÷ Q)
Many morphophonological changes emerge as the second difference
of sets (‘paradigms’) of words.
8
4
emberem
emberem
embered
embere
emberünk
emberetek
emberük
dögünk
dögötek
dögük
emberünk
emberetek
emberük
m
d
m
∅
d
∅
em
ünk
ed
ünk
e
ünk
m
tek
d
tek
∅
tek
ünk
etek
em
ük
ed
ük
e
ük
nk
k
etek
ük
∅
d
ünk
ed
tek
d
ük
ed
ünk
e
tek
∅
ük
e
etek
ünk
k
nk
ük
etek
dögöm
dögöd
döge
dögünk
dögötek
dögük
m
d
öm
e
öd
e
öm
ünk
öd
ünk
e
ünk
m
tek
d
tek
e
ötek
ünk
ötek
öm
ük
öd
ük
e
ük
nk
k
ötek
ük
d
m
e
öm
ünk
öm
tek
m
ük
öm
döge
embere
d
m
∅
m
ünk
em
tek
m
ük
em
dögöm
dögöd
embered
e
öd
ünk
öd
tek
d
ük
öd
ünk
e
ötek
e
ük
e
ötek
ünk
k
nk
ük
ötek
Differences of differences
∅
emberük
emberük
∅
emberük
ö
e
ö
e
emberük
e
ö
e
ö
ö
e
ö
e
∅
e
ö
emberük
∅
∅
emberük
ö
e
ö
e
e
ö
e
ö
∅
∅
ö
e
ö
e
e
ö
e
ö
∅
∅
∅
e
ö
∅
∅
e
ö
ö
e
Figure 10: Hungarian vowel harmony: commutative free group
9
hablar
hablar
hablo
hablas
habla
hablamos
hablan
hablé
hable
hables
buscas
busca
buscamos
buscan
busqué
busque
busques
hablas
habla
hablamos
hablan
hablé
hable
hables
ar
o
r
s
o
as
r
∅
o
a
s
∅
r
mos
o
amos
s
mos
∅
mos
r
n
o
an
s
n
∅
n
mos
n
ar
é
o
é
as
é
a
é
amos
é
an
é
ar
e
o
e
as
e
a
e
amos
e
an
e
é
e
ar
es
o
es
as
es
a
es
amos
es
an
es
é
es
∅
s
o
ar
s
r
∅
r
mos
r
n
r
é
ar
e
ar
es
ar
as
o
a
o
amos
o
an
o
é
o
e
o
es
o
∅
s
mos
s
n
s
é
as
e
as
es
as
mos
∅
n
∅
é
a
e
a
es
a
n
mos
é
amos
e
amos
es
amos
buscar
busco
buscas
busca
buscamos
buscan
busqué
busque
busques
ar
o
r
s
o
as
r
∅
o
a
s
∅
r
mos
o
amos
s
mos
∅
mos
r
n
o
an
s
n
∅
n
mos
n
car
qué
co
qué
cas
qué
ca
qué
camos
qué
can
qué
car
que
co
que
cas
que
ca
que
camos
que
can
que
é
e
car
ques
co
ques
cas
ques
ca
ques
camos
ques
can
ques
é
es
∅
s
buscar
busco
hablo
o
ar
s
r
∅
r
mos
r
n
r
qué
car
que
car
ques
car
hables
as
o
a
o
amos
o
an
o
qué
co
que
co
ques
co
hables
∅
s
mos
s
n
s
qué
cas
que
cas
ques
cas
hables
mos
∅
n
∅
qué
ca
que
ca
ques
ca
hables
é
an
e
an
es
an
n
mos
qué
camos
que
camos
ques
camos
hables
qué
can
que
can
ques
can
hables
hables
hables
hables
hables
hables
hables
hables
hables
hables
c
qu
c
qu
c
qu
c
qu
c
qu
c
qu
c
qu
c
qu
c
qu
c
qu
c
qu
c
qu
c
qu
c
qu
c
qu
e
é
es
é
s
∅
e
é
es
é
s
∅
hables
hables
hables
qu
c
qu
c
qu
c
qu
c
qu
c
qu
c
qu
c
qu
c
qu
c
qu
c
qu
c
qu
c
qu
c
qu
c
qu
c
qu
c
qu
c
qu
c
c
qu
c
qu
c
qu
Figure 11: Difference of differences: Spanish verb
10