A cross-linguistic quantitative study of homophony

A cross-linguistic quantitative study of homophony
Jinyun Ke (Dr.)
English Language Institute, University of Michigan
Abstract :
Homophony is ubiquitous across languages. It is an important source of
ambiguity which is a distinctive feature of human language. There have been,
however, few quantitative investigations on questions such as “do languages
have similar degrees of homophony?”, “can the degree of homophony in a
language be predictable?”. We report a preliminary attempt to answer these
questions. We measure the degree of homophony of two sets of languages, one
including twenty Chinese dialects and the other including three Germanic
languages. It is found that there exists a strong correlation between the
degree of homophony and the number of occurring syllable types (which can be
taken as an estimation of the size of the phonological resource of a language),
or the number of monosyllabic words in the lexicon. Furthermore, the
distributional
properties
of
homophony
reflect
some
self-organization
characteristics of language as a system, as illustrated by two pieces of
evidence: the first is the correlation between the degree of homophony and the
degree of disyllabification in Chinese dialects, and the second is the
observation from some languages that pairs of words tend to exist in different
grammatical classes, suggesting that language self-organizes in a way to
decrease the chances of ambiguity.
1. Introduction
Human language is full of ambiguity, a ubiquitous phenomenon in which one form can be
interpreted as more than one meaning (i.e. one-to-many mapping). Homophony is an important
source of ambiguity. It refers to two or more words, which are called “homophones”, “having the
same sound, but differing in meaning or derivation”, according to the Oxford English Dictionary
(OED). Here are some examples of homophones: “chu1 shi4”1 for “出事” (“accident”), “出示”
(“show”) and “出世” (“be born”) in Chinese; “sight”, “site” and “cite” in English; “père”, “pair”,
and “paire” in French; “das” and “daβ”, “man” and “Mann” in German. To extend the definition
from words to more general forms, /s/ (including variants /z/ and /Ιz/) in English can be also
considered as homophonous, as the same form has two functions, one to form plural of nouns,
and the other to form the third person present tense of verbs.
Homophony is ubiquitous across languages. There are two main sources of homophones: sound
change and language contact. Most of the homophones arise as the result of phonological merger,
a type of sound change which is very common in languages. Words become homophonous once
the phonetic distinction that kept them apart becomes lost. In English, for example, “meat” in
Middle English was pronounced similar to “mate” in modern English, but after the Great Vowel
Shift it became homophonous to “meet” due to vowel raising, though the written forms still retain
the distinction. Chinese is a classic instance in which numerous homophones have come from
sound change. In modern Chinese dialects, especially in northern dialects where many mergers
have occurred in the history, we can find a lot of homophonous monosyllabic morphemes which
1
In this paper, the pronunications of Chinese characters are given by the pinyin spelling, with tone following the syllable.
1
were distinct in earlier stages. For example, in Mandarin, “急” (“anxious”), “疾” (“illness”) and
“即” (“immediately”) are homophones with the same pronunciation (ji2) due to the loss of the
consonant ending of “p,t,k” in monosyllables, while these morphemes remain distinct in southern
dialects where such sound changes did not happen.
Lexical borrowing is another main source of homophones. Languages are constantly in contact
with other languages, which results in lexical borrowings to various extents. Very often the
pronunciation of foreign words cannot fit the phonology of the target language, and these words
are adjusted accordingly by the native speaker of the borrowing language. Occasionally the
borrowed word may collide with some existing words, or even some words borrowed earlier from
other languages, and homophones are thus created. For example, in English the two words
“sheik” and “chic” are homophones with the same pronunciation [∫i:k], but they came in different
times into English: according to the OED, the former was borrowed from the Arabic word
“shaikh” in the sixteen century, while the latter is a French loan word in the middle of the
nineteen century.
It has been a common belief that all languages have homophony in different degrees (Antilla,
1989: 184). However, so far there have been few attempts to examine the degrees of homophony
quantitatively, not to mention cross-language comparison. There have been some lists of
homophones, e.g., Higgins (1995) for English, and some case studies on the history of individual
homophones or near-homophones, e.g., Bloomfield (1933: 396-398) and Malkiel (1979).
However, there has been little discussion on the relation between homophony and other parts of
the language system. Intuitively if a language has a large phonological resource to exploit, i.e. a
large number of sounds to make up words, the language may have fewer homophones. Such
attestable hypothesis has been largely unexplored.
In this paper, we report a preliminary study to address the above hypothesis. We measure the
degree of homophony on two sets of available data: one includes twenty Chinese dialects and the
other includes three Germanic languages. Section 2 introduces these measures and the results.
Section 3 examines the relation between the degree of homophony and the size of the
phonological resource, and Section 4 proposes two hypotheses to predict the degree of
homophony and shows that the degree of monosyllabicity (i.e. the number of monosyllabic words
in the lexicon) is a better indicator for the degree of homophony than the number of occurring
syllable types.
While the emergence of homophony is considered as unavoidable, the existence of homophony
does not seem totally at random, but instead reflects some characteristics of self-organization in a
language. In Section 5, we will show two pieces of evidence in this regard. The first is the
correlation between the degree of homophony and the disyllabification in Chinese dialects, and
the second is the observation from some languages that pairs of words tend to exist in different
grammatical classes, suggesting that language self-organizes in a way to decrease the chances of
ambiguity.
2
2. Cross-language comparison of the degree of homophony
There are several difficulties in having a quantitative measure of the degree of homophony and
making cross-language comparison. First, as the lexicon in a language is basically an open set
and keeps evolving, whether a given word has a homophone or not heavily depends on the size of
the lexicon used for search for homophones. A lexicon with more entries and with many ancient
words included certainly will be more likely to include more homophones. Therefore, it is
important to ensure the comparability of lexicons, from which homophones are extracted. We
will demonstrate how this problem of comparability is dealt with for two sets of data. The first is
a set of Chinese dialects, for which the pronunciations of the same set of Chinese monosyllabic
morphemes are available. Another set of data comes from CELEX, a large lexical database for
three Germanic languages. The details of the two sets of data will be described in the Sections 2.1
and 2.2.
The second difficulty in deriving a homophone list for a language is a long-standing problem for
both lexicologists and semanticists, that is how to distinguish homophony from polysemy. To
avoid the difficulty of differentiating polysemes from true homophones, in the study of three
Germanic languages, we restrict the scope of our analyses to homophones which have different
orthographic forms. It is assumed that very often when two words are spelled differently, the
chance for two homophones to have the same etymology is very small2.
Following the above principle of polyseme pruning, we exclude homonyms such as “(river)
bank” and “(financial) bank”. Those words, such as “work” as a noun and “work” as a verb
which have the same meaning but are used as different parts-of-speech, are not considered as
homophones either. We further exclude those pairs which are different inflection forms of one
lemma, because these words in fact refer to the same meaning even though there are distinctions
in either gender, number or tense. For example, in Dutch “raad” and “raadt” are two inflectional
forms of the same verb “to guess”, and are pronounced the same. Such pairs are excluded from
our homophone lists. The homophone lists derived based on the above criteria underestimates the
degree of homophony, and it is unclear yet how serious this underestimation is. Nevertheless, this
restricted definition of homophony enables us to obtain an estimation of the lower-bound of the
degree of homophony in these languages. More importantly, these restricted, but explicit, criteria
enable cross-language comparison.
2.1. Degrees of homophony in Chinese dialects
The data for Chinese dialects come from the Dictionary on Computer (DOC), which is an
electronic database of the phonological systems of Chinese languages. It is one of the earliest
computer databases of languages, first developed in the research group led by Prof. William S-Y
Wang at the University of California at Berkeley in 1966, and has been upgraded and maintained
through the years (Wang, 1969a; Cheng, 1998). DOC has been a fertile database with historical
depth and geographic breadth, having been used in many studies of sound changes in historical
Chinese and Chinese dialects (Wang, 1977). These studies constitute the empirical basis for the
launch of the theory of lexical diffusion (Wang, 1969b; Chen & Wang, 1975).
2
We note that there are still words which have different spellings but actually come from the same origin, such as “check” and
“cheque” in English.
3
DOC includes the pronunciation of over 2,700 monosyllabic morphemes (or “Chinese
characters”, to be more accurate) in 20 Chinese modern dialects (Beijing, Jinan, Xi’an, Taiyuan,
Wuhan, Chengdu, Yangzhou, Suzhou, Wenzhou, Hefei, Changsha, Shuangfeng, Nanchang,
Meixian, Guangzhou, Yangjiang, Xiamen, Jian’ou, Chaozhou, Fuzhou3), two ancient Chinese
rhyme books (Guang Yun of the 7th century, and Zhongyuan Yinyun of the 14th century) and
corresponding borrowings in Japanese and Korean4.
We use the data of DOC to measure the degree of homophony in the 20 Chinese dialects. One
advantage of these data is that the semantic range is approximately the same for the dialects,
since all dialects use the same set of characters or morphemes. While there is a big problem for
the definition of “wordness” in Chinese (it is hard to decide whether a combination of
morphemes is a word or a phrase), to examine the monosyllabic morphemes will avoid this
problem and enable us to carry out comparisons across dialects. We note that, however, the
obtained measure is only valid in monosyllabic morphemes, and does not reflect the complete
situation of homophony in modern dialects, as some of these monosyllabic morphemes can not be
used as free morphemes and there are many polysyllabic words in the contemporary dialects.
However, these are the data we could have convenient access so far, and the obtained measure
may provide at least some preliminary comparisons, which can be extended to a better coverage
later when more data of the lexicons of modern dialects are available.
Table 1 gives the numbers of entries for each dialect in DOC (there are many characters having
multiple pronunciations, therefore the numbers of entries vary across dialects and all are more
than the numbers of characters, i.e. 2700), and the syllable inventory, i.e., the number of syllable
types occurring in these morphemes (Syl) (tone included).
The table also shows the results of three measures of the degrees of homophony: (1) the number
of homophone sets (HomoSet), and the proportion of homophone sets in the total number of sets
of morphemes (PropSet); (2) the number of homophone pairs (HomoPair), and the proportion of
homophone pairs in the total number of pairs (PropPair); (3) the average number of homophones
per syllable (AverHomo). The three measures of the degrees of homophony are carried out as
follows.
Table 1. The degrees of homophony in 20 modern dialects (sorted according to the number of
occurring syllables (Syl).
Dialect
Entries
Syl HomoSet PropSet HomoPair PropPair AverHomo per Syl
Taiyuan
3933
828
580
0.70
14581
0.0019
4.75
Wuhan
3947
870
625
0.72
13412
0.0017
4.54
Chengdu
3838
938
657
0.70
11769
0.0016
4.09
Yangzhou
3766
947
642
0.68
11673
0.0016
3.98
3 Among the 20 dialects, the data of 17 dialects are from 汉语方音字汇 (Han4yu3 Fang1yin1 Zi4hui4, “A collection of Character
Pronunciation in Chinese Dialects”, abbreviated as Zihui 1989).
4
Japanese and Korean have had heavy contacts with Chinese, and there are many Chinese borrowing words in these two
languages. In Japanese there are two main layers of borrowings, called Kan-on and Go-on readings respectively.
4
Hefei
Changsha
Suzhou
Shuangfeng
Wenzhou
Jinan
Xian
Nanchang
Beijing
Jian’ou
Meixian
Yangjiang
Guangzhou
Fuzhou
Chaozhou
Xiamen
3693
4174
3967
4020
4108
3853
3875
3842
4111
4181
3848
3682
3773
4398
4193
5000
976
981
999
1001
1048
1063
1084
1111
1125
1241
1304
1319
1367
1413
1759
1855
661
653
644
672
682
732
745
732
757
780
785
800
812
867
919
993
0.68
0.67
0.64
0.67
0.65
0.69
0.69
0.66
0.67
0.63
0.60
0.61
0.59
0.61
0.52
0.54
10782
13548
12077
10802
13587
9855
9397
8828
10564
10154
7539
6485
6143
8639
5977
8664
0.0016
0.0016
0.0015
0.0013
0.0016
0.0013
0.0013
0.0012
0.0013
0.0012
0.0010
0.0010
0.0009
0.0009
0.0007
0.0007
3.78
4.26
3.97
4.02
3.92
3.62
3.57
3.46
3.66
3.37
2.95
2.79
2.76
3.11
2.38
2.93
Percentage of homophone
sets (PropSet)
1) HomoSet: the number of syllables which have homophones (the 4th column in Table 1). For
example, HomoSet(Beijing)=757 and HomoSet(Guangzhou)=812, i.e., Guangzhou has more
homophone sets than Beijing. However, we need to do a normalization as the total number of
syllables should be taken into account. The number of syllables with homophones divided by the
total number of actual syllables, i.e., HomoSet/Syl, gives a more indicative measure, namely,
PropSet (the 5th column). Now Beijing has a higher degree of homophony than Guangzhou
(PropSet(Beijing)=0.67 and PropSet(Guangzhou)=0.59), which is more consistent with our
expectation. Moreover, there exists a significantly high negative correlation between the degree
of homophony and the number of syllables, as shown in Fig. 1. The Pearson correlation test
shows a high correlation: Corr(PropSet,Syl)=-0.90 (p<0.001).
1.00
y = 9.1x-0.38
0.90
R2 = 0.90
0.80
0.70
taiyuan
wuhan
beijing
jian'ou yangjian
fuzhou
g
guangzhou
0.60
0.50
xiamen
chaozhou
0.40
700
900
1100
1300
1500
1700
1900
Number of syllables
Fig. 1. Correlation between the size of the syllable inventory and the degree of homophony in
terms of proportion of homophone sets. The power function and the R-squared value for
the curve-fitting are given on the top right of the figure.
5
Percentage of homophone
pairs (PropPair)
2) HomoPair: the number of pairs of homophones (the 6th column). HomoPair(Beijing) =10564
and HomoPair(Guangzhou)=6143. After normalization, dividing HomoPair by the total number
of pairs of morphemes, we obtain the proportion of homophone pairs as another measure of the
degree of homophony, namely, PropPair (the 7th column). Similar to the result of PropSet,
Beijing has a higher degree of homophony than Guangzhou (PropPair(Beijing)=0.0013, and
PropPair(Guangzhou)=0.0009). For this measure of PropPair, we see an even higher correlation
between the degree of homophony and the size of the syllable inventory: Corr(PropPair,Syl)=0.96 (p<0.001), as shown in Fig. 2.
0.0020
y = 16.9x-1.35
taiyuan
wuhan
0.0015
R2 = 0.96
beijing jian'ou
fuzhou
0.0010
xiamen
guangzhou
chaozhou
0.0005
0.0000
600
800
1000
1200
1400
1600
1800
2000
Number of syllables
Fig. 2. Correlation between the size of the syllable inventory and the degree of homophony in
terms of proportion of homophone pairs.
3) AverHomo: the average number of homophones per syllable (the 8th column). Again we find
that the AverHomo has a high negative correlation with Syl: Corr(AverHomo, Syl)=-0.85
(p<0.001), as shown in Fig. 3.
Average number of
homophones per syllable
5.00
taiyuan
wuhan
4.00
y = 807.3x-0.77
R 2 = 0.85
beijing
fuzhou
3.00
xiam en
guangzhou
chaozhou
2.00
1.00
600
800
1000
1200
1400
1600
1800
2000
Number of syllables
Fig. 3. Correlation between the size of the syllable inventory and the degree of homophony in
terms of average number of homophones per syllable.
6
The three different measures discussed above all exhibit a high negative correlation between the
size of the syllable inventory and the degree of homophony. These convergent results from
different indices suggest the robustness of the measurement of the degree of homophony. The
correlations show that the more syllable types a language has, the smaller the degree of
homophony. This correlation conforms to our intuition about the relationship between the size of
the phonological resource and the degree of homophony. However, this finding is not supported
by the second set of data to be discussed below.
2.2. Degrees of homophony in three Germanic languages
We examine another set of languages for which we have available data for extracting
homophones in large lexicons. The data are provided by CELEX, which is an electronic lexical
database developed by the Dutch Centre for Lexical Information of the Max Planck Institute for
Psycholinguistics (Baayen et al., 1995). The database contains lexical information including
spelling, pronunciation, morphological structure, syntactic information (part of speech and
subcategorization) and corpus frequency for three Germanic languages, including Dutch, English
and German.
Table 2 gives some information about the database, including the sizes of the wordform lexicon
and lemma lexicon before and after processing5 for three languages, as well as the size of the
corpora from which the frequency information is obtained. From the table we see that the three
lexicons are not comparable in either of the two types of lexicon. For the lemma lexicon, Dutch
has more than twice the lemmata of the other two languages; for the wordform lexicon, the word
count in English is less than half of the other two6 . To deal with the compatibility problem
between different lexicons, we decided to consider only the first 5000 most frequent words and
carry out the comparison along the frequency bands. It is assumed that the first few thousand
frequent words should be relatively stable, regardless of the size the corpora, provided that the
corpora are both sufficiently large and from similar genres.
Table 2. A summary of the three lexicons from CELEX.
lemma types
wordform types
pre-processing post-processing pre-processing post-processing
Dutch
124,136
122,400
381,292
313,270
English 52,447
41,535
160,595
77,031
German 51,728
51,728
365,530
321,081
corpus size
42.38m
17.9m
6.0m
5
It was found that there are some repetitive entries in the lexicons for the three languages. Therefore, we carried out some
cleaning processing on the lexicons to remove the repeated items.
6
Wordform lexicon includes words like “walk” , “walked” and “walking” as individual items, while lemma lexicon excludes
inflectional word forms, for example, in the above case, only “walk” is included, but not the above three inflectional forms. The
ratios between the number of word forms and the number of lemmata gives a rough idea how the three languages differ in their
inflectional morphology complexity. The averge number of word forms for each lemmata in German is much higher
(321081/51728=6.2) than that of Dutch (313270/122400=2.6) and English (77031/41535=1.9), that is, in German the words have
more inflectional forms on average.
7
For each of the three languages, we first sort the words in the order of their frequency. Then we
check for each of the first 5000 words if it has a homophone/homophones in the whole word list7,
according to our restricted criteria of homophony stated above.
The first 5000 frequent words are then grouped into 14 frequency bands in a decreasing order of
frequencies. The size of the first 10 bands increases by 100 words for each band. In other words,
the first band includes the first 100 words, the second band includes the first 200 words, and so
on. After the 10th band, the sizes of bands increase by 1000 words, i.e. the 11th band includes the
first 1000 words, and the 12th band includes the first 2000 words. For each band, we count the
number of words which have at least one homophone. Fig. 4 shows the degree of homophony in
the 14 frequency bands, i.e. cumulative proportion of words which have homophones in the given
frequency bands, in the three languages.
Degree of homophony
0.4
Eng
Dut
0.3
Ger
0.2
0.1
0
0
1000
2000
3000
4000
5000
Frequency band
Fig. 4. Degree of homophony in the first 5000 frequent words in three Germanic languages.
We have a few interesting observations from the above figure. First, the degrees of homophony in
the first several frequency bands are all much higher than in later frequency bands in the three
languages. In the first frequency band (the first 100 words), English has 35% words with
homophones, Dutch 11% and German 16%. In English, among the 35 words having homophones,
32 of them belong to the closed class vocabulary, i.e. function words, such as the articles “the”
and “a”, the prepositions “to” and “in”, and the conjunctions “but” and “or”, etc. In fact in the
three languages, over 90% of the words in the first 100 most frequent words are such function
words. It remains to be seen whether it is true for other languages that there exists a large degree
of homophony in the most frequent words, and whether these homophone pairs are mostly
closed-class words. Furthermore, we find that most of the homophones are monosyllabic words.
We therefore posit that there exists a correlation between the degree of homophony and
monosyllabicity. We will examine this in more detail in a later section.
Second, while there are more homophones in high frequency bands than in low frequency bands,
the degree of homophony starts to level off at a certain value after the 12th frequency band which
7
We note that this way of searching for homophones in the whole word list is still dependent on the size of the lexicon, but
confining the first word in the first 5000 frequent word list at least provides some compatibility, as those pairs in which both
members are infrequent are excluded.
8
includes 2000 words. This suggests that if we want to compare the degrees of homophony among
different languages, we may only need to examine the high frequency word list to a certain extent,
say, up to the first 2000 words. We see from Fig. 4 that English has the highest degree of
homophony (about 10%) while Dutch and German have similar smaller degrees (about 4%) as
the level-off value. Why do these three languages have such differences? In the Chinese dialects
shown above, we find a negative correlation between the number of syllables and the degree of
homophony. Is there such a correlation in the Germanic languages as well? Can we predict the
degree of homophony based on this parameter, i.e. the number of syllable types? The following is
a preliminary attempt to answer these questions.
3. Homophony and phonological resource
The capacity of handling a large number of words is considered as a defining characteristic for
human species (Deacon, 1997). Most words in a language are arbitrary associations between
forms and meanings, which are expressions of the Saussurean Sign (de Saussure, 1910/1983).
While the number of meanings seems to be infinite, only a small number of them are lexicalized,
and others are expressed by combinations of words. The forms of the lexical items are built by
choosing from a finite set of units, which we call “phonological resource”. The size and
characteristics of the components of this finite set of phonological resource should affect the
degree of homophony. In the following, we begin with explaining how to measure the
phonological resource in a language.
The “phonological resource” refers to the number of possible distinctive forms a language can
make use of to construct words or morphemes. It depends not only on the number of sounds, but
also on the ways that the sounds are combined together, i.e., the phonotactic constraints.
Languages differ a lot in both dimensions.
Though the number of sounds that humans can make is infinitely large as articulation exploits a
continuous space in the vocal tract, the actual number of sound categories (called “segments” or
“phonemes”) which are used in distinguishing meanings by any individual language, is very
limited. According to the UCLA Phonological Segment Inventory Database (UPSID) (Maddieson
& Precoda, 1990), the maximum number of segments in an extant language is 141 (a Khoisan
language !Xũ). The average number of segments among languages, however, is much smaller. In
the UPSID, the average size of the segment inventory is only about 31.
Segments are organized into syllables, and words are constructed by concatenation of syllables.
An ordinary syllable consists of an obligatory vowel, and optional preceding and following
consonants. Different types of combination of consonants and vowels in one syllable constitute
different canonical forms, such as (V), (CV), (CVC), (CVCC), (CCVC), and so on. Different
languages vary a lot in the number and complexity of legitimate canonical forms. For example,
Germanic languages allow large consonant clusters, such as in English, (CCCVCCC) in “scripts”
and (CCVCCCC) as in “glimpsed”; (CCCVCCCC) as in “abstractst” in Dutch and “strolchst” in
German; while the most complex canonical form in Chinese dialects is only (CGVN) (“G”
standing for “glide”, and “N” for “nasal”), such as “liang” in Putonghua.
9
Table 3 lists the number of consonants and vowels and the number of canonical forms in the three
Germanic languages. To ensure a valid comparison between the three languages, the criteria for
determining the numbers of phonemes are important. While the determination of phonemes in a
language often has non-unique solutions (Chao, 1934), we adopt the systems used in CELEX,
which include larger numbers of segments than usual agreed analyses. For example, there are 24,
instead of 20, vowels in English, due to the inclusion of four nasalized vowels which only occur
in foreign words. We adopt these systems for the sake of simplicity of comparison. Moreover, as
to be shown below, the distribution of the frequency of syllable types follows a power-law
distribution, and a large proportion of segments only occur once or twice; therefore, the syllable
types with foreign sounds should be in a similar status as those infrequent syllable types, and
including these foreign sounds in the segment inventory just like including other infrequent
sounds.
Table 3. Segment inventories, number of canonical forms, occurring syllable types, possible and
actual CV combinations, and CV exploitation rates in three Germanic languages.
consonants vowels canonical occurring Possible actual CV exploitation
forms syllables CVs
CVs
rate
Dutch
23
21
35
9,031
483
254
53%
English
24
24
41
9,570
576
412
72%
German
25
34
33
4,225
850
217
26%
From the inventory of consonants and vowels and the legitimate canonical forms, we see that the
relations between the three variables are complex. English has fewer segments than German, but
much more types of canonical forms, which may be explained as languages seem to have a tradeoff between the number of segments and the ways of combining segments so as to achieve a
similar size of phonological resource. However, this hypothesis does not hold for when Dutch
and English are compared: Dutch has fewer segments than English, but also fewer canonical
forms, though the difference is small. Since we only have three languages and these languages
are very similar due to their close genetic relationships, it is hard to make more meaning
inferences.
The number of segments and the types of canonical forms may provide a measure of the potential
phonological resource in a language. However, each language has a set of specific phonotactic
constraints, resulting in many systematic gaps, such as no *[tl-] and *[dl-], as well as many
accidental gaps such as no *[krIp] and *[blIk] in English. Therefore it is hardly possible to have
an accurate estimate of the number of syllable types based on only the number of consonants and
vowels, and the number of canonical forms. Jespersen (1933: 623) estimates the number of
possible syllable types in English as more than 158,000, when systematic gaps are excluded.
According to our calculation, however, the number of occurring syllable types in the CELEX
English lexicon is only 9,570. If we assume that CELEX has included a representative number of
syllable types as its lexicon size is sufficiently large (77,031 word forms), we may obtain a rough
estimate of the exploitation rate of the phonological resource in English, based on the above two
values. Taking the ratio between our number (9,570) and that of Jespersen’s (158,000), we
estimate that the exploitation rate is only about 6%. This shows that the usable phonological
resource is far from being fully employed.
10
We may have another measure of the exploitation rate of phonological resource, by examining
CV combinations only. The possible CV types can be estimated by taking the full combinations
of all consonants and vowels. As shown in Table 3, the possible CV combinations are far from
being fully utilized either. English has the highest rate (85%), and German the lowest (26%). Xxx
In fact, German has a larger segment inventory than Dutch and English, but German and Dutch
have a similar number of CV combinations, while English has about twice as many as the other
two. This implies that English has fewer phonotactic constraints than German and Dutch, at least
in the CV combinations.
Furthermore, we find that the syllables are not utilized in a uniform way. Some syllables appear
very frequently, such as in English [lI] (appearing 2850 times), [rI] (2016) and [ə] (1916), while a
large proportion of the syllables (about 44%) only occur in one or two words. German and Dutch
have similar characteristics. The three most frequent syllables in German are [gə] (4405), [tə]
(3349) and [tən] (2845); and in Dutch they are [də] (17848), [tə] (12220) and [xə] (9899).
Figures 5, 6 and 7 show the distribution of the frequency of syllable types in the three languages.
All the three curves can be interpolated as similar power functions (prob(f)= Cfα , αeng=-1.6,
αdut=-1.3, and αger=-1.6 ), which appear as straight lines in the log-log plane. Power-law
distribution is often considered as a reflection of the presence of self-organization in the system
(xxx). The distributional characteristic of syllable frequencies suggests that self-organization may
be present in the organization of the lexicon.
0
10
-1
Probability
10
-2
10
-3
10
-4
10
0
10
1
2
3
10
10
10
Frequency of syllables in English lexicon
4
10
Fig. 5. Distribution of the frequency of syllable types in the English lexicon. The solid line is the
curve for the actual distribution, and the dotted line is the fitted curve with a power law.
11
0
10
-1
Probability
10
-2
10
-3
10
-4
10
0
10
1
2
3
4
10
10
10
Frequency of syllables in Dutch lexicon
10
Fig. 6. Distribution of the frequency of syllable types in the Dutch lexicon.
0
10
-1
Probability
10
-2
10
-3
10
-4
10
0
10
1
2
3
10
10
10
Frequency of syllables in German lexicon
4
10
Fig. 7. Distribution of the frequency of syllable types in the German lexicon.
4. Predicting the degree of homophony
As shown above, languages have a small exploitation rate of the possible phonological resource.
The number of possible syllable types does not give a representative measure of the phonological
resource in actual use. Instead, the number of actually occurring syllable types in the
contemporary lexicon may serve as a better index. Having chosen this index, we propose the
following hypothesis for the relation between the degree of homophony and the phonological
resource:
(1) Hypothesis-I: A larger number of syllable types (Syl) predicts a smaller degree of
homophony.
12
The hypothesis seems straightforward. If there are more distinctive forms for constructing words,
then the chance to have two words with the same forms, i.e., homophones, should be smaller. We
recall that in our earlier analyses of the degree of homophony in Chinese dialects, we do observe
a significant correlation between the degree of homophony and the size of the syllable inventory,
as reflected in Figures 1, 2 and 3.
However, comparing the degree of homophony and the Syl in the three Germanic languages, such
a correlation does not hold: English has the largest Syl (9570) and the largest degree of
homophony (10% as in first 5000 frequency word list as shown in Fig. 4), and German has a
much smaller Syl (4225) than English, but also a smaller degree of homophony (4%). How to
explain the inconsistency between the observations from these two sets of data? One explanation
is that in the case of the Chinese data, only monosyllabic morphemes are examined and many of
them are not real “words” in the actual language use (refer to a later Section about
disyllabification in Chinese), while in the case of Germanic languages, the data are from real
lexicons in which words have different lengths. Intuitively, longer words are less likely to have
homophones. For two languages having the same size of syllable inventory, the one which has
more long words will be expected to have fewer homophones.
It has been shown that languages differ a lot in the word mean length, and there is a high negative
correlation between the word mean length and the size of the segment inventory (Nettle, 1995;
Nettle, 1998; Nettle, 1999). Fig. 8 shows the correlation for ten languages, taken from Nettle
(1999).
8
Hawaiian
Georgian
y = 17.28x -0.30
R2 = 0.67
Mean Word Length
Italian
Turkish
6
Hausa
Mandarin
Tamasheq
!Kung
Vute
4
Thai
2
0
50
100
150
200
Number of Segments
Fig. 8: Relation between the size of segment inventory and the word mean length in ten
languages. The curve-fitting power function and the R-squared value are shown in the
figure. Adapted from Nettle (1999:146).
As mentioned earlier, longer words are less likely to have homophones. In fact, in the
homophone list of the three languages compiled from CELEX, we find that most of the
13
homophone pairs are monosyllabic words. Therefore, we propose Hypothesis-II for predicting the
degree of homophony, as stated below:
(2) Hypothesis-II: A larger number of monosyllabic words (MonoW) would predict a higher
degree of homophony.
We analyze MonoW in the three Germanic languages in their 5000 most frequent word lists in a
similar way as analyzing the degree of homophony reported in Section 2.2. As shown in Fig. 9,
the proportion of monosyllabic words in the first 100 most frequent words are very high for all
three languages, especially English and Dutch, both over 80%. But the values of MonoW drop
quickly and stabilize at different levels: English has a much higher proportion of monosyllabic
words (32%) than Dutch (20%) and German (14%). When we examine the correlations between
the degrees of homophony and the values of MonoW in different frequency bands, we find that
the correlations are all very high in the three languages, i.e., 0.99, 0.96, 0.98 respectively in
English, Dutch and German. Tsou (1976) reported a similar observation that many examples of
homophones are monosyllabic. He predicted that “in disyllabic or polysyllabic morphemes the
probability for homophony is decreased geometrically” (ibid:75). Our data support this prediction:
in the first 1000 frequent words in German, 44 of the monosyllabic words have homophones,
while only 14 disyllabic and polysyllabic words have homophones; in English, all the 135 sets of
homophones are monosyllabic words;
Degree of monosyllabicity
1.00
Eng
Dut
0.80
Ger
0.60
0.40
0.20
0.00
0
1000
2000
3000
4000
5000
Frequency band
Fig. 9: Degree of monosyllabicity in the first 5000 words in English, Dutch and German.
Each language has its own characteristic for monosyllabicity. The number of monosyllabic words
in a language not only depends on the size of the phonological resource, such as the number of
occurring syllable types (Syl), but also on other aspects of the language system, such as the
complexity of the morphological system. We expect that if a language has a large number of
morphological processes, either inflectional or derivational, the words are likely to be longer, and
therefore the language tends to have a smaller proportion of monosyllabic words in its lexicon.
5. Self-organization in homophony
Do homophones cause ambiguity and confusion in daily communication? One answer to this
question from common sense is that homophones do not usually affect communication, because
14
the context, such as the neighboring words, help to disambiguate. However, there still exist
situations where contexts do not provide enough information for immediate comprehension, and
misunderstanding persists for a while until enough information is obtained.
Morever, many psycholinguistic experiments show that there are differences in the processing of
words with and without homophones. For instance, in an experiment where subjects were asked
to verify whether the referent of a word was a member of a specified semantic category, it is
found that false positive rates were higher for words that were homophonous with a category
member than for orthographically similar non-homophones. For example, “rows” was more
likely than “robs” to be misclassified as a flower (Van Orden, 1987). It has also been found that
words with homophonous partners typically take longer time than those without homophonous
partners in lexical decision experiments (e.g. Ferrand & Grainger, 2003).
The homophone interference effect found in these experimental situations may appear as
insignificant or negligible, with respect to actual communication interactions. And if there is any
confusion caused by the presence of homophony, the listener is the most affected; how will this
effect on the listener affect the fate of the troubling homophones? There are two conceivable
reasons. First, the confusion in the listener may affect the speaker. When the listener asks for
clarification, the speaker will realize that there is some mis-communication going on, and it will
require extra effort for the speaker to attempt new ways to repair and clarify the situation.
Second, the confusion caused by the homophony in the listener may remind the listener to avoid
the use of the homophone in the same context in his own speaking. Though these may be some
spurious processes, they may result in some significant effects in the long run.
Therefore, some words which cause problems will face a de-selection pressure and consequently
will be used less and less. These can be attested in the many cases of words which get disused
because of being homophonous to some taboo words (Bloomfield, 1933; Stimson, 1966). We
consider this as a self-organization process in language, and expect that the effects of such
homophony avoidance may be detectable in the synchronic distribution of homophones. In the
following we will consider two phenomena as evidence of the self-organization process.
5.1 Disyllabification in Chinese
The first evidence is the disyllabification phenomenon in the Chinese history, which has been
extensively discussed (e.g. Guo, 1938; Lü, 1963; Dai, 1990; Feng 1995; Duanmu, 2000). It has
been generally believed that monosyllabic words are the majority in ancient Chinese, but the
number of disyllabic words has increased a lot in the history. In modern Chinese dialects,
monosyllabic words are only in a small proportion, and the majority of words are disyllabic. For
instance, in Putonghua, in the frequent word list, only 29% are monosyllabic words. Many words
which were monosyllabic in earlier times have become disyllabic. For example, “父” (fu4),
which means “father”, is now only used in disyllabic words, such as “父亲” (fu4 qin1) in
Putonghua; “睛” (jing1) (“eye” or “eyeball”) has to be embedded in disyllabic compound words,
such as “眼睛” (yan3 jing1) (“eye(s)”).
It is widely accepted that there has been a disyllabification process in the history of Chinese,
despite the various arguments on when this process started and how extensive it has been (c.f.
Kennedy 1951/1964). As for the questions about why this process happened, there are more
15
controversies (Feng 1995; Duanmu, 2000). Homophony avoidance has been used as an
explanation, i.e. the monosyllabic words got disyllabified in order to avoid the confusion caused
by homophonous words. An additional monosyllabic word is used to disambiguate an ambiguous
word, and the collocation of the two words gradually becomes a fixed expression, and later may
become a lexicalized word8. Meanwhile, there are several other hypotheses in explaining the
disyllabification process, such as speech-tempo constraint (Guo, 1938), grammatical
considerations (Li, 1990) morphologization (Dai, 1990), stress constraint (Lu & Duanmu, 1991;
Duanmu, 2000), and prosodic constraint (Feng, 1995). The argumentation of these controversial
hypotheses will not be dealt with in this paper. Our main concern here is to provide one piece of
evidence for the homophony avoidance hypothesis.
The homophony avoidance hypothesis would predict the following correlations: a smaller
phonological inventory implies a larger degree of homophony in monosyllabic morphemes, and
consequently a larger degree of disyllabification. What we are interested in this study is to
examine this prediction. Lü (1963) has speculated the possible relation between the size of the
syllable inventory and the degree of disyllabification, comparing the northern and southern
dialects: “because Cantonese has a larger syllable inventory than Putonghua, there should be
fewer disyllabic words in Cantonese than in Putonghua” (p.440). There are ample examples of
words having been disyllabified in Putonghua which are still monosyllabic in Cantonese; for
instance, “蚊” (“wen2”) (“mosquito”) cannot be used as a free morpheme and has to combine
with a suffix to form “蚊子” (“wen2 zi”) in Putonghua, while in Cantonese this morpheme is still
used as a monosyllabic word. There has been no systematic way to compare the dialects
quantitatively. In the following we will report a study which supports the above hypothesis.
We use the dialect dictionary 汉语方言词汇 (Han4yu3 Fang1yan1 Ci2hui4, “A Collection of
Words in Chinese Dialects”, henceforth Cihui (1995)) to estimate the degree of disyllabification
in different dialects. The Cihui gives a list of corresponding words for 1236 lexemes in 20
Chinese dialects. We first count the number of monosyllabic words in the whole list, and
calculate the proportion of monosyllabic words, denoted as PropMono. The degree of
disyllabification (PropDisy1) is estimated as 1-PropMono. Table 4 gives the degrees of
disyllabification of the 20 Chinese dialects, as well as the number of syllable types and the
degrees of homophony (PropSet) which have been shown in Table 1.
Table 4. Comparison of degrees of homophony and degrees of disyllabification in 20 Chinese
modern dialects
Dialect
Syl
Homo PropSet PropDisy1 PropDisy2
Taiyuan
828
0.70
0.60
0.40
Wuhan
870
0.72
0.62
0.40
Chengdu
938
0.70
0.62
0.42
Yangzhou
947
0.68
0.61
0.40
Hefei
976
0.68
0.61
0.40
8
In English, there are similar phenomena to the disyllabification in Chinese. For example, in some areas in the United States,
there has been a sound change merging [Ε] and [Ι], which results in pairs of homophones such as “pen” and “pin”. It is found that
these two words are expressed by adding a modifier, for example: “ink pen”, and “stick pin” in order to eliminate the possible
confusion. Also, to differentiate the second person plural pronoun and the second person singular, the expression ‘you all’ is often
used to indicate the plural meaning. These examples show how ambiguity avoidance leads to fixed collocations of individual
words. Though so far these words have not become lexical items yet, they may become lexicalized later.
16
Changsha
Suzhou
Shuangfeng
Wenzhou
Ji’nan
Xi’an
Nanchang
Beijing
Jian’ou
Meixian
Yangjiang
Guangzhou
Fuzhou
Chaozhou
Xiamen
981
999
1001
1048
1063
1084
1111
1125
1241
1304
1319
1367
1413
1759
1855
0.67
0.64
0.67
0.65
0.69
0.69
0.66
0.67
0.63
0.60
0.61
0.59
0.61
0.52
0.54
0.62
0.61
0.63
0.53
0.59
0.61
0.60
0.62
0.55
0.60
0.51
0.50
0.51
0.50
0.54
0.41
0.40
0.43
0.31
0.36
0.41
0.38
0.41
0.31
0.39
0.24
0.24
0.25
0.23
0.29
We find there is a significantly high negative correlation between the size of the syllable
inventory and the degree of disyllabification: Corr(Syl, PropDisy1)=-0.74 (p<0.001), which
supports Lü (1963). Also, there is a high positive correlation between the degree of homophony
and the degree of disyllabification: Corr(PropSet, PropDisy1)=0.76. Fig. 10 and Fig. 11 show
the relation between the two pairs of variables and the curve-fitting functions.
Degree of disyllabification
0.50
-0.78
0.45
y = 82.9x
2
R = 0.60
beijing
0.40
taiyuan
0.35
xiamen
0.30
fuzhou
0.25
guangzhou
0.20
0.15
700
900
1100
1300
1500
chaozhou
1700
1900
2100
Number of syllables
Fig. 10. Sizes of syllable inventory versus degrees of disyllabification in 20 Chinese dialects.
17
Degree of disyllabification
0.45
1.96
0.40
beijing
y = 0.82x
2
R = 0.60
taiyuan
0.35
0.30
xiamen
0.25
guangzhou
chaozhou
fuzhou
0.20
0.15
0.45
0.50
0.55
0.60
0.65
0.70
0.75
Degree of homophony (PercSet)
Fig. 11. Degrees of homophony versus degrees of disyllabification in 20 Chinese dialects.
The above method to calculate the degree of disyllabification is subject to the problem of
overestimating the degrees of disyllabification, because in the list there are some polysyllabic
words which were never monosyllabic in the first place, and are polysyllabic in all modern
dialects, such as “ 玻 璃 ”(“bo1 li”) (“the glass”). These words did not go through a
“disyllabification” process from monosyllabic to disyllabic, and therefore should not be included
in the estimation. Therefore, we modify the measure by only taking into account those lexemes
which are expressed by a monosyllabic word in at least one dialect, based on the assumption that
the original form for the meaning is likely to be monosyllabic at earlier stages and has been
retained in at least one dialect (as we assume that it is rare that a disyllabic word would become a
monosyllabic word again). Thus we obtain a better measure of the degree of disyllabification,
PropDisy2, as shown in the 5th column of Table 4. We again calculate the correlations between
the degree of homophony and the degree of disyllabification. It is found that the correlations are
higher compared to those using PropDisy1: Corr(Syl, PropDisy2)=-0.76 (p<0.001); and
Corr(PropSet, PropDisy2)=0.78.
These high correlations provide a strong argument for the homophony avoidance hypothesis,
because the existence of such a correlation is hard to explain by other proposals, such as those
considering prosodic or stress constraints, for disyllabification. There has been no argument to
show that the prosodic or stress constraint is related to the size of the syllable inventory. However,
we are not to argue that the homophony avoidance is the only one, or the most important,
mechanism to account for the disyllabification. We view disyllabification as the result of several
mechanisms, and homophony avoidance is only one of them. There may have been several stages
of disyllabification, due to different mechanisms at work. At the initial stage for the increase of
disyllabic words, the homophony avoidance mechanism may play an important role. Once the
disyllabic prosodic structure is well established in the language, new lexical items are more likely
to be disyllabic. This may account for the continuous increase in the number of disyllabic words
in the last 100 years, especially the borrowing words from other cultures (Masini, 1993).
18
5.2 Grammatical differentiation between homophones
A pair of homophones sharing the same grammatical class are more likely to cause confusion
than words belonging to different grammatical classes. If so, we may expect to see more pairs of
homophones in different grammatical classes than in the same classes. Kelly & Ragade (2000)
did some statistical analysis on the English homophones to test this hypothesis. They found that
there is no statistical bias against homophone pairs having the same grammatical class. However,
many words belong to more than one grammatical class. When the frequency effect is taken into
account, the frequently used grammatical class of a word is statistically biased not to be the same
as that of the word’s homophone(s). For example, for the pair of homophones “weight” and
“wait”, both can be a noun and a verb, but considering only the most frequent usage, one is a
noun, and the other is a verb.
Kelly and Ragade carried out two statistical tests. The first is called “existence constraint test”.
Among the 502 homophone pairs in their list, it is found that there are 139 pairs of words which
are from the same grammatical class. In order to test if the distribution of homophone pairs is
random, Monte Carlo experiments were run for fifty cycles of random word pairings. In these
cycles, each word in the analysis was randomly paired with another word. It turns out that the
mean number of pairs with the same grammatical class from these 50 cycles is 140.1. There is no
statistical difference between the estimated value (140.1) and the observed value (139) as shown
by t-test. The results are summarized in the first row of Table 5.
Table 5. Homophone pairs are differentiated in usage in terms of grammatical classes when
frequency is taken into account. A summary of the findings given in Kelly & Ragade
(2000).
t-test between homophone
mean from50 cycles of
conditions
pairs of
and random pairs
homophones / total
random pairing (standard
deviation)
pairs of words
existence
139/502
140.1 (9.47)
0.84 (p>0.30)
constraint test
frequency
73/253
84 (5.94)
13.09 (p<0.0001)
constraint test
The second test is called “frequency constraint test”. A subset of data with 253 homophones pairs
from the original 502 pairs was prepared, in which each word was only marked by its most
frequently used grammatical class. It is found that there are 73 pairs with the same grammatical
class. The Monte Carlo experiments were performed again to this set of data, and the estimated
mean value for the number of random pairs of words with the same grammatical class was 84.
The t-test shows that this value from random pairs is significantly different from the observed
value, which suggests that homophone pairs tend to differentiate in different grammatical classes.
As Kelly and Ragade explain, “this restriction seems reasonable if one assumes that the usage
frequency of a word will be depressed if it has a greater chance of impairing comprehension”.
We apply the same tests to the three Germanic homophone lists extracted from CELEX. The
homophone pairs are compiled from what have been obtained in the analyses in Section 2.2, i.e.
each pair has at least one word in the first 5000 frequent word list. The statistical results of the
19
Monte Carlo experiments are shown in Tables 6, 7 and 8 for English, Dutch and German
respectively.
The results for English show the opposite result as what Kelly and Ragade have shown. The
existence constraint test shows that the whole set of homophone pairs differentiate in
grammatical classes in a way significantly different from random data; but the frequent constraint
test disproves the differentiation, as random pairs of words (203) have less differentiation in
grammatical classes than homophone pairs (207). The homophone data we compiled are different
from what Kelly and Ragade used, and it is not clear yet what accounts for this discrepancy.
However, in Dutch and German, both tests show grammatical differentiation with statistical
significance.
Table 6. Statistical results of the differentiation of homophone pairs in usage in terms of
grammatical class in English.
t-test between homophone
pairs of
mean from 50 cycles of
and random pairs
homophones/ total random pairing (standard
deviation)
pairs of words
existence
410/1125
426 (13.5)
8.6 (p<0.001)
constraint test
frequency
207/448
203 (10.6)
-2.7 (p<0.005)
constraint test
xxx
Table 7. Statistical results of the differentiation of homophone pairs in usage in terms of
grammatical class in Dutch.
pairs of
mean from 50 cycles of
t-test between homophone
homophones/ total random pairing (standard
and random pairs
pairs of words
deviation)
existence
180/499
186 (11.4)
3.6 (p<0.001)
constraint test
frequency
55/175
71 (6.7)
16.4 (p<0.0001)
constraint test
Table 8. Statistical results of the differentiation of homophone pairs in usage in terms of
grammatical class in German.
pairs of
mean from 50 cycles of
t-test between homophone
homophones/ total random pairing (standard
and random pairs
pairs of words
deviation)
existence
52/183
59 (5.2)
9.5 (p<0.001)
constraint test
frequency
44/125
65 (6.9)
21.8 (p<0.001)
constraint test
20
5. Conclusions and discussions
Homophony appears as a by-product of incessant sound change and language contact. Its
membership is in a continuous flux, as new homophones arise and old homophones disappear.
However, the existence and distribution of homophones are not random. The degree of
homophony in a language may be predictable to some extent. Our study has shown that it is
correlated with some parameters of the language, such as the size of the phonological resources,
or the degree of monosyllabicity in the lexicon. There are several possible ways to measure the
size of the phonological resources, for instance, the size of the segment inventory, the number of
possible syllable types, the number of syllable types actually occurring, and so on. The number of
occurring syllables shows a strong negative correlation with the degree of homophony in the 20
Chinese dialects, i.e. the more syllables a language has, the less monosyllabic homophones it has.
However, this correlation does not hold in the three Germanic languages. Instead, the number of
monosyllabic words shows a high correlation with the degree of homophony, as the majority
homophone words are monosyllabic in languages.
While the degree of homophony is possibly predictable, the synchronic distribution of the
homophones may also exhibit certain characteristics which reflect the nature of language as a
self-organizing system. Language evolves in a way to ensure efficient communication. When
some words cause ambiguity and confusion often, such as words which are homophonous to
taboo words, they will disappear, or change (for instance, monosyllabic words may get
disyllabified). Also, the statistical results for the Germanic languages suggest that homophone
pairs tend to diffentiate in grammatical classes, so as to decrease the possibility of ambiguity in
communication, though it may not be necessarily true in all languages.
Self-organization has been recognized as a universal mechanism for the evolution of complex
systems, in parallel to Darwinian natural selection (Kauffman, 1995). Language is such a selforganizing complex system (Lindblom et al., 1984; Köhler, 1994; Steels, 1998; de Boer, 2001).
Köhler(1994) has proposed a general framework called “synergentic linguistics” to examine the
various self-organizing features in language. Similar to what we have shown in Section 2.1 where
the degree of homophony is shown to be a function of the number of syllable types, he has
proposed several hypotheses, such as the lexicon size is a function of the number of meanings to
be coded and of the mean polysemy, and the word length is a function of lexicon size, of
redundancy, of the phonological inventory size, and of frequency, and so on.
However, how does the self-organization process progress to result in these various features? The
systemic view which only considers language itself as an abstract self-contained system will not
provide us with a successful framework to look for answers to this question. There is no such a
“language system” which actually exists and self-organizes. Nor does any individual speaker aim
to organize his language as an efficient system, for instance, to construct an optimal lexicon, to
put homophones in different grammatical class, to restrict monosyllabic words or homophones
within a certain degree, etc. The statistical distribution of grammatical differentiation in
homophone pairs and the disyllabification process should be explained as the long term effect of
language evolution at the population level, through the iterative local interactions among
individual speakers and listerners (Ke, 2004).
21
While it is hardly possible for empirical studies to examine the long term effect of such a
dynamic process, computational modelling may provide us with a useful tool to investigate these
questions. With models, experimentation under controlled conditions can be carried out to show
the accumulative effect of long term local interactions among indivudals in a language
community, and to study the effect of various parameters. This methodology has achieved
growing attention in the study of language evolution in recent years (Cangelosi & Parisi, 2001;
Wang et al. 2004). One immediate future work for homophony in this area is to simulate the
evolution of homophones, from rise to fall, to see whether and how the features in synchronic
distribution can emerge in the model. This may be a fruitful area to explore.
References
Antilla, R. (1989). Historical and Comparative Linguistics. John Benjamins, Amsterdam/Philadelphia, 2nd
edition.
Baayen, R.H., Piepenbrock, R. & Gulikersm L. (1995). The CELEX Lexical Database (Release 2) [CDROM]. Philadelphia, PA: Linguistic Data Consortium, University of Pennsylvania [Distributor].
Bloomfield, L. (1933). Language. The University of Chicago Press.
Cangelosi, A. and Parisi, D. (2001). Simulating the Evolution of Language (ed.). Springer-Verlag, London.
Chao, Y. R. (1934). The non-uniqueness of phonemic solutions of phonetic systems. Bulletin of the
Institute of History and Philology, Academia Sinica, 4 (4):363-397. Reprinted in Readings in
Linguistics, ed. Martin Joos, p38-54.
Chen, M. Y.-C. and Wang, W. S.-Y. (1975). Sound change: actuation and implementation. Language,
51(1):255-281.
Cheng, C-C. (1998). Quantification for understanding language cognition, in Quantitative and
Computational Studies on the Chinese Language, 15-30, ed Benjamin K. T'sou, Tom B. Y. Lai,
Samuel W. K. Chan, and William S-Y. Wang, City University of Hong Kong Press.
Cihui (1964/1995). 汉语方言词汇 (Han4yu3 Fang1yan2 Ci2hui4, “A Collection of Words in Chinese
Dialects”. Beijing Daxue Zhongguo Yuyan Wenxue Xi Yuyanxue Jiaoyanshi, Yuwen Chuban She (北
京大学中国语言文学系语言学教研室, 语文出版社).
Dai, X. J. (1990). Historical morphonologization of syntactic words: Evidence from Chinese derived verbs.
Diachronica, 7(1):9–46.
Deacon, T. (1997). The Symbolic Species. W. Norton and Co., New York.
de Boer, B. (2001). The Origins of Vowel Systems. Oxford University Press.
de Saussure, F. (1910/1983). Course in General Linguistics (Cours de Linguistique Generale). Open Court,
LaSalle, IL.
Duanmu, S. (2000). The Phonology of Standard Chinese. Oxford University Press.
Feng, S-L. (1995). Prosodic Structure and Prosodically Constrained Syntax in Chinese. PhD dissertation
University Microfilms, Inc.
Ferrand, L. and Grainger, J. (2003). Homophone interference effects in visual word recognition. The
Quarterly Journal of Experimental Psychology: Section A, 56(3):403 – 419.
Ferrer, R. and Solé, R. V. (2001). Two regimes in the frequency of words and the origin of complex
lexicons: Zipf's law revisited. Journal of Quantitative Linguistics.
Guo, S. (1938). Zhongguo yuci zhi tanxing zuoyong (中国语词之弹性作用 The elastic function of
Chinese word length). Yanjing Xuebao (燕京学报), 24.
Higgins, J. (1995). Quantifying English homophones and minimal pairs, in Studies in General and English
Phonetics: Essays in Honor of Professor J.D. O'Connor. 326-334. Routledge.
Li, N. (1990). Dongci fenlei yanjiu shuolue (动词分类研究说略 A note on the catgorization of verbs).
Zhongguo Yuwen (中国语文), 4:248–257.
22
Lindblom, B., MacNeilage, P., and Studdert-Kennedy, M. (1984). Self-organizing processes and the
explanation of language universals. In Butterworth, B., Bernard, C., and Dahl, O., editors,
Explanations for Language Universals, pages 181–203. Walter de Gruyter & Co.
Lu, B. and Duanmu, S. (1991). A case study of the relation between rhythm and syntax in Chinese. In The
Third North America Conference on Chinese Linguistics. Ithaca.
Lü, S. X. (1963). Xiandai hanyu shuangyinjie wenti chutan (现代汉语双音节问题初探 An enquiry to the
question of disyllabification in Chinese). Zhongguo Yuwen (中国语文), 1:11–23.
Jespersen, O. (1933). Monosyllabism in English, Binnial Lecture on English Philology, in the British
Academy, Nov. 6. Reprinted in Selected writings of Otto Jespersen, Tokyo: Senjo Publishing, 617641.
Kauffman, S. A. (1995). At Home in the Universe: the Search for Laws of Self-organization and
Complexity. New York: Oxford University Press.
Ke, J-Y. (2004). Self-organization and Language Evolution: System, Population and Individual.
Unpublished PhD thesis, City University of Hong Kong.
Kelly, M. H. and Ragade, A. Grammatical relationships between homonyms: effects on language
comprehension and the structure of the English vocabulary. Unpublished manuscript, 2000
http://www.sas.upenn.edu /~kellym/homonym.html.
Kennedy, G. A. (1951/1964). The Monosyllabic Myth, Journal of the American Oriental Society, 71.3:
161-166, 1951. reprinted in Selected works of George A. Kennedy, edited by Tien-yi Li. 104-118.
New Haven, Conn. : Far Eastern Publications, Yale University.
Köhler, Reinhard. (1994). Synergetic Linguistics. In: Asher, R.E.: The Encyclopedia of Language and
Linguistics. Oxford, New York, Seoul, Tokyo: Pergamon Press, S. 4454-4455.
Maddieson, I. and Precoda, K. (1990). Updating UPSID. Journal of the Acoustical Society of America,
Suppl. 1, Vol. 86, S19.
Malkiel, Y. (1979). Problems in the diachronic differentiation of near-homophones, Language, 55:1-36.
Masini, F. (1993). The Formation of Modern Chinese Lexicon and Its Evolution toward a National
Language: The Period from 1840 to 1898. Journal of Chinese Linguistics, California.
Nettle, D. (1995). Segmental inventory size, word length, and communication efficiency. Linguistics,
33:359–367.
Nettle, D. (1998). Coevolution of phonology and the lexicon in twelve languages of West Africa. Journal
of Quantitative Linguistics, 5(3):240–245.
Nettle, D. (1999). Linguistic Diversity. Oxford University Press, Oxford.
Steels, L. (1998). Synthesizing the origins of language and meaning using coevolution, self-organization
and level formation. In Hurford, J. R., Studdert-Kennedy, M., and Knight, C., editors, Approaches to
the Evolution of Language: Social and Cognitive Bases, pages 384–404. Cambridge University Press,
Cambridge.
Stimson, H. (1966). A tabu word in the Peking dialect. Language. 42.285-294.
Tsou, B. K. (1976). Homophony and internal change in Chinese. Computational Analysis of Asian &
African Languages, 3:67–86.
Van Orden, G. C. (1987). A rows is a rose: Spelling, sound, and reading. Memory and Cognition, 15:181–
198.
Wang, W. S.-Y. (1969a). Project DOC: Its methodological basis. Journal of the American Oriental Society,
90:57–66.
Wang, W. S.-Y. (1969b). Competing changes as a cause of residue. Language, 45(1):9–25.
Wang, W. S.-Y. (1977). The Lexicon in Phonological Change (ed.). Mouton, The Hague.
Wang, W. S.-Y., Ke, J.-Y., and Minett, J. W. (2004). Computer modeling of language evolution. In Huang,
C.-R. and Lenders, W., editors, Computational Linguistics and Beyond: Perspectives at the Beginning
of the 21st Century. Frontiers in Linguistics (I). Language and Linguistics, Academia Sinica, Taipei.
23
Zihui (1962/1989). Hanyu Fangyin Zihui (汉语方音字汇 A collection of Character Pronunciation in
Chinese Dialects). Beijing Daxue Zhongguo Yuyan Wenxue Xi Yuyanxue Jiaoyanshi, Wenzi Gaige
Chubanshe (北京大学中国语言文学系语言学教研室,文字改革出版社)1989 2nd edition.
Acknowledgements
I would like to thank Profs. Reinhard Köhler, William S-Y Wang and Chin-Chuan Cheng, and the
members of the former Language Engineering Laboratory of City University of Hong Kong for their
helpful discussions. Also, I am thankful to Volker Dollun, Dinoj Surendran, Lolke Van-Der-Veen
and Feng Wang for their help in this study. Special thanks are due to Dr. Christophe Coupé and the
support of Laboratoire Dynamique du Langage, Institut des Sciences de l'Homme in Lyon, France.
24