Mimetics

Signs & Sounds: Phonaesthematic Association of
Japanese Mimetics Using Artificial Neural
Networks
Luke Smith '06
Swarthmore College Department of Linguistics
K. D. Harrison & D. J. Napoli Advising
April 26, 2006
Uttering a word is like striking a note on the keyboard of imagination.
-Wittgenstein
Abstract
The Japanese mimetic lexicon displays a number of phonaesthematic regufarities
associating phonological features with semantic characteristics. Artificial Neural Networks can be used to good effect in mapping phonological patterns onto semantic
features. Thls paper demonstrates a proof-of-concept by training an ANN to associate
phonological features with meaning to a large degree of success. Ouster analysis reveals that activation patterns for tokens matching the same phonaesthematic pattern
group together. Larger corpora and more sophisticated techniques are required for
further research.
Mimetics
Saussure famously argued that that the linguistic sign is completely arbitrary in nature.
According to orthodox linguistic theory, a word's phonological representation beats no
relationship to its semantic referent. The existence of mimetics poses a problem for this
analysis. Mimetic words are either sound-symbolic or in some other way a mimesis of
a phYSical or psychological phenomenon. The mimetic domain does indeed display a
correspondence between sign and signified. Specifically, in Japanese, associations exist
between the phonological features of mimetics and the semantic properties of the phenomena- they represent.
These representations are in a sense no less arbitrary than those made in other parts
of the lexicon. Phonological symbolism simply displays a finer granularity of arbitrary
representation than that of the non-mimetic lexicon. However, there may be limitations
on the space of possible mimetic representations. Consider what might be called Hthe
Tinkerbell phenomenon./I Tinkerbell, a pint-sized fairy who appears to us as an ethereal
glimmer, speaks with the high-pitched ringing of a tiny bell. The ubiquitous sound-effect
for th~ glinting of light is a sharp, high
sound~
Could Tinkerbell plausibly be associated
with a deep, rumbling boom? Even in a radically different social and linguistic context,
this seems impOSSible. Although the sounds in the play are not linguistic representations
proper, this question should give us pause before we so easily assume that linguistic signs
are entirely arbitrary.
The Japanese Mimetic Lexicon
The Japanese lexicon is stratified into four distinct sub-lexica: Yamato (native) words,
Sino-Japanese words and compounds, non-Chinese loans, and mimetics. The Japanese
mimetic lexicon is extremely rich in comparison with that of English; not limited to
sound-symbolism, it contains thousands of words describing a wide variety of phenomena. Speakers classify these into four types: giseigo
(~,munds
of living things), giongo
(sounds of the inanimate world, gitaigo (mimesis of physical states or events), and gi-
jougo (mimesis of psychological states or events) (see Table 1). Korean has a similarly rich
lexicon. (Shibatani 1990)
Japanese has a rich lexicon of mimetics that .represent visuat psychological, and a
variety of abstract phenomena. Among these are pika-pika ("shining") and nobi-nobi ("feel
at ease"). Japanese mimetics typically fill an adverbial role, as found in sentences like arne
2
gzsezgo
wan-wan
kero-kero
nya-nya
dog's bark
frog's croak
cat's meow
sounds of living things
gwngo
gujya-gujya
kara-kara
baki-baki
slushy, sloppy
dry rustling
snapping, breaking
sounds of the inanimate world
gitaigo
pika-pika
shining
neba-neba
sticky, greasy
kunya-kunya soft and cuddly
gZJougo mura-mura
sobo-Sobo
wasa-wasa
mimesis of physical states or events
irresistible temptation mimesis of psychological states or events
dispirited
nervous, restless
Table 1: Tokens from the mimetic lexicon by type
ga zaa-zaa futteiru (liThe rain is falling zaa-zaa" or liThe rain is falling hard"). They can also
be adjectival or even verb-like when used with a light verb such as the copula (da) or do"
fI
(suru). Thus, the phrase kyoro-kyoro suru means lito look about helplessly."
Mimetics are often (but not always) at least partially reduplicative in their structure..
The prototypical mimetic is one like petya-petya (lltalk like birds chirping"), which has
a fully reduplicated cvcv-cvcv structure. Another, related token is petya-kutya, lito
talk (in an annoying way) as a crowd talks." Ivanova (2002) observes that the first,
fully-reduplicated word neutrally describes a phenomenon, while the second, partiallyreduplicated version applies a subjective judgement to the activity described. In addition
to fully:... and partially-reduplicated forms, a third class of non-reduplicative mimetics exist. These make use of fixed endings with a set semantic association onto which only
a single mora is prefixed. They include words like haQ-kiri clearly" and kuQ-kiri disII
II
tinctly.fli
Regularities in the mimetic lexicon are best understood in tenns of mora. Mora are
lThis paper uses a modified version of Kunrei romanization for Japanese; Q denotes a geminate, and N
a moraic nasal.
3
sub-morphological units of uniform length in Japanese. They often, but not always, correspond to syllables. A mora can be (1) a CV string in which the vowel is not long, (2) a
cyv string where'y' denotes palatalization (eg. kyo in to.kyo.u [Tokyo]); (3) a single vowel
which may stand alone or lengthen another vowel, (4) a nasal (in restricted distribution),
and (5) a geminate (also restricted). The mimetic petya-petya,'for example, is divided into
four mora: pe.tya.pe.tya. All of the mimetics discussed in this paper are of this typical
four-mora type.
Japanese mimetics are used to describe everything from a person's manner of walking to the distribution of checks in a restaurant. They are a vital, daily part of spoken
Japanese and are necessary for basic understanding of the language. In addition to lexical
knowledge, Japanese speakers have access to a regular system of mimetic construction,
and mimetics have at least a limited productivity. In other words, the distribution of
phonemes in the mimetic lexicon is not random. Certain phonological features are associated with semantic "features" of mimetics. Perhaps most prominently, voicing is associated with what might be called "intensityj" pouring rain is described with zaa-zaa, while
a soft shower is described by adevoiced version of the same mimetic: saa-saa; baki-baki
describes the breaking larger objects than does paki-paki. Other associations exist; velars,
for instance, correspond with quick, decisive movement or action. No comprehensive account exists, however, of all associations between phonological features in mimetics and
the features of their semantic interpretation.
Approaches to Classification
One approach to discovering these associations might be to interview native speakers
and test their intuitions; This, however, is problematic in several ways. First, native
speakers are not particularly attuned to the relation between phonology and semantics
in mimetics. Their production takes place largely outside of consciousness. Second, such
4
i
a methodology lacks a rigorously quantitative approach that might shed light on the rel,ative strengths of semantic associations and how conflicts between them are resolved.
Finally, interviewing native speakers is time-consuming and difficult. A great deal of
research would be required to uncover a complete account of Japanese mimetics.
Another possible method for discovering phono-semantic associations is a brute-force
human analysis of the lexicon. This approach could certainly be sufficiently quantitative,
but the flood of raw numeric data might make recognizing relevant patterns extremely
difficult. Such an analysis could mean countless hours wasted for little gain in predictive
capability.
The problems involved with the methods above motivates a third approach: analysis,
by Artificial Neural Network (ANN). ANNs are computer-simulated networks designed
to discover, with minimal intervention, salient patterns in noisy data. They are modeled
on the neural structure of the human brain, and they excel in extracting relevant associations from data which is often inconsistent, difficult to visualize, and overwhelming.
ANNs are employed effectively in such tasks as face-recognition, artificial speech production, and interpretation of robot vision. ANNs are well-suited to the task of discovering
the phono-semantic associations in the Japanese mimetic lexicon.
The lexicon in question is simultaneously vast and intricate, and it has proven resistant to previous attempts at systematic classification. Hamano (1998) attempted to directly associate phonemes in specific word-positions with semantic features. Although
some associations can be made, they are not robust and many counterexamples are attested. Ivanova (2002) suggests a more nuanced approach, which takes the phonaesthematic pattern as its basic unit of analysis. The word phonaesthematic describes "the pres- ,
ence of a sequence of phonemes shared by words with some perceived common element
in meaning," ego
glimmer, glisten, glint, and glamour. By examining sub-morphological
phonaesthematic sequences instead of individual phonemes, a more robust analysis of
the Japanese mimetic lexicon can be developed.
5
The phonaesthematic nature of the regularities in the lexicon does not present a problem for ANN analysis. Although the direct inputs to the network will consist exclusively
of an ordered set of phonemes identified by their features, the internal layer of the network provides a medium in which to build associations between specific sequences of
phonemes or even abstract relationships such as complete or partial reduplication. An
ANN should theoretically be capable of identifying any relevant pattern in the data provided.
Ivanova identifies a staggering number of robust and well-attested phonaesthematic
regularities in the lexicon. These, with some additions and modifications, appear in Appendix A. ANN analysis could provide an interesting point of comparison to Ivanova's
manual survey. It also might be more able to uncover a hierarchy of associations and shed
light on their interaction.
Crucially, Ivanova points out that tokens fitting multiple phonaesthematic patterns
often display semantic characteristics of both. This points to a difficulty in the classification of mimetic tokens that ANNs are particularly suited to address. Each token is,
in effect, placed on one or more dimensions of meaning. If each semantic feature with
a phonaesthematic association is imagined in this way, any given token can be plotted
in an n-dimensional space, where n is the number of semantic characteristics. This ndimensional space would model the entire range of phonoaesthematic pOSSibility. Tokens
with similar meanings would cluster close to each other, while opposites would be distant
from on another.
This method, however, presents a problem. What if two phonaesthematic associations
are in conflict? For instance, initial mo carries the meaning of "murky," (Ivanova 2002)
while the ending Q-kiri indicates clarity. Could a word like mOQ-kiri exist in Japanese?
Speakers confirm that it could not (Terasaki, Okamoto 2006). The two semantic associations are in conflict. Perhaps they are not, however, so directly opposed that they could
be collapsed into a single semantic dimension. This means that there is a region in the
6
spatial model which is effectively off-limits. In a perfect model, every point in the theoretical space should correspond to a possible (but not necessarily attested) token in the
mimetic lexicon. What does this mean for our analysis? It may not be possible to predict
the outcome of all phonaesthematic conflicts and develop a set of semantic features which
will allow a perfect modeling of the lexicon's phonaesthematic space, or, alternatively, a
set of ideal meta-features might be required to obtain such a result.
Artificial Neural Networks
The ANN developed in this investigation takes as its inputs the phonological features
of a given mimetic from the lexicon. After training, it should be capable of associating
those phonological features with a given set of "semantic features" such as intensity and
others. Determining the relevant set of semantic features is the primary challenge of
this approach. If features with no phonological association are selected, no pattern can
be discovered in the data. Thus, this method necessitates an approach informed by the
intuition of native speakers and an analysis of the lexicon.
Training an ANN requires a large corpus of representative data tagged with the correct associations between a given input and the "correct" output. In the case of Japanese
mimetics, this consists of a list of as many mimetics as possible, their phonological features, and the semantic features with which those phonological features are to be associated. The phonological features make up the data for the input layer of the neural
network. The activations of the input layer are then multiplied by the network weights
of the ANN and passed to the "hidden" layer. The hidden layer again multiplies its input
by network weights as it passes its activations to the output layer. The nodes of this layer
correspond to the semantic features of a given mimetic.
Over time, the ANN adjusts the network weights between its layers to better reflect ,the
associations between phonological and semantic features. It does this using an algorithm
7
~
~
Figure 1: Simplified view of an Artificial Neural Network (Wikipedia.org, Creative Commons Attribution license)
called "Back-Propagation," or BACKPROP. BACKPROP performs a stochastic gradient descent to minimize error in the network. (Meed en 2005)
Cluster Analysis
One problem with the ANN approach is that, while a network is capable of finding associations in the raw data, it often provides little insight into the structure of the system
being investigated. Although the network may predict semantic associations for a given
set of phonological features, it may reveal little about the underlying rules governing
such associations. In order to discover these, an analysis of the network weights must be
performed.
The network weights are the components of the ANN that store the associations between inputs and outputs. Each node in the network receives an input value between
zero and one. It then multiplies that value by a network weight and passes it to each
node in the next layer. Each node has a network weight associated with every node in the
next layer of the ANN. When it passes a value to another node, it multiplies that value
8
by a weight that is used exclusively for that other node. The process of adjusting these
weights in response to training data is what gives ANNs their associative power.
To test the performance of an ANN at a given task, two data sets are required. The
first is the training set, tagged with both the inputs (phonological features) and the correct
outputs (semantic features). The ANN adjusts its network weights in response to the
associations found in the training data set. Then, to evaluate the performance of the
network, a corpus novel tokens in the same class is presented. The network predicts the
outputs associated with the inputs. If its accuracy does not exceed that of a random guess,
it is unlikely that any pattern of association exists between inputs and outputs. If the
ANN analysis of Japanese mimetics were successful, the network, given a novel token,
would predict the same semantic associations as a native speaker would upon hearing
the mimetic for the first time.
Even if the network can accurately predict outputs from inputs, its methods of association remain opaque. In order to extract information from the ANN, it is necessary to interpret the network weights. Superficially they are nothing more than a series of meaningless numerical values. To discover their significance, a cluster analysis can be performed.
Cluster analysis models the network weights in an n-dimensional space, where n is the
total number of network weights. This space corresponds to the n-dimensional space of
phonoaesthematic association discussed above. A set of inputs is fed into the network,
possibly even the entire corpus. The activations of each node (the input multiplied by the
network weight) are then plotted into this n-dimensional model. The distance between
each set of activations is determined, and the associated inputs are grouped by their proximity to one another. The two closest inputs are grouped together under the input which
was closest to both of them, and so on. The entire set of inputs can be grouped in this way
into hierarchical clusters, revealing the underlying system of categorization that drives
the association of input and output. Thus, we might expect that voiced mimetics would
cluster apart from unvoiced mimetics, and that below this major association in the hier9
#
1
2
3
4
Mimetic
pLka.pLka
sa.a.sa.a
za.a.za.a
ku.tya.gu.tya
Gloss
[VOICED. VELAR.FRICATIVE] x 4 mora INTENSE
"shining"
[0.0.0] [0.1.0] [0.0.0][0.1.0]
o.
"soft rain sound"
[0.0.1][1.0.0][0.0.1][1.0.0]
o
"hard rain sound"
1
[1.0.1][1.0.0][1.0.1][1.0.0]
"speak worthlessly" [0.1.0] [0.0.0] [1.1.0. ][0.0.0]
o
Table 2: Trial one test corpus
archy, other clusters (around, say, velar-decisiveness) would form. Cluster analysis is a
convenient way to discover which patterns are prominent and which are relatively weak.
Semantic Features
As mentioned above, determining the best set of semantic features to associate with
phonological features is the primary challenge of ANN classification of the mimetic lexicon. INTENSE is certainly one of these. Other relevant characteristics found in the literature include tautness, hardness, viscosity, weakness, friction, protrusion, largeness, even
emotionality, normativity and self-confidence. The only way to find the most effective
feature set is through trial and error.
Trial One: Feature Extraction
Trial One demonstrates the process through which ANNs extract relevant features from
noisy data. A small four-token test corpus was used to demonstrate the network's basic
ability to associate phonolOgically + VOICED mimetics with the +INTENSE semantic feature.
In order to maintain a constant number of phonological inputs to the neural network,
phonological assignments were made to each of the four mora in these typical four-mora
mimetics. Mora were marked +VELAR if their initial consonant was a velar, and +FRICATIVE if their
initial consonant was a fricative. Because all vowels are voiced, a mora was
required to be entirely +VOICED (including the initial consonant) in order to be marked
10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
pi.ka.pi.ka
sa.a.sa.a
za.a.za.a
ku.tya.gu.tya
pi.ka.pi.ka
sa.a.sa.a
za.a.za.a
ku.tya.gu.tya
pi.ka.pi.ka
sa.a.sa.a
za.a.za.a
ku.tya.gu.tya
pi.ka.pi.ka
sa.a.sa.a
za.a.za.a
ku.tya.gu.tya
0.208295724222
0.159071080922
0.420666643129
0.17328400539
0.140041028372
0.107125475597
0.496530943773
0.127372183711
0.105973608446
0.0836767780189
0.538912477283
0.104159019003
0.0884180378068
0.0716269472153
0.563691291134
0.091032369665
Table 3: Error reports for first sixteen cycles in Trial One
+VOICED. It should
be made clear that the ANN is agnostic to the theoretical correctness
of the feature assignment; as long as a relation is preserved between input and target
output, it will discover it.
Table 2 gives the four-token test-corpus, selected phonological features associated
with each mora in each mimetic, and the intensity value associated with each mimetic.
They are numbered in the sequence in which were presented to the network. Table 3
gives the error reports of the network over the first sixteen cycles, through four presentations of each mimetic. The error rate is reported as the average difference between the
target output values (in this case there is only one output value: zero or one, for -INTENSE
and +INTENSE respectively) and the values produced in the network's output layer. As
expected, error rates for the +INTENSE token zaa-zaa are much higher than those for the
other three tokens in the first several hundred cycles; the error rate for saa-saa is also
higher than the two remaining tokens between cycles 100 and 1,000. The graph in Figure
2 shows a point-by-point error plot in which the error trajectories for zaa-zaa, saa-saa, and
the other two .(indistinguishable) tokens are clearly differentiated.
11
0.6
. '''
........''''
. . . . .''''
.........''''
.........T'
......................................,..
...................................."T'
.....................................,.
.........................................,.
.............................................
,~r--""""'''''''''''''''''''''''''''''T''"''"''''''''
0.5
0.4 •
....
....
2,
w
0.3. :.
0.2·'
0.1 ;,
100
200
300
,400
500600
100
800
900
1000
Network training CYcles.
Figure 2: Intensity prediction error for four-token training corpus in Trial One
Figure 3 shows a closer view of error rates for eight passes through the training corpus
from cycles 201 to 232. At this stage, the network has begun to misidentify +FRICATIVE as
a predictor for +INTENSE. The peaks in the graph represent the high error reports for zaa-
zaa, the + INTENSE token. The two low reports in each pass through the corpus are from
pika-pika and guJya.gu.tya, the -FRICATIVE tokens. The reports immediately to the left
of each peak are for saa-saa; the + FRICATIVE feature it shares with zaa zaa is temporarily
confusing the network.
The error rates converge over 1,000 training cycles as the network discovers that +VOICED
is the relevant feature for predicting + INTENSE. The elevated error for saa saa is due to
the. fact that it shares a feature-+ FRICATIVE-with zaa-zaa. The network temporarily
misidentifies +FRICATIVE in addition to +VOICED as a relevant feature for the prediction
of INTENSE. By cycle 1,000, however, the network has correctly identified +VOICED as the
12
0.35 0.$
~
I
lit
il II
.1,
~
fJ
It
~
I·
II
I
:
Ii
~
o
.s
0.2 I
Q 15 .
II
1
.IJJ
~
F
'I
II
0.1 " ' f
Q.05
I
U
,205
I
'I
I,.
,I
I
j
i j
Jill
t,'
l' ~
I ,t
r
~
iI'
1~
l
i
11'.
'1.',1
I \.kJ/
j
1=4
lL'"
I
f
I
I
iI
1
1 '"
11l
1!
r
1
l 1W/ L
t ..
,~j
f
225
210
,1-
1.1 \
!'
IlL'!!
"j
·,4
'
230
Figure 3: Distinct error trajectories for tokens in Trial One training corpus
13
# Mimetic
1 ti.bi.ti.bi
2 ka.ra.ka.ra
3 ga.ra.ga.ra
4 su.be.su.be
INTENSE Error
Gloss
'litHe by little' o
0.00159028974308
, dry rustling' o
0.00154425726513
clattering'
0.012309590643
1
'slick'
0.00490661539401
o
I
Table 4: Trial one evaluation corpus
only relevant feature and error for all tokens approaches zero.
At this point, the network has learned the correct associations for the corpus data.
However, as any parent knows, just because a child can recite the text of her favorite book
as the pages are turned does not mean that she can read. Table 4 demonstrates that the
networ k has in fact learned a general method for distinguishing + INTENSE tokens-error
was minimal on a novel evaluation corpus.
Another way of assessing the network's "understanding" of patterns in the corpus is
through cluster analysis of its hidden activations. Figure 4 gives a cluster hierarchy of
the training and evaluation corpora in Trial One. Clusters for both the training corpus
and the evaluation corpus show the +INTENSE token (zaa-zaa and gara-gam respectively)
farthest in Euclidean distance from the other tokens. In other words, distinct clusters of
hidden activation patterns correspond to + INTENSE and -INTENSE.
Trial Two
Trial One demonstrates the ANNs basic capability and workings, but does not enhance
our understanding of the mimetic lexicon. Trial Two will build on the methodology in
Trial One to analyze a larger corpus for more complex phonaesthematic associations. As
a step toward classifying a corpus representative of the entire mimetic lexicon, Trial Two
demonstrates the network's ability to extract multiple patterns from a data set.
Ivanova (2002) provides a list of observed phonaesthemes and give several examples of each. The phonaesthemes were analyzed and reworked with collaboration with
14
Training Corpus
minimum distance. = 0.000383
minimum distance
0.760665
minimum distance
1.710566
Resulting Tree =
=
to.ki.do.ki ) ( pi.ka.pi.ka )
pi.ka.pi.ka to.ki.do.ki )
( sa.a.sa.a )
sa.a.sa.a pi.ka.pi.ka to.ki.do.ki
( za.a.za.a
_1-----------------------------------------> za.a.za.a
1
1------------------> sa.a.sa.a
1
1-> pi.ka.pi.ka
1-> to.ki.do.ki
Evaluation corpus
minimum distance
minimum distance
minimum distance
Resulting Tree =
0.090262
0.228090
1.236330
( ka.ra.ka.ra ) ( ti.bi.ti.bi )
( su.be.su.be ) ( ti.bi.ti.bi ka.ra.ka.ra
( ti.bi.tLbi ka.ra.ka.ra su.be.su.be ) ( ga.ra.ga.ra
1----------------------------------------------->
-j
.
)--~---------------------_-------------~--------I
ga.ra.ga.ra
1----> ti. bi. ti.bi
)----> ka.ra.ka.ra
)---------> su.be.su.be
Figure 4: Cluster analysis of hidden activations in Trial One
Terasaki and further tokens were uncovered. Appendix A shows the final phonaesthematic clusters and their associated feature IDs the triaL The Trial Two corpus will consist
of these 150 tokens. The network's performance on the corpus will be evaluated using
Ivanova's and Terasaki's analyses as a reference.
Table gives a complete listing of the phonological and semantic features presented to
the network. The 14 phonologiCal features. are sufficient to distinguish any two possible
tokens from one another. The semantic features listed, however, nowhere near exhaust
the number of possible (or even probable) feature set available to native speakers.
Certain stumbling blocks were encountered in the preparation of the corpus. First,
mime tics are often similar in structure to words derived from reduplication in otherparts
of the lexicon. toki-doki (occasional) andbetsu-betsu (separately) seem similar to mimetics
but are in fact derived from reduplication of the non-mimetic words toki (time) and betu
(separate) respectively. Some of these tokens exist on a continuum with mimetics, as
15
Phonological features
Semantic features
VELAR
PALATAL
NASAL
CORONAL
LABIAL
FRICATIVE
VOICED
MORAICN
GEMINATE
VJiIGH
V_LOW
V_BACK
V_LABIAL
V_TENSE
MURKY
EXEN
STICKY
NOSTRSS
UNSTEAD
FREE
DEFIC
NOSPACE
NOVIT
IDLE
BADSFT
SHAKING
RSTLSS
LIQSWAY
DISORD
NOCONF
COWARD
IMPERT
BRISK
MEANDER
SFTUNR
PRESS
TIMID
EXPROP
ANXIETY
SUDDEN
GENTLE
SUCCESS
CLEAR
EXGOOD
SUFF
MANY
PLENTY
HEAVY
.RELAX
UNATTR
FIDGET
Table 5: Phonological and semantic features in Trial Two
16
Token
toki-doki
Meaning
Lexical Association Classification
toki (fltime")
occasional
tibi-tibi
little by little
tiisai ("small")
doki-doki· heavy pounding (of the heart) [none]
non-mimetic
mimetic
mimetic
Table 6: Mimetics and Lexical Reduplication
laid out in Table 6. In some cases of conflict the non-mimetic lexeme may contradict the
phonaesthematic association. The token mote-mote, for instance; fits the pattern for both
"murkiness" and exceeding the proper amount/' but its lexical association withholding
H
Hpossess" (motsu) gives it the interpretation of "sexually attractive."
Another difficulty in dealing with the corpus is the slippery nature of the semantic
content of the phonaesthemes themselves. For instance, Ivanova (2002) identifies the
pattern in Table 7 as a phonaestheme corresponding to "slow action / idleness." Consultation with Terasaki (personal communication, 2006), however, revealed that tokens
that Ivanova classified as exceptions to the phonaesthematic pattern are in fact part of
the same phonaestheme. A more accurate characterization of the semantic association
in question might be "in round drops, rolling." The token goro-goro means lolling about,
as in rolling around on a futon or kicking around the house; oro-oro means to be mentally off-balance, with thoughts rolling about in the head at random. The tokens that
Ivanova (2002) classified as exceptions-boro-bara etc.-ate in fact closer to the core, nonmetaphorical meaning associated with the phonaestheme than the ones classified as representative. As the "slow action / idleness" phonaestheme demonstrates, the semantic
associations of many mimetics are difficult to pin down with other language.
Despite these difficulties, theANN in Trial Two successfully classified most of the tokens in the corpus with minimal error. Figure 5 shows the short term error reports, and
Figure ?? shows them over the full triaL There are some outliers whose error levels drop
only slightly over time, but given enough training, the network would simply ;'memorize" these tokens and classify them correctly.
17
Pattern
Meaning
Tokens
(c) + oro + (C) + oro slow action; idle- doro-doro
ness
Anomalous tokens
(pulpy; boro-boro (in big
mushy,
drops; worn-out);
thick)
oro-oro (thrown off horo-horo (in small
balance) goro-goro drops), para-para
(lie about idly)
(in drops)
Table 7: Iv-anova's problematic classification
20000
40000
60000
80000
Figure 5: First 100,000 cycles in Trial Two training corpus error
18
10000c
50000 100000 150000 .200000 2SOOO0 300000 3SOOOO 400000
' .....k tr:.!Ii! in'log
.. ...
,....-t
.N.t>
..,.................
y .........
_
Figure 6: Complete Trial Two error reports
19
Table 8 shows error reports for specific novel tokens in the evaluation corpus. As
anticipated, patterns represented with a large number of tokens in the corpus, such as
gotya-gotya etc. for the NOCONF phonaestheme, were identified with a lower rate of error.
Patterns with only one or two tokens in the training corpus, like
RSTLSS
BETYA-BETYA
for the
phonaestheme, were not identified at all due to the fact that the presence of only
one token did not give the ANN an opportunity to distinguish the relevant pattern from
the characteristics of its tokens. Cluster analysis, shown in Appendix D reveals that some
mimetics have been Clusteted by phonaestheme, but no robust pattern has emerged. A
larger corpus might produce better results.
Programs and Conclusions
This paper demonstrates a successful proof-of-concept for phonaesthematic association
of Japanese mimetics using Artificial Neural Networks. For future work, a corpus of
perhaps one hundred times the size might be required. With such a corpus, it would be
possible to specify "naIve" semantic features such as wetness, speed, hardness etc. and
have the network search for patterns to associate with them. These features might work
to break down the semantic associations into smaller, more common elements that would
render the space of phonaesthemes in higher resolution.
The work also might benefit from more sophisticated computational methods. One
way to speed up the learning of problematic tokens would be to use a Genetic Algorithm (GA). One problem often confronted in the training of neural networks is the phenomenon of the local minimum. If the error function of the network were mapped in a
multi-dimensional space, we can imagine how, in its attempt to descend along the error.
gradient, the network might settle into a configuration that has a lower level of error than
the surrounding area, but which does not represent the best possible solution. This is part
of the reason why the ANN in Trial Two was unable to classify all of the tokens correctly.
20
mo. so. mo. so
mu.ka.mu.ka
ne.ba.ne.ba
no.ko.no.ko
yo.bo.yo.bo
yu.Q.ku.ri
ba.sa.ba.sa
gi.ti.gi.ti
to.bo.to.bo
o.ro.o.ro
gu • j ya . gu • j ya
yu.ra.yu.ra
hyo.ro.hyo.ro
de.bu.de.bu
go.tya.go.tya
su.go.su.go
i.ji.i.ji
nu.ke.nu.ke
me.ki.me.ki
u.ne.u.ne
fu.nya.fu.nya
do.si.do.si
ga.si.ga.si
hi.so.hi.so
bo.te.bo.te
mu. zu.mu. zu
ga.Q.ta.ri
fu.N.wa.ri
ba.Q.ti.ri
ha.Q.ki.ri
go.Q.po.ri
ta.Q.pu.ri
ba.Q.sa.ri
do.Q. si. ri
go.Q.te.ri
o.Q.to.ri
be.tya.be.tya
. ji.ta.ba.ta
0.998694319028
1.36113869166
1.01487529777
1.87014146777
1.99242816593
1.04669909089
2.49753137008e-05
1.12817480377
0.00214731167517
4.8426910364ge-06
0.00299313717304
1.74301395223
0.0906405269696
0.00220005934501
1.4080120943e-05
1.30561197121
1. 0004849271
0.981309841762
0.673205094683
9.58668169567e-05
7.1996645154e-06
0.364352304025
6.91751350252e-06
1.34500042008
0.000394612296337
0.699978662153
0.00198909687591
0.000603075600391
0.000829279222434
0.853517334462
0.323952721098
0.00458525164778
1. 7 004 6 316 943
0.100842889473
0.000599389182655
0.00361815702284
1.97831760239
1.77028268892
Figure 7: Error reports on Trial Two evaluation corpus
21
In the worst cases, the network is like a marble sitting in a basin on top ofa mountain.
Some method is required to test configurations that are relatively distant from the best
solution currently within reach.
In order to do this without sacrificing hard-won training on random reconfiguration,
a Genetic Algorithm (GA) could be employed. A population of neural networks is instantiated, trained, and made to compete for survival. In each generation, a certain amount
of random mutation is introduced. Low-performing networks are eliminated. New networks are produced through crossover, which is the IIbreeding" of successful networks
to produce new networks that contain characteristics from both of them. In this way, the
problem of local minima can be mitigated without giving up the ANNs ability to learn
through experience.
Another technique of ANN analysiS might also prove useful. It is possible to reverse
the direction of the network's throughput (move activation from the output layer through
to the input layer) and derive "paradigmatic" representations of the types the network is
classifying. The ANN could be given a set of semantic features and made to produce the
mimetic token that best represents them. In this way, new phonaesthemes could be more
easily uncovered.
Overall, ANNs seem to be a promising tool for classifying mimetics and may prove
even more useful than they have already in theoretical linguistics generally. They cannot
compare, however, to the learning capacity of the human brain. A speaker can learn a
new mimetic and integrate it into her phonaesthematic system after only hearing it once.
What child requires 3000 exposures in order to become competent with a new vocabulary
item? Even so, ANNs and related tools seem to open up a rich new vein of inquiry.
The above investigation shows that there is a real relationship between semantic characteristics and phonological representations in the Japanese mimetic lexicon. The function of their relation, however, is far from a bijection. Phonaesthematic associations do
not constitute a
one~to-one
correspondence between sign and signified. Tools like Artifi-
22
cial Neural Networks, however, are the best lights we have in the murky world of natural
language.
23
Appendix A: IvanovalTerasaki Phonaesthemes
Feature ID
Pattern (1st & 3rd)
Meaning
Tokens
Anomalous tokens
MURKY
mo + CV + mo + CV or mo
murlciness
mogo-mogo
(mumble),
moso-moso
(mumble),
mOQ-sari
(sluggish),
moya-moya (hazy,murky),
moku-moku (send up great
mOTi-mori (eat heG
+x
have a thick wal
muscles)
volumes of smoke)
EXEN
mu + CV + mu + cv or mu
+x
excessive energy,
pression
sup-
muka-muka (retch, go
mad), muku-muku (growing), muN-muN (s~
sultry), mura-mura (irre-
muki-muki (thick mm
sistible temptation)
STICKY
ne + CV + ne + CV or ne +
x
sticlciness, tenacity
neba-neba (sticky, greasy),
nechi-nechi (sticky, persistent), neQ-tori (c1anliny,
viscous)
nobi~nobi
no + CV + no + CV or no +
x
slow action, lack of stress,
anxiety, or uneasiness
UNSTEAD
yo + cv + yo + cv
unsteady- unreliable
yobo-yobo (feeble, shaky),
yoro-yoro
(unstable,
weak), yochi-yochi (toddle), yota-yota (totter)
FREE
yu + CV + yu + CV or yu +
x
free of pressure, relaxing
yU-Qkuri (slowly, at one's
leisure), yu-rari (slowly),
yusa-yusa (sway- to and
fro), yuQ-tari (slow), yuuyuu (chill / relaxed)
Feature ID
Pattern (vowel-mora)
Meaning
Tokens
DEFIC
c+asa+C+asa
disappointing
appearance due to some deficiency
friable,
NOSTRS
(feel at ease,
relaxed/relieved),
noko-noko (nonchalantly),
noro-noro (drag oneself,
walk slowiy)
be
basa-basa
Anomylous tokens
(crumbly,
disheveled),
pasa-pasa (dry), kasa-kasa
(dry, rough), wasa-wasa
(nervous, restless)
i
NOSPACE
g/k + v + ti + g/k + v
lack of fluidity and space
giti-giti (very tight),
bati-bati (heavy, brittle),
pati-pati (brittle), lalti-leaN
moti-meti mushy
(frozen hard, dried up
completely),
koti-koti
(tense, stiff, frozen hard)
NOYIT
c + obo + C + obo
lack of vitality and energy
syobo-syobo
(dispirited,
despondent bleary-eyed)
tobo-tobo (trudge we~
yobo-yobo (feeble, sh
llllsteady)
IDLE
(c) + oro + (e) + oro
slow action, idleness
doro-doro (pulpy, mushy,
thick) oro-oro (thrown
off balance) goro-goro (lie
about idly), boro-boro (in
big drops; worn-out),
horo-horo (in small4rops),
para-para (in drops)
BADSFT
(c) + ujya + (e) + ujya
unpleasant softness
gujya-gujya
(slushy,
sloppy), ujya-ujya (wriggle)
SHAKING
(e) + ura + (e) + ura
shaking, swaying
kuta-kura
(dizzy,
reel-
in& spinning), yura-yum
sura-sura smoothly,
out a hitch
(quake, sway), mura-mura
(llllsteady),
gum-gum
(toppling),
bura-bura
(swaying, hanging out)
nyoro-nyoro
RSTLSS
(c) + yoro + (e) + yoro
restless, unstable
(long and
squirming), gyoro-gyoro
(water dribbling) tyorotyoro (trickle, flicker),
hyro-hyoro (totter, toddle, stagger), kyoro-kyoro
(look around restlessly,
nervously)
Feature ID
Pattern (2nd & 4th)
Meaning
Tokens
LIQSWAY
ey
a great amollllt of liquid
or flesh swaying
debu-debu (fat and flabby),
gabu-gabu (guzzle/chug,
slosh), zabu-zabu (splash)
DISORD
(e)Y +
+ bu + ey + bu
tya + (C)Y + tya
disorder, disappointing
experience
itya.itya
Anomylous tokens
(behave
tatiously),
(spurting,
(worthless
{gotya-gotya
flir,..
bitya-bitya
gotya-gotya
speech),
(messy,
jumbled-up), kutya-kutya
(crumpled)
11
NOCONF
CV
+ go + CV + go
lack of confidence
mago-mngo (at a loss,
disorganized), mogo-mogo
(mumble),
sugo-sugo
(disheartened, downcast)
COWARD
(C)V + ji
+ (C)v + ji
+ ke + CV + ke
IMPERT
CV
BRISK
(C)V
+ ki + (C)V + ki
cowardice,
bewilderment, lack of enthusiasm! initiative
iji-iji (hesitantly, timidly),
moji-moji (bashful, fidget),
taji-taji (quail)
impertinent
nuke-nuke (impudently,
zuke-zuke
brazenly)
(bluntly,
Without reserve), tuke-tuke (rudely)
brisk, dynamic action·
doki-doki ([heart] beat
fast, throb), paki-paki
(breaking),
bati-bati
(heavy breaking), poldpoki (breaking), bakibald (heavy breaking),
meki-meki (remarkably,
syaki-syaki
rapidly),
(briskly, concisely), melting, weakening
MEANDER
(C)V
+ ne + (C) V + ne
meandering, winding
kune-kune (sway, meander), une-une (winding,
tortuous)
SFTUNR
CV
+ nya + cv + nya
soft, unreliable, unstable
funya-funya
(limply,
flabby), kunya-kunya (soft
and cuddly)
munyamunya
(mumble),
gunya-gunya
(melting,
weakened)
PRESS
CV
+ si + CV + si
pressure
dosi-dosi (hand over
fist, Without hesitation),
gosf-gosi (scrub), basibasi (whack), gasi-gasi
(fierce scrubbing), gisi-gisi
(squeeze, be packed), hisihisi (press, feel keenly)
TIMID
CV
+ so + CV + so
timid, retiring
boso-boso (in a sUbdued
voice), hiso-hiso (in whispers), koso-koso (secretly,
moso-moso
stealthily),
(slow movement), bosoboso (under one's breath),
goso-goso (rummage)
iii
mnji-maji (to stare [to
tently)
EXPROP
ev + te + ev + te
exceed
amount
the
proper
bote-bote (fat, obese), gotegate (gaudy, heavy), kotekate (thickly, lavishly),
bate-bate (exhausted)
ANXIETY
(e)v + zu + (e)v + zu
anxiety, impatience
(grumble),
(burning
with desire), uzu-uzu
(impatient, itching)
Feature ID
Pattern (2nd, 3rd & 4th)
Meaning
Tokens
SUDDEN
e + Q-tari
sudden/abrupt action
gaQ-tari
GENTLE
cv +N-wari
gentleness
fuN-wari (fluffy), jiN-wari
cv + Q-tiri
successful,
positively
evaluated, compact
(destroy suddenly), baQ-tari (run into,
stop suddenly), paQ-tari
(suddenly, abruptly)
baQ-tiri (perfect), miQ-tiri
(full, soft), ga-tiri (full,
hard), gaQ-tiri (solid,
steady, tight), kiQ-tiri
(exact, perfect, tight)
CLEAR
ev + Q-kiri
clear-cut, exactly
haQ-kiri (clearly, without
any misunderstanding),
(clearly,
distinctly), nyoQ-kiri (stuck
out), tyoQ-kiri (exactly)
kUQ-kiri
EXGOOD
cv + Q-pari
exceed
the
proper
amount (with positive
connotation)
gaQ-pori (make a pile of
money), goQ-pori (in a
large quantity), si.:.pori
(wet through, tender,
passionate), sUQ-pori
SUFF
CV
+ Q-puri
sufficient amount
dOQ-puri (to the full, be
up to one's neck in), gaQpuri (drink up a large
amount of liquid), taQpuri (plenty, full of)
MANY
CV
+ Q-sari
many
baQ-sari (make a drastic cut), guQ-pori (stab
/ plunge in completely),
dOQ-sari (heaps/loads of),
fuQ-sari (abundant hair)
iv
(sexuan~
tractive)
guzu-guzu
muzu-muzu
(well up slowly),yaN-wari
(gentle, mild)
SUCCESS
mote-mote
Anomylous tokens
PLENTY
CV
+ Q-siri
plenty, packed
dOQ-siri (massive), giQ-siri
(closely, tightly), biQ-siri
(filled completely), zUQsiri (heavy)
HEAVY
CV
+ Q-teri
heavy, fat
gOQ-teri (thick, stodgy),
kOQ-teri (filling, rich), poQten (plump, chubby)
RELAX
(c)v + Q-tori
relaxing
oQ-tori (composed, gen- YOQ-tori (chill, relaxe
tie [personality]), siQ-tori
(quiet, calm), uQ-tori (enchanted)
Feature ID
Pattern (partialredup)
Meaning
Tokens
UNATR
ci vI + tya + c2v2 + tya
disorder, unattractiveness
betya-kutya (gab), metyakutya (unreasonable, incoherent), petya-kutya (chatter)
FIDGET
(cl)vl + fa + c2v2 + ta
restlessness, fidgety
ata-futa (hastily), dota-bata
(make a fuss), jita-bata
(flail)
v
Anomylous tokens
Appendix B: Combined Trial Two Corpus
mo.go.mo.go
mo.Q.sa.ri
mo.so.mo.SO
mo.ya.mo.ya
mo.ku.mo.ku
mo.rLmo.ri
mu.ka.mu.ka
mu.N.mu.N
mU.ra.mu.ra
mu.ku.mu.ku
mu.ki.mu.ki
ne.ba.ne.ba
ne.tLne.ti
ne.Q.to.ri
no. bL no. bi
no.ko.no.ko
nO.ro.no.ro
yo.bo.yo.bo
yo. tLyo. ti
yo.ta.yo.ta
yo.ro.yo.ro
yu.Q.ku.ri
yu.Q.ra.ri
yu. sa.yu. sa
yu.Q.ta.ri
yu.u.yu.u
ba.sa.ba.sa
ka.sa.ka.sa
wa.sa.wa.sa
pa.sa.pa.sa
gi.ti.gi.ti
ka.tLka.ti
ko.tLko.ti
ba.tLba.ti
pa. tLpa. ti
so.bo.so.bo
to.bo.to.bo
yo.bo.yo.bo
do.ro.do.ro
o.ro.o.ro
go.ro.go.ro
bo.ro.bo.ro
ho.ro.ho.ro
po.ro.po.ro
gu. jya. gu. jya
u.jya.u.jya
ku.ra.ku.ra
u.ra.u.ra
yu.ra.yu.ra
su.ra.su.ra
mumble
sluggish
slow movement
hazy, murky
send up lots of smoke
have thick muscles [e]
retch
stuffy, sultry
irresistible temptation
growing
have thick muscles
sticky, greasy
sticky, persistent
clammy, viscous
feel at ease
nonchalantly
drag oneself
feeble, shaky
toddle
toddler
unstable, weak
slowly, at leisure
slowly
sway to and fro
slowly
chill (not temperature)
crumbly, disheveled
dry, rough
nervous, restless
dry
very tight
frozen hard, dried up
tense, stiff, frozen
brittle (large)
brittle
dispirited
trudge wearily
feeble, shaky, unsteady
pulpy, mushy
thrown off balance
lie about idly
in big drops
in small drops
in drops
slushy, sloppy
wriggle
dizzy, reeling
gentle
quake, sway
smoothly, without a hitch
i
mu.ra.mu.ra
gu.ra.gu.ra
bu.ra.bu.ra
tyo.ro.tyo.ro
hyo.ro.hyo.ro
kyo.ro.kyo.ro
nyo. ro. ,nyo. ro
gyo.ro.gyo.ro
de.bu.de.bu
ga.bu.ga.bu
za.bu.za.bu
i.tya.i.tya
go.tya.go.tya
ku.tya.ku.tya
bi. tya. bi. tya
go.tya.go.tya
ma.go.ma.go
su.go.su.go
i. ji. L ji
mo. ji.mo. j i
tao ji. tao j i
nu.ke.nu.ke
zu.ke.zu.ke
tsu.ke.tsu.ke
do. kLdo.ki
me • ki. me. ki
sya.ki.sya.ki
pa.ki.pa.ki
ba.ki.ba.ki
po.ki.po.ki
ku.ne.ku.ne
u.ne.u.ne
fu.nya.fu.nya
ku.nya.ku.nya
mu.nya.mu.nya
gu.nya.gu.nya
do. si.do. si
gi. si. gi. si
hi. si. hi. si
go. si. go. si
baa si.ba. si
gao si.ga. si
bo.so.bo.so
hi. so.hi. so
ko.so.ko.so
mO. so .mo. so
go. so.go. so '
bo.te.bo.te
go.te.go.te
ko.te.ko.te
ba.te.ba.te
mo.te.mo.te
gu.zu.gu.zu
swaying
toppling
swaying, idling
trickle, flicker
totter, toddle, stagger
look around restlessly, nervously
long and squirming
dribbling
fat and flabby
guzzle, chug, slosh
splash
flirt
messy, jumbled
crumpled
spurting
speak worthlessly
at a loss, disorganized
disheartened, downcast
hesitantly, timidly
bashful, fidget
quail
impudently, brazenly
bluntly, without reserve
rudely
beat fast, throb
remarkably, rapidly
biskly, concisely
breaking
heavy breaking
breaking
sway, meander
winding, tortuous
limply, flabby
soft and cuddly
mumble
soft, weakened
hand over fist, without hesitation
squeeze, be packed
press, feel keenly
scrub
whack
fierce scrubbing
in a subdued voice
in whispers
secretly, stealthily
-slow movement
rununage
fat, obese
gaudy, heavy
thickly, lavishly
exhausted
sexy [e]
grumble
ii
mU.zu.mu.zu
u.zu.u.zu
ga.Q.ta.ri
ba.Q.ta.ri
pa.Q.ta.ri
fu.N.wa.ri
ji.N .wa.ri
ya.N.wa.ri
ba.Q.ti.ri
ga.Q.tLri
kLQ.tLri
ha.Q.kLri
ku.Q.kLri
nyo.Q.kLri
tyo.Q. kLri
ga.Q.po.ri
go.Q.po.ri
sLQ.po.ri
do.Q.pu.ri
ga.Q.pu.ri
ta.Q.pu.ri
su.Q.pu.ri
ba.Q.sa.ri
do.Q.sa.ri
fu.Q.sa.ri
gu.Q.sa.ri
do.Q. si. ri
gi.Q.sLri
zu. Q. si. ri
go.Q.te.ri
ko.Q.te.ri
po.Q.te.ri
o.Q.to.ri
sLQ.to.ri
u.Q.to.ri
yo.Q.to.ri
ne.Q.to.ri
be.Q.to.ri
ji.Q.to.ri
be.tya.ku.tya
me.tya.me.tya
pe.tya.pe.tya
a.ta.fu.ta
do.ta.do.ta
ji,ta.ba.ta
burning with desire
impatient, itching
destroy suddenly
run into, stop suddenly
suddenly, abruptly
fluffy
well up slowly
gently, mild
perfect
solid, steady, tight
exactly, perfect, tight
clearly, without misunderstanding
clearly, distinctly
stuck out
exactly
make a pile of money
in a large quantity
wet though, tender and passionate
to the full, be up to one's neck
drink up a lot of liquid
plenty, full of
fits perfectly
make a drastic cut
heaps / loads of
abundant hair
heaps, loads of
massive
closely, tightly
heavy
thick, stodgy
filling, rich
plump, chubby
composed, gentle
quiet, calm
enchanted
laid back
clammy, viscous
sticky, tight [e]
damp, moist [e]
gab
unreasonable, incoherent
chatter
hastily
make a fuss
flail
iii
Appendix C: Original Code'
This research is greatly indebted to Lisa Meeden and Doug Blank of Swarthmore and Bryn
Mawr colleges and their colleagues for their work on the Python Robotics. The complete
source is available at http://www.pyrobotics.org. See Blank~ D.S.~ Kumar D.~ Meeden, L.
(2005).
VOWELS
"aiueo"
VELARS = "kg"
NASALS = "mn"
FRICATIVES = "szSZT"
PALATALS. = "y"
CORONALS = "stzdnyr"
LABIALS "" "mbwp"
V
V
V
V
V
HIGH = "iu"
LOW = "a"
BACK = "aou"
LABIAL "" "ou"
TENSE"" "iuoe"
MORAIC N = "N"
GEMINATE "" "Q"
VOICED = "zbgr" + VOWELS + NASALS + MORAIC N
def firstPosFeature(string, fset):
if fset.find(string[O]) """" -1:
. return 0
else:
return 1
def secondPosFeature(string, fset):
if fset.find(string[len(string)-l]) -- -1:
return 0
else:
return 1
def floatingFeature(string, fset):
for c in fset:
if string.find(c) != -1:
return 1
return 0
def rnoraFeatures(string):
features "" []
i
FEATURE ORDER:
velar palatal nasal coronal labial fricative
voiced rnoraic_nasal geminate vowel_high vowel_low
vowel back vowel labial vowel tense (14)
nnn
features
features + [firstPosFeature(string, VELARS) ]
features
features + [floatingFe*ture(string, PALATALS) J
features
features + [firstPosFeature(string, NASALS) ]
features
features + [firstPosFeature(string, CORONALS) J
features + [firstPosFeature(string, LABIALS) J
features
features
features + [floatingFeature(string, FRICATIVES)]
features
features' + [firstPosFeature(string, VOICED) 1
features = features + [£irstPosFeature(string, MORAIC_N) ]
features
features + [firstPosFeature(string, GEMINATE) ]
features = features + [secondPosFeature(string, V_HIGH) ]
features = features + [secondPosFeature(string, V_LOW) 1
features = features + [secondPosFeature(string, V_BACK) ]
features
features + [secondPosFeature(string, V_LABIAL) ]
features
features + [secondPosFeature(string, V_TENSE) ]
n" u
(
#Voicing
if VOICED.find(string[O) == -1:
features "" features + [0 ]
else:
features
features + [1 ]
#ve1ar
i f VELARS. find ( str ing[ 0] ) == -1:
features
features + [0]
else:
features = features + [ 1]
#fricative
-1:
if FRICATIVES.find(string[O])
features"" features + [0]
else:
features"" features + [1]
return features
def featurize(string):
moraList
string.split('.')
features = []
for mora in moraList:
# print mora
# print rnoraFeatures(mora)
features = features + moraFeatures(mora)
ii
return features
def phonaesthize(n, fname, j, dValue, oFile):
# retrieve string
tdata h = open(fname)
tdata
tdata_h.readlines()
# debug printing
print "read from file:"
print tdata
k = 0
#if we're dumping activations
if dValue == 1:
activationsFile = open ( "a_dump. dat
"W")
namesFile = open("a_names.dat", "W")
II ,
if oFile != "":
oHandle
open(oFile, "w")
# master loop
eList
[]
while k < j:
k = k + 1
for line in tdata:
pFeatures "" line.split('\t') [0']
sList"" line.split()
sFeatures"" []
#hack atoi for sFeatures
i
=1
while i < len(sList):
sFeatures = sFeatures + [string.atoi(sList[i})}
i = i + 1
current
results
featurize(pFeatures)
sFeatures
#copy to layers
n.getLayer('input').copyActivations(current)
n.getLayer('output').copyTargets(results)
#if we're dumping activations
if dValue == 1:
aList = n.getLayer('hidden'}.getActivationsList()
for v in aList:
activationsFile.write(str(v»
iii
activationSFile.write("\n")
namesFile.write(pFeatures)
namesFile.write(tI\n")
#step network
(error, correct, count)
n. step( )
#record error
print error
eList = eList + [error]
#if we're outputting to a file
if oFile 1= "":
oHandle.write(str(error»
oHandle.write( "\n")
if k == j - 1:
aError = 0
for v in eList:
aError = aError + v
aError = aError / len(eList)
print "FINAL AVERAGE ERROR:"
print aError
i f dValue == 1:
activationsFile.close()
namesFile.close()
if __name__
=='
main
':
USAGE:
python gitaigo.py trainingfile cycles header
[-s nfile] [-1 nfile] [-e efile] [-dJ [-0 outfileJ
if sys.argv[3] == '1':
print n\n ••. GITAIGO.vl.2 .••••••••••••••••••••
print. " ••• Luke.Smith.2006 ...••••••••••.•••. "
print " ••• Swarthmore College Linguistics ••. \n"
II
# If we're loading from a file ••.
try:
fSIndex = sys.argv.index("-l") + 1
n :: 10adNetworkFromFile(sys.argv[fsIndex])
print "loaded network from file:"
print sys.argv[fSIndex)
except(ValueError):
n = Network ( )
n.addThreeLayers(56,60,37)
n.setEpsilon(O.25)
iv
n.setMomentum(O.l)
n.SetBatch(O)
n.setReportRate(lOO)
n.setLayerVerification(O) # turn off automatic error checking
# check for activations dump
try:
if sys.argv.index("-d") > 0:
dValue = 1
except(ValueE~ror):
dValue = 0
# check for output dump"
oFile = tI"
try:
if sys.argv.index("-o") > 0:
oValue = 1
oFile = sys.argv[sys.argv.index{"-on) + 1]
except(ValueError):
oValue = 0
phonaesthize(n, sys.argv[1], string.atoi(sys.argv[2]), dValue, OFile)
# if we're saving to a file ..•
try:
fSIndex = sys.argv.index( "-s") + 1
n.saveNetworkToFile(sys.argv[fsIndex])
except(ValueError):
n = Network ( )
v
Appendix D: Trial Two Cluster Analysis
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
minimum
distance
distance
distance
distance
distance -:
distance
distance
distance
distance
distance
distance
distance
distance
distance
distance
distance
distance
distance
distance a
distance
distance
distance
distance
distance
distance
distance
distance
distance
distance
distance
distance
distance
distance
distance'"
distance'"
distance
distance
distance
distance
distance
distance
distance
distance
distance distance
distance
distance
distance
distance
distance
distance
distance
distance distance
distance
distance
distance
distance '"
distance
distance
distance
distance
distance
0.000000
0.000000
0.000009
0.000012
0.000014
0.000017
0.000019
0.00002l
0.000033
0.000055
0.000057
0.000057
0.000061
0.000064
0.000074
0.000082
0.000089
0.000112
0.000116
0.000120
.0.000146
0.000149
0.000212
0.000231
0.000248
0.000255
0.000276
0.000294
0.000308
0.000;382
0.000409
0.000430
0.000465
0.000475
0.000505
0.000526
0.000556
0.000606
0.000710
0.000734
0.000794
0.000864
0.000907
0.000959
0.001050
0.001120
0.001220
0.001222
0.001303
0.001425
0.001539
0.001552
0·.001677
0.001772
0.002093
0.002527
0.002778
0.002845
0.002980
0.003189
0.003206
0.003401
0.003418
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
(
mu.ra.mu.ra )
yu.Q.ta.ri )
ku.tya.ku.tya
( yu.Q.ta.ri mu.ra.mu.ra
do.ta.do.ta )
bi.tya.bi.tya
gu.ra.gu.ra )
( mu.ku.mu.ku )
zu.Q.si.ri )
( gi.si.gi.si )
ka.ti.ka.ti )
( pa.sa.pa.sa )
ko.Q.te.ri )
( ku.ra.ku.ra )
ku.Q.ki.ri )
( ki.Q.ti.ri )
g1.s1.g1.si zu.Q.sLri )
( mu •. ku.mu.ku gu.ra.gu.ra )
mu.nya.mu.nya )
( t5u.ke.tsu.ke )
tyo.Q.ki.ri )
( gu.nya.gu.nya )
be.Q.to.ri )
( yo.ti.yo.ti )
bi~tya.bi.tya do.ta.do.ta )
( mu.ku.mu.ku gu.ra.gu.ra gi.5i.gi.si
ba.ki.ba.ki )
( so.bo.so.bo )
ku.ra.ku.ra ko.Q.te.ri )
( su.ra.su.ra
yo.ro.yo.ro )
( mo.go.mo.go )
yu.u.yu.u )
yu.sa.yu.sa )
mo.te.mo.te
( pa.ti.pa.ti
tyo.ro.tyo.ro
( ka.sa.ka.sa
ya.N.wa.ri )
( ku.ne.ku.ne )
mu.ku.mu.ku gu.ra.gu.ra gi.si.gi.si zu.Q.5i.ri bi.tya.bi.tya do.ta.do.ta
sU.ra.su.ra ku.ra.ku.ra ko.Q.te.ri )
( yu.Q.ta.ri mU.ra.mu.ra ku.t]
pa.ti.pa.ti mo.te.mo.te )
( po.ki.po.ki )
bu.ra.bu.ra mu.ku.mu.ku gu.ra.gu.ra gi.si.gi.si zu.Q.si.ri bi.tya.bi.tya (
yu.Q.ta.ri mU.ra.mu.ra ku.tya.ku.tya su.ra.su.ra ku.ra.ku.ra ko.Q.te.ri )
yo.bo.yo.bo )
( mu.ki.mu.ki )
mo.ji.mo.ji )
( u.jya.u.jya )
ga.Q.po.ri )
( gyo.ro.gyo.ro )
t5u.ke.tsu.ke mu.nya.mu.nya )
( go.tya.go.tya )
pe.tya.pe.tya )
( ma.go.ma.go )
fu.Q.sa.ri )
( U.zU.U.zu )
po.ki.po.ki pa.ti.pa.ti mo.te.mo.te )
so.bo.so.bo ba.ki.ba.ki
go.tya.go.tya tsu.ke.tsu.ke mu.nya.mu.nya
( u.ra.u.ra )
ne.Q.to.ri )
( nyo.Q.ki.ri )
si.Q.po.ri )
( ji.N.wa.ri )
gu.nya.gu.nya tyo.Q.ki.ri bu.ra.bu.ra mu.ku.mu.ku gu.ra.gu.ra gi.si.gi.si
ka.sa.ka.sa tyo.ro.tyo.ro )
( ne.Q.to.ri )
so.bo.so.bo ba.ki.ba.ki po.ki.po.ki pa.ti.pa.ti mo.te.mo.te )
( boo
ki.Q.ti.ri ku.Q.ki.ri )
( u.ra.u.ra go.tya.go.tya tsu.ke.tsu.ke mu.
gyo.ro.gyo.ro ga.Q.po.ri )
( no.bi.no.bi )
sya.ki.sya.ki )
( yu.Q.ra.ri )
gi.Q.si.ri )
( kyo.ro.kyo.ro )
ku.ne.ku.ne ya.N.wa.ri )
( ko.ti.ko.ti
su.Q.pu.ri )
( mo.ya.mo.ya
yo.ti.yo.ti be.Q.to.ri )
( ne.Q.to.ri ka.sa.ka.sa tyo.ro.tyo.ro )
ga.Q.ti.ri )
( za.bu.za.bu )
ta.ji.ta.ji )
( mo.Q.sa.ri )
yu.Q.ra.ri sya.ki.sya.ki )
( mu.ki.mu.ki yo.bo.yo.bo )
ko.ti.ko.ti ku.ne.ku.ne ya.N.wa.ri )
( u.jya.u.jya mo.ji.mo.ji )
mo.go.mo.go yo.ro.yo.ro yu.Q.ta.ri mu.ra.mu.ra ku.tya.ku.tya su.ra.su.ra
ba.te.ba.te )
( mu.N.mu.N )
ne.Q.to.ri ka.sa.ka.sa tyo.ro.tyo.ro yo.ti.yo.ti be.Q.to.ri )
( do.
go.5i.go.si )
( hi.si.hi.si )
ga.Q.pu.ri )
( ba.Q.ta.ri )
no.bi.no.bi gyo.ro.gyo.ro ga.Q.po.ri
( nO.ro.no.ro )
po.Q.te.ri )
( mu.ra.mu.ra )
do.ki.do.ki ne.Q.to.ri ka.sa.ka.sa tyo.ro.tyo.ro yo.ti.yo.ti be.Q.to.ri
ji.N.wa.ri si.Q.po.ri )
( mo.Q.sa.ri ta.ji.ta.ji )
kyo.ro.kyo.ro gi.Q.si.ri )
( mu.ki.mu.ki yo.bo.yo.bo yu.Q.ra.ri Syi
U.zU.U.zu fu.Q.sa.ri )
( yu.sa.yu.sa yu.u.yu.u )
po.ro.po.ro )
( bo.ro.bo.ro )
u.jya.u.jya mo.ji.mo.ji ko.ti.ko.ti ku.ne.ku.ne ya.N.wa.ri
{ bO.f
pa.Q.ta.ri )
( ba.si.ba.si )
i
minimum distance
minimum distance
minimum distance
minimum distance
minimum distance
minimum distance
minimum distance
minimum distance
minimum distance
minimum distance
minimum distance
minimum distance
minimum distance
minimum distance minimum distance
minimum distance
minimum distance
minimum distance
minimum distance
minimum distance
minimum distance
minimum distance
minimum distance minimum distance
minimum distance
minimum distance
minimum distance
minimum distance
minimum distance
minimum distance
minimum distance
minimum distance
minimum distance
minimum distance ~
minimum distance
minimum distance
minimum distance
minimum distance
minimum distance
minimum distance
minimum distance
minimum distance
minimum distance
minimum distance
minimum distance
minimum distance
Resulting Tree =
( mo.so.mo.so )
( kO.te.ko.te
( mo.ku.mo.ku
( ma.go.ma.go pe.tya.pe.tya )
( ho.ro.ho.ro )
( za.bu.za.bu ga.Q.ti.ri )
hi.si.hi.si go.si.go.si )
( u.Q.to.ri
( ba.Q.ta.ri ga.Q.pu.ri )
( yo.Q.to.ri )
( mo.ri.mo.ri )
( gu.zu.gu.zu )
( nyo.ro.nyo.ro )
( ko.so.ko.so )
( u.ra.u.ra go.tya.go.tya tsu.ke.tsu.ke mu.nya.mu.nya ki.Q.ti.ri ku.Q.ki.ri
( do.Q.sa.ri )
( mu.ra.rnu.ra po.Q.te.ri )
( mu.ki.rnu.ki yo.bo,yo.bo yu.Q.ra.ri sya.ki.sya.ki kyo.ro.kyo.ro gi.Q.si.ri
( ba.ti.ba.ti )
( mu.N.mu.N ba.te.ba.te ).
( bo.so.bo.so so.bo.so.bo ba.l
( ho.ro.ho.ro za.bu;za.bu ga.Q.ti.ri )
( bo.ro.bo.ro po.ro.po.ro )
( mo.so.mo.so ko.te.ko.te )
( ne.ti.ne.ti )
( me~tya.me.tya )
do.Q.sa.ri mu.ra.mu.ra po.Q.i
( hi.si.hi.si go.si.go.si u.Q.to.ri )
( si.Q.to.ri )
( a.ta.fu.ta )
( ba.Q.ta.ri ga.Q.pu.ri yo.Q.to.ri )
( nyo.Q.ki.ri ne.Q.to.ri )
( pa.sa.pa.sa ka.ti.ka.ti )
( do.Q.pu.ri )
( nO.ro.no.ro no.bi.no.bi gyo.ro.gyo.ro ga.Q.po.ri do.ki.do.ki ne.Q.to.ri k.
( yo.ta.yo.ta mu.ki.mu.ki yo.bo.yo.bo yu.Q.ra.ri sya.ki.sya.ki kyo.ro.kyo.r(
( ga.bu.ga.bu )
( mo.ri.mo.ri gu.zu.gu.zu )
( zu.ke.zu.ke )
( go.te.go.te )
( ba.si.ba.si pa.Q.ta.ri )
( pa.sa.pa.sa ka.ti.ka.ti do.Q.pu.ri )
( nyo.ro.nyo.ro ko.so.ko.so
( ba.ti.ba.ti mU.N.mu.N ba.te.ba.te )
( bo.ro.bo.ro po.ro.po.ro mo,
( ga.bu.ga.bu mo.ri.mo.ri gu.zu.gu.zu )
( go.ro.go.ro )
( yu.sa.yu.sa yu.u.yu.u u.zu.u.zu fu.Q.sa.ri )
( ne.ti.ne.ti me.tya.me.tya )
( si.Q.to.ri a.ta.fu.ta )
( wa.sa.wa.sa )
( gu.Q.sa.ri )
( nyo.ro.nyo.ro ko.1
( mo.Q.sa.ri ta.ji.ta.ji ji.N.wa.ri si.Q.po.ri )
( pa.k:
( ba.si.ba.si pa.Q.ta.ri pa.sa.pa.sa ka.ti.ka.ti do.Q.pu.ri )
( do.Q.sl
( ba.Q.ta.ri ga.Q.pu.ri yo.Q.to.ri nyo.Q.ki.ri ne.Q.to.ri
( go.so.go.so )
( zu.ke.zu.ke go.te.go.te )
( mo.ya.mo. --.. 5U.
( ne.ti.ne.ti me.tya.me.tya si.Q.to.ri a.ta.fu.ta )
,Q.i
( do.Q.sa.ri mU.ra.mu.ra po.Q.te.ri hi.sLhLsi gO.5i.gO.5i u.Q.to.r~
0.003654
0.003872
0.004173
0.004222
0.004790
0.005105
0.005478
0.005564
0.005745
0.005890
0.006480
0.007183
0.009678
0.009957
0.010037
0.010299
0.011167
0.011349
0.012010
0.012103
0.012453
0.013591
0.015558
0.016346
0.021130
0.021733
0.022283
0.023636
0.025616
0.027316
0.027425
0.030766
0.041142
0.045747
0.052773
0.058806
0.060869
0.071151
0.075812
0.100507
0.102745
0.131916
0.135562
0.171335
0.412362
0.841598
r
(
(
(
(
(
(
(
(
(
(
(
(
1->
(
nyo.ro.nyo.ro ko.so.ko.so ba.ti.ba.ti mu.N.mu.N ba.te.ba.te mo.Q.sa.· , tao
pa.kLpa.ki ba.sLba.si pa.Q.ta.ri pa.sa.pa.sa ka;tLka.ti do.Q.pu.(
mo.ku.mo.ku ma.go.ma.go pe.tya.pe.tya nO.ro.no.ro no.bi.no.bi gyo.r~_~yo.,
ji.Q.to.ri )
( ku.nya.ku.nya )
mo.ya.mo.ya su.Q.pu.ri ne.ti.ne.ti me.tya.me.tya si.Q.to.ri a.ta.fu.ta )
wa.sa.wa.sa gu.Q.sa.ri mo.ya.mo.ya su.Q.pu.ri ne.ti.ne.ti rne.tya.me.tya s:
bo.ro.bo.ro po.ro.po.ro mo.so.mo.so ko.te.ko.te ga.bu.ga.bu mo.ri.mo.ri ~
go.so.go.so zu.ke.zu.ke go.te.go.te bo.ro.bo.ro po.ro.po.ro mo.so.mo.so k(
ku.nya.ku.nya ji.Q.to.ri )
( i.tya.i.tya )
go.ro.go.ro yu.sa~yu.sa yu.u.yu.u u.zu.u.zu fu.Q.5a.ri do.Q.sa.ri mu.ra.ml
do.ro.do.ro yo.ta.yo.ta mu.ki.mu.ki yo.bo.yo.bo yu.Q.ra.ri sya.ki.sya.ki I
i.tya.i.tya ku.nya.ku.nya ji.Q.to.ri do.ro.do.ro yo.ta.yo.ta mu.ki.mu.ki 1
bo.so.bo.so
I-I _1-> so.bo.so.bo
I I-I 1-> ba.kLba.ki
I 1_1-> po.ki.po.ki
I-I
1_1-> pa.ti.pa.ti
I I
1-> mo.te.mo.te
I 1 _1-> u.jya.u.jya
I 1-) 1-> mo.ji.mo.ji
1--1 1_1-> ko.ti.ko.ti
I I
1_1-> ku.ne.ku.ne
I I
1 LI->
1-:> ya.N.wa.ri
ho.ro.ho.ro
1_1-> za.bu.za.bu
---I
1-> ga.Q.ti.ri
_1-> nyo.ro.nyo.ro
) I-I 1-> ko.so.ko.so
I I 1_1-> ba .ti.ba.ti
I I 1_)-> mu.N.mu.N
1--1
1-> ba.te.ba.te
I _1-> mo.Q.sa.ri
1
1
ii
1-------------------1
I
I
I
I
I
1
1---
-I
]
]
I
I
I
I
I
I
I
I
I
I
I
H 1-> ta.ji. tao ji
1_1-> ji.N.wa.ri
1-> si.Q.po.ri
_1-> go.so.go.so
1_1-> zu.ke.zu.ke
1-> go.te.go.te
_1-> bo.ro.bo.ro
I- 1 1-> po. ro . po . ro
I 1_1-> mo.so.mo.so
1--1 1-> ko.te.ko.te
I 1_1-> ga.bu.ga.bu
---I
1_1-> mo.ri.mo.ri
I
1-> gu.zu.gu.zu
I 1-> pa.ki.pa.ki
1--1 _1-> ba.si.ba.si
I-I 1-> pa.Q.ta.ri
I _1-> pa.sa.pa.sa
I-I 1-> ka.ti.ka.ti
1-> do.Q.pu.ri
1---> i.tya.i.tya
1_1--> ku.nya.ku.nya
1-->
ji.Q.to.ri
-> do.ro.do.ro
1--- -> yo.ta.yo.ta
I
_1-> mu.ki.mu.ki
I
I-I 1-> yo.bo.yo.bo
1
1 U-> yu.Q.ra.ri
----I
-I
1-> sya.ki.sya.ki
I
1
1_1-> kyo. ro • kyo. ro
1-------------------1
1
1-> gi.Q.si.ri
I
1
__ 1-> wa.sa.wa.sa
I
1
1 1-> gu.Q.sa.ri
1
1---1 _1-> mo.ya.mo.ya
1
I 1 1-> su.Q.pu.ri
I
1--1 _1-> ne.ti.ne.ti
I
I-I 1-> me.tya.me.tya
I
1_1-> si.Q.to.ri
I
1-> a.ta.fu.ta
I
1-> go.ro.go.ro
I
1
I-I _1-> yu.sa.yu.sa
1---------1
I I-I 1-> yu.u.yu.u
I
1 1_1-> u.zu.u.zu
I
I
1-> fu.Q.sa.ri
I
1--1
_1-> do.Q.sa.ri
I
I I 1 1_ -> mu.ra.mu.ra
I
I I I-I
-> po.Q.te.ri
I
I I I I _ -> hi.si.hi.si
I
I I I I-I ~> go.si.go.si
I
I I -I
1-> u. Q• to . r i
I
I
I
_1-> ba.Q.ta.ri
1
I
I I-I 1-> ga.Q.pu.ri
I
1
I-I I -> yo. Q • to . r i
1----1
1_1-> nyo.Q.ki.ri
I
1-> ne.Q.to.ri
I _1-> mo.ku.mo.ku
I I 1_1-> ma.go.ma.go
I I ]-> pe.tya.pe.tya
I 1 _I -> no. ro • no . ro
I I I 1_1-> no.bi.no.bi
I I I 1_1-> gyo.ro.gyo.ro
] I I-I
1-> ga.Q.po.ri
1--1 I I 1-> do.kLdo.ki
1 I I -I _1-> ne. Q • to • r i
I 1 1 1 1_1-> ka.sa.ka.sa
1 I I-I
1-> tyo.ro.tyo.ro
I I
U-> yo.ti.yo.ti
I]
1-> be.Q.to.ri
iii
I-I
I
_1->
u.ra.u.ra
go.tya.go.tya
I I-I I~I-> tsu.ke.tsu.ke
I1I
1-> mu.nya.mu.nya
I 1 U-> kLQ.tLri
I I 1-> ku.Q.ki.ri
I 1 _1-> mo.go.mo.go
I-I 1 1-> yo.ro.yo.ro
I I-I
_1-> yu.Q.ta.ri
I I I I-I 1-> mU'.ra.mu.ra
I I I-I 1-> ku.tya.ku.tya
I I 1_1-> su.ra.su.ra
I-I
1_1-> ku.ra.ku.ra
I
1-> ko.Q.te.ri
I _1-> gu.nya.gu.nya
I I 1-> tyo.Q.kLri
I-I 1-> bu.ra.bu.ra
I I
_1-> mu.ku.mu.ku
I-I I-I 1-> gu.ra.gu.ra
I I 1_1-> gi.si.gi.si
I-I
1-> zu.Q.si.ri
'-1-> bLtya.bLtya
1-> do.ta.do.ta
I 1_1->
iv
References
[1] HAMANO, S., The Sound-Symbolic System of Japanese" CSLI Publications
If
[2] Ivanova, G., "On the Relation between Sound, Word Structure and Meaning in Japanese
Mimetic Words" Iconicity in Language http://www.trismegistos.comllconidtyInLanguagel.
(Utsunomiya University, Japan, 20 Jan. 2002)
[3] Blank, D.S., Kumar, D., Meeden, L., and Yanco, H, The Pyro Toolkit for AI and Robotics/'
If
AI Magazine [available at http://pyrorobotics.org/?page=PyroPublications] (2005)
[4] OKAMOTO, N., Personal Correspondence (Tochigi Prefecture, Japan, 19 Apr. 2006)
[5] SHIBATANI, M. The Languages of Japan, (Cambridge [England); New York: Cambridge University Press, 1990)
[6] TERASAKI, M., Personal Communication (Swarthmore College, Swarthmore, PA, 25 Apr.
2006)