as a PDF

Reading and Writing: An Interdisciplinary Journal 16: 41–59, 2003.
© 2003 Kluwer Academic Publishers. Printed in the Netherlands.
41
Phonology: An emergent consequence of memory constraints and
sensory input
FRANCISCO LACERDA
Department of Linguistics, Stockholm University, Stockholm, Sweden
Abstract. This paper presents a theoretical model that attempts to account for the early
stages of language acquisition in terms of interaction between biological constraints and
input characteristics. The model uses the implications of stochastic representations of the
sensory input in a volatile and limited memory. It is argued that phonological structure
is a consequence of limited memory resources under the pressure of ecologically relevant
multi-sensory information.
Key words: Emergent phonology, Language acquisition, Self-organizing processes
Introduction
Speech communication is probably the most complex natural communication
system that humans possess. However, while recognizing the complexity of
the process, one is at the same time struck by the apparent ease with which
children develop the speech communication ability and the adults’ efficiency
in using speech to communicate with each other. In the face of the puzzling
discrepancy between the complex structure of the speech communication
process and the spontaneous character of the language acquisition process,
the notion that language was an innate human capacity emerged in the 1960s
as a reaction to the strict behaviorist suggestion that language acquisition
might be explained in terms of stimulus-response mechanisms.
Nevertheless, whereas language is apparently a human activity that is
unparalleled in other species, dismissing language acquisition from the
linguistic agenda with the assumption that language is an innate human
capacity (Chomsky, 1975) probably does not do justice to the explanatory
power that biological and ecological factors may bring into the debate of the
language acquisition issue. Therefore, the current paper sketches a model of
the early stages of the language acquisition process, which, albeit crudely,
attempts to draw attention to how elementary, “purposeless” events may,
in time, lead to emergent structures that are mainly determined by the
constraints of the learning system itself, and its environment.
42
FRANCISCO LACERDA
The current attempt to argue for phonological structure as an emergent
consequence of memory constraints and sensory input will be organized as
follows. First, to set the stage for the argument, a quick review of the nativist
perspective will be presented in order to expose the challenges that have
to be faced by the present alternative approach. General considerations on
how information may be spontaneously represented by living organisms will
then be sketched as a broad model of representation of sensory information.
Finally, the model will be used to argue that the stored information becomes
inevitably structured as a consequence of sensory exposure, the infant’s biases
and limitations in memory resources.
The nativist perspective
The core of the nativist attitude towards language acquisition is that
language is a genetically determined process, developing spontaneously,
“given minimal conditions of exposure and care” (Chomsky, 1975: 147). Not
surprisingly, the programmatic nativist attitude is to dismiss the problems of
language acquisition as falling outside the scope of linguistics. If humans are
born with a “universal grammar,” then the focus of research may be directed
to how this universal schema is tuned to specific ambient languages.
To be sure, the nativist argument feeds on numerous observations that
humans tend to develop the ability to communicate with each other, overcoming a vast range of adverse conditions. This is the case of deaf children
who spontaneously develop sign language in spite of being integrated in
exclusively verbal communities. Obviously, the necessity to communicate
is an inestimable driving force, capable of putting young humans on the
communication trail, even against all odds. With this in mind, Steven Pinker’s
statement that “The belief that Motherese is essential for language development is part of the same mentality that sends yuppies to ‘learning centers’
to buy little mittens with bull’s eyes to help their babies to find their hands
sooner” (Pinker, 1994: 40, italics added) is perfectly sound. Clearly, language
acquisition can unfold even in the absence of motherese, as dramatically
demonstrated by the cases of forced deprivation of language exposure during
the first years of life (Davis, 1947). In fact, as Gleitman and Newport (1995)
point out, evidence from language deprivation cases and other special conditions of language development, such as those occurring in congenitally blind
or deaf children, make a strong case for the importance of both biological components and adequate interaction with appropriate environmental
conditions. Whereas it certainly is easy to accept their conclusion, the next
issue becomes how innate knowledge may interact with ambient language
constraints, from the onset of post-natal life. Chomsky’s famous argument of
the “poverty of stimulus” strongly suggests that the young language learner
EMERGENT PHONOLOGY
43
must have innate linguistic knowledge to be able to make sense of the noisy
and sparse linguistic input that infants and children are exposed to. But is
the speech input available to the young language learner really poor and
noisy? Does the infant need innate specific linguistic knowledge to acquire
its ambient language, or can language acquisition be seen as an inevitable emergent consequence of non-specific biases and exposure to ambient
language?
Modeling general aspects of stimulus representation
Let us start off by presenting a very broad theoretical model of how structure may arise through interaction between available organic structures and
external variables. The goal is to suggest how “chaotic processes,” in the
sense of “purely random,” disordered and meaningless events, may in fact
lead to structured outcomes because of the interaction between a system’s
random input with the system’s own history (Dennett, 1995).
Representing external variables
An organism’s representation of its external world is necessarily constrained
by its ability to map physical stimuli into internal states. In a very broad sense,
this mapping may include anything from an unspecific global process through
which the organism’s internal state is modified in response to external stimuli
as well as specialized representations achieved after specific processing of
the input. External stimuli, intense enough not to harm the organism’s internal
structure, can be represented by incremental global changes in the organism’s
internal states. Such internal changes constitute a form of automatic encoding
of the organism’s exposure to the stimuli. For instance, a bacterium may
react to light or to a chemical agent by changing form or moving to another
location. It achieves an internal representation of the stimulus and may
produce a response based on that representation. At the same time, representing the stimulus involves changes in its detailed internal structure and
thus that particular representation becomes part of the bacterium’s specific
life history.1 In this sense, representing external information is constantly
part of an organism’s natural interaction with its environment. And because
the organism’s detailed internal structure is affected by exposure to external
stimuli, the status of this structure is also an implicit and automatic record of
the organism’s life history. Admittedly, the overwhelming part of the encoded
information is unspecific and most often non-retrievable to explicit form.
Nevertheless, this purposeless information represents crucial general implicit
44
FRANCISCO LACERDA
knowledge since it encodes the organism’s life experience and can be used to
plan or guide future actions (in the sense proposed by Tulving, 1998).
In complex organisms, the fundamental aspects of the representation
process are bound to be identical to those of simpler organisms, as a
direct consequence of evolutionary tinkering (Jacob, 1982). The overall
complexity arises from the large number of parallel elementary processes,
their mutual interaction along with their adaptive response to stimulation
(that is also present in the simpler organisms). For instance, the sheer addition of elementary representation processes leads to exponentially growing
combinatorial possibilities. In addition, the mutual interaction among these
elementary processes and the plasticity of the individual components tends to
result in specialized processing structures capable of more efficient mapping
of the external information along certain dimensions. For example, instead of
a general sensitivity to vibrations conveyed by a single nerve fiber, complex
organisms develop ears, i.e., higher-level specialized systems capable of
providing a more detailed analysis of the vibrations occurring in their ambient
world. The organism is able to add frequency information to the lower-level
representation of vibration amplitude alone. This more detailed processing
increases the representation capacity dramatically because each of the representation levels available along the initial dimension will now be sub-divided
into the number of representation levels available along the new dimension.
To be specific, an estimate of the total number of differences between tones
that the human auditory system might be able to detect can be derived
from the difference limens along the intensity and the frequency dimensions. If the auditory system only could represent intensity, the estimate
is about one hundred detectable level differences. However, when the estimates for intensity and frequency are combined the result indicates that about
340,000 tones might be represented in the human frequency × intensity
space (Stevens & Davis, 1938/1983: 152). Interestingly, as noted by Steven
and Davis (1938/1983), “when the total number of distinguishable colors
is deduced from the known number of DL’s for hue, brightness, and saturation, the result is of the same order of magnitude” (p. 152). Thus, for
each added dimension, the number of representation levels in the representation space is multiplied by the number of levels that the new dimension
contributes with, immediately increasing the complexity of the representation space. An estimate of the total audio-visual resolution capacity, based on
the DL’s reported by Stevens and Davis, yields 115,600,000,000 potentially
distinguishable audio-visual events! Including in this exercise the number of
distinguishable olfactory, gustative and tactile events, based on the respective
DL’s, the total will obviously increase very quickly. Still another source of
complexity is the interdependence between the different sensory dimensions.
EMERGENT PHONOLOGY
45
This type of interdependence introduces non-linearity and time dependency
in the representation behavior, but in a plastic system as the sensory representation systems found in complex organisms, interdependencies and plasticity
will actually lead to the emergence of structure in the system, making the
system more specialized by reducing its degrees of freedom.
At first sight, given the variance of the external stimuli and the huge
representation resources available in this space, it seems unlikely that the
information represented might be structured without the help of pre-existent
sorting mechanisms.2 In the context of language acquisition, for instance,
something like a Universal Grammar seems to be needed in order to make
sense of the noisy speech input and its poor information content. But can
innate mechanisms be, in fact, manifestations of the ontogenetic evolution of general-purpose biological resources under the pressure of stimulus
exposure? Again, how poorly specified is the speech input that the young
language learner is exposed to and how can structure arise in the absence of
pre-wired linguistic knowledge?
Structure emerges from random processes
A common spontaneous attitude towards the variance observed in natural
phenomena is to treat it as a unwanted and meaningless disturbance of underlying deterministic processes. While this may be a useful strategy to focus
on the core of the phenomena in study, there is a clear risk of missing very
relevant information that is conveyed by the variance, itself. In fact, the
structure of the variance associated with natural phenomena is an extremely
powerful source of information as demonstrated, for instance, by the inferential power of the analysis of variance: the possibility of drawing conclusions
about a population in general, based on the variance structure observed in
a specific (random) sample of that population. Specifically, such analysis
of variance permits quantifying the risk involved in generalizing from the
sample to the population.
Another common spontaneous attitude is to view samples as timeless sets
of data,3 leading to a dramatic mismatch with the ecologically relevant reality.
Since all the sensory dimensions are available simultaneously as the organism
automatically represents natural events, external events are represented by
sequences of points (actually continuous trajectories) in the representation
space. Activity represented by a point in this space encodes therefore an
observed relationship among the sensory dimensions, at a certain time. As
time passes, the level of the activity at a point decreases unless reactivated by
new activity represented at the same coordinates or specifically reactivated. In
other words, the coordinates of a point in this n-dimensional representation
space encode an instance of a specific relationship between the n sensory
46
FRANCISCO LACERDA
dimensions represented in this space. Because of the continuous decay of the
stored activity, only representations that are frequently updated will tend to be
maintained in this space. Also, coordinates shared by different points indicate
that the events they represent share the sensory characteristics represented
by the shared coordinates. Because events in the outside world are highly
variable and the representation space is huge, the likelihood for two unrelated
events to be represented in the same location (i.e., the chance of two random
points having the same coordinates in the n-dimensional space) is extremely
small. Therefore, if two events lead to representations that land on the same
location, they implicitly convey the important information that they are very
likely related to each other.
In fact, given the variance of the input and the available representation
resources, the likelihood that two events would be represented on the same
location is practically zero, given the time frame of a living organism. Thus,
when all the dimensions of the space are considered there will be virtually no
clusters of representations because the events that are being represented will
tend to differ in some detail and as all the details are represented the whole
representation will appear as a blend of scattered random points. However,
the chances of disclosing possible structures in this representation space
increases dramatically if it is possible to “look” at the represented events from
different viewpoints and “angles” from which several points representing
events will appear to be clustered.4 Interestingly, reducing the dimensionality
of the representation resources may expose hidden structures.
General aspects of the language acquisition process
Having sketched a general model of how external events may be represented
by complex organisms, the challenge is now to apply the model to the specific
case of language acquisition in infants.
Poverty of stimulus is a reasonable argument when only the speech signal
is considered to be the external component of the process of spoken language
acquisition. Out of its multi-sensory context and from a strict behaviorist
perspective, the speech input alone would very likely be insufficient to
account for the language acquisition process within the normal time frame
and in the absence of pre-programmed linguistic knowledge. But in the
infant’s ecological setting, events are multi-dimensional (multi-sensory), the
infant does interact with the environment and the young language learner has
relatively limited representation “needs.” In this scenario, the model sketched
above critically diminishes the significance of the “poverty of stimulus”: if
the stimulus is indeed poor, its lack of variance leads to a rapid association
of the involved external variables; if the stimulus is after all rich enough, its
EMERGENT PHONOLOGY
47
effective linguistic use under adult-infant interaction will expose “spurious
variance” and enable the infant to single out the principal components of the
representation space.
Initial knowledge
The classical study by Eimas, Siqueland, Jusczyk and Vigorito (1971),
showing that one- and four-month-old infants categorized the /ba/-/pa/ VOT
continuum in adult-like fashion, provided strong experimental indication that
a “linguistic mode” might be “part of the biological makeup of the organism”
(Eimas et al., 1971: 306) but this view was subsequently abandoned after
Kuhl and her collaborators’ demonstrations of categorical perception also in
chinchillas and macaques (Kuhl & Miller, 1975; Kuhl & Padden, 1982, 1983).
Current accounts of the infant’s initial propensity to focus on speech sounds
are less dogmatic as to what mechanisms may underlie the observed infant
behavior and Jusczyk (1997), for instance, suggested recently that “dedicated,
hard-wired, specialized speech-processing mechanisms” (p. 78) do not have
to be necessarily involved in the development of speech perception during
the first year of life. Indeed, the experimental evidence suggests that the
newborn infant orients towards speech, in particular the mother’s speech,
because of prenatal exposure to speech stimuli rather than by hard-wired
specialized processing mechanisms (de Casper & Fifer, 1980; de Casper &
Prescott, 1984; Greenough & Alcantara, 1993; Turkewitz, 1993; EcklundFlores & Turkewitz, 1996). The findings in the speech domain are paralleled
by observations of perinatal olfactory preference for the mother’s amniotic
liquid in which the fetus was immersed during its prenatal life (Varendi,
Porter & Winberg, 1996, 1997).
But if the newborn’s auditory and olfactory preferences are linked to the
memory of prenatal exposures it is reasonable to expect that newborns would
also show preference for the mother’s non-voluntary sounds, like sounds
caused by bowel movements. Is it possible to account for an overall preference for speech sounds in this noisy scenario? In fact, according to the
general model, the physiological correlates of speech production (breathing
rhythm, diaphragm tension, hormonal discharges associated with the alertness required to speak, etc.) may be sufficient to help singling out speech
because of its natural correlation with other sensory dimensions that can
also be perceived by the fetus. Of course, this speculative account is only
relevant in the context of immediate post-birth preference for speech. Even
if an initial bias towards language is likely to be advantageous in launching
the newborn infant into its ecological setting, a normal infant who lacks that
initial bias will, in a normal linguistic environment, nevertheless quickly
focus its attention on speech. In all likelihood, the spoken language will
48
FRANCISCO LACERDA
become an inescapably salient acoustic component of the multi-sensory flood
to which the infant is exposed, since it is consistently widely used in the
infant’s normal environment.
The recent demonstration by Ramus, Hauser, Miller, Morris and Mehler
(2000) that cotton-top tamarin monkeys parallel human newborns in their
ability to pick up specific prosodic cues from speech sequences clearly
suggests that sensitivity to prosodic properties of speech does not require
innate linguistic capacity.
In this scenario, the newborn infant’s initial bias towards speech is due
not to an innate propensity, as suggested by the universal grammar, but to an
epiphenomenon created by the interaction between available neurophysiological and anatomic structures, on the one hand, and of the statistical properties
of the pre-birth multi-sensory exposure, on the other.
Post-natal development in a multidimensional perspective
The young infant’s language acquisition process is obviously influenced by
a number of endogenous and exogenous factors. For instance, aspects like
the infant’s anatomic and physiological development must have a relatively
direct impact on the infant’s capacity to produce the speech sounds used in
the ambient language; the infant’s auditory capacity will largely determine the
characteristics of the speech sounds that the infant will discriminate successfully; the nature of the speech input to which the infant is exposed will be
an exogenous component through which the phonetic characteristics of the
ambient language become accessible to the infant; the adult ability to interact
with the young infant and fine-tune to the infant’s needs and expectations is
also likely to be an important exogenous component of the language development process. To appreciate the role that components like this may have in
the early stages of the language acquisition process, let us select some of the
aspects that are expected to have significant impact on the predictions of the
current model.
The infant’s vocalic domain
To estimate the domain of the acoustic output produced by the infant during
its first months of life, an acoustic model of the vocal tract was implemented using anatomic data available from comparative anatomy (Bosma,
1975; Aronson, 1990). One of the most conspicuous differences between the
vocal tract anatomy of the adult and the young infant is the proportion of the
pharyngeal tract to the oral cavity. In the newborn infant the larynx is at the
level of the 3rd cervical vertebra and the pharynx is therefore extremely short.
As the infant matures, the pharynx length increases dramatically during the
EMERGENT PHONOLOGY
49
Figure 1. Larynx’s position relative to the cervical vertebrae, as a function of age (data from
Aronson, 1990).
first years of age. By about five years of age the relation of the pharyngeal
length to the oral cavity has practically reached adult proportions, although
the larynx will continue to descend throughout life (Figure 1). The acoustic
model was developed according to Fant’s acoustic theory of speech (Fant,
1960). It describes the vocal tract as a series of 20 tubes with regions of
articulatory mobility displaced to reflect the infant’s anatomy. For convenient
comparison with typical adult values, the formants were computed as if the
infant’s vocal tract were 17.5 cm long. The conversion between the adultbased values and the actual formant values for an infant was assumed to be
approximately linear.
Not surprisingly, the results indicate that opening and closing the jaw with
the tongue resting on the jaw mainly affects F1 . The first formant rises quickly
as a consequence of the initial jaw openings but all the other formants tend
to remain unchanged. This acoustic result is depicted in Figure 2a, where
the formant trajectory in the F1 × F2 plane is sampled at constant time
intervals, for a uniform opening gesture. The corresponding stylized spectrogram, showing the trajectories of the first four formants, is displayed in
Figure 2b. According to this computation, the infant would tend to produce a
series of central vowels differing mainly on vowel height, a prediction that is
compatible with experimental observation (e.g. Davis & MacNeilage, 1990;
MacNeilage & Davis, 2000).
50
FRANCISCO LACERDA
Figure 2. F1 and F2 values resulting from opening the jaw with uniform opening and closing
speed. Note that the opening gesture affects mainly F1 and leaves F2 at approximately the
value of a central vowel. This gesture results in a sequence that sounds roughly like a closant,
followed by a vowel that becomes increasingly open, as the jaw is lowered. (a) Trajectory on
the F1 × F2 plane. (b) Trajectory in a stylized spectrogram (frequency, in Hz vs. time, on an
arbitrary scale).
Figure 3. Same as Figure 2 but raising of the tongue dorsum towards the velum of the infant.
Note that because of the non-linear transformation between the infant’s and the adult’s vocal
tract, this movement results in a sound sequence evolving from a schwa vowel to an approximately pharyngeal consonant. (a) Trajectory on the F1 × F2 plane. (b) Trajectory in a stylized
spectrogram (frequency, in Hz vs. time, on an arbitrary scale).
A closure gesture, corresponding roughly to a velar place of articulation
by reference to the infant’s articulatory structures (a constriction at about
1/4 of the vocal tract length), generates formant movements resembling a
vowel + uvular or a vowel + pharyngeal sequence (see Figures 3a, b). Correspondingly, articulatory gestures engaging the young infant’s tongue dorsum
would result in adult equivalents of velar articulations. From this perspective,
the common notion that the infant’s babbling is initially characterized by
pharyngeal and velar sounds (Figures 4a, b) clearly gains a coherent acousticarticulatory explanation. The infant may, in fact, be activating the same
structures that the adult uses to produce some of the most frequent speech
sounds but because of anatomic differences, the resulting vocalizations sound
as if they had places of articulation further back in the vocal tract.
EMERGENT PHONOLOGY
51
Figure 4. Same as Figure 2, but raising of the tongue towards the hard palate. This movement
results in a sound sequence evolving from a schwa vowel towards a velar consonant. (a)
Trajectory on the F1 × F2 plane. (b) Trajectory in a stylized spectrogram (frequency, in Hz
vs. time, on an arbitrary scale).
Adult feedback
Adult feedback in response to the infant’s vocalizations is, in terms of the
emergent perspective presented here, an important component of the language
acquisition process. Although, as discussed above, the vocal output produced
by the infant does not necessarily involve adult-like articulatory-acoustic
correspondences, adult listeners often tend to interpret the infant’s vocalizations in terms of speech sounds used in their ambient language. This adult
interpretation can therefore be seen as a systematic bias (a “phonological
filter,” e.g., Sundberg, 1998) that effectively structures the infant’s phonetic
variations (Routh, 1967). In other words, by providing feedback to the infant’s
spontaneous utterances, adults may help the infant to establish equivalence
classes between babbled utterances and adult speech sound categories.
An experimental study of the feedback spontaneously provided by adult
listeners when listening to babbled utterances was reported by Lacerda and
Ichijima (1995). They asked Japanese and Swedish adult listeners (phonetic
students) to estimate the tongue positions used by infants when producing a
series of babbled utterances. When the adult judgments were sorted according
to the age at which the babbled utterances had been produced the adult
judgments of tongue height were surprisingly consistent for all the ages but
the frontness judgments were consistent only for the late babbling. Interestingly, the outcome of this listening experiment is also compatible with the
acoustic-articulatory predictions of high-low dominance, derived above.
In terms of the general representation model, Lacerda and Ichijima’s
(1995) results suggest that adults may spontaneously provide more consistent
feedback regarding height than frontness, a feedback that may elicit the
infant’s bias towards the height contrasts that can easily be produced by the
opening and closing gestures during vocalization. The adult feedback does
52
FRANCISCO LACERDA
not have to be explicit, of a Skinnerian fashion, nor does it have to be as repetitive as statistical learning per se would require. In real-life situations, the
adult essentially reinterprets the infant’s utterances, lending a (fuzzy) structure and a (fuzzy) meaning to them, and in cognitive-constructivistic terms.
But this is not a process that demands a long series of repeated exposures to
stimulus-response contingencies (Kelly, 1963). On the one hand there is an
overall quality attached to the feedback (a sort of paralinguistic emotional
validation); on the other hand the low likelihood of two unrelated events
leading to the same representation renders high significance to a couple of
similar occurrences. Besides, both the infant and the adult generate models of
reality involving these contingencies, on the basis of rather little information.
Obviously, jumping to conclusions before gathering enough statistical data is
always a risky business. However, given the range of ecological settings in
which language acquisition develops, perhaps the risks may not be very high
after all and, at any rate, worth taking to gain communicative competence.
Anisotropies in the infant perceptual space
Experimental evidence from speech perception research with infants has
shown that the perceptual space of the infant is altered by exposure to
language (Kuhl, Williams, Lacerda, Stevens & Lindblom, 1992), suggesting
that young infants may organize vowel perception around vowel prototypes,
as described by Kuhl’s Native Language Magnet (NLM) Theory. The vowel
prototype acts as a magnet that attracts neighboring vowel representations
towards it. Kuhl’s suggestion is that the perceptual space becomes structured
because of the warping in the neighborhood of the vowel prototypes, as a
consequence of exposure to the ambient language.5
Whereas the genesis of the vowel prototypes may be object of discussion
(e.g., Frieda, Walley, Flege & Sloane, 1999; Lacerda, 1995), it is possible
that the initial structure of the infant’s perceptual space is affected by other
types of anisotropies. For instance, in line with the acoustic-articulatory
observations, also the infant’s ability to discriminate vowel contrasts seems
to favor distinctions of vowel height, relative to frontness, as indicated by
experimental results obtained by the infant speech perception research group
at Stockholm University. Both 2–3-month-old and 6–12-month-old infants,
who were respectively tested with the High-Amplitude Sucking technique
and with the Head-Turn technique, demonstrated better discrimination performance for vowel contrasts along F1 than along F2 , for a set of synthetic vowels
differing only in F1 or in F2 (Lacerda, 1993, 1994). The stimuli had equal
differences, in Bark, along F1 and F2 . In addition, to avoid providing intensity
cues correlated with F1 , the vowel stimuli that the older infants listened to
were generated by a parallel speech synthesizer (Fant, 1960).
EMERGENT PHONOLOGY
53
In summary, the infant’s ability to produce and perceive vowel-like sounds
along with the adult’s interpretation of infant babbling, suggest that the young
infant may tend to favor vowel contrasts along the height dimension (contrasts
in F1 ) rather than along the front-back dimension (contrasts in F2 ).6 Clearly,
biases of this sort are likely to have a long-term shaping effect on the infant’s
articulatory and perceptual representation of vowels.
Modeling the emergence of linguistic structure
This section will shortly review Lacerda and Lindblom’s model (Lacerda
& Lindblom, 1997, 1998; Lacerda, 1998) illustrating how unstructured
representations converge towards “implicit categories,” that are specified by
small persistent statistical regularities. In line with the notion of representation sketched above, the model assumes that acoustic input gives rise to
activity at a point in the representation space. The coordinates of the point
in the representation space encode the sensory input generated both by the
acoustic signal itself and by all the other sensory inputs simultaneous with
the acoustic signal. The activity level7 at a point in this space at a specific
time is the system’s memory of the encoded event. The initial activity in the
representation space is assumed to be zero everywhere. The activity generated
by external stimuli is mapped onto the appropriate coordinates of the representation space and added to the activity level that might have been elicited
by previous exposure mapping onto those coordinates.
As stated earlier, the representation space is extremely vast. The immense
representation resources associated with a high resolution of representation
and the variance of natural external stimuli, leads to an extremely low likelihood of mapping two representations onto the same coordinates of the
representation space. In other words, in case of unconstrained representation
resources, the system will tend to represent the details associated with every
single external stimulus but fails to capture the implicit overall structure of the
stimuli in general. However, when the mapping of external stimuli is affected
by memory diffusion (Edelman, 1987), or by sensory smearing, the situation
becomes radically different because the system performs now a long-term
running average of the activity levels generated by the external stimuli. This
running average captures automatically part of the structure implicit in the
external stimuli (Elman, 1999).
In Lacerda and Lindblom’s model the stimuli were vowels leading to twodimensional representations on the F1 × F2 plane and the memory diffusion
was then described as a two-dimensional Gaussian distribution centered at
the “stimulus coordinates.” The activity levels were made proportional to the
duration of the stimuli.8
54
FRANCISCO LACERDA
Figure 5. Representation of the areas assigned to /a/, /i/ and /u/ given a decision threshold
of 0.01 (for details, see Lacerda & Lindblom, 1997). The left panel shows the assignments
in terms of plateaux where each category is represented by a given plateau height. The right
panel displays the same information, seen from the top.
The model was applied to a set of 100 vowels having no prior explicit
knowledge of the type of stimuli to which it is exposed. The simulated
acoustic input consisted of the two formants, corresponding to the sound
being “heard,” along with another dimension corresponding to a random variable, associated with the formant values. This random variable was named
“label” but it is, in fact, not a label in its proper sense. Rather it is a variable that represents circumstantial sensory information, co-occurring with
the formant information. Such circumstantial information tends, during the
early stages of language acquisition, to be statistically related to the speech
information, not deterministically related. For instance, an adult introduces a
teddy bear to an infant by showing the toy and saying its name. According
to the present model, the infant may register the acoustic information corresponding to the sentences produced by the adult but may be staring at a light
source behind the bear. In such case, the light source, not the bear, will be
represented along with the acoustic information. What the model predicts is
that although several “wrong labels” like this may be stored, in the long run,
the sentences referring to the bear will tend to appear along with the visual
information representing the bear and this consistency will eventually enable
the infant to single out the common denominator between the acoustic the
visual information: sound strings involving “bear” and something looking
like a teddy bear seen from a variety of angles and contexts.
To model the variance of the natural world, these “labels” were drawn
from a random variable that could assume the value of any of a number of
different categories, with the only constraint being that the probability of
the “intended” category, i.e., the category from which the formant values
EMERGENT PHONOLOGY
55
had in fact been drawn, was slightly higher than those of the competing
categories. Thus, although the “labels” were determined and limited a priori
to make sure that the model converges within practical computational time,
their random character actually captures the real-life “implicit labeling.” The
computations carried out by Lacerda and Lindblom (1997) indicate that in
spite of the “wrong” “label”-formants associations, the majority of the those
associations is locally correctly corresponding to the intended associations.
In other words, the model learns to associate labels to certain areas of the
F1 × F2 plane by simply using the “label”-formant association with the
highest activity level in that location. This is illustrated in Figure 5, where
an arbitrary decision threshold was used. In biological systems, the local
dominance of a certain type of label will tend to unbalance the system and
enhance even more that dominance, driving the system towards specialized
behavior (Zohary, Celebrini, Britten & Newsome, 1994). Without constraints,
like memory diffusion or sensory smearing, the structure of the representation
space tends to disappear because in the absence of local overlap between the
representations of the stimuli, every event will be unique and recency will be
the only determinant of the activity levels.
Conclusion
In general, any correlation along any of the involved dimensions can be used
to establish a “labeling relationship” between two sensory inputs (the conventional stimulus and the sensory input representing its label). For instance, the
infant learning the word “mama” stores all the available information associated with the word. According to the model, as the infant hears the word
“mama,” it also stores other available circumstantial information, i.e., not
only the details of the voice speaking (Locke, 1996), but also the image of
the mother, her smell, her taste, etc., because these sensory inputs are simultaneously available. Of all these simultaneous sensory inputs, those that are
statistically associated with the word may eventually emerge as (unintended)
labels of the very word. Infants are good at picking up statistical relationships between events (e.g., Saffran, Aslin & Newport, 1996) and therefore,
in the long run, even relationships between the olfactory, visual and gustative
representations of the word “mama” will emerge as reciprocal labels.
According to the present model, the young language learners will probably
start by storing acoustic information corresponding to the global characteristics of the speech they are exposed to. This has been shown by de Casper and
his colleagues, as well as, indirectly, by the infant’s preferences for motherese
(Fernald & Kuhl, 1987). As the number of stored multi-sensory representations increases, more fine detailed relationships between the acoustic and the
56
FRANCISCO LACERDA
other sensory inputs emerge spontaneously from the available correlations
between sensory dimensions (Lindblom, 1992). But this succession of correlations of more and more detailed subsets of the sound string stops when the
non-auditory components no longer offer information that must be correlated
with finer sound sub-strings. According to the model proposed here, this is
probably why detailed phonological awareness tends to emerge in response
to the demands posed by sophisticated word games or, more commonly, in
association with the acquisition of reading and writing abilities. In general,
however, the mechanisms underlying the emergence of phonological structure may be essentially the same as those involved in syntax (Anward &
Lindblom, 2000). Indeed, because language’s combinatorial principles apply
in fundamentally the same way to sentences, words and increasingly detailed
parts of words, correlation between different kinds of sensory information
may be a pervasive structuring component at any of these levels.
Acknowledgements
The author is indebted to Amanda Walley, an anonymous reviewer, and Ulla
Sundberg for their comments on an earlier version of this paper. Research was
supported by The Bank of Sweden Tercentenary Foundation (Grant 94-0435)
and by Stockholm University.
Notes
1. In fact, repeated exposure to a stimulus must actually elicit different detailed responses,
since new responses inevitably interact with the representations caused by the organism’s
early history. Incidentally, because living organisms must continuously repair themselves
so as not to succumb to the second principle of Thermodynamics, they are in a sense
under continuous evolution, even in the absence of repeated exposure to explicit external
stimuli.
2. The complexity of the mutual interactions of the sensory channels is also a structuring
component.
3. Obviously, Markov models or ANOVA models with repeated measures do use an implicit
time dimension but still tend to portray data sets as static arrays.
4. What is described here is essentially the basis of principal components’ analysis.
5. See Frieda et al. (1999) for a discussion of the phenomenon from the perspective of adult
vowel perception, and Lacerda (1995) for a discussion of the genesis of the phenomenon.
6. Incidentally, it may be noted that natural vowel systems also do tend to explore more
vowel height contrasts than frontness contrasts as the number of vowels in the system
increases (Liljencrants & Lindblom, 1972; Lindblom & Maddieson, 1988). In addition,
front-back contrasts in natural vowel systems do generally involve rounding of the back
vowels, as if the perceptual salience of front-back contrasts conveyed by F2 alone needs
to be enhanced by a general lowering of all the formants.
EMERGENT PHONOLOGY
57
7. “Activity” is, in fact, an extra dimension in the representation space.
8. This proportionality would not be necessary if the stimuli were represented by series of
pairs of F1 and F2 values, sampled at a given sampling frequency because in that case the
cumulative activity levels would implicitly be linked to the stimuli durations.
References
Anward, J. & Lindblom, B. (2000). On the rapid perceptual processing of speech: From
signal information to phonetic knowledge. Proceedings of the International Symposium
on Language Processing and Interpreting, Stockholm University, Stockholm, February,
1997. http://lab1.isp.su.se/iis/Anward-Lindblom.PDF.
Aronson, A. (1990). Clinical voice disorders. New York: Thieme.
Bosma, J. (1975). Anatomic and physiologic development of the speech apparatus. In D.
Tower (Ed.), The nervous system, vol. 3: Human communication and its disorders. New
York: Raven Press.
Chomsky, N. (1975). Reflections on language. Glasgow: William Collins Sons.
Davis, B. & MacNeilage, P. (1990). Acquisition of correct vowel production: A quantitative
case study. Journal of Speech and Hearing Research, 33, 16–27.
Davis, K. (1947). Final note on a case of extreme social isolation. American Journal of
Sociology, 52, 432–437.
De Casper, A. & Fifer, W. (1980). Of human bonding: Newborns prefer their mothers’ voices.
Science, 208, 1174–1176.
De Casper, A. & Prescott, P. (1984). Human newborns’ perception of male voices: Preference,
discrimination, and reinforcing value. Developmental Psychobiology, 17, 481–491.
Dennett, D. (1995). Darwin’s dangerous idea: Evolution and the meanings of life. New York:
Touchstone.
Ecklund-Flores, L. & Turkewitz, G. (1996). Asymmetric headturning to speech and nonspeech
in human newborns. Developmental Psychobiology, 29, 205–217.
Edelman, G. (1987). Neural darwinism: The theory of neuronal group selection. New York:
Basic Books.
Eimas, P., Siqueland, E., Jusczyk, P. & Vigorito, J. (1971). Speech perception in infants.
Science, 171, 303–306.
Elman, J. (1999). The emergence of language: A conspiracy theory. In B. MacWhinney (Ed.),
The emergence of language (pp. 1–27). Mahwah, New Jersey: Erlbaum.
Fant, G. (1960). Acoustic theory of speech production. The Hague, The Netherlands: Mouton.
Fernald, A. & Kuhl, P. (1987). Acoustic determinants of infant preference of motherese
speech. Infant Behavior and Development, 10, 279–293.
Frieda, E., Walley, A., Flege, J. & Sloane, M. (1999). Adults’ perception of native and
nonnative vowels: Implications for the perceptual magnet effect. Perception and Psychophysics, 61, 561–577.
Gleitman, L. & Newport, E. (1995). The invention of language by children: Environmental
and biological influences on the acquisition of language, In L. Gleitman, M. Liberman &
D. Osherson (Eds.), Language, vol. 1: An invitation to cognitive science. Cambridge: MIT
Press.
Greenough, W. & Alcantara, A. (1993). The roles of experience in different developmental
information stage processes. In B. de Boysson-Bardies, S. de Schonen, P. Jusczyk,
P. McNeilage & J. Morton (Eds.), Developmental neurocognition: Speech and face
58
FRANCISCO LACERDA
processing in the first year of life (pp. 3–16). Dordrecht, The Netherlands: Kluwer
Academic Publishers.
Jacob, F. (1982). The possible and the actual. Seattle: University of Washington Press.
Jusczyk, P. (1997). The discovery of spoken language. Cambridge: MIT Press.
Kelly, G. (1963). A theory of personality: The psychology of personal constructs. New York:
W.W. Norton.
Kuhl, P. & Miller, J. (1975). Speech perception by the chinchilla: Voiced voiceless distinction
in alveolar-plosive consonants. Science, 190, 69–72.
Kuhl, P. & Padden, D. (1982). Enhanced discriminability at the phonetic boundaries for the
voicing feature in macaques. Perception and Psychophysics, 32, 542–550.
Kuhl, P. & Padden, D. (1983). Enhanced discriminability at the phonetic boundaries for the
place feature in macaques. Journal of the Acoustical Society of America, 73, 1003–1010.
Kuhl, P., Williams, K., Lacerda, F., Stevens, K. & Lindblom, B. (1992). Linguistic experience
alters phonetic perception in infants by 6 months of age. Science, 55, 606–608.
Lacerda, F. (1993). Sonority contrasts dominate young infants’ vowel perception. PERILUS
XVII, 55–63, Stockholm University.
Lacerda, F. (1994). The asymmetric structure of the infant’s perceptual vowel space. Journal
of the Acoustical Society of America, 95, 3016 (A).
Lacerda, F. (1995). The perceptual magnet-effect: An emergent consequence of exemplarbased phonetic memory. In K. Elenius & P. Branderud (Eds.), Proceedings of the
international congress of phonetic sciences 95, Vol. 2 (pp. 140–147). Stockholm: ICPhS.
Lacerda, F. (1998). An exemplar-based account of emergent phonetic categories. Journal of
the Acoustical Society of America, 103, 2980–2981.
Lacerda, F. & Ichijima, T. (1995). Adult judgements of infant vocalizations. In K. Elenius &
P. Branderud (Eds.), Proceedings of the International Congress of Phonetic Sciences 95,
Vol. 1 (pp. 142–145). Stockholm: ICPhS.
Lacerda, F. & Lindblom, B. (1997). Modeling the early stages of language acquisition. In Å.
Olofsson & S. Strömqvist (Eds.), Cross-linguistic studies of dyslexia and early language
development (pp. 14–33). Brussels: European Commission/COST A8.
Lacerda, F. & Lindblom, B. (1998). Some remarks on Tallal’s transform in the light of emergent phonology, In C. von Euler, I. Lundberg & R. Llinás (Eds.), Basic mechanisms in
cognition and language (pp. 263–283). Amsterdam: Elsevier.
Liljencrants, J. and Lindblom, B. (1972). Numerical simulation of vowel quality systems: The
role of perceptual contrast. Language, 48, 839–862.
Lindblom, B. and Maddieson, I (1988). Phonetic universals in consonant systems. In L.M.
Hyman & C.N. Li (Eds.), Language, speech and mind: Studies in honor of Victoria
Fromkin (pp. 62–78). London: Routledge.
Lindblom, B. (1992). Phonological units as adaptive emergents of lexical development. In
C.A. Ferguson, L. Menn & C. Stoel-Gammon (Eds.), Phonological development (pp. 131–
163). Timonium, Maryland: York Press.
Locke, J. (1996). Why do infants begin to talk? Language as an unintended consequence.
Journal of Child Language, 23, 251–268.
MacNeilage, P. & Davis, B. (2000). On the origin of internal structure of word forms. Science,
288, 527–531.
Pinker, S. (1994). The language instinct. New York: Morrow.
Ramus, F., Hauser, M., Miller, C., Morris, D. & Mehler, J. (2000). Language discrimination
by human newborns and by cotton-top tamarin monkeys. Science, 288, 349–351.
Routh, D. (1967). Conditioning of vocal response differentiation in infant. Developmental
Psychology, 1, 219–226.
EMERGENT PHONOLOGY
59
Saffran, J., Aslin, R. & Newport, E. (1996). Statistical learning by 8-month-old infants.
Science, 274, 1926–1928.
Stevens, S. & Davis, H. (1938). Hearing, its psychology and physiology. New York: John
Wiley.
Sundberg, U. (1998). Mother tongue – Phonetic aspects of infant-directed speech, Unpublished Ph.D. thesis, PERILUS XXI, Stockholm University.
Tulving, E. (1998). Neurocognitive processes of human memory. In C. von Euler, I. Lundberg & R. Llinás (Eds.), Basic mechanisms in cognition and language (pp. 263–283),
Amsterdam: Elsevier.
Turkewitz, G. (1993). The origins of differential hemispheric strategies for information
processing in the relationships between voice and face perception. In B. de BoyssonBardies, S. de Schonen, P. Jusczyk, P. McNeilage & J. Morton (Eds.), Developmental
neurocognition: Speech and face processing in the first year of life (pp. 165–170).
Dordrecht, The Netherlands: Kluwer Academic Publishers.
Varendi, H., Porter, R. & Winberg, J. (1996). Attractiveness of amniotic fluid odor: Evidence
of prenatal olfactory learning? Acta Paediatrica, 85, 1223–1227.
Varendi, H., Porter, R. & Winberg, J. (1997). Natural odor preferences of newborn infants
change over time. Acta Paediatrica, 86, 985–990.
Zohary, E., Celebrini, S., Britten, K. & Newsome, W. (1994). Neuronal plasticity that underlies
improvement in perceptual performance. Science, 263, 1289–1292.
Address for correspondence: Francisco Lacerda, Department of Linguistics, Stockholm
University, SE-106 91 Stockholm, Sweden
Phone: +46-8-162341; Fax: +46-8-155389; E-mail: [email protected]